Papers — Jason Wei

	2025
	HealthBench: Evaluating large language models towards improved human health.
	R. Arora, J. Wei, R. Soskin Hicks, P. Bowman, J. Quiñonero-Candela, F. Tsimpourlas, M. Sharman, M. Shah, A. Vallone, A. Beutel, J. Heidecke, and K. Singhal.
	OpenAI blog
	BrowseComp: A simple yet challenging benchmark for browsing agents.
	{J. Wei, Z. Sun}, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, A. Passos, W. Fedus, and A. Glaese.
	OpenAI blog
	2024
	Deliberative alignment: Reasoning enables safer language models.
	M. Guan, M. Joglekar, E. Wallace, S. Jain, B. Barak, A. Helyar, R. Dias, A. Vallone, H. Ren, J. Wei, H. W. Chung, S. Toyer, J. Heidecke, A. Beutel, and A. Glaese.
	OpenAI blog
	Measuring short-form factuality in large language models.
	{J. Wei, N. Karina}, H.W. Chung, Y. Jiao, S. Papay, A. Glaese, J. Schulman, and W. Fedus.
	OpenAI blog
ACL '24	FreshLLMs: Refreshing large language models with search engine augmentation.
(Findings)	T. Vu, M. Iyyer, X. Wang, N. Constant, J. Wei, J. Wei, C. Tar, Y.H. Sung, D. Zhou, Q. Le, and T. Luong.
JMLR '24	Scaling instruction-finetuned language models.
	{H. W. Chung, L. Hou, S. Longpre}, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, A. Castro-Ros, M. Pellat, K. Robinson, D. Valter, S. Narang, G. Mishra, A. Yu, V. Zhao, Y. Huang, A. Dai, H. Yu, S. Petrov, E. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. Le, and J. Wei.
	Google AI blog
NAACL '24'	A pretrainer's guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity (outstanding paper award).
	S. Longpre, G. Yauney, E. Reif, K. Lee, A. Roberts, B. Zoph, D. Zhou, J. Wei, K. Robinson, D. Mimno, and D. Ippolito.
ICLR '24	Mixture-of-experts meets instruction tuning: A winning combination for large language models.
	S. Shen, L. Hou, Y. Zhou, N. Du, S. Longpre, J. Wei, H. W. Chung, B. Zoph, W. Fedus, X. Chen, T. Vu, Y. Wu, W. Chen, A. Webson, Y. Li, V. Zhao, H. Yu, K. Keutzer, T. Darrell, and D. Zhou.
	2023
	Larger language models do in-context learning differently.
	J. Wei, J. Wei, Y. Tay, D. Tran, A. Webson, Y. Lu, X. Chen, H. Liu, D. Huang, D. Zhou, and T. Ma.
	Google AI blog
EMNLP '23	Inverse scaling can become U-shaped.
	{J. Wei, N. Kim}, Y. Tay, and Q. Le.
EMNLP '23	Transcending scaling laws with 0.1% extra compute.
	Y. Tay, J. Wei, H. W. Chung, V. Tran, D. So, S. Shakeri, X. Garcia, H. Zheng, J. Rao, A. Chowdhery, D. Zhou, D. Metzler, S. Petrov, N. Houlsby, Q. Le, and M. Dehghani.
Nature '23	Large language models encode clinical knowledge.
	K. Singhal, S. Azizi, T. Tu, S. Mahdavi, J. Wei, H. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, P. Payne, M. Seneviratne, P. Gamble, C. Kelly, N. Scharli, A. Chowdhery, P. Mansfield, B. Aguera y Arcas, D. Webster, G. Corrado, Y. Matias, K. Chou, J. Gottweis, N. Tomasev, Y. Liu, A. Rajkomar, J. Barral, C. Semturs, A. Karthikesalingam, and V. Natarajan.
ICML '23	The Flan Collection: Designing data and methods for effective instruction tuning.
	S. Longpre, L. Hou, T. Vu, A. Webson, H. Chung, Y. Tay, D. Zhou, Q. Le, B. Zoph, J. Wei, and A. Roberts.
	Google AI blog
ACL '23	Challenging BIG-Bench tasks and whether chain-of-thought can solve them.
(Findings)	M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. Le, E. Chi, D. Zhou, and J. Wei.
ICLR '23	Language models are multilingual chain-of-thought reasoners.
	{F. Shi, M. Suzgun}, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, D. Zhou, D. Das, and J. Wei.
ICLR '23	Self-consistency improves chain of thought reasoning in language models.
	X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou.
ICLR '23	UL2: Unifying language learning paradigms.
	Y. Tay, M. Dehghani, V. Tran, X. Garcia, J. Wei, X. Wang, H. Chung, D. Bahri, T. Schuster, H. Zheng, D. Zhou, N. Houlsby, and D. Metzler.
ICLR '23	Least-to-most prompting enables complex reasoning in large language models.
	D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, O. Bousquet, C. Cui, Q. Le, and E. Chi.
ICLR '23	Mind's Eye: Grounded language model reasoning through simulation.
	R. Liu, J. Wei, S. Gu, T. Wu, S. Vosoughi, C. Cui, D. Zhou, and A. Dai.
JMLR '23	PaLM: Scaling language modeling with Pathways.
	{A. Chowdhery, S. Narang, J. Devlin} and 64 additional authors including J. Wei.
	Google AI blog
	2022
TMLR '22	Emergent abilities of large language models.
	J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus.
	Google AI blog / Stanford HAI blog
NeurIPS '22	Chain-of-thought prompting elicits reasoning in large language models.
	J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou.
	Sundar explains chain of thought prompting at Google I/O 2022 / Google AI blog
ACL '22	A recipe for arbitrary text style transfer with large language models.
	{E. Reif, D. Ippolito}, A. Yuan, A. Coenen, C. Callison-Burch, and J. Wei.
ICLR '22	Finetuned language models are zero-shot learners.
	{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, and Q. Le.
	Google AI blog / oral
ICLR '22	The MultiBERTs: BERT reproductions for robustness analysis.
	{T. Sellam, S. Yadlowsky}, I. Tenney, J. Wei, N. Saphra, A. D'Amour, T. Linzen, J. Bastings, I. Turc, J. Eisenstein, D. Das, and E. Pavlick.
	2021
EMNLP '21	Frequency effects on syntactic rule learning in transformers.
	J. Wei, D. Garrette, T. Linzen, and E. Pavlick. Google AI blog / oral
EMNLP '21	Good-enough example extrapolation.
	J. Wei.
ACL '21	A cognitive regularizer for language modeling.
	J. Wei, C. Meister, and R. Cotterell.
ACL '21	Language model augmented relevance score.
	R. Liu, J. Wei, and S. Vosoughi.
ACL '21	A survey of data augmentation approaches for NLP.
(Findings)	{S. Feng, V. Gangal}, J. Wei, S. Chandar, S. Vosoughi, T. Mitamura, and E. Hovy.
ACL '21	Modulating language models with emotions.
(Findings)	R. Liu, J. Wei, C. Jia, and S. Vosoughi.
NAACL '21	Linguistic complexity loss in text-based therapy.
	J. Wei, K. Finn, E. Templeton, T. Wheatley, and S. Vosoughi.
NAACL '21	Few-shot text classification with triplet networks, data augmentation, and curriculum learning.
	J. Wei, C. Huang, S. Vosoughi, Y. Cheng, and S. Xu.
EACL '21	Text augmentation in a multi-task view.
	J. Wei, C. Huang, S. Xu, and S. Vosoughi.
AAAI '21	Mitigating political bias in language models through reinforced calibration (outstanding paper).
	R. Liu, C. Jia, J. Wei, G. Xu, L. Wang, and S. Vosoughi.
	2019
EMNLP '19	Easy data augmentation techniques for boosting performance on text classification tasks.
	J. Wei and K. Zou.