Research I enjoy

Dec 29

Doing research that is enjoyable is critical to producing outstanding work. Research is a long-term endeavor (done over decades!), involving challenges, failures, and drama. It will be hard to sustain a career doing work that is not enjoyable.

In this blog post I reflect on my work in the past few years, focusing in particular on how I feel about the papers I’ve written. I realized that I’m heavily motivated by how much impact my research has on the community, and so research that makes me happy roughly translates to the following things.

Research I enjoy…

Is about a general idea
Demonstrates thought leadership or influences the community
Aims towards artificial general intelligence (AGI)

Research I've done before and now try to avoid…

Is task-specific
Is not of interest to the general community
Has a weakness that could render it quickly stale

Reflecting on my work in the past few years, here are the papers that I look back fondly on:

Chain-of-thought prompting elicits reasoning in large language models, NeurIPS 2022. It’s easy for me to like this paper since Sundar presented it at Google I/O, but the real reasons I like chain of thought (CoT) are more intrinsic. For example, CoT can work for any task; CoT performance improves with model scale; and CoT does not require finetuning. In terms of impact, I think of CoT as a framework for how to prompt language models, and people have told me that it changed their minds about the kinds of tasks that language models can do.
Scaling instruction-finetuned language models, 2022. This paper scaled instruction finetuning in terms of number of tasks and model size. Although the empirical results were strong, the more important implication is that instruction finetuning is likely to stand the test of time and continue improving performance as we get more data and use better pre-trained models. The positive response from the community for our Flan-T5 models were also great to see. I also really liked the first Flan paper, but at this point is it seems basically like a trial run for this paper.
Emergent abilities of large language models, TMLR 2022. Although this paper has no experiments, I think the way that we presented emergent abilities put the existing literature together in a digestable way. The deeper point here is the implication that continued scaling may unlock even more abilities which are not present in small models. What's also special for me about this paper is that the framing went through 3-5 iterations with a diverse set of co-authors before we finally converged on the final one, and that process was a lot of fun.

Conversely, other papers I’ve worked on feel a bit less fulfilling now, for reasons that are clear in retrospect. (I feel bad to critique this work since it also represents co-authors, but I’m first author on all these so happy to take the hit for all the limitations. Also, I learned a lot from these projects and met amazing collaborators, so these papers were fulfilling in that aspect.)

Frequency effects on syntactic rule-learning in transformers, EMNLP 2021. In my view this paper was quite rigorous in the way the experiments were designed and executed, but the main limitation is that we only focused on a single model size and a single narrow task (subject-verb agreement). This niche setup hurt the generality of the paper quite a bit and made it unclear how to extrapolate the findings to other settings.
A cognitive regularizer for language modeling, ACL 2021. I liked this paper a lot when I wrote it, since it combines information theory with deep learning. In retrospect, the topic was too narrow for my taste. The maximum number of people who would be interested was only a subset of the NLP community, and the benefit of cognitive regularizers seems to diminish as we have more data. Similar critique for my other computational linguistics paper.
Easy data augmentation techniques for boosting performance on text classification tasks, EMNLP 2019. This paper garnered more than 1K citations at the time I write this blog post, and I do think there are some scenarios where data augmentation is critical. However, I am not sure that boosting performance by a few percent (which is basically the point of data augmentation) is a “game-changing” direction, especially since a lot of the gains go away with large data or model pre-training. Similar critique for my other data augmentation papers.

Overall, the biggest lesson I've learned is the importance of choosing a research style that makes me happy. The topic determines the best case scenario for the outcome of the research project, and if the topic is too narrow, the impact will be capped no matter how well it is executed.

(This blog post represents my personal opinions only, and not those of my employer.)

Jason Wei

Research I enjoy

Practicing AI research

137 emergent abilities of large language models