137 emergent abilities of large language models

Emergent abilities are not present in small models but can be observed in large models.

In Emergent abilities of large language models, we defined an emergent ability as an ability that is “not present in small models but is present in large models.” Is emergence a rare phenomena, or are many tasks actually emergent?

It turns out that there are more than 100 examples of emergent abilities that already been empirically discovered by scaling language models such as GPT-3, Chinchilla, and PaLM. To facilitate further research on emergence, I have compiled a list of emergent abilities in this post.


Emergent few-shot prompted tasks

First, emergent few-shot prompted tasks have performance at random chance for small models and well above-random for large models. By far the largest sources for these emergent tasks were BIG-Bench and the Massive Multitask Benchmark, with 67 and 51 emergent tasks respectively. Here are the tasks:

BIG-Bench (67 tasks):

MMLU (51 tasks; see the Chinchilla paper for results):

  • Chinchilla 7B (7 tasks): Professional Medicine, High School Statistics, High School Macroeconomics, High School Psychology, Anatomy, High School Government And Politics, High School Microeconomics
  • Chinchilla 70B (44 tasks): International Law, Human Aging, Sociology, Us Foreign Policy, High School World History, Marketing, Logical Fallacies, Miscellaneous, College Biology, High School Us History, Security Studies, High School European History, High School Geography, Computer Security, Human Sexuality, Astronomy, Prehistory, Philosophy, Jurisprudence, Management, Moral Disputes, High School Biology, Professional Psychology, World Religions, Nutrition, Clinical Knowledge, Business Ethics, Medical Genetics, High School Computer Science, Public Relations, College Medicine, Conceptual Physics, Electrical Engineering, High School Chemistry, Machine Learning, Professional Accounting, Professional Law, Virology, Econometrics, College Physics, Elementary Mathematics, Moral Scenarios, Formal Logic, High School Physics

In addition to these large repositories of tasks, several papers have also shown individual tasks as emergent abilities:

  • GPT-3 paper: 3 digit addition/subtraction (GPT-3 13B), 4-5 digit addition/substraction (GPT-3 175B), leveraging few-shot examples for word denoising (GPT-3 13B)

  • Gopher paper: Toxicity classification (Gopher 7.1B), TruthfulQA (Gopher 280B)

  • Patel & Pavlick: grounded conceptual mappings (GPT-3 175B)

  • PaLM paper: Word in Context benchmark (PaLM 540B)

Emergent prompting strategies

Whereas emergent prompted tasks focus on a particular dataset, the second category of emergence is few-shot prompting strategies, which are general prompting strategies that only work for language models of a sufficiently large scale. These are the emergent prompting strategies that I have seen so far in the literature.

Looking forward

Given these new abilities of language models, I think there are several promising future research directions, beyond simply scaling up.

  1. Can we improve model architectures? E.g., sparsity, external memory, better objectives

  2. Can we improve data quality and quantity? Training for longer increases pre-training compute but not inference compute

  3. Better prompting. How can we extract the most performance out of an existing language model?

  4. Frontier tasks. What tasks are language models currently not able to perform, that we should evaluate on future language models of better quality?

  5. Why do emergent abilities occur, and can we predict them? E.g., do language models learning compositional abilities that enable them to solve harder problems?

Overall, the existence of emergent abilities applies that scaling further would unlock even more emergent abilities. This idea is super exciting to me. If I missed any emergent abilities, feel free to email me and I’ll add them to the list! jason.weng.wei@gmail.com

Thanks Yi Tay for feedback on this blog post.

Previous
Previous

Research I enjoy