In their recent work, Anthropic provide a succinct definition of in-context learning:

In-context learning is where an LLM learns using just the information provided within the prompt, without any later fine-tuning.

Drake meme that reads "Add examples to the model prompt" on top and "Condition the model through in-context learning with few-shot demonstrations" on bottom. From Jo Kristian Bergum on Twitter.

Anthropic, along with many others including me, find that in-context learning helps. In-context learning techniques have blown up in the last few years, including retrieval augmented generation, memory augmentation, few-shot learning, and more. Few-shot learning is a specific in-context learning technique that involves providing a discrete set of examples in the prompt: increasing the number of provided examples/”shots” will increase performance on a wide range of tasks (with diminishing returns i.e. power-law performance increases). For in-context learning in general, many papers have found that benefits scale by power law, e.g. in 2020, 2023, and 2024. Many of the remarkable abilities of LLMs on custom tasks may be primarily attributable to in-context learning (see “Are Emergent Abilities in Large Language Models just In-Context Learning?”).

But why does in-context learning “work”? It does seem to help, but our knowledge of why is still preliminary. I’ve been collecting works with something to say about that topic, and I’m actively interested in seeing more work along these lines. Please send me additional papers (on Mastodon or Twitter).

Samuel Müller has recently been pushing the idea that in-context learning “works” because it attempts to approximate the posterior predictive distribution (see “Transformers Can Do Bayesian Inference”).

Laura Ruis says on Twitter that, for instruction-fine-tuned models, “although few-shot prompting helps performance, it probably just helps formatting and doesn’t teach the models pragmatics.” Here’s Ruis et al. in “The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs”:

In-context prompting can mitigate the fact that some models are better at natural prompts and others better at structured prompts by improving performance on the type of prompt the model struggles with zero-shot.

From the sharp rise in performance observed for the k = 0 to k = 1 result (from 60.2% to 72.8%) we hypothesise that the k-shot in-context examples in this task do not necessarily teach the model pragmatics in-context, but prime the model for the task format.

Based on some experiments with randomizing the labels provided with in-context examples, Ruis et al. conclude that “for base models the content of the in-context prompt seems important, whereas for [example-level instruction-fine-tuned] models the in-context examples mainly serve as a primer for the task structure.”

In “Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?”, Min et al. randomizing the labels provided during few-shot learning still produces in-context learning benefits. From the abstract:

We show that ground truth demonstrations are in fact not required—randomly replacing labels in the demonstrations barely hurts performance on a range of classification and multi-choce tasks, consistently over 12 different models including GPT-3. Instead, we find that other aspects of the demonstrations are the key drivers of end task performance, including the fact that they provide a few examples of (1) the label space, (2) the distribution of the input text, and (3) the overall format of the sequence.

Min et al.’s finding is consistent with later work (like Ruis et al.’s discussed above) that format is a critical aspect of in-context learning. (An important implication of this work is that researchers are probably under-reporting zero-shot performance in cases where additional unlabeled data is available.)

As summarized by Sebastian Ruder: Prystawski et al. “find that chain-of-thought reasoning is only useful when the training data is locally structured. In other words, when examples are about closely connected topics as is common in natural language. They find that chain-of-thought reasoning is helpful because it incrementally chains local statistical dependencies that are frequently observed in training.” See “Why think step by step? Reasoning emerges from the locality of experience”.

Ruder also summarizes the findings in “Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning”: “examples that are probable based on a latent concept of a task are useful demonstrations”

In “The Power of Noise: Redefining Retrieval for RAG Systems”, Cuconasu et al. found that RAG improves accuracy even when documents are randomly selected. To my mind, that’s more evidence toward the “distribution” theory: any vaguely coherent text provides information about the appropriate output distribution.

In “The Expressive Power of Transformers with Chain of Thought”, Merrill and Sabharwal argue:

Chain of thought (and similar prompting approaches that produce a “scratchpad” of intermediate results) enables transformers to solve sequential reasoning problems that they otherwise are incapable of.

In “The Learnability of In-Context Learning” at NeurIPS’23, Wies et al. derive thoeretical results that align with some of the empirical results described above: “in-context learning is more about identifying the task than about learning it, a result which is in line with a series of recent empirical findings”.

Again, these are work-in-progress notes. Please send me additional research with insights on why in-context learning works!