YouTube video summary

Stanford CS25: Transformers United V6 I Distinct Modes of Generalization from Parameters and Context

Artificial Intelligence

25 May 202622 min summaryFrom Stanford Online

Stanford CS25: Transformers United V6 I Distinct Modes of Generalization from Parameters and Context

Stanford Online

Save to your library

Chat with this summary

Introduction and Background on Andrew Lampin and Research Focus

The introduction of Andrew Lampin, a member of the technical staff at Anthropic and former staff research scientist at Google DeepMind, is made, highlighting his research interests in AI and cognitive science, including learning, generalization, and representation in language models 10s.
The question of what computational principles are shared across natural and artificial intelligence is posed, with a focus on how language models generalize from what they learn and whether this can provide insights into how natural intelligence systems work 2m6s.
Language models are noted to learn in two different ways: through optimizing network parameters over a large training corpus to encode knowledge or skills, and through learning from information in context, which allows them to use new information to learn new things 4m30s.
The ability to conduct controlled experiments on language models is highlighted, enabling the study of how these models use novel information to make predictions and respond to tasks, which would be challenging to do with natural intelligences 6m15s.

Language Model Learning Mechanisms and Generalization

A series of experiments is mentioned, which found significant differences in how language models generalize from information stored in their parameters versus information in context, leading to further research on bridging this generalization gap 8m20s.
The inspiration for this work is attributed to a paper called "the reversal curse," which demonstrated that language models struggle to reverse relations they have learned, but this issue is not observed when the models are used in a chat context 10m50s.
Researchers investigated how language models generalize from information they encounter in their context versus information stored in their parameters, and they found that in-context learning can generalize better than fine-tuning to certain kinds of tests, such as reversals of relations and syllogistic generalizations 10s.
The researchers took various data sets, fine-tuned the model on them in the standard way, and also put the entire data set in context to teach the language model about the information, and then tested how the model makes generalizations from the data 2m6s.
The results showed that a pre-trained model is at chance when it comes to reversal accuracy, but when the entire data set is put in context, the model is able to answer the reversals with 99% accuracy, indicating a big gap between the model's ability to generalize from information that's fine-tuned on and its ability to generalize that kind of information in context 4m30s.
The researchers also tested syllogistic generalization, where they set up syllogisms with nonsense nouns and found that in-context learning is able to generalize better than fine-tuning, and they believe that this is because there are many structures of these examples on the internet in the training data 6m40s.

In-Context Learning vs. Fine-Tuning in Generalization

The question of why fine-tuning does not allow the models to learn how to generalize these structures appropriately is still being investigated, but one possibility is that it is something about fine-tuning itself, such as the models not really learning that much in fine-tuning 10m0s.
The hypothesis that pre-training models from scratch would enable them to generalize relations more appropriately was tested by creating a small language model and a large data set with 20,000 relationships, where 1% of the reversals were held out, and the results showed that the models were able to recall trained relations and use them flexibly in context, but failed to generalize to the held out reversals 10s.
The data set consisted of prefix data for diversity, trained relations such as "X contain Y", and in-context learning training sequences where a forward relation was followed by a reverse relation, and the models were able to reverse the trained relations in context, but not the held out reversals 42s.
The phenomenon of lack of generalization to reversals was found to be fundamental to how the models generalize from relational information in their training data, and not just a result of fine-tuning, as the models were able to recall the information well and use it flexibly in context, but not generalize to the held out tests 2m6s.
The experiment was not limited to reversals, as other structures such as codebooks were also tested, where the model was trained on information about how to encode different kinds of information in different languages, and the results showed that the models were able to recall the information well and use it flexibly in context, but not generalize to the held out tests 4m10s.
The models were able to generalize to novel sequences with known encodings, but not to the held out tests, and the question of what percentage of data with reversals is needed to achieve generalization was raised, but not answered, as the experiment did not try gradually increasing the percentage of data with reversals 8m30s.

Language and Encoding Considerations in Model Training

The encoding of propositions was done in English, but it is not clear if other languages were used, as the question of whether the propositions were always encoded in English was asked, but not fully answered 11m40s.
Models trained from scratch are all token-based and do not know anything about English or any other language, but the tokens can be thought of as formal statements, similar to mathematical operators, allowing for a more formal calculus approach 10s.
The reversal behavior in models is not architecture-independent, but it is a systematic feature of architectures that do causal next token prediction without using tricks to avoid it, and certain modifications, such as using a bidirectional transformer or changing the learning objective, can fix this issue 2m6s.
Even when models are trained from scratch with enough data to support generalization, they may still fail to generalize certain structures, suggesting that they learn these structures from the data, and this can be attributed to the models' inability to effectively generalize latent information conveyed in the training data 4m30s.

Parametric vs. In-Context Learning and Their Roles in Generalization

Training data can have explicit information and also convey latent information, such as multihop or syllogism-like data, and models may not always be good at generalizing this latent information, leading to issues with cross-lingual generalization or generalization to alternative goals 6m20s.
Parametric learning consolidates information across many documents, making the model less flexible in using that information at test time, whereas in-context learning provides information in richer detail, allowing for more flexible use at test time, and both types of learning are important for the model's generalization capabilities 10m30s.
The distinction between parametric and in-context learning is important, as parametric learning allows models to learn how to use information more flexibly at test time, while in-context learning is better at using specific pieces of information flexibly, and both are necessary for effective model generalization 12m40s.
Parametric learning is better at extracting statistical structures that are common across many different data documents, including structures that support flexible and context learning, and it can be seen in examples from a paper on deep sequence models 10s.
Language models often generalize better than 0% in practice, and they can make generalizations based on word co-occurrences alone, such as inferring that eagles have wings from examples of other birds having wings and flying 2m6s.
Parametric learning can do types of generalization that work around its failures to do more structured relational generalization, but this can also lead to incorrect generalizations, and models will learn things that are useful in expectation but can lead to mistakes in particular instances 4m30s.
The use of word co-occurrences is a legitimate mode of generalization, and models should integrate information across examples to infer generalizations, such as that birds that fly typically have wings, but this can also lead to mistakes if the co-occurrence structures are not licensed 6m40s.

Statistical Generalization and Its Limitations

Statistical learning is helpful on average, but it will get certain kinds of things wrong, and more data is needed to fix these mistakes, and models may be able to do more sophisticated things in context or using other information sources or methods 8m50s.
The concept of generalization can be thought of in terms of compression, and it is a hard problem to get everything right all the time, not just for models, but also for humans, and it requires a refined understanding of the world to repair mistakes 11m20s.
Compression can be useful for generalization, but it can also lose important information, creating a tension between the benefits of compression and the potential loss of valuable data 10s.
The question of whether models can generalize even when there are no statistical cues is being explored, with potential paths to achieving this including train-time offline methods and online methods that retrieve information from memory at test time 4m30s.

In-Context Augmentation and Distillation Strategies

One proposed method is to augment the training data using in-context learning and then distill that back into the model's parametric knowledge, which can be done by prompting the model with the data set in context to generate reasoning traces that elaborate upon the documents 6m20s.
This augmented fine-tuning strategy can generalize as well as in-context learning on reversal tasks and possibly even better on tasks like syllogisms, as it allows the model more chances to fill in missing links 10m10s.
The use of in-context learning to augment the fine-tuning data can improve the model's generalization, and this approach does not involve creating new information, but rather uncovering information that is already present in the data 12m40s.
A reaction against the idea of synthetic data discovering new knowledge has been noted, with some researchers arguing that data processing inequality means that new knowledge cannot be discovered from synthetic data generated from existing knowledge, but this perspective may be misleading 15m0s.
The process of taking implicit information from data and making it explicit and easily accessible to the model is a key aspect of in-context augmentation, allowing the model to extract and utilize latent information, such as implied statements like "no x or z" from given statements "all x or y" and "no y or z" 10s.
Researchers have explored this concept in various papers, including one by Tanya Lombroso titled "Learning by Thinking and Natural and Artificial Minds" and another by Stfano Man's group at Stanford titled "A Theory of Usable Information under Computational Constraints" 2m6s.
A major challenge with in-context augmentation is that it is not always possible to know how information will be useful in the future, making it difficult to augment the data ahead of time, as illustrated by the example of reading a paper and trying to relate it to future research 4m30s.
To address this issue, researchers considered the role of retrieval in allowing models to use information even if its usefulness was not known at training time, and proposed using episodic memory, which is the hippocampus in humans, to retrieve knowledge and make it available to the model 6m20s.

Episodic Memory and Retrieval in Model Generalization

Experiments were conducted using an Oracle episodic memory, which has perfect recall but imperfect precision, and the results showed that both parametric learning and episodic retrieval performed equally well on generalization tasks, but episodic retrieval outperformed parametric learning on reversal tests 10m30s.
The use of episodic retrieval allowed the system to generalize well even when retrieving distractors, and also performed well on a codebooks task, demonstrating the potential of episodic retrieval to support flexible generalization by bringing learning experiences into context 12m40s.
The concept of retrieval-augmented generation has been explored in previous work, showing its effectiveness in getting models to answer questions about factual information, and the current research builds upon this foundation 16m10s.
The mechanism by which models can use information more flexibly and answer questions better is being explored, even for documents that have been trained into the parameters, and having that information available in context can allow the models to use it more effectively 10s.
Models can extract relations from pre-training quite well, but they need to be cued in the right way, and as long as they are cued with the first part of the relation, the model will do a decent job of continuing with the second part, but they don't have a good way of going backwards unless that information is already in context 42s.
A natural question to ask is whether models could learn how to do episodic retrieval via reinforcement learning without needing to do explicit retrieval or augmentation, and the intuition is that models already know the relations and can state them in a new context 2m6s.
The approach to testing this involves fine-tuning a model on several datasets containing new information and then testing the model's generalization, as well as using reinforcement learning to teach the model how to reason on one of the datasets and then testing whether that allows the model to learn how to reason and generalize better on the other dataset 2m6s.
The results show that doing reinforcement learning on one dataset generalizes to reasoning on another dataset, whereas augmentation on one dataset does not generalize to another dataset, and the amount of generalization varies across conditions 4m30s.
Reversals do not get much generalization benefit from the in-context recall strategy, as opposed to retrieval, because solving a reversal type problem requires the model to recall the information in the other direction, which is hard, and the model is able to do this above chance on the test set by enumerating all the things it has seen in the dataset 6m40s.

Reinforcement Learning and Generalization Across Tasks

The model's ability to generalize is limited by its need to enumerate all possible solutions, which is not a practical solution at scale, and this shows a kind of glass half full story 8m20s.
Models can learn via reinforcement learning to regenerate information needed in context at test time, which can help with some problems, but not all of them, and this approach is different from augmentation at train time 10s.
The model has to learn how to pull out relevant information to solve a task, and it can do this decently well for tasks like syllogisms, but it's harder for reversals, which are easier to solve with in-context augmentation 1m6s.
There are two paths to better generalization: a train-time or offline path where training data is augmented in context and then distilled into the model's parametric knowledge, and online test-time paths where the model either explicitly retrieves memory or is trained via reinforcement learning to implicitly retrieve information 2m6s.
The methods have different tradeoffs, including performance, with train-time augmentation achieving performance as good or better than in-context learning if the questions are known at test time, and test-time episodic retrieval and RL-based methods also achieving good performance on certain tasks 4m10s.
The differences between the methods are about what they cost and when, with train-time augmentation being expensive at training time but efficient at test time, test-time retrieval being free at training time but expensive at test time, and the RL-based method requiring extra inference compute at both training and test time 5m30s.
The RL-based method generalizes beyond anticipated tasks, but requires extra inference compute at training and test time, and the model has to generate a longer thinking trace to pull information out of its context 7m20s.

Comparative Analysis of Generalization Methods

The train-time augmentation and test-time retrieval methods are in some sense easier for the model because the information gets into its context, and all it has to do is use it in the right way 9m10s.
Test time retrieval is considered the easiest because it knows what question it's going to use it for, whereas train time augmentations require the model to fill in the ways the information might be useful in the future 10s.
The model has to go through a lot of information to pull out something useful, especially for things like reversals, even though it knows what question it's trying to answer 42s.
Training data often convey more than their explicit content, and when a language model's parameters are trained on these data, what it learns might be overly tied to what was explicitly conveyed, leading to the question of whether natural intelligence faces similar challenges 2m6s.
Natural intelligence might use different strategies to bridge the generalization gap, such as augmenting training experiences offline and retrieving relevant information online, similar to the hippocampus in the brain, which does episodic memory and replays experiences to support generalization 2m6s.
The relationship between the hippocampus and the rest of the neocortex is complicated, with evidence suggesting that the hippocampus is doing things like offline augmentation, preemptive replay of future problems, and reorganizing experiences to support generalization 2m6s.
Natural intelligence might take a "throw everything at the wall and see what sticks" approach to solving latent generalization problems by using both offline and online methods, with cortical and hippocampal learning being complementary, allowing for the integration of information and preservation of specific experiences 2m6s.

Computational Ingredients for Effective Generalization

The computational ingredients needed for generalization include consolidated information, procedures for reasoning about information flexibly in context, and episodic memory that preserves experiences in rich detail, allowing access to information that might be important later 2m6s.
These functions can be thought of as being similar to the functions of parameters in language models and the things that are put in context for them, at least with methods like retrieval-augmented generation 2m6s.
The discussion involves the relationship between offline and online hypocample replay, and how the brain's processing might be more advanced than current methods, with the importance of collaborations and support from organizations like Deep Mind being acknowledged 10s.
A question is raised about the impact of context size, number of parameters, and sparsity on the results, with the answer being that it is model-dependent, and experiments have shown that models can perform well with limited context but are sensitive to the type of information being retrieved 4m42s.

Context Sensitivity and Model Performance

The experiments have also shown that replacing real words with nonsense words in the context can significantly decrease the model's ability to answer questions, suggesting that retrieval is content-sensitive and depends on the model's pre-training 6m6s.
Larger models are found to be better at handling context and retrieval tasks, but there is no direct experimentation on the impact of sparsity, such as mixture of experts versus dense models, on the results 8m6s.
The role of attention in modulating in-context effects is highlighted, and it is suggested that sparsity in the model, particularly in the MLP layers, may not have a significant impact on the results, but further experimentation is needed to confirm this 10m6s.
The possibility of using sparse models, such as those with redundant parameters, to learn new things more easily is discussed, and it is noted that most of the experiments were done with dense models, with a comparison to sparse cases being an interesting area for future research 12m6s.
The problem of models not generalizing well can be attributed to the fact that parametric learning systems, such as the cortex, learn slowly to integrate information across experiences without interference, allowing for statistical generalization, while episodic memory stores experiences rapidly and replays them to integrate with other knowledge, and this process can be damaged if the model's parameters are updated too quickly from a single experience 10s.
Episodic memory allows for fast generalization by storing experiences rapidly and replaying them over time to integrate with other knowledge, preserving the slow learning properties of the cortex, and this process can be applied to models by storing experiences explicitly and retrieving or replaying them to learn over time 2m6s.
Models may not generalize well because they are too context-sensitive, distinguishing between different situations and not generalizing as far as desired, and retrieval of relevant examples could help with this problem, but may also cause other issues depending on the type of generalization desired 4m30s.

Data Augmentation and Generalization Trade-Offs

Data augmentation can achieve as good or better performance than in-context learning (ICL) and reinforcement learning (RL), and the basic situation in which data augmentation performs better is when the model has limited opportunities to learn from a single experience, such as in in-context learning where the model gets one chance to learn from a question 8m30s.
The comparison of pros and cons of different approaches, including data augmentation, ICL, and RL, is important to determine which approach performs better in different situations, and understanding the strengths and weaknesses of each approach can help improve model generalization and performance 10m0s.
The model's ability to generate thinking traces from the information in context allows it to sample a higher number of reasoning traces, making it more likely to chain together the right statements to make an inference, which is why it generalizes better on syllogistic tests with augmented fine-tuning than with in-context learning 10s.
When a question is asked without providing the necessary information in context, the model must rely on reasoning about things from the training corpus to figure out the answer, as there is no other way for it to come up with the correct response 42s.
The model's ability to generalize depends on the statistics of the training corpus, and it may not always be able to fill in the gaps correctly, especially when the statements do not sound similar or rhythmic 2m6s.
Training the model on completely abstract sequences, such as nonsense words, could potentially provide all the generalization capabilities, but the statistical structure of the training corpus is actually a valuable thing for the model to learn 2m6s.

Statistical Structure and Reasoning in Language Models

Research has shown that training pre-trained models on formal languages can lead to improvements in how they learn on real language, and the model's ability to reason about logical structures is dependent on the information and structures it is reasoning about, which is similar to how humans reason 2m6s.
Statistical structure is necessary to constrain the kinds of inferences a system makes, allowing it to efficiently arrive at a feasible answer, and this is important for understanding human reasoning and language models 10s.
The comparison between the hippocampus and near context, as well as in context learning and learning, is an interesting area of study, with researchers like James Wadington exploring the analogy between transformer style attention and hippocampal memory 42s.
The analogy between transformer style attention and hippocampal memory is intriguing because, despite their differences in implementation, they share similar computational properties, such as pulling information from a long sequence of past states according to weighted similarity-based lookup 2m6s.
The analogy between the hippocampus and language models breaks down at the implementation level, as the hippocampus is not using transformer style key query value attention, and also in terms of the amount of information that can be stored, with the hippocampus preserving a much larger array of information than a language model's context 4m30s.

Hippocampus and Transformer Attention Analogies

The hippocampus is capable of generative retrieval, reconstructing memories and strengthening their encoding each time they are recalled, which is different from how language models process information, and this is reflected in the fact that many episodic memories are constructed rather than being entirely accurate 6m40s.
The hippocampus and transformer attention have different rates of confabulation, with the hippocampus possibly having a higher rate, but the generative nature of the hippocampus may be more flexible in some ways, allowing for more sophisticated kinds of generalization 10s.
The ability of context to match training data is dependent on various factors, including the consistency of information in context, and models have a harder time reconciling inconsistent information, making it challenging to guarantee reliability 2m6s.
The trade-off between investing in training and using in-context learning is an interesting scenario, and while there haven't been explicit studies on this topic, there has been work on optimal language model scaling, which considers the trade-off between training compute and test-time performance 4m42s.
The idea of "chinchilla optimal language model scaling" is mentioned as an example of how to optimize model performance given a fixed training budget, and similarly, there may be scaling laws that can be applied to trade off parameters such as data augmentation, new data collection, and compute time 6m15s.

Context Matching and Reliability Challenges

The importance of considering the cost and time of getting a system to perform a certain task is highlighted, and it is likely that researchers are exploring these issues, including the trade-offs between offline augmenting the training data and other approaches 8m30s.
The trade-offs of making a model better without much information include inference cost, which is at least 10 times more expensive than fine-tuning over a small data set, and hallucinations in the augmented data, which can be mitigated if they are independent from each other 10s.
Fine-tuning on a larger data set can distort the model's knowledge of other things, but regularization methods can counteract this effect, and the exact trade-offs depend on the regularization method used 2m6s.
The choice of prompt for generating augmented training data is important, as it should have some knowledge of the task but not be too specific, and the prompts used in the experiments were general but may not work well for other types of data, such as math reasoning 4m30s.

Trade-Offs in Model Optimization and Scaling

The prompts used in the experiments were designed to test factual knowledge and were of the character of rewriting a document and making connections between it and other documents in the corpus, but developing prompts that are sufficiently general to cover any type of data is a hard problem 6m40s.
The experiments tried not to focus too much on logical reasoning, and the prompts were designed to be more general, but the results may not be directly applicable to other types of data or tasks 10m0s.
The discussion concluded with an acknowledgment that it is challenging to achieve perfection, and the intention was not to give the impression of cheating, but rather to provide a genuine presentation 0s.
The time allocated for the discussion had come to an end, and the audience was thanked for their thought-provoking questions 0s.
A round of applause was given to the speaker as a gesture of appreciation for their presentation, and the speaker expressed gratitude to the attendees for coming 0s.

Made with Recall · in 3 seconds

Get a summary like this for anything you read, watch or save.

Recall summarizes any link you paste, then keeps it in your personal library so you can search, chat with it, and never lose a key idea again.

YouTube videosArticlesPodcastsPDFsAnything else

Save this summary

Keep it in your library.

Save to your library

Browse all from Stanford Online →

Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required

Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required

YouTube02 Jun 2026

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation

Artificial Intelligence

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation

YouTube02 Jun 2026

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics

Artificial Intelligence

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics

YouTube02 Jun 2026

Stanford CS153 Frontier Systems | The AI Native Company: How One Founder Becomes a 1000x Engineer

Entrepreneurship

Stanford CS153 Frontier Systems | The AI Native Company: How One Founder Becomes a 1000x Engineer

YouTube25 May 2026

Stanford CS547 HCI Seminar | Spring 2026 | HCI and Human-Centered AI for Digital Health

Health & Medicine

Stanford CS547 HCI Seminar | Spring 2026 | HCI and Human-Centered AI for Digital Health

YouTube25 May 2026

Stanford Robotics Seminar ENGR319 | Spring 2026 | Integrated Learning and Planning

Stanford Robotics Seminar ENGR319 | Spring 2026 | Integrated Learning and Planning

YouTube25 May 2026

Ready to get started?

Save, summarize and chat with your content.

IT'S FREE

No credit card required · 30 Day Refund on Premium · 24 Hour Support

Recall web app on laptop, personal AI knowledge base for summarizing and chatting with your content