YouTube video summary

Stanford Webinar - Large Language Models Get the Hype, but Compound Systems Are the Future of AI

Artificial intelligence

04 Dec 202421 min summaryFrom Stanford Online

Stanford Webinar - Large Language Models Get the Hype, but Compound Systems Are the Future of AI

Stanford Online

Save to your library

Chat with this summary

Large Language Models and the Rise of Compound AI Systems

Large language models receive significant attention, but compound systems are the future of AI, and this concept is often overlooked due to the focus on models in headlines 11s.
The trend of emphasizing large language models began with the GPT-3 paper, which introduced a 175 billion parameter model, an order of magnitude larger than previous models, and demonstrated the potential of scaling to create new systems 56s.
The announcement of Google's Palm model, with 540 billion parameters, is another example of this trend, where the focus is on the model rather than the system it is part of 1m30s.
Even companies like Open AI, which focuses on full software systems, tend to announce new models, such as GPT-4, rather than the systems they power 2m7s.
The emphasis on models can create a misconception that they are the primary focus, but in reality, people only interact with systems, not individual models 42s.
A large language model is inert on its own and requires additional components, such as a prompt and a sampling method, to function as a system 2m57s.
The choices of prompt and sampling method are non-trivial and play a crucial role in creating a functional system 3m49s.
Compound systems, which include models like Gemini and ChatGPT, are already being developed and used, but are often described in terms of their models rather than the systems as a whole 1m50s.
A minimal system in AI consists of a prompt, a model, and a sampling method, but modern systems are taking this further by giving these minimal systems access to calculators, programming environments, databases, web APIs, and the web itself, making them software systems with the language model as a hub 4m23s.
The capabilities of these systems are defined by how all components work together, not solely by the language model, which is the essence of the thesis that compound AI systems are the future of AI 4m43s.

The Shift from Models to Systems

This idea has been explored in a blog post titled "The Shift from Models to Compound AI Systems" from February 18, 2024, which emphasizes the importance of thinking in terms of systems rather than just models 5m0s.
The idea of shifting from models to systems has been echoed by others in the field, such as Sam Altman, who recently mentioned expecting a shift from talking about models to talking about systems 5m31s.
Investing in the right things when developing an AI solution requires thinking in terms of systems, rather than just focusing on the model, which is analogous to designing a Formula 1 race car where all components must work together to achieve success 6m16s.
Focusing solely on the model, like focusing solely on the engine of a Formula 1 race car, will not lead to a good overall system, and it's essential to think in terms of the entire system 7m26s.
Building the best system for specific goals and constraints will likely emphasize system components, and a small model embedded in a smart system will always be better than a big model in a simplistic system 7m48s.
When designing AI systems, considerations such as cost, latency, safety, and privacy are crucial, and in such cases, a small model might be the only viable choice, especially when cost is a significant factor 7m58s.
There is a need to shift the focus from regulating model artifacts to regulating entire systems, as focusing solely on models may lead to electing dangerous systems and overregulation 8m30s.

Sampling Methods and Their Importance

The method used for sampling when generating output from a model is a critical component of the system, and there are various methods available, including greedy decoding, top-p sampling, beam search, and insisting on token diversity 9m12s.
These sampling methods can be used to impose high-level ideals, such as ensuring generated output conforms to valid JSON or a specific grammar, and there are innovative ideas in the literature that make use of gradients and other techniques to achieve this 10m9s.
Researchers are exploring new methods to improve sampling, including adding parameters to the model to adaptively find a good temperature for creative or constrained output, depending on the task 11m1s.
The concept of sampling can be expanded to include creative exploration, such as majority completion strategies, which involve using the model to complete a prompt with a hard reasoning task 11m31s.
Large language models can generate answers in one step, but it might be more effective to have them generate a reasoning path and then produce an answer, allowing the model to explore and reason before providing a final answer 11m46s.
By sampling multiple reasoning paths, the model can produce a distribution of answers, and the most common outcome can be considered the final answer, giving the model the ability to explore and reason before producing a final answer 11m56s.
This approach is not strictly a sampling strategy, but rather a way to let the model explore and reason before producing a final answer, and it can be made to look like a simple sampling strategy by hiding the intermediate steps from the user 12m27s.
There is no one true sampling method for a model, and the choice of sampling method is highly consequential for the overall system, as it is what makes the model "speak" and can interact in complex ways with the language model 13m0s.

Prompting: The Heart of Modern AI Systems

Prompting is a crucial aspect of modern AI system development, and it is the heart of what makes these systems work, allowing for complex and interesting behaviors to be achieved through careful engineering of prompts 13m38s.
The origins of prompting in AI systems can be traced back to the GPT-2 paper from 2019, which demonstrated that language models can perform downstream tasks in a zero-shot setting without any parameter or architecture modification 14m12s.
The GPT-2 paper showed that by adding a simple prompt, such as "TL;DR", to the input text, the model could be induced to perform tasks such as summarization, translation, question answering, and reading comprehension 14m36s.
The concept of large language models has undergone significant development, from GPT-2 with 1.2 billion parameters to GPT-3 with 175 billion parameters, showcasing the impact of scaling on complex in-context learning 15m43s.
GPT-3 demonstrated successful question-answering capabilities by learning to imitate behavior from a prompt with a context passage, demonstrations, and a target question, leading to striking behavior where the model answers questions as substrings of the passage 16m28s.
The GPT-3 paper identified a general pattern for this behavior, consisting of context or instructions, a list of demonstrations, and a target, which was applied to various tasks such as QA, reading comprehension, and machine translation 16m41s.
A template for building systems in this modern mode has been established, allowing for exciting possibilities, but also highlighting the potential dark side of prompting, including sensitivity to prompt formatting 17m20s.
Research has shown that language models can be highly sensitive to prompt choice, with minor changes leading to significant differences in performance, emphasizing the importance of considering the model-prompt combination when evaluating a model's capabilities 18m12s.
The concept of systems thinking is essential in this context, where the focus should be on finding the optimal prompt-model combination to achieve specific goals, rather than evaluating the model in isolation 18m59s.
The idea of Chain of Thought reasoning has been explored in papers such as "Echo prompt," which empirically examines different strategies for step-by-step reasoning, highlighting the importance of considering the prompting strategy when evaluating a model's capabilities 19m17s.
The performance of a model can vary greatly based on how a question is framed, as shown by the results of an experiment with the Da Vinci 2 model, highlighting the importance of considering model-prompt combinations when evaluating a model 19m38s.
Evaluating a model requires thinking in terms of model-prompt combinations, which is a systems-level approach rather than a model-level approach, and can bring clarity to development cycles 20m7s.
A tweet about a person's experience with updating a model from GP4 to 40 and having to re-engineer prompts highlights the inextricable link between models and prompts 20m25s.
Prompts are not just written in English, but are an effort to communicate with the language model, and understanding this link is crucial for developing effective systems 20m57s.
The Apple intelligence prompts found in system files demonstrate the tightly knit relationship between prompts and models, with prompts being more like compiled binaries meant to be paired with a particular model 21m41s.
The dspi library is being promoted as a tool for developing systems that take into account the interaction between models and prompts, and adopting a systems-level approach to AI development 22m37s.

From Prompt Engineering to Language Model Programming with DSP

The core lessons of artificial intelligence, including modular system design, data-driven optimization, and generic architectures, have been successful in part because they have been adopted throughout the field 22m51s.
The development of AI systems has moved rapidly, especially in the deep learning era, leading to the creation of exciting large language models, and libraries like Torch, Theano, Chainer, and PyTorch have embodied time-tested lessons that have helped speed up development 23m7s.
However, the current moment in AI system development is characterized by a lot of manual adjustments to prompts, resulting in complete model dependence, which is tragic given the progress made by time-tested lessons 23m34s.
DSP is a model and programming library that aims to move away from prompt engineering and towards language model programming, honoring the insight that a prompt, language model, and sampling strategy together design a software system 24m3s.
DSP allows users to define a system using high-level tools and then compile it down into an actual prompt to a language model, abstracting away some model dependence 24m34s.
A simple example of DSP is a minimal system for doing basic question answering, which can be achieved with just one line of code, and the system gets compiled down into an actual prompt to a language model 24m56s.
More complex programs can also be written in DSP, such as a program for multihop question answering, which allows users to freely express the kind of system they want to develop in code 25m43s.
The design principles of DSP are tightly integrated with PyTorch, and the system can be optimized to find a successful prompting strategy independent of the chosen tools 26m0s.
The optimizer in DSP can simultaneously optimize the instructions and few-shot demonstrations, moving the burden of finding good ways of doing that onto the optimizer 26m29s.

Evaluating AI Systems: A Framework and Case Study

A framework for evaluation is presented, consisting of a program, an optimizer, and a language model, to specify a complete system and demonstrate the importance of thinking in systems terms 27m7s.
The baseline system, which goes from questions to answers, achieves a score of 34 for Turbo and 27.5 for LLaMA 23B, with the metric being exact match on the desired answer 27m47s.
Adding a simple DSP program for retrieval-augmented generation results in a boost in performance, and using bootstrap few-shot optimization leads to significant gains, with scores increasing to 42 and 38 28m23s.
The use of react agents, which enables the model to reflect and think about how to solve the task, shows less success but still demonstrates the power of systems thinking 28m36s.
Human-written reasoning prompts underperform compared to prompts generated through simple bootstrapping, highlighting the power of data-driven optimization 28m58s.
A program designed for multihop reasoning, which gathers evidence from multiple passages, achieves high scores, with Turbo reaching almost 55 and LLaMA 23B reaching 50, demonstrating the power of intelligent system design and small models 29m37s.

The Importance of System Design for Small Models

The importance of designing systems that can get the most out of small models is emphasized, as 77% of enterprise usage of models is at the 13 billion parameter size or smaller, according to an analyst at Theory Ventures 30m16s.
In industrial systems, latency is a significant concern, with ideal latency being around 18 milliseconds, but as latency increases above 50 milliseconds and up to 750 milliseconds, it can become expensive and potentially prohibitive 30m35s.
To get the most out of small models, it's essential to think about the systems being designed around them, with the prompt being a crucial factor in system performance 31m19s.

Tool Access and System Consequences

Tool access is an area where entire systems, not just models, are considered, involving calculators, programming environments, databases, the web, and web APIs 31m36s.
When designing systems, it's essential to consider the overall consequences, such as reliability, preference, and danger, rather than just focusing on technical details 31m57s.
A giant large language model with a snapshot of the entire web may be less reliable than a tiny language model working with an up-to-date web search engine 32m20s.
A small model doing autocomplete tasks locally on a phone may be preferred over a giant large language model doing contextless autocomplete via a centralized service 32m41s.
A 10 billion parameter language model with access to databases and the web may be more dangerous than GP4 with no access to these tools 33m5s.

The Future of AI: Systems, Not Just Models

In 2026, it's expected that systems consisting of multiple models and tools working together will be more prevalent than massive foundation models doing everything in terms of their parameters 33m38s.
Recent legislation, such as SB147, has attempted to regulate models based on their size, with a focus on models that cost over a million dollars to train and have 10 to the 26 flops performed during training 34m1s.
The concept of regulating large language models is being explored, with the idea that smaller specialized models may emerge as equally or more dangerous than larger models, and that complex systems composed of these models could be more hazardous than a single expensive model 34m35s.
The focus of future legislation should be on regulating systems rather than individual models, as systems are the entities that can cause harm 35m26s.

Rethinking Evaluation: Focusing on Systems, Not Just Models

Current research evaluations, such as leaderboards, are flawed as they only evaluate individual models, whereas in reality, a system consisting of a model, prompting strategy, and generation procedure is being evaluated 35m40s.
The community should reorient leaderboard evaluations to focus on entire systems, considering all components working together, rather than just individual language models 36m42s.

Scaling Up: Multiple Dimensions and the Power of Systems

The future of AI may involve multiple notions of scaling, including scaling up unsupervised training and scaling up instructive fine-tuning, with the latter driving significant gains in AI progress 37m21s.
The power of large teams of humans creating good input-output pairs for model updates has been demonstrated, particularly with the emergence of chat GPT in 2022 38m4s.
Scaling up large language models is leading to gains, but it's not a silver bullet, and this has led to a rise in the theme of scaling sampling for generation, which involves sophisticated forms of sampling for generation that can be thought of as scaling up inference time processing search 38m24s.
This theme is expected to continue from 2024 onward, with transformative things happening in virtue of the fact that perfectly good language models, even small ones, are given access to lots of different tools and other things that make them really capable as systems 38m54s.
The future of AI lies in the scaling up of systems, where language models are given access to various tools and capabilities, making them more productive and leading to bigger gains 39m0s.

Generative Agents and the Importance of Systems Thinking

Generative agents play a crucial role in this whole thing, as they can be used to take a language model and have it do things it couldn't do on its own, and also make use of tools and tool output 40m5s.
Systems thinking is essential, and people should not be purists when designing software systems, as they can design agents that depend on the model doing complex things in generation or write code to bridge the gaps between the language model's capabilities and the desired outcome 40m31s.

Reasoning Paths and Transparency in AI Systems

Modern AI systems are already producing multiple reasoning paths in the background, and the system produces reasoning paths through methods like Chain of Thought, which involves prompting the model to think step by step and generate tokens in response 41m5s.
To give users more access to the inner workings of these systems, it's essential to understand how the system produces reasoning paths and provide users with more transparency and control over the evaluation of results 41m13s.
The concept of Chain of Thought reasoning involves generating multiple inference paths to arrive at different outcomes, and then statistically analyzing those outputs to decide on a final generation, which is a systems thinking approach that considers the prompt, overall system structure, and generation methods 42m0s.
This approach is exemplified in Open AI's 401 models, which perform extensive inference time work before producing an answer, but the details of these processes are considered trade secrets 42m47s.
Researchers can explore the behaviors of smaller models to understand how to coax out desired behaviors, and this topic is expected to be explored in the research literature going forward 43m16s.

The Future of System-Level Scaling and Tool Access

The future of system-level scaling will involve increasingly complex systems, similar to Google search, which has evolved from a simple search technology to a complicated software system that functions through teams of people and dynamic behavior 43m41s.
The development of geni systems will lead to incredible systems being built over the next decades, and a key aspect to watch is when people provide tool access, allowing language models to interact with the web and other systems like humans do 44m17s.
As language models become more integrated with the web and other systems, they will have significant consequences, both productive and problematic, and it is essential to establish guardrails to mitigate potential negative consequences 44m51s.
Establishing guardrails requires considering both positive and negative consequences and thinking about how to regulate and manage the development and deployment of these systems 45m28s.

Regulating AI: Systems vs. Models and Human Considerations

Regulating AI systems is more likely to be effective than regulating the models themselves, as there is already existing legislation that governs software system behavior, which can be applied to the AI realm 45m39s.
Considering human aspects, such as requiring AI systems to identify themselves as non-human, can help people calibrate to them as agents and control the situations they are allowed to enter 46m1s.
Fundamental restrictions may be necessary, such as limiting AI models' ability to log into certain websites or interact freely on social networks 46m20s.

The Near-Term Impact of AI and the Need for Vigilance

The initial disasters caused by AI systems may not be cataclysmic, allowing society to learn from them and figure out how to respond 46m39s.
AI will impact people's lives in the next five years, even if they are not thinking about it, as more systems that can help with daily tasks and provide companionship, discovery, and creative expression will emerge 47m11s.
AI can also help with education, providing customized experiences at low costs, but there is a risk of bad actors using AI for malicious purposes, such as social engineering 47m54s.
Individuals and society need to be on the lookout for AI systems and take steps to prevent malicious activities 48m29s.

Getting Involved in the AI Community and Learning Resources

For those interested in learning more about AI, there are resources available, such as a Discord community and tutorials for technical individuals, and recommendations for business leaders to grasp the implications of AI for their businesses 48m44s.
To get involved in the community and contribute to projects, one can start by filing issues or making pull requests, which can lead to positive impacts and learning opportunities, and joining the Discord community to see what others are working on and potentially collaborate with them 49m19s.
The research community is also open to new contributors, and YouTube can be a helpful resource for learning about different prompting strategies, agent tool usage, and other relevant topics 49m46s.

The Changing Landscape of AI Research and Information Access

However, the industry is becoming increasingly closed, making it harder to gain insight into the decisions being made and why, even at the level of research innovation 50m4s.
For leaders in organizations looking to define a generative AI strategy, it's essential to think about what success looks like, what kind of testing to do, and designing a system that balances goals with known risks 50m21s.
To stay up-to-date with the field, one can dedicate 10 minutes a day to reading and learning, and recommended resources include the ACL Anthology for NLP papers, Semantic Scholar, and tools like ChatGPT that can do retrieval-augmented generation 51m14s.
Twitter, or X, is no longer as reliable a resource as it once was due to changes, and communities have spread out to other platforms like Bluesky, Threads, and Mastodon 51m48s.
The rise of generative AI has made it easier to get a sense of an area by typing common-sense questions into search engines like ChatGPT, which can provide a starting point for learning and exploring the literature 52m4s.
NLP is a well-organized community with a comprehensive literature, and resources like the ACL Anthology and Semantic Scholar can be used to find relevant papers and stay current 52m42s.
A course on natural language understanding is available, which includes a project development phase that involves building a literature review, forming an experimental protocol, writing a paper, and creating associated code, providing a guided way to do a focused research project and understand the rhythms of research in the domain 52m55s.

Building Generative AI Strategies and Staying Up-to-Date

For leaders in organizations looking to define a generative AI strategy, it's essential to think about what success looks like, what kind of testing to do, and designing a system that balances goals with known risks 50m21s.
To stay up-to-date with the field, one can dedicate 10 minutes a day to reading and learning, and recommended resources include the ACL Anthology for NLP papers, Semantic Scholar, and tools like ChatGPT that can do retrieval-augmented generation 51m14s.
Twitter, or X, is no longer as reliable a resource as it once was due to changes, and communities have spread out to other platforms like Bluesky, Threads, and Mastodon 51m48s.
The rise of generative AI has made it easier to get a sense of an area by typing common-sense questions into search engines like ChatGPT, which can provide a starting point for learning and exploring the literature 52m4s.
NLP is a well-organized community with a comprehensive literature, and resources like the ACL Anthology and Semantic Scholar can be used to find relevant papers and stay current 52m42s.

Practical Advice for Building with DSPi and Langchain

The SPI (Software Platform Interface) is gaining traction in the business world, with various organizations, including Jet Blue and startups, using it in different ways, and its website, dspi.doai, offers documentation, use cases, and starter code 53m35s.
When starting out with SPI or Langchain, it's essential to make principal choices and avoid designing a system entirely around prompt templates, as this can lead to unintended consequences and make changes difficult 54m26s.
Using prompt templates can be productive for teaching purposes, but it's crucial to express things as proper software systems to avoid failure modes and ensure flexibility and adaptability 54m32s.
DSPi is a great choice for building software systems, especially for those experienced in machine learning, as it's tailored to their needs and has PyTorch principles, although it may have a learning curve 55m28s.
The most important thing to take away is to avoid thinking entirely in terms of models and instead focus on building software systems that can respond to new requirements and changes in the underlying environment 56m16s.

The Importance of Systems Thinking: A Final Emphasis

Software systems like ChatGPT are often viewed as models, but they are actually compound systems that require a broader focus beyond just the model choice or its properties 56m34s.
A more effective approach is to concentrate energy on the entire system, similar to an F1 race car design team that focuses on all the complicated pieces working together in concert 56m49s.
In the industry, most energy is focused on small models, which makes system design even more crucial, as simplistic system design and prompting strategies are not sufficient 57m10s.
With small models, it is necessary to do everything possible to achieve a significant impact, which places more pressure on system design, but this pressure also presents a huge opportunity 57m29s.

Conclusion and Next Steps

The discussion concluded with appreciation for the questions and participants, and the session will be posted on YouTube and shared as a recorded session 57m49s.

Made with Recall · in 3 seconds

Get a summary like this for anything you read, watch or save.

Recall summarizes any link you paste, then keeps it in your personal library so you can search, chat with it, and never lose a key idea again.

YouTube videosArticlesPodcastsPDFsAnything else

Save this summary

Keep it in your library.

Save to your library

Browse all from Stanford Online →

Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required

Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required

YouTube02 Jun 2026

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation

Artificial Intelligence

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation

YouTube02 Jun 2026

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics

Artificial Intelligence

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics

YouTube02 Jun 2026

Stanford CS153 Frontier Systems | The AI Native Company: How One Founder Becomes a 1000x Engineer

Entrepreneurship

Stanford CS153 Frontier Systems | The AI Native Company: How One Founder Becomes a 1000x Engineer

YouTube25 May 2026

Stanford CS547 HCI Seminar | Spring 2026 | HCI and Human-Centered AI for Digital Health

Health & Medicine

Stanford CS547 HCI Seminar | Spring 2026 | HCI and Human-Centered AI for Digital Health

YouTube25 May 2026

Stanford CS25: Transformers United V6 I Distinct Modes of Generalization from Parameters and Context

Artificial Intelligence

Stanford CS25: Transformers United V6 I Distinct Modes of Generalization from Parameters and Context

YouTube25 May 2026

Ready to get started?

Save, summarize and chat with your content.

IT'S FREE

No credit card required · 30 Day Refund on Premium · 24 Hour Support

Recall web app on laptop, personal AI knowledge base for summarizing and chatting with your content