YouTube video summary

Inference, Diffusion, World Models, and More | YC Paper Club

Artificial Intelligence02 Jun 202624 min summaryFrom Y Combinator
Inference, Diffusion, World Models, and More | YC Paper Club
Y Combinator
YouTube

Introduction to the YC Paper Club and Attendees

  • The YC paper club is a community that aims to bring together great founders and researchers, with a mission to create a space for them to come together and share ideas, and the first meeting had a very exciting response with over a thousand applicants, but only about a hundred were selected 10s.
  • The attendees of the meeting are a very accomplished group, with many having a high number of citations, including some with over 10,000 citations, and many have also raised significant amounts of money, with some having raised over $50 million 2m6s.
  • The hidden mission of the YC paper club is to make Pioneer great again, and the location of the meeting is special because it is where many successful companies, including unicorns, were founded, and it is also close to many AI companies and research institutions, including Google DeepMind and Stanford 4m30s.
  • The meeting is also an opportunity to bring together the AI community in the Bay Area, with many companies and researchers based in the city, but also many in Palo Alto, and the goal is to create a community that can share ideas and collaborate 6m20s.

Overview of the First Paper on Speculative Decoding

  • The first paper to be presented is on speculative decoding, and the presenter, Tanishk, is a grad student at Stanford, who worked on the project with Triau and Aar May, and he will be discussing inference and its importance, and how it can be improved 8m40s.
  • Tanishk's goal is to evangelize inference and make the audience appreciate its importance, and he starts by sharing his own mental model of how inference works, which he realizes is oversimplified, and he aims to show that inference is a complex and important part of the process 10m10s.
  • Inference at scale involves a lot of subtlety and fun algorithms, and its importance is often discussed in terms of high costs that dominate training costs when serving a model for a large number of users, with costs being trillions of tokens 10s.
  • Inference costs are not only high but also exceed the compute requirements of pre-training, especially with the rise of reinforcement learning (RL), which is essentially a wrapper on inference, leading to increased compute requirements 1m20s.
  • The reason for working on inference is not just about reducing costs or increasing convenience, but rather about enhancing capability, as the speed of inference directly affects the peak intelligence that can be delivered by a method or algorithm 2m6s.
  • The goal is to shift the perception of inference from being a cost or convenience factor to being a capability, enabling the development of more intelligent systems that can process and generate tokens quickly 2m50s.

Technical Details of Speculative Decoding

  • To demonstrate the importance of fast inference, a comparison of three algorithms is presented, including normal auto-regressive decoding, speculative decoding, and a custom inference engine, showcasing the potential for significant speed improvements 4m20s.
  • Speculative decoding is introduced as a technique that uses a small model as a proxy to quickly sample from a larger model, with the small model generating tokens one by one and the large model verifying these guesses 6m10s.
  • The process of speculative decoding involves the small model making predictions and the large model verifying them, allowing for faster sampling from the large model, and this technique is further improved upon by the custom inference engine 7m30s.
  • Verification in the context of speculative decoding involves doing one forward pass over generated tokens to determine how likely it is that a big model would have generated them, and this process is easier and faster than generating tokens, which is done one at a time through auto-regressive decoding 10s.
  • The verification process works by having the big model look at the probabilities of each generated token and determining how plausible it is that it would have generated those tokens, accepting tokens with reasonably high probabilities and rejecting those that are less plausible 42s.
  • In speculative decoding, once a token is rejected, an extra token can be sampled for free without doing any more forward passes, which is referred to as a bonus token, and this concept is important for understanding how Speculative Decoding (SSD) works 2m6s.
  • Speculative decoding is a way to exchange computational power (flops) for latency, and it is a concept used not only in large language models (LLMs) but also in computer science, such as in CPUs, where it involves predicting and preparing for future outcomes to reduce latency 4m10s.
  • The goal of SSD is to parallelize the sequential operation of drafting and verification, allowing them to happen at the same time, which requires addressing the logical dependency between the draft model and the verifier, and finding a way to parallelize this inherently sequential algorithm 6m20s.
  • To achieve parallelization, the draft model sends back its draft tokens, and then the verifier does a forward pass to verify them, which is a key step in the SSD process, and this approach enables the drafting and verification processes to occur simultaneously 8m30s.
  • The process of verification involves a big model and can be time-consuming, so to speed it up, the approach is to anticipate the most likely verification outcomes immediately and start drafting the next round on top of those while verification is taking place 10s.
  • The concept of speculative decoding is used, where drafting and verification happen in parallel, and the principal difficulty is predicting verification outcomes ahead of time, which can be done reasonably well by using information on the draft to predict what the verification outcome is likely to be 42s.
  • The verification outcome is predicted by using information from the token distributions of the draft model, and once these predictions are made, they can be decoded in parallel as different sequences that are decoded on top of a shared prefix, resulting in speedups 2m6s.
  • The approach can get the prediction right about 80 to 90% of the time, which is more than enough to get big speedups, and it also gives more time to draft, allowing for more tokens to be drafted and increasing the expected tokens per round 2m6s.
  • There are implementation details to consider, such as handling cache misses, and the approach is not always optimal, requiring different ways to predict and handle cache misses, and navigating trade-offs between cache hit rate and the quality of the drafting 4m30s.
  • The complexity of the approach can be navigated, and the result is an increase in the number of inference algorithms and inference engines, which is the goal of AI research 8m0s.

Introduction to Diffusion Policy and Model Predictive Control

  • The inference engines being discussed include the baseline implementation of speculative decoding, SG lang, and SSD, with SG lang being the fastest with speculative decoding, and it is noted that speculative decoding is a win for both latency and throughput in this setting 10s.
  • The concept of diffusion policy is introduced, which involves using video models to predict future steps for robotic control, and it is mentioned that this idea was explored independently before discovering that DM mind had already worked on it, resulting in a month of wasted time 2m6s.
  • Stannis, a research scientist at Google DeepMind, is introduced, and he discusses his work on word modeling for robotics, which involves building general-purpose policies on top of video and word models, and he mentions that his current project is an early work that demonstrates similar ideas on toy problems 4m42s.
  • The concept of model predictive control (MPC) is explained, which uses a dynamics model and an action selector mechanism to construct agents that can solve tasks by maximizing an objective, and it is noted that MPC can adapt to normal reward functions at test time and is easier to learn than just policies 6m15s.
  • The idea of diffusion model operative control (DMPC) is introduced, which addresses the problems of MPC by using diffusion models to learn multi-step action proposals and dynamics models, reducing compounding errors and simplifying the planning algorithm 8m30s.
  • The advantages of DMPC are discussed, including the ability to reduce compounding errors and simplify the planning algorithm, allowing for the use of a simple sampling-based planner that can outperform previous approaches 10m45s.

Diffusion-Based Agents and Their Variants

  • There are various approaches to building a joint distribution of states and actions, including factorized methods with a policy and dynamics model, model predictive control (MPC), and model-free approaches that directly learn a policy, with different trade-offs in terms of runtime planning, adaptation to rewards and dynamics, and general speed 10s.
  • Diffusion models have been successful in generating images and videos, and have also found success in robotics, with different types of diffusion-based agents, including diffusion policy, diffuser, decision diffuser, and diffusion model predictive control 2m6s.
  • The diffusion policy conditions on observations to generate future actions, while the diffuser jointly models observations and states in a latent space, and the decision diffuser generates future observations and uses an inverse dynamics model to derive actions 2m6s.
  • Diffusion model predictive control uses an action proposal and a dynamics model to evolve the state, and then selects actions using a planner, allowing for runtime adaptation to normal rewards and dynamics 2m6s.
  • The algorithm for learning diffusion-based agents is relatively simple, involving learning a policy and dynamics model from an offline dataset, and then using these models to sample and rank action proposals at inference time 2m6s.
  • The different diffusion-based agents have various benefits and trade-offs, such as the diffusion policy requiring expert demonstrations, the diffuser allowing for implicit world modeling and model-based planning, and the decision diffuser enabling observation-only learning from video data 2m6s.
  • The main difference in the approach is the adoption of a multi-step action proposal, similar to a diffusion policy, which can provide more coverage in the action space when trained on diverse data, and also utilizes a multi-step dynamics model to evolve for a long time horizon without compounding error 10s.
  • The approach leverages a diffusion model, which is a powerful way to model data, especially multimodal data, and empirically, the stronger modeling capabilities allow for simplification of the planning algorithm, enabling the use of a simple planner to solve tasks 42s.
  • The approach is contrasted with other representative works, including model-based offline control and offline planning, and a diffuser work that learns a joint model and uses classifier-free guidance for planning 1m26s.
  • The approach obtains very competitive results in fixed reward single task setups, demonstrating its ability to perform competitively with state-of-the-art approaches when deployed in a single reward fixed task setup 2m6s.
  • The approach can adapt to no rewards at runtime, and by changing the reward function, it can exhibit novel behaviors, and it can also adapt to novel dynamics, which is a benefit of the factorization of the action proposal and the dynamics model 2m50s.
  • The factorized representation in the approach allows for adaptation to changing dynamics by simply adapting the dynamics model on some play data collected in the new environment, resulting in recovery of performance 4m10s.
  • The different components of the approach, including diffusion active proposals, multi-step diffusion action proposals, and multi-step dynamics modeling, all contribute to improved performance 5m30s.

Introduction to World Models and Their Applications

  • The presentation then shifts to a new topic, introducing a paper on world models, specifically the "lay world model" from Yan Lacun's group, presented by Isaac Ward, who has been working on world models for a couple of years and is excited to share this project 7m20s.
  • World models are about learning the dynamics of the world, using a big neural network to predict how a system will change over time based on its inputs, and they enable capabilities such as generating imagined outcomes, model-based control, and surprise quantification 10s.
  • These models are not a new idea, but rather an old concept with new advertising or packaging, as described in a 1990 paper by Richard S. Sutton, who is known for his work in reinforcement learning, where he describes a black box that takes a situation and action as input and outputs a prediction of the next situation 42s.
  • World models typically use observations from sensors, such as images from a camera, and control inputs, such as yaw and movement, to make predictions about the outcome of an action, and they can handle highly dimensional observations, including images and LAR 2m6s.
  • Training world models can be challenging due to long action sequences and the possibility that the minimum in the optimization landscape may not correspond to the desired behavior, but having a system capable of doing this implies that it must have an internal model of the world 4m30s.
  • The ability to imbue agents with an internal model of the world is a potentially useful capability, and the big question is whether agents will have model-free or model-based policies, and whether they will have an internal model of the world or not 6m20s.

Technical Aspects of World Models and Training Challenges

  • Yan Lecun's raise of $1.03 billion dollars to train world models is an example of the significance of this concept, and understanding world models is crucial to exploring their potential applications and limitations 10s.
  • Model-free approaches involve taking observations and feeding them into a neural network to get an optimal action, without an explicit representation of the future, and these models are pretty good, but have weaknesses such as brittleness to out-of-distribution data 10s.
  • Model-based approaches, on the other hand, train a world model explicitly and use it to predict the outcome of potential actions, allowing for the quantification of modeling error, which is important when deploying things in the real world 1m30s.
  • There is growing evidence that internal to neural networks are highly obfuscated and challenging to interpret world models, and a paper will be discussed that speaks to this, highlighting that even model-free policies have world models in them 42s.
  • The challenges associated with training world models include learning the representation of the world and how actions change that representation, and many solutions in the optimization landscape can cause the model to collapse, but there are techniques to avoid this collapse 4m30s.
  • A toy example is used to illustrate this, where an agent is trying to push a blue ball into a green slot, and a model is trained to predict the action sequence, showing that it is possible to train models of toy environments and more complex ones 6m0s.
  • The world model approach simplifies the process of training models by avoiding the need for tricks, special methods, or hyperparameter tuning schedules, and instead uses a more elegant method, with popular world models including those mentioned in the paper 8m30s.
  • PLDM, DINO, distillation with no labels, World Model, Dreamer, and Temporal Difference MPC are planning methods that use latent dynamic models, and they employ various tricks to avoid collapse, with World Models simplifying the process using one hyperparameter and one loss term 10s.
  • World Models can be categorized into three approaches: using explicit heuristics to enforce healthiness in the latent space, utilizing foundational methods such as autoencoders or diffusion models and adding action conditioning, or leveraging privileged data to avoid collapse 2m6s.
  • The JER, or Joint Embedding Predictive Architecture, is a type of World Model that uses an autoencoder or image encoder to turn observations into latent vectors, and it trains a predictor to forecast the next latent embedding when an action is executed 4m6s.
  • The LAY World Model, based on JER, uses a sigg regularizer to ensure the latent embeddings are in a healthy, Gaussian distribution, which involves taking one-dimensional slices of the high-dimensional data and checking for Gaussian distribution 6m6s.
  • The sigg regularizer allows for the evaluation of how Gaussian-distributed the embeddings are, indicating the health and non-collapsing nature of the World Model, and it can be added to the training process as an additional term 8m6s.
  • The World Model provides three key capabilities: open-loop prediction quality, which is the ability to predict the next action, and this is demonstrated through tasks such as push-t and push-cube, where the model can generate imagined sequences that resemble real examples 10m6s.

World Model Capabilities and Performance

  • Model predictive control can be achieved by using world models, where an initial observation and a goal observation are encoded, and a search is performed over actions to reach the goal in the latent space, with well-defined optimization methods available to achieve this 10s.
  • The world model has been shown to outperform competition in small 2D tasks and 3D tasks, particularly in environments like Dino World, due to its foundational backbone trained on image data, and it is approximately 50 times faster than other models 2m6s.
  • The world model's efficiency can be attributed to its ability to operate in the latent space, requiring less than 24 gigabytes of VRAM and only 15 million parameters, making it possible to run on a single card 2m6s.
  • A key capability of world models is the ability to quantify model error, which allows agents to estimate their uncertainty and detect perturbations, such as changes in the environment, providing a powerful tool for intelligent agents 2m6s.
  • The discussion of world models raises broader themes, including the choice between model-based and model-free approaches, the importance of regularization and representation learning, and the potential for bio-inspired methods or pre-existing foundation models 2m6s.
  • The paper also touches on the topic of representational collapse and how to address it elegantly, with the current work providing a good example, but leaving room for further exploration of the best approach 2m6s.

Introduction to Generalization and Deep Learning Mysteries

  • The conversation then shifts to a new paper, "Deep Learning is Not So Mysterious or Different" by Andrew Gordon Wilson, which will be presented by Ashe, co-founder and president of QABs, and explores the current state of machine learning, including the relationship between scaling models and generalization 10m10s.
  • Understanding generalization is crucial as it can help optimize for it, and the payoff to understanding it is significant, with potential applications in fields like overparameterization, benign overfitting, and double descent 10s.
  • The concept of generalization is often considered a mystery, but classical theories of generalization, such as PAC-Bayes, can be used to explain it, and PAC-Bayes bounds the test loss with a training loss and a compression term 42s.
  • Overparameterization is a mystery in deep learning where increasing the model parameter size leads to better generalization, contrary to the bias-variance trade-off, and the PAC-Bayes framework provides a useful way to think about the success of overparameterization 2m6s.

Overparameterization and Model Compressibility

  • Empirical risk, or training loss, decreases as the number of parameters increases, allowing the model to fit the data better, and research by Lotfi and others has found that increasing the model size leads to more compressible solutions 2m6s.
  • The concept of model compressibility is related to the perspective of flatness, where increasing the number of parameters leads to an exponential increase in the volume of flat minima in parameter space, making the model more compressible 2m6s.
  • Benign overfitting is another mystery in deep learning, where deep neural networks can fit totally random noise but still generalize well on structured data, and a regularized polynomial model can provide intuition for how this is possible 4m30s.
  • Andrew's work dispels the mysteries of overparameterization and benign overfitting by using classical theories of generalization, providing useful bounds on generalization even for large models, and offering a new perspective on the success of overparameterization 4m30s.

Inductive Bias and Learning Efficiency

  • Neural networks are expressive models with a soft inductive bias, which allows them to generalize and avoid overfitting, and this concept can be explained using a figure that illustrates the trade-off between flexibility and inductive bias 10s.
  • The idea of inductive bias is important in machine learning, as it can help solve the overfitting problem, and having a very expressive hypothesis space with a bias towards solutions that generalize can be beneficial 42s.
  • The no free lunch theorem states that the only way to improve learning efficiency is through inductive biases, and finding the right inductive biases can help optimize for them and lead to significant gains in capability 2m6s.
  • Two major problems that need to be solved in AI are intelligence per watt and intelligence per sample, and currently, AI is still orders of magnitude off from human-level performance in these areas 4m30s.

Data Constraints and Pre-Training Challenges

  • Researchers, including those in Chris Ray's lab, have been exploring the idea of achieving generalization with a fixed amount of data and infinite compute, and a paper co-led by Con Woo, Suhas, Percy, and Potsu starts to answer this question 6m15s.
  • The paper is motivated by the fact that pre-training has continued to improve model capabilities in surprising ways, with recent examples including the emergence of in-context learning, alignment, and reasoning, and the development of new models like Mythos and 5.5 8m0s.
  • However, pre-training is expensive, and the focus has been on improving compute efficiency, which can be achieved by scaling the number of parameters and data points, but this will soon be constrained by data, as quantified by the chinchilla scaling laws 10m30s.
  • The amount of human-generated text on the internet grows by roughly 3% per year, while the amount of compute spent on pre-training is growing by roughly 4 or 5x per year, indicating that the amount of compute spent per data point will continue to increase by roughly 4x year-over-year 10s.
  • This growth motivates the core question of how to approach pre-training when constrained by data but unconstrained by compute, which is a different algorithmic regime from the computer-efficient pre-training world 1m5s.
  • The question of pre-training under data constraints is not new and has been explored in classical statistics and older benchmarks like EMNLP and Penn Treebank, where data is limited and compute is not a concern 2m6s.
  • The core contribution is to bring the modern toolkit of scaling laws to answer this problem, proposing scaling recipes that monotonically decrease validation loss, and showing that these scaling laws follow a clean power law 2m45s.

Scaling Laws and Algorithmic Recipes for Data Efficiency

  • The goal is to estimate the best possible loss of a recipe by looking at the asymptote of the power law, which quantifies the best possible performance under infinite compute, and to think more carefully about algorithms that allow lowering the compute asymptote 4m10s.
  • A canonical setting is introduced, simulating a data-constrained world by constraining the number of pre-training tokens to 200 million, and pre-training large models using different recipes to find those that allow spending more compute while monotonically decreasing loss 5m20s.
  • The standard approach of epoching data and scaling up models is considered, but it leads to overfitting, and alternative approaches such as aggressive regularization through weight decay are explored 7m30s.
  • The goal is to optimally tune learning rate, weight decay, and epoch count for each model, and it is shown that the loss follows a clean power law as the number of parameters in the model increases, with aggressive regularization using weight decays 30 times larger than those used for compute optimal pre-training 10s.
  • The power law has a few nice properties, including an exponent on the model parameters of one, which is predicted by the data constraint theory, and an asymptote of 3.43, which characterizes the performance of the best possible regularized model with infinite compute 42s.
  • Ensembling is a technique that can be used to improve the performance of models, and it is shown that ensembling can be incredibly data efficient, with a power law that has an exponent of one and an asymptote, and the asymptote of ensembling is much lower than the asymptote of the regularized recipe 2m6s.
  • The benefits of regularization and ensembling can be composed, and it is shown that training an ensemble of small models can be better than training one large model when data is constrained, and the joint scaling recipe quantifies the hypothetical performance of training an infinitely large ensemble of infinitely large models 4m10s.
  • The joint scaling recipe is quantified by fitting two scaling laws, first by training ensembles of models of different sizes and looking at the asymptotes of the ensembles, and then by fitting a second scaling law to the asymptotes of the ensembles, which is essentially taking the limit over the number of ensemble members and the limit over the number of parameters 6m15s.

Ensembling, Distillation, and Practical Data Efficiency

  • The experiments are performed in a toy data-constrained setup of 200 million tokens, and to confirm that the recipes scale, data scaling laws are built by repeating the experiments at four different pre-training token counts up to 1.7 billion tokens 8m30s.
  • Data scaling laws are used to quantify the best possible performance of each recipe with an infinite amount of compute, and they let us quantify the data efficiency numbers of approaches, allowing for the measurement of the effective number of extra tokens that an algorithmic improvement is buying 10s.
  • The joint scaling recipe gives a roughly 5x data efficiency win over the standard recipe, and this win can be realized with finite models, such as training a five-ensemble of 1 billion parameter models, which gives a roughly 3.7x data efficiency win 42s.
  • The data scaling laws have similar exponents and asymptotes, suggesting that the data efficiency win will be constant over the actual number of token counts, even if the token count is increased to a large number, such as 10 trillion tokens 1m15s.
  • To make the data efficiency win more practical, methods such as distillation can be used to reduce the amount of inference compute needed, and it is shown that an eight-ensemble model can be distilled into a single dense 300 million parameter model while retaining roughly 83% of the loss improvement 2m6s.
  • Self-distillation can also be used to improve the loss, and it is found that self-distillation gives huge loss improvement, even beating the asymptote of the regularized recipe, and has connections to ensembling and can be viewed as implicitly training a two-ensemble 3m20s.

Downstream Benchmarks and Continued Pre-Training

  • The trends in the paper are found to work on downstream benchmarks, which are fully held out test sets, and the standard recipe overfits, while model scaling gives improvements 4m30s.
  • Ensembling can lead to better results, and the benefits of ensembling can still be retained through distillation, allowing for data efficiency tricks like aggressive epoing to be effective 10s.
  • The concept of continued pre-training is explored, where a 3B model is trained on a restricted set of 4 billion math-related tokens, and with techniques like ensembling, the performance of training on the full 73 billion tokens can be matched, resulting in a 17x data efficiency win 42s.

Conclusion and Future Directions

  • The main point to be taken away is that when data is limited but compute is not, the algorithmic choices made can significantly impact results, and it is essential to rethink every aspect of the stack, revisiting classical ideas from machine learning and deep learning, such as regularization, ensembling, and distillation 2m6s.
  • The introduction of the evaluative tool of asymptotes can help in chasing algorithms with lower compute asymptotes, potentially leading to better ideas for data efficiency, and ultimately, the goal is to develop new and better ideas under infinite compute that do not already exist 2m6s.
  • The paper discusses the details of these concepts, and follow-up work has been done on exploring the interaction between synthetic data and data efficiency, with more information available through a provided QR code 2m6s.
  • The YC Paper Club is discussed as a platform for exploring ideas and making the club a fun and engaging experience, with an invitation for attendees to join the Slack and contribute their ideas to make the club successful 10s.
Made with Recall · in 3 seconds

Get a summary like this for anything you read, watch or save.

Recall summarizes any link you paste, then keeps it in your personal library so you can search, chat with it, and never lose a key idea again.

YouTube videosArticlesPodcastsPDFsAnything else
Save this summary

Then save anything you watch or read next.

Bookmark this summary, then save any video, article or PDF you read next.

Save to your library

Ready to get started?

Save, summarize & chat with your content.

GET STARTED

IT'S FREE

No credit card required · 30 Day Refund on Premium · 24 Hour Support

Recall web app on laptop