YouTube video summary

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics

Artificial Intelligence02 Jun 202627 min summaryFrom Stanford Online
Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics
Stanford Online
YouTube

Overview of the CME 296 Class and Image Generation Goals

  • The last lecture of the CME 296 class is divided into two parts, where the first part aims to piece together everything learned in the class and the second part explores adjacent fields where the learned concepts can be applied 10s.
  • The primary goal of the class has been to learn how to generate images, given an input prompt, and to achieve this, the class has been decomposing the process of learning how to generate images into tractable parts 2m6s.
  • The first three lectures focused on understanding how to generate images, starting with the diffusion paradigm, which involves learning how to generate images using a blackbox model, and the first lecture specifically discussed using the diffusion paradigm to generate images 2m6s.

The Diffusion Paradigm and Forward Process

  • The diffusion paradigm represents images as multi-dimensional variables and involves defining a forward process to corrupt clean images into noise, with the purpose of diffusion being to learn how to reverse this process 4m30s.
  • The forward process involves Gaussian distributions and the reverse process is derived by maximizing the likelihood of seeing the data distribution under P theta, which leads to a tractable lower bound called the elbow 6m20s.
  • The elbow is expanded to derive a loss function, which is a simple L2 regression loss that estimates the noise added to a given image, allowing diffusion to learn how to remove noise and go from an easy-to-sample-from distribution to the data distribution 8m40s.

Score Estimation and Denoising Score Matching

  • The second lecture introduced another approach to generating images, which involves looking at the data distribution in terms of where to go, rather than noise to remove, and explores the concept of the score, defined as the gradient of the log of P, which has nice properties and allows for navigation to the data distribution 10m50s.
  • The goal of the lecture was to derive a way to estimate the score, which is a quantity that is not directly known, by leveraging the properties of the Gaussian distribution, particularly the fact that the score of a Gaussian distribution can be computed analytically 10s.
  • To estimate the score, the approach involved introducing noise to the target data distribution, which created a trade-off between being able to compute the score of the noise distribution and having the noise distribution be far from the original data distribution 42s.
  • A solution to this trade-off was to estimate the score as a function of both the location in space and the level of noise in the data distribution, which led to the development of the denoising score matching loss 2m6s.
  • This approach was found to be similar to the diffusion process, and it was discovered that the score of the forward process is equal to minus the noise added over a coefficient, leading to a similar formulation 2m6s.

Continuous Formulation and Stochastic Differential Equations

  • The discrete formulation of the noising process was then extended to a continuous formulation, which resulted in a stochastic differential equation that describes the forward process, including a drift term and a diffusion term 2m6s.
  • The continuous formulation was found to be a more generic formula that encompasses the DDPM formulation and the noise conditioning score networks as special cases, with the DDPM formulation being a variance preserving formulation and the noise conditioning score networks being a variance exploding formulation 2m6s.
  • The ultimate goal is to learn the reverse process, and a result from the 80s provides a way to do this by estimating the score, which is the quantity being estimated, in order to reverse the process 2m6s.
  • The key takeaway is that estimating the score is crucial for reversing the process, and this can be done by predicting the noise to remove, as seen in lecture one, or by knowing where to move towards, which is the score, as seen in lecture two 2m6s.

Flow Matching Formulation and Vector Fields

  • The score can be thought of as a compass that indicates the location of the data distribution, and the flow matching formulation frames the problem in a different way by considering the noisy distribution as an initial distribution and the data distribution of interest as a target distribution 10s.
  • The goal of the flow matching formulation is to figure out how to move the probability density from an initial distribution to a target distribution, and it involves a vector field or velocity, noted as ut of x, which is a vector defined at every position and at all time t 42s.
  • There are two ways of looking at the flow matching formulation: the microscopic way, which involves the ordinary differential equation (OD) dx/dt = ut(x), and the macro level, which involves the evolution of the probability at a given time t, equal to minus the probability flux, constituting the continuity equation 1m6s.
  • The whole point of the flow matching formulation is to figure out a vector field that would allow moving from the initial distribution to the target data distribution, and this can be achieved by learning a model that estimates the vector field, noted as ut theta of x 2m6s.
  • If a model that estimates the vector field is learned, then sampling from the complicated data distribution can be done by sampling from the easy data distribution and numerically solving the OD using the learned vector field 2m42s.
  • A simpler case of the flow matching formulation is the conditional probability path, where the initial data distribution is moved to a single point, and this can be used to obtain a conditional flow matching loss that is equivalent to learning the vector field 3m42s.
  • The first three lectures covered the flow matching formulation, which is a mathematically challenging topic, but it is an important paradigm to generate images, and most models nowadays use a variant of the flow matching formulation, such as the rectivi flow variant 5m6s.

Latent Space and Variational Autoencoders

  • The rectivi flow variant is an extension of the flow matching formulation that allows the path to be straighter, making it possible to afford having fewer steps in the numerical solver at inference time, enabling sampling of images with fewer steps 6m0s.
  • The initial approach to generating images involved two main assumptions: generating images in an unconditioned way without input prompts, and representing images with a multidimensional vector, but without considering what representation would make sense, which was the focus of a previous lecture on latent space and guidance 10s.
  • To have a meaningful representation of images, it is necessary to keep only the useful information, and in pixel space, there is a large amount of spatial correlation, resulting in redundant information, and high dimensionality, making it necessary to compress the input into a smaller space 42s.
  • An autoencoder model was used to learn a way to represent images in a smaller space, called the latent space, by downsampling the input image and then reconstructing it, allowing the model to learn a latent space that compresses the information into fewer dimensions 1m6s.
  • However, the autoencoder model had no way of controlling the shape of the latent space, and to address this, a constraint was added to regularize the latent space, resulting in the variational autoencoder (VAE), which structures the latent space in a way that makes it easier to learn and generate images 2m6s.
  • The VAE uses a loss function with two terms: a pixelwise reconstruction loss and a term that aims to structure the latent space to a certain prior distribution, allowing for a more compact and well-put-together data distribution 3m6s.

Model Architectures and Training Strategies

  • The focus of the previous lecture was on representing inputs, including different encoders, such as transformer-based encoders, like the VIT vision transformer, and methods to combine different modalities in the same space, like CLIP using a contrastive loss 4m6s.
  • Additionally, methods were explored to incentivize generation to be more aligned with the condition, including classifier-free guidance, which allows for more controlled image generation 5m6s.
  • The current goal is to determine the model to use for predicting a quantity that allows transitioning from noise to clean, with the flow matching paradigm being the most common approach, which involves a model that predicts velocity, taking in noise level, condition of interest, and noisy latent as input 10s.
  • A popular architecture for image generation is the unit, composed of downsampling and upsampling parts, with the downsampling part allowing for a global understanding of the input through downsampling operations, and the upsampling part matching the desired shape, using copy and crop connections to transfer lower-level details 42s.
  • The transformer and self-attention mechanism became widely used after 2017, leading to the development of the diffusion transformer in 2022, which addressed the limitation of the unit by allowing patches far from each other to interact directly, using an adaptive layer norm framework to inject conditions 2m6s.
  • Today, there are many models, including the multimodel diffusion transformer, which considers the condition as part of the joint attention, and most image generation models rely on the DIT-based architecture 4m30s.
  • After determining the model and representation, the focus shifted to training the model, which involved considering how to sample the time step during training, with the initial approach of sampling from a uniform distribution being found not optimal, and the logit normal distribution being more commonly used, as it focuses on the middle steps 6m40s.
  • The resolution of images also matters in terms of perceived noise, and the difficulty of noise levels varies, with those in the middle being the most challenging, requiring important decisions about where to go 8m10s.

Model Training and Deployment Considerations

  • To train a model on higher resolution images, it is necessary to increase the noise level, as a lower resolution image will appear noisier compared to a higher resolution one due to spatial correlation within the image 10s.
  • The typical model training process involves several stages, starting with pre-training, which is the most time-consuming and expensive part, requiring a large corpus of high-quality images that encompass all possible images to be generated 2m6s.
  • After pre-training, the next step is to generate good images, which can be aesthetically pleasing and relevant to the field of interest, and this can be achieved through continued training, where the model is taught to generate images in a specific set of interest 2m6s.
  • An optional third step is the tuning step, where the model is fine-tuned to generate images of a specific subject or person, and this can be done using techniques such as Dream Booth, which involves gathering a set of images and training the model to generate images containing a specific person or object 2m6s.
  • To deploy the model to production, it is necessary to reduce the cost and time required for generation, and this can be achieved through methods such as distillation, including progressive distillation, which aims to shorten the number of steps needed to generate samples 2m6s.
  • Finally, evaluating the quality of generated images is crucial to determine where to focus efforts, and this was discussed in the previous lecture, which covered various methods for evaluating image quality 2m6s.

Image Quality Evaluation Metrics

  • Evaluating images is crucial, and the most common method of evaluation is through pair-wise comparisons between images generated by models, with the ELO rating being a metric that considers the history and strength of each model, taking into account that winning against a weak model is not the same as winning against a strong one 10s.
  • The ELO rating quantifies how good each model is, allowing for the computation of an expected score based on the ratings of two models, and the difference between the expected and actual scores, known as delta, indicates how surprised one should be by the outcome, with a higher delta indicating a more significant win 2m6s.
  • The ELO score is a smart way of computing pair-wise comparisons by considering the strength of the opponent, and it will be encountered frequently, with its key aspect being the consideration of the opponent's strength when comparing models 4m30s.
  • Not all models have the luxury of human ratings, which is why automated metrics like the FID score, or Fréchet Inception Distance, are used, computing the distance between the distribution of generated images and real images, with lower scores indicating better performance 6m15s.
  • The FID score is derived from a general formulation, assuming Gaussian distributions, and is a proxy metric, not perfect but useful, and other methods like leveraging multimodel large language models (MLM) can provide more clever ways of obtaining scores 8m20s.
  • Multimodel large language models can take text and images as inputs, allowing for the generation of scores based on prompts, and can be used as a judge to rate images, enabling a tighter loop for iterations and potentially replacing human ratings with MLM-based ratings 10m50s.
  • The process of generating an image from a prompt involves a lot of underlying complexity, which was covered in lectures 1 to 7, and now the focus is on state-of-the-art models that can perform this task 10s.

State-of-the-Art Image Generation Models

  • The best models are ranked on a leaderboard using the ELO score, with the top models being from OpenAI, Google, and XAI, but these models are closed source and do not publish reports on how they work 1m42s.
  • The leaderboard also ranks open-weight models with published technical reports, with the top-ranked model being Hydream 01 from Hydra, followed by Quinn image, and then models from Black Forest Lab called Flux 2 2m6s.
  • The Flux 2 models are based on rectified flow, a derivative of flow matching, and use a combination of single stream and double stream diffusion transformer, as well as a VAE to ensure a compact latent space 3m15s.
  • The Quinn image model uses a flow matching loss, a multimodel diffusion transformer, and relies on a VAE, with text embeddings based on Quen 4m30s.
  • The top-ranked Hydream 01 model uses a flow matching loss and a transformer-based architecture, but does not use a VAE or a pre-trained text encoder, instead generating images in pixel space 5m40s.
  • The Hydream 01 model uses a larger patch size of 32x32 to make computation tractable, and shifts the effort of learning to the diffusion-based transformer to address the challenges of learning in pixel space 7m10s.
  • The diffusion transformer or transformer-based model can still achieve amazing results even with a harder problem if it is scaled well enough, as seen in a paper that scaled the model to 8 billion and 200 billion parameters, which is huge for image generation models 10s.
  • The trade-off between making it easier for the transformer to learn versus doing it in the raw space is still being figured out, and learning the latent space can make it easier for the model to learn but may lose fidelity due to the lossy operation of operating in a non-original space 1m5s.
  • A newer trend is emerging where models are being trained without the use of a Variational Autoencoder (VA), and this paper shows that the downsides of the VA may be mitigated by scaling the model, which is worth keeping an eye on 2m6s.

Text Encoding and Prompt Enhancement

  • An alternative to using a pre-trained encoder is to learn the representation of the input text itself by tokenizing it and learning its representation as part of the training, and some models use prompt enhancement to make the job of the text encoder easier 4m30s.
  • The use of a pre-trained text encoder is not necessary, and instead, a text encoder can be trained as part of the training, allowing the model to understand text without relying on a pre-trained encoder 6m15s.
  • The concept of image generation can be extended to video generation, which involves adding an extra dimension of time and ensuring temporal consistency between frames, so that the objects and characters in the video do not suddenly change or appear with new things 9m20s.

Video Generation and Temporal Consistency

  • Generating videos requires not only creating plausible frames at a given time, but also ensuring temporal consistency, meaning the sequence of frames makes sense, and this is an additional consideration compared to generating images 10s.
  • To make computations tractable, it is necessary to represent the input in a way that reduces dimensionality, as representing a video as a sequence of 2D images would increase the dimension by a factor of time, and metrics such as the Fréchet Inception Distance (FID) are extended for videos by using a pre-trained encoder to represent videos 1m30s.
  • The FID metric is used to evaluate the quality of generated videos, but it is just a proxy, and human evaluation is also necessary to ensure the generated videos are of good quality, and there are other metrics such as the Fréchet Video Distance that can be used to compare generated videos to real ones 2m30s.
  • Video generation models are based on traditional image generation models, which are DIT-based models operating in a latent space, and for videos, both spatial and temporal compression are performed to reduce the dimensionality of the input 3m30s.
  • The spatial compression ratio, denoted as f, is typically around 8, and temporal compression is also necessary to reduce the dimensionality of the input along the time axis, as consecutive frames in a video often have redundant information 4m30s.
  • The latent space in video generation models is composed of space-time latents, which represent not only the information within a given image but also the information with respect to time, and the dimension of the latent space is affected by the temporal compression ratio 5m30s.
  • The dimension of the latent space over time is 1 + t, and the latent space is of dimension 1 / t over 4, and the reason for the 1 + t dimension is that a starting point or initial frame is needed to generate a video 6m30s.
  • The first frame of a video is considered special and is referred to as the anchor frame, which is represented as well as possible to ensure a natural continuation of the video, with the "oneplus" used to represent this frame to its fullest extent 10s.
  • The VAE being discussed is a 3D VAE, operating along the time dimension, and is called a causal VAE because it only uses past and current frames to compute feature maps, not future frames, allowing for efficient streaming of the encoding process 2m6s.
  • The causal VAE is asymmetric, meaning it only depends on the current and past frames, which is important for temporal consistency and preventing the invention of objects or people in the video that may have happened in the past 2m6s.
  • The receptive field can widen with multiple convolution operations, and making sure a given frame only depends on the current and past frames allows for computational efficiency and streaming of the encoding and decoding process 2m6s.
  • The use of past frames is important for capturing temporal consistency and making sure the video is consistent, such as in the example of a teddy bear walking across the street and a pedestrian coming back into view 4m30s.
  • To address efficiency concerns, video generation can be divided into parts, generating a video of a fixed length and then using the last frame as the anchor frame to generate another video of a fixed length 6m30s.

Efficient Video Generation and Causal Models

  • The model generates videos in the latent space, similar to image generation, using a DIT-based architecture that works from patches and generates patches 9m40s.
  • The main difference between image and video generation is that video generation involves spacetime patches, which are not spatial patches, and these patches interact with each other through self-attention mechanisms to produce a coherent output across space and time 10s.
  • To ensure causality in video generation, it is essential to consider the type of data used for training the model, as the resulting model will reflect the patterns it has seen during training, and feeding the model with data that reflects causality will help it learn causal patterns 2m6s.
  • In large language models, masked self-attention is used to make things causal, and a similar approach can be applied to image generation, but in the case of video generation, people typically keep the full self-attention to ensure consistency between all parts of the video 4m30s.
  • There are multiple models available for video generation, including One and LTX, and reading the papers on these models can be helpful, especially for those who already understand how image generation models work 6m40s.

Image Editing and Vision-Language Models

  • Image editing can be performed using text-to-image models, but considering it as a from-scratch generation problem may not be optimal, as it may not preserve the original image, and instead, thinking about it as an image editing problem can be a better approach 8m50s.
  • The problem with using text-to-image models for image editing is that it may not guarantee that the output image is the same as the input image, and alternative approaches are being explored to address this issue 10m40s.
  • Editing an image can be performed by feeding a prompt into a Vision Language Model (VLM), which is a type of Multimodal Large Language Model (MLM), to receive editing actions that can be used to interact with editing software like Photoshop, with the goal of preserving the initial image while making desired changes 10s.
  • The main challenge with this approach is for the VLM to know the set of possible actions well enough for the editing action to make sense, and to address this, some people are trying to think about how to resolve this issue, with a few papers attempting to provide solutions 1m30s.
  • One possible method to ensure the output is aligned with something that makes sense is to look at the logs of people who are actually making edits, and to use the sequence of edits made as a way for the model to learn from, with the goal of inferring user intent 2m6s.
  • Papers on this topic are trying to come up with pairs of initial image and edited image, along with a user intent that is inferred by these two images, and are tuning the VLM on these golden sets to have it behave more like something that would tell you actions that would correspond to your intent 3m30s.
  • The loss function is not used to infer the user intent, but rather, one possible way of inferring user intent is to feed an off-the-shelf VLM an initial image and an output image, and to ask the VLM to tell what changed between these two images 5m40s.
  • Applying diffusion to the field of Large Language Models (LLMs) is a very hot area of research, with knowledge being transferred from the text world to the vision world, and the transformer model, initially designed for translation tasks in 2017, is an example of this transfer of knowledge 8m10s.

Diffusion in Large Language Models

  • The vision world has adapted the transformer architecture to leverage its scalability benefits, particularly with the diffusion transformer, which relies on this adaptation 10s.
  • Post-training approaches, such as injecting negative signals into models, have been explored, including methods like DPO (Direct Preference Optimization) from the LLM world, which has been adapted to the diffusion world, and GRPO, which is widely used in LLMs and has been experimented with in the vision world 42s.
  • In the text world, most tasks are performed in an auto-regressive way, where one word is generated at a time, and the output is a function of the previous words, similar to how humans converse, using models like ARM LLM (Auto Regressive Model) 2m6s.
  • The auto-regressive approach can be time-consuming for long output sentences or responses, as the number of iterations is proportional to the number of output tokens, making it inefficient for tasks like generating large amounts of code 2m6s.
  • An alternative approach is to borrow the idea of diffusion from the vision world and apply it to text, where instead of generating tokens one at a time, the model starts with noise and progressively denoises it to obtain the final output 4m30s.
  • This diffusion-based approach for text can reduce the complexity of the model from being proportional to the number of output tokens to being proportional to the number of diffusion steps 6m40s.
  • The idea of starting with a noisy version of the text and refining it through diffusion steps is analogous to the process of writing a speech, where one starts with a draft and refines it through multiple iterations 5m50s.
  • The diffusion-based LLM would take text with random noise and progressively reduce the noise to produce the cleaned text, similar to how diffusion models work with images 7m30s.
  • The process of denoising the whole input text is similar to denoising an entire image at once, but the concept of noise is trickier in the text world due to its discrete nature, comprising words, tokens, and other elements, unlike the continuous world of images 10s.
  • In the text world, representing noise is not as straightforward as using Gaussian noise in the image world, and simply replacing tokens with random ones can alter the semantic meaning of the input, which is why a dedicated mask token is often used to represent unknown text tokens 1m5s.
  • To apply the denoising process to text, a clean sentence is corrupted according to a noise level, with a certain percentage of tokens being masked, and the training objective is to reconstruct the masked tokens based on the remaining unmasked tokens 2m6s.
  • This approach is similar to the pre-training task used in BERT, an encoder-only architecture, but with a variable noise level that can be adjusted, allowing for more flexibility in the masking scheme 3m40s.
  • At inference time, the model is used to predict the hidden tokens behind a sequence of mask tokens, and to refine the predictions, some tokens can be remasked and re-predicted, either randomly or based on confidence scores, until a final output is obtained 5m20s.
  • This paradigm has the potential to speed up text processing by up to 10x compared to traditional auto-regressive methods, making it particularly useful for tasks that require quick responses 8m30s.

Challenges and Applications of Diffusion Models

  • The paradigm of diffusion models can significantly speed up the process of coding tasks, making it a good use case, and it is also beneficial for fill-in-the-middle tasks, which are common in coding, 10s.
  • However, training diffusion models is more expensive than traditional auto-regressive models, particularly because the traditional way of training allows for parallelization, which is not as effective with diffusion schemes, 2m6s.
  • There are techniques that can combine the benefits of auto-regressive models and diffusion models, such as block diffusion, which generates text block by block, allowing for more efficient handling of variable output lengths, 2m6s.
  • To handle variable output lengths, a common approach is to set a given length for the output and stop generating text after a special "end of sentence" token, but this can be wasteful if the actual output is shorter, 2m6s.
  • Block diffusion is a method that can be useful in handling variable output lengths, as it generates text in blocks and can be repeated until the desired output is achieved, 2m6s.
  • The differences between text and image, such as variable length, are important considerations when applying diffusion models to text, and there are potential applications to other discrete domains, 2m6s.
  • Researchers have derived mathematical ways to apply diffusion models to discrete items, and there are papers that explore this topic, such as the first paper mentioned, which discusses how to go from continuous to discrete grids, 2m6s.
  • Another approach to handling text is to consider it as images of text and use OCR mechanisms to process it, which is a promising direction, as seen in papers like DeepSer, 2m6s.
  • The current price per megapixel for perfect images from top models is around 10 cents, which can be a useful metric to track over time to see the transition from a niche product to a commodity 10s.
  • There are challenges that can be tackled in the near term, such as reasoning with images, which involves generating diagrams that are more refined and concise, similar to the level of refinement achieved in the text world 2m6s.
  • Image editing use cases can be improved by using existing tools and human expertise to make editing happen in a constrained and tractable way, rather than relying on the whole generation process 2m6s.
  • Other attractive use cases include learning about a class by bringing together multiple streams of information, such as text, slides, lecture videos, and audio tracks, to generate a synthesis in a consistent and coherent way 2m6s.
  • In the longer run, the benefits of these advancements can be seen across industries, such as robotics, medicine, and other desk jobs, which could benefit from the incorporation of these wins 2m6s.
  • However, there are also challenges to be addressed, such as the costs associated with these models, which can be alleviated through distillation, and the need for research on the hardware side to simplify the operations involved in transformers 2m6s.
  • Another concern is the data quality side, as the images being generated and released into the wild can have an impact on the overall quality of the models 2m6s.

Future Directions and Societal Implications

  • The next generation of models may encounter the issue of not finding the true distribution P data, and some models study the phenomenon of model collapse, where feeding generated data creates an echo chamber of mistakes that keep growing 10s.
  • To deal with the issue of model collapse and the blurring of lines between true data distribution and generated images, two ways to counter this are the C2PA norm, which has been gaining traction across software companies, and watermarking, such as synth ID from Google DeepMind 42s.
  • Watermarking is a method that hides behind pixel patterns to reveal the origin of an image, and it is one of the ways to tackle the data collection quality issue, but taking a screenshot of an image can make the metadata disappear 2m6s.
  • The topic of safety is very important in creating images, as generating harmful images can have societal implications, and companies have their own policies to guard against such generations, while laws are also being developed to address this issue 2m6s.

Resources for Learning and Staying Updated

  • To stay updated on the latest developments in the field, one suggestion is to take a look at the relevant archive section on computer vision, which receives hundreds of submissions every day, and some venues do a good job of distilling these works to give an idea of where the field is heading 2m6s.
  • A great pattern in academia and industry is the release of code used to design a given method, and cloning the GitHub repo and playing with AI assistant coding software can be a great way to learn and understand how these methods work 2m6s.
  • Additional resources for learning and staying updated include Twitter, where there is a community of people discussing these topics, and other Stanford courses, such as 231N, which talks about vision, as well as a study guide that is kept up to date across the years 2m6s.
Made with Recall · in 3 seconds

Get a summary like this for anything you read, watch or save.

Recall summarizes any link you paste, then keeps it in your personal library so you can search, chat with it, and never lose a key idea again.

YouTube videosArticlesPodcastsPDFsAnything else
Save this summary

Then save anything you watch or read next.

Bookmark this summary, then save any video, article or PDF you read next.

Save to your library
Browse all from Stanford Online →

Ready to get started?

Save, summarize & chat with your content.

GET STARTED

IT'S FREE

No credit card required · 30 Day Refund on Premium · 24 Hour Support

Recall web app on laptop