YouTube video summary

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation

Artificial Intelligence

02 Jun 202624 min summaryFrom Stanford Online

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation

Stanford Online

Save to your library

Chat with this summary

Introduction to Image Generation Model Evaluation

The lecture is about evaluating the quality of the output of text-to-image generation models, which is a crucial step in determining what to improve, and this topic is important because judging the quality of an image can be quite subjective 10s.
The previous lecture covered the training life cycle of text-to-image generation models, including the loss function used for training, the logit normal distribution, and the concept of time step shift, which is necessary because for a given noise level, a higher resolution image perceives less noise than a lower resolution image 2m6s.
The training life cycle of text-to-image generation models involves pre-training, where the goal is to teach the model how to generate images, using techniques such as curriculum learning, and handling different resolutions, which is achievable with a DIT-based model by having a longer input 4m30s.
Post-training approaches aim to generate more beautiful images, including continued training, supervised fine-tuning, and preference tuning methods, such as those derived from gRPO and DPO, which capture negative user signals 6m15s.

Post-Training and Personalization Techniques

Personalization of text-to-image generation models is also possible, using methods like Dream Booth, which relies on a rare token and trains the model on a specific object or person, and distillation methods, such as progressive distillation and distribution matching distillation, which shorten the number of steps needed at inference time 8m0s.
The current lecture will focus on evaluation, which involves assessing the quality of the generated image, and a motivating example will be used to illustrate this, with an input prompt of a teddy bear reading a book, and the generated image will be evaluated to determine if it is a good image or not 10m30s.
The evaluation of generated images from text-to-image generation models typically involves assessing two main categories: whether the image is aesthetically pleasing and physically plausible, and whether it adheres to the input prompt, with the example of a teddy bear reading a book being used to illustrate this point 10s.

Evaluation Criteria for Generated Images

The two main buckets of evaluation are not exhaustive, as there are other criteria to consider, including safety, diversity, generalization capabilities, and bias, which are important for ensuring the model generates a wide range of outputs without memorizing the input or producing unsafe content 4m6s.
The image evaluation problem can be thought of as assessing whether the output image is good along the dimensions of aesthetics and prompt adherence, with these two categories being the primary focus of evaluation 6m42s.
One simple and natural way to evaluate the output of a text-to-image generation model is to leverage human evaluators, who can rate the generated images on a scale from 1 to 5, with 1 being very bad and 5 being very good, to determine the average rating of the model's performance across a dataset 8m50s.
The average rating score is nuanced, allowing for distinction between very good, good, and not great images, and can be calculated by summing the ratings and dividing by the number of ratings, providing a way to measure the performance of the model 10m50s.

Human Evaluation Methods and Limitations

The main drawback of using a one to five scale for rating images is that it can introduce noise in the ratings, as humans may have different interpretations of the scale, and it can be hard for people to decide between two adjacent ratings, such as four versus five 10s.
To reduce this problem, a second method is to use a binary scale, where humans are asked to rate an image as either good or bad, which is an easier task, and the score can be calculated as the proportion of images that pass the bar 2m6s.
However, humans still have a hard time evaluating things on an absolute scale, even with a binary scale, and it's easier for them to compare two things rather than making an absolute judgment, which leads to a third method of rating images by comparing them to other images 4m30s.
In the pair-wise comparison setting, two images are generated, and the task is to say which one is better, which is an easier task than the absolute scale task, and the ratings are less noisy compared to the absolute scale binary one 6m20s.
A natural way to quantify the performance of a model in the pair-wise comparison setting is to count the number of wins and divide that by the number of times it was compared to something else, which is called the win rate, but this metric may not be sufficient, as winning against a good model is harder than winning against a bad one 10m30s.
The win proportion should not be the only factor, and the comparison should also consider who the model is being compared to, which is similar to how models are ranked in a leaderboard, where models can come and go, and their performance can change over time 14m40s.

Improving Evaluation with Win Rate and ELO Metrics

To evaluate a model's performance, it would need to be evaluated against all models in a list, and this process would need to be repeated every time the list is updated, which would result in a large number of evaluations 10s.
To address this issue, the win rate metric can be adjusted to take into account the strength of the opponent, which would provide a more accurate assessment of a model's performance 1m42s.
The idea is to capture the difference in reaction when a model wins against a weak opponent versus a strong opponent, and to incorporate this into the metric 2m6s.
A metric per model, denoted as R or rating, can be used to evaluate a model's performance, and the rating can be updated based on the expected score and actual score against other models 3m10s.
The expected score can be computed using a formula that takes into account the ratings of the two models being compared, and the actual score can be determined by comparing the two models 4m20s.
The difference between the actual score and the expected score, denoted as delta, can be used to update the model's rating, with a positive delta indicating a better-than-expected performance and a negative delta indicating a worse-than-expected performance 6m10s.
The ELO score, or ELO rating, is a method used to track the performance of a model by computing the score as a function of how strong the opponent is, which allows for efficient evaluation of models without requiring every model to be evaluated against every other model 9m40s.

Challenges with Human Evaluation and Introduction to Automated Metrics

The method being discussed is named after the person who came up with it, ILO, which is not an acronym, and the problems with this approach include the expense and slowness of involving humans in the evaluation process, as well as the subjectivity of human ratings, which can be influenced by various factors 10s.
The subjectivity of human evaluations is highlighted by the example of determining whether an image is well-lit, as there are multiple notions of what constitutes good lighting, and humans are not perfect, with their ratings potentially being affected by external circumstances 1m20s.
To address these limitations, automated metrics are being explored, starting with reference-free metrics, which do not rely on a single reference image, as there are multiple ways to produce an image for a given prompt, making it unfair to compare against a single reference 2m6s.

Reference-Free Metrics and Distribution Comparison

The reference-free metrics are used to quantify the quality of generated images with respect to aesthetics and prompt adherence, and one approach is to compare the distribution of generated images to the distribution of real images in a latent space 3m30s.
To compare the distributions, the mean and covariance matrix of each distribution can be used, with the covariance matrix providing information on the spread and diversity of the generated images, which is an important aspect to quantify 5m40s.
The goal is to quantify the distance between the two distributions, and a formula can be used to calculate this distance, which is a key aspect of evaluating the quality of generated images 7m10s.
The process of evaluating diffusion models involves quantifying the difference in location and shape between the distributions of real and generated images, with the first term comparing the means of each distribution and the second term comparing the shapes, using a metric called the Fréchet Inception Distance (FID) that is derived from the Vaserstein distance 10s.
The FID metric aims to quantify the effort required to transform one distribution into another, and it has a closed-form solution when the distributions are assumed to be Gaussian, allowing for a simpler calculation of the distance between the two distributions 1m42s.

FID Metric and Its Application

The FID is a distance metric, and a lower value indicates better performance, with the goal of minimizing the distance between the distributions of real and generated images 3m10s.
The FID metric uses a pre-trained encoder, specifically the Inception network, to compute the representations of real and generated images, and this allows for comparison with other models that use the same representation 5m30s.
The FID metric can be applied to diffusion models that operate in pixel space, by using the pre-trained Inception model to compute the representations of the generated images, and then comparing them to the representations of real images 7m10s.
When evaluating diffusion models using the FID metric, it is possible to condition the evaluation on a particular prompt or task, such as generating faces or indoor scenes, by comparing the generated images to a set of real images that are representative of the task or prompt of interest 9m30s.
The evaluation of generated images can be done by comparing them to a set of real images, typically using a dataset that is aligned with the task of interest, such as ImageNet, MS COCO, and others, which contain the thing that would be used as input, allowing for comparison to the distribution of real images 10s.
In practice, the set of real images used for comparison is usually around 50,000 images, such as FID 50k, where 50,000 generated images are compared to 50,000 real images to evaluate the quality of the generated images 2m6s.
The use of variance and mean to characterize the whole distribution of generated images is not necessarily fair, as the shape and distance may not be representative of the quality, and the community acknowledges that the current metric, FID, is not perfect 4m30s.

Prompt Adherence and CLIP-Based Evaluation

FID is widely used as a metric to quantify the quality of generated images, despite its limitations, such as not being reflective of the quality of the images, and the community continues to use it because it provides a way to compare methods and results 6m20s.
The location and shape of the distribution of generated images can provide insights into the quality and diversity of the generated images, with a big gap in the location potentially indicating differences in quality or style, and a constrained region potentially indicating a lack of diversity 8m40s.
The sample size used for evaluation is typically around 50,000 images, and the FID metric is part of the reference-free metrics section, which compares the distribution of generated images to the distribution of real images, rather than comparing individual images 10m10s.
One of the main limitations of FID is that it assumes a Gaussian distribution, which is not typically the case for real and generated images, and this limitation is often discussed in papers that critique the use of FID as a metric 12m20s.
The evaluation of image generation models involves assessing prompt adherence, which can be quantified using methods such as CLIP (Contrastive Language Image Pre-training) to compare the input text and output image, and this can be done using the CLIP score, which provides a measure of how aligned the text and image are 10s.
Another approach to evaluating prompt adherence is to use a CLIP-like model to predict how good the image is with respect to the input text, and this can be done by training the model on a dataset of human preferences, resulting in a P score that combines aesthetics and prompt adherence to provide a holistic score 2m6s.

Reference-Based Metrics and Image Reconstruction

In addition to evaluating image generation models, other components such as the VAE (Variational Autoencoder) are also important, and reference-based metrics can be used to compare the output of the VAE with the original input, providing a clearer label for evaluation 4m42s.
Reference-based metrics are useful for evaluating image reconstruction tasks, as well as other use cases such as image editing tasks, where the goal is to compare the edited image with the original input image 6m15s.
One common reference-based metric is the Mean Squared Error (MSE), which calculates the pixel-wise distance between the reference image and the generated image, but it has the drawback of being sensitive to alignment between the two images 8m30s.
The notation for the input and output images is X and Xhat, respectively, and the goal is to find a quantified metric to compare the two images and evaluate the quality of the reconstruction 10m10s.
The Mean Squared Error (MSE) metric has a drawback in that its value depends on how pixels are encoded, making it difficult to interpret, and this issue is addressed by the Peak Signal to Noise Ratio (PSNR) metric, which normalizes MSE with respect to the maximum value it can take 10s.

Pixel-Based Metrics and Their Limitations

The PSNR metric not only normalizes MSE but also applies a logarithm to the result, which helps to put the error into context, similar to how the perceptual difference between turning on a light bulb in a dark room versus a well-lit room is different 42s.
The PSNR metric is still sensitive to pixel position and shifting, which is why other metrics, such as those that look at the structure of images, are also used to evaluate image quality 2m6s.
One such metric examines the structure of images by comparing patches of the original input and generated output, looking at aspects such as brightness, contrast, and the variation of pixels within the patches 2m6s.

Structural Similarity Metrics

This structural comparison involves computing the mean, variance, and covariance of pixel values within patches to quantify brightness, contrast, and structural similarity between the original and generated images 4m30s.
The structural similarity metric produces a quantity that represents the similarity between two patches, which can be calculated using a formula that takes into account the mean of the pixel values in each patch 6m10s.
The formula for structural similarity involves calculating a value based on the means of the two patches, with a constant added for stability purposes, and this value can be interpreted as a measure of how similar the two patches are 8m20s.
The formula can be related to a general identity from high school mathematics, which involves the difference of squares, and this relationship can help to understand the properties of the structural similarity metric 10m30s.

Dice Coefficient and SSIM Metric

The dice coefficient is a similarity metric that measures the similarity between two quantities, considering their values, and is bounded between 0 and 1, where 1 indicates identical quantities and 0 indicates no similarity 10s.
The dice coefficient is useful for comparing the similarity of two quantities, such as images, and takes into account the relative difference between them, making it a good metric for evaluating image similarity 2m6s.
The formula for the dice coefficient can be used to compute the similarity between images based on luminance, contrast, and structure, using metrics such as the mean of pixels, variance, and standard deviation, and the Pearson correlation 4m30s.
The structural similarity index (SSIM) is a metric that combines the luminance, contrast, and structure similarities, and is computed by multiplying the individual similarities and averaging the scores across patches 6m40s.
The SSIM metric is sensitive to pixel shifts, which can be a limitation, and an alternative metric is the learned perceptual image patch similarity (LPIPS), which uses a pre-trained encoder to compute the perceptual similarity between images 10m50s.

LPIPS and Perceptual Similarity

The LPIPS metric works by passing the images through a pre-trained encoder and computing the distance between the representations, which provides a measure of perceptual similarity that is less sensitive to pixel shifts 12m20s.
The metric being discussed has a coefficient W that was determined to align with human perception, and it is commonly used, but it is not super interpretable in terms of what is going wrong, and it was tuned using a dataset of images to match human perception levels 10s.
The pre-trained models that can be used with this metric include VGG and AlexNet, and the coefficients WL are dependent on the encoder used, so libraries can be used to mention which encoder to leverage 2m6s.
The formula for the metric involves the difference between the feature map of X and the feature map of Xhat, multiplied element-wise with the coefficient WL, and this can be applied to a batch by using traditional aggregations like the mean 4m30s.

Multimodal Models and Text-to-Image Evaluation

The metrics being discussed are very mathematical, and there is a need to interpret them from an intuition standpoint, which is why a detour is being taken to visit a class of models called multimodal LLM that can convert text and image data into text 6m40s.
The models seen so far in the class include the transformer, which transforms bits of text to text, and models like DIT and MMDIT, which generate images based on latent noise and potentially a condition, but these models cannot be reused as is 8m10s.
The setup being discussed involves a model that takes image and text as input and gives text as output, and an example of this is evaluating the cuteness of a teddy bear, and the goal is to find an architecture that can accept embeddings and turn them into interpretable text 10m20s.
One possible direction for finding this architecture is to leverage the cross-attention layer of a text-based model, which can enable text tokens to attend to images, and this can be done by using text tokens as input and keeping the cross-attention mechanism to interact with encoded images 12m30s.
One approach to dealing with multimodal input is to leverage a fixed context length composed of text and make it interact with images, as seen in models like Flamingo by Google, where images are given as keys and values and a placeholder token gates when it's fine to attend to these images 10s.
A caveat of this approach is that it cannot directly leverage the latest advances in large language models, which are decoder-only models that have gotten rid of cross-attention, requiring the re-engineering of the cross-attention layer, which is why a second branch of dealing with multimodal input is to directly feed image and text tokens as input, as seen in models like Lava 1m30s.
The approach of directly feeding image and text tokens as input to a decoder-only structure is what people use in practice, allowing for the reuse of existing architectures and weights, and enabling models to work with all kinds of images and generate text as part of images 2m40s.

Model Capabilities and Text Integration

Models like these have capabilities such as working with all kinds of images and generating text as part of images, and they can be designed to be aware of characters and have reasoning abilities, which is why techniques like OCR are mentioned, and leveraging reasoning abilities is important to have a rational behind the grade given to generated images 4m10s.
Traditional metrics have concerns, such as providing a number without context, and one way to mitigate this is to decompose what is cared about into atomic properties, as seen in the TIFA paper, which allows for reasoning about what properties in the generated image make it match the prompt 8m20s.
The goal is to have models that can generate text as part of images and be aware of characters, and to have a rational behind the grade given to generated images, which is why leveraging reasoning abilities and techniques like OCR are important 9m40s.

Decomposing Evaluation into Atomic Properties

Text-to-image faithfulness evaluation uses methods like fot learning to generate a set of questions that decompose the judging into quantitative dimensions, allowing for a more precise assessment of image quality, with each question having a clear right or wrong answer 10s.
The evaluation process involves using a model, such as an MLM model, to assess each dimension independently, providing a score that reflects how well the generated image meets the specified criteria, as seen in the example of a teddy bear reading a book 42s.
This method of generating dimensions that are cared about per image is bespoke, requiring a specific grading rubric for each prompt, which can be expensive and error-prone, and may not accurately reflect the importance of each claim 2m6s.
The clip score method can convey whether the semantic meaning of a generated image was preserved, but it may not capture all the semantic subtleties of a prompt, as it projects the image and prompt content into vectors that may not be richly conveying these subtleties 4m30s.

Limitations of CLIP Score and Alternative Methods

The clip score's limitations can be attributed to its training process, which incentivizes matching images and prompts based on text-image captions, but does not provide an incentive to learn the subtleties of prompt variations, such as the difference between a teddy bear reading a book and a book reading a teddy bear 6m20s.
The clip score was generated with the embedding of the entire sentence, using a decoder to encode the meaning of a given sentence and then extracting the last embedding, which contains the embedding of the whole sentence 8m40s.
The VQA score is a method where the image and sentence are inputted into the same embedding, and the probability of whether they match is extracted, allowing for a more nuanced evaluation of image and sentence similarity, with the template following a simple structure of putting the image and then asking if it shows the content of the prompt 10s.
A regular LLM may not be able to perform this task due to the presence of image modality, but a standard MLM can be used for this purpose as it is trained to understand broader concepts, making it a zero-shot task that does not require bespoke training 2m6s.
One drawback of this technique is that it relies on access to the probability distribution of the next token, which may not be available for latest and best models, and it can be wasteful as each question requires a dedicated MLM call, which can be expensive and scale with the number of decomposed dimensions 4m30s.

Concept-Centric Evaluation and VIE Score

Instead of being prompt-centric, a more attractive approach is to be concept-centric, describing what is considered good in a generic way and leveraging the reasoning capabilities of MLM models to come up with a final score, which is the idea behind the VIE score, or visual instruction guided explainable score 6m40s.
The VIE score involves giving a prompt, generated image, and describing a rubric, and then asking an MLM to judge and provide a score, with the model being interpretable and able to output reasoning before prompting for a score, and considering dimensions such as semantic consistency and perceptual quality 8m50s.
Evaluating the quality of generated images involves assessing semantic consistency and perceptual quality, which can be done using a large vision model (MLM) as a judge to provide a decision in a format such as JSON, allowing for the parsing of scores and rationals 10s.

Human Alignment and MLM Evaluation Process

To ensure that the MLM's decisions align with human intuition, a three-stage process is used, starting with humans grading images alongside rubrics of interest, such as perceptual quality and semantic consistency, to establish a sense of what is good versus bad 2m6s.
The second stage involves handcrafting rubrics, or using a model to generate instructions, that match human intuition, and the third stage involves using the trained MLM to judge new images, allowing for the evaluation of images in various settings, including pointwise, pairwise, and ranking 4m30s.
The pointwise setting is useful for one-sided evaluation and debugging, as it provides a rough idea of the image quality and can guide the understanding of loss modes, while the pairwise setting is great for comparing models with each other, such as when evaluating a new iteration of a model 6m40s.
The ranking setting, although studied in papers, is typically not used in practice due to its sensitivity and variance, and is often not a desirable task, making the pointwise and pairwise settings more broadly applicable 8m50s.

Best Practices for MLM Evaluation

Best practices when dealing with MLM as a judge involve decomposing the evaluation into relevant dimensions, such as the VIE score, which can be broken down into two dimensions, allowing for a more structured and effective evaluation process 11m20s.
In real-life scenarios, task-specific rubrics are often used to evaluate models, and these rubrics are typically decomposed and isolated as separate metrics to ensure accurate judging, including criteria that are atomic and isolating what is being cared about 10s.
Empirically, it is desirable for models to output their reasoning before providing a score, as this practice improves performance and intuition, similar to how humans form opinions by enumerating facts before making a judgment 1m5s.
The temperature parameter in Large Language Models (LLMs) controls the creativity and determinism of the output, and while non-deterministic outputs are often better for certain tasks, deterministic outputs are preferred for precise tasks like judgment, where the model is grounded on the input and rubrics 2m6s.

Position Bias and Benchmarking

Position bias can occur in pairwise settings, and swapping the ordering of the two samples being compared can help mitigate this issue 3m30s.
To ensure that models are aligned with human judgment, it is essential to evaluate them using benchmarks that cover characteristics such as object detection, attribute attribution, and prompt following, before trusting them blindly 4m20s.
Benchmarks like GenEval assess the ability of models to generate objects and attributes mentioned in a prompt, using object detection models as judges to evaluate the correctness of the generated objects and colors 5m40s.
The DPGB benchmark evaluates the ability of models to render detailed prompts correctly, by decomposing the prompt into atomic yes or no questions and grouping them into a logical graph, allowing for efficient evaluation of attribute attribution and spatial relationships 6m50s.

OCR and Image Editing Evaluation

The capability of Optical Character Recognition (OCR) is of interest, particularly in image generation, where images with walls of text need to be rendered accurately, and a long text bench can be used to evaluate this capability 10s.
In evaluating OCR, a Vision Language Model (VLM) can be used as a judge to render the text in the generated image and match it to the reference text, and if both match, the text is considered correctly generated 42s.
Another image generation mode of interest is editing a given image, where a benchmark like J grounded edit bench can be used, which involves 11 tasks such as replacing backgrounds or changing colors, and a Multimodal Language Model (MLM) can be used as a judge model to evaluate the output 1m30s.

Challenges and Considerations in Model Evaluation

When evaluating image generation models, it is important to consider that metrics can be imperfect and dependent on tuning, and relying on a few sample images can be misleading, as the same model can produce different distributions of images based on the sampling method 2m6s.
A thought experiment can be done to compare the best and worst of three generated images from the same model, which can result in quite different distributions, highlighting the importance of considering the sampling method when evaluating image generation models 3m10s.

Made with Recall · in 3 seconds

Get a summary like this for anything you read, watch or save.

Recall summarizes any link you paste, then keeps it in your personal library so you can search, chat with it, and never lose a key idea again.

YouTube videosArticlesPodcastsPDFsAnything else

Save this summary

Keep it in your library.

Save to your library

Browse all from Stanford Online →

Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required

Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required

YouTube02 Jun 2026

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics

Artificial Intelligence

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics

YouTube02 Jun 2026

Stanford CS153 Frontier Systems | The AI Native Company: How One Founder Becomes a 1000x Engineer

Entrepreneurship

Stanford CS153 Frontier Systems | The AI Native Company: How One Founder Becomes a 1000x Engineer

YouTube25 May 2026

Stanford CS547 HCI Seminar | Spring 2026 | HCI and Human-Centered AI for Digital Health

Health & Medicine

Stanford CS547 HCI Seminar | Spring 2026 | HCI and Human-Centered AI for Digital Health

YouTube25 May 2026

Stanford CS25: Transformers United V6 I Distinct Modes of Generalization from Parameters and Context

Artificial Intelligence

Stanford CS25: Transformers United V6 I Distinct Modes of Generalization from Parameters and Context

YouTube25 May 2026

Stanford Robotics Seminar ENGR319 | Spring 2026 | Integrated Learning and Planning

Stanford Robotics Seminar ENGR319 | Spring 2026 | Integrated Learning and Planning

YouTube25 May 2026

Ready to get started?

Save, summarize and chat with your content.

IT'S FREE

No credit card required · 30 Day Refund on Premium · 24 Hour Support

Recall web app on laptop, personal AI knowledge base for summarizing and chatting with your content