YouTube video summary

Stanford AA228/CS238 Decision Making Under Uncertainty I Policy Gradient Estimation & Optimization

Artificial intelligence

22 Nov 202410 min summaryFrom Stanford Online

Stanford AA228/CS238 Decision Making Under Uncertainty I Policy Gradient Estimation & Optimization

Stanford Online

Save to your library

Chat with this summary

Introduction and Parameterized Policies

The proposal for the project should be a sequential decision problem, and if there's uncertainty, it's recommended to come by office hours or post on Ed for clarification 27s.
A policy, denoted as π, is a function that tells us what action to take based on the state we're in, and it can be used to control the temperature in a building by turning on the heat or AC 1m15s.
The policy might look like a table with associated actions for every possible temperature, but storing such a policy wouldn't be possible for large or continuous state spaces 2m13s.
To address this issue, policies can be parameterized using parameters θ, and an example of a parameterized policy for temperature control is given using two parameters θ1 and θ2 2m24s.

Utility and Gradient Optimization

The utility of the policy, denoted as U(θ), is the thing that's generally interested in optimizing, and many optimization methods rely on the gradient of θ 3m16s.
The gradient of θ, denoted as ∇θ, is a vector of partial derivatives, one for each parameter θi, and it's used to estimate the gradient of the utility function 3m51s.
The lecture will cover a few different options for estimating the gradient of θ, and it's noted that this portion of the lecture is calculus-heavy 4m27s.

Finite Differences Method for Gradient Estimation

The first method for estimating the gradient of a function is finite differences, which involves computing the slope between two points that are a small distance apart, denoted as Delta, to estimate the derivative at a point of interest 4m56s.
In the multivariable case, the finite difference method is used to estimate the gradient of U(θ) by computing U(θ + Δe) - U(θ) / Δ, where e is a standard basis vector, and repeating this process for each parameter θ 6m8s.
The standard basis vector e has a one in the position corresponding to the parameter being estimated and zeros elsewhere, allowing the method to estimate the gradient of U(θ) around a point 6m50s.
The finite difference method can be impacted by the variance of U(θ), which may require a large number of rollouts to obtain a good estimate, and techniques can be used to mitigate this variance 7m46s.
The parameters θ are the parameters of the parameterization, and the first-order approximation may not always work for high-dimensional cases where the parameters are on different scales 8m11s.

Regression Gradient Method for Gradient Estimation

The next method discussed is the regression gradient, which involves collecting data points and fitting a regression line to estimate the gradient of U(θ) around a point 9m0s.
The regression gradient method collects data points in the neighborhood of θ and fits a regression line to estimate the gradient at that point 10m38s.
A matrix called Delta Theta is created for perturbation of each parameter, with dimensions M by N, where M is the number of perturbations and N is the number of parameters 10m42s.
The number of perturbations is chosen empirically, with a common rule of thumb being to have double the number of parameters as the number of perturbations, but this can be based on computational constraints or expert knowledge 11m21s.
The perturbations are obtained by randomly sampling from a normal distribution, which works well empirically, and these samples are normalized 11m48s.
The perturbations can be visualized as points along the surface of a sphere, with each point representing a change in Theta 12m10s.
When randomly sampling around the point Theta, the line is found by collecting data points of differences from the original Theta and differences from the original U of Theta, and then using a regression method to fit the line 12m42s.
The method of sampling on the sphere involves sampling from a normal distribution and then normalizing the samples, so that all points are at a distance of one from the original value 13m36s.
The same set of perturbations is applied to all coordinates, which gives the M by N matrix 14m31s.
The random points are called perturbations, and the perturbation is the change applied to Theta 14m53s.
The data set of points for Theta is built by applying the perturbations and computing the differences in U of Theta 15m21s.
The computation of Delta U involves using rollouts, such as Monte Carlo simulation, to find the difference in U of Theta plus Delta Theta and U of Theta for each parameter 15m29s.
The gradient of U of theta is estimated by fitting a line between Delta Theta and Delta U, where Delta Theta and Delta U come from policy rollouts 16m17s.
The gradient U of theta is approximately equal to Delta Theta times the pseudo-inverse of Delta U, where the pseudo-inverse is used instead of the regular inverse because Delta Theta is not guaranteed to be invertible 16m44s.
The pseudo-inverse of a matrix X is equivalent to the inverse of X transpose X times X transpose, which allows for the inversion of a non-square matrix 17m19s.
The dimensions of Delta Theta and Delta U are m by 1, where m is the number of perturbations, and it is important to have an equal number of perturbations of theta and U 18m15s.
The perturbations are sampled from an n-dimensional normal distribution, and it is not necessary to do this for every single theta since all parameters are moved in each perturbation 19m1s.

Likelihood Ratio Method for Gradient Estimation

The likelihood ratio method is used to compute the gradient of U of theta, which involves using an analytical form of U of theta that includes an integral over all possible trajectories 19m30s.
The integral includes the probability of a trajectory given theta, the return for that trajectory, and the discount over the entire sequence 20m14s.
The return for a trajectory can be formulated using the Bellman optimality equation, which allows for a discount over the entire sequence 21m33s.
The return for a trajectory is determined, and the likelihood of the trajectory is determined by the policy parameterized by Theta, which tells us what action to take based on a state 21m45s.
The policy is determining how likely a trajectory is, and the goal is to optimize the utility of the policy by finding the gradient of the utility 23m1s.
The process of optimizing the policy is similar to optimizing the parameters of a neural network by finding the gradient of the loss and slowly changing the parameters to improve the utility 23m42s.
The policy determines the likelihood of a trajectory, and the goal is to take the analytical expression and get an expression for the gradient of U of theta from it 24m16s.
The first step in finding the gradient of U of theta is to take the gradient of both sides of the expression and move the gradient inside the integral 24m33s.
Multiplying the expression inside the integral by one allows it to be written in a way that can be simplified, and it can be written as the expectation over a certain expression 25m44s.
The expression can be written as the expectation over tow of G Theta P Theta of to over P Theta of to R of to D to, which is a form of expectation 26m19s.
The log derivative trick, mentioned in chapter 10.5, allows the expression to be written in a certain way, but it will not be derived today 27m52s.
The expectation form of the gradient U of theta is expressed as an expectation, providing a relationship that can be used to estimate the gradient.
To estimate the gradient, it is necessary to find the gradient with respect to Theta of the log of P Theta of to, which depends on the type of policy used, either stochastic or deterministic.
A deterministic policy always returns the same action for a given state, whereas a stochastic policy provides a distribution over actions for a given state.
For a stochastic policy, Pi Theta of a specific action given a state is the probability of taking that action from that state.
The lecture will focus on the stochastic case, which is easier to compute and provides an interesting insight into why it is preferred.
To find the gradient with respect to Theta log P Theta of to for a stochastic policy, it is necessary to consider a sequence of states, actions, and rewards or returns up to a certain time horizon.
The probability of a trajectory to is calculated as the product of the probability of the initial state, the transition probabilities, and the likelihood of each action according to the policy.
The policy does not affect the initial state, which is sampled from a distribution independent of the policy.
The transition probabilities and the likelihood of each action are used to calculate the likelihood of the next state in the trajectory.
The product of these probabilities gives the likelihood of the trajectory to according to the policy parameterized by Theta.
The likelihood of a trajectory is calculated by multiplying the probability of the first state by the product of the policy and transition probabilities for each state in the sequence, resulting in the overall likelihood of the trajectory 34m39s.
The product involves the policy and transition probabilities for each state in the sequence, with the superscripts on the states representing the time step 35m17s.
The product starts from the first state, not the second, as it represents the transition from the first state to the second 35m40s.
The expression for the likelihood of the trajectory is then taken to the log, converting the product into a summation 36m36s.
The log of the probability of the first state and the summation of the log of the policy and transition probabilities for each state in the sequence are the components of the log likelihood expression 37m17s.
Taking the gradient of the log likelihood with respect to the policy parameter results in the terms that are not dependent on the policy parameter dropping out 38m19s.
The resulting expression is a summation that provides another option for estimating the log likelihood 38m52s.
The terms that drop out are those that do not have any dependence on the policy parameter, similar to taking a derivative with respect to a variable that is not present in the expression 39m12s.
The correct range for the summation is from K=1 to D-1, where D is the depth of the trajectory, as going beyond D-1 would involve states that are not part of the current trajectory 39m44s.
When taking the derivative of theta, we end up looking at the state-action pair, and the last term is only going to D minus one, but for the second summation, it can still go to D because the transition at the D level doesn't matter 41m29s.
The counter K is stopping at D, so we would do K from one to D, including the first state and action, and then the second, and so on, all the way to D in the summation 43m7s.
To find the expectation, we need to consider multiple trajectories, and in practice, it's hard to iterate over all of them, so we would be sampling trajectories, and we can use different characteristics or methods to sample 43m55s.
The goal is to calculate the gradient of the utility function, and there are three different methods to estimate the gradient, including finite differences, to optimize the parameters of our policy Theta with respect to utility 45m14s.
We're assuming we're not able to compute the gradient directly, and these methods are possible options to estimate the gradient of the utility function 45m27s.
The methods discussed are used to estimate the gradient of the utility function so that we can optimize the parameters of our policy, and Kiana will discuss what to do with these estimates 42m53s.

Made with Recall · in 3 seconds

Get a summary like this for anything you read, watch or save.

Recall summarizes any link you paste, then keeps it in your personal library so you can search, chat with it, and never lose a key idea again.

YouTube videosArticlesPodcastsPDFsAnything else

Save this summary

Keep it in your library.

Save to your library

Browse all from Stanford Online →

Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required

Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required

YouTube02 Jun 2026

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation

Artificial Intelligence

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation

YouTube02 Jun 2026

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics

Artificial Intelligence

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics

YouTube02 Jun 2026

Stanford CS153 Frontier Systems | The AI Native Company: How One Founder Becomes a 1000x Engineer

Entrepreneurship

Stanford CS153 Frontier Systems | The AI Native Company: How One Founder Becomes a 1000x Engineer

YouTube25 May 2026

Stanford CS547 HCI Seminar | Spring 2026 | HCI and Human-Centered AI for Digital Health

Health & Medicine

Stanford CS547 HCI Seminar | Spring 2026 | HCI and Human-Centered AI for Digital Health

YouTube25 May 2026

Stanford CS25: Transformers United V6 I Distinct Modes of Generalization from Parameters and Context

Artificial Intelligence

Stanford CS25: Transformers United V6 I Distinct Modes of Generalization from Parameters and Context

YouTube25 May 2026

Ready to get started?

Save, summarize and chat with your content.

IT'S FREE

No credit card required · 30 Day Refund on Premium · 24 Hour Support

Recall web app on laptop, personal AI knowledge base for summarizing and chatting with your content