YouTube video summary

Stanford CS234 Reinforcement Learning I Exploration 1 I 2024 I Lecture 11

Education

31 Oct 202422 min summaryFrom Stanford Online

Stanford CS234 Reinforcement Learning I Exploration 1 I 2024 I Lecture 11

Stanford Online

Save to your library

Chat with this summary

Introduction to Reinforcement Learning and Data Efficiency

Space repetition is helpful in learning, and periodically bringing up ideas from earlier in the course can aid in understanding 36s.
Important sampling does not leverage the Markov assumption and can work with non-Markov systems, making it a general method that can be used to estimate the value of a policy through rollouts 3m21s.
Important sampling can be used to estimate the value of a policy by using the advantage function of one policy and sampling from another policy, but this method is not exact due to the difference in states visited by the two policies 3m56s.
The approximation error of using samples from one policy to estimate the value of another policy is bounded by the average difference between the two policies over the states visited by one policy 4m28s.
Prior work has shown that it is possible to bound the error in approximating the value of a policy by using samples from another policy, which is a useful insight in reinforcement learning 4m46s.
The course will now focus on learning from data that can be gathered, rather than relying on prior data or human feedback, and will discuss how to strategically gather data to learn good policies 5m15s.
The next few lectures will discuss how to gather data for online reinforcement learning, including the influence of the data gathering process on the learning of good policies 5m39s.
Computational efficiency is an important consideration in reinforcement learning, particularly in simulated environments, where computational time is equivalent to data 6m2s.
In many domains, computation is separate from actual data, such as using mobile phones for health interventions, consumer marketing, educational technology, and climate or environmental applications, where real-world data is crucial and often limited 7m5s.
In these cases, computers can be used to compute policies, and real-world data, referred to as samples or sample efficiency, is essential and needs to be utilized carefully to make good decisions 7m55s.

Challenges of Limited Data in Reinforcement Learning

The data in these domains is often limited, unlike in cases like Atari or AlphaGo, where simulators can generate infinite data 8m17s.
When working with limited data, it's essential to consider the performance of different algorithms, including whether they converge, converge to the optimal policy, and how quickly they do so 9m1s.
The convergence of an algorithm is not always guaranteed, and even if it does converge, it may not be to the optimal policy, and the speed of convergence is also crucial 9m10s.
Different reinforcement learning algorithms can be evaluated using various measures, and their performance can vary significantly, with some algorithms being smooth and consistent, while others may make periodic mistakes 9m47s.
The performance of an algorithm can be illustrated using a graph of time versus reward, showing how different algorithms can have distinct performance profiles 9m52s.
There are different types of behavior in algorithms, such as one that is always pretty good but never awesome, and one that guarantees a certain level of performance, which are important to consider when evaluating the quality of algorithms 10m25s.
The performance guarantees of an algorithm can be thought of in terms of an AI clinician that helps manage blood pressure with a certain level of accuracy, and the difference between an algorithm that is 80% accurate on average versus one that completely manages blood pressure for 80% of the population 10m32s.

Introduction to Bandits and the Multi-Armed Bandit Problem

The course will introduce different types of settings and ways to evaluate the quality of algorithms, starting with Bandits, which are simple reinforcement learning problems that have been studied since the 1920s and have been used in various application areas 11m11s.
A Bandit is a simple RL problem with a finite set of arms, or actions, and a probability distribution over rewards for each arm, where the goal is to maximize cumulative reward over time steps 11m52s.
The Bandit setting can be applied to various areas, such as clinical trials, ads, and treatment or control, where the goal is to randomize the next person or action to take 12m57s.
A running example for the course will be a silly one, where the goal is to treat patients with broken toes, and there are three options: surgery, buddy taping the broken toe to another toe, or doing nothing, with a binary outcome measure of whether the toe is healed or not after 6 weeks 13m15s.
The goal is to figure out the best strategy for healing broken toes over time, without doing a clinical trial, but rather by trying different options and observing the outcomes 13m47s.
A problem can be modeled as a multi-arm bandit with three arms, where each arm is a newly introduced variable with an unknown parameter Theta, representing the probability of a successful outcome 13m56s.
The multi-arm bandit framework is suitable for this problem because only one decision is made per patient, and the outcome for each patient is independent of the others 15m27s.
The problem does not have sequential dependence, and each decision is made independently, like a new person arriving at a time point 15m46s.
The parameter Theta can be between zero and one, indicating that the outcomes are not deterministic, and sometimes the patient will heal, while other times they won't 15m53s.
The probability distribution of the reward is assumed to be stationary, meaning it remains the same at every time step, and does not change over time 16m15s.

Greedy Algorithms and the Need for Exploration

A greedy algorithm can be used to solve the problem, which selects the action with the highest expected reward, denoted by Q, and repeats this process 17m11s.
The expected reward Q can be estimated by counting and averaging the outcomes for each action 17m0s.
The greedy algorithm starts by sampling all actions once, and then selects the action with the highest observed reward 17m47s.
In the given example, the true set of parameters shows that surgery is the most effective action, followed by body taping, and then doing nothing 17m29s.
When applying the greedy algorithm, the arm with the highest observed reward is selected next, which in this case would be arm two, with an observed reward of one 18m24s.
A greedy algorithm that always chooses the action with the highest estimated value can get stuck in a suboptimal solution, never finding the optimal action, and will not converge to the optimal action 18m31s.
This is because the algorithm's estimate of the true value of the actions is low, and it will never sample the suboptimal action again, even if it has a higher true value 19m0s.
To avoid this, some form of exploration is needed to ensure the algorithm does not make an infinite number of bad decisions 19m30s.

Regret in Bandit Problems

The concept of regret is used to quantify the difference between the decisions made by the algorithm and the optimal decisions, with regret being the opportunity loss or the difference between the expected reward of the optimal action and the expected reward of the action taken 19m44s.
Regret is often used to score the algorithm based on the gap between the optimal value and the value achieved by the algorithm 19m59s.
The optimal value is the maximum expected reward, and the regret is the difference between the expected reward of the optimal action and the expected reward of the action taken 20m10s.
Regret is typically considered in expectation, due to stochasticity in the rewards, and is used to compare the expected reward of the optimal arm to the expected reward of the suboptimal arm 20m36s.
The goal is to minimize total regret, which is equivalent to maximizing cumulative reward, and this is often the focus in bandit problems 21m7s.
The regret can be computed by comparing the expected reward of the optimal action to the expected reward of the action taken over all time steps 21m1s.
The number of times an action is selected, NTA, is used to calculate the regret, with the gap for a particular arm being the difference between the expected reward of the optimal action and the expected reward of the action taken 21m37s.
The concept of the "Gap" is introduced, which refers to the difference in expected rewards between the optimal action and the action taken, and it plays a crucial role in learning which action is better, with larger gaps requiring fewer samples to figure out and smaller gaps requiring more data 22m29s.
The Gap (G) is a function of the gaps and counts, representing the number of times each action is taken and the difference between the optimal action and the reward received 22m55s.
The expected regret is the sum of the times each action is taken multiplied by the Gap, indicating that actions with large gaps should be avoided, while actions close to the optimal action are more acceptable 23m11s.
Many algorithms aim to bound the regret quantity, which is challenging due to the unknown optimal action and its value, but algorithms can be designed to prove how the regret grows 23m35s.
The regret can be visualized as a series of actions taken over time, with the true optimal action and observed rewards influencing the regret at each step 24m5s.
The size of the regret is determined by the Gap between the optimal arm and the arm taken, and making bad decisions forever can result in linear regret 24m48s.
Ideally, algorithms should strive for constant regret or zero regret, which means making a finite number of bad decisions before learning the optimal arm and taking it forever 25m31s.
In the worst-case scenario, linear regret can occur if mistakes are made consistently 25m55s.

Epsilon-Greedy Algorithms and Regret Analysis

Epsilon-greedy algorithms are used to balance exploration and exploitation, where the greedy action is chosen with probability 1 - Epsilon, and a random action is chosen with probability Epsilon 26m20s.
The regret of Epsilon-greedy algorithms depends on the value of Epsilon, and it is possible to have linear regret if Epsilon is not chosen carefully 26m29s.
In a setting where there is always at least one action with a non-zero gap, the regret of Epsilon-greedy can be analyzed to determine if it is sublinear or linear 28m14s.
The value of Epsilon can affect the regret of the algorithm, and it is possible to have linear regret for certain values of Epsilon 28m36s.
The algorithm's performance can be evaluated by considering the probability of choosing the greedy action versus a random action 27m3s.
In a specific example, with Epsilon = 0.1, the algorithm chooses the greedy action with 90% probability and a random action with 10% probability 27m15s.
The regret of the algorithm can be computed and analyzed to determine if it is sublinear or linear 28m1s.
Epsilon-greedy algorithms have a lower bound on the number of times each arm is sampled, which is at least Epsilon divided by the number of actions times T, where T is the total number of decisions made, resulting in at least linear regret if Epsilon is greater than zero 31m47s.
If Epsilon is set statically, it can be pretty bad, and decaying Epsilon over time can make a significant difference in the algorithm's performance 32m36s.
The growth of regret over time steps can be visualized in plots, showing that greedy algorithms can have linear regret, while Epsilon-greedy algorithms can have slightly better but still linear regret, and decaying Epsilon can achieve sublinear regret 32m55s.

Problem-Dependent and Problem-Independent Regret Bounds

A good algorithm should be able to guarantee sublinear regret without knowing the problem-dependent properties in advance, such as the gaps between rewards 33m34s.
Problem-independent bounds describe how regret grows as a function of time steps for any possible problem, while problem-dependent bounds describe regret as a function of the gap between rewards 33m52s.
Problem-dependent bounds are elegant because they don't require the algorithm to know the gaps, but rather adapt to the problem's difficulty, resulting in better regret if the problem is easier to learn 34m25s.
The gap between rewards is usually less than one, and the consideration of rewards is typically between zero and one, depending on the domain 35m4s.
In certain domains, the differences or gaps between optimal and suboptimal actions can be really tiny, requiring a lot of data and smart, data-efficient algorithms to optimize, whereas in other cases, the gaps can be large, making it easier to estimate them 35m21s.
A lower bound by Robbins from the 1950s provides a minimum regret that any algorithm will suffer as a function of the problem, indicating that the regret will be at least log T, where T is the number of time steps or decisions made 36m4s.
The lower bound also depends on the gap between the optimal and suboptimal arms, with the gap in the numerator, and is considered a problem-dependent or instance-dependent bound because it holds based on the unknown gaps 36m30s.

Optimism Under Uncertainty and Upper Confidence Bound (UCB) Algorithms

The concept of optimism under uncertainty is a principle that shows why it's provably optimal to be optimistic about things, as it allows for sublinear regret 37m9s.
Optimism in this context means choosing actions or arms that might have a high value, which can result in either getting high reward or learning something and improving estimates 37m29s.
When choosing actions that might be good, two possible outcomes are getting high reward, which is the goal, or getting low reward, which allows for learning and improving estimates 37m51s.
Both outcomes are valuable from the point of view of a reinforcement learning algorithm, as either the goal is achieved or something is learned to avoid making bad decisions in the future 38m48s.
To leverage information from low rewards, an algorithm is needed to formalize uncertainty, which involves quantifying confidence intervals or uncertainty bounds to make decisions 39m4s.
The approach involves estimating an upper confidence bound for each action value, ensuring it holds with high probability, and focusing on frequentist high-probability bounds 39m30s.
The upper confidence bound (UCB) is dependent on the number of times an action has been taken, and the agent will choose the action with the highest UCB 39m48s.
UCB algorithms are a suite of algorithms that follow this notion, and there are variants, including optimism in the face of uncertainty (OFU) 40m11s.
The performance of this approach and the quantification of uncertainty will be evaluated, using Hoeffding's inequality, which is a useful inequality for understanding how different the observed average can be from the true mean 40m36s.
Hoeffding's inequality states that the difference between the empirical estimate and the true estimate decreases exponentially as the number of samples increases, ensuring convergence to the true mean 41m14s.
The inequality implies that the probability of a large difference between the empirical and true means decreases exponentially with the number of samples, allowing for the estimation of an upper bound on the true expected reward 41m52s.
The goal is to determine how big the upper bound (u) needs to be set to get an upper bound on the true expected reward, using the equation derived from Hoeffding's inequality 42m26s.
To construct an upper confidence bound that holds with a certain probability, we use the formula u = sqrt(1/n * log(2/Delta)), where u is the range, n is the number of samples, and Delta is the confidence level 43m9s.
This formula gives us a range, and if we want the probability that our empirical estimate differs from the true mean by no more than u, it is sufficient to set u equal to this value 43m47s.
We can then say that the empirical estimate is less than or equal to the expected value of x, which is less than or equal to the empirical estimate plus u, with a probability greater than or equal to 1 - Delta 44m6s.
This creates an upper confidence bound, which we can use to estimate the upper limit of the true mean 44m9s.
The UCB1 algorithm uses this upper confidence bound, adding a bonus term to the empirical average, which depends on the number of samples 44m39s.
The UCB1 algorithm is used at every time step, computing the empirical average and adding the bonus term 44m44s.
The bonus term is calculated as sqrt(2 * log(t) / n), where t is the time step and n is the number of samples 45m0s.
The UCB1 algorithm is a variant of the Upper Confidence Bound algorithm, which was first introduced in a paper in 2002 45m31s.
The algorithm uses optimism under uncertainty, which is a concept that was around before the 2000s, but was first formally proven in the 2002 paper 45m56s.

UCB1 Algorithm and its Implementation

The UCB1 algorithm can be used in different settings, such as sampling each arm once and then computing the upper confidence bounds for each arm 46m15s.
The algorithm then selects the arm with the highest upper confidence bound, which is the argmax of the upper confidence bounds 47m10s.
Upper confidence bounds (UCB) are used to balance exploration and exploitation in reinforcement learning, with UCB A2 being 1 + √(2 * log(1/Δ)) and UCB A3 being 0 + √(2 * log(1/Δ1)) 47m32s.
The upper confidence bounds are reduced as more information is gathered, allowing for more informed decision-making 48m0s.
The value of Δ is chosen to determine the confidence bounds, with the goal of having all confidence bounds hold for all time steps and arms 48m13s.
Union bounding is used to ensure that all confidence bounds hold simultaneously, taking into account the number of arms and total decisions (Big T) 48m30s.
The confidence intervals will change over time, causing the algorithm to alternate between arms based on the upper confidence bounds 49m1s.
With a fixed number of time steps (Big T), Δ can be set to roughly Δ / (T * |A|) using a union bound, where |A| is the number of arms 49m43s.
This approach is related to false discovery and other concepts in machine learning, and is used to ensure that the confidence bounds are valid at every time step 50m1s.
The union bound is used to sum over the probability of each event, resulting in a bound of roughly the number of arms times Big T 50m29s.
Dividing Δ by (T * |A|) is generally sufficient, but can result in a larger log term, which can be mitigated using approaches such as the law of iterated logarithms 50m50s.

Proof Sketch for Sublinear Regret in UCB Algorithms

A proof sketch is provided to show how a specific type of idea can be used to achieve sublinear regret in reinforcement learning 51m18s.
The proof is based on a result from a book on banded algorithms by Tor Lattimore and Csaba Szepesvári, which provides a rigorous version of the proof 51m53s.
The result states that if an arm is suboptimal, the number of times it is pulled in Upper Confidence Bounds scales as a constant C' times log of 1/Delta, where Delta is the gap between the expected reward of the arm and the true reward 53m11s.
The constant C' does not depend on the number of arms, gaps, or other domain-specific factors, and can be a fixed value such as 37 53m33s.
The result is interesting because it shows that if the gap is large, the suboptimal arm will be pulled fewer times, and if the gap is small, it may be sampled more 53m51s.
Combining this result with another equation can lead to a sublinear regret bound, where the total expected regret is proportional to the number of actions times a constant 54m31s.
The proof sketch focuses on showing that the number of times a suboptimal arm is pulled is bounded by a constant times log of 1/Delta 54m18s.
The book by Lattimore and Szepesvári provides a more rigorous version of the proof, which is recommended for further reading 52m15s.

Formal Proof of Sublinear Regret in UCB Algorithms

The formal proof of the total number of times a particular arm is pulled in a multi-armed bandit problem scales with one over the size of the gap squared, and this concept will be explored in more detail 55m14s.
The proof relies heavily on the Hoeffding inequality and upper confidence bounds, which provide a way to estimate the expected value of an arm and bound the probability of the true expected value being outside of a certain range 55m36s.
The Hoeffding inequality states that the difference between the true expected value of an arm and its empirical average is greater than a certain quantity (the upper confidence bound) with probability no more than 1 - Delta/T, where Delta is a small probability and T is the number of times the arm is pulled 56m26s.
The goal is to bound the number of times a suboptimal arm is pulled, which is the only case where regret occurs, and this can be done by analyzing the times when a non-optimal arm is pulled and its upper confidence bound is higher than the optimal arm's upper confidence bound 57m5s.
If the confidence interval holds, then the upper confidence bound of a suboptimal arm is higher than the true expected value of the optimal arm, and this can be used to bound the number of times the suboptimal arm is pulled 57m57s.
Under the UCB algorithm, a suboptimal arm is only pulled if its upper confidence bound is higher than the optimal arm's upper confidence bound, and this can be used to derive a bound on the number of times the suboptimal arm is pulled 58m21s.
By substituting the upper confidence bounds of the suboptimal and optimal arms, it can be shown that the number of times the suboptimal arm is pulled is bounded by a function of the gap between the two arms and the number of times the suboptimal arm is pulled 59m56s.
The upper confidence bound on the optimal action also holds, so its upper confidence bound has to be higher than its true value 1h0m12s.
Q of a plus two of the confidence intervals is going to be greater than Q of a star, which means 2 < TK C log 1 Delta over NT of a is greater than Q of a star minus Q of a, which is equal to Delta a 1h2m50s.
Rearranging the equation, we get that 4 * C log 1 / Delta / NT of a is greater than equal to Delta a 2, which means NT of a is less than or equal to 4 C log 1 / Delta 1h3m48s.
Intuitively, this means that if confidence bounds hold and are used to make decisions, then the only time a wrong decision is made is when the confidence bounds are large enough to overwhelm the gap, and the number of times this occurs is finite 1h4m6s.
The size of the confidence intervals decreases over time, eventually becoming smaller than the gap, resulting in fewer suboptimal actions being taken 1h4m36s.
The algorithm achieves logarithmic asymptotic regret as a function of log T, because the log T term is inside the number of times suboptimal actions are taken 1h5m11s.

Comparison of UCB with Other Approaches and Extensions

Upper Confidence Bound (UCB) algorithms have a nice logarithmic shape if the constants are set correctly, but the constants can greatly affect the exploration time, and being more aggressive can result in better performance 1h5m41s.
An alternative to UCB is to always select the arm with the highest lower bound, which can yield linear regret, and it's helpful to think about why this approach wouldn't work in the upper confidence bound case 1h6m11s.
To demonstrate why selecting the arm with the highest lower bound can lead to linear regret, one can construct a multi-arm bandit case where this approach would result in linear regret, similar to the example shown for the greedy algorithm 1h7m30s.
When the condition for UCB is not met, the expectation can be split into a high-probability event and a low-probability event, allowing for the regret to be bounded from those time points 1h8m41s.
To achieve linear regret when selecting the arm with the highest lower bound, the upper bound and mean of the suboptimal arm (A2) should be placed such that the algorithm is pessimistic and never selects the optimal arm (A1), for example, by having a very high upper bound for A2 1h11m18s.
If the upper bound of A2 is higher than A1, the algorithm would learn to select A2 under UCB, but being pessimistic would prevent this, and the algorithm would never pull A2 1h12m22s.
The pessimistic view is that learning may not occur due to not updating other bounds 1h12m38s.
The empirical average plus its upper confidence bound being bigger than the optimal arm's empirical average plus its upper bound is based on the equation that the empirical average is always less than or equal to the true value plus the upper confidence bound 1h12m54s.
Substituting the empirical average with the true value plus the upper confidence bound results in Q of A plus 2 times the bound 1h13m15s.
This algorithm has provably sublinear regret, which is a desirable property, and is also easy to implement, especially when counts are available 1h13m41s.
The optimism under uncertainty principle can be extended to more complicated settings, such as function approximation and reinforcement learning 1h13m51s.
The principle is used to make decisions when outcomes are uncertain, with the goal of reducing regret over time 1h14m1s.

Introduction to Bayesian Bandits

The next topic to be covered is Bayesian Bandits, which involves thinking of the problem from a different perspective, using a prior over the distribution of rewards for each arm 1h14m19s.
Bayesian Bandits can also involve algorithms that use prior information to quickly gather data and make good decisions, which can be similar to optimism in certain ways 1h14m33s.

Made with Recall · in 3 seconds

Get a summary like this for anything you read, watch or save.

Recall summarizes any link you paste, then keeps it in your personal library so you can search, chat with it, and never lose a key idea again.

YouTube videosArticlesPodcastsPDFsAnything else

Save this summary

Keep it in your library.

Save to your library

Browse all from Stanford Online →

Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required

Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required

YouTube02 Jun 2026

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation

Artificial Intelligence

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation

YouTube02 Jun 2026

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics

Artificial Intelligence

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics

YouTube02 Jun 2026

Stanford CS153 Frontier Systems | The AI Native Company: How One Founder Becomes a 1000x Engineer

Entrepreneurship

Stanford CS153 Frontier Systems | The AI Native Company: How One Founder Becomes a 1000x Engineer

YouTube25 May 2026

Stanford CS547 HCI Seminar | Spring 2026 | HCI and Human-Centered AI for Digital Health

Health & Medicine

Stanford CS547 HCI Seminar | Spring 2026 | HCI and Human-Centered AI for Digital Health

YouTube25 May 2026

Stanford CS25: Transformers United V6 I Distinct Modes of Generalization from Parameters and Context

Artificial Intelligence

Stanford CS25: Transformers United V6 I Distinct Modes of Generalization from Parameters and Context

YouTube25 May 2026

Ready to get started?

Save, summarize and chat with your content.

IT'S FREE

No credit card required · 30 Day Refund on Premium · 24 Hour Support

Recall web app on laptop, personal AI knowledge base for summarizing and chatting with your content