YouTube video summary

Stanford Robotics Seminar ENGR319 | Spring 2026 | Integrated Learning and Planning

Robotics

25 May 202618 min summaryFrom Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Integrated Learning and Planning

Stanford Online

Save to your library

Chat with this summary

Introduction to Physical Intelligence and Learning

The topic of discussion is integrated learning and planning with neuro-symbolic concepts, with the goal of building general-purpose physical intelligence that can perceive, understand, and take actions in the physical world 10s.
One approach to achieving physical intelligence is by fitting functions to data sets, where a policy is built to map historical observations to the next action, requiring large amounts of training data, such as 100 hours of data to train a robot to fold boxes 1m30s.
Recent efforts have focused on deploying such systems on activators, training them to generate joystick commands to control an activator, but this approach still has low data efficiency and limited multitask performance 2m6s.
In contrast, humans can learn from one example and generalize reliably to different states and goals, such as manipulating objects with an activator, and can adapt to new tasks and goals with ease 3m40s.

Human vs. Machine Learning and Generalization

An example of human generalizability is the ability to play a physical game where a person must pick up an object without using their hands, but rather using tools they are currently holding, demonstrating the ability to make plans internally and adapt to new situations 5m30s.
The speaker, Jayen, who is currently a member of the technical staff at Amazon and will soon join UPN as a system professor, argues that humans can solve complex tasks with minimal practice and can generalize to different kinds of goals, incorporating different kinds of objects 40s.
The goal for robotics is to create a system that can match human-level performance, learning from few demonstrations and generalizing across different scenarios, with the ability to learn from one to 10 demonstrations and generalize reliably out of distribution to novel states, objects, and goals 10s.

Conceptual Framework for Physical Intelligence

The central idea is to combine machine learning, policy learning, and planning, starting from a high-level view of physical intelligence as a combination of world modeling and planning, where humans understand objects, their properties, and possible actions 1m30s.
This paradigm involves understanding the state in an abstract way, enumerating possible actions, and imagining the outcomes before execution, which can be used with search-based algorithms to solve tasks 2m6s.
The key idea is to plan with compositional abstraction of states and actions to achieve generality, using what will be referred to as neuro-symbolic concepts, which combine world modeling and planning 3m20s.

Overview of the Talk and Neuro-Symbolic Concepts

The talk is outlined to cover neuro-symbolic action models, applications in skill learning, adding reasoning in language and spatial reasoning, and building neuro-symbolic planning-compatible action models with applications in long-horizon planning 4m40s.
To model an action, such as picking an object, one approach is to represent the policy from the current state to the target action, but this has limitations, such as not considering the dependencies between actions, like picking up a pen to pivot a glass 6m10s.

Challenges in Action Generation and Constraint Optimization

Action generation in robotics can be limited by the lack of compositionality, where simple tasks cannot be directly combined to solve a more complex task, and this limitation can be addressed by formulating action generation as a constraint optimization problem 10s.
The constraint optimization approach involves generating trajectories that satisfy certain constraints, such as staying within joint limits and avoiding collisions, as well as subgoal constraints like holding a target object, and this formulation allows for the composition of multiple actions by adding up constraints temporally and spatially 42s.
Mathematically, this can be written down as a constraint optimization problem, where the goal is to find two trajectories and two states that minimize a cost, such as the total length of the trajectory, subject to various constraints like dynamics, collision avoidance, and subgoal constraints 2m6s.
The constraints can be categorized into rigid body dynamics, geometric constraints, and task-relevant constraints, and this view allows for learning to happen in two different levels: learning the set of constraints to follow and learning how to generate trajectories or states that satisfy those constraints 2m6s.

Constraint Optimization in Robotics

Some constraints can be modeled using human models, such as physical simulators for rigid body physics, and motion planners for geometric constraints, while others, like task-relevant constraints, can be learned from data 4m30s.
A concrete example of this approach is learning to hand an object from a single example, where the goal is to generalize the skill to a complete unseen situation, such as putting a mug onto a mug tree, which is a challenging problem due to the joint constraints between the grasping pose and the hanging pose 6m10s.
The approach to solving the problem of moving an object involves either model-based planning, which is slow, or more modern learning-based approaches that operate by collecting a large dataset of possible objects, training geometric features, and then training a policy to generate robot actions 10s.
The performance of learning-based approaches is still not satisfactory, and they require a lot of data for different kinds of objects, with poor transferability across categories, prompting the decomposition of the problem into two parts: learning the set of constraints to follow and computing the contact positions 42s.
The solution to the problem involves constraint optimization with visual correspondence guidance, which includes detecting the sequence of contacts between the robot and objects, and computing the contact positions using visual feature correspondence, with the particular approach using Dino V2 to compute visual feature correspondence 2m6s.
The visual feature correspondence is used as guidance and then verified with model-based planning to ensure good contact points, resulting in a significant speedup over a blind search baseline, and generalizing to normal scenarios like picking up a mug and placing it onto a mug tree 2m6s.

Applications and Generalization of Constraint Optimization

The approach relies on models for 3D construction and physical simulation to test the stability of contacts, and generalizes to other kinds of objects, including kitchenware, and even to alphabetical shapes with diverse geometric structures 2m6s.
The model can decide where to pick up and place objects, and benchmarking shows that one-shot policy learning performs poorly on completely unseen objects, such as alphabetical shapes, with a performance of zero 2m6s.
The overall model provides a high success rate of over 90% on a particular test set, and it can learn from one or very few demonstrations and generalize to different objects and situations 10s.

Neuro-Symbolic Integration for Generalization

The approach combines neural representations and physical models to provide guidance on functional correspondence and stability analysis within a constraint optimization framework, resulting in efficient and generalizable performance 2m6s.
The model can be extended to slightly longer horizon manipulation tasks, such as teaching a robot to rotate an object on a table, push it, and lift it up with just one single demonstration, and then generalize to new scenarios with unseen targets 2m6s.

Extending to Long-Horizon Tasks

The demonstrations so far have been relatively short horizon, involving a few objects making contact with each other, but to create a more general-purpose robot, the system needs to be extended with more machine learning components, such as diffusion models for spatial reasoning 4m30s.
A particular manipulation problem being addressed is setting up tables based on language instructions, which is challenging due to the need to learn human preferences, limited data, and long horizon manipulation with many variabilities 6m40s.

Formulating Table Setup as a Constraint Problem

The problem of setting up tables can be formulated as finding object poses that satisfy spatial constraints, such as the apple being on the left of the plate, and can be visualized as a graphical structure with constraints among objects 8m50s.
The problem formulation allows for learning to happen at two levels, including finding the set of constraints to satisfy when setting up a table based on human instructions 10m40s.

Leveraging Language and Vision Models for Spatial Reasoning

The system leverages pre-trained large language models or visual language models to learn common sense knowledge, such as which object should be placed on the left of another, instead of learning everything from robot demonstrations 10s.
The system starts with language goals and an initial image containing object shapes, and a vision language model generates an abstract spatial relationship graph that describes the spatial relationships between individual objects, with a finite library of possible relationships like left, right, horizontally aligned, or vertically split 42s.
The system allows for additional examples to be provided to show preferences, such as setting up a dining table according to a specific cultural tradition, and then uses a compositional diffusion model to generate the pose of all objects and control the robot to move the objects to the target pose 1m30s.

Diffusion Models for Spatial Constraints

The compositional diffusion model takes the graphical structure as input and generates continuous values, such as the pose of objects, and it uses dedicated diffusion models for each type of relationship, such as left, which are trained to predict an energy value that quantifies how well the input satisfies the constraint 2m6s.
The diffusion models predict the gradients over the inputs, which can be combined by adding up the gradients to solve the original problem, and this process can be visualized as a gradient field or an energy field for a learned diffusion model 3m30s.
The system uses programming synthesis techniques to generate layouts, but the details of this graph generation procedure are not fully explained in this context 2m30s.
The diffusion model is used to represent energy values and gradients, where the energy value is lower when close to the edge of the table, and the gradients point towards the table, with different energy fields and gradients for various constraints, such as one object being to the left or right of another 10s.
The optimization process involves adding up all the energy landscapes and gradients to form a new composed field, and then finding the minimum value or the value that minimizes the energy function, which gives the position of the target object, using a technique called unadjusted Langevin dynamics 1m30s.

Compositional Diffusion and Optimization

The approach allows for composing individual models for each type of relationship at inference time without additional training, generating placements of objects, and can work in various scenarios, such as study desks, coffee tables, and dining tables 2m6s.
The method can be integrated with real robot execution by considering motion constraints, such as a robot's workspace and ability to put objects only on one side of the table, and can synthesize a plan subject to these constraints and geometric constraints predicted by a large language model 3m20s.
The system can learn from field demonstrations and generalize to new spatial reasoning tasks by composing neuro-symbolic diffusion models within a constraint optimization framework, leveraging large language models for common sense knowledge 4m40s.

Long-Horizon Planning and Task Composition

The approach can be extended to diffusion models of actions, trajectories, and other tasks, but currently focuses on spatial reasoning, and the next part of the discussion will explore neuro-symbolic planning compatible action models for solving longer-term tasks 6m10s.
The problem setting involves learning to solve long-horizon planning tasks, such as washing dishes, using data from demonstrations, where each demo shows how to perform a single step, and the goal is to learn from these demonstrations to solve the entire task 8m0s.
The goal is to make long horizon plans with different kinds of obstacles or objects, such as sorting plates on a rack, which requires deliberate reasoning about the order of actions due to potential movement blockages 10s.

Language and Vision for Task Segmentation

To achieve this, language is used to help make sense of long horizon plans, starting from demonstration videos, and video language models like Gemini are utilized to segment the entire trajectory into several segments with action names 1m30s.
Vision models are then used to segment the raw observations into 3D point clouds, allowing for future learning, and providing relevant objects involved in each step, their trajectories, and the robot's hand trajectory 2m6s.
The resulting dataset, with multiple demonstrations, enables learning of individual skills or actions, such as pickup, and each action is modeled in two parts: trajectory constraints and future state prediction 3m20s.

Modeling Trajectories and Future States

The trajectory constraint satisfaction model takes the initial position of the gripper and object point cloud as input and uses a diffusion model to generate possible trajectories, depending on the data coverage in the demonstration 4m40s.
The second part of the model predicts the future state, taking the initial states, gripper position, and original point cloud, as well as the inferred trajectory, to forecast what will happen after executing an action 5m50s.
The process involves understanding how objects will change position after executing a policy or trajectory, focusing on geometric aspects such as object pose changes 10s.

Simulation and Planning for Task Execution

A high-level plan, or task skeleton, is created as a sequence of actions without deciding on the exact trajectory, and internal simulation is used to predict the outcome of taking a particular trajectory 42s.
The simulation can enumerate the next action based on the current state, determining if the plan will succeed or fail due to collisions, and if it fails, a different trajectory must be tried 1m15s.
The goal is to generalize the policy to work in various scenarios, including new heights, positions, orientations, and novel obstacles, by training the policy on specific scenarios such as axis-aligned objects and a limited number of books 2m6s.

Testing and Generalization of the Model

The model is tested in different situations, including handing multiple mugs onto a mug tree, which requires planning and proficiency in hanging individual mugs while avoiding collisions 3m30s.
The model can make delivery plans and attribute them in the physical world, demonstrating its ability to handle complex tasks and generalize to new situations 4m20s.

Neuro-Symbolic Planning and Learning Framework

The system enables future learning of action models and uses planning to combine them together by sampling possible trajectories and predicting the future before executing them, allowing for learning from very few demonstrations and generalizing to different situations 10s.
The principle behind this system involves composing neural trajectory generation models and transition models within a neuro-symbolic planning framework, combining neuro-trajectory generation models with world models to select the best plan 10s.
The system also leverages large vision and language models to perform tasks such as segmenting objects and teaching common sense knowledge at the symbolic level, making learning more efficient and planning more generalizable 10s.

Benefits and Implications of Neuro-Symbolic Systems

The idea of neuro-symbolic concepts enables data-efficient learning, such as learning actions from a single or few demonstrations, and generalization to completely unseen objects and states 10s.
Neuro-symbolic systems provide insights into the scientific understanding of tasks and learning, allowing for analysis of the expressiveness of neural networks and discussion of the parameterized circuit complexity of models 2m6s.
The system has implications for robotics, which is not just a machine learning problem, but also a system engineering problem that requires the integration of high-level planning, lower-level control, perception, memory tracking, planning, and control 2m6s.
Neuro-symbolic systems provide a principal way to think about the integration of different pieces, including perception, memory tracking, planning, and control, making them useful for academic research 2m6s.

Introducing the Retriever Framework

A new framework called Retriever is being released, which is a programming model for closed-loop robot agents that is based on the principle of having synchronous robot action and time-explicit typing, allowing for efficient and smooth execution of tasks such as seasoning a stick and tracking a memory 10s.
The Retriever framework is able to make long-running plans and has the ability to search for targets in different drawers, demonstrating its capabilities in robot action control and planning 42s.
The system's efficiency and smoothness can be attributed to its asynchronous execution and time-explicit typing, making it a significant improvement in the field of robotics 1m30s.

Composing and Scaling Action Models

With the advancement of foundation models for various tasks such as computer vision, language models, and action models, there is a growing need to compose different systems using principled algorithms for probabilistic reasoning or planning 2m6s.
A model orchestrator is necessary to scale up the composition of action models, which can tell the system what features and action models to care about in a particular environment and recognize object states, generate possible trajectories, and more 2m40s.
The development of novel paradigms for training models from diverse data is an exciting area of research in robotics, allowing for the composition of models using neuro-symbolic reasoning algorithms such as constraint satisfaction 3m30s.

Agent Hardness and System Composition

The concept of agent hardness is relevant to this area of research, where different agents can perform small tasks, but a system is needed to compose them together to solve more complicated tasks, and neuro-symbolism provides insights into designing such systems 4m10s.
Continual learning capability and self-improving models for task generation, planning, and policy learning are essential for overcoming the challenges of collecting raw data, which is very hard and challenging to collect 5m0s.

Future Directions and Self-Improving Systems

The current state of demos in the field of robotics and artificial intelligence is that they require human-provided data to learn from and solve problems, but the ultimate goal is to create systems that can start from basic compositional foundation models and use reasoning, planning, and exploration algorithms to acquire new capabilities and experiences, which can then be used to train the next generation of foundational models 10s.
This framework allows systems to start with basic capabilities such as object recognition, pick and place, and other small skills, and then explore and refine their models further through self-improvement and knowledge distillation from different foundation models 42s.
The concept of knowledge distillation is already being applied in areas such as agent research and vision language model research, where different foundation models with various capabilities are being composed and distilled into each other to create stronger capabilities, such as promptable segmentation models 1m30s.

Data Efficiency and Generalization in Intelligent Systems

The overall framework being discussed is aimed at enabling more data-efficient learning and better generalization in generally intelligent systems, which can bridge the understanding of intelligence and leverage insights into the building process to create better systems 2m6s.
The concept of common sense knowledge at the symbolic level is still being explored, and it is unclear whether current models have enough common sense knowledge, as it depends on the definition of symbolic level and the complexity of the knowledge being described 4m30s.

Evaluation and Verification of System Outputs

The evaluation of future states in systems, such as determining whether a predicted state is a success or failure, is done through the use of vision language models and physical feasibility checks, such as collision detection and stability checks 5m40s.
The vision language model uses semantic knowledge to understand tasks and goals at a high level, while the transition model handles low-level control and physical feasibility 10s.
A diffusion policy is needed to generate a probabilistic model of possible trajectories or target poses, allowing for multiple possible ways to complete a task, such as sorting books on a bookshelf 1m42s.

Incorporating Human Preferences and Utility Functions

The model can produce secondary byproduct effects, some of which may be desirable and others not, and verifying the correct behavior is crucial, potentially by incorporating human preferences as additional utility functions 4m6s.
Human preferences can be factored into the planning framework as utility functions, enabling the model to generate solutions that satisfy multiple constraints, including user preferences and physical constraints 6m30s.

Closed-Loop Execution and Real-Time Adaptation

The planning process can involve generating an overall trajectory and executing it, with the possibility of closed-loop execution, where the model continuously predicts and adjusts based on new observations from the environment 10m30s.
In the retriever example, closed-loop execution is already implemented, with high-level planning operating at 0.5 hertz and a memory tracking system operating at one hertz, allowing for adjustments based on new observations 12m10s.

Framework Design for Asynchronous Execution

The development of a new framework was necessary to address the challenges of asynchronous operation of different modules at varying speeds, which was a limitation of standard sequential programming models 10s.
A simulation environment is available for testing, with similar setups to the demo, allowing the system to be run in simulation, and this simulation environment can be accessed through a website 1m42s.

Neuro-Symbolic Architecture and Constraint Graphs

The neuro aspect of the work involves a symbolic structure that helps to organize the reasoning process, with a graphical structure of a constraint graph, where each edge of the graph is associated with a neural network that generates object poses, grasping poses, or contact points to satisfy constraints 4m6s.
The neural networks are integrated together to simultaneously satisfy all constraints, using a composition inference algorithm that enables the generation of values that meet all the requirements 6m1s.
The overall system operates by combining the symbolic and neural network components, allowing for the satisfaction of geometric and task-related constraints, such as avoiding collisions and achieving specific object locations or grasping poses 6m30s.

Made with Recall · in 3 seconds

Get a summary like this for anything you read, watch or save.

Recall summarizes any link you paste, then keeps it in your personal library so you can search, chat with it, and never lose a key idea again.

YouTube videosArticlesPodcastsPDFsAnything else

Save this summary

Keep it in your library.

Save to your library

Browse all from Stanford Online →

Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required

Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required

YouTube02 Jun 2026

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation

Artificial Intelligence

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation

YouTube02 Jun 2026

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics

Artificial Intelligence

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics

YouTube02 Jun 2026

Stanford CS153 Frontier Systems | The AI Native Company: How One Founder Becomes a 1000x Engineer

Entrepreneurship

Stanford CS153 Frontier Systems | The AI Native Company: How One Founder Becomes a 1000x Engineer

YouTube25 May 2026

Stanford CS547 HCI Seminar | Spring 2026 | HCI and Human-Centered AI for Digital Health

Health & Medicine

Stanford CS547 HCI Seminar | Spring 2026 | HCI and Human-Centered AI for Digital Health

YouTube25 May 2026

Stanford CS25: Transformers United V6 I Distinct Modes of Generalization from Parameters and Context

Artificial Intelligence

Stanford CS25: Transformers United V6 I Distinct Modes of Generalization from Parameters and Context

YouTube25 May 2026

Ready to get started?

Save, summarize and chat with your content.

IT'S FREE

No credit card required · 30 Day Refund on Premium · 24 Hour Support

Recall web app on laptop, personal AI knowledge base for summarizing and chatting with your content