YouTube video summary

Stanford Seminar - Leveraging Physics-Based Models To Learn Generalizable Robotic Manipulation

Robotics

26 Nov 20249 min summaryFrom Stanford Online

Stanford Seminar - Leveraging Physics-Based Models To Learn Generalizable Robotic Manipulation

Stanford Online

Free · no signup

Get the key points of a YouTube video or podcast in 30 seconds

Paste a YouTube, Spotify, or Apple Podcasts link and jump straight to what matters, with timestamps, instead of watching the whole thing.

YouTube Videos
Spotify Podcasts
Apple Podcasts

Trusted by 500,000+ researchers, students, and professionals

Save to your library Chat with this summary

Introduction

The presentation aims to answer three questions: what is missing in robotic manipulation, what causes this deficiency, and how physics-based models can help, with a focus on learning generalizable robotic manipulation 11s.
Despite impressive robotic manipulation demos, the field is far from being solved, and current policies learned from methods are not generalizable or robust 1m20s.
State-of-the-art algorithms, such as diffusion policy and reinforcement learning, have limitations, including poor generalizability, sensitivity to demonstration data set size, and the need for expensive and tedious tuning processes 2m5s.
The key to understanding the difficulty of manipulation lies in the rich physical constraints that govern it, including collision and robot reachability constraints, contact modes, force closure, and friction cone 3m2s.

Constraints in Robotic Manipulation

Collision and robot reachability constraints are kinematic and geometric, depending on the object's shape and the scene, and are highly non-convex, making it difficult to obtain a global solution 3m45s.
Contact mode describes the contact configuration between the robot, object, and scene, directly impacting the system's dynamics due to hybrid dynamics 4m12s.
In robotic manipulation, there are three types of constraints: environment contact, force disclosure, and friction cone, which are crucial for tasks like grasping and manipulation, but are generally difficult to work with due to their non-convex and differentiable nature 4m28s.
Environment contact constraints depend on the shape of the object and the environment, and can be detected and enforced using tactile force sensing and classical control tools like compliant or force control 4m30s.
Force disclosure and friction cone constraints are mostly concerned with robotic grasping, where force closure describes the sum of external forces and torques equal to zero, and friction cone typically uses the Kum friction model in literature 5m8s.
These constraints are conditioned on the contact points, and there are rich computational tools like quadratic optimization that allow us to leverage these constraints 5m32s.
The common attributes of these constraints are that they are non-convex, mostly differentiable, and computationally expensive to work with, making it difficult to obtain a global solution when solving an optimization problem with these constraints 6m0s.
These constraints are extensively covered by path literature, classical control theory tools, and stem from physics, particularly classical mechanics, kinematics, and dynamics 6m26s.

Leveraging Physics-Based Models

To overcome the challenges in robotic manipulation, physics-based models can be leveraged by moving expensive analytical computations offline, generating datasets or creating heuristics, and solving problems with learning 6m53s.
Learning can be used to refine a network output with these constraints, essentially solving a local problem, and can be applied to challenging manipulation tasks like dexterous grasping, dexterous pre-grasping, and extrinsic manipulation 7m38s.

Dexterous Grasping

Dexterous grasping refers to grasping an object using a multi-fingered hand, which is difficult due to the high degrees of freedom compared to a parallel gripper 8m22s.
Dexter grasping is a powerful tool for grasping objects, allowing for a wide range of different objects to be grabbed, and providing diverse grasping strategies, such as grasping from the top, side, and other configurations 9m9s.
A fully accurate four-finger hand has over 20 degrees of freedom, compared to approximately seven degrees of freedom for a parallel gripper, and is subject to more constraints, including collision, reachability, force closure, and friction cone 8m36s.
A pipeline is proposed to generate Dexter grasping by combining learning and optimization, starting with building a grasp data set with an analytical model, training a generative grasp point predictor, and refining the prediction with local optimization 9m31s.
The generative grasp point predictor is a conditional variational autoencoder that predicts where to place the fingers on an object, and the local optimization problem is solved to satisfy physical constraints, including collision, robot reachability, force closure, and friction cone 9m49s.
The pipeline achieves nearly 90% success rate over 20 objects and 120 trials, with objects ranging from those seen during training to those with similar shapes or never seen before 11m8s.
The pipeline allows for grasping in different configurations, even for the same objects, and enables the robot to use multimodal strategies to pick up objects 11m31s.
Dexter grasping can be used to pick up objects that cannot be picked up with a parallel gripper, demonstrating the power of Dexter grasping 11m54s.
The takeaway of the Dexter grasping task is that a grasp data set can be generated with physics, a grasp predictor can be learned, and the prediction can be refined with optimization that considers friction cone and force closure 12m2s.
The pipeline can also be used to solve for the hand collision, a 22 degrees of freedom problem, using kinematics 12m21s.

Dexterous Pre-grasping

In real-world scenarios, objects to be picked up may be ungraspable and require movement before a graspable position can be reached, and a pre-grasp is necessary to establish a grasp or rigid linkage to the object 12m44s.
Identifying a good pre-grasp is challenging because it is defined by the quality of the grasp it can lead to, and computing a pre-grasp is computationally expensive due to factors like finger movement, object contact, and potential collisions 13m14s.
A key insight is that only one environment contact is needed during a pre-grasp, and only two fingers are required to establish a grasp, allowing for a reduction in search space and the use of model-based methods 13m44s.
A proposed pipeline involves offline training, constructing a contact state graph, and using model-based methods for optimization and synthesizing hand motion 13m57s.
Offline training includes learning a grasp generator and a score function to evaluate pre-grasps, which offloads expensive computations to offline processing 14m12s.
A contact state graph is built based on finger placement on the object surface, and edges represent transitions between contact states 14m49s.
A scoring function is trained to evaluate contact configurations, and trajectory optimization is performed to find the best path on the graph 15m11s.
The pipeline is used to plan a contact transition and synthesize full hand motion with kinematics, resulting in physics-realistic trajectories 15m41s.
The approach is tested in various environments where direct grasping is not possible, and the pipeline achieves efficient pre-grasp planning 16m9s.
The takeaway from the project is the use of a scoring function to rank contact states, a grasp predictor to complete grasps, and a contact mode to guide pre-grasp planning 16m29s.

Extrinsic Manipulation

Extrinsic manipulation is a versatile mode of interaction that involves manipulating an object using environment context, and it's challenging due to various contact configurations and unknown factors like friction coefficients 17m3s.
A divide and conquer approach can be taken by breaking down extrinsic manipulation into primitives based on contact configurations, allowing for the training of robust policies within the same contact configuration using reinforcement learning 17m35s.
The challenge lies in switching between primitives with different contact requirements, and a framework using physical models can be built to enforce these constraints and stitch the primitives together 18m5s.
A primitive library has been built with four primitives: pushing, pulling, pivoting, and grasping, which can be used to describe context and allow the robot to move freely between contact transitions 18m44s.
The framework uses a mixture of classical control tools and learning-based tools, and it can retarget the contact configuration to different scenes and objects using a demonstration and remapping the task 18m33s.
The robot can execute long-horizon tasks, including up to four different contact configurations, using the same primitives by mapping the contact configuration 20m22s.
The framework has been demonstrated to achieve the same task on a variety of different objects and environments using a single demonstration and remapping the contact configuration 20m46s.
A divide and conquer framework can be built using physical models and contact constraints, allowing for the learning of goal-conditioned motion primitives with standard learning tools, and achieving long-horizon tasks that are otherwise impossible 21m1s.

Summary and Discussion

Three tasks were covered: Des grasping, dextrous grasping, and extrinsic manipulation, all of which used analytical methods to make computation more efficient by moving expensive computation offline and ensuring that learning methods satisfy physical constraints at all times 21m22s.
The main issues with robotic manipulation are generalizability and robustness, with current policies being mostly fragile due to complex physical constraints 21m56s.
Physical models can be used to solve these issues through synthetic data set generation, robust constraint satisfaction, composability, and multitask generalization, as shown in the three examples provided 22m13s.
The Gring project only considered finger contact, which is the minimum needed, and the 14th finger was not used, but the same method could potentially work for more fingers 22m54s.
The Preg grasp project required two fingers, but in the book case, only one finger was used, because two fingers plus the environment form a rigid caging, but a patch contact on the environment only requires one finger 23m22s.
The Deen generation pipeline starts with an offline learned grasp generator, followed by a model-based optimization procedure to refine the grasp, and requires a shape or model of the object to run the optimization 24m2s.
The method can be compared to a purely learning-free approach, which can also generate grasp points using a traditional grasp point generator, but the proposed method is more efficient due to the difficulty of global optimization 25m7s.
Learning provides a good local initial guess for robotic manipulation, allowing for diverse grasping configurations even for the same observation, and can also see the initial guess differently to achieve this diversity 25m27s.
The hard part of robotic manipulation is not choosing the initial contact state, but rather the motion planning required to get the object into a state where it can be grasped, which is an accessibility problem as much as a contact problem 26m3s.
The motion planning part is the most important and computationally expensive aspect, requiring evaluation of the entire motion planning process to decide if a pre-grasp is good or not 26m26s.
To address this, a score function is learned to provide a guess of how good a pre-grasp will be without having to solve the full motion planning problem, allowing for offline motion planning 26m47s.
The current success rate of 87% is not enough for real-world applications, and failures are often due to engineering challenges such as the hand overheating or colliding with the table 27m15s.
Improving the success rate from 87% to 95% or higher requires significant engineering work to address these challenges, often referred to as the "last mile problem" 27m58s.
While there are ways to improve the success rate, they may be less principled and require more engineering effort 28m14s.

Free · no signup

Do this for your own videos and podcasts

You just got the key points without sitting through the whole thing. Paste a YouTube, Spotify, or Apple Podcasts link and get the same summary in under 30 seconds.

YouTube Videos
Spotify Podcasts
Apple Podcasts

Trusted by 500,000+ researchers, students, and professionals

Browse all from Stanford Online →

Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required

Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required

YouTube02 Jun 2026

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation

Artificial Intelligence

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation

YouTube02 Jun 2026

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics

Artificial Intelligence

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics

YouTube02 Jun 2026

Stanford CS153 Frontier Systems | The AI Native Company: How One Founder Becomes a 1000x Engineer

Entrepreneurship

Stanford CS153 Frontier Systems | The AI Native Company: How One Founder Becomes a 1000x Engineer

YouTube25 May 2026

Stanford CS547 HCI Seminar | Spring 2026 | HCI and Human-Centered AI for Digital Health

Health & Medicine

Stanford CS547 HCI Seminar | Spring 2026 | HCI and Human-Centered AI for Digital Health

YouTube25 May 2026

Stanford CS25: Transformers United V6 I Distinct Modes of Generalization from Parameters and Context

Artificial Intelligence

Stanford CS25: Transformers United V6 I Distinct Modes of Generalization from Parameters and Context

YouTube25 May 2026

Ready to get started?

Save, summarize and chat with your content.

IT'S FREE

No credit card required · 30 Day Refund on Premium · 24 Hour Support

Recall web app on laptop, personal AI knowledge base for summarizing and chatting with your content