TRI’s Robots Learn New Manipulation Skills in an Afternoon. Here’s How.

By: Siyuan Feng, Ben Burchfiel, Toffee Albina, and Russ Tedrake

Published in

Toyota Research Institute

11 min readSep 14, 2023

TRI is unveiling a new approach that allows a robot to acquire new dexterous behaviors from demonstration. We’re going to walk through why this is a critical new capability, what advancements have made this possible, and where we are going next.

Several dexterous example behaviors that were learned via our approach.

Introduction

The mission of TRI’s Robotics research division is to develop new technology for next-generation dexterous robots to amplify people and improve their lives. Doing this requires robots to operate alongside people in unstructured natural settings. Whether it’s helping someone with arthritis prepare dinner, assisting an injured person out of a chair, or aiding workers on a job site, future robots will have an enormous positive impact on the lives of people around the world.

These next-generation robots must be flexible, adaptable, and general. Because no two homes, no two job sites, and no two people are the same, our robots must become far more versatile than they are today. Currently, robots are meticulously programmed to accomplish tasks, with humans explicitly anticipating edge cases and instructing the robot how to recover from mistakes. This approach has seen tremendous success in controlled settings; it underpins modern factory and warehouse automation. However, creating robots this way requires careful modeling of the robot’s environment, careful scoping of the robot’s behavior, and careful anticipation of situations it will need to respond to, all ahead of time. This can’t scale to the complexity required for future, more capable, robots operating in the wild.

As a result, robotics has seen an explosion of interest in robots that use AI and Machine Learning. Many of these approaches fuse aspects of traditional robotics with AI to great effect, but there’s something big missing: robots are still bad at affecting the physical world around them. We now have robots that can converse with people, but can’t open a bag of chips, and robots that can do backflips, but can’t tie a shoe. Robots are getting smarter, but they can still only interact with the world in simple ways.

To address this, TRI has developed a capability, powered by recent advancements in generative AI, that enables teaching robots new manipulation abilities in a single afternoon. Using the same robot, same code, and same setup, we’ve taught over 60 different dexterous behaviors like peeling vegetables, using hand mixers, preparing snacks, and flipping pancakes.

We’re now building a large, diverse curriculum of behaviors to not only push the limits of what’s possible today but also to lay the foundation for a general-purpose robot capable of flexible, adaptable dexterous manipulation.

We’re excited to give you a look behind the scenes at what makes this possible.

How Teaching Works

To teach new behaviors, a human operator teleoperates the robot through demonstrations of the desired task. This usually requires an hour or two of teaching which generally equates to anywhere from a couple dozen to a hundred or so demonstrations.

Our bimanual teleoperation teaching setup.

How Learning Works

Once a set of demonstrations has been collected for a particular behavior, the robot learns to perform that behavior autonomously. The core of our process is a generative AI technique called Diffusion which has recently taken the field of image creation by storm {DALL-E 2, Stable Diffusion}. Recently, TRI and our university partners in Professor Song’s lab adapted this technique into a method called Diffusion Policy, which directly generates robot behaviors. Instead of image generation conditioned on natural language, we use diffusion to generate robot actions conditioned on sensor observations and, optionally, natural language.

**In this simple block-pushing example, the robot’s action space is in 2D position, making it possible to easily visualize behavior being “diffused**” by the learned policy. At each timestep, the policy starts with a random trajectory for the finger which is then diffused into a coherent plan which is executed by the robot. This process repeats multiple times a second.

Using diffusion to generate robot behavior provides three key benefits over previous approaches:

Applicability to multi-modal demonstrations. This means human demonstrators can teach behaviors naturally and not worry about confusing the robot.
Suitability to high-dimensional action spaces. This means it’s possible for the robot to plan forward in time which helps avoid myopic, inconsistent, or erratic behavior.
Stable and reliable training. This means it’s possible to train robots at scale and have confidence they will work, without laborious hand-tuning or hunting for golden checkpoints.

Multi-Modal Behaviors

Most real-world tasks can be solved in many different ways. When picking up a cup, for example, a person might grab it from the top, the side, or even the bottom. This phenomenon, behavioral multimodality, has historically been very difficult for behavior learning methods to cope with, despite its ubiquity in normal human behavior.

Consider the simple illustrative case of a T-shaped block sitting on a table that the robot must push to a target location. The robot can move the block by sliding and must move around the block to reach different sides of the T; it cannot fly over the block. This task involves inherent multimodality — it’s generally reasonable to go around the block either to the left or to the right — with two modes of equally correct action. The solution is that instead of predicting a single action, we learn a distribution over actions. Diffusion Policy is able to learn these distributions in a more stable and robust way and captures this multi-modality much better than previous approaches.

**Sampled behavior on the Push-T domain, Diffusion Policy vs. prior approaches. Please see our** **Diffusion Policy** **writeup for more technical detail.**

Being able to handle multimodal demonstrations has proven critical in our success teaching complex dexterous behaviors, where this type of multimodality is endemic, and it also enables our robots to easily learn from multiple teachers as we scale up data collection.

High-Dimensional Actions

Diffusion is naturally well suited for high dimensional output spaces. Generating images, for example, requires predicting hundreds of thousands of individual pixels. For robotics, this is a key advantage and allows diffusion-based behavior models to gracefully scale to complex robots with multiple limbs. It also provides us the critical ability to predict intended trajectories of actions instead of single timesteps. Recent work {DP, ACT} has shown that trajectory prediction is often a key design feature for learning robust policies that perform well.

Stable Training

Diffusion Policy is also embarrassingly simple to train; new behaviors can be taught without requiring numerous costly and laborious real-world evaluations to hunt for the best-performing checkpoints and hyperparameters. Unlike computer vision or natural language applications, AI-based closed-loop systems can not be accurately evaluated with offline metrics — they must be evaluated in a closed-loop setting which, in robotics, generally requires evaluation on physical hardware. This means any learning pipeline that requires extensive tuning or hyperparameter optimization becomes impractical due to this bottleneck in real-life evaluation. Because Diffusion Policy works out of the box so consistently, it allows us to bypass this difficulty and has been a key enabler of scale for us.

Our Platform

Teleoperation

Because we teach our robots via human demonstration, a good teleoperation interface is critical for teaching challenging dexterous behaviors. Our approach to robot learning is agnostic to choice of teleoperation device and we have used a variety of low-cost interfaces such as joysticks. For more dexterous behaviors, we teach via bimanual haptic devices with position-position coupling between the teleoperation device and the robot. Position-position coupling means that the input device sends measured pose as commands to the robot and the robot tracks these pose commands using torque-based Operational Space Control. The robot’s pose-tracking error is then converted to a force and sent back to the input device for the teacher to feel. This allows teachers to close the feedback loop with the robot through force and has been critical for many of the most difficult skills we have taught.

Providing force feedback is particularly important when a robot is manipulating an object with both arms. An illustrative example of this is operating a device that requires actuation, such as a manual hand mixer, which was impossible to reliably demonstrate without this feedback.

In this example, a human teacher attempted 10 egg-beating demonstrations. With haptic force feedback, the operator succeeded every time. Without this feedback, they failed every time. Pictured (right) is the breakdown of failure modes encountered without force.

When the robot holds a tool with both arms, it creates a closed kinematic chain. For any given configuration of the robot and tool, there is a large range of possible internal forces that are unobservable visually. Certain force configurations, such as pulling the grippers apart, are inherently unstable and make it likely the robot’s grasp will slip. If human demonstrators do not have access to haptic feedback, they won’t be able to sense or teach proper control of force. We find that haptic feedback is critical for improving demonstration success rate when teaching force-sensitive behaviors, particularly ones requiring coupling of both arms.

Here, a human demonstrator is attempting to move a cracker up and down using both grippers — without breaking it. It’s easy with haptic feedback (left) and extremely difficult without (right). With haptic feedback, the demonstrator can easily feel the forces caused by improper coordination between the two grippers, and adjust accordingly. As a result, the operator doesn’t break the cracker until they want to (at the end of the video).

Tactile Sensing

Anyone who has attempted to tie a shoe while wearing gloves has experienced how important the sense of touch is for people; when performing dexterous tasks, being able to feel what is happening provides additional information that is critical for success. We believe robots are no different and also benefit from a sense of touch. To this end, we employ TRI Soft-Bubble sensors on many of our platforms. These sensors consist of an internal camera observing an inflated deformable outer membrane. They go beyond measuring sparse force signals and allow the robot to perceive spatially dense information about contact patterns, geometry, slip, and force.

The robot loads dishes into a rack, aided by information from a tactile bubble sensor.

While sensors of this type have been more popular in recent years, making good use of the information they provide has been a challenge. Diffusion provides a natural way for robots to use the full richness these visuotactile sensors afford — we condition on these signals as an additional input — which allows us to apply them to arbitrary dexterous tasks.

Book page turning: The task is to flip the recipe book to the salad page. Notice the policy accidentally flipped too many pages backward and has to recover from that. Also, notice the subtle slippage detected by the policy on the red page.

Plate singulation: The task is to singulate a plate from a stack. Notice the subtle bumps that signal the top plate becoming unseated from the stack.

Flipping pancake: The task is to scoop a pancake that’s partially dangling from the griddle. The robot has to first pull it back fully onto the griddle. Notice the subtle motion required to regulate friction between the pancake and the griddle when pulling the pancake back onto the griddle.

Our early experiments in this direction have been extremely promising. We’re finding that in many cases, adding contact sensing drastically improves a robot’s ability to perform tasks with interesting contact phases

**A real-world performance comparison between tactile-enabled and vision-only learned policies.**

Safe and Performant Control

A critically important, and often overlooked, component of a high-performance robot is its mid-level control layer. In our case, both the learned policies and human demonstrators issue position and orientation commands for the robot’s gripper at 10Hz. These commands are then upsampled and converted into 1kHz joint-level torque commands by the mid-level controller. Crucially, this mid-level controller has built-in safety features that safeguard the robot and prevent it from executing potentially dangerous higher-level commands (either from a learned policy or from a human teacher).

Here, the teacher intentionally commands an unsafe behavior that would result in a collision — which the robot’s control layer correctly prevents.

Our approach is rooted in Operational Space Control and is formulated as a constrained optimization problem over joint-level commands. The objective is to track high-level commands provided by the demonstrator or the learned policy while obeying physics and other safety constraints such as collision avoidance. This implementation leverages the Drake Systems Framework, which enables rigorous analysis and testing — allowing us to be confident in this key piece of our system. We’re planning to open-source this implementation in the future and view it as one of the key drivers of our success to date.

A solid mid-level controller is truly the foundation of a high-quality behavior-learning pipeline. It not only enables mission-critical features like impedance control and haptic feedback but also provides invaluable safety protections for the overall system and allows teachers to push the robot to its physical limits without worrying about damaging it.

Where Are We Now

We’re entering a remarkable new era of robotics. What used to take expert roboticists weeks of development time, anyone can now teach in an afternoon with the expectation of success at the end of the day. It is finally becoming possible to teach dexterous behaviors featuring complex interactions with a single pipeline that just works — without tuning.

Programming this behavior a few years ago (top) took us months. The taught version (bottom) took us a day.

There is still more to do. Currently, when we teach a robot a new skill, it is brittle. Skills will work well in circumstances that are similar to those used in teaching, but the robot will struggle when they differ. In practice, the most common causes of failure cases we observe are:

States where no recovery has been demonstrated. This can be the result of demonstrations that are too clean.
Camera viewpoint or background significant changes.
Test time manipulands that were not encountered during training.
Distractor objects, for example, significant clutter that was not present during training.

So what’s the solution? Much like a person becomes increasingly capable and adaptable when learning from a lifetime of experience, we believe that a highly diverse behavioral curriculum is the key to creating more flexible and general robots. To this end, TRI is investing heavily in building a powerful behavioral curriculum of embodied data on both a fleet of physical robots and in our powerful Drake simulation suite.

To date, we have taught more than 60 behaviors to our fleet of dexterous manipulation robots and are in the process of increasing velocity on this effort with a target of 200 taught behaviors by year’s end. These behaviors span a wide range of manipulation scenarios, from tool use, to deformable object manipulation, to careful bimanual coordination, and more.

Here are just some of the behaviors in our rapidly growing behavior curriculum, which we recently showcased here.

What’s Next

We’ve seen early signs of success when training policies from a diverse skill dataset. In one instance, we taught the robot to empty a mug of ice into the sink in relatively uncluttered scenes. When evaluated with heavy clutter, the policy trained from just this data failed almost instantly. We trained a second version of this skill jointly with 15 other tasks and conditioned the robot on a language description of desired behavior. Despite having access to identical demonstrations for pouring ice, the multi-task skill succeeds in the heavily cluttered setting where the single-task version failed catastrophically.

A representative training example (left), multi-skill evaluation rollout (center), and single-skill evaluation rollout (right).

We expect, however, that physical robot data alone will not be sufficient to create truly general dexterous robots. To fill this gap, we’re leaning hard on our powerful simulation expertise using Drake — which is unique in its ability to accurately model detailed physics interactions, between both rigid and soft objects, that are critical for precise and complex behaviors.

Realistic teleoperated dexterous manipulation in a Drake simulation.

As this effort progresses, one of the surest signs of success will be zero-shot dexterous behavior generation and in-context learning. Existing Large Language Models possess the powerful ability to compose concepts in novel ways and learn from single examples. In the past year, we’ve seen this enable robots to generalize semantically (for example, pick and place with novel objects). The next big milestone is the creation of equivalently powerful Large Behavior Models that fuse this semantic capability with a high level of physical intelligence and creativity. These models will be critical for general-purpose robots that are able to richly engage with the world around them and spontaneously create new dexterous behaviors when needed.

If you have any questions or would like to hear from us directly on this project, please join our LinkedIn Live Q&A session on October 4th from 1 pm — 1:30 pm ET / 10 am — 10:30 am PT. Sign up for the event on TRI’s LinkedIn page.

Acknowledgments

None of this work would have been possible without the hard work of Eric Cousineau, Naveen Kuppuswamy, and Paarth Shah; and our other collaborators Alex Alspach, Rares Ambrus, Max Bajracharya, Andrew Beaulieu, Aditya Bhat, Ishaan Chandratreya, Cheng Chi, Rick Cory, Sam Creasey, Hongkai Dai, Richard Denitto, Zach Fang, Adrien Gaidon, Grant Gould, Kunimatsu Hashimoto, Brendan Hathaway, Allison Henry, Phoebe Horgan, Jenna Hormann, Yutaro Ishida, Thomas Kollar, Dale McConachie, Ian McMahon, Calder Phillips-Grafflin, Gordon Richardson, Charlie Richter, Taro Takahashi, Pavel Tokmakov, Jarod Wilson, Tristan Whiting, and Blake Wulfe.