A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation

Download Paper ArXiv Seminar Talk Overview Video

Large Behavior Models Team, Toyota Research Institute
(click to expand author list)

Authors and contributions

First authors¹: Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching-Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, Naveen Kuppuswamy, Kuan-Hui Lee, Katherine Liu, Dale McConachie, Ian McMahon, Haruki Nishimura, Calder Phillips-Grafflin, Charles Richter, Paarth Shah, Krishnan Srinivasan, Blake Wulfe, Chen Xu, Mengchao Zhang.

Second authors²: Alex Alspach, Maya Angeles, Kushal Arora, Vitor Campagnolo Guizilini, Alejandro Castro, Dian Chen, Ting-Sheng Chu, Sam Creasey, Sean Curtis, Richard Denitto, Emma Dixon, Eric Dusel, Matthew Ferreira, Aimee Goncalves, Grant Gould, Damrong Guoy, Swati Gupta, Xuchen Han, Kyle Hatch, Brendan Hathaway, Allison Henry, Hillel Hochsztein, Phoebe Horgan, Shun Iwase, Donovon Jackson, Siddharth Karamcheti, Sedrick Keh, Joseph Masterjohn, Jean Mercat, Patrick Miller, Paul Mitiguy, Tony Nguyen, Jeremy Nimmer, Yuki Noguchi, Reko Ong, Aykut Onol, Owen Pfannenstiehl, Richard Poyner, Leticia Priebe Mendes Rocha, Gordon Richardson, Christopher Rodriguez, Derick Seale, Michael Sherman, Mariah Smith-Jones, David Tago, Pavel Tokmakov, Matthew Tran, Basile Van Hoorick, Igor Vasiljevic, Sergey Zakharov, Mark Zolotas.

Last authors³: Rares Ambrus, Kerri Fetzer-Borelli, Ben Burchfiel, Hadas Kress-Gazit, Siyuan Feng, Stacie Ford, Russ Tedrake.

All authors are sorted alphabetically within each group.
¹ Primary contributors who made substantial contributions to the work (policy architecture and training, evaluation, simulation).
² Assisted with the work (developing infrastructure, data collection, paper edits and feedback).
³ Led the project and are responsible for strategic decisions (method, benchmark, paper writing and presentation).

General-purpose robots promise a future where household assistance is ubiquitous and aging in place is supported by reliable, intelligent help. These robots will unlock human potential by enabling people to shape and interact with the physical world in transformative new ways. At the core of this transformation are Large Behavior Models (LBMs) — embodied AI systems that take in robot sensor data and output actions. LBMs are pretrained on large, diverse manipulation datasets and offer the key to realizing robust, general-purpose robotic intelligence.

Yet despite their growing popularity, we still know surprisingly little about the nuances of what today’s LBMs actually offer. This uncertainty stems from the difficulty of conducting rigorous, large-scale evaluations in real-world robotics. As a result, progress in algorithm and dataset design is often guided by intuition rather than evidence, hampering progress. Our work aims to change that.

Autonomous evaluation rollouts from two finetuned LBMs performing long-horizon behaviors: (left) coring and cutting an apple, and (right) installing a bike rotor.
(Both videos are playing at 1x speed.)

Result Highlights

We trained a series of diffusion-based LBMs on almost 1,700 hours of robot data and conducted 1,800 real-world evaluation rollouts and over 47,000 simulation rollouts to rigorously study their capabilities.

We found that LBMs:

Deliver consistent performance improvements relative to from-scratch policies;
Enable new tasks to be learned with 3-5× less data in challenging settings requiring robustness to a variety of environmental factors;
Improve steadily as pretraining data increases.

Even with just a few hundred diverse hours of data — and only a few hundred demos per behavior — performance jumped meaningfully. Pretraining provides consistent performance uplifts at earlier than expected scales. There is not yet an internet worth of robot data, but benefits appear far before that scale — a promising sign for enabling virtuous cycles of data acquisition and bootstrapped performance.

As we add more data to our pretraining mixture by including additional tasks, aggregate LBM performance improves smoothly.

Our evaluation suite includes several novel and highly challenging long-horizon real-world tasks; finetuned and evaluated in this setting, LBM pretraining improves performance despite these behaviors being highly distinct from the pretraining tasks.

Achieved task progression on novel long-horizon real-world tasks.

Side-by-side comparison of models setting up a breakfast table: (left) single-task baseline, and (right) LBM.
(Both videos are playing at 1x speed.)

LBMs — Architecture and Data

The LBM architecture is instantiated as a diffusion transformer which predicts robot actions.

Our LBMs are scaled multitask diffusion policies with multimodal ViT vision-language encoders and a transformer denoising head conditioned on encoded observations via AdaLN. These models consume wrist and scene cameras, robot proprioception, and language prompts and predict 16 timesteps (1.6 second) action chunks.

We train our LBMs on a mixture of 468 hours of internally collected bimanual robot teleoperation data, 45 hours of simulation-collected teleoperation data, 32 hours of Universal Manipulation Interface (UMI) data, and roughly 1,150 hours of internet data curated from the Open X-Embodiment dataset. While the proportion of simulation data is small, its inclusion in our pretraining mixture ensures that we can evaluate the same LBM checkpoint in both sim and real.

Evaluation — Simulation, Real Robots, and
Careful Protocols

Our LBMs are evaluated on physical and Drake-simulated bimanual stations employing Franka Panda FR3 arms and up to six cameras — up to two on each wrist, and two static scene cameras.

We evaluate our LBM models on a bimanual platform across a variety of tasks and environmental conditions in both simulation and in the real world.

We evaluate our models on both seen tasks (present in the pretraining data) and unseen tasks (which we finetune our pretrained model on). Our evaluation suite consists of 16 simulated seen-during-pretraining tasks, 3 real-world seen-during-pretraining tasks, 5 previously unseen long-horizon simulated tasks, and 5 complex previously unseen long-horizon real-world tasks. Each model was tested via 50 rollouts for each real-world task and 200 rollouts for each simulation task. This enables a high level of statistical rigour in our analysis with the pretrained models evaluated on 4,200 rollouts across 29 tasks.

We carefully control initial conditions to be consistent in both the real-world and simulation, and conduct blind A/B-style testing in the real world with statistical significance computed via a sequential hypothesis testing framework.

Many of the effects we observe were only measurable with larger-than-standard sample sizes and careful statistical testing that is non-standard for empirical robotics. It’s easy for noise due to experimental variation to dwarf the effects being measured, and many robotics papers may be measuring statistical noise due to insufficient statistical power. The plot below shows the size of Clopper-Pearson confidence intervals at varying numbers of rollouts and true success rates. With 50 rollouts, for example 10 rollouts each on 5 behaviors, the resulting CI width is generally 20% to 30% absolute success rate, making all but the largest-sized effects impossible to reliably measure.

We show confidence intervals as a function of number of evaluation rollouts.

Final Thoughts

One of our main takeaways is that finetuned performance smoothly improves with increasing pretraining data. At the data scales we examined, we saw no evidence of performance discontinuities or sharp inflection points; AI scaling appears alive and well in robotics.

We did experience mixed results with non-finetuned pretrained LBMs. Encouragingly, we found that a single network is able to learn many tasks simultaneously, but we don't observe consistent outperformance of from-scratch single-task training without finetuning. We expect this is partially due to the language steerability of our model. In internal testing, we've seen some promising early signs that larger VLA prototypes overcome some of this difficulty, but more work is required to rigorously examine this effect in higher-language-capacity models.

Our findings largely support the recent surge in popularity of LBM-style robot foundation models, adding to evidence that large-scale pretraining on diverse robot data is a viable path towards more capable robots, though with a few points of caution. Notably, subtle design choices like data normalization can have large effects on performance, often dominating architectural or algorithmic changes. It's important that these design choices are carefully isolated to avoid conflating the source of performance changes.

More Resources