τ₀-WM: A Unified Video-Action World Model for Robotic Manipulation

Summary

τ₀-WM is a 5-billion-parameter robotic foundation model from AgiBot that unifies policy learning, video prediction, and action evaluation within a single future-predictive framework built on a shared video diffusion backbone. Trained on 27,300 hours of real-robot teleoperation, UMI-style demonstrations, and egocentric interaction videos, it enables robots to simultaneously generate executable actions and anticipate their future visual consequences before physical execution.

Key Contributions

Unified video-action architecture: a single diffusion model jointly generates future video frames and action sequences, enabling tight coupling between prediction and control
Large-scale training corpus: 27.3K hours of real-robot and egocentric video spanning diverse embodiments and tasks (one of the largest disclosed robot training sets)
Action evaluation via world model rollouts: the model can assess candidate actions by simulating their outcomes before committing, acting as an internal critic
5B-parameter open model from AgiBot Finch, making a large-scale robot foundation model publicly accessible

Significance

A major open-access release from AgiBot (China’s leading humanoid robotics company), demonstrating that unifying video generation and action prediction at scale produces emergent robot generalisation capabilities; the large training set sets a new benchmark for data volume in robotic foundation models.

Embodied Robotics Research

Explorer

τ₀-WM: A Unified Video-Action World Model for Robotic Manipulation

Summary

Key Contributions

Significance

Links

Graph View

Table of Contents

Backlinks