Summary
This position paper challenges the dominant framing that generalist robot intelligence is primarily a policy-scaling problem. The authors — from ETH Zürich, Tübingen, IIT, TU Darmstadt, and Huawei Noah’s Ark Lab — argue that four interfaces are missing from current VLA + world model stacks: data interfaces for autolabelling unstructured behaviour, embodiment interfaces for retargeting human motion to robot actions, world-model interfaces for physics-grounded 3D reasoning, and reward interfaces for inferring task progress from video and language.
Key Contributions
- Articulates the “grounded supervision gap”: even with strong VLAs and world models, learning from the world’s abundant unstructured data remains unsolved
- Proposes four missing interface categories with concrete research directions for each
- Surveys recent progress in cross-embodiment datasets, human-video learning, world models, and reward modelling through this lens
- Frames robotics’ bottleneck as a data-interface engineering problem, not only a model-scaling problem
Significance
A compelling research agenda from leading academic robotics groups that reframes where the field should invest next; broadly cited as a north-star position paper within weeks of publication.