Robots Need More than VLA and World Models

Summary

This position paper challenges the dominant framing that generalist robot intelligence is primarily a policy-scaling problem. The authors — from ETH Zürich, Tübingen, IIT, TU Darmstadt, and Huawei Noah’s Ark Lab — argue that four interfaces are missing from current VLA + world model stacks: data interfaces for autolabelling unstructured behaviour, embodiment interfaces for retargeting human motion to robot actions, world-model interfaces for physics-grounded 3D reasoning, and reward interfaces for inferring task progress from video and language.

Key Contributions

Articulates the “grounded supervision gap”: even with strong VLAs and world models, learning from the world’s abundant unstructured data remains unsolved
Proposes four missing interface categories with concrete research directions for each
Surveys recent progress in cross-embodiment datasets, human-video learning, world models, and reward modelling through this lens
Frames robotics’ bottleneck as a data-interface engineering problem, not only a model-scaling problem

Significance

A compelling research agenda from leading academic robotics groups that reframes where the field should invest next; broadly cited as a north-star position paper within weeks of publication.

Embodied Robotics Research

Explorer

Robots Need More than VLA and World Models

Summary

Key Contributions

Significance

Links

Graph View

Table of Contents

Backlinks