Summary

This position paper challenges the dominant framing that generalist robot intelligence is primarily a policy-scaling problem. The authors — from ETH Zürich, Tübingen, IIT, TU Darmstadt, and Huawei Noah’s Ark Lab — argue that four interfaces are missing from current VLA + world model stacks: data interfaces for autolabelling unstructured behaviour, embodiment interfaces for retargeting human motion to robot actions, world-model interfaces for physics-grounded 3D reasoning, and reward interfaces for inferring task progress from video and language.

Key Contributions

  • Articulates the “grounded supervision gap”: even with strong VLAs and world models, learning from the world’s abundant unstructured data remains unsolved
  • Proposes four missing interface categories with concrete research directions for each
  • Surveys recent progress in cross-embodiment datasets, human-video learning, world models, and reward modelling through this lens
  • Frames robotics’ bottleneck as a data-interface engineering problem, not only a model-scaling problem

Significance

A compelling research agenda from leading academic robotics groups that reframes where the field should invest next; broadly cited as a north-star position paper within weeks of publication.