Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines

Summary

This survey argues that future progress in VLA models will depend less on architecture innovations and more on co-designing high-fidelity data engines with structured evaluation protocols. Organized around three pillars — datasets, benchmarks, and data engines — it provides a systematic analysis of the data infrastructure underlying embodied learning and identifies critical bottlenecks that limit real-world deployment.

Key Contributions

Systematic review of VLA datasets covering scale, diversity, embodiment coverage, and annotation quality
Analysis of benchmark design principles, identifying gaps in current evaluation protocols
Survey of data engine pipelines including simulation, human teleoperation, and automated data collection
Argument for data infrastructure co-design as the primary driver of next-generation VLA advances

Significance

By reframing VLA progress as a data problem rather than a model problem, this survey provides a critical roadmap for the community — highlighting where investment in data pipelines and evaluation benchmarks will yield the highest returns for embodied AI.

Embodied Robotics Research

Explorer

Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines

Summary

Key Contributions

Significance

Links

Graph View

Table of Contents

Backlinks