Summary
This IJCAI 2026 survey provides a unified view of how human video data can be leveraged to train VLA models at scale, sidestepping the high cost of robot demonstrations. It categorizes existing approaches into four classes based on the action-related information they extract: latent action representations, predictive world models, explicit 2D visual supervision, and explicit 3D structural cues.
Key Contributions
- Taxonomizes human-centric VLA learning into four distinct paradigms with a common conceptual framework
- Reviews the landscape of datasets, methods, and embodiment transfer strategies derived from human video
- Covers challenges of embodiment gap, action grounding, and distributional shift in human-to-robot transfer
- Identifies open problems in scaling human video datasets for diverse manipulation tasks
Significance
As robot demonstration collection remains costly and embodiment-specific, this survey is a key reference for the community shifting toward scalable VLA pre-training from abundant internet-scale human video data.