From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data

Summary

This IJCAI 2026 survey provides a unified view of how human video data can be leveraged to train VLA models at scale, sidestepping the high cost of robot demonstrations. It categorizes existing approaches into four classes based on the action-related information they extract: latent action representations, predictive world models, explicit 2D visual supervision, and explicit 3D structural cues.

Key Contributions

Taxonomizes human-centric VLA learning into four distinct paradigms with a common conceptual framework
Reviews the landscape of datasets, methods, and embodiment transfer strategies derived from human video
Covers challenges of embodiment gap, action grounding, and distributional shift in human-to-robot transfer
Identifies open problems in scaling human video datasets for diverse manipulation tasks

Significance

As robot demonstration collection remains costly and embodiment-specific, this survey is a key reference for the community shifting toward scalable VLA pre-training from abundant internet-scale human video data.

Embodied Robotics Research

Explorer

From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data

Summary

Key Contributions

Significance

Links

Graph View

Table of Contents

Backlinks