Summary
This survey examines the growing role of world models in robotic manipulation through three organizing questions: what future representation is predicted (pixels, latents, object states), how prediction is connected to action (implicit vs. explicit coupling), and when prediction is used in the robot-learning pipeline (pretraining, data augmentation, planning, evaluation). It provides a taxonomy of current approaches and highlights open challenges.
Key Contributions
- Unified three-axis taxonomy: representation predicted, coupling to action, and stage of use in the pipeline
- Comprehensive review of world-model-based data augmentation, model-based RL, and test-time planning for manipulation
- Analysis of evaluation protocols and benchmarks across simulation and real-robot settings
- Identifies key gaps: physical consistency, long-horizon coherence, and real-world calibration
Significance
As world models proliferate across robotics research, this survey provides a timely and structured overview that helps practitioners navigate the design space and select approaches appropriate for their manipulation tasks.