Summary
This paper systematically compares Vision-Language-Action (VLA) models against World Action Models (WAMs) under visual and language perturbations on LIBERO-Plus and RoboTwin 2.0-Plus benchmarks. WAMs, which leverage video-data-pretrained world models for future-state prediction prior to action, consistently achieve stronger robustness: LingBot-VA reaches 74.2% on RoboTwin 2.0-Plus and Cosmos-Policy achieves 82.2% on LIBERO-Plus.
Key Contributions
- LIBERO-Plus and RoboTwin 2.0-Plus: augmented benchmarks with systematic visual (lighting, backgrounds, object appearance) and language perturbations for robustness evaluation
- Head-to-head comparison of VLA models vs. WAMs across perturbation conditions at scale
- Identification of specific failure modes of VLAs under distribution shift that WAMs mitigate via world-model pretraining
Significance
Provides the first large-scale empirical evidence that world model pretraining improves generalization robustness for robot policies, lending support for WAM-style architectures over pure VLAs in deployment settings.