Chain of World: World Model Thinking in Latent Motion

Summary

CoWVLA introduces the “Chain of World” paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation for VLA pretraining. A pretrained video VAE factorizes video segments into structure and motion latents; the model then learns to infer a continuous latent motion chain from instruction and initial frame, predicting the terminal frame as a world model goal. At co-fine-tuning, these latent dynamics are aligned with discrete action prediction via a unified autoregressive decoder that jointly models sparse keyframes and actions.

Key Contributions

Chain of World paradigm: continuous latent motion chains as intermediate world-model reasoning steps for VLAs
Pretrained video VAE for disentangled structure and motion latent factorization
Pre-training objective: latent motion chain inference + terminal frame prediction from instruction and initial frame
Co-fine-tuning: joint keyframe and action modeling in a unified autoregressive decoder
Accepted at CVPR 2026; outperforms existing world-model and latent-action approaches on robotic simulation benchmarks

Significance

CoWVLA preserves the rich temporal knowledge of world models while avoiding the high computational cost of pixel-space generation, making world-model-based VLA pretraining practical and efficient.

Embodied Robotics Research

Explorer

Chain of World: World Model Thinking in Latent Motion

Summary

Key Contributions

Significance

Links

Graph View

Table of Contents

Backlinks