Intercepting the Future: Latent-Space Predictive World Model for Dynamic VLA Manipulation

Summary

This paper addresses a critical failure mode of VLA models: they assume scenes are static during task execution, causing them to fail when objects move. The authors propose AHEAD (Anticipatory Horizon Extrapolation with Adaptive Dynamics), a predict-then-act wrapper that augments a frozen VLA with a motion-aware latent world model. A small world model trained on manipulation video forecasts future patch tokens in the VLA’s feature space, conditioned on per-token velocity and acceleration derived from optical flow, enabling the VLA to act on predicted future states.

Key Contributions

Identification and formalization of the static-scene assumption failure mode in existing VLA models
AHEAD: a lightweight, training-free-for-VLA wrapper using a latent-space world model for future state prediction
Per-token optical-flow conditioning for motion-aware future feature forecasting
Demonstrated significant performance improvements on dynamic manipulation tasks without retraining the underlying VLA

Significance

Opens a practical path for deploying frozen VLA models in realistic dynamic environments where objects move, a fundamental requirement for real-world deployment that has been largely overlooked.

Embodied Robotics Research

Explorer

Intercepting the Future: Latent-Space Predictive World Model for Dynamic VLA Manipulation

Summary

Key Contributions

Significance

Links

Graph View

Table of Contents

Backlinks