Summary

This paper addresses a critical failure mode of VLA models: they assume scenes are static during task execution, causing them to fail when objects move. The authors propose AHEAD (Anticipatory Horizon Extrapolation with Adaptive Dynamics), a predict-then-act wrapper that augments a frozen VLA with a motion-aware latent world model. A small world model trained on manipulation video forecasts future patch tokens in the VLA’s feature space, conditioned on per-token velocity and acceleration derived from optical flow, enabling the VLA to act on predicted future states.

Key Contributions

  • Identification and formalization of the static-scene assumption failure mode in existing VLA models
  • AHEAD: a lightweight, training-free-for-VLA wrapper using a latent-space world model for future state prediction
  • Per-token optical-flow conditioning for motion-aware future feature forecasting
  • Demonstrated significant performance improvements on dynamic manipulation tasks without retraining the underlying VLA

Significance

Opens a practical path for deploying frozen VLA models in realistic dynamic environments where objects move, a fundamental requirement for real-world deployment that has been largely overlooked.