Summary
Anticipation-VLA tackles the compounding-error problem in long-horizon robotic tasks by introducing an Anticipation Model that adaptively and recursively generates future visual subgoals as intermediate planning targets. The hierarchical system pairs a fine-tuned Unified Multimodal Model for high-level subgoal generation with a goal-conditioned VLA policy for low-level action execution, continuously adapting subgoals as the task unfolds.
Key Contributions
- Anticipation Model that recursively generates adaptive subgoal images, recalibrating predictions in response to evolving scene dynamics
- Hierarchical VLA architecture decoupling high-level visual planning from low-level motor control
- Demonstrated effectiveness in both simulated and real-world robotic manipulation benchmarks
Significance
Addresses the fundamental long-horizon limitation of flat VLA architectures by adding structured visual foresight, showing that adaptive subgoal generation is essential for reliable long-horizon policy execution.