Summary

AtomVLA is the first subtask-aware VLA framework paired with a scalable offline post-training pipeline. It addresses the “instruction grounding gap” in VLA models — the absence of explicit intermediate guidance that leads to compounding errors in long-horizon tasks — by decomposing tasks into atomic subtasks guided by predictive latent world models during post-training.

Key Contributions

  • Subtask decomposition approach that bridges the instruction gap between high-level language commands and low-level actions
  • Scalable offline post-training pipeline that leverages predictive latent world models to generate intermediate supervision
  • Reduces compounding errors in long-horizon multi-step manipulation tasks
  • Demonstrated improvements on standard benchmarks without requiring online environment interaction during post-training

Significance

Demonstrates that scalable offline post-training with structured subtask supervision can substantially improve VLA performance on complex tasks, without the cost and complexity of online RL.