Summary

AHA-WAM resolves the tension between long-horizon world modeling and high-frequency action execution through a dual DiT architecture: a low-frequency world DiT maintains rolling key-value memory over past observations and exposes reusable layerwise latent context, while a high-frequency action DiT executes short action chunks by querying this context via layerwise joint attention. Horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR) let the action expert exploit long-horizon world context while remaining responsive to real-time state without rerunning the world DiT. AHA-WAM achieves 92.80% on RoboTwin and 78.3% on 4 real-world tasks at 24.17 Hz — 4.59× faster than Fast-WAM — without any robot-data pretraining.

Key Contributions

  • Dual DiT architecture decoupling world planning (low-frequency) from action execution (high-frequency)
  • Observation-Guided Video-Context Routing (OVCR) for responsive real-time use of long-horizon context
  • Horizon-adaptive offset training enabling asynchronous execution with correct temporal alignment
  • 92.80% on RoboTwin, 24.17 Hz closed-loop, 4.59× speedup over Fast-WAM; no robot-data pretraining

Significance

AHA-WAM is the first WAM to achieve SOTA accuracy on a challenging dual-arm manipulation benchmark while simultaneously achieving real-time control speeds, demonstrating that long-horizon reasoning and reactive execution are not mutually exclusive.