Summary

OneWM-VLA challenges the assumption that world-model-augmented VLAs need high visual bandwidth per frame, showing that compressing each camera view to a single semantic token via Adaptive Attention Pooling is sufficient for strong long-horizon performance. The resulting latent stream and the action trajectory are co-produced under a unified flow-matching objective, eliminating the need for a separate world-model decoder.

Key Contributions

  • Adaptive Attention Pooling that compresses each frame view to a single semantic token without compromising long-horizon performance
  • Unified flow-matching objective producing latent world-model rollouts and action trajectories jointly
  • Empirically demonstrates that per-frame visual bandwidth can be reduced to 1 token in world-model-augmented VLA policy learning

Significance

By drastically reducing the visual bandwidth consumed by the world-model component of a VLA, OneWM-VLA opens the door to more computationally efficient long-horizon robot planning without sacrificing task performance.