One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

Summary

OneWM-VLA challenges the assumption that world-model-augmented VLAs need high visual bandwidth per frame, showing that compressing each camera view to a single semantic token via Adaptive Attention Pooling is sufficient for strong long-horizon performance. The resulting latent stream and the action trajectory are co-produced under a unified flow-matching objective, eliminating the need for a separate world-model decoder.

Key Contributions

Adaptive Attention Pooling that compresses each frame view to a single semantic token without compromising long-horizon performance
Unified flow-matching objective producing latent world-model rollouts and action trajectories jointly
Empirically demonstrates that per-frame visual bandwidth can be reduced to 1 token in world-model-augmented VLA policy learning

Significance

By drastically reducing the visual bandwidth consumed by the world-model component of a VLA, OneWM-VLA opens the door to more computationally efficient long-horizon robot planning without sacrificing task performance.

Embodied Robotics Research

Explorer

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

Summary

Key Contributions

Significance

Links

Graph View

Table of Contents

Backlinks