Summary
Existing WAMs achieve strong generalization through photorealistic future prediction but incur high inference latency, making real-time robot deployment impractical. Efficient-WAM treats future video prediction as a compact guidance signal rather than a visual fidelity target, using a video expert transferred from WAN-2.2-5B with token-sparse video latents and asymmetric video-action denoising that allocates fewer sampling steps to video than to actions. The 1B-parameter model reduces per-chunk latency to ~100 ms (30× speedup over existing WAMs) while maintaining competitive manipulation accuracy — key insight: visibly coarse futures still provide sufficient guidance for strong action generation.
Key Contributions
- Compact video expert via knowledge transfer from WAN-2.2-5B to a 1B model
- Token-sparse video latents reducing computation without hurting action guidance
- Asymmetric video-action denoising: fewer steps for video, more for actions
- 30× wall-clock speedup (~100 ms per chunk) with competitive task accuracy
Significance
Efficient-WAM provides the first empirical evidence that photorealistic video fidelity is not required for WAM benefits — coarse imagination is sufficient — which dramatically reduces the computational barrier to deploying WAMs on real hardware.