Summary
Being-H0.7 proposes that world modeling for robotics should operate in a compact action-oriented latent space rather than pixel space, avoiding the cost of image-then-act pipelines. It inserts learnable latent queries between perception and action tokens as an explicit reasoning interface, trained via a future-informed dual-branch design where a posterior branch supervises the latent space during training and a lightweight prior branch is used at deployment.
Key Contributions
- Latent query interface between multimodal context and action tokens as a compact world-model reasoning slot
- Dual-branch training: posterior branch (future-informed) supervises the prior branch (current context only) via hidden-state alignment
- Lightweight regularization preventing latent collapse during large-scale egocentric video pretraining
- Zero-shot generalization to diverse robot tasks after pretraining on large-scale egocentric human video
Significance
Reframes embodied world modeling away from pixel-level video prediction toward action-oriented latent states, offering a scalable and inference-efficient alternative that directly benefits downstream policy learning.