Summary

Being-H0.7 proposes that world modeling for robotics should operate in a compact action-oriented latent space rather than pixel space, avoiding the cost of image-then-act pipelines. It inserts learnable latent queries between perception and action tokens as an explicit reasoning interface, trained via a future-informed dual-branch design where a posterior branch supervises the latent space during training and a lightweight prior branch is used at deployment.

Key Contributions

  • Latent query interface between multimodal context and action tokens as a compact world-model reasoning slot
  • Dual-branch training: posterior branch (future-informed) supervises the prior branch (current context only) via hidden-state alignment
  • Lightweight regularization preventing latent collapse during large-scale egocentric video pretraining
  • Zero-shot generalization to diverse robot tasks after pretraining on large-scale egocentric human video

Significance

Reframes embodied world modeling away from pixel-level video prediction toward action-oriented latent states, offering a scalable and inference-efficient alternative that directly benefits downstream policy learning.