Being-H0.7: A Latent World-Action Model from Egocentric Videos

Summary

Being-H0.7 proposes that world modeling for robotics should operate in a compact action-oriented latent space rather than pixel space, avoiding the cost of image-then-act pipelines. It inserts learnable latent queries between perception and action tokens as an explicit reasoning interface, trained via a future-informed dual-branch design where a posterior branch supervises the latent space during training and a lightweight prior branch is used at deployment.

Key Contributions

Latent query interface between multimodal context and action tokens as a compact world-model reasoning slot
Dual-branch training: posterior branch (future-informed) supervises the prior branch (current context only) via hidden-state alignment
Lightweight regularization preventing latent collapse during large-scale egocentric video pretraining
Zero-shot generalization to diverse robot tasks after pretraining on large-scale egocentric human video

Significance

Reframes embodied world modeling away from pixel-level video prediction toward action-oriented latent states, offering a scalable and inference-efficient alternative that directly benefits downstream policy learning.

Embodied Robotics Research

Explorer

Being-H0.7: A Latent World-Action Model from Egocentric Videos

Summary

Key Contributions

Significance

Links

Graph View

Table of Contents

Backlinks