Summary
HEX is a state-centric VLA framework for coordinated whole-body manipulation on full-sized bipedal humanoid robots. It introduces a humanoid-aligned universal state representation for scalable learning across heterogeneous embodiments, and a Mixture-of-Experts Unified Proprioceptive Predictor to model whole-body coordination and temporal motion dynamics from large-scale multi-embodiment trajectory data. Lightweight history tokens summarize past observations for efficient temporal context without repeated image re-encoding.
Key Contributions
- Humanoid-aligned universal state representation enabling cross-embodiment scalability
- Mixture-of-Experts Unified Proprioceptive Predictor for whole-body coordination modeling
- Lightweight history-token mechanism for efficient temporal context during inference
- State-of-the-art performance on real-world humanoid manipulation tasks, especially in fast-reaction and long-horizon scenarios
Significance
Addresses a critical gap in VLA research by enabling coordinated whole-body humanoid control where most existing approaches treat robot body parts independently, filling a key requirement for practical humanoid deployment.