Summary
HALO introduces Embodied Multimodal Chain-of-Thought (EM-CoT) reasoning into VLA models via a three-stage pipeline: textual task reasoning, visual subgoal prediction for fine-grained guidance, and EM-CoT-augmented action prediction. It addresses the failure of existing VLAs in long-horizon and out-of-distribution scenarios caused by their lack of explicit reasoning and world-state anticipation.
Key Contributions
- Unified EM-CoT framework that sequentially performs text reasoning → visual subgoal generation → action prediction in one model
- Mixture-of-Transformers (MoT) architecture with specialized experts for semantic reasoning, visual foresight, and action prediction, connected via shared self-attention
- Experiments show significant improvements on long-horizon tasks compared to VLAs lacking explicit reasoning mechanisms
- Decoupled expert design retains scalability while enabling rich cross-modal interaction
Significance
HALO provides a principled, human-like reasoning framework for VLAs that unifies textual and visual chain-of-thought, filling a key gap in the field’s ability to handle complex sequential manipulation tasks.