Summary

H-WM proposes a two-level world model that integrates symbolic task planning with visual dynamics to address the limitations of single-modal world models in long-horizon manipulation. A high-level logical world model predicts state transitions in logical/symbolic space, providing structured task decomposition and long-horizon robustness. A low-level visual world model then grounds these logical states in visual observations, enabling precise execution via VLA control policies.

Key Contributions

  • Hierarchical World Model (H-WM): joint prediction of logical and visual state transitions across two levels
  • High-level logical world model for symbolic long-horizon planning and robust task decomposition
  • Low-level visual world model for grounding symbolic states in visual observations and guiding VLA policies
  • Hierarchical intermediate outputs that mitigate error accumulation across extended task sequences
  • Demonstrated generality across multiple VLA control policy architectures

Significance

H-WM bridges the gap between classical TAMP and learned world models, enabling robust long-horizon robot control without requiring explicit symbolic state definitions at deployment time.