H-WM: Robotic Task and Motion Planning Guided by Hierarchical World Model

Summary

H-WM proposes a two-level world model that integrates symbolic task planning with visual dynamics to address the limitations of single-modal world models in long-horizon manipulation. A high-level logical world model predicts state transitions in logical/symbolic space, providing structured task decomposition and long-horizon robustness. A low-level visual world model then grounds these logical states in visual observations, enabling precise execution via VLA control policies.

Key Contributions

Hierarchical World Model (H-WM): joint prediction of logical and visual state transitions across two levels
High-level logical world model for symbolic long-horizon planning and robust task decomposition
Low-level visual world model for grounding symbolic states in visual observations and guiding VLA policies
Hierarchical intermediate outputs that mitigate error accumulation across extended task sequences
Demonstrated generality across multiple VLA control policy architectures

Significance

H-WM bridges the gap between classical TAMP and learned world models, enabling robust long-horizon robot control without requiring explicit symbolic state definitions at deployment time.

Embodied Robotics Research

Explorer

H-WM: Robotic Task and Motion Planning Guided by Hierarchical World Model

Summary

Key Contributions

Significance

Links

Graph View

Table of Contents

Backlinks