Mask World Model: Predicting What Matters for Robust Robot Policy Learning

Summary

Mask World Model (MWM) replaces RGB pixel prediction with semantic mask prediction in a video diffusion architecture, imposing a geometric information bottleneck that forces the model to capture essential physical dynamics and contact relations while discarding irrelevant visual distractions such as dynamic backgrounds and illumination changes. The mask-prediction world model is integrated with a diffusion-based policy head for end-to-end control.

Key Contributions

Semantic mask prediction as the world model target instead of RGB pixels, enforcing a geometry-focused information bottleneck
Integration of mask-based world model with a diffusion policy head for end-to-end robot control
Demonstrated robustness gains over standard pixel-prediction world models under visual distractor conditions

Significance

Offers a lightweight yet principled approach to making robot world models robust to visual noise by changing what the model is asked to predict, improving generalization without requiring larger architectures or more data.

Embodied Robotics Research

Explorer

Mask World Model: Predicting What Matters for Robust Robot Policy Learning

Summary

Key Contributions

Significance

Links

Graph View

Table of Contents

Backlinks