GeoSem-WAM: Geometry- and Semantic-Aware World Action Models

Summary

GeoSem-WAM proposes a structured world modeling framework that enhances World Action Model (WAM) latent representations through explicit geometric and semantic supervision. Standard WAMs rely on RGB future prediction, which provides limited structural and spatial understanding. The paper also investigates whether WAM effectiveness comes from explicit future imagination at inference or from representation learning induced by predictive pretraining.

Key Contributions

Geometric supervision stream augmenting WAM latent representations with 3D structural understanding of manipulation scenes
Semantic supervision stream injecting task-relevant semantic priors to support richer action grounding
Analysis showing WAM’s primary advantage lies in learning robust latent representations rather than generating future observations at test time — a key insight for WAM design
Outperforms RGB-only WAM baselines on embodied manipulation benchmarks

Significance

Provides both a practical improvement (geometry + semantics improve WAM representations) and an important theoretical clarification (representation learning, not inference-time imagination, drives WAM gains), reshaping how the field should design and evaluate world action models.

Embodied Robotics Research

Explorer

GeoSem-WAM: Geometry- and Semantic-Aware World Action Models

Summary

Key Contributions

Significance

Links

Graph View

Table of Contents

Backlinks