RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

Summary

RepWAM addresses a core limitation of existing World Action Models that inherit reconstruction-oriented video tokenizers from pretrained video generation models: pixel reconstruction provides limited guidance for learning instruction-following dynamics. RepWAM trains a representation visual-action tokenizer that maps visual inputs into semantically aligned visual and latent action tokens, then pretrains the WAM to jointly model future visual states and the latent actions connecting them under language instructions, followed by adaptation to real robot trajectories.

Key Contributions

Representation visual-action tokenizer: produces semantically rich visual tokens aligned with latent action tokens, replacing reconstruction-focused tokenisation
Joint visual-action pretraining objective: the WAM learns to predict future states and intermediate actions simultaneously, tightening the world-prediction/control coupling
Ablations demonstrating the advantage of semantic visual-action tokenization over reconstruction-oriented alternatives at both pretraining and fine-tuning stages
Strong performance across real-world manipulation tasks and simulation benchmarks

Significance

Directly targets the tokeniser design — the lowest-level architectural choice in WAMs — and shows that semantically-aligned tokenisation delivers outsized gains, suggesting representation quality at the input stage is a critical factor often overlooked in world action model design.

Embodied Robotics Research

Explorer

RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

Summary

Key Contributions

Significance

Links

Graph View

Table of Contents

Backlinks