Summary
RoboAlign-R1 addresses the mismatch between standard pixel-reconstruction training objectives and what actually matters for robot decision-making in video world models. It constructs RobotWorldBench (10,000 annotated video-instruction pairs), trains a multimodal teacher judge (RoboAlign-Judge) for six-dimensional video evaluation, and distills it into a lightweight student reward model used for GRPO-based RL post-training of world models.
Key Contributions
- RobotWorldBench: 10k annotated video-instruction pairs from four robot data sources for fine-grained world model evaluation
- RoboAlign-Judge: multimodal teacher scoring instruction following, manipulation success, and physical plausibility across six dimensions
- GRPO-based RL post-training using the distilled student reward model to align world model outputs with decision-relevant quality
- Sliding-window re-encoding strategy to stabilize long-horizon autoregressive rollouts and reduce error accumulation
Significance
Establishes a principled reward-alignment pipeline for robot video world models, bridging the gap between low-level reconstruction objectives and high-level task success metrics needed for planning.