Summary

RoboAlign-R1 addresses the mismatch between standard pixel-reconstruction training objectives and what actually matters for robot decision-making in video world models. It constructs RobotWorldBench (10,000 annotated video-instruction pairs), trains a multimodal teacher judge (RoboAlign-Judge) for six-dimensional video evaluation, and distills it into a lightweight student reward model used for GRPO-based RL post-training of world models.

Key Contributions

  • RobotWorldBench: 10k annotated video-instruction pairs from four robot data sources for fine-grained world model evaluation
  • RoboAlign-Judge: multimodal teacher scoring instruction following, manipulation success, and physical plausibility across six dimensions
  • GRPO-based RL post-training using the distilled student reward model to align world model outputs with decision-relevant quality
  • Sliding-window re-encoding strategy to stabilize long-horizon autoregressive rollouts and reduce error accumulation

Significance

Establishes a principled reward-alignment pipeline for robot video world models, bridging the gap between low-level reconstruction objectives and high-level task success metrics needed for planning.