Summary
ABot-PhysWorld is a 14B Diffusion Transformer world model trained on 3 million manipulation clips annotated with physics metadata. A DPO-based post-training framework with decoupled discriminators suppresses physically implausible behaviors (object penetration, anti-gravity motion) while preserving visual quality. A parallel context block injects spatial action signals for cross-embodiment control. The authors also introduce EZSbench, the first training-independent zero-shot embodied benchmark.
Key Contributions
- 14B DiT world model with 3M physics-annotated robot manipulation clips
- DPO-based post-training with decoupled discriminators to enforce physical plausibility
- Parallel context block for precise spatial action injection enabling cross-embodiment control
- EZSbench: first training-independent zero-shot embodied benchmark combining real and synthetic robot-task-scene combinations
Significance
ABot-PhysWorld surpasses Veo 3.1 and Sora v2 Pro on physical plausibility and trajectory consistency benchmarks, establishing a new state-of-the-art for physics-aligned robot world models. The DPO approach for enforcing physical laws is a novel training paradigm for embodied world models.