Summary
OSCAR is a precise action-conditioned video world model that generalizes across robot embodiments and enables virtual policy evaluation. It addresses three core barriers to adoption: limited scenario diversity in existing robot datasets, imprecise action following in existing video generators, and poor cross-embodiment generalization.
Key Contributions
- Large-scale standardized data pipeline that curates, filters, and deduplicates robot and egocentric human datasets into a clean joint-training corpus spanning diverse tasks, scenarios, and embodiments
- 2D kinematic skeleton rendering as a unified conditioning representation, allowing the same conditioning approach to work for robot arms and human hands alike
- Fine-tuned from Cosmos-Predict2.5-2B; virtual policy rollouts show strong correlation with real-world evaluation outcomes
- Dataset available at Hugging Face (zywu2115/OSCAR_human)
Significance
Demonstrates that virtual evaluation in a learned world model can reliably substitute for physical evaluation across embodiments, paving the way for purely simulated robot policy benchmarking and iteration loops.