Summary
ROVE is a Reinforcement Learning framework with Optimistic Value Estimation for post-training humanoid VLA models using imperfect human interventions. It builds a human-in-the-loop data collection pipeline that supports whole-body and dexterous-hand intervention, and introduces a state-value learning recipe that combines robot rollouts, human intervention trajectories, and human experience videos to produce robust advantage signals even from suboptimal demonstrations.
Key Contributions
- Human-in-the-loop pipeline for humanoid manipulation supporting whole-body and dexterous-hand interventions
- Optimistic Value Estimation (OVE) to extract reliable advantage estimates from mixed-quality human trajectories
- State-value learning that fuses robot rollouts, human interventions, and experience videos for richer reward signal
- Validated on real-world humanoid manipulation tasks including novel objects and long-horizon sequences
Significance
Enables humanoid VLAs to leverage imperfect human corrections — a practically abundant signal — through RL, overcoming the challenge that standard imitation learning from suboptimal interventions causes distribution collapse.