Summary

MARCH bridges model-based and model-free RL for safety-critical humanoid locomotion on sparse footholds (beams, stepping stones) where small errors cause catastrophic failure. The three-stage pipeline generates a safe reference trajectory via simplified dynamics models, trains a privileged teacher policy guided by a Control Lyapunov Function (CLF) reward built around this reference, and then distills the teacher into a vision-based student policy. Evaluated on a Unitree G1 humanoid robot, the approach produces stable, precise footstep placement across challenging terrains where pure model-free RL fails to converge.

Key Contributions

  • Combines model-based safety guarantees (CLF reward around simplified-model reference) with model-free robustness
  • Privileged teacher policy using ground-truth state for structured learning before vision-based distillation
  • CLF reward provides dense, safety-consistent feedback without manual reward engineering
  • Successfully deployed on Unitree G1 for sparse-foothold locomotion tasks

Significance

MARCH shows that safety-critical locomotion on sparse terrain — a major barrier to humanoid deployment — becomes tractable when simplified-model references are used to structure the RL reward, bypassing the need for careful manual reward shaping.