Summary

FlashSAC is a scalable off-policy RL algorithm built on Soft Actor-Critic that sharply reduces gradient update frequency while compensating with larger models and higher data throughput, enabling stable training in high-dimensional robot control settings. It explicitly bounds weight, feature, and gradient norms to prevent critic error accumulation under the broader state-action distributions sampled during off-policy learning.

Key Contributions

  • Reduced gradient update schedule compensated by larger network capacity and increased replay throughput, improving training efficiency
  • Explicit norm bounding (weight, feature, gradient) to maintain critic stability under diverse off-policy data
  • Evaluated on 60+ tasks across 10 simulators, consistently outperforming PPO and strong off-policy baselines
  • Reduces humanoid locomotion sim-to-real training time from hours to minutes; largest gains on dexterous manipulation

Significance

Demonstrates that off-policy methods can match or exceed on-policy stability with appropriate regularization, reopening the efficiency advantages of SAC-class algorithms for challenging high-dimensional robot control tasks.