Summary

LifeLong-RFT addresses the catastrophic forgetting problem in VLA models when supervised fine-tuning (SFT) is applied to new downstream tasks. The proposed Reinforcement Fine-Tuning strategy is independent of online environmental feedback and pre-trained reward models, instead using chunking-level on-policy RL with a Multi-Dimensional Process Reward (MDPR) that quantifies the heterogeneous contributions of intermediate action chunks across three dimensions to facilitate stable policy optimization.

Key Contributions

  • LifeLong-RFT: RFT strategy that requires no online environmental feedback or external reward models
  • Multi-Dimensional Process Reward (MDPR) for evaluating action chunks along multiple quality dimensions
  • Chunking-level on-policy RL adapted to the autoregressive action-chunk structure of VLAs
  • Mitigates catastrophic forgetting while substantially reducing task-specific data requirements vs SFT

Significance

LifeLong-RFT addresses a critical practical challenge for deploying VLAs in dynamic real-world settings: how to adapt to new tasks without regressing on previously learned skills, without incurring the cost of online data collection or external reward engineering.