Summary
FlowPRO is a reward-free offline reinforced fine-tuning framework for flow-matching VLAs. It introduces RPRO (Robotic Flow-matching Proximalized Preference Optimization), a preference-optimization objective tailored to the continuous-action flow-matching head of VLA models, combining a contrastive optimizer with an explicit proximal regularizer that prevents reward-hacking.
Key Contributions
- RPRO objective: extends preference optimization to the continuous trajectory space of flow-matching VLAs, avoiding the discrete-token assumptions of language-model DPO
- Proximal regularizer anchors the absolute magnitude of the implicit reward, eliminating the reward-hacking failure mode of plain Flow-DPO
- Reward-free formulation: no ground-truth reward labels needed — operates purely on preference pairs derived from offline data
- From Tencent Robotics X, Futian Laboratory, and Tsinghua University
Significance
Solves a key gap in VLA post-training: how to do preference-based reinforcement fine-tuning when the action head is a continuous flow-matching model rather than a discrete token predictor, broadening RL-based improvement to the growing class of diffusion/flow VLAs.