FlowPRO: Reward-Free Reinforced Fine-Tuning of Flow-Matching VLAs via Proximalized Preference Optimization

Summary

FlowPRO is a reward-free offline reinforced fine-tuning framework for flow-matching VLAs. It introduces RPRO (Robotic Flow-matching Proximalized Preference Optimization), a preference-optimization objective tailored to the continuous-action flow-matching head of VLA models, combining a contrastive optimizer with an explicit proximal regularizer that prevents reward-hacking.

Key Contributions

RPRO objective: extends preference optimization to the continuous trajectory space of flow-matching VLAs, avoiding the discrete-token assumptions of language-model DPO
Proximal regularizer anchors the absolute magnitude of the implicit reward, eliminating the reward-hacking failure mode of plain Flow-DPO
Reward-free formulation: no ground-truth reward labels needed — operates purely on preference pairs derived from offline data
From Tencent Robotics X, Futian Laboratory, and Tsinghua University

Significance

Solves a key gap in VLA post-training: how to do preference-based reinforcement fine-tuning when the action head is a continuous flow-matching model rather than a discrete token predictor, broadening RL-based improvement to the growing class of diffusion/flow VLAs.

Embodied Robotics Research

Explorer

FlowPRO: Reward-Free Reinforced Fine-Tuning of Flow-Matching VLAs via Proximalized Preference Optimization

Summary

Key Contributions

Significance

Links

Graph View

Table of Contents

Backlinks