AI Summary • Published on Dec 3, 2025
Traditional fine-tuning methods for Vision-Language-Action (VLA) models in robotic manipulation often treat long-horizon actions as monolithic linguistic sequences. This approach leads to challenges such as coarse credit assignment and unstable training, particularly because action trajectories naturally consist of causally chained stages that vary in difficulty and require precise execution. Optimizing an entire trajectory at once makes it difficult for the model to learn the specific contributions of different actions within a complex sequence, hindering performance on tasks requiring high precision or multi-stage coordination.
The authors propose Stage-Aware Reinforcement (StARe), a novel plug-in module designed to decompose long-horizon action trajectories into semantically meaningful, progressive stages. StARe employs a stage separator to identify transitions between stages based on end-effector translation and orientation, and a stage calculator to compute stage-wise costs and dense intra-stage rewards. These rewards provide fine-grained, interpretable feedback, addressing the issue of sparse rewards in traditional reinforcement learning. StARe is integrated into two fine-tuning algorithms: Stage-Aware Trajectory-Wise Preference Optimization (StA-TPO) for offline stage-level preference alignment, and Stage-Aware Proximal Policy Optimization (StA-PPO) for online intra-stage interaction. To leverage these, the Imitation → Preference → Interaction (IPI) pipeline is introduced. IPI sequentially fine-tunes a VLA model starting with Supervised Fine-Tuning (SFT) from expert demonstrations, followed by StA-TPO for offline preference optimization, and finally StA-PPO for online robustness enhancement through exploration.
Experiments were conducted on two robotic manipulation benchmarks: SimplerEnv with a WidowX arm and ManiSkill3 with a Franka robot, encompassing canonical single-object tasks and contact-rich manipulations. The IPI framework achieved state-of-the-art success rates, reaching 98.0% on SimplerEnv-WidowX and 96.4% on ManiSkill3 tasks. This represents a substantial improvement of +5.4 and +25.9 percentage points over previous state-of-the-art methods, respectively. Ablation studies further revealed that StARe's stage-aware guidance is particularly critical for precision-demanding stages, such as the 'Place' stage in stacking or the 'Upright' stage in peg lifting, where its removal led to significant performance degradation exceeding 20%.
The introduction of Stage-Aware Reinforcement (StARe) and the IPI fine-tuning pipeline demonstrates that a stage-centric perspective and fine-grained credit assignment are crucial missing components in current VLA reinforcement learning for robotics. This approach leads to more stable optimization, faster convergence, and significantly enhanced performance and robustness for long-horizon and complex manipulation tasks. The research opens promising avenues for developing more scalable, interpretable, and reliable robotic learning systems by explicitly addressing the sequential and multi-stage nature of robotic actions.