Entity

HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

We investigate a narrow but common failure mode of GRPO-style reinforcement learning in the context of sparse verifiable rewards: early updates contain more responses with negative advantages than those with positive advantages, while response-level length normalization ties the magnitude of the update to the length of the output. We propose Hysteretic Policy Optimization (HPO), a minimal modification of GRPO that reduces the weight of negative-advantage updates and replaces per-response length

Paper · arXiv

cs.LG

Authors: Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Fadhel Ayed, Haozhe Zhang
Published: 2026-05-28
Categories: cs.LGcs.AI

Abstract ↗

via arXiv · 2605.30201