Entity

How Neural Reward Models Learn Features for Policy Optimization: A Single-Index Analysis

Reward modeling is not only a prediction problem: in KL-regularized policy optimization, the learned reward is exponentiated to define the deployed policy, so downstream value depends on errors in reward-tilted regions. We study this feedback in a Gaussian single-index model with $r^*(x) = σ^*(\langle θ^*, x\rangle)$ and $x \sim N(0, I_d)$. We analyze a two-stage neural reward model that first learns the hidden direction $θ^*$ from reward-weighted samples and then fits the readout layer by weigh

Paper · arXiv

stat.ML

Authors: Rei Higuchi, Ryotaro Kawata, Akifumi Wachi, Shokichi Takakura, Kohei Miyaguchi + 1 more
Published: 2026-05-23
Categories: stat.MLcs.LG

Abstract ↗

via arXiv · 2605.24749