Entity

In-Context Reward Adaptation for Robust Preference Modeling

Reinforcement Learning from Human Feedback (RLHF) typically relies on static reward models to align Large Language Models with human preferences. However, human values are inherently diverse and heterogeneous, and a single reward model often lacks the robustness required to generalize to unseen preference domains. While existing multi-reward frameworks attempt to address this, they are often restricted to a fixed set of known domains and fail to adapt to unseen human distributions without costly

Paper · arXiv

cs.LG

Authors: Zhenyu Sun, Zheng Xu, Ermin Wei
Published: 2026-05-28
Categories: cs.LGcs.AI

Abstract ↗

via arXiv · 2605.30323