Entity

TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning

This paper investigates large language model (LLM) abstention learning, specifically using ternary reward, which incentivize truthfulness in large language models. This paper extends that idea by moving from a ternary reward to a Trajectory-Informed advantage reweighting, dynamically re-weights the abstention reward during Group Relative Policy Optimization (GRPO) training. The objective of this work focuses on abstention learning instead of improving truthfulness, serving as an exploration into

Paper · arXiv

cs.CL

Authors: Muyu Pan, Shu Zhao, Nan Zhang, Philip Shin, Varun Parekh + 2 more
Published: 2026-05-25
Categories: cs.CLcs.AIcs.LG

Abstract ↗

via arXiv · 2605.2585