Entity

Trading Human Curation for Synthetic Augmentation in RLVR

The supply of high-quality training tasks is a central bottleneck for reinforcement learning from verifiable rewards (RLVR) on agentic language models. Each task requires a sandboxed setup, a prompt, and a hand-authored reward function, and only tasks that pass a quality bar produce useful training signal. Hand-curation at this quality bar does not scale economically to the task counts effective RL training requires, and the substitution rate between automatically generated task variants and hum

Paper · arXiv

cs.LG

Authors: Akshansh, Leonardo Rosa Rodrigues, Michael Korostelev, Youssef Hassan, Mark E. Whiting
Published: 2026-06-02
Categories: cs.LGcs.AI

Abstract ↗

via arXiv · 2606.038