Entity

Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments

Training LLMs to orchestrate multi-step tool calls is held back by three coupled obstacles: realistic stateful execution environments are costly to build, synthetic training queries are often detached from the server's actual state (so the generated tool calls fail to execute), and recall-based RL rewards incentivize verbose tool-calling patterns. We present PROVE (Programmatic Rewards On Verified Environments), a framework with three contributions: (1) a library of 20 stateful MCP (Model Contex

Paper · arXiv

cs.CL

Authors: Ibrahim Abdelaziz, Asim Munawar, Kinjal Basu, Maxwell Crouse, Chulaka Gunasekara + 2 more
Published: 2026-06-02
Categories: cs.CLcs.AIcs.LG

Abstract ↗

via arXiv · 2606.03892