Entity

Peak-Then-Collapse and the Four Interface Channels of Knowledge-Graph Tool Use

We test the standard RLVR tool-use recipe -- GRPO on Qwen2.5-7B-Instruct -- on a deliberately minimal knowledge-graph tool API: four Freebase navigation verbs over Complex WebQuestions. Under a self-verifiable retrieval reward, the policy's tool-grounded answer rate climbs from $3.8\%$ to $9.6\%$ over 250 steps, then collapses to $0\%$ within a single 50-step window -- a \emph{peak-then-collapse} pattern replicated across four seeds. Across seven reward designs, we find four recurring failure mo

Paper · arXiv

cs.CL

Authors: Tianda Sun, Dimitar Kazakov
Published: 2026-05-25

Abstract ↗

via arXiv · 2605.26037