Entity

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-

Paper · arXiv

cs.AI

Authors: Zhangchen Xu, Junda Chen, Yue Huang, Dongfu Jiang, Jiefeng Chen + 14 more
Published: 2026-06-03
Categories: cs.AIcs.LG

Abstract ↗

via arXiv · 2606.0508