Entity

Benchmark Everything Everywhere All at Once

Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit measures of performance. However, their construction is labor-intensive and hard to reuse, raising concerns about sustainability and scalability. Moreover, existing benchmarks often quickly reach performance saturation after their release, resulting in insufficient discrimination among state-of-the-art models. To address these challenges, we introduce Benchmark Agent, a fully autonomous

Paper · arXiv

cs.AI

Authors: Shiyun Xiong, Dongming Wu, Peiwen Sun, Yuang Ai, Bokang Yang + 3 more
Published: 2026-06-04

Abstract ↗

via arXiv · 2606.06462