Entity

Knowledge Index of Noah's Ark

Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results. First, we cast representativeness as a coverage-style objective over expert-elicited anchors and operationalize disciplinary representativeness throug

Paper · arXiv

cs.AI

Authors: Sheng Jin, Minghao Liu, Yunze Xiao, Zeqi Zhou, Heli Qi + 22 more
Published: 2026-06-03

Abstract ↗

via arXiv · 2606.05104