Entity

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic

Paper · arXiv

cs.AI

Authors: Haowen Wang, Yaxin Du, Jian Yang, Jiajun Wu, Shukai Liu + 6 more
Published: 2026-05-28

Abstract ↗

via arXiv · 2605.30288