Entity

You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

Long-context inference in modern LLMs is increasingly constrained by decoding efficiency, especially in reasoning-heavy settings where models generate long intermediate chains of thought. Existing sparse attention methods often face a practical efficiency-quality trade-off. Structured block sparse methods typically provide stronger acceleration but incur noticeable quality loss, while token sparse methods are usually more accurate yet deliver limited end-to-end speedup because top-k routing over

Paper · arXiv

cs.CL

Authors: Yutao Sun, Yanqi Zhang, Li Dong, Jianyong Wang, Furu Wei
Published: 2026-06-04
Categories: cs.CLcs.AIcs.LG

Abstract ↗

via arXiv · 2606.06467