Entity

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

Sparse Autoencoders (SAEs) are widely used for mechanistic interpretability in large language models, yet their formulation assigns each latent feature a single decoder direction, implicitly assuming features to be one-dimensional. We show that this assumption mismatches with the multi-dimensional structure of model features, provably inducing feature splitting through two distinct mechanisms. Geometrically, reconstructing a feature of intrinsic dimension $d_i \ge 2$ to error $\varepsilon$ with

Paper · arXiv

cs.LG

Authors: Seyed Arshan Dalili, Mehrdad Mahdavi
Published: 2026-06-04
Categories: cs.LGcs.AI

Abstract ↗

via arXiv · 2606.06333