Entity

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

Post-training compression of Large Language Models (LLMs) removes entire architectural components, either deleting them or replacing them with fitted modules. Existing replacement-based methods share two design constraints: full-layer granularity and contiguous selection. We argue that this is overly restrictive: in fact, redundancy in pretrained transformers is not confined to contiguous regions, nor does it evenly distribute between Attention and FeedForward outputs, implying that different st

Paper · arXiv

cs.CL

Authors: Elia Cunegatti, Marcus Vukojevic, Erik Nielsen, Giovanni Iacca
Published: 2026-06-01
Categories: cs.CLcs.AI

Abstract ↗

via arXiv · 2606.02559