Entity

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability techniques. However, uncertainty quantification (UQ) for the natural-language outputs of such activation oracles is so far understudied. Here, we investigate 6 different methods for estimating the confidence of activation oracles and evaluate how well-calibrated their confidence scores are. Our experiments on 6,000 samples per oracle (varying verbali

Paper · arXiv

cs.CL

Authors: Federico Torrielli, Peter Schneider-Kamp, Lukas Galke Poech
Published: 2026-05-25
Categories: cs.CLcs.AI

Abstract ↗

via arXiv · 2605.26045