Entity

Consistency Training Can Entrench Misalignment

Consistency training encourages a model to produce similar outputs across related inputs or sampling procedures. Such methods are simple, scalable, and largely label-free, but their effects on model alignment remain poorly understood. Could the self-bootstrapping nature of these methods amplify undesired behavior in models? We test seven consistency training methods on 108 ``model organisms: open-source models (7B--70B) fine-tuned to exhibit various forms of controlled misaligned behavior. We fi

Paper · arXiv

cs.CL

Authors: David Demitri Africa, Arathi Mani
Published: 2026-06-02
Categories: cs.CLcs.AI

Abstract ↗

via arXiv · 2606.0381