machine learning model training problem of greatly diverging gradient magnitudes between earlier and later layers encountered when training neural networks with backpropagation
Discovered by embedding cosine similarity (sentence-transformers MiniLM, 384-dim).