Entity

PowLU: An Activation Function for Stable Pre-Training of LLMs

In contemporary large language models (LLMs), the swish-gated linear unit (SwiGLU) activation function is widely adopted to regulate the information flow and introduce non-linearity. For large positive inputs, SwiGLU approximates the quadratic function $x^2$, providing strong nonlinearity and expressive capacity. However, this property also causes numerical instability as the input or model scale increases, particularly in low-precision LLM training. The main reason is its approximate quadratic

Paper · arXiv

cs.CL

Authors: Peijie Jiang, Yuqi Feng, Cunyin Peng, Qian Zhao, Jia Liu + 3 more
Published: 2026-05-25
Categories: cs.CLcs.LG

Abstract ↗

via arXiv · 2605.25704