Optimizer design matched to neural network layer symmetries
Original: Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
Source: arxiv.org ↗
Who: Arxiv preprint authored by Tim Tsz-Kit Lau and Weijie Su, both affiliated with the University of Pennsylvania.
What's new: This paper argues that the most popular AI training algorithms ignore a fundamental property of neural networks — symmetry — and proposes a principled fix. The authors introduce a design rule they call the "symmetry-compatible principle": every block of weights in a model should be updated by an that respects the geometric symmetries of that block. This sounds abstract, but the practical result is a complete, matched set of updaters — one for each major type of weight matrix in modern language models.
How it works: Standard optimizers like treat every number in a weight matrix as independent, ignoring structure. The authors instead classify each weight type — , , projections, and — by what mathematical symmetries they possess, then derive update rules that cannot violate those symmetries. The result is a layerwise stack where each major weight class gets its own geometrically appropriate updater, including novel variants with names like "row-norm" and "left-spectral" updates.
The numbers: The authors run experiments on several model architectures styled after real public models — including -0.6B, 1B, and -1B-7B style architectures. Symmetry-compatible updates consistently reduce the final compared to AdamW across all tested configurations, and in some cases also improve training stability.
Why it matters: Most AI research improves what models are trained on or how large they are. This work improves the training algorithm itself, which sits underneath every model regardless of size or data. A better optimizer that costs no extra compute at inference time is a free performance gain, and this paper offers a theoretically grounded reason — not just empirical luck — for why the gains appear.