Dec 19, 2025

Better starting scale for Muon

Muon applies Newton-Schulz (NS) iteration to a matrix with spectral norm (its largest singular value) at most 1, to get a matrix whose singular values are all ~1.

Your starting matrix $G$ ‘s spectral norm might be greater than 1, so Muon modifies it to $X = G / \|G\|_F$ . This uses the Frobenius norm as an upper bound for the spectral norm: $\sigma_1(G) \leq \|G\|_F = \sqrt{\sum \sigma_i(G)^2}$ , so $\sigma_1(X) = \sigma_1(G) / \|G\|_F < 1$ .

If the bound $\sigma_1(G) \le \sqrt{\sum \sigma_i(G)^2}$ is loose, then $X$ ‘s singular values are small, and the NS iteration struggles to bring them back up. This happens for the high-dimensional matrices used in ML.

The bound is tighter for higher order norms: $\sigma_1(G) \leq \sqrt[8]{\sum \sigma_i(G)^8} \leq \sqrt{\sum \sigma_i(G)^2}$ .

Muon already calculates $(X^T X)^2$ , and $\|(X^T X)^2\|_F = \sqrt{\sum \sigma_i(X)^8}$ . Its fourth root is the spectral norm estimate we want.

So our new starting point is $X = G / \|(G^T G)^2\|_F^{0.25}$ .

For random matrices of practical sizes, this $X$ is 9x larger than the original $X$ . But with a Polar Express baseline, the final accuracy of NS doesn’t improve at all. Meh. The next step is to test a 9x $l$ from Polar Express, but it’s unlikely the existing $l$ is optimal, so the baseline would be unfair. There is no convenient comparison so I won’t continue.

ROOT claims to solve $l, u$ selection, but their paper doesn’t describe any specifics and the AdaNewton code is mysteriously missing from their github repository.

Density-weighted $L^2$ makes more sense as an optimization target for the NS coefficients than $L^\infty$ , and would turn the $l, u$ selection into a distribution measurement. But Leloy notes that $L^\infty$ simplifies the theory because it makes the greedy coefficients optimal, and that there is diminishing return for coefficient accuracy.

Side note

There’s also the Hölder identity $\sigma_1(G) \leq \sqrt{\|G\|_1 \|G\|_\infty}$ . It’s worse than $\|G\|_F$ on low-rank matrices and better on diagonal matrices. Your matrices are low-rank, not diagonal.

Nobody uses power iteration to estimate the spectral norm, and I’m sure everyone has already considered it. $(X^T X)^2$ is a bit of free power iteration.

Leloy: “power iteration before NS typically results in the top singular value > 1 (or even >> 1 sometimes, if I’m unlucky), destabilizing the NS”

With density-weighted $L^2$ coefficients, NS would tolerate incoming singular values above 1, but the power iteration still needs to be accurate enough to confine them to a small range. Maybe it’s not. It depends on how much momentum changes between steps in the top direction.

Float16

You can run NS in float16.

Calculating $(G^T G)^2$ will typically explode, so you have to normalize $G$ to a reasonable scale beforehand - this costs a pointwise scaling operation. On random matrices with unit std terms, the empirical maximum value of $(G^T G)^2$ was fit as $2.35 \times (m+n)^{1.12} \times n^{0.79}$ . So the standard deviation of each element of $G$ should be = $16c / (n (n + m)) ^ {0.25}$ , where $m$ is the smaller dimension, and $c$ is a safety factor. $c=1/16$ seemed to work.

I don’t know if the Hölder bound is better than the Frobenius norm here.

The 3 extra mantissa bits of float16 over bfloat16 give a 8x accuracy improvement for heavy-tail matrices, zero improvement for Gaussian matrices, and negative improvement on ill-conditioned matrices.

Etc

Thanks to Leloy for answering questions.

Github with code

To discuss this article: EleutherAI discord.