Better starting scale for Muon


Muon applies Newton-Schulz (NS) iteration to a matrix with spectral norm (its largest singular value) at most 1, to get a matrix whose singular values are all ~1.

Your starting matrix GG‘s spectral norm might be greater than 1, so Muon modifies it to X=G/GFX = G / \|G\|_F. This uses the Frobenius norm as an upper bound for the spectral norm: σ1(G)GF=σi(G)2\sigma_1(G) \leq \|G\|_F = \sqrt{\sum \sigma_i(G)^2}, so σ1(X)=σ1(G)/GF<1\sigma_1(X) = \sigma_1(G) / \|G\|_F < 1.

If the bound σ1(G)σi(G)2\sigma_1(G) \le \sqrt{\sum \sigma_i(G)^2} is loose, then XX‘s singular values are small, and the NS iteration struggles to bring them back up. This happens for the high-dimensional matrices used in ML.

The bound is tighter for higher order norms: σ1(G)σi(G)88σi(G)2\sigma_1(G) \leq \sqrt[8]{\sum \sigma_i(G)^8} \leq \sqrt{\sum \sigma_i(G)^2}.

Muon already calculates (XTX)2(X^T X)^2, and (XTX)2F=σi(X)8\|(X^T X)^2\|_F = \sqrt{\sum \sigma_i(X)^8}. Its fourth root is the spectral norm estimate we want.

So our new starting point is X=G/(GTG)2F0.25X = G / \|(G^T G)^2\|_F^{0.25}.

For random matrices of practical sizes, this XX is 9x larger than the original XX. But with a Polar Express baseline, the final accuracy of NS doesn’t improve at all. Meh. The next step is to test a 9x ll from Polar Express, but it’s unlikely the existing ll is optimal, so the baseline would be unfair. There is no convenient comparison so I won’t continue.

ROOT claims to solve l,ul, u selection, but their paper doesn’t describe any specifics and the AdaNewton code is mysteriously missing from their github repository.

Density-weighted L2L^2 makes more sense as an optimization target for the NS coefficients than LL^\infty, and would turn the l,ul, u selection into a distribution measurement. But Leloy notes that LL^\infty simplifies the theory because it makes the greedy coefficients optimal, and that there is diminishing return for coefficient accuracy.

Side note

There’s also the Hölder identity σ1(G)G1G\sigma_1(G) \leq \sqrt{\|G\|_1 \|G\|_\infty}. It’s worse than GF\|G\|_F on low-rank matrices and better on diagonal matrices. Your matrices are low-rank, not diagonal.

Nobody uses power iteration to estimate the spectral norm, and I’m sure everyone has already considered it. (XTX)2(X^T X)^2 is a bit of free power iteration.

Leloy: “power iteration before NS typically results in the top singular value > 1 (or even >> 1 sometimes, if I’m unlucky), destabilizing the NS”

With density-weighted L2L^2 coefficients, NS would tolerate incoming singular values above 1, but the power iteration still needs to be accurate enough to confine them to a small range. Maybe it’s not. It depends on how much momentum changes between steps in the top direction.

Float16

You can run NS in float16.

Calculating (GTG)2(G^T G)^2 will typically explode, so you have to normalize GG to a reasonable scale beforehand - this costs a pointwise scaling operation. On random matrices with unit std terms, the empirical maximum value of (GTG)2(G^T G)^2 was fit as 2.35×(m+n)1.12×n0.792.35 \times (m+n)^{1.12} \times n^{0.79}. So the standard deviation of each element of GG should be = 16c/(n(n+m))0.2516c / (n (n + m)) ^ {0.25}, where mm is the smaller dimension, and cc is a safety factor. c=1/16c=1/16 seemed to work.

I don’t know if the Hölder bound is better than the Frobenius norm here.

The 3 extra mantissa bits of float16 over bfloat16 give a 8x accuracy improvement for heavy-tail matrices, zero improvement for Gaussian matrices, and negative improvement on ill-conditioned matrices.

Etc

Thanks to Leloy for answering questions.

Github with code

To discuss this article: EleutherAI discord.