Better starting scale for Muon
Muon applies Newton-Schulz (NS) iteration to a matrix with spectral norm (its largest singular value) at most 1, to get a matrix whose singular values are all ~1.
Your starting matrix ‘s spectral norm might be greater than 1, so Muon modifies it to . This uses the Frobenius norm as an upper bound for the spectral norm: , so .
If the bound is loose, then ‘s singular values are small, and the NS iteration struggles to bring them back up. This happens for the high-dimensional matrices used in ML.
The bound is tighter for higher order norms: .
Muon already calculates , and . Its fourth root is the spectral norm estimate we want.
So our new starting point is .
For random matrices of practical sizes, this is 9x larger than the original . But with a Polar Express baseline, the final accuracy of NS doesn’t improve at all. Meh. The next step is to test a 9x from Polar Express, but it’s unlikely the existing is optimal, so the baseline would be unfair. There is no convenient comparison so I won’t continue.
ROOT claims to solve selection, but their paper doesn’t describe any specifics and the AdaNewton code is mysteriously missing from their github repository.
Density-weighted makes more sense as an optimization target for the NS coefficients than , and would turn the selection into a distribution measurement. But Leloy notes that simplifies the theory because it makes the greedy coefficients optimal, and that there is diminishing return for coefficient accuracy.
Side note
There’s also the Hölder identity . It’s worse than on low-rank matrices and better on diagonal matrices. Your matrices are low-rank, not diagonal.
Nobody uses power iteration to estimate the spectral norm, and I’m sure everyone has already considered it. is a bit of free power iteration.
Leloy: “power iteration before NS typically results in the top singular value > 1 (or even >> 1 sometimes, if I’m unlucky), destabilizing the NS”
With density-weighted coefficients, NS would tolerate incoming singular values above 1, but the power iteration still needs to be accurate enough to confine them to a small range. Maybe it’s not. It depends on how much momentum changes between steps in the top direction.
Float16
You can run NS in float16.
Calculating will typically explode, so you have to normalize to a reasonable scale beforehand - this costs a pointwise scaling operation. On random matrices with unit std terms, the empirical maximum value of was fit as . So the standard deviation of each element of should be = , where is the smaller dimension, and is a safety factor. seemed to work.
I don’t know if the Hölder bound is better than the Frobenius norm here.
The 3 extra mantissa bits of float16 over bfloat16 give a 8x accuracy improvement for heavy-tail matrices, zero improvement for Gaussian matrices, and negative improvement on ill-conditioned matrices.
Etc
Thanks to Leloy for answering questions.
To discuss this article: EleutherAI discord.