Fixes for Nesterov momentum in ML
This article points out errors in everyone’s Nesterov momentum algorithms for ML. If you don’t use Nesterov momentum, it won’t affect you.
is a momentum buffer. is a decay scalar. are the model parameters. is a gradient. This form is standard among practitioners. (Older papers move elsewhere.)
Sutskever introduced Nesterov momentum to Machine Learning, calculating the gradient at a lookahead position:
Nesterov momentum is optimal in convex optimization in a narrow sense. Sometimes it helps in ML.
Bengio substituted , baking in the lookahead:
The lookahead momentum is multiplied by and - the learning rate and momentum decay for next step, not this step!
You’ll need to retrieve and from your schedulers, but PyTorch’s learning rate schedulers have an off-by-one issue that make this unsolvable. The standard training step counterbalances it with another off-by-one error, reversing the order of optimizer and scheduler. Loading the scheduler requires yet another off-by-one adjustment. You should roll your own random-access scheduler and not use torch’s scheduler API.
Adam
LaProp can use Nesterov momentum directly. I don’t know anyone using LaProp except Lucas Nestler.
Adam requires more work. Do not naively apply Bengio’s version inside the numerator, like NAdam does. It is wrong because it neglects the interaction with Adam’s denominator. When the denominator changes between steps, as with loss spikes or low beta2, the movements for lookahead and retraction do not cancel.
The non-Nesterov Adam update is .
For correct Nesterov acceleration, we calculate the gradient at a lookahead position:
With lookahead parameters , the update becomes:
You can recalculate using the optimizer’s buffers before updating them with the gradient.
This derivation handled Adam as a black box, without depending on its internal momentum behavior. So this formula can accelerate momentum for any optimizer. There is a gap in this theory - the optimal for Nesterov is not specified, and lacks a connection to inside Adam.
This formula is Eq (7) in Bengio’s paper. What’s old is new again.
Muon
Muon has the same issue and solution, except last step’s update is too expensive to recalculate. Either store last step’s update (memory intensive!), give up on Nesterov momentum, or hope Muon’s existing Nesterov-like momentum is fine.
Weight decay
Apply weight decay to the original parameters, not the lookahead parameters.
Inference
Remember that lookahead and normal parameters are different. You can inference with the normal parameters by either explicitly undoing the lookahead or setting your schedules to make them equal. It’s unclear if lookahead or normal parameters are better.
Experiments
If this were a paper, here would be the time-honored tradition of sabotaging the baselines to show how important this correction is. I am confident I can insert a subtle optimizer flaw that won’t be spotted, even when fully described. Regular folks would be excited by the large improvement numbers and experts would ignore them.
The optimizer I use is very different from Adam and I don’t want to re-implement Adam just to produce some charts.
Nesterov acceleration is only sometimes positive for model training. You could empirically find the optimal Nesterov-ness by testing different lookahead distances, an obvious idea re-discovered by many papers. I haven’t figured out what causes Nesterov to hurt or help.
Credits
Thanks to Alex Birch for doing my work so that I can write blog posts instead.
To discuss this article: EleutherAI discord or kevin+3qm@anlatan.ai.