seniorGradient Descent
What is adaptive momentum in optimizers like Adam?
Updated May 16, 2026
Short answer
Adaptive momentum combines momentum with adaptive learning rates per parameter.
Deep explanation
Adam maintains exponentially decaying averages of gradients (momentum) and squared gradients (variance). This allows per-parameter adaptive step sizes and faster convergence in deep learning models.
Real-world example
Training transformers like GPT models efficiently.
Common mistakes
- Assuming Adam always generalizes better than SGD.
Follow-up questions
- Why bias correction is needed?
- What is AdamW?