What is adaptive momentum in optimizers like Adam?

Updated May 16, 2026

Short answer

Adaptive momentum combines momentum with adaptive learning rates per parameter.

Deep explanation

Adam maintains exponentially decaying averages of gradients (momentum) and squared gradients (variance). This allows per-parameter adaptive step sizes and faster convergence in deep learning models.

Real-world example

Training transformers like GPT models efficiently.

Common mistakes

  • Assuming Adam always generalizes better than SGD.

Follow-up questions

  • Why bias correction is needed?
  • What is AdamW?

More Gradient Descent interview questions

View all →