What is sharp vs flat minima in Gradient Descent?

Updated May 16, 2026

Short answer

Flat minima are wide low-loss regions; sharp minima are narrow and sensitive.

Deep explanation

Flat minima generalize better because small perturbations in parameters do not significantly increase loss. Sharp minima are sensitive and often lead to overfitting.

Real-world example

Deep learning models generalizing better when trained with noise.

Common mistakes

  • Assuming lowest loss always means best model.

Follow-up questions

  • Why do flat minima generalize better?
  • How to encourage flat minima?

More Gradient Descent interview questions

View all →