seniorGradient Descent
What is sharp vs flat minima in Gradient Descent?
Updated May 16, 2026
Short answer
Flat minima are wide low-loss regions; sharp minima are narrow and sensitive.
Deep explanation
Flat minima generalize better because small perturbations in parameters do not significantly increase loss. Sharp minima are sensitive and often lead to overfitting.
Real-world example
Deep learning models generalizing better when trained with noise.
Common mistakes
- Assuming lowest loss always means best model.
Follow-up questions
- Why do flat minima generalize better?
- How to encourage flat minima?