What is the vanishing gradient problem in Deep Learning?
Updated May 16, 2026
Short answer
The vanishing gradient problem occurs when gradients become extremely small during backpropagation, preventing deep neural networks from learning effectively.
Deep explanation
During backpropagation, gradients are propagated backward through multiple layers using the chain rule. If activation derivatives are very small, repeated multiplication causes gradients to shrink exponentially as they move toward earlier layers. As a result, early layers learn extremely slowly or stop learning entirely. This issue is common with sigmoid and tanh activations because their derivatives saturate near zero. Modern architectures mitigate this problem using ReLU activations, residual connections, batch normalization, and improved weight initialization techniques like He initialization.
Real-world example
Early deep image recognition systems struggled to train networks deeper than a few layers because gradients vanished before reaching lower layers.
Common mistakes
- Using sigmoid activations in many hidden layers without normalization or residual connections.
Follow-up questions
- Why does ReLU reduce vanishing gradients?
- What is exploding gradient?
- How do residual connections help?