What is the vanishing gradient problem in Deep Learning?

Updated May 16, 2026

Short answer

The vanishing gradient problem occurs when gradients become extremely small during backpropagation, preventing deep neural networks from learning effectively.

Deep explanation

During backpropagation, gradients are propagated backward through multiple layers using the chain rule. If activation derivatives are very small, repeated multiplication causes gradients to shrink exponentially as they move toward earlier layers. As a result, early layers learn extremely slowly or stop learning entirely. This issue is common with sigmoid and tanh activations because their derivatives saturate near zero. Modern architectures mitigate this problem using ReLU activations, residual connections, batch normalization, and improved weight initialization techniques like He initialization.

Real-world example

Early deep image recognition systems struggled to train networks deeper than a few layers because gradients vanished before reaching lower layers.

Common mistakes

Using sigmoid activations in many hidden layers without normalization or residual connections.

Follow-up questions

Why does ReLU reduce vanishing gradients?
What is exploding gradient?
How do residual connections help?

Short answer

Deep explanation

Real-world example

Common mistakes

Follow-up questions

More Deep Learning interview questions