What is Layer Normalization and why is it preferred over Batch Normalization in Transformers?
Updated May 16, 2026
Short answer
Layer Normalization normalizes activations across features for each token independently and is widely used in Transformers due to stability in sequence models.
Deep explanation
Layer Normalization is a normalization technique designed to stabilize deep neural network training, especially in sequence models like Transformers.
Unlike BatchNorm, which normalizes across batch dimension, LayerNorm normalizes across feature dimension.
Formula: LN(x) = (x - μ) / √(σ² + ε) * γ + β
Where:
- μ and σ are computed per sample (not across batch).
Why it is important:
- Transformers process variable-length sequences.
- Batch sizes can be small or dynamic.
- BatchNorm becomes unstable in such settings.…
Unlock with a Pro subscription to view this section.
View pricingReal-world example
No real-world example available yet.
Unlock with a Pro subscription to view this section.
Upgrade to ProCommon mistakes
No common mistakes listed yet.
Unlock with a Pro subscription to view this section.
Upgrade to ProFollow-up questions
No follow-up questions available yet.
Unlock with a Pro subscription to view this section.
Upgrade to Pro