What is Layer Normalization and why is it preferred over Batch Normalization in Transformers?

Updated May 16, 2026

Short answer

Layer Normalization normalizes activations across features for each token independently and is widely used in Transformers due to stability in sequence models.

Deep explanation

Layer Normalization is a normalization technique designed to stabilize deep neural network training, especially in sequence models like Transformers.

Unlike BatchNorm, which normalizes across batch dimension, LayerNorm normalizes across feature dimension.

Formula: LN(x) = (x - μ) / √(σ² + ε) * γ + β

Where:

μ and σ are computed per sample (not across batch).

Why it is important:

Transformers process variable-length sequences.
Batch sizes can be small or dynamic.
BatchNorm becomes unstable in such settings.…

Unlock with a Pro subscription to view this section.

View pricing

Real-world example

No real-world example available yet.

Unlock with a Pro subscription to view this section.

Upgrade to Pro

Common mistakes

No common mistakes listed yet.

Unlock with a Pro subscription to view this section.

Upgrade to Pro

Follow-up questions

No follow-up questions available yet.

Unlock with a Pro subscription to view this section.

Upgrade to Pro

Short answer

Deep explanation

Real-world example

Common mistakes

Follow-up questions

More Deep Learning interview questions