What is normalization in deep vision networks (BatchNorm vs LayerNorm)?
Updated May 15, 2026
Short answer
BatchNorm normalizes across batch dimension; LayerNorm normalizes across features.
Deep explanation
BatchNorm depends on batch statistics, making it sensitive to batch size. LayerNorm normalizes per sample and is widely used in transformers. Both stabilize training but behave differently under distribution shifts.
Real-world example
LayerNorm used in Vision Transformers.
Common mistakes
- Using BatchNorm in small batch training scenarios.
Follow-up questions
- When is LayerNorm preferred?
- Why does BatchNorm fail in inference sometimes?