What is Gradient Checkpointing and how does it reduce memory usage in Deep Learning?
Updated May 16, 2026
Short answer
Gradient checkpointing reduces GPU memory usage during training by selectively storing intermediate activations and recomputing them during backpropagation.
Deep explanation
Training deep neural networks requires storing intermediate activations for backpropagation. As models grow larger (especially Transformers and LLMs), activation memory becomes a major bottleneck.
Gradient checkpointing solves this by trading compute for memory.
Core idea: Instead of storing all activations during forward pass:
- Store only selected “checkpoint” activations.
- Recompute missing activations during backward pass.
Memory-compute tradeoff:
- Lower memory usage.
- Higher computation cost.
How it works:
- Split model into segments.
2.…
Unlock with a Pro subscription to view this section.
View pricingReal-world example
No real-world example available yet.
Unlock with a Pro subscription to view this section.
Upgrade to ProCommon mistakes
No common mistakes listed yet.
Unlock with a Pro subscription to view this section.
Upgrade to ProFollow-up questions
No follow-up questions available yet.
Unlock with a Pro subscription to view this section.
Upgrade to Pro