What is Gradient Checkpointing and how does it reduce memory usage in Deep Learning?

Updated May 16, 2026

Short answer

Gradient checkpointing reduces GPU memory usage during training by selectively storing intermediate activations and recomputing them during backpropagation.

Deep explanation

Training deep neural networks requires storing intermediate activations for backpropagation. As models grow larger (especially Transformers and LLMs), activation memory becomes a major bottleneck.

Gradient checkpointing solves this by trading compute for memory.

Core idea: Instead of storing all activations during forward pass:

  • Store only selected “checkpoint” activations.
  • Recompute missing activations during backward pass.

Memory-compute tradeoff:

  • Lower memory usage.
  • Higher computation cost.

How it works:

  1. Split model into segments.

2.…

Unlock with a Pro subscription to view this section.

View pricing

Real-world example

No real-world example available yet.

Unlock with a Pro subscription to view this section.

Upgrade to Pro

Common mistakes

No common mistakes listed yet.

Unlock with a Pro subscription to view this section.

Upgrade to Pro

Follow-up questions

No follow-up questions available yet.

Unlock with a Pro subscription to view this section.

Upgrade to Pro

More Deep Learning interview questions

View all →