juniorApache Spark
Explain Spark's Caching/Persistence mechanism.
Updated May 5, 2026
Short answer
Caching stores RDDs/DataFrames in memory to speed up repeated computations.
Deep explanation
Use cache() for default memory storage and persist() for custom storage levels (DISK, MEMORY_ONLY, MEMORY_AND_DISK).
Real-world example
In iterative Machine Learning algorithms where the same dataset is scanned 100 times.
Common mistakes
- Caching every DataFrame, which leads to memory pressure and eviction of useful data.
Follow-up questions
- How to remove a cache?