Explain Spark's Caching/Persistence mechanism.

Updated May 5, 2026

Short answer

Caching stores RDDs/DataFrames in memory to speed up repeated computations.

Deep explanation

Use cache() for default memory storage and persist() for custom storage levels (DISK, MEMORY_ONLY, MEMORY_AND_DISK).

Real-world example

In iterative Machine Learning algorithms where the same dataset is scanned 100 times.

Common mistakes

  • Caching every DataFrame, which leads to memory pressure and eviction of useful data.

Follow-up questions

  • How to remove a cache?

More Apache Spark interview questions

View all →