What is Hadoop Distributed Cache and how does it work?

Updated May 16, 2026

Short answer

Distributed Cache is a mechanism to distribute read-only files to all nodes for local access during MapReduce jobs.

Deep explanation

Hadoop Distributed Cache allows job-specific files (like lookup tables, JARs, config files) to be cached on all nodes before execution. The ResourceManager distributes these files, and NodeManagers store them locally. This reduces network I/O during task execution and improves performance significantly in join-heavy operations or reference-data lookups.

Real-world example

Joining a large clickstream dataset with a small country-code lookup table distributed to all nodes.

Common mistakes

  • Using Distributed Cache for large datasets instead of small reference files.

Follow-up questions

  • What is the difference between Distributed Cache and broadcast join in Spark?
  • What file types are best suited for Distributed Cache?

More Hadoop interview questions

View all →