What is Hadoop Distributed Cache and how does it work?

Updated May 16, 2026

Short answer

Distributed Cache is a mechanism to distribute read-only files to all nodes for local access during MapReduce jobs.

Deep explanation

Hadoop Distributed Cache allows job-specific files (like lookup tables, JARs, config files) to be cached on all nodes before execution. The ResourceManager distributes these files, and NodeManagers store them locally. This reduces network I/O during task execution and improves performance significantly in join-heavy operations or reference-data lookups.

Real-world example

Joining a large clickstream dataset with a small country-code lookup table distributed to all nodes.

Common mistakes

Using Distributed Cache for large datasets instead of small reference files.

Follow-up questions

What is the difference between Distributed Cache and broadcast join in Spark?
What file types are best suited for Distributed Cache?

Short answer

Deep explanation

Real-world example

Common mistakes

Follow-up questions

More Hadoop interview questions