midHadoop
What is Hadoop Distributed Cache and how does it work?
Updated May 16, 2026
Short answer
Distributed Cache is a mechanism to distribute read-only files to all nodes for local access during MapReduce jobs.
Deep explanation
Hadoop Distributed Cache allows job-specific files (like lookup tables, JARs, config files) to be cached on all nodes before execution. The ResourceManager distributes these files, and NodeManagers store them locally. This reduces network I/O during task execution and improves performance significantly in join-heavy operations or reference-data lookups.
Real-world example
Joining a large clickstream dataset with a small country-code lookup table distributed to all nodes.
Common mistakes
- Using Distributed Cache for large datasets instead of small reference files.
Follow-up questions
- What is the difference between Distributed Cache and broadcast join in Spark?
- What file types are best suited for Distributed Cache?