midApache Spark
What is Broadcast Join and when should you use it?
Updated May 5, 2026
Short answer
A Broadcast Join sends a small DataFrame to all executors, avoiding a full network shuffle of the larger DataFrame.
Deep explanation
It is most effective when one table is small enough to fit into the memory of every executor (default threshold is 10MB).
Real-world example
Enriching streaming transaction data with a static list of product categories.
Common mistakes
- Broadcasting a table that is too large, causing Driver or Executor OOM.
Follow-up questions
- How to change the threshold?