midApache Spark
Explain the concept of Shuffle and how to minimize it.
Updated May 5, 2026
Short answer
Shuffle is the process of redistributing data across the cluster, which is expensive in terms of I/O and network.
Deep explanation
Wide transformations like groupByKey, reduceByKey, and join trigger shuffles. During a shuffle, Spark writes data to disk on the 'map' side and fetches it over the network on the 'reduce' side.
Real-world example
Joining a massive 'Sales' table with a tiny 'Countries' table. Broadcasting the 'Countries' table prevents the massive shuffle of 'Sales'.
Common mistakes
- Using `groupByKey` instead of `reduceByKey`. `reduceByKey` performs local aggregation before shuffling.
Follow-up questions
- What is the default value of spark.sql.shuffle.partitions?