juniorApache Spark
What are RDDs and their key characteristics?
Updated May 5, 2026
Short answer
RDDs (Resilient Distributed Datasets) are immutable, distributed collections of objects.
Deep explanation
Key characteristics include: In-memory computation, Immutability (cannot be changed once created), Fault-tolerance (rebuilds missing partitions using lineage), and Partitioning (divided across the cluster).
Real-world example
Processing unstructured log files where you need fine-grained control over Java/Scala objects.
Common mistakes
- Overusing RDDs instead of DataFrames, which lack the Catalyst Optimizer's benefits.
Follow-up questions
- What is Lineage?