What are RDDs and their key characteristics?

Updated May 5, 2026

Short answer

RDDs (Resilient Distributed Datasets) are immutable, distributed collections of objects.

Deep explanation

Key characteristics include: In-memory computation, Immutability (cannot be changed once created), Fault-tolerance (rebuilds missing partitions using lineage), and Partitioning (divided across the cluster).

Real-world example

Processing unstructured log files where you need fine-grained control over Java/Scala objects.

Common mistakes

  • Overusing RDDs instead of DataFrames, which lack the Catalyst Optimizer's benefits.

Follow-up questions

  • What is Lineage?

More Apache Spark interview questions

View all →