Apache Spark Part 2| RDD Internals

Gobalakrishnan Viswanathan
5 min readAug 8, 2020

To the continuation of Apache Spark | A Processing Friend, Here I am again with the second part. Before continuing, I am suggesting to go Part-1 to understand the basics of Apache Spark.

Apache Spark RDD (Image Credits: Dataflair)

RDD (Resilient Distributed Datasets) are the fundamental building blocks of Apache Spark. Spark stores the data in RDD format.

Why RDD:

The growth of the current technology world is something we cant estimate. Now we are in Artificial Intelligence evolution. Tomorrow it will change to something advance. For these advanced technologies, data needed to stored and processed is huge. Companies like Google processes Petabytes of data every day. The Google-like-Algorithms used to process a huge amount of data will require distributed computing. Since traditional MapReduce programs follow a disk-based approach, It is such a painful job to read/write petabytes of data from the file system. Using the MapReduce method is impractical for real-world machine learning solutions. So these algorithms cannot use old-slower MapReduce programs anymore.
This problem can be solved using in-memory data processing what RDD is.

What is RDD:

RDD (Photo Credits to edureka !)

RDD is the primary data structure of Apache Spark. They are fault-tolerant and stores data among multiple computers in the network. The same data are written into multiple execution nodes. In case of any failures, data will be read from other nodes instantly. RDD does not need something like Hard Disks or any other secondary storage. It needs only RAM.

RDD is not a Distributed File System instead it is Distributed Memory System.

Data can be loaded from any source to Apache Spark like Hadoop, HBase, Hive, SQL, S3, and much more. Now the collected data is brought to RDD. Here, RDD is not a data type bounded. They can process any type of data such as Structured, Unstructured, Semi-Structured data.

Features of RDD:

  • In-Memory Computation:
    The idea which doing groundbreaking progress in Cloud Computing. It increases the processing speed when comparing with HDFS.
  • Lazy Evaluation:
    The name itself says that Spark is Lazy to execute. It evaluates something only when require it. It does not execute each operation right away until we trigger any action on the operation. This attitude of Spark-RDD reduces the data we need to store in the memory.
  • Fault Tolerance:
    RDD’s are fault-tolerant. Any Lost data from the RDD partitions can be rolled back by applying the simple transformations on the last available transformed data.
  • RDDs are Immutable:
    Once the data is stored into the RDDs that becomes immutable. RDDs provides only READ access. The only way to get the modified data is to apply a transformation on the RDD and store it as a new RDD.
  • Partitioning:
    By default, Spark has a default configuration to divide the number of parts the data into. This configuration can be overridden too. So data can be read from their partitions fast using Spark Parallel Processing.
  • Persistence:
    The RDD’s are totally reusable. We can apply a number of transitions on RDDs and store the final result as a new RDD which can be useful for future use in the program. This avoids applying all the same hectic, complex transformations on RDD’s again which would time consuming.
  • The Coarse Grained Approach:
    The transformations like Map, Filter, Flatmap, etc change the RDD. Since Fine-Grained approach will update each element in the RDD which is potentially costlier because each update needs to be saved, Spark uses Coarse-Grained approach which takes entire data to transform to avoid the complexity in storage and processing.

Creating RDD:

  • In our Spark Program, We will create RDD to store our data from sources like HDFS, Local file Storage, etc. Here, the data stored in RAM as a Partitions. Here partitions are useful to do the Parallel Processing. Then there is something called Executors come in.
  • Executers are the real processors who do the transformations on the partitions we have. These executors will have configurable RAM.No of Executors also can be defined in the configuration based on the System configuration. The RAM size given to the Executors will not be used fully. There are some reserved storages (communication between OS & YARN, JVM requirements and so) that will be taken from each executor.
RDD Principle (Image Credits to greatlearning.in)

Let's take the picture above, try to get how RDD works.

  • In our Spark program, we are creating an RDD named logLinesRDD. The green boxes here representing the sources where the Spark gets data. The data is divided into four partitions as in the blue boxes. Now we have the data in our RDD as partitions stored in RAM. Then the next step ideally would be processing the data available.
  • Generally what happens in the processing step is, Some transformation functions will be executed on each partition parallelly and output data of each is stored in the partitions of New RDD. Because RDDs are immutable! In the above image, filter is one of the Transformation function available in Spark to filter the data based on the conditions. The Spark program filter out based on the conditions and data is stored in the new RDD named errorsRDD.
  • When we program for reading the data from sources, transformation functions on data, actually the processing is not happening due to the Lazy Evaluation attitude of the Spark. This means that Spark will not start the execution until you call the action called collect. Once it called, Spark will analyze all requirements to return the desired output. It will start with reading data from the source, doing transformations, and writing back to the source if programmed.

That's it about RDD for now. Practical Session on RDD will be in a separate post which I will be update here as well. meet you all there. ta ta. Have a great time !!!

--

--