In this Apache Spark tutorial, we will understand what is DAG in Apache Spark, what is DAG scheduler, what is the need of directed acyclic graph in Spark, how DAG is created in Spark and how it helps in achieving fault tolerance. We will also learn how DAG works in RDD, the advantages of DAG in Spark which creates the difference between Apache Spark and Hadoop MapReduce.
2. Introduction to DAG in Apache Spark
DAG is a finite directed graph with no directed cycles. There are finitely many vertices and edges, where each edge directed from one vertex to another. It contains a sequence of vertices such that every edge is directed from earlier to later in the sequence. It is a strict generalization of MapReduce model. DAG operations can do better global optimization than other systems like MapReduce. The picture of DAG becomes clear in more complex jobs.Apache Spark DAG allows the user to dive into the stage and expand on detail on any stage. In the stage view, the details of all RDDs belonging to that stage are expanded. The scheduler splits the Spark RDD into stages based on various transformation applied. Each stage is comprised of tasks, based on partitions of the RDD, which will perform same computation in parallel. The graph here refers to navigation, and directed and acyclic refers to how it is done.
3. Need of Directed Acyclic Graph in Spark
The limitations ofMapReduce in Hadoop became a key point to introduce DAG in Spark. The computation through MapReduce is carried in three steps:
- The data is read from HDFS
- Map and Reduce operations are applied
- The computed result is written back toHDFS.
Each MapReduce operation is independent of each other and HADOOP has no idea of which Map reduce would come next. Sometimes for some iteration, it is irrelevant to read and write back the immediate result between two map reduce job. In such case, the memory in stable storage (HDFS) or disk memory gets wasted.
In multiple-step, till the completion of the previous job all the jobs are blocked from beginning. As a result, complex computation can require a long time with small data volume.
While in Spark, a DAG (Directed Acyclic Graph) of consecutive computation stages is formed. In this way, the execution plan is optimized, e.g. to minimize shuffling data around. In contrast, it is done manually in MapReduce by tuning each MapReduce step.
4. Creation of DAG in Spark:
When an action is called onSpark RDD at a high level, DAG is created and is submitted to the DAG scheduler.
- Operators are divided into stages of the task in the DAG scheduler. A stage contains task based on the partition of the input data. The DAG scheduler pipelines operators together. For example, map operators are scheduled in a single stage.
- The stages are passed on to the Task Scheduler. It launches task through cluster manager. The dependencies of stages are unknown to the task scheduler.
- The Workers execute the task on the slave.
At higher level, two type ofRDD transformations can be applied: narrow transformation (e.g. map(), filter() etc.) and wide transformation (e.g. reduceByKey()). Narrow transformation does not require the shuffling of data across a partition, the narrow transformations will be grouped into single stage while in wide transformation the data is shuffled. Wide transformation results in stage boundaries.
Each RDD maintains a pointer to one or more parent along with metadata about what type of relationship it has with the parent. For example, if we call val b=a.map() on an RDD, the RDD b keeps a reference to its parent RDD a, that’s a lineage.
5. How is Fault tolerance achieved through DAG
RDD is split into the partition and each node is operating on a partition at any point of time. Here, the series of Scala function is executing on a partition of the RDD. These operations are composed together and Spark execution engine view these as DAG (Directed Acyclic Graph). If any node crashes in the middle of any operation say O3 which depends on operation O2, which inturn O1. The cluster manager finds out the node is dead and assign another node to continue processing. This node will operate on the particular partition of the RDD and the series of operation that it has to execute (O1->O2->O3). Now there will be no data loss.
6. Working of DAG optimizer in Spark
The DAG in Apache Spark is optimized by rearranging and combining operators wherever possible. For, example if we submit a spark job which has amap() operation followed by a filter operation. The DAG optimizer will rearrange the order of these operators, as filtering will reduce the number of records to undergo map operation.
7. Advantages of DAG in Spark
- The lost RDD can be recovered using the Directed Acyclic Graph.
- Map Reduce has just two queries the map, and reduce but in DAG we have multiple levels.so to execute SQL query, DAG is more flexible.
- DAG helps to achieve fault tolerance. Thus the lost data can be recovered.
- It can do a better global optimization than a system likeHadoop MapReduce.