Quote from Apache Spark documents
Apache Spark™ is a unified analytics engine for large-scale data processing.
Spark is popularity is increasing rapidly among the big data. Spark is one of the key for big data distributed processing. First before dive into spark one have to know little about Hadoop, then later the pure purpose of the Apache Spark will be understood. Let me assume, you have the little knowledge about Hadoop’s batch processing feature.
The main concern in distributed computing is failure(crash failure of subset machines) and latency(due to transfer of data between nodes/networks). In Hadoop, the main disadvantage of data processing is fetching and performing batch operation from network to disk. Thus this increase processing time. Whereas, In Spark aggressively reduces network traffics, and perform operation in memory. Thus this increases efficiency.
Inside Apache Spark
Every spark program starts with driver program, this is same as main function in other programming languages. The purpose of driver program is to execute the operations on parallel on clusters. Resilient Distributed Data structure(RDD) is a immutable data structure, This is a collection of elements, which is partitioned over the clusters in order to operate operations in parallel.
For now, don’t focus on cluster and partitions. One can say (for now) that RDD is some what similar to List in Scala. There are two types of operations in RDD, transformation and action.
Transformations are lazy, meanig they will not perform any task until are told to do. Transformations, are the higher order functions, which creates new dataset such as map, filter, flatmap and action such as reduce, collect, count returns to the final collection to driver program. When action function are called, the all the transformation are performed otherwise no task is done.
Lets now, create a simple spark project. Then later learn about how the are executed, for understanding internal structure of how spark programs are executed.
val conf = new SparkConf().setAppName(appName).setMaster(master) val spark = new SparkContext(conf) val dataRDD = spark.textFile("log.file") val logWithErrors = dataRDD.filter(_.contains("ERROR").presist() val count = logWithErrors .count() val take10Log = logWithErrors.take(10)
Let me first explain the above program. In the above example, first two lines create a spark context, which is the execution environment. Third line of the example, creates the RDD by reading from the text file. Then performing an transformation function. Please note presist function, it is used to store the result of filter in memory or on disk if needed. Last two lines are the action.
Now visualized how these are executed.
Now, Spark Context creates RDD, setups transformation and action and send to the worker nodes. Spark program connect to cluster master (YARN) to create worker nodes. The executor are responsible for returning the computed result back to Driver program. Cluster master manages the scheduling, nodes in clusters e.g YARN
As I am also a beginner in learning. Take way from my experience of how and where I started learning Spark. Well, in my opinion, you should learn spark from institute or organization providing courses for spark because, I believe if you extreme beginner, these will act as mentor and provide you a pathway, and teach sequentially for better understanding.
I found Coursera https://www.coursera.org/learn/scala-spark-big-data by Heather Miller great for beginners. This helped me lot from zero to something now.