Reduction operations combines the list of elements to produce single combined results. Operations like
aggregate iterates over the elements to produce single combine element.
Lets go through different types of reduction operations in Apache Spark
Fold vs Fold Left vs Aggregate
Fold Left is not recommended to used because it is not parallelized, whereas fold function is parallelized. foldLeft is not parallized because it has dependencies on previous updated element. For example
elements.foldLeft(0)(_ + _) , Furthermore, if we tries to force it to parallelize, since the function returns the same types as accumulators, it is sometimes impossible to parallelize.
On the other hand, fold takes the same types as of RDD. This is the reason, the fold can be parallelized, it did not have to wait for the aggregated/combined values (thus will not execute sequentially)
def fold[A1 >: A](z: A1)(op: (A1, A1) => A1): A1
But there are limitation for fold, what if we want to different type, let say for counting the occurrences of words in the elements, does the below code compile?
rdd.fold(0)((count, value) => if (isError(value)) count + 1 else count)
No, the above will not compile because the fold signature tell us that the type should be same. And here, we have to different types, Integer (Count) and String(Value).
On Aggregate function, this revisit the problem, and in conclusion combines the two operation’s signature(foldLeft and fold) into one function (aggregate).
def aggregate[B](z: B)(f: (B, A) => B, g: (B, B) => B): B
If solving the above problem, then
rdd.aggreage(0)((count, value) => if (isError(value)) count + 1 else count, _ + _)
Pair RDD is for working (key, value) wise. These are very useful because, it allows to perform operation on each key. Most of the time, we need results in Pairs (key, value).
Pair RDD is simply a Tuple in Scala and we can create the Pair RDD by using map transformation function. For example,
lines.map(x => (x.split(” “)(0), x))
Spark provides sets of transformation operation on individual key’s values. These operations are groupByKey(fn), reduceByKey(fn), joins, and mapValues(fn).
Lets take an simple, example to understand Pair RDD.
val data = sc. parallelize(List("key1","key2", "key3", "key1", "key4", "key2")) //org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD at parallelize at <console>:25 val pairRDD = data.map((_,1)).reduceByKey(_ + _) //pairRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD at reduceByKey at <console>:25 val result = pairRDD.collect() //result: Array[(String, Int)] = Array((key3,1), (key4,1), (key1,2), (key2,2))
The above returns the grouped results by keys and the time of appearances, If you notices the
ShuffledRDD what is Shuffle?
Apache spark distributes the data over cluster by partitions, shuffle is the processing of rearranging the data within JVM, or out of JVM or some networks. Mostly when two RDD are join, or some single RDD are grouped, the shuffle are done. The idea, is to reduce shuffling as much as possible because, more shuffling the less efficient execution.
Also Read: Apache Spark Basics 1