Apache Spark Basic 2

Reduction operations combines the list of elements to produce single combined results. Operations like foldreducegroupByKey and aggregate iterates over the elements to produce single combine element.

Lets go through different types of reduction operations in Apache Spark

Fold vs Fold Left vs Aggregate

Fold Left is not recommended to used because it is not parallelized, whereas fold function is parallelized. foldLeft is not parallized because it has dependencies on previous updated element. For example elements.foldLeft(0)(_ + _) , Furthermore, if we tries to force it to parallelize, since the function returns the same types as accumulators, it is sometimes impossible to parallelize.

On the other hand, fold takes the same types as of RDD. This is the reason, the fold can be parallelized, it did not have to wait for the aggregated/combined values (thus will not execute sequentially)

def fold[A1 >: A](z: A1)(op: (A1, A1) => A1): A1

But there are limitation for fold, what if we want to different type, let say for counting the occurrences of words in the elements, does the below code compile?

rdd.fold(0)((count, value) => if (isError(value)) count + 1 else count)

No, the above will not compile because the fold signature tell us that the type should be same. And here, we have to different types, Integer (Count) and String(Value).

 

On Aggregate function, this revisit the problem, and in conclusion combines the two operation’s signature(foldLeft and fold) into one function (aggregate).

def aggregate[B](z: B)(f: (B, A) => B, g: (B, B) => B): B

If solving the above problem, then

rdd.aggreage(0)((count, value) => if (isError(value)) count + 1 else count, _ + _)

 

Pair RDD

Pair RDD is for working (key, value) wise. These are very useful because, it allows to perform operation on each key. Most of the time, we need results in Pairs (key, value).

Pair RDD is simply a Tuple in Scala and we can create the Pair RDD by using map transformation function. For example,

lines.map(x => (x.split(” “)(0), x))

Spark provides sets of transformation operation on individual key’s values. These operations are groupByKey(fn), reduceByKey(fn), joins, and mapValues(fn).

Lets take an simple, example to understand Pair RDD.

val data = sc. parallelize(List("key1","key2", "key3", "key1", "key4", "key2"))
//org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:25
val pairRDD = data.map((_,1)).reduceByKey(_ + _)
//pairRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[3] at reduceByKey at <console>:25
val result = pairRDD.collect()
//result: Array[(String, Int)] = Array((key3,1), (key4,1), (key1,2), (key2,2))

The above returns the grouped results by keys and the time of appearances, If you notices the ShuffledRDD what is Shuffle?

Shuffle

Apache spark distributes the data over cluster by partitions, shuffle is the processing of rearranging the data within JVM, or out of JVM or some networks. Mostly when two RDD are join, or some single RDD are grouped, the shuffle are done. The idea, is to reduce shuffling as much as possible because, more shuffling the less efficient execution.

Also Read: Apache Spark Basics 1

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: