Apache Spark (Dataframe and Dataset)

As to target wider range of audience in “Big Data” spark introduced Data Frame API for users. RDD API is also elegant, it reduces thousand line of code to dozens however, RDD API are now consider as low level API because it requires developer to know much about internal working of apache spark in order to work efficiently and it is also required for developer to optimise the code.

Furthermore, Data frame API are considered more expressive then RDD API. In addition, it takes less code then RDD API and developer do not have to worry much about the optimization. Optimization will be taken care by apache spark.

Data frame are structured data as in Hive, relational database, or data frame in R/Python. Data frame are the collection of data in form of tables.

Data frame can be created directly by loading JSON file or RDD can be converted to Data frame.

Method 1:

// Create the DataFrame
val df = context.read.json("path/to/data.json")

Method 2:

val rdd = sc.parallelize(
  Seq(1, "Customer 1"),
    (2, "Customer 2"),
    (3, "Customer 3")
val df = spark.createDataFrame(rdd).toDF("id", "customer")

    |id |   customer|
    |1  |Customer 1 |
    |2  |Customer 2 |
    |3  |Customer 3 |


There are lot to learn and work about data frame. We can use SQL to fetch the data simply by creating table using context.table("customer") But there is a disadvantage of using Data frame. The one and most serious disadvantage of using Data frame is that it is type safe, thus when we wanted to get the result, we have to manually type cast to Scala collections. Therefore leaving us on sudden crash of application due to incorrect type casting for example

val names = df.select("name").as[String].collect()

With dataset, we can still take advantage of catalyst’s optimizer as well as leverage Tungsten’s memory encoding. The good thing about dataset API is that it is type safe, meaning it will inform about any mistyping at compile time. To create we can create case class and then if we are using JSON file then we can map to Customer case class.

val dt = context.read.json("path/to/data.json").as[Customer]

Furthermore, If we wanted to fetched the single data from Customer case class then by simply using map function we can extract with specific data type like.

Dataset<String> names = customers.map((Customer c) -> c.name, Encoders.STRING));


There are lot to cover in Dataframe, Dataset and how spark works internal when using those two API. But I like to mention about Catalyst Optimizer and Tungsten because this topics are worth understanding.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: