Evolution of apache spark

Evolution of Apache Spark
Journey of Spark in 1.x series

● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at
datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com

Agenda
● Spark 1.0
● State of Big data
● Change in ecosystem
● Dawn of structured data
● Working with structured sources
● Dawn of custom memory management
● Evolution of Libraries

Spark 1.0
● Release on May 2014 [1]
● First production ready, backward compatible release
● Contains
○ Spark batch
○ Spark streaming
○ Shark
○ MLLib and Graphx
● Developed over 4 years
● Better hadoop

State of Big data Industry
● Map/Reduce was the way to do big data processing
● HDFS was primary source of the data
● Tools like Sqoop developed for moving data to hdfs and
hdfs acted like single point of source
● Every data by default assumed to be unstructured and
structure was laid on top of it
● Hive and Pig were popular ways to do structured and
semi structured data processing on top of Map/Reduce

Spark 1.0 Ideas
● RDD abstraction was supported to do Map/Reduce style
programming
● Primary source supported was HDFS and memory as
the speedup layer
● Spark-streaming viewed as faster batch processing
rather than as streaming
● To support Hive, Shark was created to generate RDD
code rather than Map/Reduce

Changes from 2014
● Big data industry has gone through many radical
changes in thinking in last two years
● Some of those changes started in spark and some other
are influenced by other frameworks
● These changes are important to understand why Spark
2.0 abstractions are radically different than Spark 1.0
● Many of these are already discussed in earlier meetups,
links to the videos are in reference

Usage of Big data in 2014.
● Most of the people were using higher level tools like
Hive and Pig to process data rather using Map/Reduce
● Most of the data was residing in the RDBMS databases
and user ETL data from mysql to hive to query
● So lot of use cases were analysing structured data
rather than basic assumption of unstructured in big data
world
● Huge time is consumed for ETL and non optimized
workflows from Hive

Spark with Structured Data in 1.2
● Spark recognised need of structured data in the market
and started to evolve the platform to support that use
case
● First attempt was to have a specialised RDD called
SchemaRDD in Spark 1.2 which represented that
schema
● But this approach was not clean
● Also even though there was InputFormat to read from
structured data, there was no direct API to read from
Spark

DataSource API in Spark 1.3
● First API to provide an unified API to read from
structured and semi structured sources
● Can read from RDBMS, NoSql databases like
Mongodb,Cassandra etc
● Advanced API like InputFormat which gives lot of
control to source to optimize locality of data
● So in Spark 1.3, spark addressed the need of structured
data being first class in Big data ecosystem
● For more info refer to, Anatomy of DataSource API talk[2]

DataFrame abstraction in Spark
● Spark understood modifying the RDD abstraction is not
good enough
● Many frameworks like Hive, Pig tried and failed mapping
querying efficiently on Map/Reduce
● So Spark came up with Dataframe abstraction which
goes through a complete different pipeline that of RDD
which is highly optimized
● For more info refer to, Anatomy of DataFrame API talk [3]

Evolution of InMemory processing

In memory in Spark 1.0
● Spark was the first open source big data framework to
embrace in memory computing
● With cheaper hardware and abstractions like RDD
allowed spark to exploit memory in efficient way than all
other hadoop ecosystem projects
● The first implementation of in memory computing
followed typical cache approach of keeping serialized
java bytes
● This proved to be challenging in future

Challenges of in memory in Java
● As more and more big data frameworks started to
exploit memory, soon they realised few limitation of
Java memory model
● Java memory is tuned for short lived objects and
complete control of memory is given to JVM
● But big data system started using JVM for long term
storage, JVM memory model started feel inadequate
● Also as java heap grew, to cache more data, GC
pauses started to kill performance

Custom memory management
● Apache Flink is first big data system to implement
custom memory management in java
● Flink follows Dataframe like API with custom memory
model
● The custom memory model with non GC based
approach proved to be highly successful
● By observing trends in community, optly Spark also
adopted same in Spark 1.4

Tungsten in Spark 1.4
● Spark release first version of custom memory
management in 1.4 version
● It was only supported DF as they need custom memory
model
● Custom memory management greatly improved use of
spark in higher vm size and fewer GC paused
● Solved OOM issues which plagued earlier versions of
spark
● For more info refer to, Anatomy of In memory
management in Spark talk [4]

RDD and Map/Reduce API API
● RDD API of spark follows functional programming
paradigm which is similar to Map/Reduce
● RDD API passes around opaque function objects which
is great for programming but bad for system based
optimization
● Map/Reduce API of Java also follows same patterns but
less elegant than scala ones
● Hard to optimise compared to Pig/Hive
● So we saw a steady increase in custom DSL’s in
hadoop world

Need of DSL’s in Hadoop
● DSL’s like Pig or Hive are much more easier to
understand compare to Java API
● Less error prone and helps to be very specific
● Can be easily optimised, as DSL only focuses on what
to do not how to do
● As Java Map/Reduce mixes what with how, it’s hard to
optimize compare to Hive and Pig
● So more and more people prefered these DSL over
platform level API’s

Challenges of DSL in Hadoop
● Hive and Pig DSL do not integrate well with
Map/Reduce API’s
● DSL often lack the flexibility of complete programming
language
● Hive/Pig DSL don’t define single abstraction to share so
you will be not able mix
● DSL are powerful for optimization but soon become
limited in terms of functionality

Scala as language to host DSL
● Scala is one of the first language to embrace DSL as
the first class citizens
● Scala features like implicits, higher order functions,
structured types etc allow easily build DSL’s and
integrate with language
● This allows any library on scala to integrate DSL and
harness full power of language
● Many libraries define their own DSL outside big data. Ex
: Slick, Akka-http, Sbt

DF DSL and Spark SQL DSL
● To harness power of custom memory management and
hive like optimizes spark encourages to write DF and
spark sql DSL over spark RDD code
● Whenever we write this DSL, all the features of scala
language and its libraries are available,which makes it
more powerful that Pig/ Hive
● Other frameworks like Flink, Beam follow same ideas on
scala, Java 8 etc
● You can easily mix and match DSL with RDD API

Dataset DSL in Spark 1.6
● Dataframe DSL introduced in 1.4 and stabilised in 1.5
● As spark observed the user and performance benefits of
DSL based programming, it wanted to make as import
pillar of Spark
● So in Spark 1.6, Spark released Dataset DSL which is
poised to complete RDD API from user land
● This indicates a big shift in thinking as we are more and
more moving away from 1.0 Map/Reduce and
unstructured mindset.

Evolution of libraries vs frameworks
● Spark is one of the first big data framework to build
platform rather than collection of frameworks
● Single abstraction results in multiple libraries not
multiple frameworks
● All these libraries get benefits from the improvements in
run time
● This made spark to build lot of ecosystem in very less
time
● To understand the meaning of platform, refer to
Introduction to Flink talk [5]

Data exchange between Libraries
● As more and more libraries are added to spark, having
common way to exchange data became important
● Initially libraries started using RDD as data exchange
format, but soon discovered some limitations
● Limitations of RDD as data exchange format is
○ No defined schema. Need to come up with domain
object for each library
○ Too low level
○ Custom serialization is hard to integrate

DataFrame as data exchange format
● From last few release, spark is making Dataframe as
new data exchange format of Spark
● Dataframe has schema and can be easily passed
around between libraries
● Dataframe is higher level abstraction compared RDD
● As Dataframe are serialized using platform specific
code generation, all libraries will be following same
serialization
● Dataset will follow the same advantages

Learnings from Spark 1.x
● Structured/Semi structured data is the first class of Big
data processing system
● Custom memory management and code generated
serialization gives best performance on JVM
● DataFrame/ Dataset are the new abstraction layers to
build next generation big data processing system
● DSL is the way forward over Map/Reduce like API’s
● Having high level structured abstractions make libraries
coexist happily on a platform

References
1. http://spark.apache.org/news/spark-1-0-0-released.html
2. https://www.youtube.com/watch?v=ckX6fT3kYG0
3. https://www.youtube.com/watch?v=iKOGBr-kOks
4. https://www.youtube.com/watch?v=7nIMpD5TyNs
5. https://www.youtube.com/watch?v=jErEhxP8LYQ

Evolution of apache spark

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Evolution of apache spark

Similar to Evolution of apache spark (20)

More from datamantra

More from datamantra (20)

Recently uploaded

Recently uploaded (20)

Evolution of apache spark