SlideShare a Scribd company logo
Evolution of Apache Spark
Journey of Spark in 1.x series
● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at
datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● Spark 1.0
● State of Big data
● Change in ecosystem
● Dawn of structured data
● Working with structured sources
● Dawn of custom memory management
● Evolution of Libraries
Spark 1.0
● Release on May 2014 [1]
● First production ready, backward compatible release
● Contains
○ Spark batch
○ Spark streaming
○ Shark
○ MLLib and Graphx
● Developed over 4 years
● Better hadoop
State of Big data Industry
● Map/Reduce was the way to do big data processing
● HDFS was primary source of the data
● Tools like Sqoop developed for moving data to hdfs and
hdfs acted like single point of source
● Every data by default assumed to be unstructured and
structure was laid on top of it
● Hive and Pig were popular ways to do structured and
semi structured data processing on top of Map/Reduce
Spark 1.0 Ideas
● RDD abstraction was supported to do Map/Reduce style
programming
● Primary source supported was HDFS and memory as
the speedup layer
● Spark-streaming viewed as faster batch processing
rather than as streaming
● To support Hive, Shark was created to generate RDD
code rather than Map/Reduce
Changes from 2014
● Big data industry has gone through many radical
changes in thinking in last two years
● Some of those changes started in spark and some other
are influenced by other frameworks
● These changes are important to understand why Spark
2.0 abstractions are radically different than Spark 1.0
● Many of these are already discussed in earlier meetups,
links to the videos are in reference
Dawn of Structured Data
Usage of Big data in 2014.
● Most of the people were using higher level tools like
Hive and Pig to process data rather using Map/Reduce
● Most of the data was residing in the RDBMS databases
and user ETL data from mysql to hive to query
● So lot of use cases were analysing structured data
rather than basic assumption of unstructured in big data
world
● Huge time is consumed for ETL and non optimized
workflows from Hive
Spark with Structured Data in 1.2
● Spark recognised need of structured data in the market
and started to evolve the platform to support that use
case
● First attempt was to have a specialised RDD called
SchemaRDD in Spark 1.2 which represented that
schema
● But this approach was not clean
● Also even though there was InputFormat to read from
structured data, there was no direct API to read from
Spark
DataSource API in Spark 1.3
● First API to provide an unified API to read from
structured and semi structured sources
● Can read from RDBMS, NoSql databases like
Mongodb,Cassandra etc
● Advanced API like InputFormat which gives lot of
control to source to optimize locality of data
● So in Spark 1.3, spark addressed the need of structured
data being first class in Big data ecosystem
● For more info refer to, Anatomy of DataSource API talk[2]
DataFrame abstraction in Spark
● Spark understood modifying the RDD abstraction is not
good enough
● Many frameworks like Hive, Pig tried and failed mapping
querying efficiently on Map/Reduce
● So Spark came up with Dataframe abstraction which
goes through a complete different pipeline that of RDD
which is highly optimized
● For more info refer to, Anatomy of DataFrame API talk [3]
Evolution of InMemory processing
In memory in Spark 1.0
● Spark was the first open source big data framework to
embrace in memory computing
● With cheaper hardware and abstractions like RDD
allowed spark to exploit memory in efficient way than all
other hadoop ecosystem projects
● The first implementation of in memory computing
followed typical cache approach of keeping serialized
java bytes
● This proved to be challenging in future
Challenges of in memory in Java
● As more and more big data frameworks started to
exploit memory, soon they realised few limitation of
Java memory model
● Java memory is tuned for short lived objects and
complete control of memory is given to JVM
● But big data system started using JVM for long term
storage, JVM memory model started feel inadequate
● Also as java heap grew, to cache more data, GC
pauses started to kill performance
Custom memory management
● Apache Flink is first big data system to implement
custom memory management in java
● Flink follows Dataframe like API with custom memory
model
● The custom memory model with non GC based
approach proved to be highly successful
● By observing trends in community, optly Spark also
adopted same in Spark 1.4
Tungsten in Spark 1.4
● Spark release first version of custom memory
management in 1.4 version
● It was only supported DF as they need custom memory
model
● Custom memory management greatly improved use of
spark in higher vm size and fewer GC paused
● Solved OOM issues which plagued earlier versions of
spark
● For more info refer to, Anatomy of In memory
management in Spark talk [4]
DSL’s for data processing
RDD and Map/Reduce API API
● RDD API of spark follows functional programming
paradigm which is similar to Map/Reduce
● RDD API passes around opaque function objects which
is great for programming but bad for system based
optimization
● Map/Reduce API of Java also follows same patterns but
less elegant than scala ones
● Hard to optimise compared to Pig/Hive
● So we saw a steady increase in custom DSL’s in
hadoop world
Need of DSL’s in Hadoop
● DSL’s like Pig or Hive are much more easier to
understand compare to Java API
● Less error prone and helps to be very specific
● Can be easily optimised, as DSL only focuses on what
to do not how to do
● As Java Map/Reduce mixes what with how, it’s hard to
optimize compare to Hive and Pig
● So more and more people prefered these DSL over
platform level API’s
Challenges of DSL in Hadoop
● Hive and Pig DSL do not integrate well with
Map/Reduce API’s
● DSL often lack the flexibility of complete programming
language
● Hive/Pig DSL don’t define single abstraction to share so
you will be not able mix
● DSL are powerful for optimization but soon become
limited in terms of functionality
Scala as language to host DSL
● Scala is one of the first language to embrace DSL as
the first class citizens
● Scala features like implicits, higher order functions,
structured types etc allow easily build DSL’s and
integrate with language
● This allows any library on scala to integrate DSL and
harness full power of language
● Many libraries define their own DSL outside big data. Ex
: Slick, Akka-http, Sbt
DF DSL and Spark SQL DSL
● To harness power of custom memory management and
hive like optimizes spark encourages to write DF and
spark sql DSL over spark RDD code
● Whenever we write this DSL, all the features of scala
language and its libraries are available,which makes it
more powerful that Pig/ Hive
● Other frameworks like Flink, Beam follow same ideas on
scala, Java 8 etc
● You can easily mix and match DSL with RDD API
Dataset DSL in Spark 1.6
● Dataframe DSL introduced in 1.4 and stabilised in 1.5
● As spark observed the user and performance benefits of
DSL based programming, it wanted to make as import
pillar of Spark
● So in Spark 1.6, Spark released Dataset DSL which is
poised to complete RDD API from user land
● This indicates a big shift in thinking as we are more and
more moving away from 1.0 Map/Reduce and
unstructured mindset.
Evolution of Libraries
Evolution of libraries vs frameworks
● Spark is one of the first big data framework to build
platform rather than collection of frameworks
● Single abstraction results in multiple libraries not
multiple frameworks
● All these libraries get benefits from the improvements in
run time
● This made spark to build lot of ecosystem in very less
time
● To understand the meaning of platform, refer to
Introduction to Flink talk [5]
Data exchange between Libraries
● As more and more libraries are added to spark, having
common way to exchange data became important
● Initially libraries started using RDD as data exchange
format, but soon discovered some limitations
● Limitations of RDD as data exchange format is
○ No defined schema. Need to come up with domain
object for each library
○ Too low level
○ Custom serialization is hard to integrate
DataFrame as data exchange format
● From last few release, spark is making Dataframe as
new data exchange format of Spark
● Dataframe has schema and can be easily passed
around between libraries
● Dataframe is higher level abstraction compared RDD
● As Dataframe are serialized using platform specific
code generation, all libraries will be following same
serialization
● Dataset will follow the same advantages
Learnings from Spark 1.x
● Structured/Semi structured data is the first class of Big
data processing system
● Custom memory management and code generated
serialization gives best performance on JVM
● DataFrame/ Dataset are the new abstraction layers to
build next generation big data processing system
● DSL is the way forward over Map/Reduce like API’s
● Having high level structured abstractions make libraries
coexist happily on a platform
References
1. http://spark.apache.org/news/spark-1-0-0-released.html
2. https://www.youtube.com/watch?v=ckX6fT3kYG0
3. https://www.youtube.com/watch?v=iKOGBr-kOks
4. https://www.youtube.com/watch?v=7nIMpD5TyNs
5. https://www.youtube.com/watch?v=jErEhxP8LYQ

More Related Content

What's hot

Introduction to Tableau
Introduction to TableauIntroduction to Tableau
Introduction to Tableau
Kanika Nagpal
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
Russell Jurney
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
How to Use Spatial Data Science in your Site Planning Process? [CARTOframes]
How to Use Spatial Data Science in your Site Planning Process? [CARTOframes] How to Use Spatial Data Science in your Site Planning Process? [CARTOframes]
How to Use Spatial Data Science in your Site Planning Process? [CARTOframes]
CARTO
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Hadoop
Hadoop Hadoop
Hadoop
ABHIJEET RAJ
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
Edureka!
 
Spark
SparkSpark
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
Prashanth Babu
 
Apache spark
Apache sparkApache spark
Apache spark
shima jafari
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
Databricks
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
Carol McDonald
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
Phar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the RealityPhar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the Reality
Databricks
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
Colleen Farrelly
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
Lucidworks
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 

What's hot (20)

Introduction to Tableau
Introduction to TableauIntroduction to Tableau
Introduction to Tableau
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
How to Use Spatial Data Science in your Site Planning Process? [CARTOframes]
How to Use Spatial Data Science in your Site Planning Process? [CARTOframes] How to Use Spatial Data Science in your Site Planning Process? [CARTOframes]
How to Use Spatial Data Science in your Site Planning Process? [CARTOframes]
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Hadoop
Hadoop Hadoop
Hadoop
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
 
Spark
SparkSpark
Spark
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Apache spark
Apache sparkApache spark
Apache spark
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Phar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the RealityPhar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the Reality
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 

Viewers also liked

Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
datamantra
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
datamantra
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time SparkImproving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Spark
datamantra
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
datamantra
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
datamantra
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
datamantra
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
datamantra
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
datamantra
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
datamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
datamantra
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Solr Architecture
Solr ArchitectureSolr Architecture
Solr Architecture
Ramez Al-Fayez
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
datamantra
 
Mesos and Kubernetes ecosystem overview
Mesos and Kubernetes ecosystem overviewMesos and Kubernetes ecosystem overview
Mesos and Kubernetes ecosystem overview
Krishna-Kumar
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
datamantra
 
Predictive modeling healthcare
Predictive modeling healthcarePredictive modeling healthcare
Predictive modeling healthcare
Taposh Roy
 
Ranking the Web with Spark
Ranking the Web with SparkRanking the Web with Spark
Ranking the Web with Spark
Sylvain Zimmer
 

Viewers also liked (20)

Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time SparkImproving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Spark
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Solr Architecture
Solr ArchitectureSolr Architecture
Solr Architecture
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 
Mesos and Kubernetes ecosystem overview
Mesos and Kubernetes ecosystem overviewMesos and Kubernetes ecosystem overview
Mesos and Kubernetes ecosystem overview
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 
Predictive modeling healthcare
Predictive modeling healthcarePredictive modeling healthcare
Predictive modeling healthcare
 
Ranking the Web with Spark
Ranking the Web with SparkRanking the Web with Spark
Ranking the Web with Spark
 

Similar to Evolution of apache spark

Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
datamantra
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Spark 101
Spark 101Spark 101
Apache spark on Hadoop Yarn Resource Manager
Apache spark on Hadoop Yarn Resource ManagerApache spark on Hadoop Yarn Resource Manager
Apache spark on Hadoop Yarn Resource Manager
haridasnss
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
Frank Schroeter
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013
scorlosquet
 
Getting Started with Apache Spark (Scala)
Getting Started with Apache Spark (Scala)Getting Started with Apache Spark (Scala)
Getting Started with Apache Spark (Scala)
Knoldus Inc.
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
Ramesh Mudunuri
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 

Similar to Evolution of apache spark (20)

Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
 
Spark 101
Spark 101Spark 101
Spark 101
 
Apache spark on Hadoop Yarn Resource Manager
Apache spark on Hadoop Yarn Resource ManagerApache spark on Hadoop Yarn Resource Manager
Apache spark on Hadoop Yarn Resource Manager
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013
 
Getting Started with Apache Spark (Scala)
Getting Started with Apache Spark (Scala)Getting Started with Apache Spark (Scala)
Getting Started with Apache Spark (Scala)
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
 

More from datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
datamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
datamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
datamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
datamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
datamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
datamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
datamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
datamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
datamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
datamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
datamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
datamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
datamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
datamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
datamantra
 

More from datamantra (20)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 

Recently uploaded

Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
Nohoax Kanont
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
siddu769252
 
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPathCommunity
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
 
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
Priyanka Aash
 
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Alliance
 
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
Fwdays
 
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptxFIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Alliance
 
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Alliance
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Zilliz
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Alliance
 
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
Marrie Morris
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
Low Hong Chuan
 
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
webbyacad software
 
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Alliance
 
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
AMol NAik
 
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
ZachWylie3
 
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
Badri_Bady
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Michael Price
 
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
Stephanie Beckett
 

Recently uploaded (20)

Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
 
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
 
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
 
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
 
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
 
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptxFIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
 
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
 
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
 
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
 
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
 
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
 
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
 
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
 
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
 

Evolution of apache spark

  • 1. Evolution of Apache Spark Journey of Spark in 1.x series
  • 2. ● Madhukara Phatak ● Technical Lead at Tellius ● Consultant and Trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Spark 1.0 ● State of Big data ● Change in ecosystem ● Dawn of structured data ● Working with structured sources ● Dawn of custom memory management ● Evolution of Libraries
  • 4. Spark 1.0 ● Release on May 2014 [1] ● First production ready, backward compatible release ● Contains ○ Spark batch ○ Spark streaming ○ Shark ○ MLLib and Graphx ● Developed over 4 years ● Better hadoop
  • 5. State of Big data Industry ● Map/Reduce was the way to do big data processing ● HDFS was primary source of the data ● Tools like Sqoop developed for moving data to hdfs and hdfs acted like single point of source ● Every data by default assumed to be unstructured and structure was laid on top of it ● Hive and Pig were popular ways to do structured and semi structured data processing on top of Map/Reduce
  • 6. Spark 1.0 Ideas ● RDD abstraction was supported to do Map/Reduce style programming ● Primary source supported was HDFS and memory as the speedup layer ● Spark-streaming viewed as faster batch processing rather than as streaming ● To support Hive, Shark was created to generate RDD code rather than Map/Reduce
  • 7. Changes from 2014 ● Big data industry has gone through many radical changes in thinking in last two years ● Some of those changes started in spark and some other are influenced by other frameworks ● These changes are important to understand why Spark 2.0 abstractions are radically different than Spark 1.0 ● Many of these are already discussed in earlier meetups, links to the videos are in reference
  • 9. Usage of Big data in 2014. ● Most of the people were using higher level tools like Hive and Pig to process data rather using Map/Reduce ● Most of the data was residing in the RDBMS databases and user ETL data from mysql to hive to query ● So lot of use cases were analysing structured data rather than basic assumption of unstructured in big data world ● Huge time is consumed for ETL and non optimized workflows from Hive
  • 10. Spark with Structured Data in 1.2 ● Spark recognised need of structured data in the market and started to evolve the platform to support that use case ● First attempt was to have a specialised RDD called SchemaRDD in Spark 1.2 which represented that schema ● But this approach was not clean ● Also even though there was InputFormat to read from structured data, there was no direct API to read from Spark
  • 11. DataSource API in Spark 1.3 ● First API to provide an unified API to read from structured and semi structured sources ● Can read from RDBMS, NoSql databases like Mongodb,Cassandra etc ● Advanced API like InputFormat which gives lot of control to source to optimize locality of data ● So in Spark 1.3, spark addressed the need of structured data being first class in Big data ecosystem ● For more info refer to, Anatomy of DataSource API talk[2]
  • 12. DataFrame abstraction in Spark ● Spark understood modifying the RDD abstraction is not good enough ● Many frameworks like Hive, Pig tried and failed mapping querying efficiently on Map/Reduce ● So Spark came up with Dataframe abstraction which goes through a complete different pipeline that of RDD which is highly optimized ● For more info refer to, Anatomy of DataFrame API talk [3]
  • 13. Evolution of InMemory processing
  • 14. In memory in Spark 1.0 ● Spark was the first open source big data framework to embrace in memory computing ● With cheaper hardware and abstractions like RDD allowed spark to exploit memory in efficient way than all other hadoop ecosystem projects ● The first implementation of in memory computing followed typical cache approach of keeping serialized java bytes ● This proved to be challenging in future
  • 15. Challenges of in memory in Java ● As more and more big data frameworks started to exploit memory, soon they realised few limitation of Java memory model ● Java memory is tuned for short lived objects and complete control of memory is given to JVM ● But big data system started using JVM for long term storage, JVM memory model started feel inadequate ● Also as java heap grew, to cache more data, GC pauses started to kill performance
  • 16. Custom memory management ● Apache Flink is first big data system to implement custom memory management in java ● Flink follows Dataframe like API with custom memory model ● The custom memory model with non GC based approach proved to be highly successful ● By observing trends in community, optly Spark also adopted same in Spark 1.4
  • 17. Tungsten in Spark 1.4 ● Spark release first version of custom memory management in 1.4 version ● It was only supported DF as they need custom memory model ● Custom memory management greatly improved use of spark in higher vm size and fewer GC paused ● Solved OOM issues which plagued earlier versions of spark ● For more info refer to, Anatomy of In memory management in Spark talk [4]
  • 18. DSL’s for data processing
  • 19. RDD and Map/Reduce API API ● RDD API of spark follows functional programming paradigm which is similar to Map/Reduce ● RDD API passes around opaque function objects which is great for programming but bad for system based optimization ● Map/Reduce API of Java also follows same patterns but less elegant than scala ones ● Hard to optimise compared to Pig/Hive ● So we saw a steady increase in custom DSL’s in hadoop world
  • 20. Need of DSL’s in Hadoop ● DSL’s like Pig or Hive are much more easier to understand compare to Java API ● Less error prone and helps to be very specific ● Can be easily optimised, as DSL only focuses on what to do not how to do ● As Java Map/Reduce mixes what with how, it’s hard to optimize compare to Hive and Pig ● So more and more people prefered these DSL over platform level API’s
  • 21. Challenges of DSL in Hadoop ● Hive and Pig DSL do not integrate well with Map/Reduce API’s ● DSL often lack the flexibility of complete programming language ● Hive/Pig DSL don’t define single abstraction to share so you will be not able mix ● DSL are powerful for optimization but soon become limited in terms of functionality
  • 22. Scala as language to host DSL ● Scala is one of the first language to embrace DSL as the first class citizens ● Scala features like implicits, higher order functions, structured types etc allow easily build DSL’s and integrate with language ● This allows any library on scala to integrate DSL and harness full power of language ● Many libraries define their own DSL outside big data. Ex : Slick, Akka-http, Sbt
  • 23. DF DSL and Spark SQL DSL ● To harness power of custom memory management and hive like optimizes spark encourages to write DF and spark sql DSL over spark RDD code ● Whenever we write this DSL, all the features of scala language and its libraries are available,which makes it more powerful that Pig/ Hive ● Other frameworks like Flink, Beam follow same ideas on scala, Java 8 etc ● You can easily mix and match DSL with RDD API
  • 24. Dataset DSL in Spark 1.6 ● Dataframe DSL introduced in 1.4 and stabilised in 1.5 ● As spark observed the user and performance benefits of DSL based programming, it wanted to make as import pillar of Spark ● So in Spark 1.6, Spark released Dataset DSL which is poised to complete RDD API from user land ● This indicates a big shift in thinking as we are more and more moving away from 1.0 Map/Reduce and unstructured mindset.
  • 26. Evolution of libraries vs frameworks ● Spark is one of the first big data framework to build platform rather than collection of frameworks ● Single abstraction results in multiple libraries not multiple frameworks ● All these libraries get benefits from the improvements in run time ● This made spark to build lot of ecosystem in very less time ● To understand the meaning of platform, refer to Introduction to Flink talk [5]
  • 27. Data exchange between Libraries ● As more and more libraries are added to spark, having common way to exchange data became important ● Initially libraries started using RDD as data exchange format, but soon discovered some limitations ● Limitations of RDD as data exchange format is ○ No defined schema. Need to come up with domain object for each library ○ Too low level ○ Custom serialization is hard to integrate
  • 28. DataFrame as data exchange format ● From last few release, spark is making Dataframe as new data exchange format of Spark ● Dataframe has schema and can be easily passed around between libraries ● Dataframe is higher level abstraction compared RDD ● As Dataframe are serialized using platform specific code generation, all libraries will be following same serialization ● Dataset will follow the same advantages
  • 29. Learnings from Spark 1.x ● Structured/Semi structured data is the first class of Big data processing system ● Custom memory management and code generated serialization gives best performance on JVM ● DataFrame/ Dataset are the new abstraction layers to build next generation big data processing system ● DSL is the way forward over Map/Reduce like API’s ● Having high level structured abstractions make libraries coexist happily on a platform
  • 30. References 1. http://spark.apache.org/news/spark-1-0-0-released.html 2. https://www.youtube.com/watch?v=ckX6fT3kYG0 3. https://www.youtube.com/watch?v=iKOGBr-kOks 4. https://www.youtube.com/watch?v=7nIMpD5TyNs 5. https://www.youtube.com/watch?v=jErEhxP8LYQ