SlideShare a Scribd company logo
Migration Story
About me
Roman Chukh
 11+ years of experience
 Java / PHP / Ruby / etc.
 ~1 year with Apache Spark
 Interested in
 Data Storage / Data Flow
 Monitoring
 Provisioning Tools
 Why Spark?
 Our Migration to Spark
 Issues
 … and solutions
 … or workarounds
 … or at least the lessons learnt
Why Spark?
[Spark is a] Fast and general-purpose
cluster computing platform for large-scale
data processing
Why Spark?
Why Spark?
Active Development
Why Spark?
Community Growth
Why Spark?
Real-World Usage
Largest Cluster 8000 nodes Tencent
Largest single job 1 PB
Top streaming intake 1 TB / hour
Why Spark?
Real-World Usage
Migrating to Spark
Cluster Manager
Worker Node
Worker Node
Migrating To Spark
Before We Start
Migrating To Spark
The Product
 Cloud-based analytics application
 Won the Big Data Startup Challenge
 In-house computation engine
Migrating To Spark
 More data
 More granular data
 Support various data backends
 Support Machine Learning algorithms
Migrating To Spark
Use Cases
❏ supplement Graph database used to
store/query big dimensions
❏ supplement RDBMS for querying of high
volumes of data
❏ represent existing computation graph as
flow of Spark-based operations
Migrating To Spark
Star Schema
Dimension DimensionMetric
Process /
Process /
Issue #1
Low-Level API
Issue #1: Low-Level API
“Resilient Distributed Datasets:
A Fault-Tolerant Abstraction for In-
Memory Cluster Computing”
Issue #1: Low-Level API
RDD: Resilient Distributed Dataset
❏ Immutable
❏ Statically typed: RDD<MyClass>
❏ Fault-Tolerant: Automatically rebuilt on failure
❏ Lazily evaluated
Issue #1: Low-Level API
Example workflow
Read File
Get line length
Sum lengths
Issue #1: Low-Level API
RDD: Example
Issue #1: Low-Level API
RDD: Issues
 Functional transformations (e.g. map/reduce)
are not as intuitive
 Manual memory management
 High (dev) maintenance cost
Issue #1: Low-Level API
DataFrame: Overview
❏ (Semi-) Structured data
❏ Columnar Storage
❏ Graph mutation
❏ Code generation
❏ "on" by default in 1.5+
❏ "always on" in latest master
Issue #1: Low-Level API
DataFrame: Example
Issue #1: Low-Level API
DataFrame vs RDD
Issue #1: Low-Level API
DataFrame: Graph Mutation
Issue #1: Low-Level API
Lessons Learnt
❏ Be aware of the new features
❏ … especially why they were introduced
❏ Low-Level API != Better Performance
Issue #2
“The fastest way to process big
data is to never read it”
Spark Flow
x > 0
Issue #2: DataSource Predicates
Use Cases
FROM Table
WHERE x > 0
Spark Flow
x > 0
Issue #2: DataSource Predicates
Use Cases
FROM Table
WHERE x > 0
AND y < 10
y < 10
Spark Flow
x > 0
Issue #2: DataSource Predicates
Use Cases
FROM Table
WHERE x > 0
OR y < 10
y < 10
Spark Flow
x > 0
Issue #2: DataSource Predicates
Use Cases
FROM Table
WHERE x > 0
OR y < 10
y < 10
… is at a very early stage
❏ Only simple predicates
<, <=, >, >=, =
❏ Only ‘AND’ predicate groups
(no OR support)
Issue #2: DataSource Predicates
… is buggy
❏ Parquet < 1.7
❏ PARQUET-136 - NPE if all column values are
❏ Parquet 1.7
❏ PARQUET-251 - Possible incorrect results
for String/Decimal/Binary columns
Issue #2: DataSource Predicates
Apache Parquet
Issue #2: DataSource Predicates
Lessons Learnt
❏ Know your data format / data storage features
❏ ... and issues
❏ Its hard to check predicate pushdown behavior
❏ SPARK-11390: Pushdown information
❏ Simple aggregation operations are not supported
❏ Check out the talk “The Pushdown of Everything”
Issue #3
Spark SQL
❏ Window functions (e.g. row_number)
❏ Introduced for HiveContext in 1.4
❏ Introduced for SparkContext in 1.5
❏ Subquery (e.g. not exists) support is still missing
❏ Can sometimes be replaced with left semi join
Issue #3: Spark (sort of) SQL
Missing Functionality
Issue #3: Spark (sort of) SQL
Lessons Learnt
❏ Know your use-case
❏ Spark SQL is still quite young
❏ SQL grammar is incomplete
❏ … but actively extended
Issue #4
Round Trips
Issue #4: Round Trips
Data Processing
Internal API
Process / Filter
Issue #4: Round Trips
Data Processing
Internal API
Process / Filter
Get ID for the ‘Year 2015’
Issue #4: Round Trips
Resolving Dimensions
key = ‘2015’
Get IDs of all passed months of the current year
WHERE parent = 2015
and level = month
Dim. id
of ‘2015’
key = ‘2015’
Issue #4: Round Trips
Resolving Dimensions
Get IDs of all passed months of the current year
AND their siblings from the previous year
parent = 2015
level = month
Dim. id
of ‘2015’
key = ‘2015’
sibling_id =
sibling_id - 1
Issue #4: Round Trips
Resolving Dimensions
❏ Spark is better suited for a single complex request
❏ … though not too complex yet
❏ Invest time in architecture analysis and data flow
❏ It might be better to replace a more high-level API
Issue #4: Round Trips
Lessons Learnt
Issue #5
Out of Memory
“RAM's cheap, but not that cheap”
Issue #5: OOM
❏ Receive request
❏ Select / Filter / Process data (on Spark)
❏ Collect results
❏ … Out Of Memory
❏ Same data as before
❏ Same external API
Issue #5: OOM
Workaround: Requirements
❏ Result holds ~ 1M objects
❏ (Average) Object size 928 bytes
❏ Result size ~880 MB
Issue #5: OOM
Workaround: Before
Issue #5: OOM
Workaround: After
❏ Result holds ~ 1M objects
❏ (Average) Object size 272 bytes
❏ Result size ~261 MB
❏ Invest (more) time in data structures
❏ Some java performance tips:
❏ Know your serializer
❏ E.g. Kryo (v2.2.1) prepares object for
deserialization by using default constructor.
Issue #5: OOM
Lessons Learnt
Instead Of
“The fact that there is a highway to hell
and only a stairway to heaven says a lot
about the traffic trends”
Any questions?

More Related Content

What's hot

201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
Mark Tabladillo
Redash: Open Source SQL Analytics on Data Lakes
Redash: Open Source SQL Analytics on Data LakesRedash: Open Source SQL Analytics on Data Lakes
Redash: Open Source SQL Analytics on Data Lakes
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
Data Con LA
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta Lake
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIO
Jozo Kovac
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At Scale
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
Anant Corporation
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
DataWorks Summit/Hadoop Summit
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesBuilding Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta Lakes
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Vasu S

What's hot (20)

201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
Redash: Open Source SQL Analytics on Data Lakes
Redash: Open Source SQL Analytics on Data LakesRedash: Open Source SQL Analytics on Data Lakes
Redash: Open Source SQL Analytics on Data Lakes
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta Lake
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIO
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At Scale
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesBuilding Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta Lakes
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole

Viewers also liked

Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
Serg Masyutin
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
AWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
AWS Simple Workflow: Distributed Out of the Box! - Morning@LohikaAWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
AWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
Serhiy Batyuk
Big data analysis in java world
Big data analysis in java worldBig data analysis in java world
Big data analysis in java world
Serg Masyutin
Tweaking performance on high-load projects
Tweaking performance on high-load projectsTweaking performance on high-load projects
Tweaking performance on high-load projects
Dmitriy Dumanskiy
Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
Valerii Moisieienko
React. Flux. Redux
React. Flux. ReduxReact. Flux. Redux
React. Flux. Redux
Andrey Kolodnitsky
Marionette talk 2016
Marionette talk 2016Marionette talk 2016
Marionette talk 2016
Kseniya Redunova
Migration from Redshift to Spark
Migration from Redshift to SparkMigration from Redshift to Spark
Migration from Redshift to Spark
Sky Yin
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
NLP: a peek into a day of a computational linguist
NLP: a peek into a day of a computational linguistNLP: a peek into a day of a computational linguist
NLP: a peek into a day of a computational linguist
Mariana Romanyshyn
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
Spark Summit
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Anastasiia Kornilova
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
Cassandra under the hood
Cassandra under the hoodCassandra under the hood
Cassandra under the hood
Andriy Rymar
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
Sofian Hadiwijaya

Viewers also liked (18)

Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
AWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
AWS Simple Workflow: Distributed Out of the Box! - Morning@LohikaAWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
AWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
Big data analysis in java world
Big data analysis in java worldBig data analysis in java world
Big data analysis in java world
Tweaking performance on high-load projects
Tweaking performance on high-load projectsTweaking performance on high-load projects
Tweaking performance on high-load projects
Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
React. Flux. Redux
React. Flux. ReduxReact. Flux. Redux
React. Flux. Redux
Marionette talk 2016
Marionette talk 2016Marionette talk 2016
Marionette talk 2016
Migration from Redshift to Spark
Migration from Redshift to SparkMigration from Redshift to Spark
Migration from Redshift to Spark
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
NLP: a peek into a day of a computational linguist
NLP: a peek into a day of a computational linguistNLP: a peek into a day of a computational linguist
NLP: a peek into a day of a computational linguist
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Cassandra under the hood
Cassandra under the hoodCassandra under the hood
Cassandra under the hood
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark

Similar to Spark - Migration Story

Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Lillian Pierson
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
Mostafa Majidpour
Presto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop MeetupPresto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop Meetup
Wojciech Biela
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
Paco Nathan
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why
BDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of sparkBDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of spark
Jerry Wen
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax Academy
Microservices, Events, and Breaking the Data Monolith with Kafka
Microservices, Events, and Breaking the Data Monolith with KafkaMicroservices, Events, and Breaking the Data Monolith with Kafka
Microservices, Events, and Breaking the Data Monolith with Kafka
VMware Tanzu
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
DataWorks Summit
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Slim Baltagi
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Paco Nathan

Similar to Spark - Migration Story (20)

Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
Presto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop MeetupPresto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop Meetup
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why
BDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of sparkBDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of spark
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
Microservices, Events, and Breaking the Data Monolith with Kafka
Microservices, Events, and Breaking the Data Monolith with KafkaMicroservices, Events, and Breaking the Data Monolith with Kafka
Microservices, Events, and Breaking the Data Monolith with Kafka
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More

Recently uploaded

Sea Wave Energy - Renewable Energy Resources
Sea Wave Energy - Renewable Energy ResourcesSea Wave Energy - Renewable Energy Resources
Sea Wave Energy - Renewable Energy Resources
Modified O-RAN 5G Edge Reference Architecture using RNN
Modified O-RAN 5G Edge Reference Architecture using RNNModified O-RAN 5G Edge Reference Architecture using RNN
Modified O-RAN 5G Edge Reference Architecture using RNN
Machine Learning- Perceptron_Backpropogation_Module 3.pdf
Machine Learning- Perceptron_Backpropogation_Module 3.pdfMachine Learning- Perceptron_Backpropogation_Module 3.pdf
Machine Learning- Perceptron_Backpropogation_Module 3.pdf
Dr. Shivashankar
System Analysis and Design in a changing world 5th edition
System Analysis and Design in a changing world 5th editionSystem Analysis and Design in a changing world 5th edition
System Analysis and Design in a changing world 5th edition
Aiml ppt pdf.pdf on music recommendation system
Aiml ppt pdf.pdf on music recommendation systemAiml ppt pdf.pdf on music recommendation system
Aiml ppt pdf.pdf on music recommendation system
07 - Method Statement for Plastering Works.pdf
07 - Method Statement for Plastering Works.pdf07 - Method Statement for Plastering Works.pdf
07 - Method Statement for Plastering Works.pdf
03 - Method Statement for block masonry.pdf
03 - Method Statement for block masonry.pdf03 - Method Statement for block masonry.pdf
03 - Method Statement for block masonry.pdf
Dar es Salaam, Tanzania
sensor networks unit wise 4 ppt units ppt
sensor networks unit wise 4  ppt units pptsensor networks unit wise 4  ppt units ppt
sensor networks unit wise 4 ppt units ppt
02 - Method Statement for Concrete pouring.docx
02 - Method Statement for Concrete pouring.docx02 - Method Statement for Concrete pouring.docx
02 - Method Statement for Concrete pouring.docx
The Transformation Risk-Benefit Model of Artificial Intelligence: Balancing R...
The Transformation Risk-Benefit Model of Artificial Intelligence: Balancing R...The Transformation Risk-Benefit Model of Artificial Intelligence: Balancing R...
The Transformation Risk-Benefit Model of Artificial Intelligence: Balancing R...
software engineering software engineering
software engineering software engineeringsoftware engineering software engineering
software engineering software engineering
Youtube Transcript Sumariser- application of API
Youtube Transcript Sumariser- application of APIYoutube Transcript Sumariser- application of API
Youtube Transcript Sumariser- application of API
Protect YugabyteDB with Hashicorp Vault.pdf
Protect YugabyteDB with Hashicorp Vault.pdfProtect YugabyteDB with Hashicorp Vault.pdf
Protect YugabyteDB with Hashicorp Vault.pdf
Gwenn Etourneau
AC-AC Traction system of Indian Railway HHP Locomotives.pdf
AC-AC Traction system of Indian Railway HHP Locomotives.pdfAC-AC Traction system of Indian Railway HHP Locomotives.pdf
AC-AC Traction system of Indian Railway HHP Locomotives.pdf
VEMC: Trusted Leader in Engineering Solutions
VEMC: Trusted Leader in Engineering SolutionsVEMC: Trusted Leader in Engineering Solutions
VEMC: Trusted Leader in Engineering Solutions
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
Kiran Kumar Manigam

Recently uploaded (20)

Sea Wave Energy - Renewable Energy Resources
Sea Wave Energy - Renewable Energy ResourcesSea Wave Energy - Renewable Energy Resources
Sea Wave Energy - Renewable Energy Resources
Modified O-RAN 5G Edge Reference Architecture using RNN
Modified O-RAN 5G Edge Reference Architecture using RNNModified O-RAN 5G Edge Reference Architecture using RNN
Modified O-RAN 5G Edge Reference Architecture using RNN
Machine Learning- Perceptron_Backpropogation_Module 3.pdf
Machine Learning- Perceptron_Backpropogation_Module 3.pdfMachine Learning- Perceptron_Backpropogation_Module 3.pdf
Machine Learning- Perceptron_Backpropogation_Module 3.pdf
System Analysis and Design in a changing world 5th edition
System Analysis and Design in a changing world 5th editionSystem Analysis and Design in a changing world 5th edition
System Analysis and Design in a changing world 5th edition
Aiml ppt pdf.pdf on music recommendation system
Aiml ppt pdf.pdf on music recommendation systemAiml ppt pdf.pdf on music recommendation system
Aiml ppt pdf.pdf on music recommendation system
07 - Method Statement for Plastering Works.pdf
07 - Method Statement for Plastering Works.pdf07 - Method Statement for Plastering Works.pdf
07 - Method Statement for Plastering Works.pdf
03 - Method Statement for block masonry.pdf
03 - Method Statement for block masonry.pdf03 - Method Statement for block masonry.pdf
03 - Method Statement for block masonry.pdf
sensor networks unit wise 4 ppt units ppt
sensor networks unit wise 4  ppt units pptsensor networks unit wise 4  ppt units ppt
sensor networks unit wise 4 ppt units ppt
02 - Method Statement for Concrete pouring.docx
02 - Method Statement for Concrete pouring.docx02 - Method Statement for Concrete pouring.docx
02 - Method Statement for Concrete pouring.docx
The Transformation Risk-Benefit Model of Artificial Intelligence: Balancing R...
The Transformation Risk-Benefit Model of Artificial Intelligence: Balancing R...The Transformation Risk-Benefit Model of Artificial Intelligence: Balancing R...
The Transformation Risk-Benefit Model of Artificial Intelligence: Balancing R...
software engineering software engineering
software engineering software engineeringsoftware engineering software engineering
software engineering software engineering
Youtube Transcript Sumariser- application of API
Youtube Transcript Sumariser- application of APIYoutube Transcript Sumariser- application of API
Youtube Transcript Sumariser- application of API
Protect YugabyteDB with Hashicorp Vault.pdf
Protect YugabyteDB with Hashicorp Vault.pdfProtect YugabyteDB with Hashicorp Vault.pdf
Protect YugabyteDB with Hashicorp Vault.pdf
AC-AC Traction system of Indian Railway HHP Locomotives.pdf
AC-AC Traction system of Indian Railway HHP Locomotives.pdfAC-AC Traction system of Indian Railway HHP Locomotives.pdf
AC-AC Traction system of Indian Railway HHP Locomotives.pdf
VEMC: Trusted Leader in Engineering Solutions
VEMC: Trusted Leader in Engineering SolutionsVEMC: Trusted Leader in Engineering Solutions
VEMC: Trusted Leader in Engineering Solutions
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...

Spark - Migration Story

  • 2. About me Roman Chukh  11+ years of experience  Java / PHP / Ruby / etc.  ~1 year with Apache Spark  Interested in  Data Storage / Data Flow  Monitoring  Provisioning Tools
  • 3. Agenda  Why Spark?  Our Migration to Spark  Issues  … and solutions  … or workarounds  … or at least the lessons learnt
  • 5. “ [Spark is a] Fast and general-purpose cluster computing platform for large-scale data processing
  • 7. Why Spark? Active Development Source:
  • 8. Why Spark? Community Growth Source:
  • 9. Why Spark? Real-World Usage Source: wendell/6
  • 10. Largest Cluster 8000 nodes Tencent Largest single job 1 PB Databricks Top streaming intake 1 TB / hour Source: Why Spark? Real-World Usage
  • 12. Cluster Manager Application SparkContext Worker Node Executor Task Executor Task Worker Node Executor Task Executor Task Migrating To Spark Before We Start
  • 13. Migrating To Spark The Product  Cloud-based analytics application  Won the Big Data Startup Challenge  In-house computation engine
  • 14. Migrating To Spark Reasons  More data  More granular data  Support various data backends  Support Machine Learning algorithms
  • 15. Migrating To Spark Use Cases ❏ supplement Graph database used to store/query big dimensions ❏ supplement RDBMS for querying of high volumes of data ❏ represent existing computation graph as flow of Spark-based operations
  • 16. Migrating To Spark Star Schema Dimension DimensionMetric Process / Filter Dimension Filter Metric Process / Filter Dimension Result Data Processing ...
  • 19. Issue #1: Low-Level API RDD “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In- Memory Cluster Computing” Source:
  • 20. Issue #1: Low-Level API RDD: Resilient Distributed Dataset ❏ Immutable ❏ Statically typed: RDD<MyClass> ❏ Fault-Tolerant: Automatically rebuilt on failure ❏ Lazily evaluated
  • 21. Issue #1: Low-Level API Example workflow Read File line-by-line Get line length Sum lengths Result
  • 22. Issue #1: Low-Level API RDD: Example lines.txt some lines for test
  • 23. Issue #1: Low-Level API RDD: Issues  Functional transformations (e.g. map/reduce) are not as intuitive  Manual memory management  High (dev) maintenance cost
  • 24. Issue #1: Low-Level API DataFrame: Overview ❏ (Semi-) Structured data ❏ Columnar Storage ❏ Graph mutation ❏ Code generation ❏ "on" by default in 1.5+ ❏ "always on" in latest master
  • 25. Issue #1: Low-Level API DataFrame: Example lines.json {"line":"some"} {"line":"lines"} {"line":"for"} {"line":"test"}
  • 26. Issue #1: Low-Level API DataFrame vs RDD Source:
  • 27. Issue #1: Low-Level API DataFrame: Graph Mutation Source:
  • 28. Issue #1: Low-Level API Lessons Learnt ❏ Be aware of the new features ❏ … especially why they were introduced ❏ Low-Level API != Better Performance
  • 30. “ “The fastest way to process big data is to never read it” Source:
  • 31. Spark Flow RDBMS WHERE x > 0 Result Issue #2: DataSource Predicates Use Cases SQL SELECT * FROM Table WHERE x > 0
  • 32. Spark Flow RDBMS WHERE x > 0 Result Issue #2: DataSource Predicates Use Cases SQL SELECT * FROM Table WHERE x > 0 AND y < 10 WHERE y < 10 AND
  • 33. Spark Flow RDBMS WHERE x > 0 Result Issue #2: DataSource Predicates Use Cases SQL SELECT * FROM Table WHERE x > 0 OR y < 10 WHERE y < 10 OR
  • 34. Spark Flow RDBMS WHERE x > 0 Result Issue #2: DataSource Predicates Use Cases SQL SELECT * FROM Table WHERE x > 0 OR y < 10 WHERE y < 10 OR
  • 35. … is at a very early stage ❏ Only simple predicates <, <=, >, >=, = ❏ Only ‘AND’ predicate groups (no OR support) Issue #2: DataSource Predicates JDBC
  • 36. … is buggy ❏ Parquet < 1.7 ❏ PARQUET-136 - NPE if all column values are null ❏ Parquet 1.7 ❏ PARQUET-251 - Possible incorrect results for String/Decimal/Binary columns Issue #2: DataSource Predicates Apache Parquet
  • 37. Issue #2: DataSource Predicates Lessons Learnt ❏ Know your data format / data storage features ❏ ... and issues ❏ Its hard to check predicate pushdown behavior ❏ SPARK-11390: Pushdown information ❏ Simple aggregation operations are not supported ❏ Check out the talk “The Pushdown of Everything”
  • 39. ❏ Window functions (e.g. row_number) ❏ Introduced for HiveContext in 1.4 ❏ Introduced for SparkContext in 1.5 ❏ Subquery (e.g. not exists) support is still missing ❏ Can sometimes be replaced with left semi join Issue #3: Spark (sort of) SQL Missing Functionality
  • 40. Issue #3: Spark (sort of) SQL Lessons Learnt ❏ Know your use-case ❏ Spark SQL is still quite young ❏ SQL grammar is incomplete ❏ … but actively extended
  • 42. Issue #4: Round Trips Background Metric Data Processing ... Filter Metric Result Internal API Process / Filter Dimension Dimension ids Dimension
  • 43. Issue #4: Round Trips Background Metric Data Processing ... Filter Metric Result Internal API Process / Filter Dimension Dimension ids Dimension
  • 44. Get ID for the ‘Year 2015’ Issue #4: Round Trips Resolving Dimensions Dimension WHERE key = ‘2015’ Result
  • 45. Get IDs of all passed months of the current year Dimension WHERE parent = 2015 and level = month Dim. id of ‘2015’ WHERE key = ‘2015’ Issue #4: Round Trips Resolving Dimensions Result
  • 46. Get IDs of all passed months of the current year AND their siblings from the previous year Dimension WHERE parent = 2015 and level = month Dim. id of ‘2015’ Jan, Feb, … WHERE key = ‘2015’ WHERE sibling_id = sibling_id - 1 Result Issue #4: Round Trips Resolving Dimensions
  • 47. ❏ Spark is better suited for a single complex request ❏ … though not too complex yet ❏ Invest time in architecture analysis and data flow ❏ It might be better to replace a more high-level API Issue #4: Round Trips Lessons Learnt
  • 48. Issue #5 Out of Memory
  • 49. “ “RAM's cheap, but not that cheap” Source:
  • 50. Issue #5: OOM Background ❏ Receive request ❏ Select / Filter / Process data (on Spark) ❏ Collect results ❏ … Out Of Memory
  • 51. ❏ Same data as before ❏ Same external API Issue #5: OOM Workaround: Requirements
  • 52. ❏ Result holds ~ 1M objects ❏ (Average) Object size 928 bytes ❏ Result size ~880 MB Issue #5: OOM Workaround: Before
  • 53. Issue #5: OOM Workaround: After ❏ Result holds ~ 1M objects ❏ (Average) Object size 272 bytes ❏ Result size ~261 MB
  • 54. ❏ Invest (more) time in data structures ❏ Some java performance tips: ❏ Know your serializer ❏ E.g. Kryo (v2.2.1) prepares object for deserialization by using default constructor. Issue #5: OOM Lessons Learnt
  • 56. “ “The fact that there is a highway to hell and only a stairway to heaven says a lot about the traffic trends” Source:
  • 58. Resources      dataframes-52776940  zaharia-keynote  databricks-cofounder-patrick-wendell/6  dataframes-52776940  zaharia-keynote   everything-to-ram-and-run-it-from-there  t_there_is_a_highway_to_hell_and_only