SlideShare a Scribd company logo
Notes Sharing
Richard Kuo, Professional-Technical Architect,
Domain 2.0 Architecture & Planning
Agenda
• Big Data
• Overview of Spark
• Main Concepts
– RDD
– Transformations
– Programming Model
• Observation
01/06/15 Creative Common, BY, SA, NC 2
What is Apache Spark?
• Fast and general cluster computing system,
interoperable with Hadoop.
• Improves efficiency through:
– In-memory computing primitives
– General computational graph
• Improves usability through:
– Rich APIs in Scala, Java, Python
– Interactive shell
01/06/15 Creative Common, BY, SA, NC 3
Big Data: Hadoop Ecosystem
01/06/15 Creative Common, BY, SA, NC 4
Distributed Computing
01/06/15 Creative Common, BY, SA, NC 5
Comparison with Hadoop
Hadoop Spark
Map Reduce Framework Generalized Computation
Usually data is on disk (HDFS) On disk or in memory
Not ideal for iterative works Data can be cached in memory, great for
iterative works
Batch process Real time streaming or batch
Up to 10x faster when data is in disk
Up to 100x faster when data is in memory
2-5x time less code to write
Support Scala, Java and Python
Code re-use across modules
Interactive shell for ad-hoc exploratory
Library support: GraphX, Machine
Learning, SQL, R, Streaming, …
01/06/15 Creative Common, BY, SA, NC 6
01/06/15 Creative Common, BY, SA, NC 7
Compare to Hadoop:
01/06/15 Creative Common, BY, SA, NC 8
System performance degrade gracefully with less RAM
69
58
41
30
12
0
20
40
60
80
100
Cache
disabled
25% 50% 75% Fully cached
Executiontime(s)
% of working set in cache
01/06/15 Creative Common, BY, SA, NC 9
Software Components
• Spark runs as a library in
your program (1 instance
per app)
• Runs tasks locally or on
cluster
– Mesos, YARN or standalone
mode
• Accesses storage systems
via Hadoop InputFormat API
– Can use HBase, HDFS, S3, …
Your application
SparkContext
Local
threads
Cluster
manager
Worker
Spark
executor
Worker
Spark
executor
HDFS or other storage
01/06/15 Creative Common, BY, SA, NC 10
Spark Architecture
• [Spark
Standalone
• |Mesos
• |Yarn]
Node
Client
01/06/15 Creative Common, BY, SA, NC 11
Key Concept: RDD’s
Resilient Distributed Datasets
• Collections of objects
spread across a cluster,
stored in RAM or on Disk
• Built through parallel
transformations
• Automatically rebuilt on
failure
Operations
• Transformations
(e.g. map, filter, groupBy)
• Actions
(e.g. count, collect, save)
01/06/15 Creative Common, BY, SA, NC 12
Write programs in terms of operations on distributed
datasets
Fault Recovery
RDDs track the series of transformations used to build
them (their lineage) to re-compute lost data, no data
replication across wire.
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
01/06/15 Creative Common, BY, SA, NC 13
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
Language Support
Standalone Programs
•Python, Scala, & Java
Interactive Shells
• Python & Scala
Performance
• Java & Scala are faster due to
static typing
• …but Python is often fine
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
01/06/15 Creative Common, BY, SA, NC 14
Interactive Shell
• The fastest way to
learn Spark
• Available in Python and
Scala
• Runs as an application
on an existing Spark
Cluster…
• Or can run locally
01/06/15 Creative Common, BY, SA, NC 15
DEMO
01/06/15 Creative Common, BY, SA, NC 16
Transformation
01/06/15 Creative Common, BY, SA, NC 17
Spark Streaming
01/06/15 Creative Common, BY, SA, NC 18
Spark Streaming
01/06/15 Creative Common, BY, SA, NC 19
Spark Streaming: Word Count
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.storage.StorageLevel
object NetworkWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: NetworkWordCount <hostname> <port>")
System.exit(1)
}
StreamingExamples.setStreamingLogLevels()
// Create the context with a 1 second batch size
val sparkConf = new SparkConf().setAppName("NetworkWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
01/06/15 Creative Common, BY, SA, NC 20
Create Spark Context
Create, map, reduce
Output
Start
Analytics
01/06/15 Creative Common, BY, SA, NC 21
Conclusion
• Spark offers a rich API to make data analytics fast:
both less to write and fast to run.
• Achieves 100x speedups in real applications.
• Growing community.
01/06/15 Creative Common, BY, SA, NC 22
Observations:
• A lot of data, different kinds of data, generated
faster, need analyzed in real-time.
• All* products are data products.
• More complicate analytic algorithms applies to
commercial products and services.
• Not all data analysis requires the same accuracy.
• Expectation on service delivery increases.
01/06/15 Creative Common, BY, SA, NC 23
Reference:
• AMPLab at UC Berkeley
• Databrick
• UC BerkeleyX
– CS100.1x Introduction to Big Data with Apache Spark, starts 23 Feb 2015,
5 weeks
– CS190.1x Scalable Machine Learning, starts 14 Apr 2015, 5 weeks
• Spark Summit 2014 Training
• Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing
• An Architecture for Fast and General Data Processing on
Large Clusters
• Richard’s Study Notes
– Self Study AMPCamp
– Hortonworks HDP 2.2 Study
01/06/15 Creative Common, BY, SA, NC 24

More Related Content

What's hot

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
David Lauzon
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
Fernando Rodriguez
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
sparkInstructor
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
Thu Hiền
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
jlacefie
 
Internals
InternalsInternals
Internals
Sandeep Purohit
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
caidezhi655
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
Tudor Lapusan
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 

What's hot (20)

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
Internals
InternalsInternals
Internals
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 

Viewers also liked

Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
DataWorks Summit
 
Machine learning Mindmap
Machine learning MindmapMachine learning Mindmap
Machine learning Mindmap
Yee Jie NG
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
Gao Yunzhong
 
Visual book summaries
Visual book summariesVisual book summaries
Visual book summaries
chrisvdberge
 
ProQuest Safari: essentials of computing and popular technology
ProQuest Safari: essentials of computing and popular technologyProQuest Safari: essentials of computing and popular technology
ProQuest Safari: essentials of computing and popular technology
Scottish Library & Information Council (SLIC), CILIP in Scotland (CILIPS)
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
Demet Aksoy
 
Derrick Miles on Executive Book Summaries
Derrick Miles on Executive Book SummariesDerrick Miles on Executive Book Summaries
Derrick Miles on Executive Book Summaries
TheMilestoneBrand
 
Julius Caesar - Summary
Julius Caesar - SummaryJulius Caesar - Summary
Julius Caesar - Summary
Maximoff
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
DataWorks Summit
 
Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & Scala
Edureka!
 
Great Executive Summaries
Great Executive SummariesGreat Executive Summaries
Great Executive Summaries
Andy Forbes
 
The Lean Startup - Visual Summary
The Lean Startup - Visual SummaryThe Lean Startup - Visual Summary
The Lean Startup - Visual Summary
Brett Suddreth
 
Inside Apple
Inside AppleInside Apple

Viewers also liked (14)

Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Machine learning Mindmap
Machine learning MindmapMachine learning Mindmap
Machine learning Mindmap
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
 
Visual book summaries
Visual book summariesVisual book summaries
Visual book summaries
 
ProQuest Safari: essentials of computing and popular technology
ProQuest Safari: essentials of computing and popular technologyProQuest Safari: essentials of computing and popular technology
ProQuest Safari: essentials of computing and popular technology
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Derrick Miles on Executive Book Summaries
Derrick Miles on Executive Book SummariesDerrick Miles on Executive Book Summaries
Derrick Miles on Executive Book Summaries
 
Julius Caesar - Summary
Julius Caesar - SummaryJulius Caesar - Summary
Julius Caesar - Summary
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & Scala
 
Great Executive Summaries
Great Executive SummariesGreat Executive Summaries
Great Executive Summaries
 
The Lean Startup - Visual Summary
The Lean Startup - Visual SummaryThe Lean Startup - Visual Summary
The Lean Startup - Visual Summary
 
Inside Apple
Inside AppleInside Apple
Inside Apple
 

Similar to Spark Study Notes

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
Rahul Kumar
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
Jim Hatcher
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 

Similar to Spark Study Notes (20)

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Spark core
Spark coreSpark core
Spark core
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 

More from Richard Kuo

Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural Network
Richard Kuo
 
View Orchestration from Model Driven Engineering Prospective
View Orchestration from Model Driven Engineering ProspectiveView Orchestration from Model Driven Engineering Prospective
View Orchestration from Model Driven Engineering Prospective
Richard Kuo
 
Telecom Infra Project study notes
Telecom Infra Project study notesTelecom Infra Project study notes
Telecom Infra Project study notes
Richard Kuo
 
5g, gpu and fpga
5g, gpu and fpga5g, gpu and fpga
5g, gpu and fpga
Richard Kuo
 
Learning
Learning Learning
Learning
Richard Kuo
 
Kubernetes20151017a
Kubernetes20151017aKubernetes20151017a
Kubernetes20151017a
Richard Kuo
 
IaaS with Chef
IaaS with ChefIaaS with Chef
IaaS with Chef
Richard Kuo
 
Ontology, Semantic Web and DBpedia
Ontology, Semantic Web and DBpediaOntology, Semantic Web and DBpedia
Ontology, Semantic Web and DBpedia
Richard Kuo
 
SDN and NFV
SDN and NFVSDN and NFV
SDN and NFV
Richard Kuo
 
Graph Database
Graph DatabaseGraph Database
Graph Database
Richard Kuo
 
UML, OWL and REA based enterprise business model 20110201a
UML, OWL and REA based enterprise business model 20110201aUML, OWL and REA based enterprise business model 20110201a
UML, OWL and REA based enterprise business model 20110201a
Richard Kuo
 
Open v switch20150410b
Open v switch20150410bOpen v switch20150410b
Open v switch20150410b
Richard Kuo
 
Docker and coreos20141020b
Docker and coreos20141020bDocker and coreos20141020b
Docker and coreos20141020b
Richard Kuo
 
Git studynotes
Git studynotesGit studynotes
Git studynotes
Richard Kuo
 
Cloud computing reference architecture from nist and ibm
Cloud computing reference architecture from nist and ibmCloud computing reference architecture from nist and ibm
Cloud computing reference architecture from nist and ibm
Richard Kuo
 

More from Richard Kuo (15)

Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural Network
 
View Orchestration from Model Driven Engineering Prospective
View Orchestration from Model Driven Engineering ProspectiveView Orchestration from Model Driven Engineering Prospective
View Orchestration from Model Driven Engineering Prospective
 
Telecom Infra Project study notes
Telecom Infra Project study notesTelecom Infra Project study notes
Telecom Infra Project study notes
 
5g, gpu and fpga
5g, gpu and fpga5g, gpu and fpga
5g, gpu and fpga
 
Learning
Learning Learning
Learning
 
Kubernetes20151017a
Kubernetes20151017aKubernetes20151017a
Kubernetes20151017a
 
IaaS with Chef
IaaS with ChefIaaS with Chef
IaaS with Chef
 
Ontology, Semantic Web and DBpedia
Ontology, Semantic Web and DBpediaOntology, Semantic Web and DBpedia
Ontology, Semantic Web and DBpedia
 
SDN and NFV
SDN and NFVSDN and NFV
SDN and NFV
 
Graph Database
Graph DatabaseGraph Database
Graph Database
 
UML, OWL and REA based enterprise business model 20110201a
UML, OWL and REA based enterprise business model 20110201aUML, OWL and REA based enterprise business model 20110201a
UML, OWL and REA based enterprise business model 20110201a
 
Open v switch20150410b
Open v switch20150410bOpen v switch20150410b
Open v switch20150410b
 
Docker and coreos20141020b
Docker and coreos20141020bDocker and coreos20141020b
Docker and coreos20141020b
 
Git studynotes
Git studynotesGit studynotes
Git studynotes
 
Cloud computing reference architecture from nist and ibm
Cloud computing reference architecture from nist and ibmCloud computing reference architecture from nist and ibm
Cloud computing reference architecture from nist and ibm
 

Recently uploaded

TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
DianaGray10
 
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
Stephanie Beckett
 
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Alliance
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
OnBoard
 
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
Stephanie Beckett
 
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Quentin Reul
 
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
webbyacad software
 
Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
Peter Caitens
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
Razin Mustafiz
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
 
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
DianaGray10
 
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
Zilliz
 
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Alliance
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Zilliz
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
DefCamp_2016_Chemerkin_Yury_--_publish.pdf
DefCamp_2016_Chemerkin_Yury_--_publish.pdfDefCamp_2016_Chemerkin_Yury_--_publish.pdf
DefCamp_2016_Chemerkin_Yury_--_publish.pdf
Yury Chemerkin
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Alliance
 
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptxFIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Alliance
 
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
Marrie Morris
 

Recently uploaded (20)

TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
 
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
 
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
 
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
 
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
 
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
 
Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
 
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
 
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
 
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
 
DefCamp_2016_Chemerkin_Yury_--_publish.pdf
DefCamp_2016_Chemerkin_Yury_--_publish.pdfDefCamp_2016_Chemerkin_Yury_--_publish.pdf
DefCamp_2016_Chemerkin_Yury_--_publish.pdf
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
 
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptxFIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
 
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
 

Spark Study Notes

  • 1. Notes Sharing Richard Kuo, Professional-Technical Architect, Domain 2.0 Architecture & Planning
  • 2. Agenda • Big Data • Overview of Spark • Main Concepts – RDD – Transformations – Programming Model • Observation 01/06/15 Creative Common, BY, SA, NC 2
  • 3. What is Apache Spark? • Fast and general cluster computing system, interoperable with Hadoop. • Improves efficiency through: – In-memory computing primitives – General computational graph • Improves usability through: – Rich APIs in Scala, Java, Python – Interactive shell 01/06/15 Creative Common, BY, SA, NC 3
  • 4. Big Data: Hadoop Ecosystem 01/06/15 Creative Common, BY, SA, NC 4
  • 6. Comparison with Hadoop Hadoop Spark Map Reduce Framework Generalized Computation Usually data is on disk (HDFS) On disk or in memory Not ideal for iterative works Data can be cached in memory, great for iterative works Batch process Real time streaming or batch Up to 10x faster when data is in disk Up to 100x faster when data is in memory 2-5x time less code to write Support Scala, Java and Python Code re-use across modules Interactive shell for ad-hoc exploratory Library support: GraphX, Machine Learning, SQL, R, Streaming, … 01/06/15 Creative Common, BY, SA, NC 6
  • 8. Compare to Hadoop: 01/06/15 Creative Common, BY, SA, NC 8
  • 9. System performance degrade gracefully with less RAM 69 58 41 30 12 0 20 40 60 80 100 Cache disabled 25% 50% 75% Fully cached Executiontime(s) % of working set in cache 01/06/15 Creative Common, BY, SA, NC 9
  • 10. Software Components • Spark runs as a library in your program (1 instance per app) • Runs tasks locally or on cluster – Mesos, YARN or standalone mode • Accesses storage systems via Hadoop InputFormat API – Can use HBase, HDFS, S3, … Your application SparkContext Local threads Cluster manager Worker Spark executor Worker Spark executor HDFS or other storage 01/06/15 Creative Common, BY, SA, NC 10
  • 11. Spark Architecture • [Spark Standalone • |Mesos • |Yarn] Node Client 01/06/15 Creative Common, BY, SA, NC 11
  • 12. Key Concept: RDD’s Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save) 01/06/15 Creative Common, BY, SA, NC 12 Write programs in terms of operations on distributed datasets
  • 13. Fault Recovery RDDs track the series of transformations used to build them (their lineage) to re-compute lost data, no data replication across wire. val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) 01/06/15 Creative Common, BY, SA, NC 13 HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  • 14. Language Support Standalone Programs •Python, Scala, & Java Interactive Shells • Python & Scala Performance • Java & Scala are faster due to static typing • …but Python is often fine Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count(); 01/06/15 Creative Common, BY, SA, NC 14
  • 15. Interactive Shell • The fastest way to learn Spark • Available in Python and Scala • Runs as an application on an existing Spark Cluster… • Or can run locally 01/06/15 Creative Common, BY, SA, NC 15
  • 18. Spark Streaming 01/06/15 Creative Common, BY, SA, NC 18
  • 19. Spark Streaming 01/06/15 Creative Common, BY, SA, NC 19
  • 20. Spark Streaming: Word Count import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.storage.StorageLevel object NetworkWordCount { def main(args: Array[String]) { if (args.length < 2) { System.err.println("Usage: NetworkWordCount <hostname> <port>") System.exit(1) } StreamingExamples.setStreamingLogLevels() // Create the context with a 1 second batch size val sparkConf = new SparkConf().setAppName("NetworkWordCount") val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } } 01/06/15 Creative Common, BY, SA, NC 20 Create Spark Context Create, map, reduce Output Start
  • 22. Conclusion • Spark offers a rich API to make data analytics fast: both less to write and fast to run. • Achieves 100x speedups in real applications. • Growing community. 01/06/15 Creative Common, BY, SA, NC 22
  • 23. Observations: • A lot of data, different kinds of data, generated faster, need analyzed in real-time. • All* products are data products. • More complicate analytic algorithms applies to commercial products and services. • Not all data analysis requires the same accuracy. • Expectation on service delivery increases. 01/06/15 Creative Common, BY, SA, NC 23
  • 24. Reference: • AMPLab at UC Berkeley • Databrick • UC BerkeleyX – CS100.1x Introduction to Big Data with Apache Spark, starts 23 Feb 2015, 5 weeks – CS190.1x Scalable Machine Learning, starts 14 Apr 2015, 5 weeks • Spark Summit 2014 Training • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing • An Architecture for Fast and General Data Processing on Large Clusters • Richard’s Study Notes – Self Study AMPCamp – Hortonworks HDP 2.2 Study 01/06/15 Creative Common, BY, SA, NC 24

Editor's Notes

  1. MPI (Message Passing Interface)
  2. http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf
  3. Gracefully
  4. The barrier to entry for working with the spark API is minimal
  5. (word, 1L) reduceByKey(_, _)
  6. from http://spark.apache.org/docs/latest/streaming-programming-guide.html /** * Usage: NetworkWordCount <hostname> <port> * To run this on your local machine, you need to first run a Netcat server * `$ nc -lk 9999` * and then run the example * `$ bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999` */