SlideShare a Scribd company logo
Introduction to Spark 
Wisely Chen (aka thegiive) 
Sr. Engineer at Yahoo
Agenda 
• What is Spark? ( Easy ) 
• Spark Concept ( Middle ) 
• Break : 10min 
• Spark EcoSystem ( Easy ) 
• Spark Future ( Middle ) 
• Q&A
Who am I? 
• Wisely Chen ( thegiive@gmail.com ) 
• Sr. Engineer in Yahoo![Taiwan] data team 
• Loves to promote open source tech 
• Hadoop Summit 2013 San Jose 
• Jenkins Conf 2013 Palo Alto 
• Spark Summit 2014 San Francisco 
• Coscup 2006, 2012, 2013 , OSDC 2007, Webconf 2013, 
Coscup 2012, PHPConf 2012 , RubyConf 2012
Taiwan Data Team 
Data! 
Highway 
BI! 
Report 
Serving! 
API 
Data! 
Mart 
ETL / 
Forecast 
Machine! 
Learning
OCF.tw's talk about "Introduction to spark"
Forecast 
Recommendation
HADOOP
Opinion from Cloudera 
• The leading candidate for “successor to 
MapReduce” today is Apache Spark 
• No vendor — no new project — is likely to catch 
up. Chasing Spark would be a waste of time, 
and would delay availability of real-time analytic 
and processing services for no good reason. ! 
• From http://0rz.tw/y3OfM
What is Spark 
• From UC Berkeley AMP Lab 
• Most activity Big data open source project since 
Hadoop
Community
Community
Where is Spark?
YARN 
HDFS 
MapReduce 
Hadoop 2.0 
Storm HBase Others
Hadoop Architecture 
Hive 
MapReduce 
YARN 
HDFS 
SQL 
Computing Engine 
Resource Management 
Storage
Hadoop vs Spark 
Hive Shark/SparkSQL 
YARN 
HDFS 
MapReduce 
Spark
Spark vs Hadoop 
• Spark run on Yarn, Mesos or Standalone mode 
• Spark’s main concept is based on MapReduce 
• Spark can read from 
• HDFS: data locality 
• HBase 
• Cassandra
More than MapReduce 
Shark: Hive GraphX: Pregel MLib: Mahout 
Spark Core : MapReduce 
HDFS 
Streaming: 
Storm 
Resource Management System(Yarn, Mesos)
Why Spark?
天下武功,無堅不破,惟快不破
Logistic 
regression 
3 
110 
82.5 
55 
27.5 
33 
106 
180 
135 
90 
45 
171 
3X~25X than MapReduce framework 
! 
From Matei’s paper: http://0rz.tw/VVqgP 
Running Time(S) 
80 
60 
40 
20 
0 
76 
MR Spark 
KMeans 
0 
MR Spark 
PageRank 
0 
23 
MR Spark
What is Spark 
• Apache Spark™ is a very fast and general 
engine for large-scale data processing
Language Support 
• Python 
• Java 
• Scala
Python Word Count 
• file = spark.textFile("hdfs://...") 
• counts = file.flatMap(lambda line: line.split(" "))  
• .map(lambda word: (word, 1))  
• .reduceByKey(lambda a, b: a + b) 
• counts.saveAsTextFile("hdfs://...") 
Access data via 
Spark API 
Process via Python
What is Spark 
• Apache Spark™ is a very fast and general 
engine for large-scale data processing
Why is Spark so fast?
Most machine learning 
algorithms need iterative computing
a 1.0 
1.0 
1.0 
1.0 
PageRank 
b 
b 
1st Iter 2nd Iter 3rd Iter 
b 
d 
c 
Rank 
Tmp 
Result 
Rank 
Tmp 
Result 
a 1.85 
1.0 
0.58 
d 
c 
0.58 
a 1.31 
1.72 
0.39 
d 
c 
0.58
HDFS is 100x slower than memory 
Input 
(HDFS) 
Iter 1 
Tmp 
(HDFS) 
Iter 2 
Tmp 
(HDFS) 
Iter N 
Input 
(HDFS) 
Iter 1 
Tmp 
(Mem) 
Iter 2 
Tmp 
(Mem) 
Iter N 
MapReduce 
Spark
3rd iteration(mem)! 
take 7.7 sec 
2nd iteration(mem)! 
take 7.4 sec 
First iteration(HDFS)! 
take 200 sec 
Page Rank algorithm in 1 billion record url
Spark Concept
Map Reduce 
Shuffle
DAG Engine
DAG Engine
RDD 
• Resilient Distributed Dataset 
• Collections of objects spread across a cluster, 
stored in RAM or on Disk 
• Built through parallel transformations
Fault Tolerance 
天下武功,無堅不破,惟快不破
RDD 
val b = a.filer( line=>line.contain(“Spark”) ) 
RDD a RDD b 
val a =sc.textFile(“hdfs://....”) 
Value c 
val c = b.count() 
Transformation Action
Log mining 
val a = sc.textfile(“hdfs://aaa.com/a.txt”)! 
val err = a.filter( t=> t.contains(“ERROR”) )! 
.filter( t=>t.contains(“2014”)! 
! 
err.cache()! 
err.count()! 
! 
val m = err.filter( t=> t.contains(“MYSQL”) )! 
! ! .count()! 
val a = err.filter( t=> t.contains(“APACHE”) )! 
! ! .count() 
Driver 
Worker! 
! 
! 
! 
Worker! 
! 
! 
Tas! k 
Worker! 
! 
! 
! 
Task Task
Log mining 
val a = sc.textfile(“hdfs://aaa.com/a.txt”)! 
val err = a.filter( t=> t.contains(“ERROR”) )! 
.filter( t=>t.contains(“2014”)! 
! 
err.cache()! 
err.count()! 
! 
val m = err.filter( t=> t.contains(“MYSQL”) )! 
! ! .count()! 
val a = err.filter( t=> t.contains(“APACHE”) )! 
! ! .count() 
Driver 
Worker! 
! 
! 
! 
RDD a 
Bloc! k1 
Worker! 
! 
! 
! 
RDD a 
Bloc! k3 
Worker! 
! 
! 
! 
RDD a 
Bloc! k2
Log mining 
val a = sc.textfile(“hdfs://aaa.com/a.txt”)! 
val err = a.filter( t=> t.contains(“ERROR”) )! 
.filter( t=>t.contains(“2014”)! 
! 
err.cache()! 
err.count()! 
! 
val m = err.filter( t=> t.contains(“MYSQL”) )! 
! ! .count()! 
val a = err.filter( t=> t.contains(“APACHE”) )! 
! ! .count() 
Driver 
Worker! 
! 
! 
! 
! 
RDD err 
Worker! 
! 
! 
! 
! 
RDD err 
Block3 
Worker! 
! 
! 
! 
! 
RDD err 
Block1 Block2
Log mining 
val a = sc.textfile(“hdfs://aaa.com/a.txt”)! 
val err = a.filter( t=> t.contains(“ERROR”) )! 
.filter( t=>t.contains(“2014”)! 
! 
err.cache()! 
err.count()! 
! 
val m = err.filter( t=> t.contains(“MYSQL”) )! 
! ! .count()! 
val a = err.filter( t=> t.contains(“APACHE”) )! 
! ! .count() 
Driver 
Worker! 
! 
! 
! 
! 
RDD err 
Worker! 
! 
! 
! 
! 
RDD err 
Block3 
Worker! 
! 
! 
! 
! 
RDD err 
Block1 Block2
Log mining 
val a = sc.textfile(“hdfs://aaa.com/a.txt”)! 
val err = a.filter( t=> t.contains(“ERROR”) )! 
.filter( t=>t.contains(“2014”)! 
! 
err.cache()! 
err.count()! 
! 
val m = err.filter( t=> t.contains(“MYSQL”) )! 
! ! .count()! 
val a = err.filter( t=> t.contains(“APACHE”) )! 
! ! .count() 
Driver 
Worker! 
! 
! 
! 
! 
RDD err 
Worker! 
! 
! 
! 
! 
RDD err 
Cache3 
Worker! 
! 
! 
! 
! 
RDD err 
Cache1 Cache2
Log mining 
val a = sc.textfile(“hdfs://aaa.com/a.txt”)! 
val err = a.filter( t=> t.contains(“ERROR”) )! 
.filter( t=>t.contains(“2014”)! 
! 
err.cache()! 
err.count()! 
! 
val m = err.filter( t=> t.contains(“MYSQL”) )! 
! ! .count()! 
val a = err.filter( t=> t.contains(“APACHE”) )! 
! ! .count() 
Driver 
Worker! 
! 
! 
! 
! 
RDD m 
Worker! 
! 
! 
! 
! 
RDD m 
Cache3 
Worker! 
! 
! 
! 
! 
RDD m 
Cache1 Cache2
Log mining 
val a = sc.textfile(“hdfs://aaa.com/a.txt”)! 
val err = a.filter( t=> t.contains(“ERROR”) )! 
.filter( t=>t.contains(“2014”)! 
! 
err.cache()! 
err.count()! 
! 
val m = err.filter( t=> t.contains(“MYSQL”) )! 
! ! .count()! 
val a = err.filter( t=> t.contains(“APACHE”) )! 
! ! .count() 
Driver 
Worker! 
! 
! 
! 
! 
RDD a 
Worker! 
! 
! 
! 
! 
RDD a 
Cache3 
Worker! 
! 
! 
! 
! 
RDD a 
Cache1 Cache2
RDD Cache 
with cache! 
take 7 sec 
1st 
iteration(no cache)! 
take same time
RDD Cache 
• Data locality 
• Cache 
After cache, take 
only 265ms 
A big shuffle! 
take 20min 
self join 5 billion record data
Scala Word Count 
• val file = spark.textFile("hdfs://...") 
• val counts = file.flatMap(line => line.split(" ")) 
• .map(word => (word, 1)) 
• .reduceByKey(_ + _) 
• counts.saveAsTextFile("hdfs://...")
Step by Step 
• file.flatMap(line => line.split(" “)) => (aaa,bb,cc) 
• .map(word => (word, 1)) => ((aaa,1),(bb,1)..) 
• .reduceByKey(_ + _) => ((aaa,123),(bb,23)…)
Java Wordcount 
• JavaRDD<String> file = spark.textFile("hdfs://..."); 
• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() 
• public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } 
• }); 
• JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>() 
• public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } 
• }); 
• JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() 
• public Integer call(Integer a, Integer b) { return a + b; } 
• }); 
• counts.saveAsTextFile("hdfs://...");
Java vs Scala 
• Scala : file.flatMap(line => line.split(" ")) 
• Java version : 
• JavaRDD<String> words = file.flatMap(new 
FlatMapFunction<String, String>() 
• public Iterable<String> call(String s) { 
• return Arrays.asList(s.split(" ")); } 
• });
Python 
• file = spark.textFile("hdfs://...") 
• counts = file.flatMap(lambda line: line.split(" "))  
• .map(lambda word: (word, 1))  
• .reduceByKey(lambda a, b: a + b) 
• counts.saveAsTextFile("hdfs://...")
Highly Recommend 
• Scala : Latest API feature, Stable 
• Python 
• very familiar language 
• Native Lib: NumPy, SciPy
How to use it? 
• 1. go to https://spark.apache.org/ 
• 2. Download and unzip it 
• 3. ./sbin/start-all.sh or ./bin/spark-shell
DEMO
EcoSystem/Future
OCF.tw's talk about "Introduction to spark"
Hadoop EcoSystem
Hadoop EcoSystem
Spark ECOSystem 
SparkSQL: Hive GraphX: Pregel MLib: Mahout 
Spark Core : MapReduce 
HDFS 
Streaming: 
Storm 
Resource Management System(Yarn, Mesos)
Unified Platform
Detail 
Streaming BI ETL 
Spark 
SparkSQL 
MLlib 
Hive HDFS Cassandra RDBMS
Complexity
Performance
Write once, Run use case
BI 
(SparkSQL) 
Streaming 
(SparkStreaming) 
Machine 
Learning 
(MLlib) 
Spark
Spark bridge people 
together
Data Analyst 
Data Engineer Data Scientist
Bridge people together 
• Scala : Engineer 
• Java : Engineer 
• Python : Data Scientist , Engineer 
• R : Data Scientist , Data Analyst 
• SQL : Data Analyst
Yahoo EC team 
Data Platform! 
! 
! 
! 
! 
! 
! 
! 
! 
! 
Filtered 
Data! 
(HDFS) 
Data 
Mart! 
(Oracle) 
ML Model! 
(Spark) 
BI Report! 
(MSTR) 
Traffic! 
Data 
Transaction! 
Data 
Shark
Data Analyst
Data Analyst 
350 TB data 
• Select tweet from tweets_data where 
similarity(tweet , “FIFA” ) > 0.01 
Machine 
! 
Learning 
• = 
! 
• http://youtu.be/lO7LhVZrNwA?list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr 
https://www.youtube.com/watch?v=lO7LhVZrNwA&list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr#t=2900
Data Scientist 
http://goo.gl/q5CAx8 
http://research.janelia.org/zebrafish/
SQL 
(Data Analyst) 
Cloud 
Computing 
(Data Engineer) 
Machine Learning 
(Data Scientist) 
Spark
Databricks Cloud 
DEMO
BI 
(SparkSQL) 
Streaming 
(SparkStreaming) 
Machine 
Learning 
(MLlib) 
Spark
Instant BI Report 
http://youtu.be/dJQ5lV5Tldw?t=30m30s
BI 
(SparkSQL) 
Streaming 
(SparkStreaming) 
Machine 
Learning 
(MLlib) 
Spark
Background Knowledge 
• Tweet real time data store into SQL database 
• Spark MLLib use Wikipedia data to train a TF-IDF 
model 
• SparkSQL select tweet and filter by TF-IDF 
model 
• Generate live BI report
Code 
• val wiki = sql(“select text from wiki”) 
• val model = new TFIDF() 
• model.train(wiki) 
• registerFunction(“similarity” , model.similarity _ ) 
• select tweet from tweet where similarity(tweet, 
“$search” > 0.01 )
DEMO 
http://youtu.be/dJQ5lV5Tldw?t=39m30s
Q & A

More Related Content

What's hot

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Matthias Niehoff
 
Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0
Vince Gonzalez
 
Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013
Gera Shegalov
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache Drill
Charles Givre
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
Modern Data Stack France
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
Jacek Lewandowski
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA conf
Sujee Maniyam
 
Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Intro to py spark (and cassandra)
Intro to py spark (and cassandra)
Jon Haddad
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
Cheng Min Chi
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
Konrad Malawski
 
Hadoop Jungle
Hadoop JungleHadoop Jungle
Hadoop Jungle
Alexey Zinoviev
 
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache Drill
MapR Technologies
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
DataWorks Summit
 
Open Source Logging and Metric Tools
Open Source Logging and Metric ToolsOpen Source Logging and Metric Tools
Open Source Logging and Metric Tools
Phase2
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
Fernando Rodriguez
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To Cascading
Nate Murray
 
Apache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBaseApache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBase
Nag Arvind Gudiseva
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 

What's hot (20)

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
 
Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0
 
Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache Drill
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA conf
 
Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Intro to py spark (and cassandra)
Intro to py spark (and cassandra)
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Hadoop Jungle
Hadoop JungleHadoop Jungle
Hadoop Jungle
 
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache Drill
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
 
Open Source Logging and Metric Tools
Open Source Logging and Metric ToolsOpen Source Logging and Metric Tools
Open Source Logging and Metric Tools
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To Cascading
 
Apache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBaseApache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBase
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
 

Viewers also liked

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Samy Dindane
 
Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos
Rahul Kumar
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
pumaranikar
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 

Viewers also liked (7)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 

Similar to OCF.tw's talk about "Introduction to spark"

Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
Wisely chen
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupxPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
Radu Chilom
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos
Paco Nathan
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
Gert Drapers
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Anastasios Skarlatidis
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
cdmaxime
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
Thu Hiền
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
Li Ming Tsai
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 

Similar to OCF.tw's talk about "Introduction to spark" (20)

Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupxPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 

Recently uploaded

FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Alliance
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Alliance
 
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
Badri_Bady
 
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptxFIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Alliance
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
Razin Mustafiz
 
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Alliance
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
Alison B. Lowndes
 
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
Sara Kroft
 
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partesExchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
jorgelebrato
 
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
Fwdays
 
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
Marrie Morris
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
Priyanka Aash
 
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Quentin Reul
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Michael Price
 
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Alliance
 
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
Priyanka Aash
 
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
DianaGray10
 
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptxFIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Alliance
 

Recently uploaded (20)

FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
 
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
 
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptxFIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
 
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
 
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
 
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partesExchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
 
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
 
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
 
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
 
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
 
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
 
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
 
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
 
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptxFIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptx
 

OCF.tw's talk about "Introduction to spark"

  • 1. Introduction to Spark Wisely Chen (aka thegiive) Sr. Engineer at Yahoo
  • 2. Agenda • What is Spark? ( Easy ) • Spark Concept ( Middle ) • Break : 10min • Spark EcoSystem ( Easy ) • Spark Future ( Middle ) • Q&A
  • 3. Who am I? • Wisely Chen ( thegiive@gmail.com ) • Sr. Engineer in Yahoo![Taiwan] data team • Loves to promote open source tech • Hadoop Summit 2013 San Jose • Jenkins Conf 2013 Palo Alto • Spark Summit 2014 San Francisco • Coscup 2006, 2012, 2013 , OSDC 2007, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012
  • 4. Taiwan Data Team Data! Highway BI! Report Serving! API Data! Mart ETL / Forecast Machine! Learning
  • 8. Opinion from Cloudera • The leading candidate for “successor to MapReduce” today is Apache Spark • No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. ! • From http://0rz.tw/y3OfM
  • 9. What is Spark • From UC Berkeley AMP Lab • Most activity Big data open source project since Hadoop
  • 13. YARN HDFS MapReduce Hadoop 2.0 Storm HBase Others
  • 14. Hadoop Architecture Hive MapReduce YARN HDFS SQL Computing Engine Resource Management Storage
  • 15. Hadoop vs Spark Hive Shark/SparkSQL YARN HDFS MapReduce Spark
  • 16. Spark vs Hadoop • Spark run on Yarn, Mesos or Standalone mode • Spark’s main concept is based on MapReduce • Spark can read from • HDFS: data locality • HBase • Cassandra
  • 17. More than MapReduce Shark: Hive GraphX: Pregel MLib: Mahout Spark Core : MapReduce HDFS Streaming: Storm Resource Management System(Yarn, Mesos)
  • 20. Logistic regression 3 110 82.5 55 27.5 33 106 180 135 90 45 171 3X~25X than MapReduce framework ! From Matei’s paper: http://0rz.tw/VVqgP Running Time(S) 80 60 40 20 0 76 MR Spark KMeans 0 MR Spark PageRank 0 23 MR Spark
  • 21. What is Spark • Apache Spark™ is a very fast and general engine for large-scale data processing
  • 22. Language Support • Python • Java • Scala
  • 23. Python Word Count • file = spark.textFile("hdfs://...") • counts = file.flatMap(lambda line: line.split(" ")) • .map(lambda word: (word, 1)) • .reduceByKey(lambda a, b: a + b) • counts.saveAsTextFile("hdfs://...") Access data via Spark API Process via Python
  • 24. What is Spark • Apache Spark™ is a very fast and general engine for large-scale data processing
  • 25. Why is Spark so fast?
  • 26. Most machine learning algorithms need iterative computing
  • 27. a 1.0 1.0 1.0 1.0 PageRank b b 1st Iter 2nd Iter 3rd Iter b d c Rank Tmp Result Rank Tmp Result a 1.85 1.0 0.58 d c 0.58 a 1.31 1.72 0.39 d c 0.58
  • 28. HDFS is 100x slower than memory Input (HDFS) Iter 1 Tmp (HDFS) Iter 2 Tmp (HDFS) Iter N Input (HDFS) Iter 1 Tmp (Mem) Iter 2 Tmp (Mem) Iter N MapReduce Spark
  • 29. 3rd iteration(mem)! take 7.7 sec 2nd iteration(mem)! take 7.4 sec First iteration(HDFS)! take 200 sec Page Rank algorithm in 1 billion record url
  • 34. RDD • Resilient Distributed Dataset • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations
  • 36. RDD val b = a.filer( line=>line.contain(“Spark”) ) RDD a RDD b val a =sc.textFile(“hdfs://....”) Value c val c = b.count() Transformation Action
  • 37. Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! Worker! ! ! Tas! k Worker! ! ! ! Task Task
  • 38. Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! RDD a Bloc! k1 Worker! ! ! ! RDD a Bloc! k3 Worker! ! ! ! RDD a Bloc! k2
  • 39. Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Block3 Worker! ! ! ! ! RDD err Block1 Block2
  • 40. Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Block3 Worker! ! ! ! ! RDD err Block1 Block2
  • 41. Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Cache3 Worker! ! ! ! ! RDD err Cache1 Cache2
  • 42. Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD m Worker! ! ! ! ! RDD m Cache3 Worker! ! ! ! ! RDD m Cache1 Cache2
  • 43. Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD a Worker! ! ! ! ! RDD a Cache3 Worker! ! ! ! ! RDD a Cache1 Cache2
  • 44. RDD Cache with cache! take 7 sec 1st iteration(no cache)! take same time
  • 45. RDD Cache • Data locality • Cache After cache, take only 265ms A big shuffle! take 20min self join 5 billion record data
  • 46. Scala Word Count • val file = spark.textFile("hdfs://...") • val counts = file.flatMap(line => line.split(" ")) • .map(word => (word, 1)) • .reduceByKey(_ + _) • counts.saveAsTextFile("hdfs://...")
  • 47. Step by Step • file.flatMap(line => line.split(" “)) => (aaa,bb,cc) • .map(word => (word, 1)) => ((aaa,1),(bb,1)..) • .reduceByKey(_ + _) => ((aaa,123),(bb,23)…)
  • 48. Java Wordcount • JavaRDD<String> file = spark.textFile("hdfs://..."); • JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() • public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } • }); • JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>() • public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } • }); • JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() • public Integer call(Integer a, Integer b) { return a + b; } • }); • counts.saveAsTextFile("hdfs://...");
  • 49. Java vs Scala • Scala : file.flatMap(line => line.split(" ")) • Java version : • JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() • public Iterable<String> call(String s) { • return Arrays.asList(s.split(" ")); } • });
  • 50. Python • file = spark.textFile("hdfs://...") • counts = file.flatMap(lambda line: line.split(" ")) • .map(lambda word: (word, 1)) • .reduceByKey(lambda a, b: a + b) • counts.saveAsTextFile("hdfs://...")
  • 51. Highly Recommend • Scala : Latest API feature, Stable • Python • very familiar language • Native Lib: NumPy, SciPy
  • 52. How to use it? • 1. go to https://spark.apache.org/ • 2. Download and unzip it • 3. ./sbin/start-all.sh or ./bin/spark-shell
  • 53. DEMO
  • 58. Spark ECOSystem SparkSQL: Hive GraphX: Pregel MLib: Mahout Spark Core : MapReduce HDFS Streaming: Storm Resource Management System(Yarn, Mesos)
  • 60. Detail Streaming BI ETL Spark SparkSQL MLlib Hive HDFS Cassandra RDBMS
  • 63. Write once, Run use case
  • 64. BI (SparkSQL) Streaming (SparkStreaming) Machine Learning (MLlib) Spark
  • 66. Data Analyst Data Engineer Data Scientist
  • 67. Bridge people together • Scala : Engineer • Java : Engineer • Python : Data Scientist , Engineer • R : Data Scientist , Data Analyst • SQL : Data Analyst
  • 68. Yahoo EC team Data Platform! ! ! ! ! ! ! ! ! ! Filtered Data! (HDFS) Data Mart! (Oracle) ML Model! (Spark) BI Report! (MSTR) Traffic! Data Transaction! Data Shark
  • 70. Data Analyst 350 TB data • Select tweet from tweets_data where similarity(tweet , “FIFA” ) > 0.01 Machine ! Learning • = ! • http://youtu.be/lO7LhVZrNwA?list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr https://www.youtube.com/watch?v=lO7LhVZrNwA&list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr#t=2900
  • 71. Data Scientist http://goo.gl/q5CAx8 http://research.janelia.org/zebrafish/
  • 72. SQL (Data Analyst) Cloud Computing (Data Engineer) Machine Learning (Data Scientist) Spark
  • 74. BI (SparkSQL) Streaming (SparkStreaming) Machine Learning (MLlib) Spark
  • 75. Instant BI Report http://youtu.be/dJQ5lV5Tldw?t=30m30s
  • 76. BI (SparkSQL) Streaming (SparkStreaming) Machine Learning (MLlib) Spark
  • 77. Background Knowledge • Tweet real time data store into SQL database • Spark MLLib use Wikipedia data to train a TF-IDF model • SparkSQL select tweet and filter by TF-IDF model • Generate live BI report
  • 78. Code • val wiki = sql(“select text from wiki”) • val model = new TFIDF() • model.train(wiki) • registerFunction(“similarity” , model.similarity _ ) • select tweet from tweet where similarity(tweet, “$search” > 0.01 )
  • 80. Q & A