Introduction to Spark 
Wisely Chen (aka thegiive) 
Sr. Engineer at Yahoo
• What is Spark? ( Easy ) 
• Spark Concept ( Middle ) 
• Break : 10min 
• Spark EcoSystem ( Easy ) 
• Spark Future ( Middle ) 
• Q&A
Who am I? 
• Wisely Chen ( ) 
• Sr. Engineer in Yahoo![Taiwan] data team 
• Loves to promote open source tech 
• Hadoop Summit 2013 San Jose 
• Jenkins Conf 2013 Palo Alto 
• Spark Summit 2014 San Francisco 
• Coscup 2006, 2012, 2013 , OSDC 2007, Webconf 2013, 
Coscup 2012, PHPConf 2012 , RubyConf 2012
Taiwan Data Team 
ETL / 
Learning's talk about "Introduction to spark"
Opinion from Cloudera 
• The leading candidate for “successor to 
MapReduce” today is Apache Spark 
• No vendor — no new project — is likely to catch 
up. Chasing Spark would be a waste of time, 
and would delay availability of real-time analytic 
and processing services for no good reason. ! 
• From
What is Spark 
• From UC Berkeley AMP Lab 
• Most activity Big data open source project since 
Where is Spark?
Hadoop 2.0 
Storm HBase Others
Hadoop Architecture 
Computing Engine 
Resource Management 
Hadoop vs Spark 
Hive Shark/SparkSQL 
Spark vs Hadoop 
• Spark run on Yarn, Mesos or Standalone mode 
• Spark’s main concept is based on MapReduce 
• Spark can read from 
• HDFS: data locality 
• HBase 
• Cassandra
More than MapReduce 
Shark: Hive GraphX: Pregel MLib: Mahout 
Spark Core : MapReduce 
Resource Management System(Yarn, Mesos)
Why Spark?
3X~25X than MapReduce framework 
From Matei’s paper: 
Running Time(S) 
MR Spark 
MR Spark 
MR Spark
What is Spark 
• Apache Spark™ is a very fast and general 
engine for large-scale data processing
Language Support 
• Python 
• Java 
• Scala
Python Word Count 
• file = spark.textFile("hdfs://...") 
• counts = file.flatMap(lambda line: line.split(" "))  
• .map(lambda word: (word, 1))  
• .reduceByKey(lambda a, b: a + b) 
• counts.saveAsTextFile("hdfs://...") 
Access data via 
Spark API 
Process via Python
What is Spark 
• Apache Spark™ is a very fast and general 
engine for large-scale data processing
Why is Spark so fast?
Most machine learning 
algorithms need iterative computing
a 1.0 
1st Iter 2nd Iter 3rd Iter 
a 1.85 
a 1.31 
HDFS is 100x slower than memory 
Iter 1 
Iter 2 
Iter N 
Iter 1 
Iter 2 
Iter N 
3rd iteration(mem)! 
take 7.7 sec 
2nd iteration(mem)! 
take 7.4 sec 
First iteration(HDFS)! 
take 200 sec 
Page Rank algorithm in 1 billion record url
Spark Concept
Map Reduce 
DAG Engine
DAG Engine
• Resilient Distributed Dataset 
• Collections of objects spread across a cluster, 
stored in RAM or on Disk 
• Built through parallel transformations
Fault Tolerance 
val b = a.filer( line=>line.contain(“Spark”) ) 
RDD a RDD b 
val a =sc.textFile(“hdfs://....”) 
Value c 
val c = b.count() 
Transformation Action
Log mining 
val a = sc.textfile(“hdfs://”)! 
val err = a.filter( t=> t.contains(“ERROR”) )! 
.filter( t=>t.contains(“2014”)! 
val m = err.filter( t=> t.contains(“MYSQL”) )! 
! ! .count()! 
val a = err.filter( t=> t.contains(“APACHE”) )! 
! ! .count() 
Tas! k 
Task Task
Log mining 
val a = sc.textfile(“hdfs://”)! 
val err = a.filter( t=> t.contains(“ERROR”) )! 
.filter( t=>t.contains(“2014”)! 
val m = err.filter( t=> t.contains(“MYSQL”) )! 
! ! .count()! 
val a = err.filter( t=> t.contains(“APACHE”) )! 
! ! .count() 
RDD a 
Bloc! k1 
RDD a 
Bloc! k3 
RDD a 
Bloc! k2
Log mining 
val a = sc.textfile(“hdfs://”)! 
val err = a.filter( t=> t.contains(“ERROR”) )! 
.filter( t=>t.contains(“2014”)! 
val m = err.filter( t=> t.contains(“MYSQL”) )! 
! ! .count()! 
val a = err.filter( t=> t.contains(“APACHE”) )! 
! ! .count() 
RDD err 
RDD err 
RDD err 
Block1 Block2
Log mining 
val a = sc.textfile(“hdfs://”)! 
val err = a.filter( t=> t.contains(“ERROR”) )! 
.filter( t=>t.contains(“2014”)! 
val m = err.filter( t=> t.contains(“MYSQL”) )! 
! ! .count()! 
val a = err.filter( t=> t.contains(“APACHE”) )! 
! ! .count() 
RDD err 
RDD err 
RDD err 
Block1 Block2
Log mining 
val a = sc.textfile(“hdfs://”)! 
val err = a.filter( t=> t.contains(“ERROR”) )! 
.filter( t=>t.contains(“2014”)! 
val m = err.filter( t=> t.contains(“MYSQL”) )! 
! ! .count()! 
val a = err.filter( t=> t.contains(“APACHE”) )! 
! ! .count() 
RDD err 
RDD err 
RDD err 
Cache1 Cache2
Log mining 
val a = sc.textfile(“hdfs://”)! 
val err = a.filter( t=> t.contains(“ERROR”) )! 
.filter( t=>t.contains(“2014”)! 
val m = err.filter( t=> t.contains(“MYSQL”) )! 
! ! .count()! 
val a = err.filter( t=> t.contains(“APACHE”) )! 
! ! .count() 
RDD m 
RDD m 
RDD m 
Cache1 Cache2
Log mining 
val a = sc.textfile(“hdfs://”)! 
val err = a.filter( t=> t.contains(“ERROR”) )! 
.filter( t=>t.contains(“2014”)! 
val m = err.filter( t=> t.contains(“MYSQL”) )! 
! ! .count()! 
val a = err.filter( t=> t.contains(“APACHE”) )! 
! ! .count() 
RDD a 
RDD a 
RDD a 
Cache1 Cache2
RDD Cache 
with cache! 
take 7 sec 
iteration(no cache)! 
take same time
RDD Cache 
• Data locality 
• Cache 
After cache, take 
only 265ms 
A big shuffle! 
take 20min 
self join 5 billion record data
Scala Word Count 
• val file = spark.textFile("hdfs://...") 
• val counts = file.flatMap(line => line.split(" ")) 
• .map(word => (word, 1)) 
• .reduceByKey(_ + _) 
• counts.saveAsTextFile("hdfs://...")
Step by Step 
• file.flatMap(line => line.split(" “)) => (aaa,bb,cc) 
• .map(word => (word, 1)) => ((aaa,1),(bb,1)..) 
• .reduceByKey(_ + _) => ((aaa,123),(bb,23)…)
Java Wordcount 
• JavaRDD<String> file = spark.textFile("hdfs://..."); 
• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() 
• public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } 
• }); 
• JavaPairRDD<String, Integer> pairs = PairFunction<String, String, Integer>() 
• public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } 
• }); 
• JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() 
• public Integer call(Integer a, Integer b) { return a + b; } 
• }); 
• counts.saveAsTextFile("hdfs://...");
Java vs Scala 
• Scala : file.flatMap(line => line.split(" ")) 
• Java version : 
• JavaRDD<String> words = file.flatMap(new 
FlatMapFunction<String, String>() 
• public Iterable<String> call(String s) { 
• return Arrays.asList(s.split(" ")); } 
• });
• file = spark.textFile("hdfs://...") 
• counts = file.flatMap(lambda line: line.split(" "))  
• .map(lambda word: (word, 1))  
• .reduceByKey(lambda a, b: a + b) 
• counts.saveAsTextFile("hdfs://...")
Highly Recommend 
• Scala : Latest API feature, Stable 
• Python 
• very familiar language 
• Native Lib: NumPy, SciPy
How to use it? 
• 1. go to 
• 2. Download and unzip it 
• 3. ./sbin/ or ./bin/spark-shell
EcoSystem/Future's talk about "Introduction to spark"
Hadoop EcoSystem
Hadoop EcoSystem
Spark ECOSystem 
SparkSQL: Hive GraphX: Pregel MLib: Mahout 
Spark Core : MapReduce 
Resource Management System(Yarn, Mesos)
Unified Platform
Streaming BI ETL 
Hive HDFS Cassandra RDBMS
Write once, Run use case
Spark bridge people 
Data Analyst 
Data Engineer Data Scientist
Bridge people together 
• Scala : Engineer 
• Java : Engineer 
• Python : Data Scientist , Engineer 
• R : Data Scientist , Data Analyst 
• SQL : Data Analyst
Yahoo EC team 
Data Platform! 
ML Model! 
BI Report! 
Data Analyst
Data Analyst 
350 TB data 
• Select tweet from tweets_data where 
similarity(tweet , “FIFA” ) > 0.01 
• = 
Data Scientist
(Data Analyst) 
(Data Engineer) 
Machine Learning 
(Data Scientist) 
Databricks Cloud 
Instant BI Report
Background Knowledge 
• Tweet real time data store into SQL database 
• Spark MLLib use Wikipedia data to train a TF-IDF 
• SparkSQL select tweet and filter by TF-IDF 
• Generate live BI report
• val wiki = sql(“select text from wiki”) 
• val model = new TFIDF() 
• model.train(wiki) 
• registerFunction(“similarity” , model.similarity _ ) 
• select tweet from tweet where similarity(tweet, 
“$search” > 0.01 )
Q & A

  • 2. Agenda • What is Spark? ( Easy ) • Spark Concept ( Middle ) • Break : 10min • Spark EcoSystem ( Easy ) • Spark Future ( Middle ) • Q&A
  • 3. Who am I? • Wisely Chen ( ) • Sr. Engineer in Yahoo![Taiwan] data team • Loves to promote open source tech • Hadoop Summit 2013 San Jose • Jenkins Conf 2013 Palo Alto • Spark Summit 2014 San Francisco • Coscup 2006, 2012, 2013 , OSDC 2007, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012
  • 4. Taiwan Data Team Data! Highway BI! Report Serving! API Data! Mart ETL / Forecast Machine! Learning
  • 8. Opinion from Cloudera • The leading candidate for “successor to MapReduce” today is Apache Spark • No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. ! • From
  • 9. What is Spark • From UC Berkeley AMP Lab • Most activity Big data open source project since Hadoop
  • 13. YARN HDFS MapReduce Hadoop 2.0 Storm HBase Others
  • 14. Hadoop Architecture Hive MapReduce YARN HDFS SQL Computing Engine Resource Management Storage
  • 15. Hadoop vs Spark Hive Shark/SparkSQL YARN HDFS MapReduce Spark
  • 16. Spark vs Hadoop • Spark run on Yarn, Mesos or Standalone mode • Spark’s main concept is based on MapReduce • Spark can read from • HDFS: data locality • HBase • Cassandra
  • 17. More than MapReduce Shark: Hive GraphX: Pregel MLib: Mahout Spark Core : MapReduce HDFS Streaming: Storm Resource Management System(Yarn, Mesos)
  • 20. Logistic regression 3 110 82.5 55 27.5 33 106 180 135 90 45 171 3X~25X than MapReduce framework ! From Matei’s paper: Running Time(S) 80 60 40 20 0 76 MR Spark KMeans 0 MR Spark PageRank 0 23 MR Spark
  • 21. What is Spark • Apache Spark™ is a very fast and general engine for large-scale data processing
  • 22. Language Support • Python • Java • Scala
  • 23. Python Word Count • file = spark.textFile("hdfs://...") • counts = file.flatMap(lambda line: line.split(" ")) • .map(lambda word: (word, 1)) • .reduceByKey(lambda a, b: a + b) • counts.saveAsTextFile("hdfs://...") Access data via Spark API Process via Python
  • 24. What is Spark • Apache Spark™ is a very fast and general engine for large-scale data processing
  • 25. Why is Spark so fast?
  • 26. Most machine learning algorithms need iterative computing
  • 27. a 1.0 1.0 1.0 1.0 PageRank b b 1st Iter 2nd Iter 3rd Iter b d c Rank Tmp Result Rank Tmp Result a 1.85 1.0 0.58 d c 0.58 a 1.31 1.72 0.39 d c 0.58
  • 28. HDFS is 100x slower than memory Input (HDFS) Iter 1 Tmp (HDFS) Iter 2 Tmp (HDFS) Iter N Input (HDFS) Iter 1 Tmp (Mem) Iter 2 Tmp (Mem) Iter N MapReduce Spark
  • 29. 3rd iteration(mem)! take 7.7 sec 2nd iteration(mem)! take 7.4 sec First iteration(HDFS)! take 200 sec Page Rank algorithm in 1 billion record url
  • 34. RDD • Resilient Distributed Dataset • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations
  • 36. RDD val b = a.filer( line=>line.contain(“Spark”) ) RDD a RDD b val a =sc.textFile(“hdfs://....”) Value c val c = b.count() Transformation Action
  • 37. Log mining val a = sc.textfile(“hdfs://”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! Worker! ! ! Tas! k Worker! ! ! ! Task Task
  • 38. Log mining val a = sc.textfile(“hdfs://”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! RDD a Bloc! k1 Worker! ! ! ! RDD a Bloc! k3 Worker! ! ! ! RDD a Bloc! k2
  • 39. Log mining val a = sc.textfile(“hdfs://”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Block3 Worker! ! ! ! ! RDD err Block1 Block2
  • 40. Log mining val a = sc.textfile(“hdfs://”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Block3 Worker! ! ! ! ! RDD err Block1 Block2
  • 41. Log mining val a = sc.textfile(“hdfs://”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Cache3 Worker! ! ! ! ! RDD err Cache1 Cache2
  • 42. Log mining val a = sc.textfile(“hdfs://”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD m Worker! ! ! ! ! RDD m Cache3 Worker! ! ! ! ! RDD m Cache1 Cache2
  • 43. Log mining val a = sc.textfile(“hdfs://”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD a Worker! ! ! ! ! RDD a Cache3 Worker! ! ! ! ! RDD a Cache1 Cache2
  • 44. RDD Cache with cache! take 7 sec 1st iteration(no cache)! take same time
  • 45. RDD Cache • Data locality • Cache After cache, take only 265ms A big shuffle! take 20min self join 5 billion record data
  • 46. Scala Word Count • val file = spark.textFile("hdfs://...") • val counts = file.flatMap(line => line.split(" ")) • .map(word => (word, 1)) • .reduceByKey(_ + _) • counts.saveAsTextFile("hdfs://...")
  • 47. Step by Step • file.flatMap(line => line.split(" “)) => (aaa,bb,cc) • .map(word => (word, 1)) => ((aaa,1),(bb,1)..) • .reduceByKey(_ + _) => ((aaa,123),(bb,23)…)
  • 48. Java Wordcount • JavaRDD<String> file = spark.textFile("hdfs://..."); • JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() • public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } • }); • JavaPairRDD<String, Integer> pairs = PairFunction<String, String, Integer>() • public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } • }); • JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() • public Integer call(Integer a, Integer b) { return a + b; } • }); • counts.saveAsTextFile("hdfs://...");
  • 49. Java vs Scala • Scala : file.flatMap(line => line.split(" ")) • Java version : • JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() • public Iterable<String> call(String s) { • return Arrays.asList(s.split(" ")); } • });
  • 50. Python • file = spark.textFile("hdfs://...") • counts = file.flatMap(lambda line: line.split(" ")) • .map(lambda word: (word, 1)) • .reduceByKey(lambda a, b: a + b) • counts.saveAsTextFile("hdfs://...")
  • 51. Highly Recommend • Scala : Latest API feature, Stable • Python • very familiar language • Native Lib: NumPy, SciPy
  • 52. How to use it? • 1. go to • 2. Download and unzip it • 3. ./sbin/ or ./bin/spark-shell
  • 53. DEMO
  • 58. Spark ECOSystem SparkSQL: Hive GraphX: Pregel MLib: Mahout Spark Core : MapReduce HDFS Streaming: Storm Resource Management System(Yarn, Mesos)
  • 60. Detail Streaming BI ETL Spark SparkSQL MLlib Hive HDFS Cassandra RDBMS
  • 63. Write once, Run use case
  • 64. BI (SparkSQL) Streaming (SparkStreaming) Machine Learning (MLlib) Spark
  • 66. Data Analyst Data Engineer Data Scientist
  • 67. Bridge people together • Scala : Engineer • Java : Engineer • Python : Data Scientist , Engineer • R : Data Scientist , Data Analyst • SQL : Data Analyst
  • 68. Yahoo EC team Data Platform! ! ! ! ! ! ! ! ! ! Filtered Data! (HDFS) Data Mart! (Oracle) ML Model! (Spark) BI Report! (MSTR) Traffic! Data Transaction! Data Shark
  • 70. Data Analyst 350 TB data • Select tweet from tweets_data where similarity(tweet , “FIFA” ) > 0.01 Machine ! Learning • = ! •
  • 71. Data Scientist
  • 72. SQL (Data Analyst) Cloud Computing (Data Engineer) Machine Learning (Data Scientist) Spark
  • 74. BI (SparkSQL) Streaming (SparkStreaming) Machine Learning (MLlib) Spark
  • 75. Instant BI Report
  • 76. BI (SparkSQL) Streaming (SparkStreaming) Machine Learning (MLlib) Spark
  • 77. Background Knowledge • Tweet real time data store into SQL database • Spark MLLib use Wikipedia data to train a TF-IDF model • SparkSQL select tweet and filter by TF-IDF model • Generate live BI report
  • 78. Code • val wiki = sql(“select text from wiki”) • val model = new TFIDF() • model.train(wiki) • registerFunction(“similarity” , model.similarity _ ) • select tweet from tweet where similarity(tweet, “$search” > 0.01 )
  • 80. Q & A