SlideShare a Scribd company logo
From Hadoop to Spark
1/2
Dr. Fabio Fumarola
Contents
• Aggregate and Cluster
• Scatter Gather and MapReduce
• MapReduce
• Why Spark?
• Spark:
– Example, task and stages
– Docker Example
– Scala and Anonymous Functions
• Next Topics in 2/2
2
Aggregates and Clusters
• Aggregate-oriented databases change the rules for
data storage (CAP)
• But running on a cluster changes also computation
models
• When you store data on a cluster you can process
data in parallel
3
Database vs Client Processing
• With a centralized database data can be processed
on the database server or on the client machine
• Running on the client:
– Pros: flexibility and programming languages
– Cons: data transfer from the server to the client
• Running on the server:
– Pros: Data locality
– Cons: programming languages and debugging
4
Cluster and Computation
• We can spread the computation across the cluster
• However, we have to reduce the amount of data
transferred over the network
• We need have computation locality
• That is process the data in the same node where is
stored
5
Scatter-Gather
6
Use a Scatter-Gather that broadcasts a message to multiple recipients and
re-aggregates the responses back into a single message.
2003
Map-Reduce
• It is a way to organize processing by taking
advantage of clusters
• It gained prominence with Google’s MapReduce
framework (Dean and Ghemawat 2004)
• It was then implemented in Hadoop Framework
7
http://research.google.com/archive/mapreduce.html
https://hadoop.apache.org/
Programming Model: MapReduce
• We have a huge text document
• Count the number of times each distinct word
appears in the file
• Sample applications
– Analyze web server logs to find popular URLs
– Term statistics for search
8
Word Count
• Assumption: the input file is too large for memory,
but all <word, count> pairs fit in memory
• We can compute the pairs by
– wc file.txt | sort | uniq -c
• Encyclopedia Britannica, 11th Edition, Volume 4, Part
3 (http://www.gutenberg.org/files/19699/19699.zip)
9
Word Count Steps
wc file.txt | sort | uniq –c
•Map
• Scan input file record-at-a-time
• Extract keys from each record
•Group by key
• Sort and shuffle
•Reduce
• Aggregate, summarize, filter or transform
• Write the results
10
MapReduce: Logical Steps
• Map
• Group by Key
• Reduce
11
Map Phase
12
Group and Reduce Phase
13
Partition and shuffling
MapReduce: Word Counting
14
Word Count with MapReduce
15
Example: Language Model
• Count the number of times each 5-word sequence
occurs in a large corpus of documents
• Map
– Extract <5-word sequence, count> pairs from each
document
• Reduce
– Combine the counts
16
MapReduce: Physical Execution
17
Physical Execution: Concerns
• Mapper intermediate results are send to a single
reduces:
– This is the only steps that require a communication over
the network
• Thus the Partition and Shuffle phase are critical
18
Partition and Shuffle
Reducing the cost of these steps, dramatically reduces
the cost in time of the computation:
•The Partitioner determines which partition a given
(key, value) pair will go to.
•The default partitioner computes a hash value for the
key and assigns the partition based on this result.
•The Shuffler moves map outputs to the reducers.
19
MapReduce: Features
• Partioning the input data
• Scheduling the program’s execution across a set of
machines
• Performing the group by key step
• Handling node failures
• Managing required inter-machine communication
20
Hadoop
MapReduce
21
Hadoop
• An Open-Source software for distributed storage of large
dataset on commodity hardware
• Provides a programming model/framework for processing
large dataset in parallel
22
Map
Map
Map
Reduce
Reduce
Input Output
Hadoop: Architecture
23
Distributed File System
• Data is kept in “chunks” spread across machines
• Each chink is replicated on different machines
(Persistence and Availability)
24
Distributed File System
• Chunk Servers
– File is split into contiguous chunks (16-64 MB)
– Each chunk is replicated 3 times
– Try to keep replicas on different machines
• Master Node
– Name Node in Hadoop’s HDFS
– Stores metadata about where files are stored
– Might be replicated
25
Hadoop’s Limitations
26
Limitations of Map Reduce
• Slow due to replication, serialization, and disk IO
• Inefficient for:
– Iterative algorithms (Machine Learning, Graphs & Network Analysis)
– Interactive Data Mining (R, Excel, Ad hoc Reporting, Searching)
27
Input iter. 1iter. 1 iter. 2iter. 2 . . .
HDFS
read
HDFS
write
HDFS
read
HDFS
write
Map
Map
Map
Reduce
Reduce
Input Output
Solutions?
• Leverage to memory:
– load Data into Memory
– Replace disks with SSD
28
Apache Spark
• A big data analytics cluster-computing framework
written in Scala.
• Open Sourced originally in AMPLab at UC Berkley
• Provides in-memory analytics based on RDD
• Highly compatible with Hadoop Storage API
– Can run on top of an Hadoop cluster
• Developer can write programs using multiple
programming languages
29
Spark architecture
30
HDFS
Datanode Datanode Datanode....
Spark
Worker
Spark
Worker
Spark
Worker
....
CacheCache CacheCache CacheCache
Block Block Block
Cluster Manager
Spark Driver (Master)
Hadoop Data Flow
31
iter. 1iter. 1 iter. 2iter. 2 . . .
Input
HDFS
read
HDFS
write
HDFS
read
HDFS
write
Spark Data Flow
32
iter. 1iter. 1 iter. 2iter. 2 . . .
Input
Not tied to 2 stage Map
Reduce paradigm
1. Extract a working set
2. Cache it
3. Query it repeatedly
Logistic regression in Hadoop and Spark
HDFS
read
Spark Programming Model
33
Datanode
HDFS
Datanode…
User
(Developer)
Writes
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
Driver Program
SparkContextSparkContext
Cluster
Manager
Cluster
Manager
Worker Node
ExecuterExecuter CacheCache
TaskTask TaskTask
Worker Node
ExecuterExecuter CacheCache
TaskTask TaskTask
Spark Programming Model
34
User
(Developer)
Writes
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
Driver Program
RDD
(Resilient
Distributed
Dataset)
RDD
(Resilient
Distributed
Dataset)
• Immutable Data structure
• In-memory (explicitly)
• Fault Tolerant
• Parallel Data Structure
• Controlled partitioning to
optimize data placement
• Can be manipulated using rich set
of operators.
• Immutable Data structure
• In-memory (explicitly)
• Fault Tolerant
• Parallel Data Structure
• Controlled partitioning to
optimize data placement
• Can be manipulated using rich set
of operators.
RDD
• Programming Interface: Programmer can perform 3 types of
operations
35
Transformations
•Create a new dataset from
and existing one.
•Lazy in nature. They are
executed only when some
action is performed.
•Example :
• Map(func)
• Filter(func)
• Distinct()
Transformations
•Create a new dataset from
and existing one.
•Lazy in nature. They are
executed only when some
action is performed.
•Example :
• Map(func)
• Filter(func)
• Distinct()
Actions
•Returns to the driver
program a value or exports
data to a storage system
after performing a
computation.
•Example:
• Count()
• Reduce(funct)
• Collect
• Take()
Actions
•Returns to the driver
program a value or exports
data to a storage system
after performing a
computation.
•Example:
• Count()
• Reduce(funct)
• Collect
• Take()
Persistence
•For caching datasets in-
memory for future
operations.
•Option to store on disk or
RAM or mixed (Storage
Level).
•Example:
• Persist()
• Cache()
Persistence
•For caching datasets in-
memory for future
operations.
•Option to store on disk or
RAM or mixed (Storage
Level).
•Example:
• Persist()
• Cache()
How Spark works
• RDD: Parallel collection with partitions
• User application create RDDs, transform them, and
run actions.
• This results in a DAG (Directed Acyclic Graph) of
operators.
• DAG is compiled into stages
• Each stage is executed as a series of Task (one Task
for each Partition).
36
Example
37
sc.textFile(“/wiki/pagecounts”) RDD[String]
textFile
Example
38
sc.textFile(“/wiki/pagecounts”)
.map(line => line.split(“t”))
RDD[String]
textFile map
RDD[List[String]]
Example
39
sc.textFile(“/wiki/pagecounts”)
.map(line => line.split(“t”))
.map(R => (R[0], int(R[1])))
RDD[String]
textFile map
RDD[List[String]]
RDD[(String, Int)]
map
Example
40
sc.textFile(“/wiki/pagecounts”)
.map(line => line.split(“t”))
.map(R => (R[0], int(R[1])))
.reduceByKey(_+_)
RDD[String]
textFile map
RDD[List[String]]
RDD[(String, Int)]
map
RDD[(String, Int)]
reduceByKey
Example
41
sc.textFile(“/wiki/pagecounts”)
.map(line => line.split(“t”))
.map(R => (R[0], int(R[1])))
.reduceByKey(_+_, 3)
.collect()
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
RDD[(String, Int)]
reduceByKey
Array[(String, Int)]
collect
Execution Plan
Stages are sequences of RDDs, that don’t have a Shuffle in
between
42
textFile map map
reduceByKey
collect
Stage 1 Stage 2
Execution Plan
43
textFile map map
reduceByK
ey
collect
Stage
1
Stage
2
Stage
1
Stage
2
1. Read HDFS split
2. Apply both the maps
3. Start Partial reduce
4. Write shuffle data
1. Read shuffle data
2. Final reduce
3. Send result to driver
program
Stage Execution
• Create a task for each Partition in the new RDD
• Serialize the Task
• Schedule and ship Tasks to Slaves
And all this happens internally (you need to do anything)
44
Task 1
Task 2
Task 2
Task 2
Spark Executor (Slaves)
45
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Core 1
Core 2
Core 3
Summary of Components
• Task: The fundamental unit of execution in Spark
• Stage: Set of Tasks that run parallel
• DAG: Logical Graph of RDD operations
• RDD: Parallel dataset with partitions
46
Start the docker container
From
•https://github.com/sequenceiq/docker-spark
docker pull sequenceiq/spark:1.3.0
docker run -i -t -h sandbox sequenceiq/spark:1.3.0-ubuntu
/etc/ bootstrap.sh bash
•Run the spark shell using yarn or local
spark-shell --master yarn-client --driver-memory 1g --executor-memory
1g --executor-cores 2
47
Separate Container Master/Worker
$ docker pull snufkin/spark-master
$ docker pull snufkin/spark-worker
•These images are based on snufkin/spark-base
$ docker run … master
$ docker run … worker
48
Running the example and Shell
• To Run an example
$ run-example SparkPi 10
• We can start a spark shell via
–spark-shell -- master local n
• The -- master specifies the master URL for a
distributed cluster
• Example applications are also provided in Python
–spark-submit example/src/main/python/pi.py 10
49
Scala Base Course - Start
50
Scala vs Java vs Python
• Spark was originally written in Scala, which allows
concise function syntax and interactive use
• Java API added for standalone applications
• Python API added more recently along with an
interactive shell.
51
Why Scala?
• High-level language for the JVM
– Object oriented + functional programming
• Statistically typed
– Type Inference
• Interoperates with Java
– Can use any Java Class
– Can be called from Java code
52
Quick Tour of Scala
53
Laziness
• For variables we can define lazy val, that are evaluated when
called
lazy val x = 10 * 10 * 10 * 10 //long computation
• For methods we can define call by value and call by name for
the parameters
def square(x: Double) // call by value
def square(x: => Double) // call by name
• It changes the order the parameter are evaluated
54
Anonymous functions
55
scala> val square = (x: Int) => x * x
square: Int => Int = <function1>
We define an anonymous function from Int to Int
The square is a val square of type Function1, which is equivalent to
scala> def square(x: Int) = x * x
square: (x: Int)Int
Anonymous Functions
(x: Int) => x * x
This is a syntactic sugar for
new Function1[Int ,Int] {
def apply(x: Int): Int = x * x
}
56
Currying
Converting a function with multiple arguments into a function
with a single argument that returns another function.
def gen(f: Int => Int)(x: Int) = f(x)
def identity(x: Int) = gen(i => i)(x)
def square(x: Int) = gen(i => i * i)(x)
def cube(x: Int) = gen(i => i * i * i)(x)
57
Anonymous Functions
//Explicit type declaration
val call1 = doWithOneAndTwo((x: Int, y: Int) => x + y)
//The compiler expects 2 ints so x and y types are inferred
val call2 = doWithOneAndTwo((x, y) => x + y)
//Even more concise syntax
val call3 = doWithOneAndTwo(_ + _)
58
Returning multiple variables
def swap(x:String, y:String) = (y, x)
val (a,b) = swap("hello","world")
println(a, b)
59
High Order Functions
Methods that take as parameter functions
val list = (1 to 4).toList
list.foreach( x => println(x))
list.foreach(println)
list.map(x => x + 2)
list.map(_ + 2)
list.filter(x => x % 2 == 1)
list.filter(_ % 2 == 1)
list.reduce((x,y) => x + y)
list.reduce(_ + _)
60
Function Methods on Collections
61
http://www.scala-lang.org/api/2.11.6/index.html#scala.collection.Seq
Scala Base Course - End
http://scalatutorials.com/
62
Next Topics
• Spark Shell
– Scala
– Python
• Shark Shell
• Data Frames
• Spark Streaming
• Code Examples: Processing and Machine Learning
63

More Related Content

What's hot

Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
sunera pathan
 
Hadoop
HadoopHadoop
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab
Fabio Fumarola
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
Gerrit van Vuuren
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
Gwen (Chen) Shapira
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
Girish Khanzode
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
Cloudera, Inc.
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
Modern Data Stack France
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Data Science
Data ScienceData Science
Data Science
Ahmet Bulut
 

What's hot (20)

Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 
Spark core
Spark coreSpark core
Spark core
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Data Science
Data ScienceData Science
Data Science
 

Viewers also liked

8. column oriented databases
8. column oriented databases8. column oriented databases
8. column oriented databases
Fabio Fumarola
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker
Fabio Fumarola
 
10b. Graph Databases Lab
10b. Graph Databases Lab10b. Graph Databases Lab
10b. Graph Databases Lab
Fabio Fumarola
 
10. Graph Databases
10. Graph Databases10. Graph Databases
10. Graph Databases
Fabio Fumarola
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Slim Baltagi
 
Spark
SparkSpark
Hbase an introduction
Hbase an introductionHbase an introduction
Hbase an introduction
Fabio Fumarola
 
8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab
Fabio Fumarola
 
3 Git
3 Git3 Git
8. key value databases laboratory
8. key value databases laboratory 8. key value databases laboratory
8. key value databases laboratory
Fabio Fumarola
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
Sujee Maniyam
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
Edureka!
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
Fabio Fumarola
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Data Modeling for NoSQL
Data Modeling for NoSQLData Modeling for NoSQL
Data Modeling for NoSQL
Tony Tam
 
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use CaseOracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
Orgad Kimchi
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
Tudor Lapusan
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
Edureka!
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
larsgeorge
 

Viewers also liked (20)

8. column oriented databases
8. column oriented databases8. column oriented databases
8. column oriented databases
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker
 
10b. Graph Databases Lab
10b. Graph Databases Lab10b. Graph Databases Lab
10b. Graph Databases Lab
 
10. Graph Databases
10. Graph Databases10. Graph Databases
10. Graph Databases
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Spark
SparkSpark
Spark
 
Hbase an introduction
Hbase an introductionHbase an introduction
Hbase an introduction
 
8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab
 
3 Git
3 Git3 Git
3 Git
 
8. key value databases laboratory
8. key value databases laboratory 8. key value databases laboratory
8. key value databases laboratory
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Data Modeling for NoSQL
Data Modeling for NoSQLData Modeling for NoSQL
Data Modeling for NoSQL
 
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use CaseOracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 

Similar to 11. From Hadoop to Spark 1:2

Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Apache Spark
Apache SparkApache Spark
Apache Spark
SugumarSarDurai
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
Jakir Hossain
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
SaiSriMadhuriYatam
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
Michael Ming Lei
 
Hadoop
HadoopHadoop
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
hansen3032
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Hadoop
HadoopHadoop
Hadoop
Anil Reddy
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
Shravan (Sean) Pabba
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
wang xing
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Dong Ngoc
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 

Similar to 11. From Hadoop to Spark 1:2 (20)

Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Data Science
Data ScienceData Science
Data Science
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Hadoop
HadoopHadoop
Hadoop
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 

More from Fabio Fumarola

9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
Fabio Fumarola
 
6 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/26 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/2
Fabio Fumarola
 
5 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/25 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/2
Fabio Fumarola
 
2 Linux Container and Docker
2 Linux Container and Docker2 Linux Container and Docker
2 Linux Container and Docker
Fabio Fumarola
 
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
Fabio Fumarola
 
An introduction to maven gradle and sbt
An introduction to maven gradle and sbtAn introduction to maven gradle and sbt
An introduction to maven gradle and sbt
Fabio Fumarola
 
Develop with linux containers and docker
Develop with linux containers and dockerDevelop with linux containers and docker
Develop with linux containers and docker
Fabio Fumarola
 
Linux containers and docker
Linux containers and dockerLinux containers and docker
Linux containers and docker
Fabio Fumarola
 
08 datasets
08 datasets08 datasets
08 datasets
Fabio Fumarola
 
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
Fabio Fumarola
 
NoSQL databases pros and cons
NoSQL databases pros and consNoSQL databases pros and cons
NoSQL databases pros and cons
Fabio Fumarola
 

More from Fabio Fumarola (11)

9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
 
6 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/26 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/2
 
5 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/25 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/2
 
2 Linux Container and Docker
2 Linux Container and Docker2 Linux Container and Docker
2 Linux Container and Docker
 
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
 
An introduction to maven gradle and sbt
An introduction to maven gradle and sbtAn introduction to maven gradle and sbt
An introduction to maven gradle and sbt
 
Develop with linux containers and docker
Develop with linux containers and dockerDevelop with linux containers and docker
Develop with linux containers and docker
 
Linux containers and docker
Linux containers and dockerLinux containers and docker
Linux containers and docker
 
08 datasets
08 datasets08 datasets
08 datasets
 
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
 
NoSQL databases pros and cons
NoSQL databases pros and consNoSQL databases pros and cons
NoSQL databases pros and cons
 

Recently uploaded

CT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptxCT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptx
RejoJohn2
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
weiwchu
 
Annex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf documentAnnex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf document
Steven McGee
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
lenjisoHussein
 
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Samuel Jackson
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
Priyanka Jadhav
 
Vrinda store data analysis project using Excel
Vrinda store data analysis project using ExcelVrinda store data analysis project using Excel
Vrinda store data analysis project using Excel
SantuJana12
 
Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
SelcukTOPAL2
 
Histology of Muscle types histology o.ppt
Histology of Muscle types histology o.pptHistology of Muscle types histology o.ppt
Histology of Muscle types histology o.ppt
SamanArshad11
 
How AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdfHow AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdf
PromptCloud
 
Systane Global education training centre
Systane Global education training centreSystane Global education training centre
Systane Global education training centre
AkhinaRomdoni
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptxParcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
AltanAtabarut
 
future-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-managementfuture-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-management
Aadee4
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
amazenolmedojeruel
 
Aws MLOps Interview Questions with answers
Aws MLOps Interview Questions  with answersAws MLOps Interview Questions  with answers
Aws MLOps Interview Questions with answers
Sathiakumar Chandr
 
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataTowards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Samuel Jackson
 
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERINGSOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
PrabhuB33
 
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
deepikakumaridk25
 
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di MalaysiaBimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
aznidajailani
 
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
Ladislau5
 

Recently uploaded (20)

CT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptxCT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptx
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
 
Annex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf documentAnnex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf document
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
 
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
 
Vrinda store data analysis project using Excel
Vrinda store data analysis project using ExcelVrinda store data analysis project using Excel
Vrinda store data analysis project using Excel
 
Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
 
Histology of Muscle types histology o.ppt
Histology of Muscle types histology o.pptHistology of Muscle types histology o.ppt
Histology of Muscle types histology o.ppt
 
How AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdfHow AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdf
 
Systane Global education training centre
Systane Global education training centreSystane Global education training centre
Systane Global education training centre
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptxParcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
 
future-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-managementfuture-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-management
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
 
Aws MLOps Interview Questions with answers
Aws MLOps Interview Questions  with answersAws MLOps Interview Questions  with answers
Aws MLOps Interview Questions with answers
 
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataTowards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
 
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERINGSOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
 
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
 
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di MalaysiaBimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
 
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
 

11. From Hadoop to Spark 1:2

  • 1. From Hadoop to Spark 1/2 Dr. Fabio Fumarola
  • 2. Contents • Aggregate and Cluster • Scatter Gather and MapReduce • MapReduce • Why Spark? • Spark: – Example, task and stages – Docker Example – Scala and Anonymous Functions • Next Topics in 2/2 2
  • 3. Aggregates and Clusters • Aggregate-oriented databases change the rules for data storage (CAP) • But running on a cluster changes also computation models • When you store data on a cluster you can process data in parallel 3
  • 4. Database vs Client Processing • With a centralized database data can be processed on the database server or on the client machine • Running on the client: – Pros: flexibility and programming languages – Cons: data transfer from the server to the client • Running on the server: – Pros: Data locality – Cons: programming languages and debugging 4
  • 5. Cluster and Computation • We can spread the computation across the cluster • However, we have to reduce the amount of data transferred over the network • We need have computation locality • That is process the data in the same node where is stored 5
  • 6. Scatter-Gather 6 Use a Scatter-Gather that broadcasts a message to multiple recipients and re-aggregates the responses back into a single message. 2003
  • 7. Map-Reduce • It is a way to organize processing by taking advantage of clusters • It gained prominence with Google’s MapReduce framework (Dean and Ghemawat 2004) • It was then implemented in Hadoop Framework 7 http://research.google.com/archive/mapreduce.html https://hadoop.apache.org/
  • 8. Programming Model: MapReduce • We have a huge text document • Count the number of times each distinct word appears in the file • Sample applications – Analyze web server logs to find popular URLs – Term statistics for search 8
  • 9. Word Count • Assumption: the input file is too large for memory, but all <word, count> pairs fit in memory • We can compute the pairs by – wc file.txt | sort | uniq -c • Encyclopedia Britannica, 11th Edition, Volume 4, Part 3 (http://www.gutenberg.org/files/19699/19699.zip) 9
  • 10. Word Count Steps wc file.txt | sort | uniq –c •Map • Scan input file record-at-a-time • Extract keys from each record •Group by key • Sort and shuffle •Reduce • Aggregate, summarize, filter or transform • Write the results 10
  • 11. MapReduce: Logical Steps • Map • Group by Key • Reduce 11
  • 13. Group and Reduce Phase 13 Partition and shuffling
  • 15. Word Count with MapReduce 15
  • 16. Example: Language Model • Count the number of times each 5-word sequence occurs in a large corpus of documents • Map – Extract <5-word sequence, count> pairs from each document • Reduce – Combine the counts 16
  • 18. Physical Execution: Concerns • Mapper intermediate results are send to a single reduces: – This is the only steps that require a communication over the network • Thus the Partition and Shuffle phase are critical 18
  • 19. Partition and Shuffle Reducing the cost of these steps, dramatically reduces the cost in time of the computation: •The Partitioner determines which partition a given (key, value) pair will go to. •The default partitioner computes a hash value for the key and assigns the partition based on this result. •The Shuffler moves map outputs to the reducers. 19
  • 20. MapReduce: Features • Partioning the input data • Scheduling the program’s execution across a set of machines • Performing the group by key step • Handling node failures • Managing required inter-machine communication 20
  • 22. Hadoop • An Open-Source software for distributed storage of large dataset on commodity hardware • Provides a programming model/framework for processing large dataset in parallel 22 Map Map Map Reduce Reduce Input Output
  • 24. Distributed File System • Data is kept in “chunks” spread across machines • Each chink is replicated on different machines (Persistence and Availability) 24
  • 25. Distributed File System • Chunk Servers – File is split into contiguous chunks (16-64 MB) – Each chunk is replicated 3 times – Try to keep replicas on different machines • Master Node – Name Node in Hadoop’s HDFS – Stores metadata about where files are stored – Might be replicated 25
  • 27. Limitations of Map Reduce • Slow due to replication, serialization, and disk IO • Inefficient for: – Iterative algorithms (Machine Learning, Graphs & Network Analysis) – Interactive Data Mining (R, Excel, Ad hoc Reporting, Searching) 27 Input iter. 1iter. 1 iter. 2iter. 2 . . . HDFS read HDFS write HDFS read HDFS write Map Map Map Reduce Reduce Input Output
  • 28. Solutions? • Leverage to memory: – load Data into Memory – Replace disks with SSD 28
  • 29. Apache Spark • A big data analytics cluster-computing framework written in Scala. • Open Sourced originally in AMPLab at UC Berkley • Provides in-memory analytics based on RDD • Highly compatible with Hadoop Storage API – Can run on top of an Hadoop cluster • Developer can write programs using multiple programming languages 29
  • 30. Spark architecture 30 HDFS Datanode Datanode Datanode.... Spark Worker Spark Worker Spark Worker .... CacheCache CacheCache CacheCache Block Block Block Cluster Manager Spark Driver (Master)
  • 31. Hadoop Data Flow 31 iter. 1iter. 1 iter. 2iter. 2 . . . Input HDFS read HDFS write HDFS read HDFS write
  • 32. Spark Data Flow 32 iter. 1iter. 1 iter. 2iter. 2 . . . Input Not tied to 2 stage Map Reduce paradigm 1. Extract a working set 2. Cache it 3. Query it repeatedly Logistic regression in Hadoop and Spark HDFS read
  • 33. Spark Programming Model 33 Datanode HDFS Datanode… User (Developer) Writes sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map Driver Program SparkContextSparkContext Cluster Manager Cluster Manager Worker Node ExecuterExecuter CacheCache TaskTask TaskTask Worker Node ExecuterExecuter CacheCache TaskTask TaskTask
  • 34. Spark Programming Model 34 User (Developer) Writes sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map Driver Program RDD (Resilient Distributed Dataset) RDD (Resilient Distributed Dataset) • Immutable Data structure • In-memory (explicitly) • Fault Tolerant • Parallel Data Structure • Controlled partitioning to optimize data placement • Can be manipulated using rich set of operators. • Immutable Data structure • In-memory (explicitly) • Fault Tolerant • Parallel Data Structure • Controlled partitioning to optimize data placement • Can be manipulated using rich set of operators.
  • 35. RDD • Programming Interface: Programmer can perform 3 types of operations 35 Transformations •Create a new dataset from and existing one. •Lazy in nature. They are executed only when some action is performed. •Example : • Map(func) • Filter(func) • Distinct() Transformations •Create a new dataset from and existing one. •Lazy in nature. They are executed only when some action is performed. •Example : • Map(func) • Filter(func) • Distinct() Actions •Returns to the driver program a value or exports data to a storage system after performing a computation. •Example: • Count() • Reduce(funct) • Collect • Take() Actions •Returns to the driver program a value or exports data to a storage system after performing a computation. •Example: • Count() • Reduce(funct) • Collect • Take() Persistence •For caching datasets in- memory for future operations. •Option to store on disk or RAM or mixed (Storage Level). •Example: • Persist() • Cache() Persistence •For caching datasets in- memory for future operations. •Option to store on disk or RAM or mixed (Storage Level). •Example: • Persist() • Cache()
  • 36. How Spark works • RDD: Parallel collection with partitions • User application create RDDs, transform them, and run actions. • This results in a DAG (Directed Acyclic Graph) of operators. • DAG is compiled into stages • Each stage is executed as a series of Task (one Task for each Partition). 36
  • 39. Example 39 sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) .map(R => (R[0], int(R[1]))) RDD[String] textFile map RDD[List[String]] RDD[(String, Int)] map
  • 40. Example 40 sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_) RDD[String] textFile map RDD[List[String]] RDD[(String, Int)] map RDD[(String, Int)] reduceByKey
  • 41. Example 41 sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_, 3) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] reduceByKey Array[(String, Int)] collect
  • 42. Execution Plan Stages are sequences of RDDs, that don’t have a Shuffle in between 42 textFile map map reduceByKey collect Stage 1 Stage 2
  • 43. Execution Plan 43 textFile map map reduceByK ey collect Stage 1 Stage 2 Stage 1 Stage 2 1. Read HDFS split 2. Apply both the maps 3. Start Partial reduce 4. Write shuffle data 1. Read shuffle data 2. Final reduce 3. Send result to driver program
  • 44. Stage Execution • Create a task for each Partition in the new RDD • Serialize the Task • Schedule and ship Tasks to Slaves And all this happens internally (you need to do anything) 44 Task 1 Task 2 Task 2 Task 2
  • 45. Spark Executor (Slaves) 45 Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Core 1 Core 2 Core 3
  • 46. Summary of Components • Task: The fundamental unit of execution in Spark • Stage: Set of Tasks that run parallel • DAG: Logical Graph of RDD operations • RDD: Parallel dataset with partitions 46
  • 47. Start the docker container From •https://github.com/sequenceiq/docker-spark docker pull sequenceiq/spark:1.3.0 docker run -i -t -h sandbox sequenceiq/spark:1.3.0-ubuntu /etc/ bootstrap.sh bash •Run the spark shell using yarn or local spark-shell --master yarn-client --driver-memory 1g --executor-memory 1g --executor-cores 2 47
  • 48. Separate Container Master/Worker $ docker pull snufkin/spark-master $ docker pull snufkin/spark-worker •These images are based on snufkin/spark-base $ docker run … master $ docker run … worker 48
  • 49. Running the example and Shell • To Run an example $ run-example SparkPi 10 • We can start a spark shell via –spark-shell -- master local n • The -- master specifies the master URL for a distributed cluster • Example applications are also provided in Python –spark-submit example/src/main/python/pi.py 10 49
  • 50. Scala Base Course - Start 50
  • 51. Scala vs Java vs Python • Spark was originally written in Scala, which allows concise function syntax and interactive use • Java API added for standalone applications • Python API added more recently along with an interactive shell. 51
  • 52. Why Scala? • High-level language for the JVM – Object oriented + functional programming • Statistically typed – Type Inference • Interoperates with Java – Can use any Java Class – Can be called from Java code 52
  • 53. Quick Tour of Scala 53
  • 54. Laziness • For variables we can define lazy val, that are evaluated when called lazy val x = 10 * 10 * 10 * 10 //long computation • For methods we can define call by value and call by name for the parameters def square(x: Double) // call by value def square(x: => Double) // call by name • It changes the order the parameter are evaluated 54
  • 55. Anonymous functions 55 scala> val square = (x: Int) => x * x square: Int => Int = <function1> We define an anonymous function from Int to Int The square is a val square of type Function1, which is equivalent to scala> def square(x: Int) = x * x square: (x: Int)Int
  • 56. Anonymous Functions (x: Int) => x * x This is a syntactic sugar for new Function1[Int ,Int] { def apply(x: Int): Int = x * x } 56
  • 57. Currying Converting a function with multiple arguments into a function with a single argument that returns another function. def gen(f: Int => Int)(x: Int) = f(x) def identity(x: Int) = gen(i => i)(x) def square(x: Int) = gen(i => i * i)(x) def cube(x: Int) = gen(i => i * i * i)(x) 57
  • 58. Anonymous Functions //Explicit type declaration val call1 = doWithOneAndTwo((x: Int, y: Int) => x + y) //The compiler expects 2 ints so x and y types are inferred val call2 = doWithOneAndTwo((x, y) => x + y) //Even more concise syntax val call3 = doWithOneAndTwo(_ + _) 58
  • 59. Returning multiple variables def swap(x:String, y:String) = (y, x) val (a,b) = swap("hello","world") println(a, b) 59
  • 60. High Order Functions Methods that take as parameter functions val list = (1 to 4).toList list.foreach( x => println(x)) list.foreach(println) list.map(x => x + 2) list.map(_ + 2) list.filter(x => x % 2 == 1) list.filter(_ % 2 == 1) list.reduce((x,y) => x + y) list.reduce(_ + _) 60
  • 61. Function Methods on Collections 61 http://www.scala-lang.org/api/2.11.6/index.html#scala.collection.Seq
  • 62. Scala Base Course - End http://scalatutorials.com/ 62
  • 63. Next Topics • Spark Shell – Scala – Python • Shark Shell • Data Frames • Spark Streaming • Code Examples: Processing and Machine Learning 63

Editor's Notes

  1. Resilient Distributed Datasets or RDD are the distributed memory abstractions that lets programmer perform in-memory parallel computations on large clusters. And that too in a highly fault tolerant manner. This is the main concept around which the whole Spark framework revolves around. Currently 2 types of RDDs: Parallelized collections: Created by calling parallelize method on an existing Scala collection. Developer can specify the number of slices to cut the dataset into. Ideally 2-3 slices per CPU. Hadoop Datasets: These distributed datasets are created from any file stored on HDFS or other storage systems supported by Hadoop (S3, Hbase etc). These are created using SparkContext’s textFile method. Default number of slices in this case is 1 slice per file block.
  2. Transformations: Like map – takes an RDD as an input, passes &amp; process each element to a function, and return a new transformed RDD as an output. By default, each transformed RDD is recomputed each time you run an action on it. Unless you specify the RDD to be cached in memory. Spark will try to keep the elements around the cluster for faster access. RDD can be persisted on discs as well. Caching is the Key tool for iterative algorithms. Using persist, one can specify the Storage Level for persisting an RDD. Cache is just a short hand for default storage level. Which is MEMORY_ONLY. MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they&amp;apos;re needed. This is the default level. MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don&amp;apos;t fit on disk, and read them from there when they&amp;apos;re needed. MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don&amp;apos;t fit in memory to disk instead of recomputing them on the fly each time they&amp;apos;re needed. DISK_ONLY Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2 etc Same as the levels above, but replicate each partition on two cluster nodes. Which Storage level is best: Few things to consider: Try to keep in-memory as much as possible Try not to spill to disc unless your computed datasets are memory expensive Use replication only if you want fault tolerance