From Hadoop to Spark
Dr. Fabio Fumarola
• Aggregate and Cluster
• Scatter Gather and MapReduce
• MapReduce
• Why Spark?
• Spark:
– Example, task and stages
– Docker Example
– Scala and Anonymous Functions
• Next Topics in 2/2
Aggregates and Clusters
• Aggregate-oriented databases change the rules for
data storage (CAP)
• But running on a cluster changes also computation
• When you store data on a cluster you can process
data in parallel
Database vs Client Processing
• With a centralized database data can be processed
on the database server or on the client machine
• Running on the client:
– Pros: flexibility and programming languages
– Cons: data transfer from the server to the client
• Running on the server:
– Pros: Data locality
– Cons: programming languages and debugging
Cluster and Computation
• We can spread the computation across the cluster
• However, we have to reduce the amount of data
transferred over the network
• We need have computation locality
• That is process the data in the same node where is
Use a Scatter-Gather that broadcasts a message to multiple recipients and
re-aggregates the responses back into a single message.
• It is a way to organize processing by taking
advantage of clusters
• It gained prominence with Google’s MapReduce
framework (Dean and Ghemawat 2004)
• It was then implemented in Hadoop Framework
Programming Model: MapReduce
• We have a huge text document
• Count the number of times each distinct word
appears in the file
• Sample applications
– Analyze web server logs to find popular URLs
– Term statistics for search
Word Count
• Assumption: the input file is too large for memory,
but all <word, count> pairs fit in memory
• We can compute the pairs by
– wc file.txt | sort | uniq -c
• Encyclopedia Britannica, 11th Edition, Volume 4, Part
3 (
Word Count Steps
wc file.txt | sort | uniq –c
• Scan input file record-at-a-time
• Extract keys from each record
•Group by key
• Sort and shuffle
• Aggregate, summarize, filter or transform
• Write the results
MapReduce: Logical Steps
• Map
• Group by Key
• Reduce
Map Phase
Group and Reduce Phase
Partition and shuffling
MapReduce: Word Counting
Word Count with MapReduce
Example: Language Model
• Count the number of times each 5-word sequence
occurs in a large corpus of documents
• Map
– Extract <5-word sequence, count> pairs from each
• Reduce
– Combine the counts
MapReduce: Physical Execution
Physical Execution: Concerns
• Mapper intermediate results are send to a single
– This is the only steps that require a communication over
the network
• Thus the Partition and Shuffle phase are critical
Partition and Shuffle
Reducing the cost of these steps, dramatically reduces
the cost in time of the computation:
•The Partitioner determines which partition a given
(key, value) pair will go to.
•The default partitioner computes a hash value for the
key and assigns the partition based on this result.
•The Shuffler moves map outputs to the reducers.
MapReduce: Features
• Partioning the input data
• Scheduling the program’s execution across a set of
• Performing the group by key step
• Handling node failures
• Managing required inter-machine communication
• An Open-Source software for distributed storage of large
dataset on commodity hardware
• Provides a programming model/framework for processing
large dataset in parallel
Input Output
Hadoop: Architecture
Distributed File System
• Data is kept in “chunks” spread across machines
• Each chink is replicated on different machines
(Persistence and Availability)
Distributed File System
• Chunk Servers
– File is split into contiguous chunks (16-64 MB)
– Each chunk is replicated 3 times
– Try to keep replicas on different machines
• Master Node
– Name Node in Hadoop’s HDFS
– Stores metadata about where files are stored
– Might be replicated
Hadoop’s Limitations
Limitations of Map Reduce
• Slow due to replication, serialization, and disk IO
• Inefficient for:
– Iterative algorithms (Machine Learning, Graphs & Network Analysis)
– Interactive Data Mining (R, Excel, Ad hoc Reporting, Searching)
Input iter. 1iter. 1 iter. 2iter. 2 . . .
Input Output
• Leverage to memory:
– load Data into Memory
– Replace disks with SSD
Apache Spark
• A big data analytics cluster-computing framework
written in Scala.
• Open Sourced originally in AMPLab at UC Berkley
• Provides in-memory analytics based on RDD
• Highly compatible with Hadoop Storage API
– Can run on top of an Hadoop cluster
• Developer can write programs using multiple
programming languages
Spark architecture
Datanode Datanode Datanode....
CacheCache CacheCache CacheCache
Block Block Block
Cluster Manager
Spark Driver (Master)
Hadoop Data Flow
iter. 1iter. 1 iter. 2iter. 2 . . .
Spark Data Flow
iter. 1iter. 1 iter. 2iter. 2 . . .
Not tied to 2 stage Map
Reduce paradigm
1. Extract a working set
2. Cache it
3. Query it repeatedly
Logistic regression in Hadoop and Spark
Spark Programming Model
sc=new SparkContext
sc=new SparkContext
Driver Program
Worker Node
ExecuterExecuter CacheCache
TaskTask TaskTask
Worker Node
ExecuterExecuter CacheCache
TaskTask TaskTask
Spark Programming Model
sc=new SparkContext
sc=new SparkContext
Driver Program
• Immutable Data structure
• In-memory (explicitly)
• Fault Tolerant
• Parallel Data Structure
• Controlled partitioning to
optimize data placement
• Can be manipulated using rich set
of operators.
• Immutable Data structure
• In-memory (explicitly)
• Fault Tolerant
• Parallel Data Structure
• Controlled partitioning to
optimize data placement
• Can be manipulated using rich set
of operators.
• Programming Interface: Programmer can perform 3 types of
•Create a new dataset from
and existing one.
•Lazy in nature. They are
executed only when some
action is performed.
•Example :
• Map(func)
• Filter(func)
• Distinct()
•Create a new dataset from
and existing one.
•Lazy in nature. They are
executed only when some
action is performed.
•Example :
• Map(func)
• Filter(func)
• Distinct()
•Returns to the driver
program a value or exports
data to a storage system
after performing a
• Count()
• Reduce(funct)
• Collect
• Take()
•Returns to the driver
program a value or exports
data to a storage system
after performing a
• Count()
• Reduce(funct)
• Collect
• Take()
•For caching datasets in-
memory for future
•Option to store on disk or
RAM or mixed (Storage
• Persist()
• Cache()
•For caching datasets in-
memory for future
•Option to store on disk or
RAM or mixed (Storage
• Persist()
• Cache()
How Spark works
• RDD: Parallel collection with partitions
• User application create RDDs, transform them, and
run actions.
• This results in a DAG (Directed Acyclic Graph) of
• DAG is compiled into stages
• Each stage is executed as a series of Task (one Task
for each Partition).
sc.textFile(“/wiki/pagecounts”) RDD[String]
.map(line => line.split(“t”))
textFile map
.map(line => line.split(“t”))
.map(R => (R[0], int(R[1])))
textFile map
RDD[(String, Int)]
.map(line => line.split(“t”))
.map(R => (R[0], int(R[1])))
textFile map
RDD[(String, Int)]
RDD[(String, Int)]
.map(line => line.split(“t”))
.map(R => (R[0], int(R[1])))
.reduceByKey(_+_, 3)
RDD[(String, Int)]
RDD[(String, Int)]
Array[(String, Int)]
Execution Plan
Stages are sequences of RDDs, that don’t have a Shuffle in
textFile map map
Stage 1 Stage 2
Execution Plan
textFile map map
1. Read HDFS split
2. Apply both the maps
3. Start Partial reduce
4. Write shuffle data
1. Read shuffle data
2. Final reduce
3. Send result to driver
Stage Execution
• Create a task for each Partition in the new RDD
• Serialize the Task
• Schedule and ship Tasks to Slaves
And all this happens internally (you need to do anything)
Task 1
Task 2
Task 2
Task 2
Spark Executor (Slaves)
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Core 1
Core 2
Core 3
Summary of Components
• Task: The fundamental unit of execution in Spark
• Stage: Set of Tasks that run parallel
• DAG: Logical Graph of RDD operations
• RDD: Parallel dataset with partitions
Start the docker container
docker pull sequenceiq/spark:1.3.0
docker run -i -t -h sandbox sequenceiq/spark:1.3.0-ubuntu
/etc/ bash
•Run the spark shell using yarn or local
spark-shell --master yarn-client --driver-memory 1g --executor-memory
1g --executor-cores 2
Separate Container Master/Worker
$ docker pull snufkin/spark-master
$ docker pull snufkin/spark-worker
•These images are based on snufkin/spark-base
$ docker run … master
$ docker run … worker
Running the example and Shell
• To Run an example
$ run-example SparkPi 10
• We can start a spark shell via
–spark-shell -- master local n
• The -- master specifies the master URL for a
distributed cluster
• Example applications are also provided in Python
–spark-submit example/src/main/python/ 10
Scala Base Course - Start
Scala vs Java vs Python
• Spark was originally written in Scala, which allows
concise function syntax and interactive use
• Java API added for standalone applications
• Python API added more recently along with an
interactive shell.
Why Scala?
• High-level language for the JVM
– Object oriented + functional programming
• Statistically typed
– Type Inference
• Interoperates with Java
– Can use any Java Class
– Can be called from Java code
Quick Tour of Scala
• For variables we can define lazy val, that are evaluated when
lazy val x = 10 * 10 * 10 * 10 //long computation
• For methods we can define call by value and call by name for
the parameters
def square(x: Double) // call by value
def square(x: => Double) // call by name
• It changes the order the parameter are evaluated
Anonymous functions
scala> val square = (x: Int) => x * x
square: Int => Int = <function1>
We define an anonymous function from Int to Int
The square is a val square of type Function1, which is equivalent to
scala> def square(x: Int) = x * x
square: (x: Int)Int
Anonymous Functions
(x: Int) => x * x
This is a syntactic sugar for
new Function1[Int ,Int] {
def apply(x: Int): Int = x * x
Converting a function with multiple arguments into a function
with a single argument that returns another function.
def gen(f: Int => Int)(x: Int) = f(x)
def identity(x: Int) = gen(i => i)(x)
def square(x: Int) = gen(i => i * i)(x)
def cube(x: Int) = gen(i => i * i * i)(x)
Anonymous Functions
//Explicit type declaration
val call1 = doWithOneAndTwo((x: Int, y: Int) => x + y)
//The compiler expects 2 ints so x and y types are inferred
val call2 = doWithOneAndTwo((x, y) => x + y)
//Even more concise syntax
val call3 = doWithOneAndTwo(_ + _)
Returning multiple variables
def swap(x:String, y:String) = (y, x)
val (a,b) = swap("hello","world")
println(a, b)
High Order Functions
Methods that take as parameter functions
val list = (1 to 4).toList
list.foreach( x => println(x))
list.foreach(println) => x + 2) + 2)
list.filter(x => x % 2 == 1)
list.filter(_ % 2 == 1)
list.reduce((x,y) => x + y)
list.reduce(_ + _)
Function Methods on Collections
Scala Base Course - End
Next Topics
• Spark Shell
– Scala
– Python
• Shark Shell
• Data Frames
• Spark Streaming
• Code Examples: Processing and Machine Learning

Editor's Notes

  1. Resilient Distributed Datasets or RDD are the distributed memory abstractions that lets programmer perform in-memory parallel computations on large clusters. And that too in a highly fault tolerant manner. This is the main concept around which the whole Spark framework revolves around. Currently 2 types of RDDs: Parallelized collections: Created by calling parallelize method on an existing Scala collection. Developer can specify the number of slices to cut the dataset into. Ideally 2-3 slices per CPU. Hadoop Datasets: These distributed datasets are created from any file stored on HDFS or other storage systems supported by Hadoop (S3, Hbase etc). These are created using SparkContext’s textFile method. Default number of slices in this case is 1 slice per file block.
  2. Transformations: Like map – takes an RDD as an input, passes &amp; process each element to a function, and return a new transformed RDD as an output. By default, each transformed RDD is recomputed each time you run an action on it. Unless you specify the RDD to be cached in memory. Spark will try to keep the elements around the cluster for faster access. RDD can be persisted on discs as well. Caching is the Key tool for iterative algorithms. Using persist, one can specify the Storage Level for persisting an RDD. Cache is just a short hand for default storage level. Which is MEMORY_ONLY. MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they&amp;apos;re needed. This is the default level. MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don&amp;apos;t fit on disk, and read them from there when they&amp;apos;re needed. MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don&amp;apos;t fit in memory to disk instead of recomputing them on the fly each time they&amp;apos;re needed. DISK_ONLY Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2 etc Same as the levels above, but replicate each partition on two cluster nodes. Which Storage level is best: Few things to consider: Try to keep in-memory as much as possible Try not to spill to disc unless your computed datasets are memory expensive Use replication only if you want fault tolerance