SlideShare a Scribd company logo
Introduction to MapReduce
Mohamed Baddar
Senior Data Engineer
Contents
Computation Models for Distributed Computing
MPI
MapReduce
Why MapReduce?
How MapReduce works
Simple example
References
2
Distributed Computing
Why ?
Booming of big data generation (social media , e-commerce , banks , etc …)
Big data and machine learning , data mining AI became like bread and butter : better results
comes from analyzing bigger set of data.
How it works ?
Data-Partitioning : divide data into multiple tasks , each implementing the same procedure
(computations) at specific phase on its data segment
Task-Partitioning : assign different tasks to different computation units
Hardware for distributed Computing
Multiple processors (Multi-core processors) 3
Metrics
How to judge computational model suitability ?
Simplicity : level of developer experience
Scalability : adding more computational node , increase throughput / response time
fault tolerance : support recovering computed results when node is down
Maintainability : How easy fix bugs , add features
Cost : need for special hardware (Multi-core processors , large RAM , infiniband or can use
common ethernet cluster of commodity machines)
No one size fits all
sometimes it is better to use hybrid computational models 4
MPI (Message Passing Interface)
● Workload is divided among different
processes (each process may have
multiple threads)
● Communication is via Message passing
● Data exchange is via shared memory
(Physical / Virtual)
● Pros
○ Flexibility : programmer can customize message and communication between nodes
○ Speed : rely on sharing data via memory
Source :
https://computing.llnl.gov/tutorials/mpi/
5
MapReduce
Objective : Design scalable parallel programming
framework to be deployed on
large cluster of commodity machines
Data divided into splits , each processed
by map functions , whose output are
processed by reduce functions.
Originated and first practical
implementation in Google Inc. 2004
MapReduce implementations
Apache Hadoop (Computation)
6
MapReduce Execution (1)
# Mapper (M=3)
#Reducers (R=2)
MapReduce functions
map(K1,V1) list(K2,V2)
reduce(K2,list(V2)) list(V2)
7
MapReduce - Execution (2)
Platform :
Nodes communicating over ethernet network over TCP/IP
Two main type of processes :
Master : orchestrates the work
Worker : process data
Units of work :
Jobs :A MapReduce job is a unit of work that the client wants to be performed: it consists of the
input data, the MapReduce program, and configuration.
task : can be map task (process input to intermediate data), reduce task (process intermediate
data to output). Job is divided into several map / reduce tasks. 8
MapReduce Execution (3)
1. A copy of the master process is created
2. Input data is divided into M splits , each of 16 to 64 MB (user configured)
3. M map tasks are created and given unique IDs , each parses key/value pairs
of each split ,start processing , output is written into a memory buffer.
4. Map output is partitioned to R partitions. When buffers are full , they are
spilled into local hard disks , and master is notified by saved buffered
locations. All records with the same key are put in same partition.
Note : Map output stored in local worker file system , not distributed file system , as it is intermediate
and to avoid complexity.
5. Shuffling : when a reduce receives a notification from master that one of 9
MapReduce Execution (4)
6. When the reduce worker receives all its intermediate output , it sorts them by
key (sorting is need as reduce task may have several keys). (1)
7. When sorting finished , the reduce worker iterates over each key , passing the
key and list of values to the reduce function.
8. The output of reduce function is appended to the file corresponding to this
reduce worker.
9. For each HDFS block of the output of reduce task , one block is stored locally
in the reduce worker and the other 2 (assuming replication factor of 3) is
replicated on two other off-rack node for reliability.
■ Notes : 10
Master responsibilities
Find idle nodes (workers) to assign
map and reduce tasks.
monitor each task status
(idle , in-progress, finished).
Keep track of locations of R map
intermediate output on each map
worker machine.
Keep record of worker IDs and
other info (CPU , memory , disk size)
Continuously push information about
intermediate map output to reduce 11
Fault tolerance (1)
Objective : handle machine failures gracefully ,
i.e. programmer don’t need to handle
it or be aware of details.
Two type of failures :
Master failure
Worker failure
Two main activities
Failure detection
Recover lost (computed) data with least 12
Fault tolerance (2)
Worker failure
Detection : Timeout for master ping , mark worker as failed.
Remove worker from list of available worker.
For all map tasks assigned to that worker :
mark these tasks as idle
these tasks will be eligible for re-scheduling on other workers
map tasks are re-executed as output is stored in local file system in failed machine
all reduce workers are notified with re-execution so they can get intermediate data they
haven’t yet.
No need to re-execute reduce tasks as their output is stored in distributed file system and
13
Semantics in the Presence of Failures
Deterministic and Nondeterministic Functions
Deterministic functions always return the same result any time they are called with a specific set
of input values.
Nondeterministic functions may return different results each time they are called with a specific
set of input values.
If map and reduce function are “deterministic” , distributed implementation of
mapreduce framework must produce the same output of a non-faulting
sequential execution of the program.
Several copies of the same map/reduce task might run on different nodes for
sake of reliability and fault tolerance 14
Semantics in the Presence of Failures (2)
Mapper always write their output to tmp files (atomic commits).
When a map task finishes :
Renames the tmp file to final output.
Sends message to master informing it with the filename.
If another copy of the same map finished before , master ignores it , else store the filename.
Reducers do the same , and if multiple copies of the same reduce task finished ,
MapReduce framework rely on the atomic rename of the file system.
If map task are non-deterministic , and multiple copies of the map task run on
different machines , weak semantic condition can happen : two reducers read15
Semantics in the Presence of Failures (3)
16
● Workers #1 and #2 run the
same copy of map task M1
● Reducer task R1 reads its
input for M1 from worker #1
● Reducer task R2 reads its
input for M1 from worker #2 ,
as worker#1 has failed by the
time R2 has started.
● If M1’s function is deterministic
, we have complete
consistency.
● If M1’s function is not
deterministic , R1 and R2 may
receive different results from
M1.
Task granularity
Load balancing: fine grained is better , faster machines tend to take more tasks than slower machined
over time , leads to less overall job execution time.
Failure recovery : less time to re-execute failed tasks.
Very fine-grained tasks may not be desirable : overhead to manage and too much data shuffling
(consume valuable bandwidth).
Optimal Granularity : Split size = HDFS block size (128 MB by default)
HDFS block is guaranteed to be in the same node.
We need to maximize work done by one mapper locally
if split size < block : not fully utilizing possibility of local data processing
if split size > block : may need data transfer to make map function complete
17
Data locality
Network bandwidth is a valuable resource.
We assume a rack server hardware.
MapReduce scheduler works as follows :
1. Try to assign map task to the node where the
corresponding split block(s) reside , if it is free
assign , else go to step 2
2. try to find a free nod in the same rack to assign
the map task , if can’t find a free off-rack node
to assign.
● More complex implementation uses network cost
model.
18
Backup tasks
Stragglers : a set machines that run a set
of assigned tasks (MapReduce) very slow.
Slow running can be due to many reasons;
bad disk , slow network , low speed CPU.
Other tasks scheduled on stragglers cause
more
load and longer execution time.
Solution Mechanism:
When MapReduce job is close to finish
for all the “in-progress” tasks , issue backup tasks. 19
Refinements
Partitioning function :
Partition the output of mapping tasks into R partitions (each for a reduce task).
Good function should try as possible to make the partitions equal.
Default : hash(key) mod R
Works fine usually
problem arises when specific keys have many records
than the others.
Need design custom hash functions or change the key.
Combiner function
Reduce size of map intermediate output.
20
Refinements (2)
Skipping bad record
Bug in third party library that can’t be fixed , causes code crash at specific records
Terminating a job running for hours / days more expensive than sacrificing small percentage of
accuracy (If context allows , for ex. statistical analysis of large data).
How MapReduce handle that ?
1. Each worker process installs a signal handler that catches segmentation violations ,bus
errors and other possible fatal errors.
2. Before a map / reduce task runs , the MapReduce library stores the key value in global
variable.
3. When a map / reduce task function code generates a signal , the worker sends UDP 21
References
1. MapReduce: simplified data processing on large clusters
http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf
2. Hadoop Definitive guide , Ch 3
3. https://dgleich.wordpress.com/2012/10/05/why-mapreduce-is-successful-its-the-io/
4. http://www.infoworld.com/article/2616904/business-intelligence/mapreduce.html
5. http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf
6. Hadoop Definitive Guide - Chapter 1&2
7. http://research.google.com/archive/mapreduce-osdi04-slides/
8. http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf
22

More Related Content

What's hot

Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large Clusters
Abhishek Singh
 
MapReduce
MapReduceMapReduce
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
Hadoop User Group
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Manuel Correa
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
Bhupesh Chawda
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
Andrea Iacono
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
Francisco Pérez-Sorrosal
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
Cleverence Kombe
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
Brendan Tierney
 
MapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersMapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large Clusters
Abolfazl Asudeh
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
Hassan A-j
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
Nicola Cadenelli
 
E031201032036
E031201032036E031201032036
E031201032036
ijceronline
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
Adrian Florea
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
Apache Apex
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusters
kazuma_sato
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
Subhas Kumar Ghosh
 

What's hot (20)

Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large Clusters
 
MapReduce
MapReduceMapReduce
MapReduce
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
MapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersMapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large Clusters
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
E031201032036
E031201032036E031201032036
E031201032036
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusters
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 

Viewers also liked

Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
Amund Tveit
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Mapreduce Osdi04
Mapreduce Osdi04Mapreduce Osdi04
Mapreduce Osdi04
Jyotirmoy Dey
 
Algorithms for Cloud Computing
Algorithms for Cloud ComputingAlgorithms for Cloud Computing
Algorithms for Cloud Computing
Sławomir Zborowski
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MG
Pradeep MG
 
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce
Krishna Sangeeth KS
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
Jeff Hammerbacher
 
Pyshark in Network Packet analysis
Pyshark in Network Packet analysisPyshark in Network Packet analysis
Pyshark in Network Packet analysis
Rengaraj D
 
Cyber terrorism by_Ali_Fahad
Cyber terrorism by_Ali_FahadCyber terrorism by_Ali_Fahad
Cyber terrorism by_Ali_Fahad
aliuet
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
soujavajug
 
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesModern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Lorenzo Alberton
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Frane Bandov
 
An Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDBAn Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDB
Rainforest QA
 
Ipv6
Ipv6Ipv6
IPV6 ppt
IPV6 pptIPV6 ppt
IPV6 ppt
justdoitkhan
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Skillspeed
 
Cyber terrorism
Cyber terrorismCyber terrorism
Cyber terrorism
shaympariyar
 
cyber terrorism
cyber terrorismcyber terrorism
cyber terrorism
Accenture
 
ipv6 ppt
ipv6 pptipv6 ppt
ipv6 ppt
Shiva Kumar
 

Viewers also liked (20)

Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Mapreduce Osdi04
Mapreduce Osdi04Mapreduce Osdi04
Mapreduce Osdi04
 
Algorithms for Cloud Computing
Algorithms for Cloud ComputingAlgorithms for Cloud Computing
Algorithms for Cloud Computing
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MG
 
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Pyshark in Network Packet analysis
Pyshark in Network Packet analysisPyshark in Network Packet analysis
Pyshark in Network Packet analysis
 
Cyber terrorism by_Ali_Fahad
Cyber terrorism by_Ali_FahadCyber terrorism by_Ali_Fahad
Cyber terrorism by_Ali_Fahad
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesModern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
An Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDBAn Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDB
 
Ipv6
Ipv6Ipv6
Ipv6
 
IPV6 ppt
IPV6 pptIPV6 ppt
IPV6 ppt
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Cyber terrorism
Cyber terrorismCyber terrorism
Cyber terrorism
 
cyber terrorism
cyber terrorismcyber terrorism
cyber terrorism
 
ipv6 ppt
ipv6 pptipv6 ppt
ipv6 ppt
 

Similar to Introduction to map reduce

mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
ShimoFcis
 
Map reduce
Map reduceMap reduce
Map reduce
대호 김
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
Urvashi Kataria
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
softwarequery
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
RojaT4
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
lmphuong06
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
sreehari orienit
 
Hadoop
HadoopHadoop
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
Soumee Maschatak
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
ijcsit
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
IRJET Journal
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
Harisankar H
 
MapReduce
MapReduceMapReduce
MapReduce
Surinder Kaur
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
Vu Thi Trang
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
Newvewm
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
Sitamarhi Institute of Technology
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
Sitamarhi Institute of Technology
 
Hadoop
HadoopHadoop
MapReduce
MapReduceMapReduce
MapReduce
ahmedelmorsy89
 

Similar to Introduction to map reduce (20)

mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
Map reduce
Map reduceMap reduce
Map reduce
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Hadoop
HadoopHadoop
Hadoop
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
MapReduce
MapReduceMapReduce
MapReduce
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
MapReduce
MapReduceMapReduce
MapReduce
 

Recently uploaded

Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptxParcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
AltanAtabarut
 
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Alireza Kamrani
 
Audits Of Complaints Against the PPD Report_2022.pdf
Audits Of Complaints Against the PPD Report_2022.pdfAudits Of Complaints Against the PPD Report_2022.pdf
Audits Of Complaints Against the PPD Report_2022.pdf
evwcarr
 
How AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdfHow AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdf
PromptCloud
 
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
Ladislau5
 
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
Milind Agarwal
 
Big Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of PaymentsBig Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of Payments
RuchiRathor2
 
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
deepikakumaridk25
 
Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
District 11 Solutions
 
Field Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdfField Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdf
hritikbui
 
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdfThe Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
Riya Sen
 
Histology of Muscle types histology o.ppt
Histology of Muscle types histology o.pptHistology of Muscle types histology o.ppt
Histology of Muscle types histology o.ppt
SamanArshad11
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
wojakmodern
 
CT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptxCT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptx
RejoJohn2
 
Acid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjkAcid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjk
talha2khan2k
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdf
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdfParcel Delivery - Intel Segmentation and Last Mile Opt.pdf
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdf
AltanAtabarut
 
Aws MLOps Interview Questions with answers
Aws MLOps Interview Questions  with answersAws MLOps Interview Questions  with answers
Aws MLOps Interview Questions with answers
Sathiakumar Chandr
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
amazenolmedojeruel
 
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di MalaysiaBimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
aznidajailani
 
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
rightmanforbloodline
 

Recently uploaded (20)

Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptxParcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
 
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
 
Audits Of Complaints Against the PPD Report_2022.pdf
Audits Of Complaints Against the PPD Report_2022.pdfAudits Of Complaints Against the PPD Report_2022.pdf
Audits Of Complaints Against the PPD Report_2022.pdf
 
How AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdfHow AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdf
 
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
 
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
 
Big Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of PaymentsBig Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of Payments
 
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
 
Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
 
Field Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdfField Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdf
 
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdfThe Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
 
Histology of Muscle types histology o.ppt
Histology of Muscle types histology o.pptHistology of Muscle types histology o.ppt
Histology of Muscle types histology o.ppt
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
 
CT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptxCT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptx
 
Acid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjkAcid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjk
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdf
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdfParcel Delivery - Intel Segmentation and Last Mile Opt.pdf
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdf
 
Aws MLOps Interview Questions with answers
Aws MLOps Interview Questions  with answersAws MLOps Interview Questions  with answers
Aws MLOps Interview Questions with answers
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
 
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di MalaysiaBimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
 
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
 

Introduction to map reduce

  • 1. Introduction to MapReduce Mohamed Baddar Senior Data Engineer
  • 2. Contents Computation Models for Distributed Computing MPI MapReduce Why MapReduce? How MapReduce works Simple example References 2
  • 3. Distributed Computing Why ? Booming of big data generation (social media , e-commerce , banks , etc …) Big data and machine learning , data mining AI became like bread and butter : better results comes from analyzing bigger set of data. How it works ? Data-Partitioning : divide data into multiple tasks , each implementing the same procedure (computations) at specific phase on its data segment Task-Partitioning : assign different tasks to different computation units Hardware for distributed Computing Multiple processors (Multi-core processors) 3
  • 4. Metrics How to judge computational model suitability ? Simplicity : level of developer experience Scalability : adding more computational node , increase throughput / response time fault tolerance : support recovering computed results when node is down Maintainability : How easy fix bugs , add features Cost : need for special hardware (Multi-core processors , large RAM , infiniband or can use common ethernet cluster of commodity machines) No one size fits all sometimes it is better to use hybrid computational models 4
  • 5. MPI (Message Passing Interface) ● Workload is divided among different processes (each process may have multiple threads) ● Communication is via Message passing ● Data exchange is via shared memory (Physical / Virtual) ● Pros ○ Flexibility : programmer can customize message and communication between nodes ○ Speed : rely on sharing data via memory Source : https://computing.llnl.gov/tutorials/mpi/ 5
  • 6. MapReduce Objective : Design scalable parallel programming framework to be deployed on large cluster of commodity machines Data divided into splits , each processed by map functions , whose output are processed by reduce functions. Originated and first practical implementation in Google Inc. 2004 MapReduce implementations Apache Hadoop (Computation) 6
  • 7. MapReduce Execution (1) # Mapper (M=3) #Reducers (R=2) MapReduce functions map(K1,V1) list(K2,V2) reduce(K2,list(V2)) list(V2) 7
  • 8. MapReduce - Execution (2) Platform : Nodes communicating over ethernet network over TCP/IP Two main type of processes : Master : orchestrates the work Worker : process data Units of work : Jobs :A MapReduce job is a unit of work that the client wants to be performed: it consists of the input data, the MapReduce program, and configuration. task : can be map task (process input to intermediate data), reduce task (process intermediate data to output). Job is divided into several map / reduce tasks. 8
  • 9. MapReduce Execution (3) 1. A copy of the master process is created 2. Input data is divided into M splits , each of 16 to 64 MB (user configured) 3. M map tasks are created and given unique IDs , each parses key/value pairs of each split ,start processing , output is written into a memory buffer. 4. Map output is partitioned to R partitions. When buffers are full , they are spilled into local hard disks , and master is notified by saved buffered locations. All records with the same key are put in same partition. Note : Map output stored in local worker file system , not distributed file system , as it is intermediate and to avoid complexity. 5. Shuffling : when a reduce receives a notification from master that one of 9
  • 10. MapReduce Execution (4) 6. When the reduce worker receives all its intermediate output , it sorts them by key (sorting is need as reduce task may have several keys). (1) 7. When sorting finished , the reduce worker iterates over each key , passing the key and list of values to the reduce function. 8. The output of reduce function is appended to the file corresponding to this reduce worker. 9. For each HDFS block of the output of reduce task , one block is stored locally in the reduce worker and the other 2 (assuming replication factor of 3) is replicated on two other off-rack node for reliability. ■ Notes : 10
  • 11. Master responsibilities Find idle nodes (workers) to assign map and reduce tasks. monitor each task status (idle , in-progress, finished). Keep track of locations of R map intermediate output on each map worker machine. Keep record of worker IDs and other info (CPU , memory , disk size) Continuously push information about intermediate map output to reduce 11
  • 12. Fault tolerance (1) Objective : handle machine failures gracefully , i.e. programmer don’t need to handle it or be aware of details. Two type of failures : Master failure Worker failure Two main activities Failure detection Recover lost (computed) data with least 12
  • 13. Fault tolerance (2) Worker failure Detection : Timeout for master ping , mark worker as failed. Remove worker from list of available worker. For all map tasks assigned to that worker : mark these tasks as idle these tasks will be eligible for re-scheduling on other workers map tasks are re-executed as output is stored in local file system in failed machine all reduce workers are notified with re-execution so they can get intermediate data they haven’t yet. No need to re-execute reduce tasks as their output is stored in distributed file system and 13
  • 14. Semantics in the Presence of Failures Deterministic and Nondeterministic Functions Deterministic functions always return the same result any time they are called with a specific set of input values. Nondeterministic functions may return different results each time they are called with a specific set of input values. If map and reduce function are “deterministic” , distributed implementation of mapreduce framework must produce the same output of a non-faulting sequential execution of the program. Several copies of the same map/reduce task might run on different nodes for sake of reliability and fault tolerance 14
  • 15. Semantics in the Presence of Failures (2) Mapper always write their output to tmp files (atomic commits). When a map task finishes : Renames the tmp file to final output. Sends message to master informing it with the filename. If another copy of the same map finished before , master ignores it , else store the filename. Reducers do the same , and if multiple copies of the same reduce task finished , MapReduce framework rely on the atomic rename of the file system. If map task are non-deterministic , and multiple copies of the map task run on different machines , weak semantic condition can happen : two reducers read15
  • 16. Semantics in the Presence of Failures (3) 16 ● Workers #1 and #2 run the same copy of map task M1 ● Reducer task R1 reads its input for M1 from worker #1 ● Reducer task R2 reads its input for M1 from worker #2 , as worker#1 has failed by the time R2 has started. ● If M1’s function is deterministic , we have complete consistency. ● If M1’s function is not deterministic , R1 and R2 may receive different results from M1.
  • 17. Task granularity Load balancing: fine grained is better , faster machines tend to take more tasks than slower machined over time , leads to less overall job execution time. Failure recovery : less time to re-execute failed tasks. Very fine-grained tasks may not be desirable : overhead to manage and too much data shuffling (consume valuable bandwidth). Optimal Granularity : Split size = HDFS block size (128 MB by default) HDFS block is guaranteed to be in the same node. We need to maximize work done by one mapper locally if split size < block : not fully utilizing possibility of local data processing if split size > block : may need data transfer to make map function complete 17
  • 18. Data locality Network bandwidth is a valuable resource. We assume a rack server hardware. MapReduce scheduler works as follows : 1. Try to assign map task to the node where the corresponding split block(s) reside , if it is free assign , else go to step 2 2. try to find a free nod in the same rack to assign the map task , if can’t find a free off-rack node to assign. ● More complex implementation uses network cost model. 18
  • 19. Backup tasks Stragglers : a set machines that run a set of assigned tasks (MapReduce) very slow. Slow running can be due to many reasons; bad disk , slow network , low speed CPU. Other tasks scheduled on stragglers cause more load and longer execution time. Solution Mechanism: When MapReduce job is close to finish for all the “in-progress” tasks , issue backup tasks. 19
  • 20. Refinements Partitioning function : Partition the output of mapping tasks into R partitions (each for a reduce task). Good function should try as possible to make the partitions equal. Default : hash(key) mod R Works fine usually problem arises when specific keys have many records than the others. Need design custom hash functions or change the key. Combiner function Reduce size of map intermediate output. 20
  • 21. Refinements (2) Skipping bad record Bug in third party library that can’t be fixed , causes code crash at specific records Terminating a job running for hours / days more expensive than sacrificing small percentage of accuracy (If context allows , for ex. statistical analysis of large data). How MapReduce handle that ? 1. Each worker process installs a signal handler that catches segmentation violations ,bus errors and other possible fatal errors. 2. Before a map / reduce task runs , the MapReduce library stores the key value in global variable. 3. When a map / reduce task function code generates a signal , the worker sends UDP 21
  • 22. References 1. MapReduce: simplified data processing on large clusters http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf 2. Hadoop Definitive guide , Ch 3 3. https://dgleich.wordpress.com/2012/10/05/why-mapreduce-is-successful-its-the-io/ 4. http://www.infoworld.com/article/2616904/business-intelligence/mapreduce.html 5. http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf 6. Hadoop Definitive Guide - Chapter 1&2 7. http://research.google.com/archive/mapreduce-osdi04-slides/ 8. http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf 22

Editor's Notes

  1. image source http://www.ibm.com/developerworks/cloud/library/cl-openstack-deployhadoop/
  2. source : google paper
  3. In google paper , sorting is mentioned as reducer responsibility , however in Hadoop Definitive Guide , sorting is mentioned as mapper responsibility and reducer is responsible for merging sorting intermediate output.