An Introduction to MapReduce

An Introduction to MapReduce
Presented by Frane Bandov
at the Operating Complex IT-Systems seminar
Berlin, 1/26/2010

Outline
• Introduction
• Google MapReduce
– Idea
– Overview
– Fault Tolerance
– GFS: Google File System
– Job Example
• Alternative Implementations
• Reception and Criticism
• Trends and Future Development
• Conclusion
2/16/10 An Introduction to MapReduce 2

Outline
• Introduction
– Idea
– Overview
– Fault Tolerance
– Job Example
• Conclusion

Introduction – Problem
Sometimes we have to deal with huge amounts
of data
TBytes
250

200

150

100

50

0
You Facebook Yahoo! Groups German Climate
Computing Centre


Introduction – Problem
The data needs to be processed, but how?

Can‘t process all of this data on one machine
 Distribute the processing to many machines


Introduction – Approach
Distributed computing is the solution
“Let’s write our own distributed computing
software as a solution to our problem”
Checklist
 design protocols  evelopment takes a long time
D
 design data structures
 write the code  Expensive: Cost-benefit ratio?
 assure failure tolerance

Build complex software for simple computations?


Outline
• Introduction
– Idea
– Overview
– Fault Tolerance
– Job Example
• Conclusion

Google MapReduce – Idea
A framework for distributed computing

Don‘t care about protocols, failure tolerance, etc.

Just write your simple computation


MapReduce Paradigm
Map: Reduce:
Apply function to all Combine all elements
elements of a list of a list

square x = x * x; reduce (+)[1, 2, 3, 4, 5];
map square [1, 2, 3, 4, 5];
 [1, 4, 9, 16, 25]  15


Basic functioning

Input Map Reduce Output


Google MapReduce – Overview
MapReduce-Based User Program

GFS GFS

Split 1 Master

Split 2 Intermediate
Worker Worker File 1
File 1

Split 3
Intermediate
Worker
File 2 Worker File 2
Split 4

Intermediate
Split 5 Worker
File 3
Reduce Output
Input file Map Phase Phase files

MapReduce – Fault Tolerance
• Workers are periodically pinged by master
• No answer over certain time  worker failed

Mapper fails:
– Reset map job as idle
– Even if job was completed  intermediate files are
inaccessible
– Notify reducers where to get the new intermediate file
Reducer fails:
– Reset its job as idle

MapReduce – Fault Tolerance
Master fails:
– Periodically sets checkpoints
– In case of failure MapReduce-Operation is aborted
– Operation can be restarted from last checkpoint


Google MapReduce – GFS
Google File System
• In-house distributed file system at Google
• Stores all input an output files
• Stores files…
– divided into 64 MB blocks
– on at least 3 different machines
• Machines running GFS also
run MapReduce

Google MapReduce – Job Example


Outline
• Introduction
– Idea
– Overview
– Fault Tolerance
– Job Example
• Conclusion

Alternative Implementations
Apache Hadoop

• Open-Source-Implementation in Java
• Jobs can be written in C++, Java, Python, etc.
• Used by Yahoo!, Facebook, Amazon and others
• Most commonly used implementation
• HDFS as open-source-implementation of GFS
• Can also use Amazon S3, HTTP(S) or FTP
• Extensions: Hive, Pig, HBase

Mars
MapReduce-Implementation for nVidia GPU
using the CUDA framework

MapReduce-Cell
Implementation for the Cell multi-core
processor

Qizmt
MySpace’s implementation of MapReduce in C#



There are many other open- and closed-
source implementations of MapReduce!


Outline
• Introduction
– Idea
– Overview
– Fault Tolerance
– Job Example
• Conclusion

Reception and Criticism
• Yahoo!: Hadoop on a 10,000 server cluster
• Facebook analyses the daily log (25TB) on
a 1,000 server cluster
• Amazon Elastic MapReduce: Hadoop
clusters for rent on EC2 and S3
• IBM and Google: Support university
courses in distributed programming
• UC Berkley announced to teach freashmen
programming MapReduce

• Criticism mainly by RDBMS experts
DeWitt and Stonebraker
• MapReduce
– is a step backwards in database access
– is a poor implementation
– is not novel
– is missing features that are routinely provided
by modern DBMSs
– is incompatible with the DBMS tools

Response to criticism

MapReduce is no RDBMS

It suits well for processing and structuring huge
amounts of unstructured data

MapReduce's big inovation is that it enables
distributing data processing across a network of
cheap and possibly unreliable computers

Outline
• Introduction
– Idea
– Overview
– Fault Tolerance
– Job Example
• Conclusion

Trends and Future Development
Trend of utilizing MapReduce/Hadoop as
parallel database

• Hive: Query language for Hadoop
• HBase: Column-oriented distributed database
(modeled after Google’s BigTable)
• Map-Reduce-Merge: Adding merge to the
paradigm allows implementing features of
relational algebra

Trends and Future Development
Trend to use the MapReduce-paradigm to
better utilize multi-core CPUs

• Qt Concurrent
– Simplified C++ version of MapReduce for distributing
tasks between multiple processor cores
• Mars
• MapReduce-Cell


Outline
• Introduction
– Idea
– Overview
– Fault Tolerance
– Job Example
• Conclusion

Conclusion
MapReduce

provides an easy solution for the processing of
large amounts of data

brings a paradigm shift in programming

changed the world,
i.e. made data processing more efficient and
cheaper, is the foundation of many other
approaches and solutions

Questions?


Thank You!


An Introduction to MapReduce

Related slideshows

More Related Content

What's hot

What's hot (20)

Similar to An Introduction to MapReduce

Similar to An Introduction to MapReduce (20)

Recently uploaded

Recently uploaded (20)

An Introduction to MapReduce