SlideShare a Scribd company logo
HDFS:
Hadoop Distributed
File System
CIS 612
Sunnie Chung
Introduction
• What is Big Data??
– Bulk Amount
– Unstructured
• Lots of Applications which need to handle
huge amount of data (in terms of 500+ TB per
day)
• If a regular machine need to transmit 1TB of
data through 4 channels : 43 Minutes.
• What if 500 TB ??
2
Hadoop
What is Hadoop?
• Framework for large-scale data processing
• Inspired by Google’s Architecture:
– Google File System (GFS) and MapReduce
• Open-source Apache project
– Nutch search engine project
– Apache Incubator
• Written in Java and shell scripts
3
Hadoop
Hadoop Distributed File System (HDFS)
• Storage unit of Hadoop
• Relies on principles of Distributed File System.
• HDFS have a Master-Slave architecture
• Main Components:
– Name Node : Master
– Data Node : Slave
• 3+ replicas for each block
• Default Block Size : 128MB
4
Hadoop
H
Hadoop Distributed File System (HDFS)
• Hadoop Distributed File System (HDFS)
– Runs entirely in userspace
– The file system is dynamically distributed across multiple
computers
– Allows for nodes to be added or removed easily
– Highly scalable in a horizontal fashion
• Hadoop Development Platform
– Uses a MapReduce model for working with data
– Users can program in Java, C++, and other languages
5
Hadoop
Why should I use Hadoop?
• Fault-tolerant hardware is expensive
• Hadoop designed to run on commodity hardware
• Automatically handles data replication and deals with
node failure
• Does all the hard work so you can focus on processing
data
6
Hadoop
HDFS: Key Features
• Highly Fault Tolerant:
Automatic Failure Recovery System
• High aggregate throughput for streaming large files
• Supports replication and locality features
• Designed to work with systems with vary large file
(files with size in TB) and few in number.
• Provides streaming access to file system data. It is
specifically good for write once read many kind of
files (for example Log files).
7
Hadoop
Hadoop Distributed File System (HDFS)
• Can be built out of commodity hardware. HDFS
doesn't need highly expensive storage devices
– Uses off the shelf hardware
• Rapid Elasticity
– Need more capacity, just assign some more nodes
– Scalable
– Can add or remove nodes with little effort or
reconfiguration
• Resistant to Failure
• Individual node failure does not disrupt the
system
8
Hadoop
Who uses Hadoop?
9
Hadoop
What features does Hadoop offer?
• API and implementation for working with
MapReduce
• Infrastructure
– Job configuration and efficient scheduling
– Web-based monitoring of cluster stats
– Handles failures in computation and data nodes
– Distributed File System optimized for huge amounts of
data
10
Hadoop
When should you choose Hadoop?
• Need to process a lot of unstructured data
• Processing needs are easily run in parallel
• Batch jobs are acceptable
• Access to lots of cheap commodity machines
11
Hadoop
When should you avoid Hadoop?
• Intense calculations with little or no data
• Processing cannot easily run in parallel
• Data is not self-contained
• Need interactive results
12
Hadoop
Hadoop Examples
• Hadoop would be a good choice for:
– Indexing log files
– Sorting vast amounts of data
– Image analysis
– Search engine optimization
– Analytics
• Hadoop would be a poor choice for:
– Calculating Pi to 1,000,000 digits
– Calculating Fibonacci sequences
– A general RDBMS replacement
13
Hadoop
Hadoop Distributed File System (HDFS)
• How does Hadoop work?
– Runs on top of multiple commodity systems
– A Hadoop cluster is composed of nodes
• One Master Node
• Many Slave Nodes
– Multiple nodes are used for storing data & processing
data
– System abstracts the underlying hardware to
users/software
14
Hadoop
How HDFS works: Split Data
• Data copied into HDFS is split into blocks
• Typical HDFS block size is 128 MB
– (VS 4 KB on UNIX File Systems)
15
Hadoop
How HDFS works: Replication
• Each block is replicated to multiple machines
• This allows for node failure without data loss
16
Data
Node 2
Data
Node 3
Data
Node 1
Block
#1
Block
#2
Block
#2
Block
#3
Block
#1
Block
#3
Hadoop
HDFS Architecture
Hadoop Distributed File System (HDFS)p:
HDFS • HDFS Consists of data blocks
– Files are divided into data
blocks
– Default size if 64MB
– Default replication of blocks is 3
– Blocks are spread out over Data
Nodes
18
 HDFS is a multi-node system
 Name Node (Master)
 Single point of failure
 Data Node (Slave)
 Failure tolerant (Data
replication)
Hadoop
Hadoop Architecture Overview
19
Client
Job Tracker
Task Tracker Task Tracker
Name
Node
Name
Node
Data
Node
Data
Node Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Hadoop
Hadoop Components: Job Tracker
20
Client
Job Tracker
Task Tracker Task Tracker
Name Node
Name Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Only one Job Tracker per cluster
 Receives job requests submitted by client
 Schedules and monitors jobs on task trackers
Hadoop
Hadoop Components: Name Node
21
Client
Job
Tracker
Task
Tracker
Task
Tracker
Name Node
Name Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
 One active Name Node per cluster
 Manages the file system namespace and metadata
 Single point of failure: Good place to spend money on hardware
Hadoop
Name Node
• Master of HDFS
• Maintains and Manages data on Data Nodes
• High reliability Machine (can be even RAID)
• Expensive Hardware
• Stores NO data; Just holds Metadata!
• Secondary Name Node:
– Reads from RAM of Name Node and stores it to hard
disks periodically.
• Active  Passive Name Nodes from Gen2 Hadoop
22
Hadoop
Hadoop Components: Task Tracker
23
Client
Job
Tracker
Task
Tracker
Task
Tracker
Name Node
Name Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
 There are typically a lot of task trackers
 Responsible for executing operations
 Reads blocks of data from data nodes
Hadoop
Hadoop Components: Data Node
24
Client
Job
Tracker
Task
Tracker
Task
Tracker
Name Node
Name Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
 There are typically a lot of data nodes
 Data nodes manage data blocks and serve them to clients
 Data is replicated so failure is not a problem
Hadoop
Data Nodes
• Slaves in HDFS
• Provides Data Storage
• Deployed on independent machines
• Responsible for serving Read/Write requests from
Client.
• The data processing is done on Data Nodes.
25
Hadoop
HDFS Architecture
26
Hadoop
Hadoop Modes of Operation
Hadoop supports three modes of operation:
• Standalone
• Pseudo-Distributed
• Fully-Distributed
27
Hadoop
HDFS Operation
28
Hadoop
HDFS Operation
• Client makes a Write request to Name Node
• Name Node responds with the information about
on available data nodes and where data to be
written.
• Client write the data to the addressed Data Node.
• Replicas for all blocks are automatically created
by Data Pipeline.
• If Write fails, Data Node will notify the Client
and get new location to write.
• If Write Completed Successfully,
Acknowledgement is given to Client
• Non-Posted Write by Hadoop
29
Hadoop
HDFS: File Write
30
Hadoop
HDFS: File Read
31
Hadoop
Hadoop: Hadoop Stack
• Hadoop Development Platform
– User written code runs on system
– System appears to user as a single entity
– User does not need to worry about
distributed system
– Many system can run on top of Hadoop
• Allows further abstraction from system
32
Hadoop
Hadoop: Hive  HBase
Hive and HBase are layers on top of Hadoop
HBase  Hive are applications
Provide an interface to data on the HDFS
Other programs or applications may use Hive or
HBase as an intermediate layer
33
HBase
ZooKeeper
Hadoop
Hadoop: Hive
• Hive
– Data warehousing application
– SQL like commands (HiveQL)
– Not a traditional relational database
– Scales horizontally with ease
– Supports massive amounts of data*
34
* Facebook has more than 15PB of information stored in it and imports 60TB each day (as of 2010)
Hadoop
Hadoop: HBase
• HBase
– No SQL Like language
• Uses custom Java API for working with data
– Modeled after Google’s BigTable
– Random read/write operations allowed
– Multiple concurrent read/write operations allowed
35
Hadoop
Hadoop MapReduce
Hadoop has it’s own implementation of MapReduce
Hadoop 1.0.4
API: http://hadoop.apache.org/docs/r1.0.4/api/
Tutorial: http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html
Custom Serialization
Data Types
Writable/Comparable
Text vs String
LongWritable vs long
IntWritable vs int
DoubleWritable vs double
36
Hadoop
Structure of a Hadoop Mapper (WordCount)
37
Hadoop
Structure of a Hadoop Reducer (WordCount)
38
Hadoop
Hadoop MapReduce
 Working with the Hadoop
 http://hadoop.apache.org/docs/r1.0.4/commands_manual.html
A quick overview of Hadoop commands
bin/start-all.sh
bin/stop-all.sh
bin/hadoop fs –put localSourcePath hdfsDestinationPath
bin/hadoop fs –get hdfsSourcePath localDestinationPath
bin/hadoop fs –rmr folderToDelete
bin/hadoop job –kill job_id
Running a Hadoop MR Program
 bin/hadoop jar jarFileName.jar programToRun parm1 parm2…
SS Chung CIS 612 Lecture Notes 39
Useful Application Sites
[1] http://wiki.apache.org/hadoop/EclipsePlugIn
[2] 10gen. Mongodb. http://www.mongodb.org/
[3] Apache. Cassandra. http://cassandra.apache.org/
[4] Apache. Hadoop. http://hadoop.apache.org/
[5] Apache. Hbase. http://hbase.apache.org/
[6] Apache, Hive. http://hive.apache.org/
[7] Apache, Pig. http://pig.apache.org/
[8] Zoo Keeper, http://zookeeper.apache.org/
40
Hadoop
How MapReduce Works in Hadoop
Lifecycle of a MapReduce Job
Map function
Reduce function
Run this program as a
MapReduce job
Lifecycle of a MapReduce Job
Map function
Reduce function
Run this program as a
MapReduce job
Map
Wave 1
Reduce
Wave 1
Map
Wave 2
Reduce
Wave 2
Input
Splits
Lifecycle of a MapReduce Job
Time
Hadoop MR Job Interface:
Input Format
• The Hadoop MapReduce framework spawns
one map task for each InputSplit
• InputSplit: Input File is Split to Input Splits (Logical
splits (usually 1 block), not Physically split chunks)
Input Format::getInputSplits()
• The number of maps is usually driven by the total
number of blocks (InputSplits) of the input files.
1 block size = 128 MB,
10 TB file configured with 82000 maps
Hadoop MR Job Interface:
map()
• The framework then calls
map(WritableComparable, Writable, OutputCollector,
Reporter) for each key/value pair (line_num, line_string
) in the InputSplit for that task.
• Output pairs are collected with calls to
OutputCollector.collect(WritableComparable,Writable).
Hadoop MR Job Interface:
combiner()
• Optional combiner, via
JobConf.setCombinerClass(Class)
• to perform local aggregation of the intermediate
outputs of mapper
Hadoop MR Job Interface:
Partitioner()
• Partitioner controls the partitioning of the keys of the
intermediate map-outputs.
• The key (or a subset of the key) is used to derive the
partition, typically by a hash function.
• The total number of partitions is the same as the
number of reducers
• HashPartitioner is the default Partitioner of reduce
tasks for the job
Hadoop MR Job Interface:
reducer()
• Reducer has 3 primary phases:
1. Shuffle:
2. Sort
3. Reduce
Hadoop MR Job Interface:
reducer()
• Shuffle
Input to the Reducer is the sorted output of the mappers.
In this phase the framework fetches the relevant
partition of the output of all the mappers, via HTTP.
• Sort
The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this
stage.
• The shuffle and sort phases occur simultaneously;
while map-outputs are being fetched they are merged.
Hadoop MR Job Interface:
reducer()
• Reduce
The framework then calls
reduce(WritableComparable, Iterator, OutputCollector, Reporter)
method for each key, (list of values) pair in the
grouped inputs.
• The output of the reduce task is typically written to
the FileSystem via
OutputCollector.collect(WritableComparable, Writable).
MR Job Parameters
• Map Parameters
io.sort.mb
• Shuffle/Reduce Parameters
io.sort.factor
mapred.inmem.merge.threshold
mapred.job.shuffle.merge.percent
Components in a Hadoop MR Workflow
Next few slides are from: http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
Job Submission
Initialization
Scheduling
Execution
Map Task
Sort Buffer
Reduce Tasks
Quick Overview of Other Topics
• Dealing with failures
• Hadoop Distributed FileSystem (HDFS)
• Optimizing a MapReduce job
Dealing with Failures and Slow Tasks
• What to do when a task fails?
– Try again (retries possible because of idempotence)
– Try again somewhere else
– Report failure
• What about slow tasks: stragglers
– Run another version of the same task in parallel. Take
results from the one that finishes first
– What are the pros and cons of this approach?
Fault tolerance is of
high priority in the
MapReduce framework
HDFS Architecture
Map
Wave 1
Reduce
Wave 1
Map
Wave 2
Reduce
Wave 2
Input
Splits
Lifecycle of a MapReduce Job
Time
How are the number of splits, number of map and reduce
tasks, memory allocation to tasks, etc., determined?
Job Configuration Parameters
• 190+ parameters in
Hadoop
• Set manually or defaults
are used
Image source: http://www.jaso.co.kr/265
Hadoop Job Configuration Parameters
Tuning Hadoop Job Conf. Parameters
• Do their settings impact performance?
• What are ways to set these parameters?
– Defaults -- are they good enough?
– Best practices -- the best setting can depend on data, job, and
cluster properties
– Automatic setting
Experimental Setting
• Hadoop cluster on 1 master + 16 workers
• Each node:
– 2GHz AMD processor, 1.8GB RAM, 30GB local disk
– Relatively ill-provisioned!
– Xen VM running Debian Linux
– Max 4 concurrent maps  2 reduces
• Maximum map wave size = 16x4 = 64
• Maximum reduce wave size = 16x2 = 32
• Not all users can run large Hadoop clusters:
– Can Hadoop be made competitive in the 10-25 node, multi GB
to TB data size range?
Parameters Varied in Experiments
• Varying number of reduce tasks, number of concurrent sorted
streams for merging, and fraction of map-side sort buffer
devoted to metadata storage
Hadoop 50GB TeraSort
Hadoop 50GB TeraSort
• Varying number of reduce tasks for different values
of the fraction of map-side sort buffer devoted to
metadata storage (with io.sort.factor = 500)
Hadoop 50GB TeraSort
• Varying number of reduce tasks for different values of
io.sort.factor (io.sort.record.percent = 0.05, default)
• 1D projection for
io.sort.factor=500
Hadoop 75GB TeraSort
Automatic Optimization? (Not yet in Hadoop)
Map
Wave 1
Map
Wave 3
Map
Wave 2
Reduce
Wave 1
Reduce
Wave 2
Shuffle
Map
Wave 1
Map
Wave 3
Map
Wave 2
Reduce
Wave 1
Reduce
Wave 2
Reduce
Wave 3
What if
#reduces
increased
to 9?

More Related Content

Similar to hadoop distributed file systems complete information

2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
Lokesh Ramaswamy
��
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
Sufi Nawaz
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
Brian Enochson
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
DanishMahmood23
 
Hadoop
HadoopHadoop
Hadoop
chandinisanz
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
Derek Chen
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
yaevents
 
Anju
AnjuAnju
Hadoop training
Hadoop trainingHadoop training
Hadoop training
TIB Academy
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
Data Con LA
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
Learntek1
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
 
Big data
Big dataBig data
Big data
Mayuri Verma
 
Big data
Big dataBig data
Big data
Alisha Roy
 
Big data and hadoop anupama
Big data and hadoop anupamaBig data and hadoop anupama
Big data and hadoop anupama
Anupama Prabhudesai
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
KennyPratheepKumar
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Venneladonthireddy1
 

Similar to hadoop distributed file systems complete information (20)

2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Hadoop
HadoopHadoop
Hadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Anju
AnjuAnju
Anju
 
Hadoop training
Hadoop trainingHadoop training
Hadoop training
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data and hadoop anupama
Big data and hadoop anupamaBig data and hadoop anupama
Big data and hadoop anupama
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 

More from bhargavi804095

java1.pptjava is programming language, having core and advanced java
java1.pptjava is programming language, having core and advanced javajava1.pptjava is programming language, having core and advanced java
java1.pptjava is programming language, having core and advanced java
bhargavi804095
 
Big Data Analytics is not something which was just invented yesterday!
Big Data Analytics is not something which was just invented yesterday!Big Data Analytics is not something which was just invented yesterday!
Big Data Analytics is not something which was just invented yesterday!
bhargavi804095
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
C++ was developed by Bjarne Stroustrup, as an extension to the C language. cp...
C++ was developed by Bjarne Stroustrup, as an extension to the C language. cp...C++ was developed by Bjarne Stroustrup, as an extension to the C language. cp...
C++ was developed by Bjarne Stroustrup, as an extension to the C language. cp...
bhargavi804095
 
A File is a collection of data stored in the secondary memory. So far data wa...
A File is a collection of data stored in the secondary memory. So far data wa...A File is a collection of data stored in the secondary memory. So far data wa...
A File is a collection of data stored in the secondary memory. So far data wa...
bhargavi804095
 
C++ helps you to format the I/O operations like determining the number of dig...
C++ helps you to format the I/O operations like determining the number of dig...C++ helps you to format the I/O operations like determining the number of dig...
C++ helps you to format the I/O operations like determining the number of dig...
bhargavi804095
 
While writing program in any language, you need to use various variables to s...
While writing program in any language, you need to use various variables to s...While writing program in any language, you need to use various variables to s...
While writing program in any language, you need to use various variables to s...
bhargavi804095
 
Python is a high-level, general-purpose programming language. Its design phil...
Python is a high-level, general-purpose programming language. Its design phil...Python is a high-level, general-purpose programming language. Its design phil...
Python is a high-level, general-purpose programming language. Its design phil...
bhargavi804095
 
cpp-streams.ppt,C++ is the top choice of many programmers for creating powerf...
cpp-streams.ppt,C++ is the top choice of many programmers for creating powerf...cpp-streams.ppt,C++ is the top choice of many programmers for creating powerf...
cpp-streams.ppt,C++ is the top choice of many programmers for creating powerf...
bhargavi804095
 
Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...
bhargavi804095
 
power point presentation to show oops with python.pptx
power point presentation to show oops with python.pptxpower point presentation to show oops with python.pptx
power point presentation to show oops with python.pptx
bhargavi804095
 
Lecture4_Method_overloading power point presentaion
Lecture4_Method_overloading power point presentaionLecture4_Method_overloading power point presentaion
Lecture4_Method_overloading power point presentaion
bhargavi804095
 
Lecture5_Method_overloading_Final power point presentation
Lecture5_Method_overloading_Final power point presentationLecture5_Method_overloading_Final power point presentation
Lecture5_Method_overloading_Final power point presentation
bhargavi804095
 
power point presentation on object oriented programming functions concepts
power point presentation on object oriented programming functions conceptspower point presentation on object oriented programming functions concepts
power point presentation on object oriented programming functions concepts
bhargavi804095
 
THE C PROGRAMMING LANGUAGE PPT CONTAINS THE BASICS OF C
THE C PROGRAMMING LANGUAGE PPT CONTAINS THE BASICS OF CTHE C PROGRAMMING LANGUAGE PPT CONTAINS THE BASICS OF C
THE C PROGRAMMING LANGUAGE PPT CONTAINS THE BASICS OF C
bhargavi804095
 
Chapter24.pptx big data systems power point ppt
Chapter24.pptx big data systems power point pptChapter24.pptx big data systems power point ppt
Chapter24.pptx big data systems power point ppt
bhargavi804095
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
bhargavi804095
 

More from bhargavi804095 (17)

java1.pptjava is programming language, having core and advanced java
java1.pptjava is programming language, having core and advanced javajava1.pptjava is programming language, having core and advanced java
java1.pptjava is programming language, having core and advanced java
 
Big Data Analytics is not something which was just invented yesterday!
Big Data Analytics is not something which was just invented yesterday!Big Data Analytics is not something which was just invented yesterday!
Big Data Analytics is not something which was just invented yesterday!
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
C++ was developed by Bjarne Stroustrup, as an extension to the C language. cp...
C++ was developed by Bjarne Stroustrup, as an extension to the C language. cp...C++ was developed by Bjarne Stroustrup, as an extension to the C language. cp...
C++ was developed by Bjarne Stroustrup, as an extension to the C language. cp...
 
A File is a collection of data stored in the secondary memory. So far data wa...
A File is a collection of data stored in the secondary memory. So far data wa...A File is a collection of data stored in the secondary memory. So far data wa...
A File is a collection of data stored in the secondary memory. So far data wa...
 
C++ helps you to format the I/O operations like determining the number of dig...
C++ helps you to format the I/O operations like determining the number of dig...C++ helps you to format the I/O operations like determining the number of dig...
C++ helps you to format the I/O operations like determining the number of dig...
 
While writing program in any language, you need to use various variables to s...
While writing program in any language, you need to use various variables to s...While writing program in any language, you need to use various variables to s...
While writing program in any language, you need to use various variables to s...
 
Python is a high-level, general-purpose programming language. Its design phil...
Python is a high-level, general-purpose programming language. Its design phil...Python is a high-level, general-purpose programming language. Its design phil...
Python is a high-level, general-purpose programming language. Its design phil...
 
cpp-streams.ppt,C++ is the top choice of many programmers for creating powerf...
cpp-streams.ppt,C++ is the top choice of many programmers for creating powerf...cpp-streams.ppt,C++ is the top choice of many programmers for creating powerf...
cpp-streams.ppt,C++ is the top choice of many programmers for creating powerf...
 
Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...
 
power point presentation to show oops with python.pptx
power point presentation to show oops with python.pptxpower point presentation to show oops with python.pptx
power point presentation to show oops with python.pptx
 
Lecture4_Method_overloading power point presentaion
Lecture4_Method_overloading power point presentaionLecture4_Method_overloading power point presentaion
Lecture4_Method_overloading power point presentaion
 
Lecture5_Method_overloading_Final power point presentation
Lecture5_Method_overloading_Final power point presentationLecture5_Method_overloading_Final power point presentation
Lecture5_Method_overloading_Final power point presentation
 
power point presentation on object oriented programming functions concepts
power point presentation on object oriented programming functions conceptspower point presentation on object oriented programming functions concepts
power point presentation on object oriented programming functions concepts
 
THE C PROGRAMMING LANGUAGE PPT CONTAINS THE BASICS OF C
THE C PROGRAMMING LANGUAGE PPT CONTAINS THE BASICS OF CTHE C PROGRAMMING LANGUAGE PPT CONTAINS THE BASICS OF C
THE C PROGRAMMING LANGUAGE PPT CONTAINS THE BASICS OF C
 
Chapter24.pptx big data systems power point ppt
Chapter24.pptx big data systems power point pptChapter24.pptx big data systems power point ppt
Chapter24.pptx big data systems power point ppt
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
 

Recently uploaded

Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop ServiceCal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Deepikakumari457585
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
wojakmodern
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
SomalyEng
 
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
rightmanforbloodline
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
weiwchu
 
Accounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-RegulationsAccounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-Regulations
DALubis
 
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataTowards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Samuel Jackson
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
Priyanka Jadhav
 
Full Disclosure Board Policy.docx BRGY LICUMA
Full  Disclosure Board Policy.docx BRGY LICUMAFull  Disclosure Board Policy.docx BRGY LICUMA
Full Disclosure Board Policy.docx BRGY LICUMA
brgylicumaormoccity
 
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di MalaysiaBimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
aznidajailani
 
Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
District 11 Solutions
 
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdfThe Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
Riya Sen
 
Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
SelcukTOPAL2
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
lenjisoHussein
 
Big Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of PaymentsBig Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of Payments
RuchiRathor2
 
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
deepikakumaridk25
 
Annex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf documentAnnex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf document
Steven McGee
 
future-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-managementfuture-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-management
Aadee4
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptxParcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
AltanAtabarut
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
amazenolmedojeruel
 

Recently uploaded (20)

Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop ServiceCal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
 
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
 
Accounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-RegulationsAccounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-Regulations
 
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataTowards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
 
Full Disclosure Board Policy.docx BRGY LICUMA
Full  Disclosure Board Policy.docx BRGY LICUMAFull  Disclosure Board Policy.docx BRGY LICUMA
Full Disclosure Board Policy.docx BRGY LICUMA
 
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di MalaysiaBimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
 
Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
 
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdfThe Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
 
Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
 
Big Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of PaymentsBig Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of Payments
 
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
 
Annex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf documentAnnex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf document
 
future-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-managementfuture-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-management
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptxParcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
 

hadoop distributed file systems complete information

  • 2. Introduction • What is Big Data?? – Bulk Amount – Unstructured • Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) • If a regular machine need to transmit 1TB of data through 4 channels : 43 Minutes. • What if 500 TB ?? 2 Hadoop
  • 3. What is Hadoop? • Framework for large-scale data processing • Inspired by Google’s Architecture: – Google File System (GFS) and MapReduce • Open-source Apache project – Nutch search engine project – Apache Incubator • Written in Java and shell scripts 3 Hadoop
  • 4. Hadoop Distributed File System (HDFS) • Storage unit of Hadoop • Relies on principles of Distributed File System. • HDFS have a Master-Slave architecture • Main Components: – Name Node : Master – Data Node : Slave • 3+ replicas for each block • Default Block Size : 128MB 4 Hadoop
  • 5. H Hadoop Distributed File System (HDFS) • Hadoop Distributed File System (HDFS) – Runs entirely in userspace – The file system is dynamically distributed across multiple computers – Allows for nodes to be added or removed easily – Highly scalable in a horizontal fashion • Hadoop Development Platform – Uses a MapReduce model for working with data – Users can program in Java, C++, and other languages 5 Hadoop
  • 6. Why should I use Hadoop? • Fault-tolerant hardware is expensive • Hadoop designed to run on commodity hardware • Automatically handles data replication and deals with node failure • Does all the hard work so you can focus on processing data 6 Hadoop
  • 7. HDFS: Key Features • Highly Fault Tolerant: Automatic Failure Recovery System • High aggregate throughput for streaming large files • Supports replication and locality features • Designed to work with systems with vary large file (files with size in TB) and few in number. • Provides streaming access to file system data. It is specifically good for write once read many kind of files (for example Log files). 7 Hadoop
  • 8. Hadoop Distributed File System (HDFS) • Can be built out of commodity hardware. HDFS doesn't need highly expensive storage devices – Uses off the shelf hardware • Rapid Elasticity – Need more capacity, just assign some more nodes – Scalable – Can add or remove nodes with little effort or reconfiguration • Resistant to Failure • Individual node failure does not disrupt the system 8 Hadoop
  • 10. What features does Hadoop offer? • API and implementation for working with MapReduce • Infrastructure – Job configuration and efficient scheduling – Web-based monitoring of cluster stats – Handles failures in computation and data nodes – Distributed File System optimized for huge amounts of data 10 Hadoop
  • 11. When should you choose Hadoop? • Need to process a lot of unstructured data • Processing needs are easily run in parallel • Batch jobs are acceptable • Access to lots of cheap commodity machines 11 Hadoop
  • 12. When should you avoid Hadoop? • Intense calculations with little or no data • Processing cannot easily run in parallel • Data is not self-contained • Need interactive results 12 Hadoop
  • 13. Hadoop Examples • Hadoop would be a good choice for: – Indexing log files – Sorting vast amounts of data – Image analysis – Search engine optimization – Analytics • Hadoop would be a poor choice for: – Calculating Pi to 1,000,000 digits – Calculating Fibonacci sequences – A general RDBMS replacement 13 Hadoop
  • 14. Hadoop Distributed File System (HDFS) • How does Hadoop work? – Runs on top of multiple commodity systems – A Hadoop cluster is composed of nodes • One Master Node • Many Slave Nodes – Multiple nodes are used for storing data & processing data – System abstracts the underlying hardware to users/software 14 Hadoop
  • 15. How HDFS works: Split Data • Data copied into HDFS is split into blocks • Typical HDFS block size is 128 MB – (VS 4 KB on UNIX File Systems) 15 Hadoop
  • 16. How HDFS works: Replication • Each block is replicated to multiple machines • This allows for node failure without data loss 16 Data Node 2 Data Node 3 Data Node 1 Block #1 Block #2 Block #2 Block #3 Block #1 Block #3 Hadoop
  • 18. Hadoop Distributed File System (HDFS)p: HDFS • HDFS Consists of data blocks – Files are divided into data blocks – Default size if 64MB – Default replication of blocks is 3 – Blocks are spread out over Data Nodes 18 HDFS is a multi-node system Name Node (Master) Single point of failure Data Node (Slave) Failure tolerant (Data replication) Hadoop
  • 19. Hadoop Architecture Overview 19 Client Job Tracker Task Tracker Task Tracker Name Node Name Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Hadoop
  • 20. Hadoop Components: Job Tracker 20 Client Job Tracker Task Tracker Task Tracker Name Node Name Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Only one Job Tracker per cluster Receives job requests submitted by client Schedules and monitors jobs on task trackers Hadoop
  • 21. Hadoop Components: Name Node 21 Client Job Tracker Task Tracker Task Tracker Name Node Name Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node One active Name Node per cluster Manages the file system namespace and metadata Single point of failure: Good place to spend money on hardware Hadoop
  • 22. Name Node • Master of HDFS • Maintains and Manages data on Data Nodes • High reliability Machine (can be even RAID) • Expensive Hardware • Stores NO data; Just holds Metadata! • Secondary Name Node: – Reads from RAM of Name Node and stores it to hard disks periodically. • Active Passive Name Nodes from Gen2 Hadoop 22 Hadoop
  • 23. Hadoop Components: Task Tracker 23 Client Job Tracker Task Tracker Task Tracker Name Node Name Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node There are typically a lot of task trackers Responsible for executing operations Reads blocks of data from data nodes Hadoop
  • 24. Hadoop Components: Data Node 24 Client Job Tracker Task Tracker Task Tracker Name Node Name Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node There are typically a lot of data nodes Data nodes manage data blocks and serve them to clients Data is replicated so failure is not a problem Hadoop
  • 25. Data Nodes • Slaves in HDFS • Provides Data Storage • Deployed on independent machines • Responsible for serving Read/Write requests from Client. • The data processing is done on Data Nodes. 25 Hadoop
  • 27. Hadoop Modes of Operation Hadoop supports three modes of operation: • Standalone • Pseudo-Distributed • Fully-Distributed 27 Hadoop
  • 29. HDFS Operation • Client makes a Write request to Name Node • Name Node responds with the information about on available data nodes and where data to be written. • Client write the data to the addressed Data Node. • Replicas for all blocks are automatically created by Data Pipeline. • If Write fails, Data Node will notify the Client and get new location to write. • If Write Completed Successfully, Acknowledgement is given to Client • Non-Posted Write by Hadoop 29 Hadoop
  • 32. Hadoop: Hadoop Stack • Hadoop Development Platform – User written code runs on system – System appears to user as a single entity – User does not need to worry about distributed system – Many system can run on top of Hadoop • Allows further abstraction from system 32 Hadoop
  • 33. Hadoop: Hive HBase Hive and HBase are layers on top of Hadoop HBase Hive are applications Provide an interface to data on the HDFS Other programs or applications may use Hive or HBase as an intermediate layer 33 HBase ZooKeeper Hadoop
  • 34. Hadoop: Hive • Hive – Data warehousing application – SQL like commands (HiveQL) – Not a traditional relational database – Scales horizontally with ease – Supports massive amounts of data* 34 * Facebook has more than 15PB of information stored in it and imports 60TB each day (as of 2010) Hadoop
  • 35. Hadoop: HBase • HBase – No SQL Like language • Uses custom Java API for working with data – Modeled after Google’s BigTable – Random read/write operations allowed – Multiple concurrent read/write operations allowed 35 Hadoop
  • 36. Hadoop MapReduce Hadoop has it’s own implementation of MapReduce Hadoop 1.0.4 API: http://hadoop.apache.org/docs/r1.0.4/api/ Tutorial: http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html Custom Serialization Data Types Writable/Comparable Text vs String LongWritable vs long IntWritable vs int DoubleWritable vs double 36 Hadoop
  • 37. Structure of a Hadoop Mapper (WordCount) 37 Hadoop
  • 38. Structure of a Hadoop Reducer (WordCount) 38 Hadoop
  • 39. Hadoop MapReduce Working with the Hadoop http://hadoop.apache.org/docs/r1.0.4/commands_manual.html A quick overview of Hadoop commands bin/start-all.sh bin/stop-all.sh bin/hadoop fs –put localSourcePath hdfsDestinationPath bin/hadoop fs –get hdfsSourcePath localDestinationPath bin/hadoop fs –rmr folderToDelete bin/hadoop job –kill job_id Running a Hadoop MR Program bin/hadoop jar jarFileName.jar programToRun parm1 parm2… SS Chung CIS 612 Lecture Notes 39
  • 40. Useful Application Sites [1] http://wiki.apache.org/hadoop/EclipsePlugIn [2] 10gen. Mongodb. http://www.mongodb.org/ [3] Apache. Cassandra. http://cassandra.apache.org/ [4] Apache. Hadoop. http://hadoop.apache.org/ [5] Apache. Hbase. http://hbase.apache.org/ [6] Apache, Hive. http://hive.apache.org/ [7] Apache, Pig. http://pig.apache.org/ [8] Zoo Keeper, http://zookeeper.apache.org/ 40 Hadoop
  • 41. How MapReduce Works in Hadoop
  • 42. Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job
  • 43. Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job
  • 44. Map Wave 1 Reduce Wave 1 Map Wave 2 Reduce Wave 2 Input Splits Lifecycle of a MapReduce Job Time
  • 45. Hadoop MR Job Interface: Input Format • The Hadoop MapReduce framework spawns one map task for each InputSplit • InputSplit: Input File is Split to Input Splits (Logical splits (usually 1 block), not Physically split chunks) Input Format::getInputSplits() • The number of maps is usually driven by the total number of blocks (InputSplits) of the input files. 1 block size = 128 MB, 10 TB file configured with 82000 maps
  • 46. Hadoop MR Job Interface: map() • The framework then calls map(WritableComparable, Writable, OutputCollector, Reporter) for each key/value pair (line_num, line_string ) in the InputSplit for that task. • Output pairs are collected with calls to OutputCollector.collect(WritableComparable,Writable).
  • 47. Hadoop MR Job Interface: combiner() • Optional combiner, via JobConf.setCombinerClass(Class) • to perform local aggregation of the intermediate outputs of mapper
  • 48. Hadoop MR Job Interface: Partitioner() • Partitioner controls the partitioning of the keys of the intermediate map-outputs. • The key (or a subset of the key) is used to derive the partition, typically by a hash function. • The total number of partitions is the same as the number of reducers • HashPartitioner is the default Partitioner of reduce tasks for the job
  • 49. Hadoop MR Job Interface: reducer() • Reducer has 3 primary phases: 1. Shuffle: 2. Sort 3. Reduce
  • 50. Hadoop MR Job Interface: reducer() • Shuffle Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP. • Sort The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage. • The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.
  • 51. Hadoop MR Job Interface: reducer() • Reduce The framework then calls reduce(WritableComparable, Iterator, OutputCollector, Reporter) method for each key, (list of values) pair in the grouped inputs. • The output of the reduce task is typically written to the FileSystem via OutputCollector.collect(WritableComparable, Writable).
  • 52. MR Job Parameters • Map Parameters io.sort.mb • Shuffle/Reduce Parameters io.sort.factor mapred.inmem.merge.threshold mapred.job.shuffle.merge.percent
  • 53. Components in a Hadoop MR Workflow Next few slides are from: http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
  • 61. Quick Overview of Other Topics • Dealing with failures • Hadoop Distributed FileSystem (HDFS) • Optimizing a MapReduce job
  • 62. Dealing with Failures and Slow Tasks • What to do when a task fails? – Try again (retries possible because of idempotence) – Try again somewhere else – Report failure • What about slow tasks: stragglers – Run another version of the same task in parallel. Take results from the one that finishes first – What are the pros and cons of this approach? Fault tolerance is of high priority in the MapReduce framework
  • 64. Map Wave 1 Reduce Wave 1 Map Wave 2 Reduce Wave 2 Input Splits Lifecycle of a MapReduce Job Time How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?
  • 65. Job Configuration Parameters • 190+ parameters in Hadoop • Set manually or defaults are used
  • 66. Image source: http://www.jaso.co.kr/265 Hadoop Job Configuration Parameters
  • 67. Tuning Hadoop Job Conf. Parameters • Do their settings impact performance? • What are ways to set these parameters? – Defaults -- are they good enough? – Best practices -- the best setting can depend on data, job, and cluster properties – Automatic setting
  • 68. Experimental Setting • Hadoop cluster on 1 master + 16 workers • Each node: – 2GHz AMD processor, 1.8GB RAM, 30GB local disk – Relatively ill-provisioned! – Xen VM running Debian Linux – Max 4 concurrent maps 2 reduces • Maximum map wave size = 16x4 = 64 • Maximum reduce wave size = 16x2 = 32 • Not all users can run large Hadoop clusters: – Can Hadoop be made competitive in the 10-25 node, multi GB to TB data size range?
  • 69. Parameters Varied in Experiments
  • 70. • Varying number of reduce tasks, number of concurrent sorted streams for merging, and fraction of map-side sort buffer devoted to metadata storage Hadoop 50GB TeraSort
  • 71. Hadoop 50GB TeraSort • Varying number of reduce tasks for different values of the fraction of map-side sort buffer devoted to metadata storage (with io.sort.factor = 500)
  • 72. Hadoop 50GB TeraSort • Varying number of reduce tasks for different values of io.sort.factor (io.sort.record.percent = 0.05, default)
  • 73. • 1D projection for io.sort.factor=500 Hadoop 75GB TeraSort
  • 74. Automatic Optimization? (Not yet in Hadoop) Map Wave 1 Map Wave 3 Map Wave 2 Reduce Wave 1 Reduce Wave 2 Shuffle Map Wave 1 Map Wave 3 Map Wave 2 Reduce Wave 1 Reduce Wave 2 Reduce Wave 3 What if #reduces increased to 9?