Apache Hadoop Big Data Technology

Er.Jay Nagar(Technology Researcher )
+91-9601957620

What is Apache Hadoop?
 Open source software framework designed for
storage and processing of large scale data on
clusters of commodity hardware
 Created by Doug Cutting and Mike Carafella in
2005.
 Cutting named the program after his son’s toy
elephant.

Uses for Hadoop
 Data-intensive text processing
 Assembly of large genomes
 Graph mining
 Machine learning and data mining
 Large scale social network analysis

The Hadoop Ecosystem
• Contains Libraries and other
modules
Hadoop
Common
• Hadoop Distributed File
SystemHDFS
• Yet Another Resource
Negotiator
Hadoop
YARN
• A programming model for
large scale data processing
Hadoop
MapReduce

How much data?
 Facebook
 500 TB per day
 Yahoo
 Over 170 PB
 eBay
 Over 6 PB
 Getting the data to the processors becomes the
bottleneck

• Hadoop:
• an open-source software framework that supports data-intensive
distributed applications, licensed under the Apache v2 license.
• Goals / Requirements:
• Abstract and facilitate the storage and processing of large and/or
rapidly growing data sets
• Structured and non-structured data
• Simple programming models
• High scalability and availability
• Use commodity (cheap!) hardware with little redundancy
• Fault-tolerance
• Move computation rather than data

Hadoop’s Architecture
• Distributed, with some centralization
• Main nodes of cluster are where most of the
computational power and storage of the system
lies
• Main nodes run TaskTracker to accept and reply
to MapReduce tasks, and also DataNode to
store needed blocks closely as possible
• Central control node runs NameNode to keep
track of HDFS directories & files, and JobTracker
to dispatch compute tasks to TaskTracker

• Hadoop Distributed Filesystem
• Tailored to needs of MapReduce
• Targeted towards many reads of filestreams
• Writes are more costly
• High degree of data replication (3x by default)
• No need for RAID on normal nodes
• Large blocksize (64MB)
• Location awareness of DataNodes in network

• Hadoop is in use at most organizations that
handle big data:
o Yahoo!
o Facebook
o Amazon
o Netflix
o Etc…
• Some examples of scale:
o Yahoo!’s Search Webmap runs on 10,000
core Linux cluster and powers Yahoo!
Web search
o FB’s Hadoop cluster hosts 100+ PB of
data (July, 2012) & growing at ½ PB/day
(Nov, 2012)

NameNode:
• Stores metadata for the files, like the directory
structure of a typical FS.
• The server holding the NameNode instance is quite
crucial, as there is only one.
• Transaction log for file deletes/adds, etc. Does not use
transactions for whole blocks or file-streams, only
metadata.
• Handles creation of more replica blocks when
necessary after a DataNode failure

DataNode:
• Stores the actual data in HDFS
• Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has
• NameNode replicates blocks 2x in local rack, 1x elsewhere

Hadoop’s Architecture: MapReduce Engine

Apache Hadoop Big Data Technology

MapReduce Engine:
• JobTracker & TaskTracker
• JobTracker splits up data into smaller
tasks(“Map”) and sends it to the TaskTracker
process in each node
• TaskTracker reports back to the JobTracker
node and reports on job progress, sends data
(“Reduce”) or requests new jobs

HDFS Basic Concepts
 HDFS is a file system written in Java based on
the Google’s GFS
 Provides redundant storage for massive amounts
of data

HDFS Basic Concepts
 HDFS works best with a smaller number of large
files
 Millions as opposed to billions of files
 Typically 100MB or more per file
 Files in HDFS are write once
 Optimized for streaming reads of large files and
not random reads

How are Files Stored
 Files are split into blocks
 Blocks are split across many machines at load
time
 Different blocks from the same file will be stored on
different machines
 Blocks are replicated across multiple machines
 The NameNode keeps track of which blocks
make up a file and where they are stored

Data Replication
 Default replication is 3-fold

MapReduce
Distributing computation across nodes

MapReduce Overview
 A method for distributing computation across
multiple nodes
 Each node processes the data that is stored at
that node
 Consists of two main phases
 Map
 Reduce

MapReduce Features
 Automatic parallelization and distribution
 Fault-Tolerance
 Provides a clean abstraction for programmers to
use

The Mapper
 Reads data as key/value pairs
 The key is often discarded
 Outputs zero or more key/value pairs

Shuffle and Sort
 Output from the mapper is sorted by key
 All values with the same key are guaranteed to
go to the same machine

The Reducer
 Called once for each unique key
 Gets a list of all values associated with a key as
input
 The reducer outputs zero or more final key/value
pairs
 Usually just one output per input key

Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Reducer Reducer Reducer
(intermediates) (intermediates) (intermediates)
Partitioner Partitioner Partitioner Partitioner
shuffling

Overview
 NameNode
 Holds the metadata for the HDFS
 Secondary NameNode
 Performs housekeeping functions for the
NameNode
 DataNode
 Stores the actual HDFS data blocks
 JobTracker
 Manages MapReduce jobs
 TaskTracker
 Monitors individual Map and Reduce tasks

The NameNode
 Stores the HDFS file system information in a
fsimage
 Updates to the file system (add/remove blocks)
do not change the fsimage file
 They are instead written to a log file
 When starting the NameNode loads the fsimage
file and then applies the changes in the log file

The Secondary NameNode
 NOT a backup for the NameNode
 Periodically reads the log file and applies the
changes to the fsimage file bringing it up to date
 Allows the NameNode to restart faster when
required

JobTracker and TaskTracker
 JobTracker
 Determines the execution plan for the job
 Assigns individual tasks
 TaskTracker
 Keeps track of the performance of an individual
mapper or reducer

Hadoop Ecosystem
Other available tools

Why do these tools exist?
 MapReduce is very powerful, but can be awkward
to master
 These tools allow programmers who are familiar
with other programming styles to take advantage
of the power of MapReduce

Other Tools
 Hive
 Hadoop processing with SQL
 Pig
 Hadoop processing with scripting
 Cascading
 Pipe and Filter processing model
 HBase
 Database model built on top of Hadoop
 Flume
 Designed for large scale data movement

Apache Hadoop Big Data Technology

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Apache Hadoop Big Data Technology

Similar to Apache Hadoop Big Data Technology (20)

More from Jay Nagar

More from Jay Nagar (20)

Recently uploaded

Recently uploaded (20)

Apache Hadoop Big Data Technology

Editor's Notes