Introduction to Hadoop

Apache Hadoop
Core & Ecosystem
UOIT - Faculty of Business and IT
Hamzeh Khazaei
hkh@yorku.ca
November 16, 2015

Agenda
• Data management system
• Conventional Data
• Big Data
• Hadoop
• File System
• Computation Paradigm
• YARN
• Subprojects
• NoSQL Datastores
• A Real Project
2

Database management system
• Relational database
management
system
• Structured data
• SQL
• Standard interface
• Vertical Scalability
• High-end servers
3

Questions?
1. Who faced these challenges first?
2. When did they confronted with challenges?
3. What was their solution?
4. What are the opportunities?
5. What is the role of cloud here?
6. What is next?
6

7
It is all about:
“How to store and process big data
with reasonable cost and time?”

Definitions
• Apache Hadoop is an open-source software project that enables distributed
processing of large data sets across clusters of commodity servers. It is designed
to scale out from a single server to thousands of machines, with a very high
degree of fault tolerance. (IBM)
• Apache Hadoop is a Java-based open-source framework for distributed storage
and processing of large sets of data on commodity hardware. Hadoop enables
businesses to quickly gain insight from massive amounts of structured and
unstructured data. (Hotronworks)
• Hadoop is an open-source software framework for storing data and running
applications on clusters of commodity hardware. It provides massive storage for
any kind of data, enormous processing power and the ability to handle virtually
limitless concurrent tasks or jobs. (SAS)
8

History, Google vs Hadoop
10
Devlop Group Google Apache
Sponsor Google Yahoo, Amazon
File System GFS (2003) HDFS (2005)
Programming Model MapReduce (2004) Hadoop MapReduce (2005)
Storage System BigTable (2006) HBase (2010)
Search Engine Google Nutch

● Data-intensive text processing
● Assembly of large genomes
● Graph mining
● Machine learning and data mining
● Large scale social network analysis
● Log analytics
● Health Informatics
● Smart Cities
Uses for Hadoop
11

Hadoop
Common
Contains Libraries and other
modules
HDFS Hadoop Distributed File System
Hadoop YARN
Yet Another Resource
Negotiator
Hadoop
MapReduce
A programming model for large scale
data processing
Hadoop Core
12

Goals of HDFS
• Very Large Distributed File System
• 10K nodes, 100 million files, 200PB
• Assumes Commodity Hardware
• Files are replicated to handle hardware failure
• Detect failures and recover from them
• Optimized for Batch Processing
• Data locations exposed so that computations can move
to where data resides
• Provides very high aggregate bandwidth
14

HDFS - Specifications
• Single Namespace for entire cluster
• Data Coherency
• Write-once-read-many access model
• Client can only append to existing files
• Files are broken up into blocks
• Typically 64MB block size
• Each block replicated on multiple DataNodes
• Intelligent Client
• Client can find location of blocks
• Client accesses data directly from DataNode
15

HDFS - Architecture 1
• Master/Slave Architecture
• NameNode
• Metadata Server
• File location (file name -> the DataNode)
• File attributions (atime/ctime/mtime, size, number of replicas)
• DataNode
• Manages the storage attached to the nodes that they run on
• Client
• Producer and Consumers of data
16

HDFS I/O
• A typical read from a client involves:
a) Contact the NameNode to determine where the actual data is stored
b) NameNode replies with block identifiers and locations (i.e., which
DataNode)
c) Contact the DataNode to fetch data
• A typical write from a client involves:
a) Contact the NameNode to update the namespace and verify permissions
b) NameNode allocates a new block on a suitable DataNode
c) The client directly streams to the selected DataNode
d) Currently, HDFS files are immutable
• Data is never moved through the NameNode Hence, there is no
bottleneck
18

• Default replication is 3-fold
Data Replication
19

HDFS Replication
• By default, HDFS stores 3 separate copies of each
block
• This ensures reliability, availability and performance
• Replication policy
• Spread replicas across different racks
• Robust against cluster node failures
• Robust against rack failures
• Block replication benefits MapReduce
• Scheduling decisions can take replicas into account
• Exploit better data locality
20

User Interface
• Commands for HDFS User:
• hadoop dfs -mkdir /foodir
• hadoop dfs -cat /foodir/myfile.txt
• hadoop dfs -rm /foodir/myfile.txt
• Commands for HDFS Administrator
• hadoop dfsadmin -report
• hadoop dfsadmin -decommision datanodename
• Web Interface
• http://host:port/dfshealth.jsp
21

● A method for distributing computation across
multiple nodes
● Each node processes the data that is stored at that
node
● Consists of two main phases
◦ Map
◦ Reduce
MapReduce Overview
23

MapReduce in 41 words.
Goal: count the number of books in the library.
• Map:
You count up shelf #1, I count up shelf #2.
(The more people we get, the faster this part goes)
• Reduce:
We all get together and add up our individual counts.
25

Word Count Example
• Mapper
• Input: value: lines of text of input
• Output: key: word, value: 1
• Reducer
• Input: key: word, value: set of counts
• Output: key: word, value: sum
• Launching program
• Defines this job
• Submits job to cluster
26

Word Count Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();
public static void map(LongWritable key, Text value,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws
IOException {
String line = value.toString();
StringTokenizer = new StringTokenizer(line);
while (tokenizer.hasNext()) {
word.set(tokenizer.nextToken());
output.collect(word,one);
}
}
}
28

Word Count Reducer
public static class Reduce extends MapReduceBase implements
Reducer<Text,IntWritable,Text,IntWritable> {
public static void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws
IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
29

World Count Main
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
30

Execution Framework
• MapReduce program, a.k.a. a job:
• Code of mappers and reducers
• Code for combiners and partitioners (optional)
• Configuration parameters
• All packaged together
• A MapReduce job is submitted to the cluster
• The framework takes care of everything else
• Next, we will delve into the details
31

Scheduling
• Each Job is broken into tasks
• Map tasks work on fractions of the input dataset, as defined by the
underlying distributed filesystem
• Reduce tasks work on intermediate inputs and write back to the distributed
filesystem
• The number of tasks may exceed the number of available
machines in a cluster
• The scheduler takes care of maintaining something similar to a queue of
pending tasks to be assigned to machines with available resources
• Jobs to be executed in a cluster requires scheduling as well
• Different users may submit jobs
• Jobs may be of various complexity
• Fairness is generally a requirement
32

• NameNode
• Holds the metadata for the HDFS
• Secondary NameNode
• Performs housekeeping functions for the NameNode
• DataNode
• Stores the actual HDFS data blocks
• JobTracker
• Manages MapReduce jobs
• TaskTracker
• Monitors individual Map and Reduce tasks
Anatomy of a Hadoop Cluster
33

YARN
• Yet Another Resource Negotiator
• YARN Application Resource Negotiator
(Recursive Acronym)
• Remedies the scalability shortcomings of “classic”
MapReduce - one jobtracker per cluster so that limit to
4000 nodes per cluster.
• Is more of a general purpose framework of which classic
mapreduce is one application.
• Inflexible slots on nodes, run Map or Reduce not both --
causes underutilization of cluster
37

YARN = Hadoop 2.0 = MRv2
• The fundamental idea of
MRv2 is to split up the
two major functionalities
of the JobTracker (ie,
resource management
and job
scheduling/monitoring)
into separate daemons.
• The idea is to have a
global ResourceManager
(RM) and per-application
ApplicationMaster (AM).
38

Hadoop Common
• Hadoop Common refers to the collection of common
utilities and libraries that support other Hadoop modules.
• It is an essential part or module of the Apache Hadoop
Framework, along with the HDFS, Hadoop YARN and
Hadoop MapReduce.
• Like all other modules, Hadoop Common assumes that
hardware failures are common and that these should be
automatically handled in software by the Hadoop
Framework.
39

Pig
• Started at Yahoo! Research
• Now runs about 30% of Yahoo!’s jobs
• Features
• Expresses sequences of MapReduce jobs
• Data model: nested “bags” of items
• Provides relational (SQL) operators (JOIN, GROUP BY,
etc.)
• Easy to plug in Java functions
42

Example
• Suppose you have
user data in a file,
website data in
another, and you
need to find the top
5 most visited
pages by users aged
18-25
43

HBase
• Open source implementation of Google’s
Bigtable
• Row/column store
• Billions of rows/millions on columns
• Column-oriented - nulls are free
• Online processing
• Master/Slave architecture
• Based on HDFS
46

HBase Query
• Retrieve a cell
• Cell =
table.getRow(“enclosure1”).getColumn(“animal:type”).getValue();
• Retrieve a row
• RowResult = table.getRow( “enclosure1” );
• Scan through a range of rows
• Scanner s = table.getScanner( new String[] { “animal:type” } );
48

Hive
• Developed at Facebook
• Used for majority of Facebook jobs
• “Relational database” built on Hadoop
• Maintains list of table schemas
• SQL-like query language (HiveQL)
• Can call Hadoop Streaming scripts from HiveQL
• Supports table partitioning, clustering, complex data
types, some optimizations
49

Hive DDL
50
CREATE TABLE page_views (viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'User IP address')
COMMENT 'This is the page view table'
PARTITIONED BY (dt STRING, country STRING)
STORED AS SEQUENCEFILE;
● Partitioning breaks table into separate files for each (dt,
country) pair
○ Ex: /hive/page_view/dt=2015-06-08,country=USA
/hive/page_view/dt=2015-06-08,country=CA

A Simple Query
• Find all page views coming from xyz.com in
March:
• Hive only reads partitions “2015-03-*” instead
of scanning entire table
51
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2015-03-01'
AND page_views.date <= '2015-03-31'
AND page_views.referrer_url like '%xyz.com';

Mahout
52
• A Scalable machine learning and data mining
library.
• Apache Software Foundation project
• Create scalable machine learning libraries
• Why Mahout? Many Open Source ML libraries either:
• Lack Community
• Lack Documentation and Examples
• Lack Scalability
• Or are research-oriented

Apache Ambari
• Provision a Hadoop Cluster
• Ambari provides a step-by-step wizard for installing Hadoop services across any number of
hosts.
• Ambari handles configuration of Hadoop services for the cluster.
• Manage a Hadoop Cluster
• Ambari provides central management for starting, stopping, and reconfiguring Hadoop
services across the entire cluster.
• Monitor a Hadoop Cluster
• Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
• Ambari leverages Ambari Metrics System for metrics collection.
• Ambari leverages Ambari Alert Framework for system alerting and will notify you when
your attention is needed (e.g., a node goes down, remaining disk space is low, etc).
53

Others
• Zookeeper
• ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and providing
group services. All of these kinds of services are used in some form or
another by distributed applications.
• Oozie
• Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
• Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
• Sqoop
• Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured datastores such as relational
databases.
55

Introduction to Hadoop

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Hadoop

Similar to Introduction to Hadoop (20)

More from York University

More from York University (7)

Recently uploaded

Recently uploaded (20)

Introduction to Hadoop