Apache Hadoop
Core & Ecosystem
UOIT - Faculty of Business and IT
Hamzeh Khazaei
November 16, 2015
• Data management system
• Conventional Data
• Big Data
• Hadoop
• File System
• Computation Paradigm
• Subprojects
• NoSQL Datastores
• A Real Project
Database management system
• Relational database
• Structured data
• Standard interface
• Vertical Scalability
• High-end servers
Big Data
1. Who faced these challenges first?
2. When did they confronted with challenges?
3. What was their solution?
4. What are the opportunities?
5. What is the role of cloud here?
6. What is next?
It is all about:
“How to store and process big data
with reasonable cost and time?”
• Apache Hadoop is an open-source software project that enables distributed
processing of large data sets across clusters of commodity servers. It is designed
to scale out from a single server to thousands of machines, with a very high
degree of fault tolerance. (IBM)
• Apache Hadoop is a Java-based open-source framework for distributed storage
and processing of large sets of data on commodity hardware. Hadoop enables
businesses to quickly gain insight from massive amounts of structured and
unstructured data. (Hotronworks)
• Hadoop is an open-source software framework for storing data and running
applications on clusters of commodity hardware. It provides massive storage for
any kind of data, enormous processing power and the ability to handle virtually
limitless concurrent tasks or jobs. (SAS)
Who Uses Hadoop?
History, Google vs Hadoop
Devlop Group Google Apache
Sponsor Google Yahoo, Amazon
File System GFS (2003) HDFS (2005)
Programming Model MapReduce (2004) Hadoop MapReduce (2005)
Storage System BigTable (2006) HBase (2010)
Search Engine Google Nutch
● Data-intensive text processing
● Assembly of large genomes
● Graph mining
● Machine learning and data mining
● Large scale social network analysis
● Log analytics
● Health Informatics
● Smart Cities
Uses for Hadoop
Contains Libraries and other
HDFS Hadoop Distributed File System
Hadoop YARN
Yet Another Resource
A programming model for large scale
data processing
Hadoop Core
Goals of HDFS
• Very Large Distributed File System
• 10K nodes, 100 million files, 200PB
• Assumes Commodity Hardware
• Files are replicated to handle hardware failure
• Detect failures and recover from them
• Optimized for Batch Processing
• Data locations exposed so that computations can move
to where data resides
• Provides very high aggregate bandwidth
HDFS - Specifications
• Single Namespace for entire cluster
• Data Coherency
• Write-once-read-many access model
• Client can only append to existing files
• Files are broken up into blocks
• Typically 64MB block size
• Each block replicated on multiple DataNodes
• Intelligent Client
• Client can find location of blocks
• Client accesses data directly from DataNode
HDFS - Architecture 1
• Master/Slave Architecture
• NameNode
• Metadata Server
• File location (file name -> the DataNode)
• File attributions (atime/ctime/mtime, size, number of replicas)
• DataNode
• Manages the storage attached to the nodes that they run on
• Client
• Producer and Consumers of data
HDFS - Architecture 2
• A typical read from a client involves:
a) Contact the NameNode to determine where the actual data is stored
b) NameNode replies with block identifiers and locations (i.e., which
c) Contact the DataNode to fetch data
• A typical write from a client involves:
a) Contact the NameNode to update the namespace and verify permissions
b) NameNode allocates a new block on a suitable DataNode
c) The client directly streams to the selected DataNode
d) Currently, HDFS files are immutable
• Data is never moved through the NameNode Hence, there is no
• Default replication is 3-fold
Data Replication
HDFS Replication
• By default, HDFS stores 3 separate copies of each
• This ensures reliability, availability and performance
• Replication policy
• Spread replicas across different racks
• Robust against cluster node failures
• Robust against rack failures
• Block replication benefits MapReduce
• Scheduling decisions can take replicas into account
• Exploit better data locality
User Interface
• Commands for HDFS User:
• hadoop dfs -mkdir /foodir
• hadoop dfs -cat /foodir/myfile.txt
• hadoop dfs -rm /foodir/myfile.txt
• Commands for HDFS Administrator
• hadoop dfsadmin -report
• hadoop dfsadmin -decommision datanodename
• Web Interface
• http://host:port/dfshealth.jsp
● A method for distributing computation across
multiple nodes
● Each node processes the data that is stored at that
● Consists of two main phases
◦ Map
◦ Reduce
MapReduce Overview
Now, Technically, What is MapReduce?
• MapReduce is a programming model for efficient
distributed computing
• It works like a Unix pipeline
• cat input | grep | sort | uniq -c | cat > output
• Input | Map | Shuffle & Sort | Reduce | Output
• Efficiency from
• Streaming through data, reducing seeks
• Pipelining
• A good fit for a lot of applications
• Log processing
• Web index building
MapReduce in 41 words.
Goal: count the number of books in the library.
• Map:
You count up shelf #1, I count up shelf #2.
(The more people we get, the faster this part goes)
• Reduce:
We all get together and add up our individual counts.
Word Count Example
• Mapper
• Input: value: lines of text of input
• Output: key: word, value: 1
• Reducer
• Input: key: word, value: set of counts
• Output: key: word, value: sum
• Launching program
• Defines this job
• Submits job to cluster
Word Count Data Flow
Word Count Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();
public static void map(LongWritable key, Text value,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws
IOException {
String line = value.toString();
StringTokenizer = new StringTokenizer(line);
while (tokenizer.hasNext()) {
Word Count Reducer
public static class Reduce extends MapReduceBase implements
Reducer<Text,IntWritable,Text,IntWritable> {
public static void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws
IOException {
int sum = 0;
while (values.hasNext()) {
sum +=;
output.collect(key, new IntWritable(sum));
World Count Main
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
Execution Framework
• MapReduce program, a.k.a. a job:
• Code of mappers and reducers
• Code for combiners and partitioners (optional)
• Configuration parameters
• All packaged together
• A MapReduce job is submitted to the cluster
• The framework takes care of everything else
• Next, we will delve into the details
• Each Job is broken into tasks
• Map tasks work on fractions of the input dataset, as defined by the
underlying distributed filesystem
• Reduce tasks work on intermediate inputs and write back to the distributed
• The number of tasks may exceed the number of available
machines in a cluster
• The scheduler takes care of maintaining something similar to a queue of
pending tasks to be assigned to machines with available resources
• Jobs to be executed in a cluster requires scheduling as well
• Different users may submit jobs
• Jobs may be of various complexity
• Fairness is generally a requirement
• NameNode
• Holds the metadata for the HDFS
• Secondary NameNode
• Performs housekeeping functions for the NameNode
• DataNode
• Stores the actual HDFS data blocks
• JobTracker
• Manages MapReduce jobs
• TaskTracker
• Monitors individual Map and Reduce tasks
Anatomy of a Hadoop Cluster
Hadoop 1.0 vs Hadoop 2.0
Hadoop 2.0 with YARN
• Yet Another Resource Negotiator
• YARN Application Resource Negotiator
(Recursive Acronym)
• Remedies the scalability shortcomings of “classic”
MapReduce - one jobtracker per cluster so that limit to
4000 nodes per cluster.
• Is more of a general purpose framework of which classic
mapreduce is one application.
• Inflexible slots on nodes, run Map or Reduce not both --
causes underutilization of cluster
YARN = Hadoop 2.0 = MRv2
• The fundamental idea of
MRv2 is to split up the
two major functionalities
of the JobTracker (ie,
resource management
and job
into separate daemons.
• The idea is to have a
global ResourceManager
(RM) and per-application
ApplicationMaster (AM).
Hadoop Common
• Hadoop Common refers to the collection of common
utilities and libraries that support other Hadoop modules.
• It is an essential part or module of the Apache Hadoop
Framework, along with the HDFS, Hadoop YARN and
Hadoop MapReduce.
• Like all other modules, Hadoop Common assumes that
hardware failures are common and that these should be
automatically handled in software by the Hadoop
Hadoop Ecosystem
• Started at Yahoo! Research
• Now runs about 30% of Yahoo!’s jobs
• Features
• Expresses sequences of MapReduce jobs
• Data model: nested “bags” of items
• Provides relational (SQL) operators (JOIN, GROUP BY,
• Easy to plug in Java functions
• Suppose you have
user data in a file,
website data in
another, and you
need to find the top
5 most visited
pages by users aged
in MapReduce
in Pig Latin
• Open source implementation of Google’s
• Row/column store
• Billions of rows/millions on columns
• Column-oriented - nulls are free
• Online processing
• Master/Slave architecture
• Based on HDFS
HBase Schema
HBase Query
• Retrieve a cell
• Cell =
• Retrieve a row
• RowResult = table.getRow( “enclosure1” );
• Scan through a range of rows
• Scanner s = table.getScanner( new String[] { “animal:type” } );
• Developed at Facebook
• Used for majority of Facebook jobs
• “Relational database” built on Hadoop
• Maintains list of table schemas
• SQL-like query language (HiveQL)
• Can call Hadoop Streaming scripts from HiveQL
• Supports table partitioning, clustering, complex data
types, some optimizations
Hive DDL
CREATE TABLE page_views (viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'User IP address')
COMMENT 'This is the page view table'
● Partitioning breaks table into separate files for each (dt,
country) pair
○ Ex: /hive/page_view/dt=2015-06-08,country=USA
A Simple Query
• Find all page views coming from in
• Hive only reads partitions “2015-03-*” instead
of scanning entire table
SELECT page_views.*
FROM page_views
WHERE >= '2015-03-01'
AND <= '2015-03-31'
AND page_views.referrer_url like '';
• A Scalable machine learning and data mining
• Apache Software Foundation project
• Create scalable machine learning libraries
• Why Mahout? Many Open Source ML libraries either:
• Lack Community
• Lack Documentation and Examples
• Lack Scalability
• Or are research-oriented
Apache Ambari
• Provision a Hadoop Cluster
• Ambari provides a step-by-step wizard for installing Hadoop services across any number of
• Ambari handles configuration of Hadoop services for the cluster.
• Manage a Hadoop Cluster
• Ambari provides central management for starting, stopping, and reconfiguring Hadoop
services across the entire cluster.
• Monitor a Hadoop Cluster
• Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
• Ambari leverages Ambari Metrics System for metrics collection.
• Ambari leverages Ambari Alert Framework for system alerting and will notify you when
your attention is needed (e.g., a node goes down, remaining disk space is low, etc).
• Zookeeper
• ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and providing
group services. All of these kinds of services are used in some form or
another by distributed applications.
• Oozie
• Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
• Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
• Sqoop
• Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured datastores such as relational

Introduction to Hadoop

  • 1. Apache Hadoop Core & Ecosystem UOIT - Faculty of Business and IT Hamzeh Khazaei November 16, 2015
  • 2. Agenda • Data management system • Conventional Data • Big Data • Hadoop • File System • Computation Paradigm • YARN • Subprojects • NoSQL Datastores • A Real Project 2
  • 3. Database management system • Relational database management system • Structured data • SQL • Standard interface • Vertical Scalability • High-end servers 3
  • 5. 5
  • 6. Questions? 1. Who faced these challenges first? 2. When did they confronted with challenges? 3. What was their solution? 4. What are the opportunities? 5. What is the role of cloud here? 6. What is next? 6
  • 7. 7 It is all about: “How to store and process big data with reasonable cost and time?”
  • 8. Definitions • Apache Hadoop is an open-source software project that enables distributed processing of large data sets across clusters of commodity servers. It is designed to scale out from a single server to thousands of machines, with a very high degree of fault tolerance. (IBM) • Apache Hadoop is a Java-based open-source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data. (Hotronworks) • Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. (SAS) 8
  • 10. History, Google vs Hadoop 10 Devlop Group Google Apache Sponsor Google Yahoo, Amazon File System GFS (2003) HDFS (2005) Programming Model MapReduce (2004) Hadoop MapReduce (2005) Storage System BigTable (2006) HBase (2010) Search Engine Google Nutch
  • 11. ● Data-intensive text processing ● Assembly of large genomes ● Graph mining ● Machine learning and data mining ● Large scale social network analysis ● Log analytics ● Health Informatics ● Smart Cities Uses for Hadoop 11
  • 12. Hadoop Common Contains Libraries and other modules HDFS Hadoop Distributed File System Hadoop YARN Yet Another Resource Negotiator Hadoop MapReduce A programming model for large scale data processing Hadoop Core 12
  • 13. 13
  • 14. Goals of HDFS • Very Large Distributed File System • 10K nodes, 100 million files, 200PB • Assumes Commodity Hardware • Files are replicated to handle hardware failure • Detect failures and recover from them • Optimized for Batch Processing • Data locations exposed so that computations can move to where data resides • Provides very high aggregate bandwidth 14
  • 15. HDFS - Specifications • Single Namespace for entire cluster • Data Coherency • Write-once-read-many access model • Client can only append to existing files • Files are broken up into blocks • Typically 64MB block size • Each block replicated on multiple DataNodes • Intelligent Client • Client can find location of blocks • Client accesses data directly from DataNode 15
  • 16. HDFS - Architecture 1 • Master/Slave Architecture • NameNode • Metadata Server • File location (file name -> the DataNode) • File attributions (atime/ctime/mtime, size, number of replicas) • DataNode • Manages the storage attached to the nodes that they run on • Client • Producer and Consumers of data 16
  • 18. HDFS I/O • A typical read from a client involves: a) Contact the NameNode to determine where the actual data is stored b) NameNode replies with block identifiers and locations (i.e., which DataNode) c) Contact the DataNode to fetch data • A typical write from a client involves: a) Contact the NameNode to update the namespace and verify permissions b) NameNode allocates a new block on a suitable DataNode c) The client directly streams to the selected DataNode d) Currently, HDFS files are immutable • Data is never moved through the NameNode Hence, there is no bottleneck 18
  • 19. • Default replication is 3-fold Data Replication 19
  • 20. HDFS Replication • By default, HDFS stores 3 separate copies of each block • This ensures reliability, availability and performance • Replication policy • Spread replicas across different racks • Robust against cluster node failures • Robust against rack failures • Block replication benefits MapReduce • Scheduling decisions can take replicas into account • Exploit better data locality 20
  • 21. User Interface • Commands for HDFS User: • hadoop dfs -mkdir /foodir • hadoop dfs -cat /foodir/myfile.txt • hadoop dfs -rm /foodir/myfile.txt • Commands for HDFS Administrator • hadoop dfsadmin -report • hadoop dfsadmin -decommision datanodename • Web Interface • http://host:port/dfshealth.jsp 21
  • 22. 22
  • 23. ● A method for distributing computation across multiple nodes ● Each node processes the data that is stored at that node ● Consists of two main phases ◦ Map ◦ Reduce MapReduce Overview 23
  • 24. Now, Technically, What is MapReduce? • MapReduce is a programming model for efficient distributed computing • It works like a Unix pipeline • cat input | grep | sort | uniq -c | cat > output • Input | Map | Shuffle & Sort | Reduce | Output • Efficiency from • Streaming through data, reducing seeks • Pipelining • A good fit for a lot of applications • Log processing • Web index building 24
  • 25. MapReduce in 41 words. Goal: count the number of books in the library. • Map: You count up shelf #1, I count up shelf #2. (The more people we get, the faster this part goes) • Reduce: We all get together and add up our individual counts. 25
  • 26. Word Count Example • Mapper • Input: value: lines of text of input • Output: key: word, value: 1 • Reducer • Input: key: word, value: set of counts • Output: key: word, value: sum • Launching program • Defines this job • Submits job to cluster 26
  • 27. Word Count Data Flow 27
  • 28. Word Count Mapper public static class Map extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable> { private static final IntWritable one = new IntWritable(1); private Text word = new Text(); public static void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer = new StringTokenizer(line); while (tokenizer.hasNext()) { word.set(tokenizer.nextToken()); output.collect(word,one); } } } 28
  • 29. Word Count Reducer public static class Reduce extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable> { public static void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum +=; } output.collect(key, new IntWritable(sum)); } } 29
  • 30. World Count Main public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } 30
  • 31. Execution Framework • MapReduce program, a.k.a. a job: • Code of mappers and reducers • Code for combiners and partitioners (optional) • Configuration parameters • All packaged together • A MapReduce job is submitted to the cluster • The framework takes care of everything else • Next, we will delve into the details 31
  • 32. Scheduling • Each Job is broken into tasks • Map tasks work on fractions of the input dataset, as defined by the underlying distributed filesystem • Reduce tasks work on intermediate inputs and write back to the distributed filesystem • The number of tasks may exceed the number of available machines in a cluster • The scheduler takes care of maintaining something similar to a queue of pending tasks to be assigned to machines with available resources • Jobs to be executed in a cluster requires scheduling as well • Different users may submit jobs • Jobs may be of various complexity • Fairness is generally a requirement 32
  • 33. • NameNode • Holds the metadata for the HDFS • Secondary NameNode • Performs housekeeping functions for the NameNode • DataNode • Stores the actual HDFS data blocks • JobTracker • Manages MapReduce jobs • TaskTracker • Monitors individual Map and Reduce tasks Anatomy of a Hadoop Cluster 33
  • 34. 34
  • 35. Hadoop 1.0 vs Hadoop 2.0 35
  • 36. Hadoop 2.0 with YARN 36
  • 37. YARN • Yet Another Resource Negotiator • YARN Application Resource Negotiator (Recursive Acronym) • Remedies the scalability shortcomings of “classic” MapReduce - one jobtracker per cluster so that limit to 4000 nodes per cluster. • Is more of a general purpose framework of which classic mapreduce is one application. • Inflexible slots on nodes, run Map or Reduce not both -- causes underutilization of cluster 37
  • 38. YARN = Hadoop 2.0 = MRv2 • The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker (ie, resource management and job scheduling/monitoring) into separate daemons. • The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). 38
  • 39. Hadoop Common • Hadoop Common refers to the collection of common utilities and libraries that support other Hadoop modules. • It is an essential part or module of the Apache Hadoop Framework, along with the HDFS, Hadoop YARN and Hadoop MapReduce. • Like all other modules, Hadoop Common assumes that hardware failures are common and that these should be automatically handled in software by the Hadoop Framework. 39
  • 41. 41
  • 42. Pig • Started at Yahoo! Research • Now runs about 30% of Yahoo!’s jobs • Features • Expresses sequences of MapReduce jobs • Data model: nested “bags” of items • Provides relational (SQL) operators (JOIN, GROUP BY, etc.) • Easy to plug in Java functions 42
  • 43. Example • Suppose you have user data in a file, website data in another, and you need to find the top 5 most visited pages by users aged 18-25 43
  • 46. HBase • Open source implementation of Google’s Bigtable • Row/column store • Billions of rows/millions on columns • Column-oriented - nulls are free • Online processing • Master/Slave architecture • Based on HDFS 46
  • 48. HBase Query • Retrieve a cell • Cell = table.getRow(“enclosure1”).getColumn(“animal:type”).getValue(); • Retrieve a row • RowResult = table.getRow( “enclosure1” ); • Scan through a range of rows • Scanner s = table.getScanner( new String[] { “animal:type” } ); 48
  • 49. Hive • Developed at Facebook • Used for majority of Facebook jobs • “Relational database” built on Hadoop • Maintains list of table schemas • SQL-like query language (HiveQL) • Can call Hadoop Streaming scripts from HiveQL • Supports table partitioning, clustering, complex data types, some optimizations 49
  • 50. Hive DDL 50 CREATE TABLE page_views (viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'User IP address') COMMENT 'This is the page view table' PARTITIONED BY (dt STRING, country STRING) STORED AS SEQUENCEFILE; ● Partitioning breaks table into separate files for each (dt, country) pair ○ Ex: /hive/page_view/dt=2015-06-08,country=USA /hive/page_view/dt=2015-06-08,country=CA
  • 51. A Simple Query • Find all page views coming from in March: • Hive only reads partitions “2015-03-*” instead of scanning entire table 51 SELECT page_views.* FROM page_views WHERE >= '2015-03-01' AND <= '2015-03-31' AND page_views.referrer_url like '';
  • 52. Mahout 52 • A Scalable machine learning and data mining library. • Apache Software Foundation project • Create scalable machine learning libraries • Why Mahout? Many Open Source ML libraries either: • Lack Community • Lack Documentation and Examples • Lack Scalability • Or are research-oriented
  • 53. Apache Ambari • Provision a Hadoop Cluster • Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts. • Ambari handles configuration of Hadoop services for the cluster. • Manage a Hadoop Cluster • Ambari provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster. • Monitor a Hadoop Cluster • Ambari provides a dashboard for monitoring health and status of the Hadoop cluster. • Ambari leverages Ambari Metrics System for metrics collection. • Ambari leverages Ambari Alert Framework for system alerting and will notify you when your attention is needed (e.g., a node goes down, remaining disk space is low, etc). 53
  • 54. 54
  • 55. Others • Zookeeper • ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. • Oozie • Oozie is a workflow scheduler system to manage Apache Hadoop jobs. • Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. • Sqoop • Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. 55
  • 56. 56
  • 58. 58