SlideShare a Scribd company logo
Apache Hadoop
Core & Ecosystem
UOIT - Faculty of Business and IT
Hamzeh Khazaei
hkh@yorku.ca
November 16, 2015
Agenda
• Data management system
• Conventional Data
• Big Data
• Hadoop
• File System
• Computation Paradigm
• YARN
• Subprojects
• NoSQL Datastores
• A Real Project
2
Database management system
• Relational database
management
system
• Structured data
• SQL
• Standard interface
• Vertical Scalability
• High-end servers
3
Big Data
4
5
Questions?
1. Who faced these challenges first?
2. When did they confronted with challenges?
3. What was their solution?
4. What are the opportunities?
5. What is the role of cloud here?
6. What is next?
6
7
It is all about:
“How to store and process big data
with reasonable cost and time?”
Definitions
• Apache Hadoop is an open-source software project that enables distributed
processing of large data sets across clusters of commodity servers. It is designed
to scale out from a single server to thousands of machines, with a very high
degree of fault tolerance. (IBM)
• Apache Hadoop is a Java-based open-source framework for distributed storage
and processing of large sets of data on commodity hardware. Hadoop enables
businesses to quickly gain insight from massive amounts of structured and
unstructured data. (Hotronworks)
• Hadoop is an open-source software framework for storing data and running
applications on clusters of commodity hardware. It provides massive storage for
any kind of data, enormous processing power and the ability to handle virtually
limitless concurrent tasks or jobs. (SAS)
8
Who Uses Hadoop?
9
History, Google vs Hadoop
10
Devlop Group Google Apache
Sponsor Google Yahoo, Amazon
File System GFS (2003) HDFS (2005)
Programming Model MapReduce (2004) Hadoop MapReduce (2005)
Storage System BigTable (2006) HBase (2010)
Search Engine Google Nutch
● Data-intensive text processing
● Assembly of large genomes
● Graph mining
● Machine learning and data mining
● Large scale social network analysis
● Log analytics
● Health Informatics
● Smart Cities
Uses for Hadoop
11
Hadoop
Common
Contains Libraries and other
modules
HDFS Hadoop Distributed File System
Hadoop YARN
Yet Another Resource
Negotiator
Hadoop
MapReduce
A programming model for large scale
data processing
Hadoop Core
12
13
Goals of HDFS
• Very Large Distributed File System
• 10K nodes, 100 million files, 200PB
• Assumes Commodity Hardware
• Files are replicated to handle hardware failure
• Detect failures and recover from them
• Optimized for Batch Processing
• Data locations exposed so that computations can move
to where data resides
• Provides very high aggregate bandwidth
14
HDFS - Specifications
• Single Namespace for entire cluster
• Data Coherency
• Write-once-read-many access model
• Client can only append to existing files
• Files are broken up into blocks
• Typically 64MB block size
• Each block replicated on multiple DataNodes
• Intelligent Client
• Client can find location of blocks
• Client accesses data directly from DataNode
15
HDFS - Architecture 1
• Master/Slave Architecture
• NameNode
• Metadata Server
• File location (file name -> the DataNode)
• File attributions (atime/ctime/mtime, size, number of replicas)
• DataNode
• Manages the storage attached to the nodes that they run on
• Client
• Producer and Consumers of data
16
HDFS - Architecture 2
17
HDFS I/O
• A typical read from a client involves:
a) Contact the NameNode to determine where the actual data is stored
b) NameNode replies with block identifiers and locations (i.e., which
DataNode)
c) Contact the DataNode to fetch data
• A typical write from a client involves:
a) Contact the NameNode to update the namespace and verify permissions
b) NameNode allocates a new block on a suitable DataNode
c) The client directly streams to the selected DataNode
d) Currently, HDFS files are immutable
• Data is never moved through the NameNode Hence, there is no
bottleneck
18
• Default replication is 3-fold
Data Replication
19
HDFS Replication
• By default, HDFS stores 3 separate copies of each
block
• This ensures reliability, availability and performance
• Replication policy
• Spread replicas across different racks
• Robust against cluster node failures
• Robust against rack failures
• Block replication benefits MapReduce
• Scheduling decisions can take replicas into account
• Exploit better data locality
20
User Interface
• Commands for HDFS User:
• hadoop dfs -mkdir /foodir
• hadoop dfs -cat /foodir/myfile.txt
• hadoop dfs -rm /foodir/myfile.txt
• Commands for HDFS Administrator
• hadoop dfsadmin -report
• hadoop dfsadmin -decommision datanodename
• Web Interface
• http://host:port/dfshealth.jsp
21
22
● A method for distributing computation across
multiple nodes
● Each node processes the data that is stored at that
node
● Consists of two main phases
◦ Map
◦ Reduce
MapReduce Overview
23
Now, Technically, What is MapReduce?
• MapReduce is a programming model for efficient
distributed computing
• It works like a Unix pipeline
• cat input | grep | sort | uniq -c | cat > output
• Input | Map | Shuffle & Sort | Reduce | Output
• Efficiency from
• Streaming through data, reducing seeks
• Pipelining
• A good fit for a lot of applications
• Log processing
• Web index building
24
MapReduce in 41 words.
Goal: count the number of books in the library.
• Map:
You count up shelf #1, I count up shelf #2.
(The more people we get, the faster this part goes)
• Reduce:
We all get together and add up our individual counts.
25
Word Count Example
• Mapper
• Input: value: lines of text of input
• Output: key: word, value: 1
• Reducer
• Input: key: word, value: set of counts
• Output: key: word, value: sum
• Launching program
• Defines this job
• Submits job to cluster
26
Word Count Data Flow
27
Word Count Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();
public static void map(LongWritable key, Text value,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws
IOException {
String line = value.toString();
StringTokenizer = new StringTokenizer(line);
while (tokenizer.hasNext()) {
word.set(tokenizer.nextToken());
output.collect(word,one);
}
}
}
28
Word Count Reducer
public static class Reduce extends MapReduceBase implements
Reducer<Text,IntWritable,Text,IntWritable> {
public static void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws
IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
29
World Count Main
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
30
Execution Framework
• MapReduce program, a.k.a. a job:
• Code of mappers and reducers
• Code for combiners and partitioners (optional)
• Configuration parameters
• All packaged together
• A MapReduce job is submitted to the cluster
• The framework takes care of everything else
• Next, we will delve into the details
31
Scheduling
• Each Job is broken into tasks
• Map tasks work on fractions of the input dataset, as defined by the
underlying distributed filesystem
• Reduce tasks work on intermediate inputs and write back to the distributed
filesystem
• The number of tasks may exceed the number of available
machines in a cluster
• The scheduler takes care of maintaining something similar to a queue of
pending tasks to be assigned to machines with available resources
• Jobs to be executed in a cluster requires scheduling as well
• Different users may submit jobs
• Jobs may be of various complexity
• Fairness is generally a requirement
32
• NameNode
• Holds the metadata for the HDFS
• Secondary NameNode
• Performs housekeeping functions for the NameNode
• DataNode
• Stores the actual HDFS data blocks
• JobTracker
• Manages MapReduce jobs
• TaskTracker
• Monitors individual Map and Reduce tasks
Anatomy of a Hadoop Cluster
33
34
Hadoop 1.0 vs Hadoop 2.0
35
Hadoop 2.0 with YARN
36
YARN
• Yet Another Resource Negotiator
• YARN Application Resource Negotiator
(Recursive Acronym)
• Remedies the scalability shortcomings of “classic”
MapReduce - one jobtracker per cluster so that limit to
4000 nodes per cluster.
• Is more of a general purpose framework of which classic
mapreduce is one application.
• Inflexible slots on nodes, run Map or Reduce not both --
causes underutilization of cluster
37
YARN = Hadoop 2.0 = MRv2
• The fundamental idea of
MRv2 is to split up the
two major functionalities
of the JobTracker (ie,
resource management
and job
scheduling/monitoring)
into separate daemons.
• The idea is to have a
global ResourceManager
(RM) and per-application
ApplicationMaster (AM).
38
Hadoop Common
• Hadoop Common refers to the collection of common
utilities and libraries that support other Hadoop modules.
• It is an essential part or module of the Apache Hadoop
Framework, along with the HDFS, Hadoop YARN and
Hadoop MapReduce.
• Like all other modules, Hadoop Common assumes that
hardware failures are common and that these should be
automatically handled in software by the Hadoop
Framework.
39
40
Hadoop Ecosystem
41
Pig
• Started at Yahoo! Research
• Now runs about 30% of Yahoo!’s jobs
• Features
• Expresses sequences of MapReduce jobs
• Data model: nested “bags” of items
• Provides relational (SQL) operators (JOIN, GROUP BY,
etc.)
• Easy to plug in Java functions
42
Example
• Suppose you have
user data in a file,
website data in
another, and you
need to find the top
5 most visited
pages by users aged
18-25
43
in MapReduce
44
in Pig Latin
45
HBase
• Open source implementation of Google’s
Bigtable
• Row/column store
• Billions of rows/millions on columns
• Column-oriented - nulls are free
• Online processing
• Master/Slave architecture
• Based on HDFS
46
HBase Schema
47
HBase Query
• Retrieve a cell
• Cell =
table.getRow(“enclosure1”).getColumn(“animal:type”).getValue();
• Retrieve a row
• RowResult = table.getRow( “enclosure1” );
• Scan through a range of rows
• Scanner s = table.getScanner( new String[] { “animal:type” } );
48
Hive
• Developed at Facebook
• Used for majority of Facebook jobs
• “Relational database” built on Hadoop
• Maintains list of table schemas
• SQL-like query language (HiveQL)
• Can call Hadoop Streaming scripts from HiveQL
• Supports table partitioning, clustering, complex data
types, some optimizations
49
Hive DDL
50
CREATE TABLE page_views (viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'User IP address')
COMMENT 'This is the page view table'
PARTITIONED BY (dt STRING, country STRING)
STORED AS SEQUENCEFILE;
● Partitioning breaks table into separate files for each (dt,
country) pair
○ Ex: /hive/page_view/dt=2015-06-08,country=USA
/hive/page_view/dt=2015-06-08,country=CA
A Simple Query
• Find all page views coming from xyz.com in
March:
• Hive only reads partitions “2015-03-*” instead
of scanning entire table
51
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2015-03-01'
AND page_views.date <= '2015-03-31'
AND page_views.referrer_url like '%xyz.com';
Mahout
52
• A Scalable machine learning and data mining
library.
• Apache Software Foundation project
• Create scalable machine learning libraries
• Why Mahout? Many Open Source ML libraries either:
• Lack Community
• Lack Documentation and Examples
• Lack Scalability
• Or are research-oriented
Apache Ambari
• Provision a Hadoop Cluster
• Ambari provides a step-by-step wizard for installing Hadoop services across any number of
hosts.
• Ambari handles configuration of Hadoop services for the cluster.
• Manage a Hadoop Cluster
• Ambari provides central management for starting, stopping, and reconfiguring Hadoop
services across the entire cluster.
• Monitor a Hadoop Cluster
• Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
• Ambari leverages Ambari Metrics System for metrics collection.
• Ambari leverages Ambari Alert Framework for system alerting and will notify you when
your attention is needed (e.g., a node goes down, remaining disk space is low, etc).
53
54
Others
• Zookeeper
• ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and providing
group services. All of these kinds of services are used in some form or
another by distributed applications.
• Oozie
• Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
• Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
• Sqoop
• Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured datastores such as relational
databases.
55
56
Hortonworks
57
58

More Related Content

What's hot

02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
WANdisco Plc
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
Victoria López
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
Uwe Printz
 
Tutorial Haddop 2.3
Tutorial Haddop 2.3Tutorial Haddop 2.3
Tutorial Haddop 2.3
Atanu Chatterjee
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
Kelly Technologies
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
Gerrit van Vuuren
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduce
Uwe Printz
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
Modern Data Stack France
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
vmoorthy
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
DerrekYoungDotCom
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Modern Data Stack France
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
Cloudera, Inc.
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
Ted Dunning
 
Pptx present
Pptx presentPptx present
Pptx present
Nitish Bhardwaj
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
Vitthal Gogate
 

What's hot (20)

02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Tutorial Haddop 2.3
Tutorial Haddop 2.3Tutorial Haddop 2.3
Tutorial Haddop 2.3
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduce
 
Syncsort et le retour d'exp��rience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
 
Pptx present
Pptx presentPptx present
Pptx present
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 

Similar to Introduction to Hadoop

Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
Dhanashri Yadav
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
Lokesh Ramaswamy
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
Talentica Software
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
Federico Cargnelutti
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Venneladonthireddy1
 
Hadoop
HadoopHadoop
Hadoop
Anil Reddy
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
chariorienit
 
Big data applications
Big data applicationsBig data applications
Big data applications
Juan Pablo Paz Grau, Ph.D., PMP
 
Apache Spark
Apache SparkApache Spark
Apache Spark
SugumarSarDurai
 
Hadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data AnalyticsHadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data Analytics
RUHULAMINHAZARIKA
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
NIKHILGR3
 
Big data
Big dataBig data
Big data
Mayuri Verma
 
Big data
Big dataBig data
Big data
Alisha Roy
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
Leons Petražickis
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
sonukumar379092
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
arslanhaneef
 

Similar to Introduction to Hadoop (20)

Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Hadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data AnalyticsHadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data Analytics
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 

More from York University

Elascale Poster
Elascale PosterElascale Poster
Elascale Poster
York University
 
Adaptation as a Service
Adaptation as a ServiceAdaptation as a Service
Adaptation as a Service
York University
 
SAVI-IoT: A Self-managing Containerized IoT Platform
SAVI-IoT: A Self-managing Containerized IoT PlatformSAVI-IoT: A Self-managing Containerized IoT Platform
SAVI-IoT: A Self-managing Containerized IoT Platform
York University
 
Realtime Big Data Analytics for Event Detection in Highways
Realtime Big Data Analytics for Event Detection in HighwaysRealtime Big Data Analytics for Event Detection in Highways
Realtime Big Data Analytics for Event Detection in Highways
York University
 
Provisioning Performance of Cloud Microservice Platforms
Provisioning Performance of Cloud Microservice PlatformsProvisioning Performance of Cloud Microservice Platforms
Provisioning Performance of Cloud Microservice Platforms
York University
 
MSc Thesis: Fingerprint Detection and Classification using Computational Geom...
MSc Thesis: Fingerprint Detection and Classification using Computational Geom...MSc Thesis: Fingerprint Detection and Classification using Computational Geom...
MSc Thesis: Fingerprint Detection and Classification using Computational Geom...
York University
 
PhD Thesis: Performance Modeling of Cloud Computing Centers
PhD Thesis: Performance Modeling of Cloud Computing CentersPhD Thesis: Performance Modeling of Cloud Computing Centers
PhD Thesis: Performance Modeling of Cloud Computing Centers
York University
 

More from York University (7)

Elascale Poster
Elascale PosterElascale Poster
Elascale Poster
 
Adaptation as a Service
Adaptation as a ServiceAdaptation as a Service
Adaptation as a Service
 
SAVI-IoT: A Self-managing Containerized IoT Platform
SAVI-IoT: A Self-managing Containerized IoT PlatformSAVI-IoT: A Self-managing Containerized IoT Platform
SAVI-IoT: A Self-managing Containerized IoT Platform
 
Realtime Big Data Analytics for Event Detection in Highways
Realtime Big Data Analytics for Event Detection in HighwaysRealtime Big Data Analytics for Event Detection in Highways
Realtime Big Data Analytics for Event Detection in Highways
 
Provisioning Performance of Cloud Microservice Platforms
Provisioning Performance of Cloud Microservice PlatformsProvisioning Performance of Cloud Microservice Platforms
Provisioning Performance of Cloud Microservice Platforms
 
MSc Thesis: Fingerprint Detection and Classification using Computational Geom...
MSc Thesis: Fingerprint Detection and Classification using Computational Geom...MSc Thesis: Fingerprint Detection and Classification using Computational Geom...
MSc Thesis: Fingerprint Detection and Classification using Computational Geom...
 
PhD Thesis: Performance Modeling of Cloud Computing Centers
PhD Thesis: Performance Modeling of Cloud Computing CentersPhD Thesis: Performance Modeling of Cloud Computing Centers
PhD Thesis: Performance Modeling of Cloud Computing Centers
 

Recently uploaded

Dreams Realised by mahadev desai 9 1.pptx
Dreams Realised by mahadev desai 9 1.pptxDreams Realised by mahadev desai 9 1.pptx
Dreams Realised by mahadev desai 9 1.pptx
AncyTEnglish
 
Microservices Interview Questions and Answers PDF By ScholarHat
Microservices Interview Questions and Answers PDF By ScholarHatMicroservices Interview Questions and Answers PDF By ScholarHat
Microservices Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
How to define Related field in Odoo 17 - Odoo 17 Slides
How to define Related field in Odoo 17 - Odoo 17 SlidesHow to define Related field in Odoo 17 - Odoo 17 Slides
How to define Related field in Odoo 17 - Odoo 17 Slides
Celine George
 
Celebrating 25th Year SATURDAY, 27th JULY, 2024
Celebrating 25th Year SATURDAY, 27th JULY, 2024Celebrating 25th Year SATURDAY, 27th JULY, 2024
Celebrating 25th Year SATURDAY, 27th JULY, 2024
APEC Melmaruvathur
 
NLC 2024 - Certificate of Recognition
NLC  2024  -  Certificate of RecognitionNLC  2024  -  Certificate of Recognition
NLC 2024 - Certificate of Recognition
Deped
 
classroom orientation/ back to school...
classroom orientation/ back to school...classroom orientation/ back to school...
classroom orientation/ back to school...
RoselleRaguindin
 
V2-NLC-Certificate-of-Completion_Learner.docx
V2-NLC-Certificate-of-Completion_Learner.docxV2-NLC-Certificate-of-Completion_Learner.docx
V2-NLC-Certificate-of-Completion_Learner.docx
302491
 
Official MATATAG Weekly Lesson Log Format.pdf
Official MATATAG Weekly Lesson Log Format.pdfOfficial MATATAG Weekly Lesson Log Format.pdf
Official MATATAG Weekly Lesson Log Format.pdf
JaReah
 
Form for Brigada eskwela-04 SY 2024.docx
Form for Brigada eskwela-04 SY 2024.docxForm for Brigada eskwela-04 SY 2024.docx
Form for Brigada eskwela-04 SY 2024.docx
VenuzSayanAday
 
Reports in Odoo 17 Point of Sale - Odoo Slides
Reports in Odoo 17 Point of Sale - Odoo SlidesReports in Odoo 17 Point of Sale - Odoo Slides
Reports in Odoo 17 Point of Sale - Odoo Slides
Celine George
 
How to install python packages from Pycharm
How to install python packages from PycharmHow to install python packages from Pycharm
How to install python packages from Pycharm
Celine George
 
Parkinson Disease & Anti-Parkinsonian Drugs.pptx
Parkinson Disease & Anti-Parkinsonian Drugs.pptxParkinson Disease & Anti-Parkinsonian Drugs.pptx
Parkinson Disease & Anti-Parkinsonian Drugs.pptx
AnujVishwakarma34
 
VRS An Strategic Approch to Meet Need of Organisation.pptx
VRS An Strategic Approch to Meet Need of Organisation.pptxVRS An Strategic Approch to Meet Need of Organisation.pptx
VRS An Strategic Approch to Meet Need of Organisation.pptx
Banker and Adjunct Lecturer
 
Java MCQ Questions and Answers PDF By ScholarHat
Java MCQ Questions and Answers PDF By ScholarHatJava MCQ Questions and Answers PDF By ScholarHat
Java MCQ Questions and Answers PDF By ScholarHat
Scholarhat
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
2024 Winter SWAYAM NPTEL & A Student.pptx
2024 Winter SWAYAM NPTEL & A Student.pptx2024 Winter SWAYAM NPTEL & A Student.pptx
2024 Winter SWAYAM NPTEL & A Student.pptx
Utsav Yagnik
 
How to Load Custom Field to POS in Odoo 17 - Odoo 17 Slides
How to Load Custom Field to POS in Odoo 17 - Odoo 17 SlidesHow to Load Custom Field to POS in Odoo 17 - Odoo 17 Slides
How to Load Custom Field to POS in Odoo 17 - Odoo 17 Slides
Celine George
 
SQL Server Interview Questions PDF By ScholarHat
SQL Server Interview Questions PDF By ScholarHatSQL Server Interview Questions PDF By ScholarHat
SQL Server Interview Questions PDF By ScholarHat
Scholarhat
 
How to Fix Field Does Not Exist Error in Odoo 17
How to Fix Field Does Not Exist Error in Odoo 17How to Fix Field Does Not Exist Error in Odoo 17
How to Fix Field Does Not Exist Error in Odoo 17
Celine George
 
SD_Instructional-Design-Frameworkzz.pptx
SD_Instructional-Design-Frameworkzz.pptxSD_Instructional-Design-Frameworkzz.pptx
SD_Instructional-Design-Frameworkzz.pptx
MarkKennethBellen1
 

Recently uploaded (20)

Dreams Realised by mahadev desai 9 1.pptx
Dreams Realised by mahadev desai 9 1.pptxDreams Realised by mahadev desai 9 1.pptx
Dreams Realised by mahadev desai 9 1.pptx
 
Microservices Interview Questions and Answers PDF By ScholarHat
Microservices Interview Questions and Answers PDF By ScholarHatMicroservices Interview Questions and Answers PDF By ScholarHat
Microservices Interview Questions and Answers PDF By ScholarHat
 
How to define Related field in Odoo 17 - Odoo 17 Slides
How to define Related field in Odoo 17 - Odoo 17 SlidesHow to define Related field in Odoo 17 - Odoo 17 Slides
How to define Related field in Odoo 17 - Odoo 17 Slides
 
Celebrating 25th Year SATURDAY, 27th JULY, 2024
Celebrating 25th Year SATURDAY, 27th JULY, 2024Celebrating 25th Year SATURDAY, 27th JULY, 2024
Celebrating 25th Year SATURDAY, 27th JULY, 2024
 
NLC 2024 - Certificate of Recognition
NLC  2024  -  Certificate of RecognitionNLC  2024  -  Certificate of Recognition
NLC 2024 - Certificate of Recognition
 
classroom orientation/ back to school...
classroom orientation/ back to school...classroom orientation/ back to school...
classroom orientation/ back to school...
 
V2-NLC-Certificate-of-Completion_Learner.docx
V2-NLC-Certificate-of-Completion_Learner.docxV2-NLC-Certificate-of-Completion_Learner.docx
V2-NLC-Certificate-of-Completion_Learner.docx
 
Official MATATAG Weekly Lesson Log Format.pdf
Official MATATAG Weekly Lesson Log Format.pdfOfficial MATATAG Weekly Lesson Log Format.pdf
Official MATATAG Weekly Lesson Log Format.pdf
 
Form for Brigada eskwela-04 SY 2024.docx
Form for Brigada eskwela-04 SY 2024.docxForm for Brigada eskwela-04 SY 2024.docx
Form for Brigada eskwela-04 SY 2024.docx
 
Reports in Odoo 17 Point of Sale - Odoo Slides
Reports in Odoo 17 Point of Sale - Odoo SlidesReports in Odoo 17 Point of Sale - Odoo Slides
Reports in Odoo 17 Point of Sale - Odoo Slides
 
How to install python packages from Pycharm
How to install python packages from PycharmHow to install python packages from Pycharm
How to install python packages from Pycharm
 
Parkinson Disease & Anti-Parkinsonian Drugs.pptx
Parkinson Disease & Anti-Parkinsonian Drugs.pptxParkinson Disease & Anti-Parkinsonian Drugs.pptx
Parkinson Disease & Anti-Parkinsonian Drugs.pptx
 
VRS An Strategic Approch to Meet Need of Organisation.pptx
VRS An Strategic Approch to Meet Need of Organisation.pptxVRS An Strategic Approch to Meet Need of Organisation.pptx
VRS An Strategic Approch to Meet Need of Organisation.pptx
 
Java MCQ Questions and Answers PDF By ScholarHat
Java MCQ Questions and Answers PDF By ScholarHatJava MCQ Questions and Answers PDF By ScholarHat
Java MCQ Questions and Answers PDF By ScholarHat
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
2024 Winter SWAYAM NPTEL & A Student.pptx
2024 Winter SWAYAM NPTEL & A Student.pptx2024 Winter SWAYAM NPTEL & A Student.pptx
2024 Winter SWAYAM NPTEL & A Student.pptx
 
How to Load Custom Field to POS in Odoo 17 - Odoo 17 Slides
How to Load Custom Field to POS in Odoo 17 - Odoo 17 SlidesHow to Load Custom Field to POS in Odoo 17 - Odoo 17 Slides
How to Load Custom Field to POS in Odoo 17 - Odoo 17 Slides
 
SQL Server Interview Questions PDF By ScholarHat
SQL Server Interview Questions PDF By ScholarHatSQL Server Interview Questions PDF By ScholarHat
SQL Server Interview Questions PDF By ScholarHat
 
How to Fix Field Does Not Exist Error in Odoo 17
How to Fix Field Does Not Exist Error in Odoo 17How to Fix Field Does Not Exist Error in Odoo 17
How to Fix Field Does Not Exist Error in Odoo 17
 
SD_Instructional-Design-Frameworkzz.pptx
SD_Instructional-Design-Frameworkzz.pptxSD_Instructional-Design-Frameworkzz.pptx
SD_Instructional-Design-Frameworkzz.pptx
 

Introduction to Hadoop

  • 1. Apache Hadoop Core & Ecosystem UOIT - Faculty of Business and IT Hamzeh Khazaei hkh@yorku.ca November 16, 2015
  • 2. Agenda • Data management system • Conventional Data • Big Data • Hadoop • File System • Computation Paradigm • YARN • Subprojects • NoSQL Datastores • A Real Project 2
  • 3. Database management system • Relational database management system • Structured data • SQL • Standard interface • Vertical Scalability • High-end servers 3
  • 5. 5
  • 6. Questions? 1. Who faced these challenges first? 2. When did they confronted with challenges? 3. What was their solution? 4. What are the opportunities? 5. What is the role of cloud here? 6. What is next? 6
  • 7. 7 It is all about: “How to store and process big data with reasonable cost and time?”
  • 8. Definitions • Apache Hadoop is an open-source software project that enables distributed processing of large data sets across clusters of commodity servers. It is designed to scale out from a single server to thousands of machines, with a very high degree of fault tolerance. (IBM) • Apache Hadoop is a Java-based open-source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data. (Hotronworks) • Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. (SAS) 8
  • 10. History, Google vs Hadoop 10 Devlop Group Google Apache Sponsor Google Yahoo, Amazon File System GFS (2003) HDFS (2005) Programming Model MapReduce (2004) Hadoop MapReduce (2005) Storage System BigTable (2006) HBase (2010) Search Engine Google Nutch
  • 11. ● Data-intensive text processing ● Assembly of large genomes ● Graph mining ● Machine learning and data mining ● Large scale social network analysis ● Log analytics ● Health Informatics ● Smart Cities Uses for Hadoop 11
  • 12. Hadoop Common Contains Libraries and other modules HDFS Hadoop Distributed File System Hadoop YARN Yet Another Resource Negotiator Hadoop MapReduce A programming model for large scale data processing Hadoop Core 12
  • 13. 13
  • 14. Goals of HDFS • Very Large Distributed File System • 10K nodes, 100 million files, 200PB • Assumes Commodity Hardware • Files are replicated to handle hardware failure • Detect failures and recover from them • Optimized for Batch Processing • Data locations exposed so that computations can move to where data resides • Provides very high aggregate bandwidth 14
  • 15. HDFS - Specifications • Single Namespace for entire cluster • Data Coherency • Write-once-read-many access model • Client can only append to existing files • Files are broken up into blocks • Typically 64MB block size • Each block replicated on multiple DataNodes • Intelligent Client • Client can find location of blocks • Client accesses data directly from DataNode 15
  • 16. HDFS - Architecture 1 • Master/Slave Architecture • NameNode • Metadata Server • File location (file name -> the DataNode) • File attributions (atime/ctime/mtime, size, number of replicas) • DataNode • Manages the storage attached to the nodes that they run on • Client • Producer and Consumers of data 16
  • 18. HDFS I/O • A typical read from a client involves: a) Contact the NameNode to determine where the actual data is stored b) NameNode replies with block identifiers and locations (i.e., which DataNode) c) Contact the DataNode to fetch data • A typical write from a client involves: a) Contact the NameNode to update the namespace and verify permissions b) NameNode allocates a new block on a suitable DataNode c) The client directly streams to the selected DataNode d) Currently, HDFS files are immutable • Data is never moved through the NameNode Hence, there is no bottleneck 18
  • 19. • Default replication is 3-fold Data Replication 19
  • 20. HDFS Replication • By default, HDFS stores 3 separate copies of each block • This ensures reliability, availability and performance • Replication policy • Spread replicas across different racks • Robust against cluster node failures • Robust against rack failures • Block replication benefits MapReduce • Scheduling decisions can take replicas into account • Exploit better data locality 20
  • 21. User Interface • Commands for HDFS User: • hadoop dfs -mkdir /foodir • hadoop dfs -cat /foodir/myfile.txt • hadoop dfs -rm /foodir/myfile.txt • Commands for HDFS Administrator • hadoop dfsadmin -report • hadoop dfsadmin -decommision datanodename • Web Interface • http://host:port/dfshealth.jsp 21
  • 22. 22
  • 23. ● A method for distributing computation across multiple nodes ● Each node processes the data that is stored at that node ● Consists of two main phases ◦ Map ◦ Reduce MapReduce Overview 23
  • 24. Now, Technically, What is MapReduce? • MapReduce is a programming model for efficient distributed computing • It works like a Unix pipeline • cat input | grep | sort | uniq -c | cat > output • Input | Map | Shuffle & Sort | Reduce | Output • Efficiency from • Streaming through data, reducing seeks • Pipelining • A good fit for a lot of applications • Log processing • Web index building 24
  • 25. MapReduce in 41 words. Goal: count the number of books in the library. • Map: You count up shelf #1, I count up shelf #2. (The more people we get, the faster this part goes) • Reduce: We all get together and add up our individual counts. 25
  • 26. Word Count Example • Mapper • Input: value: lines of text of input • Output: key: word, value: 1 • Reducer • Input: key: word, value: set of counts • Output: key: word, value: sum • Launching program • Defines this job • Submits job to cluster 26
  • 27. Word Count Data Flow 27
  • 28. Word Count Mapper public static class Map extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable> { private static final IntWritable one = new IntWritable(1); private Text word = new Text(); public static void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer = new StringTokenizer(line); while (tokenizer.hasNext()) { word.set(tokenizer.nextToken()); output.collect(word,one); } } } 28
  • 29. Word Count Reducer public static class Reduce extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable> { public static void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } 29
  • 30. World Count Main public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } 30
  • 31. Execution Framework • MapReduce program, a.k.a. a job: • Code of mappers and reducers • Code for combiners and partitioners (optional) • Configuration parameters • All packaged together • A MapReduce job is submitted to the cluster • The framework takes care of everything else • Next, we will delve into the details 31
  • 32. Scheduling • Each Job is broken into tasks • Map tasks work on fractions of the input dataset, as defined by the underlying distributed filesystem • Reduce tasks work on intermediate inputs and write back to the distributed filesystem • The number of tasks may exceed the number of available machines in a cluster • The scheduler takes care of maintaining something similar to a queue of pending tasks to be assigned to machines with available resources • Jobs to be executed in a cluster requires scheduling as well • Different users may submit jobs • Jobs may be of various complexity • Fairness is generally a requirement 32
  • 33. • NameNode • Holds the metadata for the HDFS • Secondary NameNode • Performs housekeeping functions for the NameNode • DataNode • Stores the actual HDFS data blocks • JobTracker • Manages MapReduce jobs • TaskTracker • Monitors individual Map and Reduce tasks Anatomy of a Hadoop Cluster 33
  • 34. 34
  • 35. Hadoop 1.0 vs Hadoop 2.0 35
  • 36. Hadoop 2.0 with YARN 36
  • 37. YARN • Yet Another Resource Negotiator • YARN Application Resource Negotiator (Recursive Acronym) • Remedies the scalability shortcomings of “classic” MapReduce - one jobtracker per cluster so that limit to 4000 nodes per cluster. • Is more of a general purpose framework of which classic mapreduce is one application. • Inflexible slots on nodes, run Map or Reduce not both -- causes underutilization of cluster 37
  • 38. YARN = Hadoop 2.0 = MRv2 • The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker (ie, resource management and job scheduling/monitoring) into separate daemons. • The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). 38
  • 39. Hadoop Common • Hadoop Common refers to the collection of common utilities and libraries that support other Hadoop modules. • It is an essential part or module of the Apache Hadoop Framework, along with the HDFS, Hadoop YARN and Hadoop MapReduce. • Like all other modules, Hadoop Common assumes that hardware failures are common and that these should be automatically handled in software by the Hadoop Framework. 39
  • 41. 41
  • 42. Pig • Started at Yahoo! Research • Now runs about 30% of Yahoo!’s jobs • Features • Expresses sequences of MapReduce jobs • Data model: nested “bags” of items • Provides relational (SQL) operators (JOIN, GROUP BY, etc.) • Easy to plug in Java functions 42
  • 43. Example • Suppose you have user data in a file, website data in another, and you need to find the top 5 most visited pages by users aged 18-25 43
  • 46. HBase • Open source implementation of Google’s Bigtable • Row/column store • Billions of rows/millions on columns • Column-oriented - nulls are free • Online processing • Master/Slave architecture • Based on HDFS 46
  • 48. HBase Query • Retrieve a cell • Cell = table.getRow(“enclosure1”).getColumn(“animal:type”).getValue(); • Retrieve a row • RowResult = table.getRow( “enclosure1” ); • Scan through a range of rows • Scanner s = table.getScanner( new String[] { “animal:type” } ); 48
  • 49. Hive • Developed at Facebook • Used for majority of Facebook jobs • “Relational database” built on Hadoop • Maintains list of table schemas • SQL-like query language (HiveQL) • Can call Hadoop Streaming scripts from HiveQL • Supports table partitioning, clustering, complex data types, some optimizations 49
  • 50. Hive DDL 50 CREATE TABLE page_views (viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'User IP address') COMMENT 'This is the page view table' PARTITIONED BY (dt STRING, country STRING) STORED AS SEQUENCEFILE; ● Partitioning breaks table into separate files for each (dt, country) pair ○ Ex: /hive/page_view/dt=2015-06-08,country=USA /hive/page_view/dt=2015-06-08,country=CA
  • 51. A Simple Query • Find all page views coming from xyz.com in March: • Hive only reads partitions “2015-03-*” instead of scanning entire table 51 SELECT page_views.* FROM page_views WHERE page_views.date >= '2015-03-01' AND page_views.date <= '2015-03-31' AND page_views.referrer_url like '%xyz.com';
  • 52. Mahout 52 • A Scalable machine learning and data mining library. • Apache Software Foundation project • Create scalable machine learning libraries • Why Mahout? Many Open Source ML libraries either: • Lack Community • Lack Documentation and Examples • Lack Scalability • Or are research-oriented
  • 53. Apache Ambari • Provision a Hadoop Cluster • Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts. • Ambari handles configuration of Hadoop services for the cluster. • Manage a Hadoop Cluster • Ambari provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster. • Monitor a Hadoop Cluster • Ambari provides a dashboard for monitoring health and status of the Hadoop cluster. • Ambari leverages Ambari Metrics System for metrics collection. • Ambari leverages Ambari Alert Framework for system alerting and will notify you when your attention is needed (e.g., a node goes down, remaining disk space is low, etc). 53
  • 54. 54
  • 55. Others • Zookeeper • ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. • Oozie • Oozie is a workflow scheduler system to manage Apache Hadoop jobs. • Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. • Sqoop • Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. 55
  • 56. 56
  • 58. 58