Hadoop Eco System
• Why Big Data?
• Ingredients of Big Data Eco System
• Working with Map Reduce
• Phases of MR
• Hive
• Use case
• Conclusion
• Data is CHANGING and getting MESSY
• Prior Structured but now Unstructured.
• Non Uniform
• Many distributed contributors to the data
• Mobile, PDA, Tablet, sensors.
• Domains: Financial, Healthcare, Social Media
Why Big Data!!

• Map reduce – Technique of solving big data by map –
reduce technique on clusters.
• HDFS- Distributed file system used by hadoop.
• HIVE- SQL based query engine for non java programmers
• PIG- A data flow language and execution environment for
exploring very large datasets
Ingredients of Eco System
• HBASE - A distributed, column-oriented database.
• Zookeeper - A distributed, highly available coordination
• Sqoop - A tool for efficiently moving data between
relational databases and HDFS.
Ingredients cont.
• Protocols used- RPC/ HTTP for inter
communication of commodity hardware.
• Run on Pseudo Node or Clusters
• Components- Daemons
• NameNode
• DataNode
• JobTracker
• TaskTracker
Hadoop Internals
• Map  Function which maps for each of
the data available
• Reduce  Function which is used for
aggregation or reduction
Working with Map Reduce

• f(n) = Σ {n=0.. n=10} (n(n-1)/2)
• map = ∀ n from 0 to n
• map(n(n-1)/2)
• Reduce = Σ ([values]) is the
aggregation/reduction function
Hence can achieve parallelism
MR as a function
MR as representation
• Map <K1, V1>  Map <K2, V2>
• V2 – list of values for Key K2
• Reduce <K2, V2>  ~ <K3, V3>
• ~ Reduction operation
• Reduced output with specific Keys and
• Data on HDFS
• Input partition – FileSplit , Inputsplit
• Map
• Shuffle
• Sort
• Partition
• Reducer
• Aggregated Data on HDFS
Phases of MR
Phases of MR depicted

Data flow in MR
MapReduce data flow with multiple reduce tasks
Shuffle and Sort phase
• Architecture
HDFS Hadoop Distributed File System
HDFS- Client Read

HDFS- Client Write
• List all the files and directories in the HDFS
• $hadoop fs –lsr
• Put file to HDFS
• $hadoop fs –put <from path> <to path>
• Get files from HDFS
• $hadoop fs –get <from path>
• To run jar file
• $hadoop jar <jarfile> <className> <input
path> <output path>
HDFS - cli
• Job Configuration
• Key files core-site.xml, mapred-
• Specific job configuration can be
provided in the code
Map Reduce cont.
MR job in action

• Job Scheduling
• Fair scheduler
• Capacity scheduler
Job Scheduling
• Job is planned and placed in the job pool
• Supports preemption
• If no pools created and only one job
available, the job runs as is
Fair Scheduler
• Supports Multi user scheduling
• Depends on the clusters, number of
queues and hierarchical way jobs are
• One queue may be child of another
• Enforces fair scheduling within each job
Capacity scheduler
Map reduce Input Formats

Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce

This document provides an overview of big data and Hadoop. It discusses the scale of big data, noting that Facebook handles 180PB per year and Twitter handles 1.2 million tweets per second. It also covers the volume, variety, and velocity challenges of big data. Hadoop and MapReduce are introduced as the leading solutions for distributed storage and processing of big data using a scale-out architecture. Key ideas of Hadoop include storing large data across multiple machines in HDFS and processing that data in parallel using MapReduce jobs.

• Map Side Join
• large inputs works by performing the join
before the data reaches the map function
• Reducer Side Join
• input datasets don’t have to be structured in
any particular way, but it is less efficient as
both datasets have to go through the Map
Reduce shuffle.
MR Joins
• Hive was created to make it possible for
analysts with strong SQL skills (but meager
Java programming skills)
• From Developers of Facebook and later
associated it part of apache open source
• Hive runs on your workstation and converts
your SQL query into a series of Map
Reduce jobs for execution on a Hadoop
• Unzip the gz file
• % tar xzf hive-x.y.z-dev.tar.gz
• Be handy
• % export HIVE_INSTALL=/home/tom/hive-x.y.z-
• % export PATH=$PATH:$HIVE_INSTALL/bin
• Hive shell launched
• hive> Show tables;
Hive Infrastructure
Hive Modules

Hive Data Types
• Creating table
• CREATE TABLE rank_customer(custid STRING,
• Load Data
'input/dir/customerrank.dat‘ OVERWRITE INTO
TABLE rank_customer;
• Check data in warehouse
• $ls /user/hive/warehouse/records/
• SELECT c.custid, c.score, c.location FROM
rank_customer c ORDER BY c.custid ASC,
c.location ASC, c.score DESC;
Commands cont.
• hive> CREATE DATABASE financials WITH
DBPROPERTIES ('creator' = MGP', 'date' =
• hive> DROP DATABASE IF EXISTS financials;
• hive> ALTER DATABASE financials SET
DBPROPERTIES ('edited-by' = 'Joe Dba');
• hive> DROP TABLE IF EXISTS employees;
• hive> ALTER TABLE log_messages RENAME TO
DDL Commands

• Determine the rank of the customer
based on his id and the locality he
belongs. Highest scorer gains the higher
• Input Output
Use case
• Custom Writable
Using Map Reduce
• CustomWritable methods overridden
CustomWritable cont.
Driver code

Mapper Code
Partitioner Code
Sort Comparator class
Reducer Code

• hive>SELECT custid, score, location, rank()
over(PARTITION BY custid, location ORDER BY
score DESC )
AS myrank
FROM rank_customer;
Hive Query
Hive results
• Hadoop eco system is majorly designed
for large number of files of large size of
• Not so suitable for small sized large
number of files.
• Achieving the parallelism on the huge
• Mapping and Reducing are the key and
core functions to achieve parallelism.
• Hadoop eco system works efficiently with
commodity hardware.
• Distributed hardware can be efficiently
• Hadoop map reduce codes are written
using Java.
• Hive gives feasibility for SQL
programmers though internally Java MR
jobs run.
Conclusion cont.

