Hadoop_EcoSystem_Pradeep_MG

• Why Big Data?
• Ingredients of Big Data Eco System
• Working with Map Reduce
• Phases of MR
• HDFS
• Hive
• Use case
• Conclusion
Agenda
2

• Big Data is NOT JUST ABOUT SIZE its ABOUT
HOW IMPORTANT THE DATA IS in a large
chunk
• Data is CHANGING and getting MESSY
• Prior Structured but now Unstructured.
• Non Uniform
• Many distributed contributors to the data
• Mobile, PDA, Tablet, sensors.
• Domains: Financial, Healthcare, Social Media
Why Big Data!!
3

• Map reduce – Technique of solving big data by map –
reduce technique on clusters.
• HDFS- Distributed file system used by hadoop.
• HIVE- SQL based query engine for non java programmers
• PIG- A data flow language and execution environment for
exploring very large datasets
Ingredients of Eco System
5

• HBASE - A distributed, column-oriented database.
• Zookeeper - A distributed, highly available coordination
service.
• Sqoop - A tool for efficiently moving data between
relational databases and HDFS.
Ingredients cont.
6

• Protocols used- RPC/ HTTP for inter
communication of commodity hardware.
• Run on Pseudo Node or Clusters
• Components- Daemons
• NameNode
• DataNode
• JobTracker
• TaskTracker
Hadoop Internals
7

• Map  Function which maps for each of
the data available
• Reduce  Function which is used for
aggregation or reduction
Working with Map Reduce
8

• f(n) = Σ {n=0.. n=10} (n(n-1)/2)
• map = ∀ n from 0 to n
• map(n(n-1)/2)
• Reduce = Σ ([values]) is the
aggregation/reduction function
Hence can achieve parallelism
MR as a function
9

MR as representation
10
• Map <K1, V1>  Map <K2, V2>
• V2 – list of values for Key K2
• Reduce <K2, V2>  ~ <K3, V3>
• ~ Reduction operation
• Reduced output with specific Keys and
Values

• Data on HDFS
• Input partition – FileSplit , Inputsplit
• Map
• Shuffle
• Sort
• Partition
• Reducer
• Aggregated Data on HDFS
Phases of MR
11

Data flow in MR
13
MapReduce data flow with multiple reduce tasks

• Architecture
HDFS Hadoop Distributed File System
15

• List all the files and directories in the HDFS
• $hadoop fs –lsr
• Put file to HDFS
• $hadoop fs –put <from path> <to path>
• Get files from HDFS
• $hadoop fs –get <from path>
• To run jar file
• $hadoop jar <jarfile> <className> <input
path> <output path>
HDFS - cli
18

• Job Configuration
• Key files core-site.xml, mapred-
site.xml
• Specific job configuration can be
provided in the code
Map Reduce cont.
19

• Job Scheduling
• Fair scheduler
• Capacity scheduler
Job Scheduling
21

• Job is planned and placed in the job pool
• Supports preemption
• If no pools created and only one job
available, the job runs as is
Fair Scheduler
22

• Supports Multi user scheduling
• Depends on the clusters, number of
queues and hierarchical way jobs are
scheduled
• One queue may be child of another
queue
• Enforces fair scheduling within each job
pool
Capacity scheduler
23

• Map Side Join
• large inputs works by performing the join
before the data reaches the map function
• Reducer Side Join
• input datasets don’t have to be structured in
any particular way, but it is less efficient as
both datasets have to go through the Map
Reduce shuffle.
MR Joins
25

• Hive was created to make it possible for
analysts with strong SQL skills (but meager
Java programming skills)
• From Developers of Facebook and later
associated it part of apache open source
projects.
• Hive runs on your workstation and converts
your SQL query into a series of Map
Reduce jobs for execution on a Hadoop
cluster
HIVE
26

• Unzip the gz file
• % tar xzf hive-x.y.z-dev.tar.gz
• Be handy
• % export HIVE_INSTALL=/home/tom/hive-x.y.z-
dev
• % export PATH=$PATH:$HIVE_INSTALL/bin
• Hive shell launched
• hive> Show tables;
Hive Infrastructure
27

• Creating table
• CREATE TABLE rank_customer(custid STRING,
socre STRING, location STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',';
• Load Data
• LOAD DATA LOCAL INPATH
'input/dir/customerrank.dat‘ OVERWRITE INTO
TABLE rank_customer;
• Check data in warehouse
• $ls /user/hive/warehouse/records/
Commands
30

• SELECT QUERY
• SELECT c.custid, c.score, c.location FROM
rank_customer c ORDER BY c.custid ASC,
c.location ASC, c.score DESC;
Commands cont.
31

• hive> CREATE DATABASE financials WITH
DBPROPERTIES ('creator' = MGP', 'date' =
'2014-10-03');
• hive> DROP DATABASE IF EXISTS financials;
• hive> ALTER DATABASE financials SET
DBPROPERTIES ('edited-by' = 'Joe Dba');
• hive> DROP TABLE IF EXISTS employees;
• hive> ALTER TABLE log_messages RENAME TO
logmsgs;
Hive-
DDL Commands
32

• Determine the rank of the customer
based on his id and the locality he
belongs. Highest scorer gains the higher
rank.
• Input Output
Use case
33

• Custom Writable
Using Map Reduce
34

• CustomWritable methods overridden
CustomWritable cont.
35

• ## FOR OBTINING THE RANKING ON THE BASIS OF
LOCATION AND CUSTOMER ID AS PER THE
REQUIREMENT
• hive>SELECT custid, score, location, rank()
over(PARTITION BY custid, location ORDER BY
score DESC )
AS myrank
FROM rank_customer;
Hive Query
41

• Hadoop eco system is majorly designed
for large number of files of large size of
data
• Not so suitable for small sized large
number of files.
• Achieving the parallelism on the huge
data
• Mapping and Reducing are the key and
core functions to achieve parallelism.
Conclusion
43

• Hadoop eco system works efficiently with
commodity hardware.
• Distributed hardware can be efficiently
utilized.
• Hadoop map reduce codes are written
using Java.
• Hive gives feasibility for SQL
programmers though internally Java MR
jobs run.
Conclusion cont.
44

• Hadoop: The Definitive Guide, Third
Edition by Tom White
• Programming Hive by Edward Capriolo,
Dean Wampler, and Jason Rutherglen
• http://hadoop.apache.org/
• http://hive.apache.org/
References
45

Hadoop_EcoSystem_Pradeep_MG

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Hadoop_EcoSystem_Pradeep_MG

Similar to Hadoop_EcoSystem_Pradeep_MG (20)

Hadoop_EcoSystem_Pradeep_MG