SlideShare a Scribd company logo
1
Hadoop Eco System
• Why Big Data?
• Ingredients of Big Data Eco System
• Working with Map Reduce
• Phases of MR
• HDFS
• Hive
• Use case
• Conclusion
Agenda
2
• Big Data is NOT JUST ABOUT SIZE its ABOUT
HOW IMPORTANT THE DATA IS in a large
chunk
• Data is CHANGING and getting MESSY
• Prior Structured but now Unstructured.
• Non Uniform
• Many distributed contributors to the data
• Mobile, PDA, Tablet, sensors.
• Domains: Financial, Healthcare, Social Media
Why Big Data!!
3
Glimpse
4

Recommended for you

Big data
Big dataBig data
Big data

This ppt is to help students to learn about big data, Hadoop, HDFS, MApReduce, architecture of HDFS,

big datahadoophdfs
Introduce to spark
Introduce to sparkIntroduce to spark
Introduce to spark

Spark is a big data processing tool built with Scala that runs on the Java Virtual Machine (JVM). It is up to 100 times faster than Hadoop for iterative jobs because it keeps intermediate data in memory rather than writing to disk. Spark uses Resilient Distributed Datasets (RDDs) that allow data to be kept in memory across transformations and actions. RDDs also maintain lineage graphs to allow recovery from failures.

Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It was created in 2005 and is designed to reliably handle large volumes of data and complex computations in a distributed fashion. The core of Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for processing data in parallel across large clusters of computers. It is widely adopted by companies handling big data like Yahoo, Facebook, Amazon and Netflix.

information technologydatabasebig data
• Map reduce – Technique of solving big data by map –
reduce technique on clusters.
• HDFS- Distributed file system used by hadoop.
• HIVE- SQL based query engine for non java programmers
• PIG- A data flow language and execution environment for
exploring very large datasets
Ingredients of Eco System
5
• HBASE - A distributed, column-oriented database.
• Zookeeper - A distributed, highly available coordination
service.
• Sqoop - A tool for efficiently moving data between
relational databases and HDFS.
Ingredients cont.
6
• Protocols used- RPC/ HTTP for inter
communication of commodity hardware.
• Run on Pseudo Node or Clusters
• Components- Daemons
• NameNode
• DataNode
• JobTracker
• TaskTracker
Hadoop Internals
7
• Map  Function which maps for each of
the data available
• Reduce  Function which is used for
aggregation or reduction
Working with Map Reduce
8

Recommended for you

MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab

Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial: 1) Thinking in Map / Reduce 2) Understanding Unix Pipeline 3) Examples to understand MapReduce 4) Merging 5) Mappers & Reducers 6) Mapper Example 7) Input Split 8) mapper() & reducer() Code 9) Example - Count number of words in a file using MapReduce 10) Example - Compute Max Temperature using MapReduce 11) Hands-on - Count number of words in a file using MapReduce on CloudxLab

cloudxlabhadoopapache hadoop
Getting started big data
Getting started big dataGetting started big data
Getting started big data

Apache Hadoop is a popular open-source framework for storing and processing large datasets across clusters of computers. It includes Apache HDFS for distributed storage, YARN for job scheduling and resource management, and MapReduce for parallel processing. The Hortonworks Data Platform is an enterprise-grade distribution of Apache Hadoop that is fully open source.

Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC

I originally gave this presentation as an internal briefing at SDSC based on my experiences in working with Spark to solve scientific problems.

spark hadoop hpc
• f(n) = Σ {n=0.. n=10} (n(n-1)/2)
• map = ∀ n from 0 to n
• map(n(n-1)/2)
• Reduce = Σ ([values]) is the
aggregation/reduction function
Hence can achieve parallelism
MR as a function
9
MR as representation
10
• Map <K1, V1>  Map <K2, V2>
• V2 – list of values for Key K2
• Reduce <K2, V2>  ~ <K3, V3>
• ~ Reduction operation
• Reduced output with specific Keys and
Values
• Data on HDFS
• Input partition – FileSplit , Inputsplit
• Map
• Shuffle
• Sort
• Partition
• Reducer
• Aggregated Data on HDFS
Phases of MR
11
Phases of MR depicted
12

Recommended for you

Hadoop
HadoopHadoop
Hadoop

This document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes how Hadoop uses HDFS for scalable, fault-tolerant storage and MapReduce for parallel processing. The core components of Hadoop - HDFS and MapReduce - allow for distributed processing of large datasets across commodity hardware, providing capabilities for scalability, cost-effectiveness, and efficient distributed computing.

hadooptechnology
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes

Hadoop Streaming allows any executable or script to be used as a MapReduce job. It works by launching the executable or script as a separate process and communicating with it via stdin and stdout. The executable or script receives key-value pairs in a predefined format and outputs new key-value pairs that are collected. Hadoop Streaming uses PipeMapper and PipeReducer to adapt the external processes to the MapReduce framework. It provides a simple way to run MapReduce jobs without writing Java code.

mapreducehadoop
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic

MapReduce is a programming model for processing large datasets in parallel. It works by breaking the dataset into independent chunks which are processed by the map function, and then grouping the output of the maps into partitions to be processed by the reduce function. Hadoop uses MapReduce to provide fault tolerance by restarting failed tasks and monitoring the JobTracker and TaskTrackers. MapReduce programs can be written in languages other than Java using Hadoop Streaming.

mapreducehadoopbig data
Data flow in MR
13
MapReduce data flow with multiple reduce tasks
Shuffle and Sort phase
14
• Architecture
HDFS Hadoop Distributed File System
15
HDFS- Client Read
16

Recommended for you

Nextag talk
Nextag talkNextag talk
Nextag talk

Hive provides an SQL-like interface to query data stored in Hadoop's HDFS distributed file system and processed using MapReduce. It allows users without MapReduce programming experience to write queries that Hive then compiles into a series of MapReduce jobs. The document discusses Hive's components, data model, query planning and optimization techniques, and performance compared to other frameworks like Pig.

Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2

In the ppt i have explained the basic difference between the hadoop architectures. hadoop architecture 1 and hadoop architecture 2 i have taken the reference from the website for the preperation.

hadoop
Hadoop
HadoopHadoop
Hadoop

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It uses MapReduce as a programming model and HDFS for storage. MapReduce allows for massively parallel processing of large datasets by breaking jobs into smaller tasks that can be run in parallel on multiple machines. HDFS stores very large files across machines in a distributed file system for fault tolerance.

hadoop
HDFS- Client Write
17
• List all the files and directories in the HDFS
• $hadoop fs –lsr
• Put file to HDFS
• $hadoop fs –put <from path> <to path>
• Get files from HDFS
• $hadoop fs –get <from path>
• To run jar file
• $hadoop jar <jarfile> <className> <input
path> <output path>
HDFS - cli
18
• Job Configuration
• Key files core-site.xml, mapred-
site.xml
• Specific job configuration can be
provided in the code
Map Reduce cont.
19
MR job in action
20

Recommended for you

Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data

What makes a big-data platform 'cloud-optimized'. Here's our (Qubole's) shot at it. @Cloud-Asia 2014.

qubolehivehadoop
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure

Hadoop is a system for processing large amounts of data using MapReduce and HDFS. HDFS is the storage component that splits files into blocks and stores multiple copies for reliability. MapReduce is the processing framework where mappers process key-value pairs in parallel and reducers aggregate the outputs. While Hadoop can process huge datasets, other systems like Pig, Hive, HBase, Accumulo, Avro, ZooKeeper, and Flume provide additional functionality for tasks like SQL queries, real-time processing, coordination, serialization, and data aggregation.

hadoopwindows azurebig data
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark

Spark is a general-purpose cluster computing framework that provides high-level APIs and is faster than Hadoop for iterative jobs and interactive queries. It leverages cached data in cluster memory across nodes for faster performance. Spark supports various higher-level tools including SQL, machine learning, graph processing, and streaming.

apache sparkapache hadoop
• Job Scheduling
• Fair scheduler
• Capacity scheduler
Job Scheduling
21
• Job is planned and placed in the job pool
• Supports preemption
• If no pools created and only one job
available, the job runs as is
Fair Scheduler
22
• Supports Multi user scheduling
• Depends on the clusters, number of
queues and hierarchical way jobs are
scheduled
• One queue may be child of another
queue
• Enforces fair scheduling within each job
pool
Capacity scheduler
23
Map reduce Input Formats
24

Recommended for you

Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture

Detailed documentation of the complex Map Reduce Execution Architecture with the terminology explanations and the execution of MapReduce Jar File

big datadata analyticsmachine learning
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark

This document provides an introduction and overview of Apache Spark. It discusses why Spark is useful, describes some Spark basics including Resilient Distributed Datasets (RDDs) and DataFrames, and gives a quick tour of Spark Core, SQL, and Streaming functionality. It also provides some tips for using Spark and describes how to set up Spark locally. The presenter is introduced as a data engineer who uses Spark to load data from Kafka streams into Redshift and Cassandra. Ways to learn more about Spark are suggested at the end.

big dataspark streamingspark sql
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce

This document provides an overview of big data and Hadoop. It discusses the scale of big data, noting that Facebook handles 180PB per year and Twitter handles 1.2 million tweets per second. It also covers the volume, variety, and velocity challenges of big data. Hadoop and MapReduce are introduced as the leading solutions for distributed storage and processing of big data using a scale-out architecture. Key ideas of Hadoop include storing large data across multiple machines in HDFS and processing that data in parallel using MapReduce jobs.

meetupbigdatamapreduce
• Map Side Join
• large inputs works by performing the join
before the data reaches the map function
• Reducer Side Join
• input datasets don’t have to be structured in
any particular way, but it is less efficient as
both datasets have to go through the Map
Reduce shuffle.
MR Joins
25
• Hive was created to make it possible for
analysts with strong SQL skills (but meager
Java programming skills)
• From Developers of Facebook and later
associated it part of apache open source
projects.
• Hive runs on your workstation and converts
your SQL query into a series of Map
Reduce jobs for execution on a Hadoop
cluster
HIVE
26
• Unzip the gz file
• % tar xzf hive-x.y.z-dev.tar.gz
• Be handy
• % export HIVE_INSTALL=/home/tom/hive-x.y.z-
dev
• % export PATH=$PATH:$HIVE_INSTALL/bin
• Hive shell launched
• hive> Show tables;
Hive Infrastructure
27
Hive Modules
28

Recommended for you

Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch

This document summarizes Hadoop MapReduce, including its goals of distribution and reliability. It describes the roles of mappers, reducers, and other system components like the JobTracker and TaskTracker. Mappers process input splits in parallel and generate intermediate key-value pairs. Reducers sort and group the outputs by key before processing each unique key. The JobTracker coordinates jobs while the TaskTracker manages tasks on each node.

Pyshark in Network Packet analysis
Pyshark in Network Packet analysisPyshark in Network Packet analysis
Pyshark in Network Packet analysis

Pyshark is a wrapper around tshark comand line utility to capture a live Network packet or from a capture file. Pyshark is useful in parsing capture data for analysis.

pysharkpython
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce

In this presentation , i provide in depth information about the how MapReduce works. It contains many details about the execution steps , Fault tolerance , master / worker responsibilities.

hadoopbig data
Hive Data Types
29
• Creating table
• CREATE TABLE rank_customer(custid STRING,
socre STRING, location STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',';
• Load Data
• LOAD DATA LOCAL INPATH
'input/dir/customerrank.dat‘ OVERWRITE INTO
TABLE rank_customer;
• Check data in warehouse
• $ls /user/hive/warehouse/records/
Commands
30
• SELECT QUERY
• SELECT c.custid, c.score, c.location FROM
rank_customer c ORDER BY c.custid ASC,
c.location ASC, c.score DESC;
Commands cont.
31
• hive> CREATE DATABASE financials WITH
DBPROPERTIES ('creator' = MGP', 'date' =
'2014-10-03');
• hive> DROP DATABASE IF EXISTS financials;
• hive> ALTER DATABASE financials SET
DBPROPERTIES ('edited-by' = 'Joe Dba');
• hive> DROP TABLE IF EXISTS employees;
• hive> ALTER TABLE log_messages RENAME TO
logmsgs;
Hive-
DDL Commands
32

Recommended for you

MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms

This document provides an overview of Hadoop MapReduce scheduling algorithms. It discusses several commonly used algorithms like FIFO, fair scheduling, and capacity scheduler. It also introduces more advanced algorithms such as LATE, SAMR, ESAMR, locality-aware scheduling, and center-of-gravity scheduling that aim to improve metrics like fairness, throughput, response time, and resource utilization. The document concludes by listing references for further reading on MapReduce scheduling techniques.

mapreducealgorithmsbigdata
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce

The document presents an introduction to MapReduce. It discusses how MapReduce provides an easy framework for distributed computing by allowing programmers to write simple map and reduce functions without worrying about complex distributed systems issues. It outlines Google's implementation of MapReduce and how it uses the Google File System for fault tolerance. Alternative open-source implementations like Apache Hadoop are also covered. The document discusses how MapReduce has been widely adopted by companies to process massive amounts of data and analyzes some criticism of MapReduce from database experts. It concludes by noting trends in using MapReduce as a parallel database and for multi-core processing.

map-reducemapreduceintroduction
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms

This document provides a summary of MapReduce algorithms. It begins with background on the author's experience blogging about MapReduce algorithms in academic papers. It then provides an overview of MapReduce concepts including the mapper and reducer functions. Several examples of recently published MapReduce algorithms are described for tasks like machine learning, finance, and software engineering. One algorithm is examined in depth for building a low-latency key-value store. Finally, recommendations are provided for designing MapReduce algorithms including patterns, performance, and cost/maintainability considerations. An appendix lists additional MapReduce algorithms from academic papers in areas such as AI, biology, machine learning, and mathematics.

mapreducealgorithmsparallelization
• Determine the rank of the customer
based on his id and the locality he
belongs. Highest scorer gains the higher
rank.
• Input Output
Use case
33
• Custom Writable
Using Map Reduce
34
• CustomWritable methods overridden
CustomWritable cont.
35
Driver code
36

Recommended for you

Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals

This Hadoop MapReduce tutorial will unravel MapReduce Programming, MapReduce Commands, MapReduce Fundamentals, Driver Class, Mapper Class, Reducer Class, Job Tracker & Task Tracker. At the end, you'll have a strong knowledge regarding Hadoop MapReduce Basics. PPT Agenda: ✓ Introduction to BIG Data & Hadoop ✓ What is MapReduce? ✓ MapReduce Data Flows ✓ MapReduce Programming ---------- What is MapReduce? MapReduce is a programming framework for distributed processing of large data-sets via commodity computing clusters. It is based on the principal of parallel data processing, wherein data is broken into smaller blocks rather than processed as a single block. This ensures a faster, secure & scalable solution. Mapreduce commands are based in Java. ---------- What are MapReduce Components? It has the following components: 1. Combiner: The combiner collates all the data from the sample set based on your desired filters. For example, you can collate data based on day, week, month and year. After this, the data is prepared and sent for parallel processing. 2. Job Tracker: This allocates the data across multiple servers. 3. Task Tracker: This executes the program across various servers. 4. Reducer: It will isolate the desired output from across the multiple servers. ---------- Applications of MapReduce 1. Data Mining 2. Document Indexing 3. Business Intelligence 4. Predictive Modelling 5. Hypothesis Testing ---------- Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance. Email: sales@skillspeed.com Website: https://www.skillspeed.com

big datahadoopmapreduce fundamentals
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce

Hadoop MapReduce is an open source framework for distributed processing of large datasets across clusters of computers. It allows parallel processing of large datasets by dividing the work across nodes. The framework handles scheduling, fault tolerance, and distribution of work. MapReduce consists of two main phases - the map phase where the data is processed key-value pairs and the reduce phase where the outputs of the map phase are aggregated together. It provides an easy programming model for developers to write distributed applications for large scale processing of structured and unstructured data.

chapterstudentvnit-acm
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig

In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London

ydnpigyahoo
Mapper Code
37
Partitioner Code
38
Sort Comparator class
39
Reducer Code
40

Recommended for you

Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals

This document provides an overview of the Hadoop MapReduce Fundamentals course. It discusses what Hadoop is, why it is used, common business problems it can address, and companies that use Hadoop. It also outlines the core parts of Hadoop distributions and the Hadoop ecosystem. Additionally, it covers common MapReduce concepts like HDFS, the MapReduce programming model, and Hadoop distributions. The document includes several code examples and screenshots related to Hadoop and MapReduce.

clouderaapache hadoopmapreduce
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce

This document provides an overview of MapReduce, a programming model developed by Google for processing and generating large datasets in a distributed computing environment. It describes how MapReduce abstracts away the complexities of parallelization, fault tolerance, and load balancing to allow developers to focus on the problem logic. Examples are given showing how MapReduce can be used for tasks like word counting in documents and joining datasets. Implementation details and usage statistics from Google demonstrate how MapReduce has scaled to process exabytes of data across thousands of machines.

mapreducegooglemapreducegoogle
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt

Miss goodheart created a PowerPoint to test out the Slideshare tool, which was introduced to her by Sharon Tonner on January 20th, 2011.

• ## FOR OBTINING THE RANKING ON THE BASIS OF
LOCATION AND CUSTOMER ID AS PER THE
REQUIREMENT
• hive>SELECT custid, score, location, rank()
over(PARTITION BY custid, location ORDER BY
score DESC )
AS myrank
FROM rank_customer;
Hive Query
41
Hive results
42
• Hadoop eco system is majorly designed
for large number of files of large size of
data
• Not so suitable for small sized large
number of files.
• Achieving the parallelism on the huge
data
• Mapping and Reducing are the key and
core functions to achieve parallelism.
Conclusion
43
• Hadoop eco system works efficiently with
commodity hardware.
• Distributed hardware can be efficiently
utilized.
• Hadoop map reduce codes are written
using Java.
• Hive gives feasibility for SQL
programmers though internally Java MR
jobs run.
Conclusion cont.
44

Recommended for you

Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop

The document provides an introduction to Hadoop. It discusses how Google developed its own infrastructure using Google File System (GFS) and MapReduce to power Google Search due to limitations with databases. Hadoop was later developed based on these Google papers to provide an open-source implementation of GFS and MapReduce. The document also provides overviews of the HDFS file system and MapReduce programming model in Hadoop.

apache hadoophadoop tutorial
1. Big Data - Introduction(what is bigdata).pdf
1. Big Data - Introduction(what is bigdata).pdf1. Big Data - Introduction(what is bigdata).pdf
1. Big Data - Introduction(what is bigdata).pdf

Big Data Characteristics Contents Explosion in Quantity of Data Importance of Big Data Usage Example in Big Data Challenges in Big Data Hadoop Ecosystem

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop

Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. The core of Hadoop consists of HDFS for storage and MapReduce for processing. Hadoop has been expanded with additional projects including YARN for job scheduling and resource management, Pig and Hive for SQL-like queries, HBase for column-oriented storage, Zookeeper for coordination, and Ambari for provisioning and managing Hadoop clusters. Hadoop provides scalable and cost-effective solutions for storing and analyzing massive amounts of data.

hadoopbig dataibm
• Hadoop: The Definitive Guide, Third
Edition by Tom White
• Programming Hive by Edward Capriolo,
Dean Wampler, and Jason Rutherglen
• http://hadoop.apache.org/
• http://hive.apache.org/
References
45
46
THANK YOU
Q&A
PRADEEP M G

More Related Content

What's hot

Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
Brian Enochson
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
Federico Cargnelutti
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Rahul Agarwal
 
Big data
Big dataBig data
Big data
Alisha Roy
 
Introduce to spark
Introduce to sparkIntroduce to spark
Introduce to spark
Yen Hao Huang
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
Jay Nagar
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
Kibrom Gebrehiwot
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
Glenn K. Lockwood
 
Hadoop
HadoopHadoop
Hadoop
avnishagr
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
Hanborq Inc.
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
Chirag Ahuja
 
Nextag talk
Nextag talkNextag talk
Nextag talk
Joydeep Sen Sarma
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
VIVEKVANAVAN
 
Hadoop
HadoopHadoop
Hadoop
David Xie
 
Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
Joydeep Sen Sarma
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
DataPlato, Crossing the line
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
UserReport
 
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture
Rupak Roy
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
Kyle Burke
 

What's hot (20)

Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Big data
Big dataBig data
Big data
 
Introduce to spark
Introduce to sparkIntroduce to spark
Introduce to spark
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
 
Hadoop
HadoopHadoop
Hadoop
 
Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 

Viewers also liked

Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce
Krishna Sangeeth KS
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
Jeff Hammerbacher
 
Pyshark in Network Packet analysis
Pyshark in Network Packet analysisPyshark in Network Packet analysis
Pyshark in Network Packet analysis
Rengaraj D
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
M Baddar
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
Leila panahi
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Frane Bandov
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
Amund Tveit
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Skillspeed
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
Mandy Suzanne
 

Viewers also liked (13)

Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Pyshark in Network Packet analysis
Pyshark in Network Packet analysisPyshark in Network Packet analysis
Pyshark in Network Packet analysis
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

Similar to Hadoop_EcoSystem_Pradeep_MG

Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Cloudera, Inc.
 
1. Big Data - Introduction(what is bigdata).pdf
1. Big Data - Introduction(what is bigdata).pdf1. Big Data - Introduction(what is bigdata).pdf
1. Big Data - Introduction(what is bigdata).pdf
AmanCSE050
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
York University
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
Michael Ming Lei
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
Andrew Brust
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Andrew Brust
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
Microsoft TechNet - Belgium and Luxembourg
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
yaevents
 
Scheduling scheme for hadoop clusters
Scheduling scheme for hadoop clustersScheduling scheme for hadoop clusters
Scheduling scheme for hadoop clusters
Amjith Singh
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
jani shaik
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
Farzad Nozarian
 
What's the Scoop on Hadoop? How It Works and How to WORK IT!
What's the Scoop on Hadoop? How It Works and How to WORK IT!What's the Scoop on Hadoop? How It Works and How to WORK IT!
What's the Scoop on Hadoop? How It Works and How to WORK IT!
MongoDB
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Andrew Brust
 

Similar to Hadoop_EcoSystem_Pradeep_MG (20)

Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
1. Big Data - Introduction(what is bigdata).pdf
1. Big Data - Introduction(what is bigdata).pdf1. Big Data - Introduction(what is bigdata).pdf
1. Big Data - Introduction(what is bigdata).pdf
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Scheduling scheme for hadoop clusters
Scheduling scheme for hadoop clustersScheduling scheme for hadoop clusters
Scheduling scheme for hadoop clusters
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
What's the Scoop on Hadoop? How It Works and How to WORK IT!
What's the Scoop on Hadoop? How It Works and How to WORK IT!What's the Scoop on Hadoop? How It Works and How to WORK IT!
What's the Scoop on Hadoop? How It Works and How to WORK IT!
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 

Hadoop_EcoSystem_Pradeep_MG

  • 2. • Why Big Data? • Ingredients of Big Data Eco System • Working with Map Reduce • Phases of MR • HDFS • Hive • Use case • Conclusion Agenda 2
  • 3. • Big Data is NOT JUST ABOUT SIZE its ABOUT HOW IMPORTANT THE DATA IS in a large chunk • Data is CHANGING and getting MESSY • Prior Structured but now Unstructured. • Non Uniform • Many distributed contributors to the data • Mobile, PDA, Tablet, sensors. • Domains: Financial, Healthcare, Social Media Why Big Data!! 3
  • 5. • Map reduce – Technique of solving big data by map – reduce technique on clusters. • HDFS- Distributed file system used by hadoop. • HIVE- SQL based query engine for non java programmers • PIG- A data flow language and execution environment for exploring very large datasets Ingredients of Eco System 5
  • 6. • HBASE - A distributed, column-oriented database. • Zookeeper - A distributed, highly available coordination service. • Sqoop - A tool for efficiently moving data between relational databases and HDFS. Ingredients cont. 6
  • 7. • Protocols used- RPC/ HTTP for inter communication of commodity hardware. • Run on Pseudo Node or Clusters • Components- Daemons • NameNode • DataNode • JobTracker • TaskTracker Hadoop Internals 7
  • 8. • Map  Function which maps for each of the data available • Reduce  Function which is used for aggregation or reduction Working with Map Reduce 8
  • 9. • f(n) = Σ {n=0.. n=10} (n(n-1)/2) • map = ∀ n from 0 to n • map(n(n-1)/2) • Reduce = Σ ([values]) is the aggregation/reduction function Hence can achieve parallelism MR as a function 9
  • 10. MR as representation 10 • Map <K1, V1>  Map <K2, V2> • V2 – list of values for Key K2 • Reduce <K2, V2>  ~ <K3, V3> • ~ Reduction operation • Reduced output with specific Keys and Values
  • 11. • Data on HDFS • Input partition – FileSplit , Inputsplit • Map • Shuffle • Sort • Partition • Reducer • Aggregated Data on HDFS Phases of MR 11
  • 12. Phases of MR depicted 12
  • 13. Data flow in MR 13 MapReduce data flow with multiple reduce tasks
  • 14. Shuffle and Sort phase 14
  • 15. • Architecture HDFS Hadoop Distributed File System 15
  • 18. • List all the files and directories in the HDFS • $hadoop fs –lsr • Put file to HDFS • $hadoop fs –put <from path> <to path> • Get files from HDFS • $hadoop fs –get <from path> • To run jar file • $hadoop jar <jarfile> <className> <input path> <output path> HDFS - cli 18
  • 19. • Job Configuration • Key files core-site.xml, mapred- site.xml • Specific job configuration can be provided in the code Map Reduce cont. 19
  • 20. MR job in action 20
  • 21. • Job Scheduling • Fair scheduler • Capacity scheduler Job Scheduling 21
  • 22. • Job is planned and placed in the job pool • Supports preemption • If no pools created and only one job available, the job runs as is Fair Scheduler 22
  • 23. • Supports Multi user scheduling • Depends on the clusters, number of queues and hierarchical way jobs are scheduled • One queue may be child of another queue • Enforces fair scheduling within each job pool Capacity scheduler 23
  • 24. Map reduce Input Formats 24
  • 25. • Map Side Join • large inputs works by performing the join before the data reaches the map function • Reducer Side Join • input datasets don’t have to be structured in any particular way, but it is less efficient as both datasets have to go through the Map Reduce shuffle. MR Joins 25
  • 26. • Hive was created to make it possible for analysts with strong SQL skills (but meager Java programming skills) • From Developers of Facebook and later associated it part of apache open source projects. • Hive runs on your workstation and converts your SQL query into a series of Map Reduce jobs for execution on a Hadoop cluster HIVE 26
  • 27. • Unzip the gz file • % tar xzf hive-x.y.z-dev.tar.gz • Be handy • % export HIVE_INSTALL=/home/tom/hive-x.y.z- dev • % export PATH=$PATH:$HIVE_INSTALL/bin • Hive shell launched • hive> Show tables; Hive Infrastructure 27
  • 30. • Creating table • CREATE TABLE rank_customer(custid STRING, socre STRING, location STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; • Load Data • LOAD DATA LOCAL INPATH 'input/dir/customerrank.dat‘ OVERWRITE INTO TABLE rank_customer; • Check data in warehouse • $ls /user/hive/warehouse/records/ Commands 30
  • 31. • SELECT QUERY • SELECT c.custid, c.score, c.location FROM rank_customer c ORDER BY c.custid ASC, c.location ASC, c.score DESC; Commands cont. 31
  • 32. • hive> CREATE DATABASE financials WITH DBPROPERTIES ('creator' = MGP', 'date' = '2014-10-03'); • hive> DROP DATABASE IF EXISTS financials; • hive> ALTER DATABASE financials SET DBPROPERTIES ('edited-by' = 'Joe Dba'); • hive> DROP TABLE IF EXISTS employees; • hive> ALTER TABLE log_messages RENAME TO logmsgs; Hive- DDL Commands 32
  • 33. • Determine the rank of the customer based on his id and the locality he belongs. Highest scorer gains the higher rank. • Input Output Use case 33
  • 34. • Custom Writable Using Map Reduce 34
  • 35. • CustomWritable methods overridden CustomWritable cont. 35
  • 41. • ## FOR OBTINING THE RANKING ON THE BASIS OF LOCATION AND CUSTOMER ID AS PER THE REQUIREMENT • hive>SELECT custid, score, location, rank() over(PARTITION BY custid, location ORDER BY score DESC ) AS myrank FROM rank_customer; Hive Query 41
  • 43. • Hadoop eco system is majorly designed for large number of files of large size of data • Not so suitable for small sized large number of files. • Achieving the parallelism on the huge data • Mapping and Reducing are the key and core functions to achieve parallelism. Conclusion 43
  • 44. • Hadoop eco system works efficiently with commodity hardware. • Distributed hardware can be efficiently utilized. • Hadoop map reduce codes are written using Java. • Hive gives feasibility for SQL programmers though internally Java MR jobs run. Conclusion cont. 44
  • 45. • Hadoop: The Definitive Guide, Third Edition by Tom White • Programming Hive by Edward Capriolo, Dean Wampler, and Jason Rutherglen • http://hadoop.apache.org/ • http://hive.apache.org/ References 45