SlideShare a Scribd company logo
Hadoop
Ecosystem
Ran Silberman, December 2014
What types of ecosystems exist?
● Systems that are based on MapReduce
● Systems that replace MapReduce
● Complementary databases
● Utilities
● See complete list here
Systems based
on MapReduce
Hive
● Part of the Apache project
● General SQL-like syntax for querying HDFS or other
large databases
● Each SQL statement is translated to one or more
MapReduce jobs (in some cases none)
● Supports pluggable Mappers, Reducers and SerDe’s
(Serializer/Deserializer)
● Pro: Convenient for analytics people that use SQL
Hive Architecture
Hive Usage
Start a hive shell:
$hive
create hive table:
hive> CREATE TABLE tikal (id BIGINT, name STRING, startdate TIMESTAMP, email
STRING)
Show all tables:
hive> SHOW TABLES;
Add a new column to the table:
hive> ALTER TABLE tikal ADD COLUMNS (description STRING);
Load HDFS data file into the dable:
hive> LOAD DATA INPATH '/home/hduser/tikal_users' OVERWRITE INTO TABLE tikal;
query employees that work more than a year:
hive> SELECT name FROM tikal WHERE (unix_timestamp() - startdate > 365 * 24 *
60 * 60);
Pig
● Part of the Apache project
● A programing language that is compiled into one or
more MaprRecuce jobs.
● Supports User Defined functions
● Pro: More Convenient to write than pure MapReduce.
Pig Usage
Start a pig Shell. (grunt is the PigLatin shell prompt)
$ pig
grunt>
Load a HDFS data file:
grunt> employees = LOAD 'hdfs://hostname:54310/home/hduser/tikal_users'
as (id,name,startdate,email,description);
Dump the data to console:
grunt> DUMP employees;
Query the data:
grunt> employees_more_than_1_year = FILTER employees BY
(float)rating>1.0;
grunt> DUMP employees_more_than_1_year;
Store query result to new file:
grunt> store employees_more_than_1_year into
'/home/hduser/employees_more_than_1_year';
Cascading
● An infrastructure with API that is compiled to one or
more MapReduce jobs
● Provide graphical view of the MapReduce jobs workflow
● Ways to tweak setting and improve performance of
workflow.
● Pros:
o Hides MapReduce API and joins jobs
o Graphical view and performance tuning
MapReduce workflow
● MapReduce framework operates exclusively on
Key/Value pairs
● There are three phases in the workflow:
o map
o combine
o reduce
(input) <k1, v1> =>
map => <k2, v2> =>
combine => <k2, v2> =>
reduce => <k3, v3> (output)
WordCount in MapRecuce Java API
private class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
WordCount in MapRecuce Java Cont.
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
WordCount in MapRecuce Java Cont.
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
MapReduce workflow example.
Let’s consider two text files:
$ bin/hdfs dfs -cat /user/joe/wordcount/input/file01
Hello World Bye World
$ bin/hdfs dfs -cat /user/joe/wordcount/input/file02
Hello Hadoop Goodbye Hadoop
Mapper code
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
Mapper output
For two files there will be two mappers.
For the given sample input the first map emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
The second map emits:
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
Set Combiner
We defined a combiner in the code:
job.setCombinerClass(IntSumReducer.class);
Combiner output
Output of each map is passed through the local combiner
for local aggregation, after being sorted on the keys.
The output of the first map:
< Bye, 1>
< Hello, 1>
< World, 2>
The output of the second map:
< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>
Reducer code
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Reducer output
The reducer sums up the values
The output of the job is:
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
The Cascading core components
● Tap (Data resource)
o Source (Data input)
o Sink (Data output)
● Pipe (data stream)
● Filter (Data operation)
● Flow (assembly of Taps and Pipes)
WordCount in Cascading
Visualization
source (Document Collection)
sink (Word Count)
pipes (Tokenize, Count)
WodCount in Cascading Cont.
// define source and sink Taps.
Scheme sourceScheme = new TextLine( new Fields( "line" ) );
Tap source = new Hfs( sourceScheme, inputPath );
Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );
Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );
// the 'head' of the pipe assembly
Pipe assembly = new Pipe( "wordcount" );
// For each input Tuple
// parse out each word into a new Tuple with the field name "word"
// regular expressions are optional in Cascading
String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?!pL)";
Function function = new RegexGenerator( new Fields( "word" ), regex );
assembly = new Each( assembly, new Fields( "line" ), function );
// group the Tuple stream by the "word" value
assembly = new GroupBy( assembly, new Fields( "word" ) );
WodCount in Cascading
// For every Tuple group
// count the number of occurrences of "word" and store result in
// a field named "count"
Aggregator count = new Count( new Fields( "count" ) );
assembly = new Every( assembly, count );
// initialize app properties, tell Hadoop which jar file to use
Properties properties = new Properties();
FlowConnector.setApplicationJarClass( properties, Main.class );
// plan a new Flow from the assembly using the source and sink Taps
// with the above properties
FlowConnector flowConnector = new FlowConnector( properties );
Flow flow = flowConnector.connect( "word-count", source, sink, assembly );
// execute the flow, block until complete
flow.complete();
Diagram of Cascading Flow
Scalding
● Extension to Cascading
● Programing language is Scala instead of Java
● Good for functional programing paradigms in Data
Applications
● Pro: code can be very compact!
WordCount in Scalding
import com.twitter.scalding._
class WordCountJob(args : Args) extends Job(args) {
TypedPipe.from(TextLine(args("input")))
.flatMap { line => line.split("""s+""") }
.groupBy { word => word }
.size
.write(TypedTsv(args("output")))
}
Summingbird
● An open source from Twitter.
● An API that is compiled to Scalding and to Storm
topologies.
● Can be written in Java or Scala
● Pro: When you want to use Lambda Architecture and
you want to write one code that will run on both Hadoop
and Storm.
WordCount in Summingbird
def wordCount[P <: Platform[P]]
(source: Producer[P, String], store: P#Store[String, Long]) =
source.flatMap { sentence =>
toWords(sentence).map(_ -> 1L)
}.sumByKey(store)
Systems that
replace MapReduce
Spark
● Part of the Apache project
● Replaces MapReduce with it own engine that works
much faster without compromising consistency
● Architecture not based on Map-reduce but rather on two
concepts: RDD (Resilient Distributed Dataset) and DAG
(Directed Acyclic Graph)
● Pro’s:
o Works much faster than MapReduce;
o fast growing community.
Impala
● Open Source from Cloudera
● Used for Interactive queries with SQL syntax
● Replaces MapReduce with its own Impala Server
● Pro: Can get much faster response time for SQL over
HDFS than Hive or Pig.
Impala benchmark
Note: Impala is over Parquet!
Impala replaces MapReduce
Impala architecture
● Impala architecture was inspired by Google Dremel
● MapReduce is great for functional programming, but not
efficient for SQL.
● Impala replaced the MapReduce with Distributed Query
Engine that is optimized for fast queries.
Dermal architecture
Dremel: Interactive Analysis of Web-Scale Datasets
Impala architecture
Presto, Drill, Tez
● Several more alternatives:
o Presto by Facebook
o Apache Drill pushed by MapR
o Apache Tez pushed by Hortonworks
● all are alternatives to Impala and do more or less the
same: provide faster response time for queries over
HDFS.
● Each of the above claim to have very fast results.
● Be careful of benchmarks they publish: to get better
results they use indexed data rather than sequential
files in HDFS (i.e., ORC file, Parquet, HBase)
Complementary
Databases
HBase
● Apache project
● NoSQL cluster database that can grow linearly
● Can store billions of rows X millions of columns
● Storage is based on HDFS
● API based on MapReduce
● Pros:
o Strongly consistent read/writes
o Good for high-speed counter aggregations
Parquet
● Apache (incubator) project. Initiated by Twitter &
Cloudera
● Columnar File Format - write one column at a time
● Integrated with Hadoop ecosystem (MapReduce, Hive)
● Supports Avro, Thrift and ProtBuf
● Pro: keep I/O to a minimum by reading from a disk only
the data required for the query
Columnar format (Parquet)
Advantages of Columnar formats
● Better compression as data is more homogenous.
● I/O will be reduced as we can efficiently scan only a
subset of the columns while reading the data.
● When storing data of the same type in each column,
we can use encodings better suited to the modern
processors’ pipeline by making instruction branching
more predictable.
Utilities
Flume
● Cloudera product
● Used to collect files from distributed systems and send
them to central repository
● Designed for integration with HDFS but can write to
other FS
● Supports listening to TCP and UDP sockets
● Main Use Case: collect distributed logs to HDFS
Avro
● An Apache project
● Data Serialization by Schema
● Support rich data structures. Defined in Json-like syntax
● Support Schema evolution
● Integrated with Hadoop I/O API
● Similar to Thrift and ProtocolBuffers
Oozie
● An Apache project
● Workflow Scheduler for Hadoop jobs
● Very close integration with the Hadoop API
Mesos
● Apache project
● Cluster manager that abstracts resources
● Integrated with Hadoop to allocate resources
● Scalable to 10,000 nodes
● Supports physical machines, VM’s, Docker
● Multi resource scheduler (memory, CPU, disk, ports)
● Web UI for viewing cluster status

More Related Content

What's hot

Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internals
Sigmoid
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
Fernando Rodriguez
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
Databricks
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
Andrea Iacono
 
Scalding
ScaldingScalding
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on Android
Tomáš Kypta
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
David Lauzon
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
jlacefie
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
Konrad Malawski
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
Stephane Manciot
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formats
Vigen Sahakyan
 
Scala+data
Scala+dataScala+data
Scala+data
Samir Bessalah
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Datio Big Data
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
 

What's hot (20)

Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internals
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Scalding
ScaldingScalding
Scalding
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on Android
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formats
 
Scala+data
Scala+dataScala+data
Scala+data
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 

Similar to Hadoop ecosystem

MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Robert Metzger
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
pappupassindia
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
Jazan University
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Hadoop
HadoopHadoop
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
Fei Dong
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
NouhaElhaji1
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
Mao Geng
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
softwarequery
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
AnushkaChauhan68
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programming
Kuldeep Dhole
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
Unmesh Baile
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
appaji intelhunt
 

Similar to Hadoop ecosystem (20)

MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Hadoop
HadoopHadoop
Hadoop
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programming
 
Data Science
Data ScienceData Science
Data Science
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 

Recently uploaded

Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Alireza Kamrani
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
Priyanka Jadhav
 
Field Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdfField Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdf
hritikbui
 
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdfThe Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
Riya Sen
 
Acid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjkAcid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjk
talha2khan2k
 
Full Disclosure Board Policy.docx BRGY LICUMA
Full  Disclosure Board Policy.docx BRGY LICUMAFull  Disclosure Board Policy.docx BRGY LICUMA
Full Disclosure Board Policy.docx BRGY LICUMA
brgylicumaormoccity
 
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
Ladislau5
 
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Samuel Jackson
 
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
Milind Agarwal
 
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERINGSOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
PrabhuB33
 
Systane Global education training centre
Systane Global education training centreSystane Global education training centre
Systane Global education training centre
AkhinaRomdoni
 
How AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdfHow AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdf
PromptCloud
 
Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
HeidiLivengood
 
Annex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf documentAnnex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf document
Steven McGee
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
amazenolmedojeruel
 
future-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-managementfuture-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-management
Aadee4
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptxParcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
AltanAtabarut
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
SomalyEng
 
Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
49AkshitYadav
 
Histology of Muscle types histology o.ppt
Histology of Muscle types histology o.pptHistology of Muscle types histology o.ppt
Histology of Muscle types histology o.ppt
SamanArshad11
 

Recently uploaded (20)

Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
 
Field Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdfField Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdf
 
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdfThe Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
 
Acid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjkAcid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjk
 
Full Disclosure Board Policy.docx BRGY LICUMA
Full  Disclosure Board Policy.docx BRGY LICUMAFull  Disclosure Board Policy.docx BRGY LICUMA
Full Disclosure Board Policy.docx BRGY LICUMA
 
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
 
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...
 
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
 
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERINGSOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
 
Systane Global education training centre
Systane Global education training centreSystane Global education training centre
Systane Global education training centre
 
How AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdfHow AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdf
 
Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
 
Annex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf documentAnnex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf document
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
 
future-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-managementfuture-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-management
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptxParcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
 
Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
 
Histology of Muscle types histology o.ppt
Histology of Muscle types histology o.pptHistology of Muscle types histology o.ppt
Histology of Muscle types histology o.ppt
 

Hadoop ecosystem

  • 2. What types of ecosystems exist? ● Systems that are based on MapReduce ● Systems that replace MapReduce ● Complementary databases ● Utilities ● See complete list here
  • 4. Hive ● Part of the Apache project ● General SQL-like syntax for querying HDFS or other large databases ● Each SQL statement is translated to one or more MapReduce jobs (in some cases none) ● Supports pluggable Mappers, Reducers and SerDe’s (Serializer/Deserializer) ● Pro: Convenient for analytics people that use SQL
  • 6. Hive Usage Start a hive shell: $hive create hive table: hive> CREATE TABLE tikal (id BIGINT, name STRING, startdate TIMESTAMP, email STRING) Show all tables: hive> SHOW TABLES; Add a new column to the table: hive> ALTER TABLE tikal ADD COLUMNS (description STRING); Load HDFS data file into the dable: hive> LOAD DATA INPATH '/home/hduser/tikal_users' OVERWRITE INTO TABLE tikal; query employees that work more than a year: hive> SELECT name FROM tikal WHERE (unix_timestamp() - startdate > 365 * 24 * 60 * 60);
  • 7. Pig ● Part of the Apache project ● A programing language that is compiled into one or more MaprRecuce jobs. ● Supports User Defined functions ● Pro: More Convenient to write than pure MapReduce.
  • 8. Pig Usage Start a pig Shell. (grunt is the PigLatin shell prompt) $ pig grunt> Load a HDFS data file: grunt> employees = LOAD 'hdfs://hostname:54310/home/hduser/tikal_users' as (id,name,startdate,email,description); Dump the data to console: grunt> DUMP employees; Query the data: grunt> employees_more_than_1_year = FILTER employees BY (float)rating>1.0; grunt> DUMP employees_more_than_1_year; Store query result to new file: grunt> store employees_more_than_1_year into '/home/hduser/employees_more_than_1_year';
  • 9. Cascading ● An infrastructure with API that is compiled to one or more MapReduce jobs ● Provide graphical view of the MapReduce jobs workflow ● Ways to tweak setting and improve performance of workflow. ● Pros: o Hides MapReduce API and joins jobs o Graphical view and performance tuning
  • 10. MapReduce workflow ● MapReduce framework operates exclusively on Key/Value pairs ● There are three phases in the workflow: o map o combine o reduce (input) <k1, v1> => map => <k2, v2> => combine => <k2, v2> => reduce => <k3, v3> (output)
  • 11. WordCount in MapRecuce Java API private class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
  • 12. WordCount in MapRecuce Java Cont. public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
  • 13. WordCount in MapRecuce Java Cont. public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
  • 14. MapReduce workflow example. Let’s consider two text files: $ bin/hdfs dfs -cat /user/joe/wordcount/input/file01 Hello World Bye World $ bin/hdfs dfs -cat /user/joe/wordcount/input/file02 Hello Hadoop Goodbye Hadoop
  • 15. Mapper code public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } }
  • 16. Mapper output For two files there will be two mappers. For the given sample input the first map emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> The second map emits: < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>
  • 17. Set Combiner We defined a combiner in the code: job.setCombinerClass(IntSumReducer.class);
  • 18. Combiner output Output of each map is passed through the local combiner for local aggregation, after being sorted on the keys. The output of the first map: < Bye, 1> < Hello, 1> < World, 2> The output of the second map: < Goodbye, 1> < Hadoop, 2> < Hello, 1>
  • 19. Reducer code public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
  • 20. Reducer output The reducer sums up the values The output of the job is: < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>
  • 21. The Cascading core components ● Tap (Data resource) o Source (Data input) o Sink (Data output) ● Pipe (data stream) ● Filter (Data operation) ● Flow (assembly of Taps and Pipes)
  • 22. WordCount in Cascading Visualization source (Document Collection) sink (Word Count) pipes (Tokenize, Count)
  • 23. WodCount in Cascading Cont. // define source and sink Taps. Scheme sourceScheme = new TextLine( new Fields( "line" ) ); Tap source = new Hfs( sourceScheme, inputPath ); Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE ); // the 'head' of the pipe assembly Pipe assembly = new Pipe( "wordcount" ); // For each input Tuple // parse out each word into a new Tuple with the field name "word" // regular expressions are optional in Cascading String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?!pL)"; Function function = new RegexGenerator( new Fields( "word" ), regex ); assembly = new Each( assembly, new Fields( "line" ), function ); // group the Tuple stream by the "word" value assembly = new GroupBy( assembly, new Fields( "word" ) );
  • 24. WodCount in Cascading // For every Tuple group // count the number of occurrences of "word" and store result in // a field named "count" Aggregator count = new Count( new Fields( "count" ) ); assembly = new Every( assembly, count ); // initialize app properties, tell Hadoop which jar file to use Properties properties = new Properties(); FlowConnector.setApplicationJarClass( properties, Main.class ); // plan a new Flow from the assembly using the source and sink Taps // with the above properties FlowConnector flowConnector = new FlowConnector( properties ); Flow flow = flowConnector.connect( "word-count", source, sink, assembly ); // execute the flow, block until complete flow.complete();
  • 26. Scalding ● Extension to Cascading ● Programing language is Scala instead of Java ● Good for functional programing paradigms in Data Applications ● Pro: code can be very compact!
  • 27. WordCount in Scalding import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TypedPipe.from(TextLine(args("input"))) .flatMap { line => line.split("""s+""") } .groupBy { word => word } .size .write(TypedTsv(args("output"))) }
  • 28. Summingbird ● An open source from Twitter. ● An API that is compiled to Scalding and to Storm topologies. ● Can be written in Java or Scala ● Pro: When you want to use Lambda Architecture and you want to write one code that will run on both Hadoop and Storm.
  • 29. WordCount in Summingbird def wordCount[P <: Platform[P]] (source: Producer[P, String], store: P#Store[String, Long]) = source.flatMap { sentence => toWords(sentence).map(_ -> 1L) }.sumByKey(store)
  • 31. Spark ● Part of the Apache project ● Replaces MapReduce with it own engine that works much faster without compromising consistency ● Architecture not based on Map-reduce but rather on two concepts: RDD (Resilient Distributed Dataset) and DAG (Directed Acyclic Graph) ● Pro’s: o Works much faster than MapReduce; o fast growing community.
  • 32. Impala ● Open Source from Cloudera ● Used for Interactive queries with SQL syntax ● Replaces MapReduce with its own Impala Server ● Pro: Can get much faster response time for SQL over HDFS than Hive or Pig.
  • 33. Impala benchmark Note: Impala is over Parquet!
  • 35. Impala architecture ● Impala architecture was inspired by Google Dremel ● MapReduce is great for functional programming, but not efficient for SQL. ● Impala replaced the MapReduce with Distributed Query Engine that is optimized for fast queries.
  • 36. Dermal architecture Dremel: Interactive Analysis of Web-Scale Datasets
  • 38. Presto, Drill, Tez ● Several more alternatives: o Presto by Facebook o Apache Drill pushed by MapR o Apache Tez pushed by Hortonworks ● all are alternatives to Impala and do more or less the same: provide faster response time for queries over HDFS. ● Each of the above claim to have very fast results. ● Be careful of benchmarks they publish: to get better results they use indexed data rather than sequential files in HDFS (i.e., ORC file, Parquet, HBase)
  • 40. HBase ● Apache project ● NoSQL cluster database that can grow linearly ● Can store billions of rows X millions of columns ● Storage is based on HDFS ● API based on MapReduce ● Pros: o Strongly consistent read/writes o Good for high-speed counter aggregations
  • 41. Parquet ● Apache (incubator) project. Initiated by Twitter & Cloudera ● Columnar File Format - write one column at a time ● Integrated with Hadoop ecosystem (MapReduce, Hive) ● Supports Avro, Thrift and ProtBuf ● Pro: keep I/O to a minimum by reading from a disk only the data required for the query
  • 43. Advantages of Columnar formats ● Better compression as data is more homogenous. ● I/O will be reduced as we can efficiently scan only a subset of the columns while reading the data. ● When storing data of the same type in each column, we can use encodings better suited to the modern processors’ pipeline by making instruction branching more predictable.
  • 45. Flume ● Cloudera product ● Used to collect files from distributed systems and send them to central repository ● Designed for integration with HDFS but can write to other FS ● Supports listening to TCP and UDP sockets ● Main Use Case: collect distributed logs to HDFS
  • 46. Avro ● An Apache project ● Data Serialization by Schema ● Support rich data structures. Defined in Json-like syntax ● Support Schema evolution ● Integrated with Hadoop I/O API ● Similar to Thrift and ProtocolBuffers
  • 47. Oozie ● An Apache project ● Workflow Scheduler for Hadoop jobs ● Very close integration with the Hadoop API
  • 48. Mesos ● Apache project ● Cluster manager that abstracts resources ● Integrated with Hadoop to allocate resources ● Scalable to 10,000 nodes ● Supports physical machines, VM’s, Docker ● Multi resource scheduler (memory, CPU, disk, ports) ● Web UI for viewing cluster status