SlideShare a Scribd company logo
BI Seminar project:
High-level languages
for Big Data Analytics
Janani Chakkaradhari
Jose Luis Lopez Pino
Outline
1. Introduction
1.1 The Map Reduce programming model
1.2 Hadoop
2. High Level Languages
2.1 Pig Latin
2.2 Hive
2.3 JAQL
2.4 Other Languages
3. Comparison of HLLs
3.1 Syntax Comparison
3.2 Performance
3.3 Query Compilation
3.4 JOIN Implementation
4. Future Work
4.1 Machine Learning
4.2 Interactive Queries
5. Conclusion
Introduction
The MapReduce model
• Introduced in 2004 by Google
• This model allows programmers without any experience in
parallel coding to write the highly scalable programs and
hence process voluminous data sets.
• This high level of scalability is reached thanks to the
decomposition of the problem into a big number of tasks.
• The Map function produces a set of key/value pairs, taking a
single pair key/value as input.
• The Reduce function takes a key and a set of values related to this
key as input and it might also produce a set of values, but
commonly it emits only one or zero values as output
The MapReduce model
• Advantages:
• Scalability
• Handle failures and balance the system
• Pitfalls
• Complicated to code some tasks.
• Some tasks are very expensive.
• Difficulties to debug the code.
• Absence of schema and indexes.
• A lot of bandwidth might be consumed.
Hadoop
• An Apache Software foundation open source
project
• Hadoop – HDFS + Map Reduce
• DFS – Partitioning data & Storing in separate
machine
• HDFS – Stores large files, running on commodity
clusters of hardware and typically 64 MB for per
block
• Both FS and Map reduce are Co-Designed
Hadoop
• No separate storage network and processing network
• Moving compute to the data node
High level languages
High level languages
• Two different types
• Created specifically for this model.
• Already existing languages
• Languages present in the comparison
• Pig Latin
• HiveQL
• Jaql
• Interesting languages
• Meteor
• DryadLINQ
Pig Latin
• Executed over Hadoop.
• Procedural language.
• High level operations similar to those that we can find in SQL
• Some interesting operators:
• FOREACH to process a transformation over every tuple of the set.
To make possible to parallelise this operation, the transformation
of one row should depend on another.
• COGROUP to group related tuples of multiple datasets. It is
similar to the first step of a join.
• LOAD to load the input data and its structure and STORE to save
data in a file.
Pig Latin
• Goal: to reduce the time of development.
• Nested data model.
• User-defined functions
• Analytic queries over text files (not need of loading the data)
• Procedural language -> control over the execution plan
• The user can speed the performance up.
• It makes easier the work of the query optimiser.
• Unlike SQL.
HiveQL
• Open-source DW
solution built on top
of Hadoop
• The queries looks
similar to SQL and
also has extensions
on it
• Complex column
types-
map, array, struct as
data types
• It stores the
metadata in RDBMs
HiveQL
• The Metastore acts as the system catalog for
Hive
• It stores all the information about the
tables, their partition, the schema and etc.,
• Without the system catalog it is not possible to
impose a structure on hadoop files.
• Facebook uses MySQL to store this metadata.
Reason: Since these information has to be
served fast to the compiler
JAQL
• What is Jaql?
• Declarative scripting programming language.
• Used over Hadoop’s MapReduce framework
• Included in IBM’s InfoSphere BigInsights and Cognos Consumer
Insight products.
• Developed after Pig and Hive.
• More scalable.
• More flexible
• More reusable.
• Data model
• Simple: similar to JSON.
• Values as trees.
• No references.
• Textual representation very similar.
• Flexible
• Handle semistructured documents.
• But also structured records validated against a schema.
JAQL
• Control over the evaluation plan.
• The programmer can work at different levels of abstraction
using Jaql's syntax:
• Full definition of the execution plan.
• Use of hints to indicate to the optimizer some evaluation
features.
• This feature is present in most of the database engines that use SQL
as query language.
• Declarative programming, without any control over the flow.
Other languages: Meteor
• Stratosphere stack
• Pact:
• Programming model
• It extends MapReduce with new second-order functions
• Cross: Cartesian product
• CoGroup: group all the records with the same key and process them.
• Match: similar to CoGroup but pairs with the same key could be processed
separately.
• Sopemo:
• Semantically rich operator model
• Extensible
• Meteor: query language
• Optimization
• Meteor code
• Logical plan using Sopemo operators -> Optimized
• Pact final program -> Physically optimized
Other languages: DryadLINQ
• Coded embedded in .NET programming languages
• Operators
• Almost all the operators available in LINQ.
• Some specific operators for parallel programming.
• Develop can include their own implementations.
• DryadLINQ code is translated to a Dryad plan
• Optimization
• Pipeline operations
• Remove redundancy
• Push aggregations
• Reduce network traffic
Comparison HLLs
Comparing HLLs
• Different design motivations
• Developers preferences
• Write concise code -> Expressiveness
• Efficiency -> Performance
• Criteria that impact performance
• Join implementation
• Query processing
• Some other are not included
• Language paradigm
• Scalability
Comparison Criteria
• Expressive power
• Performance
• Query Compilation
• JOIN Implementation
Expressive power
• Three categories by Robert Stewart:
• Relational complete
• SQL equivalent (aggregate functions)
• Turing complete
• Conditional branching
• Indefinite iterations by means of recursion
• Emulation of infinite memory model
Expressive power
• Three categories by Robert Stewart:
• Relational complete
• SQL equivalent (aggregate functions)
• Turing complete
• Conditional branching
• Indefinite iterations by means of recursion
• Emulation of infinite memory model
Expressive power
• Three categories by Robert Stewart:
• Relational complete
• SQL equivalent (aggregate functions)
• Turing complete
• Conditional branching
• Indefinite iterations by means of recursion
• Emulation of infinite memory model
Expressive power
Expressive power
• But this do not mean that they are SQL, Pig Latin
and HiveQL are the same.
• HiveQL
• Is inspired by SQL but it does not support the full
repertoire included in the SQL-92 specification
• Includes features notably inspired by MySQL and
MapReduce that are not part of SQL.
• Pig Latin
• It is not inspired by SQL.
• For instance, do not have OVER clause
SQL Vs. HiveQL (2009)
SQL HiveQL
Transactions Yes No
Indexes Yes No
Create table as select Not SQL-92 Yes
Subqueries In any clause
Correlated or not
Only in FROM clause
Only noncorrelated
Views Yes Not materialized
Extension with
map/reduce scripts
No Yes
Query Processing
Query Processing
• In order to make a good comparison we should
have the basic knowledge on how these HLQL
are working.
• How the abstract user representation of the
query or the script is converted to map reduce
jobs?
Query Processing – Pig Latin
• The goal of writing
Pig Latin script is to
produce an
equivalent map
reduce jobs that can
be executed in the
Hadoop environment
• Parser first checks for
the syntactic errors
Query Processing – Pig Latin
Query Processing – Pig Latin
Query Processing - Hive
• It gets the Hive SQL string from the client
• The parser phase converts it into parse tree
representation
• The logical query plan generator converts it into
logical query representation. Prunes the columns
early and pushes the predicates closer to the
table.
• The logical plan is converted to physical plan and
then map reduce jobs.
Query Processing - JAQL
• JAQL includes two higher order functions such as
mapReduceFn and mapAggregate
• The rewriter engine generates calls to the mapReduceFn or
mapAggregate
QP - Summary
• All these languages has its own methods
• All supports syntax checking usually done by the
compiler
• Pig currently misses out on optimized storage
structures like indexes and column groups
• HiveQL provides more optimizations
• it prunes the buckets that are not needed
• Predicate push down
• Query rewriting is the future work of JAQL
(Projection push-down )
JOIN Implementation
JOIN in Pig Latin
• Pig Latin Supports inner join, equijoin and outer
join. The JOIN operator always performs inner
join.
• Join can also be achieved by COGROUP
operation followed by FLATTEN
• JOIN creates a flat set of output records while
COGROUP creates a nested set of output records
• GROUP – when only one relation
• COGROUP – when multiple relations are involved
• FLATTEN - (a, {(b,c), (d,e)}) (a, b, c) and (a, d, e)
JOIN in Pig Latin
• Fragment Replicate joins
• Trivial case, only possible if one of two relations are
small enough to fit into memory
• JOIN is in Map phase
• Skewed Joins
• Not equally distributed data
• Basically computes histogram of the key space and
uses this data to allocate reducers for a given key
• JOIN in reduce phase
• Merge Joins
• Only possible if the relations are already sorted
JOIN in Pig Latin
• The choice of join strategy can be specified by the user
JOIN in Hive
• Normal map-reduce Join
• Mapper sends all rows with the same key to a
single reducer
• Reducer does the join
• SELECT t1.a1 as c1, t2.b1 as c2
FROM t1 JOIN t2 ON (t1.a2 = t2.b2);
• Map side Joins
• small tables are replicated in all the mappers
and joined with other tables
JOIN in JAQL
• Currently JAQL supports equijoin
• The join expression supports equijoin of 2 or
more inputs. All of the options for inner and
outer joins are also supported
joinedRefs = join w in wroteAbout, p in
products
where w.product == p.name
into { w.author, p.* };
JOIN - Summary
• Both Pig and Hive has the possibility to performs
join in map phase instead of reduce phase
• For skewed distribution of data, the performance
of JAQL for join is not comparable to other two
languages
Performance
Benchmarks
• Pig Mix is a set of queries to test the
performance. These set checks the scalability
and latency
• Hive’s benchmark is mainly based on the queries
that are specified by Pavlo et al (selection
task, Aggregation task and a Join task)
• Pig-Latin implementation for the TPC-H queries
and HiveQL implementation of TPC-H queries
Performance - Summary
• The paper describes Scale up, Scale out and
runtime
• For skewed data, Pig and Hive seems to be more
effective in handling it compared to JAQL
runtime
• Pig and Hive better in utilizing the increase in the
cluster size compared JAQL
• Pig and Hive allows the user to explicitly specify
the number of reducers task
• This feature has significant influence on the
performance
High-level languages for Big Data Analytics (Presentation)
Machine Learning
• What page will the visitor next visit?
• Twitter has extended Pig’s support of ML by
placing learning algorithms in Pig Storage
functions
• Hive - the machine learning is treated as UAFs
• A new data analytics platform Ricardo is
proposed combines the functionalities of R and
Jaql.
Interactive queries
• One of the main problems of MapReduce all the languages built on top
of this framework (Pig, Hive, etc.) is the latency.
• As a complement of those technologies, some new frameworks that
allow programmers to query large datasets in an interactive manner
have been developed
• Dremel by Google
• The open source project Apache Drill.
• How to reduce the latency?
• Store the information as nested columns.
• Query execution based on a multi-level tree architecture.
• Balance the load by means of a query dispatcher.
• Not too many details of the query language
• It is based on SQL
• It includes the usual operations (selection, projection, etc.)
• SQL-like languages features: user define functions or nested subqueries
• The characteristic that distinguish this languages is that it operates with
nested tables as inputs and outputs.
Conclusions
• The MapReduce programming model have big pitfalls.
• Each programming language try to solve some of these
disadvantages in a different way.
• No single language beat all the other options.
• Comparison
• Jaql is expressively more powerful.
• JAQL is at the lower level in case of performance when compared
to Hive and Pig
• HiveQL and Pig Latin supports map phase JOIN.
• HiveQL use more advanced optimization techniques for query
processing
• New technologies to solve those problems:
• Languages: Dremel and Apache Drill
• Libraries: Mahaut
Thankyouverymuch!

More Related Content

What's hot

Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
Jakir Hossain
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
datamantra
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
FARUK BERKSÖZ
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
Jeff Hammerbacher
 
Apache Spark Streaming
Apache Spark StreamingApache Spark Streaming
Apache Spark Streaming
Zahra Eskandari
 
try
trytry
Java High Level Stream API
Java High Level Stream APIJava High Level Stream API
Java High Level Stream API
Apache Apex
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
Romain Jacotin
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
datamantra
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
Muhammad Shahid
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
punesparkmeetup
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
Prashant Gupta
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Frane Bandov
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
datamantra
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
datamantra
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integration
Dylan Wan
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means Clustering
George Ang
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 

What's hot (20)

Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Apache Spark Streaming
Apache Spark StreamingApache Spark Streaming
Apache Spark Streaming
 
try
trytry
try
 
Java High Level Stream API
Java High Level Stream APIJava High Level Stream API
Java High Level Stream API
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integration
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means Clustering
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 

Viewers also liked

Big data ppt
Big  data pptBig  data ppt
Big data ppt
Nasrin Hussain
 
Master thesis byambajargal
Master thesis byambajargalMaster thesis byambajargal
Master thesis byambajargal
kumank
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
Skillwise Group
 
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OS
Jane Man
 
Hadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical BasicsHadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical Basics
Zitao Liu
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
Alexey Grigorev
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
npinto
 
MapReduce
MapReduceMapReduce
MapReduce
Abe Arredondo
 
Social Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data TechnologiesSocial Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data Technologies
Nicolas Morales
 
IBM-Why Big Data?
IBM-Why Big Data?IBM-Why Big Data?
IBM-Why Big Data?
Kun Le
 
Big data ppt
Big data pptBig data ppt
Big data ppt
Shweta Sahu
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
Ahmed Salman
 
Process Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at ExpediaProcess Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at Expedia
huguk
 
Big data
Big dataBig data
Big data
hsn99
 
Programming languages,compiler,interpreter,softwares
Programming languages,compiler,interpreter,softwaresProgramming languages,compiler,interpreter,softwares
Programming languages,compiler,interpreter,softwares
Nisarg Amin
 
Computer languages
Computer languagesComputer languages
Computer languages
BESOR ACADEMY
 
4 introduction to programming structure
4 introduction to programming structure4 introduction to programming structure
4 introduction to programming structure
Rheigh Henley Calderon
 
Lecture 8
Lecture 8Lecture 8
Lecture 8
Anshumali Singh
 
Programming languages
Programming languagesProgramming languages
Programming languages
Archana Maharjan
 
High Level Languages (Imperative, Object Orientated, Declarative)
High Level Languages (Imperative, Object Orientated, Declarative)High Level Languages (Imperative, Object Orientated, Declarative)
High Level Languages (Imperative, Object Orientated, Declarative)
Project Student
 

Viewers also liked (20)

Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Master thesis byambajargal
Master thesis byambajargalMaster thesis byambajargal
Master thesis byambajargal
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OS
 
Hadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical BasicsHadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical Basics
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
MapReduce
MapReduceMapReduce
MapReduce
 
Social Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data TechnologiesSocial Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data Technologies
 
IBM-Why Big Data?
IBM-Why Big Data?IBM-Why Big Data?
IBM-Why Big Data?
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Process Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at ExpediaProcess Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at Expedia
 
Big data
Big dataBig data
Big data
 
Programming languages,compiler,interpreter,softwares
Programming languages,compiler,interpreter,softwaresProgramming languages,compiler,interpreter,softwares
Programming languages,compiler,interpreter,softwares
 
Computer languages
Computer languagesComputer languages
Computer languages
 
4 introduction to programming structure
4 introduction to programming structure4 introduction to programming structure
4 introduction to programming structure
 
Lecture 8
Lecture 8Lecture 8
Lecture 8
 
Programming languages
Programming languagesProgramming languages
Programming languages
 
High Level Languages (Imperative, Object Orientated, Declarative)
High Level Languages (Imperative, Object Orientated, Declarative)High Level Languages (Imperative, Object Orientated, Declarative)
High Level Languages (Imperative, Object Orientated, Declarative)
 

Similar to High-level languages for Big Data Analytics (Presentation)

Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL Databases
Emanuel Calvo
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analytics
KrishnaVeni451953
 
MapReduce Programming Model
MapReduce Programming ModelMapReduce Programming Model
MapReduce Programming Model
AdarshaDhakal
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Pig Experience
Pig ExperiencePig Experience
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
NetajiGandi1
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
hadooparchbook
 
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
bhuvankumar3877
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
Prashant Gupta
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Flink Forward
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Map reduce advantages over parallel databases
Map reduce advantages over parallel databases Map reduce advantages over parallel databases
Map reduce advantages over parallel databases
Ahmad El Tawil
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
Ahmad El Tawil
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
Csaba Toth
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
markgrover
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
ssusere05ec21
 
Project Progress
Project ProgressProject Progress
Project Progress
sunnysomchok
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
BikalAdhikari4
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 

Similar to High-level languages for Big Data Analytics (Presentation) (20)

Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL Databases
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analytics
 
MapReduce Programming Model
MapReduce Programming ModelMapReduce Programming Model
MapReduce Programming Model
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Map reduce advantages over parallel databases
Map reduce advantages over parallel databases Map reduce advantages over parallel databases
Map reduce advantages over parallel databases
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 
Project Progress
Project ProgressProject Progress
Project Progress
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 

More from Jose Luis Lopez Pino

Lessons learnt from applying PyData to GetYourGuide marketing
Lessons learnt from applying PyData to GetYourGuide marketingLessons learnt from applying PyData to GetYourGuide marketing
Lessons learnt from applying PyData to GetYourGuide marketing
Jose Luis Lopez Pino
 
BDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the massesBDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the masses
Jose Luis Lopez Pino
 
Massive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using RMassive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using R
Jose Luis Lopez Pino
 
Metadata in Business Intelligence
Metadata in Business IntelligenceMetadata in Business Intelligence
Metadata in Business Intelligence
Jose Luis Lopez Pino
 
Scheduling and sharing resources in Data Clusters
Scheduling and sharing resources in Data ClustersScheduling and sharing resources in Data Clusters
Scheduling and sharing resources in Data Clusters
Jose Luis Lopez Pino
 
Distributed streaming k means
Distributed streaming k meansDistributed streaming k means
Distributed streaming k means
Jose Luis Lopez Pino
 
RDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itRDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use it
Jose Luis Lopez Pino
 
RDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itRDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use it
Jose Luis Lopez Pino
 
Firefox Vs. Chromium: Guerra de los navegadores libres
Firefox Vs. Chromium: Guerra de los navegadores libresFirefox Vs. Chromium: Guerra de los navegadores libres
Firefox Vs. Chromium: Guerra de los navegadores libres
Jose Luis Lopez Pino
 
Esteganografia
EsteganografiaEsteganografia
Esteganografia
Jose Luis Lopez Pino
 
Presentacion Proyecto Fin De Carrera
Presentacion Proyecto Fin De CarreraPresentacion Proyecto Fin De Carrera
Presentacion Proyecto Fin De Carrera
Jose Luis Lopez Pino
 
Memoria Proyecto Fin de Carrera
Memoria Proyecto Fin de CarreraMemoria Proyecto Fin de Carrera
Memoria Proyecto Fin de Carrera
Jose Luis Lopez Pino
 
Presentacion CUSL nacional
Presentacion CUSL nacionalPresentacion CUSL nacional
Presentacion CUSL nacional
Jose Luis Lopez Pino
 
Resumen del proyecto Visuse
Resumen del proyecto VisuseResumen del proyecto Visuse
Resumen del proyecto Visuse
Jose Luis Lopez Pino
 
Presentacion cusl granadino
Presentacion cusl granadinoPresentacion cusl granadino
Presentacion cusl granadino
Jose Luis Lopez Pino
 
Como hacer un módulo para Visuse
Como hacer un módulo para VisuseComo hacer un módulo para Visuse
Como hacer un módulo para Visuse
Jose Luis Lopez Pino
 
Visuse: resumen del I Hackathon
Visuse: resumen del I HackathonVisuse: resumen del I Hackathon
Visuse: resumen del I Hackathon
Jose Luis Lopez Pino
 
Presentacion Visuse para el Hachathón
Presentacion Visuse para el HachathónPresentacion Visuse para el Hachathón
Presentacion Visuse para el Hachathón
Jose Luis Lopez Pino
 
Desarrollar un módulo para Visuse
Desarrollar un módulo para VisuseDesarrollar un módulo para Visuse
Desarrollar un módulo para Visuse
Jose Luis Lopez Pino
 
Control de versiones y Subversion
Control de versiones y SubversionControl de versiones y Subversion
Control de versiones y Subversion
Jose Luis Lopez Pino
 

More from Jose Luis Lopez Pino (20)

Lessons learnt from applying PyData to GetYourGuide marketing
Lessons learnt from applying PyData to GetYourGuide marketingLessons learnt from applying PyData to GetYourGuide marketing
Lessons learnt from applying PyData to GetYourGuide marketing
 
BDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the massesBDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the masses
 
Massive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using RMassive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using R
 
Metadata in Business Intelligence
Metadata in Business IntelligenceMetadata in Business Intelligence
Metadata in Business Intelligence
 
Scheduling and sharing resources in Data Clusters
Scheduling and sharing resources in Data ClustersScheduling and sharing resources in Data Clusters
Scheduling and sharing resources in Data Clusters
 
Distributed streaming k means
Distributed streaming k meansDistributed streaming k means
Distributed streaming k means
 
RDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itRDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use it
 
RDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itRDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use it
 
Firefox Vs. Chromium: Guerra de los navegadores libres
Firefox Vs. Chromium: Guerra de los navegadores libresFirefox Vs. Chromium: Guerra de los navegadores libres
Firefox Vs. Chromium: Guerra de los navegadores libres
 
Esteganografia
EsteganografiaEsteganografia
Esteganografia
 
Presentacion Proyecto Fin De Carrera
Presentacion Proyecto Fin De CarreraPresentacion Proyecto Fin De Carrera
Presentacion Proyecto Fin De Carrera
 
Memoria Proyecto Fin de Carrera
Memoria Proyecto Fin de CarreraMemoria Proyecto Fin de Carrera
Memoria Proyecto Fin de Carrera
 
Presentacion CUSL nacional
Presentacion CUSL nacionalPresentacion CUSL nacional
Presentacion CUSL nacional
 
Resumen del proyecto Visuse
Resumen del proyecto VisuseResumen del proyecto Visuse
Resumen del proyecto Visuse
 
Presentacion cusl granadino
Presentacion cusl granadinoPresentacion cusl granadino
Presentacion cusl granadino
 
Como hacer un módulo para Visuse
Como hacer un módulo para VisuseComo hacer un módulo para Visuse
Como hacer un módulo para Visuse
 
Visuse: resumen del I Hackathon
Visuse: resumen del I HackathonVisuse: resumen del I Hackathon
Visuse: resumen del I Hackathon
 
Presentacion Visuse para el Hachathón
Presentacion Visuse para el HachathónPresentacion Visuse para el Hachathón
Presentacion Visuse para el Hachathón
 
Desarrollar un módulo para Visuse
Desarrollar un módulo para VisuseDesarrollar un módulo para Visuse
Desarrollar un módulo para Visuse
 
Control de versiones y Subversion
Control de versiones y SubversionControl de versiones y Subversion
Control de versiones y Subversion
 

Recently uploaded

UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
UiPathCommunity
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Chris Swan
 
@Call @Girls Thiruvananthapuram 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...
@Call @Girls Thiruvananthapuram  🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...@Call @Girls Thiruvananthapuram  🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...
@Call @Girls Thiruvananthapuram 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...
kantakumariji156
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
BookNet Canada
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
Liveplex
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
uuuot
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
@Call @Girls Guwahati 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cute Girl any...
@Call @Girls Guwahati 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cute Girl any...@Call @Girls Guwahati 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cute Girl any...
@Call @Girls Guwahati 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cute Girl any...
kantakumariji156
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
Matthew Sinclair
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions
 
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
Edge AI and Vision Alliance
 
Running a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU ImpactsRunning a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU Impacts
ScyllaDB
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
Matthew Sinclair
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
Matthew Sinclair
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Mydbops
 
5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx
SATYENDRA100
 

Recently uploaded (20)

UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
 
@Call @Girls Thiruvananthapuram 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...
@Call @Girls Thiruvananthapuram  🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...@Call @Girls Thiruvananthapuram  🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...
@Call @Girls Thiruvananthapuram 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
@Call @Girls Guwahati 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cute Girl any...
@Call @Girls Guwahati 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cute Girl any...@Call @Girls Guwahati 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cute Girl any...
@Call @Girls Guwahati 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cute Girl any...
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
 
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
 
Running a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU ImpactsRunning a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU Impacts
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
 
5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx
 

High-level languages for Big Data Analytics (Presentation)

  • 1. BI Seminar project: High-level languages for Big Data Analytics Janani Chakkaradhari Jose Luis Lopez Pino
  • 2. Outline 1. Introduction 1.1 The Map Reduce programming model 1.2 Hadoop 2. High Level Languages 2.1 Pig Latin 2.2 Hive 2.3 JAQL 2.4 Other Languages 3. Comparison of HLLs 3.1 Syntax Comparison 3.2 Performance 3.3 Query Compilation 3.4 JOIN Implementation 4. Future Work 4.1 Machine Learning 4.2 Interactive Queries 5. Conclusion
  • 4. The MapReduce model • Introduced in 2004 by Google • This model allows programmers without any experience in parallel coding to write the highly scalable programs and hence process voluminous data sets. • This high level of scalability is reached thanks to the decomposition of the problem into a big number of tasks. • The Map function produces a set of key/value pairs, taking a single pair key/value as input. • The Reduce function takes a key and a set of values related to this key as input and it might also produce a set of values, but commonly it emits only one or zero values as output
  • 5. The MapReduce model • Advantages: • Scalability • Handle failures and balance the system • Pitfalls • Complicated to code some tasks. • Some tasks are very expensive. • Difficulties to debug the code. • Absence of schema and indexes. • A lot of bandwidth might be consumed.
  • 6. Hadoop • An Apache Software foundation open source project • Hadoop – HDFS + Map Reduce • DFS – Partitioning data & Storing in separate machine • HDFS – Stores large files, running on commodity clusters of hardware and typically 64 MB for per block • Both FS and Map reduce are Co-Designed
  • 7. Hadoop • No separate storage network and processing network • Moving compute to the data node
  • 9. High level languages • Two different types • Created specifically for this model. • Already existing languages • Languages present in the comparison • Pig Latin • HiveQL • Jaql • Interesting languages • Meteor • DryadLINQ
  • 10. Pig Latin • Executed over Hadoop. • Procedural language. • High level operations similar to those that we can find in SQL • Some interesting operators: • FOREACH to process a transformation over every tuple of the set. To make possible to parallelise this operation, the transformation of one row should depend on another. • COGROUP to group related tuples of multiple datasets. It is similar to the first step of a join. • LOAD to load the input data and its structure and STORE to save data in a file.
  • 11. Pig Latin • Goal: to reduce the time of development. • Nested data model. • User-defined functions • Analytic queries over text files (not need of loading the data) • Procedural language -> control over the execution plan • The user can speed the performance up. • It makes easier the work of the query optimiser. • Unlike SQL.
  • 12. HiveQL • Open-source DW solution built on top of Hadoop • The queries looks similar to SQL and also has extensions on it • Complex column types- map, array, struct as data types • It stores the metadata in RDBMs
  • 13. HiveQL • The Metastore acts as the system catalog for Hive • It stores all the information about the tables, their partition, the schema and etc., • Without the system catalog it is not possible to impose a structure on hadoop files. • Facebook uses MySQL to store this metadata. Reason: Since these information has to be served fast to the compiler
  • 14. JAQL • What is Jaql? • Declarative scripting programming language. • Used over Hadoop’s MapReduce framework • Included in IBM’s InfoSphere BigInsights and Cognos Consumer Insight products. • Developed after Pig and Hive. • More scalable. • More flexible • More reusable. • Data model • Simple: similar to JSON. • Values as trees. • No references. • Textual representation very similar. • Flexible • Handle semistructured documents. • But also structured records validated against a schema.
  • 15. JAQL • Control over the evaluation plan. • The programmer can work at different levels of abstraction using Jaql's syntax: • Full definition of the execution plan. • Use of hints to indicate to the optimizer some evaluation features. • This feature is present in most of the database engines that use SQL as query language. • Declarative programming, without any control over the flow.
  • 16. Other languages: Meteor • Stratosphere stack • Pact: • Programming model • It extends MapReduce with new second-order functions • Cross: Cartesian product • CoGroup: group all the records with the same key and process them. • Match: similar to CoGroup but pairs with the same key could be processed separately. • Sopemo: • Semantically rich operator model • Extensible • Meteor: query language • Optimization • Meteor code • Logical plan using Sopemo operators -> Optimized • Pact final program -> Physically optimized
  • 17. Other languages: DryadLINQ • Coded embedded in .NET programming languages • Operators • Almost all the operators available in LINQ. • Some specific operators for parallel programming. • Develop can include their own implementations. • DryadLINQ code is translated to a Dryad plan • Optimization • Pipeline operations • Remove redundancy • Push aggregations • Reduce network traffic
  • 19. Comparing HLLs • Different design motivations • Developers preferences • Write concise code -> Expressiveness • Efficiency -> Performance • Criteria that impact performance • Join implementation • Query processing • Some other are not included • Language paradigm • Scalability
  • 20. Comparison Criteria • Expressive power • Performance • Query Compilation • JOIN Implementation
  • 21. Expressive power • Three categories by Robert Stewart: • Relational complete • SQL equivalent (aggregate functions) • Turing complete • Conditional branching • Indefinite iterations by means of recursion • Emulation of infinite memory model
  • 22. Expressive power • Three categories by Robert Stewart: • Relational complete • SQL equivalent (aggregate functions) • Turing complete • Conditional branching • Indefinite iterations by means of recursion • Emulation of infinite memory model
  • 23. Expressive power • Three categories by Robert Stewart: • Relational complete • SQL equivalent (aggregate functions) • Turing complete • Conditional branching • Indefinite iterations by means of recursion • Emulation of infinite memory model
  • 25. Expressive power • But this do not mean that they are SQL, Pig Latin and HiveQL are the same. • HiveQL • Is inspired by SQL but it does not support the full repertoire included in the SQL-92 specification • Includes features notably inspired by MySQL and MapReduce that are not part of SQL. • Pig Latin • It is not inspired by SQL. • For instance, do not have OVER clause
  • 26. SQL Vs. HiveQL (2009) SQL HiveQL Transactions Yes No Indexes Yes No Create table as select Not SQL-92 Yes Subqueries In any clause Correlated or not Only in FROM clause Only noncorrelated Views Yes Not materialized Extension with map/reduce scripts No Yes
  • 28. Query Processing • In order to make a good comparison we should have the basic knowledge on how these HLQL are working. • How the abstract user representation of the query or the script is converted to map reduce jobs?
  • 29. Query Processing – Pig Latin • The goal of writing Pig Latin script is to produce an equivalent map reduce jobs that can be executed in the Hadoop environment • Parser first checks for the syntactic errors
  • 30. Query Processing – Pig Latin
  • 31. Query Processing – Pig Latin
  • 32. Query Processing - Hive • It gets the Hive SQL string from the client • The parser phase converts it into parse tree representation • The logical query plan generator converts it into logical query representation. Prunes the columns early and pushes the predicates closer to the table. • The logical plan is converted to physical plan and then map reduce jobs.
  • 33. Query Processing - JAQL • JAQL includes two higher order functions such as mapReduceFn and mapAggregate • The rewriter engine generates calls to the mapReduceFn or mapAggregate
  • 34. QP - Summary • All these languages has its own methods • All supports syntax checking usually done by the compiler • Pig currently misses out on optimized storage structures like indexes and column groups • HiveQL provides more optimizations • it prunes the buckets that are not needed • Predicate push down • Query rewriting is the future work of JAQL (Projection push-down )
  • 36. JOIN in Pig Latin • Pig Latin Supports inner join, equijoin and outer join. The JOIN operator always performs inner join. • Join can also be achieved by COGROUP operation followed by FLATTEN • JOIN creates a flat set of output records while COGROUP creates a nested set of output records • GROUP – when only one relation • COGROUP – when multiple relations are involved • FLATTEN - (a, {(b,c), (d,e)}) (a, b, c) and (a, d, e)
  • 37. JOIN in Pig Latin • Fragment Replicate joins • Trivial case, only possible if one of two relations are small enough to fit into memory • JOIN is in Map phase • Skewed Joins • Not equally distributed data • Basically computes histogram of the key space and uses this data to allocate reducers for a given key • JOIN in reduce phase • Merge Joins • Only possible if the relations are already sorted
  • 38. JOIN in Pig Latin • The choice of join strategy can be specified by the user
  • 39. JOIN in Hive • Normal map-reduce Join • Mapper sends all rows with the same key to a single reducer • Reducer does the join • SELECT t1.a1 as c1, t2.b1 as c2 FROM t1 JOIN t2 ON (t1.a2 = t2.b2); • Map side Joins • small tables are replicated in all the mappers and joined with other tables
  • 40. JOIN in JAQL • Currently JAQL supports equijoin • The join expression supports equijoin of 2 or more inputs. All of the options for inner and outer joins are also supported joinedRefs = join w in wroteAbout, p in products where w.product == p.name into { w.author, p.* };
  • 41. JOIN - Summary • Both Pig and Hive has the possibility to performs join in map phase instead of reduce phase • For skewed distribution of data, the performance of JAQL for join is not comparable to other two languages
  • 43. Benchmarks • Pig Mix is a set of queries to test the performance. These set checks the scalability and latency • Hive’s benchmark is mainly based on the queries that are specified by Pavlo et al (selection task, Aggregation task and a Join task) • Pig-Latin implementation for the TPC-H queries and HiveQL implementation of TPC-H queries
  • 44. Performance - Summary • The paper describes Scale up, Scale out and runtime • For skewed data, Pig and Hive seems to be more effective in handling it compared to JAQL runtime • Pig and Hive better in utilizing the increase in the cluster size compared JAQL • Pig and Hive allows the user to explicitly specify the number of reducers task • This feature has significant influence on the performance
  • 46. Machine Learning • What page will the visitor next visit? • Twitter has extended Pig’s support of ML by placing learning algorithms in Pig Storage functions • Hive - the machine learning is treated as UAFs • A new data analytics platform Ricardo is proposed combines the functionalities of R and Jaql.
  • 47. Interactive queries • One of the main problems of MapReduce all the languages built on top of this framework (Pig, Hive, etc.) is the latency. • As a complement of those technologies, some new frameworks that allow programmers to query large datasets in an interactive manner have been developed • Dremel by Google • The open source project Apache Drill. • How to reduce the latency? • Store the information as nested columns. • Query execution based on a multi-level tree architecture. • Balance the load by means of a query dispatcher. • Not too many details of the query language • It is based on SQL • It includes the usual operations (selection, projection, etc.) • SQL-like languages features: user define functions or nested subqueries • The characteristic that distinguish this languages is that it operates with nested tables as inputs and outputs.
  • 48. Conclusions • The MapReduce programming model have big pitfalls. • Each programming language try to solve some of these disadvantages in a different way. • No single language beat all the other options. • Comparison • Jaql is expressively more powerful. • JAQL is at the lower level in case of performance when compared to Hive and Pig • HiveQL and Pig Latin supports map phase JOIN. • HiveQL use more advanced optimization techniques for query processing • New technologies to solve those problems: • Languages: Dremel and Apache Drill • Libraries: Mahaut

Editor's Notes

  1. 80% of execution time is spent executing at most 20% of the codeis to provide anabstract data querying interface to remove the burden of the MR implementationaway from the programmer. whether or not programs pay a performance penalty foropting for these more abstract languagesLoop Recognition in C++/Java/Go/Scala. Robert Hundt,Google. 2011Is there any optimization techniques if so when and where?
  2. In Pig the operator GROUP is translated as LOCALREARRANGE, GLOBAL REARRANGE AND PACKAGE in physical plan.Rearranging means either it does hashing or sorting by key.The combinationof local and global rearranges produces the result in such a way that the tupleshaving same group key will be moved to same machine
  3. TPC-H benchmark for relational OLTP systemThe ideal theoretical outcome wouldbe that there is no increase in the computation time (T) for a given job.Scalability and fault tolerancea document is a good match to a query if the document model is likely to generate the query
  4. JAQLincludes two higher order functions such as mapReduceFn and mapAggregateto execute map reduce and aggregate operations respectively. The rewriterengine generates calls to the mapReduceFn or mapAggregate, by identifyingthe parts of the scripts and moving them to map,reduce and aggregate functionparameters. Based on the some rules, rewriter converts them to Expr tree.Finally it checks for the presence of algebraic aggregates, if it is there then itinvokes mrAggregate function. In otherworlds it can complete the task withsingle map reduce job.
  5. JAQLs physicaltransparency is an added value feature because it allows the user to add newrun time operator without aecting JAQLs internals.
  6. TPC-H benchmark for relational OLTP systemThe ideal theoretical outcome wouldbe that there is no increase in the computation time (T) for a given job.Scalability and fault tolerancea document is a good match to a query if the document model is likely to generate the query
  7. TPC-H benchmark for relational OLTP systemThe ideal theoretical outcome wouldbe that there is no increase in the computation time (T) for a given job.Scalability and fault tolerancea document is a good match to a query if the document model is likely to generate the query
  8. In this case, the big relation is distributedacross hadoop nodes and the smaller relation is replicated on each node. Herethe entire join operation is performed in Map phase.In general the data in data warehouse is not equally distributedand it is susceptible to skewed in nature. Pig handles this conditionby employingskewed join. The basic idea is to compute a histogram of the keyspace and uses this data to allocate reducers for a given key. Currently pigallows skewed join of only two tables. The join is performed in Reduce phase.
  9. TPC-H benchmark for relational OLTP systemThe ideal theoretical outcome wouldbe that there is no increase in the computation time (T) for a given job.Scalability and fault tolerancea document is a good match to a query if the document model is likely to generate the query
  10. TPC-H benchmark for relational OLTP systemThe ideal theoretical outcome wouldbe that there is no increase in the computation time (T) for a given job.Scalability and fault tolerancea document is a good match to a query if the document model is likely to generate the query
  11. Dremel combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. Dremel uses  a novel query execution engine based on aggregator trees.to run almost realtime , interactive AND adhoc queries both of which MapReduce cannot. And Pig and Hive aren't real timeDremel is what the future of HIVE (and not MapReduce as I mentioned before) should be. Hive right now provides a SQL like interface to run MapReduce jobs. Hive has very high latency, and so is not practical in ad-hoc data analysis. Dremel provides a very fast SQL like interface to the data by using a different technique than MapReduce.