SlideShare a Scribd company logo
EMC Corporation All rights reserved
EMC Corporation All rights reserved
• Introduction
• Hive
• Impala
• SparkSQL
• HBase + Phoenix
• Drill
• Networking & Pizza
EMC Corporation All rights reserved
• How many developers?
EMC Corporation All rights reserved
• How many BI/SQL Developer?
EMC Corporation All rights reserved
• How many Business analyst/Sales?
EMC Corporation All rights reserved
• How many have used Hadoop?
EMC Corporation All rights reserved
• How many have used SQL on Hadoop?
EMC Corporation All rights reserved
• Hadoop is an open source framework for large-
scale data storing & processing.
EMC Corporation All rights reserved
• Application Workgroup in EMC
– Focused on
•Big data development/infrastructure
•Application modernization
EMC Corporation All rights reserved
• Fahim Kundi
– 10+ years experience in EDW and big data
• Haden Pareira
– Data engineer with 5+ years of Hadoop experience
• Muhammad Ali
– Data engineer 2+ years with Hadoop
EMC Corporation All rights reserved
EMC Corporation All rights reserved
• HDFS is a file system – it’s all files
• MapReduce requires strong programming skills
• It’s so difficult
EMC Corporation All rights reserved
• SQL is well known in analytics community
• Faster and easier data insights
• Allows SQL/BI developer to retain their expertise
and create value out of big data
EMC Corporation All rights reserved
• Cloudera – Impala
• Hortonworks – Hive/Tez
• Pivotal – HAWQ … now HDB
• MapR – Drill
• IBM – Big SQL
EMC Corporation All rights reserved
EMC Corporation All rights reserved
Hive and HAWQ
By Fahim Kundi
EMC Corporation All rights reserved
• Hive Introduction
• How Hive Works
• Apache Tez
• Hive with Tez Vs Mapreduce
• ORC and Parquet Format
• HAWQ Introduction
• Query Optimizer
• PxF
EMC Corporation All rights reserved
• Apache Hive is high level query language
and data warehouse features built on top of
• It is initially developed by yahoo and made
open source in 2008.
• SQL Like Query Language called HQL.
• Partitioning and Bucketing for faster Query
• Integration with Visualization tool like
EMC Corporation All rights reserved
• Hive supports all the common primitive data
formats such as INT, BINARY, BOOLEAN,
• In addition, analysts can combine primitive
data types to form complex data types, such
as structs, maps and arrays.
EMC Corporation All rights reserved
• The tables in Hive are similar to tables in a relational
• Databases are comprised of tables, which are made up
of partitions.
• Data can be accessed via a simple query language and
Hive supports overwriting or appending data.
• Hive queries internally will be converted to map reduce
programs or Tez.
EMC Corporation All rights reserved
• Within a particular database, data in the tables is
serialized and each table has a corresponding Hadoop
Distributed File System (HDFS) directory.
• Each table can be sub-divided into partitions that
determine how data is distributed within sub-
directories of the table directory.
• Data within partitions can be further broken down into
EMC Corporation All rights reserved
• Apache Tez, a new distributed execution framework
that is targeted towards data-processing applications
on Hadoop.
• Tez is developed by Hortonwork and built on top of
YARN (Resource Management Framework for Hadoop)
• Tez generalizes Mapreduce to more powerful
framework as it creates Dataflow Graph for job
executed by User. (Example)
EMC Corporation All rights reserved
• The Tez API has the following components –
– DAG (Directed Acyclic Graph) – defines the overall job.
One DAG object corresponds to one job
– Vertex – defines the user logic along with the resources
and the environment needed to execute the user logic.
One Vertex corresponds to one step in the job
– Edge – defines the connection between producer and
consumer vertices.
• Tez is not meant directly for end-users – in fact it
enables developers to build end-user applications with
much better performance and flexibility.
EMC Corporation All rights reserved
EMC Corporation All rights reserved
• ORC(Optimal Row Columnar) is columnar file format designed
for Hadoop workloads.
• ORC files developed to massively speed up Apache Hive and
improve the storage efficiency of data stored in Apache Hadoop.
It is optimized for large streaming reads.
• ORC Features:
– Columnar format for complex data types
– Built into Hive from 0.11
– Support for Pig and Mapreduce via Hcat.
– Two level of compression
• Light weight type specific
• General
– Built in Indexes
EMC Corporation All rights reserved
EMC Corporation All rights reserved
• Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem, regardless of
the choice of data processing framework, data model
or programming language.
• Parquet Feature:
– Columnar File Format
– Support Nested Data Structures
– Accessible by Hive, Spark, Pig, Drill, MR
– R/W in HDFS or local file system
EMC Corporation All rights reserved
EMC Corporation All rights reserved
• Two major consideration for considering ORC over Parquet
– Many of the performance improvements provided in the Stinger
initiative are dependent on features of the ORC format including
block level index for each column. This leads to potentially more
efficient I/O allowing Hive to skip reading entire blocks of data if it
determines predicate values are not present there.
– Also the Cost Based Optimizer has the ability to consider column
level metadata present in ORC files in order to generate the most
efficient graph.
– ACID transactions are only possible when using ORC as the file
EMC Corporation All rights reserved
EMC Corporation All rights reserved
• HAWQ is MPP(Parallel) SQL-query engine that uses HDFS for
its storage layer.
• HAWQ evolves from the Greenplum Database query planner
to handle query processing and does not rely on MapReduce
under the hood to do processing.
• HAWQ reads data from and writes data to HDFS natively.
• It also has extensions(PxF) to allow it to interact with data
contained in other services (HBase, Hive, Avro, etc) that also
reside in HDFS.
EMC Corporation All rights reserved
• HAWQ provides all major features found in Greenplum
– SQL Completeness: 2003 Extensions
– JDBC Compliant
– Robust Query Optimizer
– Row or Column-Oriented Table Storage
– Parallel Loading and Unloading
– Distributions
– Multi-level Partitioning
– High speed data redistribution
– Views
– External Tables
– Compression
– Resource Management
– Security
– Authentication
– Management and Monitoring
EMC Corporation All rights reserved
Local Storage
HAWQ Master
Parser Query Optimizer
Local Temp Storage
Segment Host
Query Executor
[Segment …]
Local Temp Storage
Segment Host
Query Executor
[Segment …]
HAWQ Standby
Secondary NameNode
EMC Corporation All rights reserved
Gather Motion
Redistribute Motion
Seq Scan on
Seq Scan on
Seq Scan on
Broadcast Motion
Seq Scan on
• Turn SQL Query into execution Plan
• Cost based Optimizer
EMC Corporation All rights reserved
• PXF is a fast, extensible framework connecting HAWQ to a
HDFS data store of choice that exposes a parallel API
 An advanced version of external
 Enables combining HAWQ data
and Hadoop data in a single query
 Supports connectors for HDFS,
HBase and Hive
 Provides extensible framework API
to enable custom connector
development for any data sources
HDFS HBase Hive
Xtension Framework
EMC Corporation All rights reserved
Muhammad Ali
Image courtesy cloudera
EMC Corporation. All rights reserved.
• Interactive Query on top of Hadoop
• ANSI-92 SQL Standard
• Native MPP query engine
• Written in C++
EMC Corporation. All rights reserved.
• Native to Hadoop
– Blends with the eco system
– Security
– Hive MetaStore / HCatalog
– Query existing HDFS data
• Not as fault-tolerant as MapReduce
– (or Hive or SparkSQL or …)
– Single node fails during query the whole query fails
– But if it’s 20x faster, you can rerun and still finish faster ;)
EMC Corporation. All rights reserved.
Image courtesy cloudera
EMC Corporation. All rights reserved.
• Query execution times (small to medium size)
• Parquet Format
– Compression
• High Concurrency – kills the competitors
• Partitioning
• Query Optimizer (Compute Statistics!)
EMC Corporation. All rights reserved.
EMC Corporation. All rights reserved.
• Distributed columnar storage manager
• Performance of Parquet
– Great for analytical queries
• Mutability of HBase
– Supports UPDATE/DELETE unlike Parquet
• One common storage to rule them all!
– (not exactly!)
EMC Corporation. All rights reserved.
EMC Corporation. All rights reserved.
• IoT use cases
– High velocity data
– Same data read for analytical queries near real time
• Predictive Modeling
– Large datasets updated frequently
– Retraining models
• Time-series applications
– Kudu offers compound keys/hash based partitioning
– Avoids hot spotting
EMC Corporation. All rights reserved.
EMC Corporation. All rights reserved.
EMC Corporation. All rights reserved.
• General Purpose Distributed Computing System
– Multiple language support (Java, Scala, Python, and R)
– Fault tolerant, data distribution, in-memory caching etc.
– Resilient distributed datasets
• Operations
– Transformations (define new RDDs)
– Actions (return value)
• No nonsense
– 100x faster than MapReduce
– Disk used only when can’t be avoided
EMC Corporation. All rights reserved.
Image Courtesy: Sachin Parmar
EMC Corporation. All rights reserved.
EMC Corporation. All rights reserved.
• Structured Data Processing
– Commonly known to us as tables
• Integrated into Spark programming model
• Unified Data Access
• Scalability
• Support for HiveQL
• Cache it!
EMC Corporation. All rights reserved.
• Two APIs
– DataFrames
• Data organized into named columns
• Similar to Tables
• Can be constructed from structured data files, Hive, external DBs
– DataSets
• Experimental interface
• Strongly typed & SQL execution engine
• Can be constructed from regular JVM objects
EMC Corporation. All rights reserved.
EMC Corporation. All rights reserved.
EMC Corporation. All rights reserved.
SQL On Hadoop

More Related Content

What's hot

Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
IBM Analytics
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
DataWorks Summit
Oracle’s Advanced Analytics & Machine Learning 12.2c New Features & Road Map;...
Oracle’s Advanced Analytics & Machine Learning 12.2c New Features & Road Map;...Oracle’s Advanced Analytics & Machine Learning 12.2c New Features & Road Map;...
Oracle’s Advanced Analytics & Machine Learning 12.2c New Features & Road Map;...
Charlie Berger
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Cloudera, Inc.
PASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep DivePASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep Dive
Travis Wright
Operationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryOperationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud Foundry
VMware Tanzu
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
DataWorks Summit
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
Andriy Zabavskyy
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
Nicolas Morales
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
DataWorks Summit
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
DataWorks Summit
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
DataWorks Summit
The Challenges of SQL on Hadoop
The Challenges of SQL on HadoopThe Challenges of SQL on Hadoop
The Challenges of SQL on Hadoop
DataWorks Summit
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
avanttic Consultoría Tecnológica
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the Experts
DataWorks Summit/Hadoop Summit
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
DataWorks Summit/Hadoop Summit
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
Sandesh Rao
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
eProseed Oracle Open World 2016 debrief - Oracle Database
eProseed Oracle Open World 2016 debrief - Oracle DatabaseeProseed Oracle Open World 2016 debrief - Oracle Database
eProseed Oracle Open World 2016 debrief - Oracle Database
Marco Gralike

What's hot (20)

Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
Oracle’s Advanced Analytics & Machine Learning 12.2c New Features & Road Map;...
Oracle’s Advanced Analytics & Machine Learning 12.2c New Features & Road Map;...Oracle’s Advanced Analytics & Machine Learning 12.2c New Features & Road Map;...
Oracle’s Advanced Analytics & Machine Learning 12.2c New Features & Road Map;...
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
PASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep DivePASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep Dive
Operationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryOperationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud Foundry
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
The Challenges of SQL on Hadoop
The Challenges of SQL on HadoopThe Challenges of SQL on Hadoop
The Challenges of SQL on Hadoop
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the Experts
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
eProseed Oracle Open World 2016 debrief - Oracle Database
eProseed Oracle Open World 2016 debrief - Oracle DatabaseeProseed Oracle Open World 2016 debrief - Oracle Database
eProseed Oracle Open World 2016 debrief - Oracle Database

Viewers also liked

Hadoop M/R Pig Hive
Hadoop M/R Pig HiveHadoop M/R Pig Hive
Hadoop M/R Pig Hive
MapReduce DesignPatterns
MapReduce DesignPatternsMapReduce DesignPatterns
MapReduce DesignPatterns
Evgeny Benediktov
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
Subhas Kumar Ghosh
SQL on Hadoop
SQL on HadoopSQL on Hadoop
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
ProTechSkills Training
Shivram Mani
gsoc_mentor for Shivram Mani
gsoc_mentor for Shivram Manigsoc_mentor for Shivram Mani
gsoc_mentor for Shivram Mani
Shivram Mani
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Managing Apache HAWQ with Apache AMBARI
Managing Apache HAWQ with Apache AMBARIManaging Apache HAWQ with Apache AMBARI
Managing Apache HAWQ with Apache AMBARI
Mithun (Matt) Mathew
PXF HAWQ Unmanaged Data
PXF HAWQ Unmanaged DataPXF HAWQ Unmanaged Data
PXF HAWQ Unmanaged Data
Shivram Mani
Hawq Hcatalog Integration
Hawq Hcatalog IntegrationHawq Hcatalog Integration
Hawq Hcatalog Integration
Shivram Mani
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache HAWQ : An Introduction
Apache HAWQ : An IntroductionApache HAWQ : An Introduction
Apache HAWQ : An Introduction
Sandeep Kunkunuru
Hadoop Installation and basic configuration
Hadoop Installation and basic configurationHadoop Installation and basic configuration
Hadoop Installation and basic configuration
Gerrit van Vuuren
HAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopHAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoop
BigData Research
Pivotal HAWQ - High Availability (2014)
Pivotal HAWQ - High Availability (2014)Pivotal HAWQ - High Availability (2014)
Pivotal HAWQ - High Availability (2014)
saravana krishnamurthy
Pivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ LaunchPivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ Launch
VMware Tanzu
Build & test Apache Hawq
Build & test Apache Hawq Build & test Apache Hawq
Build & test Apache Hawq
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to Apache
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQMassively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
InMobi Technology

Viewers also liked (20)

Hadoop M/R Pig Hive
Hadoop M/R Pig HiveHadoop M/R Pig Hive
Hadoop M/R Pig Hive
MapReduce DesignPatterns
MapReduce DesignPatternsMapReduce DesignPatterns
MapReduce DesignPatterns
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
gsoc_mentor for Shivram Mani
gsoc_mentor for Shivram Manigsoc_mentor for Shivram Mani
gsoc_mentor for Shivram Mani
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
Managing Apache HAWQ with Apache AMBARI
Managing Apache HAWQ with Apache AMBARIManaging Apache HAWQ with Apache AMBARI
Managing Apache HAWQ with Apache AMBARI
PXF HAWQ Unmanaged Data
PXF HAWQ Unmanaged DataPXF HAWQ Unmanaged Data
PXF HAWQ Unmanaged Data
Hawq Hcatalog Integration
Hawq Hcatalog IntegrationHawq Hcatalog Integration
Hawq Hcatalog Integration
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache HAWQ : An Introduction
Apache HAWQ : An IntroductionApache HAWQ : An Introduction
Apache HAWQ : An Introduction
Hadoop Installation and basic configuration
Hadoop Installation and basic configurationHadoop Installation and basic configuration
Hadoop Installation and basic configuration
HAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopHAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoop
Pivotal HAWQ - High Availability (2014)
Pivotal HAWQ - High Availability (2014)Pivotal HAWQ - High Availability (2014)
Pivotal HAWQ - High Availability (2014)
Pivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ LaunchPivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ Launch
Build & test Apache Hawq
Build & test Apache Hawq Build & test Apache Hawq
Build & test Apache Hawq
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to Apache
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQMassively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQ

Similar to SQL On Hadoop

Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
DataWorks Summit
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Data Con LA
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu
EMC HADOOP Storage Strategy
EMC HADOOP Storage StrategyEMC HADOOP Storage Strategy
EMC HADOOP Storage Strategy
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
VMware Tanzu
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
DataWorks Summit
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
DataWorks Summit
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
Joey Echeverria

Similar to SQL On Hadoop (20)

Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
EMC HADOOP Storage Strategy
EMC HADOOP Storage StrategyEMC HADOOP Storage Strategy
EMC HADOOP Storage Strategy
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite

Recently uploaded

vSAN_Tutorial_Presentation with important topics
vSAN_Tutorial_Presentation with important  topicsvSAN_Tutorial_Presentation with important  topics
vSAN_Tutorial_Presentation with important topics
Literals - A Machine Independent Feature
Literals - A Machine Independent FeatureLiterals - A Machine Independent Feature
Literals - A Machine Independent Feature
Empowering Businesses with Intelligent Software Solutions - Grawlix
Empowering Businesses with Intelligent Software Solutions - GrawlixEmpowering Businesses with Intelligent Software Solutions - Grawlix
Empowering Businesses with Intelligent Software Solutions - Grawlix
Aarisha Shaikh
06. Ruby Array & Hash - Ruby Core Teaching
06. Ruby Array & Hash - Ruby Core Teaching06. Ruby Array & Hash - Ruby Core Teaching
06. Ruby Array & Hash - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching
The two flavors of Python 3.13 - PyHEP 2024
The two flavors of Python 3.13 - PyHEP 2024The two flavors of Python 3.13 - PyHEP 2024
The two flavors of Python 3.13 - PyHEP 2024
Henry Schreiner
Waze vs. Google Maps vs. Apple Maps, Who Else.pdf
Waze vs. Google Maps vs. Apple Maps, Who Else.pdfWaze vs. Google Maps vs. Apple Maps, Who Else.pdf
Waze vs. Google Maps vs. Apple Maps, Who Else.pdf
Ben Ramedani
Applitools Autonomous 2.0 Sneak Peek.pdf
Applitools Autonomous 2.0 Sneak Peek.pdfApplitools Autonomous 2.0 Sneak Peek.pdf
Applitools Autonomous 2.0 Sneak Peek.pdf
4. The Build System _ Embedded Android.pdf
4. The Build System _ Embedded Android.pdf4. The Build System _ Embedded Android.pdf
4. The Build System _ Embedded Android.pdf
The Politics of Agile Development.pptx
The  Politics of  Agile Development.pptxThe  Politics of  Agile Development.pptx
The Politics of Agile Development.pptx
UW Cert degree offer diploma
UW Cert degree offer diploma UW Cert degree offer diploma
UW Cert degree offer diploma
Unlocking the Future of Artificial Intelligence
Unlocking the Future of Artificial IntelligenceUnlocking the Future of Artificial Intelligence
Unlocking the Future of Artificial Intelligence
Understanding Automated Testing Tools for Web Applications.pdf
Understanding Automated Testing Tools for Web Applications.pdfUnderstanding Automated Testing Tools for Web Applications.pdf
Understanding Automated Testing Tools for Web Applications.pdf
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing ToolsOld Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Benjamin Bischoff
B.Sc. Computer Science Department PPT 2024
B.Sc. Computer Science Department PPT 2024B.Sc. Computer Science Department PPT 2024
B.Sc. Computer Science Department PPT 2024
03. Ruby Variables & Regex - Ruby Core Teaching
03. Ruby Variables & Regex - Ruby Core Teaching03. Ruby Variables & Regex - Ruby Core Teaching
03. Ruby Variables & Regex - Ruby Core Teaching
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
John Gallagher
BDRSuite - #1 Cost effective Data Backup and Recovery Solution
BDRSuite - #1 Cost effective Data Backup and Recovery SolutionBDRSuite - #1 Cost effective Data Backup and Recovery Solution
BDRSuite - #1 Cost effective Data Backup and Recovery Solution
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Andre Hora
What is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdfWhat is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdf

Recently uploaded (20)

vSAN_Tutorial_Presentation with important topics
vSAN_Tutorial_Presentation with important  topicsvSAN_Tutorial_Presentation with important  topics
vSAN_Tutorial_Presentation with important topics
Literals - A Machine Independent Feature
Literals - A Machine Independent FeatureLiterals - A Machine Independent Feature
Literals - A Machine Independent Feature
Empowering Businesses with Intelligent Software Solutions - Grawlix
Empowering Businesses with Intelligent Software Solutions - GrawlixEmpowering Businesses with Intelligent Software Solutions - Grawlix
Empowering Businesses with Intelligent Software Solutions - Grawlix
06. Ruby Array & Hash - Ruby Core Teaching
06. Ruby Array & Hash - Ruby Core Teaching06. Ruby Array & Hash - Ruby Core Teaching
06. Ruby Array & Hash - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching
The two flavors of Python 3.13 - PyHEP 2024
The two flavors of Python 3.13 - PyHEP 2024The two flavors of Python 3.13 - PyHEP 2024
The two flavors of Python 3.13 - PyHEP 2024
Waze vs. Google Maps vs. Apple Maps, Who Else.pdf
Waze vs. Google Maps vs. Apple Maps, Who Else.pdfWaze vs. Google Maps vs. Apple Maps, Who Else.pdf
Waze vs. Google Maps vs. Apple Maps, Who Else.pdf
Applitools Autonomous 2.0 Sneak Peek.pdf
Applitools Autonomous 2.0 Sneak Peek.pdfApplitools Autonomous 2.0 Sneak Peek.pdf
Applitools Autonomous 2.0 Sneak Peek.pdf
4. The Build System _ Embedded Android.pdf
4. The Build System _ Embedded Android.pdf4. The Build System _ Embedded Android.pdf
4. The Build System _ Embedded Android.pdf
The Politics of Agile Development.pptx
The  Politics of  Agile Development.pptxThe  Politics of  Agile Development.pptx
The Politics of Agile Development.pptx
UW Cert degree offer diploma
UW Cert degree offer diploma UW Cert degree offer diploma
UW Cert degree offer diploma
Unlocking the Future of Artificial Intelligence
Unlocking the Future of Artificial IntelligenceUnlocking the Future of Artificial Intelligence
Unlocking the Future of Artificial Intelligence
Understanding Automated Testing Tools for Web Applications.pdf
Understanding Automated Testing Tools for Web Applications.pdfUnderstanding Automated Testing Tools for Web Applications.pdf
Understanding Automated Testing Tools for Web Applications.pdf
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing ToolsOld Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
B.Sc. Computer Science Department PPT 2024
B.Sc. Computer Science Department PPT 2024B.Sc. Computer Science Department PPT 2024
B.Sc. Computer Science Department PPT 2024
03. Ruby Variables & Regex - Ruby Core Teaching
03. Ruby Variables & Regex - Ruby Core Teaching03. Ruby Variables & Regex - Ruby Core Teaching
03. Ruby Variables & Regex - Ruby Core Teaching
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
BDRSuite - #1 Cost effective Data Backup and Recovery Solution
BDRSuite - #1 Cost effective Data Backup and Recovery SolutionBDRSuite - #1 Cost effective Data Backup and Recovery Solution
BDRSuite - #1 Cost effective Data Backup and Recovery Solution
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
What is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdfWhat is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdf

SQL On Hadoop

  • 1. EMC Corporation All rights reserved SQL ON HADOOP
  • 2. EMC Corporation All rights reserved • Introduction • Hive • HAWQ • Impala • SparkSQL • HBase + Phoenix • Drill • Networking & Pizza AGENDA
  • 3. EMC Corporation All rights reserved • How many developers? INTRODUCTION A SURVEY
  • 4. EMC Corporation All rights reserved • How many BI/SQL Developer? INTRODUCTION A SURVEY
  • 5. EMC Corporation All rights reserved • How many Business analyst/Sales? INTRODUCTION A SURVEY
  • 6. EMC Corporation All rights reserved • How many have used Hadoop? INTRODUCTION A SURVEY
  • 7. EMC Corporation All rights reserved • How many have used SQL on Hadoop? INTRODUCTION A SURVEY
  • 8. EMC Corporation All rights reserved • Hadoop is an open source framework for large- scale data storing & processing. WHAT IS HADOOP
  • 9. EMC Corporation All rights reserved • Application Workgroup in EMC – Focused on •Big data development/infrastructure •Application modernization •DevOps ABOUT THE HOSTS
  • 10. EMC Corporation All rights reserved • Fahim Kundi – 10+ years experience in EDW and big data • Haden Pareira – Data engineer with 5+ years of Hadoop experience • Muhammad Ali – Data engineer 2+ years with Hadoop ABOUT THE HOSTS APPLICATION WORKGROUP IN EMC
  • 11. EMC Corporation All rights reserved WHAT IS HADOOP
  • 12. EMC Corporation All rights reserved • HDFS is a file system – it’s all files • MapReduce requires strong programming skills • It’s so difficult WHAT IS HADOOP
  • 13. EMC Corporation All rights reserved • SQL is well known in analytics community • Faster and easier data insights • Allows SQL/BI developer to retain their expertise and create value out of big data SQL ON HADOOP
  • 14. EMC Corporation All rights reserved • Cloudera – Impala • Hortonworks – Hive/Tez • Pivotal – HAWQ … now HDB • MapR – Drill • IBM – Big SQL SQL ON HADOOP
  • 15. EMC Corporation All rights reserved HIVE
  • 16. EMC Corporation All rights reserved Hive and HAWQ By Fahim Kundi
  • 17. EMC Corporation All rights reserved CONTENTS • Hive Introduction • How Hive Works • Apache Tez • Hive with Tez Vs Mapreduce • ORC and Parquet Format • HAWQ Introduction • Query Optimizer • PxF
  • 18. EMC Corporation All rights reserved HIVE INTRODUCTION (1) • Apache Hive is high level query language and data warehouse features built on top of Hadoop. • It is initially developed by yahoo and made open source in 2008. • SQL Like Query Language called HQL. • Partitioning and Bucketing for faster Query processing. • Integration with Visualization tool like Tableau.
  • 19. EMC Corporation All rights reserved HIVE INTRODUCTION (2) • Hive supports all the common primitive data formats such as INT, BINARY, BOOLEAN, CHAR, DECIMAL, FLOAT, STRING, TIMESTAMP etc. • In addition, analysts can combine primitive data types to form complex data types, such as structs, maps and arrays.
  • 20. EMC Corporation All rights reserved HOW HIVE WORKS (1) • The tables in Hive are similar to tables in a relational database. • Databases are comprised of tables, which are made up of partitions. • Data can be accessed via a simple query language and Hive supports overwriting or appending data. • Hive queries internally will be converted to map reduce programs or Tez.
  • 21. EMC Corporation All rights reserved HOW HIVE WORKS (2) • Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. • Each table can be sub-divided into partitions that determine how data is distributed within sub- directories of the table directory. • Data within partitions can be further broken down into buckets.
  • 22. EMC Corporation All rights reserved APACHE TEZ (1) • Apache Tez, a new distributed execution framework that is targeted towards data-processing applications on Hadoop. • Tez is developed by Hortonwork and built on top of YARN (Resource Management Framework for Hadoop) • Tez generalizes Mapreduce to more powerful framework as it creates Dataflow Graph for job executed by User. (Example)
  • 23. EMC Corporation All rights reserved APACHE TEZ (2) • The Tez API has the following components – – DAG (Directed Acyclic Graph) – defines the overall job. One DAG object corresponds to one job – Vertex – defines the user logic along with the resources and the environment needed to execute the user logic. One Vertex corresponds to one step in the job – Edge – defines the connection between producer and consumer vertices. • Tez is not meant directly for end-users – in fact it enables developers to build end-user applications with much better performance and flexibility.
  • 24. EMC Corporation All rights reserved EXAMPLE OF HIVE WITH TEZ VS MAPREDUCE
  • 25. EMC Corporation All rights reserved ORC FILE • ORC(Optimal Row Columnar) is columnar file format designed for Hadoop workloads. • ORC files developed to massively speed up Apache Hive and improve the storage efficiency of data stored in Apache Hadoop. It is optimized for large streaming reads. • ORC Features: – Columnar format for complex data types – Built into Hive from 0.11 – Support for Pig and Mapreduce via Hcat. – Two level of compression • Light weight type specific • General – Built in Indexes
  • 26. EMC Corporation All rights reserved ORC FILE LAYOUT
  • 27. EMC Corporation All rights reserved PARQUET • Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. • Parquet Feature: – Columnar File Format – Support Nested Data Structures – Accessible by Hive, Spark, Pig, Drill, MR – R/W in HDFS or local file system
  • 28. EMC Corporation All rights reserved PARQUET FILE LAYOUT
  • 29. EMC Corporation All rights reserved ORC VS PARQUET • Two major consideration for considering ORC over Parquet – Many of the performance improvements provided in the Stinger initiative are dependent on features of the ORC format including block level index for each column. This leads to potentially more efficient I/O allowing Hive to skip reading entire blocks of data if it determines predicate values are not present there. – Also the Cost Based Optimizer has the ability to consider column level metadata present in ORC files in order to generate the most efficient graph. – ACID transactions are only possible when using ORC as the file format.
  • 30. EMC Corporation All rights reserved FILE SIZE COMPARISION
  • 31. EMC Corporation All rights reserved HAWQ INTRODUCTION • HAWQ is MPP(Parallel) SQL-query engine that uses HDFS for its storage layer. • HAWQ evolves from the Greenplum Database query planner to handle query processing and does not rely on MapReduce under the hood to do processing. • HAWQ reads data from and writes data to HDFS natively. • It also has extensions(PxF) to allow it to interact with data contained in other services (HBase, Hive, Avro, etc) that also reside in HDFS.
  • 32. EMC Corporation All rights reserved HAWQ FEATURES • HAWQ provides all major features found in Greenplum database – SQL Completeness: 2003 Extensions – JDBC Compliant – Robust Query Optimizer – Row or Column-Oriented Table Storage – Parallel Loading and Unloading – Distributions – Multi-level Partitioning – High speed data redistribution – Views – External Tables – Compression – Resource Management – Security – Authentication – Management and Monitoring
  • 33. EMC Corporation All rights reserved HAWQ ARCHITECTURE Interconnect Local Storage HAWQ Master Parser Query Optimizer PXF Local Temp Storage Segment Host Query Executor HDFS PXF Segment [Segment …] DataNode Local Temp Storage Segment Host Query Executor HDFS PXF Segment [Segment …] HAWQ Standby Master NameNode HDFS Secondary NameNode HDFS
  • 34. EMC Corporation All rights reserved HAWQ PARALLEL QUERY OPTIMIZER Gather Motion Sort HashAggregate HashJoin Redistribute Motion HashJoin Seq Scan on lineitem Hash Seq Scan on orders Hash HashJoin Seq Scan on customer Hash Broadcast Motion Seq Scan on nation • Turn SQL Query into execution Plan • Cost based Optimizer
  • 35. EMC Corporation All rights reserved PIVOTAL EXTENSION FRAMEWORK (PXF) • PXF is a fast, extensible framework connecting HAWQ to a HDFS data store of choice that exposes a parallel API  An advanced version of external tables  Enables combining HAWQ data and Hadoop data in a single query  Supports connectors for HDFS, HBase and Hive  Provides extensible framework API to enable custom connector development for any data sources HDFS HBase Hive Xtension Framework
  • 36. EMC Corporation All rights reserved Muhammad Ali Image courtesy cloudera
  • 37. EMC Corporation. All rights reserved. • Interactive Query on top of Hadoop • ANSI-92 SQL Standard • Native MPP query engine • Written in C++ IMPALA OVERVIEW
  • 38. EMC Corporation. All rights reserved. • Native to Hadoop – Blends with the eco system – Security – Hive MetaStore / HCatalog – Query existing HDFS data • Not as fault-tolerant as MapReduce – (or Hive or SparkSQL or …) – Single node fails during query the whole query fails – But if it’s 20x faster, you can rerun and still finish faster ;) IMPALA OVERVIEW
  • 39. EMC Corporation. All rights reserved. IMPALAARCHITECTURE Image courtesy cloudera
  • 40. EMC Corporation. All rights reserved. • Query execution times (small to medium size) • Parquet Format – Compression • High Concurrency – kills the competitors • Partitioning • Query Optimizer (Compute Statistics!) IMPALA WHERE IT SHINES
  • 41. EMC Corporation. All rights reserved. IMPALA DEMO
  • 42. EMC Corporation. All rights reserved. • Distributed columnar storage manager • Performance of Parquet – Great for analytical queries • Mutability of HBase – Supports UPDATE/DELETE unlike Parquet • One common storage to rule them all! – (not exactly!) WHAT THE HELL IS KUDU!
  • 43. EMC Corporation. All rights reserved. WHERE DO YOU POSITION KUDU?
  • 44. EMC Corporation. All rights reserved. • IoT use cases – High velocity data – Same data read for analytical queries near real time • Predictive Modeling – Large datasets updated frequently – Retraining models • Time-series applications – Kudu offers compound keys/hash based partitioning – Avoids hot spotting KUDU USE CASES
  • 45. EMC Corporation. All rights reserved. IMPALA DEMO
  • 46. EMC Corporation. All rights reserved. SPARK
  • 47. EMC Corporation. All rights reserved. 2 MIN INTRO TO SPARK • General Purpose Distributed Computing System – Multiple language support (Java, Scala, Python, and R) – Fault tolerant, data distribution, in-memory caching etc. • RDD – Resilient distributed datasets • Operations – Transformations (define new RDDs) – Actions (return value) • No nonsense – 100x faster than MapReduce – Disk used only when can’t be avoided
  • 48. EMC Corporation. All rights reserved. 2 MIN INTRO TO SPARK Image Courtesy: Sachin Parmar
  • 49. EMC Corporation. All rights reserved. SPARKSQL
  • 50. EMC Corporation. All rights reserved. SPARKSQL • Structured Data Processing – Commonly known to us as tables • Integrated into Spark programming model • Unified Data Access • Scalability • Support for HiveQL • Cache it!
  • 51. EMC Corporation. All rights reserved. SPARKSQL • Two APIs – DataFrames • Data organized into named columns • Similar to Tables • Can be constructed from structured data files, Hive, external DBs – DataSets • Experimental interface • Strongly typed & SQL execution engine • Can be constructed from regular JVM objects
  • 52. EMC Corporation. All rights reserved. SPARKSQL ARCHITECTURE
  • 53. EMC Corporation. All rights reserved. DEMO SPARKSQL ON HADOOP
  • 54. EMC Corporation. All rights reserved.

Editor's Notes

  1. Hadoop has traditionally been a batch-processing platform for large amounts of data. However, there are a lot of use cases for near-real-time performance of query processing. There are also several workloads, such as Machine Learning, which do not fit will into the MapReduce paradigm. Tez helps Hadoop address these use cases.
  2. Compared with RCFile format, for example, ORC file format has many advantages such as: a single file as the output of each task, which reduces the NameNode's load Hive type support including datetime, decimal, and the complex types (struct, list, map, and union) light-weight indexes stored within the file that skip row groups that don't pass predicate filtering block-mode compression based on data type run-length encoding for integer columns and dictionary encoding for string columns concurrent reads of the same file using separate RecordReaders Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query.
  3. Advantages of Columnar Storage: Limits IO by loading the columns that is needed. Save space as columnar layout compress better
  4. Converts SQL into a physical execution plan Cost-based optimization looks for the most efficient plan Physical plan contains scans, joins, sorts, aggregations, etc. Global planning avoids sub-optimal ‘SQL pushing’ to segments Directly inserts motion nodes for inter-segment communication Directly inserts motion nodes for efficient non-local join processing
  5. Thank you