SQL On Hadoop

EMC Corporation All rights reserved
SQL ON HADOOP

• Introduction
• Hive
• HAWQ
• Impala
• SparkSQL
• HBase + Phoenix
• Drill
• Networking & Pizza
AGENDA

• How many developers?
INTRODUCTION
A SURVEY

• How many BI/SQL Developer?
INTRODUCTION
A SURVEY

• How many Business analyst/Sales?
INTRODUCTION
A SURVEY

• How many have used Hadoop?
INTRODUCTION
A SURVEY

• How many have used SQL on Hadoop?
INTRODUCTION
A SURVEY

• Hadoop is an open source framework for large-
scale data storing & processing.
WHAT IS HADOOP

• Application Workgroup in EMC
– Focused on
•Big data development/infrastructure
•Application modernization
•DevOps
ABOUT THE HOSTS

• Fahim Kundi
– 10+ years experience in EDW and big data
• Haden Pareira
– Data engineer with 5+ years of Hadoop experience
• Muhammad Ali
– Data engineer 2+ years with Hadoop
ABOUT THE HOSTS
APPLICATION WORKGROUP IN EMC

WHAT IS HADOOP

• HDFS is a file system – it’s all files
• MapReduce requires strong programming skills
• It’s so difficult
WHAT IS HADOOP

• SQL is well known in analytics community
• Faster and easier data insights
• Allows SQL/BI developer to retain their expertise
and create value out of big data
SQL ON HADOOP

• Cloudera – Impala
• Hortonworks – Hive/Tez
• Pivotal – HAWQ … now HDB
• MapR – Drill
• IBM – Big SQL
SQL ON HADOOP

HIVE

Hive and HAWQ
By Fahim Kundi

CONTENTS
• Hive Introduction
• How Hive Works
• Apache Tez
• Hive with Tez Vs Mapreduce
• ORC and Parquet Format
• HAWQ Introduction
• Query Optimizer
• PxF

HIVE INTRODUCTION (1)
• Apache Hive is high level query language
and data warehouse features built on top of
Hadoop.
• It is initially developed by yahoo and made
open source in 2008.
• SQL Like Query Language called HQL.
• Partitioning and Bucketing for faster Query
processing.
• Integration with Visualization tool like
Tableau.

HIVE INTRODUCTION (2)
• Hive supports all the common primitive data
formats such as INT, BINARY, BOOLEAN,
CHAR, DECIMAL, FLOAT, STRING, TIMESTAMP
etc.
• In addition, analysts can combine primitive
data types to form complex data types, such
as structs, maps and arrays.

HOW HIVE WORKS (1)
• The tables in Hive are similar to tables in a relational
database.
• Databases are comprised of tables, which are made up
of partitions.
• Data can be accessed via a simple query language and
Hive supports overwriting or appending data.
• Hive queries internally will be converted to map reduce
programs or Tez.

HOW HIVE WORKS (2)
• Within a particular database, data in the tables is
serialized and each table has a corresponding Hadoop
Distributed File System (HDFS) directory.
• Each table can be sub-divided into partitions that
determine how data is distributed within sub-
directories of the table directory.
• Data within partitions can be further broken down into
buckets.

APACHE TEZ (1)
• Apache Tez, a new distributed execution framework
that is targeted towards data-processing applications
on Hadoop.
• Tez is developed by Hortonwork and built on top of
YARN (Resource Management Framework for Hadoop)
• Tez generalizes Mapreduce to more powerful
framework as it creates Dataflow Graph for job
executed by User. (Example)

APACHE TEZ (2)
• The Tez API has the following components –
– DAG (Directed Acyclic Graph) – defines the overall job.
One DAG object corresponds to one job
– Vertex – defines the user logic along with the resources
and the environment needed to execute the user logic.
One Vertex corresponds to one step in the job
– Edge – defines the connection between producer and
consumer vertices.
• Tez is not meant directly for end-users – in fact it
enables developers to build end-user applications with
much better performance and flexibility.

EXAMPLE OF HIVE WITH TEZ VS MAPREDUCE

ORC FILE
• ORC(Optimal Row Columnar) is columnar file format designed
for Hadoop workloads.
• ORC files developed to massively speed up Apache Hive and
improve the storage efficiency of data stored in Apache Hadoop.
It is optimized for large streaming reads.
• ORC Features:
– Columnar format for complex data types
– Built into Hive from 0.11
– Support for Pig and Mapreduce via Hcat.
– Two level of compression
• Light weight type specific
• General
– Built in Indexes

ORC FILE LAYOUT

PARQUET
• Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem, regardless of
the choice of data processing framework, data model
or programming language.
• Parquet Feature:
– Columnar File Format
– Support Nested Data Structures
– Accessible by Hive, Spark, Pig, Drill, MR
– R/W in HDFS or local file system

PARQUET FILE LAYOUT

ORC VS PARQUET
• Two major consideration for considering ORC over Parquet
– Many of the performance improvements provided in the Stinger
initiative are dependent on features of the ORC format including
block level index for each column. This leads to potentially more
efficient I/O allowing Hive to skip reading entire blocks of data if it
determines predicate values are not present there.
– Also the Cost Based Optimizer has the ability to consider column
level metadata present in ORC files in order to generate the most
efficient graph.
– ACID transactions are only possible when using ORC as the file
format.

FILE SIZE COMPARISION

HAWQ INTRODUCTION
• HAWQ is MPP(Parallel) SQL-query engine that uses HDFS for
its storage layer.
• HAWQ evolves from the Greenplum Database query planner
to handle query processing and does not rely on MapReduce
under the hood to do processing.
• HAWQ reads data from and writes data to HDFS natively.
• It also has extensions(PxF) to allow it to interact with data
contained in other services (HBase, Hive, Avro, etc) that also
reside in HDFS.

HAWQ FEATURES
• HAWQ provides all major features found in Greenplum
database
– SQL Completeness: 2003 Extensions
– JDBC Compliant
– Robust Query Optimizer
– Row or Column-Oriented Table Storage
– Parallel Loading and Unloading
– Distributions
– Multi-level Partitioning
– High speed data redistribution
– Views
– External Tables
– Compression
– Resource Management
– Security
– Authentication
– Management and Monitoring

HAWQ ARCHITECTURE
Interconnect
Local Storage
HAWQ Master
Parser Query Optimizer
PXF
Local Temp Storage
Segment Host
Query Executor
HDFS
PXF
Segment
[Segment …]
DataNode
Local Temp Storage
Segment Host
Query Executor
HDFS
PXF
Segment
[Segment …]
HAWQ Standby
Master
NameNode
HDFS
Secondary NameNode
HDFS

HAWQ PARALLEL QUERY OPTIMIZER
Gather Motion
Sort
HashAggregate
HashJoin
Redistribute Motion
HashJoin
Seq Scan on
lineitem
Hash
Seq Scan on
orders
Hash
HashJoin
Seq Scan on
customer
Hash
Broadcast Motion
Seq Scan on
nation
• Turn SQL Query into execution Plan
• Cost based Optimizer

PIVOTAL EXTENSION FRAMEWORK (PXF)
• PXF is a fast, extensible framework connecting HAWQ to a
HDFS data store of choice that exposes a parallel API
 An advanced version of external
tables
 Enables combining HAWQ data
and Hadoop data in a single query
 Supports connectors for HDFS,
HBase and Hive
 Provides extensible framework API
to enable custom connector
development for any data sources
HDFS HBase Hive
Xtension Framework

Muhammad Ali
Image courtesy cloudera

• Native to Hadoop
– Blends with the eco system
– Security
– Hive MetaStore / HCatalog
– Query existing HDFS data
• Not as fault-tolerant as MapReduce
– (or Hive or SparkSQL or …)
– Single node fails during query the whole query fails
– But if it’s 20x faster, you can rerun and still finish faster ;)
IMPALA
OVERVIEW

IMPALAARCHITECTURE
Image courtesy cloudera

• Query execution times (small to medium size)
• Parquet Format
– Compression
• High Concurrency – kills the competitors
• Partitioning
• Query Optimizer (Compute Statistics!)
IMPALA
WHERE IT SHINES

IMPALA DEMO

• Distributed columnar storage manager
• Performance of Parquet
– Great for analytical queries
• Mutability of HBase
– Supports UPDATE/DELETE unlike Parquet
• One common storage to rule them all!
– (not exactly!)
WHAT THE HELL IS KUDU!

WHERE DO YOU POSITION KUDU?

• IoT use cases
– High velocity data
– Same data read for analytical queries near real time
• Predictive Modeling
– Large datasets updated frequently
– Retraining models
• Time-series applications
– Kudu offers compound keys/hash based partitioning
– Avoids hot spotting
KUDU USE CASES

SPARK

2 MIN INTRO TO SPARK
• General Purpose Distributed Computing System
– Multiple language support (Java, Scala, Python, and R)
– Fault tolerant, data distribution, in-memory caching etc.
• RDD
– Resilient distributed datasets
• Operations
– Transformations (define new RDDs)
– Actions (return value)
• No nonsense
– 100x faster than MapReduce
– Disk used only when can’t be avoided

2 MIN INTRO TO SPARK
Image Courtesy: Sachin Parmar
http://www.slideshare.net/sachinparmarss/deep-dive-spark-data-frames-sql-and-catalyst-optimizer?

SPARKSQL

SPARKSQL
• Structured Data Processing
– Commonly known to us as tables
• Integrated into Spark programming model
• Unified Data Access
• Scalability
• Support for HiveQL
• Cache it!

SPARKSQL
• Two APIs
– DataFrames
• Data organized into named columns
• Similar to Tables
• Can be constructed from structured data files, Hive, external DBs
– DataSets
• Experimental interface
• Strongly typed & SQL execution engine
• Can be constructed from regular JVM objects

SPARKSQL ARCHITECTURE

DEMO
SPARKSQL ON HADOOP

SQL On Hadoop

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to SQL On Hadoop

Similar to SQL On Hadoop (20)

Recently uploaded

Recently uploaded (20)

SQL On Hadoop

Editor's Notes