Big SQL Competitive Summary - Vendor Landscape

© 2014 IBM Corporation
IBM Big SQL – Vendor Landscape
<<Speaker Name Here>>
<<For questions about this presentation contact Glen Sheffield shef@ca.ibm.com

Why SQL for Hadoop?
Lower cost DW, exploration and discovery, data lake or reservoir
Open up Hadoop data to familiar SQL tools such as Cognos
“This trend is expected to continue as Hadoop vendors continue to improve so-called SQL-on-Hadoop offerings. These
technologies from vendors such as Cloudera, Hortonworks and Actian allow data scientists and less sophisticated business
analysts and business users to query data in Hadoop using ANSI SQL-like tools (some SQL-on-Hadoop offerings have
more complete SQL functionality than others) and developers to build Hadoop-based data-driven applications”
2 © 2014 IBM Corporation

What is IBM Big SQL?
IBM’s SQL for Hadoop
– Opens Hadoop data to a wider audience
– Familiar, widely known syntax
Complements the Data Warehouse
– Exploratory analytics
– Sandbox
– Data Lake
Included in BigInsights
Use familiar SQL tools
– Like Cognos

IBM Big SQL is Architected for Performance
Architected from the ground up for low latency and high throughput
MapReduce replaced with a modern MPP architecture
– Compiler and runtime are native code (not java)
– Big SQL worker daemons live directly on cluster
• Continuously running (no startup latency)
• Processing happens locally at the data
– Message passing allows data to flow directly
between nodes
Operations occur in memory with the ability
to spill to disk
– Supports joins, aggregations, and sorts larger than
available RAM
SQL-based
Application
IBM data server
client
Big SQL
Engine
SQL MPP Run-time
Data Sources
CSV
CSV
Seq
Seq
Parquet
Parquet
RC
RC
ORC
ORC
Avro
Avro
Custom
Custom
JSON
JSON
InfoSphere BigInsights

IBM Big SQL Embraces Open Source HDFS file formats
Big SQL applies SQL to your existing Hadoop data
– No propriety storage format
– Only vendor to support Parquet and ORC
A table is simply a view on your Hadoop data
– All data is Hadoop data
– In files in HDFS
– SEQ, RC, delimited, Parquet …
Table definitions shared with Hive
– The Hive Metastore catalogs table definitions
– Reading/writing data logic is shared
with Hive
– Definitions can be shared across the
Hadoop ecosystem
Data stored in Hive immediately query-able
Hive
Hive
Metastore
Hadoop
Cluster
Pig
Hive APIs
Sqoop
Big SQL
Hive APIs
Hive APIs

Different Approaches to SQL on Hadoop
Better
Worse
Just the Query Engine
• MPP SQL query engine
• Uses Hive metadata
• Runs on Hadoop cluster
• Operates directly on HDFS files
• Best integration with
Hadoop
• Native HDFS file
formats
Full RDBMS on Hadoop
• Complete database, including
storage layer and query engine
• Uses proprietary metadata
• Runs on Hadoop cluster
• Proprietary formats
• Adds database
complexities to
Hadoop
Submit a remote query
• RDBMS sends request to Hadoop
• Result returned to RDBMS
• Network may impact performance
• Ability to push down work varies
• Requires front-end
database
• Performance depends
on network and
RDBMS load

What’s the Difference?
7
SQL Query Engine
(Big SQL)
Hive Metadata Catalog
(Hadoop)
Standard HDFS files
(Hadoop)
3. Remote Query (e.g. Oracle, Teradata)
Requires a completely separate front-end
RDBMS system
RDBMS
2. Complete RDBMS on Hadoop
(e.g. Pivotal HAWQ, Actian, Vertica)
Query engine, meta data, storage layer
are all glued together in a
proprietary bundle running on Hadoop
© 2014 IBM Corporation
1. Just the Query Engine
(IBM Big SQL)
• Query engine, meta data, storage
layers are all separate
• Leverages Hadoop ecosystem
SQL Query Engine
Proprietary Database Catalog
(Meta Data)
Proprietary Database
Storage (tables)
• IBM Big SQL has an architectural
advantage over other RDBMS vendors
• Big SQL is a MPP Query engine
running natively on Hadoop
• IBM Big SQL uses open source HDFS
file system, not proprietary RDBMS
storage layer
HADOOP
Hadoop Cluster

SQL for Hadoop Vendor Landscape
Architecture,
Performance
Announced July 15
2014 for GA 3Q 2014
Available only with
Oracle Big Data
Appliance
Good, based on
Oracle
Remote query from
Oracle Exadata via
external tables
Oracle Big
Data SQL
New MPP query engine
in BigInsights 3.0
Included with IBM
InfoSphere Big
Insights
Very Good, SQL
2003/2011
MPP query engine
based on DB2 runs
native on Hadoop
IBM Big SQL
Approach SQL
Support
Packaging Comments Big SQL
Advantage
Cloudera
(Impala)
MPP query engine built
by Cloudera runs
native on Hadoop
Poor, subset of
Hive 0.9
Included with
Cloudera Enterprise
Available for MapR
and Amazon EMR
Perception leader SQL support
Memory usage
Concurrency
Hortonworks
(Hive / Stinger)
Enhance Hive, replace
MapReduce with Tez,
runs native on Hadoop
Better with Hive
.13 but sub-query
restrictions
Hive .13 Included with
Hortonworks; Hive .12
included in Cloudera,
BigInsights, MapR
Microsoft, Teradata,
SAP, HP are partners
SQL support
Performance
Pivotal HD
(HAWQ)
Port Greenplum
database to Hadoop
including storage layer
Good, based on
Greenplum
Optional feature of
Pivotal HD
Funded by EMC and
GE Capital
Architecture
Elastic
Scalability
Teradata
(SQL-H)
Remote query from
Teradata by treating
Hive tables as a view
Good, based on
Teradata
Optional feature of
Teradata 14.1, 15.
Included with
Teradata Distribution
for Hadoop (TDH)
QueryGrid coming in
3Q will provide filtering
and push-down in
Hadoop via Hive 13
Architecture,
Performance
Microsoft
(Polybase)
Remote query from
PDW using external
tables
Good, based on
Microsoft
Available only with
Microsoft PDW V2
Limited to Microsoft
PDW customers
Architecture,
Performance
Presto MPP query engine built
by Facebook runs
native on Hadoop
Good Open source
download
New, unproven Commercial
support

IBM Big SQL Advantages
Native Hadoop architecture
– Designed for Hadoop Ecosystem
– Uses native Hadoop data types
– Elastic scalability
Leading performance
– Better value for money
ANSI compliant SQL – broad language support
– Minimize re-coding retains tools investments
Automatic memory management
– Other vendors may require that queries be “hand-optimized”
Security – row, column level, field masking
Rich analytic and aggregation functions

Cloudera Impala Overview
11
11
Interactive SQL for Hadoop
Responses in seconds
Nearly ANSI-92 standard SQL with Hive SQL
Native MPP Query Engine
Purpose-built for low-latency queries
Separate runtime from MapReduce
Designed as part of the Hadoop ecosystem
Open Source
Apache-licensed

Is Cloudera Impala Open Source?
Yes
– Impala code is available for download on Github, and available under Apache license
– Impala is available for Amazon EMR and MapR Hadoop distributions
No
– Customers assume that the open source community contributes to Impala
– But Impala is Not an Apache Software Foundation open source project
– Cloudera is the only contributor (developer) to Impala code
IBM has decades of SQL research and development being invested into IBM
Big SQL
– Impala has no advantage over IBM Big SQL in terms of code contribution or product
development

IBM Big SQL Compared to Cloudera Impala
Better SQL Support for end user tools and queries
– SQL 2003/2011+ (Impala only supports a subset of SQL92)
– Impala doesn’t support sub-queries in where-clause, nested-subqueries, windowed
aggregates, common table expressions, rollup, cube, for example
– Translates to better end-user tool usage and performance
Guaranteed execution for complex queries for reliability and ease of use
– Big SQL has Automatic Memory Management
– Unlike Impala, Big SQL has no limitation that joined tables have to fit in aggregated
memory of data nodes which can cause queries to run out of memory and fail
Fine grained row and column access control for enhanced security
– Easy development for multi-tenancy applications
– Does not require the use of views
– Impala requires use of views which increases complexity
More Features
– Federation
– Stored procedures
– More Scalar functions
– More Aggregate functions

Hortonworks Stinger Initiative Improved Apache Hive

IBM Big SQL Extends Value of Apache Hive
Big SQL can immediately query data already stored in Hive
– Big SQL uses Hive Metadata and extends value of Hive
– Big Insights includes both Hive and Big SQL
Better SQL Support for end user tools and queries
– IBM Big SQL runs all 99 TPC-DS queries un-modified
– Hive .12 runs only 43 TPC-DS queries un-modified (Hive .13 testing pending)
– Hive still has many sub-query restrictions and doesn’t support non equi-joins, etc.
– Translates to better end-user tool usage and performance
Faster Performance
– Up to 41x faster than Hive .12 on TPC-H like benchmark*
– Hive .13 benchmarks pending
Fine grained row and column access control for enhanced security
More Features
– Federation
– Stored procedures
– More Scalar functions
– More Aggregate functions
TPC-DS Benchmark and TPC-H are trademarks of the Transaction Processing Performance Council (TPC).

Comparing Big SQL and Hive 0.12 for Ad-Hoc Queries
Big SQL is up
to 41x faster
than Hive 0.12
Big SQL is up
to 41x faster
than Hive 0.12
*Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the 1TB Classic BI
Workload in a controlled laboratory environment. The 1TB Classic BI Workload is a workload derived from the TPC-H
Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no update functions are
performed. TPC Benchmark and TPC-H are trademarks of the Transaction Processing Performance Council (TPC).
Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3.
Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a
production environment. Results as of April 22, 2014

How many times Faster is Big SQL than Hive 0.12?
Max
Max
Speedup
of 74x
Speedup
of 74x
Avg
Avg
Speedup
of 20x
Speedup
of 20x
Queries sorted by speed up ratio (worst to best)
* Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the 1TB Modern BI Workload
in a controlled laboratory environment. The 1TB Modern BI Workload is a workload derived from the TPC-DS Benchmark Standard,
running at 1TB scale factor. It is materially equivalent with the exception that no updates are performed, and only 43 out of 99 queries
are executed. The test measured sequential query execution of all 43 queries for which Hive syntax was publically available. TPC
Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC).
Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not
be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment.
Results as of April 22, 2014

Oracle Big Data SQL Requires Exadata and Big Data Appliance
Oracle Big Data SQL will allow an
Oracle Exadata user to issue a
remote SQL query to Oracle's Big
Data Appliance and return the
result
Not Yet Available
IBM Big SQL, part of BigInsights,
gives customers the ability to issue
SQL queries directly to Hadoop
IBM Big SQL is available today
Oracle Big Data SQL
Oracle Exadata
Oracle Big Data Appliance
IBM Big SQL
Hadoop
Not Yet Available
Available Today
© 20 2014 IBM Corporation

Oracle Big Data SQL – Exadata to Big Data Appliance
Oracle Big Data SQL = Remote query from Exadata to Oracle Big Data Appliance
Oracle Exadata
(Oracle 12 c Database)
Oracle Big Data
Appliance
(Cloudera Hadoop)
SQL query issued
Results returned

Oracle Big Data SQL – Not What it Seems
Oracle's solution requires Oracle
12c Database to submit the query
to Hadoop
– Oracle Big Data SQL requires the
Oracle 12c database on Exadata as the
front-end system to issue the query to
Hadoop
Oracle's solution requires both
Oracle Exadata and the Oracle Big
Data Appliance
– Likely to be of interest only to
customers willing to buy into the
complete Oracle stack
IBM Big SQL queries Hadoop
directly
– No need for a front-end relational
database
IBM Big SQL runs on any
commodity x86 or Power Linux
hardware
– No need for proprietary, lock-in Big
Data Appliance

IBM Big SQL Compared to Oracle Big Data SQL
IBM Big SQL offers faster performance
– Hadoop data does not need to be transferred across a network to an Oracle
database for additional processing (joins, aggregations, etc)
IBM Big SQL offers “Direct Connect”
– Front-end query tools such as Cognos can connect directly to Hadoop with IBM
Big SQL, and more efficiently, by not having to connect first to the Oracle
database
IBM Big SQL offers true MPP on Hadoop
– Big SQL deploys directly on the physical Hadoop cluster
– Big SQL accesses Hadoop data natively for reading and writing
IBM Big SQL offers query federation
– Big SQL can federate queries across DB2, Netezza, Teradata, and even Oracle
– Integrating existing relational database data with Hadoop
IBM Big SQL is available today
– Oracle Big Data SQL is not even available in Beta yet

Pivotal HD HAWQ
Or is it
The complexity of Hadoop
married to the complexity of
the Greenplum Database?

Pivotal HD – HAWQ is based on Greenplum Database
HAWQ is based on the Greenplum database with modifications that enable
data to be stored on HDFS
– Or HDFS compatible file systems (such as EMC ISILON OneFS)
The data is still stored in Greenplum relational tables, in a proprietary format
(.GDB) on HDFS, separate from Hadoop data
– Readable only through the Greenplum interface*
HAWQ SQL access to Hadoop data (including HBase) is done via the
Greenplum Database External Table feature
– Part of what is now called PXF – Pivotal Extension Framework.
HAWQ uses its own internal proprietary metadata
– Does not use Apache Hadoop Hive Metadata Catalog (HCatalog)
*Hawq recently added support for Parquet files

IBM Big SQL Advantages Compared to Pivotal HAWQ
Architecture designed for Hadoop Ecosystem
– Big SQL uses meta data and standard files on HDFS
– HAWQ uses its own metadata and database tables
Elastic Scalability
– With Big SQL, nodes can be added or removed from cluster on-line
– HAWQ requires complex, disruptive, off-line MPP database re-distribution
Federation
– DB2, Netezza, Oracle, Teradata
Fine grained row and column access control
Packaging
– Big SQL included in BigInsights
– HAWQ is extra cost license for Pivotal HD

Teradata “Unified Data Architecture”
Hadoop purpose is for data
capture and transformation
Aster Data used for discovery
and exploratory analytics
SQL-H is used to query Hadoop
using SQL from either Aster
Data or Teradata
SQL-H issues a remote query to
Hadoop and brings the data
back to the RDBMS for
processing

Teradata SQL-H – Remote Query to Hadoop

IBM Big SQL Compared to Teradata SQL-H
Big SQL operates and processes the data directly on Hadoop
– Does not need to move data to relational database for analytics
Big SQL does not require separate licensing or complexity of an MPP
relational database like Teradata
– SQL-H requires deployment of the Aster or Teradata relational database
– Big SQL provides direct SQL access to Hadoop as part of BigInsights
Big SQL supports SQL access to HBase
– SQL-H does not support HBase
SQL-H depends on network latency/speed and Teradata system
configuration for performance
– Teradata system may already be overburdened
– Teradata system likely not sized for additional processing of Hadoop data
Aster Data and Teradata are complex
– Relational MPP engine with partitioning, indexing, tuning requirements
– Aster has operational limitations including having to rebuild indexes on load, vacuum
operations to reclaim space, managing freespace, etc.

Summary

Summary
SQL on Hadoop presents huge opportunity
– Data Lake, Data Warehouse offload, Sandbox and exploration
IBM Big SQL Architecture is designed for Hadoop Ecosystem
– MPP query engine runs natively on Hadoop
Big SQL provides many advantages over Cloudera Impala
– SQL support, memory management, more features
Big SQL also has advantages over Hive
– SQL support, performance
– But Big SQL works on top of Hive, extending value of Hive
Big SQL designed for Hadoop, queries Hadoop data directly
– Neither Oracle, nor Teradata, nor Microsoft have a similar solution
– Many SQL Hadoop solutions submit remote queries to Hadoop
– Other SQL solutions are ports of database, including storage layer

Backup: Vendor Landscape
Cloudera has developed a MPP SQL engine for Hadoop called Impala, which is
now also offered by MapR and Amazon EMR (Elastic Map Reduce)
Hortonworks is enhancing the breadth of SQL coverage, and performance of Hive,
under the project name Stinger
Pivotal HAWQ is based on the Greenplum MPP database, and provides similar
MPP database capability on Hadoop as part of the Pivotal HD (Hadoop) product
Teradata offers SQL-H which enables a Teradata end-user to issue a remote
query to Hadoop and bring the data back into Teradata for analysis or integration
with Teradata data. Teradata Query Grid (3Q 2014) will enable SQL-H result sets
to be filtered on Hadoop prior to being returned to Teradata.
Oracle announced Oracle Big Data SQL which enables an Oracle Exadata end
user to issue remote queries to the Oracle Big Data Appliance and bring back a
filtered result set to Oracle database for analysis or integration with Oracle data.
GA 3Q 2014.
Microsoft’s Polybase enables their PDW appliance to issue remote SQL queries to
Hortonworks Hadoop on Windows or Microsoft HDInsight
Actian has consolidated ParAccel and Vectorwise technologies, ported to Hadoop
and released it as Actian Analytics Platform Hadoop SQL Edition.
HP Vertica has released Dragline which can run on a MapR cluster and share
storage with Hadoop

Backup: Vendor Landcape
Facebook has developed Presto, an open source SQL engine for Hadoop
designed for interactive query analysis on large data sets
Apache Drill is an open source project based on Google Dremel to provide a
distributed SQL execution engine for interactive, low latency queries. Currently in
Beta
Spark SQL is a new SQL engine being developed from the ground up for Spark by
Databricks, and replaces the previous Shark project. Currently in Alpha.
Splice Machine claims to be a full ACID compliant RDBMS on Hadoop, by taking
the Apache Derby Java relational database and removed its storage layer,
replacing it with the Apache HBase NoSQL database. Then the company modified
the planner, optimiser and executor inside Derby to take advantage of HBase's
distributed architecture. Currently in Beta.
Citus DB is built on PostgreSQL and runs on Hadoop nodes
JethroData is an index-based SQL engine for Hadoop automatically indexes data
as it is written into Hadoop
InfiniDB has recently announced InfiniDB for Apache Hadoop, which is positioned
for analytics and claims “if you know MySQL, you know InfiniDB”

Big SQL Competitive Summary - Vendor Landscape

Related slideshows

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (20)

Similar to Big SQL Competitive Summary - Vendor Landscape

Similar to Big SQL Competitive Summary - Vendor Landscape (20)

More from Nicolas Morales

More from Nicolas Morales (10)

Recently uploaded

Recently uploaded (20)

Big SQL Competitive Summary - Vendor Landscape