SlideShare a Scribd company logo
© Hortonworks Inc. 2011- 2017. All rights reserved | 1
3.77 288.9 0.76
© Hortonworks Inc. 2011- 2017. All rights reserved | 2
Counting Rows
© Hortonworks Inc. 2011- 2017. All rights reserved | 3
3.77 288.9 0.76
© Hortonworks Inc. 2011- 2017. All rights reserved | 4
3.77 288.9 0.76
Hive HBase Druid
© Hortonworks Inc. 2011- 2017. All rights reserved | 5
Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs
© Hortonworks Inc. 2011 – 2017
Big Data Processing Engines –
Which one do I use?
Ashish Narasimham, Solutions Engineer @ Cloudera
© Hortonworks Inc. 2011- 2017. All rights reserved | 7
Processing Engines Overview
© Hortonworks Inc. 2011- 2017. All rights reserved | 8
Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs
What’s Hive good at?
⬢ Jack of all trades
⬢ Key component of the real-time
database
⬢ Familiar interface for analysts – unified
SQL
⬢ Can perform joins, filtering,
aggregations
⬢ Read structured (CSV) or semi-
structured (JSON) data
HiveInterface
HBase/Phoenix
Druid
JDBC
Files
© Hortonworks Inc. 2011- 2017. All rights reserved | 10
HDP3: EDW analyst pipeline
Tableau
BI systems
Materialized
view
Surrogate
key
Constraints
Query
Result
Cache
Workload
management
ACID v2
&
ACID on
default
• Results return
from HDFS/cache
directly
• Reduce load from
repetitive queries
• Allows more
queries to be run
in parallel
• Reduce resource
starvation in large
clusters
• Also:
Active/Passive HA
• More “tools” for
optimizer to use
• More ”tools” for
DBAs to
tune/optimize
• Invisible tuning of
DB from users’
perspective
• ACID v2 is as fast
as regular tables
© Hortonworks Inc. 2011- 2017. All rights reserved | 11
Ad-hoc,
analytics
Look-ups,
updates
Aggregations,
drill-downs
What Are Apache HBase and Phoenix?
Flexible Schema
Millisecond Latency
SQL and NoSQL Interfaces
Store and Process Petabytes of Data
Scale out on Commodity Servers
Integrated with YARN
100% Open Source
YARN : Data Operating System
HBase
RegionServer
1 ° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° ° N
HDFS
(Permanent Data Storage)
HBase
RegionServer
HBase
RegionServer
Flexible Schema
Extreme Low Latency
Directly Integrated with Hadoop
SQL and NoSQL Interfaces
What Are Apache HBase and Phoenix?
Flexible Schema
Millisecond Latency
SQL and NoSQL Interfaces
Store and Process Petabytes of Data
Scale out on Commodity Servers
Integrated with YARN
100% Open Source
YARN : Data Operating System
HBase
RegionServer
1 ° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° ° N
HDFS
(Permanent Data Storage)
HBase
RegionServer
HBase
RegionServer
Flexible Schema
Extreme Low Latency
Directly Integrated with Hadoop
SQL and NoSQL Interfaces
Kinds of Apps Built with HBase
Write Heavy Low-Latency
Search /
Indexing
Messaging
Audit /
Log Archive AdvertisingData Cubes
Time Series
Sensor /
Device
Key HBase Features
Page 14
High Availability
• Data is stored on multiple
nodes and HBase coordinates
failover.
• Data stays available if nodes
fail.
Strong Consistency
• HBase doesn’t sacrifice
consistency for scale.
• Improve quality by avoiding
difficult-to-detect bugs.
Deep Hadoop Integration
• Add deep insight to your apps
through seamless integration
with Hadoop tools like Hive
Multi Datacenter
• Replicate data between 2 or
more datacenters.
• Keeps data safe and available
through datacenter outages.
Data Storage – Relational vs. HBase
Column1 Column2 Column3 Column4
Row1 f - t5
a – t1
null null d – t4
Row2 null b – t1 null null
Row3 null null null e – t4
Row4 c – t3 null g – t5 null
Relational Data Base
f – t5
a – t1
C – t3
B – t1 g – t5 d – t4
e – t4
HBase Data is located by cell coordinates consisting of row key,
column family name, column qualifier and timestamp
Column1 Column2 Column3 Column4
HFile
© Hortonworks Inc. 2011- 2017. All rights reserved | 16
Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs
Druid is for real-time, providing aggregations and fast
access
 Streaming ingestion capability
 Data Freshness – analyze events as they occur
 Fast response time (ideally < 1sec query time)
 Arbitrary slicing and dicing
 Multi-tenancy – 1000s of concurrent users
 Scalability and Availability
 Rich real-time visualization with Superset
Superset
Druid is a distributed, real-time, column-oriented datastore
designed to quickly ingest and index large amounts of data and
make it available for real-time query.
Who is Using Druid
http://druid.io/druid-powered.html
Druid
cubing
Here’s how Druid usually fits into your architecture
Streaming
data source
(Kafka, etc.) Real-
time
ingest
Druid
Jobs, batch
processes,
scheduled
tasks
HDFS Hive
Superset
VisualizationQuery engineStorageData sources
Druid-backed
Hive tables,
predicate
pushdown
HDFS-backed
Hive tables
Tableau,
Qlik,
Excel
Query
Hive/Druid via
ODBC
Batch
ingest
© Hortonworks Inc. 2011- 2017. All rights reserved | 20
Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs
© Hortonworks Inc. 2011- 2017. All rights reserved | 21
And one more honorable mention
© Hortonworks Inc. 2011- 2017. All rights reserved | 22
Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs
Complex ETL, ML
What is Apache Spark?
Classification Regression
• Support vector
• logistic regression
Collaborative Filtering
Clustering
• K-means
Optimization
• Stochastic Gradient
Descent
ML lib (Machine Learning)
Scalable
• High-throughput, fault-
tolerant stream processing
of live data streams
(micro-batches)
Data Ingest Sources
• Kafka, Flume, Twitter,
ZeroMQ, Kinesis or TCP
sockets
Reuse Spark APIs
• Complex algorithms
expressed with high-level
functions like map, reduce,
join and window
Data Persistence
• Processed data can be
pushed out to file systems,
databases and live
dashboards
Spark Streaming
Structured Data Processing
• Programming abstraction
called DataFrames
• Distributed SQL query
engine
Infer Schema
• Automatically infer scheme
of a JSON dataset and
load it as a DataFrame.
Spark SQL
Resource Management
Storage
Applications
Spark Core Engine
Scala
Java
Python
libraries
MLlib
(Machine
learning)
Spark
SQL*
Spark
Streaming*
Benefits of Apache Spark
• Performance
– Deliver high performance large scale data processing and analysis by
leveraging in memory computing
• Ease of Use
– Easy to use APIs for operating on large datasets
– Operators for transforming data
– DataFrames provides support for manipulating structured and semi-
structured data
• Efficiency
– Enhanced developer productivity through prepackaged libraries that can be
combined in the same application
• SQL queries
• Streaming data
• Machine learning
• Graph processing
Resource Management
Storage
Applications
Spark Core Engine
Scala
Java
Python
libraries
MLlib
(Machine
learning)
Spark
SQL*
Spark
Streaming*
* Tech Preview
Driver
MetaStore
HiveServer+Tez
LLAP DaemonsExecutors
Spark
Meta
Hive
Meta
Executors LLAP Daemons
Isolate Spark and Hive Catalogs/Tables
Leverage connector for Spark <-> Hive
HWC
HWC
© Hortonworks Inc. 2011- 2017. All rights reserved | 26
Which one should I choose?
© Hortonworks Inc. 2011- 2017. All rights reserved | 27
Use Case Analysis - Each engine has its niche
HBase Hive Druid Spark
Ultra-low latency
Random access
(key-based
lookup)
ACID, real-time
database, EDW
Low-latency
OLAP, concurrent
queries
Complex ETL
Large-volume
OLTP
Unified SQL
interface, JDBC
Aggregations,
drilldowns
ML model
training
Updates Reporting, batch Time-series SparkSQL
Deletes Joins, large
aggregates, ad-
hoc
Real-time
ingestion
Spark
Streaming
© Hortonworks Inc. 2011- 2017. All rights reserved | 28
Which use cases make sense?
⬢ HBase – operational data store, lots of changing data
– Financial transaction data
– Frequent customer updates
– CDC
⬢ Druid – analytics across dataset, sums and other aggregations
– Analyzing number of cars being produced by region
– Number of flights departing from a certain airport
⬢ Hive – large queries across tons of data
– Fact-dimension join across billions of rows, e.g. joining loyalty data to a day’s retail
transactions for insights into spending
⬢ Spark – predictive modeling, complex ETL (ELT) jobs
– Building a predictive maintenance model for infrastructure that a transportation company
owns
© Hortonworks Inc. 2011- 2017. All rights reserved | 29
Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs
Complex ETL, ML
© Hortonworks Inc. 2011- 2017. All rights reserved | 30
Performance Analysis
© Hortonworks Inc. 2011- 2017. All rights reserved | 31
Performance Analysis - Setup
⬢ Caching disabled
⬢ Query types
– Simple count
– Select with a where
– Join
– Update
– An aggregation (e.g. Sum)
8 cores
16GB RAM
8 nodes
30GB,
200MM rows
1.35
15.00 15.00 15.00
1.52
8.71
4.72
8.66
9.16
9.75
0.34 0.71
1.72
0.00 0.00
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
Select with filter Count(*) Aggregation with
filter
Select with join and
filter
Update with filter
Comparing Hive, HBase and Druid
HBase/Phoenix Hive Druid
© Hortonworks Inc. 2011- 2017. All rights reserved | 33
Data Load Times
Engine Load Time
Hive <1 hr
HBase 4+ hrs
Druid 2 hrs
⬢ Issues with HBase – sequential, serial
© Hortonworks Inc. 2011- 2017. All rights reserved | 34
Space Considerations
⬢ You may get better storage from HBase with different compression
Engine Size on Disk with Replication
Hive – ORC w/ Zlib 28.4GB
HBase – Snappy compression 89.5GB
Druid 31.5GB
© Hortonworks Inc. 2011- 2017. All rights reserved | 35
Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs
© Hortonworks Inc. 2011- 2017. All rights reserved | 36
Unified SQL
Hive as the Single Interface
HiveInterface
HBase/Phoenix
Druid
JDBC
Files
Hive Query Delegation by Calcite
filter time
group by
order by
Calcite rewrites to
Druid query fragment
Complex joins,
etc would be
computed here
BI on Hadoop : Different tools for different use cases
 File / RAW storage
 Unknown questions
 Latency is not a issue
 Non structured / Data Mining /
Data Science
 Structured Data
 Data cleansed / Enriched
 Questions are known but not answers
 Concepts and data regularly updated
 Streaming / low latency
 Pre-aggregation to answer specific
questions
 Known Questions and answers
 Operational dashboards
LLAP
Druid
Cold Warm
Hot
© Hortonworks Inc. 2011- 2017. All rights reserved | 40
Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs
© Hortonworks Inc. 2011- 2017. All rights reserved | 41
What conclusions can we draw?
⬢ The use case dictates the tool. This is seen in the numbers
– Druid is extremely fast for aggregations
– HBase is great with lookups and OLTP-style updates on fast-moving data
– Hive is used a lot for analytics on large quantities of data, where the query isn’t known
beforehand
– Spark has great libraries for ML and is customizable for complex ETL
⬢ Use case sprawl – watch for this (no one engine does it all)
⬢ Unified SQL – the tools complement each other in the larger enterprise
architecture
© Hortonworks Inc. 2011- 2017. All rights reserved | 42
Further Reading
⬢ Use case discussion on the engines
– https://hortonworks.com/blog/big-data-processing-engines-which-one-do-i-use-part-1/
⬢ Performance analysis
– https://community.hortonworks.com/articles/232317/big-data-processing-engines-the-
technical-series-p.html
– https://community.hortonworks.com/articles/233083/big-data-processing-engines-the-
technical-series-p-1.html

More Related Content

What's hot

Deep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profitDeep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profit
DataWorks Summit/Hadoop Summit
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
DataWorks Summit/Hadoop Summit
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
DataWorks Summit
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
DataWorks Summit
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
MapR Technologies
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
alanfgates
 
Empower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and HadoopEmpower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and Hadoop
DataWorks Summit/Hadoop Summit
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
Jan Pieter Posthuma
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
DataWorks Summit
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
DataWorks Summit/Hadoop Summit
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
Shivaji Dutta
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
DataWorks Summit
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
DataWorks Summit/Hadoop Summit
 
Cloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsCloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World Considerations
DataWorks Summit/Hadoop Summit
 
An Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureAn Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present Future
DataWorks Summit/Hadoop Summit
 

What's hot (19)

Deep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profitDeep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profit
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Empower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and HadoopEmpower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and Hadoop
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
Cloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsCloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World Considerations
 
An Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureAn Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present Future
 

Similar to Big data processing engines, Atlanta Meetup 4/30

An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
DataWorks Summit
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
alanfgates
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
DataWorks Summit
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
DataWorks Summit
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]
Hortonworks
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Hortonworks
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
 
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
HortonworksJapan
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
DataWorks Summit
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
Hortonworks
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
Aldrin Piri
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Hortonworks
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
Ameet Paranjape
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
DataWorks Summit
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
John Park
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
VMware Tanzu
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
DataWorks Summit/Hadoop Summit
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
alanfgates
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 

Similar to Big data processing engines, Atlanta Meetup 4/30 (20)

An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 

Recently uploaded

SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024
Becky Burwell
 
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
rightmanforbloodline
 
future-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-managementfuture-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-management
Aadee4
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
lenjisoHussein
 
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataTowards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Samuel Jackson
 
Getting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdfGetting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdf
Riya Sen
 
Audits Of Complaints Against the PPD Report_2022.pdf
Audits Of Complaints Against the PPD Report_2022.pdfAudits Of Complaints Against the PPD Report_2022.pdf
Audits Of Complaints Against the PPD Report_2022.pdf
evwcarr
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
Priyanka Jadhav
 
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
femim26318
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
SomalyEng
 
Systane Global education training centre
Systane Global education training centreSystane Global education training centre
Systane Global education training centre
AkhinaRomdoni
 
Vrinda store data analysis project using Excel
Vrinda store data analysis project using ExcelVrinda store data analysis project using Excel
Vrinda store data analysis project using Excel
SantuJana12
 
Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
HeidiLivengood
 
Histology of Muscle types histology o.ppt
Histology of Muscle types histology o.pptHistology of Muscle types histology o.ppt
Histology of Muscle types histology o.ppt
SamanArshad11
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
weiwchu
 
Aws MLOps Interview Questions with answers
Aws MLOps Interview Questions  with answersAws MLOps Interview Questions  with answers
Aws MLOps Interview Questions with answers
Sathiakumar Chandr
 
Big Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of PaymentsBig Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of Payments
RuchiRathor2
 
Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
SelcukTOPAL2
 
Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
49AkshitYadav
 
How AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdfHow AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdf
PromptCloud
 

Recently uploaded (20)

SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024
 
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
 
future-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-managementfuture-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-management
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
 
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataTowards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
 
Getting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdfGetting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdf
 
Audits Of Complaints Against the PPD Report_2022.pdf
Audits Of Complaints Against the PPD Report_2022.pdfAudits Of Complaints Against the PPD Report_2022.pdf
Audits Of Complaints Against the PPD Report_2022.pdf
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
 
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
 
Systane Global education training centre
Systane Global education training centreSystane Global education training centre
Systane Global education training centre
 
Vrinda store data analysis project using Excel
Vrinda store data analysis project using ExcelVrinda store data analysis project using Excel
Vrinda store data analysis project using Excel
 
Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
 
Histology of Muscle types histology o.ppt
Histology of Muscle types histology o.pptHistology of Muscle types histology o.ppt
Histology of Muscle types histology o.ppt
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
 
Aws MLOps Interview Questions with answers
Aws MLOps Interview Questions  with answersAws MLOps Interview Questions  with answers
Aws MLOps Interview Questions with answers
 
Big Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of PaymentsBig Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of Payments
 
Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
 
Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
 
How AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdfHow AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdf
 

Big data processing engines, Atlanta Meetup 4/30

  • 1. © Hortonworks Inc. 2011- 2017. All rights reserved | 1 3.77 288.9 0.76
  • 2. © Hortonworks Inc. 2011- 2017. All rights reserved | 2 Counting Rows
  • 3. © Hortonworks Inc. 2011- 2017. All rights reserved | 3 3.77 288.9 0.76
  • 4. © Hortonworks Inc. 2011- 2017. All rights reserved | 4 3.77 288.9 0.76 Hive HBase Druid
  • 5. © Hortonworks Inc. 2011- 2017. All rights reserved | 5 Ad-hoc analytics Random look-ups Aggregations drill-downs
  • 6. © Hortonworks Inc. 2011 – 2017 Big Data Processing Engines – Which one do I use? Ashish Narasimham, Solutions Engineer @ Cloudera
  • 7. © Hortonworks Inc. 2011- 2017. All rights reserved | 7 Processing Engines Overview
  • 8. © Hortonworks Inc. 2011- 2017. All rights reserved | 8 Ad-hoc analytics Random look-ups Aggregations drill-downs
  • 9. What’s Hive good at? ⬢ Jack of all trades ⬢ Key component of the real-time database ⬢ Familiar interface for analysts – unified SQL ⬢ Can perform joins, filtering, aggregations ⬢ Read structured (CSV) or semi- structured (JSON) data HiveInterface HBase/Phoenix Druid JDBC Files
  • 10. © Hortonworks Inc. 2011- 2017. All rights reserved | 10 HDP3: EDW analyst pipeline Tableau BI systems Materialized view Surrogate key Constraints Query Result Cache Workload management ACID v2 & ACID on default • Results return from HDFS/cache directly • Reduce load from repetitive queries • Allows more queries to be run in parallel • Reduce resource starvation in large clusters • Also: Active/Passive HA • More “tools” for optimizer to use • More ”tools” for DBAs to tune/optimize • Invisible tuning of DB from users’ perspective • ACID v2 is as fast as regular tables
  • 11. © Hortonworks Inc. 2011- 2017. All rights reserved | 11 Ad-hoc, analytics Look-ups, updates Aggregations, drill-downs
  • 12. What Are Apache HBase and Phoenix? Flexible Schema Millisecond Latency SQL and NoSQL Interfaces Store and Process Petabytes of Data Scale out on Commodity Servers Integrated with YARN 100% Open Source YARN : Data Operating System HBase RegionServer 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Permanent Data Storage) HBase RegionServer HBase RegionServer Flexible Schema Extreme Low Latency Directly Integrated with Hadoop SQL and NoSQL Interfaces What Are Apache HBase and Phoenix? Flexible Schema Millisecond Latency SQL and NoSQL Interfaces Store and Process Petabytes of Data Scale out on Commodity Servers Integrated with YARN 100% Open Source YARN : Data Operating System HBase RegionServer 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Permanent Data Storage) HBase RegionServer HBase RegionServer Flexible Schema Extreme Low Latency Directly Integrated with Hadoop SQL and NoSQL Interfaces
  • 13. Kinds of Apps Built with HBase Write Heavy Low-Latency Search / Indexing Messaging Audit / Log Archive AdvertisingData Cubes Time Series Sensor / Device
  • 14. Key HBase Features Page 14 High Availability • Data is stored on multiple nodes and HBase coordinates failover. • Data stays available if nodes fail. Strong Consistency • HBase doesn’t sacrifice consistency for scale. • Improve quality by avoiding difficult-to-detect bugs. Deep Hadoop Integration • Add deep insight to your apps through seamless integration with Hadoop tools like Hive Multi Datacenter • Replicate data between 2 or more datacenters. • Keeps data safe and available through datacenter outages.
  • 15. Data Storage – Relational vs. HBase Column1 Column2 Column3 Column4 Row1 f - t5 a – t1 null null d – t4 Row2 null b – t1 null null Row3 null null null e – t4 Row4 c – t3 null g – t5 null Relational Data Base f – t5 a – t1 C – t3 B – t1 g – t5 d – t4 e – t4 HBase Data is located by cell coordinates consisting of row key, column family name, column qualifier and timestamp Column1 Column2 Column3 Column4 HFile
  • 16. © Hortonworks Inc. 2011- 2017. All rights reserved | 16 Ad-hoc analytics Random look-ups Aggregations drill-downs
  • 17. Druid is for real-time, providing aggregations and fast access  Streaming ingestion capability  Data Freshness – analyze events as they occur  Fast response time (ideally < 1sec query time)  Arbitrary slicing and dicing  Multi-tenancy – 1000s of concurrent users  Scalability and Availability  Rich real-time visualization with Superset Superset Druid is a distributed, real-time, column-oriented datastore designed to quickly ingest and index large amounts of data and make it available for real-time query.
  • 18. Who is Using Druid http://druid.io/druid-powered.html
  • 19. Druid cubing Here’s how Druid usually fits into your architecture Streaming data source (Kafka, etc.) Real- time ingest Druid Jobs, batch processes, scheduled tasks HDFS Hive Superset VisualizationQuery engineStorageData sources Druid-backed Hive tables, predicate pushdown HDFS-backed Hive tables Tableau, Qlik, Excel Query Hive/Druid via ODBC Batch ingest
  • 20. © Hortonworks Inc. 2011- 2017. All rights reserved | 20 Ad-hoc analytics Random look-ups Aggregations drill-downs
  • 21. © Hortonworks Inc. 2011- 2017. All rights reserved | 21 And one more honorable mention
  • 22. © Hortonworks Inc. 2011- 2017. All rights reserved | 22 Ad-hoc analytics Random look-ups Aggregations drill-downs Complex ETL, ML
  • 23. What is Apache Spark? Classification Regression • Support vector • logistic regression Collaborative Filtering Clustering • K-means Optimization • Stochastic Gradient Descent ML lib (Machine Learning) Scalable • High-throughput, fault- tolerant stream processing of live data streams (micro-batches) Data Ingest Sources • Kafka, Flume, Twitter, ZeroMQ, Kinesis or TCP sockets Reuse Spark APIs • Complex algorithms expressed with high-level functions like map, reduce, join and window Data Persistence • Processed data can be pushed out to file systems, databases and live dashboards Spark Streaming Structured Data Processing • Programming abstraction called DataFrames • Distributed SQL query engine Infer Schema • Automatically infer scheme of a JSON dataset and load it as a DataFrame. Spark SQL Resource Management Storage Applications Spark Core Engine Scala Java Python libraries MLlib (Machine learning) Spark SQL* Spark Streaming*
  • 24. Benefits of Apache Spark • Performance – Deliver high performance large scale data processing and analysis by leveraging in memory computing • Ease of Use – Easy to use APIs for operating on large datasets – Operators for transforming data – DataFrames provides support for manipulating structured and semi- structured data • Efficiency – Enhanced developer productivity through prepackaged libraries that can be combined in the same application • SQL queries • Streaming data • Machine learning • Graph processing Resource Management Storage Applications Spark Core Engine Scala Java Python libraries MLlib (Machine learning) Spark SQL* Spark Streaming* * Tech Preview
  • 25. Driver MetaStore HiveServer+Tez LLAP DaemonsExecutors Spark Meta Hive Meta Executors LLAP Daemons Isolate Spark and Hive Catalogs/Tables Leverage connector for Spark <-> Hive HWC HWC
  • 26. © Hortonworks Inc. 2011- 2017. All rights reserved | 26 Which one should I choose?
  • 27. © Hortonworks Inc. 2011- 2017. All rights reserved | 27 Use Case Analysis - Each engine has its niche HBase Hive Druid Spark Ultra-low latency Random access (key-based lookup) ACID, real-time database, EDW Low-latency OLAP, concurrent queries Complex ETL Large-volume OLTP Unified SQL interface, JDBC Aggregations, drilldowns ML model training Updates Reporting, batch Time-series SparkSQL Deletes Joins, large aggregates, ad- hoc Real-time ingestion Spark Streaming
  • 28. © Hortonworks Inc. 2011- 2017. All rights reserved | 28 Which use cases make sense? ⬢ HBase – operational data store, lots of changing data – Financial transaction data – Frequent customer updates – CDC ⬢ Druid – analytics across dataset, sums and other aggregations – Analyzing number of cars being produced by region – Number of flights departing from a certain airport ⬢ Hive – large queries across tons of data – Fact-dimension join across billions of rows, e.g. joining loyalty data to a day’s retail transactions for insights into spending ⬢ Spark – predictive modeling, complex ETL (ELT) jobs – Building a predictive maintenance model for infrastructure that a transportation company owns
  • 29. © Hortonworks Inc. 2011- 2017. All rights reserved | 29 Ad-hoc analytics Random look-ups Aggregations drill-downs Complex ETL, ML
  • 30. © Hortonworks Inc. 2011- 2017. All rights reserved | 30 Performance Analysis
  • 31. © Hortonworks Inc. 2011- 2017. All rights reserved | 31 Performance Analysis - Setup ⬢ Caching disabled ⬢ Query types – Simple count – Select with a where – Join – Update – An aggregation (e.g. Sum) 8 cores 16GB RAM 8 nodes 30GB, 200MM rows
  • 32. 1.35 15.00 15.00 15.00 1.52 8.71 4.72 8.66 9.16 9.75 0.34 0.71 1.72 0.00 0.00 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 Select with filter Count(*) Aggregation with filter Select with join and filter Update with filter Comparing Hive, HBase and Druid HBase/Phoenix Hive Druid
  • 33. © Hortonworks Inc. 2011- 2017. All rights reserved | 33 Data Load Times Engine Load Time Hive <1 hr HBase 4+ hrs Druid 2 hrs ⬢ Issues with HBase – sequential, serial
  • 34. © Hortonworks Inc. 2011- 2017. All rights reserved | 34 Space Considerations ⬢ You may get better storage from HBase with different compression Engine Size on Disk with Replication Hive – ORC w/ Zlib 28.4GB HBase – Snappy compression 89.5GB Druid 31.5GB
  • 35. © Hortonworks Inc. 2011- 2017. All rights reserved | 35 Ad-hoc analytics Random look-ups Aggregations drill-downs
  • 36. © Hortonworks Inc. 2011- 2017. All rights reserved | 36 Unified SQL
  • 37. Hive as the Single Interface HiveInterface HBase/Phoenix Druid JDBC Files
  • 38. Hive Query Delegation by Calcite filter time group by order by Calcite rewrites to Druid query fragment Complex joins, etc would be computed here
  • 39. BI on Hadoop : Different tools for different use cases  File / RAW storage  Unknown questions  Latency is not a issue  Non structured / Data Mining / Data Science  Structured Data  Data cleansed / Enriched  Questions are known but not answers  Concepts and data regularly updated  Streaming / low latency  Pre-aggregation to answer specific questions  Known Questions and answers  Operational dashboards LLAP Druid Cold Warm Hot
  • 40. © Hortonworks Inc. 2011- 2017. All rights reserved | 40 Ad-hoc analytics Random look-ups Aggregations drill-downs
  • 41. © Hortonworks Inc. 2011- 2017. All rights reserved | 41 What conclusions can we draw? ⬢ The use case dictates the tool. This is seen in the numbers – Druid is extremely fast for aggregations – HBase is great with lookups and OLTP-style updates on fast-moving data – Hive is used a lot for analytics on large quantities of data, where the query isn’t known beforehand – Spark has great libraries for ML and is customizable for complex ETL ⬢ Use case sprawl – watch for this (no one engine does it all) ⬢ Unified SQL – the tools complement each other in the larger enterprise architecture
  • 42. © Hortonworks Inc. 2011- 2017. All rights reserved | 42 Further Reading ⬢ Use case discussion on the engines – https://hortonworks.com/blog/big-data-processing-engines-which-one-do-i-use-part-1/ ⬢ Performance analysis – https://community.hortonworks.com/articles/232317/big-data-processing-engines-the- technical-series-p.html – https://community.hortonworks.com/articles/233083/big-data-processing-engines-the- technical-series-p-1.html