Big data processing engines, Atlanta Meetup 4/30

© Hortonworks Inc. 2011- 2017. All rights reserved | 1
3.77 288.9 0.76

Counting Rows

3.77 288.9 0.76

3.77 288.9 0.76
Hive HBase Druid

Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs

© Hortonworks Inc. 2011 – 2017
Big Data Processing Engines –
Which one do I use?
Ashish Narasimham, Solutions Engineer @ Cloudera

Processing Engines Overview

Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs

What’s Hive good at?
⬢ Jack of all trades
⬢ Key component of the real-time
database
⬢ Familiar interface for analysts – unified
SQL
⬢ Can perform joins, filtering,
aggregations
⬢ Read structured (CSV) or semi-
structured (JSON) data
HiveInterface
HBase/Phoenix
Druid
JDBC
Files

HDP3: EDW analyst pipeline
Tableau
BI systems
Materialized
view
Surrogate
key
Constraints
Query
Result
Cache
Workload
management
ACID v2
&
ACID on
default
• Results return
from HDFS/cache
directly
• Reduce load from
repetitive queries
• Allows more
queries to be run
in parallel
• Reduce resource
starvation in large
clusters
• Also:
Active/Passive HA
• More “tools” for
optimizer to use
• More ”tools” for
DBAs to
tune/optimize
• Invisible tuning of
DB from users’
perspective
• ACID v2 is as fast
as regular tables

Ad-hoc,
analytics
Look-ups,
updates
Aggregations,
drill-downs

What Are Apache HBase and Phoenix?
Flexible Schema
Millisecond Latency
SQL and NoSQL Interfaces
Store and Process Petabytes of Data
Scale out on Commodity Servers
Integrated with YARN
100% Open Source
YARN : Data Operating System
HBase
RegionServer
1 ° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° ° N
HDFS
(Permanent Data Storage)
HBase
RegionServer
HBase
RegionServer
Flexible Schema
Extreme Low Latency
Directly Integrated with Hadoop
What Are Apache HBase and Phoenix?
Flexible Schema
Millisecond Latency
Store and Process Petabytes of Data
Scale out on Commodity Servers
Integrated with YARN
100% Open Source
YARN : Data Operating System
HBase
RegionServer
1 ° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° ° N
HDFS
(Permanent Data Storage)
HBase
RegionServer
HBase
RegionServer
Flexible Schema
Extreme Low Latency
Directly Integrated with Hadoop

Kinds of Apps Built with HBase
Write Heavy Low-Latency
Search /
Indexing
Messaging
Audit /
Log Archive AdvertisingData Cubes
Time Series
Sensor /
Device

Key HBase Features
Page 14
High Availability
• Data is stored on multiple
nodes and HBase coordinates
failover.
• Data stays available if nodes
fail.
Strong Consistency
• HBase doesn’t sacrifice
consistency for scale.
• Improve quality by avoiding
difficult-to-detect bugs.
Deep Hadoop Integration
• Add deep insight to your apps
through seamless integration
with Hadoop tools like Hive
Multi Datacenter
• Replicate data between 2 or
more datacenters.
• Keeps data safe and available
through datacenter outages.

Data Storage – Relational vs. HBase
Column1 Column2 Column3 Column4
Row1 f - t5
a – t1
null null d – t4
Row2 null b – t1 null null
Row3 null null null e – t4
Row4 c – t3 null g – t5 null
Relational Data Base
f – t5
a – t1
C – t3
B – t1 g – t5 d – t4
e – t4
HBase Data is located by cell coordinates consisting of row key,
column family name, column qualifier and timestamp
Column1 Column2 Column3 Column4
HFile

Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs

Druid is for real-time, providing aggregations and fast
access
 Streaming ingestion capability
 Data Freshness – analyze events as they occur
 Fast response time (ideally < 1sec query time)
 Arbitrary slicing and dicing
 Multi-tenancy – 1000s of concurrent users
 Scalability and Availability
 Rich real-time visualization with Superset
Superset
Druid is a distributed, real-time, column-oriented datastore
designed to quickly ingest and index large amounts of data and
make it available for real-time query.

Who is Using Druid
http://druid.io/druid-powered.html

Druid
cubing
Here’s how Druid usually fits into your architecture
Streaming
data source
(Kafka, etc.) Real-
time
ingest
Druid
Jobs, batch
processes,
scheduled
tasks
HDFS Hive
Superset
VisualizationQuery engineStorageData sources
Druid-backed
Hive tables,
predicate
pushdown
HDFS-backed
Hive tables
Tableau,
Qlik,
Excel
Query
Hive/Druid via
ODBC
Batch
ingest

Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs

And one more honorable mention

Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs
Complex ETL, ML

What is Apache Spark?
Classification Regression
• Support vector
• logistic regression
Collaborative Filtering
Clustering
• K-means
Optimization
• Stochastic Gradient
Descent
ML lib (Machine Learning)
Scalable
• High-throughput, fault-
tolerant stream processing
of live data streams
(micro-batches)
Data Ingest Sources
• Kafka, Flume, Twitter,
ZeroMQ, Kinesis or TCP
sockets
Reuse Spark APIs
• Complex algorithms
expressed with high-level
functions like map, reduce,
join and window
Data Persistence
• Processed data can be
pushed out to file systems,
databases and live
dashboards
Spark Streaming
Structured Data Processing
• Programming abstraction
called DataFrames
• Distributed SQL query
engine
Infer Schema
• Automatically infer scheme
of a JSON dataset and
load it as a DataFrame.
Spark SQL
Resource Management
Storage
Applications
Spark Core Engine
Scala
Java
Python
libraries
MLlib
(Machine
learning)
Spark
SQL*
Spark
Streaming*

Benefits of Apache Spark
• Performance
– Deliver high performance large scale data processing and analysis by
leveraging in memory computing
• Ease of Use
– Easy to use APIs for operating on large datasets
– Operators for transforming data
– DataFrames provides support for manipulating structured and semi-
structured data
• Efficiency
– Enhanced developer productivity through prepackaged libraries that can be
combined in the same application
• SQL queries
• Streaming data
• Machine learning
• Graph processing
Resource Management
Storage
Applications
Spark Core Engine
Scala
Java
Python
libraries
MLlib
(Machine
learning)
Spark
SQL*
Spark
Streaming*
* Tech Preview

Driver
MetaStore
HiveServer+Tez
LLAP DaemonsExecutors
Spark
Meta
Hive
Meta
Executors LLAP Daemons
Isolate Spark and Hive Catalogs/Tables
Leverage connector for Spark <-> Hive
HWC
HWC

Which one should I choose?

Use Case Analysis - Each engine has its niche
HBase Hive Druid Spark
Ultra-low latency
Random access
(key-based
lookup)
ACID, real-time
database, EDW
Low-latency
OLAP, concurrent
queries
Complex ETL
Large-volume
OLTP
Unified SQL
interface, JDBC
Aggregations,
drilldowns
ML model
training
Updates Reporting, batch Time-series SparkSQL
Deletes Joins, large
aggregates, ad-
hoc
Real-time
ingestion
Spark
Streaming

Which use cases make sense?
⬢ HBase – operational data store, lots of changing data
– Financial transaction data
– Frequent customer updates
– CDC
⬢ Druid – analytics across dataset, sums and other aggregations
– Analyzing number of cars being produced by region
– Number of flights departing from a certain airport
⬢ Hive – large queries across tons of data
– Fact-dimension join across billions of rows, e.g. joining loyalty data to a day’s retail
transactions for insights into spending
⬢ Spark – predictive modeling, complex ETL (ELT) jobs
– Building a predictive maintenance model for infrastructure that a transportation company
owns

Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs
Complex ETL, ML

Performance Analysis

Performance Analysis - Setup
⬢ Caching disabled
⬢ Query types
– Simple count
– Select with a where
– Join
– Update
– An aggregation (e.g. Sum)
8 cores
16GB RAM
8 nodes
30GB,
200MM rows

1.35
15.00 15.00 15.00
1.52
8.71
4.72
8.66
9.16
9.75
0.34 0.71
1.72
0.00 0.00
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
Select with filter Count(*) Aggregation with
filter
Select with join and
filter
Update with filter
Comparing Hive, HBase and Druid
HBase/Phoenix Hive Druid

Data Load Times
Engine Load Time
Hive <1 hr
HBase 4+ hrs
Druid 2 hrs
⬢ Issues with HBase – sequential, serial

Space Considerations
⬢ You may get better storage from HBase with different compression
Engine Size on Disk with Replication
Hive – ORC w/ Zlib 28.4GB
HBase – Snappy compression 89.5GB
Druid 31.5GB

Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs

Unified SQL

Hive as the Single Interface
HiveInterface
HBase/Phoenix
Druid
JDBC
Files

Hive Query Delegation by Calcite
filter time
group by
order by
Calcite rewrites to
Druid query fragment
Complex joins,
etc would be
computed here

BI on Hadoop : Different tools for different use cases
 File / RAW storage
 Unknown questions
 Latency is not a issue
 Non structured / Data Mining /
Data Science
 Structured Data
 Data cleansed / Enriched
 Questions are known but not answers
 Concepts and data regularly updated
 Streaming / low latency
 Pre-aggregation to answer specific
questions
 Known Questions and answers
 Operational dashboards
LLAP
Druid
Cold Warm
Hot

Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs

What conclusions can we draw?
⬢ The use case dictates the tool. This is seen in the numbers
– Druid is extremely fast for aggregations
– HBase is great with lookups and OLTP-style updates on fast-moving data
– Hive is used a lot for analytics on large quantities of data, where the query isn’t known
beforehand
– Spark has great libraries for ML and is customizable for complex ETL
⬢ Use case sprawl – watch for this (no one engine does it all)
⬢ Unified SQL – the tools complement each other in the larger enterprise
architecture

Further Reading
⬢ Use case discussion on the engines
– https://hortonworks.com/blog/big-data-processing-engines-which-one-do-i-use-part-1/
⬢ Performance analysis
– https://community.hortonworks.com/articles/232317/big-data-processing-engines-the-
technical-series-p.html
– https://community.hortonworks.com/articles/233083/big-data-processing-engines-the-
technical-series-p-1.html

Big data processing engines, Atlanta Meetup 4/30

Related slideshows

More Related Content

What's hot

What's hot (19)

Similar to Big data processing engines, Atlanta Meetup 4/30

Similar to Big data processing engines, Atlanta Meetup 4/30 (20)

Recently uploaded

Recently uploaded (20)

Big data processing engines, Atlanta Meetup 4/30