Interactive Analytics at Scale in Apache Hive Using Druid

Interactive Analytics at Scale
in Apache Hive using Druid
Jesús Camacho Rodríguez
DataWorks Summit Sydney
September 21, 2017

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Motivation
 BI/OLAP applications that require interactive
visualization of complex data streams
– Real time bidding events
– User activity streams
– Voice call logs
– Network traffic flows
– Firewall events
– Application performance metrics
 Querying event data interactively at large scale poses multiple challenges
Interactive analytics on event data

Interactive analytics on event data
 Queries always contain date columns as grouping keys
 Most queries filter by time dimension
 Queries use very few columns (less than 10)
 Very selective filter conditions (hundreds/thousands of rows out of billions)
Query workload characteristics

Druid overview
 Development starts in 2011, open-sourced in late 2012
 Initial use case: interactive ad-analytics
 +150 contributors
 Main features
– Column-oriented distributed data store
– Batch and real-time ingestion
– Scalable to petabytes of data
– Sub-second response for arbitrary time-based
slice-and-dice
• Data partitioned by time dimension
• Automatic data summarization
• Approximate algorithms (hyperLogLog, theta)
Most Events per Day
30 Billion Events / Day
(Metamarkets)
Most Computed Metrics
1 Billion Metrics / Min
(Jolata)
Largest Cluster
200 Nodes
(Metamarkets)
Largest Hourly Ingestion
2TB per Hour
(Netflix)

Druid architecture
Dashboards, BI tools

Persistent storage
 Data in Druid is stored in segment files
 Partitioned by time, supports fast time-based slice-and-dice
 Ideally, segment files are each smaller than 1GB
 If files are large, smaller time partitions are needed
Time
Segment 1:
Monday
Segment 2:
Tuesday
Segment 3:
Wednesday
Segment 4:
Thursday
Segment 5:
Friday

Segment data structures
 Within a segment
– Timestamp column
– Dimension columns
– Metric columns
– Indexes to facilitate fast lookup and aggregation

Querying
 HTTP REST API
 Queries and results expressed in JSON
 Multiple query types
– Time boundary
– Segment metadata
– Timeseries
– TopN
– GroupBy
– Select
{
"queryType": "groupBy",
"dataSource": "product_sales_index",
"granularity": "all",
"dimension": "product_id",
"aggregations": [ { "type": "doubleSum", "name": "s", "fieldName": "sales" } ],
"limitSpec": {
"limit": 10,
"columns": [ {"dimension": "s", "direction": "descending" } ]
},
"intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000" ]
}
Important to use adequate type  Impact on query performance

Druid + Apache Hive
 Integration brings benefits both to Druid and Apache Hive
– Indexing complex query results in Druid using Hive
– Introducing a SQL interface on top of Druid
– Being able to execute complex operations on Druid data
– Efficient execution of OLAP queries in Hive

Agenda
Interactive Analytics at Scale in Hive using Druid
Introduction
Registering and creating Druid data sources

Druid data sources in Hive
 User needs to provide Druid data sources information to Hive
 Two different options depending on requirements
– Register Druid data sources in Hive (CREATE EXTERNAL TABLE)
• Data is already stored in Druid
– Create Druid data sources from Hive (CREATE TABLE AS SELECT)
• Data is stored in Hive
• User may want to pre-process the data before storing it in Druid
 INSERT INTO / INSERT OVERWRITE operations also supported

 Simple CREATE EXTERNAL TABLE statement
CREATE EXTERNAL TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker");
Hive table name
Hive storage handler classname
Druid data source name
⇢ Broker node endpoint specified as a Hive configuration parameter
⇢ Automatic Druid data schema discovery: segment metadata query
Registering Druid data sources

 Use CREATE TABLE AS SELECT (CTAS) statement
CREATE TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker", "druid.segment.granularity" = "DAY")
AS
SELECT __time, page, user, c_added, c_removed
FROM src;
Hive table name
Hive storage handler classname
Druid data source name
Druid segment granularity
Creating Druid data sources

 Use CREATE TABLE AS SELECT (CTAS) statement
CREATE TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler’
TBLPROPERTIES ("druid.datasource" = "wikiticker", "druid.segment.granularity" = "DAY”)
AS
SELECT __time, page, user, c_added, c_removed
FROM src;
⇢ Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive column type
Timestamp Dimensions Metrics

 File Sink operator uses Druid output format
– Creates segment files and register them in Druid
– Data needs to be partitioned by time granularity
• Granularity specified as configuration parameter
Select
File Sink
Original CTAS
physical plan
__time page user c_added c_removed
2011-01-01T01:05:00Z Justin Boxer 1800 25
2011-01-02T19:00:00Z Justin Reach 2912 42
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17
2011-01-02T13:00:00Z Ke$ha Helz 3194 170
2011-01-02T18:00:00Z Miley Ashu 2232 34
CTAS query results
Table Scan

__time page user c_added c_removed __time_granularity
2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z
2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z
2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z
2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z
Select
File Sink
Rewritten CTAS
physical plan CTAS query results
Table Scan
Reduce
Truncate timestamp to day granularity

2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z
2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z
2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z
2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z
Segment 2011-01-01
Segment 2011-01-02
Select
File Sink
Rewritten CTAS
physical plan
Table Scan
Reduce
CTAS query results

Agenda
Introduction
Querying Druid data sources

 Automatic rewriting when query is expressed over Druid table
– Powered by Apache Calcite
– Main challenge: identify patterns in logical plan corresponding to different kinds of Druid queries
(Timeseries, TopN, GroupBy, Select)
 Translate (sub)plan of operators into valid Druid JSON query
– Druid query is encapsulated within Hive TableScan operator
 Hive TableScan uses Druid input format
– Submits query to Druid and generates records out of the query results
 It might not be possible to push all computation to Druid
– Our contract is that the query should always be executed

Druid query recognition (powered by Apache Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Top 10 users that have added more characters
from beginning of 2010 until the end of 2011
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter

FROM druid_table_1
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Top 10 users that have added more characters
from beginning of 2010 until the end of 2011
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
Possible to express filters
on time dimension using
SQL standard functions

Apache Hive
Druid query
select
FROM druid_table_1
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Initially:
– Scan is executed in Druid (select query)
– Rest of the query is executed in Hive
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter

Apache Hive
Druid query
select
Rewriting
rule
FROM druid_table_1
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter

Apache Hive
Druid query
select
Rewriting
rule
FROM druid_table_1
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter

Apache Hive
Druid query
groupBy
Rewriting
rule
FROM druid_table_1
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter

{
"queryType": "groupBy",
"dataSource": "users_index",
"granularity": "all",
"dimension": "user",
"aggregations": [ { "type": "longSum", "name": "s", "fieldName": "c_added" } ],
"limitSpec": {
"limit": 10,
"columns": [ {"dimension": "s", "direction": "descending" } ]
},
"intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000" ]
}
Physical plan transformation
Apache Hive
Druid query
groupBy
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
Select
File SinkFile Sink
Table Scan
Query physical plan
Druid JSON query
Table Scan uses
Druid Input Format

Druid input format
 Submits query to Druid and generates records out of the query results
 Current version
– Timeseries, TopN, and GroupBy queries are not partitioned
– Select queries: realtime and historical nodes are contacted directly
Node
Table Scan
Record reader
…
Timeseries, TopN, GroupBy
Node
Table Scan
Record reader
…
Table Scan
Record reader
… Node
Table Scan
Record reader
…
Table Scan
Record reader
…
Select

Agenda
Introduction
Implementation and experimental evaluation

Implementation
 Available from Apache Hive 2.2.0
– Relies on Druid 0.9.2 and Apache Calcite 1.10.0
– Registering, creating, overwritting and deleting Druid data sources
– Querying Druid from Hive
 GA in HDP-2.6.3

Experimental evaluation
 Star Schema Benchmark (SSB)
– Based on TPC-H
– Measures the performance of database products
in support of classical data warehousing applications
– Four different group of queries
• What-if scenarios
• Drill down
• Better understand trends
SELECT
c_city, s_city, d_year, sum(lo_revenue)
FROM
customer, lineorder, supplier, dates
WHERE
lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_orderdate = d_datekey
and c_nation = 'UNITED STATES’
and s_nation = 'UNITED STATES’
and d_year >= 1992 and d_year <= 1997
GROUP BY
c_city, s_city, d_year
ORDER BY
d_year asc, lo_revenue desc;
Q 3.2

0
500
1000
1500
2000
2500
3000
Responsetime(ms)
SSB 1TB Scale with Hive over 10 Druid nodes (denormalized schema)
Average Min Max

Agenda
Introduction
Road ahead

Road ahead
 Tighten integration between Druid and Apache Hive/Apache Calcite
– Recognize more functions  Push more computation to Druid
– Support complex column types
– Close the gap between semantics of different systems
• Time zone handling

Materialized view
definition
Road ahead
 Broader perspective
– Materialized views support in Apache Hive 3.0
• Data stored in Apache Hive
• Create materialized view in Druid
– Denormalized star schema
• Automatic input query rewriting over the materialized view
SELECT
FROM
customer, lineorder, supplier, dates
WHERE
lo_custkey = c_custkey
and lo_orderdate = d_datekey
GROUP BY
ORDER BY
SELECT
FROM
ssb_mv
WHERE
GROUP BY
ORDER BY
Automatic
rewriting
CREATE MATERIALIZED VIEW ssb_mv ENABLE REWRITE
STORED BY
'org.apache.hadoop.hive.druid.DruidStorageHandler'
AS
SELECT *
FROM
customer, dates, lineorder, ssb_part, supplier
WHERE
lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_custkey = c_custkey;
Query can be completely
executed by Druid

Acknowledgments
 Apache Hive, Apache Calcite and Druid communities
– Slim Bouguerra, Julian Hyde, Nishant Bangarwa, Ashutosh Chauhan, Gunther Hagleitner, Carter
Shanklin, and many others

Thank You
@ApacheHive | @ApacheCalcite | @druidio
https://hortonworks.com/blog/apache-hive-druid-part-1-3/
https://hortonworks.com/blog/sub-second-analytics-hive-druid/
https://hortonworks.com/blog/connect-tableau-druid-hive/

Interactive Analytics at Scale in Apache Hive Using Druid

Related slideshows

More Related Content

What's hot

What's hot (20)

Similar to Interactive Analytics at Scale in Apache Hive Using Druid

Similar to Interactive Analytics at Scale in Apache Hive Using Druid (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Interactive Analytics at Scale in Apache Hive Using Druid

Editor's Notes