SlideShare a Scribd company logo
Interactive Analytics at Scale
in Apache Hive using Druid
Jesús Camacho Rodríguez
DataWorks Summit Sydney
September 21, 2017
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
 BI/OLAP applications that require interactive
visualization of complex data streams
– Real time bidding events
– User activity streams
– Voice call logs
– Network traffic flows
– Firewall events
– Application performance metrics
 Querying event data interactively at large scale poses multiple challenges
Interactive analytics on event data
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Interactive analytics on event data
 Queries always contain date columns as grouping keys
 Most queries filter by time dimension
 Queries use very few columns (less than 10)
 Very selective filter conditions (hundreds/thousands of rows out of billions)
Query workload characteristics
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid overview
 Development starts in 2011, open-sourced in late 2012
 Initial use case: interactive ad-analytics
 +150 contributors
 Main features
– Column-oriented distributed data store
– Batch and real-time ingestion
– Scalable to petabytes of data
– Sub-second response for arbitrary time-based
• Data partitioned by time dimension
• Automatic data summarization
• Approximate algorithms (hyperLogLog, theta)
Most Events per Day
30 Billion Events / Day
Most Computed Metrics
1 Billion Metrics / Min
Largest Cluster
200 Nodes
Largest Hourly Ingestion
2TB per Hour
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid architecture
Dashboards, BI tools
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Persistent storage
 Data in Druid is stored in segment files
 Partitioned by time, supports fast time-based slice-and-dice
 Ideally, segment files are each smaller than 1GB
 If files are large, smaller time partitions are needed
Segment 1:
Segment 2:
Segment 3:
Segment 4:
Segment 5:
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Segment data structures
 Within a segment
– Timestamp column
– Dimension columns
– Metric columns
– Indexes to facilitate fast lookup and aggregation
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
 Queries and results expressed in JSON
 Multiple query types
– Time boundary
– Segment metadata
– Timeseries
– TopN
– GroupBy
– Select
"queryType": "groupBy",
"dataSource": "product_sales_index",
"granularity": "all",
"dimension": "product_id",
"aggregations": [ { "type": "doubleSum", "name": "s", "fieldName": "sales" } ],
"limitSpec": {
"limit": 10,
"columns": [ {"dimension": "s", "direction": "descending" } ]
"intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000" ]
Important to use adequate type  Impact on query performance
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid + Apache Hive
 Integration brings benefits both to Druid and Apache Hive
– Indexing complex query results in Druid using Hive
– Introducing a SQL interface on top of Druid
– Being able to execute complex operations on Druid data
– Efficient execution of OLAP queries in Hive
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Interactive Analytics at Scale in Hive using Druid
Registering and creating Druid data sources
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
 User needs to provide Druid data sources information to Hive
 Two different options depending on requirements
– Register Druid data sources in Hive (CREATE EXTERNAL TABLE)
• Data is already stored in Druid
– Create Druid data sources from Hive (CREATE TABLE AS SELECT)
• Data is stored in Hive
• User may want to pre-process the data before storing it in Druid
 INSERT INTO / INSERT OVERWRITE operations also supported
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
 Simple CREATE EXTERNAL TABLE statement
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker");
Hive table name
Hive storage handler classname
Druid data source name
⇢ Broker node endpoint specified as a Hive configuration parameter
⇢ Automatic Druid data schema discovery: segment metadata query
Registering Druid data sources
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
CREATE TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker", "druid.segment.granularity" = "DAY")
SELECT __time, page, user, c_added, c_removed
FROM src;
Hive table name
Hive storage handler classname
Druid data source name
Druid segment granularity
Creating Druid data sources
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
CREATE TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler’
TBLPROPERTIES ("druid.datasource" = "wikiticker", "druid.segment.granularity" = "DAY”)
SELECT __time, page, user, c_added, c_removed
FROM src;
⇢ Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive column type
Creating Druid data sources
Timestamp Dimensions Metrics
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
 File Sink operator uses Druid output format
– Creates segment files and register them in Druid
– Data needs to be partitioned by time granularity
• Granularity specified as configuration parameter
Creating Druid data sources
File Sink
Original CTAS
physical plan
__time page user c_added c_removed
2011-01-01T01:05:00Z Justin Boxer 1800 25
2011-01-02T19:00:00Z Justin Reach 2912 42
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17
2011-01-02T13:00:00Z Ke$ha Helz 3194 170
2011-01-02T18:00:00Z Miley Ashu 2232 34
CTAS query results
Table Scan
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
__time page user c_added c_removed __time_granularity
2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z
2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z
2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z
2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z
Druid data sources in Hive
 File Sink operator uses Druid output format
– Creates segment files and register them in Druid
– Data needs to be partitioned by time granularity
• Granularity specified as configuration parameter
Creating Druid data sources
File Sink
Rewritten CTAS
physical plan CTAS query results
Table Scan
Truncate timestamp to day granularity
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
 File Sink operator uses Druid output format
– Creates segment files and register them in Druid
– Data needs to be partitioned by time granularity
• Granularity specified as configuration parameter
2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z
2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z
2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z
2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z
Segment 2011-01-01
Segment 2011-01-02
Druid data sources in Hive
Creating Druid data sources
File Sink
Rewritten CTAS
physical plan
Table Scan
CTAS query results
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Interactive Analytics at Scale in Hive using Druid
Registering and creating Druid data sources
Querying Druid data sources
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Querying Druid data sources
 Automatic rewriting when query is expressed over Druid table
– Powered by Apache Calcite
– Main challenge: identify patterns in logical plan corresponding to different kinds of Druid queries
(Timeseries, TopN, GroupBy, Select)
 Translate (sub)plan of operators into valid Druid JSON query
– Druid query is encapsulated within Hive TableScan operator
 Hive TableScan uses Druid input format
– Submits query to Druid and generates records out of the query results
 It might not be possible to push all computation to Druid
– Our contract is that the query should always be executed
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid query recognition (powered by Apache Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
 Top 10 users that have added more characters
from beginning of 2010 until the end of 2011
Apache Hive - SQL query
Query logical plan
Druid Scan
Sort Limit
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid query recognition (powered by Apache Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
 Top 10 users that have added more characters
from beginning of 2010 until the end of 2011
Apache Hive - SQL query
Query logical plan
Druid Scan
Sort Limit
Possible to express filters
on time dimension using
SQL standard functions
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
Druid query
Druid query recognition (powered by Apache Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
 Initially:
– Scan is executed in Druid (select query)
– Rest of the query is executed in Hive
Apache Hive - SQL query
Query logical plan
Druid Scan
Sort Limit
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
Druid query
Druid query recognition (powered by Apache Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
 Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Apache Hive - SQL query
Query logical plan
Druid Scan
Sort Limit
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
Druid query
Druid query recognition (powered by Apache Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
 Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Apache Hive - SQL query
Query logical plan
Druid Scan
Sort Limit
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
Druid query
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
 Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Druid query recognition (powered by Apache Calcite)
Apache Hive - SQL query
Query logical plan
Druid Scan
Sort Limit
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
Druid query
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
 Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Druid query recognition (powered by Apache Calcite)
Apache Hive - SQL query
Query logical plan
Druid Scan
Sort Limit
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
"queryType": "groupBy",
"dataSource": "users_index",
"granularity": "all",
"dimension": "user",
"aggregations": [ { "type": "longSum", "name": "s", "fieldName": "c_added" } ],
"limitSpec": {
"limit": 10,
"columns": [ {"dimension": "s", "direction": "descending" } ]
"intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000" ]
Physical plan transformation
Apache Hive
Druid query
Query logical plan
Druid Scan
Sort Limit
File SinkFile Sink
Table Scan
Query physical plan
Druid JSON query
Table Scan uses
Druid Input Format
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid input format
 Submits query to Druid and generates records out of the query results
 Current version
– Timeseries, TopN, and GroupBy queries are not partitioned
– Select queries: realtime and historical nodes are contacted directly
Table Scan
Record reader
Timeseries, TopN, GroupBy
Table Scan
Record reader
Table Scan
Record reader
… Node
Table Scan
Record reader
Table Scan
Record reader
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Interactive Analytics at Scale in Hive using Druid
Registering and creating Druid data sources
Querying Druid data sources
Implementation and experimental evaluation
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
 Available from Apache Hive 2.2.0
– Relies on Druid 0.9.2 and Apache Calcite 1.10.0
– Registering, creating, overwritting and deleting Druid data sources
– Querying Druid from Hive
 GA in HDP-2.6.3
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Experimental evaluation
 Star Schema Benchmark (SSB)
– Based on TPC-H
– Measures the performance of database products
in support of classical data warehousing applications
– Four different group of queries
• What-if scenarios
• Drill down
• Better understand trends
c_city, s_city, d_year, sum(lo_revenue)
customer, lineorder, supplier, dates
lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_orderdate = d_datekey
and c_nation = 'UNITED STATES’
and s_nation = 'UNITED STATES’
and d_year >= 1992 and d_year <= 1997
c_city, s_city, d_year
d_year asc, lo_revenue desc;
Q 3.2
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Experimental evaluation
SSB 1TB Scale with Hive over 10 Druid nodes (denormalized schema)
Average Min Max
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Interactive Analytics at Scale in Hive using Druid
Registering and creating Druid data sources
Querying Druid data sources
Experimental evaluation
Road ahead
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Road ahead
 Tighten integration between Druid and Apache Hive/Apache Calcite
– Recognize more functions  Push more computation to Druid
– Support complex column types
– Close the gap between semantics of different systems
• Time zone handling
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Materialized view
Road ahead
 Broader perspective
– Materialized views support in Apache Hive 3.0
• Data stored in Apache Hive
• Create materialized view in Druid
– Denormalized star schema
• Automatic input query rewriting over the materialized view
c_city, s_city, d_year, sum(lo_revenue)
customer, lineorder, supplier, dates
lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_orderdate = d_datekey
and c_nation = 'UNITED STATES’
and s_nation = 'UNITED STATES’
and d_year >= 1992 and d_year <= 1997
c_city, s_city, d_year
d_year asc, lo_revenue desc;
c_city, s_city, d_year, sum(lo_revenue)
and c_nation = 'UNITED STATES’
and s_nation = 'UNITED STATES’
and d_year >= 1992 and d_year <= 1997
c_city, s_city, d_year
d_year asc, lo_revenue desc;
customer, dates, lineorder, ssb_part, supplier
lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and lo_custkey = c_custkey;
Query can be completely
executed by Druid
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
 Apache Hive, Apache Calcite and Druid communities
– Slim Bouguerra, Julian Hyde, Nishant Bangarwa, Ashutosh Chauhan, Gunther Hagleitner, Carter
Shanklin, and many others
Thank You
@ApacheHive | @ApacheCalcite | @druidio

More Related Content

What's hot

Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
DataWorks Summit/Hadoop Summit
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
DataWorks Summit/Hadoop Summit
Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?
DataWorks Summit/Hadoop Summit
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015
SparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data ScientistsSparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data Scientists
DataWorks Summit
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash Course
DataWorks Summit
Apache Hadoop Crash Course - HS16SJ
Apache Hadoop Crash Course - HS16SJApache Hadoop Crash Course - HS16SJ
Apache Hadoop Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2
Modernise your EDW - Data Lake
Modernise your EDW - Data LakeModernise your EDW - Data Lake
Modernise your EDW - Data Lake
DataWorks Summit/Hadoop Summit
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
DataWorks Summit
Security, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software IntegrationSecurity, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software Integration
DataWorks Summit
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
DataWorks Summit
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJIntro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern ArchitectureData in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Mats Johansson
Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
DataWorks Summit
Make Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the DetailsMake Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the Details
DataWorks Summit/Hadoop Summit
Design a Dataflow in 7 minutes with Apache NiFi/HDF
Design a Dataflow in 7 minutes with Apache NiFi/HDFDesign a Dataflow in 7 minutes with Apache NiFi/HDF
Design a Dataflow in 7 minutes with Apache NiFi/HDF

What's hot (20)

Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015
SparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data ScientistsSparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data Scientists
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash Course
Apache Hadoop Crash Course - HS16SJ
Apache Hadoop Crash Course - HS16SJApache Hadoop Crash Course - HS16SJ
Apache Hadoop Crash Course - HS16SJ
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2
Modernise your EDW - Data Lake
Modernise your EDW - Data LakeModernise your EDW - Data Lake
Modernise your EDW - Data Lake
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
Security, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software IntegrationSecurity, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software Integration
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJIntro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern ArchitectureData in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Make Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the DetailsMake Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the Details
Design a Dataflow in 7 minutes with Apache NiFi/HDF
Design a Dataflow in 7 minutes with Apache NiFi/HDFDesign a Dataflow in 7 minutes with Apache NiFi/HDF
Design a Dataflow in 7 minutes with Apache NiFi/HDF

Similar to Interactive Analytics at Scale in Apache Hive Using Druid

Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit/Hadoop Summit
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime Analytics
Aaron Brooks
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop Ecosystem
Slim Bouguerra
Time-series data analysis and persistence with Druid
Time-series data analysis and persistence with DruidTime-series data analysis and persistence with Druid
Time-series data analysis and persistence with Druid
Raúl Marín
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
Abdelkrim Hadjidj
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
Thejas Nair
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
DataWorks Summit
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Future of Data Meetup
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
DataWorks Summit
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
DataWorks Summit
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
Artem Ervits
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
Julien Le Dem
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
DataWorks Summit/Hadoop Summit
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
DataWorks Summit
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
MapR Technologies
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...

Similar to Interactive Analytics at Scale in Apache Hive Using Druid (20)

Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime Analytics
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop Ecosystem
Time-series data analysis and persistence with Druid
Time-series data analysis and persistence with DruidTime-series data analysis and persistence with Druid
Time-series data analysis and persistence with Druid
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded

It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
Priyanka Aash
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Alliance
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Snarky Security
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Alliance
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
Razin Mustafiz
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
Marrie Morris
Yury Chemerkin
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Alliance
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
Priyanka Aash
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
Stephanie Beckett
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect

Recently uploaded (20)

It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect

Interactive Analytics at Scale in Apache Hive Using Druid

  • 1. Interactive Analytics at Scale in Apache Hive using Druid Jesús Camacho Rodríguez DataWorks Summit Sydney September 21, 2017
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Motivation  BI/OLAP applications that require interactive visualization of complex data streams – Real time bidding events – User activity streams – Voice call logs – Network traffic flows – Firewall events – Application performance metrics  Querying event data interactively at large scale poses multiple challenges Interactive analytics on event data
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Interactive analytics on event data  Queries always contain date columns as grouping keys  Most queries filter by time dimension  Queries use very few columns (less than 10)  Very selective filter conditions (hundreds/thousands of rows out of billions) Query workload characteristics
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid overview  Development starts in 2011, open-sourced in late 2012  Initial use case: interactive ad-analytics  +150 contributors  Main features – Column-oriented distributed data store – Batch and real-time ingestion – Scalable to petabytes of data – Sub-second response for arbitrary time-based slice-and-dice • Data partitioned by time dimension • Automatic data summarization • Approximate algorithms (hyperLogLog, theta) Most Events per Day 30 Billion Events / Day (Metamarkets) Most Computed Metrics 1 Billion Metrics / Min (Jolata) Largest Cluster 200 Nodes (Metamarkets) Largest Hourly Ingestion 2TB per Hour (Netflix)
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid architecture Dashboards, BI tools
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Persistent storage  Data in Druid is stored in segment files  Partitioned by time, supports fast time-based slice-and-dice  Ideally, segment files are each smaller than 1GB  If files are large, smaller time partitions are needed Time Segment 1: Monday Segment 2: Tuesday Segment 3: Wednesday Segment 4: Thursday Segment 5: Friday
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Segment data structures  Within a segment – Timestamp column – Dimension columns – Metric columns – Indexes to facilitate fast lookup and aggregation
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Querying  HTTP REST API  Queries and results expressed in JSON  Multiple query types – Time boundary – Segment metadata – Timeseries – TopN – GroupBy – Select { "queryType": "groupBy", "dataSource": "product_sales_index", "granularity": "all", "dimension": "product_id", "aggregations": [ { "type": "doubleSum", "name": "s", "fieldName": "sales" } ], "limitSpec": { "limit": 10, "columns": [ {"dimension": "s", "direction": "descending" } ] }, "intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000" ] } Important to use adequate type  Impact on query performance
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid + Apache Hive  Integration brings benefits both to Druid and Apache Hive – Indexing complex query results in Druid using Hive – Introducing a SQL interface on top of Druid – Being able to execute complex operations on Druid data – Efficient execution of OLAP queries in Hive
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Interactive Analytics at Scale in Hive using Druid Introduction Registering and creating Druid data sources
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  User needs to provide Druid data sources information to Hive  Two different options depending on requirements – Register Druid data sources in Hive (CREATE EXTERNAL TABLE) • Data is already stored in Druid – Create Druid data sources from Hive (CREATE TABLE AS SELECT) • Data is stored in Hive • User may want to pre-process the data before storing it in Druid  INSERT INTO / INSERT OVERWRITE operations also supported
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  Simple CREATE EXTERNAL TABLE statement CREATE EXTERNAL TABLE druid_table_1 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource" = "wikiticker"); Hive table name Hive storage handler classname Druid data source name ⇢ Broker node endpoint specified as a Hive configuration parameter ⇢ Automatic Druid data schema discovery: segment metadata query Registering Druid data sources
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  Use CREATE TABLE AS SELECT (CTAS) statement CREATE TABLE druid_table_1 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource" = "wikiticker", "druid.segment.granularity" = "DAY") AS SELECT __time, page, user, c_added, c_removed FROM src; Hive table name Hive storage handler classname Druid data source name Druid segment granularity Creating Druid data sources
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  Use CREATE TABLE AS SELECT (CTAS) statement CREATE TABLE druid_table_1 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler’ TBLPROPERTIES ("druid.datasource" = "wikiticker", "druid.segment.granularity" = "DAY”) AS SELECT __time, page, user, c_added, c_removed FROM src; ⇢ Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive column type Creating Druid data sources Timestamp Dimensions Metrics
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  File Sink operator uses Druid output format – Creates segment files and register them in Druid – Data needs to be partitioned by time granularity • Granularity specified as configuration parameter Creating Druid data sources Select File Sink Original CTAS physical plan __time page user c_added c_removed 2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T18:00:00Z Miley Ashu 2232 34 CTAS query results Table Scan
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved __time page user c_added c_removed __time_granularity 2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z 2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z 2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z 2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z 2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z Druid data sources in Hive  File Sink operator uses Druid output format – Creates segment files and register them in Druid – Data needs to be partitioned by time granularity • Granularity specified as configuration parameter Creating Druid data sources Select File Sink Rewritten CTAS physical plan CTAS query results Table Scan Reduce Truncate timestamp to day granularity
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved  File Sink operator uses Druid output format – Creates segment files and register them in Druid – Data needs to be partitioned by time granularity • Granularity specified as configuration parameter 2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z 2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z 2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z 2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z 2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z Segment 2011-01-01 Segment 2011-01-02 Druid data sources in Hive Creating Druid data sources Select File Sink Rewritten CTAS physical plan Table Scan Reduce CTAS query results
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Interactive Analytics at Scale in Hive using Druid Introduction Registering and creating Druid data sources Querying Druid data sources
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Querying Druid data sources  Automatic rewriting when query is expressed over Druid table – Powered by Apache Calcite – Main challenge: identify patterns in logical plan corresponding to different kinds of Druid queries (Timeseries, TopN, GroupBy, Select)  Translate (sub)plan of operators into valid Druid JSON query – Druid query is encapsulated within Hive TableScan operator  Hive TableScan uses Druid input format – Submits query to Druid and generates records out of the query results  It might not be possible to push all computation to Druid – Our contract is that the query should always be executed
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid query recognition (powered by Apache Calcite) SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Top 10 users that have added more characters from beginning of 2010 until the end of 2011 Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid query recognition (powered by Apache Calcite) SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Top 10 users that have added more characters from beginning of 2010 until the end of 2011 Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter Possible to express filters on time dimension using SQL standard functions
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Druid query select Druid query recognition (powered by Apache Calcite) SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Initially: – Scan is executed in Druid (select query) – Rest of the query is executed in Hive Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Druid query select Rewriting rule Druid query recognition (powered by Apache Calcite) SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Druid query select Rewriting rule Druid query recognition (powered by Apache Calcite) SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Druid query groupBy Rewriting rule SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Druid query recognition (powered by Apache Calcite) Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Druid query groupBy Rewriting rule SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Druid query recognition (powered by Apache Calcite) Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved { "queryType": "groupBy", "dataSource": "users_index", "granularity": "all", "dimension": "user", "aggregations": [ { "type": "longSum", "name": "s", "fieldName": "c_added" } ], "limitSpec": { "limit": 10, "columns": [ {"dimension": "s", "direction": "descending" } ] }, "intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000" ] } Physical plan transformation Apache Hive Druid query groupBy Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter Select File SinkFile Sink Table Scan Query physical plan Druid JSON query Table Scan uses Druid Input Format
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid input format  Submits query to Druid and generates records out of the query results  Current version – Timeseries, TopN, and GroupBy queries are not partitioned – Select queries: realtime and historical nodes are contacted directly Node Table Scan Record reader … Timeseries, TopN, GroupBy Node Table Scan Record reader … Table Scan Record reader … Node Table Scan Record reader … Table Scan Record reader … Select
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Interactive Analytics at Scale in Hive using Druid Introduction Registering and creating Druid data sources Querying Druid data sources Implementation and experimental evaluation
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Implementation  Available from Apache Hive 2.2.0 – Relies on Druid 0.9.2 and Apache Calcite 1.10.0 – Registering, creating, overwritting and deleting Druid data sources – Querying Druid from Hive  GA in HDP-2.6.3
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Experimental evaluation  Star Schema Benchmark (SSB) – Based on TPC-H – Measures the performance of database products in support of classical data warehousing applications – Four different group of queries • What-if scenarios • Drill down • Better understand trends SELECT c_city, s_city, d_year, sum(lo_revenue) FROM customer, lineorder, supplier, dates WHERE lo_custkey = c_custkey and lo_suppkey = s_suppkey and lo_orderdate = d_datekey and c_nation = 'UNITED STATES’ and s_nation = 'UNITED STATES’ and d_year >= 1992 and d_year <= 1997 GROUP BY c_city, s_city, d_year ORDER BY d_year asc, lo_revenue desc; Q 3.2
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Experimental evaluation 0 500 1000 1500 2000 2500 3000 Responsetime(ms) SSB 1TB Scale with Hive over 10 Druid nodes (denormalized schema) Average Min Max
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Interactive Analytics at Scale in Hive using Druid Introduction Registering and creating Druid data sources Querying Druid data sources Experimental evaluation Road ahead
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Road ahead  Tighten integration between Druid and Apache Hive/Apache Calcite – Recognize more functions  Push more computation to Druid – Support complex column types – Close the gap between semantics of different systems • Time zone handling
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Materialized view definition Road ahead  Broader perspective – Materialized views support in Apache Hive 3.0 • Data stored in Apache Hive • Create materialized view in Druid – Denormalized star schema • Automatic input query rewriting over the materialized view SELECT c_city, s_city, d_year, sum(lo_revenue) FROM customer, lineorder, supplier, dates WHERE lo_custkey = c_custkey and lo_suppkey = s_suppkey and lo_orderdate = d_datekey and c_nation = 'UNITED STATES’ and s_nation = 'UNITED STATES’ and d_year >= 1992 and d_year <= 1997 GROUP BY c_city, s_city, d_year ORDER BY d_year asc, lo_revenue desc; SELECT c_city, s_city, d_year, sum(lo_revenue) FROM ssb_mv WHERE and c_nation = 'UNITED STATES’ and s_nation = 'UNITED STATES’ and d_year >= 1992 and d_year <= 1997 GROUP BY c_city, s_city, d_year ORDER BY d_year asc, lo_revenue desc; Automatic rewriting CREATE MATERIALIZED VIEW ssb_mv ENABLE REWRITE STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' AS SELECT * FROM customer, dates, lineorder, ssb_part, supplier WHERE lo_orderdate = d_datekey and lo_partkey = p_partkey and lo_suppkey = s_suppkey and lo_custkey = c_custkey; Query can be completely executed by Druid
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Acknowledgments  Apache Hive, Apache Calcite and Druid communities – Slim Bouguerra, Julian Hyde, Nishant Bangarwa, Ashutosh Chauhan, Gunther Hagleitner, Carter Shanklin, and many others
  • 37. Thank You @ApacheHive | @ApacheCalcite | @druidio

Editor's Notes

  1. Fact table approx 6B rows
  2. - Add more info about materialized views?
  3. - Add more info about materialized views?