Apache drill

Apache Drill
Introduction
Jakub Pieprzyk
1

Long time ago… Hadoop
Big data batch processing.
Huge volumes of data.
Responsiveness was not a concern.
2

Batch processing by design
HDFS
Map-Reduce applications
M R M RM R M RM R
3

At some point of time...
Hadoop cluster has a lot of data useful for ad-hoc analysis.
Hard to perform data exploration in batch mode (“data lake”, “schema on read”); lot of
iterative tasks.
Servers have more RAM, SSD drives...
4

Big Data (&Fast SQL) Analytics
5

Wide range of products emerged...
Tez
Spark
Facebook: Presto
(Google Dremel) → Apache Drill
Cloudera: Impala
6

Apache Drill
Scalable query engine
Querying different data sources - both schema and schema-free
JDBC / Mongo / File System / Hive / HBase
Text files / Parquet / Sequence files / MapR-DB
8

Integration with existing BI tools
Apache Drill come with JDBC/ODBC driver.
Supporting many data sources and formats + responsiveness make it good
candidate to Business Intelligence tools backend.
Drill
9

Interfaces
Command line (~beeline)
JDBC/ODBC
Web Console
C/Java API
REST API
10

Architecture highlights
Cluster of nodes on which drillbit service is installed.
Drillbit responsible for receiving queries, generating plan and executing.
Zookeeper is used to maintain cluster membership.
Clients can connect to any node (or via Zookeeper) and submit queries.
11

Architecture highlights (cont.)
Schema can be discovered in the runtime - no need to know the schema before
executing the query.
Storage plugins - can access custom databases.
Distributed cache is used to share metadata, plans and statistics (Infinispan in-
memory key-value data store)
12

Performance
Columnar processing
Data locality (when executed on Hadoop cluster)
Vectorization (processing vector of values from single column rather than
whole rows)
13

Simple query
reading data from classpath
file is JSON
FROM cp.`employee.json`
14

Hive → Drill Migration ?
Apache Drill is a good candidate to Fast SQL solution over Hadoop.
When deployed alongside Hive it gives ad-hoc capabilities
Can use Hive Metastore
Can use Hive UDFs
15

Hive → Drill
Data types ~ match those in Hive (although DECIMAL still in alpha)
Analytical functions ~ like in Hive (but still not 100% implemented, like moving average AVG(x) OVER
(ORDER BY time ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING)
Support for Hive UDFs (but JAR needs to be uploaded into every host)
16

Web Console
http://<node>:8047
17

Web Console - Running Queries
18

Web Console - Running Queries
19

Web Console - Query Execution
20

Web Console - Storage Plugins
21

Web Console - Hive Plugin Configuration
22

Embarrassingly simple performance test...
...just to put some numbers in the presentation ;)
Hadoop cluster:
3 nodes: data node / node manager / apache drill
2 nodes: 16GB RAM, 2 CPU x 2 cores
1 node: 10GB RAM, 2 CPU x 2 cores
+1: name node / resource manager / hive server
23

Hive MR vs. Drill
Wikipedia pageview counts:
en A1_road_in_London 1 35107
en A1_steak_sauce 1 13905
en A1_volleyball_league_(Greece) 1 17636
en A1chieve 1 6558
en A2%20road 1 7402
project article
page
views
bytes
24

Hive schema
create table wiki_pagecounts(
prj string,
page string,
pv int,
bytes bigint
) partitioned by (ts string)
row format delimited fields terminated by ' ';
25

Timing: Hive (MR) vs. Drill
Q1 - simple count per partition
(group by)
Q2 - top page within hour/lang.
(row_number)
Q3 - mobile share
(group by, case stmt)
Q4 - top pages with pct pv
(join, group by, row_number)
26

Integration with YARN?
Currently (Drill 1.5) not supported
There is a ticket for this DRILL-142
Would make deployment much easier and more efficient resource
management.
27

Kerberos?
Currently (Drill 1.5) doesn’t support Kerberos when accessing HDFS
Ticket opened: DRILL-3584
Without it it may be challenging to fit Drill into existing secured Hadoop
environment.
28

Apache Drill Github commits
29

Apache drill

Related slideshows

More Related Content

What's hot

What's hot (20)

Similar to Apache drill

Similar to Apache drill (20)

Recently uploaded

Recently uploaded (20)

Apache drill