Apache’s Answer to Low Latency
Interactive Query for Big Data
February 13, 2013
• Apache Drill overview
• Key features
• Status and progress
• Discuss potential use cases and cooperation
Big Data Workloads
• Data mining
• Blob store
• Lightweight OLTP on large datasets
• Index and model generation
• Web crawling
• Stream processing
• Clustering, anomaly detection and classification
• Interactive analysis
Interactive Queries and Hadoop
Compile SQL to
SQL based
interactive queries
interactive queries
Emerging Technologies
Common Solutions
Export MapReduce
results to RDBMS and
query the RDBMS
External tables in an
MPP database

 *big data

RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction

The document discusses compaction in RocksDB, an embedded key-value storage engine. It describes the two compaction styles in RocksDB: level style compaction and universal style compaction. Level style compaction stores data in multiple levels and performs compactions by merging files from lower to higher levels. Universal style compaction keeps all files in level 0 and performs compactions by merging adjacent files in time order. The document provides details on the compaction process and configuration options for both styles.

Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse

HBase as the NoSQL database of choice in the Hadoop ecosystem has already been proven itself in scale and in many mission critical workloads in hundreds of companies. Phoenix as the SQL layer on top of HBase, has been increasingly becoming the tool of choice as the perfect complementary for HBase. Phoenix is now being used more and more for super low latency querying and fast analytics across a large number of users in production deployments. In this talk, we will cover what makes Phoenix attractive among current and prospective HBase users, like SQL support, JDBC, data modeling, secondary indexing, UDFs, and also go over recent improvements like Query Server, ODBC drivers, ACID transactions, Spark integration, etc. We will conclude by looking into items in the pipeline and how Phoenix and HBase interacts with other engines like Hive and Spark.

phoenixapache phoenixhbase
Example Problem
• Jane works as an
analyst at an e-
commerce company
• How does she figure
out good targeting
segments for the next
marketing campaign?
• She has some ideas
and lots of data
Solving the Problem with Traditional Systems
• Use an RDBMS
– ETL the data from MongoDB and Hadoop into the RDBMS
• MongoDB data must be flattened, schematized, filtered and aggregated
• Hadoop data must be filtered and aggregated
– Query the data using any SQL-based tool
• Use MapReduce
– ETL the data from Oracle and MongoDB into Hadoop
– Work with the MapReduce team to generate the desired analyses
• Use Hive
– ETL the data from Oracle and MongoDB into Hadoop
• MongoDB data must be flattened and schematized
– But HiveQL is limited, queries take too long and BI tool support is
File System
GFS BigTable Dremel MapReduce
HDFS HBase ???
Build Apache Drill to provide a true open source
solution to interactive analysis of Big Data
Apache Drill Overview
• Interactive analysis of Big Data using standard SQL
• Fast
– Low latency queries
– Columnar execution
• Inspired by Google Dremel/BigQuery
– Complement native interfaces and
• Open
– Community driven open source project
– Under Apache Software Foundation
• Modern
– Standard ANSI SQL:2003 (select/into)
– Nested/hierarchical data support
– Schema is optional
– Supports RDBMS, Hadoop and NoSQL
Interactive queries
Data analyst
100 ms-20 min
Data mining
Large ETL
20 min-20 hr
Apache Drill

How Does It Work?
• Drillbits run on each node, designed to
maximize data locality
• Processing is done outside MapReduce
paradigm (but possibly within YARN)
• Queries can be fed to any Drillbit
• Coordination, query planning, optimization,
scheduling, and execution are distributed
Key Features
• Full SQL (ANSI SQL:2003)
• Nested data
• Schema is optional
• Flexible and extensible architecture
Full SQL (ANSI SQL:2003)
• Drill supports standard ANSI SQL:2003
– Correlated subqueries, analytic functions, …
– SQL-like is not enough
• Use any SQL-based tool with Apache Drill
– Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, …
– Standard ODBC and JDBC drivers
Nested Data
• Nested data is becoming prevalent
– JSON, BSON, XML, Protocol Buffers, Avro, etc.
– The data source may or may not be aware
• MongoDB supports nested data natively
• A single HBase value could be a JSON document
(compound nested type)
– Google Dremel’s innovation was efficient columnar
storage and querying of nested data
• Flattening nested data is error-prone and often
– Think about repeated and optional fields at every
• Apache Drill supports nested data
– Extensions to ANSI SQL:2003
enum Gender {
record User {
string name;
Gender gender;
long followers;
"name": "Homer",
"gender": "Male",
"followers": 100
children: [
{name: "Bart"},
{name: "Lisa”}

Schema is Optional
• Many data sources do not have rigid schemas
– Schemas change rapidly
– Each record may have a different schema
• Sparse and wide rows in HBase and Cassandra, MongoDB
• Apache Drill supports querying against unknown schemas
– Query any HBase, Cassandra or MongoDB table
• User can define the schema or let the system discover it automatically
– System of record may already have schema information
• Why manage it in a separate system?
– No need to manage schema evolution
Row Key CF contents CF anchor
"com.cnn.www" contents:html = "<html>…" = "" = "CNN"
"com.foxnews.www" contents:html = "<html>…" = "Fox News"
… … …
Flexible and Extensible Architecture
• Apache Drill is designed for extensibility
• Well-documented APIs and interfaces
• Data sources and file formats
– Implement a custom scanner to support a new data source or file format
• Query languages
– SQL:2003 is the primary language
– Implement a custom Parser to support a Domain Specific Language
– UDFs and UDTFs
• Optimizers
– Drill will have a cost-based optimizer
– Clear surrounding APIs support easy optimizer exploration
• Operators
– Custom operators can be implemented
• Special operators for Mahout (k-means) being designed
– Operator push-down to data source (RDBMS)
How Does Impala Fit In?
Impala Strengths
• Beta currently available
• Easy install and setup on top of
• Faster than Hive on some queries
• SQL-like query language
• Open Source ‘Lite’
• Doesn’t support RDBMS or other
NoSQLs (beyond Hadoop/HBase)
• Early row materialization increases
footprint and reduces performance
• Limited file format support
• Query results must fit in memory!
• Rigid schema is required
• No support for nested data
• Compound APIs restrict optimizer
• SQL-like (not SQL)
Many important features are “coming soon”. Architectural foundation is constrained. No
community development.
Status: In Progress
• Heavy active development by multiple organizations
• Available
– Logical plan syntax and interpreter
– Reference interpreter
• In progress
– SQL interpreter
– Storage engine implementations for Accumulo, Cassandra, HBase and various file formats
• Significant community momentum
– Over 200 people on the Drill mailing list
– Over 200 members of the Bay Area Drill User Group
– Drill meetups across the US and Europe
– OpenDremel team joined Apache Drill
• Anticipated schedule:
– Prototype: Q1
– Alpha: Q2
– Beta: Q3

Why Apache Drill Will Be Successful
• Contributors have strong
backgrounds from
companies like Oracle,
IBM Netezza, Informatica,
Clustrix and Pentaho
• Development done in the
• Active contributors from
multiple companies
• Rapidly growing
• Full SQL
• New data support
• Extensible APIs
• Full Columnar Execution
• Beyond Hadoop
• What problems can Drill solve for you?
• Where does it fit in the organization?
• Which data sources and BI tools are important
to you?
Let’s Talk!
• @ted_dunning @ApacheDrill
• Slides at
• See also

Why Not Leverage MapReduce?
• Scheduling Model
– Coarse resource model reduces hardware utilization
– Acquisition of resources typically takes 100’s of millis to seconds
• Barriers
– Map completion required before shuffle/reduce
– All maps must complete before reduce can start
– In chained jobs, one job must finish entirely before the next one
can start
• Persistence and Recoverability
– Data is persisted to disk between each barrier
– Serialization and deserialization are required between execution

  • 1. Apache’s Answer to Low Latency Interactive Query for Big Data February 13, 2013
  • 2. Agenda • Apache Drill overview • Key features • Status and progress • Discuss potential use cases and cooperation
  • 3. Big Data Workloads • ETL • Data mining • Blob store • Lightweight OLTP on large datasets • Index and model generation • Web crawling • Stream processing • Clustering, anomaly detection and classification • Interactive analysis
  • 4. Interactive Queries and Hadoop Compile SQL to MapReduce SQL based analytics Impala Real-time interactive queries Real-time interactive queries Emerging Technologies Common Solutions Export MapReduce results to RDBMS and query the RDBMS External tables in an MPP database
  • 5. Example Problem • Jane works as an analyst at an e- commerce company • How does she figure out good targeting segments for the next marketing campaign? • She has some ideas and lots of data User profiles Transaction information Access logs
  • 6. Solving the Problem with Traditional Systems • Use an RDBMS – ETL the data from MongoDB and Hadoop into the RDBMS • MongoDB data must be flattened, schematized, filtered and aggregated • Hadoop data must be filtered and aggregated – Query the data using any SQL-based tool • Use MapReduce – ETL the data from Oracle and MongoDB into Hadoop – Work with the MapReduce team to generate the desired analyses • Use Hive – ETL the data from Oracle and MongoDB into Hadoop • MongoDB data must be flattened and schematized – But HiveQL is limited, queries take too long and BI tool support is limited
  • 7. WWGD Distributed File System NoSQL Interactive analysis Batch processing GFS BigTable Dremel MapReduce HDFS HBase ??? Hadoop MapReduce Build Apache Drill to provide a true open source solution to interactive analysis of Big Data
  • 8. Apache Drill Overview • Interactive analysis of Big Data using standard SQL • Fast – Low latency queries – Columnar execution • Inspired by Google Dremel/BigQuery – Complement native interfaces and MapReduce/Hive/Pig • Open – Community driven open source project – Under Apache Software Foundation • Modern – Standard ANSI SQL:2003 (select/into) – Nested/hierarchical data support – Schema is optional – Supports RDBMS, Hadoop and NoSQL Interactive queries Data analyst Reporting 100 ms-20 min Data mining Modeling Large ETL 20 min-20 hr MapReduce Hive Pig Apache Drill
  • 9. How Does It Work? • Drillbits run on each node, designed to maximize data locality • Processing is done outside MapReduce paradigm (but possibly within YARN) • Queries can be fed to any Drillbit • Coordination, query planning, optimization, scheduling, and execution are distributed SELECT * FROM oracle.transactions, mongo.users, LIMIT 1
  • 10. Key Features • Full SQL (ANSI SQL:2003) • Nested data • Schema is optional • Flexible and extensible architecture
  • 11. Full SQL (ANSI SQL:2003) • Drill supports standard ANSI SQL:2003 – Correlated subqueries, analytic functions, … – SQL-like is not enough • Use any SQL-based tool with Apache Drill – Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, … – Standard ODBC and JDBC drivers Drill%Worker Drill%Worker Driver Client Drillbit SQL%Query% Parser Query% Planner Drillbits Drill%ODBC% Driver Tableau MicroStrategy Excel SAP%Crystal% Reports
  • 12. Nested Data • Nested data is becoming prevalent – JSON, BSON, XML, Protocol Buffers, Avro, etc. – The data source may or may not be aware • MongoDB supports nested data natively • A single HBase value could be a JSON document (compound nested type) – Google Dremel’s innovation was efficient columnar storage and querying of nested data • Flattening nested data is error-prone and often impossible – Think about repeated and optional fields at every level… • Apache Drill supports nested data – Extensions to ANSI SQL:2003 enum Gender { MALE, FEMALE } record User { string name; Gender gender; long followers; } { "name": "Homer", "gender": "Male", "followers": 100 children: [ {name: "Bart"}, {name: "Lisa”} ] } JSON Avro
  • 13. Schema is Optional • Many data sources do not have rigid schemas – Schemas change rapidly – Each record may have a different schema • Sparse and wide rows in HBase and Cassandra, MongoDB • Apache Drill supports querying against unknown schemas – Query any HBase, Cassandra or MongoDB table • User can define the schema or let the system discover it automatically – System of record may already have schema information • Why manage it in a separate system? – No need to manage schema evolution Row Key CF contents CF anchor "com.cnn.www" contents:html = "<html>…" = "" = "CNN" "com.foxnews.www" contents:html = "<html>…" = "Fox News" … … …
  • 14. Flexible and Extensible Architecture • Apache Drill is designed for extensibility • Well-documented APIs and interfaces • Data sources and file formats – Implement a custom scanner to support a new data source or file format • Query languages – SQL:2003 is the primary language – Implement a custom Parser to support a Domain Specific Language – UDFs and UDTFs • Optimizers – Drill will have a cost-based optimizer – Clear surrounding APIs support easy optimizer exploration • Operators – Custom operators can be implemented • Special operators for Mahout (k-means) being designed – Operator push-down to data source (RDBMS)
  • 15. How Does Impala Fit In? Impala Strengths • Beta currently available • Easy install and setup on top of Cloudera • Faster than Hive on some queries • SQL-like query language Questions • Open Source ‘Lite’ • Doesn’t support RDBMS or other NoSQLs (beyond Hadoop/HBase) • Early row materialization increases footprint and reduces performance • Limited file format support • Query results must fit in memory! • Rigid schema is required • No support for nested data • Compound APIs restrict optimizer progression • SQL-like (not SQL) Many important features are “coming soon”. Architectural foundation is constrained. No community development.
  • 16. Status: In Progress • Heavy active development by multiple organizations • Available – Logical plan syntax and interpreter – Reference interpreter • In progress – SQL interpreter – Storage engine implementations for Accumulo, Cassandra, HBase and various file formats • Significant community momentum – Over 200 people on the Drill mailing list – Over 200 members of the Bay Area Drill User Group – Drill meetups across the US and Europe – OpenDremel team joined Apache Drill • Anticipated schedule: – Prototype: Q1 – Alpha: Q2 – Beta: Q3
  • 17. Why Apache Drill Will Be Successful Resources • Contributors have strong backgrounds from companies like Oracle, IBM Netezza, Informatica, Clustrix and Pentaho Community • Development done in the open • Active contributors from multiple companies • Rapidly growing Architecture • Full SQL • New data support • Extensible APIs • Full Columnar Execution • Beyond Hadoop
  • 18. Questions? • What problems can Drill solve for you? • Where does it fit in the organization? • Which data sources and BI tools are important to you?
  • 19. Let’s Talk! • • • @ted_dunning @ApacheDrill • Slides at • See also resources/drill
  • 21. Why Not Leverage MapReduce? • Scheduling Model – Coarse resource model reduces hardware utilization – Acquisition of resources typically takes 100’s of millis to seconds • Barriers – Map completion required before shuffle/reduce commencement – All maps must complete before reduce can start – In chained jobs, one job must finish entirely before the next one can start • Persistence and Recoverability – Data is persisted to disk between each barrier – Serialization and deserialization are required between execution phase

