SlideShare a Scribd company logo
NoSQL and SQL Work Side-by-Side
to Tackle Real-time Big Data Needs
Allen Day
MapR Technologies
• Allen Day
– Principal Data Scientist @ MapR
– Human Genomics / Bioinformatics
(PhD, UCLA School of Medicine)
• @allenday
• I’m assuming that the typical attendee:
– is a software developer
– is interested and familiar with open source
– is familiar with Hadoop, relational DBs
– has heard of or has used some NoSQL technology
Big Data Workloads
• Offline
– Model creation & clustering & indexing
– Web Crawling
– Batch reporting
• Online
– Lightweight OLTP
– Classification & anomaly detection
– Stream processing
– Interactive reporting
What is NoSQL? Why use it?
• Traditional storage (relational DBs) are unable to
accommodate increasing # and variety of
– Culprits: sensors, event logs, electronic payments
• Solution: stay responsive by relaxing ACID storage
– Denormalize (#)
– Loosen schema (variety), loosen consistency
• This is the essence of NoSQL
NoSQL Impact on Business Processes
• Traditional business intelligence (BI) tech stack
assumes relational DB storage
– Company decisions depend on this (reports, charts)
• NoSQL collected data aren’t in relational DB
– Data volume/variety is still increasing
– Tech and methods are still in flux
• Decoupled data storage and decision support
– BI can’t access freshest, largest data sets
– Very high opportunity cost to business
Ideal Solution Features
• Scalable & Reliable
– Distributed replicated storage
– Distributed parallel processing
• BI application support
– Ad-hoc, interactive queries
– Real-time responsiveness
• Flexible
– Handles rapid storage and schema evolution
– Handles new analytics methods and functions
Hadoop FS
Map/Reduce, YARN{
SQL Interface{
Extensible for NoSQL,
Advanced Analytics{
From Ideals to Possibilities
• Migrate NoSQL data/processing to SQL
– High cost to marshal NoSQL data to SQL storage
– SQL systems lack advanced analytics capabilities
• Migrate SQL data to NoSQL
– Breaks compatibility for BI-dependent functions, e.g.
financial reporting
– Limited support for relational operations (joins)
• high latency
– NoSQL tech is still in flux (continuity)
• Other Approaches?
– Yes. First let’s consider a SQL/NoSQL use case
Interactive Queries & Hadoop
Example Problem: Marketing Campaign
• Jane is an analyst at an
e-commerce company
• How does she figure
out good targeting
segments for the next
marketing campaign?
• She has some ideas…
…and lots of data
Traditional System Solution 1: RDBMS
• ETL the data from
MongoDB and Hadoop
into the RDBMS
– MongoDB data must be
flattened, schematized,
filtered and aggregated
– Hadoop data must be
filtered and aggregated
• Query the data using
any SQL-based tool
Traditional System Solution 2: Hadoop
• ETL the data from
Oracle and MongoDB
into Hadoop
– MongoDB data must be
flattened and
• Work with the
MapReduce team to
write custom code to
generate the desired
Traditional System Solution 3: Hive
• ETL the data from
Oracle and MongoDB
into Hadoop
– MongoDB data must be
flattened and
• But HiveQL queries are
slow and BI tool
support is limited
– Marshaling/Coding
What Would Google Do?
File System
GFS BigTable Dremel MapReduce
HDFS HBase ???
Build Apache Drill to provide a true open source
solution to interactive analysis of Big Data
Apache Drill Overview
• Interactive analysis of Big Data using standard
• Fast
– Low latency queries
– Complement native interfaces and
• Open
– Community driven open source project
– Under Apache Software Foundation
• Modern
– Standard ANSI SQL:2003 (select/into)
– Nested data support
– Schema is optional
– Supports RDBMS, Hadoop and NoSQL
Interactive queries
Data analyst
100 ms-20 min
Data mining
Large ETL
20 min-20 hr
Apache Drill
How Does It Work?
SQL Query
Query Planner
Drill Client
Drill ODBC Driver
How Does It Work?
• Drillbits run on each node, designed to
maximize data locality
• Processing is done outside MapReduce
paradigm (but possibly within YARN)
• Queries can be fed to any Drillbit
• Coordination, query planning, optimization,
scheduling, and execution are distributed
Apache Drill: Key Features
• Full ANSI SQL:2003 support
– Use any SQL-based tool
• Nested data support
– Flattening is error-prone and often impossible
• Schema-less data source support
– Schema can change rapidly and may be record-specific
• Extensible
– DSLs, UDFs
– Custom operators (e.g. k-means clustering)
– Well-documented data source & file format APIs
How Does Impala Fit In?
Impala Strengths
• Beta currently available
• Easy install and setup on top of
• Faster than Hive on some queries
• SQL-like query language
• Open Source ‘Lite’
• Lacks RDBMS support
• Lacks NoSQL support beyond
• Early row materialization
increases footprint and reduces
• Limited file format support
• Query results must fit in memory!
• Rigid schema is required
• No support for nested data
• SQL-like (not SQL)
Many important features are “coming soon”.
Architectural foundation is constrained. No community development.
Drill Status: Alpha Available July
• Heavy active development by multiple organizations
– Contributors from Oracle, IBM Netezza, Informatica, Clustrix, Pentaho
• Available
– Logical plan syntax and interpreter
– Reference interpreter
• In progress
– SQL interpreter
– Storage engine implementations for Accumulo, Cassandra, HBase and
various file formats
• Significant community momentum
– Over 200 people on the Drill mailing list
– Over 200 members of the Bay Area Drill User Group
– Drill meetups across the US and Europe
• Beta: Q3
Why Apache Drill Will Be Successful
• Contributors have
strong backgrounds
from companies like
Oracle, IBM Netezza,
Informatica, Clustrix
and Pentaho
• Development done in
the open
• Active contributors
from multiple
• Rapidly growing
• Full SQL
• New data support
• Extensible APIs
• Full Columnar
• Beyond Hadoop
Bottom Line: Apache Drill enables NoSQL and SQL
Work Side-by-Side to Tackle Real-time Big Data Needs
• Allen Day
– Principal Data Scientist @ MapR
• @allenday
No sql and sql - open analytics summit
Full SQL (ANSI SQL:2003)
• Drill supports SQL (ANSI SQL:2003 standard)
– Correlated subqueries, analytic functions, …
– SQL-like is not enough
• Use any SQL-based tool with Apache Drill
– Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, …
– Standard ODBC and JDBC drivers
Nested Data
• Nested data is becoming prevalent
– JSON, BSON, XML, Protocol Buffers, Avro, etc.
– The data source may or may not be aware
• MongoDB supports nested data natively
• A single HBase value could be a JSON document
(compound nested type)
– Google Dremel’s innovation was efficient columnar
storage and querying of nested data
• Flattening nested data is error-prone and often
– Think about repeated and optional fields at every
• Apache Drill supports nested data
– Extensions to ANSI SQL:2003
enum Gender {
record User {
string name;
Gender gender;
long followers;
"name": "Homer",
"gender": "Male",
"followers": 100
children: [
{name: "Bart"},
{name: "Lisa”}
Schema is Optional
• Many data sources do not have rigid schemas
– Schemas change rapidly
– Each record may have a different schema, may be sparse/wide
• Apache Drill supports querying against unknown schemas
– Query any HBase, Cassandra or MongoDB table
• User can define the schema or let the system discover it
– System of record may already have schema information
– No need to manage schema evolution
Row Key CF contents CF anchor
"com.cnn.www" contents:html = "<html>…" = "" = "CNN"
"com.foxnews.www" contents:html = "<html>…" = "Fox News"
… … …
Flexible and Extensible Architecture
• Apache Drill is designed for extensibility
• Well-documented APIs and interfaces
• Data sources and file formats
– Implement a custom scanner to support a new source/format
• Query languages
– SQL:2003 is the primary language
– Implement a custom Parser to support a Domain Specific Language
– UDFs
• Optimizers
– Drill will have a cost-based optimizer
– Clear surrounding APIs support easy optimizer exploration
• Operators
– Custom operators can be implemented (e.g. k-Means clustering)
– Operator push-down to data source (RDBMS)

More Related Content

What's hot

Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Data Con LA
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Spark Summit
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
Blake Irvine
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Big Data Spain
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics tools
Nascenia IT
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
Data Con LA
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
Albert Wong
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoMastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Spark Summit
Building a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksBuilding a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with Databricks
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
Big Data Spain
Sheetal Pratik
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz MagdanskiHow Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
Data engineering
Data engineeringData engineering
Data engineering
Parimala Killada
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
Trieu Nguyen
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"
Rob Winters
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Spark Summit
Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent Enterprise
Data Care, Feeding, and Maintenance
Data Care, Feeding, and MaintenanceData Care, Feeding, and Maintenance
Data Care, Feeding, and Maintenance
Mercedes Coyle
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
H2O World - Survey of Available Machine Learning Frameworks - Brendan HergerH2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
Sri Ambati
Data Engineering for Data Scientists
Data Engineering for Data Scientists Data Engineering for Data Scientists
Data Engineering for Data Scientists

What's hot (20)

Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics tools
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoMastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Building a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksBuilding a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with Databricks
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz MagdanskiHow Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
Data engineering
Data engineeringData engineering
Data engineering
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent Enterprise
Data Care, Feeding, and Maintenance
Data Care, Feeding, and MaintenanceData Care, Feeding, and Maintenance
Data Care, Feeding, and Maintenance
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
H2O World - Survey of Available Machine Learning Frameworks - Brendan HergerH2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
Data Engineering for Data Scientists
Data Engineering for Data Scientists Data Engineering for Data Scientists
Data Engineering for Data Scientists

Similar to No sql and sql - open analytics summit

Apache drill
Apache drillApache drill
Apache drill
MapR Technologies
Apache Drill
Apache DrillApache Drill
Apache Drill
Ted Dunning
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
Neeraja Rentachintala
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Alex Gorbachev
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache Drill
MapR Technologies
Oracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_databaseOracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_database
Getting value from IoT, Integration and Data Analytics
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
Zohar Elkayam
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
MapR Technologies
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
Venu Anuganti
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
Alok Mohapatra
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
Rittman Analytics
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.

Similar to No sql and sql - open analytics summit (20)

Apache drill
Apache drillApache drill
Apache drill
Apache Drill
Apache DrillApache Drill
Apache Drill
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache Drill
Oracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_databaseOracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_database
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...

More from Open Analytics

Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)
Open Analytics
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Open Analytics
CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)
Open Analytics
An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)
Open Analytics
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
Open Analytics
Observations on Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on Website Traffic Surge Due to Chechnya Terrorism Scare...
Open Analytics
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & Personalization
Open Analytics
M&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsM&A Trends in Telco Analytics
M&A Trends in Telco Analytics
Open Analytics
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital Economy
Open Analytics
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)
Open Analytics
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Analytics
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)
Open Analytics
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
Open Analytics
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Open Analytics
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Open Analytics
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)
Open Analytics
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYC
Open Analytics
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics Meetup
Open Analytics
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetup
Open Analytics
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_final
Open Analytics

More from Open Analytics (20)

Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
Observations on Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on Website Traffic Surge Due to Chechnya Terrorism Scare...
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & Personalization
M&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsM&A Trends in Telco Analytics
M&A Trends in Telco Analytics
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital Economy
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYC
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics Meetup
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetup
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_final

Recently uploaded

UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
Yury Chemerkin
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
Bhajan Mehta
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
Marrie Morris
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Michael Price
Keynote : Presentation on SASE Technology
Keynote : Presentation on SASE TechnologyKeynote : Presentation on SASE Technology
Keynote : Presentation on SASE Technology
Priyanka Aash
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
Priyanka Aash
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Alliance
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
AMol NAik
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Alliance
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Alliance
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Alliance
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partesExchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
webbyacad software
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
Stephanie Beckett
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
Stephanie Beckett
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf

Recently uploaded (20)

UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Keynote : Presentation on SASE Technology
Keynote : Presentation on SASE TechnologyKeynote : Presentation on SASE Technology
Keynote : Presentation on SASE Technology
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partesExchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf

No sql and sql - open analytics summit

  • 1. NoSQL and SQL Work Side-by-Side to Tackle Real-time Big Data Needs Allen Day MapR Technologies
  • 2. Me • Allen Day – Principal Data Scientist @ MapR – Human Genomics / Bioinformatics (PhD, UCLA School of Medicine) • @allenday • •
  • 3. You • I’m assuming that the typical attendee: – is a software developer – is interested and familiar with open source – is familiar with Hadoop, relational DBs – has heard of or has used some NoSQL technology
  • 4. Big Data Workloads • Offline – ETL – Model creation & clustering & indexing – Web Crawling – Batch reporting • Online – Lightweight OLTP – Classification & anomaly detection – Stream processing – Interactive reporting SQL
  • 5. What is NoSQL? Why use it? • Traditional storage (relational DBs) are unable to accommodate increasing # and variety of observations – Culprits: sensors, event logs, electronic payments • Solution: stay responsive by relaxing ACID storage requirements – Denormalize (#) – Loosen schema (variety), loosen consistency • This is the essence of NoSQL
  • 6. NoSQL Impact on Business Processes • Traditional business intelligence (BI) tech stack assumes relational DB storage – Company decisions depend on this (reports, charts) • NoSQL collected data aren’t in relational DB – Data volume/variety is still increasing – Tech and methods are still in flux • Decoupled data storage and decision support systems – BI can’t access freshest, largest data sets – Very high opportunity cost to business
  • 7. Ideal Solution Features • Scalable & Reliable – Distributed replicated storage – Distributed parallel processing • BI application support – Ad-hoc, interactive queries – Real-time responsiveness • Flexible – Handles rapid storage and schema evolution – Handles new analytics methods and functions Hadoop FS Map/Reduce, YARN{ SQL Interface{ Extensible for NoSQL, Advanced Analytics{
  • 8. From Ideals to Possibilities • Migrate NoSQL data/processing to SQL – High cost to marshal NoSQL data to SQL storage – SQL systems lack advanced analytics capabilities • Migrate SQL data to NoSQL – Breaks compatibility for BI-dependent functions, e.g. financial reporting – Limited support for relational operations (joins) • high latency – NoSQL tech is still in flux (continuity) • Other Approaches? – Yes. First let’s consider a SQL/NoSQL use case
  • 9. Impala Interactive Queries & Hadoop low-latency
  • 10. Example Problem: Marketing Campaign • Jane is an analyst at an e-commerce company • How does she figure out good targeting segments for the next marketing campaign? • She has some ideas… …and lots of data User profiles Transaction information Access logs
  • 11. Traditional System Solution 1: RDBMS • ETL the data from MongoDB and Hadoop into the RDBMS – MongoDB data must be flattened, schematized, filtered and aggregated – Hadoop data must be filtered and aggregated • Query the data using any SQL-based tool User profiles Access logs Transaction information
  • 12. Traditional System Solution 2: Hadoop • ETL the data from Oracle and MongoDB into Hadoop – MongoDB data must be flattened and schematized • Work with the MapReduce team to write custom code to generate the desired analyses User profiles Access logs Transaction information
  • 13. Traditional System Solution 3: Hive • ETL the data from Oracle and MongoDB into Hadoop – MongoDB data must be flattened and schematized • But HiveQL queries are slow and BI tool support is limited – Marshaling/Coding User profiles Access logs Transaction information
  • 14. What Would Google Do? Distributed File System NoSQL Interactive analysis Batch processing GFS BigTable Dremel MapReduce HDFS HBase ??? Hadoop MapReduce Build Apache Drill to provide a true open source solution to interactive analysis of Big Data
  • 15. Apache Drill Overview • Interactive analysis of Big Data using standard SQL • Fast – Low latency queries – Complement native interfaces and MapReduce/Hive/Pig • Open – Community driven open source project – Under Apache Software Foundation • Modern – Standard ANSI SQL:2003 (select/into) – Nested data support – Schema is optional – Supports RDBMS, Hadoop and NoSQL Interactive queries Data analyst Reporting 100 ms-20 min Data mining Modeling Large ETL 20 min-20 hr MapReduce Hive Pig Apache Drill
  • 16. How Does It Work? Drillbit (Coordinator) SQL Query Parser Query Planner Drillbit (Executor) Drillbit (Executor) Drillbit (Executor) SELECT * FROM oracle.transactions, mongo.users, LIMIT 1 Drill Client Tableau Drill ODBC Driver Micro- Strategy Crystal Reports Driver
  • 17. How Does It Work? • Drillbits run on each node, designed to maximize data locality • Processing is done outside MapReduce paradigm (but possibly within YARN) • Queries can be fed to any Drillbit • Coordination, query planning, optimization, scheduling, and execution are distributed SELECT * FROM oracle.transactions, mongo.users, LIMIT 1
  • 18. Apache Drill: Key Features • Full ANSI SQL:2003 support – Use any SQL-based tool • Nested data support – Flattening is error-prone and often impossible • Schema-less data source support – Schema can change rapidly and may be record-specific • Extensible – DSLs, UDFs – Custom operators (e.g. k-means clustering) – Well-documented data source & file format APIs
  • 19. How Does Impala Fit In? Impala Strengths • Beta currently available • Easy install and setup on top of Cloudera • Faster than Hive on some queries • SQL-like query language Questions • Open Source ‘Lite’ • Lacks RDBMS support • Lacks NoSQL support beyond HBase • Early row materialization increases footprint and reduces performance • Limited file format support • Query results must fit in memory! • Rigid schema is required • No support for nested data • SQL-like (not SQL) Many important features are “coming soon”. Architectural foundation is constrained. No community development.
  • 20. Drill Status: Alpha Available July • Heavy active development by multiple organizations – Contributors from Oracle, IBM Netezza, Informatica, Clustrix, Pentaho • Available – Logical plan syntax and interpreter – Reference interpreter • In progress – SQL interpreter – Storage engine implementations for Accumulo, Cassandra, HBase and various file formats • Significant community momentum – Over 200 people on the Drill mailing list – Over 200 members of the Bay Area Drill User Group – Drill meetups across the US and Europe • Beta: Q3
  • 21. Why Apache Drill Will Be Successful Resources • Contributors have strong backgrounds from companies like Oracle, IBM Netezza, Informatica, Clustrix and Pentaho Community • Development done in the open • Active contributors from multiple companies • Rapidly growing Architecture • Full SQL • New data support • Extensible APIs • Full Columnar Execution • Beyond Hadoop Bottom Line: Apache Drill enables NoSQL and SQL Work Side-by-Side to Tackle Real-time Big Data Needs
  • 22. Me • Allen Day – Principal Data Scientist @ MapR • @allenday • •
  • 25. Full SQL (ANSI SQL:2003) • Drill supports SQL (ANSI SQL:2003 standard) – Correlated subqueries, analytic functions, … – SQL-like is not enough • Use any SQL-based tool with Apache Drill – Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, … – Standard ODBC and JDBC drivers Drill%Worker Drill%Worker Driver Client Drillbit SQL%Query% Parser Query% Planner Drillbits Drill%ODBC% Driver Tableau MicroStrategy Excel SAP%Crystal% Reports
  • 26. Nested Data • Nested data is becoming prevalent – JSON, BSON, XML, Protocol Buffers, Avro, etc. – The data source may or may not be aware • MongoDB supports nested data natively • A single HBase value could be a JSON document (compound nested type) – Google Dremel’s innovation was efficient columnar storage and querying of nested data • Flattening nested data is error-prone and often impossible – Think about repeated and optional fields at every level… • Apache Drill supports nested data – Extensions to ANSI SQL:2003 enum Gender { MALE, FEMALE } record User { string name; Gender gender; long followers; } { "name": "Homer", "gender": "Male", "followers": 100 children: [ {name: "Bart"}, {name: "Lisa”} ] } JSON Avro
  • 27. Schema is Optional • Many data sources do not have rigid schemas – Schemas change rapidly – Each record may have a different schema, may be sparse/wide • Apache Drill supports querying against unknown schemas – Query any HBase, Cassandra or MongoDB table • User can define the schema or let the system discover it automatically – System of record may already have schema information – No need to manage schema evolution Row Key CF contents CF anchor "com.cnn.www" contents:html = "<html>…" = "" = "CNN" "com.foxnews.www" contents:html = "<html>…" = "Fox News" … … …
  • 28. Flexible and Extensible Architecture • Apache Drill is designed for extensibility • Well-documented APIs and interfaces • Data sources and file formats – Implement a custom scanner to support a new source/format • Query languages – SQL:2003 is the primary language – Implement a custom Parser to support a Domain Specific Language – UDFs • Optimizers – Drill will have a cost-based optimizer – Clear surrounding APIs support easy optimizer exploration • Operators – Custom operators can be implemented (e.g. k-Means clustering) – Operator push-down to data source (RDBMS)

Editor's Notes

  1. Emphasize previous experience in my applied domain BFX, difficulty of processing queries effectively (stratified experiments of high-dimensional genomic data).
  2. I’m assuming that the typical attendee of this talk is a software developer familiar with and interested in open source technologies. Is already familiar with Hadoop, relational databases, and has heard of or may have some hands-on experience working with some NosQL technologies.
  3. Note correspondences between offline operation and its online counterpart
  4. Call detail records, as we’ve been hearing about in the news around PRISM recently
  5. Hive: compile to MR, Aster: external tables in MPP, Oracle/MySQL: export MR results to RDBMSDrill, Impala, CitusDB: real-time
  6. Emphasize previous experience in my applied domain BFX, difficulty of processing queries effectively (stratified experiments of high-dimensional genomic data).