SlideShare a Scribd company logo
Apache’s Answer to Low Latency
Interactive Query for Big Data
February 13, 2013
Agenda
• Apache Drill overview
• Key features
• Status and progress
• Discuss potential use cases and cooperation
Big Data Workloads
• ETL
• Data mining
• Blob store
• Lightweight OLTP on large datasets
• Index and model generation
• Web crawling
• Stream processing
• Clustering, anomaly detection and classification
• Interactive analysis
Interactive Queries and Hadoop
Compile SQL to
MapReduce
SQL based
analytics
Impala
Real-time
interactive queries
Real-time
interactive queries
Emerging Technologies
Common Solutions
Export MapReduce
results to RDBMS and
query the RDBMS
External tables in an
MPP database

Recommended for you

Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake

Timely data in a data warehouse is a challenge many of us face, often with there being no straightforward solution. Using a combination of batch and streaming data pipelines you can leverage the Delta Lake format to provide an enterprise data warehouse at a near real-time frequency. Delta Lake eases the ETL workload by enabling ACID transactions in a warehousing environment. Coupling this with structured streaming, you can achieve a low latency data warehouse. In this talk, we’ll talk about how to use Delta Lake to improve the latency of ingestion and storage of your data warehouse tables. We’ll also talk about how you can use spark streaming to build the aggregations and tables that drive your data warehouse.

*apache spark

 *big data

 *
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction

The document discusses compaction in RocksDB, an embedded key-value storage engine. It describes the two compaction styles in RocksDB: level style compaction and universal style compaction. Level style compaction stores data in multiple levels and performs compactions by merging files from lower to higher levels. Universal style compaction keeps all files in level 0 and performs compactions by merging adjacent files in time order. The document provides details on the compaction process and configuration options for both styles.

databaseflashsoftware
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse

HBase as the NoSQL database of choice in the Hadoop ecosystem has already been proven itself in scale and in many mission critical workloads in hundreds of companies. Phoenix as the SQL layer on top of HBase, has been increasingly becoming the tool of choice as the perfect complementary for HBase. Phoenix is now being used more and more for super low latency querying and fast analytics across a large number of users in production deployments. In this talk, we will cover what makes Phoenix attractive among current and prospective HBase users, like SQL support, JDBC, data modeling, secondary indexing, UDFs, and also go over recent improvements like Query Server, ODBC drivers, ACID transactions, Spark integration, etc. We will conclude by looking into items in the pipeline and how Phoenix and HBase interacts with other engines like Hive and Spark.

phoenixapache phoenixhbase
Example Problem
• Jane works as an
analyst at an e-
commerce company
• How does she figure
out good targeting
segments for the next
marketing campaign?
• She has some ideas
and lots of data
User
profiles
Transaction
information
Access
logs
Solving the Problem with Traditional Systems
• Use an RDBMS
– ETL the data from MongoDB and Hadoop into the RDBMS
• MongoDB data must be flattened, schematized, filtered and aggregated
• Hadoop data must be filtered and aggregated
– Query the data using any SQL-based tool
• Use MapReduce
– ETL the data from Oracle and MongoDB into Hadoop
– Work with the MapReduce team to generate the desired analyses
• Use Hive
– ETL the data from Oracle and MongoDB into Hadoop
• MongoDB data must be flattened and schematized
– But HiveQL is limited, queries take too long and BI tool support is
limited
WWGD
Distributed
File System
NoSQL
Interactive
analysis
Batch
processing
GFS BigTable Dremel MapReduce
HDFS HBase ???
Hadoop
MapReduce
Build Apache Drill to provide a true open source
solution to interactive analysis of Big Data
Apache Drill Overview
• Interactive analysis of Big Data using standard SQL
• Fast
– Low latency queries
– Columnar execution
• Inspired by Google Dremel/BigQuery
– Complement native interfaces and
MapReduce/Hive/Pig
• Open
– Community driven open source project
– Under Apache Software Foundation
• Modern
– Standard ANSI SQL:2003 (select/into)
– Nested/hierarchical data support
– Schema is optional
– Supports RDBMS, Hadoop and NoSQL
Interactive queries
Data analyst
Reporting
100 ms-20 min
Data mining
Modeling
Large ETL
20 min-20 hr
MapReduce
Hive
Pig
Apache Drill

Recommended for you

Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...

Talk given at RedisConf 17 on June 1, 2017 by Daniel Hochman. A video will be published by the conference organizers. Abstract: Built-in GEO commands in Redis provide a solid foundation for location-based applications. The scale of Lyft requires a completely different approach to the problem. Learn how to push beyond your constraints to build a highly available, high throughput, horizontally scalable Redis architecture. The techniques presented in this case study are broadly applicable to scaling any type of application powered by Redis. The talk will cover data modeling, open-source solutions, reliability engineering, and Lyft platform.

databasemonitoringenvoy
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB

MongoDB is an open-source, document-oriented database that provides high performance and horizontal scalability. It uses a document-model where data is organized in flexible, JSON-like documents rather than rigidly defined rows and tables. Documents can contain multiple types of nested objects and arrays. MongoDB is best suited for applications that need to store large amounts of unstructured or semi-structured data and benefit from horizontal scalability and high performance.

mongodbnosqldatabase
How Does It Work?
• Drillbits run on each node, designed to
maximize data locality
• Processing is done outside MapReduce
paradigm (but possibly within YARN)
• Queries can be fed to any Drillbit
• Coordination, query planning, optimization,
scheduling, and execution are distributed
SELECT * FROM
oracle.transactions,
mongo.users,
hdfs.events
LIMIT 1
Key Features
• Full SQL (ANSI SQL:2003)
• Nested data
• Schema is optional
• Flexible and extensible architecture
Full SQL (ANSI SQL:2003)
• Drill supports standard ANSI SQL:2003
– Correlated subqueries, analytic functions, …
– SQL-like is not enough
• Use any SQL-based tool with Apache Drill
– Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, …
– Standard ODBC and JDBC drivers
Drill%Worker
Drill%Worker
Driver
Client
Drillbit
SQL%Query%
Parser
Query%
Planner
Drillbits
Drill%ODBC%
Driver
Tableau
MicroStrategy
Excel
SAP%Crystal%
Reports
Nested Data
• Nested data is becoming prevalent
– JSON, BSON, XML, Protocol Buffers, Avro, etc.
– The data source may or may not be aware
• MongoDB supports nested data natively
• A single HBase value could be a JSON document
(compound nested type)
– Google Dremel’s innovation was efficient columnar
storage and querying of nested data
• Flattening nested data is error-prone and often
impossible
– Think about repeated and optional fields at every
level…
• Apache Drill supports nested data
– Extensions to ANSI SQL:2003
enum Gender {
MALE, FEMALE
}
record User {
string name;
Gender gender;
long followers;
}
{
"name": "Homer",
"gender": "Male",
"followers": 100
children: [
{name: "Bart"},
{name: "Lisa”}
]
}
JSON
Avro

Recommended for you

From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.

This presentation: * covers basics of caching and popular cache types * explains evolution from simple cache to distributed, and from distributed to IMDG * not describes usage of NoSQL solutions for caching * is not intended for products comparison or for promotion of Hazelcast as the best solution

replicated cachecachingin-memory data grid
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive

Apache Hive is a rapidly evolving project which continues to enjoy great adoption in big data ecosystem. Although, Hive started primarily as batch ingestion and reporting tool, community is hard at work in improving it along many different dimensions and use cases. This talk will provide an overview of latest and greatest features and optimizations which have landed in project over last year. Materialized view, micro managed tables and workload management are some noteworthy features. I will deep dive into some optimizations which promise to provide major performance gains. Support for ACID tables has also improved considerably. Although some of these features and enhancements are not novel but have existed for years in other DB systems, implementing them on Hive poses some unique challenges and results in lessons which are generally applicable in many other contexts. I will also provide a glimpse of what is expected to come in near future. Speaker: Ashutosh Chauhan, Engineering Manager, Hortonworks

data warehousing and operational data storesapache hiveashutosh chauhan
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow

This document discusses Apache Arrow, an open-source library that enables fast and efficient data interchange and processing. It summarizes the growth of Arrow and its ecosystem, including new features like the Arrow C++ query engine and Arrow Rust DataFusion. It also highlights how enterprises are using Arrow to solve challenges around data interoperability, access speed, query performance, and embeddable analytics. Case studies describe how companies like Microsoft, Google Cloud, Snowflake, and Meta leverage Arrow in their products and platforms. The presenter promotes Voltron Data's enterprise subscription and upcoming conference to support business use of Apache Arrow.

apache arrowbig dataanalytics
Schema is Optional
• Many data sources do not have rigid schemas
– Schemas change rapidly
– Each record may have a different schema
• Sparse and wide rows in HBase and Cassandra, MongoDB
• Apache Drill supports querying against unknown schemas
– Query any HBase, Cassandra or MongoDB table
• User can define the schema or let the system discover it automatically
– System of record may already have schema information
• Why manage it in a separate system?
– No need to manage schema evolution
Row Key CF contents CF anchor
"com.cnn.www" contents:html = "<html>…" anchor:my.look.ca = "CNN.com"
anchor:cnnsi.com = "CNN"
"com.foxnews.www" contents:html = "<html>…" anchor:en.wikipedia.org = "Fox News"
… … …
Flexible and Extensible Architecture
• Apache Drill is designed for extensibility
• Well-documented APIs and interfaces
• Data sources and file formats
– Implement a custom scanner to support a new data source or file format
• Query languages
– SQL:2003 is the primary language
– Implement a custom Parser to support a Domain Specific Language
– UDFs and UDTFs
• Optimizers
– Drill will have a cost-based optimizer
– Clear surrounding APIs support easy optimizer exploration
• Operators
– Custom operators can be implemented
• Special operators for Mahout (k-means) being designed
– Operator push-down to data source (RDBMS)
How Does Impala Fit In?
Impala Strengths
• Beta currently available
• Easy install and setup on top of
Cloudera
• Faster than Hive on some queries
• SQL-like query language
Questions
• Open Source ‘Lite’
• Doesn’t support RDBMS or other
NoSQLs (beyond Hadoop/HBase)
• Early row materialization increases
footprint and reduces performance
• Limited file format support
• Query results must fit in memory!
• Rigid schema is required
• No support for nested data
• Compound APIs restrict optimizer
progression
• SQL-like (not SQL)
Many important features are “coming soon”. Architectural foundation is constrained. No
community development.
Status: In Progress
• Heavy active development by multiple organizations
• Available
– Logical plan syntax and interpreter
– Reference interpreter
• In progress
– SQL interpreter
– Storage engine implementations for Accumulo, Cassandra, HBase and various file formats
• Significant community momentum
– Over 200 people on the Drill mailing list
– Over 200 members of the Bay Area Drill User Group
– Drill meetups across the US and Europe
– OpenDremel team joined Apache Drill
• Anticipated schedule:
– Prototype: Q1
– Alpha: Q2
– Beta: Q3

Recommended for you

Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx

This document provides an overview and introduction to MongoDB. It discusses how new types of applications, data, volumes, development methods and architectures necessitated new database technologies like NoSQL. It then defines MongoDB and describes its features, including using documents to store data, dynamic schemas, querying capabilities, indexing, auto-sharding for scalability, replication for availability, and using memory for performance. Use cases are presented for companies like Foursquare and Craigslist that have migrated large volumes of data and traffic to MongoDB to gain benefits like flexibility, scalability, availability and ease of use over traditional relational database systems.

Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison

Igor Anishchenko Odessa Java TechTalks Lohika - May, 2012 Let's take a step back and compare data serialization formats, of which there are plenty. What are the key differences between Apache Thrift, Google Protocol Buffers and Apache Avro. Which is "The Best"? Truth of the matter is, they are all very good and each has its own strong points. Hence, the answer is as much of a personal choice, as well as understanding of the historical context for each, and correctly identifying your own, individual requirements.

HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice

HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.

dataworks summithbasehadoop summit
Why Apache Drill Will Be Successful
Resources
• Contributors have strong
backgrounds from
companies like Oracle,
IBM Netezza, Informatica,
Clustrix and Pentaho
Community
• Development done in the
open
• Active contributors from
multiple companies
• Rapidly growing
Architecture
• Full SQL
• New data support
• Extensible APIs
• Full Columnar Execution
• Beyond Hadoop
Questions?
• What problems can Drill solve for you?
• Where does it fit in the organization?
• Which data sources and BI tools are important
to you?
Let’s Talk!
• tdunning@maprtech.com
• tdunning@apache.org
• @ted_dunning @ApacheDrill
• Slides at http://bit.ly/YxZq8X
• See also
http://www.mapr.com/support/community-
resources/drill
APPENDIX

Recommended for you

Micro services vs hadoop
Micro services vs hadoopMicro services vs hadoop
Micro services vs hadoop

As seen at our meetup on 2017 Feb 21. https://www.meetup.com/futureofdata-budapest/events/236853376/ Author: Marton Elek, Hortonworks

hortonworkshadoop
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview

Apache Arrow is a new standard for in-memory columnar data processing. It is a complement to Apache Parquet and Apache ORC. In this deck we review key design goals and how Arrow works in detail.

apache calciteapache arrowapache orc
MongoDB
MongoDBMongoDB
MongoDB

This document provides an overview and introduction to MongoDB, an open-source, high-performance NoSQL database. It outlines MongoDB's features like document-oriented storage, replication, sharding, and CRUD operations. It also discusses MongoDB's data model, comparisons to relational databases, and common use cases. The document concludes that MongoDB is well-suited for applications like content management, inventory management, game development, social media storage, and sensor data databases due to its flexible schema, distributed deployment, and low latency.

Why Not Leverage MapReduce?
• Scheduling Model
– Coarse resource model reduces hardware utilization
– Acquisition of resources typically takes 100’s of millis to seconds
• Barriers
– Map completion required before shuffle/reduce
commencement
– All maps must complete before reduce can start
– In chained jobs, one job must finish entirely before the next one
can start
• Persistence and Recoverability
– Data is persisted to disk between each barrier
– Serialization and deserialization are required between execution
phase

More Related Content

What's hot

Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Adam Kawa
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
HostedbyConfluent
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
The Hive
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Databricks
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
MIJIN AN
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse
enissoz
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
Daniel Hochman
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Mike Dirolf
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
Taras Matyashovsky
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
DataWorks Summit
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx
JWORKS powered by Ordina
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Igor Anishchenko
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 
Micro services vs hadoop
Micro services vs hadoopMicro services vs hadoop
Micro services vs hadoop
Gergely Devenyi
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
Dremio Corporation
 
MongoDB
MongoDBMongoDB
MongoDB
nikhil2807
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
Jurriaan Persyn
 

What's hot (20)

Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
Micro services vs hadoop
Micro services vs hadoopMicro services vs hadoop
Micro services vs hadoop
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
MongoDB
MongoDBMongoDB
MongoDB
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 

Viewers also liked

Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
inovex GmbH
 
Oracle OpenWorld - Getting started with MySQL Cluster
Oracle OpenWorld - Getting started with MySQL ClusterOracle OpenWorld - Getting started with MySQL Cluster
Oracle OpenWorld - Getting started with MySQL Cluster
Benedita Paúl Vasconcelos
 
Big Data mit Apache Hadoop
Big Data mit Apache HadoopBig Data mit Apache Hadoop
Big Data mit Apache Hadoop
Alexander Alten
 
Webinar: Kennzahlen in der Produktion - gewusst wie!
Webinar: Kennzahlen in der Produktion - gewusst wie!Webinar: Kennzahlen in der Produktion - gewusst wie!
Webinar: Kennzahlen in der Produktion - gewusst wie!
Trebing & Himstedt Prozeßautomation GmbH & Co. KG
 
MapReduce & Apache Hadoop
MapReduce & Apache HadoopMapReduce & Apache Hadoop
MapReduce & Apache Hadoop
Oliver Fischer
 
Social Media: 10 Shit Storm Tips - How to survive a shit storm - Paula Hannemann
Social Media: 10 Shit Storm Tips - How to survive a shit storm - Paula HannemannSocial Media: 10 Shit Storm Tips - How to survive a shit storm - Paula Hannemann
Social Media: 10 Shit Storm Tips - How to survive a shit storm - Paula Hannemann
Paula Peters
 
Eval Apache Storm vs. Spark Streaming - German
Eval Apache Storm vs. Spark Streaming - GermanEval Apache Storm vs. Spark Streaming - German
Eval Apache Storm vs. Spark Streaming - German
Erik Schmiegelow
 

Viewers also liked (7)

Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
 
Oracle OpenWorld - Getting started with MySQL Cluster
Oracle OpenWorld - Getting started with MySQL ClusterOracle OpenWorld - Getting started with MySQL Cluster
Oracle OpenWorld - Getting started with MySQL Cluster
 
Big Data mit Apache Hadoop
Big Data mit Apache HadoopBig Data mit Apache Hadoop
Big Data mit Apache Hadoop
 
Webinar: Kennzahlen in der Produktion - gewusst wie!
Webinar: Kennzahlen in der Produktion - gewusst wie!Webinar: Kennzahlen in der Produktion - gewusst wie!
Webinar: Kennzahlen in der Produktion - gewusst wie!
 
MapReduce & Apache Hadoop
MapReduce & Apache HadoopMapReduce & Apache Hadoop
MapReduce & Apache Hadoop
 
Social Media: 10 Shit Storm Tips - How to survive a shit storm - Paula Hannemann
Social Media: 10 Shit Storm Tips - How to survive a shit storm - Paula HannemannSocial Media: 10 Shit Storm Tips - How to survive a shit storm - Paula Hannemann
Social Media: 10 Shit Storm Tips - How to survive a shit storm - Paula Hannemann
 
Eval Apache Storm vs. Spark Streaming - German
Eval Apache Storm vs. Spark Streaming - GermanEval Apache Storm vs. Spark Streaming - German
Eval Apache Storm vs. Spark Streaming - German
 

Similar to Apache drill

No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
Open Analytics
 
Apache Drill
Apache DrillApache Drill
Apache Drill
Ted Dunning
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
Bigdatapump
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
Rittman Analytics
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
N Masahiro
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
Zohar Elkayam
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
Neeraja Rentachintala
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
Some corner at the Laboratory
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
ssuserd3a367
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
MapR Technologies
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Cloudera, Inc.
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Andrew Brust
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
Andrew Brust
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
InSemble
 

Similar to Apache drill (20)

No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 

More from MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
MapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
MapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
MapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
MapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
MapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
MapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
MapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
MapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
MapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
MapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
MapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
MapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
MapR Technologies
 

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Recently uploaded

Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
 
Performance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy EvertsPerformance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy Everts
ScyllaDB
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024
The Digital Insurer
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
Aurora Consulting
 
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & SolutionsMYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
Linda Zhang
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1
FellyciaHikmahwarani
 
How Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global ScaleHow Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global Scale
ScyllaDB
 
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design ApproachesKnowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Earley Information Science
 
5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx
SATYENDRA100
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
Enterprise Wired
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
How to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory ModelHow to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory Model
ScyllaDB
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
Matthew Sinclair
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Safe Software
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
Mark Billinghurst
 
AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)
apoorva2579
 

Recently uploaded (20)

Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
 
Performance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy EvertsPerformance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy Everts
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
 
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & SolutionsMYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1
 
How Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global ScaleHow Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global Scale
 
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design ApproachesKnowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
 
5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
How to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory ModelHow to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory Model
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
 
AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)
 

Apache drill

  • 1. Apache’s Answer to Low Latency Interactive Query for Big Data February 13, 2013
  • 2. Agenda • Apache Drill overview • Key features • Status and progress • Discuss potential use cases and cooperation
  • 3. Big Data Workloads • ETL • Data mining • Blob store • Lightweight OLTP on large datasets • Index and model generation • Web crawling • Stream processing • Clustering, anomaly detection and classification • Interactive analysis
  • 4. Interactive Queries and Hadoop Compile SQL to MapReduce SQL based analytics Impala Real-time interactive queries Real-time interactive queries Emerging Technologies Common Solutions Export MapReduce results to RDBMS and query the RDBMS External tables in an MPP database
  • 5. Example Problem • Jane works as an analyst at an e- commerce company • How does she figure out good targeting segments for the next marketing campaign? • She has some ideas and lots of data User profiles Transaction information Access logs
  • 6. Solving the Problem with Traditional Systems • Use an RDBMS – ETL the data from MongoDB and Hadoop into the RDBMS • MongoDB data must be flattened, schematized, filtered and aggregated • Hadoop data must be filtered and aggregated – Query the data using any SQL-based tool • Use MapReduce – ETL the data from Oracle and MongoDB into Hadoop – Work with the MapReduce team to generate the desired analyses • Use Hive – ETL the data from Oracle and MongoDB into Hadoop • MongoDB data must be flattened and schematized – But HiveQL is limited, queries take too long and BI tool support is limited
  • 7. WWGD Distributed File System NoSQL Interactive analysis Batch processing GFS BigTable Dremel MapReduce HDFS HBase ??? Hadoop MapReduce Build Apache Drill to provide a true open source solution to interactive analysis of Big Data
  • 8. Apache Drill Overview • Interactive analysis of Big Data using standard SQL • Fast – Low latency queries – Columnar execution • Inspired by Google Dremel/BigQuery – Complement native interfaces and MapReduce/Hive/Pig • Open – Community driven open source project – Under Apache Software Foundation • Modern – Standard ANSI SQL:2003 (select/into) – Nested/hierarchical data support – Schema is optional – Supports RDBMS, Hadoop and NoSQL Interactive queries Data analyst Reporting 100 ms-20 min Data mining Modeling Large ETL 20 min-20 hr MapReduce Hive Pig Apache Drill
  • 9. How Does It Work? • Drillbits run on each node, designed to maximize data locality • Processing is done outside MapReduce paradigm (but possibly within YARN) • Queries can be fed to any Drillbit • Coordination, query planning, optimization, scheduling, and execution are distributed SELECT * FROM oracle.transactions, mongo.users, hdfs.events LIMIT 1
  • 10. Key Features • Full SQL (ANSI SQL:2003) • Nested data • Schema is optional • Flexible and extensible architecture
  • 11. Full SQL (ANSI SQL:2003) • Drill supports standard ANSI SQL:2003 – Correlated subqueries, analytic functions, … – SQL-like is not enough • Use any SQL-based tool with Apache Drill – Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, … – Standard ODBC and JDBC drivers Drill%Worker Drill%Worker Driver Client Drillbit SQL%Query% Parser Query% Planner Drillbits Drill%ODBC% Driver Tableau MicroStrategy Excel SAP%Crystal% Reports
  • 12. Nested Data • Nested data is becoming prevalent – JSON, BSON, XML, Protocol Buffers, Avro, etc. – The data source may or may not be aware • MongoDB supports nested data natively • A single HBase value could be a JSON document (compound nested type) – Google Dremel’s innovation was efficient columnar storage and querying of nested data • Flattening nested data is error-prone and often impossible – Think about repeated and optional fields at every level… • Apache Drill supports nested data – Extensions to ANSI SQL:2003 enum Gender { MALE, FEMALE } record User { string name; Gender gender; long followers; } { "name": "Homer", "gender": "Male", "followers": 100 children: [ {name: "Bart"}, {name: "Lisa”} ] } JSON Avro
  • 13. Schema is Optional • Many data sources do not have rigid schemas – Schemas change rapidly – Each record may have a different schema • Sparse and wide rows in HBase and Cassandra, MongoDB • Apache Drill supports querying against unknown schemas – Query any HBase, Cassandra or MongoDB table • User can define the schema or let the system discover it automatically – System of record may already have schema information • Why manage it in a separate system? – No need to manage schema evolution Row Key CF contents CF anchor "com.cnn.www" contents:html = "<html>…" anchor:my.look.ca = "CNN.com" anchor:cnnsi.com = "CNN" "com.foxnews.www" contents:html = "<html>…" anchor:en.wikipedia.org = "Fox News" … … …
  • 14. Flexible and Extensible Architecture • Apache Drill is designed for extensibility • Well-documented APIs and interfaces • Data sources and file formats – Implement a custom scanner to support a new data source or file format • Query languages – SQL:2003 is the primary language – Implement a custom Parser to support a Domain Specific Language – UDFs and UDTFs • Optimizers – Drill will have a cost-based optimizer – Clear surrounding APIs support easy optimizer exploration • Operators – Custom operators can be implemented • Special operators for Mahout (k-means) being designed – Operator push-down to data source (RDBMS)
  • 15. How Does Impala Fit In? Impala Strengths • Beta currently available • Easy install and setup on top of Cloudera • Faster than Hive on some queries • SQL-like query language Questions • Open Source ‘Lite’ • Doesn’t support RDBMS or other NoSQLs (beyond Hadoop/HBase) • Early row materialization increases footprint and reduces performance • Limited file format support • Query results must fit in memory! • Rigid schema is required • No support for nested data • Compound APIs restrict optimizer progression • SQL-like (not SQL) Many important features are “coming soon”. Architectural foundation is constrained. No community development.
  • 16. Status: In Progress • Heavy active development by multiple organizations • Available – Logical plan syntax and interpreter – Reference interpreter • In progress – SQL interpreter – Storage engine implementations for Accumulo, Cassandra, HBase and various file formats • Significant community momentum – Over 200 people on the Drill mailing list – Over 200 members of the Bay Area Drill User Group – Drill meetups across the US and Europe – OpenDremel team joined Apache Drill • Anticipated schedule: – Prototype: Q1 – Alpha: Q2 – Beta: Q3
  • 17. Why Apache Drill Will Be Successful Resources • Contributors have strong backgrounds from companies like Oracle, IBM Netezza, Informatica, Clustrix and Pentaho Community • Development done in the open • Active contributors from multiple companies • Rapidly growing Architecture • Full SQL • New data support • Extensible APIs • Full Columnar Execution • Beyond Hadoop
  • 18. Questions? • What problems can Drill solve for you? • Where does it fit in the organization? • Which data sources and BI tools are important to you?
  • 19. Let’s Talk! • tdunning@maprtech.com • tdunning@apache.org • @ted_dunning @ApacheDrill • Slides at http://bit.ly/YxZq8X • See also http://www.mapr.com/support/community- resources/drill
  • 21. Why Not Leverage MapReduce? • Scheduling Model – Coarse resource model reduces hardware utilization – Acquisition of resources typically takes 100’s of millis to seconds • Barriers – Map completion required before shuffle/reduce commencement – All maps must complete before reduce can start – In chained jobs, one job must finish entirely before the next one can start • Persistence and Recoverability – Data is persisted to disk between each barrier – Serialization and deserialization are required between execution phase

Editor's Notes

  1. With the recent explosion of everything related to Hadoop, it is no surprise that new projects/implementations related to the Hadoop ecosystem keep appearing. There have been quite a few initiatives that provide SQL interfaces into Hadoop. The Apache Drill project is a distributed system for interactive analysis of large-scale datasets, inspired by Google&apos;s Dremel. Drill is not trying to replace existing Big Data batch processing frameworks, such as Hadoop MapReduce or stream processing frameworks, such as S4 or Storm. It rather fills the existing void – real-time interactive processing of large data sets.------------------------------Technical DetailSimilar to Dremel, the Drill implementation is based on the processing of nested, tree-like data. In Dremel this data is based on protocol buffers – nested schema-based data model. Drill is planning to extend this data model by adding additional schema-based implementations, for example, Apache Avro and schema-less data models such asJSON and BSON. In addition to a single data structure, Drill is also planning to support “baby joins” – joins to the small, loadable in memory, data structures.