SlideShare a Scribd company logo
© 2014 IBM Corporation
Challenges of Building a First
Class SQL-on-Hadoop Engine
Scott C. Gray sgray@us.ibm.com @ScottCGrayIBM
Adriana Zubiri zubiri@ca.ibm.com @adrianaZubiri
Agenda
► Why and what is Big SQL 3.0?
• Not a sales pitch, I promise!
► Overview of the challenges
► How we solved (some of) them
• Architecture and interaction with Hadoop
• Query rewrite
• Query optimization
► Future challenges
The Perfect Storm
► Increase business interest on SQL on Hadoop to
improve the pace and efficiency of adopting Hadoop
► SQL engines on Hadoop moving away from MR
towards MPP architectures
► SQL users expect same level of language expressiveness,
features and (somewhat) performance as RDMSs
► IBM has decades of experience and assets on building
SQL engines… Why not leverage it?
The Result? Big SQL 3.0
► MapReduce replaced with a modern
MPP shared-nothing architecture
► Architected from the ground up
for low latency and high throughput
► Same SQL expressiveness as relational
RDBMs, which allows application portability
► Rich enterprise capabilities…

Recommended for you

Hadoop DB
Hadoop DBHadoop DB
Hadoop DB

HadoopDB is a system that combines the performance of parallel database systems with the flexibility and fault tolerance of Hadoop. It uses Hadoop as the communication layer between multiple single-node database instances running on cluster nodes. Benchmark results showed that HadoopDB's performance was close to parallel databases for structured queries and similar to Hadoop for unstructured queries, while also providing Hadoop's ability to operate in heterogeneous environments and tolerate faults.

hadoop db
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0

Slides for presentation on Cloudera Impala I gave at the DC/NOVA Java Users Group on 7/9/2013. It is a slightly updated set of slides from the ones I uploaded a few months ago on 4/19/2013. It covers version 1.0.1 and also includes some new slides on HortonWorks' Stinger Initiative.

apache drillclouderaimpala
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice

From: DataWorks Summit 2017 - Munich - 20170406 HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.

hbasetuningbest practices
Big SQL 3.0 At a Glance
Application Portability & Integration
Data shared with Hadoop ecosystem
Comprehensive file formats supported
Superior enablement of IBM Software
Performance
Powerful SQL query rewriter
Cost based optimizer
Optimized for concurrent user throughput
performance
Result sets not constrained by existing
memory
Federation
Distributed requests to multiple data
sources within a single SQL statement
Main data sources supported: DB2,
Teradata, Oracle, Netezza
Enterprise Capabilities
Advanced Security / Auditing
Resource and Workload Management
Self Tuning Memory Management
Comprehensive Monitoring
Rich SQL
Comprehensive SQL support
IBM’s SQL PL compatibility
How did we do it?
► Big SQL is derived from an existing IBM shared-nothing RDBMS
• A very mature MPP architecture
• Already understands distributed joins and optimization
► Behavior is sufficiently different that
it is considered a separate product
• Certain SQL constructs are disabled
• Traditional data warehouse partitioning
is unavailable
• New SQL constructs introduced
► On the surface, porting a shared
nothing RDBMS to a shared nothing
cluster (Hadoop) seems easy, but …
database
partition
database
partition
database
partition
database
partition
Traditional Distributed RBMS Architecture
Challenges for a traditional RDBMS on Hadoop
► Data placement
• Traditional databases expect to have full control over data placement
• Data placement plays an important role in performance (e.g. co-located
joins)
• Hadoop’s randomly scattered data plays against the grain of this
► Reading and writing Hadoop files
• Normally an RDBMS has its own storage format
• Format is highly optimized to minimize cost of moving data into memory
• Hadoop has a practically unbounded number of storage formats all with
different capabilities
Challenges for a traditional RDBMS on Hadoop
► Query optimization
• Statistics on Hadoop are a relatively new concept
• The are frequently not available
• The database optimizer can use statistics not traditionally available in Hive
• Hive-style partitioning (grouping data into different files/directories) is a new
concept
► Resource management
• A database server almost always runs in isolation
• In Hadoop the nodes must be shared with many other tasks
– Data nodes
– MR task tracker and tasks
– HBase region servers, etc.
• We needed to learn to play nice with others

Recommended for you

HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...

HBase application developers face a number of challenges: schema management is performed at the application level, decoupled components of a system can break one another in unexpected ways, less-technical users cannot easily access data, and evolving data collection and analysis needs are difficult to plan for. In this talk, we describe a schema management methodology based on Apache Avro that enables users and applications to share data in HBase in a scalable, evolvable fashion. By adopting these practices, engineers independently using the same data have guarantees on how their applications interact. As data collection needs change, applications are resilient to drift in the underlying data representation. This methodology results in a data dictionary that allows less-technical users to understand what data is available to them for analysis and inspect data using general-purpose tools (for example, export it via Sqoop to an RDBMS). And because of Avro’s cross-language capabilities, HBase’s power can reach new domains, like web apps built in Ruby.

hbasecon 2012apache hbasehbase wibidata
HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead end

This document discusses HadoopDB and Apache Hive. HadoopDB aims to combine the scalability of MapReduce with the performance of parallel databases by running Hive queries over data stored in node-local relational databases rather than HDFS. It describes HadoopDB's architecture, which replaces HDFS with local databases, and benchmarks comparing it to MapReduce. It also summarizes Hive's data model, query language and architecture, which provides a SQL interface to MapReduce by translating queries into map and reduce jobs.

hadapthivehadoopdb
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala

Impala is an open source SQL query engine for Apache Hadoop that allows real-time queries on large datasets stored in HDFS and other data stores. It uses a distributed architecture where an Impala daemon runs on each node and coordinates query planning and execution across nodes. Impala allows SQL queries to be run directly against files stored in HDFS and other formats like Avro and Parquet. It aims to provide high performance for both analytical and transactional workloads through its C++ implementation and avoidance of MapReduce.

groverimpalamark
Architecture Overview
Management Node
Big SQL
Master Node
Management Node
Big SQL
Scheduler
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
Tracker
Other
ServiceHDFS
Data HDFS
Data HDFS
Data
Temp
Data
UDF
FMP
Compute Node
Database
Service
Hive
Metastore
Hive
Server
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
Tracker
Other
ServiceHDFS
Data HDFS
Data HDFS
Data
Temp
Data
UDF
FMP
Compute Node
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
Tracker
Other
ServiceHDFS
Data HDFS
Data HDFS
Data
Temp
Data
UDF
FMP
Compute Node
DDL
FMP
UDF
FMP *FMP = Fenced mode process
Big SQL Scheduler
► The Scheduler is the main RDBMS↔Hadoop service interface
• Interfaces with Hive metastore for table metadata
• Acts like the MapReduce job tracker for Big SQL
– Big SQL provides query predicates for
scheduler to perform partition elimination
– Determines splits for each “table” involved in the query
– Schedules splits on available Big SQL nodes
(favoring scheduling locally to the data)
– Serves work (splits) to I/O engines
– Coordinates “commits” after INSERTs
► Scheduler allows the database engine to
be largely unaware of the Hadoop world
Management Node
Big SQL
Master Node
Big SQL
Scheduler
DDL
FMP
UDF
FMP
Mgmt Node
Database
Service
Hive
Metastore
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
TrackerUDF
FMP
I/O Fence Mode Processes
► Native I/O FMP
• The high-speed interface for a limited number of common file formats
► Java I/O FMP
• Handles all other formats via standard Hadoop/Hive API’s
► Both perform multi-threaded direct I/O on local data
► The database engine had to be taught storage format capabilities
• Projection list is pushed into I/O format
• Predicates are pushed as close to the data as
possible (into storage format, if possible)
• Predicates that cannot be pushed down are
evaluated within the database engine
► The database engine is only aware of which nodes
need to read
• Scheduler directs the readers to their portion of work
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
Tracker
Other
ServiceHDFS
Data HDFS
Data HDFS
Data
Temp
Data
UDF
FMP
Compute Node
Mgmt Node
Big SQL
Master Node
Big SQL
Scheduler
DDL
FMP
UDF
FMP
Query Compilation There is a lot involved in SQL compilation
► Parsing
• Catch syntax errors
• Generate internal representation of query
► Semantic checking
• Determine if query makes sense
• Incorporate view definitions
• Add logic for constraint checking
► Query optimization
• Modify query to improve performance (Query Rewrite)
• Choose the most efficient “access plan”
► Pushdown Analysis
• Federation “optimization”
► Threaded code generation
• Generate efficient “executable” code

Recommended for you

Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN

Wangda Tan and Mayank Bansal presented on YARN Node Labels. Node labels allow grouping nodes with similar hardware or software profiles and partitioning a cluster. This allows applications to request nodes with specific resources and partitions the cluster for different organizations or workloads. Node partitions were added in Hadoop 2.6 to allow exclusive or non-exclusive access to labeled nodes. Ebay and other companies use node labels to separate machine learning, licensed software, and organizational workloads. Future work includes adding node constraints and supporting node labels in other Apache projects like FairScheduler, Tez and Oozie.

yarnhadoopnode labels
Apache Drill
Apache DrillApache Drill
Apache Drill

Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.

apache drillmaprhadoop
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala

Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Apache Hadoop cluster. With Impala, the Hadoop ecosystem now has an open-source codebase that helps users query data stored in Hadoop-based enterprise data hubs in real time, using familiar SQL syntax. This talk will begin with an overview of the challenges organizations face as they collect and process more data than ever before, followed by an overview of Impala from the user's perspective and a dive into Impala's architecture. It concludes with stories of how Cloudera's customers are using Impala and the benefits they see.

clouderaimpalarealtime
Query Rewrite
► Why is query re-write important?
• There are many ways to express the same query
• Query generators often produce suboptimal queries and don’t permit “hand optimization”
• Complex queries often result in redundancy, especially with views
• For Large data volumes optimal access plans more crucial as penalty for poor planning is
greater
select sum(l_extendedprice) / 7.0
avg_yearly
from tpcd.lineitem, tpcd.part
where p_partkey = l_partkey
and p_brand = 'Brand#23'
and p_container = 'MED BOX'
and l_quantity < ( select 0.2 *
avg(l_quantity) from tpcd.lineitem
where l_partkey = p_partkey);
select sum(l_extendedprice) / 7.0 as avg_yearly
from temp (l_quantity, avgquantity, l_extendeprice)
as
(select l_quantity, avg(l_quantity) over
(partition by l_partkey)
as avgquantity, l_extenedprice
from tpcd.lineitem, tpcd.part
where p_partkey = l_partkey
and p_brand = 'BRAND#23'
and p_container = 'MED BOX')
where l_quantity < 0.2 * avgquantity
• Query correlation eliminated
• Line item table accessed only once
• Execution time reduced in half!
Query Rewrite
► Most existing query rewrite rules remain unchanged
• 140+ existing query re-writes are leveraged
• Almost none are impacted by “the Hadoop world”
► There were however a few modifications that were required…
Query Rewrite and Indexes
► Column nullability and indexes can help drive query optimization
• Can produce more efficiently decorrelated subqueries and joins
• Used to prove uniqueness of joined rows (“early-out” join)
► Very few Hadoop data sources support
the concept of an index
► In the Hive metastore all columns
are implicitly nullable
► Big SQL introduces advisory
constraints and nullability indicators
• User can specify whether or not
constraints can be “trusted” for
query rewrites
create hadoop table users
(
id int not null primary key,
office_id int null,
fname varchar(30) not null,
lname varchar(30) not null,
salary timestamp(3) null,
constraint fk_ofc foreign key (office_id)
references office (office_id)
)
row format delimited
fields terminated by '|'
stored as textfile;
Nullability Indicators
Constraints
Query Pushdown
► Pushdown moves processing down as
close to the data as possible
• Projection pushdown – retrieve only
necessary columns
• Selection pushdown – push search criteria
► Big SQL understands the capabilities of
readers and storage formats involved
• As much as possible is pushed down
• Residual processing done in the server
• Optimizer costs queries based upon how
much can be pushed down
3) External Sarg Predicate,
Comparison Operator: Equal (=)
Subquery Input Required: No
Filter Factor: 0.04
Predicate Text:
--------------
(Q1.P_BRAND = 'Brand#23')
4) External Sarg Predicate,
Comparison Operator: Equal (=)
Subquery Input Required: No
Filter Factor: 0.025
Predicate Text:
--------------
(Q1.P_CONTAINER = 'MED BOX')
select sum(l_extendedprice) / 7.0 as avg_yearly
from temp (l_quantity, avgquantity, l_extendeprice) as
(select l_quantity, avg(l_quantity) over
(partition by l_partkey)
as avgquantity, l_extenedprice
from tpcd.lineitem, tpcd.part
where p_partkey = l_partkey
and p_brand = 'BRAND#23'
and p_container = 'MED BOX')
where l_quantity < 0.2 * avgquantity

Recommended for you

HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce

Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.

hbasecon 2012salesforce hbasehbase schema
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table design

HBase In Action - Chapter 04: HBase table design Learning HBase, Real-time Access to Your Big Data, Data Manipulation at Scale, Big Data, Text Mining, HBase, Deploying HBase

real-time access to your big datadata manipulation at scaletext mining
HBase internals
HBase internalsHBase internals
HBase internals

This document summarizes a presentation about HBase storage internals and future developments. It discusses how HBase provides random read/write access on HDFS using tables, regions, and region servers. It describes the write path involving the client, master, and region servers as well as the read path. It also covers topics like snapshots, compactions, and future plans to improve encryption, security, write-ahead logs, and compaction policies.

Statistics
► Big SQL utilizes Hive statistics
collection with some extensions:
• Additional support for column groups,
histograms and frequent values
• Automatic determination of partitions that
require statistics collection vs. explicit
• Partitioned tables: added table-level
versions of NDV, Min, Max, Null count,
Average column length
• Hive catalogs as well as database engine
catalogs are also populated
• We are restructuring the relevant code for
submission back to Hive
► Capability for statistic fabrication
if no stats available at compile time
Table statistics
• Cardinality (count)
• Number of Files
• Total File Size
Column statistics
• Minimum value (all types)
• Maximum value (all types)
• Cardinality (non-nulls)
• Distribution (Number of Distinct Values NDV)
• Number of null values
• Average Length of the column value (all types)
• Histogram - Number of buckets configurable
• Frequent Values (MFV) – Number configurable
Column group statistics
Costing Model
► Few extensions required to the Cost Model
► TBSCAN operator cost model extended
to evaluate cost of reading from Hadoop
► New elements taken into account:
# of files, size of files, # of partitions, # of nodes
► Optimizer now knows in which subset of
nodes the data resides
→ Better costing!
|
2.66667e-08
HSJOIN
( 7)
1.1218e+06
8351
/--------+--------
5.30119e+08 3.75e+07
BTQ NLJOIN
( 8) ( 11)
948130 146345
7291 1060
| /----+----
5.76923e+08 1 3.75e+07
LTQ GRPBY FILTER
( 9) ( 12) ( 20)
855793 114241 126068
7291 1060 1060
| | |
5.76923e+08 13 7.5e+07
TBSCAN TBSCAN BTQ
( 10) ( 13) ( 21)
802209 114241 117135
7291 1060 1060
| | |
7.5e+09 13 5.76923e+06
TABLE: TPCH5TB_PARQ TEMP LTQ
ORDERS ( 14) ( 22)
Q1 114241 108879
1060 1060
| |
13 5.76923e+06
DTQ TBSCAN
( 15) ( 23)
114241 108325
1060 1060
| |
1 7.5e+08
GRPBY TABLE: TPCH5TB_PARQ
( 16) CUSTOMER
114241 Q5
1060
|
1
LTQ
( 17)
114241
1060
|
1
GRPBY
( 18)
114241
1060
|
5.24479e+06
TBSCAN
( 19)
113931
1060
|
7.5e+08
TABLE: TPCH5TB_PARQ
CUSTOMER
Q2
We can access a Hadoop table as:
► “Scattered” Partitioned:
• Only accesses local data to the node
► Replicated:
• Accesses local and remote data
– Optimizer could also use a broadcast table queue
– HDFS shared file system provides replication
New Access Plans
Data not hash partitioned on a particular columns
(aka “Scattered partitioned”)
New Parallel Join Strategy
introduced
Parallel Join Strategies
Replicated vs. Broadcast join
All tables are “scatter” partitioned
Join predicate:
STORE.STOREKEY = DAILY_SALES.STOREKEY
19
Replicate smaller table to partitions
of the larger table using:
• Broadcast table queue
• Replicated HDFS scan
Table Queue represents
communication between
nodes or subagents
JOIN
Store
Daily Sales
SCAN
SCAN
Broadcast
TQ SCAN
replicated SCAN

Recommended for you

Apache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaApache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in Alibaba

Alibaba builds the data infrastructure with Apache Hadoop YARN since 2013, and till now it manages more than 10k nodes. In Alibaba, Hadoop YARN serves various systems such as search, advertising, and recommendation etc. It runs not just batch jobs, also streaming, machine learning, OLAP, and even online services that directly impact Alibaba’s user experience. To extend YARN’s ability to support such complex scenarios, we have done and leveraged a lot of YARN 3.x improvements. In this talk, you will find what are these improvements and how they helped to solve difficult problems in large production clusters. This includes: 1. Highly improved performance with Capacity Scheduler’s async scheduling framework 2. Better placement decisions with node attributes, placement constraints 3. Better resource utilization with opportunistic containers 4. Introduce a load balancer to balance resource utilization 5. Generic resource types scheduling/isolation to manage new resources such as GPU and FPGA In the presentation, we will further introduce how we build the entire ecosystem on top of YARN and how we keep evolving YARN’s ability to tackle the challenges brought by continuously increasing data and business in Alibaba. Speakers Weiwei Yang, Alibaba, Staff Software Engineer Ren Chunde, Alibaba Group, Senior Engineer

apache hadoopapache flinkapache hbase
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBase

This document discusses improvements to compaction in Apache HBase. It begins with an overview of what compactions are and how they improve read performance in HBase. It then describes the default compaction algorithm and improvements made, including exploring selection and off-peak compactions. The document also covers making compactions more pluggable and enabling tuning on a per-table/column family basis. Finally, it proposes algorithms for different scenarios, such as level and stripe compactions, to improve compaction performance.

hbaseconhbasebig data
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet

This is an exam cheat sheet hopes to cover all keys points for GCP Data Engineer Certification Exam Let me know if there is any mistake and I will try to update it

Parallel Join Strategies
Repartitioned join
All tables are “scatter” partitioned
Join predicate:
DAILY_FORECAST.STOREKEY = DAILY_SALES.STOREKEY
20
• Both tables large
• Too expensive to broadcast or
replicate either
• Repartition both tables on the
join columns
• Use directed table queue (DTQ)
JOIN
Daily
Forecast
Daily Sales
SCAN SCAN
Directed
TQ
Directed
TQ
Future Challenges
► The challenges never end!
• That’s what makes this job fun!
• The Hadoop ecosystem continues to expand
• New storage techniques, indexing techniques, etc.
► Here are a few areas we’re exploring….
Future Challenges
► Dynamic split allocation
• React to competing workloads
• If one node is slow, hand work you would have handed it to another node
► More pushdown!
• Currently we push projection/selection down
• Should we push more advanced operations? Aggregation? Joins?
► Join co-location
• Perform co-located joins when tables are partitioned on the same join key
► Explicit MapReduce style parallelism (“SQL MR”)
• Expand SQL to explicitly perform partitioned operations
Queries?
(Optimized, of course)
Try Big SQL 3.0 Beta on the cloud!
https://bigsql.imdemocloud.com/
Scott C. Gray sgray@us.ibm.com @ScottCGrayIBM
Adriana Zubiri zubiri@ca.ibm.com @adrianaZubiri

Recommended for you

Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010

The document summarizes HBase use at Facebook, including its development and future work. HBase is used for incremental updates to data warehouses, high frequency analytics, and write-intensive workloads. Development includes Hive integration, master high availability, and random read optimizations. Future work focuses on coprocessors, intelligent load balancing, and cluster performance.

clouderahadoop worldapache hadoop
Advanced Data Modeling and Bitmap Indexes
Advanced Data Modeling and Bitmap IndexesAdvanced Data Modeling and Bitmap Indexes
Advanced Data Modeling and Bitmap Indexes

Matt Stump presents for the DataStax Cassandra South Bay Users group on advanced data modeling and bitmap indexes.

matt stumpmattdatastax
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011

In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.

mrv2mapreducehadoop

More Related Content

What's hot

Cloudera impala
Cloudera impalaCloudera impala
Cloudera impala
Swiss Big Data User Group
 
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
George Joseph
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon
 
Hadoop DB
Hadoop DBHadoop DB
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0
Scott Leberknight
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
larsgeorge
 
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
Cloudera, Inc.
 
HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead end
thkoch
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
markgrover
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
Wangda Tan
 
Apache Drill
Apache DrillApache Drill
Apache Drill
Ted Dunning
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
Gwen (Chen) Shapira
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Cloudera, Inc.
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table design
phanleson
 
HBase internals
HBase internalsHBase internals
HBase internals
Matteo Bertozzi
 
Apache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaApache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in Alibaba
DataWorks Summit
 
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBase
Cloudera, Inc.
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
Guang Xu
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Cloudera, Inc.
 

What's hot (19)

Cloudera impala
Cloudera impalaCloudera impala
Cloudera impala
 
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
Hadoop DB
Hadoop DBHadoop DB
Hadoop DB
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
 
HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead end
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table design
 
HBase internals
HBase internalsHBase internals
HBase internals
 
Apache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaApache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in Alibaba
 
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBase
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
 

Viewers also liked

Advanced Data Modeling and Bitmap Indexes
Advanced Data Modeling and Bitmap IndexesAdvanced Data Modeling and Bitmap Indexes
Advanced Data Modeling and Bitmap Indexes
DataStax Academy
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
Milind Bhandarkar
 
Implementing Hadoop on a single cluster
Implementing Hadoop on a single clusterImplementing Hadoop on a single cluster
Implementing Hadoop on a single cluster
Salil Navgire
 
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Michael Arnold
 
Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
Data-Ed Webinar: A Framework for Implementing NoSQL, HadoopData-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
DATAVERSITY
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
Milind Bhandarkar
 
Monitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, NagiosMonitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, Nagios
Pradeep Kumar
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
Praveen Kumar Donta
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Hortonworks
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
Humza Naseer
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Cloudera, Inc.
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
Hortonworks
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
Caserta
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
OReillyStrata
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 

Viewers also liked (18)

Advanced Data Modeling and Bitmap Indexes
Advanced Data Modeling and Bitmap IndexesAdvanced Data Modeling and Bitmap Indexes
Advanced Data Modeling and Bitmap Indexes
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Implementing Hadoop on a single cluster
Implementing Hadoop on a single clusterImplementing Hadoop on a single cluster
Implementing Hadoop on a single cluster
 
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
 
Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
Data-Ed Webinar: A Framework for Implementing NoSQL, HadoopData-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Monitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, NagiosMonitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, Nagios
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similar to Challenges of Implementing an Advanced SQL Engine on Hadoop

SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Data Con LA
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Mark Rittman
 
Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)
Lucas Jellema
 
Webinar: Migrating from RDBMS to MongoDB
Webinar: Migrating from RDBMS to MongoDBWebinar: Migrating from RDBMS to MongoDB
Webinar: Migrating from RDBMS to MongoDB
MongoDB
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Arseny Chernov
 
NoSQL
NoSQLNoSQL
NoSQL
dbulic
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
Amazon Web Services
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
SolidQ
 
1. Big Data - Introduction(what is bigdata).pdf
1. Big Data - Introduction(what is bigdata).pdf1. Big Data - Introduction(what is bigdata).pdf
1. Big Data - Introduction(what is bigdata).pdf
AmanCSE050
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Michael Hiskey
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Mark Rittman
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Kognitio
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDB
MongoDB
 
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop Ecosystem
Serendio Inc.
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
Adi Challa
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at Scale
MongoDB
 

Similar to Challenges of Implementing an Advanced SQL Engine on Hadoop (20)

SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)
 
Webinar: Migrating from RDBMS to MongoDB
Webinar: Migrating from RDBMS to MongoDBWebinar: Migrating from RDBMS to MongoDB
Webinar: Migrating from RDBMS to MongoDB
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
 
NoSQL
NoSQLNoSQL
NoSQL
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
1. Big Data - Introduction(what is bigdata).pdf
1. Big Data - Introduction(what is bigdata).pdf1. Big Data - Introduction(what is bigdata).pdf
1. Big Data - Introduction(what is bigdata).pdf
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDB
 
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop Ecosystem
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at Scale
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design ApproachesKnowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Earley Information Science
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
jackson110191
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Safe Software
 
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
amitchopra0215
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
Liveplex
 
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating AppsecGDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
James Anderson
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
Stephanie Beckett
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
Edge AI and Vision Alliance
 
K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024
The Digital Insurer
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & SolutionsMYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
Linda Zhang
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024
The Digital Insurer
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
Enterprise Wired
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Chris Swan
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
 
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
uuuot
 

Recently uploaded (20)

Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design ApproachesKnowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
 
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
 
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating AppsecGDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
 
K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & SolutionsMYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
 
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
 

Challenges of Implementing an Advanced SQL Engine on Hadoop

  • 1. © 2014 IBM Corporation Challenges of Building a First Class SQL-on-Hadoop Engine Scott C. Gray sgray@us.ibm.com @ScottCGrayIBM Adriana Zubiri zubiri@ca.ibm.com @adrianaZubiri
  • 2. Agenda ► Why and what is Big SQL 3.0? • Not a sales pitch, I promise! ► Overview of the challenges ► How we solved (some of) them • Architecture and interaction with Hadoop • Query rewrite • Query optimization ► Future challenges
  • 3. The Perfect Storm ► Increase business interest on SQL on Hadoop to improve the pace and efficiency of adopting Hadoop ► SQL engines on Hadoop moving away from MR towards MPP architectures ► SQL users expect same level of language expressiveness, features and (somewhat) performance as RDMSs ► IBM has decades of experience and assets on building SQL engines… Why not leverage it?
  • 4. The Result? Big SQL 3.0 ► MapReduce replaced with a modern MPP shared-nothing architecture ► Architected from the ground up for low latency and high throughput ► Same SQL expressiveness as relational RDBMs, which allows application portability ► Rich enterprise capabilities…
  • 5. Big SQL 3.0 At a Glance Application Portability & Integration Data shared with Hadoop ecosystem Comprehensive file formats supported Superior enablement of IBM Software Performance Powerful SQL query rewriter Cost based optimizer Optimized for concurrent user throughput performance Result sets not constrained by existing memory Federation Distributed requests to multiple data sources within a single SQL statement Main data sources supported: DB2, Teradata, Oracle, Netezza Enterprise Capabilities Advanced Security / Auditing Resource and Workload Management Self Tuning Memory Management Comprehensive Monitoring Rich SQL Comprehensive SQL support IBM’s SQL PL compatibility
  • 6. How did we do it? ► Big SQL is derived from an existing IBM shared-nothing RDBMS • A very mature MPP architecture • Already understands distributed joins and optimization ► Behavior is sufficiently different that it is considered a separate product • Certain SQL constructs are disabled • Traditional data warehouse partitioning is unavailable • New SQL constructs introduced ► On the surface, porting a shared nothing RDBMS to a shared nothing cluster (Hadoop) seems easy, but … database partition database partition database partition database partition Traditional Distributed RBMS Architecture
  • 7. Challenges for a traditional RDBMS on Hadoop ► Data placement • Traditional databases expect to have full control over data placement • Data placement plays an important role in performance (e.g. co-located joins) • Hadoop’s randomly scattered data plays against the grain of this ► Reading and writing Hadoop files • Normally an RDBMS has its own storage format • Format is highly optimized to minimize cost of moving data into memory • Hadoop has a practically unbounded number of storage formats all with different capabilities
  • 8. Challenges for a traditional RDBMS on Hadoop ► Query optimization • Statistics on Hadoop are a relatively new concept • The are frequently not available • The database optimizer can use statistics not traditionally available in Hive • Hive-style partitioning (grouping data into different files/directories) is a new concept ► Resource management • A database server almost always runs in isolation • In Hadoop the nodes must be shared with many other tasks – Data nodes – MR task tracker and tasks – HBase region servers, etc. • We needed to learn to play nice with others
  • 9. Architecture Overview Management Node Big SQL Master Node Management Node Big SQL Scheduler Big SQL Worker Node Java I/O FMP Native I/O FMP HDFS Data Node MRTask Tracker Other ServiceHDFS Data HDFS Data HDFS Data Temp Data UDF FMP Compute Node Database Service Hive Metastore Hive Server Big SQL Worker Node Java I/O FMP Native I/O FMP HDFS Data Node MRTask Tracker Other ServiceHDFS Data HDFS Data HDFS Data Temp Data UDF FMP Compute Node Big SQL Worker Node Java I/O FMP Native I/O FMP HDFS Data Node MRTask Tracker Other ServiceHDFS Data HDFS Data HDFS Data Temp Data UDF FMP Compute Node DDL FMP UDF FMP *FMP = Fenced mode process
  • 10. Big SQL Scheduler ► The Scheduler is the main RDBMS↔Hadoop service interface • Interfaces with Hive metastore for table metadata • Acts like the MapReduce job tracker for Big SQL – Big SQL provides query predicates for scheduler to perform partition elimination – Determines splits for each “table” involved in the query – Schedules splits on available Big SQL nodes (favoring scheduling locally to the data) – Serves work (splits) to I/O engines – Coordinates “commits” after INSERTs ► Scheduler allows the database engine to be largely unaware of the Hadoop world Management Node Big SQL Master Node Big SQL Scheduler DDL FMP UDF FMP Mgmt Node Database Service Hive Metastore Big SQL Worker Node Java I/O FMP Native I/O FMP HDFS Data Node MRTask TrackerUDF FMP
  • 11. I/O Fence Mode Processes ► Native I/O FMP • The high-speed interface for a limited number of common file formats ► Java I/O FMP • Handles all other formats via standard Hadoop/Hive API’s ► Both perform multi-threaded direct I/O on local data ► The database engine had to be taught storage format capabilities • Projection list is pushed into I/O format • Predicates are pushed as close to the data as possible (into storage format, if possible) • Predicates that cannot be pushed down are evaluated within the database engine ► The database engine is only aware of which nodes need to read • Scheduler directs the readers to their portion of work Big SQL Worker Node Java I/O FMP Native I/O FMP HDFS Data Node MRTask Tracker Other ServiceHDFS Data HDFS Data HDFS Data Temp Data UDF FMP Compute Node
  • 12. Mgmt Node Big SQL Master Node Big SQL Scheduler DDL FMP UDF FMP Query Compilation There is a lot involved in SQL compilation ► Parsing • Catch syntax errors • Generate internal representation of query ► Semantic checking • Determine if query makes sense • Incorporate view definitions • Add logic for constraint checking ► Query optimization • Modify query to improve performance (Query Rewrite) • Choose the most efficient “access plan” ► Pushdown Analysis • Federation “optimization” ► Threaded code generation • Generate efficient “executable” code
  • 13. Query Rewrite ► Why is query re-write important? • There are many ways to express the same query • Query generators often produce suboptimal queries and don’t permit “hand optimization” • Complex queries often result in redundancy, especially with views • For Large data volumes optimal access plans more crucial as penalty for poor planning is greater select sum(l_extendedprice) / 7.0 avg_yearly from tpcd.lineitem, tpcd.part where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX' and l_quantity < ( select 0.2 * avg(l_quantity) from tpcd.lineitem where l_partkey = p_partkey); select sum(l_extendedprice) / 7.0 as avg_yearly from temp (l_quantity, avgquantity, l_extendeprice) as (select l_quantity, avg(l_quantity) over (partition by l_partkey) as avgquantity, l_extenedprice from tpcd.lineitem, tpcd.part where p_partkey = l_partkey and p_brand = 'BRAND#23' and p_container = 'MED BOX') where l_quantity < 0.2 * avgquantity • Query correlation eliminated • Line item table accessed only once • Execution time reduced in half!
  • 14. Query Rewrite ► Most existing query rewrite rules remain unchanged • 140+ existing query re-writes are leveraged • Almost none are impacted by “the Hadoop world” ► There were however a few modifications that were required…
  • 15. Query Rewrite and Indexes ► Column nullability and indexes can help drive query optimization • Can produce more efficiently decorrelated subqueries and joins • Used to prove uniqueness of joined rows (“early-out” join) ► Very few Hadoop data sources support the concept of an index ► In the Hive metastore all columns are implicitly nullable ► Big SQL introduces advisory constraints and nullability indicators • User can specify whether or not constraints can be “trusted” for query rewrites create hadoop table users ( id int not null primary key, office_id int null, fname varchar(30) not null, lname varchar(30) not null, salary timestamp(3) null, constraint fk_ofc foreign key (office_id) references office (office_id) ) row format delimited fields terminated by '|' stored as textfile; Nullability Indicators Constraints
  • 16. Query Pushdown ► Pushdown moves processing down as close to the data as possible • Projection pushdown – retrieve only necessary columns • Selection pushdown – push search criteria ► Big SQL understands the capabilities of readers and storage formats involved • As much as possible is pushed down • Residual processing done in the server • Optimizer costs queries based upon how much can be pushed down 3) External Sarg Predicate, Comparison Operator: Equal (=) Subquery Input Required: No Filter Factor: 0.04 Predicate Text: -------------- (Q1.P_BRAND = 'Brand#23') 4) External Sarg Predicate, Comparison Operator: Equal (=) Subquery Input Required: No Filter Factor: 0.025 Predicate Text: -------------- (Q1.P_CONTAINER = 'MED BOX') select sum(l_extendedprice) / 7.0 as avg_yearly from temp (l_quantity, avgquantity, l_extendeprice) as (select l_quantity, avg(l_quantity) over (partition by l_partkey) as avgquantity, l_extenedprice from tpcd.lineitem, tpcd.part where p_partkey = l_partkey and p_brand = 'BRAND#23' and p_container = 'MED BOX') where l_quantity < 0.2 * avgquantity
  • 17. Statistics ► Big SQL utilizes Hive statistics collection with some extensions: • Additional support for column groups, histograms and frequent values • Automatic determination of partitions that require statistics collection vs. explicit • Partitioned tables: added table-level versions of NDV, Min, Max, Null count, Average column length • Hive catalogs as well as database engine catalogs are also populated • We are restructuring the relevant code for submission back to Hive ► Capability for statistic fabrication if no stats available at compile time Table statistics • Cardinality (count) • Number of Files • Total File Size Column statistics • Minimum value (all types) • Maximum value (all types) • Cardinality (non-nulls) • Distribution (Number of Distinct Values NDV) • Number of null values • Average Length of the column value (all types) • Histogram - Number of buckets configurable • Frequent Values (MFV) – Number configurable Column group statistics
  • 18. Costing Model ► Few extensions required to the Cost Model ► TBSCAN operator cost model extended to evaluate cost of reading from Hadoop ► New elements taken into account: # of files, size of files, # of partitions, # of nodes ► Optimizer now knows in which subset of nodes the data resides → Better costing! | 2.66667e-08 HSJOIN ( 7) 1.1218e+06 8351 /--------+-------- 5.30119e+08 3.75e+07 BTQ NLJOIN ( 8) ( 11) 948130 146345 7291 1060 | /----+---- 5.76923e+08 1 3.75e+07 LTQ GRPBY FILTER ( 9) ( 12) ( 20) 855793 114241 126068 7291 1060 1060 | | | 5.76923e+08 13 7.5e+07 TBSCAN TBSCAN BTQ ( 10) ( 13) ( 21) 802209 114241 117135 7291 1060 1060 | | | 7.5e+09 13 5.76923e+06 TABLE: TPCH5TB_PARQ TEMP LTQ ORDERS ( 14) ( 22) Q1 114241 108879 1060 1060 | | 13 5.76923e+06 DTQ TBSCAN ( 15) ( 23) 114241 108325 1060 1060 | | 1 7.5e+08 GRPBY TABLE: TPCH5TB_PARQ ( 16) CUSTOMER 114241 Q5 1060 | 1 LTQ ( 17) 114241 1060 | 1 GRPBY ( 18) 114241 1060 | 5.24479e+06 TBSCAN ( 19) 113931 1060 | 7.5e+08 TABLE: TPCH5TB_PARQ CUSTOMER Q2
  • 19. We can access a Hadoop table as: ► “Scattered” Partitioned: • Only accesses local data to the node ► Replicated: • Accesses local and remote data – Optimizer could also use a broadcast table queue – HDFS shared file system provides replication New Access Plans Data not hash partitioned on a particular columns (aka “Scattered partitioned”) New Parallel Join Strategy introduced
  • 20. Parallel Join Strategies Replicated vs. Broadcast join All tables are “scatter” partitioned Join predicate: STORE.STOREKEY = DAILY_SALES.STOREKEY 19 Replicate smaller table to partitions of the larger table using: • Broadcast table queue • Replicated HDFS scan Table Queue represents communication between nodes or subagents JOIN Store Daily Sales SCAN SCAN Broadcast TQ SCAN replicated SCAN
  • 21. Parallel Join Strategies Repartitioned join All tables are “scatter” partitioned Join predicate: DAILY_FORECAST.STOREKEY = DAILY_SALES.STOREKEY 20 • Both tables large • Too expensive to broadcast or replicate either • Repartition both tables on the join columns • Use directed table queue (DTQ) JOIN Daily Forecast Daily Sales SCAN SCAN Directed TQ Directed TQ
  • 22. Future Challenges ► The challenges never end! • That’s what makes this job fun! • The Hadoop ecosystem continues to expand • New storage techniques, indexing techniques, etc. ► Here are a few areas we’re exploring….
  • 23. Future Challenges ► Dynamic split allocation • React to competing workloads • If one node is slow, hand work you would have handed it to another node ► More pushdown! • Currently we push projection/selection down • Should we push more advanced operations? Aggregation? Joins? ► Join co-location • Perform co-located joins when tables are partitioned on the same join key ► Explicit MapReduce style parallelism (“SQL MR”) • Expand SQL to explicitly perform partitioned operations
  • 24. Queries? (Optimized, of course) Try Big SQL 3.0 Beta on the cloud! https://bigsql.imdemocloud.com/ Scott C. Gray sgray@us.ibm.com @ScottCGrayIBM Adriana Zubiri zubiri@ca.ibm.com @adrianaZubiri

Editor's Notes

  1. Rewriting a given SQL query into a semantically equivalent form that may be processed more efficiently
  2. Mfv and histograms obtain better selectivity estimates for range predicates over data that is non-uniformly distributed. Stats stored in Hive metastore for currently hive supported stats and our internal catalog tables for all Min. max in hive only for a subset of types Avg length of the column values in hive only for strings Column and table stats done together Next: automatic stats collection