Google Map/Reduce
paper (2004)
Cutting & Cafarella
create Hadoop (2005)
Google Dremel paper (2010)
Facebook creates Hive (2007)*
Cloudera announces Impala
(October 2012)
HortonWorks' Stinger
(February 2013)
Apache Drill proposal
(August 2012)
* Hive => "SQL on Hadoop"
Write SQL queries
Translate into Map/Reduce job(s)
Convenient & easy
High-latency (batch processing)
What is Impala?
In-memory, distributed SQL
query engine (no Map/Reduce)
Native code (C++)
(on HDFS data nodes)
Why Impala?
Interactive data analysis
Low-latency response
(roughly, 4-100x Hive)
Deploy on existing Hadoop clusters
Why Impala? (cont'd)
Data stored in HDFS avoids...
...duplicate storage transformation
...moving data
Why Impala? (cont'd)
impalad daemon runs on HDFS nodes
Queries run on "relevant" nodes
Supports common HDFS file formats
statestored, uses Hive metastore
(for database metadata)
Overview (cont'd)
Does not use Map/Reduce
Not fault tolerant !
(query fails if any query on any node fails)
Submit queries via Hue/Beeswax
Thrift API, CLI, ODBC, JDBC (future)
SQL Support
(w/ LIMIT)
JOIN (equi-join only,
subject to memory
(subset of Hive QL)
HBase Queries
Maps HBase tables via Hive
metastore mapping
Row key predicates => start/stop row
Non-row key predicates => SingleColumnValueFilter
HBase scan translations:
(Very) Unscientific Benchmarks
9 queries, run in Impala Demo VM
Macbook Pro Retina, mid 2012
4GB for VM (VMWare 5),
Intel i7 2.6GHz quad-core processor
No other load on system during queries
Pseudo-cluster + Impala daemons
Benchmarks (cont'd)
(from simple projection queries to
multiple joins, aggregation, multiple
predicates, and order by)
Impala vs. Hive performance
"TPC-DS" sample dataset
Query "A"
from customer c
limit 50;
Query "B"
from customer c
   join customer_address ca
on c.c_current_addr_sk = ca.ca_address_sk
limit 50;
Query "C"
from customer c
   join customer_address ca
on c.c_current_addr_sk = ca.ca_address_sk
where lower(c.c_last_name) like 'smi%'
limit 50;
Query "D"
select distinct cd_credit_rating
from customer_demographics;
Query "E"
from customer_demographics
group by cd_credit_rating;
Query "F"
from customer c
   join customer_address ca
       on c.c_current_addr_sk = ca.ca_address_sk
   join customer_demographics cd
       on c.c_current_cdemo_sk = cd.cd_demo_sk
   lower(c.c_last_name) like 'smi%' and
   cd.cd_credit_rating in ('Unknown', 'High Risk')
limit 50;
Query "G"
from customer c
   join customer_address ca
       on c.c_current_addr_sk = ca.ca_address_sk
   join customer_demographics cd
       on c.c_current_cdemo_sk = cd.cd_demo_sk
   ca.ca_zip in ('20191', '20194') and
   cd.cd_credit_rating in ('Unknown', 'High Risk');
Query "H"
from customer c
   join customer_address ca
       on c.c_current_addr_sk = ca.ca_address_sk
   join customer_demographics cd
       on c.c_current_cdemo_sk = cd.cd_demo_sk
   ca.ca_zip in ('20191', '20194') and
   cd.cd_credit_rating in ('Unknown', 'High Risk')
limit 100;
  avg(ss_quantity) agg1,
  avg(ss_list_price) agg2,
  avg(ss_coupon_amt) agg3,
  avg(ss_sales_price) agg4
from store_sales
join date_dim
   on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
join item
   on (store_sales.ss_item_sk = item.i_item_sk)
join customer_demographics
   on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
join store
   on (store_sales.ss_store_sk = store.s_store_sk)
  cd_gender = 'M' and
  cd_marital_status = 'S' and
  cd_education_status = 'College' and
  d_year = 2002 and
  s_state in ('TN','SD', 'SD', 'SD', 'SD', 'SD')
group by
order by
limit 100;
Query "TPC-DS"
Query Hive (sec) # M/R jobs Impala (sec) x Hive perf.
A 12.4 1 0.21 59
B 30.9 1 0.37 84
C 29.6 1 0.33 91
D 22.8 1 0.60 38
E 22.5 1 0.52 44
F 66.4 2 1.56 43
G 83.0 3 1.33 62
H 66.1 2 1.50 44
TPC-DS 248.3 6 3.05 82
(remember, unscientific...)
Cloudera Impala
Cloudera Impala
Two daemons
impalad on each HDFS data node
statestored - metadata
Thrift APIs
Query execution
Query coordination
Query planning
Query Coordinator
Query Planner
Query Executor
HDFS DataNode
HBase RegionServer
Queries performed in-memory
Intermediate data never hits disk!
Data streamed to clients
runtime code generation
intrinsics for optimization
Execution engine:
Cluster membership
Metadata handling
(scheduled for GA release)
Not a SPOF
(single point of failure)
Shares Hive metastore
Daemons cache metadata
Push to cluster via statestored
(scheduled for GA release)
Create tables in Hive
(then REFRESH impalad)
Next up - how queries work...
Query Coordinator
Query Planner
Query Executor
HDFS DataNode
HBase RegionServer
Client Statestore Hive Metastore
Query Coordinator
Query Planner
Query Executor
HDFS DataNode
HBase RegionServer
Query Coordinator
Query Planner
Query Executor
HDFS DataNode
HBase RegionServer
Read directly from disk
Short-circuit reads
Bypass HDFS DataNode
(avoids overhead of HDFS API)
Query Coordinator
Query Planner
Query Executor
Local Filesystem
from disk
Current Limitations
(as of beta version 0.6)
No join order optimization
No custom file formats or SerDes or UDFs
Limit required when using ORDER BY
Joins limited by memory of single node
(at GA, aggregate memory of cluster)
Current Limitations
(as of beta version 0.6)
No advanced data structures
(arrays, maps, json, etc.)
No DDL (do in Hive)
Limited file formats (text, sequence
w/ snappy/gzip compression)
Future - GA & beyond...
Structure types (structs,
arrays, maps, json, etc.)
DDL support
Additional file formats &
compression support
Columnar format
(via statestore)
Join optimization
(e.g. cost-based)
Dremel is a scalable, interactive ad-hoc
query system for analysis of read-only
nested data. By combining multi-level
execution trees and columnar data layout, it
is capable of running aggregation queries
over trillion-row tables in seconds. The
system scales to thousands of CPUs and
petabytes of data, and has thousands of
users at Google.
Comparing Impala to Dremel
Comparing Impala to Dremel
Impala = Dremel features circa 2010 + join
support, assuming columnar data format
(but, Google doesn't stand still...)
Dremel is production, mature
Basis for Google's BigQuery
Comparing Impala to Hive
Hive uses Map/Reduce -> high latency
Impala is in-memory, low-
latency query engine
Sacrifices fault tolerance for
Comparing Impala to Others
Apache Drill
Improve Hive performance (e.g. optimize execution plan)
Based on Dremel
In very early stages...
Support for analytics (e.g. OVER clause, window functions)
TEZ framework to optimize execution
Columnar file format
In-memory, distributed
SQL query engine
Integrates into
existing HDFS
Not Map/Reduce
Focus is on
(native code)
Google Dremel -
Apache Drill -
TPC-DS dataset -
Stinger Initiative -
Cloudera Impala resources
Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real
Photo Attributions
Impala -
Measuring tape -
Bridge frame -
Balance -
* All others are iStockPhoto (paid)
Cloudera Impala