Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

1
Cloudera Impala
LV Big Data Monthly Meetup #1
November 5th 2014
Maxime Dumas
Systems Engineer

Thirty Seconds About Max
• Systems Engineer
• aka Sales Engineer
• SoCal, AZ, NV
• former coder of PHP
• teaches meditation + yoga
• from Montreal, Canada
2

What Does Cloudera Do?
• product
• distribution of Hadoop components, Apache licensed
• enterprise tooling
• support
• training
• services (aka consulting)
• community
3

What This Talk Isn’t About
• deploying
• Puppet, Chef, Ansible, homegrown scripts, intern labor
• sizing & tuning
• depends heavily on data and workload
• coding
• unless you count XML or CSV or SQL
• algorithms
4

cloud·e·ra im·pal·a
7
/kloudˈi(ə)rə imˈpalə/
noun
a modern, open source, MPP SQL query engine
for Apache Hadoop.
“Cloudera Impala provides fast, ad hoc SQL query
capability for Apache Hadoop, complementing
traditional MapReduce batch processing.”

Impala adoption
8
Component (and Founder) Vendor Support
Cloudera MapR Amazon IBM Pivotal Hortonworks
Impala (Cloudera) ✔ ✔ ✔ X X X
Hue (Cloudera) ✔ ✔ X X X ✔
Sentry (Cloudera) ✔ ✔ X ✔ ✔ X
Flume (Cloudera) ✔ ✔ X ✔ ✔ ✔
Parquet (Cloudera/Twitter) ✔ ✔ X ✔ ✔ X
Sqoop (Cloudera) ✔ ✔ ✔ ✔ ✔ ✔
Ambari (Hortonworks) X X X X ✔ ✔
Knox (Hortonworks) X X X X X ✔
Tez (Hortonworks) X X X X X ✔
Drill (MapR) X ✔ X X X X

9
The Apache Hadoop Ecosystem
Quick and dirty, for context.

©2014 Cloudera, Inc. All rights
reserved.
Why Hadoop?
• Scalability
• Simply scales just by adding nodes
• Local processing to avoid network bottlenecks
• Efficiency
• Cost efficiency (<$1k/TB) on commodity hardware
• Unified storage, metadata, security (no duplication or synchronization)
• Flexibility
• All kinds of data (blobs, documents, records, etc)
• In all forms (structured, semi-structured, unstructured)
• Store anything then later analyze what you need

Why “Ecosystem?”
• In the beginning, just Hadoop
• HDFS
• MapReduce
• Today, dozens of interrelated components
• I/O
• Processing
• Specialty Applications
• Configuration
• Workflow
11

HDFS
• Distributed, highly fault-tolerant filesystem
• Optimized for large streaming access to data
• Based on Google File System
• http://research.google.com/archive/gfs.html
12

Lots of Commodity Machines
13
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

MapReduce (MR)
• Programming paradigm
• Batch oriented, not realtime
• Works well with distributed computing
• Lots of Java, but other languages supported
• Based on Google’s paper
• http://research.google.com/archive/mapreduce.html
14

Apache Hive
• Abstraction of Hadoop’s Java API
• HiveQL “compiles” down to MR
• a “SQL-like” language
• Eases analysis using MapReduce
15

Apache Hive Metastore
• Maps HDFS files to DB-like resources
• Databases
• Tables
• Column/field names, data types
• Roles/users
• InputFormat/OutputFormat
16

Architecture
CLOUDERA’S ENTERPRISE DATA HUB
reserved.
3RD PARTY
APPS
STORAGE FOR ANY TYPE OF DATA
UNIFIED, ELASTIC, RESILIENT, SECURE
BATCH
PROCESSING
MAPREDUCE,
SPARK
ANALYTIC
SQL
IMPALA
SEARCH
SOLR
MACHINE
LEARNING
STREAM
PROCESSING
SPARK
WORKLOAD MANAGEMENT YARN
FILESYSTEM
HDFS
ONLINE NOSQL
HBASE
MANAGEMENT
CLOUDERA NAVIGATOR
DATA
MANAGEMENT
CLOUDERA MANAGER
SYSTEM
SENTRY
PARTNERS,
MAHOUT

But wait…
WHY DO WE NEED THIS?
18

20
Cloudera Impala
Familiar interface, but more powerful.

Cloudera Impala
• Interactive query on Hadoop
• think seconds, not minutes
• ANSI-92 standard SQL
• compatible with HiveQL
• Native MPP query engine
• built for low-latency queries
• HDFS and HBase storage
21

Cloudera Impala – Design Choices
• Native daemons, written in C/C++
• No JVM, no MapReduce
• Saturate disks on reads
• Uses in-memory HDFS caching
• Re-uses Hive metastore
• Not as fault-tolerant as MapReduce
22

Benefits of Impala
Unlocks BI/analytics on Hadoop
• Interactive SQL in seconds
• Highly concurrent to handle 100s of users
Native Hadoop flexibility
• No data migration, conversion, or duplication required
• Query existing Hadoop data
• Run multiple frameworks on the same data at the same time
• Supports Parquet for best-of-breed columnar performance
Native MPP query engine designed into Hadoop:
• Unified Hadoop storage
• Unified Hadoop metadata (uses Hive and HCatalog)
• Unified Hadoop security
• Fine-grained role-based access controls with Sentry
Apache-licensed open source
Proven in
Production
23

Cloudera Impala – Architecture
• Impala Daemon
• runs on every node
• handles client requests
• handles query planning & execution
• State Store Daemon
• provides name service
• metadata distribution
• used for finding data
24

Impala Query Execution
25
1) Request arrives via ODBC/JDBC/HUE/Shell
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL request

26
2) Planner turns request into collections of plan fragments
3) Coordinator initiates execution on impalad(s) local to data
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase

27
4) Intermediate results are streamed between impalad(s)
5) Query results are streamed back to client
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query results

Cloudera Impala – Results
• Allows for fast iteration/discovery
• How much faster?
• 3-4x faster on I/O bound workloads
• up to 45x faster on multi-MR queries
• up to 90x faster on in-memory cache
28

Latest SQL Performance
350
300
250
200
150
100
50
0
Impala Spark SQL Presto Hive-on-Tez
Time (in seconds)
Single User vs 10 User Response Time/Impala
Times Faster
(Lower bars = better)
Single User, 5
10 Users, 11
Single User, 25
10 Users, 120
10 Users, 302
10 Users, 202
Single User, 37
Single User, 77
5.0x
10.6x
7.4x
27.4x
15.4x
18.3x
Independent validation by IBM Research SQL-on-Hadoop VLDB paper:
“Impala’s database architecture provides significant performance gains”
29

Previous Milestones
Impala 1.0
(GA)
Impala 1.1
(Security)
Impala 1.2
(Usability)
Impala 1.3
(Resource
Management)
Impala 1.4
(Extensibility)
Impala 2.0
(SQL)
Analytic Database
Capabilities
Spring
2013
Summer
2013
Fall
2013
Spring
2014
Summer
2014
Fall
2014
30

Cloudera Impala 2.0
Window Functions
“Aggregate function applied to a partition of the result set” (SQL 2003)
Ex:
sum(population) OVER (PARTITION BY city)
rank() OVER (PARTITION BY state, ORDER BY population)
We’ve implemented most of the spec
• PARTITION BY, ORDER BY
• WINDOW
• PRECEEDING, FOLLOWING
• ROWS
• Any number of analytic functions in one query
31

Cloudera Impala 2.0
Subqueries
A query that is part of another query. Ex:
select col from t1
where col in
(select c2 from t2)
Support:
• Correlated and uncorrelated subqueries.
• IN, NOT IN, EXISTS, NOT EXISTS
32

Cloudera Impala 2.0
Spill to disk joins & aggregations
• Previously, if a query ran out of memory, Impala would abort it
• This means some big joins (fact table – fact table) joins could never run.
• All operators that accumulate memory can now spill to disk if
necessary.
• Order by (Impala 1.4)
• Join/Agg (Impala 2.0)
• Analytic Functions (Impala 2.0)
• Transparent to existing workloads
33

Cloudera Impala 2.1 +
34
• Nested data – enables queries on complex nested structures including maps, structs,
and arrays (early 2015)
• MERGE statement – enables merging in updates into existing tables
• Additional analytic SQL functionality – ROLLUP, CUBE, and GROUPING SET
• SQL SET operators – MINUS, INTERSECT
• Apache HBase CRUD – allows use of Impala for inserts and updates into HBase
• UDTFs (user-defined table functions) – for more advanced user functions and
extensibility
• Intra-node parallelized aggregations and joins – to provide even faster joins and
aggregations on on top of the performance gains of Impala
• Parquet enhancements – continued performance gains including index pages
• Amazon S3 integration

35
Quick Demo
Hold onto something, folks.

Apache-licensed open source
• Download: cloudera.com/downloads
• Email: impala-user@cloudera.org
• Join: groups.cloudera.org
Cloudera Live
Free, Interactive Tutorials at cloudera.com/live
reserved.
Try It Out

Special thanks:
LAS VEGAS BIG DATA
37

38
Questions?
Preferably related to the talk… or not.

39
Thank You!
Maxime Dumas
mdumas@cloudera.com
We’re hiring.

Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Similar to Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014 (20)

More from cdmaxime

More from cdmaxime (6)

Recently uploaded

Recently uploaded (20)

Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Editor's Notes