SlideShare a Scribd company logo
Cloudera Impala 
LV Big Data Monthly Meetup #1 
November 5th 2014 
Maxime Dumas 
Systems Engineer
Thirty Seconds About Max 
• Systems Engineer 
• aka Sales Engineer 
• SoCal, AZ, NV 
• former coder of PHP 
• teaches meditation + yoga 
• from Montreal, Canada 
What Does Cloudera Do? 
• product 
• distribution of Hadoop components, Apache licensed 
• enterprise tooling 
• support 
• training 
• services (aka consulting) 
• community 
What This Talk Isn’t About 
• deploying 
• Puppet, Chef, Ansible, homegrown scripts, intern labor 
• sizing & tuning 
• depends heavily on data and workload 
• coding 
• unless you count XML or CSV or SQL 
• algorithms 
What is Cloudera Impala? 
Public Domain IFCAR
cloud·e·ra im·pal·a 
/kloudˈi(ə)rə imˈpalə/ 
a modern, open source, MPP SQL query engine 
for Apache Hadoop. 
“Cloudera Impala provides fast, ad hoc SQL query 
capability for Apache Hadoop, complementing 
traditional MapReduce batch processing.”
Impala adoption 
Component (and Founder) Vendor Support 
Cloudera MapR Amazon IBM Pivotal Hortonworks 
Impala (Cloudera) ✔ ✔ ✔ X X X 
Hue (Cloudera) ✔ ✔ X X X ✔ 
Sentry (Cloudera) ✔ ✔ X ✔ ✔ X 
Flume (Cloudera) ✔ ✔ X ✔ ✔ ✔ 
Parquet (Cloudera/Twitter) ✔ ✔ X ✔ ✔ X 
Sqoop (Cloudera) ✔ ✔ ✔ ✔ ✔ ✔ 
Ambari (Hortonworks) X X X X ✔ ✔ 
Knox (Hortonworks) X X X X X ✔ 
Tez (Hortonworks) X X X X X ✔ 
Drill (MapR) X ✔ X X X X
The Apache Hadoop Ecosystem 
Quick and dirty, for context.
©2014 Cloudera, Inc. All rights 
Why Hadoop? 
• Scalability 
• Simply scales just by adding nodes 
• Local processing to avoid network bottlenecks 
• Efficiency 
• Cost efficiency (<$1k/TB) on commodity hardware 
• Unified storage, metadata, security (no duplication or synchronization) 
• Flexibility 
• All kinds of data (blobs, documents, records, etc) 
• In all forms (structured, semi-structured, unstructured) 
• Store anything then later analyze what you need
Why “Ecosystem?” 
• In the beginning, just Hadoop 
• MapReduce 
• Today, dozens of interrelated components 
• I/O 
• Processing 
• Specialty Applications 
• Configuration 
• Workflow 
• Distributed, highly fault-tolerant filesystem 
• Optimized for large streaming access to data 
• Based on Google File System 
Lots of Commodity Machines 
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
MapReduce (MR) 
• Programming paradigm 
• Batch oriented, not realtime 
• Works well with distributed computing 
• Lots of Java, but other languages supported 
• Based on Google’s paper 
Apache Hive 
• Abstraction of Hadoop’s Java API 
• HiveQL “compiles” down to MR 
• a “SQL-like” language 
• Eases analysis using MapReduce 
Apache Hive Metastore 
• Maps HDFS files to DB-like resources 
• Databases 
• Tables 
• Column/field names, data types 
• Roles/users 
• InputFormat/OutputFormat 
©2014 Cloudera, Inc. All rights 
But wait… 
Cloudera Impala 
Familiar interface, but more powerful.
Cloudera Impala 
• Interactive query on Hadoop 
• think seconds, not minutes 
• ANSI-92 standard SQL 
• compatible with HiveQL 
• Native MPP query engine 
• built for low-latency queries 
• HDFS and HBase storage 
Cloudera Impala – Design Choices 
• Native daemons, written in C/C++ 
• No JVM, no MapReduce 
• Saturate disks on reads 
• Uses in-memory HDFS caching 
• Re-uses Hive metastore 
• Not as fault-tolerant as MapReduce 
Benefits of Impala 
Unlocks BI/analytics on Hadoop 
• Interactive SQL in seconds 
• Highly concurrent to handle 100s of users 
Native Hadoop flexibility 
• No data migration, conversion, or duplication required 
• Query existing Hadoop data 
• Run multiple frameworks on the same data at the same time 
• Supports Parquet for best-of-breed columnar performance 
Native MPP query engine designed into Hadoop: 
• Unified Hadoop storage 
• Unified Hadoop metadata (uses Hive and HCatalog) 
• Unified Hadoop security 
• Fine-grained role-based access controls with Sentry 
Apache-licensed open source 
Proven in 
Cloudera Impala – Architecture 
• Impala Daemon 
• runs on every node 
• handles client requests 
• handles query planning & execution 
• State Store Daemon 
• provides name service 
• metadata distribution 
• used for finding data 
Impala Query Execution 
1) Request arrives via ODBC/JDBC/HUE/Shell 
Query Planner 
Query Coordinator 
Query Executor 
SQL App 
HDFS NN Statestore 
Query Planner 
Query Coordinator 
Query Executor 
Query Planner 
Query Coordinator 
Query Executor 
SQL request
Impala Query Execution 
2) Planner turns request into collections of plan fragments 
3) Coordinator initiates execution on impalad(s) local to data 
Query Planner 
Query Coordinator 
Query Executor 
SQL App 
HDFS NN Statestore 
Query Planner 
Query Coordinator 
Query Executor 
Query Planner 
Query Coordinator 
Query Executor 
Impala Query Execution 
4) Intermediate results are streamed between impalad(s) 
5) Query results are streamed back to client 
Query Planner 
Query Coordinator 
Query Executor 
SQL App 
HDFS NN Statestore 
Query Planner 
Query Coordinator 
Query Executor 
Query Planner 
Query Coordinator 
Query Executor 
Query results
Cloudera Impala – Results 
• Allows for fast iteration/discovery 
• How much faster? 
• 3-4x faster on I/O bound workloads 
• up to 45x faster on multi-MR queries 
• up to 90x faster on in-memory cache 
Latest SQL Performance 
Impala Spark SQL Presto Hive-on-Tez 
Time (in seconds) 
Single User vs 10 User Response Time/Impala 
Times Faster 
(Lower bars = better) 
Single User, 5 
10 Users, 11 
Single User, 25 
10 Users, 120 
10 Users, 302 
10 Users, 202 
Single User, 37 
Single User, 77 
Independent validation by IBM Research SQL-on-Hadoop VLDB paper: 
“Impala’s database architecture provides significant performance gains” 
Previous Milestones 
Impala 1.0 
Impala 1.1 
Impala 1.2 
Impala 1.3 
Impala 1.4 
Impala 2.0 
Analytic Database 
Cloudera Impala 2.0 
Window Functions 
“Aggregate function applied to a partition of the result set” (SQL 2003) 
sum(population) OVER (PARTITION BY city) 
rank() OVER (PARTITION BY state, ORDER BY population) 
We’ve implemented most of the spec 
• Any number of analytic functions in one query 
Cloudera Impala 2.0 
A query that is part of another query. Ex: 
select col from t1 
where col in 
(select c2 from t2) 
• Correlated and uncorrelated subqueries. 
Cloudera Impala 2.0 
Spill to disk joins & aggregations 
• Previously, if a query ran out of memory, Impala would abort it 
• This means some big joins (fact table – fact table) joins could never run. 
• All operators that accumulate memory can now spill to disk if 
• Order by (Impala 1.4) 
• Join/Agg (Impala 2.0) 
• Analytic Functions (Impala 2.0) 
• Transparent to existing workloads 
Cloudera Impala 2.1 + 
• Nested data – enables queries on complex nested structures including maps, structs, 
and arrays (early 2015) 
• MERGE statement – enables merging in updates into existing tables 
• Additional analytic SQL functionality – ROLLUP, CUBE, and GROUPING SET 
• Apache HBase CRUD – allows use of Impala for inserts and updates into HBase 
• UDTFs (user-defined table functions) – for more advanced user functions and 
• Intra-node parallelized aggregations and joins – to provide even faster joins and 
aggregations on on top of the performance gains of Impala 
• Parquet enhancements – continued performance gains including index pages 
• Amazon S3 integration
Quick Demo 
Hold onto something, folks.
Apache-licensed open source 
• Download: 
• Email: 
• Join: 
Cloudera Live 
Free, Interactive Tutorials at 
©2014 Cloudera, Inc. All rights 
Try It Out
Special thanks: 
Preferably related to the talk… or not.
Thank You! 
Maxime Dumas 
We’re hiring.

More Related Content

What's hot

Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Shravan (Sean) Pabba
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Jeremy Beard
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data Success
Cloudera, Inc.
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
Andriy Zabavskyy
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
Chris George
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Cloudera, Inc.
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop

What's hot (20)

Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data Success
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop

Viewers also liked

AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Amazon Web Services Korea
Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data Hub
Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data HubCloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data Hub
Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data Hub
Cloudera, Inc.
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupChallenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Andrei Savu
Hadoop and Manufacturing
Hadoop and ManufacturingHadoop and Manufacturing
Hadoop and Manufacturing
Cloudera, Inc.
Oozie sweet
Oozie sweetOozie sweet
Oozie sweet
CoreOS : 설치부터 컨테이너 배포까지
CoreOS : 설치부터 컨테이너 배포까지CoreOS : 설치부터 컨테이너 배포까지
CoreOS : 설치부터 컨테이너 배포까지
충섭 김
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Douglas Bernardini
Five Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWSFive Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWS
Cloudera, Inc.
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
Amazon Web Services
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
HBase 훑어보기
HBase 훑어보기HBase 훑어보기
HBase 훑어보기
beom kyun choi
Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2
IMC Institute
Docker + Kubernetes를 이용한 빌드 서버 가상화 사례
Docker + Kubernetes를 이용한 빌드 서버 가상화 사례Docker + Kubernetes를 이용한 빌드 서버 가상화 사례
Docker + Kubernetes를 이용한 빌드 서버 가상화 사례
처음 접하는 Oozie Workflow, Coordinator
처음 접하는 Oozie Workflow, Coordinator처음 접하는 Oozie Workflow, Coordinator
처음 접하는 Oozie Workflow, Coordinator
Kim Log
하둡 (Hadoop) 및 관련기술 훑어보기
하둡 (Hadoop) 및 관련기술 훑어보기하둡 (Hadoop) 및 관련기술 훑어보기
하둡 (Hadoop) 및 관련기술 훑어보기
beom kyun choi
Overview of Amazon Web Services
Overview of Amazon Web ServicesOverview of Amazon Web Services
Overview of Amazon Web Services
Amazon Web Services
AWS Cloud Kata | Bangkok - Getting to MVP
AWS Cloud Kata | Bangkok - Getting to MVPAWS Cloud Kata | Bangkok - Getting to MVP
AWS Cloud Kata | Bangkok - Getting to MVP
Amazon Web Services
도커 무작정 따라하기: 도커가 처음인 사람도 60분이면 웹 서버를 올릴 수 있습니다!
도커 무작정 따라하기: 도커가 처음인 사람도 60분이면 웹 서버를 올릴 수 있습니다!도커 무작정 따라하기: 도커가 처음인 사람도 60분이면 웹 서버를 올릴 수 있습니다!
도커 무작정 따라하기: 도커가 처음인 사람도 60분이면 웹 서버를 올릴 수 있습니다!

Viewers also liked (20)

AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data Hub
Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data HubCloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data Hub
Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data Hub
Hive case studies
Hive case studiesHive case studies
Hive case studies
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupChallenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Hadoop and Manufacturing
Hadoop and ManufacturingHadoop and Manufacturing
Hadoop and Manufacturing
Oozie sweet
Oozie sweetOozie sweet
Oozie sweet
CoreOS : 설치부터 컨테이너 배포까지
CoreOS : 설치부터 컨테이너 배포까지CoreOS : 설치부터 컨테이너 배포까지
CoreOS : 설치부터 컨테이너 배포까지
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Five Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWSFive Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWS
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
HBase 훑어보기
HBase 훑어보기HBase 훑어보기
HBase 훑어보기
Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2
Docker + Kubernetes를 이용한 빌드 서버 가상화 사례
Docker + Kubernetes를 이용한 빌드 서버 가상화 사례Docker + Kubernetes를 이용한 빌드 서버 가상화 사례
Docker + Kubernetes를 이용한 빌드 서버 가상화 사례
처음 접하는 Oozie Workflow, Coordinator
처음 접하는 Oozie Workflow, Coordinator처음 접하는 Oozie Workflow, Coordinator
처음 접하는 Oozie Workflow, Coordinator
하둡 (Hadoop) 및 관련기술 훑어보기
하둡 (Hadoop) 및 관련기술 훑어보기하둡 (Hadoop) 및 관련기술 훑어보기
하둡 (Hadoop) 및 관련기술 훑어보기
Overview of Amazon Web Services
Overview of Amazon Web ServicesOverview of Amazon Web Services
Overview of Amazon Web Services
AWS Cloud Kata | Bangkok - Getting to MVP
AWS Cloud Kata | Bangkok - Getting to MVPAWS Cloud Kata | Bangkok - Getting to MVP
AWS Cloud Kata | Bangkok - Getting to MVP
도커 무작정 따라하기: 도커가 처음인 사람도 60분이면 웹 서버를 올릴 수 있습니다!
도커 무작정 따라하기: 도커가 처음인 사람도 60분이면 웹 서버를 올릴 수 있습니다!도커 무작정 따라하기: 도커가 처음인 사람도 60분이면 웹 서버를 올릴 수 있습니다!
도커 무작정 따라하기: 도커가 처음인 사람도 60분이면 웹 서버를 올릴 수 있습니다!

Similar to Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
Cloudera, Inc.
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
Cloudera, Inc.
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Mladen Kovacevic
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
Spark Summit
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted Cloud
Colin Charles
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
Felicia Haggarty
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Yahoo Developer Network

Similar to Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014 (20)

Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted Cloud
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...

More from cdmaxime

Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Art of Living Happiness App Challenge - San Diego Meetup Nov 20th 2014
Art of Living Happiness App Challenge - San Diego Meetup Nov 20th 2014Art of Living Happiness App Challenge - San Diego Meetup Nov 20th 2014
Art of Living Happiness App Challenge - San Diego Meetup Nov 20th 2014
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014

More from cdmaxime (6)

Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Art of Living Happiness App Challenge - San Diego Meetup Nov 20th 2014
Art of Living Happiness App Challenge - San Diego Meetup Nov 20th 2014Art of Living Happiness App Challenge - San Diego Meetup Nov 20th 2014
Art of Living Happiness App Challenge - San Diego Meetup Nov 20th 2014
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014

Recently uploaded

OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...
OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...
OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...
Shane Coughlan
New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
John Gallagher
Understanding Automated Testing Tools for Web Applications.pdf
Understanding Automated Testing Tools for Web Applications.pdfUnderstanding Automated Testing Tools for Web Applications.pdf
Understanding Automated Testing Tools for Web Applications.pdf
01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching
Three available editions of Windows Servers crucial to your organization’s op...
Three available editions of Windows Servers crucial to your organization’s op...Three available editions of Windows Servers crucial to your organization’s op...
Three available editions of Windows Servers crucial to your organization’s op...
Mastering MicroStation DGN: How to Integrate CAD and GIS
Mastering MicroStation DGN: How to Integrate CAD and GISMastering MicroStation DGN: How to Integrate CAD and GIS
Mastering MicroStation DGN: How to Integrate CAD and GIS
Safe Software
Unlocking value with event-driven architecture by Confluent
Unlocking value with event-driven architecture by ConfluentUnlocking value with event-driven architecture by Confluent
Unlocking value with event-driven architecture by Confluent
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing ToolsOld Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Benjamin Bischoff
Fixing Git Catastrophes - Nebraska.Code()
Fixing Git Catastrophes - Nebraska.Code()Fixing Git Catastrophes - Nebraska.Code()
Fixing Git Catastrophes - Nebraska.Code()
Gene Gotimer
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
Andre Hora
What is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdfWhat is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdf
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
David D. Scott
03. Ruby Variables & Regex - Ruby Core Teaching
03. Ruby Variables & Regex - Ruby Core Teaching03. Ruby Variables & Regex - Ruby Core Teaching
03. Ruby Variables & Regex - Ruby Core Teaching
Waze vs. Google Maps vs. Apple Maps, Who Else.pdf
Waze vs. Google Maps vs. Apple Maps, Who Else.pdfWaze vs. Google Maps vs. Apple Maps, Who Else.pdf
Waze vs. Google Maps vs. Apple Maps, Who Else.pdf
Ben Ramedani
Literals - A Machine Independent Feature
Literals - A Machine Independent FeatureLiterals - A Machine Independent Feature
Literals - A Machine Independent Feature
09. Ruby Object Oriented Programming - Ruby Core Teaching
09. Ruby Object Oriented Programming - Ruby Core Teaching09. Ruby Object Oriented Programming - Ruby Core Teaching
09. Ruby Object Oriented Programming - Ruby Core Teaching
The Politics of Agile Development.pptx
The  Politics of  Agile Development.pptxThe  Politics of  Agile Development.pptx
The Politics of Agile Development.pptx
4. The Build System _ Embedded Android.pdf
4. The Build System _ Embedded Android.pdf4. The Build System _ Embedded Android.pdf
4. The Build System _ Embedded Android.pdf
CrushFTP PC Software - WhizNews
CrushFTP PC Software - WhizNewsCrushFTP PC Software - WhizNews
CrushFTP PC Software - WhizNews
Eman Nisar

Recently uploaded (20)

OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...
OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...
OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...
New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Understanding Automated Testing Tools for Web Applications.pdf
Understanding Automated Testing Tools for Web Applications.pdfUnderstanding Automated Testing Tools for Web Applications.pdf
Understanding Automated Testing Tools for Web Applications.pdf
01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching
Three available editions of Windows Servers crucial to your organization’s op...
Three available editions of Windows Servers crucial to your organization’s op...Three available editions of Windows Servers crucial to your organization’s op...
Three available editions of Windows Servers crucial to your organization’s op...
Mastering MicroStation DGN: How to Integrate CAD and GIS
Mastering MicroStation DGN: How to Integrate CAD and GISMastering MicroStation DGN: How to Integrate CAD and GIS
Mastering MicroStation DGN: How to Integrate CAD and GIS
Unlocking value with event-driven architecture by Confluent
Unlocking value with event-driven architecture by ConfluentUnlocking value with event-driven architecture by Confluent
Unlocking value with event-driven architecture by Confluent
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing ToolsOld Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Fixing Git Catastrophes - Nebraska.Code()
Fixing Git Catastrophes - Nebraska.Code()Fixing Git Catastrophes - Nebraska.Code()
Fixing Git Catastrophes - Nebraska.Code()
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
What is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdfWhat is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdf
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
03. Ruby Variables & Regex - Ruby Core Teaching
03. Ruby Variables & Regex - Ruby Core Teaching03. Ruby Variables & Regex - Ruby Core Teaching
03. Ruby Variables & Regex - Ruby Core Teaching
Waze vs. Google Maps vs. Apple Maps, Who Else.pdf
Waze vs. Google Maps vs. Apple Maps, Who Else.pdfWaze vs. Google Maps vs. Apple Maps, Who Else.pdf
Waze vs. Google Maps vs. Apple Maps, Who Else.pdf
Literals - A Machine Independent Feature
Literals - A Machine Independent FeatureLiterals - A Machine Independent Feature
Literals - A Machine Independent Feature
09. Ruby Object Oriented Programming - Ruby Core Teaching
09. Ruby Object Oriented Programming - Ruby Core Teaching09. Ruby Object Oriented Programming - Ruby Core Teaching
09. Ruby Object Oriented Programming - Ruby Core Teaching
The Politics of Agile Development.pptx
The  Politics of  Agile Development.pptxThe  Politics of  Agile Development.pptx
The Politics of Agile Development.pptx
4. The Build System _ Embedded Android.pdf
4. The Build System _ Embedded Android.pdf4. The Build System _ Embedded Android.pdf
4. The Build System _ Embedded Android.pdf
CrushFTP PC Software - WhizNews
CrushFTP PC Software - WhizNewsCrushFTP PC Software - WhizNews
CrushFTP PC Software - WhizNews

Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

  • 1. 1 Cloudera Impala LV Big Data Monthly Meetup #1 November 5th 2014 Maxime Dumas Systems Engineer
  • 2. Thirty Seconds About Max • Systems Engineer • aka Sales Engineer • SoCal, AZ, NV • former coder of PHP • teaches meditation + yoga • from Montreal, Canada 2
  • 3. What Does Cloudera Do? • product • distribution of Hadoop components, Apache licensed • enterprise tooling • support • training • services (aka consulting) • community 3
  • 4. What This Talk Isn’t About • deploying • Puppet, Chef, Ansible, homegrown scripts, intern labor • sizing & tuning • depends heavily on data and workload • coding • unless you count XML or CSV or SQL • algorithms 4
  • 5. What is Cloudera Impala? 5
  • 7. cloud·e·ra im·pal·a 7 /kloudˈi(ə)rə imˈpalə/ noun a modern, open source, MPP SQL query engine for Apache Hadoop. “Cloudera Impala provides fast, ad hoc SQL query capability for Apache Hadoop, complementing traditional MapReduce batch processing.”
  • 8. Impala adoption 8 Component (and Founder) Vendor Support Cloudera MapR Amazon IBM Pivotal Hortonworks Impala (Cloudera) ✔ ✔ ✔ X X X Hue (Cloudera) ✔ ✔ X X X ✔ Sentry (Cloudera) ✔ ✔ X ✔ ✔ X Flume (Cloudera) ✔ ✔ X ✔ ✔ ✔ Parquet (Cloudera/Twitter) ✔ ✔ X ✔ ✔ X Sqoop (Cloudera) ✔ ✔ ✔ ✔ ✔ ✔ Ambari (Hortonworks) X X X X ✔ ✔ Knox (Hortonworks) X X X X X ✔ Tez (Hortonworks) X X X X X ✔ Drill (MapR) X ✔ X X X X
  • 9. 9 The Apache Hadoop Ecosystem Quick and dirty, for context.
  • 10. ©2014 Cloudera, Inc. All rights reserved. Why Hadoop? • Scalability • Simply scales just by adding nodes • Local processing to avoid network bottlenecks • Efficiency • Cost efficiency (<$1k/TB) on commodity hardware • Unified storage, metadata, security (no duplication or synchronization) • Flexibility • All kinds of data (blobs, documents, records, etc) • In all forms (structured, semi-structured, unstructured) • Store anything then later analyze what you need
  • 11. Why “Ecosystem?” • In the beginning, just Hadoop • HDFS • MapReduce • Today, dozens of interrelated components • I/O • Processing • Specialty Applications • Configuration • Workflow 11
  • 12. HDFS • Distributed, highly fault-tolerant filesystem • Optimized for large streaming access to data • Based on Google File System • 12
  • 13. Lots of Commodity Machines 13 Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
  • 14. MapReduce (MR) • Programming paradigm • Batch oriented, not realtime • Works well with distributed computing • Lots of Java, but other languages supported • Based on Google’s paper • 14
  • 15. Apache Hive • Abstraction of Hadoop’s Java API • HiveQL “compiles” down to MR • a “SQL-like” language • Eases analysis using MapReduce 15
  • 16. Apache Hive Metastore • Maps HDFS files to DB-like resources • Databases • Tables • Column/field names, data types • Roles/users • InputFormat/OutputFormat 16
  • 18. But wait… WHY DO WE NEED THIS? 18
  • 19. 19
  • 20. 20 Cloudera Impala Familiar interface, but more powerful.
  • 21. Cloudera Impala • Interactive query on Hadoop • think seconds, not minutes • ANSI-92 standard SQL • compatible with HiveQL • Native MPP query engine • built for low-latency queries • HDFS and HBase storage 21
  • 22. Cloudera Impala – Design Choices • Native daemons, written in C/C++ • No JVM, no MapReduce • Saturate disks on reads • Uses in-memory HDFS caching • Re-uses Hive metastore • Not as fault-tolerant as MapReduce 22
  • 23. Benefits of Impala Unlocks BI/analytics on Hadoop • Interactive SQL in seconds • Highly concurrent to handle 100s of users Native Hadoop flexibility • No data migration, conversion, or duplication required • Query existing Hadoop data • Run multiple frameworks on the same data at the same time • Supports Parquet for best-of-breed columnar performance Native MPP query engine designed into Hadoop: • Unified Hadoop storage • Unified Hadoop metadata (uses Hive and HCatalog) • Unified Hadoop security • Fine-grained role-based access controls with Sentry Apache-licensed open source Proven in Production 23
  • 24. Cloudera Impala – Architecture • Impala Daemon • runs on every node • handles client requests • handles query planning & execution • State Store Daemon • provides name service • metadata distribution • used for finding data 24
  • 25. Impala Query Execution 25 1) Request arrives via ODBC/JDBC/HUE/Shell Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase SQL request
  • 26. Impala Query Execution 26 2) Planner turns request into collections of plan fragments 3) Coordinator initiates execution on impalad(s) local to data Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase
  • 27. Impala Query Execution 27 4) Intermediate results are streamed between impalad(s) 5) Query results are streamed back to client Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Query results
  • 28. Cloudera Impala – Results • Allows for fast iteration/discovery • How much faster? • 3-4x faster on I/O bound workloads • up to 45x faster on multi-MR queries • up to 90x faster on in-memory cache 28
  • 29. Latest SQL Performance 350 300 250 200 150 100 50 0 Impala Spark SQL Presto Hive-on-Tez Time (in seconds) Single User vs 10 User Response Time/Impala Times Faster (Lower bars = better) Single User, 5 10 Users, 11 Single User, 25 10 Users, 120 10 Users, 302 10 Users, 202 Single User, 37 Single User, 77 5.0x 10.6x 7.4x 27.4x 15.4x 18.3x Independent validation by IBM Research SQL-on-Hadoop VLDB paper: “Impala’s database architecture provides significant performance gains” 29
  • 30. Previous Milestones Impala 1.0 (GA) Impala 1.1 (Security) Impala 1.2 (Usability) Impala 1.3 (Resource Management) Impala 1.4 (Extensibility) Impala 2.0 (SQL) Analytic Database Capabilities Spring 2013 Summer 2013 Fall 2013 Spring 2014 Summer 2014 Fall 2014 30
  • 31. Cloudera Impala 2.0 Window Functions “Aggregate function applied to a partition of the result set” (SQL 2003) Ex: sum(population) OVER (PARTITION BY city) rank() OVER (PARTITION BY state, ORDER BY population) We’ve implemented most of the spec • PARTITION BY, ORDER BY • WINDOW • PRECEEDING, FOLLOWING • ROWS • Any number of analytic functions in one query 31
  • 32. Cloudera Impala 2.0 Subqueries A query that is part of another query. Ex: select col from t1 where col in (select c2 from t2) Support: • Correlated and uncorrelated subqueries. • IN, NOT IN, EXISTS, NOT EXISTS 32
  • 33. Cloudera Impala 2.0 Spill to disk joins & aggregations • Previously, if a query ran out of memory, Impala would abort it • This means some big joins (fact table – fact table) joins could never run. • All operators that accumulate memory can now spill to disk if necessary. • Order by (Impala 1.4) • Join/Agg (Impala 2.0) • Analytic Functions (Impala 2.0) • Transparent to existing workloads 33
  • 34. Cloudera Impala 2.1 + 34 • Nested data – enables queries on complex nested structures including maps, structs, and arrays (early 2015) • MERGE statement – enables merging in updates into existing tables • Additional analytic SQL functionality – ROLLUP, CUBE, and GROUPING SET • SQL SET operators – MINUS, INTERSECT • Apache HBase CRUD – allows use of Impala for inserts and updates into HBase • UDTFs (user-defined table functions) – for more advanced user functions and extensibility • Intra-node parallelized aggregations and joins – to provide even faster joins and aggregations on on top of the performance gains of Impala • Parquet enhancements – continued performance gains including index pages • Amazon S3 integration
  • 35. 35 Quick Demo Hold onto something, folks.
  • 36. Apache-licensed open source • Download: • Email: • Join: Cloudera Live Free, Interactive Tutorials at ©2014 Cloudera, Inc. All rights reserved. Try It Out
  • 37. Special thanks: LAS VEGAS BIG DATA 37
  • 38. 38 Questions? Preferably related to the talk… or not.
  • 39. 39 Thank You! Maxime Dumas We’re hiring.

Editor's Notes

  1. Similar to the Red Hat model. Hadoop elephant logo licensed for public use via Apache license: Apache Software Foundation,
  2. Similar to the Red Hat model. Hadoop elephant logo licensed for public use via Apache license: Apache Software Foundation,
  3. Furthermore, for projects that carry the Apache License, open-ness does not always guarantee freedom from lock-in to a single support provider. For example, Drill, Knox, Tez, and Falcon are all open source, and all shipped by a single vendor – what’s a better example of “lock-in” than that?
  4. We’re going to breeze through these really quick, just to show how Search plugs in later…
  5. Lose a server, no problem. Lose a rack, no problem.
  6. We’re going to breeze through these really quick, just to show how Search plugs in later…
  7. More & Faster Value from Big Data Provides an interactive BI/Analytics experience on Hadoop Previously BI/Analytics was impractical due to the batch orientation of MapReduce Enables more users to gain value from organizational data assets (SQL/BI users) Makes more data available for analysis (raw data, multi-structured data, historical data) Removes delays from data migration Into specialized analytical DBMSs Into proprietary file formats that happen to be stored in HDFS Into transient in-memory stores Flexibility Query across existing data in Hadoop HDFS and HBase Access data immediately and directly in its native format Select best-fit file formats Use raw data formats when unsure of access patterns (text files, RCFiles, LZO) Increase performance with optimized file formats when access patterns are known (Parquet, Avro) Run multiple frameworks on the same data at the same time All file formats are compatible across the entire Hadoop ecosystem – i.e. MapReduce, Pig, Hive, Impala, etc. on the same data at the same time Run multiple frameworks on the same data at the same time All file formats are compatible across the entire Hadoop ecosystem – i.e. MapReduce, Pig, Hive, Impala, etc. Cost Efficiency Reduce movement, duplicate storage & compute Data movement: no time or resource penalty for migrating data into specialized systems or formats Duplicate storage: no need to duplicate data across systems or within the same system in different file formats Compute: use the same compute resources as the rest of the Hadoop system – You don’t need a separate set of nodes to run interactive query vs. batch processing (MapReduce) You don’t need to overprovision your hardware to enable memory-intensive, on-the-fly format conversions 10% to 1% the cost of analytic DMBS Less than $1,000/TB Full Fidelity Analysis No loss of fidelity from aggregations or conforming to fixed schemas If the attribute exists in the raw data, you can query against it
  8. These run continuously, always ready. In C/C++ for the most-part.
  9. Impala 1.0 ~SQL-92 (minus correlated sub-queries) Native Hadoop file formats (Parquet, Avro, text, Sequence, …) Enterprise-readiness (authentication, ODBC/JDBC drivers, etc) Service-level resource isolation with other Hadoop frameworks Impala 1.1 Fine-grained, role-based authorization via Apache Sentry Auditing (Impala 1.1.1 and CM 4.7+) Impala 1.2 Custom language extensibility (UDFs, UDAFs) Cost-based join-order optimization On-par performance compared to traditional MPP query engines while maintaining native Hadoop data flexibility Impala 1.3 / CDH 5.0 (also has version for CDH 4.x) Resource management
  10. Do not support RANGE windows. Range windows let you specify a range based on the current row’s value (as opposed to ROWS, which is the ordinal). Example: sum(c) OVER(ORDER BY year BETWEEN RANGE 1 PRECEEDING and 2 FOLLOWING) Error: “RANGE is only supported with both the lower and upper bounds UNBOUNDED or one UNBOUNDED and the other CURRENT ROW." No UDA support Not all aggregate functions are supported (ndv, etc) Looking at both for 2.1.
  11. All subqueries are rewritten as joins. No “Independent evaluation” We’ve added additional join types to support this: LEFT/RIGHT ANTI-JOIN RIGHT SEMI-JOIN NULL AWARE LEFT ANTI JOIN Subqueries are only supported in the WHERE clause. Impala can’t reason if a subquery returns one row in all cases: select col limit 1  works select min(col)  works select min(col) group by x where x = 1  doesn’t Can manually add a limit 1 to the subquery. See docs for more details These should all have error messages explaining why We implemented the common use cases.
  12. Impala hash partitions the input to the operator, spilling partitions as necessary When all the input is partitioned, Impala processes the partitions that are still in memory (did not spill) Impala then processed the spilled partitions 1 by 1, repartitioning if necessary. Impala tries to minimize the number of spilled bytes. Peak memory usage when the first spill happened Stays high until we handled all the non-spilled partitions Lower as we handle the spilled partitions 1 by 1.
  13. We’re going to breeze through these really quick, just to show how Search plugs in later…