Real-time Big Data即時應用架構研習 
Etu首席顧問 陳昭宇 
Oct 8, 2014
Workshop Goal 
Let’s talk about the 3Vs in Big 
Data. Hadoop is good for Volume 
and Variety 
How about Velocity ??? 
This is why we are sitting here ….
Target Audience 
• CTO 
• Architect 
• Software/Application Developer 
• IT
Background Knowledge 
• Linux operation system 
• Basic Hadoop ecosystem knowledge 
• Basic knowledge of SQL 
• Java or Python programming experience

Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala

This talk was held at the 11th meeting on April 7 2014 by Marcel Kornacker. Impala ( raises the bar for SQL query performance on Apache Hadoop. With Impala, you can query Hadoop data – including SELECT, JOIN, and aggregate functions – in real time to do BI-style analysis. As a result, Impala makes a Hadoop-based enterprise data hub function like an enterprise data warehouse for native Big Data.

• Hadoop: Open source big data platform 
• HDFS: Hadoop Distributed Filesystem 
• MapReduce: Parallel computing framework on top of 
• HBase: NoSQL database on top of Hadoop 
• Impala: MPP SQL query engine on top of Hadoop 
• Spark: In-memory cluster computing engine 
• Hive: SQL to MapReduce translator 
• Hive Metastore: Database that stores table schema 
• Hive QL: A SQL subset
• What is Hadoop and what’s wrong with Hadoop in real-time? 
• What is Impala? 
• Hands-on Impala 
• What is Spark? 
• Hands-on Spark 
• Spark and Impala work together 
• Q & A
What is Hadoop ? 
Apache Hadoop is an open 
source platform for data 
storage and processing that 
is : 
Fault tolerant 
A fault-tolerant, 
scalable clustered 
A fault-tolerant, 
scalable clustered 
A distributed 
A distributed 
• Ask questions across structured and 
unstructured data 
• Schema-less 
• Scale-out architecture 
divides workloads across 
• Flexible file system 
eliminates ETL bottlenecks. 
Flexible for storing and mining 
any type of data 
Processing Complex 
Big Data 
Scales Economically 
• Deploy on commodity 
• Open sourced platform
Limitations of MapReduce 
• Batch oriented 
• High latency 
• Doesn’t fit all cases 
• Only for developers

apache spark
Pig and Hive 
• MR is hard and only for developers 
• High level abstraction for converting declarative syntax to 
– SQL – Hive 
– Dataflow language - Pig 
• Build on top of MapReduce
• General-purpose SQL engine: 
– Works for both analytics and transactional/single-row workloads. 
– Supports queries that take from milliseconds to hours. 
• Runs directly within Hadoop and: 
– Reads widely used Hadoop file formats. 
– Runs on same nodes that run Hadoop processes. 
• High performance: 
– C++ instead of Java 
– Runtime code generation 
– Completely new execution engine – No MapReduce
What is Impala 
• General-purpose SQL engine 
• Real-time queries in Apache Hadoop 
• Beta version released since Oct. 2012 
• GA since Apr. 2013 
• Apache licensed 
• Latest release v1.4.2
Impala Overview 
• Distributed service in cluster: One impala daemon on each 
data node 
• No SPOF 
• User submits query via ODBC/JDBC, CLI, or HUE to any of the 
• Query is distributed to all nodes with data locality. 
• Uses Hive’s metadata interfaces and connects to Hive 
• Supported file formats: 
– Uncompressed/lzo-compressed text files 
– Sequence files and RCFile with snappy/gzip, Avro 
– Parquet columnar format

Impala’s SQL 
• High compatibility with HiveQL 
• SQL support: 
– Essential SQL-92, minus correlated subqueries 
– Only equi-joins; no non-equi-joins, no cross products 
– Order By requires Limit (not required after 1.4.2) 
– Limited DDL support 
– SQL-style authorization via Apache Sentry 
– UDFs and UDAFs are supported
Impala’s SQL limitations 
• No file formats, SerDes 
• No beyond SQL (buckets, samples, transforms, arrays, 
structs, maps, xpath, json) 
• Broadcast joins and partitioned hash joins supported 
(Smaller tables have to fit in the aggregate memory of all 
executing nodes)
Work with HBase 
• Functionality highlights: 
– Predicates on rowkey columns are mapped into start/stop rows 
– Predicates on other columns are mapped into SingleColumnValueFilters 
• BUT mapping of HBase tables and metastore table 
patterned after Hive: 
– All data is stored as scalars and in ASCII. 
– The rowkey needs to be mapped into a single string column.
HBase in Roadmap 
• Full support for UPDATE and DELETE. 
• Storage of structured data to minimize storage and access 
• Composite rowkey encoding mapped into an arbitrary 
number of table columns.

Impala’s Architecture 
Query Planner 
Query Coordinator 
Query Executor 
Query Planner 
Query Coordinator 
Query Executor 
Query Planner 
Query Coordinator 
Query Executor 
Metastore HDFS NN Statestore 
SQL Client 
1. Request arrives via 
Impala’s Architecture 
Query Planner 
Query Coordinator 
Query Executor 
Query Planner 
Query Coordinator 
Query Executor 
Query Planner 
Query Coordinator 
Query Executor 
Metastore HDFS NN Statestore 
SQL Client 
2. Planner turns request 
into collections of plan 
3. Coordinator initiates 
execution on impalad(s) 
local to data.
Impala’s Architecture 
Query Planner 
Query Coordinator 
Query Executor 
Query Planner 
Query Coordinator 
Query Executor 
Query Planner 
Query Coordinator 
Query Executor 
Metastore HDFS NN Statestore 
SQL Client 
5. Query results are 
streamed back to client. 
4. Intermediate results are 
streamed between 
Metadata Handling 
• Impala metadata 
– Hive’s metastore: Logical metadata (table definitions, columns, 
CREATE TABLE parameters) 
– HDFS NameNode: Directory contents and block replica locations 
– HDFS DataNode: Block replias’ volume ids 
• Caches metadata: No synchronous metastore API calls 
during query execution 
• Impala instances read metadata from metastore at startup. 
• Catalog Service relays metadata when you run DDL or 
update metadata on one of the impalads.

Metadata Handling – Cont. 
• REFRESH [<tbl>]: Reloads metadata on all impalads (if you 
added new files via Hive) 
• INVALIDATE METADATA: Reloads metadata for all tables
Comparing Impala to Dremel 
• What is Dremel? 
– Columnar storage for data with nested structures 
– Distributed scalable aggregation on top of that 
• Columnar storage in Hadoop: Parquet 
– Store data in appropriate native/binary types 
– Can also store nested structures similar to Dremel’s ColumnIO 
• Distributed aggregation: Impala 
• Impala plus Parquet: A superset of the published version of 
Dremel (does not support joins)
Comparing Impala to Hive 
• Hive: MapReduce as an execution engine 
– High latency, low throughput queries 
– Fault-tolerance model based on MapReduce’s on-disk check pointing; 
materializes all intermediate results 
– Java runtime allows for easy late-binding of functionality: file formats 
and UDFs 
– Extensive layering imposes high runtime overhead 
• Impala: 
– Direct, process-to-process data exchange 
– No fault tolerance 
– An execution engine designed for low runtime overhead
Impala and Hive 
Shares Everything Client-Facing 
•Metadata (table definitions) 
•ODBC/JDBC drivers 
•SQL syntax (Hive SQL) 
•Flexible file formats 
•Machine pool 
Resource Management 
Data Store 
But Built for Different Purposes 
•Hive: Runs on MapReduce and ideal for 
batch processing 
•Impala: Native MPP query engine ideal 
for interactive SQL Data Ingestion 
SQL Syntax 
Compute framework 
SQL syntax + 
compute framework

Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop

This document summarizes a meetup about Big Data and SQL on Hadoop. The meetup included discussions on what Hadoop is, why SQL on Hadoop is useful, what Hive is, and introduced IBM's BigInsights software for running SQL on Hadoop with improved performance over other solutions. Key topics included HDFS file storage, MapReduce processing, Hive tables and metadata storage, and how BigInsights provides a massively parallel SQL engine instead of relying on MapReduce.

Typical Use Cases 
• Data Warehouse Offload 
• Ad-hoc Analytics 
• Provide SQL interoperability to HBase
Hands-on Impala 
• Query a file on HDFS with Impala 
• Query a table on HBase with Impala
What is Spark? 
• MapReduce Review… 
• Apache Spark… 
• How Spark Works… 
• Fault Tolerance and Performance… 
• Examples… 
• Spark & More…
MapReduce: Good 
The Good: 
•Built in fault tolerance 
•Optimized IO path 
•Developer focuses on Map/Reduce, not infrastructure 
•Simple? API

MapReduce: Bad 
The Bad: 
•Optimized for disk IO 
– Does not leverage memory 
– Iterative algorithms go through disk IO path again and 
•Primitive API 
– Developers have to build on a very simple abstraction 
– Key/Value in/out 
– Even basic things like join require extensive code 
•A common result is many files require to be combined 
Apache Spark 
• Originally developed in 
2009 in UC Berkeley’s AMP 
• Fully open sourced in 2010 
– now at Apache Software 
Spark: Easy and Fast Big Data 
• Easy to Develop 
– Rich APIs in Java, Scala, 
– Interactive Shell 
• 2-5x less code 
• Fast to Run 
– General execution 
– In-memory store
How Spark Works – SparkContext 
Spark Worker Spark Worker 
Cache Executer 
Task Task 
Data Node Data Node 
Task Task 
sc=new SparkContext 
sc=new SparkContext 

Data Explosion - TBs of data generated everyday Solution – HDFS to store data and Hadoop Map-Reduce framework to parallelize processing of Data What is the catch? Hadoop Map Reduce is Java intensive Thinking in Map Reduce paradigm can get tricky

How Spark Works – RDD 
sc=new SparkContext 
sc=new SparkContext 
Storage Types: 
• Fault Tolerant 
• Controlled 
• Fault Tolerant 
• Controlled 
partitioning to 
optimize data 
partitioning to 
optimize data 
• Manipulated by 
• Manipulated by 
using a rich set of 
using a rich set of 
• Partitions of Data 
• Dependency between partitions
• Stands for Resilient Distributed Datasets 
• Spark revolves around RDDs 
• Fault-tolerant read only collection of elements that can be 
operated on in parallels 
• Cached in memory 
• Read-only, partitioned collection of records 
3 partitions 
• Supports only coarse-grained operations 
– e.g. map and group-by transformation, reduce 
RDD Operations

RDD Operations - Expressive 
• Transformations 
– Creation of a new RDD dataset from an existing: 
• map, filter, distinct, union, sample, groupByKey, join, 
reduce, etc… 
• Actions 
– Returns a value after running a computation: 
• Collect, count, first, takeSample, foreach, etc… 
• Reference 
Word Count on Spark 
sparkContext.textFile(“hdfs://…”) RDD[String] 
Word Count on Spark 
.map(line => line.split(“s”)) 
textFile map
Word Count on Spark 
.map(line => line.split(“s”)) 
.map(word => (word, 1)) 
RDD[(String, Int)] 
textFile map map

Word Count on Spark 
.map(line => line.split(“s”)) 
.map(word => (word, 1)) 
.reduceByKey((a, b) => a+b) 
RDD[(String, Int)] 
RDD[(String, Int)] 
textFile map map reduceByKey 
Word Count on Spark 
.map(line => line.split(“s”)) 
.map(word => (word, 1)) 
.reduceByKey((a, b) => a+b) 
RDD[(String, Int)] 
RDD[(String, Int)] 
Array[(String, Int)] 
textFile map map reduceByKey 
• Parallel Operations 
map reduce sample 
filter count take 
grougBy fold first 
sort reduceByKey partitionBy 
union groupByKey mapWith 
join cogroup pipe 
leftOuterJoin cross save 
rightOuterJoin zip ….
textFile map map reduceByKey 
Stage 1 Stage 2 
DAG (Directed Acyclic Graph) Each stage is executed as 
Stage 1 
Stage 2 
a series of Task (one 
Task for each partition).

Task is the fundamental unit of execution in Spark 
Fetch Input 
Execute Task 
Write Output 
shuffle output 
Core 1 
Fetch Input 
Execute Task 
Write Output 
Fetch Input 
Execute Task 
Write Output 
Fetch Input 
Execute Task 
Write Output 
Core 2
Spark Summary 
• SparkContext 
• Resilient Distributed Dataset 
• Parallel Operations 
• Shared Variables 
– Broadcast Variables – read-only 
– Accumulators
MapReduce Impala Spark 
Scheduler MapReduce Job Query Plan Computation Graph 
I/O Disk In-memory with 
In-memory, cache 
and shared data 
Fault Tolerance Duplication and 
Disk I/O 
No Fault Tolerance Hash partition and 
auto reconstruction 
Iterative Bad Bad Good 
Shared data No No Yes 
Streaming No No Yes
Hands-on Spark 
• Spark Shell 
• Word Count

Spark Streaming 
• Takes the concept of RDDs and extends it to Dstreams 
– Fault-tolerant like RDDs 
– Transformable like RDDs 
• Adds new “rolling window” operations 
– Rolling averages, etc.. 
• But keeps everything else! 
– Regular Spark code works in Spark Streaming 
– Can still access HDFS data, etc. 
• Example use cases: 
– “On-the-fly” ETL as data is ingested into Hadoop/HDFS 
– Detecting anomalous behavior and trigger alerts 
– Continuous reporting of summary metrics for incoming data
How Streaming Works
Micro-batching for on the fly ETL
Window-based Transformation

Spark SQL 
• Spark SQL is one of Spark’s 
– Executes SQL on Spark 
– Builds SchemaRDD 
• Optimizes execution plan 
• Uses existing Hive metastores, 
SerDes, and UDFs.
Unified Data Access 
• Ability to load and query 
data from a variety of 
• SchemaRDDs provides a 
single interface that 
efficiently works with 
structured data, including 
Hive tables, parquet files, 
and JSON. 
schema_rdd = sqlCtx.sql(""" 
FROM hiveTable 
JOIN json ...""") 
Query and join different data sources
Hands-on Spark 
• Parse/transform log on the fly with Spark-Streaming 
• Aggregate with Spark SQL (Top N) 
• Output from Spark to HDFS
Spark & Impala work together 
Data Stream 
-Click Steam 
-Machine Data 
-Network Traffic 
On-the-fly Processing 
-ETL, transformation, filter 
-Pattern Matching & Alert 
Real-time Analytics 
-Machine Learning (Rec. Cluster..) 
-Iterative Algorithms 
Near Real-time Query 
- Ad-hoc query 
- Reporting 
Long term data store 
-Batch process 
-Offline analytics 
-Historical Mining

Etu 讓 Hadoop 更容易 
唯一在地 Hadoop 專業服務 
主流 X86 商用伺服器
ESA Software Stack 
Cloudera Manager 
Etu Manager 
CentOS作業系統 (64 bits) 
HA 管理 
Etu 加值模組
Etu Services 
數據 架構 
挖掘 設計 
Etu 易 
Etu Professional 
屏蔽 Hadoop 平台 
• 快速推出服務,搶佔市場 
• 應用、數據才是企業核心價值 
• 依核心價值調配資源,建立競爭優 
EEttuu SSuuppppoorrtt 
Etu Professional 
EEttuu CCoonnssuullttiinngg 
EEttuu TTrraaiinniinngg 
Question and Discussion 
Thank you 
318, Rueiguang Rd., Taipei 114, Taiwan 
T: +886 2 7720 1888 
F: +886 2 8798 6069

Etu Solution Day 2014 Track-D: 掌握Impala和Spark

  • 1. 掌握Impala和Spark Real-time Big Data即時應用架構研習 Etu首席顧問 陳昭宇 Oct 8, 2014
  • 2. Workshop Goal Let’s talk about the 3Vs in Big Data. Hadoop is good for Volume and Variety But… How about Velocity ??? This is why we are sitting here ….
  • 3. Target Audience • CTO • Architect • Software/Application Developer • IT
  • 4. Background Knowledge • Linux operation system • Basic Hadoop ecosystem knowledge • Basic knowledge of SQL • Java or Python programming experience
  • 5. Terminology • Hadoop: Open source big data platform • HDFS: Hadoop Distributed Filesystem • MapReduce: Parallel computing framework on top of HDFS • HBase: NoSQL database on top of Hadoop • Impala: MPP SQL query engine on top of Hadoop • Spark: In-memory cluster computing engine • Hive: SQL to MapReduce translator • Hive Metastore: Database that stores table schema • Hive QL: A SQL subset
  • 6. Agenda • What is Hadoop and what’s wrong with Hadoop in real-time? • What is Impala? • Hands-on Impala • What is Spark? • Hands-on Spark • Spark and Impala work together • Q & A
  • 7. What is Hadoop ? Apache Hadoop is an open source platform for data storage and processing that is : Distributed Fault tolerant Scalable CORE HADOOP SYSTEM COMPONENTS HDFS HDFS A fault-tolerant, scalable clustered A fault-tolerant, scalable clustered storage storage MapReduce MapReduce A distributed computing framework A distributed computing framework • Ask questions across structured and unstructured data • Schema-less • Scale-out architecture divides workloads across nodes. • Flexible file system eliminates ETL bottlenecks. Flexible for storing and mining any type of data Processing Complex Big Data Scales Economically • Deploy on commodity hardware • Open sourced platform
  • 8. Limitations of MapReduce • Batch oriented • High latency • Doesn’t fit all cases • Only for developers
  • 9. Pig and Hive • MR is hard and only for developers • High level abstraction for converting declarative syntax to MR – SQL – Hive – Dataflow language - Pig • Build on top of MapReduce
  • 10. Goals • General-purpose SQL engine: – Works for both analytics and transactional/single-row workloads. – Supports queries that take from milliseconds to hours. • Runs directly within Hadoop and: – Reads widely used Hadoop file formats. – Runs on same nodes that run Hadoop processes. • High performance: – C++ instead of Java – Runtime code generation – Completely new execution engine – No MapReduce
  • 11. What is Impala • General-purpose SQL engine • Real-time queries in Apache Hadoop • Beta version released since Oct. 2012 • GA since Apr. 2013 • Apache licensed • Latest release v1.4.2
  • 12. Impala Overview • Distributed service in cluster: One impala daemon on each data node • No SPOF • User submits query via ODBC/JDBC, CLI, or HUE to any of the daemons. • Query is distributed to all nodes with data locality. • Uses Hive’s metadata interfaces and connects to Hive metastore. • Supported file formats: – Uncompressed/lzo-compressed text files – Sequence files and RCFile with snappy/gzip, Avro – Parquet columnar format
  • 13. Impala’s SQL • High compatibility with HiveQL • SQL support: – Essential SQL-92, minus correlated subqueries – INSERT INTO … SELECT … – Only equi-joins; no non-equi-joins, no cross products – Order By requires Limit (not required after 1.4.2) – Limited DDL support – SQL-style authorization via Apache Sentry – UDFs and UDAFs are supported
  • 14. Impala’s SQL limitations • No file formats, SerDes • No beyond SQL (buckets, samples, transforms, arrays, structs, maps, xpath, json) • Broadcast joins and partitioned hash joins supported (Smaller tables have to fit in the aggregate memory of all executing nodes)
  • 15. Work with HBase • Functionality highlights: – Support for SELECT, INSERT INTO…SELECT…, and INSERT INTO … VALUES (…) – Predicates on rowkey columns are mapped into start/stop rows – Predicates on other columns are mapped into SingleColumnValueFilters • BUT mapping of HBase tables and metastore table patterned after Hive: – All data is stored as scalars and in ASCII. – The rowkey needs to be mapped into a single string column.
  • 16. HBase in Roadmap • Full support for UPDATE and DELETE. • Storage of structured data to minimize storage and access overhead. • Composite rowkey encoding mapped into an arbitrary number of table columns.
  • 17. Impala’s Architecture Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Hive Metastore HDFS NN Statestore SQL Client ODBC 1. Request arrives via ODBC/JDBC/Beeswax/ Shell.
  • 18. Impala’s Architecture Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Hive Metastore HDFS NN Statestore SQL Client ODBC 2. Planner turns request into collections of plan fragments. 3. Coordinator initiates execution on impalad(s) local to data.
  • 19. Impala’s Architecture Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Hive Metastore HDFS NN Statestore SQL Client ODBC 5. Query results are streamed back to client. 4. Intermediate results are streamed between impalad(s).
  • 20. Metadata Handling • Impala metadata – Hive’s metastore: Logical metadata (table definitions, columns, CREATE TABLE parameters) – HDFS NameNode: Directory contents and block replica locations – HDFS DataNode: Block replias’ volume ids • Caches metadata: No synchronous metastore API calls during query execution • Impala instances read metadata from metastore at startup. • Catalog Service relays metadata when you run DDL or update metadata on one of the impalads.
  • 21. Metadata Handling – Cont. • REFRESH [<tbl>]: Reloads metadata on all impalads (if you added new files via Hive) • INVALIDATE METADATA: Reloads metadata for all tables
  • 22. Comparing Impala to Dremel • What is Dremel? – Columnar storage for data with nested structures – Distributed scalable aggregation on top of that • Columnar storage in Hadoop: Parquet – Store data in appropriate native/binary types – Can also store nested structures similar to Dremel’s ColumnIO • Distributed aggregation: Impala • Impala plus Parquet: A superset of the published version of Dremel (does not support joins)
  • 23. Comparing Impala to Hive • Hive: MapReduce as an execution engine – High latency, low throughput queries – Fault-tolerance model based on MapReduce’s on-disk check pointing; materializes all intermediate results – Java runtime allows for easy late-binding of functionality: file formats and UDFs – Extensive layering imposes high runtime overhead • Impala: – Direct, process-to-process data exchange – No fault tolerance – An execution engine designed for low runtime overhead
  • 24. Impala and Hive Shares Everything Client-Facing •Metadata (table definitions) •ODBC/JDBC drivers •SQL syntax (Hive SQL) •Flexible file formats •Machine pool •GUI Resource Management Data Store But Built for Different Purposes •Hive: Runs on MapReduce and ideal for batch processing •Impala: Native MPP query engine ideal for interactive SQL Data Ingestion HDFS HBase TEXT, RCFILE, PARQUET,AVRO, ETC. RECORDS Hive SQL Syntax MapReduce Compute framework Impala SQL syntax + compute framework
  • 25. Typical Use Cases • Data Warehouse Offload • Ad-hoc Analytics • Provide SQL interoperability to HBase
  • 26. Hands-on Impala • Query a file on HDFS with Impala • Query a table on HBase with Impala
  • 27. What is Spark? • MapReduce Review… • Apache Spark… • How Spark Works… • Fault Tolerance and Performance… • Examples… • Spark & More…
  • 28. MapReduce: Good The Good: •Built in fault tolerance •Optimized IO path •Scalable •Developer focuses on Map/Reduce, not infrastructure •Simple? API
  • 29. MapReduce: Bad The Bad: •Optimized for disk IO – Does not leverage memory – Iterative algorithms go through disk IO path again and again •Primitive API – Developers have to build on a very simple abstraction – Key/Value in/out – Even basic things like join require extensive code •A common result is many files require to be combined appropriately
  • 30. Apache Spark • Originally developed in 2009 in UC Berkeley’s AMP Lab. • Fully open sourced in 2010 – now at Apache Software Foundation.
  • 31. Spark: Easy and Fast Big Data • Easy to Develop – Rich APIs in Java, Scala, Python – Interactive Shell • 2-5x less code • Fast to Run – General execution graph – In-memory store
  • 32. How Spark Works – SparkContext Cluster Master Spark Worker Spark Worker Executer Cache Executer Task Task Cache Data Node Data Node HDFS Task Task SparkContext sc=new SparkContext Rdd=sc.textfile(“hdfs://..”) Rdd.filter(…) Rdd.cache(…) Rdd.count(…) sc=new SparkContext Rdd=sc.textfile(“hdfs://..”) Rdd.filter(…) Rdd.cache(…) Rdd.count(…)
  • 33. How Spark Works – RDD RDD (Resilient Distributed Dataset) sc=new SparkContext Rdd=sc.textfile(“hdfs://..” ) Rdd.filter(…) Rdd.cache(…) Rdd.count(…) sc=new SparkContext Rdd=sc.textfile(“hdfs://..” ) Rdd.filter(…) Rdd.cache(…) Rdd.count(…) Storage Types: MEMORY_ONLY, MEMORY_AND_DISK DISK_ONLY, • Fault Tolerant • Controlled • Fault Tolerant • Controlled partitioning to optimize data placement partitioning to optimize data placement • Manipulated by • Manipulated by using a rich set of operators using a rich set of operators • Partitions of Data • Dependency between partitions
  • 34. RDD • Stands for Resilient Distributed Datasets • Spark revolves around RDDs • Fault-tolerant read only collection of elements that can be operated on in parallels • Cached in memory Reference: k.pdf
  • 35. RDD • Read-only, partitioned collection of records DD11 DD22 DD33 3 partitions • Supports only coarse-grained operations – e.g. map and group-by transformation, reduce actions DD11 DD22 DD33 DD11 DD22 DD33 DD11 DD22 DD33 DD11 DD22 DD33 DD11 DD22 DD33 Value
  • 37. RDD Operations - Expressive • Transformations – Creation of a new RDD dataset from an existing: • map, filter, distinct, union, sample, groupByKey, join, reduce, etc… • Actions – Returns a value after running a computation: • Collect, count, first, takeSample, foreach, etc… • Reference –
  • 38. Word Count on Spark sparkContext.textFile(“hdfs://…”) RDD[String] textFile
  • 39. Word Count on Spark sparkContext.textFile(“hdfs://…”) .map(line => line.split(“s”)) RDD[String] RDD[List[String]] textFile map
  • 40. Word Count on Spark sparkContext.textFile(“hdfs://…”) .map(line => line.split(“s”)) .map(word => (word, 1)) RDD[String] RDD[List[String]] RDD[(String, Int)] textFile map map
  • 41. Word Count on Spark sparkContext.textFile(“hdfs://…”) .map(line => line.split(“s”)) .map(word => (word, 1)) .reduceByKey((a, b) => a+b) RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] textFile map map reduceByKey map
  • 42. Word Count on Spark sparkContext.textFile(“hdfs://…”) .map(line => line.split(“s”)) .map(word => (word, 1)) .reduceByKey((a, b) => a+b) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] Array[(String, Int)] textFile map map reduceByKey map collect
  • 43. Actions • Parallel Operations map reduce sample filter count take grougBy fold first sort reduceByKey partitionBy union groupByKey mapWith join cogroup pipe leftOuterJoin cross save rightOuterJoin zip ….
  • 44. Stages textFile map map reduceByKey collect Stage 1 Stage 2 DAG (Directed Acyclic Graph) Each stage is executed as Stage 1 Stage 2 a series of Task (one Task for each partition).
  • 45. Tasks Task is the fundamental unit of execution in Spark Fetch Input Execute Task Write Output HDFS /RDD HDFS/RDD/Intermediate shuffle output Core 1 Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Core 2
  • 46. Spark Summary • SparkContext • Resilient Distributed Dataset • Parallel Operations • Shared Variables – Broadcast Variables – read-only – Accumulators
  • 47. Comparison MapReduce Impala Spark Storage HDFS HDFS/HBase HDFS Scheduler MapReduce Job Query Plan Computation Graph I/O Disk In-memory with cache In-memory, cache and shared data Fault Tolerance Duplication and Disk I/O No Fault Tolerance Hash partition and auto reconstruction Iterative Bad Bad Good Shared data No No Yes Streaming No No Yes
  • 48. Hands-on Spark • Spark Shell • Word Count
  • 49. Spark Streaming • Takes the concept of RDDs and extends it to Dstreams – Fault-tolerant like RDDs – Transformable like RDDs • Adds new “rolling window” operations – Rolling averages, etc.. • But keeps everything else! – Regular Spark code works in Spark Streaming – Can still access HDFS data, etc. • Example use cases: – “On-the-fly” ETL as data is ingested into Hadoop/HDFS – Detecting anomalous behavior and trigger alerts – Continuous reporting of summary metrics for incoming data
  • 51. Micro-batching for on the fly ETL
  • 53. Spark SQL • Spark SQL is one of Spark’s components. – Executes SQL on Spark – Builds SchemaRDD • Optimizes execution plan • Uses existing Hive metastores, SerDes, and UDFs.
  • 54. Unified Data Access • Ability to load and query data from a variety of sources. • SchemaRDDs provides a single interface that efficiently works with structured data, including Hive tables, parquet files, and JSON. sqlCtx.jsonFile("s3n://...") .registerAsTable("json") schema_rdd = sqlCtx.sql(""" SELECT * FROM hiveTable JOIN json ...""") Query and join different data sources
  • 55. Hands-on Spark • Parse/transform log on the fly with Spark-Streaming • Aggregate with Spark SQL (Top N) • Output from Spark to HDFS
  • 56. Spark & Impala work together Data Strea m Data Strea m Spark- Streaming Spark Impala DN RS Data Strea m Spark- Streaming Spark Impala DN RS Data Strea m Spark- Streaming Spark DN RS Impala … Data Strea m Data Strea m Data Stream -Click Steam -Machine Data -Log -Network Traffic -Etc.. On-the-fly Processing -ETL, transformation, filter -Pattern Matching & Alert Real-time Analytics -Machine Learning (Rec. Cluster..) -Iterative Algorithms Near Real-time Query - Ad-hoc query - Reporting Long term data store -Batch process -Offline analytics -Historical Mining
  • 57. Etu 讓 Hadoop 更容易 全自動、高效能、易管理的軟體式一體機 空機自動部署 效能最佳化 全叢集管理 唯一在地 Hadoop 專業服務 主流 X86 商用伺服器
  • 58. ESA Software Stack Cloudera Manager Etu Manager 安全管理 效能最佳化 組態同步網路管理監控告警套件管理 CentOS作業系統 (64 bits) HA 管理 Rack Awareness Etu 加值模組
  • 59. Etu在Hadoop生態系的定位與價值 Etu Services 人才 招聘 團隊 建立 程式開發 數據 架構 挖掘 設計 部署、調校 易 Etu 易 運維、管理 應用 平台 搶 佔 市場 Etu Professional Services 核心 價 值 資源 調配 屏蔽 Hadoop 平台 部署與運維的複雜度 易 • 快速推出服務,搶佔市場 • 應用、數據才是企業核心價值 • 依核心價值調配資源,建立競爭優 勢 Software Appliance EEttuu SSuuppppoorrtt Etu Professional Services EEttuu CCoonnssuullttiinngg EEttuu TTrraaiinniinngg 易
  • 60. Question and Discussion Thank you 318, Rueiguang Rd., Taipei 114, Taiwan T: +886 2 7720 1888 F: +886 2 8798 6069