SlideShare a Scribd company logo
掌握Impala和Spark 
Real-time Big Data即時應用架構研習 
Etu首席顧問 陳昭宇 
Oct 8, 2014
Workshop Goal 
Let’s talk about the 3Vs in Big 
Data. Hadoop is good for Volume 
and Variety 
But… 
How about Velocity ??? 
This is why we are sitting here ….
Target Audience 
• CTO 
• Architect 
• Software/Application Developer 
• IT
Background Knowledge 
• Linux operation system 
• Basic Hadoop ecosystem knowledge 
• Basic knowledge of SQL 
• Java or Python programming experience

Recommended for you

Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop

Impala is a massively parallel processing SQL query engine for Apache Hadoop. It allows real-time queries on large datasets using existing SQL skills. Impala's architecture includes impalad daemons that process queries in parallel across nodes, a statestore for metadata coordination, and a new execution engine written in C++. It aims to provide faster performance than Hive for interactive queries while leveraging Hadoop's existing ecosystem. The first general availability release is planned for April 2013.

clouderaimpalahadoop
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components

Spark Streaming allows processing live data streams using small batch sizes to provide low latency results. It provides a simple API to implement complex stream processing algorithms across hundreds of nodes. Spark SQL allows querying structured data using SQL or the Hive query language and integrates with Spark's batch and interactive processing. MLlib provides machine learning algorithms and pipelines to easily apply ML to large datasets. GraphX extends Spark with an API for graph-parallel computation on property graphs.

Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala

This talk was held at the 11th meeting on April 7 2014 by Marcel Kornacker. Impala (impala.io) raises the bar for SQL query performance on Apache Hadoop. With Impala, you can query Hadoop data – including SELECT, JOIN, and aggregate functions – in real time to do BI-style analysis. As a result, Impala makes a Hadoop-based enterprise data hub function like an enterprise data warehouse for native Big Data.

Terminology 
• Hadoop: Open source big data platform 
• HDFS: Hadoop Distributed Filesystem 
• MapReduce: Parallel computing framework on top of 
HDFS 
• HBase: NoSQL database on top of Hadoop 
• Impala: MPP SQL query engine on top of Hadoop 
• Spark: In-memory cluster computing engine 
• Hive: SQL to MapReduce translator 
• Hive Metastore: Database that stores table schema 
• Hive QL: A SQL subset
Agenda 
• What is Hadoop and what’s wrong with Hadoop in real-time? 
• What is Impala? 
• Hands-on Impala 
• What is Spark? 
• Hands-on Spark 
• Spark and Impala work together 
• Q & A
What is Hadoop ? 
Apache Hadoop is an open 
source platform for data 
storage and processing that 
is : 
Distributed 
Fault tolerant 
Scalable 
CORE HADOOP SYSTEM COMPONENTS 
HDFS 
HDFS 
A fault-tolerant, 
scalable clustered 
A fault-tolerant, 
scalable clustered 
storage 
storage 
MapReduce 
MapReduce 
A distributed 
computing 
framework 
A distributed 
computing 
framework 
• Ask questions across structured and 
unstructured data 
• Schema-less 
• Scale-out architecture 
divides workloads across 
nodes. 
• Flexible file system 
eliminates ETL bottlenecks. 
Flexible for storing and mining 
any type of data 
Processing Complex 
Big Data 
Scales Economically 
• Deploy on commodity 
hardware 
• Open sourced platform
Limitations of MapReduce 
• Batch oriented 
• High latency 
• Doesn’t fit all cases 
• Only for developers

Recommended for you

Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014

Building applications using Apache Hadoop with a use-case of clickstream analysis. Presented by Mark Grover and Jonathan Seidman at Big Data TechCon, Boston in April 2014

2014hadoopapplications
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopJan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop

Impala is a SQL query engine for Apache Hadoop that allows for interactive queries on large datasets. It uses a distributed architecture where each node runs an Impala daemon and queries are distributed across nodes. Impala aims to provide general-purpose SQL with high performance by using C++ instead of Java and avoiding MapReduce execution. It runs directly on Hadoop storage systems and supports common file formats like Parquet and Avro.

impalahugcloudera
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals

This document provides an overview of the Apache Spark framework. It covers Spark fundamentals including the Spark execution model using Resilient Distributed Datasets (RDDs), basic Spark programming, and common Spark libraries and use cases. Key topics include how Spark improves on MapReduce by operating in-memory and supporting general graphs through its directed acyclic graph execution model. The document also reviews Spark installation and provides examples of basic Spark programs in Scala.

apache spark
Pig and Hive 
• MR is hard and only for developers 
• High level abstraction for converting declarative syntax to 
MR 
– SQL – Hive 
– Dataflow language - Pig 
• Build on top of MapReduce
Goals 
• General-purpose SQL engine: 
– Works for both analytics and transactional/single-row workloads. 
– Supports queries that take from milliseconds to hours. 
• Runs directly within Hadoop and: 
– Reads widely used Hadoop file formats. 
– Runs on same nodes that run Hadoop processes. 
• High performance: 
– C++ instead of Java 
– Runtime code generation 
– Completely new execution engine – No MapReduce
What is Impala 
• General-purpose SQL engine 
• Real-time queries in Apache Hadoop 
• Beta version released since Oct. 2012 
• GA since Apr. 2013 
• Apache licensed 
• Latest release v1.4.2
Impala Overview 
• Distributed service in cluster: One impala daemon on each 
data node 
• No SPOF 
• User submits query via ODBC/JDBC, CLI, or HUE to any of the 
daemons. 
• Query is distributed to all nodes with data locality. 
• Uses Hive’s metadata interfaces and connects to Hive 
metastore. 
• Supported file formats: 
– Uncompressed/lzo-compressed text files 
– Sequence files and RCFile with snappy/gzip, Avro 
– Parquet columnar format

Recommended for you

Operationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryOperationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud Foundry

The document discusses how operationalizing machine learning models through continuous deployment and monitoring is important to realize business value but often overlooked, and describes how Alpine Data's Chorus platform in combination with Pivotal's Big Data Suite and Cloud Foundry can provide a turn-key solution for operationalizing models by deploying scalable scoring engines that can consume models exported in the PFA format. The platform aims to make it simple to deploy both individual models and complex scoring flows represented as PFA documents to ensure models have maximum impact on the business.

springone platform 2016springone platform
SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan

This document discusses SQL engines for Hadoop, including Hive, Presto, and Impala. Hive is best for batch jobs due to its stability. Presto provides interactive queries across data sources and is easier to manage than Hive with Tez. Presto's distributed architecture allows queries to run in parallel across nodes. It supports pluggable connectors to access different data stores and has language bindings for multiple clients.

hadoop presto sqlhadoop presto sql
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop

The document discusses architectural considerations for implementing clickstream analytics using Hadoop. It covers choices for data storage layers like HDFS vs HBase, data modeling including file formats and partitioning, data ingestion methods like Flume and Sqoop, available processing engines like MapReduce, Hive, Spark and Impala, and the need to sessionize clickstream data to analyze metrics like bounce rates and attribution.

big dataarchitecturehadoop
Impala’s SQL 
• High compatibility with HiveQL 
• SQL support: 
– Essential SQL-92, minus correlated subqueries 
– INSERT INTO … SELECT … 
– Only equi-joins; no non-equi-joins, no cross products 
– Order By requires Limit (not required after 1.4.2) 
– Limited DDL support 
– SQL-style authorization via Apache Sentry 
– UDFs and UDAFs are supported
Impala’s SQL limitations 
• No file formats, SerDes 
• No beyond SQL (buckets, samples, transforms, arrays, 
structs, maps, xpath, json) 
• Broadcast joins and partitioned hash joins supported 
(Smaller tables have to fit in the aggregate memory of all 
executing nodes)
Work with HBase 
• Functionality highlights: 
– Support for SELECT, INSERT INTO…SELECT…, and INSERT INTO … 
VALUES (…) 
– Predicates on rowkey columns are mapped into start/stop rows 
– Predicates on other columns are mapped into SingleColumnValueFilters 
• BUT mapping of HBase tables and metastore table 
patterned after Hive: 
– All data is stored as scalars and in ASCII. 
– The rowkey needs to be mapped into a single string column.
HBase in Roadmap 
• Full support for UPDATE and DELETE. 
• Storage of structured data to minimize storage and access 
overhead. 
• Composite rowkey encoding mapped into an arbitrary 
number of table columns.

Recommended for you

Apache drill
Apache drillApache drill
Apache drill

A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.

apache drillinteractive querieshadoop
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala

The document discusses Impala, a SQL query engine for Hadoop. It was created to enable low-latency queries on Hadoop data by using a new execution engine instead of MapReduce. Impala aims to provide high performance SQL queries on HDFS, HBase and other Hadoop data. It runs as a distributed service and queries are distributed to nodes and executed in parallel. The document covers Impala's architecture, query execution process, and its planner which partitions queries for efficient execution.

clouderaimpalarealtime
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN

This document provides information about running Spark on YARN including: - Spark allows processing of large datasets in a distributed manner using Resilient Distributed Datasets (RDDs). - When running on YARN, Spark is able to leverage existing Hadoop clusters for locality-aware processing, resource management, and other benefits while still using its own execution engine. - Running Spark on YARN provides advantages like shipping code to where the data is located instead of moving large amounts of data, leveraging existing Hadoop cluster infrastructure, and allowing Spark workloads to run natively within Hadoop.

hadoop summitapache hadoop
Impala’s Architecture 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Hive 
Metastore HDFS NN Statestore 
SQL Client 
ODBC 
1. Request arrives via 
ODBC/JDBC/Beeswax/ 
Shell.
Impala’s Architecture 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Hive 
Metastore HDFS NN Statestore 
SQL Client 
ODBC 
2. Planner turns request 
into collections of plan 
fragments. 
3. Coordinator initiates 
execution on impalad(s) 
local to data.
Impala’s Architecture 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Hive 
Metastore HDFS NN Statestore 
SQL Client 
ODBC 
5. Query results are 
streamed back to client. 
4. Intermediate results are 
streamed between 
impalad(s).
Metadata Handling 
• Impala metadata 
– Hive’s metastore: Logical metadata (table definitions, columns, 
CREATE TABLE parameters) 
– HDFS NameNode: Directory contents and block replica locations 
– HDFS DataNode: Block replias’ volume ids 
• Caches metadata: No synchronous metastore API calls 
during query execution 
• Impala instances read metadata from metastore at startup. 
• Catalog Service relays metadata when you run DDL or 
update metadata on one of the impalads.

Recommended for you

Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala

Slides for presentation on Cloudera Impala I gave at the Near Infinity (www.nearinfinity.com) 2013 spring conference.

clouderaapache drillimpala
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala

Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Apache Hadoop cluster. With Impala, the Hadoop ecosystem now has an open-source codebase that helps users query data stored in Hadoop-based enterprise data hubs in real time, using familiar SQL syntax. This talk will begin with an overview of the challenges organizations face as they collect and process more data than ever before, followed by an overview of Impala from the user's perspective and a dive into Impala's architecture. It concludes with stories of how Cloudera's customers are using Impala and the benefits they see.

clouderaimpalarealtime
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala

The document summarizes strengths and weaknesses of Cloudera Impala. Key strengths include excellent performance for analytical queries over large datasets, SQL compliance, and integration with Hadoop ecosystem. Weaknesses are slow random access, lack of fault tolerance, tedious data updating process, and memory intensive queries. The conclusion is that Impala is well-suited for analytics on immutable data but not for workloads with frequent updates.

impala sql
Metadata Handling – Cont. 
• REFRESH [<tbl>]: Reloads metadata on all impalads (if you 
added new files via Hive) 
• INVALIDATE METADATA: Reloads metadata for all tables
Comparing Impala to Dremel 
• What is Dremel? 
– Columnar storage for data with nested structures 
– Distributed scalable aggregation on top of that 
• Columnar storage in Hadoop: Parquet 
– Store data in appropriate native/binary types 
– Can also store nested structures similar to Dremel’s ColumnIO 
• Distributed aggregation: Impala 
• Impala plus Parquet: A superset of the published version of 
Dremel (does not support joins)
Comparing Impala to Hive 
• Hive: MapReduce as an execution engine 
– High latency, low throughput queries 
– Fault-tolerance model based on MapReduce’s on-disk check pointing; 
materializes all intermediate results 
– Java runtime allows for easy late-binding of functionality: file formats 
and UDFs 
– Extensive layering imposes high runtime overhead 
• Impala: 
– Direct, process-to-process data exchange 
– No fault tolerance 
– An execution engine designed for low runtime overhead
Impala and Hive 
Shares Everything Client-Facing 
•Metadata (table definitions) 
•ODBC/JDBC drivers 
•SQL syntax (Hive SQL) 
•Flexible file formats 
•Machine pool 
•GUI 
Resource Management 
Data Store 
But Built for Different Purposes 
•Hive: Runs on MapReduce and ideal for 
batch processing 
•Impala: Native MPP query engine ideal 
for interactive SQL Data Ingestion 
HDFS HBase 
TEXT, RCFILE, 
PARQUET,AVRO, ETC. 
RECORDS 
Hive 
SQL Syntax 
MapReduce 
Compute framework 
Impala 
SQL syntax + 
compute framework

Recommended for you

Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Cloudera Impala presentation to San Diego Big Data Meetup (http://www.meetup.com/sdbigdata/events/189420582/)

clouderaimpalacloudera impala
SQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopSQL and Machine Learning on Hadoop
SQL and Machine Learning on Hadoop

1) HAWQ is an SQL and machine learning engine that runs on Hadoop, providing SQL capabilities and machine learning functionality directly on HDFS data. 2) HAWQ provides up to 30x faster performance than other SQL-on-Hadoop engines like Impala and Hive, through its massively parallel processing (MPP) architecture and query optimization capabilities. 3) Key features of HAWQ include ANSI SQL compliance, integrated machine learning via the MADlib library, flexible deployment across on-premises and cloud environments, and high scalability to petabytes of data.

machine learningsqldata mining
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop

This document summarizes a meetup about Big Data and SQL on Hadoop. The meetup included discussions on what Hadoop is, why SQL on Hadoop is useful, what Hive is, and introduced IBM's BigInsights software for running SQL on Hadoop with improved performance over other solutions. Key topics included HDFS file storage, MapReduce processing, Hive tables and metadata storage, and how BigInsights provides a massively parallel SQL engine instead of relying on MapReduce.

Typical Use Cases 
• Data Warehouse Offload 
• Ad-hoc Analytics 
• Provide SQL interoperability to HBase
Hands-on Impala 
• Query a file on HDFS with Impala 
• Query a table on HBase with Impala
What is Spark? 
• MapReduce Review… 
• Apache Spark… 
• How Spark Works… 
• Fault Tolerance and Performance… 
• Examples… 
• Spark & More…
MapReduce: Good 
The Good: 
•Built in fault tolerance 
•Optimized IO path 
•Scalable 
•Developer focuses on Map/Reduce, not infrastructure 
•Simple? API

Recommended for you

SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop

SQL on Hadoop Looking for the correct tool for your SQL-on-Hadoop use case? There is a long list of alternatives to choose from; how to select the correct tool? The tool selection is always based on use case requirements. Read more on alternatives and our recommendations.

sqlhadoop
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf

Impala is a SQL query engine for Apache Hadoop that allows real-time queries on large datasets. It is designed to provide high performance for both analytical and transactional workloads by running directly on Hadoop clusters and utilizing C++ code generation and in-memory processing. Impala uses the existing Hadoop ecosystem including metadata storage in Hive and data formats like Avro, but provides faster performance through its new query execution engine compared to traditional MapReduce-based systems like Hive. Future development of Impala will focus on improved support for features like HBase, additional SQL functionality, and query optimization.

Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Maxime Dumas gives a presentation on Cloudera Impala, which provides fast SQL query capability for Apache Hadoop. Impala allows for interactive queries on Hadoop data in seconds rather than minutes by using a native MPP query engine instead of MapReduce. It offers benefits like SQL support, improved performance of 3-4x up to 90x faster than MapReduce, and flexibility to query existing Hadoop data without needing to migrate or duplicate it. The latest release of Impala 2.0 includes new features like window functions, subqueries, and spilling joins and aggregations to disk when memory is exhausted.

cloudera impalaclouderareal-time
MapReduce: Bad 
The Bad: 
•Optimized for disk IO 
– Does not leverage memory 
– Iterative algorithms go through disk IO path again and 
again 
•Primitive API 
– Developers have to build on a very simple abstraction 
– Key/Value in/out 
– Even basic things like join require extensive code 
•A common result is many files require to be combined 
appropriately
Apache Spark 
• Originally developed in 
2009 in UC Berkeley’s AMP 
Lab. 
• Fully open sourced in 2010 
– now at Apache Software 
Foundation.
Spark: Easy and Fast Big Data 
• Easy to Develop 
– Rich APIs in Java, Scala, 
Python 
– Interactive Shell 
• 2-5x less code 
• Fast to Run 
– General execution 
graph 
– In-memory store
How Spark Works – SparkContext 
Cluster 
Master 
Spark Worker Spark Worker 
Executer 
Cache Executer 
Task Task 
Cache 
Data Node Data Node 
HDFS 
Task Task 
SparkContext 
sc=new SparkContext 
Rdd=sc.textfile(“hdfs://..”) 
Rdd.filter(…) 
Rdd.cache(…) 
Rdd.count(…) 
Rdd.map 
sc=new SparkContext 
Rdd=sc.textfile(“hdfs://..”) 
Rdd.filter(…) 
Rdd.cache(…) 
Rdd.count(…) 
Rdd.map

Recommended for you

Apache Drill
Apache DrillApache Drill
Apache Drill

Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.

apache drillmaprhadoop
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution

CDH is a popular distribution of Apache Hadoop and related projects that delivers scalable storage and distributed computing through Apache-licensed open source software. It addresses challenges in storing and analyzing large datasets known as Big Data. Hadoop is a framework for distributed processing of large datasets across computer clusters using simple programming models. Its core components are HDFS for storage, MapReduce for processing, and YARN for resource management. The Hadoop ecosystem also includes tools like Kafka, Sqoop, Hive, Pig, Impala, HBase, Spark, Mahout, Solr, Kudu, and Sentry that provide functionality like messaging, data transfer, querying, machine learning, search, and authorization.

hadoopclouderabig data
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx

Data Explosion - TBs of data generated everyday Solution – HDFS to store data and Hadoop Map-Reduce framework to parallelize processing of Data What is the catch? Hadoop Map Reduce is Java intensive Thinking in Map Reduce paradigm can get tricky

How Spark Works – RDD 
RDD 
(Resilient 
Distributed 
Dataset) 
sc=new SparkContext 
Rdd=sc.textfile(“hdfs://..” 
) 
Rdd.filter(…) 
Rdd.cache(…) 
Rdd.count(…) 
Rdd.map 
sc=new SparkContext 
Rdd=sc.textfile(“hdfs://..” 
) 
Rdd.filter(…) 
Rdd.cache(…) 
Rdd.count(…) 
Rdd.map 
Storage Types: 
MEMORY_ONLY, 
MEMORY_AND_DISK 
DISK_ONLY, 
• Fault Tolerant 
• Controlled 
• Fault Tolerant 
• Controlled 
partitioning to 
optimize data 
placement 
partitioning to 
optimize data 
placement 
• Manipulated by 
• Manipulated by 
using a rich set of 
operators 
using a rich set of 
operators 
• Partitions of Data 
• Dependency between partitions
RDD 
• Stands for Resilient Distributed Datasets 
• Spark revolves around RDDs 
• Fault-tolerant read only collection of elements that can be 
operated on in parallels 
• Cached in memory 
Reference: 
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spar 
k.pdf
RDD 
• Read-only, partitioned collection of records 
DD11 
DD22 
DD33 
3 partitions 
• Supports only coarse-grained operations 
– e.g. map and group-by transformation, reduce 
actions 
DD11 
DD22 
DD33 
DD11 
DD22 
DD33 
DD11 
DD22 
DD33 
DD11 
DD22 
DD33 
DD11 
DD22 
DD33 
Value
RDD Operations

Recommended for you

Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training

An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.

apache sparksynergetics-indiahdinsight
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)

Recorded at SpringOne2GX 2013 in Santa Clara, CA Speaker: Adam Shook This session assumes absolutely no knowledge of Apache Hadoop and will provide a complete introduction to all the major aspects of the Hadoop ecosystem of projects and tools. If you are looking to get up to speed on Hadoop, trying to work out what all the Big Data fuss is about, or just interested in brushing up your understanding of MapReduce, then this is the session for you. We will cover all the basics with detailed discussion about HDFS, MapReduce, YARN (MRv2), and a broad overview of the Hadoop ecosystem including Hive, Pig, HBase, ZooKeeper and more. Learn More about Spring XD at: http://projects.spring.io/spring-xd Learn More about Gemfire XD at: http://www.gopivotal.com/big-data/pivotal-hd

apache hadoopadam shookpivotal software
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala

SQL Engines for Hadoop - The case for Impala presentation by Mark Grover at Budapest Data Forum on June 4th, 2015

introductionsqlimpala
RDD Operations - Expressive 
• Transformations 
– Creation of a new RDD dataset from an existing: 
• map, filter, distinct, union, sample, groupByKey, join, 
reduce, etc… 
• Actions 
– Returns a value after running a computation: 
• Collect, count, first, takeSample, foreach, etc… 
• Reference 
– http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations
Word Count on Spark 
sparkContext.textFile(“hdfs://…”) RDD[String] 
textFile
Word Count on Spark 
sparkContext.textFile(“hdfs://…”) 
.map(line => line.split(“s”)) 
RDD[String] 
RDD[List[String]] 
textFile map
Word Count on Spark 
sparkContext.textFile(“hdfs://…”) 
.map(line => line.split(“s”)) 
.map(word => (word, 1)) 
RDD[String] 
RDD[List[String]] 
RDD[(String, Int)] 
textFile map map

Recommended for you

Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud

Hive was the first popular SQL layer built on Hadoop and has long been known as a heavyweight SQL engine suitable mainly for long-running batch jobs. This has greatly changed since Hive was announced to the world over 8 years ago. Hortonworks and the open source community have evolved Apache Hive into a fast, dynamic SQL on Hadoop engine capable of running highly concurrent query workloads over large datasets with sub-second response time. The latest Hortonworks and Azure HDInsight platform versions fully support Hive with LLAP execution engine for production use. In this webinar, we will go through the architecture of Hive + LLAP engine and explain how it differs from previous Hive versions. We will then dive deeper and show how features like query vectorization and LLAP columnar caching bring further automatic performance improvements. In the end, we will show how Gluent brings these new performance benefits to traditional enterprise database platforms via transparent data virtualization, allowing even your largest databases to benefit from all this without changing any application code. Join this webinar to learn about significant improvements in modern Hive architecture and how Gluent and Hive LLAP on Hortonworks or Azure HDInsight platforms can accelerate cloud migrations and greatly improve hybrid query performance!

hive llaphadoopgluent
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark

Spark is a general-purpose cluster computing framework that provides high-level APIs and is faster than Hadoop for iterative jobs and interactive queries. It leverages cached data in cluster memory across nodes for faster performance. Spark supports various higher-level tools including SQL, machine learning, graph processing, and streaming.

apache sparkapache hadoop
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop

The document provides an overview of big data and Hadoop fundamentals. It discusses what big data is, the characteristics of big data, and how it differs from traditional data processing approaches. It then describes the key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. HDFS architecture and features are explained in more detail. MapReduce tasks, stages, and an example word count job are also covered. The document concludes with a discussion of Hive, including its use as a data warehouse infrastructure on Hadoop and its query language HiveQL.

bigdata
Word Count on Spark 
sparkContext.textFile(“hdfs://…”) 
.map(line => line.split(“s”)) 
.map(word => (word, 1)) 
.reduceByKey((a, b) => a+b) 
RDD[String] 
RDD[List[String]] 
RDD[(String, Int)] 
RDD[(String, Int)] 
textFile map map reduceByKey 
map
Word Count on Spark 
sparkContext.textFile(“hdfs://…”) 
.map(line => line.split(“s”)) 
.map(word => (word, 1)) 
.reduceByKey((a, b) => a+b) 
.collect() 
RDD[String] 
RDD[List[String]] 
RDD[(String, Int)] 
RDD[(String, Int)] 
Array[(String, Int)] 
textFile map map reduceByKey 
map 
collect
Actions 
• Parallel Operations 
map reduce sample 
filter count take 
grougBy fold first 
sort reduceByKey partitionBy 
union groupByKey mapWith 
join cogroup pipe 
leftOuterJoin cross save 
rightOuterJoin zip ….
Stages 
textFile map map reduceByKey 
collect 
Stage 1 Stage 2 
DAG (Directed Acyclic Graph) Each stage is executed as 
Stage 1 
Stage 2 
a series of Task (one 
Task for each partition).

Recommended for you

CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx

CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx

engineering
Spark SQL
Spark SQLSpark SQL
Spark SQL

The document provides an agenda and overview for a Big Data Warehousing meetup hosted by Caserta Concepts. The meetup agenda includes an introduction to SparkSQL with a deep dive on SparkSQL and a demo. Elliott Cordo from Caserta Concepts will provide an introduction and overview of Spark as well as a demo of SparkSQL. The meetup aims to share stories in the rapidly changing big data landscape and provide networking opportunities for data professionals.

sqlbig data analyticscas
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem

Dive into the world of Apache Hive with our insightful presentation covering a range of topics, including Hive introduction, dispelling misconceptions about Hive, exploring its features and origins, understanding the why behind Hive's existence, delving into its architecture and working principles, dissecting the data model employed in Hive, exploring different modes of operation, and weighing the advantages and disadvantages it brings. The presentation concludes with practical examples, demonstrating how to create tables in Hive, upload data, and execute queries within the Hadoop environment. Join us on a journey through the intricacies of Hive, unraveling its capabilities and applications in big data analytics

big datahivesql
Tasks 
Task is the fundamental unit of execution in Spark 
Fetch Input 
Execute Task 
Write Output 
HDFS 
/RDD 
HDFS/RDD/Intermediate 
shuffle output 
Core 1 
Fetch Input 
Execute Task 
Write Output 
Fetch Input 
Execute Task 
Write Output 
Fetch Input 
Execute Task 
Write Output 
Core 2
Spark Summary 
• SparkContext 
• Resilient Distributed Dataset 
• Parallel Operations 
• Shared Variables 
– Broadcast Variables – read-only 
– Accumulators
Comparison 
MapReduce Impala Spark 
Storage HDFS HDFS/HBase HDFS 
Scheduler MapReduce Job Query Plan Computation Graph 
I/O Disk In-memory with 
cache 
In-memory, cache 
and shared data 
Fault Tolerance Duplication and 
Disk I/O 
No Fault Tolerance Hash partition and 
auto reconstruction 
Iterative Bad Bad Good 
Shared data No No Yes 
Streaming No No Yes
Hands-on Spark 
• Spark Shell 
• Word Count

Recommended for you

Getting started big data
Getting started big dataGetting started big data
Getting started big data

Apache Hadoop is a popular open-source framework for storing and processing large datasets across clusters of computers. It includes Apache HDFS for distributed storage, YARN for job scheduling and resource management, and MapReduce for parallel processing. The Hortonworks Data Platform is an enterprise-grade distribution of Apache Hadoop that is fully open source.

SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop

The document summarizes several popular options for SQL on Hadoop including Hive, SparkSQL, Drill, HAWQ, Phoenix, Trafodion, and Splice Machine. Each option is reviewed in terms of key features, architecture, usage patterns, and strengths/limitations. While all aim to enable SQL querying of Hadoop data, they differ in support for transactions, latency, data types, and whether they are native to Hadoop or require separate processes. Hive and SparkSQL are best for batch jobs while Drill, HAWQ and Splice Machine provide lower latency but with different integration models and capabilities.

sparkhivesql
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo

http://bit.ly/1BTaXZP – As organizations look for even faster ways to derive value from big data, they are turning to Apache Spark is an in-memory processing framework that offers lightning-fast big data analytics, providing speed, developer productivity, and real-time processing advantages. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Spark Streaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. By the end of the session, you’ll come away with a deeper understanding of how you can unlock deeper insights from your data, faster, with Spark.

big data everywhereapache sparkhadoop
Spark Streaming 
• Takes the concept of RDDs and extends it to Dstreams 
– Fault-tolerant like RDDs 
– Transformable like RDDs 
• Adds new “rolling window” operations 
– Rolling averages, etc.. 
• But keeps everything else! 
– Regular Spark code works in Spark Streaming 
– Can still access HDFS data, etc. 
• Example use cases: 
– “On-the-fly” ETL as data is ingested into Hadoop/HDFS 
– Detecting anomalous behavior and trigger alerts 
– Continuous reporting of summary metrics for incoming data
How Streaming Works
Micro-batching for on the fly ETL
Window-based Transformation

Recommended for you

Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012

This document discusses Microsoft's efforts to make big data technologies like Hadoop more accessible through its products. It describes Hadoop, MapReduce, HDFS, and other big data concepts. It then outlines Microsoft's project to create a Hadoop distribution that runs on Windows Server and Windows Azure, including building an ODBC driver to allow tools like Excel to query Hadoop. This will help bring big data to more business users and integrate it with Microsoft's existing BI technologies.

Hadoop con 2015 hadoop enables enterprise data lake
Hadoop con 2015   hadoop enables enterprise data lakeHadoop con 2015   hadoop enables enterprise data lake
Hadoop con 2015 hadoop enables enterprise data lake

Mobile Internet, Social Media 以及 Smart Device 的發展促成資訊的大爆炸,伴隨產生大量的非結構化及半結構化的資料,不但資料的格式多樣,產生的速度極快,對企業的資訊架構帶來了前所未有的挑戰,面對多樣的資料結構及多樣的分析工具,我們應該採用什麼樣的架構互相整合,才能有效的管理資料生命週期,提取資料價值,Hadoop 生態系統,無疑的在這個大架構裡,將扮演最基礎的資料平台的角色,實現企業的 Data Lake。

hadoopetuhadoopcon
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用

Apache Mahout是一個架構在MapReduce之上的演算法函式庫,內建許多經典的演算法,借助MapReduce的平行處理架構讓巨量資料的分析更容易。電子商務的發展越來越趨向個人化,帶來對於使用者行為分析的需求、精準推薦、精準廣告投放、個人化商品推薦,個人化內容推薦等個性化服務不斷推出,使得Apache Mahout在Hadoop Ecosystem中的角色日益重要。MapReduce的平行運算能力與Machine Learning的結合,協助電商業者從巨量的網站日誌中,提取出有價值的使用者行為數據,Etu將在這個session中,介紹Mahout內建的商品推薦演算法原理,以及如何step by step打造一個end-to-end的商品推薦系統。

recommendationhadoopmahout
Spark SQL 
• Spark SQL is one of Spark’s 
components. 
– Executes SQL on Spark 
– Builds SchemaRDD 
• Optimizes execution plan 
• Uses existing Hive metastores, 
SerDes, and UDFs.
Unified Data Access 
• Ability to load and query 
data from a variety of 
sources. 
• SchemaRDDs provides a 
single interface that 
efficiently works with 
structured data, including 
Hive tables, parquet files, 
and JSON. 
sqlCtx.jsonFile("s3n://...") 
.registerAsTable("json") 
schema_rdd = sqlCtx.sql(""" 
SELECT * 
FROM hiveTable 
JOIN json ...""") 
Query and join different data sources
Hands-on Spark 
• Parse/transform log on the fly with Spark-Streaming 
• Aggregate with Spark SQL (Top N) 
• Output from Spark to HDFS
Spark & Impala work together 
Data 
Strea 
m 
Data 
Strea 
m 
Spark- 
Streaming 
Spark 
Impala 
DN 
RS 
Data 
Strea 
m 
Spark- 
Streaming 
Spark 
Impala 
DN 
RS 
Data 
Strea 
m 
Spark- 
Streaming 
Spark 
DN 
RS 
Impala 
… 
Data 
Strea 
m 
Data 
Strea 
m 
Data Stream 
-Click Steam 
-Machine Data 
-Log 
-Network Traffic 
-Etc.. 
On-the-fly Processing 
-ETL, transformation, filter 
-Pattern Matching & Alert 
Real-time Analytics 
-Machine Learning (Rec. Cluster..) 
-Iterative Algorithms 
Near Real-time Query 
- Ad-hoc query 
- Reporting 
Long term data store 
-Batch process 
-Offline analytics 
-Historical Mining

Recommended for you

Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作

This document provides an overview of an advanced Big Data hands-on course covering Hadoop, Sqoop, Pig, Hive and enterprise applications. It introduces key concepts like Hadoop and large data processing, demonstrates tools like Sqoop, Pig and Hive for data integration, querying and analysis on Hadoop. It also discusses challenges for enterprises adopting Hadoop technologies and bridging the skills gap.

pighivehadoop
Hadoop的典型应用与企业化之路 for HBTC 2012
Hadoop的典型应用与企业化之路 for HBTC 2012Hadoop的典型应用与企业化之路 for HBTC 2012
Hadoop的典型应用与企业化之路 for HBTC 2012

2012 北京Hadoop與大數據技術大會上發表的演講內容,主要討論Etu Appliance,企業使用Hadoop的挑戰與一些Etu的成功案例。

hadoopetuhbtc
Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結

精誠集團行雲流水系列第三場-流, 大會主題 2012 Taiwan Big Data, 我在其中一個 Track 跟大家分享 Hadoop 與 SQL 的整合運用

hadopsqlhive
Etu 讓 Hadoop 更容易 
全自動、高效能、易管理的軟體式一體機 
空機自動部署 
效能最佳化 
全叢集管理 
唯一在地 Hadoop 專業服務 
主流 X86 商用伺服器
ESA Software Stack 
Cloudera Manager 
Etu Manager 
安全管理 
效能最佳化 
組態同步網路管理監控告警套件管理 
CentOS作業系統 (64 bits) 
HA 管理 
Rack 
Awareness 
Etu 加值模組
Etu在Hadoop生態系的定位與價值 
Etu Services 
人才 
招聘 
團隊 
建立 
程式開發
數據 架構 
挖掘 設計 
部署、調校
易 
Etu 易 
運維、管理
應用
平台
搶 
佔 
市場 
Etu Professional 
Services 
核心 
價 
值 
資源 
調配 
屏蔽 Hadoop 平台 
部署與運維的複雜度 
易 
• 快速推出服務,搶佔市場 
• 應用、數據才是企業核心價值 
• 依核心價值調配資源,建立競爭優 
勢 
Software 
Appliance 
EEttuu SSuuppppoorrtt 
Etu Professional 
Services 
EEttuu CCoonnssuullttiinngg 
EEttuu TTrraaiinniinngg 
易
Question and Discussion 
Thank you 
318, Rueiguang Rd., Taipei 114, Taiwan 
T: +886 2 7720 1888 
F: +886 2 8798 6069 
www.etusolution.com

Recommended for you

[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex

This document discusses using Hadoop/MapReduce with Solr/Lucene for large scale distributed search. It begins with an introduction to the speaker and his experience with Hadoop. The agenda then outlines discussing why search big data, an overview of Lucene, Solr and Zookeeper, distributed searching and indexing with Hadoop, and a case study on web log categorization.

lucenemapreducehadoop
Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)

Our world runs on software. It governs all major aspects of our life. It is an enabler for research and innovation, and is critical for business competitivity. Traditional software engineering techniques have achieved high effectiveness, but still may fall short on delivering software at the accelerated pace and with the increasing quality that future scenarios will require. To attack this issue, some software paradigms raise the automation of software development via higher levels of abstraction through domain-specific languages (e.g., in model-driven engineering) and empowering non-professional developers with the possibility to build their own software (e.g., in low-code development approaches). In a software-demanding world, this is an attractive possibility, and perhaps -- paraphrasing Andy Warhol -- "in the future, everyone will be a developer for 15 minutes". However, to make this possible, methods are required to tweak languages to their context of use (crucial given the diversity of backgrounds and purposes), and the assistance to developers throughout the development process (especially critical for non-professionals). In this keynote talk at ICSOFT'2024 I presented enabling techniques for this vision, supporting the creation of families of domain-specific languages, their adaptation to the usage context; and the augmentation of low-code environments with assistants and recommender systems to guide developers (professional or not) in the development process.

softwaremodel-driven engineeringdomain-specific languages
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptxAddressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx

Enhance the top 9 user pain points with effective visual design elements to improve user experience & satisfaction. Learn the best design strategies

#ui visual designrecruitmentux

More Related Content

What's hot

Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
Shravan (Sean) Pabba
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
trihug
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
Cloudera, Inc.
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera, Inc.
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
Girish Khanzode
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
hadooparchbook
 
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopJan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Yahoo Developer Network
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Operationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryOperationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud Foundry
VMware Tanzu
 
SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
Treasure Data, Inc.
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover
 
Apache drill
Apache drillApache drill
Apache drill
MapR Technologies
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
Data Science London
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
DataWorks Summit
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
Scott Leberknight
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
Gwen (Chen) Shapira
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
Rajesh Gupta
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
SQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopSQL and Machine Learning on Hadoop
SQL and Machine Learning on Hadoop
Mukund Babbar
 

What's hot (20)

Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopJan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Operationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryOperationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud Foundry
 
SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Apache drill
Apache drillApache drill
Apache drill
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
SQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopSQL and Machine Learning on Hadoop
SQL and Machine Learning on Hadoop
 

Similar to Etu Solution Day 2014 Track-D: 掌握Impala和Spark

Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
Bigdatapump
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
ssusere05ec21
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
Apache Drill
Apache DrillApache Drill
Apache Drill
Ted Dunning
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
Thisara Pramuditha
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
vishwasgarade1
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
gluent.
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
UserReport
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Caserta
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
mashoodsyed66
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
Kibrom Gebrehiwot
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
nvvrajesh
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Andrew Brust
 

Similar to Etu Solution Day 2014 Track-D: 掌握Impala和Spark (20)

Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 

More from James Chen

Hadoop con 2015 hadoop enables enterprise data lake
Hadoop con 2015   hadoop enables enterprise data lakeHadoop con 2015   hadoop enables enterprise data lake
Hadoop con 2015 hadoop enables enterprise data lake
James Chen
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用
James Chen
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
James Chen
 
Hadoop的典型应用与企业化之路 for HBTC 2012
Hadoop的典型应用与企业化之路 for HBTC 2012Hadoop的典型应用与企业化之路 for HBTC 2012
Hadoop的典型应用与企业化之路 for HBTC 2012
James Chen
 
Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結
James Chen
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
James Chen
 

More from James Chen (6)

Hadoop con 2015 hadoop enables enterprise data lake
Hadoop con 2015   hadoop enables enterprise data lakeHadoop con 2015   hadoop enables enterprise data lake
Hadoop con 2015 hadoop enables enterprise data lake
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Hadoop的典型应用与企业化之路 for HBTC 2012
Hadoop的典型应用与企业化之路 for HBTC 2012Hadoop的典型应用与企业化之路 for HBTC 2012
Hadoop的典型应用与企业化之路 for HBTC 2012
 
Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
 

Recently uploaded

Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)
miso_uam
 
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptxAddressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
Sparity1
 
@Call @Girls in Saharanpur 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Clas...
 @Call @Girls in Saharanpur 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Clas... @Call @Girls in Saharanpur 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Clas...
@Call @Girls in Saharanpur 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Clas...
AlinaDevecerski
 
@Call @Girls in Tiruppur 🤷‍♂️ XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class ...
 @Call @Girls in Tiruppur 🤷‍♂️  XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class ... @Call @Girls in Tiruppur 🤷‍♂️  XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class ...
@Call @Girls in Tiruppur 🤷‍♂️ XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class ...
Mona Rathore
 
WEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service ProvidersWEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service Providers
Severalnines
 
NYC 26-Jun-2024 Combined Presentations.pdf
NYC 26-Jun-2024 Combined Presentations.pdfNYC 26-Jun-2024 Combined Presentations.pdf
NYC 26-Jun-2024 Combined Presentations.pdf
AUGNYC
 
Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...
Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...
Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...
Asher Sterkin
 
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
ENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentationENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentation
sofiafernandezon
 
@Call @Girls in Tirunelveli 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Cla...
 @Call @Girls in Tirunelveli 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Cla... @Call @Girls in Tirunelveli 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Cla...
@Call @Girls in Tirunelveli 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Cla...
JoyaBansal
 
Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model SafeKolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Misti Soneji
 
Java SE 17 Study Guide for Certification - Chapter 01
Java SE 17 Study Guide for Certification - Chapter 01Java SE 17 Study Guide for Certification - Chapter 01
Java SE 17 Study Guide for Certification - Chapter 01
williamrobertherman
 
Dombivli @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai Available
Dombivli @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai AvailableDombivli @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai Available
Dombivli @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai Available
cristine510
 
@Call @Girls in Aligarh 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A...
 @Call @Girls in Aligarh 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A... @Call @Girls in Aligarh 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A...
@Call @Girls in Aligarh 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A...
msriya3
 
dachnug51 - HCLs evolution of the employee experience platform.pdf
dachnug51 - HCLs evolution of the employee experience platform.pdfdachnug51 - HCLs evolution of the employee experience platform.pdf
dachnug51 - HCLs evolution of the employee experience platform.pdf
DNUG e.V.
 
Folding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a seriesFolding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a series
Philip Schwarz
 
Chennai @Call @Girls 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet
Chennai @Call @Girls 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real MeetChennai @Call @Girls 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet
Chennai @Call @Girls 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet
lovelykumarilk789
 
@Call @Girls in Surat 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Surat Avaulable
 @Call @Girls in Surat 🐱‍🐉  XXXXXXXXXX 🐱‍🐉  Best High Class Surat Avaulable @Call @Girls in Surat 🐱‍🐉  XXXXXXXXXX 🐱‍🐉  Best High Class Surat Avaulable
@Call @Girls in Surat 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Surat Avaulable
DiyaSharma6551
 
dachnug51 - HCL Domino Roadmap .pdf
dachnug51 - HCL Domino Roadmap      .pdfdachnug51 - HCL Domino Roadmap      .pdf
dachnug51 - HCL Domino Roadmap .pdf
DNUG e.V.
 
Migrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS CloudMigrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS Cloud
Ortus Solutions, Corp
 

Recently uploaded (20)

Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)
 
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptxAddressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
 
@Call @Girls in Saharanpur 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Clas...
 @Call @Girls in Saharanpur 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Clas... @Call @Girls in Saharanpur 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Clas...
@Call @Girls in Saharanpur 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Clas...
 
@Call @Girls in Tiruppur 🤷‍♂️ XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class ...
 @Call @Girls in Tiruppur 🤷‍♂️  XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class ... @Call @Girls in Tiruppur 🤷‍♂️  XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class ...
@Call @Girls in Tiruppur 🤷‍♂️ XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class ...
 
WEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service ProvidersWEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service Providers
 
NYC 26-Jun-2024 Combined Presentations.pdf
NYC 26-Jun-2024 Combined Presentations.pdfNYC 26-Jun-2024 Combined Presentations.pdf
NYC 26-Jun-2024 Combined Presentations.pdf
 
Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...
Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...
Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...
 
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
 
ENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentationENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentation
 
@Call @Girls in Tirunelveli 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Cla...
 @Call @Girls in Tirunelveli 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Cla... @Call @Girls in Tirunelveli 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Cla...
@Call @Girls in Tirunelveli 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Cla...
 
Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model SafeKolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
 
Java SE 17 Study Guide for Certification - Chapter 01
Java SE 17 Study Guide for Certification - Chapter 01Java SE 17 Study Guide for Certification - Chapter 01
Java SE 17 Study Guide for Certification - Chapter 01
 
Dombivli @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai Available
Dombivli @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai AvailableDombivli @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai Available
Dombivli @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai Available
 
@Call @Girls in Aligarh 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A...
 @Call @Girls in Aligarh 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A... @Call @Girls in Aligarh 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A...
@Call @Girls in Aligarh 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A...
 
dachnug51 - HCLs evolution of the employee experience platform.pdf
dachnug51 - HCLs evolution of the employee experience platform.pdfdachnug51 - HCLs evolution of the employee experience platform.pdf
dachnug51 - HCLs evolution of the employee experience platform.pdf
 
Folding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a seriesFolding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a series
 
Chennai @Call @Girls 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet
Chennai @Call @Girls 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real MeetChennai @Call @Girls 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet
Chennai @Call @Girls 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet
 
@Call @Girls in Surat 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Surat Avaulable
 @Call @Girls in Surat 🐱‍🐉  XXXXXXXXXX 🐱‍🐉  Best High Class Surat Avaulable @Call @Girls in Surat 🐱‍🐉  XXXXXXXXXX 🐱‍🐉  Best High Class Surat Avaulable
@Call @Girls in Surat 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Surat Avaulable
 
dachnug51 - HCL Domino Roadmap .pdf
dachnug51 - HCL Domino Roadmap      .pdfdachnug51 - HCL Domino Roadmap      .pdf
dachnug51 - HCL Domino Roadmap .pdf
 
Migrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS CloudMigrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS Cloud
 

Etu Solution Day 2014 Track-D: 掌握Impala和Spark

  • 1. 掌握Impala和Spark Real-time Big Data即時應用架構研習 Etu首席顧問 陳昭宇 Oct 8, 2014
  • 2. Workshop Goal Let’s talk about the 3Vs in Big Data. Hadoop is good for Volume and Variety But… How about Velocity ??? This is why we are sitting here ….
  • 3. Target Audience • CTO • Architect • Software/Application Developer • IT
  • 4. Background Knowledge • Linux operation system • Basic Hadoop ecosystem knowledge • Basic knowledge of SQL • Java or Python programming experience
  • 5. Terminology • Hadoop: Open source big data platform • HDFS: Hadoop Distributed Filesystem • MapReduce: Parallel computing framework on top of HDFS • HBase: NoSQL database on top of Hadoop • Impala: MPP SQL query engine on top of Hadoop • Spark: In-memory cluster computing engine • Hive: SQL to MapReduce translator • Hive Metastore: Database that stores table schema • Hive QL: A SQL subset
  • 6. Agenda • What is Hadoop and what’s wrong with Hadoop in real-time? • What is Impala? • Hands-on Impala • What is Spark? • Hands-on Spark • Spark and Impala work together • Q & A
  • 7. What is Hadoop ? Apache Hadoop is an open source platform for data storage and processing that is : Distributed Fault tolerant Scalable CORE HADOOP SYSTEM COMPONENTS HDFS HDFS A fault-tolerant, scalable clustered A fault-tolerant, scalable clustered storage storage MapReduce MapReduce A distributed computing framework A distributed computing framework • Ask questions across structured and unstructured data • Schema-less • Scale-out architecture divides workloads across nodes. • Flexible file system eliminates ETL bottlenecks. Flexible for storing and mining any type of data Processing Complex Big Data Scales Economically • Deploy on commodity hardware • Open sourced platform
  • 8. Limitations of MapReduce • Batch oriented • High latency • Doesn’t fit all cases • Only for developers
  • 9. Pig and Hive • MR is hard and only for developers • High level abstraction for converting declarative syntax to MR – SQL – Hive – Dataflow language - Pig • Build on top of MapReduce
  • 10. Goals • General-purpose SQL engine: – Works for both analytics and transactional/single-row workloads. – Supports queries that take from milliseconds to hours. • Runs directly within Hadoop and: – Reads widely used Hadoop file formats. – Runs on same nodes that run Hadoop processes. • High performance: – C++ instead of Java – Runtime code generation – Completely new execution engine – No MapReduce
  • 11. What is Impala • General-purpose SQL engine • Real-time queries in Apache Hadoop • Beta version released since Oct. 2012 • GA since Apr. 2013 • Apache licensed • Latest release v1.4.2
  • 12. Impala Overview • Distributed service in cluster: One impala daemon on each data node • No SPOF • User submits query via ODBC/JDBC, CLI, or HUE to any of the daemons. • Query is distributed to all nodes with data locality. • Uses Hive’s metadata interfaces and connects to Hive metastore. • Supported file formats: – Uncompressed/lzo-compressed text files – Sequence files and RCFile with snappy/gzip, Avro – Parquet columnar format
  • 13. Impala’s SQL • High compatibility with HiveQL • SQL support: – Essential SQL-92, minus correlated subqueries – INSERT INTO … SELECT … – Only equi-joins; no non-equi-joins, no cross products – Order By requires Limit (not required after 1.4.2) – Limited DDL support – SQL-style authorization via Apache Sentry – UDFs and UDAFs are supported
  • 14. Impala’s SQL limitations • No file formats, SerDes • No beyond SQL (buckets, samples, transforms, arrays, structs, maps, xpath, json) • Broadcast joins and partitioned hash joins supported (Smaller tables have to fit in the aggregate memory of all executing nodes)
  • 15. Work with HBase • Functionality highlights: – Support for SELECT, INSERT INTO…SELECT…, and INSERT INTO … VALUES (…) – Predicates on rowkey columns are mapped into start/stop rows – Predicates on other columns are mapped into SingleColumnValueFilters • BUT mapping of HBase tables and metastore table patterned after Hive: – All data is stored as scalars and in ASCII. – The rowkey needs to be mapped into a single string column.
  • 16. HBase in Roadmap • Full support for UPDATE and DELETE. • Storage of structured data to minimize storage and access overhead. • Composite rowkey encoding mapped into an arbitrary number of table columns.
  • 17. Impala’s Architecture Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Hive Metastore HDFS NN Statestore SQL Client ODBC 1. Request arrives via ODBC/JDBC/Beeswax/ Shell.
  • 18. Impala’s Architecture Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Hive Metastore HDFS NN Statestore SQL Client ODBC 2. Planner turns request into collections of plan fragments. 3. Coordinator initiates execution on impalad(s) local to data.
  • 19. Impala’s Architecture Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Hive Metastore HDFS NN Statestore SQL Client ODBC 5. Query results are streamed back to client. 4. Intermediate results are streamed between impalad(s).
  • 20. Metadata Handling • Impala metadata – Hive’s metastore: Logical metadata (table definitions, columns, CREATE TABLE parameters) – HDFS NameNode: Directory contents and block replica locations – HDFS DataNode: Block replias’ volume ids • Caches metadata: No synchronous metastore API calls during query execution • Impala instances read metadata from metastore at startup. • Catalog Service relays metadata when you run DDL or update metadata on one of the impalads.
  • 21. Metadata Handling – Cont. • REFRESH [<tbl>]: Reloads metadata on all impalads (if you added new files via Hive) • INVALIDATE METADATA: Reloads metadata for all tables
  • 22. Comparing Impala to Dremel • What is Dremel? – Columnar storage for data with nested structures – Distributed scalable aggregation on top of that • Columnar storage in Hadoop: Parquet – Store data in appropriate native/binary types – Can also store nested structures similar to Dremel’s ColumnIO • Distributed aggregation: Impala • Impala plus Parquet: A superset of the published version of Dremel (does not support joins)
  • 23. Comparing Impala to Hive • Hive: MapReduce as an execution engine – High latency, low throughput queries – Fault-tolerance model based on MapReduce’s on-disk check pointing; materializes all intermediate results – Java runtime allows for easy late-binding of functionality: file formats and UDFs – Extensive layering imposes high runtime overhead • Impala: – Direct, process-to-process data exchange – No fault tolerance – An execution engine designed for low runtime overhead
  • 24. Impala and Hive Shares Everything Client-Facing •Metadata (table definitions) •ODBC/JDBC drivers •SQL syntax (Hive SQL) •Flexible file formats •Machine pool •GUI Resource Management Data Store But Built for Different Purposes •Hive: Runs on MapReduce and ideal for batch processing •Impala: Native MPP query engine ideal for interactive SQL Data Ingestion HDFS HBase TEXT, RCFILE, PARQUET,AVRO, ETC. RECORDS Hive SQL Syntax MapReduce Compute framework Impala SQL syntax + compute framework
  • 25. Typical Use Cases • Data Warehouse Offload • Ad-hoc Analytics • Provide SQL interoperability to HBase
  • 26. Hands-on Impala • Query a file on HDFS with Impala • Query a table on HBase with Impala
  • 27. What is Spark? • MapReduce Review… • Apache Spark… • How Spark Works… • Fault Tolerance and Performance… • Examples… • Spark & More…
  • 28. MapReduce: Good The Good: •Built in fault tolerance •Optimized IO path •Scalable •Developer focuses on Map/Reduce, not infrastructure •Simple? API
  • 29. MapReduce: Bad The Bad: •Optimized for disk IO – Does not leverage memory – Iterative algorithms go through disk IO path again and again •Primitive API – Developers have to build on a very simple abstraction – Key/Value in/out – Even basic things like join require extensive code •A common result is many files require to be combined appropriately
  • 30. Apache Spark • Originally developed in 2009 in UC Berkeley’s AMP Lab. • Fully open sourced in 2010 – now at Apache Software Foundation.
  • 31. Spark: Easy and Fast Big Data • Easy to Develop – Rich APIs in Java, Scala, Python – Interactive Shell • 2-5x less code • Fast to Run – General execution graph – In-memory store
  • 32. How Spark Works – SparkContext Cluster Master Spark Worker Spark Worker Executer Cache Executer Task Task Cache Data Node Data Node HDFS Task Task SparkContext sc=new SparkContext Rdd=sc.textfile(“hdfs://..”) Rdd.filter(…) Rdd.cache(…) Rdd.count(…) Rdd.map sc=new SparkContext Rdd=sc.textfile(“hdfs://..”) Rdd.filter(…) Rdd.cache(…) Rdd.count(…) Rdd.map
  • 33. How Spark Works – RDD RDD (Resilient Distributed Dataset) sc=new SparkContext Rdd=sc.textfile(“hdfs://..” ) Rdd.filter(…) Rdd.cache(…) Rdd.count(…) Rdd.map sc=new SparkContext Rdd=sc.textfile(“hdfs://..” ) Rdd.filter(…) Rdd.cache(…) Rdd.count(…) Rdd.map Storage Types: MEMORY_ONLY, MEMORY_AND_DISK DISK_ONLY, • Fault Tolerant • Controlled • Fault Tolerant • Controlled partitioning to optimize data placement partitioning to optimize data placement • Manipulated by • Manipulated by using a rich set of operators using a rich set of operators • Partitions of Data • Dependency between partitions
  • 34. RDD • Stands for Resilient Distributed Datasets • Spark revolves around RDDs • Fault-tolerant read only collection of elements that can be operated on in parallels • Cached in memory Reference: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spar k.pdf
  • 35. RDD • Read-only, partitioned collection of records DD11 DD22 DD33 3 partitions • Supports only coarse-grained operations – e.g. map and group-by transformation, reduce actions DD11 DD22 DD33 DD11 DD22 DD33 DD11 DD22 DD33 DD11 DD22 DD33 DD11 DD22 DD33 Value
  • 37. RDD Operations - Expressive • Transformations – Creation of a new RDD dataset from an existing: • map, filter, distinct, union, sample, groupByKey, join, reduce, etc… • Actions – Returns a value after running a computation: • Collect, count, first, takeSample, foreach, etc… • Reference – http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations
  • 38. Word Count on Spark sparkContext.textFile(“hdfs://…”) RDD[String] textFile
  • 39. Word Count on Spark sparkContext.textFile(“hdfs://…”) .map(line => line.split(“s”)) RDD[String] RDD[List[String]] textFile map
  • 40. Word Count on Spark sparkContext.textFile(“hdfs://…”) .map(line => line.split(“s”)) .map(word => (word, 1)) RDD[String] RDD[List[String]] RDD[(String, Int)] textFile map map
  • 41. Word Count on Spark sparkContext.textFile(“hdfs://…”) .map(line => line.split(“s”)) .map(word => (word, 1)) .reduceByKey((a, b) => a+b) RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] textFile map map reduceByKey map
  • 42. Word Count on Spark sparkContext.textFile(“hdfs://…”) .map(line => line.split(“s”)) .map(word => (word, 1)) .reduceByKey((a, b) => a+b) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] Array[(String, Int)] textFile map map reduceByKey map collect
  • 43. Actions • Parallel Operations map reduce sample filter count take grougBy fold first sort reduceByKey partitionBy union groupByKey mapWith join cogroup pipe leftOuterJoin cross save rightOuterJoin zip ….
  • 44. Stages textFile map map reduceByKey collect Stage 1 Stage 2 DAG (Directed Acyclic Graph) Each stage is executed as Stage 1 Stage 2 a series of Task (one Task for each partition).
  • 45. Tasks Task is the fundamental unit of execution in Spark Fetch Input Execute Task Write Output HDFS /RDD HDFS/RDD/Intermediate shuffle output Core 1 Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Core 2
  • 46. Spark Summary • SparkContext • Resilient Distributed Dataset • Parallel Operations • Shared Variables – Broadcast Variables – read-only – Accumulators
  • 47. Comparison MapReduce Impala Spark Storage HDFS HDFS/HBase HDFS Scheduler MapReduce Job Query Plan Computation Graph I/O Disk In-memory with cache In-memory, cache and shared data Fault Tolerance Duplication and Disk I/O No Fault Tolerance Hash partition and auto reconstruction Iterative Bad Bad Good Shared data No No Yes Streaming No No Yes
  • 48. Hands-on Spark • Spark Shell • Word Count
  • 49. Spark Streaming • Takes the concept of RDDs and extends it to Dstreams – Fault-tolerant like RDDs – Transformable like RDDs • Adds new “rolling window” operations – Rolling averages, etc.. • But keeps everything else! – Regular Spark code works in Spark Streaming – Can still access HDFS data, etc. • Example use cases: – “On-the-fly” ETL as data is ingested into Hadoop/HDFS – Detecting anomalous behavior and trigger alerts – Continuous reporting of summary metrics for incoming data
  • 51. Micro-batching for on the fly ETL
  • 53. Spark SQL • Spark SQL is one of Spark’s components. – Executes SQL on Spark – Builds SchemaRDD • Optimizes execution plan • Uses existing Hive metastores, SerDes, and UDFs.
  • 54. Unified Data Access • Ability to load and query data from a variety of sources. • SchemaRDDs provides a single interface that efficiently works with structured data, including Hive tables, parquet files, and JSON. sqlCtx.jsonFile("s3n://...") .registerAsTable("json") schema_rdd = sqlCtx.sql(""" SELECT * FROM hiveTable JOIN json ...""") Query and join different data sources
  • 55. Hands-on Spark • Parse/transform log on the fly with Spark-Streaming • Aggregate with Spark SQL (Top N) • Output from Spark to HDFS
  • 56. Spark & Impala work together Data Strea m Data Strea m Spark- Streaming Spark Impala DN RS Data Strea m Spark- Streaming Spark Impala DN RS Data Strea m Spark- Streaming Spark DN RS Impala … Data Strea m Data Strea m Data Stream -Click Steam -Machine Data -Log -Network Traffic -Etc.. On-the-fly Processing -ETL, transformation, filter -Pattern Matching & Alert Real-time Analytics -Machine Learning (Rec. Cluster..) -Iterative Algorithms Near Real-time Query - Ad-hoc query - Reporting Long term data store -Batch process -Offline analytics -Historical Mining
  • 57. Etu 讓 Hadoop 更容易 全自動、高效能、易管理的軟體式一體機 空機自動部署 效能最佳化 全叢集管理 唯一在地 Hadoop 專業服務 主流 X86 商用伺服器
  • 58. ESA Software Stack Cloudera Manager Etu Manager 安全管理 效能最佳化 組態同步網路管理監控告警套件管理 CentOS作業系統 (64 bits) HA 管理 Rack Awareness Etu 加值模組
  • 59. Etu在Hadoop生態系的定位與價值 Etu Services 人才 招聘 團隊 建立 程式開發 數據 架構 挖掘 設計 部署、調校 易 Etu 易 運維、管理 應用 平台 搶 佔 市場 Etu Professional Services 核心 價 值 資源 調配 屏蔽 Hadoop 平台 部署與運維的複雜度 易 • 快速推出服務,搶佔市場 • 應用、數據才是企業核心價值 • 依核心價值調配資源,建立競爭優 勢 Software Appliance EEttuu SSuuppppoorrtt Etu Professional Services EEttuu CCoonnssuullttiinngg EEttuu TTrraaiinniinngg 易
  • 60. Question and Discussion Thank you 318, Rueiguang Rd., Taipei 114, Taiwan T: +886 2 7720 1888 F: +886 2 8798 6069 www.etusolution.com