SlideShare a Scribd company logo
Alex Zeltov
Solutions Engineer
@azeltov
http://tiny.cc/sparkmeetup
Intro to Big Data Analytics using
Apache Spark & Zeppelin
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
In this workshop
• Introduction to HDP and Spark
• Spark Programming: Scala, Python, R
- Core Spark: working with RDDs
- Spark SQL: structured data access
- Spark MlLib: predictive analytics
- Spark Streaming: real time data processing
• Conclusion and Further Reading, Q/A
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Spark Background
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Spark?
 Apache Open Source Project - originally developed at AMPLab
(University of California Berkeley)
 Data Processing Engine - focused on in-memory distributed
computing use-cases
 API - Scala, Python, Java and R
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Ecosystem
Spark Core
Spark SQL Spark Streaming MLLib GraphX
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Spark?
 Elegant Developer APIs
– Single environment for data munging and Machine Learning (ML)
 In-memory computation model – Fast!
– Effective for iterative computations and ML
 Machine Learning
– Implementation of distributed ML algorithms
– Pipeline API (Spark ML)
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Generality
• Combine SQL, streaming, and complex analytics.
• Spark powers a stack of libraries including SQL and DataFrames, MLlib for
machine learning, GraphX, and Spark Streaming. You can combine these
libraries seamlessly in the same application.
Runs Everywhere:
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse
data sources including HDFS, Cassandra, HBase, S3, WASB
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Emerging Spark Patterns
 Spark as query federation engine
• Bring data from multiple sources to join/query in Spark
 Use multiple Spark libraries together
• Common to see Core, ML & Sql used together
 Use Spark with various Hadoop ecosystem projects
• Use Spark & Hive together
• Spark & HBase together
• Spark & SOLR, etc...
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
More Data Sources APIs
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Hadoop?
Apache Hadoop is an open-source software framework written in
Java for distributed storage and distributed processing of very large
data sets on computer clusters built from commodity hardware.
The core of Apache Hadoop consists of a storage part Hadoop
Distributed File System (HDFS) and a processing part (MapReduce)
and YARN ResourceManager for allocating resources and
scheduling applications
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Access patterns enabled by YARN
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop Distributed File System
Interactive Real-TimeBatch
Applications Batch
Needs to happen but, no
timeframe limitations
Interactive
Needs to happen at
Human time
Real-Time
Needs to happen at
Machine Execution time.
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Spark on YARN?
 Utilize existing HDP cluster infrastructure
 Resource management
– share Spark workloads with other workloads like PIG, HIVE, etc.
 Scheduling and queues
Spark Driver
Client
Spark
Application Master
YARN container
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why HDFS?
Fault Tolerant Distributed Storage
• Divide files into big blocks and distribute 3 copies randomly across the cluster
• Processing Data Locality
• Not Just storage but computation
10110100101
00100111001
11111001010
01110100101
00101100100
10101001100
01010010111
01011101011
11011011010
10110100101
01001010101
01011100100
11010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark is certified as YARN Ready and is a part of HDP.
Hortonworks Data Platform 2.4
GOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
YARN: Data Operating System
(Cluster Resource Management)
MapReduce
Apache Falcon
Apache Sqoop
Apache Flume
Apache Kafka
ApacheHive
ApachePig
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm
1 • • • • • • • • • • •
• • • • • • • • • • • •
HDFS
(Hadoop Distributed File System)
Apache Ambari
Apache
ZooKeeper
Apache Oozie
Deployment Choice
Linux Windows On-premises Cloud
Apache Atlas
Cloudbreak
SECURITY
Apache Ranger
Apache Knox
Apache Atlas
HDFS Encryption
ISVEngines
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hortonworks Commitment to Spark
Hortonworks is focused on making Apache
Spark enterprise ready so you can depend
on it for mission critical applications
YARN: Data Operating System
SECURITY
BATCH, INTERACTIVE & REAL-TIME
DATA ACCESS
GOVERNANCE
&INTEGRATION
OPERATIONS
Script
Pig
Search
Solr
SQL
Hive
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
Other
ISVs
TezTez
In-Memory
1. YARN enable Spark to
co-exist with other
engines
Spark is “YARN Ready” so its
memory & CPU intensive apps can
work with predictable performance
along side other engines all on the
same set(s) of data.
2. Extend Spark with
enterprise capabilities
Ensure Spark can be managed,
secured and governed all via a single
set of frameworks to ensure
consistency. Ensure reliability and
quality of service of Spark along side
other engines.
3. Actively collaborate
within the open
community
As with everything we do at
Hortonworks we work entirely within
the open community across Spark
and all related projects to improve
this key Hadoop technology.
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Interacting with Spark
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Interacting with Spark
• Spark’s interactive REPL shell (in Python or Scala)
• Web-based Notebooks:
• Zeppelin: A web-based notebook that enables interactive data
analytics.
• Jupyter: Evolved from the IPython Project
• SparkNotebook: forked from the scala-notebook
• RStudio: for Spark R , Zeppelin support coming soon
https://community.hortonworks.com/articles/25558/running-sparkr-in-rstudio-using-hdp-24.html
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin
• A web-based notebook that enables interactive data
analytics.
• Multiple language backend
• Multi-purpose Notebook is the place for all your
needs
 Data Ingestion
 Data Discovery
 Data Analytics
 Data Visualization
 Collaboration
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Zeppelin- Multiple language backend
Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown and Shell.
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Zeppelin – Dependency Management
• Load libraries recursively from Maven repository
• Load libraries from local filesystem
%dep
// add maven repository
z.addRepo("RepoName").url("RepoURL”)
// add artifact from filesystem
z.load("/path/to.jar")
// add artifact from maven repository, with no dependency
z.load("groupId:artifactId:version").excludeAll()
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 21
Community Plugins
• 100+ connectors
http://spark-packages.org/
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Spark Basics
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How Does Spark Work?
• RDD
• Your data is loaded in parallel into structured collections
• Actions
• Manipulate the state of the working model by forming new RDDs and
performing calculations upon them
• Persistence
• Long-term storage of an RDD’s state
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RDD - Resilient Distributed Dataset
 Primary abstraction in Spark
– An Immutable collection of objects (or records, or elements) that can be operated on
in parallel
 Distributed
– Collection of elements partitioned across nodes in a cluster
– Each RDD is composed of one or more partitions
– User can control the number of partitions
– More partitions => more parallelism
 Resilient
– Recover from node failures
– An RDD keeps its lineage information -> it can be recreated from parent RDDs
 May be persisted in memory for efficient reuse across parallel operations (caching)
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
item-1
item-2
item-3
item-4
item-5
item-6
item-7
item-8
item-9
item-10
item-11
item-12
item-13
item-14
item-15
item-16
item-17
item-18
item-19
item-20
item-21
item-22
item-23
item-24
item-25
more partitions = more parallelism
Worker
Spark
executor
Worker
Spark
executor
Worker
Spark
executor
Programmer specifies number of partitions for an RDD
(Default value used if unspecified)
RDD split into 5 partitions
RDD – Resilient Distributed Dataset
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RDDs
• Two types of operations:transformations and actions
• Transformations are lazy (not computed immediately)
• Transformed RDD is executed when action runs on it
• Persist (cache) RDDs in memory or disk
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example RDD Transformations
map(func)
filter(func)
distinct(func)
• All create a new DataSet from an existing one
• Do not create the DataSet until an action is performed (Lazy)
• Each element in an RDD is passed to the target function and the
result forms a new RDD
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example Action Operations
count()
reduce(func)
collect()
take()
• Either:
• Returns a value to the driver program
• Exports state to external system
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example Persistence Operations
persist() -- takes options
cache() -- only one option: in-memory
• Stores RDD Values
• in memory (what doesn’t fit is recalculated when necessary)
• Replication is an option for in-memory
• to disk
• blended
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Applications
Are a definition in code of
• RDD creation
• Actions
• Persistence
Results in the creation of a DAG (Directed Acyclic Graph) [workflow]
• Each DAG is compiled into stages
• Each Stage is executed as a series of Tasks
• Each Task operates in parallel on assigned partitions
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Context
 Main entry point for Spark functionality
 Represents a connection to a Spark cluster
 Represented as sc in your code
What is it?
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Context
• A Spark program first creates a SparkContext object
• Tells Spark how and where to access a cluster
• Use SparkContext to create RDDs
• SparkContext, SQLContext, ZeppelinContext:
• are automatically created and exposed as variable names 'sc', 'sqlContext' and 'z',
respectively, both in scala and python environments using Zeppelin
• iPython and programs must use a constructor to create a new SparkContext
Note: that scala / python environment shares the same SparkContext, SQLContext, ZeppelinContext
instance.
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Processing A File in Scala
//Load the file:
val file = sc.textFile("hdfs://…/user/DAW/littlelog.csv")
//Trim away any empty rows:
val fltr = file.filter(_.length > 0)
//Print out the remaining rows:
fltr.foreach(println)
33
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
A Word on Anonymous Functions
Scala programmers make great use of anonymous functions as can be seen
in the code:
flatMap( line => line.split(" ") )
34
Argument
to the
function
Body of
the
function
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scala Functions Come In a Variety of Styles
flatMap( line => line.split(" ") )
flatMap((line:String) => line.split(" "))
flatMap(_.split(" "))
35
Argument to the
function (type
inferred)
Body of the
function
Argument to the
function (explicit type)
Body of the
function
No Argument to the function
declared (placeholder)
instead
Body of the function includes placeholder _ which allows for exactly one use of one arg
for each _ present. _ essentially means ‘whatever you pass me’
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
And Finally – the Formal ‘def’
def myFunc(line:String): Array[String]={
return line.split(",")
}
//and now that it has a name:
myFunc("Hi Mom, I’m home.").foreach(println)
Return type of the function)
Body of the
function
Argument to the
function)
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lab Spark RDD – Philly Crime Dataset
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL Overview
 Spark module for structured data processing (e.g. DB tables, JSON files)
 Three ways to manipulate data:
– DataFrames API
– SQL queries
– Datasets API
 Same execution engine for all three
 Spark SQL interfaces provide more information about both structure and computation
being performed than basic Spark RDD API
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrames
 Conceptually equivalent to a table in relational DB or data frame in R/Python
 API available in Scala, Java, Python, and R
 Richer optimizations (significantly faster than RDDs)
 Distributed collection of data organized into named columns
 Underneath is an RDD
 Catalyst Optimizer is used underhood
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrames
CSVAvro
HIVE
Spark SQL
Text
Col1 Col2 … … ColN
DataFrame
(with RDD underneath)
Column
Row
Created from Various Sources
 DataFrames from HIVE:
– Reading and writing HIVE tables,
including ORC
 DataFrames from files:
– Built-in: JSON, JDBC, ORC, Parquet, HDFS
– External plug-in: CSV, HBASE, Avro
 DataFrames from existing RDDs
– with toDF()function
Data is described as a DataFrame
with rows, columns and a schema
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Writing a DataFrame
val df = sqlContext.jsonFile("/tmp/people.json")
df.show()
df.printSchema()
df.select ("First Name").show()
df.select("First Name","Age").show()
df.filter(df("age")>40).show()
df.groupBy("age").count().show()
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL Examples
44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SQL Context and Hive Context
 Entry point into all functionality in Spark SQL
 All you need is SparkContext
val sqlContext = SQLContext(sc)
SQLContext
 Superset of functionality provided by basic SQLContext
– Read data from Hive tables
– Access to Hive Functions  UDFs
HiveContext
val hc = HiveContext(sc)
Use when your
data resides in
Hive
45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrame Example
val df = sqlContext.table("flightsTbl")
df.select("Origin", "Dest", "DepDelay").show(5)
Reading Data From Table
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 8|
| IAD| TPA| 19|
| IND| BWI| 8|
| IND| BWI| -4|
| IND| BWI| 34|
+------+----+--------+
46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrame Example
df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5)
Using DataFrame API to Filter Data (show delays more than 15 min)
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+
47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SQL Example
// Register Temporary Table
df.registerTempTable("flights")
// Use SQL to Query Dataset
sqlContext.sql("SELECT Origin, Dest, DepDelay
FROM flights
WHERE DepDelay > 15 LIMIT 5").show
Using SQL to Query and Filter Data (again, show delays more than 15 min)
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+
48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RDD vs. DataFrame
49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RDDs vs. DataFrames
RDD
DataFrame
 Lower-level API (more control)
 Lots of existing code & users
 Compile-time type-safety
 Higher-level API (faster development)
 Faster sorting, hashing, and serialization
 More opportunities for automatic optimization
 Lower memory pressure
50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Frames are Intuitive
RDD Example
Equivalent Data Frame Example
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
Find average age by
department?
51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL Optimizations
 Spark SQL uses an underlying optimization engine (Catalyst)
– Catalyst can perform intelligent optimization since it understands the schema
 Spark SQL does not materialize all the columns (as with RDD) only what’s needed
52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lab DataFrames – Federated Spark SQL
53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hortonworks Community Connection
54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
community.hortonworks.com
55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
community.hortonworks.com
56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HCC DS, Analytics, and Spark Related Questions Sample
Thank you!
community.hortonworks.com

More Related Content

What's hot

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Databricks
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Databricks
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
Databricks
 
Cost-based Query Optimization
Cost-based Query Optimization Cost-based Query Optimization
Cost-based Query Optimization
DataWorks Summit/Hadoop Summit
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve Loughran
Spark Summit
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark Summit
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
DataWorks Summit/Hadoop Summit
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark Summit
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
Thomas W. Dinsmore
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hortonworks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
DataWorks Summit/Hadoop Summit
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Spark Summit
 

What's hot (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Cost-based Query Optimization
Cost-based Query Optimization Cost-based Query Optimization
Cost-based Query Optimization
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve Loughran
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 

Similar to Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
Hortonworks
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
Hortonworks
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
Saptak Sen
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in Hadoop
POSSCON
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
DataWorks Summit
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
Hortonworks
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
Shivaji Dutta
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
VMware Tanzu
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
DataWorks Summit/Hadoop Summit
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
All Things Open
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
DataWorks Summit
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
Yifeng Jiang
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
Steve Loughran
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Edureka!
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
DataWorks Summit
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Data Con LA
 
Deep learning on HDP 2018 Prague
Deep learning on HDP 2018 PragueDeep learning on HDP 2018 Prague
Deep learning on HDP 2018 Prague
Timothy Spann
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
John Park
 

Similar to Intro to Big Data Analytics using Apache Spark and Apache Zeppelin (20)

Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in Hadoop
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Deep learning on HDP 2018 Prague
Deep learning on HDP 2018 PragueDeep learning on HDP 2018 Prague
Deep learning on HDP 2018 Prague
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
 

Recently uploaded

How to Become a Digital Marketer in 2024.docx
How to Become a Digital Marketer in 2024.docxHow to Become a Digital Marketer in 2024.docx
How to Become a Digital Marketer in 2024.docx
InfyQ Seo Experts
 
Latest Deals in the Metaverse & NFT Markets.docx
Latest Deals in the Metaverse & NFT Markets.docxLatest Deals in the Metaverse & NFT Markets.docx
Latest Deals in the Metaverse & NFT Markets.docx
SFC Today
 
ADEGUNADEGUNADEGUNADEGUNADEGUNADEGUNADEGUN.pdf
ADEGUNADEGUNADEGUNADEGUNADEGUNADEGUNADEGUN.pdfADEGUNADEGUNADEGUNADEGUNADEGUNADEGUNADEGUN.pdf
ADEGUNADEGUNADEGUNADEGUNADEGUNADEGUNADEGUN.pdf
ifraghaffar125
 
Do it again anti Republican shirt Do it again anti Republican shirt
Do it again anti Republican shirt Do it again anti Republican shirtDo it again anti Republican shirt Do it again anti Republican shirt
Do it again anti Republican shirt Do it again anti Republican shirt
exgf28
 
How God led me to DTS? Through many different signs and connections that I c...
How God led me to DTS? Through many different signs and connections that  I c...How God led me to DTS? Through many different signs and connections that  I c...
How God led me to DTS? Through many different signs and connections that I c...
AshishMohan57
 
High-Yield Dow Jones Stocks Worth Investing in Today.docx
High-Yield Dow Jones Stocks Worth Investing in Today.docxHigh-Yield Dow Jones Stocks Worth Investing in Today.docx
High-Yield Dow Jones Stocks Worth Investing in Today.docx
SFC Today
 
The Ultimate Guide to Web Hosting Reviews in 2024.pdf
The Ultimate Guide to Web Hosting Reviews in 2024.pdfThe Ultimate Guide to Web Hosting Reviews in 2024.pdf
The Ultimate Guide to Web Hosting Reviews in 2024.pdf
Hosting Mastery Hub
 
The Money Wave 2024 Review_ Is It the Key to Financial Success.pdf
The Money Wave 2024 Review_ Is It the Key to Financial Success.pdfThe Money Wave 2024 Review_ Is It the Key to Financial Success.pdf
The Money Wave 2024 Review_ Is It the Key to Financial Success.pdf
nirahealhty
 
How Can Microsoft Office 365 Improve Your Productivity?
How Can Microsoft Office 365 Improve Your Productivity?How Can Microsoft Office 365 Improve Your Productivity?
How Can Microsoft Office 365 Improve Your Productivity?
Digital Host
 
Module 16 Incineration of Healthcare Waste and the Stockholm Convention Guide...
Module 16 Incineration of Healthcare Waste and the Stockholm Convention Guide...Module 16 Incineration of Healthcare Waste and the Stockholm Convention Guide...
Module 16 Incineration of Healthcare Waste and the Stockholm Convention Guide...
Beshoelwy
 
Java Training in Chandigarh.Mastering Java: From Fundamentals to Advanced App...
Java Training in Chandigarh.Mastering Java: From Fundamentals to Advanced App...Java Training in Chandigarh.Mastering Java: From Fundamentals to Advanced App...
Java Training in Chandigarh.Mastering Java: From Fundamentals to Advanced App...
aryan4bhardwaj37
 
The Money Wave 2024 Review: Is It the Key to Financial Success?
The Money Wave 2024 Review: Is It the Key to Financial Success?The Money Wave 2024 Review: Is It the Key to Financial Success?
The Money Wave 2024 Review: Is It the Key to Financial Success?
nirahealhty
 
New York Institute of Technology degree Cert diploma offer
New York Institute of Technology degree Cert diploma offerNew York Institute of Technology degree Cert diploma offer
New York Institute of Technology degree Cert diploma offer
ubovu
 

Recently uploaded (13)

How to Become a Digital Marketer in 2024.docx
How to Become a Digital Marketer in 2024.docxHow to Become a Digital Marketer in 2024.docx
How to Become a Digital Marketer in 2024.docx
 
Latest Deals in the Metaverse & NFT Markets.docx
Latest Deals in the Metaverse & NFT Markets.docxLatest Deals in the Metaverse & NFT Markets.docx
Latest Deals in the Metaverse & NFT Markets.docx
 
ADEGUNADEGUNADEGUNADEGUNADEGUNADEGUNADEGUN.pdf
ADEGUNADEGUNADEGUNADEGUNADEGUNADEGUNADEGUN.pdfADEGUNADEGUNADEGUNADEGUNADEGUNADEGUNADEGUN.pdf
ADEGUNADEGUNADEGUNADEGUNADEGUNADEGUNADEGUN.pdf
 
Do it again anti Republican shirt Do it again anti Republican shirt
Do it again anti Republican shirt Do it again anti Republican shirtDo it again anti Republican shirt Do it again anti Republican shirt
Do it again anti Republican shirt Do it again anti Republican shirt
 
How God led me to DTS? Through many different signs and connections that I c...
How God led me to DTS? Through many different signs and connections that  I c...How God led me to DTS? Through many different signs and connections that  I c...
How God led me to DTS? Through many different signs and connections that I c...
 
High-Yield Dow Jones Stocks Worth Investing in Today.docx
High-Yield Dow Jones Stocks Worth Investing in Today.docxHigh-Yield Dow Jones Stocks Worth Investing in Today.docx
High-Yield Dow Jones Stocks Worth Investing in Today.docx
 
The Ultimate Guide to Web Hosting Reviews in 2024.pdf
The Ultimate Guide to Web Hosting Reviews in 2024.pdfThe Ultimate Guide to Web Hosting Reviews in 2024.pdf
The Ultimate Guide to Web Hosting Reviews in 2024.pdf
 
The Money Wave 2024 Review_ Is It the Key to Financial Success.pdf
The Money Wave 2024 Review_ Is It the Key to Financial Success.pdfThe Money Wave 2024 Review_ Is It the Key to Financial Success.pdf
The Money Wave 2024 Review_ Is It the Key to Financial Success.pdf
 
How Can Microsoft Office 365 Improve Your Productivity?
How Can Microsoft Office 365 Improve Your Productivity?How Can Microsoft Office 365 Improve Your Productivity?
How Can Microsoft Office 365 Improve Your Productivity?
 
Module 16 Incineration of Healthcare Waste and the Stockholm Convention Guide...
Module 16 Incineration of Healthcare Waste and the Stockholm Convention Guide...Module 16 Incineration of Healthcare Waste and the Stockholm Convention Guide...
Module 16 Incineration of Healthcare Waste and the Stockholm Convention Guide...
 
Java Training in Chandigarh.Mastering Java: From Fundamentals to Advanced App...
Java Training in Chandigarh.Mastering Java: From Fundamentals to Advanced App...Java Training in Chandigarh.Mastering Java: From Fundamentals to Advanced App...
Java Training in Chandigarh.Mastering Java: From Fundamentals to Advanced App...
 
The Money Wave 2024 Review: Is It the Key to Financial Success?
The Money Wave 2024 Review: Is It the Key to Financial Success?The Money Wave 2024 Review: Is It the Key to Financial Success?
The Money Wave 2024 Review: Is It the Key to Financial Success?
 
New York Institute of Technology degree Cert diploma offer
New York Institute of Technology degree Cert diploma offerNew York Institute of Technology degree Cert diploma offer
New York Institute of Technology degree Cert diploma offer
 

Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

  • 1. Alex Zeltov Solutions Engineer @azeltov http://tiny.cc/sparkmeetup Intro to Big Data Analytics using Apache Spark & Zeppelin
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved In this workshop • Introduction to HDP and Spark • Spark Programming: Scala, Python, R - Core Spark: working with RDDs - Spark SQL: structured data access - Spark MlLib: predictive analytics - Spark Streaming: real time data processing • Conclusion and Further Reading, Q/A
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Spark Background
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What is Spark?  Apache Open Source Project - originally developed at AMPLab (University of California Berkeley)  Data Processing Engine - focused on in-memory distributed computing use-cases  API - Scala, Python, Java and R
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Ecosystem Spark Core Spark SQL Spark Streaming MLLib GraphX
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why Spark?  Elegant Developer APIs – Single environment for data munging and Machine Learning (ML)  In-memory computation model – Fast! – Effective for iterative computations and ML  Machine Learning – Implementation of distributed ML algorithms – Pipeline API (Spark ML)
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Generality • Combine SQL, streaming, and complex analytics. • Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. Runs Everywhere: Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3, WASB
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Emerging Spark Patterns  Spark as query federation engine • Bring data from multiple sources to join/query in Spark  Use multiple Spark libraries together • Common to see Core, ML & Sql used together  Use Spark with various Hadoop ecosystem projects • Use Spark & Hive together • Spark & HBase together • Spark & SOLR, etc...
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved More Data Sources APIs
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What is Hadoop? Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. The core of Apache Hadoop consists of a storage part Hadoop Distributed File System (HDFS) and a processing part (MapReduce) and YARN ResourceManager for allocating resources and scheduling applications
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Access patterns enabled by YARN YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °N HDFS Hadoop Distributed File System Interactive Real-TimeBatch Applications Batch Needs to happen but, no timeframe limitations Interactive Needs to happen at Human time Real-Time Needs to happen at Machine Execution time.
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why Spark on YARN?  Utilize existing HDP cluster infrastructure  Resource management – share Spark workloads with other workloads like PIG, HIVE, etc.  Scheduling and queues Spark Driver Client Spark Application Master YARN container Spark Executor YARN container Task Task Spark Executor YARN container Task Task Spark Executor YARN container Task Task
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why HDFS? Fault Tolerant Distributed Storage • Divide files into big blocks and distribute 3 copies randomly across the cluster • Processing Data Locality • Not Just storage but computation 10110100101 00100111001 11111001010 01110100101 00101100100 10101001100 01010010111 01011101011 11011011010 10110100101 01001010101 01011100100 11010111010 0 Logical File 1 2 3 4 Blocks 1 Cluster 1 1 2 2 2 3 3 34 4 4
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark is certified as YARN Ready and is a part of HDP. Hortonworks Data Platform 2.4 GOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS YARN: Data Operating System (Cluster Resource Management) MapReduce Apache Falcon Apache Sqoop Apache Flume Apache Kafka ApacheHive ApachePig ApacheHBase ApacheAccumulo ApacheSolr ApacheSpark ApacheStorm 1 • • • • • • • • • • • • • • • • • • • • • • • HDFS (Hadoop Distributed File System) Apache Ambari Apache ZooKeeper Apache Oozie Deployment Choice Linux Windows On-premises Cloud Apache Atlas Cloudbreak SECURITY Apache Ranger Apache Knox Apache Atlas HDFS Encryption ISVEngines
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hortonworks Commitment to Spark Hortonworks is focused on making Apache Spark enterprise ready so you can depend on it for mission critical applications YARN: Data Operating System SECURITY BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE &INTEGRATION OPERATIONS Script Pig Search Solr SQL Hive HCatalog NoSQL HBase Accumulo Stream Storm Other ISVs TezTez In-Memory 1. YARN enable Spark to co-exist with other engines Spark is “YARN Ready” so its memory & CPU intensive apps can work with predictable performance along side other engines all on the same set(s) of data. 2. Extend Spark with enterprise capabilities Ensure Spark can be managed, secured and governed all via a single set of frameworks to ensure consistency. Ensure reliability and quality of service of Spark along side other engines. 3. Actively collaborate within the open community As with everything we do at Hortonworks we work entirely within the open community across Spark and all related projects to improve this key Hadoop technology.
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Interacting with Spark
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Interacting with Spark • Spark’s interactive REPL shell (in Python or Scala) • Web-based Notebooks: • Zeppelin: A web-based notebook that enables interactive data analytics. • Jupyter: Evolved from the IPython Project • SparkNotebook: forked from the scala-notebook • RStudio: for Spark R , Zeppelin support coming soon https://community.hortonworks.com/articles/25558/running-sparkr-in-rstudio-using-hdp-24.html
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin • A web-based notebook that enables interactive data analytics. • Multiple language backend • Multi-purpose Notebook is the place for all your needs  Data Ingestion  Data Discovery  Data Analytics  Data Visualization  Collaboration
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Zeppelin- Multiple language backend Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown and Shell.
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Zeppelin – Dependency Management • Load libraries recursively from Maven repository • Load libraries from local filesystem %dep // add maven repository z.addRepo("RepoName").url("RepoURL”) // add artifact from filesystem z.load("/path/to.jar") // add artifact from maven repository, with no dependency z.load("groupId:artifactId:version").excludeAll()
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 21 Community Plugins • 100+ connectors http://spark-packages.org/
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Spark Basics
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How Does Spark Work? • RDD • Your data is loaded in parallel into structured collections • Actions • Manipulate the state of the working model by forming new RDDs and performing calculations upon them • Persistence • Long-term storage of an RDD’s state
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved RDD - Resilient Distributed Dataset  Primary abstraction in Spark – An Immutable collection of objects (or records, or elements) that can be operated on in parallel  Distributed – Collection of elements partitioned across nodes in a cluster – Each RDD is composed of one or more partitions – User can control the number of partitions – More partitions => more parallelism  Resilient – Recover from node failures – An RDD keeps its lineage information -> it can be recreated from parent RDDs  May be persisted in memory for efficient reuse across parallel operations (caching)
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved item-1 item-2 item-3 item-4 item-5 item-6 item-7 item-8 item-9 item-10 item-11 item-12 item-13 item-14 item-15 item-16 item-17 item-18 item-19 item-20 item-21 item-22 item-23 item-24 item-25 more partitions = more parallelism Worker Spark executor Worker Spark executor Worker Spark executor Programmer specifies number of partitions for an RDD (Default value used if unspecified) RDD split into 5 partitions RDD – Resilient Distributed Dataset
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved RDDs • Two types of operations:transformations and actions • Transformations are lazy (not computed immediately) • Transformed RDD is executed when action runs on it • Persist (cache) RDDs in memory or disk
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Example RDD Transformations map(func) filter(func) distinct(func) • All create a new DataSet from an existing one • Do not create the DataSet until an action is performed (Lazy) • Each element in an RDD is passed to the target function and the result forms a new RDD
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Example Action Operations count() reduce(func) collect() take() • Either: • Returns a value to the driver program • Exports state to external system
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Example Persistence Operations persist() -- takes options cache() -- only one option: in-memory • Stores RDD Values • in memory (what doesn’t fit is recalculated when necessary) • Replication is an option for in-memory • to disk • blended
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Applications Are a definition in code of • RDD creation • Actions • Persistence Results in the creation of a DAG (Directed Acyclic Graph) [workflow] • Each DAG is compiled into stages • Each Stage is executed as a series of Tasks • Each Task operates in parallel on assigned partitions
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Context  Main entry point for Spark functionality  Represents a connection to a Spark cluster  Represented as sc in your code What is it?
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Context • A Spark program first creates a SparkContext object • Tells Spark how and where to access a cluster • Use SparkContext to create RDDs • SparkContext, SQLContext, ZeppelinContext: • are automatically created and exposed as variable names 'sc', 'sqlContext' and 'z', respectively, both in scala and python environments using Zeppelin • iPython and programs must use a constructor to create a new SparkContext Note: that scala / python environment shares the same SparkContext, SQLContext, ZeppelinContext instance.
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Processing A File in Scala //Load the file: val file = sc.textFile("hdfs://…/user/DAW/littlelog.csv") //Trim away any empty rows: val fltr = file.filter(_.length > 0) //Print out the remaining rows: fltr.foreach(println) 33
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved A Word on Anonymous Functions Scala programmers make great use of anonymous functions as can be seen in the code: flatMap( line => line.split(" ") ) 34 Argument to the function Body of the function
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scala Functions Come In a Variety of Styles flatMap( line => line.split(" ") ) flatMap((line:String) => line.split(" ")) flatMap(_.split(" ")) 35 Argument to the function (type inferred) Body of the function Argument to the function (explicit type) Body of the function No Argument to the function declared (placeholder) instead Body of the function includes placeholder _ which allows for exactly one use of one arg for each _ present. _ essentially means ‘whatever you pass me’
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved And Finally – the Formal ‘def’ def myFunc(line:String): Array[String]={ return line.split(",") } //and now that it has a name: myFunc("Hi Mom, I’m home.").foreach(println) Return type of the function) Body of the function Argument to the function)
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Lab Spark RDD – Philly Crime Dataset
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL
  • 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Overview  Spark module for structured data processing (e.g. DB tables, JSON files)  Three ways to manipulate data: – DataFrames API – SQL queries – Datasets API  Same execution engine for all three  Spark SQL interfaces provide more information about both structure and computation being performed than basic Spark RDD API
  • 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DataFrames  Conceptually equivalent to a table in relational DB or data frame in R/Python  API available in Scala, Java, Python, and R  Richer optimizations (significantly faster than RDDs)  Distributed collection of data organized into named columns  Underneath is an RDD  Catalyst Optimizer is used underhood
  • 41. 41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DataFrames CSVAvro HIVE Spark SQL Text Col1 Col2 … … ColN DataFrame (with RDD underneath) Column Row Created from Various Sources  DataFrames from HIVE: – Reading and writing HIVE tables, including ORC  DataFrames from files: – Built-in: JSON, JDBC, ORC, Parquet, HDFS – External plug-in: CSV, HBASE, Avro  DataFrames from existing RDDs – with toDF()function Data is described as a DataFrame with rows, columns and a schema
  • 42. 42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Writing a DataFrame val df = sqlContext.jsonFile("/tmp/people.json") df.show() df.printSchema() df.select ("First Name").show() df.select("First Name","Age").show() df.filter(df("age")>40).show() df.groupBy("age").count().show()
  • 43. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Examples
  • 44. 44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SQL Context and Hive Context  Entry point into all functionality in Spark SQL  All you need is SparkContext val sqlContext = SQLContext(sc) SQLContext  Superset of functionality provided by basic SQLContext – Read data from Hive tables – Access to Hive Functions  UDFs HiveContext val hc = HiveContext(sc) Use when your data resides in Hive
  • 45. 45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DataFrame Example val df = sqlContext.table("flightsTbl") df.select("Origin", "Dest", "DepDelay").show(5) Reading Data From Table +------+----+--------+ |Origin|Dest|DepDelay| +------+----+--------+ | IAD| TPA| 8| | IAD| TPA| 19| | IND| BWI| 8| | IND| BWI| -4| | IND| BWI| 34| +------+----+--------+
  • 46. 46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DataFrame Example df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5) Using DataFrame API to Filter Data (show delays more than 15 min) +------+----+--------+ |Origin|Dest|DepDelay| +------+----+--------+ | IAD| TPA| 19| | IND| BWI| 34| | IND| JAX| 25| | IND| LAS| 67| | IND| MCO| 94| +------+----+--------+
  • 47. 47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SQL Example // Register Temporary Table df.registerTempTable("flights") // Use SQL to Query Dataset sqlContext.sql("SELECT Origin, Dest, DepDelay FROM flights WHERE DepDelay > 15 LIMIT 5").show Using SQL to Query and Filter Data (again, show delays more than 15 min) +------+----+--------+ |Origin|Dest|DepDelay| +------+----+--------+ | IAD| TPA| 19| | IND| BWI| 34| | IND| JAX| 25| | IND| LAS| 67| | IND| MCO| 94| +------+----+--------+
  • 48. 48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved RDD vs. DataFrame
  • 49. 49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved RDDs vs. DataFrames RDD DataFrame  Lower-level API (more control)  Lots of existing code & users  Compile-time type-safety  Higher-level API (faster development)  Faster sorting, hashing, and serialization  More opportunities for automatic optimization  Lower memory pressure
  • 50. 50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Frames are Intuitive RDD Example Equivalent Data Frame Example dept name age Bio H Smith 48 CS A Turing 54 Bio B Jones 43 Phys E Witten 61 Find average age by department?
  • 51. 51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Optimizations  Spark SQL uses an underlying optimization engine (Catalyst) – Catalyst can perform intelligent optimization since it understands the schema  Spark SQL does not materialize all the columns (as with RDD) only what’s needed
  • 52. 52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Lab DataFrames – Federated Spark SQL
  • 53. 53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hortonworks Community Connection
  • 54. 54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved community.hortonworks.com
  • 55. 55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved community.hortonworks.com
  • 56. 56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HCC DS, Analytics, and Spark Related Questions Sample