Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials

Spark Camp @ Strata CA
Intro to Apache Spark with Hands-on Tutorials 
Wed Feb 18, 2015 9:00am–5:00pm
download slides: 
training.databricks.com/workshop/sparkcamp.pdf
Licensed under a Creative Commons Attribution-NonCommercial-
NoDerivatives 4.0 International License

Tutorial Outline:
morning afternoon
Welcome + Getting Started Intro to MLlib
Ex 1: Pre-Flight Check Songs Demo
How Spark Runs on a Cluster Spark SQL
A Brief History Visualizations
Ex 3: WC, Joins Ex 7: SQL + Visualizations
DBC Essentials Spark Streaming
How to “Think Notebooks” Tw Streaming / Kinesis Demo
Ex 4: Workﬂow exercise Building a Scala JAR
Tour of Spark API Deploying Apps
Spark @ Strata Ex 8: GraphX examples
Case Studies
Further Resources / Q&A

Introducing:
Reza Zadeh 
@Reza_Zadeh
Hossein Falaki 
@mhfalaki
Andy Konwinski 
@andykonwinski
Chris Fregly 
@cfregly
Denny Lee 
@dennylee
Holden Karau 
@holdenkarau
Krishna Sankar 
@ksankar
Paco Nathan 
@pacoid

Everyone will receive a username/password for one  
of the Databricks Cloud shards. Use your laptop and
browser to login there.

We ﬁnd that cloud-based notebooks are a simple way
to get started using Apache Spark – as the motto
“Making Big Data Simple” states.

Please create and run a variety of notebooks on your
account throughout the tutorial. These accounts will
remain open long enough for you to export your work.

See the product page or FAQ for more details, or
contact Databricks to register for a trial account.
5
Getting Started: Step 1

6
Getting Started: Step 1 – Credentials
url:! ! https://class01.cloud.databricks.com/!
user:! ! student-777!
pass:! ! 93ac11xq23z5150!
cluster:! student-777!

7
Open in a browser window, then click on the
navigation menu in the top/left corner:

8
The next columns to the right show folders, 
and scroll down to click on databricks_guide

9
Scroll to open the 01 Quick Start notebook, then
follow the discussion about using key features:

10
See /databricks-guide/01 Quick Start  
Key Features:

• Workspace / Folder / Notebook

• Code Cells, run/edit/move/comment

• Markdown

• Results

• Import/Export

11
Click on the Workspace menu and create your  
own folder (pick a name):

12
Navigate to /_SparkCamp/00.pre-flight-check 
hover on its drop-down menu, on the right side:

13
Then create a clone of this notebook in the folder
that you just created:

14
Attach your cluster – same as your username:

15
Now let’s get started with the coding exercise!

We’ll deﬁne an initial Spark app in three lines  
of code:
Getting Started: Coding Exercise

If you’re new to this Scala thing and want to
spend a few minutes on the basics…
16
Scala Crash Course 
Holden Karau 
lintool.github.io/SparkTutorial/
slides/day1_Scala_crash_course.pdf
Getting Started: Bonus!

17
Getting Started: Extra Bonus!!
See also the /learning_spark_book

for all of its code examples in notebooks:

How Spark runs  
on a Cluster
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3

19
Clone and run /_SparkCamp/01.log_example 
in your folder:
Spark Deconstructed: Log Mining Example

20
# load error messages from a log into memory!
# then interactively search for patterns!
!
# base RDD!
lines = sc.textFile("/mnt/paco/intro/error_log.txt") !
.map(lambda x: x.split("t"))!
!
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
!
# persistence!
messages.cache()!
!
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()!
!
# action 2!
messages.filter(lambda x: x.find("php") > -1).count()

x = messages.filter(lambda x: x.find("mysql") > -1)!
x.toDebugString()!
!
!
(2) PythonRDD[772] at RDD at PythonRDD.scala:43 []!
| PythonRDD[219] at RDD at PythonRDD.scala:43 []!
| error_log.txt MappedRDD[218] at NativeMethodAccessorImpl.java:-2 []!
| error_log.txt HadoopRDD[217] at NativeMethodAccessorImpl.java:-2 []
Note that we can examine the operator graph
for a transformed RDD, for example:
21

Driver
Worker
Worker
Worker
We start with Spark running on a cluster… 
submitting code to be evaluated on it:
22

# base RDD!
!
# transformed RDDs!
!
# persistence!
messages.cache()!
!
# action 1!
!
# action 2!
discussing the other part
23
Driver
Worker
Worker
Worker

# base RDD!
!
# transformed RDDs!
!
# persistence!
messages.cache()!
!
# action 1!
!
# action 2!
24
Driver
Worker
Worker
Worker
block 1
block 2
block 3

# base RDD!
!
# transformed RDDs!
!
# persistence!
messages.cache()!
!
# action 1!
!
# action 2!
25
Driver
Worker
Worker
Worker
block 1
block 2
block 3

# base RDD!
!
# transformed RDDs!
!
# persistence!
messages.cache()!
!
# action 1!
!
# action 2!
26
Driver
Worker
Worker
Worker
block 1
block 2
block 3
read
HDFS
block
read
HDFS
block
read
HDFS
block

# base RDD!
!
# transformed RDDs!
!
# persistence!
messages.cache()!
!
# action 1!
!
# action 2!
27
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
process,
cache data
process,
cache data
process,
cache data

# base RDD!
!
# transformed RDDs!
!
# persistence!
messages.cache()!
!
# action 1!
!
# action 2!
28
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3

# base RDD!
!
# transformed RDDs!
!
# persistence!
messages.cache()!
!
# action 1!
!
# action 2!
29
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3

# base RDD!
!
# transformed RDDs!
!
# persistence!
messages.cache()!
!
# action 1!
!
# action 2!
30
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
process
from cache
process
from cache
process
from cache

# base RDD!
!
# transformed RDDs!
!
# persistence!
messages.cache()!
!
# action 1!
!
# action 2!
31
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3

Looking at the RDD transformations and
actions from another perspective…
action value
RDD
RDD
RDD
transformations RDD
# base RDD!
!
# transformed RDDs!
!
# persistence!
messages.cache()!
!
# action 1!
!
# action 2!
32

RDD
# base RDD!
.map(lambda x: x.split("t"))
33

RDD
RDD
RDD
transformations RDD
# transformed RDDs!
!
# persistence!
messages.cache()
34

action value
RDD
RDD
RDD
transformations RDD
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()
35

37
A Brief History: Functional Programming for Big Data
circa late 1990s:  
explosive growth e-commerce and machine data
implied that workloads could not ﬁt on a single
computer anymore…

notable ﬁrms led the shift to horizontal scale-out  
on clusters of commodity hardware, especially  
for machine learning use cases at scale

38
A Brief History: Functional Programming for Big Data
2002
2002
MapReduce @ Google
2004
MapReduce paper
2006
Hadoop @Yahoo!
2004 2006 2008 2010 2012 2014
2014
Apache Spark top-level
2010
Spark paper
2008
Hadoop Summit

39
circa 2002:  
mitigate risk of large distributed workloads lost  
due to disk failures on commodity hardware…
Google File System

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung

research.google.com/archive/gfs.html

!
MapReduce: Simpliﬁed Data Processing on Large Clusters

Jeffrey Dean, Sanjay Ghemawat

research.google.com/archive/mapreduce.html
A Brief History: MapReduce

circa 1979 – Stanford, MIT, CMU, etc. 
set/list operations in LISP, Prolog, etc., for parallel processing 
www-formal.stanford.edu/jmc/history/lisp/lisp.htm

circa 2004 – Google 
MapReduce: Simplified Data Processing on Large Clusters 
Jeffrey Dean and Sanjay Ghemawat 
research.google.com/archive/mapreduce.html

circa 2006 – Apache 
Hadoop, originating from the Nutch Project 
Doug Cutting 
research.yahoo.com/files/cutting.pdf

circa 2008 – Yahoo 
web scale search indexing 
Hadoop Summit, HUG, etc. 
developer.yahoo.com/hadoop/

circa 2009 – Amazon AWS 
Elastic MapReduce 
Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc. 
aws.amazon.com/elasticmapreduce/
40

Open Discussion:

Enumerate several changes in data center
technologies since 2002…
41

pistoncloud.com/2013/04/storage-
and-the-mobility-gap/
Rich Freitas, IBM Research
meanwhile, spinny
disks haven’t changed
all that much…
storagenewsletter.com/rubriques/hard-
disk-drives/hdd-technology-trends-ibm/
42

MapReduce use cases showed two major
limitations:

1. difﬁcultly of programming directly in MR

2. performance bottlenecks, or batch not
ﬁtting the use cases

In short, MR doesn’t compose well for large
applications

Therefore, people built specialized systems as
workarounds…
43

44
MR doesn’t compose well for large applications,  
and so specialized systems emerged as workarounds
MapReduce
General Batch Processing Specialized Systems:
iterative, interactive, streaming, graph, etc.
Pregel Giraph
Dremel Drill
Tez
Impala
GraphLab
StormS4
F1
MillWheel

Developed in 2009 at UC Berkeley AMPLab, then
open sourced in 2010, Spark has since become  
one of the largest OSS communities in big data,
with over 200 contributors in 50+ organizations
spark.apache.org
“Organizations that are looking at big data challenges – 
including collection, ETL, storage, exploration and analytics – 
should consider Spark for its in-memory performance and 
the breadth of its model. It supports advanced analytics 
solutions on Hadoop clusters, including the iterative model 
required for machine learning and graph analysis.”

Gartner, Advanced Analytics and Data Science (2014)
45
A Brief History: Spark

46
Spark: Cluster Computing withWorking Sets

Matei Zaharia, Mosharaf Chowdhury,  
Michael Franklin, Scott Shenker, Ion Stoica 
people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf

!
Resilient Distributed Datasets:A Fault-Tolerant Abstraction for

In-Memory Cluster Computing

Matei Zaharia, Mosharaf Chowdhury,Tathagata Das,Ankur Dave,  
Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica

usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
circa 2010:  
a unified engine for enterprise data workflows,  
based on commodity hardware a decade later…

Unlike the various specialized systems, Spark’s
goal was to generalize MapReduce to support
new apps within same engine

Two reasonably small additions are enough to
express the previous models:

• fast data sharing

• general DAGs

This allows for an approach which is more
efﬁcient for the engine, and much simpler  
for the end users
47

Some key points about Spark:

• handles batch, interactive, and real-time  
within a single framework

• native integration with Java, Python, Scala

• programming at a higher level of abstraction

• more general: map/reduce is just one set  
of supported constructs
49

• generalized patterns 
uniﬁed engine for many use cases

• lazy evaluation of the lineage graph 
reduces wait states, better pipelining

• generational differences in hardware 
off-heap use of large memory spaces

• functional programming / ease of use 
reduction in cost to maintain large apps

• lower overhead for starting jobs

• less expensive shufﬂes
A Brief History: Key distinctions for Spark vs. MapReduce
50

databricks.com/blog/2014/11/05/spark-ofﬁcially-
sets-a-new-record-in-large-scale-sorting.html
TL;DR: SmashingThe Previous Petabyte Sort Record
51

Spark is one of the most active Apache projects
ohloh.net/orgs/apache
52
TL;DR: Sustained Exponential Growth

oreilly.com/data/free/2014-data-science-
salary-survey.csp
TL;DR: Spark ExpertiseTops Median Salaries within Big Data
53

databricks.com/blog/2015/01/27/big-data-projects-are-
hungry-for-simpler-and-more-powerful-tools-survey-
validates-apache-spark-is-gaining-developer-traction.html
TL;DR: Spark Survey 2015 by Databricks +Typesafe
54

Ex #3:  
WC, Joins, Shuffles
A:
stage 1
B:
C:
stage 2
D:
stage 3
E:
map()
map()
map()
map()
join()
cached
partition
RDD

Coding Exercise: WordCount
void map (String doc_id, String text):!
for each word w in segment(text):!
emit(w, "1");!
!
!
void reduce (String word, Iterator group):!
int count = 0;!
!
for each pc in group:!
count += Int(pc);!
!
emit(word, String(count));
Definition:

count how often each word appears  
in a collection of text documents

This simple program provides a good test case  
for parallel processing, since it:

• requires a minimal amount of code

• demonstrates use of both symbolic and  
numeric values

• isn’t many steps away from search indexing

• serves as a “HelloWorld” for Big Data apps

!
A distributed computing framework that can run
WordCount efficiently in parallel at scale  
can likely handle much larger and more interesting
compute problems
count how often each word appears  
in a collection of text documents
56

WordCount in 3 lines of Spark
WordCount in 50+ lines of Java MR
57

58
Clone and run /_SparkCamp/02.wc_example 
in your folder:

59
Clone and run /_SparkCamp/03.join_example 
in your folder:
Coding Exercise: Join

A:
stage 1
B:
C:
stage 2
D:
stage 3
E:
map() map()
map() map()
join()
cached
partition
RDD
60
Coding Exercise: Join and its Operator Graph

DBC Essentials
evaluation
optimization
representation
circa 2010
ETL into
cluster/cloud
data data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable resultsdecisions, feedback
bar developers
foo algorithms

63
DBC Essentials: What is Databricks Cloud?
Databricks Platform
Databricks Workspace
Also see FAQ for more details…

64
DBC Essentials: What is Databricks Cloud?
key concepts
Shard an instance of Databricks Workspace
Cluster a Spark cluster (multiple per shard)
Notebook
a list of markdown, executable
commands, and results
Dashboard
a ﬂexible space to create operational
visualizations
Also see FAQ for more details…

65
DBC Essentials: Notebooks
• Series of commands (think shell++)

• Each notebook has a language type,  
chosen at notebook creation:

• Python + SQL

• Scala + SQL

• SQL only

• Command output captured in notebook

• Commands can be…

• edited, reordered, rerun, exported,  
cloned, imported, etc.

66
DBC Essentials: Clusters
• Open source Spark clusters hosted in the cloud

• Access the Spark UI

• Attach and Detach notebooks to clusters

!
NB: our training shards use 7 GB cluster
conﬁgurations

67
DBC Essentials: Team, State, Collaboration, Elastic Resources
Cloud
login
state
attached
Spark
cluster
Shard
Notebook
Spark
cluster
detached
Browser
team
Browserlogin
import/
export
Local
Copies

68
DBC Essentials: Team, State, Collaboration, Elastic Resources
Excellent collaboration properties, based
on the use of:

• comments

• cloning

• decoupled state of notebooks vs.
clusters

• relative independence of code blocks
within a notebook

How to “think” in terms of leveraging notebooks,
based on Computational Thinking:
70
Think Notebooks:
“The way we depict
space has a great
deal to do with how
we behave in it.” 
– David Hockney

71
“The impact of computing extends far beyond 
science… affecting all aspects of our lives.  
To ﬂourish in today's world, everyone needs 
computational thinking.” – CMU

Computing now ranks alongside the proverbial
Reading,Writing, and Arithmetic…
Center for ComputationalThinking @ CMU 
http://www.cs.cmu.edu/~CompThink/

Exploring ComputationalThinking @ Google 
https://www.google.com/edu/computational-thinking/
Think Notebooks: ComputationalThinking

72
Computational Thinking provides a structured
way of conceptualizing the problem…

In effect, developing notes for yourself and
your team

These in turn can become the basis for team
process, software requirements, etc.,

In other words, conceptualize how to leverage
computing resources at scale to build high-ROI
apps for Big Data

73
The general approach, in four parts:

• Decomposition: decompose a complex
problem into smaller solvable problems

• Pattern Recognition: identify when a  
known approach can be leveraged

• Abstraction: abstract from those patterns  
into generalizations as strategies

• Algorithm Design: articulate strategies as
algorithms, i.e. as general recipes for how to
handle complex problems

How to “think” in terms of leveraging notebooks,  
by the numbers:

1. create a new notebook

2. copy the assignment description as markdown

3. split it into separate code cells

4. for each step, write your code under the
markdown

5. run each step and verify your results
74
Think Notebooks:

Let’s assemble the pieces of the previous few  
code examples, using two files:

/mnt/paco/intro/CHANGES.txt 
/mnt/paco/intro/README.md"
1. create RDDs to filter each line for the  
keyword Spark

2. perform a WordCount on each, i.e., so the
results are (K,V) pairs of (keyword, count)

3. join the two RDDs

4. how many instances of Spark are there in  
each file?
75
Coding Exercises: Workflow assignment

Tour of Spark API
Cluster Manager
Driver Program
SparkContext
Worker Node
Executor cache
task
task
Worker Node
Executor cache
task
task

The essentials of the Spark API in both Scala
and Python…

/_SparkCamp/05.scala_api"
/_SparkCamp/05.python_api"
!
Let’s start with the basic concepts, which are
covered in much more detail in the docs:

spark.apache.org/docs/latest/scala-
programming-guide.html
Spark Essentials:
77

First thing that a Spark program does is create
a SparkContext object, which tells Spark how
to access a cluster

In the shell for either Scala or Python, this is
the sc variable, which is created automatically

Other programs must use a constructor to
instantiate a new SparkContext

Then in turn SparkContext gets used to create
other variables
Spark Essentials: SparkContext
78

sc"
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@6ad51e9c
Spark Essentials: SparkContext
sc"
Out[1]: <__main__.RemoteContext at 0x7ff0bfb18a10>
Scala:
Python:
79

The master parameter for a SparkContext
determines which cluster to use
Spark Essentials: Master
master description
local
run Spark locally with one worker thread  
(no parallelism)
local[K]
run Spark locally with K worker threads  
(ideally set to # cores)

spark://HOST:PORT
connect to a Spark standalone cluster;  
PORT depends on conﬁg (7077 by default)

mesos://HOST:PORT
connect to a Mesos cluster;  
PORT depends on conﬁg (5050 by default)

80

Cluster ManagerDriver Program
SparkContext
Worker Node
Executor cache
tasktask
Worker Node
Executor cache
tasktask
spark.apache.org/docs/latest/cluster-
overview.html
Spark Essentials: Master
81

Cluster ManagerDriver Program
SparkContext
Worker Node
Executor cache
tasktask
Worker Node
Executor cache
tasktask
The driver performs the following:

1. connects to a cluster manager to allocate
resources across applications

2. acquires executors on cluster nodes –
processes run compute tasks, cache data

3. sends app code to the executors

4. sends tasks for the executors to run
Spark Essentials: Clusters
82

Resilient Distributed Datasets (RDD) are the
primary abstraction in Spark – a fault-tolerant
collection of elements that can be operated on  
in parallel

There are currently two types:

• parallelized collections – take an existing Scala
collection and run functions on it in parallel

• Hadoop datasets – run functions on each record
of a ﬁle in Hadoop distributed ﬁle system or any
other storage system supported by Hadoop
Spark Essentials: RDD
83

• two types of operations on RDDs:  
transformations and actions

• transformations are lazy  
(not computed immediately)

• the transformed RDD gets recomputed  
when an action is run on it (default)

• however, an RDD can be persisted into  
storage in memory or disk
84

val data = Array(1, 2, 3, 4, 5)"
data: Array[Int] = Array(1, 2, 3, 4, 5)"
!
val distData = sc.parallelize(data)"
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24970]
data = [1, 2, 3, 4, 5]"
data"
Out[2]: [1, 2, 3, 4, 5]"
!
distData = sc.parallelize(data)"
distData"
Out[3]: ParallelCollectionRDD[24864] at parallelize at PythonRDD.scala:364
Scala:
Python:
85

Spark can create RDDs from any file stored in HDFS
or other storage systems supported by Hadoop, e.g.,
local file system,Amazon S3, Hypertable, HBase, etc.

Spark supports text files, SequenceFiles, and any
other Hadoop InputFormat, and can also take a
directory or a glob (e.g. /data/201404*)
action value
RDD
RDD
RDD
transformations RDD
86

val distFile = sqlContext.table("readme")"
distFile: org.apache.spark.sql.SchemaRDD = "
SchemaRDD[24971] at RDD at SchemaRDD.scala:108
distFile = sqlContext.table("readme").map(lambda x: x[0])"
distFile"
Out[11]: PythonRDD[24920] at RDD at PythonRDD.scala:43
Scala:
Python:
87

Transformations create a new dataset from  
an existing one

All transformations in Spark are lazy: they  
do not compute their results right away –
instead they remember the transformations
applied to some base dataset

• optimize the required calculations

• recover from lost data partitions
Spark Essentials: Transformations
88

transformation description
map(func)
return a new distributed dataset formed by passing  
each element of the source through a function func
filter(func)
return a new dataset formed by selecting those
elements of the source on which func returns true

flatMap(func)
similar to map, but each input item can be mapped  
to 0 or more output items (so func should return a  
Seq rather than a single item)
sample(withReplacement,
fraction, seed)
sample a fraction fraction of the data, with or without
replacement, using a given random number generator
seed
union(otherDataset)
return a new dataset that contains the union of the
elements in the source dataset and the argument
distinct([numTasks]))
return a new dataset that contains the distinct elements
of the source dataset
89

groupByKey([numTasks])
when called on a dataset of (K, V) pairs, returns a
dataset of (K, Seq[V]) pairs
reduceByKey(func,
[numTasks])
when called on a dataset of (K, V) pairs, returns  
a dataset of (K, V) pairs where the values for each  
key are aggregated using the given reduce function
sortByKey([ascending],
[numTasks])
when called on a dataset of (K, V) pairs where K
implements Ordered, returns a dataset of (K, V)  
pairs sorted by keys in ascending or descending order,
as speciﬁed in the boolean ascending argument
join(otherDataset,
[numTasks])
when called on datasets of type (K, V) and (K, W),
returns a dataset of (K, (V, W)) pairs with all pairs  
of elements for each key
cogroup(otherDataset,
[numTasks])
when called on datasets of type (K, V) and (K, W),
returns a dataset of (K, Seq[V], Seq[W]) tuples –
also called groupWith
cartesian(otherDataset)
when called on datasets of types T and U, returns a
dataset of (T, U) pairs (all pairs of elements)
90

val distFile = sqlContext.table("readme").map(_(0).asInstanceOf[String])"
distFile.map(l => l.split(" ")).collect()"
distFile.flatMap(l => l.split(" ")).collect()
distFile.map(lambda x: x.split(' ')).collect()"
distFile.flatMap(lambda x: x.split(' ')).collect()
Scala:
Python:
distFile is a collection of lines
91

Scala:
Python:
closures
92

Scala:
Python:
closures
93
looking at the output, how would you  
compare results for map() vs. ﬂatMap() ?

Spark Essentials: Actions
action description
reduce(func)
aggregate the elements of the dataset using a function
func (which takes two arguments and returns one),  
and should also be commutative and associative so  
that it can be computed correctly in parallel
collect()
return all the elements of the dataset as an array at  
the driver program – usually useful after a filter or
other operation that returns a sufficiently small subset
of the data
count() return the number of elements in the dataset
first()
return the first element of the dataset – similar to
take(1)
take(n)
return an array with the first n elements of the dataset
– currently not executed in parallel, instead the driver
program computes all the elements
takeSample(withReplacement,
fraction, seed)
return an array with a random sample of num elements
of the dataset, with or without replacement, using the
given random number generator seed
94

action description
saveAsTextFile(path)
write the elements of the dataset as a text file (or set  
of text files) in a given directory in the local filesystem,
HDFS or any other Hadoop-supported file system.
Spark will call toString on each element to convert  
it to a line of text in the file
saveAsSequenceFile(path)
write the elements of the dataset as a Hadoop
SequenceFile in a given path in the local filesystem,
HDFS or any other Hadoop-supported file system.  
Only available on RDDs of key-value pairs that either
implement Hadoop's Writable interface or are
implicitly convertible to Writable (Spark includes
conversions for basic types like Int, Double, String,
etc).
countByKey()
only available on RDDs of type (K, V). Returns a  
`Map` of (K, Int) pairs with the count of each key
foreach(func)
run a function func on each element of the dataset –
usually done for side effects such as updating an
accumulator variable or interacting with external
storage systems
95

val words = f.flatMap(l => l.split(" ")).map(word => (word, 1))"
words.reduceByKey(_ + _).collect.foreach(println)
from operator import add"
f = sqlContext.table("readme").map(lambda x: x[0])"
words = f.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1))"
words.reduceByKey(add).collect()
Scala:
Python:
96

Spark can persist (or cache) a dataset in
memory across operations

spark.apache.org/docs/latest/programming-guide.html#rdd-
persistence

Each node stores in memory any slices of it
that it computes and reuses them in other
actions on that dataset – often making future
actions more than 10x faster

The cache is fault-tolerant: if any partition  
of an RDD is lost, it will automatically be
recomputed using the transformations that
originally created it
Spark Essentials: Persistence
97

MEMORY_ONLY
Store RDD as deserialized Java objects in the JVM.  
If the RDD does not fit in memory, some partitions  
will not be cached and will be recomputed on the fly
each time they're needed.This is the default level.
MEMORY_AND_DISK
Store RDD as deserialized Java objects in the JVM.  
If the RDD does not fit in memory, store the partitions
that don't fit on disk, and read them from there when
they're needed.
MEMORY_ONLY_SER
Store RDD as serialized Java objects (one byte array  
per partition).This is generally more space-efficient  
than deserialized objects, especially when using a fast
serializer, but more CPU-intensive to read.
MEMORY_AND_DISK_SER
Similar to MEMORY_ONLY_SER, but spill partitions
that don't fit in memory to disk instead of recomputing
them on the fly each time they're needed.
DISK_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2,
MEMORY_AND_DISK_2, etc
Same as the levels above, but replicate each partition  
on two cluster nodes.
OFF_HEAP (experimental)
Store RDD in serialized format in Tachyon.
98

val words = f.flatMap(l => l.split(" ")).map(word => (word, 1)).cache()"
words.reduceByKey(_ + _).collect.foreach(println)
from operator import add"
f = sqlContext.table("readme").map(lambda x: x[0])"
w = f.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).cache()"
w.reduceByKey(add).collect()
Scala:
Python:
99

Broadcast variables let programmer keep a
read-only variable cached on each machine
rather than shipping a copy of it with tasks

For example, to give every node a copy of  
a large input dataset efﬁciently

Spark also attempts to distribute broadcast
variables using efﬁcient broadcast algorithms
to reduce communication cost
Spark Essentials: BroadcastVariables
100

val broadcastVar = sc.broadcast(Array(1, 2, 3))"
broadcastVar.value!
res10: Array[Int] = Array(1, 2, 3)
Spark Essentials: BroadcastVariables
broadcastVar = sc.broadcast(list(range(1, 4)))"
broadcastVar.value!
Out[15]: [1, 2, 3]
Scala:
Python:
101

Accumulators are variables that can only be
“added” to through an associative operation

Used to implement counters and sums,
efﬁciently in parallel

Spark natively supports accumulators of
numeric value types and standard mutable
collections, and programmers can extend  
for new types

Only the driver program can read an
accumulator’s value, not the tasks
Spark Essentials: Accumulators
102

val accum = sc.accumulator(0)"
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)"
!
accum.value!
res11: Int = 10
accum = sc.accumulator(0)"
rdd = sc.parallelize([1, 2, 3, 4])"
def f(x):"
global accum"
accum += x"
!
rdd.foreach(f)"
!
accum.value!
Out[16]: 10
Scala:
Python:
103

val accum = sc.accumulator(0)"
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)"
!
accum.value!
res11: Int = 10
accum = sc.accumulator(0)"
rdd = sc.parallelize([1, 2, 3, 4])"
def f(x):"
global accum"
accum += x"
!
rdd.foreach(f)"
!
accum.value!
Out[16]: 10
Scala:
Python:
driver-side
104

For a deep-dive about broadcast variables
and accumulator usage in Spark, see also:

Advanced Spark Features 
Matei Zaharia, Jun 2012 
ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-
zaharia-amp-camp-2012-advanced-spark.pdf
Spark Essentials: BroadcastVariables and Accumulators
105

val pair = (a, b)"
"
pair._1 // => a"
pair._2 // => b
Spark Essentials: (K,V) pairs
Scala:
106
Python:
pair = (a, b)"
"
pair[0] # => a"
pair[1] # => b

Spark Essentials: API Details
For more details about the Scala API:

spark.apache.org/docs/latest/api/scala/
index.html#org.apache.spark.package

For more details about the Python API:

spark.apache.org/docs/latest/api/python/
107

Keynote: New Directions for Spark in 2015
Fri Feb 20 9:15am-9:25am 
strataconf.com/big-data-conference-ca-2015/
public/schedule/detail/39547
As the Apache Spark userbase grows, the developer community is working  
to adapt it for ever-wider use cases. 2014 saw fast adoption of Spark in the
enterprise and major improvements in its performance, scalability and
standard libraries. In 2015, we want to make Spark accessible to a wider  
set of users, through new high-level APIs for data science: machine learning
pipelines, data frames, and R language bindings. In addition, we are deﬁning
extension points to let Spark grow as a platform, making it easy to plug in
data sources, algorithms, and external packages. Like all work on Spark,  
these APIs are designed to plug seamlessly into Spark applications, giving
users a uniﬁed platform for streaming, batch and interactive data processing.
Matei Zaharia – started the Spark project
at UC Berkeley, currently CTO of Databricks,
SparkVP at Apache, and an assistant professor
at MIT

Spark Camp: Ask Us Anything
Fri, Feb 20 2:20pm-3:00pm 
Join the Spark team for an informal question and
answer session. Several of the Spark committers,
trainers, etc., from Databricks will be on hand to
ﬁeld a wide range of detailed questions.

Even if you don’t have a speciﬁc question, join  
in to hear what others are asking!

Databricks Spark Talks @Strata + Hadoop World
Thu Feb 19 10:40am-11:20am 
Lessons from Running Large Scale SparkWorkloads 
Reynold Xin, Matei Zaharia
Thu Feb 19 4:00pm–4:40pm 
Spark Streaming -The State of the Union, and Beyond 
Tathagata Das

Databricks Spark Talks @Strata + Hadoop World
Fri Feb 20 11:30am-12:10pm 
Tuning and Debugging in Apache Spark 
Patrick Wendell
Fri Feb 20 4:00pm–4:40pm 
Everyday I’m Shufﬂing -Tips forWriting Better Spark Programs 
Vida Ha, Holden Karau

Spark Developer Certification 
Fri Feb 20, 2015 10:40am-12:40pm
• go.databricks.com/spark-certified-developer
• defined by Spark experts @Databricks
• assessed by O’Reilly Media
• establishes the bar for Spark expertise

• 40 multiple-choice questions, 90 minutes

• mostly structured as choices among code blocks

• expect some Python, Java, Scala, SQL

• understand theory of operation

• identify best practices

• recognize code that is more parallel, less
memory constrained

!
Overall, you need to write Spark apps in practice
Developer Certiﬁcation: Overview
114

blurs the lines between RDDs and relational tables

spark.apache.org/docs/latest/sql-programming-
guide.html

!
intermix SQL commands to query external data,
along with complex analytics, in a single app:

• allows SQL extensions based on MLlib

• provides the “heavy lifting” for ETL in DBC
Spark SQL: Manipulating Structured Data Using Spark 
Michael Armbrust, Reynold Xin (2014-03-24) 
databricks.com/blog/2014/03/26/Spark-SQL-
manipulating-structured-data-using-Spark.html
119
Spark SQL: DataWorkﬂows

Parquet is a columnar format, supported  
by many different Big Data frameworks

http://parquet.io/

Spark SQL supports read/write of parquet files,  
automatically preserving the schema of the  
original data (HUGE benefits)

Modifying the previous example…
120
Spark SQL: DataWorkflows – Parquet

121
Demo of /_SparkCamp/demo_sql_scala  
by the instructor:
Spark SQL: SQL Demo

122
Next, we’ll review the following sections in the 
Databricks Guide:

/databricks_guide/Importing Data"
/databricks_guide/Databricks File System"
!
Key Topics:

• JSON, CSV, Parquet

• S3, Hive, Redshift

• DBFS, dbutils
Spark SQL: Using DBFS

124
For any SQL query, you can show the results  
as a table, or generate a plot from with a  
single click…
Visualization: Built-in Plots

125
Several of the plot types have additional options  
to customize the graphs they generate…
Visualization: Plot Options

126
For example, series groupings can be used to help
organize bar charts…
Visualization: Series Groupings

127
See /databricks-guide/05 Visualizations  
for details about built-in visualizations and
extensions…
Visualization: Reference Guide

128
The display() command:

• programmatic access to visualizations

• pass a SchemaRDD to print as an HTML table

• pass a Scala list to print as an HTML table

• call without arguments to display matplotlib
ﬁgures
Visualization: Using display()

129
The displayHTML() command:

• render any arbitrary HTML/JavaScript

• include JavaScript libraries (advanced feature)

• paste in D3 examples to get a sense for this…
Visualization: Using displayHTML()

130
Clone the entire folder /_SparkCamp/Viz D3 
into your folder and run its notebooks:
Demo: D3Visualization

131
Clone and run /_SparkCamp/07.sql_visualization 
in your folder:
Coding Exercise: SQL +Visualization

Let’s consider the top-level requirements for  
a streaming framework:

• clusters scalable to 100’s of nodes

• low-latency, in the range of seconds 
(meets 90% of use case needs)

• efﬁcient recovery from failures 
(which is a hard problem in CS)

• integrates with batch: many co’s run the  
same business logic both online+ofﬂine
Spark Streaming: Requirements
134

Therefore, run a streaming computation as:  
a series of very small, deterministic batch jobs

!
• Chop up the live stream into  
batches of X seconds

• Spark treats each batch of  
data as RDDs and processes  
them using RDD operations

• Finally, the processed results  
of the RDD operations are  
returned in batches
135

Therefore, run a streaming computation as:  
a series of very small, deterministic batch jobs

!
• Batch sizes as low as ½ sec,  
latency of about 1 sec

• Potential for combining  
batch processing and  
streaming processing in  
the same system
136

Data can be ingested from many sources:  
Kafka, Flume, Twitter, ZeroMQ,TCP sockets, etc.

Results can be pushed out to ﬁlesystems,
databases, live dashboards, etc.

Spark’s built-in machine learning algorithms and
graph processing algorithms can be applied to
data streams
Spark Streaming: Integration
137

2012

project started

2013

alpha release (Spark 0.7)

2014

graduated (Spark 0.9)
Spark Streaming: Timeline
Discretized Streams:A Fault-Tolerant Model  
for Scalable Stream Processing

Matei Zaharia,Tathagata Das, Haoyuan Li,  
Timothy Hunter, Scott Shenker, Ion Stoica

Berkeley EECS (2012-12-14)

www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
project lead:  
Tathagata Das @tathadas
138

Typical kinds of applications:

• datacenter operations

• web app funnel metrics

• ad optimization

• anti-fraud

• telecom

• video analytics

• various telematics

and much much more!
Spark Streaming: Use Cases
139

Programming Guide 
spark.apache.org/docs/latest/streaming-

TD @ Spark Summit 2014 
youtu.be/o-NXwFrNAWQ?list=PLTPXxbhUt-
YWGNTaDj6HSjnHMxiTD1HCR

“Deep Dive into Spark Streaming” 
slideshare.net/spark-project/deep-
divewithsparkstreaming-
tathagatadassparkmeetup20130617

Spark Reference Applications 
databricks.gitbooks.io/databricks-spark-
reference-applications/
Spark Streaming: Some Excellent Resources
140

import org.apache.spark.streaming._"
import org.apache.spark.streaming.StreamingContext._"
!
// create a StreamingContext with a SparkConf configuration"
val ssc = new StreamingContext(sparkConf, Seconds(10))"
!
// create a DStream that will connect to serverIP:serverPort"
val lines = ssc.socketTextStream(serverIP, serverPort)"
!
// split each line into words"
val words = lines.flatMap(_.split(" "))"
!
// count each word in each batch"
val pairs = words.map(word => (word, 1))"
val wordCounts = pairs.reduceByKey(_ + _)"
!
// print a few of the counts to the console"
wordCounts.print()"
!
ssc.start()"
ssc.awaitTermination()
Quiz: name the bits and pieces…
141

import sys"
from pyspark import SparkContext"
from pyspark.streaming import StreamingContext"
!
sc = SparkContext(appName="PyStreamNWC", master="local[*]")"
ssc = StreamingContext(sc, Seconds(5))"
!
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))"
!
counts = lines.flatMap(lambda line: line.split(" ")) "
.map(lambda word: (word, 1)) "
.reduceByKey(lambda a, b: a+b)"
!
counts.pprint()"
!
ssc.start()"
Demo: PySpark Streaming NetworkWord Count
142

import sys"
from pyspark import SparkContext"
from pyspark.streaming import StreamingContext"
!
def updateFunc (new_values, last_sum):"
return sum(new_values) + (last_sum or 0)"
!
sc = SparkContext(appName="PyStreamNWC", master="local[*]")"
ssc = StreamingContext(sc, Seconds(5))"
ssc.checkpoint("checkpoint")"
!
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))"
!
counts = lines.flatMap(lambda line: line.split(" ")) "
.map(lambda word: (word, 1)) "
.updateStateByKey(updateFunc) "
.transform(lambda x: x.sortByKey())"
!
counts.pprint()"
!
ssc.start()"
Demo: PySpark Streaming NetworkWord Count - Stateful
143

Ref Apps:
Kinesis Demo 
Twitter Streaming Demo

145
databricks.gitbooks.io/databricks-spark-reference-applications/
content/twitter_classiﬁer/README.html
Demo: Twitter Streaming Language Classiﬁer

146
Streaming:
collect tweets
Twitter API
HDFS:
dataset
Spark SQL:
ETL, queries
MLlib:
train classiﬁer
Spark:
featurize
HDFS:
model
Streaming:
score tweets
language
ﬁlter

147
1. extract text
from the tweet
https://twitter.com/
andy_bf/status/
16222269370011648
"Ceci n'est pas un tweet"
2. sequence
text as bigrams
tweet.sliding(2).toSeq ("Ce", "ec", "ci", …, )
3. convert
bigrams into
numbers
seq.map(_.hashCode()) (2178, 3230, 3174, …, )
4. index into
sparse tf vector!
seq.map(_.hashCode() %
1000)
(178, 230, 174, …, )
5. increment
feature count
Vector.sparse(1000, …) (1000, [102, 104, …],
[0.0455, 0.0455, …])
From tweets to ML features,
approximated as sparse
vectors:

SBT is the Simple Build Tool for Scala:

www.scala-sbt.org/

This is included with the Spark download, and  
does not need to be installed separately.

Similar to Maven, however it provides for
incremental compilation and an interactive shell,  
among other innovations.

SBT project uses StackOverﬂow for Q&A,  
that’s a good resource to study further:

stackoverﬂow.com/tags/sbt
Spark in Production: Build: SBT
149

Spark in Production: Build: SBT
command description
clean
delete all generated files  
(in the target directory)
package
create a JAR file
run
run the JAR  
(or main class, if named)
compile
compile the main sources  
(in src/main/scala and src/main/java directories)
test
compile and run all tests
console
launch a Scala interpreter
help
display detailed help for specified commands
150

builds:

• build/run a JAR using Java + Maven

• SBT primer

• build/run a JAR using Scala + SBT
Spark in Production: Build: Scala
151

The following sequence shows how to build  
a JAR ﬁle from a Scala app, using SBT

• First, this requires the “source” download,
not the “binary”

• Connect into the SPARK_HOME directory

• Then run the following commands…
152

# Scala source + SBT build script on following slides!
!
cd simple-app"
!
../sbt/sbt -Dsbt.ivy.home=../sbt/ivy package"
!
../spark/bin/spark-submit "
--class "SimpleApp" "
--master local[*] "
target/scala-2.10/simple-project_2.10-1.0.jar
153

/*** SimpleApp.scala ***/"
import org.apache.spark.SparkContext"
import org.apache.spark.SparkContext._"
!
object SimpleApp {"
def main(args: Array[String]) {"
val logFile = "README.md" // Should be some file on your system"
val sc = new SparkContext("local", "Simple App", "SPARK_HOME","
List("target/scala-2.10/simple-project_2.10-1.0.jar"))"
val logData = sc.textFile(logFile, 2).cache()"
!
val numAs = logData.filter(line => line.contains("a")).count()"
val numBs = logData.filter(line => line.contains("b")).count()"
!
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))"
}"
}
154

name := "Simple Project""
!
version := "1.0""
!
scalaVersion := "2.10.4""
!
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.2.0""
!
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
155

Spark in Production: Databricks Cloud
157
Databricks Platform
Databricks Workspace
Arguably, one of the simplest ways to deploy Apache Spark
is to use Databricks Cloud for cloud-based notebooks

DataStax Enterprise 4.6:

www.datastax.com/docs/latest-dse/

datastax.com/dev/blog/interactive-advanced-analytic-with-dse-
and-spark-mllib

• integration of Cassandra (elastic storage) and  
Spark (elastic compute)

• github.com/datastax/spark-cassandra-connector

• exposes Cassandra tables as Spark RDDs

• converts data types between Cassandra and Scala

• allows for execution of arbitrary CQL statements

• plays nice with CassandraVirtual Nodes
Spark in Production: DSE
158

Apache Mesos, from which Apache Spark  
originated…

Running Spark on Mesos 
spark.apache.org/docs/latest/running-on-mesos.html

Run Apache Spark on Apache Mesos 
tutorial based on Mesosphere + Google Cloud 
ceteri.blogspot.com/2014/09/spark-atop-mesos-on-google-cloud.html

Getting Started Running Apache Spark on Apache Mesos 
O’Reilly Media webcast 
oreilly.com/pub/e/2986
Spark in Production: Mesos
159

Cloudera Manager 4.8.x:

cloudera.com/content/cloudera-content/cloudera-
docs/CM4Ent/latest/Cloudera-Manager-Installation-
Guide/cmig_spark_installation_standalone.html

• 5 steps to install the Spark parcel

• 5 steps to conﬁgure and start the Spark service

Also check out Cloudera Live:

cloudera.com/content/cloudera/en/products-and-
services/cloudera-live.html
Spark in Production: CM
160

MapR Technologies provides support for running  
Spark on the MapR distros:

mapr.com/products/apache-spark

slideshare.net/MapRTechnologies/map-r-
databricks-webinar-4x3
Spark in Production: MapR
161

Hortonworks provides support for running  
Spark on HDP:

spark.apache.org/docs/latest/hadoop-third-party-
distributions.html

hortonworks.com/blog/announcing-hdp-2-1-tech-
preview-component-apache-spark/
Spark in Production: HDP
162

Running Spark on Amazon AWS EC2:

blogs.aws.amazon.com/bigdata/post/Tx15AY5C50K70RV/
Installing-Apache-Spark-on-an-Amazon-EMR-Cluster
Spark in Production: EC2
163

Spark in MapReduce (SIMR) – quick way  
for Hadoop MR1 users to deploy Spark:

databricks.github.io/simr/

spark-summit.org/talk/reddy-simr-let-your-
spark-jobs-simmer-inside-hadoop-clusters/

• Sparks run on Hadoop clusters without  
any install or required admin rights

• SIMR launches a Hadoop job that only  
contains mappers, includes Scala+Spark

./simr jar_file main_class parameters  
[—outdir=] [—slots=N] [—unique]
Spark in Production: SIMR
164

review UI features

spark.apache.org/docs/latest/monitoring.html

http://<master>:8080/

http://<master>:50070/

• verify: is my job still running?

• drill-down into workers and stages

• examine stdout and stderr

• discuss how to diagnose / troubleshoot
Spark in Production: Monitor
165

Spark in Production: Monitor – Spark Console
166

Spark in Production: Monitor – AWS Console
167

GraphX examples
cost
4
node
0
node
1
node
3
node
2
cost
3
cost
1
cost
2
cost
1

169
GraphX:
spark.apache.org/docs/latest/graphx-

!
Key Points:

!
• graph-parallel systems

• importance of workﬂows

• optimizations

170
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs 
J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin 
graphlab.org/ﬁles/osdi2012-gonzalez-low-gu-
bickson-guestrin.pdf

Pregel: Large-scale graph computing at Google 
Grzegorz Czajkowski, et al. 
googleresearch.blogspot.com/2009/06/large-scale-
graph-computing-at-google.html

GraphX: Uniﬁed Graph Analytics on Spark 
Ankur Dave, Databricks 
databricks-training.s3.amazonaws.com/slides/
graphx@sparksummit_2014-07.pdf

Advanced Exercises: GraphX 
databricks-training.s3.amazonaws.com/graph-
analytics-with-graphx.html
GraphX: Further Reading…

// http://spark.apache.org/docs/latest/graphx-programming-guide.html!
!
import org.apache.spark.graphx._"
import org.apache.spark.rdd.RDD"
!
case class Peep(name: String, age: Int)"
!
val nodeArray = Array("
(1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),"
(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),"
(5L, Peep("Leslie", 45))"
)"
val edgeArray = Array("
Edge(2L, 1L, 7), Edge(2L, 4L, 2),"
Edge(3L, 2L, 4), Edge(3L, 5L, 3),"
Edge(4L, 1L, 1), Edge(5L, 3L, 9)"
)"
!
val nodeRDD: RDD[(Long, Peep)] = sc.parallelize(nodeArray)"
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)"
val g: Graph[Peep, Int] = Graph(nodeRDD, edgeRDD)"
!
val results = g.triplets.filter(t => t.attr > 7)"
!
for (triplet <- results.collect) {"
println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")"
}
171
GraphX: Example – simple traversals

172
GraphX: Example – routing problems
cost
4
node
0
node
1
node
3
node
2
cost
3
cost
1
cost
2
cost
1
What is the cost to reach node 0 from any other
node in the graph? This is a common use case for
graph algorithms, e.g., Djikstra

173
Clone and run /_SparkCamp/08.graphx 
in your folder:
GraphX: Coding Exercise

Case Studies: Apache Spark, DBC, etc.
Additional details about production deployments
for Apache Spark can be found at:

https://cwiki.apache.org/conﬂuence/display/
SPARK/Powered+By+Spark

https://databricks.com/blog/category/company/
partners

http://go.databricks.com/customer-case-studies
175

Case Studies: Automatic Labs
176
Spark Plugs IntoYour Car 
Rob Ferguson 
spark-summit.org/east/2015/talk/spark-plugs-into-your-car

ﬁnance.yahoo.com/news/automatic-labs-turns-databricks-
cloud-140000785.html

Automatic creates personalized driving habit dashboards

• wanted to use Spark while minimizing investment in DevOps

• provides data access to non-technical analysts via SQL

• replaced Redshift and disparate ML tools with single platform

• leveraged built-in visualization capabilities in notebooks to
generate dashboards easily and quickly

• used MLlib on Spark for needed functionality out of the box

Spark atTwitter: Evaluation & Lessons Learnt 
Sriram Krishnan 
slideshare.net/krishflix/seattle-spark-meetup-spark-at-twitter

• Spark can be more interactive, efficient than MR

• support for iterative algorithms and caching

• more generic than traditional MapReduce

• Why is Spark faster than Hadoop MapReduce?

• fewer I/O synchronization barriers

• less expensive shuffle

• the more complex the DAG, the greater the  
performance improvement
177
Case Studies: Twitter

Pearson uses Spark Streaming for next
generation adaptive learning platform

Dibyendu Bhattacharya

databricks.com/blog/2014/12/08/pearson-
uses-spark-streaming-for-next-generation-
adaptive-learning-platform.html
178
• Kafka + Spark + Cassandra + Blur, on AWS on aYARN
cluster

• single platform/common API was a key reason to replace
Storm with Spark Streaming

• custom Kafka Consumer for Spark Streaming, using Low
Level Kafka Consumer APIs

• handles: Kafka node failures, receiver failures, leader
changes, committed offset in ZK, tunable data rate
throughput
Case Studies: Pearson

UnlockingYour Hadoop Data with Apache Spark and CDH5

Denny Lee

slideshare.net/Concur/unlocking-your-hadoop-data-
with-apache-spark-and-cdh5
179
• leading provider of spend management solutions and
services

• delivers recommendations based on business users’ travel
and expenses – “to help deliver the perfect trip”

• use of traditional BI tools with Spark SQL allowed analysts
to make sense of the data without becoming programmers

• needed the ability to transition quickly between Machine
Learning (MLLib), Graph (GraphX), and SQL usage

• needed to deliver recommendations in real-time
Case Studies: Concur

Stratio Streaming: a new approach to Spark Streaming

David Morales, Oscar Mendez

spark-summit.org/2014/talk/stratio-streaming-a-
new-approach-to-spark-streaming
180
• Stratio Streaming is the union of a real-time messaging bus
with a complex event processing engine atop Spark
Streaming

• allows the creation of streams and queries on the ﬂy

• paired with Siddhi CEP engine and Apache Kafka

• added global features to the engine such as auditing and
statistics
Case Studies: Stratio

Collaborative Filtering with Spark 
Chris Johnson 
slideshare.net/MrChrisJohnson/collaborative-filtering-with-
spark

• collab filter (ALS) for music recommendation

• Hadoop suffers from I/O overhead

• show a progression of code rewrites, converting a  
Hadoop-based app into efficient use of Spark
181
Case Studies: Spotify

Guavus Embeds Apache Spark  
into its Operational Intelligence Platform  
Deployed at theWorld’s LargestTelcos

Eric Carr

databricks.com/blog/2014/09/25/guavus-embeds-apache-spark-
into-its-operational-intelligence-platform-deployed-at-the-
worlds-largest-telcos.html
182
• 4 of 5 top mobile network operators, 3 of 5 top Internet
backbone providers, 80% MSOs in NorAm

• analyzing 50% of US mobile data traffic, +2.5 PB/day

• latency is critical for resolving operational issues before
they cascade: 2.5 MM transactions per second

• “analyze first” not “store first ask questions later”
Case Studies: Guavus

Case Studies: Radius Intelligence
183
From Hadoop to Spark in 4 months, Lessons Learned 
Alexis Roos 
http://youtu.be/o3-lokUFqvA
• building a full SMB index took 12+ hours using  
Hadoop and Cascading

• pipeline was difﬁcult to modify/enhance

• Spark increased pipeline performance 10x

• interactive shell and notebooks enabled data scientists  
to experiment and develop code faster

• PMs and business development staff can use SQL to  
query large data sets

community:
spark.apache.org/community.html
events worldwide: goo.gl/2YqJZK
!
video+preso archives: spark-summit.org
resources: databricks.com/spark-training-resources
workshops: databricks.com/spark-training

Further Resources: Spark Packages
186
Looking for other libraries and features? There
are a variety of third-party packages available at:

http://spark-packages.org/

Further Resources: DBC Feedback
187
Other feedback, suggestions, etc.?

http://feedback.databricks.com/

MOOCs:
Anthony Joseph 
UC Berkeley

begins Apr 2015

edx.org/course/uc-berkeleyx/uc-
berkeleyx-cs100-1x-
introduction-big-6181
Ameet Talwalkar 
UCLA

begins Q2 2015

edx.org/course/uc-berkeleyx/
uc-berkeleyx-cs190-1x-
scalable-machine-6066

confs:
Spark Summit East 
NYC, Mar 18-19 
spark-summit.org/east
QCon SP 
Saõ Paulo, Brazil, Mar 23-27 
qconsp.com
Big Data Tech Con 
Boston, Apr 26-28 
bigdatatechcon.com
Strata EU 
London, May 5-7 
strataconf.com/big-data-conference-uk-2015
GOTO Chicago 
Chicago, May 11-14 
gotocon.com/chicago-2015
Spark Summit 2015 
SF, Jun 15-17 
spark-summit.org

190
http://spark-summit.org/15% discount: 
SparkItUpEast15

books:
Fast Data Processing  
with Spark 
Holden Karau 
Packt (2013) 
shop.oreilly.com/product/
9781782167068.do
Spark in Action 
Chris Fregly 
Manning (2015) 
sparkinaction.com/
Learning Spark 
Holden Karau,  
Andy Konwinski,
Matei Zaharia 
O’Reilly (2015) 
shop.oreilly.com/product/
0636920028512.do

About Databricks
• Founded by the creators of Spark in 2013

• Largest organization contributing to Spark

• End-to-end hosted service, Databricks Cloud

• http://databricks.com/

Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials

Similar to Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials