Building Scalable
Data Pipelines
Evan Chan
Who am I
Distinguished Engineer, Tuplejump
User and contributor to Spark
since 0.9
Co-creator and maintainer of
Spark Job Server
TupleJump - Big Data Dev Partners 3
I want insights now
I want to act on news right away
I want stuff personalized for me (?)
Fast Data, not

Big Data
How Fast do you
Need to Act?
Financial trading - milliseconds
Dashboards - seconds to minutes
BI / Reports - hours to days?
What’s Your App?
Concurrent video viewers
Anomaly detection
Clickstream analysis
Live geospatial maps
Real-time trend detection & learning
State /
Example: Real-time
trend detection
Events: time, OS, location, asset/product ID
Analyze 1-5 second batches of new “hot”
data in stream processor
Combine with recent and historical top K
feature vectors in database
Update database recent feature vectors
Serve to users
Example 2: Smart
Smart City
Streaming Data
City buses - regular telemetry (position,
velocity, timestamp)
Street sweepers - regular telemetry
Transactions from rail, subway, buses, smart
311 info
911 info - new emergencies
Citizens want to
Where and for how long can I park my
Are transportation options affected by
311 and 911 events?
How long will it take the next bus to
get here?
Where is the closest bus to where I am?
Cities want to
How can I maximize parking revenue?
More granular updates to parking spots that don't
need sweeping
How does traffic affect waiting times in public
transit, and revenue?
Patterns in subway train times - is a breakdown
Population movement - where should new transit
routes be placed?
Short term
The HARD Principle
Highly Available, Resilient, Distributed
Flexibility - do as many transformations
as possible with as few components as
Real-time: “NoETL”
Community: best of breed OSS projects with
huge adoption and commercial support
Message Queue
State /
Why a message
Centralized publish-subscribe of
Need more processing? Add another
Buffer traffic spikes
Replay events in cases of failure
Message Queues
help distribute data
Input 1
Input 2
Intro to Apache
Kafka is a distributed publish subscribe
It uses a commit log to track changes
Kafka was originally created at LinkedIn
Open sourced in 2011
Graduated to a top-level Apache project
in 2012
On being HARD
Many Big Data projects are open source
implementations of closed source products
Unlike Hadoop, HBase or Cassandra, Kafka
actually isn't a clone of an existing closed
source product
The same codebase being used for years at LinkedIn
answers the questions:
Does it scale?
Is it robust?
Ad Hoc ETL
Decoupled ETL
Avro Schemas And Schema Registry
Keys and values in Kafka can be Strings
or byte arrays
Avro is a serialization format used
extensively with Kafka and Big Data
Kafka uses a Schema Registry to keep
track of Avro schemas
Verifies that the correct schemas are being used
Consumer Groups
Commit Logs
Kafka Resources
Official docs - https://
Design section is really good read
Includes schema registry
Stream Processing
State /
Types of Stream
Event by Event: Apache Storm,
Apache Flink, Intel GearPump, Akka
Micro-batch: Apache Spark
Hybrid? Google Dataflow
Apache Storm and
Transform one message at a time
Very low latency
State and more complex analytics difficult
Akka and
Actor to actor messaging. Local state.
Used for extreme low latency (ad networks, etc)
Dynamically reconfigurable topology
Configurable fault tolerance and failure
Cluster or local mode - you don’t always need
Spark Streaming
Data processed as stream of micro batches
Higher latency (seconds), higher
throughput, more complex analysis / ML
Same programming model as batch
Why Spark?
file = spark.textFile("hdfs://...")
file.flatMap(line => line.split(" "))
    .map(word => (word, 1))
    .reduceByKey(_ + _)
1 package org.myorg;
3 import;
4 import java.util.*;
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.conf.*;
8 import*;
9 import org.apache.hadoop.mapreduce.*;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
15 public class WordCount {
17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable>
18 private final static IntWritable one = new IntWritable(1);
19 private Text word = new Text();
21 public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
22 String line = value.toString();
23 StringTokenizer tokenizer = new StringTokenizer(line);
24 while (tokenizer.hasMoreTokens()) {
25 word.set(tokenizer.nextToken());
26 context.write(word, one);
27 }
28 }
29 }
31 public static class Reduce extends Reducer<Text, IntWritable, Text,
IntWritable> {
33 public void reduce(Text key, Iterable<IntWritable> values, Context context)
34 throws IOException, InterruptedException {
35 int sum = 0;
36 for (IntWritable val : values) {
37 sum += val.get();
38 }
39 context.write(key, new IntWritable(sum));
40 }
41 }
43 public static void main(String[] args) throws Exception {
44 Configuration conf = new Configuration();
46 Job job = new Job(conf, "wordcount");
48 job.setOutputKeyClass(Text.class);
49 job.setOutputValueClass(IntWritable.class);
51 job.setMapperClass(Map.class);
52 job.setReducerClass(Reduce.class);
54 job.setInputFormatClass(TextInputFormat.class);
55 job.setOutputFormatClass(TextOutputFormat.class);
57 FileInputFormat.addInputPath(job, new Path(args[0]));
58 FileOutputFormat.setOutputPath(job, new Path(args[1]));
60 job.waitForCompletion(true);
61 }
63 }
Spark Production Deployments
Explosion of Specialized Systems
Spark and Berkeley AMP Lab
Benefits of Unified Libraries
Optimizations can be shared between libraries
Project Tungsten
Shared statistics libraries
Spark Streaming
GC and memory management
Mix and match
Easily go from DataFrames (SQL) to
MLLib / statistics, for example:
scala> import org.apache.spark.mllib.stat.Statistics
scala> val numMentions ="NumMentions").map(row => row.getInt(0).toDouble)
numMentions: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[100] at map at DataFrame.scala:848
scala> val numArticles ="NumArticles").map(row => row.getInt(0).toDouble)
numArticles: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[104] at map at DataFrame.scala:848
scala> val correlation = Statistics.corr(numMentions, numArticles, "pearson")
Spark Worker Failure
Rebuild RDD Partitions on Worker
from Lineage
Spark SQL & DataFrames
DataFrames & Catalyst Optimizer
Catalyst Optimizations
Column and partition pruning
(Column filters)
Predicate pushdowns (Row filters)
Spark SQL Data Sources API
Enables custom data sources to participate in
SparkSQL = DataFrames + Catalyst
Production Impls
spark-csv (Databricks)
spark-avro (Databricks)
spark-cassandra-connector (DataStax)
elasticsearch-hadoop (
Spark Streaming
Streaming Sources
Basic: Files, Akka actors, queues of RDDs,
Twitter firehose
DStreams = micro-batches
Streaming Fault Tolerance
Incoming data is replicated to 1
other node
Write Ahead Log for sources that
support ACKs
Checkpointing for recovery if Driver
Direct Kafka Streaming: KafkaRDD
No single Receiver
No Write Ahead Log
Kafka *is* the Write Ahead Log!
KafkaRDD stores Kafka offsets
KafkaRDD partitions recover from offsets

Spark MLlib & GraphX
Spark MLlib Common Algos
DecisionTree, RandomForest
K-Means, Streaming K-Means
Collaborative Filtering
Alternating Least Squares (ALS)
Spark Text Processing Algos
Use Stanford CoreNLP!
Spark ML Pipelines
Modeled after scikit-learn
Spark GraphX
Top Influencers
Connected Components
Measure of clusters
Triangle Counting
Measure of cluster density
Handling State
State /
What Kind of State?
Non-persistent / in-memory:
concurrent viewers
Short term: latest trends
Longer term: raw event & aggregate
ML Models, predictions, scored data
Spark RDDs
Immutable, cache in memory and/or
on disk
Spark Streaming: UpdateStateByKey
IndexedRDD - can update bits of
Snapshotting for recovery
•Massively Scalable
• High Performance
• Always On
• Masterless
Apache Cassandra
• Scales Linearly to as many nodes as
you need
• Scales whenever you need
Apache Cassandra
• It’s Fast
• Built to sustain massive data insertion
rates in irregular pattern spikes
Apache Cassandra
• Automatic Replication
• Multi Datacenter
• Decentralized - no single point of failure
• Survive regional outages
• New nodes automatically add
themselves to the cluster
• DataStax drivers automatically discover
new nodes
Apache Cassandra
• Distributed, Masterless Ring Architecture
• Network Topology Aware
• Flexible, Schemaless - your data
structure can evolve seamlessly over
To download:
^ Highly recommended for local
testing/cluster setup
Cassandra Data
Primary key = (partition keys, clustering keys)
Fast queries = fetch single partition
Range scans by clustering key
Must model for query patterns
Clustering 1 Clustering 2 Clustering 3
Partition 1
Partition 2
Partition 3
City Bus Data
Modeling Example
Primary key = (Bus UUID, timestamp)
Easy queries: location and speed of single
bus for a range of time
Can also query most recent location + speed
of all buses (slower)
1020 s 1010 s 1000 s
Bus A speed, GPS
Bus B
Bus C
Using Cassandra for
Short Term Storage
Idea is store and read small values
Idempotent writes + huge write
capacity = ideal for streaming
For example, store last few (latest +
last N) snapshots of buses, taxi
locations, recent traffic info
But Mommy! What about
longer term data?
I need to read lots
of data, fast!!
- Ad hoc analytics of events
- More specialized / geospatial
- Building ML models from
large quantities of data
- Storing scored/classified data
from models
- OLAP / Data Warehousing
Can Cassandra
Handle Batch?
Cassandra tables are much better at
lots of small reads than big data scans
You CAN store data efficiently in C*
Files seem easier for long term storage
and analysis
But are files compatible with streaming?
Lambda is Hard
and Expensive
Very high TCO - Many moving parts - KV store,
real time, batch
Lots of monitoring, operations, headache
Running similar code in two places
Lower performance - lots of shuffling data,
network hops, translating domain objects
Reconcile queries against two different places
A unified system
Real-time processing and reprocessing
Fault tolerance
Everything is a stream
Can Cassandra do
batch and ad-hoc?
Yes, it can be competitive with Hadoop
If you know how to be creative with storing your
Tuplejump/SnackFS - HDFS for Cassandra - analytics database
Store your data using Protobuf / Avro / etc.
Introduction to
Efficient columnar storage - 5-10x better
Scan speeds competitive with Parquet - 100x
faster than regular Cassandra tables
Very fine grained filtering for sub-second
concurrent queries
Easy BI and ad-hoc analysis via Spark SQL/
Dataframes (JDBC etc.)
Uses Cassandra for robust, proven storage
Combining FiloDB
+ Cassandra
Regular Cassandra tables for highly concurrent,
aggregate / key-value lookups (dashboards)
FiloDB + C* + Spark for efficient long term event
Ad hoc / SQL / BI
Data source for MLLib / building models
Data storage for classified / predicted /
scored data
Short term
storage, K-V
FiloDB: Events,
ad-hoc, batch
Streaming Models
FiloDB: Long term event storage
Spark Learned
FiloDB + Cassandra
Robust, peer to peer, proven storage
Use for short term snapshots, dashboards
Use for efficient long term event
storage & ad hoc querying
Use as a source to build detailed
Thank you!

Building Scalable Data Pipelines - 2016 DataPalooza Seattle

  • 2. Who am I Distinguished Engineer, Tuplejump @evanfchan User and contributor to Spark since 0.9 Co-creator and maintainer of Spark Job Server
  • 3. TupleJump - Big Data Dev Partners 3
  • 4. Instant Gratification I want insights now I want to act on news right away I want stuff personalized for me (?)
  • 6. How Fast do you Need to Act? Financial trading - milliseconds Dashboards - seconds to minutes BI / Reports - hours to days?
  • 7. What’s Your App? Concurrent video viewers Anomaly detection Clickstream analysis Live geospatial maps Real-time trend detection & learning
  • 9. Example: Real-time trend detection Events: time, OS, location, asset/product ID Analyze 1-5 second batches of new “hot” data in stream processor Combine with recent and historical top K feature vectors in database Update database recent feature vectors Serve to users
  • 11. Smart City Streaming Data City buses - regular telemetry (position, velocity, timestamp) Street sweepers - regular telemetry Transactions from rail, subway, buses, smart cards 311 info 911 info - new emergencies
  • 12. Citizens want to know… Where and for how long can I park my car? Are transportation options affected by 311 and 911 events? How long will it take the next bus to get here? Where is the closest bus to where I am?
  • 13. Cities want to know… How can I maximize parking revenue? More granular updates to parking spots that don't need sweeping How does traffic affect waiting times in public transit, and revenue? Patterns in subway train times - is a breakdown coming? Population movement - where should new transit routes be placed?
  • 15. The HARD Principle Highly Available, Resilient, Distributed Flexibility - do as many transformations as possible with as few components as possible Real-time: “NoETL” Community: best of breed OSS projects with huge adoption and commercial support
  • 18. Why a message queue? Centralized publish-subscribe of events Need more processing? Add another consumer Buffer traffic spikes Replay events in cases of failure
  • 19. Message Queues help distribute data A-F G-M N-S T-Z Input 1 Input 2 Input3 Input4 Processing Processing Processing Processing
  • 20. Intro to Apache Kafka Kafka is a distributed publish subscribe system It uses a commit log to track changes Kafka was originally created at LinkedIn Open sourced in 2011 Graduated to a top-level Apache project in 2012
  • 21. On being HARD Many Big Data projects are open source implementations of closed source products Unlike Hadoop, HBase or Cassandra, Kafka actually isn't a clone of an existing closed source product The same codebase being used for years at LinkedIn answers the questions: Does it scale? Is it robust?
  • 24. Avro Schemas And Schema Registry Keys and values in Kafka can be Strings or byte arrays Avro is a serialization format used extensively with Kafka and Big Data Kafka uses a Schema Registry to keep track of Avro schemas Verifies that the correct schemas are being used
  • 27. Kafka Resources Official docs - https:// documentation.html Design section is really good read Includes schema registry
  • 30. Types of Stream Processors Event by Event: Apache Storm, Apache Flink, Intel GearPump, Akka Micro-batch: Apache Spark Hybrid? Google Dataflow
  • 31. Apache Storm and Flink Transform one message at a time Very low latency State and more complex analytics difficult
  • 32. Akka and Gearpump Actor to actor messaging. Local state. Used for extreme low latency (ad networks, etc) Dynamically reconfigurable topology Configurable fault tolerance and failure recovery Cluster or local mode - you don’t always need distribution!
  • 33. Spark Streaming Data processed as stream of micro batches Higher latency (seconds), higher throughput, more complex analysis / ML possible Same programming model as batch
  • 34. Why Spark? file = spark.textFile("hdfs://...")   file.flatMap(line => line.split(" "))     .map(word => (word, 1))     .reduceByKey(_ + _) 1 package org.myorg; 2 3 import; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 }
  • 38. Benefits of Unified Libraries Optimizations can be shared between libraries Core Project Tungsten MLlib Shared statistics libraries Spark Streaming GC and memory management
  • 39. Mix and match modules Easily go from DataFrames (SQL) to MLLib / statistics, for example: scala> import org.apache.spark.mllib.stat.Statistics scala> val numMentions ="NumMentions").map(row => row.getInt(0).toDouble) numMentions: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[100] at map at DataFrame.scala:848 scala> val numArticles ="NumArticles").map(row => row.getInt(0).toDouble) numArticles: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[104] at map at DataFrame.scala:848 scala> val correlation = Statistics.corr(numMentions, numArticles, "pearson")
  • 40. Spark Worker Failure Rebuild RDD Partitions on Worker from Lineage
  • 41. Spark SQL & DataFrames
  • 43. Catalyst Optimizations Column and partition pruning (Column filters) Predicate pushdowns (Row filters)
  • 44. Spark SQL Data Sources API Enables custom data sources to participate in SparkSQL = DataFrames + Catalyst Production Impls spark-csv (Databricks) spark-avro (Databricks) spark-cassandra-connector (DataStax) elasticsearch-hadoop (
  • 46. Streaming Sources Basic: Files, Akka actors, queues of RDDs, Socket Advanced Kafka Kinesis Flume Twitter firehose
  • 48. Streaming Fault Tolerance Incoming data is replicated to 1 other node Write Ahead Log for sources that support ACKs Checkpointing for recovery if Driver fails
  • 49. Direct Kafka Streaming: KafkaRDD No single Receiver Parallelizable No Write Ahead Log Kafka *is* the Write Ahead Log! KafkaRDD stores Kafka offsets KafkaRDD partitions recover from offsets

  • 50. Spark MLlib & GraphX
  • 51. Spark MLlib Common Algos Classifiers DecisionTree, RandomForest Clustering K-Means, Streaming K-Means Collaborative Filtering Alternating Least Squares (ALS)
  • 52. Spark Text Processing Algos TF/IDF LDA Word2Vec *Pro-Tip: Use Stanford CoreNLP!
  • 53. Spark ML Pipelines Modeled after scikit-learn
  • 54. Spark GraphX PageRank Top Influencers Connected Components Measure of clusters Triangle Counting Measure of cluster density
  • 57. What Kind of State? Non-persistent / in-memory: concurrent viewers Short term: latest trends Longer term: raw event & aggregate storage ML Models, predictions, scored data
  • 58. Spark RDDs Immutable, cache in memory and/or on disk Spark Streaming: UpdateStateByKey IndexedRDD - can update bits of data Snapshotting for recovery
  • 59. •Massively Scalable • High Performance • Always On • Masterless
  • 60. Scale Apache Cassandra • Scales Linearly to as many nodes as you need • Scales whenever you need
  • 61. Performance Apache Cassandra • It’s Fast • Built to sustain massive data insertion rates in irregular pattern spikes
  • 62. Fault Tolerance & Availability Apache Cassandra • Automatic Replication • Multi Datacenter • Decentralized - no single point of failure • Survive regional outages • New nodes automatically add themselves to the cluster • DataStax drivers automatically discover new nodes
  • 63. Architecture Apache Cassandra • Distributed, Masterless Ring Architecture • Network Topology Aware • Flexible, Schemaless - your data structure can evolve seamlessly over time
  • 65. Cassandra Data Modeling Primary key = (partition keys, clustering keys) Fast queries = fetch single partition Range scans by clustering key Must model for query patterns Clustering 1 Clustering 2 Clustering 3 Partition 1 Partition 2 Partition 3
  • 66. City Bus Data Modeling Example Primary key = (Bus UUID, timestamp) Easy queries: location and speed of single bus for a range of time Can also query most recent location + speed of all buses (slower) 1020 s 1010 s 1000 s Bus A speed, GPS Bus B Bus C
  • 67. Using Cassandra for Short Term Storage Idea is store and read small values Idempotent writes + huge write capacity = ideal for streaming ingestion For example, store last few (latest + last N) snapshots of buses, taxi locations, recent traffic info
  • 68. But Mommy! What about longer term data?
  • 69. I need to read lots of data, fast!! - Ad hoc analytics of events - More specialized / geospatial - Building ML models from large quantities of data - Storing scored/classified data from models - OLAP / Data Warehousing
  • 70. Can Cassandra Handle Batch? Cassandra tables are much better at lots of small reads than big data scans You CAN store data efficiently in C* Files seem easier for long term storage and analysis But are files compatible with streaming?
  • 72. Lambda is Hard and Expensive Very high TCO - Many moving parts - KV store, real time, batch Lots of monitoring, operations, headache Running similar code in two places Lower performance - lots of shuffling data, network hops, translating domain objects Reconcile queries against two different places
  • 73. NoLambda A unified system Real-time processing and reprocessing No ETLs Fault tolerance Everything is a stream
  • 74. Can Cassandra do batch and ad-hoc? Yes, it can be competitive with Hadoop actually…. If you know how to be creative with storing your data! Tuplejump/SnackFS - HDFS for Cassandra - analytics database Store your data using Protobuf / Avro / etc.
  • 75. Introduction to FiloDB Efficient columnar storage - 5-10x better Scan speeds competitive with Parquet - 100x faster than regular Cassandra tables Very fine grained filtering for sub-second concurrent queries Easy BI and ad-hoc analysis via Spark SQL/ Dataframes (JDBC etc.) Uses Cassandra for robust, proven storage
  • 76. Combining FiloDB + Cassandra Regular Cassandra tables for highly concurrent, aggregate / key-value lookups (dashboards) FiloDB + C* + Spark for efficient long term event storage Ad hoc / SQL / BI Data source for MLLib / building models Data storage for classified / predicted / scored data
  • 77. Message Queue Events Spark Streaming Short term storage, K-V Adhoc, SQL, ML Cassandra FiloDB: Events, ad-hoc, batch Spark Dashboa rds, maps
  • 79. FiloDB + Cassandra Robust, peer to peer, proven storage platform Use for short term snapshots, dashboards Use for efficient long term event storage & ad hoc querying Use as a source to build detailed models