Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Building Scalable
Data Pipelines
Evan Chan

Who am I
Distinguished Engineer, Tuplejump
@evanfchan
http://velvia.github.io
User and contributor to Spark
since 0.9
Co-creator and maintainer of
Spark Job Server

TupleJump - Big Data Dev Partners 3

Instant
Gratification
I want insights now
I want to act on news right away
I want stuff personalized for me (?)

How Fast do you
Need to Act?
Financial trading - milliseconds
Dashboards - seconds to minutes
BI / Reports - hours to days?

What’s Your App?
Concurrent video viewers
Anomaly detection
Clickstream analysis
Live geospatial maps
Real-time trend detection & learning

Common
Components
Message
Queue
Events
Stream
Processing
Layer
State /
Database
Happy
Users

Example: Real-time
trend detection
Events: time, OS, location, asset/product ID
Analyze 1-5 second batches of new “hot”
data in stream processor
Combine with recent and historical top K
feature vectors in database
Update database recent feature vectors
Serve to users

Smart City
Streaming Data
City buses - regular telemetry (position,
velocity, timestamp)
Street sweepers - regular telemetry
Transactions from rail, subway, buses, smart
cards
311 info
911 info - new emergencies

Citizens want to
know…
Where and for how long can I park my
car?
Are transportation options affected by
311 and 911 events?
How long will it take the next bus to
get here?
Where is the closest bus to where I am?

Cities want to
know…
How can I maximize parking revenue?
More granular updates to parking spots that don't
need sweeping
How does traffic affect waiting times in public
transit, and revenue?
Patterns in subway train times - is a breakdown
coming?
Population movement - where should new transit
routes be placed?

Message
Queue
Stream
Processing
Layer
Event
storage
Ad-
Hoc
311
911
Buses
Metro
Short term
telemetry
Models
Dashboard

The HARD Principle
Highly Available, Resilient, Distributed
Flexibility - do as many transformations
as possible with as few components as
possible
Real-time: “NoETL”
Community: best of breed OSS projects with
huge adoption and commercial support

Message
Queue
Events
Stream
Processing
Layer
State /
Database
Happy
Users

Why a message
queue?
Centralized publish-subscribe of
events
Need more processing? Add another
consumer
Buffer traffic spikes
Replay events in cases of failure

Message Queues
help distribute data
A-F
G-M
N-S
T-Z
Input 1
Input 2
Input3
Input4
Processing
Processing
Processing
Processing

Intro to Apache
Kafka
Kafka is a distributed publish subscribe
system
It uses a commit log to track changes
Kafka was originally created at LinkedIn
Open sourced in 2011
Graduated to a top-level Apache project
in 2012

On being HARD
Many Big Data projects are open source
implementations of closed source products
Unlike Hadoop, HBase or Cassandra, Kafka
actually isn't a clone of an existing closed
source product
The same codebase being used for years at LinkedIn
answers the questions:
Does it scale?
Is it robust?

Avro Schemas And Schema Registry
Keys and values in Kafka can be Strings
or byte arrays
Avro is a serialization format used
extensively with Kafka and Big Data
Kafka uses a Schema Registry to keep
track of Avro schemas
Verifies that the correct schemas are being used

Kafka Resources
Official docs - https://
kafka.apache.org/
documentation.html
Design section is really good read
http://www.confluent.io/product
Includes schema registry

Types of Stream
Processors
Event by Event: Apache Storm,
Apache Flink, Intel GearPump, Akka
Micro-batch: Apache Spark
Hybrid? Google Dataflow

Apache Storm and
Flink
Transform one message at a time
Very low latency
State and more complex analytics difficult

Akka and
Gearpump
Actor to actor messaging. Local state.
Used for extreme low latency (ad networks, etc)
Dynamically reconfigurable topology
Configurable fault tolerance and failure
recovery
Cluster or local mode - you don’t always need
distribution!

Spark Streaming
Data processed as stream of micro batches
Higher latency (seconds), higher
throughput, more complex analysis / ML
possible
Same programming model as batch

Why Spark?
file = spark.textFile("hdfs://...")

file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
1 package org.myorg;
2
3 import java.io.IOException;
4 import java.util.*;
5
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.conf.*;
8 import org.apache.hadoop.io.*;
9 import org.apache.hadoop.mapreduce.*;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
14
15 public class WordCount {
16
17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable>
{
18 private final static IntWritable one = new IntWritable(1);
19 private Text word = new Text();
20
21 public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
22 String line = value.toString();
23 StringTokenizer tokenizer = new StringTokenizer(line);
24 while (tokenizer.hasMoreTokens()) {
25 word.set(tokenizer.nextToken());
26 context.write(word, one);
27 }
28 }
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable, Text,
IntWritable> {
32
33 public void reduce(Text key, Iterable<IntWritable> values, Context context)
34 throws IOException, InterruptedException {
35 int sum = 0;
36 for (IntWritable val : values) {
37 sum += val.get();
38 }
39 context.write(key, new IntWritable(sum));
40 }
41 }
42
43 public static void main(String[] args) throws Exception {
44 Configuration conf = new Configuration();
45
46 Job job = new Job(conf, "wordcount");
47
48 job.setOutputKeyClass(Text.class);
49 job.setOutputValueClass(IntWritable.class);
50
51 job.setMapperClass(Map.class);
52 job.setReducerClass(Reduce.class);
53
54 job.setInputFormatClass(TextInputFormat.class);
55 job.setOutputFormatClass(TextOutputFormat.class);
56
57 FileInputFormat.addInputPath(job, new Path(args[0]));
58 FileOutputFormat.setOutputPath(job, new Path(args[1]));
59
60 job.waitForCompletion(true);
61 }
62
63 }

Explosion of Specialized Systems

Benefits of Unified Libraries
Optimizations can be shared between libraries
Core
Project Tungsten
MLlib
Shared statistics libraries
Spark Streaming
GC and memory management

Mix and match
modules
Easily go from DataFrames (SQL) to
MLLib / statistics, for example:
scala> import org.apache.spark.mllib.stat.Statistics
scala> val numMentions = df.select("NumMentions").map(row => row.getInt(0).toDouble)
numMentions: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[100] at map at DataFrame.scala:848
scala> val numArticles = df.select("NumArticles").map(row => row.getInt(0).toDouble)
numArticles: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[104] at map at DataFrame.scala:848
scala> val correlation = Statistics.corr(numMentions, numArticles, "pearson")

Spark Worker Failure
Rebuild RDD Partitions on Worker
from Lineage

DataFrames & Catalyst Optimizer

Catalyst Optimizations
Column and partition pruning
(Column filters)
Predicate pushdowns (Row filters)

Spark SQL Data Sources API
Enables custom data sources to participate in
SparkSQL = DataFrames + Catalyst
Production Impls
spark-csv (Databricks)
spark-avro (Databricks)
spark-cassandra-connector (DataStax)
elasticsearch-hadoop (Elastic.co)

Streaming Sources
Basic: Files, Akka actors, queues of RDDs,
Socket
Advanced
Kafka
Kinesis
Flume
Twitter firehose

Streaming Fault Tolerance
Incoming data is replicated to 1
other node
Write Ahead Log for sources that
support ACKs
Checkpointing for recovery if Driver
fails

Direct Kafka Streaming: KafkaRDD
No single Receiver
Parallelizable
No Write Ahead Log
Kafka *is* the Write Ahead Log!
KafkaRDD stores Kafka offsets
KafkaRDD partitions recover from offsets

Spark MLlib Common Algos
Classifiers
DecisionTree, RandomForest
Clustering
K-Means, Streaming K-Means
Collaborative Filtering
Alternating Least Squares (ALS)

Spark Text Processing Algos
TF/IDF
LDA
Word2Vec
*Pro-Tip:
Use Stanford CoreNLP!

Spark ML Pipelines
Modeled after scikit-learn

Spark GraphX
PageRank
Top Influencers
Connected Components
Measure of clusters
Triangle Counting
Measure of cluster density

What Kind of State?
Non-persistent / in-memory:
concurrent viewers
Short term: latest trends
Longer term: raw event & aggregate
storage
ML Models, predictions, scored data

Spark RDDs
Immutable, cache in memory and/or
on disk
Spark Streaming: UpdateStateByKey
IndexedRDD - can update bits of
data
Snapshotting for recovery

•Massively Scalable
• High Performance
• Always On
• Masterless

Scale
Apache Cassandra
• Scales Linearly to as many nodes as
you need
• Scales whenever you need

Performance
Apache Cassandra
• It’s Fast
• Built to sustain massive data insertion
rates in irregular pattern spikes

Fault
Tolerance
&
Availability
Apache Cassandra
• Automatic Replication
• Multi Datacenter
• Decentralized - no single point of failure
• Survive regional outages
• New nodes automatically add
themselves to the cluster
• DataStax drivers automatically discover
new nodes

Architecture
Apache Cassandra
• Distributed, Masterless Ring Architecture
• Network Topology Aware
• Flexible, Schemaless - your data
structure can evolve seamlessly over
time

To download:
https://cassandra.apache.org/
download/
https://github.com/pcmanus/ccm
^ Highly recommended for local
testing/cluster setup

Cassandra Data
Modeling
Primary key = (partition keys, clustering keys)
Fast queries = fetch single partition
Range scans by clustering key
Must model for query patterns
Clustering 1 Clustering 2 Clustering 3
Partition 1
Partition 2
Partition 3

City Bus Data
Modeling Example
Primary key = (Bus UUID, timestamp)
Easy queries: location and speed of single
bus for a range of time
Can also query most recent location + speed
of all buses (slower)
1020 s 1010 s 1000 s
Bus A speed, GPS
Bus B
Bus C

Using Cassandra for
Short Term Storage
Idea is store and read small values
Idempotent writes + huge write
capacity = ideal for streaming
ingestion
For example, store last few (latest +
last N) snapshots of buses, taxi
locations, recent traffic info

But Mommy! What about
longer term data?

I need to read lots
of data, fast!!
- Ad hoc analytics of events
- More specialized / geospatial
- Building ML models from
large quantities of data
- Storing scored/classified data
from models
- OLAP / Data Warehousing

Can Cassandra
Handle Batch?
Cassandra tables are much better at
lots of small reads than big data scans
You CAN store data efficiently in C*
Files seem easier for long term storage
and analysis
But are files compatible with streaming?

Lambda is Hard
and Expensive
Very high TCO - Many moving parts - KV store,
real time, batch
Lots of monitoring, operations, headache
Running similar code in two places
Lower performance - lots of shuffling data,
network hops, translating domain objects
Reconcile queries against two different places

NoLambda
A unified system
Real-time processing and reprocessing
No ETLs
Fault tolerance
Everything is a stream

Can Cassandra do
batch and ad-hoc?
Yes, it can be competitive with Hadoop
actually….
If you know how to be creative with storing your
data!
Tuplejump/SnackFS - HDFS for Cassandra
github.com/tuplejump/FiloDB - analytics database
Store your data using Protobuf / Avro / etc.

Introduction to
FiloDB
Efficient columnar storage - 5-10x better
Scan speeds competitive with Parquet - 100x
faster than regular Cassandra tables
Very fine grained filtering for sub-second
concurrent queries
Easy BI and ad-hoc analysis via Spark SQL/
Dataframes (JDBC etc.)
Uses Cassandra for robust, proven storage

Combining FiloDB
+ Cassandra
Regular Cassandra tables for highly concurrent,
aggregate / key-value lookups (dashboards)
FiloDB + C* + Spark for efficient long term event
storage
Ad hoc / SQL / BI
Data source for MLLib / building models
Data storage for classified / predicted /
scored data

Message
Queue
Events
Spark
Streaming
Short term
storage, K-V
Adhoc,
SQL, ML
Cassandra
FiloDB: Events,
ad-hoc, batch
Spark
Dashboa
rds,
maps

Message
Queue
Events
Spark
Streaming Models
Cassandra
FiloDB: Long term event storage
Spark Learned
Data

FiloDB + Cassandra
Robust, peer to peer, proven storage
platform
Use for short term snapshots, dashboards
Use for efficient long term event
storage & ad hoc querying
Use as a source to build detailed
models

Thank you!
@evanfchan
http://tuplejump.com

Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Similar to Building Scalable Data Pipelines - 2016 DataPalooza Seattle (20)

More from Evan Chan

More from Evan Chan (17)

Recently uploaded

Recently uploaded (20)

Building Scalable Data Pipelines - 2016 DataPalooza Seattle