SlideShare a Scribd company logo
Building Scalable
Data Pipelines
Evan Chan
Who am I
Distinguished Engineer, Tuplejump
@evanfchan
http://velvia.github.io
User and contributor to Spark
since 0.9
Co-creator and maintainer of
Spark Job Server
TupleJump - Big Data Dev Partners 3
Instant
Gratification
I want insights now
I want to act on news right away
I want stuff personalized for me (?)
Fast Data, not

Big Data
How Fast do you
Need to Act?
Financial trading - milliseconds
Dashboards - seconds to minutes
BI / Reports - hours to days?
What’s Your App?
Concurrent video viewers
Anomaly detection
Clickstream analysis
Live geospatial maps
Real-time trend detection & learning
Common
Components
Message
Queue
Events
Stream
Processing
Layer
State /
Database
Happy
Users
Example: Real-time
trend detection
Events: time, OS, location, asset/product ID
Analyze 1-5 second batches of new “hot”
data in stream processor
Combine with recent and historical top K
feature vectors in database
Update database recent feature vectors
Serve to users
Example 2: Smart
Cities
Smart City
Streaming Data
City buses - regular telemetry (position,
velocity, timestamp)
Street sweepers - regular telemetry
Transactions from rail, subway, buses, smart
cards
311 info
911 info - new emergencies
Citizens want to
know…
Where and for how long can I park my
car?
Are transportation options affected by
311 and 911 events?
How long will it take the next bus to
get here?
Where is the closest bus to where I am?
Cities want to
know…
How can I maximize parking revenue?
More granular updates to parking spots that don't
need sweeping
How does traffic affect waiting times in public
transit, and revenue?
Patterns in subway train times - is a breakdown
coming?
Population movement - where should new transit
routes be placed?
Message
Queue
Stream
Processing
Layer
Event
storage
Ad-
Hoc
311
911
Buses
Metro
Short term
telemetry
Models
Dashboard
The HARD Principle
Highly Available, Resilient, Distributed
Flexibility - do as many transformations
as possible with as few components as
possible
Real-time: “NoETL”
Community: best of breed OSS projects with
huge adoption and commercial support
Message Queue
Message
Queue
Events
Stream
Processing
Layer
State /
Database
Happy
Users
Why a message
queue?
Centralized publish-subscribe of
events
Need more processing? Add another
consumer
Buffer traffic spikes
Replay events in cases of failure
Message Queues
help distribute data
A-F
G-M
N-S
T-Z
Input 1
Input 2
Input3
Input4
Processing
Processing
Processing
Processing
Intro to Apache
Kafka
Kafka is a distributed publish subscribe
system
It uses a commit log to track changes
Kafka was originally created at LinkedIn
Open sourced in 2011
Graduated to a top-level Apache project
in 2012
On being HARD
Many Big Data projects are open source
implementations of closed source products
Unlike Hadoop, HBase or Cassandra, Kafka
actually isn't a clone of an existing closed
source product
The same codebase being used for years at LinkedIn
answers the questions:
Does it scale?
Is it robust?
Ad Hoc ETL
Decoupled ETL
Avro Schemas And Schema Registry
Keys and values in Kafka can be Strings
or byte arrays
Avro is a serialization format used
extensively with Kafka and Big Data
Kafka uses a Schema Registry to keep
track of Avro schemas
Verifies that the correct schemas are being used
Consumer Groups
Commit Logs
Kafka Resources
Official docs - https://
kafka.apache.org/
documentation.html
Design section is really good read
http://www.confluent.io/product
Includes schema registry
Stream Processing
Message
Queue
Events
Stream
Processing
Layer
State /
Database
Happy
Users
Types of Stream
Processors
Event by Event: Apache Storm,
Apache Flink, Intel GearPump, Akka
Micro-batch: Apache Spark
Hybrid? Google Dataflow
Apache Storm and
Flink
Transform one message at a time
Very low latency
State and more complex analytics difficult
Akka and
Gearpump
Actor to actor messaging. Local state.
Used for extreme low latency (ad networks, etc)
Dynamically reconfigurable topology
Configurable fault tolerance and failure
recovery
Cluster or local mode - you don’t always need
distribution!
Spark Streaming
Data processed as stream of micro batches
Higher latency (seconds), higher
throughput, more complex analysis / ML
possible
Same programming model as batch
Why Spark?
file = spark.textFile("hdfs://...")
 
file.flatMap(line => line.split(" "))
    .map(word => (word, 1))
    .reduceByKey(_ + _)
1 package org.myorg;
2
3 import java.io.IOException;
4 import java.util.*;
5
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.conf.*;
8 import org.apache.hadoop.io.*;
9 import org.apache.hadoop.mapreduce.*;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
14
15 public class WordCount {
16
17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable>
{
18 private final static IntWritable one = new IntWritable(1);
19 private Text word = new Text();
20
21 public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
22 String line = value.toString();
23 StringTokenizer tokenizer = new StringTokenizer(line);
24 while (tokenizer.hasMoreTokens()) {
25 word.set(tokenizer.nextToken());
26 context.write(word, one);
27 }
28 }
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable, Text,
IntWritable> {
32
33 public void reduce(Text key, Iterable<IntWritable> values, Context context)
34 throws IOException, InterruptedException {
35 int sum = 0;
36 for (IntWritable val : values) {
37 sum += val.get();
38 }
39 context.write(key, new IntWritable(sum));
40 }
41 }
42
43 public static void main(String[] args) throws Exception {
44 Configuration conf = new Configuration();
45
46 Job job = new Job(conf, "wordcount");
47
48 job.setOutputKeyClass(Text.class);
49 job.setOutputValueClass(IntWritable.class);
50
51 job.setMapperClass(Map.class);
52 job.setReducerClass(Reduce.class);
53
54 job.setInputFormatClass(TextInputFormat.class);
55 job.setOutputFormatClass(TextOutputFormat.class);
56
57 FileInputFormat.addInputPath(job, new Path(args[0]));
58 FileOutputFormat.setOutputPath(job, new Path(args[1]));
59
60 job.waitForCompletion(true);
61 }
62
63 }
Spark Production Deployments
Explosion of Specialized Systems
Spark and Berkeley AMP Lab
Benefits of Unified Libraries
Optimizations can be shared between libraries
Core
Project Tungsten
MLlib
Shared statistics libraries
Spark Streaming
GC and memory management
Mix and match
modules
Easily go from DataFrames (SQL) to
MLLib / statistics, for example:
scala> import org.apache.spark.mllib.stat.Statistics
scala> val numMentions = df.select("NumMentions").map(row => row.getInt(0).toDouble)
numMentions: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[100] at map at DataFrame.scala:848
scala> val numArticles = df.select("NumArticles").map(row => row.getInt(0).toDouble)
numArticles: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[104] at map at DataFrame.scala:848
scala> val correlation = Statistics.corr(numMentions, numArticles, "pearson")
Spark Worker Failure
Rebuild RDD Partitions on Worker
from Lineage
Spark SQL & DataFrames
DataFrames & Catalyst Optimizer
Catalyst Optimizations
Column and partition pruning
(Column filters)
Predicate pushdowns (Row filters)
Spark SQL Data Sources API
Enables custom data sources to participate in
SparkSQL = DataFrames + Catalyst
Production Impls
spark-csv (Databricks)
spark-avro (Databricks)
spark-cassandra-connector (DataStax)
elasticsearch-hadoop (Elastic.co)
Spark Streaming
Streaming Sources
Basic: Files, Akka actors, queues of RDDs,
Socket
Advanced
Kafka
Kinesis
Flume
Twitter firehose
DStreams = micro-batches
Streaming Fault Tolerance
Incoming data is replicated to 1
other node
Write Ahead Log for sources that
support ACKs
Checkpointing for recovery if Driver
fails
Direct Kafka Streaming: KafkaRDD
No single Receiver
Parallelizable
No Write Ahead Log
Kafka *is* the Write Ahead Log!
KafkaRDD stores Kafka offsets
KafkaRDD partitions recover from offsets

Spark MLlib & GraphX
Spark MLlib Common Algos
Classifiers
DecisionTree, RandomForest
Clustering
K-Means, Streaming K-Means
Collaborative Filtering
Alternating Least Squares (ALS)
Spark Text Processing Algos
TF/IDF
LDA
Word2Vec
*Pro-Tip:
Use Stanford CoreNLP!
Spark ML Pipelines
Modeled after scikit-learn
Spark GraphX
PageRank
Top Influencers
Connected Components
Measure of clusters
Triangle Counting
Measure of cluster density
Handling State
Message
Queue
Events
Stream
Processing
Layer
State /
Database
Happy
Users
What Kind of State?
Non-persistent / in-memory:
concurrent viewers
Short term: latest trends
Longer term: raw event & aggregate
storage
ML Models, predictions, scored data
Spark RDDs
Immutable, cache in memory and/or
on disk
Spark Streaming: UpdateStateByKey
IndexedRDD - can update bits of
data
Snapshotting for recovery
•Massively Scalable
• High Performance
• Always On
• Masterless
Scale
Apache Cassandra
• Scales Linearly to as many nodes as
you need
• Scales whenever you need
Performance
Apache Cassandra
• It’s Fast
• Built to sustain massive data insertion
rates in irregular pattern spikes
Fault
Tolerance
&
Availability
Apache Cassandra
• Automatic Replication
• Multi Datacenter
• Decentralized - no single point of failure
• Survive regional outages
• New nodes automatically add
themselves to the cluster
• DataStax drivers automatically discover
new nodes
Architecture
Apache Cassandra
• Distributed, Masterless Ring Architecture
• Network Topology Aware
• Flexible, Schemaless - your data
structure can evolve seamlessly over
time
To download:
https://cassandra.apache.org/
download/
https://github.com/pcmanus/ccm
^ Highly recommended for local
testing/cluster setup
Cassandra Data
Modeling
Primary key = (partition keys, clustering keys)
Fast queries = fetch single partition
Range scans by clustering key
Must model for query patterns
Clustering 1 Clustering 2 Clustering 3
Partition 1
Partition 2
Partition 3
City Bus Data
Modeling Example
Primary key = (Bus UUID, timestamp)
Easy queries: location and speed of single
bus for a range of time
Can also query most recent location + speed
of all buses (slower)
1020 s 1010 s 1000 s
Bus A speed, GPS
Bus B
Bus C
Using Cassandra for
Short Term Storage
Idea is store and read small values
Idempotent writes + huge write
capacity = ideal for streaming
ingestion
For example, store last few (latest +
last N) snapshots of buses, taxi
locations, recent traffic info
But Mommy! What about
longer term data?
I need to read lots
of data, fast!!
- Ad hoc analytics of events
- More specialized / geospatial
- Building ML models from
large quantities of data
- Storing scored/classified data
from models
- OLAP / Data Warehousing
Can Cassandra
Handle Batch?
Cassandra tables are much better at
lots of small reads than big data scans
You CAN store data efficiently in C*
Files seem easier for long term storage
and analysis
But are files compatible with streaming?
Lambda
Architecture
Lambda is Hard
and Expensive
Very high TCO - Many moving parts - KV store,
real time, batch
Lots of monitoring, operations, headache
Running similar code in two places
Lower performance - lots of shuffling data,
network hops, translating domain objects
Reconcile queries against two different places
NoLambda
A unified system
Real-time processing and reprocessing
No ETLs
Fault tolerance
Everything is a stream
Can Cassandra do
batch and ad-hoc?
Yes, it can be competitive with Hadoop
actually….
If you know how to be creative with storing your
data!
Tuplejump/SnackFS - HDFS for Cassandra
github.com/tuplejump/FiloDB - analytics database
Store your data using Protobuf / Avro / etc.
Introduction to
FiloDB
Efficient columnar storage - 5-10x better
Scan speeds competitive with Parquet - 100x
faster than regular Cassandra tables
Very fine grained filtering for sub-second
concurrent queries
Easy BI and ad-hoc analysis via Spark SQL/
Dataframes (JDBC etc.)
Uses Cassandra for robust, proven storage
Combining FiloDB
+ Cassandra
Regular Cassandra tables for highly concurrent,
aggregate / key-value lookups (dashboards)
FiloDB + C* + Spark for efficient long term event
storage
Ad hoc / SQL / BI
Data source for MLLib / building models
Data storage for classified / predicted /
scored data
Message
Queue
Events
Spark
Streaming
Short term
storage, K-V
Adhoc,
SQL, ML
Cassandra
FiloDB: Events,
ad-hoc, batch
Spark
Dashboa
rds,
maps
Message
Queue
Events
Spark
Streaming Models
Cassandra
FiloDB: Long term event storage
Spark Learned
Data
FiloDB + Cassandra
Robust, peer to peer, proven storage
platform
Use for short term snapshots, dashboards
Use for efficient long term event
storage & ad hoc querying
Use as a source to build detailed
models
Thank you!
@evanfchan
http://tuplejump.com

More Related Content

What's hot

Kafka Retry and DLQ
Kafka Retry and DLQKafka Retry and DLQ
Kafka Retry and DLQ
George Teo
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
Databricks
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
confluent
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case study
Kishore Gopalakrishna
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Databricks
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Accelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflowAccelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflow
Databricks
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
Russell Jurney
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Anastasios Skarlatidis
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Enterprise Knowledge Graph
Enterprise Knowledge GraphEnterprise Knowledge Graph
Enterprise Knowledge Graph
Benjamin Raethlein
 
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
NAVER D2
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
Data Con LA
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
confluent
 

What's hot (20)

Kafka Retry and DLQ
Kafka Retry and DLQKafka Retry and DLQ
Kafka Retry and DLQ
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case study
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Accelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflowAccelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflow
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Enterprise Knowledge Graph
Enterprise Knowledge GraphEnterprise Knowledge Graph
Enterprise Knowledge Graph
 
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
 

Viewers also liked

Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
Amazon Web Services
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
Lars Albertsson
 
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
Mathieu Bastian
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
Growth Intelligence
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
Lars Albertsson
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
Lars Albertsson
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
DataWorks Summit
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 

Viewers also liked (8)

Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
 
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 

Similar to Building Scalable Data Pipelines - 2016 DataPalooza Seattle

The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
gagravarr
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
Timothy Spann
 
Apache Spark Streaming
Apache Spark StreamingApache Spark Streaming
Apache Spark Streaming
Bartosz Jankiewicz
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
Adam Muise
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
Joe Stein
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
Slim Baltagi
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
Timothy Spann
 
Kafka for data scientists
Kafka for data scientistsKafka for data scientists
Kafka for data scientists
Jenn Rawlins
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Thomas Weise
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
Stephan Ewen
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
Robert Metzger
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Leverage Kafka to build a stream processing platform
Leverage Kafka to build a stream processing platformLeverage Kafka to build a stream processing platform
Leverage Kafka to build a stream processing platform
confluent
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
Takuya UESHIN
 

Similar to Building Scalable Data Pipelines - 2016 DataPalooza Seattle (20)

The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
 
Apache Spark Streaming
Apache Spark StreamingApache Spark Streaming
Apache Spark Streaming
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
 
Kafka for data scientists
Kafka for data scientistsKafka for data scientists
Kafka for data scientists
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Leverage Kafka to build a stream processing platform
Leverage Kafka to build a stream processing platformLeverage Kafka to build a stream processing platform
Leverage Kafka to build a stream processing platform
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 

More from Evan Chan

Time-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 TalkTime-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 Talk
Evan Chan
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
Evan Chan
 
Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and Kubernetes
Evan Chan
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
Evan Chan
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
Evan Chan
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
Evan Chan
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data Architecture
Evan Chan
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
Evan Chan
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
Evan Chan
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Evan Chan
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Evan Chan
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
Evan Chan
 

More from Evan Chan (17)

Time-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 TalkTime-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 Talk
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and Kubernetes
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data Architecture
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
 

Recently uploaded

PBL _PPT _final year for engineerin student
PBL _PPT _final  year for engineerin studentPBL _PPT _final  year for engineerin student
PBL _PPT _final year for engineerin student
nikitalohar549
 
FARM POND AND Percolation POND BY agri studentsPPT.pptx
FARM POND AND Percolation POND BY agri studentsPPT.pptxFARM POND AND Percolation POND BY agri studentsPPT.pptx
FARM POND AND Percolation POND BY agri studentsPPT.pptx
hasleyradha2109
 
Modified O-RAN 5G Edge Reference Architecture using RNN
Modified O-RAN 5G Edge Reference Architecture using RNNModified O-RAN 5G Edge Reference Architecture using RNN
Modified O-RAN 5G Edge Reference Architecture using RNN
ijwmn
 
VEMC: Trusted Leader in Engineering Solutions
VEMC: Trusted Leader in Engineering SolutionsVEMC: Trusted Leader in Engineering Solutions
VEMC: Trusted Leader in Engineering Solutions
VijayEngineeringandM1
 
Machine Learning- Perceptron_Backpropogation_Module 3.pdf
Machine Learning- Perceptron_Backpropogation_Module 3.pdfMachine Learning- Perceptron_Backpropogation_Module 3.pdf
Machine Learning- Perceptron_Backpropogation_Module 3.pdf
Dr. Shivashankar
 
HSE-BMS-009 COSHH & MSDS.pptHSE-BMS-009 COSHH & MSDS.pptSE-BMS-009 COSHH & MSDS.
HSE-BMS-009 COSHH & MSDS.pptHSE-BMS-009 COSHH & MSDS.pptSE-BMS-009 COSHH & MSDS.HSE-BMS-009 COSHH & MSDS.pptHSE-BMS-009 COSHH & MSDS.pptSE-BMS-009 COSHH & MSDS.
HSE-BMS-009 COSHH & MSDS.pptHSE-BMS-009 COSHH & MSDS.pptSE-BMS-009 COSHH & MSDS.
NoeAranel
 
The Pennsylvania State University degree Cert diploma offer
The Pennsylvania State University degree Cert diploma offerThe Pennsylvania State University degree Cert diploma offer
The Pennsylvania State University degree Cert diploma offer
ekyhonz
 
R18B.Tech.OpenElectivesWEF2021_22AdmittedBatch1.pdf
R18B.Tech.OpenElectivesWEF2021_22AdmittedBatch1.pdfR18B.Tech.OpenElectivesWEF2021_22AdmittedBatch1.pdf
R18B.Tech.OpenElectivesWEF2021_22AdmittedBatch1.pdf
bibej11828
 
Database management system module -3 bcs403
Database management system module -3 bcs403Database management system module -3 bcs403
Database management system module -3 bcs403
Tharani4825
 
Sea Wave Energy - Renewable Energy Resources
Sea Wave Energy - Renewable Energy ResourcesSea Wave Energy - Renewable Energy Resources
Sea Wave Energy - Renewable Energy Resources
21h16charis
 
Human_assault project using jetson nano new
Human_assault project using jetson nano newHuman_assault project using jetson nano new
Human_assault project using jetson nano new
frostflash010
 
Future Networking v Energy Limits ICTON 2024 Bari Italy
Future Networking v Energy Limits ICTON 2024 Bari ItalyFuture Networking v Energy Limits ICTON 2024 Bari Italy
Future Networking v Energy Limits ICTON 2024 Bari Italy
University of Hertfordshire
 
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
Kiran Kumar Manigam
 
Cyber security detailed ppt and understand
Cyber security detailed ppt and understandCyber security detailed ppt and understand
Cyber security detailed ppt and understand
docpain605501
 
Driving Safety.pptxxxxxxxxxxxxxxxxxxxxxxxxxx
Driving Safety.pptxxxxxxxxxxxxxxxxxxxxxxxxxxDriving Safety.pptxxxxxxxxxxxxxxxxxxxxxxxxxx
Driving Safety.pptxxxxxxxxxxxxxxxxxxxxxxxxxx
Tamara Johnson
 
UNIT-I-METAL CASTING PROCESSES -Manufact
UNIT-I-METAL CASTING PROCESSES -ManufactUNIT-I-METAL CASTING PROCESSES -Manufact
UNIT-I-METAL CASTING PROCESSES -Manufact
Mr.C.Dineshbabu
 
The Transformation Risk-Benefit Model of Artificial Intelligence: Balancing R...
The Transformation Risk-Benefit Model of Artificial Intelligence: Balancing R...The Transformation Risk-Benefit Model of Artificial Intelligence: Balancing R...
The Transformation Risk-Benefit Model of Artificial Intelligence: Balancing R...
gerogepatton
 
Bell Crank Lever.pptxDesign of Bell Crank Lever
Bell Crank Lever.pptxDesign of Bell Crank LeverBell Crank Lever.pptxDesign of Bell Crank Lever
Bell Crank Lever.pptxDesign of Bell Crank Lever
ssuser110cda
 
Predicting damage in notched functionally graded materials plates thr...
Predicting  damage  in  notched  functionally  graded  materials  plates  thr...Predicting  damage  in  notched  functionally  graded  materials  plates  thr...
Predicting damage in notched functionally graded materials plates thr...
Barhm Mohamad
 
software engineering software engineering
software engineering software engineeringsoftware engineering software engineering
software engineering software engineering
PrabhuB33
 

Recently uploaded (20)

PBL _PPT _final year for engineerin student
PBL _PPT _final  year for engineerin studentPBL _PPT _final  year for engineerin student
PBL _PPT _final year for engineerin student
 
FARM POND AND Percolation POND BY agri studentsPPT.pptx
FARM POND AND Percolation POND BY agri studentsPPT.pptxFARM POND AND Percolation POND BY agri studentsPPT.pptx
FARM POND AND Percolation POND BY agri studentsPPT.pptx
 
Modified O-RAN 5G Edge Reference Architecture using RNN
Modified O-RAN 5G Edge Reference Architecture using RNNModified O-RAN 5G Edge Reference Architecture using RNN
Modified O-RAN 5G Edge Reference Architecture using RNN
 
VEMC: Trusted Leader in Engineering Solutions
VEMC: Trusted Leader in Engineering SolutionsVEMC: Trusted Leader in Engineering Solutions
VEMC: Trusted Leader in Engineering Solutions
 
Machine Learning- Perceptron_Backpropogation_Module 3.pdf
Machine Learning- Perceptron_Backpropogation_Module 3.pdfMachine Learning- Perceptron_Backpropogation_Module 3.pdf
Machine Learning- Perceptron_Backpropogation_Module 3.pdf
 
HSE-BMS-009 COSHH & MSDS.pptHSE-BMS-009 COSHH & MSDS.pptSE-BMS-009 COSHH & MSDS.
HSE-BMS-009 COSHH & MSDS.pptHSE-BMS-009 COSHH & MSDS.pptSE-BMS-009 COSHH & MSDS.HSE-BMS-009 COSHH & MSDS.pptHSE-BMS-009 COSHH & MSDS.pptSE-BMS-009 COSHH & MSDS.
HSE-BMS-009 COSHH & MSDS.pptHSE-BMS-009 COSHH & MSDS.pptSE-BMS-009 COSHH & MSDS.
 
The Pennsylvania State University degree Cert diploma offer
The Pennsylvania State University degree Cert diploma offerThe Pennsylvania State University degree Cert diploma offer
The Pennsylvania State University degree Cert diploma offer
 
R18B.Tech.OpenElectivesWEF2021_22AdmittedBatch1.pdf
R18B.Tech.OpenElectivesWEF2021_22AdmittedBatch1.pdfR18B.Tech.OpenElectivesWEF2021_22AdmittedBatch1.pdf
R18B.Tech.OpenElectivesWEF2021_22AdmittedBatch1.pdf
 
Database management system module -3 bcs403
Database management system module -3 bcs403Database management system module -3 bcs403
Database management system module -3 bcs403
 
Sea Wave Energy - Renewable Energy Resources
Sea Wave Energy - Renewable Energy ResourcesSea Wave Energy - Renewable Energy Resources
Sea Wave Energy - Renewable Energy Resources
 
Human_assault project using jetson nano new
Human_assault project using jetson nano newHuman_assault project using jetson nano new
Human_assault project using jetson nano new
 
Future Networking v Energy Limits ICTON 2024 Bari Italy
Future Networking v Energy Limits ICTON 2024 Bari ItalyFuture Networking v Energy Limits ICTON 2024 Bari Italy
Future Networking v Energy Limits ICTON 2024 Bari Italy
 
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
 
Cyber security detailed ppt and understand
Cyber security detailed ppt and understandCyber security detailed ppt and understand
Cyber security detailed ppt and understand
 
Driving Safety.pptxxxxxxxxxxxxxxxxxxxxxxxxxx
Driving Safety.pptxxxxxxxxxxxxxxxxxxxxxxxxxxDriving Safety.pptxxxxxxxxxxxxxxxxxxxxxxxxxx
Driving Safety.pptxxxxxxxxxxxxxxxxxxxxxxxxxx
 
UNIT-I-METAL CASTING PROCESSES -Manufact
UNIT-I-METAL CASTING PROCESSES -ManufactUNIT-I-METAL CASTING PROCESSES -Manufact
UNIT-I-METAL CASTING PROCESSES -Manufact
 
The Transformation Risk-Benefit Model of Artificial Intelligence: Balancing R...
The Transformation Risk-Benefit Model of Artificial Intelligence: Balancing R...The Transformation Risk-Benefit Model of Artificial Intelligence: Balancing R...
The Transformation Risk-Benefit Model of Artificial Intelligence: Balancing R...
 
Bell Crank Lever.pptxDesign of Bell Crank Lever
Bell Crank Lever.pptxDesign of Bell Crank LeverBell Crank Lever.pptxDesign of Bell Crank Lever
Bell Crank Lever.pptxDesign of Bell Crank Lever
 
Predicting damage in notched functionally graded materials plates thr...
Predicting  damage  in  notched  functionally  graded  materials  plates  thr...Predicting  damage  in  notched  functionally  graded  materials  plates  thr...
Predicting damage in notched functionally graded materials plates thr...
 
software engineering software engineering
software engineering software engineeringsoftware engineering software engineering
software engineering software engineering
 

Building Scalable Data Pipelines - 2016 DataPalooza Seattle

  • 2. Who am I Distinguished Engineer, Tuplejump @evanfchan http://velvia.github.io User and contributor to Spark since 0.9 Co-creator and maintainer of Spark Job Server
  • 3. TupleJump - Big Data Dev Partners 3
  • 4. Instant Gratification I want insights now I want to act on news right away I want stuff personalized for me (?)
  • 6. How Fast do you Need to Act? Financial trading - milliseconds Dashboards - seconds to minutes BI / Reports - hours to days?
  • 7. What’s Your App? Concurrent video viewers Anomaly detection Clickstream analysis Live geospatial maps Real-time trend detection & learning
  • 9. Example: Real-time trend detection Events: time, OS, location, asset/product ID Analyze 1-5 second batches of new “hot” data in stream processor Combine with recent and historical top K feature vectors in database Update database recent feature vectors Serve to users
  • 11. Smart City Streaming Data City buses - regular telemetry (position, velocity, timestamp) Street sweepers - regular telemetry Transactions from rail, subway, buses, smart cards 311 info 911 info - new emergencies
  • 12. Citizens want to know… Where and for how long can I park my car? Are transportation options affected by 311 and 911 events? How long will it take the next bus to get here? Where is the closest bus to where I am?
  • 13. Cities want to know… How can I maximize parking revenue? More granular updates to parking spots that don't need sweeping How does traffic affect waiting times in public transit, and revenue? Patterns in subway train times - is a breakdown coming? Population movement - where should new transit routes be placed?
  • 15. The HARD Principle Highly Available, Resilient, Distributed Flexibility - do as many transformations as possible with as few components as possible Real-time: “NoETL” Community: best of breed OSS projects with huge adoption and commercial support
  • 18. Why a message queue? Centralized publish-subscribe of events Need more processing? Add another consumer Buffer traffic spikes Replay events in cases of failure
  • 19. Message Queues help distribute data A-F G-M N-S T-Z Input 1 Input 2 Input3 Input4 Processing Processing Processing Processing
  • 20. Intro to Apache Kafka Kafka is a distributed publish subscribe system It uses a commit log to track changes Kafka was originally created at LinkedIn Open sourced in 2011 Graduated to a top-level Apache project in 2012
  • 21. On being HARD Many Big Data projects are open source implementations of closed source products Unlike Hadoop, HBase or Cassandra, Kafka actually isn't a clone of an existing closed source product The same codebase being used for years at LinkedIn answers the questions: Does it scale? Is it robust?
  • 24. Avro Schemas And Schema Registry Keys and values in Kafka can be Strings or byte arrays Avro is a serialization format used extensively with Kafka and Big Data Kafka uses a Schema Registry to keep track of Avro schemas Verifies that the correct schemas are being used
  • 27. Kafka Resources Official docs - https:// kafka.apache.org/ documentation.html Design section is really good read http://www.confluent.io/product Includes schema registry
  • 30. Types of Stream Processors Event by Event: Apache Storm, Apache Flink, Intel GearPump, Akka Micro-batch: Apache Spark Hybrid? Google Dataflow
  • 31. Apache Storm and Flink Transform one message at a time Very low latency State and more complex analytics difficult
  • 32. Akka and Gearpump Actor to actor messaging. Local state. Used for extreme low latency (ad networks, etc) Dynamically reconfigurable topology Configurable fault tolerance and failure recovery Cluster or local mode - you don’t always need distribution!
  • 33. Spark Streaming Data processed as stream of micro batches Higher latency (seconds), higher throughput, more complex analysis / ML possible Same programming model as batch
  • 34. Why Spark? file = spark.textFile("hdfs://...")   file.flatMap(line => line.split(" "))     .map(word => (word, 1))     .reduceByKey(_ + _) 1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 }
  • 38. Benefits of Unified Libraries Optimizations can be shared between libraries Core Project Tungsten MLlib Shared statistics libraries Spark Streaming GC and memory management
  • 39. Mix and match modules Easily go from DataFrames (SQL) to MLLib / statistics, for example: scala> import org.apache.spark.mllib.stat.Statistics scala> val numMentions = df.select("NumMentions").map(row => row.getInt(0).toDouble) numMentions: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[100] at map at DataFrame.scala:848 scala> val numArticles = df.select("NumArticles").map(row => row.getInt(0).toDouble) numArticles: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[104] at map at DataFrame.scala:848 scala> val correlation = Statistics.corr(numMentions, numArticles, "pearson")
  • 40. Spark Worker Failure Rebuild RDD Partitions on Worker from Lineage
  • 41. Spark SQL & DataFrames
  • 43. Catalyst Optimizations Column and partition pruning (Column filters) Predicate pushdowns (Row filters)
  • 44. Spark SQL Data Sources API Enables custom data sources to participate in SparkSQL = DataFrames + Catalyst Production Impls spark-csv (Databricks) spark-avro (Databricks) spark-cassandra-connector (DataStax) elasticsearch-hadoop (Elastic.co)
  • 46. Streaming Sources Basic: Files, Akka actors, queues of RDDs, Socket Advanced Kafka Kinesis Flume Twitter firehose
  • 48. Streaming Fault Tolerance Incoming data is replicated to 1 other node Write Ahead Log for sources that support ACKs Checkpointing for recovery if Driver fails
  • 49. Direct Kafka Streaming: KafkaRDD No single Receiver Parallelizable No Write Ahead Log Kafka *is* the Write Ahead Log! KafkaRDD stores Kafka offsets KafkaRDD partitions recover from offsets

  • 50. Spark MLlib & GraphX
  • 51. Spark MLlib Common Algos Classifiers DecisionTree, RandomForest Clustering K-Means, Streaming K-Means Collaborative Filtering Alternating Least Squares (ALS)
  • 52. Spark Text Processing Algos TF/IDF LDA Word2Vec *Pro-Tip: Use Stanford CoreNLP!
  • 53. Spark ML Pipelines Modeled after scikit-learn
  • 54. Spark GraphX PageRank Top Influencers Connected Components Measure of clusters Triangle Counting Measure of cluster density
  • 57. What Kind of State? Non-persistent / in-memory: concurrent viewers Short term: latest trends Longer term: raw event & aggregate storage ML Models, predictions, scored data
  • 58. Spark RDDs Immutable, cache in memory and/or on disk Spark Streaming: UpdateStateByKey IndexedRDD - can update bits of data Snapshotting for recovery
  • 59. •Massively Scalable • High Performance • Always On • Masterless
  • 60. Scale Apache Cassandra • Scales Linearly to as many nodes as you need • Scales whenever you need
  • 61. Performance Apache Cassandra • It’s Fast • Built to sustain massive data insertion rates in irregular pattern spikes
  • 62. Fault Tolerance & Availability Apache Cassandra • Automatic Replication • Multi Datacenter • Decentralized - no single point of failure • Survive regional outages • New nodes automatically add themselves to the cluster • DataStax drivers automatically discover new nodes
  • 63. Architecture Apache Cassandra • Distributed, Masterless Ring Architecture • Network Topology Aware • Flexible, Schemaless - your data structure can evolve seamlessly over time
  • 65. Cassandra Data Modeling Primary key = (partition keys, clustering keys) Fast queries = fetch single partition Range scans by clustering key Must model for query patterns Clustering 1 Clustering 2 Clustering 3 Partition 1 Partition 2 Partition 3
  • 66. City Bus Data Modeling Example Primary key = (Bus UUID, timestamp) Easy queries: location and speed of single bus for a range of time Can also query most recent location + speed of all buses (slower) 1020 s 1010 s 1000 s Bus A speed, GPS Bus B Bus C
  • 67. Using Cassandra for Short Term Storage Idea is store and read small values Idempotent writes + huge write capacity = ideal for streaming ingestion For example, store last few (latest + last N) snapshots of buses, taxi locations, recent traffic info
  • 68. But Mommy! What about longer term data?
  • 69. I need to read lots of data, fast!! - Ad hoc analytics of events - More specialized / geospatial - Building ML models from large quantities of data - Storing scored/classified data from models - OLAP / Data Warehousing
  • 70. Can Cassandra Handle Batch? Cassandra tables are much better at lots of small reads than big data scans You CAN store data efficiently in C* Files seem easier for long term storage and analysis But are files compatible with streaming?
  • 72. Lambda is Hard and Expensive Very high TCO - Many moving parts - KV store, real time, batch Lots of monitoring, operations, headache Running similar code in two places Lower performance - lots of shuffling data, network hops, translating domain objects Reconcile queries against two different places
  • 73. NoLambda A unified system Real-time processing and reprocessing No ETLs Fault tolerance Everything is a stream
  • 74. Can Cassandra do batch and ad-hoc? Yes, it can be competitive with Hadoop actually…. If you know how to be creative with storing your data! Tuplejump/SnackFS - HDFS for Cassandra github.com/tuplejump/FiloDB - analytics database Store your data using Protobuf / Avro / etc.
  • 75. Introduction to FiloDB Efficient columnar storage - 5-10x better Scan speeds competitive with Parquet - 100x faster than regular Cassandra tables Very fine grained filtering for sub-second concurrent queries Easy BI and ad-hoc analysis via Spark SQL/ Dataframes (JDBC etc.) Uses Cassandra for robust, proven storage
  • 76. Combining FiloDB + Cassandra Regular Cassandra tables for highly concurrent, aggregate / key-value lookups (dashboards) FiloDB + C* + Spark for efficient long term event storage Ad hoc / SQL / BI Data source for MLLib / building models Data storage for classified / predicted / scored data
  • 77. Message Queue Events Spark Streaming Short term storage, K-V Adhoc, SQL, ML Cassandra FiloDB: Events, ad-hoc, batch Spark Dashboa rds, maps
  • 79. FiloDB + Cassandra Robust, peer to peer, proven storage platform Use for short term snapshots, dashboards Use for efficient long term event storage & ad hoc querying Use as a source to build detailed models