SlideShare a Scribd company logo
1
An API that gets out of your way
It’s so easy, we’ve embedded a bunch of examples right
here. Copy some of these requests into your terminal and
check out what happens.
With wrappers in Ruby, PHP, Python and more, you can
get started in minutes. Learn More ➤
As complexity grew…
Then we had a ProblemFactory
Started out with
We had a problem, so we thought to use …
As data volume grew…
Database scalability is a complicated topic…
Started out with
Had to make sure it was web scale
Distributed transactions
Change Data Capture
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture
Squirreling Away $640 Billion
Flink Forward - San Francisco 2022
Jeff Chao
Staff Engineer / Tech Lead for Change Data Capture Infrastructure at Stripe
How Stripe Leverages Flink for Change Data Capture
7
CDC at Stripe
Agenda
1 Aggregating Change Events
2 How it Started, How it Ended
3
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture
Change Data Capture (CDC) is widely-
used at Stripe to capture data changes
from databases without critically
impacting database reliability and
scalability. CDC powers many critical
financial use cases at Stripe such as the
Stripe Dashboard, Stripe Search, Sigma,
and Financial Reporting.
From idea to production—things may
seem straightforward at first, but the
details matter. We detail our journey of
how we leveraged Flink for Change Data
Capture at Stripe in order to uphold the
highest data quality standards. Freshness,
Coverage, and Correctness SLOs are
paramount to the success of platforms
and applications running on top of our
CDC infrastructure.
Change Event Streams are ubiquitous
across Stripe given the vast number of
applications and employees generating
datasets worldwide. Change Event
Streams are independent from one
another which leads to the typical
challenges in distributed systems. One of
the major use cases revolves around
aggregating individual change events of a
database transaction to support Stripe’s
payments infrastructure.
Change Data Capture (CDC) is widely-
used at Stripe to capture data changes
from databases without critically
impacting database reliability and
scalability. CDC powers many critical
financial use cases at Stripe such as the
Stripe Dashboard, Stripe Search, Sigma,
and Financial Reporting.
8
From idea to production—things may
seem straightforward at first, but the
details matter. We detail our journey of
how we leveraged Flink for Change Data
Capture at Stripe in order to uphold the
highest data quality standards. Freshness,
Coverage, and Correctness SLOs are
paramount to the success of platforms
and applications running on top of our
CDC infrastructure.
Change Event Streams are ubiquitous
across Stripe given the vast number of
applications and employees generating
datasets worldwide. Change Event
Streams are independent from one
another which leads to the typical
challenges in distributed systems. One of
the major use cases revolves around
aggregating individual change events of a
database transaction to support Stripe’s
payments infrastructure.
Agenda
CDC at Stripe
1 Aggregating Change Events
2 How it Started, How it Ended
3
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture
Billing
Capital Checkout
Connect
Invoicing
Corporate
Card
Climate
Atlas
Radar
Sigma
Payouts
Payments
Terminal Treasury
Issuing
Revenue
Recognitio
n
Payment
Links
Tax
Identity
Elements
Data
Pipeline
Financial
Connections
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture
30%
13
23
> 8000
Remote
Countries
Employees
CDC at Stripe
Correctness
Freshness Coverage
14
Strict SLOs
CDC at Stripe
Interoperable
Abstract Away Internals
Operational Excellence
15
Building a Platform
Make sure that we abstract away
database internals such as sharding
topology and ensure a datastore-agnostic
transport.
Build a high leveraged platform which
makes working with Change Events
interoperable with other systems within
the organization.
Minimal toil given as we scale the number
of datasets, ensure clean separation
between infrastructure and user issues,
create great operator experiences, reduce
control plane and data plane blast radius,
maintain good operator tooling/developer
experience/processes.
CDC at Stripe
16
Agenda
CDC at Stripe
1 Aggregating Change Events
2 How it Started, How it Ended
3
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture
Change Data Capture (CDC) is widely-
used at Stripe to capture data changes
from databases without critically
impacting database reliability and
scalability. CDC powers many critical
financial use cases at Stripe such as the
Stripe Dashboard, Stripe Search, Sigma,
and Financial Reporting.
From idea to production—things may
seem straightforward at first, but the
details matter. We detail our journey of
how we leveraged Flink for Change Data
Capture at Stripe in order to uphold the
highest data quality standards. Freshness,
Coverage, and Correctness SLOs are
paramount to the success of platforms
and applications running on top of our
CDC infrastructure.
Change Event Streams are ubiquitous
across Stripe given the vast number of
applications and employees generating
datasets worldwide. Change Event
Streams are independent from one
another which leads to the typical
challenges in distributed systems. One of
the major use cases revolves around
aggregating individual change events of a
database transaction to support Stripe’s
payments infrastructure.
Why?
17
Aggregating Change Events
Product teams working with payments data use transactions
Arbitrary number of tables in a database transaction
They should be able to get transactions back out from the CDC path
They shouldn’t have to become stream processing experts
18
Vites
s
Deb
eziu
m
Kaf
ka
Platform
Platform
User
Aggregating Change Events
Architecture
Mon
go
Kaf
ka
Flin
k
What is a Change Event?
19
{
"ts_utc" : 1659375300000,
"attributes": { ... },
"data": [
{
"operation": "CREATE",
"source": { ... },
"transaction": { ... },
"key": "some-unique-constraint",
"before": null,
"after": { ... },
"attributes": { ... }
}
]
}
Aggregating Change Events
What is a Change Event?
20
{
"ts_utc" : 1659375300000,
"attributes": { ... },
"data": [
{
"operation": "CREATE",
"source": { ... },
"transaction": { ... },
"key": "some-unique-constraint",
"before": null,
"after": { ... },
"attributes": { ... }
}
]
}
Stream: charges
Aggregating Change Events
What is a Change Event?
21
{
"ts_utc" : 1659375300000,
"attributes": { ... },
"data": [
{
"operation": "CREATE",
"source": { ... },
"transaction": { ... },
"key": "some-unique-constraint",
"before": null,
"after": { ... },
"attributes": { ... }
}
]
}
Aggregating Change Events
{
"id" : "transaction-id",
"global_position": 1,
"source_position": 1,
}
What is a Change Event?
22
{
"ts_utc" : 1659375300000,
"attributes": { ... },
"data": [
{
"operation": "CREATE",
"source": { ... },
"transaction": { ... },
"key": "some-unique-constraint",
"before": null,
"after": { ... },
"attributes": { ... }
}
]
}
Aggregating Change Events
{
"id" : "transaction-id",
"global_position": 1,
"source_position": 1,
}
What is a Change Event?
23
{
"ts_utc" : 1659375300000,
"attributes": { ... },
"data": [
{
"operation": "CREATE",
"source": { ... },
"transaction": { ... },
"key": "some-unique-constraint",
"before": null,
"after": { ... },
"attributes": { ... }
}
]
}
Aggregating Change Events
{
"id" : "transaction-id",
"global_position": 1,
"source_position": 1,
}
What is a Change Event?
24
{
"ts_utc" : 1659375300000,
"attributes": { ... },
"data": [
{
"operation": "CREATE",
"source": { ... },
"transaction": { ... },
"key": "some-unique-constraint",
"before": null,
"after": { ... },
"attributes": { ... }
}
]
}
Aggregating Change Events
{
"id" : "transaction-id",
"global_position": 1,
"source_position": 1,
}
What is a Change Event?
25
{
"ts_utc" : 1659375300000,
"attributes": { ... },
"data": [
{
"operation": "CREATE",
"source": { ... },
"transaction": { ... },
"key": "some-unique-constraint",
"before": null,
"after": { ... },
"attributes": { ... }
}
]
}
Aggregating Change Events
What is a Change Event?
26
{
"ts_utc" : 1659375300000,
"attributes": { ... },
"data": [
{
"operation": "CREATE",
"source": { ... },
"transaction": { ... },
"key": "some-unique-constraint",
"before": null,
"after": { ... },
"attributes": { ... }
}
]
}
Aggregating Change Events
Change Events Can Come From Anywhere
27
{
"data": [
{"source": { ... }}
]
},
{
"data": [
{"source": { ... }}
]
},
{
"data": [
{"source": { ... }}
]
},
Stream: charges
Stream: audits
Stream: disputes
Aggregating Change Events
Databases Have Transactions
28
Aggregating Change Events
BEGIN
INSERT INTO charges
UPDATE audits ...
COMMIT
What is a Transaction Metadata Event?
29
// BEGIN Marker
{
"id" : "transaction-id",
"ts_utc": 1659375300000,
"marker": "BEGIN",
"total_events": null,
"per_source_event_counts": null,
}
// COMMIT Marker
{
"id" : "transaction-id",
"ts_utc": 1659375300000,
"marker": "COMMIT",
"total_events": 3,
"per_source_event_counts": [{ ... }],
}
Aggregating Change Events
What is a Transaction Metadata Event?
30
// BEGIN Marker
{
"id" : "transaction-id",
"ts_utc": 1659375300000,
"marker": "BEGIN",
"total_events": null,
"per_source_event_counts": null,
}
// COMMIT Marker
{
"id" : "transaction-id",
"ts_utc": 1659375300000,
"marker": "COMMIT",
"total_events": 3,
"per_source_event_counts": [{ ... }],
}
Aggregating Change Events
What is a Transaction Metadata Event?
31
// BEGIN Marker
{
"id" : "transaction-id",
"ts_utc": 1659375300000,
"marker": "BEGIN",
"total_events": null,
"per_source_event_counts": null,
}
// COMMIT Marker
{
"id" : "transaction-id",
"ts_utc": 1659375300000,
"marker": "COMMIT",
"total_events": 3,
"per_source_event_counts": [{ ... }],
}
Aggregating Change Events
[
{
"source" : "keyspace.table1",
"total_events": 1,
},
{
"source" : "keyspace.table2",
"total_events": 1,
}
]
-- 4 events
BEGIN --
Transaction Metadata Event
INSERT INTO charges -- Change Event
UPDATE audits ... -- Change Event
COMMIT -- Transaction
Metadata Event
Putting It All Together
32
Aggregating Change Events
What is an Aggregated Change Event?
33
{
"ts_utc" : 1659375300000,
"data": [
{
"operation": "CREATE",
"transaction": { “id”: "txn1"},
"before": null,
"after": { ... },
},
{
"operation": "UPDATE",
"transaction": { “id”: "txn1"},
"before": { ... },
"after": { ... },
},
]
}
Aggregating Change Events
What is an Aggregated Change Event?
34
{
"ts_utc" : 1659375300000,
"data": [
{
"operation": "CREATE",
"transaction": { “id”: "txn1"},
"before": null,
"after": { ... },
},
{
"operation": "UPDATE",
"transaction": { “id”: "txn1"},
"before": { ... },
"after": { ... },
},
]
}
● One transaction with two events
having the same transaction ID.
● Events may arrive from an
arbitrary number of tables.
Aggregating Change Events
35
Transaction Metadata
Event
Stream (one)
Flat
map
Flink Job Graph
Change Event
Stream (many; one per
table)
Windowed
Aggregation
Side
Output
Aggregated Change
Event
Stream
Aggregating Change Events
Multiple Sources
36
Union
Join Connect
Aggregating Change Events
Joins elements of the same
key within the same window.
● Produces pairwise
elements
Join
37
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT BEGIN COMMIT
Event 3
Event 1 BEGIN
,
Event 1 COMMIT
,
Event 2 BEGIN
,
Event 2 COMMIT
,
Event 3 BEGIN
,
Event 3 COMMIT
,
Aggregating Change Events
Unions multiple streams of
the same type into a single
stream.
● Requires streams of the
same type
Union
38
38
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT BEGIN COMMIT
Event 3
(No output; won’t compile because streams are of different
types)
Aggregating Change Events
Connect
39
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT BEGIN COMMIT
Event 3
Event 1 BEGIN
, Event 2 COMMIT
,
Event 3 BEGIN
, COMMIT
,
, ,
Unions multiple streams,
potentially of different types.
● Similar to Unions
Aggregating Change Events
40
Support for streams of different types
Support for flexible stream combination semantics
Don’t need pairwise outputs
Aggregating Change Events
What Do We Need?
Flink Job Definition
41
val mainStream =
transactionMetadataEventStream // uid and name omitted.
.connect(changeEventStream) // Union different types.
Aggregating Change Events
42
Transaction Metadata
Event
Stream (one)
Flat
map
Flink Job Graph
Change Event
Stream (many; one per
table)
Windowed
Aggregation
Side
Output
Aggregated Change
Event
Stream
Aggregating Change Events
Connected Streams
43
Custom
Either
Aggregating Change Events
Wraps an event containing one
of two types, either from left or
right stream.
● Out-of-box
● No concept of keys
Either.left =
Either.right = null
Either
44
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT BEGIN COMMIT
Event 3
Event 1
BEGIN
, Either.left = null
Either.right =
,
…
Aggregating Change Events
WrappedEvent.key = txn-1
WrappedEvent.left = null
WrappedEvent.right =
Custom
45
WrappedEvent.key = txn-1
WrappedEvent.left =
WrappedEvent.right = null
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT BEGIN COMMIT
Event 3
Event 1
BEGIN
,
, …
Wraps an event containing one
of two types, either from left or
right stream, and a common
key among both events.
● Small and simple code
addition
● Need to extract keys
Aggregating Change Events
46
Wrap elements of a connected stream
Be able to identify keys to support
aggregations later
Aggregating Change Events
What Do We Need?
Flink Job Definition
47
val mainStream =
transactionMetadataEventStream // uid and name omitted.
.connect(changeEventStream) // Union different types.
.flatMap(new WrappedEventFunction) // Like Either type, but
with extra fields.
.keyBy(_.key) //
Group events with the same transaction ID.
Aggregating Change Events
48
Transaction Metadata
Event
Stream (one)
Flat
map
Flink Job Graph
Change Event
Stream (many; one per
table)
Windowed
Aggregation
Side
Output
Aggregated Change
Event
Stream
Aggregating Change Events
Aggregation Characteristics
Arbitrary number of Change Event Streams
One Transaction Metadata Event Stream
Change Events must have the same
transaction IDs
Handle late arriving or duplicate Change
Events and Transaction Metadata Events
Don’t result in infinite state growth
49
Aggregating Change Events
Windowing
50
Session
Sliding
Tumbling
Aggregating Change Events
Tumbling Windows
51
Assigns elements to windows
of a fixed size.
● Windows don’t overlap
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT BEGIN COMMIT
Event 3
Aggregating Change Events
Tumbling Windows
52
Assigns elements to windows
of a fixed size.
● Windows don’t overlap
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT
Aggregating Change Events
Tumbling Windows
53
Assigns elements to windows
of a fixed size.
● Windows don’t overlap
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT
● Late-arriving events? Add delay.
Aggregating Change Events
Tumbling Windows
54
Assigns elements to windows
of a fixed size.
● Windows don’t overlap
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT
● Late-arriving events? Add delay.
Aggregating Change Events
Tumbling Windows
55
Assigns elements to windows
of a fixed size.
● Windows don’t overlap
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT
● Late-arriving events? Add delay.
● Large delay? Trade-off: Freshness vs Correctness.
Aggregating Change Events
Tumbling Windows
56
Assigns elements to windows
of a fixed size.
● Windows don’t overlap
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT
● Late-arriving events? Add delay.
● Large delay? Trade-off: Freshness vs Correctness.
● Not quite right…
Aggregating Change Events
Sliding Windows
57
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT BEGIN COMMIT
Event 3
Assigns elements to windows
of a fixed size, but with a slide
interval.
● Almost like a tumbling
window, but with windows
overlapping
Aggregating Change Events
Sliding Windows
58
time
Change Events
Transaction
Metadata Events
Event 1 Event 2
BEGIN COMMIT
● Late-arriving events? Same as tumbling windows.
● Slide interval? Explosion of windows
● Not quite right…
Aggregating Change Events
Assigns elements to windows
of a fixed size, but with a slide
interval.
● Almost like a tumbling
window, but with windows
overlapping
Session Windows
59
time
Change Events
Transaction
Metadata Events
Event 1
BEGIN COMMIT
Event 2
BEGIN COMMIT
Event 3
Aggregating Change Events
Assigns elements that are seen
relatively close to each other.
● Arbitrarily-sized windows;
no fixed start and end
● Windows don’t overlap
● Windows close based on a
defined gap of inactivity
Session Windows
60
time
Change Events
Transaction
Metadata Events
Event 1
BEGIN COMMIT
Event 2
Assigns elements that are seen
relatively close to each other.
● Arbitrarily-sized windows;
no fixed start and end
● Windows don’t overlap
● Windows close based on a
defined gap of inactivity
Aggregating Change Events
Session Windows
61
time
Change Events
Transaction
Metadata Events
Event 1
BEGIN COMMIT
Event 2
Assigns elements that are seen
relatively close to each other.
● Arbitrarily-sized windows;
no fixed start and end
● Windows don’t overlap
● Windows close based on a
defined gap of inactivity
Aggregating Change Events
Session Windows
62
time
Change Events
Transaction
Metadata Events
Event 1
BEGIN COMMIT
Event 2
● Session gap too small? Incomplete aggregates
Assigns elements that are seen
relatively close to each other.
● Arbitrarily-sized windows;
no fixed start and end
● Windows don’t overlap
● Windows close based on a
defined gap of inactivity
Aggregating Change Events
Session Windows
63
time
Change Events
Transaction
Metadata Events
Event 1
BEGIN COMMIT
Event 2
● Session gap too small? Incomplete aggregates
Assigns elements that are seen
relatively close to each other.
● Arbitrarily-sized windows;
no fixed start and end
● Windows don’t overlap
● Windows close based on a
defined gap of inactivity
Aggregating Change Events
Session Windows
64
time
Change Events
Transaction
Metadata Events
Event 1
BEGIN COMMIT
Event 2
● Session gap too small? Incomplete aggregates
Assigns elements that are seen
relatively close to each other.
● Arbitrarily-sized windows;
no fixed start and end
● Windows don’t overlap
● Windows close based on a
defined gap of inactivity
Aggregating Change Events
Session Windows
65
time
Change Events
Transaction
Metadata Events
Event 1
BEGIN COMMIT
Event 2
● Session gap too small? Incomplete aggregates
● Session gap too big? Trade-off: Freshness vs Correctness
Assigns elements that are seen
relatively close to each other.
● Arbitrarily-sized windows;
no fixed start and end
● Windows don’t overlap
● Windows close based on a
defined gap of inactivity
Aggregating Change Events
Session Windows
66
time
Change Events
Transaction
Metadata Events
Event 1
BEGIN COMMIT
Event 2
● Session gap too small? Incomplete aggregates
● Session gap too big? Trade-off: Freshness vs Correctness
● Not quite right…
Assigns elements that are seen
relatively close to each other.
● Arbitrarily-sized windows;
no fixed start and end
● Windows don’t overlap
● Windows close based on a
defined gap of inactivity
Aggregating Change Events
Global Windows
67
Assigns elements to a single
window.
● Only a single window per
key
● Window never closes
time
Change Events
Transaction
Metadata Events
Event 1
BEGIN COMMIT
Event 2
BEGIN COMMIT
Event 3
Aggregating Change Events
Global Windows
68
Assigns elements to a single
window.
● Only a single window per
key
● Window never closes
time
Change Events
Transaction
Metadata Events
Event 1
BEGIN COMMIT
Event 2
BEGIN COMMIT
Event 3
● Outputs never get evaluated and materialized
● Needs more…
Aggregating Change Events
Global Windows + Custom Stateful Trigger
69
Assign elements to a Global Window and add a custom
stateful trigger.
● Flexibly define open/close conditions for non-
overlapping windows
● Reasonably handle late-arriving events
● Avoid infinite state growth and reduce likelihood of
incomplete aggregates
Aggregating Change Events
What Makes an Aggregation Complete?
70
Aggregating Change Events
BEGIN transaction marker seen
COMMIT transaction marker seen
All Change Events of the transaction seen
All Change Events are globally and locally ordered
Custom Stateful Trigger:
TransactionBoundaryTrigger
71
if transaction metadata event:
if begin transaction marker:
update begin marker state
else:
update commit marker state
update bitmap state
using commit marker’s total event count
set timeout state and register event time timer
else:
update bitmap state
with change event’s global position
set timeout state and register event time timer
if should trigger(begin, commit, total events):
clear window
TriggerResult.FIRE_AND_PURGE
else:
TriggerResult.CONTINUE
Reference
Aggregating Change Events
// ChangeEvent#transaction
{
"id" : "transaction-id",
"global_position": 1,
"source_position": 1,
}
// TransactionMetadataEvent
{
"id" : "transaction-id",
"ts_utc": 1659375300000,
"marker": "COMMIT",
"total_events": 3,
"per_source_event_counts": [{ ... }],
}
val mainStream =
transactionMetadataEventStream // uid and name omitted.
.connect(changeEventStream) // Union different types.
.flatMap(new WrappedEventFunction) // Like Either type, but
with extra fields.
.keyBy(_.key) //
Group events with the same transaction ID.
Flink Job Definition
72
.window(GlobalWindows.create)
.trigger(new TransactionBoundaryTrigger(...)) // Flexible windowing semantics.
.process(new KeyedProcessor(...))
Aggregating Change Events
73
Transaction Metadata
Event
Stream (one)
Flat
map
Flink Job Graph
Change Event
Stream (many; one per
table)
Windowed
Aggregation
Side
Output
Aggregated Change
Event
Stream
Aggregating Change Events
val mainStream =
transactionMetadataEventStream // uid and name omitted.
.connect(changeEventStream) // Union different types.
.flatMap(new WrappedEventFunction) // Like Either type, but
with extra fields.
.keyBy(_.key) //
Group events with the same transaction ID.
.window(GlobalWindows.create)
.trigger(new TransactionBoundaryTrigger(...)) // Flexible windowing semantics.
.process(new KeyedProcessor(...))
Flink Job Definition
74
mainStream //
Side output to DLQ.
.getSideOutput(...)
.addSink(...)
mainStream //
Output aggregated change events.
.addSink(...)
Aggregating Change Events
75
Agenda
CDC at Stripe
1 Aggregating Change Events
2 How it Started, How it Ended
3
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture
Change Data Capture (CDC) is widely-
used at Stripe to capture data changes
from databases without critically
impacting database reliability and
scalability. CDC powers many critical
financial use cases at Stripe such as the
Stripe Dashboard, Stripe Search, Sigma,
and Financial Reporting.
From idea to production—things may
seem straightforward at first, but the
details matter. We detail our journey of
how we leveraged Flink for Change Data
Capture at Stripe in order to uphold the
highest data quality standards. Freshness,
Coverage, and Correctness SLOs are
paramount to the success of platforms
and applications running on top of our
CDC infrastructure.
Change Event Streams are ubiquitous
across Stripe given the vast number of
applications and employees generating
datasets worldwide. Change Event
Streams are independent from one
another which leads to the typical
challenges in distributed systems. One of
the major use cases revolves around
aggregating individual change events of a
database transaction to support Stripe’s
payments infrastructure.
From Idea to Production
76
Coverage
Platform
State
How it Started, How it Ended
State
77
How it Started, How it Ended
How It Started
How It Started How It Ended
Infinite keys due to continuous stream of new transactions
Observations
80
How it Started, How it Ended
Using a Global Window; possible windows not closing properly
No trigger timeouts firing
No watermarks being generated
Idle
Sub Tasks
Observations
81
charges
(partitions = 2)
Transaction
Metadata Events
audits
(partitions = 1)
disputes
(partitions = 1)
Source Sub Tasks
How it Started, How it Ended
Fix
82
Fixed an upstream issue where transaction IDs were getting mixed up
Reduce parallelism on Source Sub Tasks for all streams
Make sure parallelism ≤ ∑ Topic Partitions
Generally, check with SplitEnumerator classes
How it Started, How it Ended
How It Started
How It Started How It Ended
State size still growing, but slower
Observations
85
How it Started, How it Ended
Event time timers firing, sometimes
Watermarks are being generated, but not for all sub tasks
New Observations
86
charges
(partitions = 2)
Transaction
Metadata Events
audits
(partitions = 1)
disputes
(partitions = 1)
Source Sub Tasks
Low volume stream
How it Started, How it Ended
Possible Fix
87
Switch from event time to processing time
Less precise
Could cause premature trigger firing, resulting in incomplete aggregates
How it Started, How it Ended
Actual Fix
88
Add idleness property on sources
Can still use event time
More precise
Not perfect; can still result in incomplete aggregates in edge cases
That’s the reality of streaming
How it Started, How it Ended
Platform
89
How it Started, How it Ended
How It Started
How It Started How It Ended
Don’t want to redeploy every time a new dataset (Kafka Topic) is added
Observations
92
How it Started, How it Ended
Blows away Freshness SLO’s error budget
Poor developer onboarding experience
Fix
93
Instead of Kafka Topic List Subscriber, use Regex Subscriber
Subscribe to all topics (for a keyspace) by default
Control plane (external) service produces an event to Broadcast Stream
On broadcast element, use Broadcast State to keep onboarded datasets in state
On element, check Broadcast State and filter for onboarded datasets
How it Started, How it Ended
Coverage
94
How it Started, How it Ended
How It Started
How It Started How It Ended
Observations
Incomplete aggregates still happening, but not frequently
97
How it Started, How it Ended
Kafka by default is at-least-once delivery
Many independent streams operating at different speeds
Storage will be expensive. Trade-off between confidence and cost-
efficiency: KV store or bloom filter
Move incomplete aggregate measurement out of the Flink Job and into a
system downstream
Fix
98
How it Started, How it Ended
New system needs to dedupe events… for all time?
How It Started
How It Started How It Ended
101
Agenda
CDC at Stripe
1 Aggregating Change Events
2 How it Started, How it Ended
3
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture
Change Data Capture (CDC) is widely-
used at Stripe to capture data changes
from databases without critically
impacting database reliability and
scalability. CDC powers many critical
financial use cases at Stripe such as the
Stripe Dashboard, Stripe Search, Sigma,
and Financial Reporting.
From idea to production – things may
seem straightforward at first, but the
details matter. We detail our journey of
how we leveraged Flink for Change Data
Capture at Stripe in order to uphold the
highest data quality standards. Freshness,
Coverage, and Correctness SLOs are
paramount to the success of platforms
and applications running on top of our
CDC infrastructure.
Change Event Streams are ubiquitous
across Stripe given the vast number of
applications and employees generating
datasets worldwide. Change Event
Streams are independent from one
another which leads to the typical
challenges in distributed systems. One of
the major use cases revolves around
aggregating individual change events of a
database transaction to support Stripe’s
payments infrastructure.
Aggregating Change Events is relatively
straightforward, but the details matter
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture
Wrap Up
102
Change Data Capture (CDC) is widely-used at
Stripe to improve database reliability and scalability
Flink is a critical component in Stripe’s CDC
infrastructure that allows us to work with financial
streaming data with high data quality guarantees
Thank you!
103
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture

More Related Content

What's hot

Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
 
Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics
Araf Karsh Hamid
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022
Kai Wähner
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
confluent
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
Flink Forward
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
Databricks
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
Ververica
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Apache Kafka in the Airline, Aviation and Travel Industry
Apache Kafka in the Airline, Aviation and Travel IndustryApache Kafka in the Airline, Aviation and Travel Industry
Apache Kafka in the Airline, Aviation and Travel Industry
Kai Wähner
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
confluent
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
GCP for Apache Kafka® Users: Stream Ingestion and Processing
GCP for Apache Kafka® Users: Stream Ingestion and ProcessingGCP for Apache Kafka® Users: Stream Ingestion and Processing
GCP for Apache Kafka® Users: Stream Ingestion and Processing
confluent
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Kai Wähner
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
Kostas Tzoumas
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 

What's hot (20)

Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Kafka in the Airline, Aviation and Travel Industry
Apache Kafka in the Airline, Aviation and Travel IndustryApache Kafka in the Airline, Aviation and Travel Industry
Apache Kafka in the Airline, Aviation and Travel Industry
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
GCP for Apache Kafka® Users: Stream Ingestion and Processing
GCP for Apache Kafka® Users: Stream Ingestion and ProcessingGCP for Apache Kafka® Users: Stream Ingestion and Processing
GCP for Apache Kafka® Users: Stream Ingestion and Processing
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 

Similar to Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture

Jazz for Service Management
Jazz for Service ManagementJazz for Service Management
Jazz for Service Management
IBM Danmark
 
PCF Data Collection for TBM
PCF Data Collection for TBMPCF Data Collection for TBM
PCF Data Collection for TBM
VMware Tanzu
 
Kakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appKakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming app
Neil Avery
 
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
confluent
 
How to Quantify the Value of Kafka in Your Organization
How to Quantify the Value of Kafka in Your Organization How to Quantify the Value of Kafka in Your Organization
How to Quantify the Value of Kafka in Your Organization
confluent
 
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York City
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York CityThe Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York City
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York City
Neo4j
 
Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida
Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida  Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida
Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida
CLARA CAMPROVIN
 
Why Data Virtualization? An Introduction
Why Data Virtualization? An IntroductionWhy Data Virtualization? An Introduction
Why Data Virtualization? An Introduction
Denodo
 
Moving To MicroServices
Moving To MicroServicesMoving To MicroServices
Moving To MicroServices
David Walker
 
Di in the age of digital disruptions v1.0
Di in the age of digital disruptions v1.0Di in the age of digital disruptions v1.0
Di in the age of digital disruptions v1.0
Amar Roy
 
The art of the event streaming application: streams, stream processors and sc...
The art of the event streaming application: streams, stream processors and sc...The art of the event streaming application: streams, stream processors and sc...
The art of the event streaming application: streams, stream processors and sc...
confluent
 
Kafka summit SF 2019 - the art of the event-streaming app
Kafka summit SF 2019 - the art of the event-streaming appKafka summit SF 2019 - the art of the event-streaming app
Kafka summit SF 2019 - the art of the event-streaming app
Neil Avery
 
“Lights Out”Configuration using Tivoli Netcool AutoDiscovery Tools
“Lights Out”Configuration using Tivoli Netcool AutoDiscovery Tools“Lights Out”Configuration using Tivoli Netcool AutoDiscovery Tools
“Lights Out”Configuration using Tivoli Netcool AutoDiscovery Tools
Antonio Rolle
 
Transforming Financial Services with Event Streaming Data
Transforming Financial Services with Event Streaming DataTransforming Financial Services with Event Streaming Data
Transforming Financial Services with Event Streaming Data
confluent
 
Deliver agile flow presentation (1)
Deliver agile   flow presentation (1)Deliver agile   flow presentation (1)
Deliver agile flow presentation (1)
James Urquhart
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?
confluent
 
Three Dimensions of Data as a Service
Three Dimensions of Data as a ServiceThree Dimensions of Data as a Service
Three Dimensions of Data as a Service
Denodo
 
SaaS Vs On Premise BI
SaaS Vs On Premise BISaaS Vs On Premise BI
SaaS Vs On Premise BI
LCWynne
 
The Streaming Assessment – An Introduction
The Streaming Assessment – An IntroductionThe Streaming Assessment – An Introduction
The Streaming Assessment – An Introduction
confluent
 
Intro to Office 365 Admin
Intro to Office 365 AdminIntro to Office 365 Admin
Intro to Office 365 Admin
Nikkia Carter
 

Similar to Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture (20)

Jazz for Service Management
Jazz for Service ManagementJazz for Service Management
Jazz for Service Management
 
PCF Data Collection for TBM
PCF Data Collection for TBMPCF Data Collection for TBM
PCF Data Collection for TBM
 
Kakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appKakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming app
 
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
 
How to Quantify the Value of Kafka in Your Organization
How to Quantify the Value of Kafka in Your Organization How to Quantify the Value of Kafka in Your Organization
How to Quantify the Value of Kafka in Your Organization
 
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York City
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York CityThe Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York City
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York City
 
Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida
Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida  Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida
Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida
 
Why Data Virtualization? An Introduction
Why Data Virtualization? An IntroductionWhy Data Virtualization? An Introduction
Why Data Virtualization? An Introduction
 
Moving To MicroServices
Moving To MicroServicesMoving To MicroServices
Moving To MicroServices
 
Di in the age of digital disruptions v1.0
Di in the age of digital disruptions v1.0Di in the age of digital disruptions v1.0
Di in the age of digital disruptions v1.0
 
The art of the event streaming application: streams, stream processors and sc...
The art of the event streaming application: streams, stream processors and sc...The art of the event streaming application: streams, stream processors and sc...
The art of the event streaming application: streams, stream processors and sc...
 
Kafka summit SF 2019 - the art of the event-streaming app
Kafka summit SF 2019 - the art of the event-streaming appKafka summit SF 2019 - the art of the event-streaming app
Kafka summit SF 2019 - the art of the event-streaming app
 
“Lights Out”Configuration using Tivoli Netcool AutoDiscovery Tools
“Lights Out”Configuration using Tivoli Netcool AutoDiscovery Tools“Lights Out”Configuration using Tivoli Netcool AutoDiscovery Tools
“Lights Out”Configuration using Tivoli Netcool AutoDiscovery Tools
 
Transforming Financial Services with Event Streaming Data
Transforming Financial Services with Event Streaming DataTransforming Financial Services with Event Streaming Data
Transforming Financial Services with Event Streaming Data
 
Deliver agile flow presentation (1)
Deliver agile   flow presentation (1)Deliver agile   flow presentation (1)
Deliver agile flow presentation (1)
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?
 
Three Dimensions of Data as a Service
Three Dimensions of Data as a ServiceThree Dimensions of Data as a Service
Three Dimensions of Data as a Service
 
SaaS Vs On Premise BI
SaaS Vs On Premise BISaaS Vs On Premise BI
SaaS Vs On Premise BI
 
The Streaming Assessment – An Introduction
The Streaming Assessment – An IntroductionThe Streaming Assessment – An Introduction
The Streaming Assessment – An Introduction
 
Intro to Office 365 Admin
Intro to Office 365 AdminIntro to Office 365 Admin
Intro to Office 365 Admin
 

More from Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Flink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
Flink Forward
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
Flink Forward
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
Flink Forward
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
Flink Forward
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
Flink Forward
 

More from Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 

Recently uploaded

It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
Zilliz
 
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
Fwdays
 
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
Priyanka Aash
 
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Alliance
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Zilliz
 
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
webbyacad software
 
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPathCommunity
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
Alison B. Lowndes
 
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
Stephanie Beckett
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
DianaGray10
 
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
Fwdays
 
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Alliance
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
OnBoard
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
AmandaCheung15
 
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
AMol NAik
 
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
Nohoax Kanont
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
 
DefCamp_2016_Chemerkin_Yury_--_publish.pdf
DefCamp_2016_Chemerkin_Yury_--_publish.pdfDefCamp_2016_Chemerkin_Yury_--_publish.pdf
DefCamp_2016_Chemerkin_Yury_--_publish.pdf
Yury Chemerkin
 
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Alliance
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
Low Hong Chuan
 

Recently uploaded (20)

It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
 
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
 
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
 
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
 
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
 
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
 
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
 
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
 
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
 
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
 
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
 
DefCamp_2016_Chemerkin_Yury_--_publish.pdf
DefCamp_2016_Chemerkin_Yury_--_publish.pdfDefCamp_2016_Chemerkin_Yury_--_publish.pdf
DefCamp_2016_Chemerkin_Yury_--_publish.pdf
 
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
 

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture

  • 1. 1
  • 2. An API that gets out of your way It’s so easy, we’ve embedded a bunch of examples right here. Copy some of these requests into your terminal and check out what happens. With wrappers in Ruby, PHP, Python and more, you can get started in minutes. Learn More ➤
  • 3. As complexity grew… Then we had a ProblemFactory Started out with We had a problem, so we thought to use …
  • 4. As data volume grew… Database scalability is a complicated topic… Started out with Had to make sure it was web scale Distributed transactions Change Data Capture
  • 6. Squirreling Away $640 Billion Flink Forward - San Francisco 2022 Jeff Chao Staff Engineer / Tech Lead for Change Data Capture Infrastructure at Stripe How Stripe Leverages Flink for Change Data Capture
  • 7. 7 CDC at Stripe Agenda 1 Aggregating Change Events 2 How it Started, How it Ended 3 Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture Change Data Capture (CDC) is widely- used at Stripe to capture data changes from databases without critically impacting database reliability and scalability. CDC powers many critical financial use cases at Stripe such as the Stripe Dashboard, Stripe Search, Sigma, and Financial Reporting. From idea to production—things may seem straightforward at first, but the details matter. We detail our journey of how we leveraged Flink for Change Data Capture at Stripe in order to uphold the highest data quality standards. Freshness, Coverage, and Correctness SLOs are paramount to the success of platforms and applications running on top of our CDC infrastructure. Change Event Streams are ubiquitous across Stripe given the vast number of applications and employees generating datasets worldwide. Change Event Streams are independent from one another which leads to the typical challenges in distributed systems. One of the major use cases revolves around aggregating individual change events of a database transaction to support Stripe’s payments infrastructure.
  • 8. Change Data Capture (CDC) is widely- used at Stripe to capture data changes from databases without critically impacting database reliability and scalability. CDC powers many critical financial use cases at Stripe such as the Stripe Dashboard, Stripe Search, Sigma, and Financial Reporting. 8 From idea to production—things may seem straightforward at first, but the details matter. We detail our journey of how we leveraged Flink for Change Data Capture at Stripe in order to uphold the highest data quality standards. Freshness, Coverage, and Correctness SLOs are paramount to the success of platforms and applications running on top of our CDC infrastructure. Change Event Streams are ubiquitous across Stripe given the vast number of applications and employees generating datasets worldwide. Change Event Streams are independent from one another which leads to the typical challenges in distributed systems. One of the major use cases revolves around aggregating individual change events of a database transaction to support Stripe’s payments infrastructure. Agenda CDC at Stripe 1 Aggregating Change Events 2 How it Started, How it Ended 3 Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture
  • 15. Interoperable Abstract Away Internals Operational Excellence 15 Building a Platform Make sure that we abstract away database internals such as sharding topology and ensure a datastore-agnostic transport. Build a high leveraged platform which makes working with Change Events interoperable with other systems within the organization. Minimal toil given as we scale the number of datasets, ensure clean separation between infrastructure and user issues, create great operator experiences, reduce control plane and data plane blast radius, maintain good operator tooling/developer experience/processes. CDC at Stripe
  • 16. 16 Agenda CDC at Stripe 1 Aggregating Change Events 2 How it Started, How it Ended 3 Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture Change Data Capture (CDC) is widely- used at Stripe to capture data changes from databases without critically impacting database reliability and scalability. CDC powers many critical financial use cases at Stripe such as the Stripe Dashboard, Stripe Search, Sigma, and Financial Reporting. From idea to production—things may seem straightforward at first, but the details matter. We detail our journey of how we leveraged Flink for Change Data Capture at Stripe in order to uphold the highest data quality standards. Freshness, Coverage, and Correctness SLOs are paramount to the success of platforms and applications running on top of our CDC infrastructure. Change Event Streams are ubiquitous across Stripe given the vast number of applications and employees generating datasets worldwide. Change Event Streams are independent from one another which leads to the typical challenges in distributed systems. One of the major use cases revolves around aggregating individual change events of a database transaction to support Stripe’s payments infrastructure.
  • 17. Why? 17 Aggregating Change Events Product teams working with payments data use transactions Arbitrary number of tables in a database transaction They should be able to get transactions back out from the CDC path They shouldn’t have to become stream processing experts
  • 19. What is a Change Event? 19 { "ts_utc" : 1659375300000, "attributes": { ... }, "data": [ { "operation": "CREATE", "source": { ... }, "transaction": { ... }, "key": "some-unique-constraint", "before": null, "after": { ... }, "attributes": { ... } } ] } Aggregating Change Events
  • 20. What is a Change Event? 20 { "ts_utc" : 1659375300000, "attributes": { ... }, "data": [ { "operation": "CREATE", "source": { ... }, "transaction": { ... }, "key": "some-unique-constraint", "before": null, "after": { ... }, "attributes": { ... } } ] } Stream: charges Aggregating Change Events
  • 21. What is a Change Event? 21 { "ts_utc" : 1659375300000, "attributes": { ... }, "data": [ { "operation": "CREATE", "source": { ... }, "transaction": { ... }, "key": "some-unique-constraint", "before": null, "after": { ... }, "attributes": { ... } } ] } Aggregating Change Events { "id" : "transaction-id", "global_position": 1, "source_position": 1, }
  • 22. What is a Change Event? 22 { "ts_utc" : 1659375300000, "attributes": { ... }, "data": [ { "operation": "CREATE", "source": { ... }, "transaction": { ... }, "key": "some-unique-constraint", "before": null, "after": { ... }, "attributes": { ... } } ] } Aggregating Change Events { "id" : "transaction-id", "global_position": 1, "source_position": 1, }
  • 23. What is a Change Event? 23 { "ts_utc" : 1659375300000, "attributes": { ... }, "data": [ { "operation": "CREATE", "source": { ... }, "transaction": { ... }, "key": "some-unique-constraint", "before": null, "after": { ... }, "attributes": { ... } } ] } Aggregating Change Events { "id" : "transaction-id", "global_position": 1, "source_position": 1, }
  • 24. What is a Change Event? 24 { "ts_utc" : 1659375300000, "attributes": { ... }, "data": [ { "operation": "CREATE", "source": { ... }, "transaction": { ... }, "key": "some-unique-constraint", "before": null, "after": { ... }, "attributes": { ... } } ] } Aggregating Change Events { "id" : "transaction-id", "global_position": 1, "source_position": 1, }
  • 25. What is a Change Event? 25 { "ts_utc" : 1659375300000, "attributes": { ... }, "data": [ { "operation": "CREATE", "source": { ... }, "transaction": { ... }, "key": "some-unique-constraint", "before": null, "after": { ... }, "attributes": { ... } } ] } Aggregating Change Events
  • 26. What is a Change Event? 26 { "ts_utc" : 1659375300000, "attributes": { ... }, "data": [ { "operation": "CREATE", "source": { ... }, "transaction": { ... }, "key": "some-unique-constraint", "before": null, "after": { ... }, "attributes": { ... } } ] } Aggregating Change Events
  • 27. Change Events Can Come From Anywhere 27 { "data": [ {"source": { ... }} ] }, { "data": [ {"source": { ... }} ] }, { "data": [ {"source": { ... }} ] }, Stream: charges Stream: audits Stream: disputes Aggregating Change Events
  • 28. Databases Have Transactions 28 Aggregating Change Events BEGIN INSERT INTO charges UPDATE audits ... COMMIT
  • 29. What is a Transaction Metadata Event? 29 // BEGIN Marker { "id" : "transaction-id", "ts_utc": 1659375300000, "marker": "BEGIN", "total_events": null, "per_source_event_counts": null, } // COMMIT Marker { "id" : "transaction-id", "ts_utc": 1659375300000, "marker": "COMMIT", "total_events": 3, "per_source_event_counts": [{ ... }], } Aggregating Change Events
  • 30. What is a Transaction Metadata Event? 30 // BEGIN Marker { "id" : "transaction-id", "ts_utc": 1659375300000, "marker": "BEGIN", "total_events": null, "per_source_event_counts": null, } // COMMIT Marker { "id" : "transaction-id", "ts_utc": 1659375300000, "marker": "COMMIT", "total_events": 3, "per_source_event_counts": [{ ... }], } Aggregating Change Events
  • 31. What is a Transaction Metadata Event? 31 // BEGIN Marker { "id" : "transaction-id", "ts_utc": 1659375300000, "marker": "BEGIN", "total_events": null, "per_source_event_counts": null, } // COMMIT Marker { "id" : "transaction-id", "ts_utc": 1659375300000, "marker": "COMMIT", "total_events": 3, "per_source_event_counts": [{ ... }], } Aggregating Change Events [ { "source" : "keyspace.table1", "total_events": 1, }, { "source" : "keyspace.table2", "total_events": 1, } ]
  • 32. -- 4 events BEGIN -- Transaction Metadata Event INSERT INTO charges -- Change Event UPDATE audits ... -- Change Event COMMIT -- Transaction Metadata Event Putting It All Together 32 Aggregating Change Events
  • 33. What is an Aggregated Change Event? 33 { "ts_utc" : 1659375300000, "data": [ { "operation": "CREATE", "transaction": { “id”: "txn1"}, "before": null, "after": { ... }, }, { "operation": "UPDATE", "transaction": { “id”: "txn1"}, "before": { ... }, "after": { ... }, }, ] } Aggregating Change Events
  • 34. What is an Aggregated Change Event? 34 { "ts_utc" : 1659375300000, "data": [ { "operation": "CREATE", "transaction": { “id”: "txn1"}, "before": null, "after": { ... }, }, { "operation": "UPDATE", "transaction": { “id”: "txn1"}, "before": { ... }, "after": { ... }, }, ] } ● One transaction with two events having the same transaction ID. ● Events may arrive from an arbitrary number of tables. Aggregating Change Events
  • 35. 35 Transaction Metadata Event Stream (one) Flat map Flink Job Graph Change Event Stream (many; one per table) Windowed Aggregation Side Output Aggregated Change Event Stream Aggregating Change Events
  • 37. Joins elements of the same key within the same window. ● Produces pairwise elements Join 37 time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT BEGIN COMMIT Event 3 Event 1 BEGIN , Event 1 COMMIT , Event 2 BEGIN , Event 2 COMMIT , Event 3 BEGIN , Event 3 COMMIT , Aggregating Change Events
  • 38. Unions multiple streams of the same type into a single stream. ● Requires streams of the same type Union 38 38 time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT BEGIN COMMIT Event 3 (No output; won’t compile because streams are of different types) Aggregating Change Events
  • 39. Connect 39 time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT BEGIN COMMIT Event 3 Event 1 BEGIN , Event 2 COMMIT , Event 3 BEGIN , COMMIT , , , Unions multiple streams, potentially of different types. ● Similar to Unions Aggregating Change Events
  • 40. 40 Support for streams of different types Support for flexible stream combination semantics Don’t need pairwise outputs Aggregating Change Events What Do We Need?
  • 41. Flink Job Definition 41 val mainStream = transactionMetadataEventStream // uid and name omitted. .connect(changeEventStream) // Union different types. Aggregating Change Events
  • 42. 42 Transaction Metadata Event Stream (one) Flat map Flink Job Graph Change Event Stream (many; one per table) Windowed Aggregation Side Output Aggregated Change Event Stream Aggregating Change Events
  • 44. Wraps an event containing one of two types, either from left or right stream. ● Out-of-box ● No concept of keys Either.left = Either.right = null Either 44 time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT BEGIN COMMIT Event 3 Event 1 BEGIN , Either.left = null Either.right = , … Aggregating Change Events
  • 45. WrappedEvent.key = txn-1 WrappedEvent.left = null WrappedEvent.right = Custom 45 WrappedEvent.key = txn-1 WrappedEvent.left = WrappedEvent.right = null time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT BEGIN COMMIT Event 3 Event 1 BEGIN , , … Wraps an event containing one of two types, either from left or right stream, and a common key among both events. ● Small and simple code addition ● Need to extract keys Aggregating Change Events
  • 46. 46 Wrap elements of a connected stream Be able to identify keys to support aggregations later Aggregating Change Events What Do We Need?
  • 47. Flink Job Definition 47 val mainStream = transactionMetadataEventStream // uid and name omitted. .connect(changeEventStream) // Union different types. .flatMap(new WrappedEventFunction) // Like Either type, but with extra fields. .keyBy(_.key) // Group events with the same transaction ID. Aggregating Change Events
  • 48. 48 Transaction Metadata Event Stream (one) Flat map Flink Job Graph Change Event Stream (many; one per table) Windowed Aggregation Side Output Aggregated Change Event Stream Aggregating Change Events
  • 49. Aggregation Characteristics Arbitrary number of Change Event Streams One Transaction Metadata Event Stream Change Events must have the same transaction IDs Handle late arriving or duplicate Change Events and Transaction Metadata Events Don’t result in infinite state growth 49 Aggregating Change Events
  • 51. Tumbling Windows 51 Assigns elements to windows of a fixed size. ● Windows don’t overlap time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT BEGIN COMMIT Event 3 Aggregating Change Events
  • 52. Tumbling Windows 52 Assigns elements to windows of a fixed size. ● Windows don’t overlap time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT Aggregating Change Events
  • 53. Tumbling Windows 53 Assigns elements to windows of a fixed size. ● Windows don’t overlap time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT ● Late-arriving events? Add delay. Aggregating Change Events
  • 54. Tumbling Windows 54 Assigns elements to windows of a fixed size. ● Windows don’t overlap time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT ● Late-arriving events? Add delay. Aggregating Change Events
  • 55. Tumbling Windows 55 Assigns elements to windows of a fixed size. ● Windows don’t overlap time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT ● Late-arriving events? Add delay. ● Large delay? Trade-off: Freshness vs Correctness. Aggregating Change Events
  • 56. Tumbling Windows 56 Assigns elements to windows of a fixed size. ● Windows don’t overlap time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT ● Late-arriving events? Add delay. ● Large delay? Trade-off: Freshness vs Correctness. ● Not quite right… Aggregating Change Events
  • 57. Sliding Windows 57 time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT BEGIN COMMIT Event 3 Assigns elements to windows of a fixed size, but with a slide interval. ● Almost like a tumbling window, but with windows overlapping Aggregating Change Events
  • 58. Sliding Windows 58 time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT ● Late-arriving events? Same as tumbling windows. ● Slide interval? Explosion of windows ● Not quite right… Aggregating Change Events Assigns elements to windows of a fixed size, but with a slide interval. ● Almost like a tumbling window, but with windows overlapping
  • 59. Session Windows 59 time Change Events Transaction Metadata Events Event 1 BEGIN COMMIT Event 2 BEGIN COMMIT Event 3 Aggregating Change Events Assigns elements that are seen relatively close to each other. ● Arbitrarily-sized windows; no fixed start and end ● Windows don’t overlap ● Windows close based on a defined gap of inactivity
  • 60. Session Windows 60 time Change Events Transaction Metadata Events Event 1 BEGIN COMMIT Event 2 Assigns elements that are seen relatively close to each other. ● Arbitrarily-sized windows; no fixed start and end ● Windows don’t overlap ● Windows close based on a defined gap of inactivity Aggregating Change Events
  • 61. Session Windows 61 time Change Events Transaction Metadata Events Event 1 BEGIN COMMIT Event 2 Assigns elements that are seen relatively close to each other. ● Arbitrarily-sized windows; no fixed start and end ● Windows don’t overlap ● Windows close based on a defined gap of inactivity Aggregating Change Events
  • 62. Session Windows 62 time Change Events Transaction Metadata Events Event 1 BEGIN COMMIT Event 2 ● Session gap too small? Incomplete aggregates Assigns elements that are seen relatively close to each other. ● Arbitrarily-sized windows; no fixed start and end ● Windows don’t overlap ● Windows close based on a defined gap of inactivity Aggregating Change Events
  • 63. Session Windows 63 time Change Events Transaction Metadata Events Event 1 BEGIN COMMIT Event 2 ● Session gap too small? Incomplete aggregates Assigns elements that are seen relatively close to each other. ● Arbitrarily-sized windows; no fixed start and end ● Windows don’t overlap ● Windows close based on a defined gap of inactivity Aggregating Change Events
  • 64. Session Windows 64 time Change Events Transaction Metadata Events Event 1 BEGIN COMMIT Event 2 ● Session gap too small? Incomplete aggregates Assigns elements that are seen relatively close to each other. ● Arbitrarily-sized windows; no fixed start and end ● Windows don’t overlap ● Windows close based on a defined gap of inactivity Aggregating Change Events
  • 65. Session Windows 65 time Change Events Transaction Metadata Events Event 1 BEGIN COMMIT Event 2 ● Session gap too small? Incomplete aggregates ● Session gap too big? Trade-off: Freshness vs Correctness Assigns elements that are seen relatively close to each other. ● Arbitrarily-sized windows; no fixed start and end ● Windows don’t overlap ● Windows close based on a defined gap of inactivity Aggregating Change Events
  • 66. Session Windows 66 time Change Events Transaction Metadata Events Event 1 BEGIN COMMIT Event 2 ● Session gap too small? Incomplete aggregates ● Session gap too big? Trade-off: Freshness vs Correctness ● Not quite right… Assigns elements that are seen relatively close to each other. ● Arbitrarily-sized windows; no fixed start and end ● Windows don’t overlap ● Windows close based on a defined gap of inactivity Aggregating Change Events
  • 67. Global Windows 67 Assigns elements to a single window. ● Only a single window per key ● Window never closes time Change Events Transaction Metadata Events Event 1 BEGIN COMMIT Event 2 BEGIN COMMIT Event 3 Aggregating Change Events
  • 68. Global Windows 68 Assigns elements to a single window. ● Only a single window per key ● Window never closes time Change Events Transaction Metadata Events Event 1 BEGIN COMMIT Event 2 BEGIN COMMIT Event 3 ● Outputs never get evaluated and materialized ● Needs more… Aggregating Change Events
  • 69. Global Windows + Custom Stateful Trigger 69 Assign elements to a Global Window and add a custom stateful trigger. ● Flexibly define open/close conditions for non- overlapping windows ● Reasonably handle late-arriving events ● Avoid infinite state growth and reduce likelihood of incomplete aggregates Aggregating Change Events
  • 70. What Makes an Aggregation Complete? 70 Aggregating Change Events BEGIN transaction marker seen COMMIT transaction marker seen All Change Events of the transaction seen All Change Events are globally and locally ordered
  • 71. Custom Stateful Trigger: TransactionBoundaryTrigger 71 if transaction metadata event: if begin transaction marker: update begin marker state else: update commit marker state update bitmap state using commit marker’s total event count set timeout state and register event time timer else: update bitmap state with change event’s global position set timeout state and register event time timer if should trigger(begin, commit, total events): clear window TriggerResult.FIRE_AND_PURGE else: TriggerResult.CONTINUE Reference Aggregating Change Events // ChangeEvent#transaction { "id" : "transaction-id", "global_position": 1, "source_position": 1, } // TransactionMetadataEvent { "id" : "transaction-id", "ts_utc": 1659375300000, "marker": "COMMIT", "total_events": 3, "per_source_event_counts": [{ ... }], }
  • 72. val mainStream = transactionMetadataEventStream // uid and name omitted. .connect(changeEventStream) // Union different types. .flatMap(new WrappedEventFunction) // Like Either type, but with extra fields. .keyBy(_.key) // Group events with the same transaction ID. Flink Job Definition 72 .window(GlobalWindows.create) .trigger(new TransactionBoundaryTrigger(...)) // Flexible windowing semantics. .process(new KeyedProcessor(...)) Aggregating Change Events
  • 73. 73 Transaction Metadata Event Stream (one) Flat map Flink Job Graph Change Event Stream (many; one per table) Windowed Aggregation Side Output Aggregated Change Event Stream Aggregating Change Events
  • 74. val mainStream = transactionMetadataEventStream // uid and name omitted. .connect(changeEventStream) // Union different types. .flatMap(new WrappedEventFunction) // Like Either type, but with extra fields. .keyBy(_.key) // Group events with the same transaction ID. .window(GlobalWindows.create) .trigger(new TransactionBoundaryTrigger(...)) // Flexible windowing semantics. .process(new KeyedProcessor(...)) Flink Job Definition 74 mainStream // Side output to DLQ. .getSideOutput(...) .addSink(...) mainStream // Output aggregated change events. .addSink(...) Aggregating Change Events
  • 75. 75 Agenda CDC at Stripe 1 Aggregating Change Events 2 How it Started, How it Ended 3 Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture Change Data Capture (CDC) is widely- used at Stripe to capture data changes from databases without critically impacting database reliability and scalability. CDC powers many critical financial use cases at Stripe such as the Stripe Dashboard, Stripe Search, Sigma, and Financial Reporting. From idea to production—things may seem straightforward at first, but the details matter. We detail our journey of how we leveraged Flink for Change Data Capture at Stripe in order to uphold the highest data quality standards. Freshness, Coverage, and Correctness SLOs are paramount to the success of platforms and applications running on top of our CDC infrastructure. Change Event Streams are ubiquitous across Stripe given the vast number of applications and employees generating datasets worldwide. Change Event Streams are independent from one another which leads to the typical challenges in distributed systems. One of the major use cases revolves around aggregating individual change events of a database transaction to support Stripe’s payments infrastructure.
  • 76. From Idea to Production 76 Coverage Platform State How it Started, How it Ended
  • 77. State 77 How it Started, How it Ended
  • 79. How It Started How It Ended
  • 80. Infinite keys due to continuous stream of new transactions Observations 80 How it Started, How it Ended Using a Global Window; possible windows not closing properly No trigger timeouts firing No watermarks being generated
  • 81. Idle Sub Tasks Observations 81 charges (partitions = 2) Transaction Metadata Events audits (partitions = 1) disputes (partitions = 1) Source Sub Tasks How it Started, How it Ended
  • 82. Fix 82 Fixed an upstream issue where transaction IDs were getting mixed up Reduce parallelism on Source Sub Tasks for all streams Make sure parallelism ≤ ∑ Topic Partitions Generally, check with SplitEnumerator classes How it Started, How it Ended
  • 84. How It Started How It Ended
  • 85. State size still growing, but slower Observations 85 How it Started, How it Ended Event time timers firing, sometimes Watermarks are being generated, but not for all sub tasks
  • 86. New Observations 86 charges (partitions = 2) Transaction Metadata Events audits (partitions = 1) disputes (partitions = 1) Source Sub Tasks Low volume stream How it Started, How it Ended
  • 87. Possible Fix 87 Switch from event time to processing time Less precise Could cause premature trigger firing, resulting in incomplete aggregates How it Started, How it Ended
  • 88. Actual Fix 88 Add idleness property on sources Can still use event time More precise Not perfect; can still result in incomplete aggregates in edge cases That’s the reality of streaming How it Started, How it Ended
  • 91. How It Started How It Ended
  • 92. Don’t want to redeploy every time a new dataset (Kafka Topic) is added Observations 92 How it Started, How it Ended Blows away Freshness SLO’s error budget Poor developer onboarding experience
  • 93. Fix 93 Instead of Kafka Topic List Subscriber, use Regex Subscriber Subscribe to all topics (for a keyspace) by default Control plane (external) service produces an event to Broadcast Stream On broadcast element, use Broadcast State to keep onboarded datasets in state On element, check Broadcast State and filter for onboarded datasets How it Started, How it Ended
  • 96. How It Started How It Ended
  • 97. Observations Incomplete aggregates still happening, but not frequently 97 How it Started, How it Ended Kafka by default is at-least-once delivery Many independent streams operating at different speeds
  • 98. Storage will be expensive. Trade-off between confidence and cost- efficiency: KV store or bloom filter Move incomplete aggregate measurement out of the Flink Job and into a system downstream Fix 98 How it Started, How it Ended New system needs to dedupe events… for all time?
  • 100. How It Started How It Ended
  • 101. 101 Agenda CDC at Stripe 1 Aggregating Change Events 2 How it Started, How it Ended 3 Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture Change Data Capture (CDC) is widely- used at Stripe to capture data changes from databases without critically impacting database reliability and scalability. CDC powers many critical financial use cases at Stripe such as the Stripe Dashboard, Stripe Search, Sigma, and Financial Reporting. From idea to production – things may seem straightforward at first, but the details matter. We detail our journey of how we leveraged Flink for Change Data Capture at Stripe in order to uphold the highest data quality standards. Freshness, Coverage, and Correctness SLOs are paramount to the success of platforms and applications running on top of our CDC infrastructure. Change Event Streams are ubiquitous across Stripe given the vast number of applications and employees generating datasets worldwide. Change Event Streams are independent from one another which leads to the typical challenges in distributed systems. One of the major use cases revolves around aggregating individual change events of a database transaction to support Stripe’s payments infrastructure.
  • 102. Aggregating Change Events is relatively straightforward, but the details matter Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture Wrap Up 102 Change Data Capture (CDC) is widely-used at Stripe to improve database reliability and scalability Flink is a critical component in Stripe’s CDC infrastructure that allows us to work with financial streaming data with high data quality guarantees
  • 103. Thank you! 103 Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture

Editor's Notes

  1. What is Stripe? Who is it for?
  2. At what scale? $640B annual in payment volume. Challenging…
  3. Many products, many apps and services, many datasets.
  4. Across many databases of different types. Mongo, MySQL. Multi-region, databases have many shards which are split as volume grows.
  5. Watermarks per partition, not per key. Perhaps note an upstream issue, nonetheless, could have manifested by testing out late events.
  6. Watermark = min parallelism
  7. Keys can go to the same partition, one key could be late, another could not. Watermark will progress. Timeout will fire - incomplete aggregate. Late key comes in and is treated as incomplete aggregate again.
  8. Connect with broadcast stream. processElement -> check broadcast state processBroadcastElement -> update state
  9. Union or join. Streams are independent and any one stream can have duplicate. If duplicate, will result in incomplete aggregate for that key. It won’t unless all streams have the same number of duplicates for that key, but unlikely. Imagine an aggregate was just completed for a key. Then, dup happens and event sits in state until timed out.