Flink powered stream processing platform at Pinterest

Flink-powered stream processing platform at Pinterest
Rainie Li
Software engineer@Pinterest
Kanchi Masalia
Software engineer@Pinterest

Agenda
1. Introduction
2. Challenges & Use cases
3. Platform missions & Frameworks
4. Ongoing Work
5. Q&A

Conﬁdential
|
©
Pinterest
Streaming use cases on Xenon platform
OKR
promised
OKR
delivered
~2x
over
~3x
scale

Conﬁdential
|
©
Pinterest
Why Real Time Stream Processing
● Ads real-time spend and reporting - Calculate spend against budget limits in near real time
to quickly adjust budget pacing and update advertisers with more timely reporting results
● Fast User Signals - Make user content signals available quickly after content creation and use
these signals in ML pipelines for a personalized and fresh user experience
● Realtime Trust & Safety - Reduce levels of unsafe content as close to content creation time
● Fast Insights (Content activation) - Distribute fresh Creator content and surface engagement
metrics to Creators so they can refine their content with minimal feedback delay
● Product Authority (Shopping) - Deliver a trustworthy shopping product experience for users
by updating product metadata in near real time
● Fast Experimentation - Accurately deliver metrics to engineers for faster experiment setup,
verification, and evaluation

Conﬁdential
|
©
Pinterest
Existing Issues
● Fragmented technologies
○ Self-managed Kafka Streams jobs (Ads Infra)
○ Overwatch platform for small batch Spark jobs (Ads Data,
Measurement)
● Lack of developer support
● Availability & scalability issues

Conﬁdential
|
©
Pinterest
Who are we?
● We are a team of engineers, SREs, PM and EM that builds the
stateful stream data processing platform called Xenon at Pinterest.
● We support around 100 engineers build and operate 100+ Flink
Applications.
● We run (near) real time applications with at 300M messages per
second and process 150TB data per second.
● We have enabled 10+ top level company KRs in the past 3 years.

Confidential
|
©
Pinterest
Xenon platform Mission
● Stability: reliably host all deployed Flink-based stream processing
applications
● Dev Velocity: quickly productionize new use cases / features to
meet business and product needs
● Cloud Efficiency: efficiently operate infras and strive for best
practices

Conﬁdential
|
©
Pinterest
Xenon - Pinterest stream processing platform
Cluster
Management
(YARN)
NRTG
Common
Libraries and
Connectors
Flink SQL
The Resource Management & Job Execution Layer
The Developer APIs
Job State
Management
(Checkpoints,
Backups,
Restores, Edits)
Security /
Auth
(PII/FGAC)
Job Health &
Diagnosis
(Dr. Squirrel)
CI/CD Hermez
The Deployment Stack
Job
Management
Service
+

PinStats Analytic
Use case
“Overall, users … cited that currently
they have diﬃculties monitoring content
performance due to a lack of real-time
data being available, which they ﬁnd
frustrating.”

Creator Content
Use cases
Fast user signals: Make user content
signals available quickly after content
creation
Safety: Reduce levels of unsafe content
as close to content creation time
Content Creation
Audience
Targeting
Content
Understanding
Quality
Interests &
Annotations
Embeddings
Performance

Ads real-time
spend and
reporting
Calculate spend against budget limits in
real time to quickly adjust budget and
update advertisers with more timely
results

Conﬁdential
|
©
Pinterest
Xenon platform Mission No.1 - stability
● Xenon Stability Strategy
● Job Deployment Framework - Hermez and Job Submission service
● Job Management Service - Pinterest stateful streaming application
runtime monitoring and auto failure to diﬀerent AZ service.

Repo Jenkins
Artifactory
S3
Hermez
Job Submission
Service
Yarn
Clusters
1
2
4
5
6
7
8
Xenon Job Deployment Framework
3

Xenon Jobs / Hermez workloads
154
Production Xenon use cases
>90
179
Deployments everyday

Highlights
Stability and Tier 1 support
● Enhanced JSS State Machine
● Supported job level dedicated S3 buckets
User experience
● Hermez supported most recent checkpoint deployment
● Hermez supported kill job and distributed shell
● Enriched savepoint information on Hermez
● Track daily & monthly deployment success rate
Metrics
● Job submission latency

Xenon Job Management Service
Monitoring
● Job Status
● Critical metrics (QPS)
● Checkpointing health
● Job/task health
● Notify users
Auto Recovery
Auto recover failed jobs
from:
● Last completed
checkpoint
● Most recent savepoint
● Fresh State
AZ Failure
Resilience
Auto failover jobs to
backup clusters in different
AZs when primary
cluster/AZ goes down

Xenon JMS
Statsboard
ZK Clusters
Hermez
JSS
Auto Recovery
Monitoring
Deployment
Yarn Clusters
AZ-a
Yarn Clusters
AZ-b
Yarn Clusters
AZ-c
Failover
JMS Architecture
Flink API
user

Jobs under management Faster recovery time
>90
Jobs get recovered
every week
10X
>7

Conﬁdential
|
©
Pinterest
Xenon platform Mission No. 2 - Developer Velocity
● Near Real Time Galaxy - Pinterest stateful streaming application Job
development framework
● CICD - Pinterest stateful streaming application change rollout ﬂow
● Dr.Squirrel - Pinterest self-served streaming application
troubleshooting portal
● Working model - New Use Case Onboarding Process

Conﬁdential
|
©
Pinterest
NRTG
Deﬁnition:
● Pinterest stateful streaming application Job development framework
History:
● Galaxy: a high-level managed execution platform for producing and
consuming signals (e.g. Entity features) about Pinterest entities (such
as pins, board, users).
● NRTG (Near Real Time Galaxy): It follows the same Galaxy dataflow
API used in Batch, extends it to streaming applications.

VIP Navboost Signal (Map Transforms, Async RPC calls, Backfill)
● User code focuses only on Business logic. ✅
● Tune flink operators using configs. ✅
● ROI: Kappa architecture - roadmap to shutting down an $800K double compute GPU cluster for visual-search batch. 🚧
Xenon
Flink
Application
Code Config

Confidential
|
©
Pinterest
Xenon CICD framework - big picture
● Bring the CICD practice from stateless online services to stateful streaming world
● Leverage the same CICD infrastructure
● Customize the CICD pipeline for validating and deploying flink-based stream
application
● Achieve the goal of safely rolling out xenon user / platform changes with minimal
human efforts involved in validation

Conﬁdential
|
©
Pinterest
Xenon CICD pipelines - details
● auto-triggered based on cron rule and availability of new artifacts
● stability checks
○ job submission success
○ no restart-loop
○ savepoint generation success
○ ACA metrics validation
○ auto-recovery from TM/JM failure
● Prod deploy: decider-controlled, safe operations on prod job during
business hours

Conﬁdential
|
©
Pinterest
Xenon CICD Pipeline UI
● Pipeline execution history
● Pipeline operation: disable / enable /
trigger
● Links to Pipeline YAML and Spinnaker
Spinnaker UI
● Pipeline parameters
● Pipeline execution status
● Details about each Stage
Xenon CICD framework - User Interface

Conﬁdential
|
©
Pinterest
Job Debugging tool - Dr. Squirrel
Definition:
● One-stop shop for Flink job troubleshooting
Features:
● Surface suspicious stats to Xenon users instead of users searching for them
○ GC, CPU, memory, backpressure, exceptions, bad config...
● Provide instructions on top of suspicious stats
Goal:
● Cut down troubleshooting time, lower the required Flink internal knowledge for
troubleshooting, increase the dev velocity

Conﬁdential
|
©
Pinterest
Working model - New Use Case Onboarding Process
● Xenon team provides managed bootstrap of new use case:
○ best practices in terms of choosing framework and deciding job graph
○ Dev environment setup
○ a buildable and deployable skeleton project (bazel, java, test, configs)
○ Hermez workloads creation
○ CICD pipeline
○ YARN queue
○ dashboard / alerts with default settings
● Xenon developers write and test business logic code
● Support auto-generation NRTG and Flink SQL based project
Outcome: reduce the onboarding time by 3+ weeks

Conﬁdential
|
©
Pinterest
Xenon platform Mission No. 2 - Cloud eﬃciency (ongoing)
● Auto Scaling - Auto tuning & Auto scaling up/down flink applications
● Cluster upgrade - Automatic job migration during platform upgrade
● Resource Optimization - Load balance Xenon clusters
● Evaluate k8s

Conﬁdential
|
©
Pinterest Auto Scaling
● Service to dynamically job parallelism based on the metrics - Kafka Lag, CPU utilization and
Backpressure.

Questions?
Anumol Sebastian
Chenqi Liu
Hannah Chen
Divye Kapoor
Kanchi Masalia
Lu Niu Rainie Li
Teja Thotapalli
Nishant More
Samuel Bahr
Heng Zhang
Kevin Browne
Sergii Marchenko
Ashish Jhaveri Dinesh Kumar Sekar
Chen Qin
Shaowen Wang YOU?!

Flink powered stream processing platform at Pinterest

Related slideshows

More Related Content

What's hot

What's hot (20)

Similar to Flink powered stream processing platform at Pinterest

Similar to Flink powered stream processing platform at Pinterest (20)

More from Flink Forward

More from Flink Forward (11)

Recently uploaded

Recently uploaded (20)

Flink powered stream processing platform at Pinterest