SlideShare a Scribd company logo
Flink-powered stream processing platform at Pinterest
Rainie Li
Software engineer@Pinterest
Kanchi Masalia
Software engineer@Pinterest
1. Introduction
2. Challenges & Use cases
3. Platform missions & Frameworks
4. Ongoing Work
5. Q&A
Flink powered stream processing platform at Pinterest
Streaming use cases on Xenon platform
Why Real Time Stream Processing
● Ads real-time spend and reporting - Calculate spend against budget limits in near real time
to quickly adjust budget pacing and update advertisers with more timely reporting results
● Fast User Signals - Make user content signals available quickly after content creation and use
these signals in ML pipelines for a personalized and fresh user experience
● Realtime Trust & Safety - Reduce levels of unsafe content as close to content creation time
● Fast Insights (Content activation) - Distribute fresh Creator content and surface engagement
metrics to Creators so they can refine their content with minimal feedback delay
● Product Authority (Shopping) - Deliver a trustworthy shopping product experience for users
by updating product metadata in near real time
● Fast Experimentation - Accurately deliver metrics to engineers for faster experiment setup,
verification, and evaluation
Existing Issues
● Fragmented technologies
○ Self-managed Kafka Streams jobs (Ads Infra)
○ Overwatch platform for small batch Spark jobs (Ads Data,
● Lack of developer support
● Availability & scalability issues
Who are we?
● We are a team of engineers, SREs, PM and EM that builds the
stateful stream data processing platform called Xenon at Pinterest.
● We support around 100 engineers build and operate 100+ Flink
● We run (near) real time applications with at 300M messages per
second and process 150TB data per second.
● We have enabled 10+ top level company KRs in the past 3 years.
Xenon platform Mission
● Stability: reliably host all deployed Flink-based stream processing
● Dev Velocity: quickly productionize new use cases / features to
meet business and product needs
● Cloud Efficiency: efficiently operate infras and strive for best
Xenon - Pinterest stream processing platform
Libraries and
Flink SQL
The Resource Management & Job Execution Layer
The Developer APIs
Job State
Restores, Edits)
Security /
Job Health &
(Dr. Squirrel)
CI/CD Hermez
The Deployment Stack
PinStats Analytic
Use case
“Overall, users … cited that currently
they have difficulties monitoring content
performance due to a lack of real-time
data being available, which they find
Creator Content
Use cases
Fast user signals: Make user content
signals available quickly after content
Safety: Reduce levels of unsafe content
as close to content creation time
Content Creation
Interests &
Ads real-time
spend and
Calculate spend against budget limits in
real time to quickly adjust budget and
update advertisers with more timely
Xenon platform Mission No.1 - stability
● Xenon Stability Strategy
● Job Deployment Framework - Hermez and Job Submission service
● Job Management Service - Pinterest stateful streaming application
runtime monitoring and auto failure to different AZ service.
Repo Jenkins
Job Submission
Xenon Job Deployment Framework
Xenon Jobs / Hermez workloads
Production Xenon use cases
Deployments everyday
Stability and Tier 1 support
● Enhanced JSS State Machine
● Supported job level dedicated S3 buckets
User experience
● Hermez supported most recent checkpoint deployment
● Hermez supported kill job and distributed shell
● Enriched savepoint information on Hermez
● Track daily & monthly deployment success rate
● Job submission latency
Xenon Job Management Service
● Job Status
● Critical metrics (QPS)
● Checkpointing health
● Job/task health
● Notify users
Auto Recovery
Auto recover failed jobs
● Last completed
● Most recent savepoint
● Fresh State
AZ Failure
Auto failover jobs to
backup clusters in different
AZs when primary
cluster/AZ goes down
Xenon JMS
ZK Clusters
Auto Recovery
Yarn Clusters
Yarn Clusters
Yarn Clusters
JMS Architecture
Flink API
Jobs under management Faster recovery time
Jobs get recovered
every week
Xenon platform Mission No. 2 - Developer Velocity
● Near Real Time Galaxy - Pinterest stateful streaming application Job
development framework
● CICD - Pinterest stateful streaming application change rollout flow
● Dr.Squirrel - Pinterest self-served streaming application
troubleshooting portal
● Working model - New Use Case Onboarding Process
● Pinterest stateful streaming application Job development framework
● Galaxy: a high-level managed execution platform for producing and
consuming signals (e.g. Entity features) about Pinterest entities (such
as pins, board, users).
● NRTG (Near Real Time Galaxy): It follows the same Galaxy dataflow
API used in Batch, extends it to streaming applications.
NRTG components (khaki boxes below)
VIP Navboost Signal (Map Transforms, Async RPC calls, Backfill)
● User code focuses only on Business logic. ✅
● Tune flink operators using configs. ✅
● ROI: Kappa architecture - roadmap to shutting down an $800K double compute GPU cluster for visual-search batch. 🚧
Code Config
Xenon CICD framework - big picture
● Bring the CICD practice from stateless online services to stateful streaming world
● Leverage the same CICD infrastructure
● Customize the CICD pipeline for validating and deploying flink-based stream
● Achieve the goal of safely rolling out xenon user / platform changes with minimal
human efforts involved in validation
Xenon CICD pipelines - details
● auto-triggered based on cron rule and availability of new artifacts
● stability checks
○ job submission success
○ no restart-loop
○ savepoint generation success
○ ACA metrics validation
○ auto-recovery from TM/JM failure
● Prod deploy: decider-controlled, safe operations on prod job during
business hours
Xenon CICD Pipeline UI
● Pipeline execution history
● Pipeline operation: disable / enable /
● Links to Pipeline YAML and Spinnaker
Spinnaker UI
● Pipeline parameters
● Pipeline execution status
● Details about each Stage
Xenon CICD framework - User Interface
Job Debugging tool - Dr. Squirrel
● One-stop shop for Flink job troubleshooting
● Surface suspicious stats to Xenon users instead of users searching for them
○ GC, CPU, memory, backpressure, exceptions, bad config...
● Provide instructions on top of suspicious stats
● Cut down troubleshooting time, lower the required Flink internal knowledge for
troubleshooting, increase the dev velocity
Dr. Squirrel UI
Architecture - Part 1
Architecture - Part 2
Working model - New Use Case Onboarding Process
● Xenon team provides managed bootstrap of new use case:
○ best practices in terms of choosing framework and deciding job graph
○ Dev environment setup
○ a buildable and deployable skeleton project (bazel, java, test, configs)
○ Hermez workloads creation
○ CICD pipeline
○ YARN queue
○ dashboard / alerts with default settings
● Xenon developers write and test business logic code
● Support auto-generation NRTG and Flink SQL based project
Outcome: reduce the onboarding time by 3+ weeks
Xenon platform Mission No. 2 - Cloud efficiency (ongoing)
● Auto Scaling - Auto tuning & Auto scaling up/down flink applications
● Cluster upgrade - Automatic job migration during platform upgrade
● Resource Optimization - Load balance Xenon clusters
● Evaluate k8s
Pinterest Auto Scaling
● Service to dynamically job parallelism based on the metrics - Kafka Lag, CPU utilization and
Anumol Sebastian
Chenqi Liu
Hannah Chen
Divye Kapoor
Kanchi Masalia
Lu Niu Rainie Li
Teja Thotapalli
Nishant More
Samuel Bahr
Heng Zhang
Kevin Browne
Sergii Marchenko
Ashish Jhaveri Dinesh Kumar Sekar
Chen Qin
Shaowen Wang YOU?!
Q & A
Thank you

More Related Content

What's hot

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
Flink Forward
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
Flink Forward
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
Flink Forward
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
Flink Forward
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafka
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
Ketan Gote
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices

What's hot (20)

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafka
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices

Similar to Flink powered stream processing platform at Pinterest

Why Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
Why Serverless Flink Matters - Blazing Fast Stream Processing Made ScalableWhy Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
Why Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Bowen Li
Un-clouding the cloud
Un-clouding the cloudUn-clouding the cloud
Un-clouding the cloud
Davinder Kohli
Accelerating Digital Transformation: It's About Digital Enablement
Accelerating Digital Transformation:  It's About Digital EnablementAccelerating Digital Transformation:  It's About Digital Enablement
Accelerating Digital Transformation: It's About Digital Enablement
Joshua Gossett
Enhancing AI-Driven User Engagement with Real-Time Data Streaming via Flink.pptx
Enhancing AI-Driven User Engagement with Real-Time Data Streaming via Flink.pptxEnhancing AI-Driven User Engagement with Real-Time Data Streaming via Flink.pptx
Enhancing AI-Driven User Engagement with Real-Time Data Streaming via Flink.pptx
Hewlett Packard Entreprise | Stormrunner load | Game Changer
Hewlett Packard Entreprise | Stormrunner load | Game ChangerHewlett Packard Entreprise | Stormrunner load | Game Changer
Hewlett Packard Entreprise | Stormrunner load | Game Changer
Jeffrey Nunn
Deepak Singh
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Netflix Architecture and Open Source
Netflix Architecture and Open SourceNetflix Architecture and Open Source
Netflix Architecture and Open Source
All Things Open
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020
Pavel Hardak
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Eren Avşaroğulları
Software engineering with Softjourn
Software engineering with SoftjournSoftware engineering with Softjourn
Software engineering with Softjourn
Emmy Gengler
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
Xiang Fu
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Ahmed El Mawaziny CV
Ahmed El Mawaziny CVAhmed El Mawaziny CV
Ahmed El Mawaziny CV
Ahmed El Mawaziny
The Kubernetes Effect
The Kubernetes EffectThe Kubernetes Effect
The Kubernetes Effect
Bilgin Ibryam
The differing ways to monitor and instrument
The differing ways to monitor and instrumentThe differing ways to monitor and instrument
The differing ways to monitor and instrument
Jonah Kowall
Cisco project ideas
Cisco   project ideasCisco   project ideas
Cisco project ideas
VIT University
Nayeem shaik resume
Nayeem shaik resumeNayeem shaik resume
Nayeem shaik resume
Nayeem Shaik
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Josef Adersberger

Similar to Flink powered stream processing platform at Pinterest (20)

Why Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
Why Serverless Flink Matters - Blazing Fast Stream Processing Made ScalableWhy Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
Why Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Un-clouding the cloud
Un-clouding the cloudUn-clouding the cloud
Un-clouding the cloud
Accelerating Digital Transformation: It's About Digital Enablement
Accelerating Digital Transformation:  It's About Digital EnablementAccelerating Digital Transformation:  It's About Digital Enablement
Accelerating Digital Transformation: It's About Digital Enablement
Enhancing AI-Driven User Engagement with Real-Time Data Streaming via Flink.pptx
Enhancing AI-Driven User Engagement with Real-Time Data Streaming via Flink.pptxEnhancing AI-Driven User Engagement with Real-Time Data Streaming via Flink.pptx
Enhancing AI-Driven User Engagement with Real-Time Data Streaming via Flink.pptx
Hewlett Packard Entreprise | Stormrunner load | Game Changer
Hewlett Packard Entreprise | Stormrunner load | Game ChangerHewlett Packard Entreprise | Stormrunner load | Game Changer
Hewlett Packard Entreprise | Stormrunner load | Game Changer
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Netflix Architecture and Open Source
Netflix Architecture and Open SourceNetflix Architecture and Open Source
Netflix Architecture and Open Source
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Software engineering with Softjourn
Software engineering with SoftjournSoftware engineering with Softjourn
Software engineering with Softjourn
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Ahmed El Mawaziny CV
Ahmed El Mawaziny CVAhmed El Mawaziny CV
Ahmed El Mawaziny CV
The Kubernetes Effect
The Kubernetes EffectThe Kubernetes Effect
The Kubernetes Effect
The differing ways to monitor and instrument
The differing ways to monitor and instrumentThe differing ways to monitor and instrument
The differing ways to monitor and instrument
Cisco project ideas
Cisco   project ideasCisco   project ideas
Cisco project ideas
Nayeem shaik resume
Nayeem shaik resumeNayeem shaik resume
Nayeem shaik resume
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes

More from Flink Forward

Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
Flink Forward
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
Flink Forward
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
Flink Forward
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
Flink Forward
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
Flink Forward
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior Detection
Flink Forward
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward

More from Flink Forward (11)

Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior Detection
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi

Recently uploaded

Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
Priyanka Aash
Yury Chemerkin
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Alliance
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
Stephanie Beckett
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Quentin Reul
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
Razin Mustafiz
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Michael Price
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
Priyanka Aash
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
Nohoax Kanont
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
Sara Kroft
Yury Chemerkin
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
Stephanie Beckett
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas

Recently uploaded (20)

Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas

Flink powered stream processing platform at Pinterest

  • 1. 1 1
  • 2. Flink-powered stream processing platform at Pinterest Rainie Li Software engineer@Pinterest Kanchi Masalia Software engineer@Pinterest
  • 3. Agenda 1. Introduction 2. Challenges & Use cases 3. Platform missions & Frameworks 4. Ongoing Work 5. Q&A
  • 6. Confidential | © Pinterest Streaming use cases on Xenon platform OKR promised OKR delivered ~2x over ~3x scale
  • 7. Confidential | © Pinterest Why Real Time Stream Processing ● Ads real-time spend and reporting - Calculate spend against budget limits in near real time to quickly adjust budget pacing and update advertisers with more timely reporting results ● Fast User Signals - Make user content signals available quickly after content creation and use these signals in ML pipelines for a personalized and fresh user experience ● Realtime Trust & Safety - Reduce levels of unsafe content as close to content creation time ● Fast Insights (Content activation) - Distribute fresh Creator content and surface engagement metrics to Creators so they can refine their content with minimal feedback delay ● Product Authority (Shopping) - Deliver a trustworthy shopping product experience for users by updating product metadata in near real time ● Fast Experimentation - Accurately deliver metrics to engineers for faster experiment setup, verification, and evaluation
  • 8. Confidential | © Pinterest Existing Issues ● Fragmented technologies ○ Self-managed Kafka Streams jobs (Ads Infra) ○ Overwatch platform for small batch Spark jobs (Ads Data, Measurement) ● Lack of developer support ● Availability & scalability issues
  • 9. Confidential | © Pinterest Who are we? ● We are a team of engineers, SREs, PM and EM that builds the stateful stream data processing platform called Xenon at Pinterest. ● We support around 100 engineers build and operate 100+ Flink Applications. ● We run (near) real time applications with at 300M messages per second and process 150TB data per second. ● We have enabled 10+ top level company KRs in the past 3 years.
  • 10. Confidential | © Pinterest Xenon platform Mission ● Stability: reliably host all deployed Flink-based stream processing applications ● Dev Velocity: quickly productionize new use cases / features to meet business and product needs ● Cloud Efficiency: efficiently operate infras and strive for best practices
  • 11. Confidential | © Pinterest Xenon - Pinterest stream processing platform Cluster Management (YARN) NRTG Common Libraries and Connectors Flink SQL The Resource Management & Job Execution Layer The Developer APIs Job State Management (Checkpoints, Backups, Restores, Edits) Security / Auth (PII/FGAC) Job Health & Diagnosis (Dr. Squirrel) CI/CD Hermez The Deployment Stack Job Management Service +
  • 12. PinStats Analytic Use case “Overall, users … cited that currently they have difficulties monitoring content performance due to a lack of real-time data being available, which they find frustrating.”
  • 13. Creator Content Use cases Fast user signals: Make user content signals available quickly after content creation Safety: Reduce levels of unsafe content as close to content creation time Content Creation Audience Targeting Content Understanding Quality Interests & Annotations Embeddings Performance
  • 14. Ads real-time spend and reporting Calculate spend against budget limits in real time to quickly adjust budget and update advertisers with more timely results
  • 15. Confidential | © Pinterest Xenon platform Mission No.1 - stability ● Xenon Stability Strategy ● Job Deployment Framework - Hermez and Job Submission service ● Job Management Service - Pinterest stateful streaming application runtime monitoring and auto failure to different AZ service.
  • 17. Xenon Jobs / Hermez workloads 154 Production Xenon use cases >90 179 Deployments everyday
  • 18. Highlights Stability and Tier 1 support ● Enhanced JSS State Machine ● Supported job level dedicated S3 buckets User experience ● Hermez supported most recent checkpoint deployment ● Hermez supported kill job and distributed shell ● Enriched savepoint information on Hermez ● Track daily & monthly deployment success rate Metrics ● Job submission latency
  • 19. Xenon Job Management Service Monitoring ● Job Status ● Critical metrics (QPS) ● Checkpointing health ● Job/task health ● Notify users Auto Recovery Auto recover failed jobs from: ● Last completed checkpoint ● Most recent savepoint ● Fresh State AZ Failure Resilience Auto failover jobs to backup clusters in different AZs when primary cluster/AZ goes down
  • 20. Xenon JMS Statsboard ZK Clusters Hermez JSS Auto Recovery Monitoring Deployment Yarn Clusters AZ-a Yarn Clusters AZ-b Yarn Clusters AZ-c Failover JMS Architecture Flink API user
  • 21. Jobs under management Faster recovery time >90 Jobs get recovered every week 10X >7
  • 22. Confidential | © Pinterest Xenon platform Mission No. 2 - Developer Velocity ● Near Real Time Galaxy - Pinterest stateful streaming application Job development framework ● CICD - Pinterest stateful streaming application change rollout flow ● Dr.Squirrel - Pinterest self-served streaming application troubleshooting portal ● Working model - New Use Case Onboarding Process
  • 23. Confidential | © Pinterest NRTG Definition: ● Pinterest stateful streaming application Job development framework History: ● Galaxy: a high-level managed execution platform for producing and consuming signals (e.g. Entity features) about Pinterest entities (such as pins, board, users). ● NRTG (Near Real Time Galaxy): It follows the same Galaxy dataflow API used in Batch, extends it to streaming applications.
  • 25. VIP Navboost Signal (Map Transforms, Async RPC calls, Backfill) ● User code focuses only on Business logic. ✅ ● Tune flink operators using configs. ✅ ● ROI: Kappa architecture - roadmap to shutting down an $800K double compute GPU cluster for visual-search batch. 🚧 Xenon Flink Application Code Config
  • 26. Confidential | © Pinterest Xenon CICD framework - big picture ● Bring the CICD practice from stateless online services to stateful streaming world ● Leverage the same CICD infrastructure ● Customize the CICD pipeline for validating and deploying flink-based stream application ● Achieve the goal of safely rolling out xenon user / platform changes with minimal human efforts involved in validation
  • 28. Confidential | © Pinterest Xenon CICD pipelines - details ● auto-triggered based on cron rule and availability of new artifacts ● stability checks ○ job submission success ○ no restart-loop ○ savepoint generation success ○ ACA metrics validation ○ auto-recovery from TM/JM failure ● Prod deploy: decider-controlled, safe operations on prod job during business hours
  • 29. Confidential | © Pinterest Xenon CICD Pipeline UI ● Pipeline execution history ● Pipeline operation: disable / enable / trigger ● Links to Pipeline YAML and Spinnaker Spinnaker UI ● Pipeline parameters ● Pipeline execution status ● Details about each Stage Xenon CICD framework - User Interface
  • 30. Confidential | © Pinterest Job Debugging tool - Dr. Squirrel Definition: ● One-stop shop for Flink job troubleshooting Features: ● Surface suspicious stats to Xenon users instead of users searching for them ○ GC, CPU, memory, backpressure, exceptions, bad config... ● Provide instructions on top of suspicious stats Goal: ● Cut down troubleshooting time, lower the required Flink internal knowledge for troubleshooting, increase the dev velocity
  • 34. Confidential | © Pinterest Working model - New Use Case Onboarding Process ● Xenon team provides managed bootstrap of new use case: ○ best practices in terms of choosing framework and deciding job graph ○ Dev environment setup ○ a buildable and deployable skeleton project (bazel, java, test, configs) ○ Hermez workloads creation ○ CICD pipeline ○ YARN queue ○ dashboard / alerts with default settings ● Xenon developers write and test business logic code ● Support auto-generation NRTG and Flink SQL based project Outcome: reduce the onboarding time by 3+ weeks
  • 35. Confidential | © Pinterest Xenon platform Mission No. 2 - Cloud efficiency (ongoing) ● Auto Scaling - Auto tuning & Auto scaling up/down flink applications ● Cluster upgrade - Automatic job migration during platform upgrade ● Resource Optimization - Load balance Xenon clusters ● Evaluate k8s
  • 36. Confidential | © Pinterest Auto Scaling ● Service to dynamically job parallelism based on the metrics - Kafka Lag, CPU utilization and Backpressure.
  • 37. Questions? Anumol Sebastian Chenqi Liu Hannah Chen Divye Kapoor Kanchi Masalia Lu Niu Rainie Li Teja Thotapalli Nishant More Samuel Bahr Heng Zhang Kevin Browne Sergii Marchenko Ashish Jhaveri Dinesh Kumar Sekar Chen Qin Shaowen Wang YOU?!
  • 38. Q & A