Flink Forward San Francisco 2022.
Pinterest is a visual discovery engine that serves over 433MM users. Stream processing allows us to unlock value from realtime data for pinners. At Pinterest, we adopt Flink as the unified streaming processing engine. In this talk, we will share our journey in building a stream processing platform with Flink and how we onboarding critical use cases to the platform. Pinterest has supported 90+near realtime streaming applications. We will cover the problem statement, how we evaluate potential solutions and our decision to build the framework.
by
Rainie Li & Kanchi Masalia
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
Flink Forward San Francisco 2022.
Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink.
by
Nico Kruber
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
Flink Forward San Francisco 2022.
At ThousandEyes we receive billions of events every day that allow us to monitor the internet; the most important aspect of our platform is to detect outages and anomalies that have a potential to cause serious impact to customer applications and user experience. Automatic detection of such events at lowest latency and highest accuracy is extremely important for our customers and their business. After launching several resilient and low latency data pipelines in production using Flink we decided to take it up a notch; we leveraged Flink to build statistical models in near real-time and apply them on incoming stream of events to detect anomalies! In this session we will deep dive into the design as well as discuss pitfalls and learnings while developing our real-time platform that leverages Debezium, Kafka, Flink, ElasticCache and DynamoDB to process events at scale!
by
Kunal Umrigar & Balint Kurnasz
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
Flinkn Forward San Francisco 2022.
In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times.
by
Piotr Nowojski
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
Flink Forward San Francisco 2022.
Running natively on Kubernetes, using the new Apache Flink Kubernetes Operator is a great way to deploy and manage Flink application and session deployments. In this presentation, we provide: - A brief overview of Kubernetes operators and their benefits. - Introduce the five levels of the operator maturity model. - Introduce the newly released Apache Flink Kubernetes Operator and FlinkDeployment CRs - Dockerfile modifications you can make to swap out UBI images and Java of the underlying Flink Operator container - Enhancements we're making in: - Versioning/Upgradeability/Stability - Security - Demo of the Apache Flink Operator in-action, with a technical preview of an upcoming product using the Flink Kubernetes Operator. - Lessons learned - Q&A
by
James Busche & Ted Chang
Producer Performance Tuning for Apache KafkaJiangjie Qin
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.
Flink Forward San Francisco 2022.
The Table API is one of the most actively developed components of Flink in recent time. Inspired by databases and SQL, it encapsulates concepts many developers are familiar with. It can be used with both bounded and unbounded streams in a unified way. But from afar it can be difficult to keep track of what this API is capable of and how it relates to Flink's other APIs. In this talk, we will explore the current state of Table API. We will show how it can be used as a batch processor, a changelog processor, or a streaming ETL tool with many built-in functions and operators for deduplicating, joining, and aggregating data. By comparing it to the DataStream API we will highlight differences and elaborate on when to use which API. We will demonstrate hybrid pipelines in which both APIs interact with one another and contribute their unique strengths. Finally, we will take a look at some of the most recent additions as a first step to stateful upgrades.
by
David Andreson
One sink to rule them all: Introducing the new Async SinkFlink Forward
Flink Forward San Francisco 2022.
Next time you want to integrate with a new destination for a demo, concept or production application, the Async Sink framework will bootstrap development, allowing you to move quickly without compromise. In Flink 1.15 we introduced the Async Sink base (FLIP-171), with the goal to encapsulate common logic and allow developers to focus on the key integration code. The new framework handles things like request batching, buffering records, applying backpressure, retry strategies, and at least once semantics. It allows you to focus on your business logic, rather than spending time integrating with your downstream consumers. During the session we will dive deep into the internals to uncover how it works, why it was designed this way, and how to use it. We will code up a new sink from scratch and demonstrate how to quickly push data to a destination. At the end of this talk you will be ready to start implementing your own Flink sink using the new Async Sink framework.
by
Steffen Hausmann & Danny Cranmer
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...HostedbyConfluent
How does Kafka Streams and ksqlDB reason about time, how does it affect my application, and how do I take advantage of it? In this talk, we explore the "time engine" of Kafka Streams and ksqlDB and answer important questions how you can work with time. What is the difference between sliding, time, and session windows and how do they relate to time? What timestamps are computed for result records? What temporal semantics are offered in joins? And why does the suppress() operator not emit data? Besides answering those questions, we will share tips and tricks how you can "bend" time to your needs and when mixing event-time and processing-time semantics makes sense. Six month ago, the question "What's the time? …and Why?" was asked and partly answered at Kafka Summit in San Francisco, focusing on writing data, data storage and retention, as well as consuming data. In this talk, we continue our journey and delve into data stream processing with Kafka Streams and ksqlDB, that both offer rich time semantics. At the end of the talk, you will be well prepared to process past, present, and future data with Kafka Streams and ksqlDB.
Running Apache Kafka in production is only the first step in the Kafka operations journey. Professional Kafka users are ready to handle all possible disasters - because for most businesses having a disaster recovery plan is not optional.
In this session, we’ll discuss disaster scenarios that can take down entire Kafka clusters and share advice on how to plan, prepare and handle these events. This is a technical session full of best practices - we want to make sure you are ready to handle the worst mayhem that nature and auditors can cause.
Visit www.confluent.io for more information.
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
Flink Forward San Francisco 2022.
In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you!
by
Olena Babenko
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
Flink Forward San Francisco 2022.
At Bloomberg, we deal with high volumes of real-time market data. Our clients expect to be notified of any anomalies in this market data, which may indicate volatile movements in the markets, notable trades, forthcoming events, or system failures. The parameters for these alerts are always evolving and our clients can update them dynamically. In this talk, we'll cover how we utilized the open source Apache Flink and Siddhi SQL projects to build a distributed, scalable, low-latency and dynamic rule-based, real-time alerting system to solve our clients' needs. We'll also cover the lessons we learned along our journey.
by
Ajay Vyasapeetam & Madhuri Jain
Kafka is an open-source distributed commit log service that provides high-throughput messaging functionality. It is designed to handle large volumes of data and different use cases like online and offline processing more efficiently than alternatives like RabbitMQ. Kafka works by partitioning topics into segments spread across clusters of machines, and replicates across these partitions for fault tolerance. It can be used as a central data hub or pipeline for collecting, transforming, and streaming data between systems and applications.
Why Serverless Flink Matters - Blazing Fast Stream Processing Made ScalableHostedbyConfluent
"The shift from batch processing to real-time processing of data is accelerating. Building real-time data applications is a necessity for many businesses as customers expect data to be always up-to-date and their apps to react to changes as they happen. However building and productizing real-time applications is often a complex and lengthy process due to limited serverless options to build such apps.
The introduction of AWS lambdas was a watershed moment in the world of cloud computing. It allowed developers to fire up “fully-managed” computer programs while paying for only when the program ran. Serverless compute comes with three big advantages - improved scalability, reduced cost, and increased flexibility. We’re bringing this same powerful paradigm to real time data processing with Flink in Confluent Cloud. Using this model, users can focus on writing business logic instead of managing nodes and other infrastructure.
Attendees will learn the benefits of serverless and see how it fits into the context of stream processing. We’ll then kick off a demo where we’ll focus on a real world production use case that uses Flink jobs to power an application with extremely low latency."
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Bowen Li
Gregory Fee presented on Lyft's use of streaming technologies like Kafka and Flink. Lyft uses streaming for real-time tasks like traffic updates and fraud detection. Previously they used Kinesis and Spark/Hive but are moving to Kafka and Flink for better scalability and developer experience. Lyft's Dryft platform provides consistent feature generation for machine learning using Flink SQL to process streaming and batch data. Dryft programs can backfill historical data and process real-time streams.
IaaS provides on-demand, self-service access to computing resources like servers and storage. PaaS automates the deployment of applications on top of IaaS and handles scaling. SaaS delivers applications to users through a thin client like a web browser. iPaaS facilitates integration between SaaS, PaaS, IaaS, and on-premise systems through a cloud-based platform. Popular IaaS include OpenStack and VMware vSphere, PaaS include Cloud Foundry and OpenShift, while Salesforce and Office 365 are examples of SaaS.
Accelerating Digital Transformation: It's About Digital EnablementJoshua Gossett
Digital Transformation is a strategy that industries have been embracing over the past several years. Efforts are maturing but organizations are continuing to struggle to capture new digital value and reflect it on the bottom line. Digital Transformation efforts for most legacy companies are struggling, as they are looked on as a Technology problem.
Any "Transformational" strategy must address all the stakeholders involved as well as have a focus on delivering value to these stakeholders at multiple levels. Success can and has been delivered through the creation of Digital Transformation Enablement Programs that address the multiple stakeholder dimensions (people, process, and technology) and ultimately lead to digital being just how we do business.
In this discussion I will specifically outline the steps that we have leveraged to deliver Digital Transformation Enablement and as a byproduct change the way people work, how they approach problems with the application of technologies, and ultimately drive new value for their organization and customers.
Hewlett Packard Entreprise | Stormrunner load | Game ChangerJeffrey Nunn
This document summarizes the key features and capabilities of HPE StormRunner Load, a cloud-based load and performance testing solution. Some of the main points covered include:
- StormRunner Load allows for simple, scalable, and smart load testing of web and mobile applications directly in the cloud.
- Tests can be created and run within 10 minutes to test from 1 to over 1 million virtual users from real-world cloud locations.
- The solution provides real-time results, analytics, and problem isolation capabilities.
- Additional features include support for multiple protocols, integrations with monitoring and DevOps tools, and collaboration functionality.
This document summarizes the experience and qualifications of Deepak Kumar Singh. He has over 3 years of experience in automation testing using tools like QTP and Selenium, and database development using SQL Server and Progress4GL. He has led project teams and worked on test automation, test case design, defect management, and status reporting. Deepak is proficient in programming languages like Java, SQL, and PL/SQL. He has expertise in various testing methodologies and tools.
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Databricks
Aggregation based features account for a quarter of the several 1000s features used by the ML-based decisioning system by the Risk team at Uber. We observed several repetitive, cumbersome steps needed for onboarding a feature, every single time. Therefore, to accelerate developer velocity, and to enable Feature Engineering at scale, we decided to develop a generic spark based infrastructure to simplify the process to no more than a simple spec file, containing a parameterized query, along with some metadata on where the feature should be aggregated and stored.
In the presentation, we will describe the architecture of the final solution, highlighting some of the advanced capabilities like backfill support and self-healing for correctness. We will showcase how, using data stored in Hive and using Spark, we developed a highly scalable solution to carry out feature aggregation in an incremental way. By dividing data aggregation responsibility across the realtime access layer, and the batch computation components, we ensured that only entities for which there is actual value changes are dispersed to our real-time access store (Cassandra). We will share how we did data modeling in Cassandra using its native capabilities such as counters, and how we worked around some of the limitations of Cassandra. We will also cover the details about the access service how we do different types of feature stitching together. How, based on our data model we were able to ensure that all the feature for an entity with the same aggregation window, were queried via a single query. Finally, we will cover some of the details on how these incremental aggregated features have enabled shorter turnaround times for the models using such features.
This document provides a summary of Netflix's architecture and use of open source software. It discusses:
- Why Netflix open sources software, including gathering feedback, collaboration, and improving retention and recruiting
- Popular Netflix open source projects like Eureka, Ribbon, and Hystrix that are widely used in cloud architectures
- Netflix's microservices architecture and emphasis on automation, high availability, and continuous delivery
- How Netflix ensures operational visibility and security at scale through open source tools like Turbine, Atlas, and Security Monkey
- Getting started resources for understanding and running Netflix's technologies like ZeroToCloud and ZeroToDocker workshops
Spark Development Lifecycle at Workday - ApacheCon 2020Pavel Hardak
Presented by Eren Avsarogullari and Pavel Hardak (ApacheCon 2020)
https://www.linkedin.com/in/erenavsarogullari/
https://www.linkedin.com/in/pavelhardak/
Apache Spark is the backbone of Workday's Prism Analytics Platform, supporting various data processing use-cases such as Data Ingestion, Preparation(Cleaning, Transformation & Publishing) and Discovery. At Workday, we extend Spark OSS repo and build custom Spark releases covering our custom patches on the top of Spark OSS patches. Custom Spark release development introduces the challenges when supporting multiple Spark versions against to a single repo and dealing with large numbers of customers, each of which can execute their own long-running Spark Applications. When building the custom Spark releases and new Spark features, dedicated Benchmark pipeline is also important to catch performance regression by running the standard TPC-H & TPC-DS queries against to both Spark versions and monitoring Spark driver & executors' runtime behaviors before production. At deployment phase, we also follow progressive roll-out plan leveraged by Feature Toggles used to enable/disable the new Spark features at the runtime. As part of our development lifecycle, Feature Toggles help on various use cases such as selection of Spark compile-time and runtime versions, running test pipelines against to both Spark versions on the build pipeline and supporting progressive roll-out deployment when dealing with large numbers of customers and long-running Spark Applications. On the other hand, executed Spark queries' operation level runtime behaviors are important for debugging and troubleshooting. Incoming Spark release is going to introduce new SQL Rest API exposing executed queries' operation level runtime metrics and we transform them to queryable Hive tables in order to track operation level runtime behaviors per executed query. In the light of these, this session aims to cover Spark feature development lifecycle at Workday by covering custom Spark Upgrade model, Benchmark & Monitoring Pipeline and Spark Runtime Metrics Pipeline details through used patterns and technologies step by step.
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Eren Avşaroğulları
Workday uses Apache Spark as the foundational technology for its Prism Analytics product. It has developed a custom Spark upgrade model to handle upgrading Spark across its multi-tenant environment. Workday also collects runtime metrics on Spark SQL queries using a custom metrics pipeline and REST API. Future plans include upgrading to Spark 3.x and improving multi-tenancy support through a "Multiverse" deployment model.
Softjourn is a software engineering company located in Ukraine that offers dedicated development teams, software as a service, and application development services. They have over 400 completed projects, a low employee turnover rate, and focus on strong communication. Their teams include engineers with masters degrees who have on average 5-7 years of experience. Softjourn aims to build trust with clients through clear communication and a collaborative approach to problem solving.
Scaling up uber's real time data analyticsXiang Fu
Realtime infrastructure powers critical pieces of Uber. This talk will discuss the architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka/Flink/Pinot) and in-house technologies have helped Uber scale and enabled SQL to power realtime decision making for city ops, data scientists, data analysts and engineers.
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...confluent
Microservices, events, containers, and orchestrators are dominating our vernacular today. As operations teams adapt to support these technologies in production, cloud-native platforms like Pivotal Cloud Foundry and Kubernetes have quickly risen to serve as force multipliers of automation, productivity and value.
Apache Kafka® is providing developers a critically important component as they build and modernize applications to cloud-native architecture.
This talk will explore:
• Why cloud-native platforms and why run Apache Kafka on Kubernetes?
• What kind of workloads are best suited for this combination?
• Tips to determine the path forward for legacy monoliths in your application portfolio
• Demo: Running Apache Kafka as a Streaming Platform on Kubernetes
This document provides a summary of Ahmed El Mawaziny's experience and skills. It includes details about his roles as a Senior Software Architect, Technology Team Lead, and Senior Software Engineer. It lists the programming languages, frameworks, databases, cloud platforms, and other tools he has experience with. It also summarizes several software projects he has worked on, including for the Saudi Ministry of Commerce, the Egyptian Electricity Holding Company, UniCare medical insurance, and others.
The document discusses Kubernetes and cloud native application design. It begins by defining cloud native as structuring teams and technology around automation and microservices packaged as containers orchestrated by platforms like Kubernetes. It then covers common Kubernetes resources like pods, services, deployments and Kubernetes design patterns like sidecars, init containers and immutable configuration. The document advocates principles for container-based applications including single concern, self-containment and image immutability. It also recommends techniques like using volumes for persistent data and logging to standard output/error.
The differing ways to monitor and instrumentJonah Kowall
FullStack London July 15th, 2016
Monitoring is complicated, and in most organizations consists of far too many tools owned by many teams. These tools consist of monitoring tools each looking at a component myopically. These tools metrics and logs from devices and software emitting them. Increasingly modern companies are creating their own instrumentation, but there is a large base of generic instrumentation of software. Fixing monitoring issues requires people, process, and technology. In this talk we will cover many common issues seen in the real world. For example decisions on what should be monitored or collected from a technology and a business perspective. This requires process and coordination.
We will investigate what instrumentation is most scalable and effective across languages this includes the commonly used APIs and possibilities to capture data from common languages like Java, .NET and PHP, but we’ll also go into methods which work with Python, Node.js, and golang. We will cover browser and mobile instrumentation techniques. How these are done? which APIs are being used? What open source tools and frameworks can be leveraged? Most importantly how to coordinate and communicate requirements across your organization.
Attendees of this session will walk away with a clear understanding of:
What is instrumentation, and what do I instrument, collect, and store?
The understanding of overhead and how this can be accomplished on common software stacks?
How to work with application owners to collect business data.
How correlation works in custom open source or packaged monitoring tools.
The document outlines 19 potential project titles for a Cisco summer internship in 2011. The projects cover a wide range of topics including network performance testing, automation, monitoring, management, and security tools.
Nayeem Shaik has experience as an Associate Software Engineer at Accenture and as a Developer Intern at Soft4u Technologies. He has technical skills in Python, Java, PHP, databases, and frameworks. At Accenture, his responsibilities included developing and supporting order management, payment, and integration services for a telecom client using IBM Integration Bus and Oracle. He optimized code for better performance and conducted code reviews. His education includes a Bachelor's degree in Computer Science Engineering. In his personal projects, he has developed systems for image classification, hieroglyph detection, and project management.
Patterns and Pains of Migrating Legacy Applications to KubernetesJosef Adersberger
Running applications on Kubernetes can provide a lot of benefits: more dev speed, lower ops costs, and a higher elasticity & resiliency in production. Kubernetes is the place to be for cloud native apps. But what to do if you’ve no shiny new cloud native apps but a whole bunch of JEE legacy systems? No chance to leverage the advantages of Kubernetes? Yes you can!
We’re facing the challenge of migrating hundreds of JEE legacy applications of a German blue chip company onto a Kubernetes cluster within one year.
The talk will be about the lessons we've learned - the best practices and pitfalls we've discovered along our way.
Similar to Flink powered stream processing platform at Pinterest (20)
Flink Forward San Francisco 2022.
Based on the new Flink-Pulsar connector, we implemented Flink's TableAPI and Catalog to help users to interact with the Pulsar cluster via Flink SQL easily. We would like to go through the design and implementation of the SQL connector in the following aspects:
1. Two different modes of use Pulsar as a metadata store
2. Data format transformation and management
3. SQL semantics support within Pulsar context
by
Sijie Guo & Neng Lu
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
Flink Forward San Francisco 2022.
What if my data is already in order? Stream Processing has given us an elegant and powerful solution for running analytic queries and logic over high volumes of continuously arriving data. However, in both Apache Flink and Apache Beam, the notion of time-ordering is baked in at a very low level, making it difficult to express computations that are interested in a semantic-, rather than time-ordering of the data. In financial services, what often matters the most about the data moving between systems is not when the data was created, but in what order, to the extent that many institutions engineer a global sequencing over all data entering and produced by their systems to achieve complete determinism. How, then, can financial institutions and others best employ Stream Processing on streams of data that are already ordered? I will cover various techniques that can make this work, as well as seek input from the community on how Flink might be improved to better support these use-cases.
by
Patrick Lucas
Batch Processing at Scale with Flink & IcebergFlink Forward
Flink Forward San Francisco 2022.
Goldman Sachs's Data Lake platform serves as the firm's centralized data platform, ingesting 140K (and growing!) batches per day of Datasets of varying shape and size. Powered by Flink and using metadata configured by platform users, ingestion applications are generated dynamically at runtime to extract, transform, and load data into centralized storage where it is then exported to warehousing solutions such as Sybase IQ, Snowflake, and Amazon Redshift. Data Latency is one of many key considerations as producers and consumers have their own commitments to satisfy. Consumers range from people/systems issuing queries, to applications using engines like Spark, Hive, and Presto to transform data into refined Datasets. Apache Iceberg allows our applications to not only benefit from consistency guarantees important when running on eventually consistent storage like S3, but also allows us the opportunity to improve our batch processing patterns with its scalability-focused features.
by
Andreas Hailu
Flink Forward San Francisco 2022.
At Flink Forward, we get to hear creative, unique use cases, often on the bleeding edge of some of the most exciting current technologies. This talk will give you a chance to get to open up the hood on our driven and innovative Open Source community. I will cover what our community has been working on this past year, and how this work relates to our (Ververica's) exciting new Flink engineering roadmap! I will also go through some best practices and upcoming opportunities for getting involved in this community!
by
Caito Scherr
Extending Flink SQL for stream processing use casesFlink Forward
1. For streaming data, Flink SQL uses STREAMs for append-only queries and CHANGELOGs for upsert queries instead of tables.
2. Stateless queries on streaming data, such as projections and filters, result in new STREAMs or CHANGELOGs.
3. Stateful queries, such as aggregations, produce STREAMs or CHANGELOGs depending on whether they are windowed or not. Join queries between streaming sources also result in STREAM outputs.
Using Queryable State for Fun and ProfitFlink Forward
Flink Forward San Francisco 2022.
A particular feature in our system relies on a streaming 90-minute trailing window of 1-minute samples - implemented as a lookaside cache - to speed up a particular query, allowing our customers to rapidly see an overview of their estate. Across our entire customer base, there is a substantial amount of data flowing into this cache - ~1,000,000 entries/second, with the entire cache requiring ~600GB of RAM. The current implementation is simplistic but expensive. In this talk I describe a replacement implementation as a stateful streaming Flink application leveraging Queryable State. This Flink application reduces the net cost by ~90%. In this session, the implementation is described in detail, including windowing considerations, a sliding-window state buffer that avoids the sliding window replication penalty, and a comparison of queryable state and Redis queries. The talk concludes with a frank discussion of when this distinctive approach is, and is not, appropriate.
by
Ron Crocker
Changelog Stream Processing with Apache FlinkFlink Forward
Flink Forward San Francisco 2022.
The world is constantly changing. Data is continuously produced and thus should be consumed in a similar fashion by enterprise systems. Only this enables real-time decisions at scale. Message logs such as Apache Kafka can be found in almost every architecture, while databases and other batch systems still provide the foundation. Change Data Capture (CDC) propagates changes downstream. In this talk, we will highlight what it means to be a general data processor and how Flink can act as an integration hub. We present the current state of Flink and how it can power various use cases on both finite and infinite streams. We demonstrate Flink's SQL engine as a changelog processor that is shipped with an ecosystem tailored to process CDC data and maintain materialized views. We will use Kafka as an upsert log, Debezium for connecting to databases, and enrich streams of various sources. Finally, we will combine Flink's Table API with DataStream API for event-driven applications beyond SQL.
by
Timo Walther
Large Scale Real Time Fraudulent Web Behavior DetectionFlink Forward
Flink Forward San Francisco 2022.
Neuro-ID analyzes web behavior at a large scale to determine visitors' intent on web pages, specifically in the online lending industry. When users interact with an online loan application, our software analyzes their behavior to determine if the applicant may be potentially fraudulent. Lenders can then request various scores describing the applicant's intentions in real-time to use to make decisions during the application flow. Flink gives our product the ability to observe behavior in a stateful manner. As an applicant interacts with an online loan application, a Flink application is used to compare earlier actions to later actions. This processing in Flink can determine the applicant's intent throughout the process of the application.
by
Jeff Niemann & Randy Hanak
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
Flink Forward San Francisco 2022.
With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi.
by
Ethan Guo & Kyle Weller
Finetuning GenAI For Hacking and DefendingPriyanka Aash
Generative AI, particularly through the lens of large language models (LLMs), represents a transformative leap in artificial intelligence. With advancements that have fundamentally altered our approach to AI, understanding and leveraging these technologies is crucial for innovators and practitioners alike. This comprehensive exploration delves into the intricacies of GenAI, from its foundational principles and historical evolution to its practical applications in security and beyond.
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceQuentin Reul
The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to:
- Identify high-impact customer needs with precision
- Harness the power of large language models to address specific customer needs effectively
- Implement AI responsibly to build trust and foster strong customer relationships
Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
"Making .NET Application Even Faster", Sergey Teplyakov.pptxFwdays
In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
Demystifying Neural Networks And Building Cybersecurity ApplicationsPriyanka Aash
In today's rapidly evolving technological landscape, Artificial Neural Networks (ANNs) have emerged as a cornerstone of artificial intelligence, revolutionizing various fields including cybersecurity. Inspired by the intricacies of the human brain, ANNs have a rich history and a complex structure that enables them to learn and make decisions. This blog aims to unravel the mysteries of neural networks, explore their mathematical foundations, and demonstrate their practical applications, particularly in building robust malware detection systems using Convolutional Neural Networks (CNNs).
Generative AI technology is a fascinating field that focuses on creating comp...Nohoax Kanont
Generative AI technology is a fascinating field that focuses on creating computer models capable of generating new, original content. It leverages the power of large language models, neural networks, and machine learning to produce content that can mimic human creativity. This technology has seen a surge in innovation and adoption since the introduction of ChatGPT in 2022, leading to significant productivity benefits across various industries. With its ability to generate text, images, video, and audio, generative AI is transforming how we interact with technology and the types of tasks that can be automated.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
DefCamp_2016_Chemerkin_Yury-publish.pdf - Presentation by Yury Chemerkin at DefCamp 2016 discussing mobile app vulnerabilities, data protection issues, and analysis of security levels across different types of mobile applications.
Self-Healing Test Automation Framework - HealeniumKnoldus Inc.
Revolutionize your test automation with Healenium's self-healing framework. Automate test maintenance, reduce flakes, and increase efficiency. Learn how to build a robust test automation foundation. Discover the power of self-healing tests. Transform your testing experience.
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
12. PinStats Analytic
Use case
“Overall, users … cited that currently
they have difficulties monitoring content
performance due to a lack of real-time
data being available, which they find
frustrating.”
13. Creator Content
Use cases
Fast user signals: Make user content
signals available quickly after content
creation
Safety: Reduce levels of unsafe content
as close to content creation time
Content Creation
Audience
Targeting
Content
Understanding
Quality
Interests &
Annotations
Embeddings
Performance
17. Xenon Jobs / Hermez workloads
154
Production Xenon use cases
>90
179
Deployments everyday
18. Highlights
Stability and Tier 1 support
● Enhanced JSS State Machine
● Supported job level dedicated S3 buckets
User experience
● Hermez supported most recent checkpoint deployment
● Hermez supported kill job and distributed shell
● Enriched savepoint information on Hermez
● Track daily & monthly deployment success rate
Metrics
● Job submission latency
19. Xenon Job Management Service
Monitoring
● Job Status
● Critical metrics (QPS)
● Checkpointing health
● Job/task health
● Notify users
Auto Recovery
Auto recover failed jobs
from:
● Last completed
checkpoint
● Most recent savepoint
● Fresh State
AZ Failure
Resilience
Auto failover jobs to
backup clusters in different
AZs when primary
cluster/AZ goes down
25. VIP Navboost Signal (Map Transforms, Async RPC calls, Backfill)
● User code focuses only on Business logic. ✅
● Tune flink operators using configs. ✅
● ROI: Kappa architecture - roadmap to shutting down an $800K double compute GPU cluster for visual-search batch. 🚧
Xenon
Flink
Application
Code Config
37. Questions?
Anumol Sebastian
Chenqi Liu
Hannah Chen
Divye Kapoor
Kanchi Masalia
Lu Niu Rainie Li
Teja Thotapalli
Nishant More
Samuel Bahr
Heng Zhang
Kevin Browne
Sergii Marchenko
Ashish Jhaveri Dinesh Kumar Sekar
Chen Qin
Shaowen Wang YOU?!