The document discusses how businesses can adopt a streaming-first approach using stream processing tools like Apache Kafka and KSQL. It argues that databases are no longer suitable for analyzing real-time streaming data and processing events. KSQL is presented as an open source tool that allows users to write SQL queries against streaming data in Kafka. It supports features like continuous queries, stream-table joins, and streaming materialized views. The document also provides a demo of how KSQL can be used for real-time anomaly detection on streaming web user data.
Neha Narkhede talks about the experience at LinkedIn moving from batch-oriented ETL to real-time streams using Apache Kafka and how the design and implementation of Kafka was driven by this goal of acting as a real-time platform for event data. She covers some of the challenges of scaling Kafka to hundreds of billions of events per day at Linkedin, supporting thousands of engineers, etc.
To remain competitive, organizations need to democratize access to fast analytics, not only to gain real-time insights on their business but also to power smart apps that need to react in the moment. In this session, you will learn how Kafka and SingleStore enable modern, yet simple data architecture to analyze both fast paced incoming data as well as large historical datasets. In particular, you will understand why SingleStore is well suited process data streams coming from Kafka.
Speaker: Ben Stopford, Technologist, Office of the CTO, Confluent Are events the new API? Event driven systems provide some unique properties, particularly for microservice architectures, as they can be used both for notification as well as for state transfer. This lets systems run in a broad range of use cases that cross geographies, clouds and devices. In this talk we will look at what event driven systems are; how they provide a unique contract for services to communicate and share data and how stream processing tools can be used to simplify the interaction between different services, be them closely coupled or largely disconnected. Ben is a technologist working in the Office of the CTO at Confluent Inc (the company behind Apache Kafka®). He’s worked on a wide range of projects, from implementing the latest version of Kafka’s replication protocol through to developing strategies for streaming applications. Before Confluent Ben led the design and build of a company-wide data platform for a large investment bank. His earlier career spanned a variety of projects at ThoughtWorks and UK-based enterprise companies. He is the author of the book “Designing Event Driven Systems,” O’Reilly, 2018. Watch the recording: https://videos.confluent.io/watch/8MLuNHnE3uSZPgstdzSk4Q?.
The document provides an overview of leveraging mainframe data for modern analytics using Attunity Replicate and Confluent streaming platform powered by Apache Kafka. It discusses the history of mainframes and data migration, how Attunity enables real-time data migration from mainframes, the Confluent streaming platform for building applications using data streams, and how Attunity and Confluent can be combined to modernize analytics using mainframe data streams. Use cases discussed include query offloading and cross-system customer data integration.
For many industries the need to group together related events based on a period of activity or inactivity is key. Advertising businesses, content producers are just a few examples of where session windows can be used to better understand user behavior. While such sessionization has been possible in Apache Kafka up to this point, implementing it has been rather complex and required leveraging low-level APIs. In the most recent release of Kafka, however, new capabilities have been added making session windows much easier to implement. In this online talk, we’ll introduce the concept of a session window, talk about common use cases, and walk through how Apache Kafka can be used for session-oriented use cases.
Paolo Castagna is a Senior Sales Engineer at Confluent. His background is on 'big data' and he has, first hand, saw the shift happening in the industry from batch to stream processing and from big data to fast data. His talk will introduce Kafka Streams and explain why Apache Kafka is a great option and simplification for stream processing.
This document provides an overview of KSQL, an open source streaming SQL engine for Apache Kafka. It describes the core concepts of Kafka and KSQL, including how KSQL can be used for streaming ETL, anomaly detection, real-time monitoring, and data transformation. It also discusses how KSQL fits into a streaming platform and can be run in both client-server and standalone modes.
Today, many companies that have lots of data are still struggling to derive value from machine learning (ML) and data science investments. Why? Accessing the data may be difficult. Or maybe it’s poorly labeled. Or vital context is missing. Or there are questions around data integrity. Or standing up an ML service can be cumbersome and complex. At Nuuly, we offer an innovative clothing rental subscription model and are continually evolving our ML solutions to gain insight into the behaviors of our unique customer base as well as provide personalized services. In this session, I’ll share how we used event streaming with Apache Kafka® and Confluent Cloud to address many of the challenges that may be keeping your organization from maximizing the business value of machine learning and data science. First, you’ll see how we ensure that every customer interaction and its business context is collected. Next, I’ll explain how we can replay entire interaction histories using Kafka as a transport layer as well as a persistence layer and a business application processing layer. Order management, inventory management, logistics, subscription management – all of it integrates with Kafka as the common backbone. These data streams enable Nuuly to rapidly prototype and deploy dynamic ML models to support various domains, including pricing, recommendations, product similarity, and warehouse optimization. Join us and learn how Kafka can help improve machine learning and data science initiatives that may not be delivered to their full potential.
StreamSets can process data using Apache Spark in three ways: 1) The Spark Evaluator stage allows user-provided Spark code to run on each batch of records in a pipeline and return results or errors. 2) A Cluster Pipeline can leverage Apache Spark's Direct Kafka DStream to partition data from Kafka across worker pipelines on a cluster. 3) A Spark Executor can kick off a Spark application when an event is received, allowing tasks like model updating to run on streaming data using Spark.
This document discusses the need for a stream registry to manage streaming data. It summarizes the evolution of a company's use of streaming from an initial problem in 2015 to developing a microservices architecture in 2016 and introducing a stream registry in 2017 to allow for self-service streams across multiple regions, clouds, and companies. The stream registry provides functions like stream registration, discovery, health monitoring, and throttling to help democratize access to streams.
LinkedIn uses Apache Kafka extensively to power various data pipelines and platforms. Some key uses of Kafka include: 1) Moving data between systems for monitoring, metrics, search indexing, and more. 2) Powering the Pinot real-time analytics query engine which handles billions of documents and queries per day. 3) Enabling replication and partitioning for the Espresso NoSQL data store using a Kafka-based approach. 4) Streaming data processing using Samza to handle workflows like user profile evaluation. Samza is used for both stateless and stateful stream processing at LinkedIn.
At Stripe, we operate a general ledger modeled as double-entry bookkeeping for all financial transactions. Warehousing such data is challenging due to its high volume and high cardinality of unique accounts. aFurthermore, it is financially critical to get up-to-date, accurate analytics over all records. Due to the changing nature of real time transactions, it is impossible to pre-compute the analytics as a fixed time series. We have overcome the challenge by creating a real time key-value store inside Pinot that can sustain half million QPS with all the financial transactions. We will talk about the details of our solution and the interesting technical challenges faced.
Lambda Architecture has been a common way to build data pipelines for a long time, despite difficulties in maintaining two complex systems. An alternative, Kappa Architecture, was proposed in 2014, but many companies are still reluctant to switch to Kappa. And there is a reason for that: even though Kappa generally provides a simpler design and similar or lower latency, there are a lot of practical challenges in areas like exactly-once delivery, late-arriving data, historical backfill and reprocessing. In this talk, I want to show how you can solve those challenges by embracing Apache Kafka as a foundation of your data pipeline and leveraging modern stream-processing frameworks like Apache Kafka Streams and Apache Flink.
If a real-time dashboard takes 5 minutes to refresh, it’s not real-time. With data lakes increasingly enabling massive amounts of unprocessed data sets, delivering low-latency analytics is not for the faint-hearted. Learn how to stream massive amounts of data which used to be impossible to handle from Kafka, to serve real-time applications using lake-scale optimized approaches to storage and indexing.
A stream processing platform is not an island unto itself; it must be connected to all of your existing data systems, applications, and sources. In this talk we will provide different options for integrating systems and applications with Apache Kafka, with a focus on the Kafka Connect framework and the ecosystem of Kafka connectors. We will discuss the intended use cases for Kafka Connect and share our experience and best practices for building large-scale data pipelines using Apache Kafka.
Speaker: Neil Avery, Technologist, Office of the CTO, Confluent Stream processing is now at the forefront of many company strategies. Over the last couple of years we have seen streaming use cases explode and now proliferate the landscape of any modern business. Use cases including digital transformation, IoT, real-time risk, payments microservices and machine learning are all built on the fundamental that they need fast data and they need it at scale. Apache Kafka® has long been the streaming platform of choice, its origins of being dumb pipes for big data have long since been left behind and now it is the goto-streaming platform of choice. Stream processing beckons as being the vehicle for driving those streams, and along with it brings a world of real-time semantics surrounding windowing, joining, correctness, elasticity, and accessibility. The ‘current state of stream processing’ walks through the origins of stream processing, applicable use cases and then dives into the challenges currently facing the world of stream processing as it drives the next data revolution. Neil is a Technologist in the Office of the CTO at Confluent, the company founded by the creators of Apache Kafka. He has over 20 years of expertise of working on distributed computing, messaging and stream processing. He has built or redesigned commercial messaging platforms, distributed caching products as well as developed large scale bespoke systems for tier-1 banks. After a period at ThoughtWorks, he went on to build some of the first distributed risk engines in financial services. In 2008 he launched a startup that specialised in distributed data analytics and visualization. Prior to joining Confluent he was the CTO at a fintech consultancy. Watch the recording: https://videos.confluent.io/watch/rmU6GHrd4EKFaZrRhdTE3s?.
Apache Kafka is critical to PayPal's analytics platform. It handles a stream of over 20 billion events per day across 300 partitions. To democratize access to analytics data, PayPal built a Connect platform leveraging Kafka to process and send data in real-time to tools of customers' choice. The platform scales to process over 40 billion events daily using reactive architectures with Akka and Alpakka Kafka connectors to consume and publish events within Akka streams. Some challenges include throughput limited by partitions and issues requiring tuning for optimal performance.
This document discusses using event streams as the system of record for data, rather than traditional databases. It argues that streams can serve as the single source of truth for data, providing benefits like data lineage, auditing, and integrity. It also describes how healthcare company Liaison uses a streaming platform from MapR to power their data integration platform, gaining the advantages of streams while meeting various compliance requirements.