Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Processing

•

6 likes•3,787 views

The document discusses how businesses can adopt a streaming-first approach using stream processing tools like Apache Kafka and KSQL. It argues that databases are no longer suitable for analyzing real-time streaming data and processing events. KSQL is presented as an open source tool that allows users to write SQL queries against streaming data in Kafka. It supports features like continuous queries, stream-table joins, and streaming materialized views. The document also provides a demo of how KSQL can be used for real-time anomaly detection on streaming web user data.

Recommended for you

Etl is Dead; Long Live Streams

Neha Narkhede talks about the experience at LinkedIn moving from batch-oriented ETL to real-time streams using Apache Kafka and how the design and implementation of Kafka was driven by this goal of acting as a real-time platform for event data. She covers some of the challenges of scaling Kafka to hundreds of billions of events per day at Linkedin, supporting thousands of engineers, etc.

•by confluent

etlkafka streams apiapache kafka

SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...

To remain competitive, organizations need to democratize access to fast analytics, not only to gain real-time insights on their business but also to power smart apps that need to react in the moment. In this session, you will learn how Kafka and SingleStore enable modern, yet simple data architecture to analyze both fast paced incoming data as well as large historical datasets. In particular, you will understand why SingleStore is well suited process data streams coming from Kafka.

•by HostedbyConfluent

apache kafkakafka summitsinglestore

Leveraging Microservice Architectures & Event-Driven Systems for Global APIs

Speaker: Ben Stopford, Technologist, Office of the CTO, Confluent Are events the new API? Event driven systems provide some unique properties, particularly for microservice architectures, as they can be used both for notification as well as for state transfer. This lets systems run in a broad range of use cases that cross geographies, clouds and devices. In this talk we will look at what event driven systems are; how they provide a unique contract for services to communicate and share data and how stream processing tools can be used to simplify the interaction between different services, be them closely coupled or largely disconnected. Ben is a technologist working in the Office of the CTO at Confluent Inc (the company behind Apache Kafka®). He’s worked on a wide range of projects, from implementing the latest version of Kafka’s replication protocol through to developing strategies for streaming applications. Before Confluent Ben led the design and build of a company-wide data platform for a large investment bank. His earlier career spanned a variety of projects at ThoughtWorks and UK-based enterprise companies. He is the author of the book “Designing Event Driven Systems,” O’Reilly, 2018. Watch the recording: https://videos.confluent.io/watch/8MLuNHnE3uSZPgstdzSk4Q?.

•by confluent

microservicesevent-drivenapi

Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Processing

C H A L L E N G E 0 1
Shared state is unsuitable
for microservices
App
03
App
02
App
01
Databases

Mutable state hurts
forward compatibility
App
03
App
02
App
01
Databases
C H A L L E N G E 0 2
App App
App

Recommended for you

Leveraging Mainframe Data for Modern Analytics

The document provides an overview of leveraging mainframe data for modern analytics using Attunity Replicate and Confluent streaming platform powered by Apache Kafka. It discusses the history of mainframes and data migration, how Attunity enables real-time data migration from mainframes, the Confluent streaming platform for building applications using data streams, and how Attunity and Confluent can be combined to modernize analytics using mainframe data streams. Use cases discussed include query offloading and cross-system customer data integration.

•by confluent

user Behavior Analysis with Session Windows and Apache Kafka's Streams API

For many industries the need to group together related events based on a period of activity or inactivity is key. Advertising businesses, content producers are just a few examples of where session windows can be used to better understand user behavior. While such sessionization has been possible in Apache Kafka up to this point, implementing it has been rather complex and required leveraging low-level APIs. In the most recent release of Kafka, however, new capabilities have been added making session windows much easier to implement. In this online talk, we’ll introduce the concept of a session window, talk about common use cases, and walk through how Apache Kafka can be used for session-oriented use cases.

•by confluent

Apache kafka-a distributed streaming platform

Paolo Castagna is a Senior Sales Engineer at Confluent. His background is on 'big data' and he has, first hand, saw the shift happening in the industry from batch to stream processing and from big data to fast data. His talk will introduce Kafka Streams and explain why Apache Kafka is a great option and simplification for stream processing.

•by confluent

Query
Inefficient for
streaming data
C H A L L E N G E 0 3
Databases

Turning the database inside out
for a streaming-first world
Storage
What would the core storage abstraction
for streaming data look like?
Processing
What would queries on streaming data
look like?
Materialized Views
How can materialized views be
constructed on streaming data?

T U R N I N G T H E D A T A B A S E I N S I D E O U T
Storage in Databases
The log is an implementation detail
Log

T U R N I N G T H E D A T A B A S E I N S I D E O U T
Storage for Streams
The log as a first class citizen
Log• Suitable for streaming data
• Built around immutability as a
core construct

Recommended for you

KSQL: Open Source Streaming for Apache Kafka

This document provides an overview of KSQL, an open source streaming SQL engine for Apache Kafka. It describes the core concepts of Kafka and KSQL, including how KSQL can be used for streaming ETL, anomaly detection, real-time monitoring, and data transformation. It also discusses how KSQL fits into a streaming platform and can be run in both client-server and standalone modes.

•by confluent

ksqlstream processingapache kafka

Maximize the Business Value of Machine Learning and Data Science with Kafka (...

Today, many companies that have lots of data are still struggling to derive value from machine learning (ML) and data science investments. Why? Accessing the data may be difficult. Or maybe it’s poorly labeled. Or vital context is missing. Or there are questions around data integrity. Or standing up an ML service can be cumbersome and complex. At Nuuly, we offer an innovative clothing rental subscription model and are continually evolving our ML solutions to gain insight into the behaviors of our unique customer base as well as provide personalized services. In this session, I’ll share how we used event streaming with Apache Kafka® and Confluent Cloud to address many of the challenges that may be keeping your organization from maximizing the business value of machine learning and data science. First, you’ll see how we ensure that every customer interaction and its business context is collected. Next, I’ll explain how we can replay entire interaction histories using Kafka as a transport layer as well as a persistence layer and a business application processing layer. Order management, inventory management, logistics, subscription management – all of it integrates with Kafka as the common backbone. These data streams enable Nuuly to rapidly prototype and deploy dynamic ML models to support various domains, including pricing, recommendations, product similarity, and warehouse optimization. Join us and learn how Kafka can help improve machine learning and data science initiatives that may not be delivered to their full potential.

•by confluent

architectureconnectorsintermediate

Streamsets and spark

StreamSets can process data using Apache Spark in three ways: 1) The Spark Evaluator stage allows user-provided Spark code to run on each batch of records in a pipeline and return results or errors. 2) A Cluster Pipeline can leverage Apache Spark's Direct Kafka DStream to partition data from Kafka across worker pipelines on a cluster. 3) A Spark Executor can kick off a Spark application when an event is received, allowing tasks like model updating to run on streaming data using Spark.

•by Hari Shreedharan

streamsetsingeststreaming

T U R N I N G T H E D A T A B A S E I N S I D E O U T
Query
Processing in Databases
One-time short-lived queries

T U R N I N G T H E D A T A B A S E I N S I D E O U T
Processing on Streams
Continuous queries
Stream
Table
Stream 01:
Stream 02:
Continuous
Query

Continuous queries core abstractions
Streams and Tables
Stream
Table
Stream 01:
Stream 02:
Continuous
Query Derived Table
Source of Truth
Stream

Query
Insert data
Source Tables Materialized View
Create via a query
Select ⭑ FROM ORDERS
Where Region – ‘USA’
T U R N I N G T H E D A T A B A S E I N S I D E O U T
Materialized Views
In relational databases

Recommended for you

Kafka Summit SF 2017 - DNS for Data: The Need for a Stream Registry

This document discusses the need for a stream registry to manage streaming data. It summarizes the evolution of a company's use of streaming from an initial problem in 2015 to developing a microservices architecture in 2016 and introducing a stream registry in 2017 to allow for self-service streams across multiple regions, clouds, and companies. The stream registry provides functions like stream registration, discovery, health monitoring, and throttling to help democratize access to streams.

•by confluent

Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka

LinkedIn uses Apache Kafka extensively to power various data pipelines and platforms. Some key uses of Kafka include: 1) Moving data between systems for monitoring, metrics, search indexing, and more. 2) Powering the Pinot real-time analytics query engine which handles billions of documents and queries per day. 3) Enabling replication and partitioning for the Espresso NoSQL data store using a Kafka-based approach. 4) Streaming data processing using Samza to handle workflows like user profile evaluation. Samza is used for both stateless and stateful stream processing at LinkedIn.

•by confluent

Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...

At Stripe, we operate a general ledger modeled as double-entry bookkeeping for all financial transactions. Warehousing such data is challenging due to its high volume and high cardinality of unique accounts. aFurthermore, it is financially critical to get up-to-date, accurate analytics over all records. Due to the changing nature of real time transactions, it is impossible to pre-compute the analytics as a fixed time series. We have overcome the challenge by creating a real time key-value store inside Pinot that can sustain half million QPS with all the financial transactions. We will talk about the details of our solution and the interesting technical challenges faced.

•by HostedbyConfluent

kafka summitfinancial dataapache pinot

Streaming Materialized Views
In Kafka
Stream
Table
Stream 01:
Stream 02:
Continuous
Query
Streaming
Materialized View

QueryQuery
What is Stream Processing?
Stream 01:
Stream 02:
Continuous
Query
Stream
Table
Processing streams of data to create more
streams or tables

Stream Processing is approachable
only to those of us who can write code

Recommended for you

It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify

Lambda Architecture has been a common way to build data pipelines for a long time, despite difficulties in maintaining two complex systems. An alternative, Kappa Architecture, was proposed in 2014, but many companies are still reluctant to switch to Kappa. And there is a reason for that: even though Kappa generally provides a simpler design and similar or lower latency, there are a lot of practical challenges in areas like exactly-once delivery, late-arriving data, historical backfill and reprocessing. In this talk, I want to show how you can solve those challenges by embracing Apache Kafka as a foundation of your data pipeline and leveraging modern stream-processing frameworks like Apache Kafka Streams and Apache Flink.

•by HostedbyConfluent

apache kafkakafka summitaws lambda

Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...

If a real-time dashboard takes 5 minutes to refresh, it’s not real-time. With data lakes increasingly enabling massive amounts of unprocessed data sets, delivering low-latency analytics is not for the faint-hearted. Learn how to stream massive amounts of data which used to be impossible to handle from Kafka, to serve real-time applications using lake-scale optimized approaches to storage and indexing.

•by HostedbyConfluent

apache kafkakafka summit

Data integration with Apache Kafka

A stream processing platform is not an island unto itself; it must be connected to all of your existing data systems, applications, and sources. In this talk we will provide different options for integrating systems and applications with Apache Kafka, with a focus on the Kafka Connect framework and the ecosystem of Kafka connectors. We will discuss the intended use cases for Kafka Connect and share our experience and best practices for building large-scale data pipelines using Apache Kafka.

•by confluent

connect frameworkapache kafkakafka connect

Introducing KSQL
Open source Streaming SQL for Apache Kafka
The first completely interactive SQL interface for Kafka
KSQL supports a variety of powerful stream processing operations
Continuous window aggregations
Stream-table joins
Filters, projections
Sessionization

KSQL
A look inside
• You can submit queries using an
interactive SQL command line client
• Several continuous queries run in
parallel on a KSQL cluster
• Adding more server processes scales a
KSQL cluster

KS Q L DE M O
Real-time Anomaly Detection:
Malicious Web Users

Recommended for you

The State of Stream Processing

Speaker: Neil Avery, Technologist, Office of the CTO, Confluent Stream processing is now at the forefront of many company strategies. Over the last couple of years we have seen streaming use cases explode and now proliferate the landscape of any modern business. Use cases including digital transformation, IoT, real-time risk, payments microservices and machine learning are all built on the fundamental that they need fast data and they need it at scale. Apache Kafka® has long been the streaming platform of choice, its origins of being dumb pipes for big data have long since been left behind and now it is the goto-streaming platform of choice. Stream processing beckons as being the vehicle for driving those streams, and along with it brings a world of real-time semantics surrounding windowing, joining, correctness, elasticity, and accessibility. The ‘current state of stream processing’ walks through the origins of stream processing, applicable use cases and then dives into the challenges currently facing the world of stream processing as it drives the next data revolution. Neil is a Technologist in the Office of the CTO at Confluent, the company founded by the creators of Apache Kafka. He has over 20 years of expertise of working on distributed computing, messaging and stream processing. He has built or redesigned commercial messaging platforms, distributed caching products as well as developed large scale bespoke systems for tier-1 banks. After a period at ThoughtWorks, he went on to build some of the first distributed risk engines in financial services. In 2008 he launched a startup that specialised in distributed data analytics and visualization. Prior to joining Confluent he was the CTO at a fintech consultancy. Watch the recording: https://videos.confluent.io/watch/rmU6GHrd4EKFaZrRhdTE3s?.

•by confluent

stream processing

Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...

Apache Kafka is critical to PayPal's analytics platform. It handles a stream of over 20 billion events per day across 300 partitions. To democratize access to analytics data, PayPal built a Connect platform leveraging Kafka to process and send data in real-time to tools of customers' choice. The platform scales to process over 40 billion events daily using reactive architectures with Akka and Alpakka Kafka connectors to consume and publish events within Akka streams. Some challenges include throughput limited by partitions and issues requiring tuning for optimal performance.

•by confluent

architecturecore kafkaanalytics

The Stream is the Database - Revolutionizing Healthcare Data Architecture

This document discusses using event streams as the system of record for data, rather than traditional databases. It argues that streams can serve as the single source of truth for data, providing benefits like data lineage, auditing, and integrity. It also describes how healthcare company Liaison uses a streaming platform from MapR to power their data integration platform, gaining the advantages of streams while meeting various compliance requirements.

•by DataWorks Summit/Hadoop Summit

hadoop summit

KSQL in practice
Use Cases
A big step towards a streaming-first world:
• Real-time monitoring and analytics
• Streaming ETL, not Batch ETL
• Application development

KSQL
Streaming SQL for Apache Kafka™
github.com/confluentinc/ksql
slackpass.io/confluentcommunity -- #ksql
confluent.io/ksql

What's hot

Kafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka

Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Processing

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Processing

Similar to Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Processing (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Processing