Stream processing using Kafka

Himani Arora & Prabhat Kashyap
Software Consultant
@_himaniarora @pk_official

Who we are?
Himani Arora
@_himaniarora
Software Consultant @ Knoldus Software LLP
Contributed in Apache Kafka, Juypter,
Apache CarbonData, Lightbend Lagom etc
Currently learning Apache Kafka
Prabhat Kashyap
@pk_official
Software Consultant @ Knoldus Software LLP
Contributed in Apache Kafka and Apache
CarbonData and Lightbend Templates
Currently learning Apache Kafka

Agenda
●
What is Stream processing
●
Paradigms of programming
●
Stream Processing with Kafka
●
What are Kafka Streams
●
Inside Kafka Streams
●
Demonstration of stream processing using Kafka Streams
●
Overview of Kafka Connect
●
Demo with Kafka Connect

What is stream processing?
● Real-time processing of data
● Does not treat data as static tables or files
● Data has to be processed fast, so that a firm can react to
changing business conditions in real time. This is required
for trading, fraud detection, system monitoring, and many
other examples.
● A “too late architecture” cannot realize these use cases.

3 PARADIGMS OF PROGRAMMING
● REQUEST/RESPONSE
● BATCH SYSTEMS
● STREAM PROCESSING

STREAM PROCESSING with KAFKA
2 APPROACHES:
● DO IT YOURSELF (DIY ! ) STREAM PROCESSING
● STREAM PROCESSING FRAMEWORK

DIY STREAM PROCESSING
Major Challenges:
● FAULT TOLERANCE
● PARTITIONING AND SCALABILITY
● TIME
● STATE
● REPROCESSING

STREAM PROCESSING FRAMEWORK
Many already available stream processing framework are:
SPARK
STORM
SAMZA
FLINK ETC...

KAFKA STREAMS : ANOTHER WAY OF STREAM PROCESSING

Let’s starts with Kafka Stream but wait.. What is KAFKA?

Hello! Apache Kafka
● Apache Kafka is an Open Source project under Apache Licence
2.0
● Apache Kafka was originally developed by LinkedIn.
● On 23 October 2012 Apache Kafka graduated from incubator to
top level projects.
● Components of Apache Kafka
○ Producer
○ Consumer
○ Broker
○ Topic
○ Data
○ Parallelism

What is Kafka Streams
● It is Streams API of Apache Kafka, available through a Java library.
● Kafka Streams is built on top of functionality provided by Kafka’s.
● It is , by deliberate design, tightly integrated with Apache Kafka.
● It can be used to build highly scalable, elastic, fault-tolerant, distributed
applications and microservices.
● Kafka Streams API allows you to create real-time applications.
● It is the easiest yet the most powerful technology to process data stored
in Kafka.

If we look closer
● A key motivation of the Kafka Streams API is to bring stream processing out of
the Big Data niche into the world of mainstream application development.
● Using the Kafka Streams API you can implement standard Java applications to
solve your stream processing needs.
● Your applications are fully elastic: you can run one or more instances of your
application.
● This lightweight and integrative approach of the Kafka Streams API – “Build
applications, not infrastructure!” .
● Deployment-wise you are free to chose from any technology that can deploy Java
applications

Capabilities of Kafka Stream
● Powerful
○ Makes your applications highly scalable, elastic, distributed, fault-
tolerant.
○ Stateful and stateless processing
○ Event-time processing with windowing, joins, aggregations
● Lightweight
○ Low barrier to entry
○ No processing cluster required
○ No external dependencies other than Apache Kafka

Capabilities of Kafka Stream
● Real-time
○ Millisecond processing latency
○ Record-at-a-time processing (no micro-batching)
○ Seamlessly handles late-arriving and out-of-order data
○ High throughput
● Fully integrated
○ 100% compatible with Apache Kafka 0.10.2 and 0.10.1
○ Easy to integrate into existing applications and microservices
○ Runs everywhere: on-premises, public clouds, private clouds, containers, etc.
○ Integrates with databases through continous change data capture (CDC) performed by
Kafka Connect

Key concepts of Kafka Streams
● Stateful Stream Processing
● KStream
● KTable
● Time
● Aggregations
● Joins
● Windowing

● Stateful Stream Processing
– Some stream processing applications don’t require state – they
are stateless.
– In practice, however, most applications require state – they are
stateful.
– The state must be managed in a fault-tolerant manner.
– Application is stateful whenever, for example, it needs to join,
aggregate, or window its input data.

● Kstream
– A KStream is an abstraction of a record stream.
– Each data record represents a self-contained datum in the
unbounded data set.
– Using the table analogy, data records in a record stream are
always interpreted as an “INSERT” .
– Let’s imagine the following two data records are being sent to
the stream:
("alice", 1) --> ("alice", 3)

● Ktable
– A KStream is an abstraction of a changelog stream.
– Each data record represents an update.
– Using the table analogy, data records in a record stream are
always interpreted as an “UPDATE” .
– Let’s imagine the following two data records are being sent to
the stream:
("alice", 1) --> ("alice", 3)

● Time
– A critical aspect in stream processing is the the notion of time.
– Kafka Streams supports the following notions of time:
●
Event Time
●
Processing Time
●
Ingestion Time
– Kafka Streams assigns a timestamp to every data record via
so-called timestamp extractors.

● Aggregations
– An aggregation operation takes one input stream or table, and
yields a new table.
– It is done by combining multiple input records into a single
output record.
– In the Kafka Streams DSL, an input stream of an aggregation
operation can be a KStream or a KTable, but the output
stream will always be a KTable.

● Joins
– A join operation merges two input streams and/or tables based
on the keys of their data records, and yields a new
stream/table.

● Windowing
– Windowing lets you control how to group records that have the same
key for stateful operations such as aggregations or joins into so-
called windows.
– Windows are tracked per record key.
– When working with windows, you can specify a retention period for
the window.
– This retention period controls how long Kafka Streams will wait for
out-of-order or late-arriving data records for a given window.
– If a record arrives after the retention period of a window has passed,
the record is discarded and will not be processed in that window.

Stream Partitions and Tasks
● Each stream partition is a totally ordered sequence of data records and
maps to a Kafka topic partition.
● A data record in the stream maps to a Kafka message from that topic.
● The keys of data records determine the partitioning of data in both Kafka
and Kafka Streams, i.e., how data is routed to specific partitions within
topics.

Threading Model
● Kafka Streams allows the user to configure the number of threads that
the library can use to parallelize processing within an application
instance.
● Each thread can execute one or more stream tasks with their processor
topologies independently.

State
● Kafka Streams provides so-called state stores.
● State can be used by stream processing applications to store and query
data, which is an important capability when implementing stateful
operations.

Backpressure
● Kafka Streams does not use a backpressure mechanism because it
does not need one.
● It uses depth-first processing strategy.
● Each record consumed from Kafka will go through the whole processor
(sub-)topology for processing and for (possibly) being written back to
Kafka before the next record will be processed.
● No records are being buffered in-memory between two connected
stream processors.
● Kafka Streams leverages Kafka’s consumer client behind the scenes.

HOW TO GET DATA IN AND OUT OF KAFKA?

Kafka connect
● So-called Sources import data into Kafka, and Sinks export data from
Kafka.
● An implementation of a Source or Sink is a Connector. And users deploy
connectors to enable data flows on Kafka
● All Kafka Connect sources and sinks map to partitioned streams of
records.
● This is a generalization of Kafka’s concept of topic partitions: a stream
refers to the complete set of records that are split into independent
infinite sequences of records

CONFIGURING CONNECTORS
● Connector configurations are key-value mappings.
● For standalone mode these are defined in a properties file and
passed to the Connect process on the command line.
● In distributed mode, they will be included in the JSON payload
sent over the REST API for the request that creates the connector.

CONFIGURING CONNECTORS
Few settings common that are common to all connectors:
● name - Unique name for the connector. Attempting to register again
with the same name will fail.
● connector.class - The Java class for the connector
● tasks.max - The maximum number of tasks that should be created for
this connector. The connector may create fewer tasks if it cannot
achieve this level of parallelism.

REFERENCES
●
https://www.slideshare.net/ConfluentInc/demystifying-stream-processing-with-apache-kafka-
69228952
●
https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/
●
http://docs.confluent.io/3.2.0/streams/index.html
●
http://docs.confluent.io/3.2.0/connect/index.html

Stream processing using Kafka

More Related Content

What's hot

What's hot (20)

Similar to Stream processing using Kafka

Similar to Stream processing using Kafka (20)

More from Knoldus Inc.

More from Knoldus Inc. (20)

Recently uploaded

Recently uploaded (20)

Stream processing using Kafka

Editor's Notes