SlideShare a Scribd company logo
#PulsarSummit Asia 2020#PulsarSummit Asia 2020
Structured Data Stream with Pulsar
Shivji Kumar Jha
1
●
●
●
●
Who am I ?
https://www.linkedin.com/in/shivjijha/
https://twitter.com/ShivjiJha
Catalogue
• Background: Apache Pulsar
• Background: Schema
• Why Schema
• Introducing Pulsar Schema
• Learnings
• Q&A
3
Background: Apache Pulsar
4
Background: Apache Pulsar
Pulsar: cloud-native, distributed messaging and streaming platform
5
Background: Apache Pulsar
Pulsar: cloud-native, distributed messaging and streaming platform
APACHE PULSAR
6
Background: Apache Pulsar
Pulsar: cloud-native, distributed messaging and streaming platform
Highlights:
1. Modular design
2. Horizontally scalable
3. Low latency with durability
4. Multi-tenancy
5. Geo Replication
APACHE PULSAR
7
Background : Schema
8
Background - schema : serialization
Definitions
1. Imagine you have to send an employee record over network.
2. Cant write as is.
3. Employee encoder to convert employee record to a stream of bytes.
4. Formally, encoding / serialization.
5. Send bytes over network.
https://www.raywenderlich.com/books/swift-apprentice/v6.0/chapters/22-encoding-decoding-types
9
Background - schema : de-serialization
Definitions
1. When reading from network, turn stream of bytes to employee record.
2. Decoder converts bytes to employee instance.
3. Formally, decoding / de-serialization.
https://www.raywenderlich.com/books/swift-apprentice/v6.0/chapters/22-encoding-decoding-types
10
Background - schema : Schema?
https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
1. Encoding can be done in native serialization of programmng language. Examples:
a. Java Serialization
b. Python’s pickle
c. Ruby’s marshal
2. Locked with programming language - oops!
3. Maybe JSON or XML work like web APIs?
a. too verbose
b. storing keys over and over
c. no way to fix types, guess types looking at data. Yuck!
4. Need to save space with each data instance.
5. Also, people stuffing random types which other people dont understand.
a. Document well?
6. Ok, lets agree on some protocols and make good documentation of whats allowed and whats
NOT
7. Well, that is what Avro, Protobuff, thrift etc are!
11
Background - schema : Evolution
https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
1. The schema is defined and documented. Great!
2. Someone wants to quickly add a new data type.
a. How does decoder know which schema to use: old or new?
b. Among all schemas how does decoder know two are connected?
i. That is schema versioning for you!
3. Avro, Protobuff, json schema, thrift etc support schema evolution
with versioning.
4. Possible to have sender (producer) and reader (consumer) having
different versions of schema at same time.
12
Background - schema : Avro
https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
1. Encode data with a schema.
2. Ship schema to consumer(?)
a. or Keep schema in a central place keyed with schemaId.
b. Ship schemaId with binary message.
3. While decoding:
a. Get schemaId from beginning of message (always Long?)
b. Fetch schema by schemaId from central schema store.
4. Decode keeping schema and binary data together.
5. Example: Schema tells decoder to expect 4 bytes to convert to an int
13
Why Schema?
14
Schema : no schema?
APACHE
PULSAR
BYTES BYTES
15
Schema : no schema?
APACHE
PULSAR
BYTES BYTES
1. Your data in pulsar store is plain
binary (0s and 1s).
2. Pulsar supports several schema
types for encoding & decoding.
3. Can encode data using schema.
4. Can decode data given schema &
binary data.
16
Schema : no schema?
APACHE
PULSAR
BYTES BYTES
1. Schema or no schema?
2. How do you encode / decode
bytes of pulsar data?
3. If you don’t have a schema, your,
schema is implicit in your app
code!
17
Schema : no schema?
APACHE
PULSAR
BYTES BYTES
1. Schema or no schema?
2. How do you encode / decode
bytes of pulsar data?
3. If you don’t have a schema, your,
schema is implicit in your app
code!
18
Schema : no schema?
https://martinfowler.com/articles/schemaless/#non-uniform-types
19
Schema : no schema?
Add custom
fields for UI
etc
Different attributes
depending on kind of
event
Obviously, easy for
schemaless,
still needs care!
https://martinfowler.com/articles/schemaless/#non-uniform-types
20
Introducing Pulsar Schema
21
Introducing Pulsar Schema : bytes
Domain Object
22
Introducing Pulsar Schema : bytes
Domain Object Byte schema serialized with java
23
Introducing Pulsar Schema : String
Producer
Consumer
24
Introducing Pulsar Schema : All Primitive types
25
Introducing Pulsar Schema : All Primitive types
26
Introducing Pulsar Schema : Structs
(JSON schema)
Domain Object
27
Introducing Pulsar Schema : Structs
(JSON schema)
Domain Object Producer with JSON schema serialization
28
Introducing Pulsar Schema : Structs
(AVRO schema)
29
Introducing Pulsar Schema : Structs
(AVRO schema)
30
Introducing Pulsar Schema : Structs
(AVRO schema)
Application “knows” which types go to which topic.
31
Pulsar Schema : Schema Store
(Client side)
1. In the previous examples, schema was stored in producer
and consumer object.
2. This is client-side schema storage approach.
32
Pulsar Schema : Schema Store
(Client side)
Problems:
1. Client responsible for:
a. “serializing” data objects (user instance) into bytes
b. “de-serializing” bytes to data object (user instance)
c. “knowing” which types go to which topic.
2. With consumer spread across several micro-services,
“knowing” and “evolving” schema is challenging!
33
Pulsar Schema : Schema Store
(Server side)
Solution:
1. Store schema on a central server.
2. When producing, upload schema to central server.
3. Add schemaId (Long) to message.
4. When consuming, fetch schema with schemaId.
5. Schema management server manages evolution (versioning).
Pulsar has built-in schema registry service!!
34
Pulsar Schema : Schema Registry
1. Entity for schema registry service: schemaInfo .
35
Pulsar Schema : Schema Registry
36
Pulsar Schema : Schema Registry
1. Each schemaInfo stored with a topic
has a version.
2. SchemaVersion manages schema
changes happening within a topic.
3. Messages produced with a
schemaInfo is tagged with version.
4. Consumer can use schemaVersion to
fetch schemaInfo. Decode message
with schemaInfo.
37
Pulsar Schema : Schema Registry
schema Payload structure:
schemaType schemaType;
Boolean isDeleted;
Long timestamp;
String user;
byte[] data;
Hashmap<string, string> props;
38
Pulsar Schema : Schema Registry
Admin CLI commands and REST APIs to manage schemas:
39
Learnings
40
Learnings
1. Struct schemas (json, avro, protobuff) model domain
objects well.
2. Use byte schema only if really needed.
3. Using avro schemas with pulsar for over an year in
production
a. Json schema is too verbose.
b. proto awesome, still being adopted with sources / sinks
c. avro saves data per message against json schema.
d. avro is very well adopted among source / sinks.
41
Learnings
1. Always a good idea to
think hard and set
compatibility on
namespace.
2. Decide on compatibility
depending on use-case
and expected evolution.
42
Learnings
1. Prefer ordering requirements to decide what goes on
which topic.
a. One domain => one topic!
b. Use AUTO_CONSUME on consumer schema type.
c. schema-autoupdate-strategy = NONE.
2. The schema management process that works for us is:
a. Keep a github repo with schemas.
b. use code reviews to review schema changes.
c. Generate POJOs from avro using maven plugin (java).
d. Add the pojo library as a dependency to micro-services to import
domain objects. 43
References
1. Pulsar docs: https://pulsar.apache.org/docs/en/schema-get-started/
2. Schema auto update strategy:
https://pulsar.apache.org/docs/en/pulsar-admin/#set-schema-autoupdate-s
trategy
3. Schema Evolution in Avro, Thrift, Protobuff:
https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-proto
col-buffers-thrift.html
4. Topic design per domain:
https://www.confluent.io/blog/put-several-event-types-kafka-topic/
5. Schema Compatibility Design:
https://docs.confluent.io/platform/current/schema-registry/avro.html#comp
atibility-types 44
Staying Connected:
●
○
○
●
○
●
○ https://twitter.com/ShivjiJha
○ https://www.linkedin.com/in/shivjijha/
Q & A
45

More Related Content

What's hot

Kafka and Spark Streaming
Kafka and Spark StreamingKafka and Spark Streaming
Kafka and Spark Streaming
datamantra
 
Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1
Knoldus Inc.
 
Query Pulsar Streams using Apache Flink
Query Pulsar Streams using Apache FlinkQuery Pulsar Streams using Apache Flink
Query Pulsar Streams using Apache Flink
StreamNative
 
Elastic Data Processing with Apache Flink and Apache Pulsar
Elastic Data Processing with Apache Flink and Apache PulsarElastic Data Processing with Apache Flink and Apache Pulsar
Elastic Data Processing with Apache Flink and Apache Pulsar
StreamNative
 
A Unified Platform for Real-time Storage and Processing
A Unified Platform for Real-time Storage and ProcessingA Unified Platform for Real-time Storage and Processing
A Unified Platform for Real-time Storage and Processing
StreamNative
 
From 100s to 100s of Millions
From 100s to 100s of MillionsFrom 100s to 100s of Millions
From 100s to 100s of Millions
Erik Onnen
 
Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021
Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021
Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021
StreamNative
 
How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...
JinfengHuang3
 
Kafka Summit SF 2017 - Kafka and the Polyglot Programmer
Kafka Summit SF 2017 - Kafka and the Polyglot ProgrammerKafka Summit SF 2017 - Kafka and the Polyglot Programmer
Kafka Summit SF 2017 - Kafka and the Polyglot Programmer
confluent
 
Apache Pulsar and Github
Apache Pulsar and GithubApache Pulsar and Github
Apache Pulsar and Github
StreamNative
 
Pulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless EvolutionPulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless Evolution
StreamNative
 
Integrating Apache Pulsar with Big Data Ecosystem
Integrating Apache Pulsar with Big Data EcosystemIntegrating Apache Pulsar with Big Data Ecosystem
Integrating Apache Pulsar with Big Data Ecosystem
StreamNative
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
dave_revell
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
DataWorks Summit
 
Spark streaming and Kafka
Spark streaming and KafkaSpark streaming and Kafka
Spark streaming and Kafka
Iraj Hedayati
 
Getting Pulsar Spinning_Addison Higham
Getting Pulsar Spinning_Addison HighamGetting Pulsar Spinning_Addison Higham
Getting Pulsar Spinning_Addison Higham
StreamNative
 
Apache Pulsar Seattle - Meetup
Apache Pulsar Seattle - MeetupApache Pulsar Seattle - Meetup
Apache Pulsar Seattle - Meetup
Karthik Ramasamy
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
Michael Noll
 
Can Apache Kafka Replace a Database? – The 2021 Update | Kai Waehner, Confluent
Can Apache Kafka Replace a Database? – The 2021 Update | Kai Waehner, ConfluentCan Apache Kafka Replace a Database? – The 2021 Update | Kai Waehner, Confluent
Can Apache Kafka Replace a Database? – The 2021 Update | Kai Waehner, Confluent
HostedbyConfluent
 

What's hot (20)

Kafka and Spark Streaming
Kafka and Spark StreamingKafka and Spark Streaming
Kafka and Spark Streaming
 
Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1
 
Query Pulsar Streams using Apache Flink
Query Pulsar Streams using Apache FlinkQuery Pulsar Streams using Apache Flink
Query Pulsar Streams using Apache Flink
 
Elastic Data Processing with Apache Flink and Apache Pulsar
Elastic Data Processing with Apache Flink and Apache PulsarElastic Data Processing with Apache Flink and Apache Pulsar
Elastic Data Processing with Apache Flink and Apache Pulsar
 
A Unified Platform for Real-time Storage and Processing
A Unified Platform for Real-time Storage and ProcessingA Unified Platform for Real-time Storage and Processing
A Unified Platform for Real-time Storage and Processing
 
From 100s to 100s of Millions
From 100s to 100s of MillionsFrom 100s to 100s of Millions
From 100s to 100s of Millions
 
Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021
Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021
Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021
 
How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...
 
Kafka Summit SF 2017 - Kafka and the Polyglot Programmer
Kafka Summit SF 2017 - Kafka and the Polyglot ProgrammerKafka Summit SF 2017 - Kafka and the Polyglot Programmer
Kafka Summit SF 2017 - Kafka and the Polyglot Programmer
 
Apache Pulsar and Github
Apache Pulsar and GithubApache Pulsar and Github
Apache Pulsar and Github
 
Pulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless EvolutionPulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless Evolution
 
Integrating Apache Pulsar with Big Data Ecosystem
Integrating Apache Pulsar with Big Data EcosystemIntegrating Apache Pulsar with Big Data Ecosystem
Integrating Apache Pulsar with Big Data Ecosystem
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Spark streaming and Kafka
Spark streaming and KafkaSpark streaming and Kafka
Spark streaming and Kafka
 
Getting Pulsar Spinning_Addison Higham
Getting Pulsar Spinning_Addison HighamGetting Pulsar Spinning_Addison Higham
Getting Pulsar Spinning_Addison Higham
 
Apache Pulsar Seattle - Meetup
Apache Pulsar Seattle - MeetupApache Pulsar Seattle - Meetup
Apache Pulsar Seattle - Meetup
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
Can Apache Kafka Replace a Database? – The 2021 Update | Kai Waehner, Confluent
Can Apache Kafka Replace a Database? – The 2021 Update | Kai Waehner, ConfluentCan Apache Kafka Replace a Database? – The 2021 Update | Kai Waehner, Confluent
Can Apache Kafka Replace a Database? – The 2021 Update | Kai Waehner, Confluent
 

Similar to Pulsar Summit Asia - Structured Data Stream with Apache Pulsar

Python web conference 2022 apache pulsar development 101 with python (f li-...
Python web conference 2022   apache pulsar development 101 with python (f li-...Python web conference 2022   apache pulsar development 101 with python (f li-...
Python web conference 2022 apache pulsar development 101 with python (f li-...
Timothy Spann
 
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)
Timothy Spann
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
DataWorks Summit
 
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
HostedbyConfluent
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
HostedbyConfluent
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Guido Schmutz
 
ITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming AppsITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming Apps
Timothy Spann
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
shareddatamsft
 
EUBra-BIGSEA: Cloud services with QoS guarantees for Big Data analytics
EUBra-BIGSEA: Cloud services with QoS guarantees for Big Data analyticsEUBra-BIGSEA: Cloud services with QoS guarantees for Big Data analytics
EUBra-BIGSEA: Cloud services with QoS guarantees for Big Data analytics
EUBra BIGSEA
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
Jean-Baptiste Onofré
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Big Data Spain
 
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a serviceCOMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
Antonio García-Domínguez
 
End-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and AtlasEnd-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and Atlas
DataWorks Summit
 
Realizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache BeamRealizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache Beam
J On The Beach
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
Timothy Spann
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
DataWorks Summit
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC
inside-BigData.com
 

Similar to Pulsar Summit Asia - Structured Data Stream with Apache Pulsar (20)

Python web conference 2022 apache pulsar development 101 with python (f li-...
Python web conference 2022   apache pulsar development 101 with python (f li-...Python web conference 2022   apache pulsar development 101 with python (f li-...
Python web conference 2022 apache pulsar development 101 with python (f li-...
 
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
 
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
 
ITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming AppsITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming Apps
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
EUBra-BIGSEA: Cloud services with QoS guarantees for Big Data analytics
EUBra-BIGSEA: Cloud services with QoS guarantees for Big Data analyticsEUBra-BIGSEA: Cloud services with QoS guarantees for Big Data analytics
EUBra-BIGSEA: Cloud services with QoS guarantees for Big Data analytics
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
 
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a serviceCOMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
 
End-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and AtlasEnd-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and Atlas
 
Realizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache BeamRealizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache Beam
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC
 

More from Shivji Kumar Jha

Batch to near-realtime: inspired by a real production incident
Batch to near-realtime: inspired by a real production incidentBatch to near-realtime: inspired by a real production incident
Batch to near-realtime: inspired by a real production incident
Shivji Kumar Jha
 
Navigating Transactions: ACID Complexity in Modern Databases
Navigating Transactions: ACID Complexity in Modern DatabasesNavigating Transactions: ACID Complexity in Modern Databases
Navigating Transactions: ACID Complexity in Modern Databases
Shivji Kumar Jha
 
Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes
Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutesDruid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes
Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes
Shivji Kumar Jha
 
osi-oss-dbs.pptx
osi-oss-dbs.pptxosi-oss-dbs.pptx
osi-oss-dbs.pptx
Shivji Kumar Jha
 
pulsar-platformatory-meetup-2.pptx
pulsar-platformatory-meetup-2.pptxpulsar-platformatory-meetup-2.pptx
pulsar-platformatory-meetup-2.pptx
Shivji Kumar Jha
 
Pulsar Summit Asia 2022 - Streaming wars and How Apache Pulsar is acing the b...
Pulsar Summit Asia 2022 - Streaming wars and How Apache Pulsar is acing the b...Pulsar Summit Asia 2022 - Streaming wars and How Apache Pulsar is acing the b...
Pulsar Summit Asia 2022 - Streaming wars and How Apache Pulsar is acing the b...
Shivji Kumar Jha
 
Pulsar Summit Asia 2022 - Keeping on top of hybrid cloud usage with Pulsar
Pulsar Summit Asia 2022 - Keeping on top of hybrid cloud usage with PulsarPulsar Summit Asia 2022 - Keeping on top of hybrid cloud usage with Pulsar
Pulsar Summit Asia 2022 - Keeping on top of hybrid cloud usage with Pulsar
Shivji Kumar Jha
 
Pulsar summit asia 2021: Designing Pulsar for Isolation
Pulsar summit asia 2021: Designing Pulsar for IsolationPulsar summit asia 2021: Designing Pulsar for Isolation
Pulsar summit asia 2021: Designing Pulsar for Isolation
Shivji Kumar Jha
 
Event sourcing Live 2021: Streaming App Changes to Event Store
Event sourcing Live 2021: Streaming App Changes to Event StoreEvent sourcing Live 2021: Streaming App Changes to Event Store
Event sourcing Live 2021: Streaming App Changes to Event Store
Shivji Kumar Jha
 
Pulsar Summit Asia - Running a secure pulsar cluster
Pulsar Summit Asia -  Running a secure pulsar clusterPulsar Summit Asia -  Running a secure pulsar cluster
Pulsar Summit Asia - Running a secure pulsar cluster
Shivji Kumar Jha
 
FOSSASIA 2015: MySQL Group Replication
FOSSASIA 2015: MySQL Group ReplicationFOSSASIA 2015: MySQL Group Replication
FOSSASIA 2015: MySQL Group Replication
Shivji Kumar Jha
 
MySQL High Availability with Replication New Features
MySQL High Availability with Replication New FeaturesMySQL High Availability with Replication New Features
MySQL High Availability with Replication New Features
Shivji Kumar Jha
 
MySQL Developer Day conference: MySQL Replication and Scalability
MySQL Developer Day conference: MySQL Replication and ScalabilityMySQL Developer Day conference: MySQL Replication and Scalability
MySQL Developer Day conference: MySQL Replication and Scalability
Shivji Kumar Jha
 
MySQL User Camp: MySQL Cluster
MySQL User Camp: MySQL ClusterMySQL User Camp: MySQL Cluster
MySQL User Camp: MySQL Cluster
Shivji Kumar Jha
 
MySQL User Camp: GTIDs
MySQL User Camp: GTIDsMySQL User Camp: GTIDs
MySQL User Camp: GTIDs
Shivji Kumar Jha
 
Open source India - MySQL Labs: Multi-Source Replication
Open source India - MySQL Labs: Multi-Source ReplicationOpen source India - MySQL Labs: Multi-Source Replication
Open source India - MySQL Labs: Multi-Source Replication
Shivji Kumar Jha
 
MySQL User Camp: Multi-threaded Slaves
MySQL User Camp: Multi-threaded SlavesMySQL User Camp: Multi-threaded Slaves
MySQL User Camp: Multi-threaded Slaves
Shivji Kumar Jha
 

More from Shivji Kumar Jha (17)

Batch to near-realtime: inspired by a real production incident
Batch to near-realtime: inspired by a real production incidentBatch to near-realtime: inspired by a real production incident
Batch to near-realtime: inspired by a real production incident
 
Navigating Transactions: ACID Complexity in Modern Databases
Navigating Transactions: ACID Complexity in Modern DatabasesNavigating Transactions: ACID Complexity in Modern Databases
Navigating Transactions: ACID Complexity in Modern Databases
 
Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes
Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutesDruid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes
Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes
 
osi-oss-dbs.pptx
osi-oss-dbs.pptxosi-oss-dbs.pptx
osi-oss-dbs.pptx
 
pulsar-platformatory-meetup-2.pptx
pulsar-platformatory-meetup-2.pptxpulsar-platformatory-meetup-2.pptx
pulsar-platformatory-meetup-2.pptx
 
Pulsar Summit Asia 2022 - Streaming wars and How Apache Pulsar is acing the b...
Pulsar Summit Asia 2022 - Streaming wars and How Apache Pulsar is acing the b...Pulsar Summit Asia 2022 - Streaming wars and How Apache Pulsar is acing the b...
Pulsar Summit Asia 2022 - Streaming wars and How Apache Pulsar is acing the b...
 
Pulsar Summit Asia 2022 - Keeping on top of hybrid cloud usage with Pulsar
Pulsar Summit Asia 2022 - Keeping on top of hybrid cloud usage with PulsarPulsar Summit Asia 2022 - Keeping on top of hybrid cloud usage with Pulsar
Pulsar Summit Asia 2022 - Keeping on top of hybrid cloud usage with Pulsar
 
Pulsar summit asia 2021: Designing Pulsar for Isolation
Pulsar summit asia 2021: Designing Pulsar for IsolationPulsar summit asia 2021: Designing Pulsar for Isolation
Pulsar summit asia 2021: Designing Pulsar for Isolation
 
Event sourcing Live 2021: Streaming App Changes to Event Store
Event sourcing Live 2021: Streaming App Changes to Event StoreEvent sourcing Live 2021: Streaming App Changes to Event Store
Event sourcing Live 2021: Streaming App Changes to Event Store
 
Pulsar Summit Asia - Running a secure pulsar cluster
Pulsar Summit Asia -  Running a secure pulsar clusterPulsar Summit Asia -  Running a secure pulsar cluster
Pulsar Summit Asia - Running a secure pulsar cluster
 
FOSSASIA 2015: MySQL Group Replication
FOSSASIA 2015: MySQL Group ReplicationFOSSASIA 2015: MySQL Group Replication
FOSSASIA 2015: MySQL Group Replication
 
MySQL High Availability with Replication New Features
MySQL High Availability with Replication New FeaturesMySQL High Availability with Replication New Features
MySQL High Availability with Replication New Features
 
MySQL Developer Day conference: MySQL Replication and Scalability
MySQL Developer Day conference: MySQL Replication and ScalabilityMySQL Developer Day conference: MySQL Replication and Scalability
MySQL Developer Day conference: MySQL Replication and Scalability
 
MySQL User Camp: MySQL Cluster
MySQL User Camp: MySQL ClusterMySQL User Camp: MySQL Cluster
MySQL User Camp: MySQL Cluster
 
MySQL User Camp: GTIDs
MySQL User Camp: GTIDsMySQL User Camp: GTIDs
MySQL User Camp: GTIDs
 
Open source India - MySQL Labs: Multi-Source Replication
Open source India - MySQL Labs: Multi-Source ReplicationOpen source India - MySQL Labs: Multi-Source Replication
Open source India - MySQL Labs: Multi-Source Replication
 
MySQL User Camp: Multi-threaded Slaves
MySQL User Camp: Multi-threaded SlavesMySQL User Camp: Multi-threaded Slaves
MySQL User Camp: Multi-threaded Slaves
 

Recently uploaded

HSE-BMS-009 COSHH & MSDS.pptHSE-BMS-009 COSHH & MSDS.pptSE-BMS-009 COSHH & MSDS.
HSE-BMS-009 COSHH & MSDS.pptHSE-BMS-009 COSHH & MSDS.pptSE-BMS-009 COSHH & MSDS.HSE-BMS-009 COSHH & MSDS.pptHSE-BMS-009 COSHH & MSDS.pptSE-BMS-009 COSHH & MSDS.
HSE-BMS-009 COSHH & MSDS.pptHSE-BMS-009 COSHH & MSDS.pptSE-BMS-009 COSHH & MSDS.
NoeAranel
 
Protect YugabyteDB with Hashicorp Vault.pdf
Protect YugabyteDB with Hashicorp Vault.pdfProtect YugabyteDB with Hashicorp Vault.pdf
Protect YugabyteDB with Hashicorp Vault.pdf
Gwenn Etourneau
 
Human_assault project using jetson nano new
Human_assault project using jetson nano newHuman_assault project using jetson nano new
Human_assault project using jetson nano new
frostflash010
 
Bell Crank Lever.pptxDesign of Bell Crank Lever
Bell Crank Lever.pptxDesign of Bell Crank LeverBell Crank Lever.pptxDesign of Bell Crank Lever
Bell Crank Lever.pptxDesign of Bell Crank Lever
ssuser110cda
 
Defect Elimination Management - CMMS Success.pdf
Defect Elimination Management - CMMS Success.pdfDefect Elimination Management - CMMS Success.pdf
Defect Elimination Management - CMMS Success.pdf
David Johnston
 
Comerica Inc Annual report summary and financial updates
Comerica Inc Annual report summary and financial updatesComerica Inc Annual report summary and financial updates
Comerica Inc Annual report summary and financial updates
ssuserb8b8c7
 
Mobile Forensics challenges and Extraction process
Mobile Forensics challenges and Extraction processMobile Forensics challenges and Extraction process
Mobile Forensics challenges and Extraction process
Swapnil Gharat
 
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
Kiran Kumar Manigam
 
internship project presentation for reference.pptx
internship project presentation for reference.pptxinternship project presentation for reference.pptx
internship project presentation for reference.pptx
SaieJadhav1
 
Database management system module -3 bcs403
Database management system module -3 bcs403Database management system module -3 bcs403
Database management system module -3 bcs403
Tharani4825
 
FIRE OUTBREAK EMERGENCY FINAL REPORT.pdf
FIRE OUTBREAK EMERGENCY FINAL REPORT.pdfFIRE OUTBREAK EMERGENCY FINAL REPORT.pdf
FIRE OUTBREAK EMERGENCY FINAL REPORT.pdf
Dar es Salaam, Tanzania
 
Gen AI with LLM for construction technology
Gen AI with LLM for construction technologyGen AI with LLM for construction technology
Gen AI with LLM for construction technology
Tae wook kang
 
Aiml ppt pdf.pdf on music recommendation system
Aiml ppt pdf.pdf on music recommendation systemAiml ppt pdf.pdf on music recommendation system
Aiml ppt pdf.pdf on music recommendation system
UdhavGupta6
 
THERMAL POWER PLANT its applications and advantages
THERMAL POWER PLANT its applications and advantagesTHERMAL POWER PLANT its applications and advantages
THERMAL POWER PLANT its applications and advantages
VikramSingh6251
 
Future Networking v Energy Limits ICTON 2024 Bari Italy
Future Networking v Energy Limits ICTON 2024 Bari ItalyFuture Networking v Energy Limits ICTON 2024 Bari Italy
Future Networking v Energy Limits ICTON 2024 Bari Italy
University of Hertfordshire
 
Artificial Intelligence Imaging - medical imaging
Artificial Intelligence Imaging - medical imagingArtificial Intelligence Imaging - medical imaging
Artificial Intelligence Imaging - medical imaging
NeeluPari
 
Structural Dynamics and Earthquake Engineering
Structural Dynamics and Earthquake EngineeringStructural Dynamics and Earthquake Engineering
Structural Dynamics and Earthquake Engineering
tushardatta
 
Design and Engineering Module 1 power point
Design and Engineering Module 1 power pointDesign and Engineering Module 1 power point
Design and Engineering Module 1 power point
ssuser76af31
 
Driving Safety.pptxxxxxxxxxxxxxxxxxxxxxxxxxx
Driving Safety.pptxxxxxxxxxxxxxxxxxxxxxxxxxxDriving Safety.pptxxxxxxxxxxxxxxxxxxxxxxxxxx
Driving Safety.pptxxxxxxxxxxxxxxxxxxxxxxxxxx
Tamara Johnson
 
AC-AC Traction system of Indian Railway HHP Locomotives.pdf
AC-AC Traction system of Indian Railway HHP Locomotives.pdfAC-AC Traction system of Indian Railway HHP Locomotives.pdf
AC-AC Traction system of Indian Railway HHP Locomotives.pdf
AMITKUMAR948425
 

Recently uploaded (20)

HSE-BMS-009 COSHH & MSDS.pptHSE-BMS-009 COSHH & MSDS.pptSE-BMS-009 COSHH & MSDS.
HSE-BMS-009 COSHH & MSDS.pptHSE-BMS-009 COSHH & MSDS.pptSE-BMS-009 COSHH & MSDS.HSE-BMS-009 COSHH & MSDS.pptHSE-BMS-009 COSHH & MSDS.pptSE-BMS-009 COSHH & MSDS.
HSE-BMS-009 COSHH & MSDS.pptHSE-BMS-009 COSHH & MSDS.pptSE-BMS-009 COSHH & MSDS.
 
Protect YugabyteDB with Hashicorp Vault.pdf
Protect YugabyteDB with Hashicorp Vault.pdfProtect YugabyteDB with Hashicorp Vault.pdf
Protect YugabyteDB with Hashicorp Vault.pdf
 
Human_assault project using jetson nano new
Human_assault project using jetson nano newHuman_assault project using jetson nano new
Human_assault project using jetson nano new
 
Bell Crank Lever.pptxDesign of Bell Crank Lever
Bell Crank Lever.pptxDesign of Bell Crank LeverBell Crank Lever.pptxDesign of Bell Crank Lever
Bell Crank Lever.pptxDesign of Bell Crank Lever
 
Defect Elimination Management - CMMS Success.pdf
Defect Elimination Management - CMMS Success.pdfDefect Elimination Management - CMMS Success.pdf
Defect Elimination Management - CMMS Success.pdf
 
Comerica Inc Annual report summary and financial updates
Comerica Inc Annual report summary and financial updatesComerica Inc Annual report summary and financial updates
Comerica Inc Annual report summary and financial updates
 
Mobile Forensics challenges and Extraction process
Mobile Forensics challenges and Extraction processMobile Forensics challenges and Extraction process
Mobile Forensics challenges and Extraction process
 
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
 
internship project presentation for reference.pptx
internship project presentation for reference.pptxinternship project presentation for reference.pptx
internship project presentation for reference.pptx
 
Database management system module -3 bcs403
Database management system module -3 bcs403Database management system module -3 bcs403
Database management system module -3 bcs403
 
FIRE OUTBREAK EMERGENCY FINAL REPORT.pdf
FIRE OUTBREAK EMERGENCY FINAL REPORT.pdfFIRE OUTBREAK EMERGENCY FINAL REPORT.pdf
FIRE OUTBREAK EMERGENCY FINAL REPORT.pdf
 
Gen AI with LLM for construction technology
Gen AI with LLM for construction technologyGen AI with LLM for construction technology
Gen AI with LLM for construction technology
 
Aiml ppt pdf.pdf on music recommendation system
Aiml ppt pdf.pdf on music recommendation systemAiml ppt pdf.pdf on music recommendation system
Aiml ppt pdf.pdf on music recommendation system
 
THERMAL POWER PLANT its applications and advantages
THERMAL POWER PLANT its applications and advantagesTHERMAL POWER PLANT its applications and advantages
THERMAL POWER PLANT its applications and advantages
 
Future Networking v Energy Limits ICTON 2024 Bari Italy
Future Networking v Energy Limits ICTON 2024 Bari ItalyFuture Networking v Energy Limits ICTON 2024 Bari Italy
Future Networking v Energy Limits ICTON 2024 Bari Italy
 
Artificial Intelligence Imaging - medical imaging
Artificial Intelligence Imaging - medical imagingArtificial Intelligence Imaging - medical imaging
Artificial Intelligence Imaging - medical imaging
 
Structural Dynamics and Earthquake Engineering
Structural Dynamics and Earthquake EngineeringStructural Dynamics and Earthquake Engineering
Structural Dynamics and Earthquake Engineering
 
Design and Engineering Module 1 power point
Design and Engineering Module 1 power pointDesign and Engineering Module 1 power point
Design and Engineering Module 1 power point
 
Driving Safety.pptxxxxxxxxxxxxxxxxxxxxxxxxxx
Driving Safety.pptxxxxxxxxxxxxxxxxxxxxxxxxxxDriving Safety.pptxxxxxxxxxxxxxxxxxxxxxxxxxx
Driving Safety.pptxxxxxxxxxxxxxxxxxxxxxxxxxx
 
AC-AC Traction system of Indian Railway HHP Locomotives.pdf
AC-AC Traction system of Indian Railway HHP Locomotives.pdfAC-AC Traction system of Indian Railway HHP Locomotives.pdf
AC-AC Traction system of Indian Railway HHP Locomotives.pdf
 

Pulsar Summit Asia - Structured Data Stream with Apache Pulsar

  • 1. #PulsarSummit Asia 2020#PulsarSummit Asia 2020 Structured Data Stream with Pulsar Shivji Kumar Jha 1
  • 2. ● ● ● ● Who am I ? https://www.linkedin.com/in/shivjijha/ https://twitter.com/ShivjiJha
  • 3. Catalogue • Background: Apache Pulsar • Background: Schema • Why Schema • Introducing Pulsar Schema • Learnings • Q&A 3
  • 5. Background: Apache Pulsar Pulsar: cloud-native, distributed messaging and streaming platform 5
  • 6. Background: Apache Pulsar Pulsar: cloud-native, distributed messaging and streaming platform APACHE PULSAR 6
  • 7. Background: Apache Pulsar Pulsar: cloud-native, distributed messaging and streaming platform Highlights: 1. Modular design 2. Horizontally scalable 3. Low latency with durability 4. Multi-tenancy 5. Geo Replication APACHE PULSAR 7
  • 9. Background - schema : serialization Definitions 1. Imagine you have to send an employee record over network. 2. Cant write as is. 3. Employee encoder to convert employee record to a stream of bytes. 4. Formally, encoding / serialization. 5. Send bytes over network. https://www.raywenderlich.com/books/swift-apprentice/v6.0/chapters/22-encoding-decoding-types 9
  • 10. Background - schema : de-serialization Definitions 1. When reading from network, turn stream of bytes to employee record. 2. Decoder converts bytes to employee instance. 3. Formally, decoding / de-serialization. https://www.raywenderlich.com/books/swift-apprentice/v6.0/chapters/22-encoding-decoding-types 10
  • 11. Background - schema : Schema? https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html 1. Encoding can be done in native serialization of programmng language. Examples: a. Java Serialization b. Python’s pickle c. Ruby’s marshal 2. Locked with programming language - oops! 3. Maybe JSON or XML work like web APIs? a. too verbose b. storing keys over and over c. no way to fix types, guess types looking at data. Yuck! 4. Need to save space with each data instance. 5. Also, people stuffing random types which other people dont understand. a. Document well? 6. Ok, lets agree on some protocols and make good documentation of whats allowed and whats NOT 7. Well, that is what Avro, Protobuff, thrift etc are! 11
  • 12. Background - schema : Evolution https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html 1. The schema is defined and documented. Great! 2. Someone wants to quickly add a new data type. a. How does decoder know which schema to use: old or new? b. Among all schemas how does decoder know two are connected? i. That is schema versioning for you! 3. Avro, Protobuff, json schema, thrift etc support schema evolution with versioning. 4. Possible to have sender (producer) and reader (consumer) having different versions of schema at same time. 12
  • 13. Background - schema : Avro https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html 1. Encode data with a schema. 2. Ship schema to consumer(?) a. or Keep schema in a central place keyed with schemaId. b. Ship schemaId with binary message. 3. While decoding: a. Get schemaId from beginning of message (always Long?) b. Fetch schema by schemaId from central schema store. 4. Decode keeping schema and binary data together. 5. Example: Schema tells decoder to expect 4 bytes to convert to an int 13
  • 15. Schema : no schema? APACHE PULSAR BYTES BYTES 15
  • 16. Schema : no schema? APACHE PULSAR BYTES BYTES 1. Your data in pulsar store is plain binary (0s and 1s). 2. Pulsar supports several schema types for encoding & decoding. 3. Can encode data using schema. 4. Can decode data given schema & binary data. 16
  • 17. Schema : no schema? APACHE PULSAR BYTES BYTES 1. Schema or no schema? 2. How do you encode / decode bytes of pulsar data? 3. If you don’t have a schema, your, schema is implicit in your app code! 17
  • 18. Schema : no schema? APACHE PULSAR BYTES BYTES 1. Schema or no schema? 2. How do you encode / decode bytes of pulsar data? 3. If you don’t have a schema, your, schema is implicit in your app code! 18
  • 19. Schema : no schema? https://martinfowler.com/articles/schemaless/#non-uniform-types 19
  • 20. Schema : no schema? Add custom fields for UI etc Different attributes depending on kind of event Obviously, easy for schemaless, still needs care! https://martinfowler.com/articles/schemaless/#non-uniform-types 20
  • 22. Introducing Pulsar Schema : bytes Domain Object 22
  • 23. Introducing Pulsar Schema : bytes Domain Object Byte schema serialized with java 23
  • 24. Introducing Pulsar Schema : String Producer Consumer 24
  • 25. Introducing Pulsar Schema : All Primitive types 25
  • 26. Introducing Pulsar Schema : All Primitive types 26
  • 27. Introducing Pulsar Schema : Structs (JSON schema) Domain Object 27
  • 28. Introducing Pulsar Schema : Structs (JSON schema) Domain Object Producer with JSON schema serialization 28
  • 29. Introducing Pulsar Schema : Structs (AVRO schema) 29
  • 30. Introducing Pulsar Schema : Structs (AVRO schema) 30
  • 31. Introducing Pulsar Schema : Structs (AVRO schema) Application “knows” which types go to which topic. 31
  • 32. Pulsar Schema : Schema Store (Client side) 1. In the previous examples, schema was stored in producer and consumer object. 2. This is client-side schema storage approach. 32
  • 33. Pulsar Schema : Schema Store (Client side) Problems: 1. Client responsible for: a. “serializing” data objects (user instance) into bytes b. “de-serializing” bytes to data object (user instance) c. “knowing” which types go to which topic. 2. With consumer spread across several micro-services, “knowing” and “evolving” schema is challenging! 33
  • 34. Pulsar Schema : Schema Store (Server side) Solution: 1. Store schema on a central server. 2. When producing, upload schema to central server. 3. Add schemaId (Long) to message. 4. When consuming, fetch schema with schemaId. 5. Schema management server manages evolution (versioning). Pulsar has built-in schema registry service!! 34
  • 35. Pulsar Schema : Schema Registry 1. Entity for schema registry service: schemaInfo . 35
  • 36. Pulsar Schema : Schema Registry 36
  • 37. Pulsar Schema : Schema Registry 1. Each schemaInfo stored with a topic has a version. 2. SchemaVersion manages schema changes happening within a topic. 3. Messages produced with a schemaInfo is tagged with version. 4. Consumer can use schemaVersion to fetch schemaInfo. Decode message with schemaInfo. 37
  • 38. Pulsar Schema : Schema Registry schema Payload structure: schemaType schemaType; Boolean isDeleted; Long timestamp; String user; byte[] data; Hashmap<string, string> props; 38
  • 39. Pulsar Schema : Schema Registry Admin CLI commands and REST APIs to manage schemas: 39
  • 41. Learnings 1. Struct schemas (json, avro, protobuff) model domain objects well. 2. Use byte schema only if really needed. 3. Using avro schemas with pulsar for over an year in production a. Json schema is too verbose. b. proto awesome, still being adopted with sources / sinks c. avro saves data per message against json schema. d. avro is very well adopted among source / sinks. 41
  • 42. Learnings 1. Always a good idea to think hard and set compatibility on namespace. 2. Decide on compatibility depending on use-case and expected evolution. 42
  • 43. Learnings 1. Prefer ordering requirements to decide what goes on which topic. a. One domain => one topic! b. Use AUTO_CONSUME on consumer schema type. c. schema-autoupdate-strategy = NONE. 2. The schema management process that works for us is: a. Keep a github repo with schemas. b. use code reviews to review schema changes. c. Generate POJOs from avro using maven plugin (java). d. Add the pojo library as a dependency to micro-services to import domain objects. 43
  • 44. References 1. Pulsar docs: https://pulsar.apache.org/docs/en/schema-get-started/ 2. Schema auto update strategy: https://pulsar.apache.org/docs/en/pulsar-admin/#set-schema-autoupdate-s trategy 3. Schema Evolution in Avro, Thrift, Protobuff: https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-proto col-buffers-thrift.html 4. Topic design per domain: https://www.confluent.io/blog/put-several-event-types-kafka-topic/ 5. Schema Compatibility Design: https://docs.confluent.io/platform/current/schema-registry/avro.html#comp atibility-types 44