From the course: Complete Guide to Apache Kafka for Beginners

Apache Kafka in five minutes

Hi, this is Stephane from Conduktor and welcome to this lecture in which I'm going to introduce Kafka to you. So let's first go with company challenges regarding data integration. So companies will have a source system, for example, a database, and at some point another part of the company will want to take that data and to put it into another system, for example, a target system. So the data has to move from a source system to a target system. And at first it's very simple. Someone will write some code and then take the data, extract it, transform it, and then load it. Now, after a while your company evolves and has many source systems and also has many target systems. And now your data integration challenges just get a lot more complicated because all your source systems must send data to all your target systems to share information. And as you can see, we have a lot of integration. So the previous architecture is that, for example, if you have 4 source systems and 6 target systems, you're going to have to write 24 integrations to make it work. And each integration comes with difficulty around the protocol because the technology is change So, maybe the data is going to be transferred over TCP, HTTP, REST, FTP, JDBC, the data format. So how is the data parsed? Is it Binary, CSV, JSON, Avro, Protobuf? et cetera et cetera. The schema and evolution. What happens if the data changes in shape overall in your source or your target systems? And then each source system will also have an increased load from all the connections and the request to extract the data. So how do we solve this problem? Well, we bring some decoupling using Apache Kafka. So we still have our source systems and our target systems, but in the middle we'll sit Apache Kafka. What happens now? The source systems are responsible for sending data, it's called producing for producing data into Apache Kafka. So now Apache Kafka is going to have a data stream of all your data, of all your source systems within it. And your target systems if they ever need to receive data from your source systems, they will actually tap into the data of Apache Kafka because Kafka is meant to actually receive and send data. So your target systems are now consuming from Apache Kafka and everything looks a little bit better and a little bit more scalable. So if we go back to the same example, what can be your source systems, for example, well it could be website events, pricing data, financial transaction or user interactions and all these things create data streams that means data created in real time, and it is sent to Apache Kafka. Now your target systems could be databases, analytics systems, email systems, and audit system. So this is the kind of architecture we will implement. Now why is Apache Kafka so good? Well, Kafka was created by LinkedIn, and you should know LinkedIn is a huge corporation and it was created as an open source project. Now it's mainly maintained by big corporations such as Confluent, IBM, Cloudera, LinkedIn and so on. It's distributed, has a resilient architecture and is fault tolerant. That means that you can upgrade Kafka, you can do Kafka maintenance without taking the whole system down. Kafka is also very good because it has horizontal scalability. That means that you can add brokers over time into your Kafka cluster, and you can scale to hundreds of brokers. Kafka also has huge scale for messages throughputs, so you can have millions of messages per second. This is the case of Twitter. Also, it's really high performance, so you have really low latency, sometimes measured in less than 10 milliseconds, which is why we called Apache Kafka a real time system. Kafka also has a really wide adoption across the world. And if you're watching this video, that means that, you know, Kafka is being widely adopted. So over 2,000 firms are using Kafka publicly. And also 80% of the Fortune 100 are using Apache Kafka. Big names, obviously using Kafka are you going to be LinkedIn, Airbnb, Netflix, Uber and Walmart. But you don't need to be a mega corporation to use Apache Kafka. Now into the use cases. How is Apache Kafka used? It's used as a messaging system, activity tracking system. It's used to gather metrics from many different locations, gather application logs. It used to be like the first use cases for Kafka, more recently is used for stream processing and we'll see how to do that using the Streams API, for example, it's used to decouple system dependencies and microservices. It has integration with big data technologies such as Spark, Flink, Storm, Hadoop and as I said, it's also used for microservices pub/sub. So some more concrete example into how Kafka is being used. So Netflix is using Apache Kafka to apply recommendations in real time while you're watching TV shows. Uber is using Kafka to gather user taxi and trip data in real time and compute and forecast demand. Also compute your pricing in real time. And LinkedIn uses Kafka to prevent spam, collect user interactions to make better connection recommendations in real time. So in all of that, Kafka is only used as the transportation mechanism, which allows huge data flows in your company. So by now, you should know what Kafka is, how it's used and why and how it came to be. So that's it for this lecture, I hope you liked it and I will see you in the next lecture.

Contents