Kafka Architecture
What is Kafka?
Kafka is open source distributed event streaming framework, it streams data at real time.
It collects, process, store and integrate data at scale.
What is event?
An event is any type of action, or change that is recorded by software application. For example a order creation, order payment, order delivered in e commerce website.
It is used to trigger some other activity. The event state or message is normally represented in some structure format, say in JSON or Protocol buffers.
Kafka Message Format
The Kafka message format consists of several components. Here's a breakdown:
The actual message format is more complex and includes additional fields for internal use. Kafka uses a binary protocol for efficiency, so messages are typically serialized before being sent to Kafka and deserialized after being received.
You can transfer data using JSON or Protbuf. Using Protbuf, reduces the size of the message by 50% and decreases the latency for publishing and reading the data.
Kafka Topic
Kafka publishes message to Topic on a particular partition. Think of it like a table in a database, and Kafka message(event) as row of the table. You can create different topics to hold different kind of events.
A topic is a log of events. Logs are easy to understand, because they are simple data structures. The are append only file, means when you publish a new message, it goes to the end of the file. Once the event has happened, you can't undo it
Partitioning
Topic: "order-placed"
|-- Partition 0: [Message0] [Message1] [Message2] ...
|-- Partition 1: [Message0] [Message1] [Message2] ...
|-- Partition 2: [Message0] [Message1] [Message2] ...
Kafka Producer
A Kafka producer is client application that publishes(writes) messages to Kafka topics.
Producers create message objects, which include the topic name, optional partition number, optional key, and the value (actual message content). The message is serialized before sending.
Producers can be configured to require acknowledgments from brokers to ensure message delivery. In case of failures, producers can automatically retry sending messages.
Kafka Broker
Kafka Cluster
|-- Broker 1 (Controller)
|-- Topic A, Partition 0 (Leader)
|-- Topic B, Partition 1 (Follower)
|-- Broker 2
|-- Topic A, Partition 1 (Leader)
|-- Topic B, Partition 0 (Follower)
|-- Broker 3
|-- Topic A, Partition 0 (Follower)
|-- Topic B, Partition 1 (Leader)
Kafka Consumer
A Kafka consumer is a client application that reads (consumes) messages from Kafka topics.
Consumers read data from specified topics, and are typically part of consumer groups, which allow for parallel processing of messages from a topic. It is possible that group can consist of only one consumer.
Consumer Group
Within a consumer group, each consumer is assigned one or more partitions to read from exclusively of a particular topic.
When consumers are added or removed from a group, Kafka automatically reassigns partitions among the remaining consumers.
Consumer group ensures only 1 consumer read 1 partition. This ensures scalable processing.
If consumer member goes away, then remaining members reorganize partitions among themselves to consume.
Message Offset
A consumer maintains an order, which is the last message it has already read, for each partition, so they keep track of their position in each partition using a numerical offset, allowing them to resume from where they left off, in case it crashes. Consumers commit their offsets to Kafka, enabling them to recover their position after a restart or failure.
Kafka maintains consumer offset topic, means once the message is received by consumer it puts it into offset, so that is not again processed by consumer.
Recommended by LinkedIn
Message polling
Consumers typically use a polling model to fetch messages from Kafka in batches.
Kafka Message Retention
You can configure for how much time a message should be retained on the topic. Different topics can have different message retention time.
Storing data for long time in Kafka does not affect on its performance.
How to ensure that Kafka messages are consumed in order?
Replication
You can have one of the Kafka broker as leader, and leader will replicate the partition across different brokers. Best practice is set to replication to 3.
If the Kafka broker leader fails, then Kafka controller assigns one of the follower broker as leader.
Suppose you have a topic which has 9 partitions, and each partition will be replicated across 2 other followers. So 3x9=27 copies of the partitions spread across 1 leader and replicas.
P0 -> 3 copies
P1 -> 3 copies
P2 -> 3 copies
…
P8 -> 3 copies
Total copies = 27
Kafka Controller
How to ensure same message is not consumed twice?
Kafka assigns the partitions of a topic to the consumer in a consumer group, so that each partition is consumed by exactly one consumer in the consumer group. Kafka guarantees that a message is only ever read by a single consumer in the consumer group.
Since the messages stored in individual partitions of the same topic are different, the two consumers would never read the same message, thereby avoiding the same messages being consumed multiple times at the consumer side.
Zookeeper
Zookeeper plays a crucial role in managing and coordinating a Kafka cluster.
Zookeeper facilitates election of controller broker and partition leader.
How to avoid duplicity in Kafka
Idempotent Producers
An Idempotent Producer is a Kafka producer that prevents duplicate messages from being sent to the broker, even if the producer retries sending a message in case of network failure. This ensures delivery exactly once.
Set enable.idempotence=true in producer configuration
If your application needs to maintain message ordering and prevent duplication, you can enable idempotency for your Apache Kafka producer. An idempotent producer has a unique producer ID and uses sequence IDs for each message, allowing the broker to ensure, on a per-partition basis, that it is committing ordered messages with no duplication.
Producer receive the acknowledgement from kafka. Set acks=1
Thanks for reading. Please dm or email techlead.ps@gmail.com for any question.
References
docs.confluent.io (excellent documentation)
Apache Kafka Architecture by Anton Putra
Chat GPT
Senior Enterprise Architect - Looking at the forests!, Also CEO of my farm! Duck your #Permanent #FullTime titles #myth for #young #India, Put the MONEY on the Table for the Job! Advocate 15 days two way notice period !
7moawesome details.. thanks!!
Principal Consultant (IT)
8moHave you had a chance to compare it with Apache Pulsar?