Once we’ve discussed the main concepts Kafka relies on, we can dive into components and their roles in the system.
Message is a unit which flows throughout the system. Its typical life cycle is described as follows:
- Created by a producer
- Published to a topic by the producer
- Stored in the topic
- Retrieved and processed by consumers
- Removed from a topic in accordance with retention policy
Message is uniquely identified via topic name, partition id and offset across the cluster.
Internally message consists of a header and a body.
Header section contains headers list which have values represented as byte arrays.
Body of the message has Key and Value which are both opaque byte arrays. This is done deliberately because a particular choice of encoding is unlikely to be right for all users.
Broker is a node that stores published messages.
In order to provide fault-tolerance, it’s common to run many brokers in order to make up a cluster and replicate partition’s data to multiple brokers.
Cluster is a distributed system composed of many brokers.
Topic is a logical set to which messages get published to. Also the topic can be treated as a category for a message.
In reality, the topic is composed of many partitions where messages are stored.
A partition is a building block of a topic. Topic may be composed of one or many partitions.
A partition is the ultimate storage of messages published to a topic. Each message in a partition is uniquely identified by an offset which is a 64-bit integer derived from a per-partition atomic counter.
Internally partition is a subdirectory in the filesystem of a particular broker that stores published messages in a set of files. Each file contains many messages.
Write to Kafka is an append-operation to the active file (or to the newly created file if the active file exceeds predefined size)
Read from Kafka is done using the given offset: at first the file is searched, then file-specific offset is calculated and the message is delivered back to the consumer.
Zookeeper is a coordination service between consumers and brokers which remembers which broker stores a specific partition of a specific topic.
This allows consumers to figure out which broker they should request to get data of a specific partition.
Replication is a technique that makes Kafka fault-tolerant in case of a node (=broker) failure.
Kafka data replication is based on top of single-leader replication in which one broker is chosen as a leader for a specific partition and then this broker replicates changes to the copies of the partition that resides on other brokers (which are treated as followers).
Retention policy is a way to remove messages. It can be defined by time or size.
Producer is a component that publishes messages to a topic.
Consumer is a component that reads messages from topics.
A topic may contain multiple partitions and in order to increase consumption rate we can read its partitions in parallel and independently by multiple consumers. The only thing we should do is to specify the same consumer group for the consumers.
Main components big picture
The overall basic architecture can be depicted as follows:
As the next step I’ll show APIs that provide Confluent.Kafka with examples in C#. Stay tuned!
If you have any questions, feel free to ask! I really appreciate your feedback :)