Kafka Basic Terminology and Examples from Real World

Submitted by heartin on Sun, 01/29/2017 - 16:35

Apache Kafka is a publish-subscribe based distributed messaging system.

Kafka Basic Terminology

Following is a list of important terminology related to Apache Kafka.

Broker
- Kafka is run as a cluster comprised of one or more servers each of which is called a broker.
- Brokers are stateless and does not maintain a record of what is consumed by whom.
  - The message state of any consumed message is maintained within the message consumer.
Topics
- Kafka maintains feeds of messages in categories. A category or feed name to which messages are published, is called a topic.
- Kafka topics are created on a Kafka broker acting as a Kafka server.
Partitions
- For each topic, the Kafka cluster maintains a partitioned log.
- Each partition is a commit log, i.e. an ordered, immutable sequence of messages that is continually appended to.
- The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition.
- Benefits of partitions
  - Partitions allow log to spread across multiple servers. A partition should fit into a single server, but a topic can have partitions in different machines.
  - Partitions act as the unit of parallelism.
  - Each partition is replicated across a configurable number of servers for fault tolerance.
    - Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader.
  - Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.
Producers
- Producers are processes that publish messages to Kafka. Producers publish data to the topics of their choice.
- The producer is responsible for choosing which message to assign to which partition within the topic. This can be done in a round-robin fashion or it can be done according to some semantic partition function.
Consumers
- Consumers are processes that subscribe to topics and process the feed of published messages.
- The message state of any consumed message is maintained within the message consumer.
Consumer Group
- Consumers can label themselves with a consumer group name.
- Consumer group is a consumer abstraction provided by Kafka that generalizes both traditional messaging models of queuing and publish-subscribe.
- Each message published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
  - If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers.
  - If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.
Offset
- Offset is the position of the consumer in the log and is retained on a per-consumer basis.
- It is the the only metadata retained on a per-consumer basis.
- Offset is controlled by the consumer.
- Normally a consumer will advance its offset linearly as it reads messages, but as the position is controlled by the consumer, it can consume messages in any order it likes or reset offset and read already processed messages.

At a high level, producers send messages over the network to the Kafka cluster which in turn serves them up to consumers.

Examples of Producers and Consumers from real world

Kafka can be considered as a broker to which producers send messages and consumers consume messages.

Here are few examples of producers and consumers from real world.

Frontend web applications generating logs
Proxies generating web analytics logs
Adapters generating transformation logs
Services generating invocation trace logs

Consumers

Offline consumers that are storing messages in Hadoop or traditional data warehouse for offline analysis
Near real-time consumers that are storing them in any NoSQL datastore, such as HBase or Cassandra, for near real-time analytics.
Real-time consumers, such as Spark or Storm, that filter messages in-memory and trigger alert events for related groups