Introduction to Kafka

Submitted by heartin on Sun, 01/29/2017 - 16:33

Apache Kafka is a publish-subscribe based distributed messaging system.

Kafka is open source, distributed, partitioned, and replicated.

From the architecture perspective, Kafka is closer to traditional messaging systems such as ActiveMQ or RabitMQ.

However from a Big Data and Hadoop perspective, Kafka can be compared with Scribe or Flume as it is useful for processing activity stream data.

Following are some definitions from different sources:

Wikipedia: Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. The design is heavily influenced by transaction logs.

kafka.apache.org: Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.

kafka.apache.org/documentation.html: Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.

Important characteristics of Kafka

Messages are persisted on disk as well as replicated within the cluster to prevent data loss.
- The Kafka cluster retains all published messages irrespective of whether the messages have been consumed, and this log retention behaviour including retention period is configurable.
Kafka's performance is effectively constant with respect to data size, and hence retaining lots of data is not a problem.
- Apache Kafka is designed with O(1) disk structures that provide constant-time performance even with very large volumes of stored messages that are in the order of TBs.
Kafka is designed to provide high throughput by able to handle hundreds of MBs of reads and writes per second from large number of clients.
Kafka is distributed
- Apache Kafka supports message partitioning over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.
- Kafka cluster can grow elastically and transparently without any downtime.
The Apache Kafka system supports easy integration of clients from different platforms such as Java, .NET, PHP, Ruby, and Python.
Kafka support realtime parallel processing
- Messages produced by the producer threads should be immediately visible to consumer threads.
- Kafka also supports parallel data loading in the Hadoop systems.

Kafka common use cases

Let us go through some of the use cases of Kafka to understand it better.

Log aggregation
- Kafka provides clean abstraction of log or event data as a stream of messages.
- Kafka can take away any dependency over file details.
- Kafka support multiple data sources and distributed data consumption.
Stream processing
- Kafka can be used for stream processing where collected data undergoes processing at multiple stages
  - For example, raw data consumed from topics can be enriched or transformed into new Kafka topics for further consumption.
Commit logs
- Kafka can be used to represent external commit logs for any large scale distributed system.
- Replicated logs over Kafka cluster help failed nodes to recover their states.
Click stream tracking
- Kafka can capture user click stream data such as page views, searches, and so on as real-time publish-subscribe feeds.
- This data is published to central topics with one topic per activity type as the volume of the data is very high.
Messaging
- Kafka offers better throughput, built-in partitioning, replication, and fault-tolerance than many popular message brokers