Exploring Apache Kafka: Key Concepts and Practical Examples

Apache Kafka has become a core tool for managing real-time data pipelines and streaming applications. Below is an overview of its key components and some example commands to help you get started with Kafka.

Kafka Concepts Overview

  • Broker / Cluster: Kafka brokers are the servers that store and manage data. Multiple brokers work together to form a Kafka cluster, which ensures data redundancy, load balancing, and fault tolerance.
  • Producer: Producers are responsible for sending data (messages) to Kafka topics. Producers can configure parameters such as message keys and timestamps for additional control.
  • Consumer: Consumers read data from Kafka topics. A consumer group allows multiple consumers to work together to read data in parallel for scalability.
  • Message: A Kafka message contains a key, value, and optional timestamp. The key is typically used for partitioning data, and the value contains the actual data payload.
  • Topic: Topics are the primary way to organize messages in Kafka. Producers write data to topics, and consumers read data from topics.
  • Partitions: Each topic is split into partitions, enabling parallel processing. Partitions also allow data distribution across multiple brokers.
  • Offsets: Each message in a partition has an offset, which is a unique identifier. Offsets track message positions within partitions, allowing consumers to know where to start or resume reading.


Setting up Kafka with Docker Compose

To use Kafka locally, here’s a sample docker-compose.yml file setup:

version: "3.8"

services:
  master:
    image: custom-spark-3.5:latest #docker.io/bitnami/spark:3.5
    container_name: spark-master
    hostname: master
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark
      - SPARK_LOG_CONF=/opt/bitnami/spark/conf/log4j2.properties
      - PYTHONPATH=/opt/bitnami/spark/jobs/app:/opt/bitnami/spark/jobs/app/
    ports:
      - "8080:8080"
      - "7077:7077"
    volumes:
      - ./scripts:/opt/spark/scripts
      - ./data:/opt/spark/data

  worker1:
    image: custom-spark-3.5:latest #docker.io/bitnami/spark:3.5
    container_name: spark-worker-1
    hostname: worker1
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark-master:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=2
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark
      - SPARK_LOG_CONF=/opt/bitnami/spark/conf/log4j2.properties
    volumes:
      - ./scripts:/opt/spark/scripts
      - ./data:/opt/spark/data
    depends_on:
      - master

  worker2:
    image: custom-spark-3.5:latest #docker.io/bitnami/spark:3.5
    container_name: spark-worker-2
    hostname: worker2
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark-master:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=2
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark
      - SPARK_LOG_CONF=/opt/bitnami/spark/conf/log4j2.properties
    volumes:
      - ./scripts:/opt/spark/scripts
      - ./data:/opt/spark/data
    depends_on:
      - master

  zookeeper:
    image: bitnami/zookeeper:latest
    ports:
      - 2181:2181
    volumes:
      - zookeeper-volume:/bitnami/zookeeper 
    environment:
      - ALLOW_ANONYMOUS_LOGIN=yes


  kafka:
    image: bitnami/kafka:latest
    container_name: kafka
    restart: on-failure
    ports:
      - 9092:9092
    environment:
      - KAFKA_CFG_BROKER_ID=1
      - KAFKA_CFG_LISTENERS=PLAINTEXT://:9092
      - KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092
      - KAFKA_CFG_ZOOKEEPER_CONNECT=zookeeper:2181
      - KAFKA_CFG_NUM_PARTITIONS=3
      - ALLOW_PLAINTEXT_LISTENER=yes
    volumes:
      - kafka-volume:/var/lib/kafka/data:z  # Volume persistente para dados do Kafka
    depends_on:
      - zookeeper

  kafka-ui:
    image: provectuslabs/kafka-ui
    container_name: kafka-ui
    depends_on:
      - kafka
      - zookeeper
    ports:
      - "8889:8080"
    restart: always
    environment:
      - KAFKA_CLUSTERS_0_NAME=kafka-ui
      - KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS=kafka:9092
      - KAFKA_CLUSTERS_0_ZOOKEEPER=zookeeper:2181


  jupyter:
    image: jupyter/datascience-notebook:python-3.8.4
    container_name: jupyter
    ports:
      - "8888:8888"
    volumes:
      - ./notebooks:/home/jovyan/work/notebooks:rw
      - ./data/:/home/jovyan/work/data:rw
      - ./py/:/home/jovyan/work/py:rw
    environment:
      JUPYTER_ENABLE_LAB: "yes"
    command: "start-notebook.sh --NotebookApp.token='' --NotebookApp.password=''"

volumes:
  kafka-volume:
    external: false
  zookeeper-volume:
    external: false        

To bring up the Kafka environment:

docker-compose up -d        

Basic Kafka Commands

Here are a few useful commands to help you get hands-on with Kafka.

For the experiments below, after starting the environment with Kafka in Docker, open a different Linux shell and access the container in interactive mode with command:

docker exec -it kafka bash


1. Creating a Topic

Create a topic named my_topic with three partitions:

docker exec -it kafka bash

/opt/bitnami/kafka/bin/kafka-topics.sh --bootstrap-server kafka:9092 \
--create \
--topic test-topic \
--replication-factor 1 \
--partitions 3        

2. Producing a Message to a Topic

Send a message to my_topic:

docker exec -it kafka bash

/opt/bitnami/kafka/bin/kafka-console-producer.sh --bootstrap-server kafka:9092 \
--topic test-topic        

3. Consuming Messages from the Start

To read messages from the beginning of the topic:

docker exec -it kafka bash

/opt/bitnami/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka:9092 \
--topic test-topic \
--from-beginning
        

4. Consuming Messages from the End

To read only new messages added to the topic:

docker exec -it kafka bash

/opt/bitnami/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka:9092 \
--topic test-topic         

Kafka Use Cases

Kafka is highly versatile and used in scenarios such as:

  • Real-time Data Streaming: Stream data from applications or IoT devices for real-time analytics.
  • Event Sourcing: Track changes in state, often used in financial services.
  • Data Pipeline: Move data between systems or applications, ensuring reliable data flow.

Kafka on the Cloud: AWS, Azure, and GCP

Each cloud provider offers managed Kafka services:

  • AWS (MSK): Amazon Managed Streaming for Apache Kafka (MSK) simplifies Kafka deployment, scaling, and security management.
  • Azure (Event Hubs): Azure Event Hubs provides Kafka-compatible endpoints, allowing you to leverage Kafka without managing the infrastructure.
  • GCP (Pub/Sub Lite): Google’s Pub/Sub Lite supports Kafka-like functionality with lower costs for high-throughput workloads.

In my next articles I will integrate this environment with pyspark and do several tests, stay tuned




David Souza

Data Engineer Specialist | SQL | PL/SQL | Power BI | Python

5mo

Thanks for sharing!

Like
Reply
Jefferson Luiz

FullStack Developer @ Itaú Digital Assets | Go | TS | Blockchain | Aws

5mo

Great content!

Like
Reply
Luís Condados

Computer Vision Engineer | Intel AI Innovator | C++ | Python | OpenCV | Open3D

6mo

Thanks for sharing it!

Like
Reply
Rafael Da Silva

Software Engineer | .NET C# | Full Stack Engineer | Backend Engineer | Azure | AWS

6mo

Nice content!

Like
Reply

To view or add a comment, sign in

More articles by Jader Lima

Insights from the community

Others also viewed

Explore topics