ELI25: Apache Kafka Quick Notes for Interviews

# architecture# dataengineering# distributedsystems# interview
ELI25: Apache Kafka Quick Notes for InterviewsHayden Cordeiro

Kafka was originally built by LinkedIn and later made open source under the Apache Software...

Kafka was originally built by LinkedIn and later made open source under the Apache Software Foundation.

Goals

  • High throughput
  • Scalable
  • Reliable
  • Fault-tolerant
  • Pub/Sub architecture

Use Cases

  • Logging
  • Messaging
  • Data replication
  • Middleware logic

The Architecture

At a high level, the flow looks like this:

Producers -----> Brokers (Topics/Partitions) -----> Consumer Groups (Consumers)

1. Producers

These are client applications that produce (write) data to the system.

2. Brokers

Brokers are daemons (background processes) that run on hardware or virtual machines.

  • Cluster: Multiple brokers running together form a Kafka Cluster.
  • Storage: Their primary job is to take messages from producers and store them on disk.
  • Retention: Brokers have a defined retention policy (time-based or size-based). Once the limit is reached, old messages are deleted to make room for new ones.

3. Topics

Topics are logical collections of messages (e.g., orders, customers, clicks).

  • You can have as many topics as you want.
  • Topics are split into Partitions.

4. Partitions

A topic can have one or more partitions. These partitions are distributed across the brokers.

Example Scenarios:

  • Scenario A: 1 Topic (Orders), 2 Brokers, 2 Partitions.

    • Broker 1: Holds Orders-Partition-0
    • Broker 2: Holds Orders-Partition-1
  • Scenario B: 1 Topic, 3 Brokers, 2 Partitions.

    • Broker 1: Holds Orders-Partition-0
    • Broker 2: Holds Orders-Partition-1
    • Broker 3: Unused for this topic.
  • Scenario C: 1 Topic, 1 Broker, 2 Partitions.

    • Broker 1: Holds both Orders-Partition-0 and Orders-Partition-1

Important Note: Every message is written to a specific partition of a TOPIC. Each message gets a unique Offset ID. Kafka guarantees ordering only within a partition, not across the entire topic.


Consumer Groups

Consumers are the applications reading data from Kafka.

  • They sit in logical groupings called Consumer Groups.
  • Consumers can read from one or more partitions.

The Scaling Rule:
If you have more Consumers than Partitions, the extra consumers will sit idle.

  • Example: 2 Partitions, 3 Consumers ---> Consumer 1 reads Partition 1, Consumer 2 reads Partition 2, Consumer 3 sits idle.

Offset Management:
Consumers track the last message they read. If a consumer crashes, the group rebalances. A new consumer picks up the partition and resumes from the last committed Offset (the unique ID mentioned earlier).


Handling Failure (Fault Tolerance)

Understanding the "happy path" is great, but interviews often focus on failure.

Consumer Failure:
Straightforward. The new consumer looks up the last committed offset and continues reading.

Broker Failure:
Since messages are stored on disk inside the broker, what happens if the broker dies? Do we lose data?
No. This is where the Replication Factor comes in.

Kafka replicates partitions across different brokers.

  • Example: Orders Topic, 2 Brokers, Replication Factor of 2.
    • Broker 1: Holds Partition 0 (Leader), Partition 1 (Replica)
    • Broker 2: Holds Partition 1 (Leader), Partition 0 (Replica)

If Broker 2 crashes, Broker 1 typically takes over as the Leader for Partition 1.

When does replication happen?
When a message is written to the Leader partition, it is immediately relayed to the follower replicas.

Producer Acknowledgement (ACKS)

The producer sends a message to the leader partition. The leader writes it, replicates it, and sends an acknowledgment (ACK) back. You can configure how strict this is:

  • acks=0 (Fire and Forget): Producer sends data and doesn't wait for a response. Fastest, but risk of data loss.
  • acks=1 (Leader Ack): Producer waits for the Leader to confirm it wrote the message.
  • acks=all (Strongest): Producer waits for the Leader AND all in-sync replicas to confirm. Safest, but highest latency.

Callouts

1. Retention is NOT "One-in-One-out"

A common misconception is that Kafka deletes messages individually (e.g., as soon as a new message arrives, an old one is deleted).

  • Reality: Kafka deletes entire Log Segments.
  • How it works: Kafka writes messages to a file called a "segment." When that file gets full (e.g., 1GB), it closes it and starts a new one. A background process checks the closed segments. If a segment is older than the retention period (e.g., 7 days), the entire file is deleted.

2. Replication is PULL, not PUSH

How do followers stay in sync with the leader?

  • Reality: The Leader does not "push" data to followers. Followers PULL (fetch) data from the Leader.
  • The "Magic" of the Fetch Request: When a follower sends a FetchRequest, it does two things:
    1. Asks for new data: "Give me everything starting from Offset 101."
    2. Confirms old data: By asking for 101, it implicitly tells the Leader, "I have successfully written everything up to Offset 100."
  • The ISR List: The Leader uses this fetch request to update its In-Sync Replica (ISR) list. Once all replicas in the ISR list have fetched the message, the Leader advances the "High Watermark" and sends the ACK to the producer (if acks=all).