Apache Kafka 4.x: The Event Streaming Era Without ZooKeeper

Posted on: 4/26/2026 7:15:51 AM

Table of contents

1. KRaft Mode: Goodbye ZooKeeper
1. Core KRaft improvements
  1. KRaft Controller vs Broker
2. Share Groups: Kafka Gets Queues
1. Acknowledgement types in Share Groups
  1. Share Groups don't replace Consumer Groups
3. Next-Gen Consumer Rebalance Protocol (KIP-848)
4. Eligible Leader Replicas — No More Data Loss During Leader Election
5. Kafka Streams 4.2: Dead Letter Queue & Server-Side Rebalance
6. Integrating Kafka 4.x with .NET
7. Complete Event-Driven Architecture with Kafka 4.x
1. When to use Consumer Group vs Share Group?
8. Migrating from ZooKeeper to KRaft
1. Important migration note
9. Kafka 4.x vs Other Messaging Systems
10. Best Practices for Kafka in Production
Conclusion
References

Apache Kafka has just reached the biggest milestone in its history: the complete removal of ZooKeeper. With version 4.0 (March 2025) and 4.2 (February 2026), Kafka doesn't just simplify its architecture — it introduces game-changing features for building event-driven systems. This article dives deep into KRaft mode, Share Groups (Queues), the next-generation Consumer Rebalance Protocol, and how to integrate Kafka 4.x with .NET.

80%+ Fortune 100 companies use Kafka

7M+ Instances worldwide

4.2 Latest version (02/2026)

0 Lines of ZooKeeper code remaining

1. KRaft Mode: Goodbye ZooKeeper

For over a decade, ZooKeeper was an indispensable component of every Kafka deployment. It handled metadata management, partition leader election, and cluster configuration storage. But ZooKeeper was also the root cause of countless operational headaches: another distributed system to manage, a partition count ceiling (~200K), and a bottleneck during metadata changes.

KRaft (Kafka Raft) is the solution: Kafka manages its own metadata through the Raft consensus protocol, built directly into the broker. No separate ZooKeeper cluster required. No additional system to operate.

graph LR
    subgraph "Kafka < 4.0 (ZooKeeper)"
        P1[Producer] --> B1[Broker 1]
        P1 --> B2[Broker 2]
        P1 --> B3[Broker 3]
        B1 --> ZK[ZooKeeper Ensemble]
        B2 --> ZK
        B3 --> ZK
        ZK --> ZK1[ZK Node 1]
        ZK --> ZK2[ZK Node 2]
        ZK --> ZK3[ZK Node 3]
        C1[Consumer] --> B1
        C1 --> B2
        C1 --> B3
    end

    style ZK fill:#ff9800,stroke:#e65100,color:#fff
    style ZK1 fill:#ffe0b2,stroke:#ff9800,color:#333
    style ZK2 fill:#ffe0b2,stroke:#ff9800,color:#333
    style ZK3 fill:#ffe0b2,stroke:#ff9800,color:#333
    style B1 fill:#e94560,stroke:#fff,color:#fff
    style B2 fill:#e94560,stroke:#fff,color:#fff
    style B3 fill:#e94560,stroke:#fff,color:#fff
    style P1 fill:#2c3e50,stroke:#fff,color:#fff
    style C1 fill:#2c3e50,stroke:#fff,color:#fff

Kafka architecture before 4.0: requires a separate ZooKeeper cluster

graph LR
    subgraph "Kafka 4.x (KRaft)"
        P2[Producer] --> KB1[Broker 1]
        P2 --> KB2[Broker 2]
        P2 --> KB3[Broker 3]
        KB1 ---|Raft Consensus| KB2
        KB2 ---|Raft Consensus| KB3
        KB3 ---|Raft Consensus| KB1
        C2[Consumer] --> KB1
        C2 --> KB2
        C2 --> KB3
    end

    style KB1 fill:#4CAF50,stroke:#fff,color:#fff
    style KB2 fill:#4CAF50,stroke:#fff,color:#fff
    style KB3 fill:#4CAF50,stroke:#fff,color:#fff
    style P2 fill:#2c3e50,stroke:#fff,color:#fff
    style C2 fill:#2c3e50,stroke:#fff,color:#fff

Kafka 4.x architecture: KRaft built-in, no ZooKeeper needed

Core KRaft improvements

Criteria	ZooKeeper Mode	KRaft Mode
Max partitions per cluster	~200,000	Millions
Cluster startup time	Minutes (depends on ZK)	Seconds
Components to operate	Kafka + ZooKeeper	Kafka only
Controller failover	Slow (ZK session timeout)	Fast (Raft leader election)
Metadata propagation	Async via ZK watchers	Event-driven via metadata log
Security model	2 separate ACL systems	Unified single system

KRaft Controller vs Broker

In KRaft mode, you can run nodes as controller-only, broker-only, or combined. For large production clusters, it's recommended to separate controller nodes (typically 3 or 5) from broker nodes to ensure metadata management isn't affected by message processing traffic.

Before Kafka 4.x, each partition could only be consumed by exactly one consumer within a consumer group. This guaranteed ordering but created a bottleneck: if you have 10 partitions but need 20 workers, you'd have to re-partition the topic — an expensive and disruptive operation.

Share Groups (KIP-932) completely solve this by allowing multiple consumers to read from the same partition with per-record acknowledgement — just like a traditional message queue (RabbitMQ, SQS) but running on Kafka infrastructure.

graph TD
    T[Topic: order-processing] --> P1[Partition 0]
    T --> P2[Partition 1]
    T --> P3[Partition 2]

    subgraph "Traditional Consumer Group"
        P1 --> CG1[Consumer A]
        P2 --> CG2[Consumer B]
        P3 --> CG3[Consumer C]
    end

    subgraph "Share Group (Kafka 4.x)"
        P1 -.-> SG1[Worker 1]
        P1 -.-> SG2[Worker 2]
        P2 -.-> SG1
        P2 -.-> SG3[Worker 3]
        P3 -.-> SG2
        P3 -.-> SG3
    end

    style T fill:#e94560,stroke:#fff,color:#fff
    style P1 fill:#f8f9fa,stroke:#e94560,color:#333
    style P2 fill:#f8f9fa,stroke:#e94560,color:#333
    style P3 fill:#f8f9fa,stroke:#e94560,color:#333
    style CG1 fill:#2c3e50,stroke:#fff,color:#fff
    style CG2 fill:#2c3e50,stroke:#fff,color:#fff
    style CG3 fill:#2c3e50,stroke:#fff,color:#fff
    style SG1 fill:#4CAF50,stroke:#fff,color:#fff
    style SG2 fill:#4CAF50,stroke:#fff,color:#fff
    style SG3 fill:#4CAF50,stroke:#fff,color:#fff

Traditional Consumer Group vs Share Group: workers can share partitions

Kafka 4.2 adds the RENEW type alongside ACCEPT and REJECT, allowing consumers to extend processing time for long-running tasks:

ACCEPT: Record processed successfully, mark as complete
REJECT: Record processing failed, return to Share Group for another worker
RENEW (new in 4.2): Extend processing time for long-running tasks like ML inference or video transcoding

Share Groups don't replace Consumer Groups

Share Groups are designed for point-to-point messaging use cases where throughput matters more than ordering. If your application requires in-order processing within a partition (e.g., event sourcing, CDC), continue using traditional Consumer Groups.

3. Next-Gen Consumer Rebalance Protocol (KIP-848)

One of the biggest pain points in operating Kafka was the stop-the-world rebalance. Whenever a consumer joined or left a group, every consumer in the group had to stop processing, revoke partitions, and wait for reassignment. With large consumer groups (hundreds of instances), this could take minutes — meaning minutes of zero message processing.

KIP-848 introduces a new rebalance protocol, enabled by default in Kafka 4.0:

Feature	Eager Rebalance (old)	KIP-848 Protocol (new)
When consumer joins	All revoke, full reassignment	Only move necessary partitions
When consumer leaves	Stop-the-world	Incremental, no disruption
Assignment logic	Client-side (group leader)	Server-side (broker decides)
Rebalance time	Proportional to consumer count	Nearly constant (O(1))
Downtime when scaling	Significant	Near zero

To enable on the client side, simply set:

group.protocol=consumer

4. Eligible Leader Replicas — No More Data Loss During Leader Election

KIP-966 introduces Eligible Leader Replicas (ELR) — a subset of ISR (In-Sync Replicas) guaranteed to have complete data up to the high-watermark. When a partition leader fails, Kafka will only elect a new leader from ELRs, preventing an under-replicated replica from becoming leader and causing data loss.

sequenceDiagram
    participant P as Producer
    participant L as Leader Broker
    participant R1 as Replica 1 (ELR)
    participant R2 as Replica 2 (Non-ELR)

    P->>L: Produce message (offset 100)
    L->>R1: Replicate (offset 100) ✓
    L->>R2: Replicate (offset 95) — lag
    Note over L: High-watermark = 100
    Note over R1: Caught up → ELR ✓
    Note over R2: Lagging → NOT ELR ✗
    L--xL: Leader crashes!
    Note over R1,R2: Leader election
    R1->>R1: Elected as new leader (ELR, no data loss)
    R2--xR2: Rejected (not in ELR, would lose offsets 96-100)

ELR ensures only fully-synced replicas can become leader

5. Kafka Streams 4.2: Dead Letter Queue & Server-Side Rebalance

Kafka Streams in version 4.2 received significant production-ready improvements:

Built-in Dead Letter Queue (DLQ)

Before 4.2, when a record caused an error in a Kafka Streams topology, you had only 2 options: skip the record or crash the application. Now, exception handlers can redirect failed records to a Dead Letter Queue — a separate topic for later analysis and reprocessing.

// Kafka Streams 4.2 — DLQ in exception handler
StreamsConfig config = new StreamsConfig();
config.put(StreamsConfig.PROCESSING_EXCEPTION_HANDLER_CLASS_CONFIG,
    DeadLetterQueueExceptionHandler.class);
config.put("dead.letter.queue.topic", "order-processing-dlq");

Server-Side Rebalance Protocol (GA)

Kafka Streams now uses broker-side task assignment instead of client-side, significantly reducing complexity and rebalance time when scaling stream processing applications.

Anchored Wall-Clock Punctuation

A new feature that allows scheduling punctuation at fixed wall-clock times (e.g., top of every hour, start of each day) rather than interval-based only. Useful for calendar-aligned aggregation tasks.

6. Integrating Kafka 4.x with .NET

The official Confluent.Kafka library (version 2.14.0) provides high-level Producer and Consumer fully compatible with Kafka 4.x. Below are production-ready patterns.

Producer with Idempotent Delivery

using Confluent.Kafka;

var config = new ProducerConfig
{
    BootstrapServers = "kafka-cluster:9092",
    EnableIdempotence = true,
    Acks = Acks.All,
    MessageSendMaxRetries = 3,
    LingerMs = 5,
    BatchSize = 65536
};

using var producer = new ProducerBuilder<string, string>(config)
    .SetErrorHandler((_, e) =>
        Console.Error.WriteLine($"Kafka error: {e.Reason}"))
    .Build();

var result = await producer.ProduceAsync("order-events",
    new Message<string, string>
    {
        Key = orderId,
        Value = JsonSerializer.Serialize(orderEvent),
        Headers = new Headers
        {
            { "correlation-id", Encoding.UTF8.GetBytes(correlationId) },
            { "event-type", Encoding.UTF8.GetBytes("OrderCreated") }
        }
    });

Console.WriteLine($"Delivered to {result.TopicPartitionOffset}");

Consumer with Manual Offset

var config = new ConsumerConfig
{
    BootstrapServers = "kafka-cluster:9092",
    GroupId = "order-processor",
    GroupProtocol = "consumer",  // KIP-848 new protocol
    AutoOffsetReset = AutoOffsetReset.Earliest,
    EnableAutoCommit = false,
    MaxPollIntervalMs = 300000
};

using var consumer = new ConsumerBuilder<string, string>(config)
    .SetPartitionsAssignedHandler((c, partitions) =>
        Console.WriteLine($"Assigned: {string.Join(", ", partitions)}"))
    .Build();

consumer.Subscribe("order-events");

while (!cancellationToken.IsCancellationRequested)
{
    var result = consumer.Consume(cancellationToken);
    try
    {
        await ProcessOrderEvent(result.Message);
        consumer.StoreOffset(result);
    }
    catch (Exception ex)
    {
        logger.LogError(ex, "Failed to process {Key}", result.Message.Key);
        await PublishToDlq(result.Message);
        consumer.StoreOffset(result);
    }
}

Integration with .NET Aspire

The Aspire.Confluent.Kafka package provides dependency injection, health checks, and automatic telemetry:

// Program.cs — .NET Aspire host
var builder = DistributedApplication.CreateBuilder(args);

var kafka = builder.AddKafka("messaging")
    .WithKRaft()    // Use KRaft mode
    .WithDataVolume();

var orderService = builder.AddProject<Projects.OrderService>()
    .WithReference(kafka);

// OrderService — DI registration
builder.AddKafkaProducer<string, OrderEvent>("messaging");
builder.AddKafkaConsumer<string, OrderEvent>("messaging", settings =>
{
    settings.Config.GroupId = "order-processor";
    settings.Config.GroupProtocol = "consumer";
});

7. Complete Event-Driven Architecture with Kafka 4.x

Kafka 4.x with Share Groups enables building more flexible event-driven architectures, combining both pub/sub and point-to-point messaging on a single platform:

graph TD
    API[API Gateway] --> CMD[Command Service]
    API --> QRY[Query Service]

    CMD --> KT1[Topic: domain-events
Consumer Group — ordered]
    CMD --> KT2[Topic: task-queue
Share Group — parallel]

    KT1 --> ES[Event Store Service]
    KT1 --> PROJ[Projection Service]
    KT1 --> NOTIFY[Notification Service]

    KT2 --> W1[Worker 1]
    KT2 --> W2[Worker 2]
    KT2 --> W3[Worker 3]

    ES --> DB1[(Event Store)]
    PROJ --> DB2[(Read DB)]
    QRY --> DB2

    W1 --> EXT[External APIs]
    W2 --> EXT
    W3 --> EXT

    style API fill:#2c3e50,stroke:#fff,color:#fff
    style CMD fill:#e94560,stroke:#fff,color:#fff
    style QRY fill:#e94560,stroke:#fff,color:#fff
    style KT1 fill:#4CAF50,stroke:#fff,color:#fff
    style KT2 fill:#ff9800,stroke:#fff,color:#fff
    style ES fill:#f8f9fa,stroke:#e94560,color:#333
    style PROJ fill:#f8f9fa,stroke:#e94560,color:#333
    style NOTIFY fill:#f8f9fa,stroke:#e94560,color:#333
    style W1 fill:#f8f9fa,stroke:#ff9800,color:#333
    style W2 fill:#f8f9fa,stroke:#ff9800,color:#333
    style W3 fill:#f8f9fa,stroke:#ff9800,color:#333
    style DB1 fill:#2c3e50,stroke:#fff,color:#fff
    style DB2 fill:#2c3e50,stroke:#fff,color:#fff
    style EXT fill:#9e9e9e,stroke:#fff,color:#fff

Architecture combining Consumer Groups (ordered events) and Share Groups (parallel tasks)

Consumer Group: Domain events, CDC streams, event sourcing — where processing order matters
Share Group: Email sending, image processing, report generation, API calls — where flexible worker scaling is needed

8. Migrating from ZooKeeper to KRaft

If you're running a Kafka cluster with ZooKeeper, the migration to KRaft involves 3 main steps:

Step 1: Preparation

Upgrade all brokers to Kafka 3.7+ (the last version supporting both modes). Ensure all clients are compatible. Backup ZooKeeper metadata and configuration. Run kafka-features.sh describe to check feature levels.

Step 2: Migration

Run kafka-metadata.sh snapshot to transfer metadata from ZooKeeper to KRaft format. Start KRaft controllers. Switch each broker to KRaft mode with rolling restarts. Verify metadata consistency between ZK and KRaft.

Step 3: Completion

After all brokers are running KRaft, disconnect ZooKeeper. Remove zookeeper.connect configuration from broker configs. Decommission the ZooKeeper ensemble. Monitor cluster health for 48-72 hours before removing ZK nodes.

Important migration note

Kafka 4.0+ no longer supports ZooKeeper mode. If your cluster is running ZooKeeper, you must migrate to KRaft before upgrading to 4.x. Perform the migration on version 3.7 or 3.8, then upgrade to 4.x.

9. Kafka 4.x vs Other Messaging Systems

Criteria	Kafka 4.x	RabbitMQ	AWS SQS/SNS
Model	Pub/Sub + Queue (Share Groups)	Queue + Pub/Sub (Exchange)	Queue (SQS) + Pub/Sub (SNS)
Throughput	Millions msg/s	~50K msg/s	Nearly unlimited (managed)
Message retention	Configurable (days/weeks/forever)	Until consumed	14 days (SQS)
Ordering	Per-partition guaranteed	Per-queue FIFO	FIFO queue or best-effort
Consumer scaling	Consumer Group + Share Group	Competing consumers	Auto-scaling consumers
Stream processing	Kafka Streams, ksqlDB	Not built-in	Requires Lambda/Kinesis
Operations	Self-managed or Confluent Cloud	Self-managed or CloudAMQP	Fully managed
Cost	Infra cost (or Confluent pricing)	Infra cost	Pay-per-request

10. Best Practices for Kafka in Production

Sizing & Performance

Partition count: Start with partitions = expected consumers × 2. Adding partitions is easy, removing them is not
Replication factor: Always set at least 3 for production. Combine min.insync.replicas=2 with acks=all
Batch size: Increase batch.size (64KB-256KB) and linger.ms (5-20ms) for better throughput
Compression: Use compression.type=zstd for best compression ratio or lz4 for lowest latency

Monitoring essentials

Consumer lag: The most important metric — if lag grows continuously, consumers can't keep up with producers
Under-replicated partitions: Indicates broker I/O or network issues
Request latency (p99): Track produce and fetch latency at the 99th percentile
Controller active count: In KRaft, there must always be exactly 1 active controller

Security

Use SASL/SCRAM or mTLS for authentication
Enable TLS encryption for all inter-broker and client-broker connections
Configure granular ACLs per-topic, per-consumer-group
With KRaft, you only need to manage a single ACL system (no more separate ZooKeeper ACLs)

Conclusion

Apache Kafka 4.x marks the biggest transformation in the history of the world's most popular event streaming platform. Removing ZooKeeper doesn't just simplify operations — it unlocks scaling to millions of partitions. Share Groups fill Kafka's biggest gap — the ability to work as a message queue — eliminating the need for many systems to run both Kafka and RabbitMQ side by side. With the increasingly mature .NET ecosystem through Confluent.Kafka and .NET Aspire, now is the ideal time to adopt Kafka in your event-driven architecture.

References

#Apache Kafka #KRaft #Event-Driven Architecture #Event Streaming #.NET #Confluent.Kafka #Message Queue #system design #Microservices #Kafka Streams

# Apache Kafka 4.x: The Event Streaming Era Without ZooKeeper

Apache Kafka has just reached the biggest milestone in its history: **the complete removal of ZooKeeper**. With version 4.0 (March 2025) and 4.2 (February 2026), Kafka doesn't just simplify its architecture — it introduces game-changing features for building event-driven systems. This article dives deep into KRaft mode, Share Groups (Queues), the next-generation Consumer Rebalance Protocol, and how to integrate Kafka 4.x with .NET.

80%+ Fortune 100 companies use Kafka

7M+ Instances worldwide

4.2 Latest version (02/2026)

0 Lines of ZooKeeper code remaining

## 1. KRaft Mode: Goodbye ZooKeeper

**KRaft** (Kafka Raft) is the solution: Kafka manages its own metadata through the Raft consensus protocol, built directly into the broker. No separate ZooKeeper cluster required. No additional system to operate.

```
graph LR
    subgraph "Kafka < 4.0 (ZooKeeper)"
        P1[Producer] --> B1[Broker 1]
        P1 --> B2[Broker 2]
        P1 --> B3[Broker 3]
        B1 --> ZK[ZooKeeper Ensemble]
        B2 --> ZK
        B3 --> ZK
        ZK --> ZK1[ZK Node 1]
        ZK --> ZK2[ZK Node 2]
        ZK --> ZK3[ZK Node 3]
        C1[Consumer] --> B1
        C1 --> B2
        C1 --> B3
    end

style ZK fill:#ff9800,stroke:#e65100,color:#fff
    style ZK1 fill:#ffe0b2,stroke:#ff9800,color:#333
    style ZK2 fill:#ffe0b2,stroke:#ff9800,color:#333
    style ZK3 fill:#ffe0b2,stroke:#ff9800,color:#333
    style B1 fill:#e94560,stroke:#fff,color:#fff
    style B2 fill:#e94560,stroke:#fff,color:#fff
    style B3 fill:#e94560,stroke:#fff,color:#fff
    style P1 fill:#2c3e50,stroke:#fff,color:#fff
    style C1 fill:#2c3e50,stroke:#fff,color:#fff

```
Kafka architecture before 4.0: requires a separate ZooKeeper cluster

```
graph LR
    subgraph "Kafka 4.x (KRaft)"
        P2[Producer] --> KB1[Broker 1]
        P2 --> KB2[Broker 2]
        P2 --> KB3[Broker 3]
        KB1 ---|Raft Consensus| KB2
        KB2 ---|Raft Consensus| KB3
        KB3 ---|Raft Consensus| KB1
        C2[Consumer] --> KB1
        C2 --> KB2
        C2 --> KB3
    end

style KB1 fill:#4CAF50,stroke:#fff,color:#fff
    style KB2 fill:#4CAF50,stroke:#fff,color:#fff
    style KB3 fill:#4CAF50,stroke:#fff,color:#fff
    style P2 fill:#2c3e50,stroke:#fff,color:#fff
    style C2 fill:#2c3e50,stroke:#fff,color:#fff

```
Kafka 4.x architecture: KRaft built-in, no ZooKeeper needed

### Core KRaft improvements

| Criteria | ZooKeeper Mode | KRaft Mode |
| --- | --- | --- |
| Max partitions per cluster | ~200,000 | Millions |
| Cluster startup time | Minutes (depends on ZK) | Seconds |
| Components to operate | Kafka + ZooKeeper | Kafka only |
| Controller failover | Slow (ZK session timeout) | Fast (Raft leader election) |
| Metadata propagation | Async via ZK watchers | Event-driven via metadata log |
| Security model | 2 separate ACL systems | Unified single system |

#### KRaft Controller vs Broker

In KRaft mode, you can run nodes as **controller-only**, **broker-only**, or **combined**. For large production clusters, it's recommended to separate controller nodes (typically 3 or 5) from broker nodes to ensure metadata management isn't affected by message processing traffic.

## 2. Share Groups: Kafka Gets Queues

Before Kafka 4.x, each partition could only be consumed by **exactly one consumer** within a consumer group. This guaranteed ordering but created a bottleneck: if you have 10 partitions but need 20 workers, you'd have to re-partition the topic — an expensive and disruptive operation.

**Share Groups** (KIP-932) completely solve this by allowing multiple consumers to read from the same partition with per-record acknowledgement — just like a traditional message queue (RabbitMQ, SQS) but running on Kafka infrastructure.

```
graph TD
    T[Topic: order-processing] --> P1[Partition 0]
    T --> P2[Partition 1]
    T --> P3[Partition 2]

subgraph "Traditional Consumer Group"
        P1 --> CG1[Consumer A]
        P2 --> CG2[Consumer B]
        P3 --> CG3[Consumer C]
    end

subgraph "Share Group (Kafka 4.x)"
        P1 -.-> SG1[Worker 1]
        P1 -.-> SG2[Worker 2]
        P2 -.-> SG1
        P2 -.-> SG3[Worker 3]
        P3 -.-> SG2
        P3 -.-> SG3
    end

style T fill:#e94560,stroke:#fff,color:#fff
    style P1 fill:#f8f9fa,stroke:#e94560,color:#333
    style P2 fill:#f8f9fa,stroke:#e94560,color:#333
    style P3 fill:#f8f9fa,stroke:#e94560,color:#333
    style CG1 fill:#2c3e50,stroke:#fff,color:#fff
    style CG2 fill:#2c3e50,stroke:#fff,color:#fff
    style CG3 fill:#2c3e50,stroke:#fff,color:#fff
    style SG1 fill:#4CAF50,stroke:#fff,color:#fff
    style SG2 fill:#4CAF50,stroke:#fff,color:#fff
    style SG3 fill:#4CAF50,stroke:#fff,color:#fff

```
Traditional Consumer Group vs Share Group: workers can share partitions

### Acknowledgement types in Share Groups

Kafka 4.2 adds the **RENEW** type alongside ACCEPT and REJECT, allowing consumers to extend processing time for long-running tasks:

- **ACCEPT**: Record processed successfully, mark as complete
- **REJECT**: Record processing failed, return to Share Group for another worker
- **RENEW** (new in 4.2): Extend processing time for long-running tasks like ML inference or video transcoding

#### Share Groups don't replace Consumer Groups

Share Groups are designed for **point-to-point messaging** use cases where throughput matters more than ordering. If your application requires in-order processing within a partition (e.g., event sourcing, CDC), continue using traditional Consumer Groups.

## 3. Next-Gen Consumer Rebalance Protocol (KIP-848)

One of the biggest pain points in operating Kafka was the **stop-the-world rebalance**. Whenever a consumer joined or left a group, *every* consumer in the group had to stop processing, revoke partitions, and wait for reassignment. With large consumer groups (hundreds of instances), this could take minutes — meaning minutes of zero message processing.

KIP-848 introduces a new rebalance protocol, enabled by default in Kafka 4.0:

| Feature | Eager Rebalance (old) | KIP-848 Protocol (new) |
| --- | --- | --- |
| When consumer joins | All revoke, full reassignment | Only move necessary partitions |
| When consumer leaves | Stop-the-world | Incremental, no disruption |
| Assignment logic | Client-side (group leader) | Server-side (broker decides) |
| Rebalance time | Proportional to consumer count | Nearly constant (O(1)) |
| Downtime when scaling | Significant | Near zero |

To enable on the client side, simply set:

```properties
group.protocol=consumer
```

## 4. Eligible Leader Replicas — No More Data Loss During Leader Election

**KIP-966** introduces **Eligible Leader Replicas (ELR)** — a subset of ISR (In-Sync Replicas) guaranteed to have complete data up to the high-watermark. When a partition leader fails, Kafka will only elect a new leader from ELRs, preventing an under-replicated replica from becoming leader and causing data loss.

```
sequenceDiagram
    participant P as Producer
    participant L as Leader Broker
    participant R1 as Replica 1 (ELR)
    participant R2 as Replica 2 (Non-ELR)

P->>L: Produce message (offset 100)
    L->>R1: Replicate (offset 100) ✓
    L->>R2: Replicate (offset 95) — lag
    Note over L: High-watermark = 100
    Note over R1: Caught up → ELR ✓
    Note over R2: Lagging → NOT ELR ✗
    L--xL: Leader crashes!
    Note over R1,R2: Leader election
    R1->>R1: Elected as new leader (ELR, no data loss)
    R2--xR2: Rejected (not in ELR, would lose offsets 96-100)

```
ELR ensures only fully-synced replicas can become leader

## 5. Kafka Streams 4.2: Dead Letter Queue & Server-Side Rebalance

Kafka Streams in version 4.2 received significant production-ready improvements:

### Built-in Dead Letter Queue (DLQ)

```java
// Kafka Streams 4.2 — DLQ in exception handler
StreamsConfig config = new StreamsConfig();
config.put(StreamsConfig.PROCESSING_EXCEPTION_HANDLER_CLASS_CONFIG,
    DeadLetterQueueExceptionHandler.class);
config.put("dead.letter.queue.topic", "order-processing-dlq");
```

### Server-Side Rebalance Protocol (GA)

Kafka Streams now uses broker-side task assignment instead of client-side, significantly reducing complexity and rebalance time when scaling stream processing applications.

### Anchored Wall-Clock Punctuation

A new feature that allows scheduling punctuation at fixed wall-clock times (e.g., top of every hour, start of each day) rather than interval-based only. Useful for calendar-aligned aggregation tasks.

## 6. Integrating Kafka 4.x with .NET

The official **Confluent.Kafka** library (version 2.14.0) provides high-level Producer and Consumer fully compatible with Kafka 4.x. Below are production-ready patterns.

### Producer with Idempotent Delivery

```csharp
using Confluent.Kafka;

var config = new ProducerConfig
{
    BootstrapServers = "kafka-cluster:9092",
    EnableIdempotence = true,
    Acks = Acks.All,
    MessageSendMaxRetries = 3,
    LingerMs = 5,
    BatchSize = 65536
};

using var producer = new ProducerBuilder<string, string>(config)
    .SetErrorHandler((_, e) =>
        Console.Error.WriteLine($"Kafka error: {e.Reason}"))
    .Build();

var result = await producer.ProduceAsync("order-events",
    new Message<string, string>
    {
        Key = orderId,
        Value = JsonSerializer.Serialize(orderEvent),
        Headers = new Headers
        {
            { "correlation-id", Encoding.UTF8.GetBytes(correlationId) },
            { "event-type", Encoding.UTF8.GetBytes("OrderCreated") }
        }
    });

Console.WriteLine($"Delivered to {result.TopicPartitionOffset}");
```

### Consumer with Manual Offset

```csharp
var config = new ConsumerConfig
{
    BootstrapServers = "kafka-cluster:9092",
    GroupId = "order-processor",
    GroupProtocol = "consumer",  // KIP-848 new protocol
    AutoOffsetReset = AutoOffsetReset.Earliest,
    EnableAutoCommit = false,
    MaxPollIntervalMs = 300000
};

using var consumer = new ConsumerBuilder<string, string>(config)
    .SetPartitionsAssignedHandler((c, partitions) =>
        Console.WriteLine($"Assigned: {string.Join(", ", partitions)}"))
    .Build();

consumer.Subscribe("order-events");

while (!cancellationToken.IsCancellationRequested)
{
    var result = consumer.Consume(cancellationToken);
    try
    {
        await ProcessOrderEvent(result.Message);
        consumer.StoreOffset(result);
    }
    catch (Exception ex)
    {
        logger.LogError(ex, "Failed to process {Key}", result.Message.Key);
        await PublishToDlq(result.Message);
        consumer.StoreOffset(result);
    }
}
```

### Integration with .NET Aspire

The `Aspire.Confluent.Kafka` package provides dependency injection, health checks, and automatic telemetry:

```csharp
// Program.cs — .NET Aspire host
var builder = DistributedApplication.CreateBuilder(args);

var kafka = builder.AddKafka("messaging")
    .WithKRaft()    // Use KRaft mode
    .WithDataVolume();

var orderService = builder.AddProject<Projects.OrderService>()
    .WithReference(kafka);

// OrderService — DI registration
builder.AddKafkaProducer<string, OrderEvent>("messaging");
builder.AddKafkaConsumer<string, OrderEvent>("messaging", settings =>
{
    settings.Config.GroupId = "order-processor";
    settings.Config.GroupProtocol = "consumer";
});
```

## 7. Complete Event-Driven Architecture with Kafka 4.x

Kafka 4.x with Share Groups enables building more flexible event-driven architectures, combining both pub/sub and point-to-point messaging on a single platform:

```
graph TD
    API[API Gateway] --> CMD[Command Service]
    API --> QRY[Query Service]

CMD --> KT1[Topic: domain-events  
Consumer Group — ordered]
    CMD --> KT2[Topic: task-queue  
Share Group — parallel]

KT1 --> ES[Event Store Service]
    KT1 --> PROJ[Projection Service]
    KT1 --> NOTIFY[Notification Service]

KT2 --> W1[Worker 1]
    KT2 --> W2[Worker 2]
    KT2 --> W3[Worker 3]

ES --> DB1[(Event Store)]
    PROJ --> DB2[(Read DB)]
    QRY --> DB2

W1 --> EXT[External APIs]
    W2 --> EXT
    W3 --> EXT

style API fill:#2c3e50,stroke:#fff,color:#fff
    style CMD fill:#e94560,stroke:#fff,color:#fff
    style QRY fill:#e94560,stroke:#fff,color:#fff
    style KT1 fill:#4CAF50,stroke:#fff,color:#fff
    style KT2 fill:#ff9800,stroke:#fff,color:#fff
    style ES fill:#f8f9fa,stroke:#e94560,color:#333
    style PROJ fill:#f8f9fa,stroke:#e94560,color:#333
    style NOTIFY fill:#f8f9fa,stroke:#e94560,color:#333
    style W1 fill:#f8f9fa,stroke:#ff9800,color:#333
    style W2 fill:#f8f9fa,stroke:#ff9800,color:#333
    style W3 fill:#f8f9fa,stroke:#ff9800,color:#333
    style DB1 fill:#2c3e50,stroke:#fff,color:#fff
    style DB2 fill:#2c3e50,stroke:#fff,color:#fff
    style EXT fill:#9e9e9e,stroke:#fff,color:#fff

```
Architecture combining Consumer Groups (ordered events) and Share Groups (parallel tasks)

#### When to use Consumer Group vs Share Group?

- **Consumer Group**: Domain events, CDC streams, event sourcing — where processing order matters
- **Share Group**: Email sending, image processing, report generation, API calls — where flexible worker scaling is needed

## 8. Migrating from ZooKeeper to KRaft

If you're running a Kafka cluster with ZooKeeper, the migration to KRaft involves 3 main steps:

Step 1: Preparation

Upgrade all brokers to Kafka 3.7+ (the last version supporting both modes). Ensure all clients are compatible. Backup ZooKeeper metadata and configuration. Run `kafka-features.sh describe` to check feature levels.

Step 2: Migration

Run `kafka-metadata.sh snapshot` to transfer metadata from ZooKeeper to KRaft format. Start KRaft controllers. Switch each broker to KRaft mode with rolling restarts. Verify metadata consistency between ZK and KRaft.

Step 3: Completion

After all brokers are running KRaft, disconnect ZooKeeper. Remove `zookeeper.connect` configuration from broker configs. Decommission the ZooKeeper ensemble. Monitor cluster health for 48-72 hours before removing ZK nodes.

#### Important migration note

Kafka 4.0+ **no longer supports ZooKeeper mode**. If your cluster is running ZooKeeper, you must migrate to KRaft *before* upgrading to 4.x. Perform the migration on version 3.7 or 3.8, then upgrade to 4.x.

## 9. Kafka 4.x vs Other Messaging Systems

| Criteria | Kafka 4.x | RabbitMQ | AWS SQS/SNS |
| --- | --- | --- | --- |
| Model | Pub/Sub + Queue (Share Groups) | Queue + Pub/Sub (Exchange) | Queue (SQS) + Pub/Sub (SNS) |
| Throughput | Millions msg/s | ~50K msg/s | Nearly unlimited (managed) |
| Message retention | Configurable (days/weeks/forever) | Until consumed | 14 days (SQS) |
| Ordering | Per-partition guaranteed | Per-queue FIFO | FIFO queue or best-effort |
| Consumer scaling | Consumer Group + Share Group | Competing consumers | Auto-scaling consumers |
| Stream processing | Kafka Streams, ksqlDB | Not built-in | Requires Lambda/Kinesis |
| Operations | Self-managed or Confluent Cloud | Self-managed or CloudAMQP | Fully managed |
| Cost | Infra cost (or Confluent pricing) | Infra cost | Pay-per-request |

## 10. Best Practices for Kafka in Production

### Sizing & Performance

- **Partition count**: Start with partitions = expected consumers × 2. Adding partitions is easy, removing them is not
- **Replication factor**: Always set at least 3 for production. Combine `min.insync.replicas=2` with `acks=all`
- **Batch size**: Increase `batch.size` (64KB-256KB) and `linger.ms` (5-20ms) for better throughput
- **Compression**: Use `compression.type=zstd` for best compression ratio or `lz4` for lowest latency

### Monitoring essentials

- **Consumer lag**: The most important metric — if lag grows continuously, consumers can't keep up with producers
- **Under-replicated partitions**: Indicates broker I/O or network issues
- **Request latency (p99)**: Track produce and fetch latency at the 99th percentile
- **Controller active count**: In KRaft, there must always be exactly 1 active controller

### Security

- Use **SASL/SCRAM** or **mTLS** for authentication
- Enable **TLS encryption** for all inter-broker and client-broker connections
- Configure granular **ACLs** per-topic, per-consumer-group
- With KRaft, you only need to manage **a single ACL system** (no more separate ZooKeeper ACLs)

## Conclusion

## References

- [Apache Kafka 4.0.0 Release Announcement — kafka.apache.org](https://kafka.apache.org/blog/2025/03/18/apache-kafka-4.0.0-release-announcement/)
- [Apache Kafka 4.2.0 Release Announcement — kafka.apache.org](https://kafka.apache.org/blog/2026/02/17/apache-kafka-4.2.0-release-announcement/)
- [Apache Kafka 4.0: Default KRaft, Queues, Faster Rebalances — confluent.io](https://www.confluent.io/blog/latest-apache-kafka-release/)
- [.NET Client for Apache Kafka — Confluent Documentation](https://docs.confluent.io/kafka-clients/dotnet/current/overview.html)
- [Aspire.Confluent.Kafka — NuGet](https://www.nuget.org/packages/Aspire.Confluent.Kafka)
- [Kafka 4.0: KRaft, Queues, Better Rebalance — SoftwareMill](https://softwaremill.com/apache-kafka-4-0-0-released-kraft-queues-better-rebalance-performance/)

Strangler Fig Pattern — Safe Legacy Modernization with YARP and .NET

Biome — The Rust Toolchain Replacing ESLint + Prettier, 50x Faster

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.