Apache Kafka 4.x: The Event Streaming Era Without ZooKeeper

Posted on: 4/26/2026 7:15:51 AM

Apache Kafka has just reached the biggest milestone in its history: the complete removal of ZooKeeper. With version 4.0 (March 2025) and 4.2 (February 2026), Kafka doesn't just simplify its architecture — it introduces game-changing features for building event-driven systems. This article dives deep into KRaft mode, Share Groups (Queues), the next-generation Consumer Rebalance Protocol, and how to integrate Kafka 4.x with .NET.

80%+ Fortune 100 companies use Kafka
7M+ Instances worldwide
4.2 Latest version (02/2026)
0 Lines of ZooKeeper code remaining

1. KRaft Mode: Goodbye ZooKeeper

For over a decade, ZooKeeper was an indispensable component of every Kafka deployment. It handled metadata management, partition leader election, and cluster configuration storage. But ZooKeeper was also the root cause of countless operational headaches: another distributed system to manage, a partition count ceiling (~200K), and a bottleneck during metadata changes.

KRaft (Kafka Raft) is the solution: Kafka manages its own metadata through the Raft consensus protocol, built directly into the broker. No separate ZooKeeper cluster required. No additional system to operate.

graph LR
    subgraph "Kafka < 4.0 (ZooKeeper)"
        P1[Producer] --> B1[Broker 1]
        P1 --> B2[Broker 2]
        P1 --> B3[Broker 3]
        B1 --> ZK[ZooKeeper Ensemble]
        B2 --> ZK
        B3 --> ZK
        ZK --> ZK1[ZK Node 1]
        ZK --> ZK2[ZK Node 2]
        ZK --> ZK3[ZK Node 3]
        C1[Consumer] --> B1
        C1 --> B2
        C1 --> B3
    end

    style ZK fill:#ff9800,stroke:#e65100,color:#fff
    style ZK1 fill:#ffe0b2,stroke:#ff9800,color:#333
    style ZK2 fill:#ffe0b2,stroke:#ff9800,color:#333
    style ZK3 fill:#ffe0b2,stroke:#ff9800,color:#333
    style B1 fill:#e94560,stroke:#fff,color:#fff
    style B2 fill:#e94560,stroke:#fff,color:#fff
    style B3 fill:#e94560,stroke:#fff,color:#fff
    style P1 fill:#2c3e50,stroke:#fff,color:#fff
    style C1 fill:#2c3e50,stroke:#fff,color:#fff

Kafka architecture before 4.0: requires a separate ZooKeeper cluster

graph LR
    subgraph "Kafka 4.x (KRaft)"
        P2[Producer] --> KB1[Broker 1]
        P2 --> KB2[Broker 2]
        P2 --> KB3[Broker 3]
        KB1 ---|Raft Consensus| KB2
        KB2 ---|Raft Consensus| KB3
        KB3 ---|Raft Consensus| KB1
        C2[Consumer] --> KB1
        C2 --> KB2
        C2 --> KB3
    end

    style KB1 fill:#4CAF50,stroke:#fff,color:#fff
    style KB2 fill:#4CAF50,stroke:#fff,color:#fff
    style KB3 fill:#4CAF50,stroke:#fff,color:#fff
    style P2 fill:#2c3e50,stroke:#fff,color:#fff
    style C2 fill:#2c3e50,stroke:#fff,color:#fff

Kafka 4.x architecture: KRaft built-in, no ZooKeeper needed

Core KRaft improvements

CriteriaZooKeeper ModeKRaft Mode
Max partitions per cluster~200,000Millions
Cluster startup timeMinutes (depends on ZK)Seconds
Components to operateKafka + ZooKeeperKafka only
Controller failoverSlow (ZK session timeout)Fast (Raft leader election)
Metadata propagationAsync via ZK watchersEvent-driven via metadata log
Security model2 separate ACL systemsUnified single system

KRaft Controller vs Broker

In KRaft mode, you can run nodes as controller-only, broker-only, or combined. For large production clusters, it's recommended to separate controller nodes (typically 3 or 5) from broker nodes to ensure metadata management isn't affected by message processing traffic.

2. Share Groups: Kafka Gets Queues

Before Kafka 4.x, each partition could only be consumed by exactly one consumer within a consumer group. This guaranteed ordering but created a bottleneck: if you have 10 partitions but need 20 workers, you'd have to re-partition the topic — an expensive and disruptive operation.

Share Groups (KIP-932) completely solve this by allowing multiple consumers to read from the same partition with per-record acknowledgement — just like a traditional message queue (RabbitMQ, SQS) but running on Kafka infrastructure.

graph TD
    T[Topic: order-processing] --> P1[Partition 0]
    T --> P2[Partition 1]
    T --> P3[Partition 2]

    subgraph "Traditional Consumer Group"
        P1 --> CG1[Consumer A]
        P2 --> CG2[Consumer B]
        P3 --> CG3[Consumer C]
    end

    subgraph "Share Group (Kafka 4.x)"
        P1 -.-> SG1[Worker 1]
        P1 -.-> SG2[Worker 2]
        P2 -.-> SG1
        P2 -.-> SG3[Worker 3]
        P3 -.-> SG2
        P3 -.-> SG3
    end

    style T fill:#e94560,stroke:#fff,color:#fff
    style P1 fill:#f8f9fa,stroke:#e94560,color:#333
    style P2 fill:#f8f9fa,stroke:#e94560,color:#333
    style P3 fill:#f8f9fa,stroke:#e94560,color:#333
    style CG1 fill:#2c3e50,stroke:#fff,color:#fff
    style CG2 fill:#2c3e50,stroke:#fff,color:#fff
    style CG3 fill:#2c3e50,stroke:#fff,color:#fff
    style SG1 fill:#4CAF50,stroke:#fff,color:#fff
    style SG2 fill:#4CAF50,stroke:#fff,color:#fff
    style SG3 fill:#4CAF50,stroke:#fff,color:#fff

Traditional Consumer Group vs Share Group: workers can share partitions

Acknowledgement types in Share Groups

Kafka 4.2 adds the RENEW type alongside ACCEPT and REJECT, allowing consumers to extend processing time for long-running tasks:

  • ACCEPT: Record processed successfully, mark as complete
  • REJECT: Record processing failed, return to Share Group for another worker
  • RENEW (new in 4.2): Extend processing time for long-running tasks like ML inference or video transcoding

Share Groups don't replace Consumer Groups

Share Groups are designed for point-to-point messaging use cases where throughput matters more than ordering. If your application requires in-order processing within a partition (e.g., event sourcing, CDC), continue using traditional Consumer Groups.

3. Next-Gen Consumer Rebalance Protocol (KIP-848)

One of the biggest pain points in operating Kafka was the stop-the-world rebalance. Whenever a consumer joined or left a group, every consumer in the group had to stop processing, revoke partitions, and wait for reassignment. With large consumer groups (hundreds of instances), this could take minutes — meaning minutes of zero message processing.

KIP-848 introduces a new rebalance protocol, enabled by default in Kafka 4.0:

FeatureEager Rebalance (old)KIP-848 Protocol (new)
When consumer joinsAll revoke, full reassignmentOnly move necessary partitions
When consumer leavesStop-the-worldIncremental, no disruption
Assignment logicClient-side (group leader)Server-side (broker decides)
Rebalance timeProportional to consumer countNearly constant (O(1))
Downtime when scalingSignificantNear zero

To enable on the client side, simply set:

group.protocol=consumer

4. Eligible Leader Replicas — No More Data Loss During Leader Election

KIP-966 introduces Eligible Leader Replicas (ELR) — a subset of ISR (In-Sync Replicas) guaranteed to have complete data up to the high-watermark. When a partition leader fails, Kafka will only elect a new leader from ELRs, preventing an under-replicated replica from becoming leader and causing data loss.

sequenceDiagram
    participant P as Producer
    participant L as Leader Broker
    participant R1 as Replica 1 (ELR)
    participant R2 as Replica 2 (Non-ELR)

    P->>L: Produce message (offset 100)
    L->>R1: Replicate (offset 100) ✓
    L->>R2: Replicate (offset 95) — lag
    Note over L: High-watermark = 100
    Note over R1: Caught up → ELR ✓
    Note over R2: Lagging → NOT ELR ✗
    L--xL: Leader crashes!
    Note over R1,R2: Leader election
    R1->>R1: Elected as new leader (ELR, no data loss)
    R2--xR2: Rejected (not in ELR, would lose offsets 96-100)

ELR ensures only fully-synced replicas can become leader

5. Kafka Streams 4.2: Dead Letter Queue & Server-Side Rebalance

Kafka Streams in version 4.2 received significant production-ready improvements:

Built-in Dead Letter Queue (DLQ)

Before 4.2, when a record caused an error in a Kafka Streams topology, you had only 2 options: skip the record or crash the application. Now, exception handlers can redirect failed records to a Dead Letter Queue — a separate topic for later analysis and reprocessing.

// Kafka Streams 4.2 — DLQ in exception handler
StreamsConfig config = new StreamsConfig();
config.put(StreamsConfig.PROCESSING_EXCEPTION_HANDLER_CLASS_CONFIG,
    DeadLetterQueueExceptionHandler.class);
config.put("dead.letter.queue.topic", "order-processing-dlq");

Server-Side Rebalance Protocol (GA)

Kafka Streams now uses broker-side task assignment instead of client-side, significantly reducing complexity and rebalance time when scaling stream processing applications.

Anchored Wall-Clock Punctuation

A new feature that allows scheduling punctuation at fixed wall-clock times (e.g., top of every hour, start of each day) rather than interval-based only. Useful for calendar-aligned aggregation tasks.

6. Integrating Kafka 4.x with .NET

The official Confluent.Kafka library (version 2.14.0) provides high-level Producer and Consumer fully compatible with Kafka 4.x. Below are production-ready patterns.

Producer with Idempotent Delivery

using Confluent.Kafka;

var config = new ProducerConfig
{
    BootstrapServers = "kafka-cluster:9092",
    EnableIdempotence = true,
    Acks = Acks.All,
    MessageSendMaxRetries = 3,
    LingerMs = 5,
    BatchSize = 65536
};

using var producer = new ProducerBuilder<string, string>(config)
    .SetErrorHandler((_, e) =>
        Console.Error.WriteLine($"Kafka error: {e.Reason}"))
    .Build();

var result = await producer.ProduceAsync("order-events",
    new Message<string, string>
    {
        Key = orderId,
        Value = JsonSerializer.Serialize(orderEvent),
        Headers = new Headers
        {
            { "correlation-id", Encoding.UTF8.GetBytes(correlationId) },
            { "event-type", Encoding.UTF8.GetBytes("OrderCreated") }
        }
    });

Console.WriteLine($"Delivered to {result.TopicPartitionOffset}");

Consumer with Manual Offset

var config = new ConsumerConfig
{
    BootstrapServers = "kafka-cluster:9092",
    GroupId = "order-processor",
    GroupProtocol = "consumer",  // KIP-848 new protocol
    AutoOffsetReset = AutoOffsetReset.Earliest,
    EnableAutoCommit = false,
    MaxPollIntervalMs = 300000
};

using var consumer = new ConsumerBuilder<string, string>(config)
    .SetPartitionsAssignedHandler((c, partitions) =>
        Console.WriteLine($"Assigned: {string.Join(", ", partitions)}"))
    .Build();

consumer.Subscribe("order-events");

while (!cancellationToken.IsCancellationRequested)
{
    var result = consumer.Consume(cancellationToken);
    try
    {
        await ProcessOrderEvent(result.Message);
        consumer.StoreOffset(result);
    }
    catch (Exception ex)
    {
        logger.LogError(ex, "Failed to process {Key}", result.Message.Key);
        await PublishToDlq(result.Message);
        consumer.StoreOffset(result);
    }
}

Integration with .NET Aspire

The Aspire.Confluent.Kafka package provides dependency injection, health checks, and automatic telemetry:

// Program.cs — .NET Aspire host
var builder = DistributedApplication.CreateBuilder(args);

var kafka = builder.AddKafka("messaging")
    .WithKRaft()    // Use KRaft mode
    .WithDataVolume();

var orderService = builder.AddProject<Projects.OrderService>()
    .WithReference(kafka);

// OrderService — DI registration
builder.AddKafkaProducer<string, OrderEvent>("messaging");
builder.AddKafkaConsumer<string, OrderEvent>("messaging", settings =>
{
    settings.Config.GroupId = "order-processor";
    settings.Config.GroupProtocol = "consumer";
});

7. Complete Event-Driven Architecture with Kafka 4.x

Kafka 4.x with Share Groups enables building more flexible event-driven architectures, combining both pub/sub and point-to-point messaging on a single platform:

graph TD
    API[API Gateway] --> CMD[Command Service]
    API --> QRY[Query Service]

    CMD --> KT1[Topic: domain-events
Consumer Group — ordered] CMD --> KT2[Topic: task-queue
Share Group — parallel] KT1 --> ES[Event Store Service] KT1 --> PROJ[Projection Service] KT1 --> NOTIFY[Notification Service] KT2 --> W1[Worker 1] KT2 --> W2[Worker 2] KT2 --> W3[Worker 3] ES --> DB1[(Event Store)] PROJ --> DB2[(Read DB)] QRY --> DB2 W1 --> EXT[External APIs] W2 --> EXT W3 --> EXT style API fill:#2c3e50,stroke:#fff,color:#fff style CMD fill:#e94560,stroke:#fff,color:#fff style QRY fill:#e94560,stroke:#fff,color:#fff style KT1 fill:#4CAF50,stroke:#fff,color:#fff style KT2 fill:#ff9800,stroke:#fff,color:#fff style ES fill:#f8f9fa,stroke:#e94560,color:#333 style PROJ fill:#f8f9fa,stroke:#e94560,color:#333 style NOTIFY fill:#f8f9fa,stroke:#e94560,color:#333 style W1 fill:#f8f9fa,stroke:#ff9800,color:#333 style W2 fill:#f8f9fa,stroke:#ff9800,color:#333 style W3 fill:#f8f9fa,stroke:#ff9800,color:#333 style DB1 fill:#2c3e50,stroke:#fff,color:#fff style DB2 fill:#2c3e50,stroke:#fff,color:#fff style EXT fill:#9e9e9e,stroke:#fff,color:#fff

Architecture combining Consumer Groups (ordered events) and Share Groups (parallel tasks)

When to use Consumer Group vs Share Group?

  • Consumer Group: Domain events, CDC streams, event sourcing — where processing order matters
  • Share Group: Email sending, image processing, report generation, API calls — where flexible worker scaling is needed

8. Migrating from ZooKeeper to KRaft

If you're running a Kafka cluster with ZooKeeper, the migration to KRaft involves 3 main steps:

Step 1: Preparation
Upgrade all brokers to Kafka 3.7+ (the last version supporting both modes). Ensure all clients are compatible. Backup ZooKeeper metadata and configuration. Run kafka-features.sh describe to check feature levels.
Step 2: Migration
Run kafka-metadata.sh snapshot to transfer metadata from ZooKeeper to KRaft format. Start KRaft controllers. Switch each broker to KRaft mode with rolling restarts. Verify metadata consistency between ZK and KRaft.
Step 3: Completion
After all brokers are running KRaft, disconnect ZooKeeper. Remove zookeeper.connect configuration from broker configs. Decommission the ZooKeeper ensemble. Monitor cluster health for 48-72 hours before removing ZK nodes.

Important migration note

Kafka 4.0+ no longer supports ZooKeeper mode. If your cluster is running ZooKeeper, you must migrate to KRaft before upgrading to 4.x. Perform the migration on version 3.7 or 3.8, then upgrade to 4.x.

9. Kafka 4.x vs Other Messaging Systems

CriteriaKafka 4.xRabbitMQAWS SQS/SNS
ModelPub/Sub + Queue (Share Groups)Queue + Pub/Sub (Exchange)Queue (SQS) + Pub/Sub (SNS)
ThroughputMillions msg/s~50K msg/sNearly unlimited (managed)
Message retentionConfigurable (days/weeks/forever)Until consumed14 days (SQS)
OrderingPer-partition guaranteedPer-queue FIFOFIFO queue or best-effort
Consumer scalingConsumer Group + Share GroupCompeting consumersAuto-scaling consumers
Stream processingKafka Streams, ksqlDBNot built-inRequires Lambda/Kinesis
OperationsSelf-managed or Confluent CloudSelf-managed or CloudAMQPFully managed
CostInfra cost (or Confluent pricing)Infra costPay-per-request

10. Best Practices for Kafka in Production

Sizing & Performance

  • Partition count: Start with partitions = expected consumers × 2. Adding partitions is easy, removing them is not
  • Replication factor: Always set at least 3 for production. Combine min.insync.replicas=2 with acks=all
  • Batch size: Increase batch.size (64KB-256KB) and linger.ms (5-20ms) for better throughput
  • Compression: Use compression.type=zstd for best compression ratio or lz4 for lowest latency

Monitoring essentials

  • Consumer lag: The most important metric — if lag grows continuously, consumers can't keep up with producers
  • Under-replicated partitions: Indicates broker I/O or network issues
  • Request latency (p99): Track produce and fetch latency at the 99th percentile
  • Controller active count: In KRaft, there must always be exactly 1 active controller

Security

  • Use SASL/SCRAM or mTLS for authentication
  • Enable TLS encryption for all inter-broker and client-broker connections
  • Configure granular ACLs per-topic, per-consumer-group
  • With KRaft, you only need to manage a single ACL system (no more separate ZooKeeper ACLs)

Conclusion

Apache Kafka 4.x marks the biggest transformation in the history of the world's most popular event streaming platform. Removing ZooKeeper doesn't just simplify operations — it unlocks scaling to millions of partitions. Share Groups fill Kafka's biggest gap — the ability to work as a message queue — eliminating the need for many systems to run both Kafka and RabbitMQ side by side. With the increasingly mature .NET ecosystem through Confluent.Kafka and .NET Aspire, now is the ideal time to adopt Kafka in your event-driven architecture.

References