Apache Kafka 4.x: The Event Streaming Era Without ZooKeeper
Posted on: 4/26/2026 7:15:51 AM
Table of contents
- 1. KRaft Mode: Goodbye ZooKeeper
- 2. Share Groups: Kafka Gets Queues
- 3. Next-Gen Consumer Rebalance Protocol (KIP-848)
- 4. Eligible Leader Replicas — No More Data Loss During Leader Election
- 5. Kafka Streams 4.2: Dead Letter Queue & Server-Side Rebalance
- 6. Integrating Kafka 4.x with .NET
- 7. Complete Event-Driven Architecture with Kafka 4.x
- 8. Migrating from ZooKeeper to KRaft
- 9. Kafka 4.x vs Other Messaging Systems
- 10. Best Practices for Kafka in Production
- Conclusion
- References
Apache Kafka has just reached the biggest milestone in its history: the complete removal of ZooKeeper. With version 4.0 (March 2025) and 4.2 (February 2026), Kafka doesn't just simplify its architecture — it introduces game-changing features for building event-driven systems. This article dives deep into KRaft mode, Share Groups (Queues), the next-generation Consumer Rebalance Protocol, and how to integrate Kafka 4.x with .NET.
1. KRaft Mode: Goodbye ZooKeeper
For over a decade, ZooKeeper was an indispensable component of every Kafka deployment. It handled metadata management, partition leader election, and cluster configuration storage. But ZooKeeper was also the root cause of countless operational headaches: another distributed system to manage, a partition count ceiling (~200K), and a bottleneck during metadata changes.
KRaft (Kafka Raft) is the solution: Kafka manages its own metadata through the Raft consensus protocol, built directly into the broker. No separate ZooKeeper cluster required. No additional system to operate.
graph LR
subgraph "Kafka < 4.0 (ZooKeeper)"
P1[Producer] --> B1[Broker 1]
P1 --> B2[Broker 2]
P1 --> B3[Broker 3]
B1 --> ZK[ZooKeeper Ensemble]
B2 --> ZK
B3 --> ZK
ZK --> ZK1[ZK Node 1]
ZK --> ZK2[ZK Node 2]
ZK --> ZK3[ZK Node 3]
C1[Consumer] --> B1
C1 --> B2
C1 --> B3
end
style ZK fill:#ff9800,stroke:#e65100,color:#fff
style ZK1 fill:#ffe0b2,stroke:#ff9800,color:#333
style ZK2 fill:#ffe0b2,stroke:#ff9800,color:#333
style ZK3 fill:#ffe0b2,stroke:#ff9800,color:#333
style B1 fill:#e94560,stroke:#fff,color:#fff
style B2 fill:#e94560,stroke:#fff,color:#fff
style B3 fill:#e94560,stroke:#fff,color:#fff
style P1 fill:#2c3e50,stroke:#fff,color:#fff
style C1 fill:#2c3e50,stroke:#fff,color:#fff
Kafka architecture before 4.0: requires a separate ZooKeeper cluster
graph LR
subgraph "Kafka 4.x (KRaft)"
P2[Producer] --> KB1[Broker 1]
P2 --> KB2[Broker 2]
P2 --> KB3[Broker 3]
KB1 ---|Raft Consensus| KB2
KB2 ---|Raft Consensus| KB3
KB3 ---|Raft Consensus| KB1
C2[Consumer] --> KB1
C2 --> KB2
C2 --> KB3
end
style KB1 fill:#4CAF50,stroke:#fff,color:#fff
style KB2 fill:#4CAF50,stroke:#fff,color:#fff
style KB3 fill:#4CAF50,stroke:#fff,color:#fff
style P2 fill:#2c3e50,stroke:#fff,color:#fff
style C2 fill:#2c3e50,stroke:#fff,color:#fff
Kafka 4.x architecture: KRaft built-in, no ZooKeeper needed
Core KRaft improvements
| Criteria | ZooKeeper Mode | KRaft Mode |
|---|---|---|
| Max partitions per cluster | ~200,000 | Millions |
| Cluster startup time | Minutes (depends on ZK) | Seconds |
| Components to operate | Kafka + ZooKeeper | Kafka only |
| Controller failover | Slow (ZK session timeout) | Fast (Raft leader election) |
| Metadata propagation | Async via ZK watchers | Event-driven via metadata log |
| Security model | 2 separate ACL systems | Unified single system |
KRaft Controller vs Broker
In KRaft mode, you can run nodes as controller-only, broker-only, or combined. For large production clusters, it's recommended to separate controller nodes (typically 3 or 5) from broker nodes to ensure metadata management isn't affected by message processing traffic.
2. Share Groups: Kafka Gets Queues
Before Kafka 4.x, each partition could only be consumed by exactly one consumer within a consumer group. This guaranteed ordering but created a bottleneck: if you have 10 partitions but need 20 workers, you'd have to re-partition the topic — an expensive and disruptive operation.
Share Groups (KIP-932) completely solve this by allowing multiple consumers to read from the same partition with per-record acknowledgement — just like a traditional message queue (RabbitMQ, SQS) but running on Kafka infrastructure.
graph TD
T[Topic: order-processing] --> P1[Partition 0]
T --> P2[Partition 1]
T --> P3[Partition 2]
subgraph "Traditional Consumer Group"
P1 --> CG1[Consumer A]
P2 --> CG2[Consumer B]
P3 --> CG3[Consumer C]
end
subgraph "Share Group (Kafka 4.x)"
P1 -.-> SG1[Worker 1]
P1 -.-> SG2[Worker 2]
P2 -.-> SG1
P2 -.-> SG3[Worker 3]
P3 -.-> SG2
P3 -.-> SG3
end
style T fill:#e94560,stroke:#fff,color:#fff
style P1 fill:#f8f9fa,stroke:#e94560,color:#333
style P2 fill:#f8f9fa,stroke:#e94560,color:#333
style P3 fill:#f8f9fa,stroke:#e94560,color:#333
style CG1 fill:#2c3e50,stroke:#fff,color:#fff
style CG2 fill:#2c3e50,stroke:#fff,color:#fff
style CG3 fill:#2c3e50,stroke:#fff,color:#fff
style SG1 fill:#4CAF50,stroke:#fff,color:#fff
style SG2 fill:#4CAF50,stroke:#fff,color:#fff
style SG3 fill:#4CAF50,stroke:#fff,color:#fff
Traditional Consumer Group vs Share Group: workers can share partitions
Acknowledgement types in Share Groups
Kafka 4.2 adds the RENEW type alongside ACCEPT and REJECT, allowing consumers to extend processing time for long-running tasks:
- ACCEPT: Record processed successfully, mark as complete
- REJECT: Record processing failed, return to Share Group for another worker
- RENEW (new in 4.2): Extend processing time for long-running tasks like ML inference or video transcoding
Share Groups don't replace Consumer Groups
Share Groups are designed for point-to-point messaging use cases where throughput matters more than ordering. If your application requires in-order processing within a partition (e.g., event sourcing, CDC), continue using traditional Consumer Groups.
3. Next-Gen Consumer Rebalance Protocol (KIP-848)
One of the biggest pain points in operating Kafka was the stop-the-world rebalance. Whenever a consumer joined or left a group, every consumer in the group had to stop processing, revoke partitions, and wait for reassignment. With large consumer groups (hundreds of instances), this could take minutes — meaning minutes of zero message processing.
KIP-848 introduces a new rebalance protocol, enabled by default in Kafka 4.0:
| Feature | Eager Rebalance (old) | KIP-848 Protocol (new) |
|---|---|---|
| When consumer joins | All revoke, full reassignment | Only move necessary partitions |
| When consumer leaves | Stop-the-world | Incremental, no disruption |
| Assignment logic | Client-side (group leader) | Server-side (broker decides) |
| Rebalance time | Proportional to consumer count | Nearly constant (O(1)) |
| Downtime when scaling | Significant | Near zero |
To enable on the client side, simply set:
group.protocol=consumer
4. Eligible Leader Replicas — No More Data Loss During Leader Election
KIP-966 introduces Eligible Leader Replicas (ELR) — a subset of ISR (In-Sync Replicas) guaranteed to have complete data up to the high-watermark. When a partition leader fails, Kafka will only elect a new leader from ELRs, preventing an under-replicated replica from becoming leader and causing data loss.
sequenceDiagram
participant P as Producer
participant L as Leader Broker
participant R1 as Replica 1 (ELR)
participant R2 as Replica 2 (Non-ELR)
P->>L: Produce message (offset 100)
L->>R1: Replicate (offset 100) ✓
L->>R2: Replicate (offset 95) — lag
Note over L: High-watermark = 100
Note over R1: Caught up → ELR ✓
Note over R2: Lagging → NOT ELR ✗
L--xL: Leader crashes!
Note over R1,R2: Leader election
R1->>R1: Elected as new leader (ELR, no data loss)
R2--xR2: Rejected (not in ELR, would lose offsets 96-100)
ELR ensures only fully-synced replicas can become leader
5. Kafka Streams 4.2: Dead Letter Queue & Server-Side Rebalance
Kafka Streams in version 4.2 received significant production-ready improvements:
Built-in Dead Letter Queue (DLQ)
Before 4.2, when a record caused an error in a Kafka Streams topology, you had only 2 options: skip the record or crash the application. Now, exception handlers can redirect failed records to a Dead Letter Queue — a separate topic for later analysis and reprocessing.
// Kafka Streams 4.2 — DLQ in exception handler
StreamsConfig config = new StreamsConfig();
config.put(StreamsConfig.PROCESSING_EXCEPTION_HANDLER_CLASS_CONFIG,
DeadLetterQueueExceptionHandler.class);
config.put("dead.letter.queue.topic", "order-processing-dlq");
Server-Side Rebalance Protocol (GA)
Kafka Streams now uses broker-side task assignment instead of client-side, significantly reducing complexity and rebalance time when scaling stream processing applications.
Anchored Wall-Clock Punctuation
A new feature that allows scheduling punctuation at fixed wall-clock times (e.g., top of every hour, start of each day) rather than interval-based only. Useful for calendar-aligned aggregation tasks.
6. Integrating Kafka 4.x with .NET
The official Confluent.Kafka library (version 2.14.0) provides high-level Producer and Consumer fully compatible with Kafka 4.x. Below are production-ready patterns.
Producer with Idempotent Delivery
using Confluent.Kafka;
var config = new ProducerConfig
{
BootstrapServers = "kafka-cluster:9092",
EnableIdempotence = true,
Acks = Acks.All,
MessageSendMaxRetries = 3,
LingerMs = 5,
BatchSize = 65536
};
using var producer = new ProducerBuilder<string, string>(config)
.SetErrorHandler((_, e) =>
Console.Error.WriteLine($"Kafka error: {e.Reason}"))
.Build();
var result = await producer.ProduceAsync("order-events",
new Message<string, string>
{
Key = orderId,
Value = JsonSerializer.Serialize(orderEvent),
Headers = new Headers
{
{ "correlation-id", Encoding.UTF8.GetBytes(correlationId) },
{ "event-type", Encoding.UTF8.GetBytes("OrderCreated") }
}
});
Console.WriteLine($"Delivered to {result.TopicPartitionOffset}");
Consumer with Manual Offset
var config = new ConsumerConfig
{
BootstrapServers = "kafka-cluster:9092",
GroupId = "order-processor",
GroupProtocol = "consumer", // KIP-848 new protocol
AutoOffsetReset = AutoOffsetReset.Earliest,
EnableAutoCommit = false,
MaxPollIntervalMs = 300000
};
using var consumer = new ConsumerBuilder<string, string>(config)
.SetPartitionsAssignedHandler((c, partitions) =>
Console.WriteLine($"Assigned: {string.Join(", ", partitions)}"))
.Build();
consumer.Subscribe("order-events");
while (!cancellationToken.IsCancellationRequested)
{
var result = consumer.Consume(cancellationToken);
try
{
await ProcessOrderEvent(result.Message);
consumer.StoreOffset(result);
}
catch (Exception ex)
{
logger.LogError(ex, "Failed to process {Key}", result.Message.Key);
await PublishToDlq(result.Message);
consumer.StoreOffset(result);
}
}
Integration with .NET Aspire
The Aspire.Confluent.Kafka package provides dependency injection, health checks, and automatic telemetry:
// Program.cs — .NET Aspire host
var builder = DistributedApplication.CreateBuilder(args);
var kafka = builder.AddKafka("messaging")
.WithKRaft() // Use KRaft mode
.WithDataVolume();
var orderService = builder.AddProject<Projects.OrderService>()
.WithReference(kafka);
// OrderService — DI registration
builder.AddKafkaProducer<string, OrderEvent>("messaging");
builder.AddKafkaConsumer<string, OrderEvent>("messaging", settings =>
{
settings.Config.GroupId = "order-processor";
settings.Config.GroupProtocol = "consumer";
});
7. Complete Event-Driven Architecture with Kafka 4.x
Kafka 4.x with Share Groups enables building more flexible event-driven architectures, combining both pub/sub and point-to-point messaging on a single platform:
graph TD
API[API Gateway] --> CMD[Command Service]
API --> QRY[Query Service]
CMD --> KT1[Topic: domain-events
Consumer Group — ordered]
CMD --> KT2[Topic: task-queue
Share Group — parallel]
KT1 --> ES[Event Store Service]
KT1 --> PROJ[Projection Service]
KT1 --> NOTIFY[Notification Service]
KT2 --> W1[Worker 1]
KT2 --> W2[Worker 2]
KT2 --> W3[Worker 3]
ES --> DB1[(Event Store)]
PROJ --> DB2[(Read DB)]
QRY --> DB2
W1 --> EXT[External APIs]
W2 --> EXT
W3 --> EXT
style API fill:#2c3e50,stroke:#fff,color:#fff
style CMD fill:#e94560,stroke:#fff,color:#fff
style QRY fill:#e94560,stroke:#fff,color:#fff
style KT1 fill:#4CAF50,stroke:#fff,color:#fff
style KT2 fill:#ff9800,stroke:#fff,color:#fff
style ES fill:#f8f9fa,stroke:#e94560,color:#333
style PROJ fill:#f8f9fa,stroke:#e94560,color:#333
style NOTIFY fill:#f8f9fa,stroke:#e94560,color:#333
style W1 fill:#f8f9fa,stroke:#ff9800,color:#333
style W2 fill:#f8f9fa,stroke:#ff9800,color:#333
style W3 fill:#f8f9fa,stroke:#ff9800,color:#333
style DB1 fill:#2c3e50,stroke:#fff,color:#fff
style DB2 fill:#2c3e50,stroke:#fff,color:#fff
style EXT fill:#9e9e9e,stroke:#fff,color:#fff
Architecture combining Consumer Groups (ordered events) and Share Groups (parallel tasks)
When to use Consumer Group vs Share Group?
- Consumer Group: Domain events, CDC streams, event sourcing — where processing order matters
- Share Group: Email sending, image processing, report generation, API calls — where flexible worker scaling is needed
8. Migrating from ZooKeeper to KRaft
If you're running a Kafka cluster with ZooKeeper, the migration to KRaft involves 3 main steps:
kafka-features.sh describe to check feature levels.
kafka-metadata.sh snapshot to transfer metadata from ZooKeeper to KRaft format. Start KRaft controllers. Switch each broker to KRaft mode with rolling restarts. Verify metadata consistency between ZK and KRaft.
zookeeper.connect configuration from broker configs. Decommission the ZooKeeper ensemble. Monitor cluster health for 48-72 hours before removing ZK nodes.
Important migration note
Kafka 4.0+ no longer supports ZooKeeper mode. If your cluster is running ZooKeeper, you must migrate to KRaft before upgrading to 4.x. Perform the migration on version 3.7 or 3.8, then upgrade to 4.x.
9. Kafka 4.x vs Other Messaging Systems
| Criteria | Kafka 4.x | RabbitMQ | AWS SQS/SNS |
|---|---|---|---|
| Model | Pub/Sub + Queue (Share Groups) | Queue + Pub/Sub (Exchange) | Queue (SQS) + Pub/Sub (SNS) |
| Throughput | Millions msg/s | ~50K msg/s | Nearly unlimited (managed) |
| Message retention | Configurable (days/weeks/forever) | Until consumed | 14 days (SQS) |
| Ordering | Per-partition guaranteed | Per-queue FIFO | FIFO queue or best-effort |
| Consumer scaling | Consumer Group + Share Group | Competing consumers | Auto-scaling consumers |
| Stream processing | Kafka Streams, ksqlDB | Not built-in | Requires Lambda/Kinesis |
| Operations | Self-managed or Confluent Cloud | Self-managed or CloudAMQP | Fully managed |
| Cost | Infra cost (or Confluent pricing) | Infra cost | Pay-per-request |
10. Best Practices for Kafka in Production
Sizing & Performance
- Partition count: Start with partitions = expected consumers × 2. Adding partitions is easy, removing them is not
- Replication factor: Always set at least 3 for production. Combine
min.insync.replicas=2withacks=all - Batch size: Increase
batch.size(64KB-256KB) andlinger.ms(5-20ms) for better throughput - Compression: Use
compression.type=zstdfor best compression ratio orlz4for lowest latency
Monitoring essentials
- Consumer lag: The most important metric — if lag grows continuously, consumers can't keep up with producers
- Under-replicated partitions: Indicates broker I/O or network issues
- Request latency (p99): Track produce and fetch latency at the 99th percentile
- Controller active count: In KRaft, there must always be exactly 1 active controller
Security
- Use SASL/SCRAM or mTLS for authentication
- Enable TLS encryption for all inter-broker and client-broker connections
- Configure granular ACLs per-topic, per-consumer-group
- With KRaft, you only need to manage a single ACL system (no more separate ZooKeeper ACLs)
Conclusion
Apache Kafka 4.x marks the biggest transformation in the history of the world's most popular event streaming platform. Removing ZooKeeper doesn't just simplify operations — it unlocks scaling to millions of partitions. Share Groups fill Kafka's biggest gap — the ability to work as a message queue — eliminating the need for many systems to run both Kafka and RabbitMQ side by side. With the increasingly mature .NET ecosystem through Confluent.Kafka and .NET Aspire, now is the ideal time to adopt Kafka in your event-driven architecture.
References
- Apache Kafka 4.0.0 Release Announcement — kafka.apache.org
- Apache Kafka 4.2.0 Release Announcement — kafka.apache.org
- Apache Kafka 4.0: Default KRaft, Queues, Faster Rebalances — confluent.io
- .NET Client for Apache Kafka — Confluent Documentation
- Aspire.Confluent.Kafka — NuGet
- Kafka 4.0: KRaft, Queues, Better Rebalance — SoftwareMill
Strangler Fig Pattern — Safe Legacy Modernization with YARP and .NET
Biome — The Rust Toolchain Replacing ESLint + Prettier, 50x Faster
Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.