Case Studies Advanced 5 min read

Design an Analytics Events Pipeline (Kafka + ClickHouse)

Q: How do I handle schema changes without breaking the pipeline?

Three rules: (1) all events have a `schema_version` field; (2) consumers tolerate unknown fields; (3) deletions wait for one full retention period. Use Protobuf or Avro for stricter typing if the team is large enough; JSON is fine for small teams who can coordinate via PR review.

How to build an event analytics pipeline in .NET: SDK ingestion, Kafka streaming, schema evolution, ClickHouse for OLAP, and the lambda architecture trade-offs.

Phùng Anh Tú · May 28, 2026

Table of contents

When does a dedicated analytics pipeline pay off?
What numbers should I budget for?
What does the architecture look like?
What is the .NET 10 wiring for ingestion?
How does the realtime aggregator work?
What scale-out path does this support?
What failure modes does this introduce?
When is the dedicated pipeline overkill?
Where should you go from here?

The analytics events pipeline is where every queue, stream, and storage idea in the series merges into one architecture. Apps emit "button clicked" events; the dashboard shows daily active users an hour later. Between those two endpoints sit Kafka, schema evolution, materialised views, and the cost trade-offs that decide how much you spend on data. This chapter designs the .NET version.

When does a dedicated analytics pipeline pay off?

Three signals.

Volume exceeds operational tools. Application Insights or Mixpanel get expensive past 100M events / month. A self-hosted pipeline pays back at scale.

Many downstream consumers. Dashboard, data warehouse, autocomplete ranking, fraud detection, ML feature store - all of them want the same event stream. A central Kafka decouples producers from consumers.

Replay matters. A bug-fix in the analytics job needs to recompute last month's data. A stream that retains 7 days of events makes that trivial; a queue that drops messages on consume does not.

If you have under 1M events/day and one dashboard, Application Insights or PostHog Cloud is cheaper than building.

What numbers should I budget for?

Events / day                500M
Avg event size              1 KB (after compression: 200 bytes)
Storage / year (raw)        500M * 365 * 1 KB = 180 TB
Storage / year (compressed) 36 TB
Storage / year (ClickHouse) ~10 TB after columnar + dictionary
Peak ingestion rate         500M / 100K * 5 = 25K events/s
Kafka brokers needed        3 (replication factor 3)
ClickHouse nodes needed     3-6 small VMs

The compression ratio (5x raw, 18x columnar) is what makes this affordable. Without it, the same workload would need 10x the hardware.

What does the architecture look like?

flowchart LR
    Apps[Mobile / Web SDK] -->|batch HTTP| Ingest[ASP.NET Core Ingest]
    Ingest -->|partition by user| K[(Kafka)]
    K --> RT[Realtime aggregator<br/>Flink / .NET worker]
    K --> Loader[Batch loader]
    Loader --> CH[(ClickHouse)]
    RT --> CH
    Apps2[Dashboards] --> CH
    K --> Lake[(Data lake / S3)]

Five tiers. Ingest accepts batched events from clients, validates schema, publishes to Kafka. Realtime aggregator computes 5-minute rollups. Batch loader writes raw events into ClickHouse. The data lake holds an immutable copy for replay. Dashboards query ClickHouse.

What is the .NET 10 wiring for ingestion?

public record AnalyticsEvent(
    string EventName, Guid UserId, DateTimeOffset Timestamp,
    Dictionary<string, object> Properties, int SchemaVersion);

app.MapPost("/v1/events", async (AnalyticsEvent[] batch, IKafkaProducer producer,
                                   IValidator validator) =>
{
    if (batch.Length > 1000) return Results.BadRequest("Batch too large.");

    foreach (var ev in batch)
    {
        if (!validator.Validate(ev)) continue;  // log and skip
        await producer.ProduceAsync("events.raw",
            key: ev.UserId.ToString(),
            value: JsonSerializer.Serialize(ev));
    }
    return Results.Accepted();
})
.RequireRateLimiting("ingest")
.AllowAnonymous();   // SDK uses an API key, not user auth

// Producer setup with batching for throughput
public class KafkaProducer
{
    private readonly IProducer<string, string> _producer;
    public KafkaProducer(IConfiguration config)
    {
        var conf = new ProducerConfig
        {
            BootstrapServers = config["Kafka:Brokers"],
            CompressionType = CompressionType.Zstd,
            LingerMs = 50,                  // batch up to 50 ms
            BatchSize = 100_000,
            Acks = Acks.All
        };
        _producer = new ProducerBuilder<string, string>(conf).Build();
    }
    public Task ProduceAsync(string topic, string key, string value)
        => _producer.ProduceAsync(topic, new() { Key = key, Value = value });
}

Three details. The user ID partition key keeps events from one user in order on one Kafka partition - useful for funnel analysis. Zstd compression is 2-3x better than gzip. Ack=All ensures durable write to all replicas before acknowledging the producer.

How does the realtime aggregator work?

flowchart LR
    K[(Kafka events.raw)] --> Worker[.NET worker<br/>5 min window]
    Worker --> RT[(events.rollup_5min)]
    RT --> CH[(ClickHouse<br/>materialised view)]
    Worker --> CH
    CH --> Dashboard[Grafana]

A .NET background worker consumes from Kafka, accumulates events in 5-minute windows, computes counters per user/event-type, and flushes the aggregates to a ClickHouse table. ClickHouse's materialised views recompute hourly and daily aggregates incrementally. Dashboards query the rollup tables, not raw events.

What scale-out path does this support?

Ingestion: stateless, scale ASP.NET Core replicas.
Kafka: add brokers, increase partition count for parallelism.
Aggregator: parallelise via Kafka consumer group; one worker per partition.
ClickHouse: shard by date; query optimiser handles the per-shard distribution.
Schema registry: Confluent Schema Registry or a homegrown Postgres table; mandatory once you have multiple producer apps.

For >10M events/sec, you need dedicated Kafka clusters per region plus replication; that scale is rare and warrants a data team.

What failure modes does this introduce?

Lost events on producer crash - in-memory buffer not yet flushed. Mitigation: SDK persists to local disk before in-memory; Acks=All on Kafka producer ensures durable write before the client sees success.
Aggregator lag - real-time rollups are minutes behind. Mitigation: alert on Kafka consumer lag; scale the worker.
Schema break - a producer ships a new event shape; consumer crashes. Mitigation: schema registry with backward-compat rules; reject incompatible changes at PR time.
Dashboard slow - ClickHouse query without partitioning. Mitigation: pre-aggregate via materialised views; quotas on long-running queries.

When is the dedicated pipeline overkill?

Below ~5M events/day, hosted analytics tools win on cost and ops. Mixpanel, Amplitude, PostHog Cloud, Application Insights all give you most of the same dashboards without the data engineering cost. Build the dedicated pipeline at the volume where the bills justify the team.

Where should you go from here?

You have completed the case studies. Next: how to answer system design interviews

the meta-chapter that turns nine case studies into a repeatable interview framework. After that, the conclusion wraps the series with the five lessons that survive every chapter.

Frequently asked questions

Why Kafka over a regular message queue?

Three reasons: throughput (>1M events/sec on modest hardware), durable replay (re-run aggregations after a bug fix), and many-consumers semantics (the same stream feeds the dashboard, the data warehouse, and the autocomplete updater). Regular queues from chapter 6 are for task queues; analytics is the canonical Kafka use case.

ClickHouse, BigQuery, or Snowflake?

ClickHouse self-hosted is cheapest at scale and gives sub-second OLAP queries; great when you have a dedicated data team. BigQuery and Snowflake are managed - higher cost but no operational burden. For a .NET shop without a dedicated data team, BigQuery via the GCP integration is often the right starting point. ClickHouse is the upgrade when bills get scary.

How do I handle schema changes without breaking the pipeline?

Three rules: (1) all events have a schema_version field; (2) consumers tolerate unknown fields; (3) deletions wait for one full retention period. Use Protobuf or Avro for stricter typing if the team is large enough; JSON is fine for small teams who can coordinate via PR review.

Lambda architecture - is it still worth it?

Less than it used to be. Original Lambda had a fast 'speed layer' (real-time approximate) plus a slow 'batch layer' (correct, latent). Modern streaming engines (Flink, ClickHouse materialised views) collapse both into one. Use lambda only if your batch and stream technologies are different vendors and you need the audit trail of the batch run.