ClickHouse 2026 — Sub-second OLAP Architecture with SharedMergeTree, Parallel Replicas, and Storage-Compute Separation for Petabyte Analytics

Posted on: 4/17/2026 3:10:28 AM

Table of contents

1. Why ClickHouse dominates production analytics in 2026
1. Three decisive questions before touching ClickHouse
2. Evolution — from Yandex 2016 to cloud-native 2026
3. Core architecture — column store, vectorized execution, and MergeTree
1. 3.1 Parts, granules, marks — three concepts to internalize
  1. ORDER BY design principle
2. 3.2 Vectorized execution — why 1 billion rows/second
4. SharedMergeTree — Storage-compute separation, properly
5. Parallel Replicas — linear query scale-out
1. Parallel Replicas and consistency
6. Projection, Materialized View, and Refreshable MV — three tools to accelerate queries
1. 6.1 Projection — a second index tier right inside the part
2. 6.2 Materialized View — real-time pre-aggregation with AggregatingMergeTree
7. Tiered storage, TTL, and data lifecycle management
8. Ingestion patterns — async insert, Kafka Engine, native protocol
9. Query tuning — sparse index, skip index, and common anti-patterns
10. Compared with the 2026 OLAP world — StarRocks, Doris, Druid, Pinot, DuckDB
11. Three typical production use cases — how to organize tables
12. Production checklist — Don't forget these before go-live
1. Before opening the cluster to real traffic
13. Closing — ClickHouse is a tool, not a silver bullet
14. References

1. Why ClickHouse dominates production analytics in 2026

In the seven years since Yandex open-sourced ClickHouse in mid-2016, it has remained one of the most rapidly evolving analytical-database projects in the open-source ecosystem — averaging one release a month, and also the OLAP engine most often referenced in engineering posts from Cloudflare, GitHub, Uber, eBay, Spotify, Tinybird, PostHog, Plausible, Sentry, and Highlight. By the 25.x series in 2026, ClickHouse is no longer just the pick for "teams tired of Hadoop". It has become the default for three very different use-case tiers: observability (logs, metrics, traces), product analytics (Mixpanel- or Amplitude-style event tracking), and real-time user-facing dashboards with sub-second latency over data added seconds ago.

This article is an in-depth handbook for architects and senior engineers — not just explaining "what ClickHouse is" but going straight into the hardest decisions when rolling a cluster into production at a scale of hundreds of billions of rows per day: designing sparse primary keys, choosing between classical replicated MergeTree and the new SharedMergeTree, when to enable Parallel Replicas, how to organize Projections and Materialized Views, tiered-storage strategy with S3, defending against "too many parts", and the new playing field between ClickHouse Cloud and successors like StarRocks 3, Apache Doris 2.x, Druid, Apache Pinot, or embedded DuckDB.

1Brows/second scanned on a mid-sized cluster

100×average compression on text logs vs raw

~50msp50 latency on trillion-row analytical queries

25.xthe 2026 release line with stable SharedMergeTree

Three decisive questions before touching ClickHouse

Is your workload write-heavy more than update-heavy (append-only or near-append)? Are queries really column aggregates/filters over a time range instead of single-row key lookups? Can you accept eventual consistency on the order of seconds for aggregated views? If all three are "yes", ClickHouse is almost always the right choice. If any is "no", consider Postgres, TiDB, Snowflake, or stick with a transactional architecture.

2. Evolution — from Yandex 2016 to cloud-native 2026

Knowing the stages gotten through explains why SharedMergeTree emerged, why Parallel Replicas is the key answer for heavy-scan workloads, and why analytics is no longer a privilege of traditional data warehouses.

2016 — Yandex open-sources ClickHouse

Internal project for Yandex.Metrica begun in 2009 goes public. The very first release already had columnar storage, vectorized execution, LZ4/ZSTD compression, and the MergeTree engine.

2019 — ReplicatedMergeTree + ZooKeeper

Replication model backed by ZooKeeper for metadata. Pushed ClickHouse from "one huge single node" into many-sharded clusters — the default for nearly a decade.

2021 — ClickHouse Keeper replaces ZooKeeper

Rewritten in C++ with Raft and ZooKeeper-wire compatible. Removes JVM overhead, simplifies operations, and can run embedded in the same binary.

2022 — ClickHouse Cloud GA

A storage and compute separated model on S3/GCS/ABS. For the first time ClickHouse could auto-scale compute without rebalancing shards — the groundwork for SharedMergeTree.

2023 — SharedMergeTree preview

A new engine where replicas no longer own parts locally. Every replica reads from object storage, coordinator is thin, no per-part ZooKeeper needed. Officially GA in the Cloud by mid-2024.

2024 — Parallel Replicas stable

One query gets split across multiple replicas of the same shard, instead of one replica owning the whole shard. Query time on a 10-node cluster drops linearly with the number of replicas.

Q4 2025 — ClickHouse 25.x LTS

Consolidates Lightweight UPDATE (patch parts), projection vectorized reads, cross-node query cache, a native JSON variant type, and Iceberg/Delta Lake native reads as local-like tables.

2026 — Standardizing on the open ecosystem

ClickHouse 25.x stabilizes integrations with Apache Iceberg catalogs (REST + AWS Glue), Delta Lake v3, and Hudi. Altinity, Tinybird, Double.cloud, and ClickHouse Cloud become the four major managed providers. The engineering tooling (dbt-clickhouse, embedded chDB, clickhouse-local) matures.

3. Core architecture — column store, vectorized execution, and MergeTree

Unlike traditional OLTP systems (Postgres, MySQL, SQL Server) that lay data out by row, ClickHouse stores each column in its own file with a mark file as a sparse index. When a query reads only 3 of 200 columns, disk sees just 1.5% of the data — the root of every performance advantage. The next formula is vectorized execution: instead of processing per-tuple, ClickHouse reads blocks (65,536 rows by default) and applies SIMD operators to the whole block. Finally, MergeTree — a storage engine inspired by the LSM-trees of Bigtable/Cassandra but streamlined for analytical writes.

graph TB
    INSERT["INSERT INTO events VALUES ..."] --> PART_NEW["New part: part_1_1_0"]
    PART_NEW --> DISK["Disk / Object Storage"]
    DISK --> MERGE{"Background merge"}
    MERGE --> PART_L1["part_1_20_1 (level 1)"]
    PART_L1 --> MERGE2{"Merge again"}
    MERGE2 --> PART_L2["part_1_100_2 (level 2)"]
    PART_L2 --> MERGE3["... until max size reached"]
    SELECT["SELECT ... FROM events WHERE ..."] --> PRIMARY_IDX["Sparse primary index (mark)"]
    PRIMARY_IDX --> SKIP_IDX["Skip indexes (minmax, bloom, set)"]
    SKIP_IDX --> READ_PARTS["Read only the required parts + granules"]
    READ_PARTS --> VEC["Vectorized execution over 65K-row blocks"]
    VEC --> AGGR["Aggregate / Join / Window"]
    AGGR --> OUTPUT["Client"]

MergeTree write path and vectorized read path in the same table

3.1 Parts, granules, marks — three concepts to internalize

Each INSERT creates a new part on disk containing every column plus index and metadata files. The part name looks like all_<min_block>_<max_block>_<level>. Many small parts are a friend of write throughput but a foe of reads — every query must open every part, read primary indexes, and merge results. Background merges compact many small parts into larger ones while applying secondary engines (Replacing, Collapsing, Aggregating) if declared on the table.

Inside a part, data is split into granules (8,192 rows by default). Each granule has an entry in each column's mark file — this is the sparse primary index. ClickHouse doesn't index every row (like Postgres's B-trees) — it only records the leading value of each granule, so the index stays small enough to live entirely in RAM even for trillion-row tables.

CREATE TABLE events
(
    user_id      UInt64,
    event_time   DateTime64(3, 'UTC'),
    event_name   LowCardinality(String),
    project_id   UInt32,
    properties   Map(String, String),
    ip           IPv6,
    country_code FixedString(2)
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_time)
ORDER BY (project_id, event_name, event_time)
SETTINGS index_granularity = 8192;

Three things worth noting:

PARTITION BY month. Partitioning isn't an index — it controls the physical folder layout on disk, making drop/TTL by time block dirt cheap. Avoid tiny partitions (hourly or daily on low-volume tables) — they spawn too many parts.
ORDER BY is the primary key. Column order matters: the more selective the prefix is for your common queries, the more aggressive the skipping. Common rule: lower-cardinality columns first, time columns last.
LowCardinality, FixedString, Map — compact data types that deliver good compression ratios and better CPU cache friendliness than raw String.

ORDER BY design principle

Ask: "If I could only choose 2-3 columns to filter most of my queries on, which would they be?" — those are your ORDER BY candidates. Don't try to cram every filterable column there; a sparse index isn't a B-tree, only the prefix actually skips data. For the rest, use skip indexes (minmax, bloom filter, set, ngrambf) or projections.

3.2 Vectorized execution — why 1 billion rows/second

For each block (65,536 rows) the engine runs a chain of processors (source, filter, expression, aggregating, merge, sink). Each processor takes a block, transforms it in place, and pushes it to the next — much like an Apache Arrow pipeline. Most operators (comparison, arithmetic, hash, count) compile down to SIMD with AVX2/AVX-512 when the runtime detects the CPU flags. That's why a 64-core server can scan tens of GB/s when uncompressed data sits hot in the page cache.

4. SharedMergeTree — Storage-compute separation, properly

The classic ReplicatedMergeTree model has a problem every operator has faced: when adding a new replica, that replica must fetch every part from peers — for a 50 TB table that can take days. Worse, each part operation (merge, mutation) requires coordination through ZooKeeper/Keeper — the number of znodes grows linearly with parts, easy to hit ensemble limits.

SharedMergeTree fixes this at the root by fully separating storage from compute. All parts live on object storage (S3, GCS, Azure Blob, MinIO). Every replica reads and writes into the same set of parts. Metadata sits in a thin store (Keeper or a distributed metadata service), only holding small fragments like "list of active parts" and "merge locks" rather than every file chunk.

graph TB
    subgraph Compute["Compute layer (stateless)"]
        R1["Replica 1"]
        R2["Replica 2"]
        R3["Replica 3"]
    end
    subgraph Meta["Metadata (Keeper)"]
        META["Active parts, merge locks"]
    end
    subgraph Storage["Object storage"]
        S3["S3 / GCS / ABS"]
    end
    R1 --- META
    R2 --- META
    R3 --- META
    R1 --- S3
    R2 --- S3
    R3 --- S3
    CLIENT["Client / Load balancer"] --> R1
    CLIENT --> R2
    CLIENT --> R3

SharedMergeTree — stateless replicas, data on object storage

Characteristic	ReplicatedMergeTree	SharedMergeTree
Where parts live	Local disk on each replica	Shared object storage
Adding a new replica	Must fetch every part (hours to days)	Ready almost immediately
Keeper load	Proportional to part count	Thin metadata only
Storage cost	Multiplied by replica count	A single copy on S3
Hot read latency	Local NVMe — lowest	Needs SSD/RAM cache
Best use case	Self-hosted bare-metal with NVMe	Cloud-native, auto-scale, multi-tenant

SharedMergeTree's only trade-off is cold reads — if page cache and local disk cache don't hold a part, every read goes to S3 with 10-50 ms latency. The 25.x clusters solve this with a persistent local cache (filesystem cache) configurable by GB and LRU-K eviction policy. For dashboard workloads, hot recent data in cache is enough; only queries into deep history touch S3 — this covers nearly 100% of real use cases.

5. Parallel Replicas — linear query scale-out

In the traditional sharded model, a query SELECT count(*) FROM events WHERE event_time > now() - INTERVAL 1 DAY is pushed to every shard, but each shard uses only one replica. With 10 shards and 3 replicas each, 20 machines effectively sit idle. Parallel Replicas changes the game: the query is chunked by sets of granules, and each replica within a shard takes an independent portion. Time drops nearly linearly with the number of participating replicas.

-- Enable parallel replicas for the session
SET allow_experimental_parallel_reading_from_replicas = 2;
SET max_parallel_replicas = 6;
SET parallel_replicas_for_non_replicated_merge_tree = 1;

-- Query on a trillion-row table
SELECT
    toStartOfHour(event_time) AS hour,
    event_name,
    count() AS cnt
FROM events_distributed
WHERE event_time BETWEEN now() - INTERVAL 7 DAY AND now()
GROUP BY hour, event_name
ORDER BY hour DESC, cnt DESC
LIMIT 100;

A few conditions are required for Parallel Replicas to pay off: the table is large enough that range scans are worth splitting, inter-replica network has high bandwidth, and a coordinator aggregates intermediate results. For small queries (reading a few million rows), coordination overhead exceeds the win — ClickHouse 25.x auto-falls back to the old model when estimated granules are below a threshold (parallel_replicas_min_number_of_granules_to_enable).

Parallel Replicas and consistency

Because replicas can be a few seconds out of sync, Parallel Replicas results can differ by a few hundred rows from the slowest replica. Acceptable for analytics, but if you're running financial reports that must be exact, use a single replica or set max_replica_delay_for_distributed_queries appropriately.

6. Projection, Materialized View, and Refreshable MV — three tools to accelerate queries

When the ORDER BY primary key can't serve every query, ClickHouse offers three very different tools. Many teams use all three in the same cluster — understanding the trade-offs picks the right one:

Tool	Essence	When to use	Downsides
Projection	A "view" stored alongside the original part, with its own ORDER BY or GROUP BY	Accelerate queries filtering on columns other than the primary key	Higher write cost; not cross-partition
Materialized View	Triggers on every INSERT into the source, writes into a target table	Pre-aggregate in real time (roll up by minute, hour, day)	Strict pipeline logic; backfilling is hard
Refreshable MV	Runs a SELECT on a schedule, overwriting the target table	Heavy reports, complex joins, no real-time needs	Non-incremental; has latency

6.1 Projection — a second index tier right inside the part

ALTER TABLE events ADD PROJECTION by_country (
    SELECT
        country_code,
        toStartOfHour(event_time) AS hour,
        count() AS cnt,
        uniqState(user_id) AS users
    GROUP BY country_code, hour
);

ALTER TABLE events MATERIALIZE PROJECTION by_country;

When SELECT country_code, sum(cnt) FROM events WHERE country_code = 'VN' GROUP BY country_code runs, the optimizer picks the by_country projection instead of reading the base table. Users don't need to know the projection exists — a great property for a data team that wants to optimize without forcing devs to rewrite SQL.

6.2 Materialized View — real-time pre-aggregation with AggregatingMergeTree

CREATE TABLE events_hourly
(
    project_id UInt32,
    hour       DateTime,
    event_name LowCardinality(String),
    cnt        AggregateFunction(count),
    users      AggregateFunction(uniq, UInt64)
)
ENGINE = AggregatingMergeTree
ORDER BY (project_id, hour, event_name);

CREATE MATERIALIZED VIEW events_hourly_mv
TO events_hourly
AS SELECT
    project_id,
    toStartOfHour(event_time) AS hour,
    event_name,
    countState() AS cnt,
    uniqState(user_id) AS users
FROM events
GROUP BY project_id, hour, event_name;

-- Aggregated query is extremely cheap via the roll-up table
SELECT
    event_name,
    sumMerge(cnt) AS total,
    uniqMerge(users) AS uu
FROM events_hourly
WHERE project_id = 42
  AND hour >= now() - INTERVAL 7 DAY
GROUP BY event_name
ORDER BY total DESC
LIMIT 20;

AggregatingMergeTree combined with the *State/*Merge function family is ClickHouse's strongest weapon. uniqState doesn't keep a real user list — it stores a HyperLogLog state (~64 KB for roughly 0.5% accuracy). Merging those states is mathematically sound. Similarly, quantilesState uses t-digest, topKState uses Count-min, argMinState/argMaxState for pivoting — all are composable mathematical sketches.

7. Tiered storage, TTL, and data lifecycle management

A typical production log table grows 1-10 TB per day. Nobody keeps everything forever, but manual deletion is dangerous. ClickHouse provides TTL and tiered storage right inside the DDL — each part can auto-migrate between disks by age, auto-roll-up, or auto-delete.

CREATE TABLE logs
(
    ts        DateTime,
    level     LowCardinality(String),
    service   LowCardinality(String),
    message   String,
    trace_id  String
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(ts)
ORDER BY (service, ts)
TTL
    ts + INTERVAL 7 DAY TO VOLUME 'warm',
    ts + INTERVAL 30 DAY TO VOLUME 'cold',
    ts + INTERVAL 180 DAY DELETE,
    ts + INTERVAL 1 DAY GROUP BY service, toStartOfHour(ts)
        SET message = any(message), trace_id = any(trace_id)
SETTINGS storage_policy = 'hot_warm_cold';

The policy above runs log aggregation after one day, moves parts to the warm volume (cheaper SSD) after 7 days, to cold (S3) after 30 days, and deletes after 6 months. All handled by background threads, with no impact on queries. The storage policy is declared in config.xml:

<storage_configuration>
    <disks>
        <hot><path>/var/lib/clickhouse/hot/</path></hot>
        <warm><path>/var/lib/clickhouse/warm/</path></warm>
        <cold>
            <type>s3</type>
            <endpoint>https://s3.ap-southeast-1.amazonaws.com/my-ch-cold/</endpoint>
            <access_key_id>...</access_key_id>
            <secret_access_key>...</secret_access_key>
            <metadata_path>/var/lib/clickhouse/disks/cold/</metadata_path>
        </cold>
    </disks>
    <policies>
        <hot_warm_cold>
            <volumes>
                <hot><disk>hot</disk></hot>
                <warm><disk>warm</disk></warm>
                <cold><disk>cold</disk></cold>
            </volumes>
        </hot_warm_cold>
    </policies>
</storage_configuration>

8. Ingestion patterns — async insert, Kafka Engine, native protocol

The most common new-user mistake is to INSERT one event per request. Each request births a part — "too many parts" within minutes. Three correct ingestion patterns in 2026 production:

8.1 Client-side batch insert

Simplest: batch at the application layer into 10,000-200,000 rows and insert once. Use the native protocol (port 9000) instead of HTTP for maximum throughput. Mature client libraries support bulk: .NET (ClickHouse.Client, ClickHouse.Driver), Node.js (@clickhouse/client), Go (clickhouse-go/v2).

8.2 Async insert — batching on the server

-- Client sends a normal INSERT
INSERT INTO events VALUES (...);

-- Server batches INSERTs within a window (default 200 ms / 1 MB)
SET async_insert = 1;
SET wait_for_async_insert = 1;

Async insert is gold when thousands of microservices fire INSERTs without client-side batching control. ClickHouse buffers in RAM and flushes at time/size thresholds. Pair with wait_for_async_insert = 0 (fire-and-forget) to sharply reduce producer latency — at the risk of losing a few batches if the server crashes before flush. Fine for logs, not ideal for financial data.

8.3 Kafka Engine and Refreshable MV — an internal pipeline

CREATE TABLE events_kafka
(
    raw String
)
ENGINE = Kafka
SETTINGS
    kafka_broker_list = 'broker-1:9092,broker-2:9092',
    kafka_topic_list = 'events',
    kafka_group_name = 'ch_consumer',
    kafka_format = 'JSONAsString',
    kafka_max_block_size = 65536;

CREATE MATERIALIZED VIEW events_mv TO events AS
SELECT
    JSONExtractUInt(raw, 'user_id')          AS user_id,
    parseDateTime64BestEffort(JSONExtractString(raw, 'ts')) AS event_time,
    JSONExtractString(raw, 'event_name')     AS event_name,
    ...
FROM events_kafka;

The 25.x Kafka Engine supports the KIP-848 new consumer group (cooperative sticky), shrinking rebalance pauses from seconds to hundreds of milliseconds — valuable for clusters scaling daily. Some teams externalize the Kafka→ClickHouse tier with Vector or Redpanda Connect to decouple consumer lifecycle from ClickHouse, trading an extra component to operate for that flexibility.

9. Query tuning — sparse index, skip index, and common anti-patterns

9.1 EXPLAIN indexes — always the first step

EXPLAIN indexes = 1
SELECT count() FROM events
WHERE project_id = 42
  AND event_name = 'page_view'
  AND event_time >= now() - INTERVAL 1 DAY;

-- Output (abridged):
-- MergeTreeThread
--   Indexes:
--     PrimaryKey
--       Keys: project_id, event_name, event_time
--       Condition: and((project_id = 42), (event_name = 'page_view'))
--       Parts: 42/1200
--       Granules: 1834/2400000

"Parts: 42/1200" means ClickHouse read only 42 of 1,200 parts — the rest skipped by partition pruning. "Granules: 1834/2400000" shows the sparse primary index eliminated 99.92% of the data. If these ratios are high (nearly full-table read), it's a sign to revisit ORDER BY or add skip indexes.

9.2 Skip indexes — add them when ORDER BY isn't enough

ALTER TABLE events
    ADD INDEX idx_country country_code TYPE set(50) GRANULARITY 4,
    ADD INDEX idx_props_vip properties['vip'] TYPE bloom_filter(0.01) GRANULARITY 4,
    ADD INDEX idx_ip_minmax ip TYPE minmax GRANULARITY 8;

ALTER TABLE events MATERIALIZE INDEX idx_country;

Four skip-index types in common use: minmax (numbers, timestamps), set(N) (low-to-mid cardinality columns), bloom_filter (high-cardinality columns with equality lookups), ngrambf/tokenbf (LIKE search). Skip indexes don't replace the primary key — they're the final tuning step. Always put the most selective column in ORDER BY first.

9.3 Four anti-patterns that kill clusters fastest

Things to never do

INSERTing one row at a time over HTTP. Always batch or turn on async_insert.
Hourly (or finer) partitioning on large tables. A month of partitions yields 720 directories, each with a part + part index — merges can't keep up.
ALTER DELETE/UPDATE on big tables the way OLTP does. Mutations in ClickHouse rewrite parts — a 50 TB table means rewriting 50 TB. Use Lightweight DELETE (stable in 25.x) or the Replacing/Collapsing engines for deletes/updates.
SELECT * in a dashboard tool. Every unused column is extra data read. Always list columns.

10. Compared with the 2026 OLAP world — StarRocks, Doris, Druid, Pinot, DuckDB

Engine	Model	Strengths	Weaknesses vs ClickHouse	Best fit
StarRocks 3.x	MPP + powerful materialized view, native Iceberg	Faster complex joins, has a pipeline executor	Smaller community, less available as SaaS	Multi-source data warehouse, many big-table joins
Apache Doris 2.x	Palo fork MPP, MySQL-wire compatible	Familiar administration for MySQL/OLTP teams	Thin international community, slower feature parity	Teams on MySQL wanting real-time reporting
Apache Druid	Time-series OLAP with multi-role data servers	Extremely strong streaming ingestion, refined time filters	Complex ops (many services), SQL support is weaker	Time series, session data, short clickstreams
Apache Pinot	User-facing analytics, star-tree index	Stable p99 latency for simple queries	Less flexible for ad-hoc queries	End-user dashboards, high QPS
DuckDB 1.x	In-process embedded OLAP	Zero ops, runs locally, native Parquet	No cluster, no concurrent writes	Local analytics, data science, edge
Snowflake / BigQuery	Managed cloud DWH, storage-compute separated	Zero ops, the most complete SQL support	High per-query cost, no sub-second ingestion	Enterprise BI, periodic reports, non-realtime

There's no "total victory" here. Rule of thumb: for append-only + filter + aggregate workloads with high throughput and sub-second latency, ClickHouse is right ~90% of the time. If you must join 10 big tables often, consider StarRocks. If the analysis is just inside a single process, DuckDB is much more convenient. Picking the right tool up front saves six months of pain, more than any tuning session ever could.

11. Three typical production use cases — how to organize tables

11.1 Observability — logs, metrics, traces

The OpenTelemetry + ClickHouse model is becoming the default for mid-sized companies in place of Elasticsearch + Prometheus + Jaeger. Log tables use LowCardinality for service/level, String for message (ZSTD compresses 20-40× typically), and Array/Map for labels. Trace tables use trace_id String + bloom-filter skip indexes for id lookups. This model is ready in the OpenTelemetry Collector (the clickhouse exporter), and SigNoz, HyperDX, Highlight.io, and chDB-based products are open-source builds on top.

11.2 Product analytics — user events and funnels

PostHog, Plausible, Tinybird, June, Datafold, and Mixpanel are all built on ClickHouse. The main events table holds every event with a Map for properties. Funnel, retention, and A/B test analyses use specialized functions like windowFunnel, retention, and sequenceMatch — extremely hard to match on Postgres or MySQL at the same performance.

-- A 3-step funnel: view -> add_to_cart -> purchase within 1 hour
SELECT
    level,
    count() AS users
FROM (
    SELECT
        user_id,
        windowFunnel(3600)(
            event_time,
            event_name = 'view',
            event_name = 'add_to_cart',
            event_name = 'purchase'
        ) AS level
    FROM events
    WHERE event_time >= today() - 7
    GROUP BY user_id
)
GROUP BY level
ORDER BY level DESC;

11.3 Real-time user-facing dashboards

Financial SaaS, game analytics, marketing attribution — products that show dashboards to end users, each user seeing their own chart, with sub-second latency even over trillion-row tables. The common template: partition by day, ORDER BY (tenant_id, metric_time) so per-tenant queries are prefix scans, materialized views roll up to minute/hour, and dashboards hit the MV rather than raw data. With a 25.x cluster + Parallel Replicas, you can hit sub-300 ms p99 on 30-day aggregate queries.

12. Production checklist — Don't forget these before go-live

Before opening the cluster to real traffic

Design the primary key around queries, not around "id". OLTP primary keys are not OLAP primary keys.
Configure memory limits: max_memory_usage, max_server_memory_usage, max_bytes_before_external_group_by so big queries spill to disk instead of OOM-killing the server.
Enable query log and query_thread_log — you can't tune without logs.
Backups: use native BACKUP/RESTORE 25.x to S3, daily schedule, 30 retained copies.
Run ClickHouse Keeper ensembles of 3 or 5 nodes, never a single Keeper.
TLS + user quotas + row policies: ClickHouse ships full RBAC, row-level, and column-level policies — use them.
Per-user quotas (bytes read, queries/minute) prevent one heavy team from taking down the cluster.
Minimum metrics: parts per partition, merge queue, replication lag, distributed delay, memory per query — export to Prometheus via the prometheus_endpoint.
Upgrade strategy: one minor version at a time, tested on staging with replayed production queries; don't make big jumps.
ClickHouse Cloud or self-hosted? — compute TCO at 100 TB; managed is typically 3-5× more expensive than bare-metal self-hosting, but frees a small team. With fewer than three DBAs, managed is almost always the right call.

13. Closing — ClickHouse is a tool, not a silver bullet

ClickHouse has traveled a path very few open-source projects manage: from "one Russian company's internal tool" to the industry's standard analytics infrastructure, competing head-to-head with Snowflake and BigQuery where real-time matters. In 2026, with stable SharedMergeTree, Parallel Replicas, polished Lightweight DELETE, and built-in Iceberg/Delta integrations, ClickHouse is pushing itself into "full data platform" territory, not just "fast analytics DB".

But every tool has limits. ClickHouse isn't OLTP, doesn't do multi-table ACID, doesn't do distributed transactions. Used for the right role — append-heavy, aggregate-heavy, time-range filters — its performance and cost/performance are nearly unmatched. Force it to replace Postgres for an order system and you'll have sleepless nights. A good architect knows where to use OLTP vs OLAP vs cache vs search — and knows how to connect them via CDC, Kafka, or correct dual-writes. ClickHouse is a powerful piece in that toolkit, but only a piece.

If you're choosing an OLAP engine for your team in 2026, start with a benchmark on real workloads — not TPC-H to check a box — and weigh long-term operating costs, not just monthly pricing. The earlier you have real data, the firmer the decision.

14. References

#ClickHouse #ClickHouse 25 #ClickHouse Cloud #OLAP #Analytics #MergeTree #ReplicatedMergeTree #SharedMergeTree #AggregatingMergeTree #Parallel Replicas #Materialized View #Projection #Sparse Primary Index #Vectorized Execution #Columnar Storage #ClickHouse Keeper #Storage Compute Separation #Object Storage #S3 #Tiered Storage #TTL #Lightweight DELETE #Async Insert #Kafka Engine #Skip Index #Bloom Filter #Query Tuning #Data Warehouse #Observability #OpenTelemetry #Product Analytics #PostHog #Tinybird #HyperLogLog #StarRocks #Apache Doris #Apache Druid #Apache Pinot #DuckDB #Snowflake #BigQuery #Apache Iceberg #Delta Lake #system design #Real-time Analytics #Data Engineering

# ClickHouse 2026 — Sub-second OLAP Architecture with SharedMergeTree, Parallel Replicas, and Storage-Compute Separation for Petabyte Analytics

## 1. Why ClickHouse dominates production analytics in 2026

In the seven years since Yandex open-sourced ClickHouse in mid-2016, it has remained one of the most rapidly evolving analytical-database projects in the open-source ecosystem — averaging one release a month, and also the OLAP engine most often referenced in engineering posts from Cloudflare, GitHub, Uber, eBay, Spotify, Tinybird, PostHog, Plausible, Sentry, and Highlight. By the 25.x series in 2026, ClickHouse is no longer just the pick for "teams tired of Hadoop". It has become the default for three very different use-case tiers: **observability** (logs, metrics, traces), **product analytics** (Mixpanel- or Amplitude-style event tracking), and **real-time user-facing dashboards** with sub-second latency over data added seconds ago.

This article is an in-depth handbook for architects and senior engineers — not just explaining "what ClickHouse is" but going straight into the hardest decisions when rolling a cluster into production at a scale of hundreds of billions of rows per day: designing **sparse primary keys**, choosing between classical replicated MergeTree and the new **SharedMergeTree**, when to enable **Parallel Replicas**, how to organize **Projections** and **Materialized Views**, tiered-storage strategy with S3, defending against "too many parts", and the new playing field between ClickHouse Cloud and successors like StarRocks 3, Apache Doris 2.x, Druid, Apache Pinot, or embedded DuckDB.

1Brows/second scanned on a mid-sized cluster

100×average compression on text logs vs raw

~50msp50 latency on trillion-row analytical queries

25.xthe 2026 release line with stable SharedMergeTree

#### Three decisive questions before touching ClickHouse

Is your workload **write-heavy more than update-heavy** (append-only or near-append)? Are queries really **column aggregates/filters** over a time range instead of single-row key lookups? Can you accept *eventual consistency* on the order of seconds for aggregated views? If all three are "yes", ClickHouse is almost always the right choice. If any is "no", consider Postgres, TiDB, Snowflake, or stick with a transactional architecture.

## 2. Evolution — from Yandex 2016 to cloud-native 2026

2016 — Yandex open-sources ClickHouse

Internal project for Yandex.Metrica begun in 2009 goes public. The very first release already had columnar storage, vectorized execution, LZ4/ZSTD compression, and the MergeTree engine.

2019 — ReplicatedMergeTree + ZooKeeper

Replication model backed by ZooKeeper for metadata. Pushed ClickHouse from "one huge single node" into many-sharded clusters — the default for nearly a decade.

2021 — ClickHouse Keeper replaces ZooKeeper

Rewritten in C++ with Raft and ZooKeeper-wire compatible. Removes JVM overhead, simplifies operations, and can run embedded in the same binary.

2022 — ClickHouse Cloud GA

A *storage and compute separated* model on S3/GCS/ABS. For the first time ClickHouse could auto-scale compute without rebalancing shards — the groundwork for SharedMergeTree.

2023 — SharedMergeTree preview

A new engine where replicas no longer own parts locally. Every replica reads from object storage, coordinator is thin, no per-part ZooKeeper needed. Officially GA in the Cloud by mid-2024.

2024 — Parallel Replicas stable

One query gets split across multiple replicas of the same shard, instead of one replica owning the whole shard. Query time on a 10-node cluster drops linearly with the number of replicas.

Q4 2025 — ClickHouse 25.x LTS

Consolidates Lightweight UPDATE (patch parts), projection vectorized reads, cross-node query cache, a native JSON variant type, and Iceberg/Delta Lake native reads as local-like tables.

2026 — Standardizing on the open ecosystem

## 3. Core architecture — column store, vectorized execution, and MergeTree

Unlike traditional OLTP systems (Postgres, MySQL, SQL Server) that lay data out by row, ClickHouse stores each column in its own file with a *mark file* as a sparse index. When a query reads only 3 of 200 columns, disk sees just 1.5% of the data — the root of every performance advantage. The next formula is vectorized execution: instead of processing per-tuple, ClickHouse reads *blocks* (65,536 rows by default) and applies SIMD operators to the whole block. Finally, MergeTree — a storage engine inspired by the LSM-trees of Bigtable/Cassandra but streamlined for analytical writes.

```
graph TB
    INSERT["INSERT INTO events VALUES ..."] --> PART_NEW["New part: part_1_1_0"]
    PART_NEW --> DISK["Disk / Object Storage"]
    DISK --> MERGE{"Background merge"}
    MERGE --> PART_L1["part_1_20_1 (level 1)"]
    PART_L1 --> MERGE2{"Merge again"}
    MERGE2 --> PART_L2["part_1_100_2 (level 2)"]
    PART_L2 --> MERGE3["... until max size reached"]
    SELECT["SELECT ... FROM events WHERE ..."] --> PRIMARY_IDX["Sparse primary index (mark)"]
    PRIMARY_IDX --> SKIP_IDX["Skip indexes (minmax, bloom, set)"]
    SKIP_IDX --> READ_PARTS["Read only the required parts + granules"]
    READ_PARTS --> VEC["Vectorized execution over 65K-row blocks"]
    VEC --> AGGR["Aggregate / Join / Window"]
    AGGR --> OUTPUT["Client"]

```

MergeTree write path and vectorized read path in the same table

### 3.1 Parts, granules, marks — three concepts to internalize

Each `INSERT` creates a new *part* on disk containing every column plus index and metadata files. The part name looks like `all_<min_block>_<max_block>_<level>`. Many small parts are a friend of write throughput but a foe of reads — every query must open every part, read primary indexes, and merge results. Background merges compact many small parts into larger ones while applying secondary engines (Replacing, Collapsing, Aggregating) if declared on the table.

Inside a part, data is split into *granules* (8,192 rows by default). Each granule has an entry in each column's mark file — this is the *sparse primary index*. ClickHouse doesn't index every row (like Postgres's B-trees) — it only records the leading value of each granule, so the index stays small enough to live entirely in RAM even for trillion-row tables.

```sql
CREATE TABLE events
(
    user_id      UInt64,
    event_time   DateTime64(3, 'UTC'),
    event_name   LowCardinality(String),
    project_id   UInt32,
    properties   Map(String, String),
    ip           IPv6,
    country_code FixedString(2)
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_time)
ORDER BY (project_id, event_name, event_time)
SETTINGS index_granularity = 8192;
```
Three things worth noting:

- **PARTITION BY** month. Partitioning isn't an index — it controls the physical folder layout on disk, making drop/TTL by time block dirt cheap. Avoid tiny partitions (hourly or daily on low-volume tables) — they spawn too many parts.
- **ORDER BY** is the primary key. Column order matters: the more selective the prefix is for your common queries, the more aggressive the skipping. Common rule: *lower-cardinality columns first, time columns last*.
- **LowCardinality**, **FixedString**, **Map** — compact data types that deliver good compression ratios and better CPU cache friendliness than raw String.

#### ORDER BY design principle

Ask: "If I could only choose 2-3 columns to filter most of my queries on, which would they be?" — those are your ORDER BY candidates. Don't try to cram every filterable column there; a sparse index isn't a B-tree, only the prefix actually skips data. For the rest, use *skip indexes* (minmax, bloom filter, set, ngrambf) or *projections*.

### 3.2 Vectorized execution — why 1 billion rows/second

For each block (65,536 rows) the engine runs a chain of *processors* (source, filter, expression, aggregating, merge, sink). Each processor takes a block, transforms it in place, and pushes it to the next — much like an Apache Arrow pipeline. Most operators (comparison, arithmetic, hash, count) compile down to SIMD with AVX2/AVX-512 when the runtime detects the CPU flags. That's why a 64-core server can scan tens of GB/s when uncompressed data sits hot in the page cache.

## 4. SharedMergeTree — Storage-compute separation, properly

The classic ReplicatedMergeTree model has a problem every operator has faced: when adding a new replica, that replica must *fetch* every part from peers — for a 50 TB table that can take days. Worse, each part operation (merge, mutation) requires coordination through ZooKeeper/Keeper — the number of znodes grows linearly with parts, easy to hit ensemble limits.

SharedMergeTree fixes this at the root by **fully separating storage from compute**. All parts live on object storage (S3, GCS, Azure Blob, MinIO). Every replica reads and writes into the same set of parts. Metadata sits in a thin store (Keeper or a distributed metadata service), only holding small fragments like "list of active parts" and "merge locks" rather than every file chunk.

```
graph TB
    subgraph Compute["Compute layer (stateless)"]
        R1["Replica 1"]
        R2["Replica 2"]
        R3["Replica 3"]
    end
    subgraph Meta["Metadata (Keeper)"]
        META["Active parts, merge locks"]
    end
    subgraph Storage["Object storage"]
        S3["S3 / GCS / ABS"]
    end
    R1 --- META
    R2 --- META
    R3 --- META
    R1 --- S3
    R2 --- S3
    R3 --- S3
    CLIENT["Client / Load balancer"] --> R1
    CLIENT --> R2
    CLIENT --> R3

```

SharedMergeTree — stateless replicas, data on object storage

| Characteristic | ReplicatedMergeTree | SharedMergeTree |
| --- | --- | --- |
| Where parts live | Local disk on each replica | Shared object storage |
| Adding a new replica | Must fetch every part (hours to days) | Ready almost immediately |
| Keeper load | Proportional to part count | Thin metadata only |
| Storage cost | Multiplied by replica count | A single copy on S3 |
| Hot read latency | Local NVMe — lowest | Needs SSD/RAM cache |
| Best use case | Self-hosted bare-metal with NVMe | Cloud-native, auto-scale, multi-tenant |

SharedMergeTree's only trade-off is cold reads — if page cache and local disk cache don't hold a part, every read goes to S3 with 10-50 ms latency. The 25.x clusters solve this with a *persistent local cache* (filesystem cache) configurable by GB and LRU-K eviction policy. For dashboard workloads, hot recent data in cache is enough; only queries into deep history touch S3 — this covers nearly 100% of real use cases.

## 5. Parallel Replicas — linear query scale-out

In the traditional sharded model, a query `SELECT count(*) FROM events WHERE event_time > now() - INTERVAL 1 DAY` is pushed to every shard, but each shard uses only one replica. With 10 shards and 3 replicas each, 20 machines effectively sit idle. Parallel Replicas changes the game: the query is chunked by *sets of granules*, and each replica within a shard takes an independent portion. Time drops nearly linearly with the number of participating replicas.

```sql
-- Enable parallel replicas for the session
SET allow_experimental_parallel_reading_from_replicas = 2;
SET max_parallel_replicas = 6;
SET parallel_replicas_for_non_replicated_merge_tree = 1;

-- Query on a trillion-row table
SELECT
    toStartOfHour(event_time) AS hour,
    event_name,
    count() AS cnt
FROM events_distributed
WHERE event_time BETWEEN now() - INTERVAL 7 DAY AND now()
GROUP BY hour, event_name
ORDER BY hour DESC, cnt DESC
LIMIT 100;
```
A few conditions are required for Parallel Replicas to pay off: the table is large enough that range scans are worth splitting, inter-replica network has high bandwidth, and a coordinator aggregates intermediate results. For small queries (reading a few million rows), coordination overhead exceeds the win — ClickHouse 25.x auto-falls back to the old model when estimated granules are below a threshold (`parallel_replicas_min_number_of_granules_to_enable`).

#### Parallel Replicas and consistency

Because replicas can be a few seconds out of sync, Parallel Replicas results can differ by a few hundred rows from the slowest replica. Acceptable for analytics, but if you're running financial reports that must be exact, use a single replica or set `max_replica_delay_for_distributed_queries` appropriately.

## 6. Projection, Materialized View, and Refreshable MV — three tools to accelerate queries

When the ORDER BY primary key can't serve every query, ClickHouse offers three very different tools. Many teams use all three in the same cluster — understanding the trade-offs picks the right one:

| Tool | Essence | When to use | Downsides |
| --- | --- | --- | --- |
| **Projection** | A "view" stored alongside the original part, with its own ORDER BY or GROUP BY | Accelerate queries filtering on columns other than the primary key | Higher write cost; not cross-partition |
| **Materialized View** | Triggers on every INSERT into the source, writes into a target table | Pre-aggregate in real time (roll up by minute, hour, day) | Strict pipeline logic; backfilling is hard |
| **Refreshable MV** | Runs a SELECT on a schedule, overwriting the target table | Heavy reports, complex joins, no real-time needs | Non-incremental; has latency |

### 6.1 Projection — a second index tier right inside the part

```sql
ALTER TABLE events ADD PROJECTION by_country (
    SELECT
        country_code,
        toStartOfHour(event_time) AS hour,
        count() AS cnt,
        uniqState(user_id) AS users
    GROUP BY country_code, hour
);

ALTER TABLE events MATERIALIZE PROJECTION by_country;
```
When `SELECT country_code, sum(cnt) FROM events WHERE country_code = 'VN' GROUP BY country_code` runs, the optimizer picks the `by_country` projection instead of reading the base table. Users don't need to know the projection exists — a great property for a data team that wants to optimize without forcing devs to rewrite SQL.

### 6.2 Materialized View — real-time pre-aggregation with AggregatingMergeTree

```sql
CREATE TABLE events_hourly
(
    project_id UInt32,
    hour       DateTime,
    event_name LowCardinality(String),
    cnt        AggregateFunction(count),
    users      AggregateFunction(uniq, UInt64)
)
ENGINE = AggregatingMergeTree
ORDER BY (project_id, hour, event_name);

CREATE MATERIALIZED VIEW events_hourly_mv
TO events_hourly
AS SELECT
    project_id,
    toStartOfHour(event_time) AS hour,
    event_name,
    countState() AS cnt,
    uniqState(user_id) AS users
FROM events
GROUP BY project_id, hour, event_name;

-- Aggregated query is extremely cheap via the roll-up table
SELECT
    event_name,
    sumMerge(cnt) AS total,
    uniqMerge(users) AS uu
FROM events_hourly
WHERE project_id = 42
  AND hour >= now() - INTERVAL 7 DAY
GROUP BY event_name
ORDER BY total DESC
LIMIT 20;
```
AggregatingMergeTree combined with the `*State`/`*Merge` function family is ClickHouse's strongest weapon. `uniqState` doesn't keep a real user list — it stores a HyperLogLog state (~64 KB for roughly 0.5% accuracy). Merging those states is mathematically sound. Similarly, `quantilesState` uses t-digest, `topKState` uses Count-min, `argMinState`/`argMaxState` for pivoting — all are composable mathematical sketches.

## 7. Tiered storage, TTL, and data lifecycle management

```sql
CREATE TABLE logs
(
    ts        DateTime,
    level     LowCardinality(String),
    service   LowCardinality(String),
    message   String,
    trace_id  String
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(ts)
ORDER BY (service, ts)
TTL
    ts + INTERVAL 7 DAY TO VOLUME 'warm',
    ts + INTERVAL 30 DAY TO VOLUME 'cold',
    ts + INTERVAL 180 DAY DELETE,
    ts + INTERVAL 1 DAY GROUP BY service, toStartOfHour(ts)
        SET message = any(message), trace_id = any(trace_id)
SETTINGS storage_policy = 'hot_warm_cold';
```
The policy above runs *log aggregation* after one day, *moves* parts to the warm volume (cheaper SSD) after 7 days, to cold (S3) after 30 days, and *deletes* after 6 months. All handled by background threads, with no impact on queries. The storage policy is declared in `config.xml`:

```xml
<storage_configuration>
    <disks>
        <hot><path>/var/lib/clickhouse/hot/</path></hot>
        <warm><path>/var/lib/clickhouse/warm/</path></warm>
        <cold>
            <type>s3</type>
            <endpoint>https://s3.ap-southeast-1.amazonaws.com/my-ch-cold/</endpoint>
            <access_key_id>...</access_key_id>
            <secret_access_key>...</secret_access_key>
            <metadata_path>/var/lib/clickhouse/disks/cold/</metadata_path>
        </cold>
    </disks>
    <policies>
        <hot_warm_cold>
            <volumes>
                <hot><disk>hot</disk></hot>
                <warm><disk>warm</disk></warm>
                <cold><disk>cold</disk></cold>
            </volumes>
        </hot_warm_cold>
    </policies>
</storage_configuration>
```

## 8. Ingestion patterns — async insert, Kafka Engine, native protocol

The most common new-user mistake is to `INSERT` one event per request. Each request births a part — "too many parts" within minutes. Three correct ingestion patterns in 2026 production:

### 8.1 Client-side batch insert

Simplest: batch at the application layer into 10,000-200,000 rows and insert once. Use the native protocol (port 9000) instead of HTTP for maximum throughput. Mature client libraries support bulk: .NET (`ClickHouse.Client`, `ClickHouse.Driver`), Node.js (`@clickhouse/client`), Go (`clickhouse-go/v2`).

### 8.2 Async insert — batching on the server

```sql
-- Client sends a normal INSERT
INSERT INTO events VALUES (...);

-- Server batches INSERTs within a window (default 200 ms / 1 MB)
SET async_insert = 1;
SET wait_for_async_insert = 1;
```
Async insert is gold when thousands of microservices fire INSERTs without client-side batching control. ClickHouse buffers in RAM and flushes at time/size thresholds. Pair with `wait_for_async_insert = 0` (fire-and-forget) to sharply reduce producer latency — at the risk of losing a few batches if the server crashes before flush. Fine for logs, not ideal for financial data.

### 8.3 Kafka Engine and Refreshable MV — an internal pipeline

```sql
CREATE TABLE events_kafka
(
    raw String
)
ENGINE = Kafka
SETTINGS
    kafka_broker_list = 'broker-1:9092,broker-2:9092',
    kafka_topic_list = 'events',
    kafka_group_name = 'ch_consumer',
    kafka_format = 'JSONAsString',
    kafka_max_block_size = 65536;

CREATE MATERIALIZED VIEW events_mv TO events AS
SELECT
    JSONExtractUInt(raw, 'user_id')          AS user_id,
    parseDateTime64BestEffort(JSONExtractString(raw, 'ts')) AS event_time,
    JSONExtractString(raw, 'event_name')     AS event_name,
    ...
FROM events_kafka;
```
The 25.x Kafka Engine supports the KIP-848 new consumer group (cooperative sticky), shrinking rebalance pauses from seconds to hundreds of milliseconds — valuable for clusters scaling daily. Some teams externalize the Kafka→ClickHouse tier with Vector or Redpanda Connect to decouple consumer lifecycle from ClickHouse, trading an extra component to operate for that flexibility.

## 9. Query tuning — sparse index, skip index, and common anti-patterns

### 9.1 EXPLAIN indexes — always the first step

```sql
EXPLAIN indexes = 1
SELECT count() FROM events
WHERE project_id = 42
  AND event_name = 'page_view'
  AND event_time >= now() - INTERVAL 1 DAY;

-- Output (abridged):
-- MergeTreeThread
--   Indexes:
--     PrimaryKey
--       Keys: project_id, event_name, event_time
--       Condition: and((project_id = 42), (event_name = 'page_view'))
--       Parts: 42/1200
--       Granules: 1834/2400000
```
"Parts: 42/1200" means ClickHouse read only 42 of 1,200 parts — the rest skipped by partition pruning. "Granules: 1834/2400000" shows the sparse primary index eliminated 99.92% of the data. If these ratios are high (nearly full-table read), it's a sign to revisit ORDER BY or add skip indexes.

### 9.2 Skip indexes — add them when ORDER BY isn't enough

```sql
ALTER TABLE events
    ADD INDEX idx_country country_code TYPE set(50) GRANULARITY 4,
    ADD INDEX idx_props_vip properties['vip'] TYPE bloom_filter(0.01) GRANULARITY 4,
    ADD INDEX idx_ip_minmax ip TYPE minmax GRANULARITY 8;

ALTER TABLE events MATERIALIZE INDEX idx_country;
```
Four skip-index types in common use: `minmax` (numbers, timestamps), `set(N)` (low-to-mid cardinality columns), `bloom_filter` (high-cardinality columns with equality lookups), `ngrambf`/`tokenbf` (LIKE search). Skip indexes don't replace the primary key — they're the final tuning step. Always put the most selective column in ORDER BY first.

### 9.3 Four anti-patterns that kill clusters fastest

#### Things to never do

- **INSERTing one row at a time** over HTTP. Always batch or turn on async_insert.
- **Hourly (or finer) partitioning** on large tables. A month of partitions yields 720 directories, each with a part + part index — merges can't keep up.
- **ALTER DELETE/UPDATE** on big tables the way OLTP does. Mutations in ClickHouse rewrite parts — a 50 TB table means rewriting 50 TB. Use `Lightweight DELETE` (stable in 25.x) or the Replacing/Collapsing engines for deletes/updates.
- **SELECT *** in a dashboard tool. Every unused column is extra data read. Always list columns.

## 10. Compared with the 2026 OLAP world — StarRocks, Doris, Druid, Pinot, DuckDB

| Engine | Model | Strengths | Weaknesses vs ClickHouse | Best fit |
| --- | --- | --- | --- | --- |
| **StarRocks 3.x** | MPP + powerful materialized view, native Iceberg | Faster complex joins, has a pipeline executor | Smaller community, less available as SaaS | Multi-source data warehouse, many big-table joins |
| **Apache Doris 2.x** | Palo fork MPP, MySQL-wire compatible | Familiar administration for MySQL/OLTP teams | Thin international community, slower feature parity | Teams on MySQL wanting real-time reporting |
| **Apache Druid** | Time-series OLAP with multi-role data servers | Extremely strong streaming ingestion, refined time filters | Complex ops (many services), SQL support is weaker | Time series, session data, short clickstreams |
| **Apache Pinot** | User-facing analytics, star-tree index | Stable p99 latency for simple queries | Less flexible for ad-hoc queries | End-user dashboards, high QPS |
| **DuckDB 1.x** | In-process embedded OLAP | Zero ops, runs locally, native Parquet | No cluster, no concurrent writes | Local analytics, data science, edge |
| **Snowflake / BigQuery** | Managed cloud DWH, storage-compute separated | Zero ops, the most complete SQL support | High per-query cost, no sub-second ingestion | Enterprise BI, periodic reports, non-realtime |

There's no "total victory" here. Rule of thumb: for *append-only + filter + aggregate* workloads with high throughput and sub-second latency, ClickHouse is right ~90% of the time. If you must join 10 big tables often, consider StarRocks. If the analysis is just inside a single process, DuckDB is much more convenient. Picking the right tool up front saves six months of pain, more than any tuning session ever could.

## 11. Three typical production use cases — how to organize tables

### 11.1 Observability — logs, metrics, traces

The *OpenTelemetry + ClickHouse* model is becoming the default for mid-sized companies in place of Elasticsearch + Prometheus + Jaeger. Log tables use LowCardinality for `service`/`level`, String for `message` (ZSTD compresses 20-40× typically), and Array/Map for labels. Trace tables use `trace_id String` + bloom-filter skip indexes for id lookups. This model is ready in the OpenTelemetry Collector (the `clickhouse` exporter), and SigNoz, HyperDX, Highlight.io, and chDB-based products are open-source builds on top.

### 11.2 Product analytics — user events and funnels

PostHog, Plausible, Tinybird, June, Datafold, and Mixpanel are all built on ClickHouse. The main `events` table holds every event with a Map for properties. Funnel, retention, and A/B test analyses use specialized functions like `windowFunnel`, `retention`, and `sequenceMatch` — extremely hard to match on Postgres or MySQL at the same performance.

```sql
-- A 3-step funnel: view -> add_to_cart -> purchase within 1 hour
SELECT
    level,
    count() AS users
FROM (
    SELECT
        user_id,
        windowFunnel(3600)(
            event_time,
            event_name = 'view',
            event_name = 'add_to_cart',
            event_name = 'purchase'
        ) AS level
    FROM events
    WHERE event_time >= today() - 7
    GROUP BY user_id
)
GROUP BY level
ORDER BY level DESC;
```

### 11.3 Real-time user-facing dashboards

## 12. Production checklist — Don't forget these before go-live

#### Before opening the cluster to real traffic

- **Design the primary key around queries**, not around "id". OLTP primary keys are not OLAP primary keys.
- **Configure memory limits**: `max_memory_usage`, `max_server_memory_usage`, `max_bytes_before_external_group_by` so big queries spill to disk instead of OOM-killing the server.
- **Enable query log and query_thread_log** — you can't tune without logs.
- **Backups**: use native `BACKUP`/`RESTORE` 25.x to S3, daily schedule, 30 retained copies.
- **Run ClickHouse Keeper ensembles of 3 or 5 nodes**, never a single Keeper.
- **TLS + user quotas + row policies**: ClickHouse ships full RBAC, row-level, and column-level policies — use them.
- **Per-user quotas** (bytes read, queries/minute) prevent one heavy team from taking down the cluster.
- **Minimum metrics**: parts per partition, merge queue, replication lag, distributed delay, memory per query — export to Prometheus via the `prometheus_endpoint`.
- **Upgrade strategy**: one minor version at a time, tested on staging with replayed production queries; don't make big jumps.
- **ClickHouse Cloud or self-hosted?** — compute TCO at 100 TB; managed is typically 3-5× more expensive than bare-metal self-hosting, but frees a small team. With fewer than three DBAs, managed is almost always the right call.

## 13. Closing — ClickHouse is a tool, not a silver bullet

## 14. References

- [ClickHouse Official Documentation](https://clickhouse.com/docs)
- [ClickHouse Release 25.x — SharedMergeTree, Lightweight DELETE](https://clickhouse.com/blog/clickhouse-release-25-1)
- [MergeTree Engine Family Deep Dive](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree)
- [SharedMergeTree Architecture (ClickHouse Cloud)](https://clickhouse.com/docs/en/cloud/reference/shared-merge-tree)
- [Parallel Replicas Configuration](https://clickhouse.com/docs/en/operations/settings/settings#allow_experimental_parallel_reading_from_replicas)
- [Query Parallelism Best Practices](https://clickhouse.com/docs/en/optimize/query-parallelism)
- [OpenTelemetry ClickHouse Exporter](https://opentelemetry.io/docs/languages/collector/clickhouse-exporter/)
- [PostHog on Building Product Analytics with ClickHouse](https://posthog.com/blog/how-we-turned-clickhouse-into-our-eventmansion)
- [Tinybird — Materialized Views in Practice](https://www.tinybird.co/blog-posts/clickhouse-materialized-views)
- [ClickHouse Source on GitHub](https://github.com/ClickHouse/ClickHouse)

Notification System Design 2026 — Fanout, Priority Queue, Idempotency, and Template Engine for Millions of Push/Email/SMS per Day

Microsoft Orleans 9 on .NET 10 — Virtual Actors, Distributed Grains, and Stateful Cloud-Native Architecture for Games, IoT, and AI Agents

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.