ClickHouse: The Real-time Analytics Engine That Queries Billions of Rows in Milliseconds

Posted on: 4/21/2026 6:08:32 AM

Table of contents

What is ClickHouse and why is it "unreasonably fast"?
1. Why is column-oriented faster for analytics?
Inside the engine: MergeTree and the secrets behind its speed
1. Sparse Primary Index — less is more
2. Vectorized Query Execution
  1. Key engines in the MergeTree family
Notable 2025–2026 updates
ClickHouse vs. other analytics solutions
1. When to pick ClickHouse
2. When NOT to use ClickHouse
Integration architecture: ClickHouse in a real system
Performance tuning: 7 golden rules
1. 7 rules for tuning ClickHouse
ClickHouse Cloud and ClickPipes: analytics without ops
1. AI features and MCP Server
Self-hosting ClickHouse: from single node to cluster
1. Sizing guide for a single node
2. clickhousectl — the new CLI (beta, v26.3)
Real-world benchmarks: ClickHouse in production
Conclusion: ClickHouse — not just a database, but an analytics platform
1. Getting started with ClickHouse
2. References

When your system emits millions of events per second — click tracking, IoT sensors, transaction logs, observability data — the question is no longer "where do we store it?" but "how do we query it in milliseconds?". ClickHouse is the answer chosen by thousands of companies, from Cloudflare, Uber, and eBay to Spotify.

$400MSeries D — January 2026

600Krows/second ingest (Goldsky benchmark)

26.3Latest LTS release (April 2026)

100+Data source connectors (ClickPipes)

What is ClickHouse and why is it "unreasonably fast"?

ClickHouse is an open-source column-oriented OLAP database (Online Analytical Processing), originally developed by Yandex since 2016. Unlike PostgreSQL or MySQL (row-oriented — which read entire rows per query), ClickHouse stores data column by column, letting it read only the columns each query actually needs.

graph LR
    subgraph Row-Oriented["Row-Oriented (PostgreSQL)"]
        R1["Row 1: id | name | city | amount"]
        R2["Row 2: id | name | city | amount"]
        R3["Row 3: id | name | city | amount"]
    end
    subgraph Column-Oriented["Column-Oriented (ClickHouse)"]
        C1["Column: id -> 1, 2, 3, ..."]
        C2["Column: name -> Alice, Bob, ..."]
        C3["Column: city -> NYC, LA, ..."]
        C4["Column: amount -> 100, 200, ..."]
    end
    Q["SELECT SUM(amount)
WHERE city = 'NYC'"] --> Column-Oriented
    style Q fill:#e94560,stroke:#fff,color:#fff
    style R1 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style R2 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style R3 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style C1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C2 fill:#f8f9fa,stroke:#e0e0e0,color:#aaa
    style C3 fill:#e94560,stroke:#fff,color:#fff
    style C4 fill:#e94560,stroke:#fff,color:#fff

Column-oriented storage reads only 2 columns (city + amount) instead of entire rows — a massive I/O reduction

Why is column-oriented faster for analytics?

Imagine a table with 100 columns where a query only needs 3 — ClickHouse reads just 3% of the data from disk. On top of that, data in the same column shares the same type, which compresses extremely well (typically 10:1 to 40:1 ratios), reducing I/O even further.

Inside the engine: MergeTree and the secrets behind its speed

The heart of ClickHouse is the MergeTree engine family — a storage mechanism designed specifically for analytical workloads with write-heavy patterns.

graph TD
    A["Client INSERT
batch rows"] --> B["In-memory Buffer"]
    B --> C["Write Part
(sorted, compressed)"]
    C --> D["Background Merge"]
    D --> E["Merged Parts
(optimized, deduped)"]
    E --> F["Query Engine
vectorized execution"]

    G["Primary Index
(sparse, in-memory)"] --> F
    H["Data Skipping Index
(minmax, set, bloom)"] --> F

    style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D fill:#2c3e50,stroke:#fff,color:#fff
    style E fill:#16213e,stroke:#fff,color:#fff
    style F fill:#e94560,stroke:#fff,color:#fff
    style G fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style H fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50

MergeTree architecture: INSERT -> buffer -> sorted parts -> background merge -> query

Sparse Primary Index — less is more

Unlike PostgreSQL's B-Tree index (one entry per row), ClickHouse uses a sparse index — storing only the first value of each granule (8,192 rows by default). For a 1-billion-row table, the primary index holds only ~120K entries — small enough to live in RAM and be scanned almost instantly.

Vectorized Query Execution

ClickHouse processes data in blocks (batches of column values), taking full advantage of SIMD instructions (Single Instruction Multiple Data) on modern CPUs. A comparison like WHERE amount > 1000 can evaluate 8-16 values per CPU cycle instead of one at a time.

Key engines in the MergeTree family

Engine	Description	Use case
MergeTree	Base engine — sorted storage + merges	General analytics
ReplacingMergeTree	Automatic dedup by sort key	CDC, upsert pattern
AggregatingMergeTree	Pre-aggregates at merge time	Materialized views, rollups
SummingMergeTree	Automatic SUM on numeric columns	Counters, metrics
CollapsingMergeTree	Collapses rows based on a sign column	State tracking, mutable data

Notable 2025–2026 updates

ClickHouse ships improvements constantly. Here are the biggest changes in the past year you should know about.

July 2025 — SharedCatalog

A centralized metadata architecture for the cloud, making DDL operations handle high concurrency easily, with near-instant service spin-up. Supports both the native format and open standards like Apache Iceberg and Delta Lake.

October 2025 — Full-text search redesign

The inverted index was fully redesigned in v25.10, now handling datasets larger than RAM. Vector similarity search reached GA with binary quantization that dramatically reduces memory footprint.

January 2026 — Series D $400M + Langfuse acquisition

ClickHouse closed a $400M Series D led by Dragoneer and acquired Langfuse — the open-source LLM observability platform. A clear signal that ClickHouse is doubling down on AI infrastructure.

February 2026 — ClickHouse 26.2

Time-based block flushing for streaming data, production-ready text index and QBit data types, plus RIGHT/FULL JOINs 3.2x faster.

April 2026 — ClickHouse 26.3 LTS

Materialized CTEs (experimental), WebAssembly UDFs, the naturalSortKey function, the ALP compression codec for floating-point data, and asynchronous inserts enabled by default.

ClickHouse vs. other analytics solutions

There is no silver bullet for every analytics workload. Below is how ClickHouse stacks up against common alternatives on practical criteria.

Criterion	ClickHouse	Apache Druid	TimescaleDB	BigQuery
Data model	Column-oriented	Column-oriented	Row-oriented (PG ext.)	Column-oriented
Ingest speed	Millions rows/s	High (via Kafka)	Moderate	Streaming (limited)
Query latency	Sub-second on billions of rows	Sub-second	Seconds to minutes	Seconds (cold start)
SQL support	Nearly full ANSI SQL	SQL-like (limited JOINs)	Full PostgreSQL SQL	Full SQL
JOINs	Good (improving continuously)	Limited	Good	Good
Self-hosted	Yes, free	Yes, more complex	Yes (PG extension)	No, GCP only
Managed cloud	Yes, ClickHouse Cloud	Yes, Imply Cloud	Yes, Timescale Cloud	Native
Best for	Log analytics, BI, observability	Real-time dashboards	Time-series / IoT	Large ad-hoc analytics

When to pick ClickHouse

Choose ClickHouse when you need: (1) analytical queries over large datasets with low latency, (2) high ingest volume (millions of events per second), (3) standard SQL plus a rich connector ecosystem, (4) the option to self-host for free or a flexible managed cloud. It's especially well-suited for log analytics, observability, product analytics, and real-time reporting.

When NOT to use ClickHouse

ClickHouse is not an OLTP database. If your workload is primarily single-row INSERT/UPDATE/DELETE with ACID transactions (e.g., e-commerce order processing), keep PostgreSQL/EF Core for that and use ClickHouse as the analytics layer behind it.

Integration architecture: ClickHouse in a real system

In practice, ClickHouse rarely stands alone — it lives inside a larger data pipeline. The pattern below is the most common one.

graph LR
    A["Application
(.NET / Vue.js)"] -->|events| B["Message Queue
(Kafka / RabbitMQ)"]
    B --> C["ClickHouse
Ingest"]
    C --> D["ClickHouse
MergeTree Tables"]
    D --> E["Materialized Views
(pre-aggregated)"]
    E --> F["Grafana / BI Tools
Dashboard"]
    D --> G["API Layer
(.NET Core)"]
    G --> H["Vue.js
Frontend"]

    I["PostgreSQL
OLTP"] -->|CDC via ClickPipes| D

    style A fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style C fill:#e94560,stroke:#fff,color:#fff
    style D fill:#16213e,stroke:#fff,color:#fff
    style E fill:#2c3e50,stroke:#fff,color:#fff
    style F fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style G fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style H fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style I fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50

Common pattern: OLTP (PostgreSQL) -> CDC -> ClickHouse -> Materialized Views -> Dashboard/API

Pattern 1: Event Analytics Pipeline

The application pushes events (page views, clicks, purchases...) through a message queue. ClickHouse ingests directly from Kafka using the built-in Kafka engine or ClickPipes (the managed connector).

-- Raw events table
CREATE TABLE events (
    event_id UUID DEFAULT generateUUIDv4(),
    event_type LowCardinality(String),
    user_id UInt64,
    session_id String,
    page_url String,
    properties String, -- JSON string
    created_at DateTime64(3) DEFAULT now64(3)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (event_type, user_id, created_at)
TTL created_at + INTERVAL 90 DAY;

-- Materialized View that aggregates automatically
CREATE MATERIALIZED VIEW events_hourly_mv
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (event_type, hour)
AS SELECT
    event_type,
    toStartOfHour(created_at) AS hour,
    count() AS event_count,
    uniqExact(user_id) AS unique_users
FROM events
GROUP BY event_type, hour;

Tip: LowCardinality — the secret weapon

LowCardinality(String) is a ClickHouse-specific optimization. For columns with few distinct values (country, event_type, status...), it automatically applies dictionary encoding — cutting memory and speeding up filters by up to 10x.

Pattern 2: CDC from PostgreSQL

With ClickPipes (or Debezium + Kafka), you can replicate data from PostgreSQL to ClickHouse in near real time. ClickHouse in 2025 added MySQL CDC, MongoDB CDC, and improved Postgres CDC inside ClickPipes.

-- ReplacingMergeTree for CDC (upsert pattern)
CREATE TABLE orders_replica (
    order_id UInt64,
    customer_id UInt64,
    total_amount Decimal(18,2),
    status LowCardinality(String),
    updated_at DateTime64(3),
    _version UInt64  -- CDC version for dedup
) ENGINE = ReplacingMergeTree(_version)
ORDER BY order_id;

-- Use FINAL to guarantee deduplication
SELECT
    status,
    count() AS order_count,
    sum(total_amount) AS revenue
FROM orders_replica FINAL
WHERE updated_at >= today() - 30
GROUP BY status;

Pattern 3: Connecting from .NET Core

ClickHouse ships an official .NET client (ClickHouse.Client) with a familiar ADO.NET interface, bulk insert, and async operations.

// NuGet: ClickHouse.Client
using ClickHouse.Client.ADO;
using ClickHouse.Client.Copy;

// Simple query via ADO.NET
await using var connection = new ClickHouseConnection(
    "Host=localhost;Port=8123;Database=analytics");
await connection.OpenAsync();

await using var cmd = connection.CreateCommand();
cmd.CommandText = @"
    SELECT event_type, count() AS cnt, uniqExact(user_id) AS users
    FROM events
    WHERE created_at >= @from
    GROUP BY event_type
    ORDER BY cnt DESC
    LIMIT 20";
cmd.Parameters.AddWithValue("@from", DateTime.UtcNow.AddDays(-7));

await using var reader = await cmd.ExecuteReaderAsync();
while (await reader.ReadAsync())
{
    Console.WriteLine($"{reader["event_type"]}: {reader["cnt"]} events, {reader["users"]} users");
}

// High-throughput bulk insert
using var bulkCopy = new ClickHouseBulkCopy(connection)
{
    DestinationTableName = "events",
    BatchSize = 100_000
};

var rows = GenerateEvents(); // IEnumerable<object[]>
await bulkCopy.InitAsync();
await bulkCopy.WriteToServerAsync(rows);

Performance tuning: 7 golden rules

ClickHouse is fast by default, but to get the most out of it you need to follow a few design principles.

7 rules for tuning ClickHouse

1. Pick the right ORDER BY: the primary key determines physical storage order. Put the most frequently filtered columns first (e.g., ORDER BY (tenant_id, event_type, created_at)).

2. Use PARTITION BY sensibly: partition by month (toYYYYMM) is typical. Too many partitions (e.g., per-day on a small table) adds overhead. Rule of thumb: each partition should hold at least 1–10 million rows.

3. Batch your INSERTs: never insert one row at a time. Batch at least 1,000 rows, ideally 10K–100K per call. Since v26.3, asynchronous inserts are enabled by default and buffer automatically.

4. LowCardinality for low-distinct columns: country, status, event_type — any column with fewer than ~10K distinct values should use LowCardinality(String).

5. Materialized views for repeated queries: instead of scanning the raw table every time, build a materialized view that pre-aggregates. Queries over the view run 10x–1000x faster.

6. TTL for lifecycle management: set TTL to auto-delete or move old data to cheaper storage: TTL created_at + INTERVAL 90 DAY DELETE or TTL created_at + INTERVAL 30 DAY TO VOLUME 'cold'.

7. Avoid SELECT *: always specify only the columns you need. Column-oriented means "read only selected columns" — but SELECT * forces ClickHouse to read everything.

ClickHouse Cloud and ClickPipes: analytics without ops

If you'd rather not manage a cluster yourself, ClickHouse Cloud offers a managed service with a long list of enterprise features:

20xreplicas (horizontal scaling GA)

6xfaster backups (2025)

3cloud providers (AWS, GCP, Azure)

HIPAAcompliance ready

ClickPipes is managed data ingestion — connecting directly to Kafka, PostgreSQL (CDC), MySQL (CDC), MongoDB (CDC), Amazon Kinesis, S3, GCS, Azure Blob Storage, and more, without any ETL code of your own.

graph LR
    A["PostgreSQL
(CDC)"] --> CP["ClickPipes"]
    B["Apache Kafka"] --> CP
    C["Amazon S3"] --> CP
    D["MongoDB
(CDC)"] --> CP
    CP --> CH["ClickHouse Cloud"]
    CH --> G["Grafana"]
    CH --> S["Superset"]
    CH --> API["Your .NET API"]
    CH --> MB["Metabase"]

    style CP fill:#e94560,stroke:#fff,color:#fff
    style CH fill:#16213e,stroke:#fff,color:#fff
    style A fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style B fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style C fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style D fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style G fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style S fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style API fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style MB fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50

ClickPipes: managed ingestion from many sources into ClickHouse Cloud

AI features and MCP Server

One thing worth noting is that ClickHouse now ships an MCP Server (Model Context Protocol) that lets AI agents query ClickHouse directly. Together with the Langfuse acquisition, ClickHouse is turning into a foundational platform for AI observability — tracking LLM tokens, latency, cost, and quality metrics at scale.

-- Example: LLM observability table on ClickHouse
CREATE TABLE llm_traces (
    trace_id UUID,
    model LowCardinality(String),
    prompt_tokens UInt32,
    completion_tokens UInt32,
    latency_ms UInt32,
    cost_usd Decimal(10,6),
    status LowCardinality(String),
    created_at DateTime64(3)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (model, created_at);

-- Dashboard query: LLM cost by model over 7 days
SELECT
    model,
    count() AS requests,
    avg(latency_ms) AS avg_latency,
    sum(cost_usd) AS total_cost,
    sum(prompt_tokens + completion_tokens) AS total_tokens
FROM llm_traces
WHERE created_at >= now() - INTERVAL 7 DAY
GROUP BY model
ORDER BY total_cost DESC;

Self-hosting ClickHouse: from single node to cluster

ClickHouse is completely free to self-host. Here's the quickest way to get going with Docker.

# docker-compose.yml — single-node ClickHouse
version: '3.8'
services:
  clickhouse:
    image: clickhouse/clickhouse-server:26.3
    ports:
      - "8123:8123"   # HTTP interface
      - "9000:9000"   # Native TCP
      - "9009:9009"   # Inter-server replication
    volumes:
      - ch_data:/var/lib/clickhouse
      - ch_logs:/var/log/clickhouse-server
    environment:
      CLICKHOUSE_DB: analytics
      CLICKHOUSE_USER: admin
      CLICKHOUSE_PASSWORD: your_secure_password
    ulimits:
      nofile:
        soft: 262144
        hard: 262144
    deploy:
      resources:
        limits:
          memory: 4G

volumes:
  ch_data:
  ch_logs:

Sizing guide for a single node

RAM: ClickHouse uses roughly 50% of RAM for query processing. 8GB of RAM comfortably handles several hundred million rows. Disk: with a 10:1 compression ratio, 100GB of raw data takes up ~10GB on disk. Prefer SSD/NVMe — ClickHouse takes great advantage of fast random reads. CPU: more cores is always better — ClickHouse parallelizes queries across every available core.

clickhousectl — the new CLI (beta, v26.3)

ClickHouse just released clickhousectl — an official CLI that supports both local installs and cloud deployments. It was designed "with AI agents in mind" — OAuth support, permissioned API keys, and output formats that play well with agentic workflows.

# Install clickhousectl
curl -fsSL https://clickhouse.com/install-ctl | bash

# Manage a local instance
clickhousectl install --version 26.3
clickhousectl start
clickhousectl status

# Connect to the cloud
clickhousectl cloud login
clickhousectl cloud query "SELECT count() FROM events"

Real-world benchmarks: ClickHouse in production

600,000 rows/second Goldsky reached this number after switching from Kafka/Avro to direct ClickHouse reads with Apache Arrow — 12x faster than the old pipeline (~50K rows/s)

A few notable production case studies:

Company	Use case	Scale	Result
Cloudflare	DNS analytics, HTTP logs	Tens of PB	Sub-second queries on petabytes
Uber	Real-time pricing, logging	Billions of events/day	P99 latency under 1s
Goldsky	Blockchain data backfill	Billions of rows	12x throughput improvement
GitLab	Product analytics	Hundreds of TB	90% query-cost reduction vs BigQuery
Spotify	A/B testing analytics	Trillions of rows	Interactive queries on experiment data

Conclusion: ClickHouse — not just a database, but an analytics platform

Between the $400M Series D, the Langfuse acquisition, and the steady cadence of 26.x improvements, ClickHouse is evolving from "column-oriented database" into a full analytics platform — covering ingestion (ClickPipes), storage (MergeTree + data lake integrations), querying (vectorized + materialized views), and AI infrastructure (MCP Server + LLM observability).

If your system is bottlenecked at the analytics layer — slow dashboards, timeouts on big tables, or runaway BigQuery/Redshift bills — ClickHouse is the most compelling option to try in 2026. Between free self-hosting and a flexible managed cloud, the barrier to entry is essentially zero.

Getting started with ClickHouse

Try it right now at play.clickhouse.com — a free playground with no sign-up required. Or use the docker-compose file above to spin up a local instance in 2 minutes. The official docs at clickhouse.com/docs have step-by-step tutorials for every common use case.

References

#ClickHouse #Real-time Analytics #system design #Database #.NET Core #OLAP #Data Engineering

# ClickHouse: The Real-time Analytics Engine That Queries Billions of Rows in Milliseconds

When your system emits millions of events per second — click tracking, IoT sensors, transaction logs, observability data — the question is no longer *"where do we store it?"* but *"how do we query it in milliseconds?"*. ClickHouse is the answer chosen by thousands of companies, from Cloudflare, Uber, and eBay to Spotify.

$400MSeries D — January 2026

600Krows/second ingest (Goldsky benchmark)

26.3Latest LTS release (April 2026)

100+Data source connectors (ClickPipes)

## What is ClickHouse and why is it "unreasonably fast"?

ClickHouse is an open-source **column-oriented OLAP database** (Online Analytical Processing), originally developed by Yandex since 2016. Unlike PostgreSQL or MySQL (row-oriented — which read entire rows per query), ClickHouse stores data column by column, letting it read only the columns each query actually needs.

```
graph LR
    subgraph Row-Oriented["Row-Oriented (PostgreSQL)"]
        R1["Row 1: id | name | city | amount"]
        R2["Row 2: id | name | city | amount"]
        R3["Row 3: id | name | city | amount"]
    end
    subgraph Column-Oriented["Column-Oriented (ClickHouse)"]
        C1["Column: id -> 1, 2, 3, ..."]
        C2["Column: name -> Alice, Bob, ..."]
        C3["Column: city -> NYC, LA, ..."]
        C4["Column: amount -> 100, 200, ..."]
    end
    Q["SELECT SUM(amount)  
WHERE city = 'NYC'"] --> Column-Oriented
    style Q fill:#e94560,stroke:#fff,color:#fff
    style R1 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style R2 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style R3 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style C1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C2 fill:#f8f9fa,stroke:#e0e0e0,color:#aaa
    style C3 fill:#e94560,stroke:#fff,color:#fff
    style C4 fill:#e94560,stroke:#fff,color:#fff

```
Column-oriented storage reads only 2 columns (city + amount) instead of entire rows — a massive I/O reduction

#### Why is column-oriented faster for analytics?

Imagine a table with 100 columns where a query only needs 3 — ClickHouse reads just 3% of the data from disk. On top of that, data in the same column shares the same type, which **compresses extremely well** (typically 10:1 to 40:1 ratios), reducing I/O even further.

## Inside the engine: MergeTree and the secrets behind its speed

The heart of ClickHouse is the **MergeTree engine family** — a storage mechanism designed specifically for analytical workloads with write-heavy patterns.

```
graph TD
    A["Client INSERT  
batch rows"] --> B["In-memory Buffer"]
    B --> C["Write Part  
(sorted, compressed)"]
    C --> D["Background Merge"]
    D --> E["Merged Parts  
(optimized, deduped)"]
    E --> F["Query Engine  
vectorized execution"]

G["Primary Index  
(sparse, in-memory)"] --> F
    H["Data Skipping Index  
(minmax, set, bloom)"] --> F

style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D fill:#2c3e50,stroke:#fff,color:#fff
    style E fill:#16213e,stroke:#fff,color:#fff
    style F fill:#e94560,stroke:#fff,color:#fff
    style G fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style H fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50

```
MergeTree architecture: INSERT -> buffer -> sorted parts -> background merge -> query

### Sparse Primary Index — less is more

Unlike PostgreSQL's B-Tree index (one entry per row), ClickHouse uses a **sparse index** — storing only the first value of each *granule* (8,192 rows by default). For a 1-billion-row table, the primary index holds only ~120K entries — small enough to live in RAM and be scanned almost instantly.

### Vectorized Query Execution

ClickHouse processes data in **blocks** (batches of column values), taking full advantage of **SIMD instructions** (Single Instruction Multiple Data) on modern CPUs. A comparison like `WHERE amount > 1000` can evaluate 8-16 values per CPU cycle instead of one at a time.

#### Key engines in the MergeTree family

| Engine | Description | Use case |
| --- | --- | --- |
| **MergeTree** | Base engine — sorted storage + merges | General analytics |
| **ReplacingMergeTree** | Automatic dedup by sort key | CDC, upsert pattern |
| **AggregatingMergeTree** | Pre-aggregates at merge time | Materialized views, rollups |
| **SummingMergeTree** | Automatic SUM on numeric columns | Counters, metrics |
| **CollapsingMergeTree** | Collapses rows based on a sign column | State tracking, mutable data |

## Notable 2025–2026 updates

ClickHouse ships improvements constantly. Here are the biggest changes in the past year you should know about.

July 2025 — SharedCatalog

A **centralized metadata architecture** for the cloud, making DDL operations handle high concurrency easily, with near-instant service spin-up. Supports both the native format and open standards like **Apache Iceberg** and **Delta Lake**.

October 2025 — Full-text search redesign

The inverted index was fully redesigned in v25.10, now handling datasets **larger than RAM**. Vector similarity search reached GA with binary quantization that dramatically reduces memory footprint.

January 2026 — Series D $400M + Langfuse acquisition

ClickHouse closed a $400M Series D led by Dragoneer and acquired **Langfuse** — the open-source LLM observability platform. A clear signal that ClickHouse is doubling down on **AI infrastructure**.

February 2026 — ClickHouse 26.2

**Time-based block flushing** for streaming data, production-ready **text index** and QBit data types, plus **RIGHT/FULL JOINs 3.2x faster**.

April 2026 — ClickHouse 26.3 LTS

**Materialized CTEs** (experimental), **WebAssembly UDFs**, the `naturalSortKey` function, the **ALP** compression codec for floating-point data, and **asynchronous inserts** enabled by default.

## ClickHouse vs. other analytics solutions

There is no silver bullet for every analytics workload. Below is how ClickHouse stacks up against common alternatives on practical criteria.

| Criterion | ClickHouse | Apache Druid | TimescaleDB | BigQuery |
| --- | --- | --- | --- | --- |
| **Data model** | Column-oriented | Column-oriented | Row-oriented (PG ext.) | Column-oriented |
| **Ingest speed** | Millions rows/s | High (via Kafka) | Moderate | Streaming (limited) |
| **Query latency** | Sub-second on billions of rows | Sub-second | Seconds to minutes | Seconds (cold start) |
| **SQL support** | Nearly full ANSI SQL | SQL-like (limited JOINs) | Full PostgreSQL SQL | Full SQL |
| **JOINs** | Good (improving continuously) | Limited | Good | Good |
| **Self-hosted** | Yes, free | Yes, more complex | Yes (PG extension) | No, GCP only |
| **Managed cloud** | Yes, ClickHouse Cloud | Yes, Imply Cloud | Yes, Timescale Cloud | Native |
| **Best for** | Log analytics, BI, observability | Real-time dashboards | Time-series / IoT | Large ad-hoc analytics |

#### When to pick ClickHouse

Choose ClickHouse when you need: (1) analytical queries over large datasets with low latency, (2) high ingest volume (millions of events per second), (3) standard SQL plus a rich connector ecosystem, (4) the option to self-host for free or a flexible managed cloud. It's especially well-suited for **log analytics**, **observability**, **product analytics**, and **real-time reporting**.

#### When NOT to use ClickHouse

ClickHouse is **not** an OLTP database. If your workload is primarily single-row INSERT/UPDATE/DELETE with ACID transactions (e.g., e-commerce order processing), keep PostgreSQL/EF Core for that and use ClickHouse as the **analytics layer** behind it.

## Integration architecture: ClickHouse in a real system

In practice, ClickHouse rarely stands alone — it lives inside a larger data pipeline. The pattern below is the most common one.

```
graph LR
    A["Application  
(.NET / Vue.js)"] -->|events| B["Message Queue  
(Kafka / RabbitMQ)"]
    B --> C["ClickHouse  
Ingest"]
    C --> D["ClickHouse  
MergeTree Tables"]
    D --> E["Materialized Views  
(pre-aggregated)"]
    E --> F["Grafana / BI Tools  
Dashboard"]
    D --> G["API Layer  
(.NET Core)"]
    G --> H["Vue.js  
Frontend"]

I["PostgreSQL  
OLTP"] -->|CDC via ClickPipes| D

style A fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style C fill:#e94560,stroke:#fff,color:#fff
    style D fill:#16213e,stroke:#fff,color:#fff
    style E fill:#2c3e50,stroke:#fff,color:#fff
    style F fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style G fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style H fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style I fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50

```
Common pattern: OLTP (PostgreSQL) -> CDC -> ClickHouse -> Materialized Views -> Dashboard/API

### Pattern 1: Event Analytics Pipeline

The application pushes events (page views, clicks, purchases...) through a message queue. ClickHouse ingests directly from Kafka using the built-in **Kafka engine** or **ClickPipes** (the managed connector).

```sql
-- Raw events table
CREATE TABLE events (
    event_id UUID DEFAULT generateUUIDv4(),
    event_type LowCardinality(String),
    user_id UInt64,
    session_id String,
    page_url String,
    properties String, -- JSON string
    created_at DateTime64(3) DEFAULT now64(3)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (event_type, user_id, created_at)
TTL created_at + INTERVAL 90 DAY;

-- Materialized View that aggregates automatically
CREATE MATERIALIZED VIEW events_hourly_mv
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (event_type, hour)
AS SELECT
    event_type,
    toStartOfHour(created_at) AS hour,
    count() AS event_count,
    uniqExact(user_id) AS unique_users
FROM events
GROUP BY event_type, hour;
```

#### Tip: LowCardinality — the secret weapon

`LowCardinality(String)` is a ClickHouse-specific optimization. For columns with few distinct values (country, event_type, status...), it automatically applies **dictionary encoding** — cutting memory and speeding up filters by up to **10x**.

### Pattern 2: CDC from PostgreSQL

With ClickPipes (or Debezium + Kafka), you can replicate data from PostgreSQL to ClickHouse in near real time. ClickHouse in 2025 added **MySQL CDC**, **MongoDB CDC**, and improved **Postgres CDC** inside ClickPipes.

```sql
-- ReplacingMergeTree for CDC (upsert pattern)
CREATE TABLE orders_replica (
    order_id UInt64,
    customer_id UInt64,
    total_amount Decimal(18,2),
    status LowCardinality(String),
    updated_at DateTime64(3),
    _version UInt64  -- CDC version for dedup
) ENGINE = ReplacingMergeTree(_version)
ORDER BY order_id;

-- Use FINAL to guarantee deduplication
SELECT
    status,
    count() AS order_count,
    sum(total_amount) AS revenue
FROM orders_replica FINAL
WHERE updated_at >= today() - 30
GROUP BY status;
```

### Pattern 3: Connecting from .NET Core

ClickHouse ships an official **.NET client** (`ClickHouse.Client`) with a familiar ADO.NET interface, bulk insert, and async operations.

```csharp
// NuGet: ClickHouse.Client
using ClickHouse.Client.ADO;
using ClickHouse.Client.Copy;

// Simple query via ADO.NET
await using var connection = new ClickHouseConnection(
    "Host=localhost;Port=8123;Database=analytics");
await connection.OpenAsync();

await using var cmd = connection.CreateCommand();
cmd.CommandText = @"
    SELECT event_type, count() AS cnt, uniqExact(user_id) AS users
    FROM events
    WHERE created_at >= @from
    GROUP BY event_type
    ORDER BY cnt DESC
    LIMIT 20";
cmd.Parameters.AddWithValue("@from", DateTime.UtcNow.AddDays(-7));

await using var reader = await cmd.ExecuteReaderAsync();
while (await reader.ReadAsync())
{
    Console.WriteLine($"{reader["event_type"]}: {reader["cnt"]} events, {reader["users"]} users");
}

// High-throughput bulk insert
using var bulkCopy = new ClickHouseBulkCopy(connection)
{
    DestinationTableName = "events",
    BatchSize = 100_000
};

var rows = GenerateEvents(); // IEnumerable<object[]>
await bulkCopy.InitAsync();
await bulkCopy.WriteToServerAsync(rows);
```

## Performance tuning: 7 golden rules

ClickHouse is fast by default, but to get the most out of it you need to follow a few design principles.

#### 7 rules for tuning ClickHouse

**1. Pick the right ORDER BY:** the primary key determines physical storage order. Put the most frequently filtered columns first (e.g., `ORDER BY (tenant_id, event_type, created_at)`).

**2. Use PARTITION BY sensibly:** partition by month (`toYYYYMM`) is typical. Too many partitions (e.g., per-day on a small table) adds overhead. Rule of thumb: each partition should hold at least 1–10 million rows.

**3. Batch your INSERTs:** never insert one row at a time. Batch at least 1,000 rows, ideally 10K–100K per call. Since v26.3, **asynchronous inserts** are enabled by default and buffer automatically.

**4. LowCardinality for low-distinct columns:** country, status, event_type — any column with fewer than ~10K distinct values should use `LowCardinality(String)`.

**5. Materialized views for repeated queries:** instead of scanning the raw table every time, build a materialized view that pre-aggregates. Queries over the view run 10x–1000x faster.

**6. TTL for lifecycle management:** set TTL to auto-delete or move old data to cheaper storage: `TTL created_at + INTERVAL 90 DAY DELETE` or `TTL created_at + INTERVAL 30 DAY TO VOLUME 'cold'`.

**7. Avoid SELECT *:** always specify only the columns you need. Column-oriented means "read only selected columns" — but `SELECT *` forces ClickHouse to read everything.

## ClickHouse Cloud and ClickPipes: analytics without ops

If you'd rather not manage a cluster yourself, **ClickHouse Cloud** offers a managed service with a long list of enterprise features:

20xreplicas (horizontal scaling GA)

6xfaster backups (2025)

3cloud providers (AWS, GCP, Azure)

HIPAAcompliance ready

**ClickPipes** is managed data ingestion — connecting directly to Kafka, PostgreSQL (CDC), MySQL (CDC), MongoDB (CDC), Amazon Kinesis, S3, GCS, Azure Blob Storage, and more, without any ETL code of your own.

```
graph LR
    A["PostgreSQL  
(CDC)"] --> CP["ClickPipes"]
    B["Apache Kafka"] --> CP
    C["Amazon S3"] --> CP
    D["MongoDB  
(CDC)"] --> CP
    CP --> CH["ClickHouse Cloud"]
    CH --> G["Grafana"]
    CH --> S["Superset"]
    CH --> API["Your .NET API"]
    CH --> MB["Metabase"]

style CP fill:#e94560,stroke:#fff,color:#fff
    style CH fill:#16213e,stroke:#fff,color:#fff
    style A fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style B fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style C fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style D fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style G fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style S fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style API fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style MB fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50

```
ClickPipes: managed ingestion from many sources into ClickHouse Cloud

### AI features and MCP Server

One thing worth noting is that ClickHouse now ships an **MCP Server** (Model Context Protocol) that lets AI agents query ClickHouse directly. Together with the **Langfuse** acquisition, ClickHouse is turning into a foundational platform for **AI observability** — tracking LLM tokens, latency, cost, and quality metrics at scale.

```sql
-- Example: LLM observability table on ClickHouse
CREATE TABLE llm_traces (
    trace_id UUID,
    model LowCardinality(String),
    prompt_tokens UInt32,
    completion_tokens UInt32,
    latency_ms UInt32,
    cost_usd Decimal(10,6),
    status LowCardinality(String),
    created_at DateTime64(3)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (model, created_at);

-- Dashboard query: LLM cost by model over 7 days
SELECT
    model,
    count() AS requests,
    avg(latency_ms) AS avg_latency,
    sum(cost_usd) AS total_cost,
    sum(prompt_tokens + completion_tokens) AS total_tokens
FROM llm_traces
WHERE created_at >= now() - INTERVAL 7 DAY
GROUP BY model
ORDER BY total_cost DESC;
```

## Self-hosting ClickHouse: from single node to cluster

ClickHouse is completely free to self-host. Here's the quickest way to get going with Docker.

```yaml
# docker-compose.yml — single-node ClickHouse
version: '3.8'
services:
  clickhouse:
    image: clickhouse/clickhouse-server:26.3
    ports:
      - "8123:8123"   # HTTP interface
      - "9000:9000"   # Native TCP
      - "9009:9009"   # Inter-server replication
    volumes:
      - ch_data:/var/lib/clickhouse
      - ch_logs:/var/log/clickhouse-server
    environment:
      CLICKHOUSE_DB: analytics
      CLICKHOUSE_USER: admin
      CLICKHOUSE_PASSWORD: your_secure_password
    ulimits:
      nofile:
        soft: 262144
        hard: 262144
    deploy:
      resources:
        limits:
          memory: 4G

volumes:
  ch_data:
  ch_logs:
```

#### Sizing guide for a single node

**RAM:** ClickHouse uses roughly 50% of RAM for query processing. 8GB of RAM comfortably handles several hundred million rows. **Disk:** with a 10:1 compression ratio, 100GB of raw data takes up ~10GB on disk. Prefer **SSD/NVMe** — ClickHouse takes great advantage of fast random reads. **CPU:** more cores is always better — ClickHouse parallelizes queries across every available core.

### clickhousectl — the new CLI (beta, v26.3)

ClickHouse just released **clickhousectl** — an official CLI that supports both local installs and cloud deployments. It was designed "with AI agents in mind" — OAuth support, permissioned API keys, and output formats that play well with agentic workflows.

```bash
# Install clickhousectl
curl -fsSL https://clickhouse.com/install-ctl | bash

# Manage a local instance
clickhousectl install --version 26.3
clickhousectl start
clickhousectl status

# Connect to the cloud
clickhousectl cloud login
clickhousectl cloud query "SELECT count() FROM events"
```

## Real-world benchmarks: ClickHouse in production

600,000 rows/second Goldsky reached this number after switching from Kafka/Avro to direct ClickHouse reads with Apache Arrow — 12x faster than the old pipeline (~50K rows/s)

A few notable production case studies:

| Company | Use case | Scale | Result |
| --- | --- | --- | --- |
| **Cloudflare** | DNS analytics, HTTP logs | Tens of PB | Sub-second queries on petabytes |
| **Uber** | Real-time pricing, logging | Billions of events/day | P99 latency under 1s |
| **Goldsky** | Blockchain data backfill | Billions of rows | 12x throughput improvement |
| **GitLab** | Product analytics | Hundreds of TB | 90% query-cost reduction vs BigQuery |
| **Spotify** | A/B testing analytics | Trillions of rows | Interactive queries on experiment data |

## Conclusion: ClickHouse — not just a database, but an analytics platform

Between the $400M Series D, the Langfuse acquisition, and the steady cadence of 26.x improvements, ClickHouse is evolving from "column-oriented database" into a full **analytics platform** — covering ingestion (ClickPipes), storage (MergeTree + data lake integrations), querying (vectorized + materialized views), and AI infrastructure (MCP Server + LLM observability).

#### Getting started with ClickHouse

Try it right now at **play.clickhouse.com** — a free playground with no sign-up required. Or use the docker-compose file above to spin up a local instance in 2 minutes. The official docs at **clickhouse.com/docs** have step-by-step tutorials for every common use case.

### References

- [ClickHouse April 2026 Newsletter](https://clickhouse.com/blog/202604-newsletter)
- [ClickHouse 2025 Roundup — What's New](https://clickhouse.com/blog/clickhouse-2025-roundup)
- [ClickHouse raises $400M Series D, acquires Langfuse](https://clickhouse.com/blog/clickhouse-raises-400-million-series-d-acquires-langfuse-launches-postgres)
- [ClickHouse Official Documentation](https://clickhouse.com/docs)
- [How to choose a database for real-time analytics in 2026](https://clickhouse.com/resources/engineering/how-to-choose-a-database-for-real-time-analytics-in-2026)
- [ClickHouse GitHub Repository](https://github.com/ClickHouse/ClickHouse)

Saga Pattern: Managing Distributed Transactions in Microservices

Vue 3.6 Vapor Mode — Goodbye Virtual DOM, Solid.js-level Performance

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.