ClickHouse: The Real-time Analytics Engine That Queries Billions of Rows in Milliseconds

Posted on: 4/21/2026 6:08:32 AM

When your system emits millions of events per second — click tracking, IoT sensors, transaction logs, observability data — the question is no longer "where do we store it?" but "how do we query it in milliseconds?". ClickHouse is the answer chosen by thousands of companies, from Cloudflare, Uber, and eBay to Spotify.

$400MSeries D — January 2026
600Krows/second ingest (Goldsky benchmark)
26.3Latest LTS release (April 2026)
100+Data source connectors (ClickPipes)

What is ClickHouse and why is it "unreasonably fast"?

ClickHouse is an open-source column-oriented OLAP database (Online Analytical Processing), originally developed by Yandex since 2016. Unlike PostgreSQL or MySQL (row-oriented — which read entire rows per query), ClickHouse stores data column by column, letting it read only the columns each query actually needs.

graph LR
    subgraph Row-Oriented["Row-Oriented (PostgreSQL)"]
        R1["Row 1: id | name | city | amount"]
        R2["Row 2: id | name | city | amount"]
        R3["Row 3: id | name | city | amount"]
    end
    subgraph Column-Oriented["Column-Oriented (ClickHouse)"]
        C1["Column: id -> 1, 2, 3, ..."]
        C2["Column: name -> Alice, Bob, ..."]
        C3["Column: city -> NYC, LA, ..."]
        C4["Column: amount -> 100, 200, ..."]
    end
    Q["SELECT SUM(amount)
WHERE city = 'NYC'"] --> Column-Oriented style Q fill:#e94560,stroke:#fff,color:#fff style R1 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50 style R2 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50 style R3 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50 style C1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style C2 fill:#f8f9fa,stroke:#e0e0e0,color:#aaa style C3 fill:#e94560,stroke:#fff,color:#fff style C4 fill:#e94560,stroke:#fff,color:#fff

Column-oriented storage reads only 2 columns (city + amount) instead of entire rows — a massive I/O reduction

Why is column-oriented faster for analytics?

Imagine a table with 100 columns where a query only needs 3 — ClickHouse reads just 3% of the data from disk. On top of that, data in the same column shares the same type, which compresses extremely well (typically 10:1 to 40:1 ratios), reducing I/O even further.

Inside the engine: MergeTree and the secrets behind its speed

The heart of ClickHouse is the MergeTree engine family — a storage mechanism designed specifically for analytical workloads with write-heavy patterns.

graph TD
    A["Client INSERT
batch rows"] --> B["In-memory Buffer"] B --> C["Write Part
(sorted, compressed)"] C --> D["Background Merge"] D --> E["Merged Parts
(optimized, deduped)"] E --> F["Query Engine
vectorized execution"] G["Primary Index
(sparse, in-memory)"] --> F H["Data Skipping Index
(minmax, set, bloom)"] --> F style A fill:#e94560,stroke:#fff,color:#fff style B fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style C fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style D fill:#2c3e50,stroke:#fff,color:#fff style E fill:#16213e,stroke:#fff,color:#fff style F fill:#e94560,stroke:#fff,color:#fff style G fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50 style H fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50

MergeTree architecture: INSERT -> buffer -> sorted parts -> background merge -> query

Sparse Primary Index — less is more

Unlike PostgreSQL's B-Tree index (one entry per row), ClickHouse uses a sparse index — storing only the first value of each granule (8,192 rows by default). For a 1-billion-row table, the primary index holds only ~120K entries — small enough to live in RAM and be scanned almost instantly.

Vectorized Query Execution

ClickHouse processes data in blocks (batches of column values), taking full advantage of SIMD instructions (Single Instruction Multiple Data) on modern CPUs. A comparison like WHERE amount > 1000 can evaluate 8-16 values per CPU cycle instead of one at a time.

Key engines in the MergeTree family

EngineDescriptionUse case
MergeTreeBase engine — sorted storage + mergesGeneral analytics
ReplacingMergeTreeAutomatic dedup by sort keyCDC, upsert pattern
AggregatingMergeTreePre-aggregates at merge timeMaterialized views, rollups
SummingMergeTreeAutomatic SUM on numeric columnsCounters, metrics
CollapsingMergeTreeCollapses rows based on a sign columnState tracking, mutable data

Notable 2025–2026 updates

ClickHouse ships improvements constantly. Here are the biggest changes in the past year you should know about.

July 2025 — SharedCatalog
A centralized metadata architecture for the cloud, making DDL operations handle high concurrency easily, with near-instant service spin-up. Supports both the native format and open standards like Apache Iceberg and Delta Lake.
October 2025 — Full-text search redesign
The inverted index was fully redesigned in v25.10, now handling datasets larger than RAM. Vector similarity search reached GA with binary quantization that dramatically reduces memory footprint.
January 2026 — Series D $400M + Langfuse acquisition
ClickHouse closed a $400M Series D led by Dragoneer and acquired Langfuse — the open-source LLM observability platform. A clear signal that ClickHouse is doubling down on AI infrastructure.
February 2026 — ClickHouse 26.2
Time-based block flushing for streaming data, production-ready text index and QBit data types, plus RIGHT/FULL JOINs 3.2x faster.
April 2026 — ClickHouse 26.3 LTS
Materialized CTEs (experimental), WebAssembly UDFs, the naturalSortKey function, the ALP compression codec for floating-point data, and asynchronous inserts enabled by default.

ClickHouse vs. other analytics solutions

There is no silver bullet for every analytics workload. Below is how ClickHouse stacks up against common alternatives on practical criteria.

Criterion ClickHouse Apache Druid TimescaleDB BigQuery
Data model Column-oriented Column-oriented Row-oriented (PG ext.) Column-oriented
Ingest speed Millions rows/s High (via Kafka) Moderate Streaming (limited)
Query latency Sub-second on billions of rows Sub-second Seconds to minutes Seconds (cold start)
SQL support Nearly full ANSI SQL SQL-like (limited JOINs) Full PostgreSQL SQL Full SQL
JOINs Good (improving continuously) Limited Good Good
Self-hosted Yes, free Yes, more complex Yes (PG extension) No, GCP only
Managed cloud Yes, ClickHouse Cloud Yes, Imply Cloud Yes, Timescale Cloud Native
Best for Log analytics, BI, observability Real-time dashboards Time-series / IoT Large ad-hoc analytics

When to pick ClickHouse

Choose ClickHouse when you need: (1) analytical queries over large datasets with low latency, (2) high ingest volume (millions of events per second), (3) standard SQL plus a rich connector ecosystem, (4) the option to self-host for free or a flexible managed cloud. It's especially well-suited for log analytics, observability, product analytics, and real-time reporting.

When NOT to use ClickHouse

ClickHouse is not an OLTP database. If your workload is primarily single-row INSERT/UPDATE/DELETE with ACID transactions (e.g., e-commerce order processing), keep PostgreSQL/EF Core for that and use ClickHouse as the analytics layer behind it.

Integration architecture: ClickHouse in a real system

In practice, ClickHouse rarely stands alone — it lives inside a larger data pipeline. The pattern below is the most common one.

graph LR
    A["Application
(.NET / Vue.js)"] -->|events| B["Message Queue
(Kafka / RabbitMQ)"] B --> C["ClickHouse
Ingest"] C --> D["ClickHouse
MergeTree Tables"] D --> E["Materialized Views
(pre-aggregated)"] E --> F["Grafana / BI Tools
Dashboard"] D --> G["API Layer
(.NET Core)"] G --> H["Vue.js
Frontend"] I["PostgreSQL
OLTP"] -->|CDC via ClickPipes| D style A fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style B fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50 style C fill:#e94560,stroke:#fff,color:#fff style D fill:#16213e,stroke:#fff,color:#fff style E fill:#2c3e50,stroke:#fff,color:#fff style F fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50 style G fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style H fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style I fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50

Common pattern: OLTP (PostgreSQL) -> CDC -> ClickHouse -> Materialized Views -> Dashboard/API

Pattern 1: Event Analytics Pipeline

The application pushes events (page views, clicks, purchases...) through a message queue. ClickHouse ingests directly from Kafka using the built-in Kafka engine or ClickPipes (the managed connector).

-- Raw events table
CREATE TABLE events (
    event_id UUID DEFAULT generateUUIDv4(),
    event_type LowCardinality(String),
    user_id UInt64,
    session_id String,
    page_url String,
    properties String, -- JSON string
    created_at DateTime64(3) DEFAULT now64(3)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (event_type, user_id, created_at)
TTL created_at + INTERVAL 90 DAY;

-- Materialized View that aggregates automatically
CREATE MATERIALIZED VIEW events_hourly_mv
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (event_type, hour)
AS SELECT
    event_type,
    toStartOfHour(created_at) AS hour,
    count() AS event_count,
    uniqExact(user_id) AS unique_users
FROM events
GROUP BY event_type, hour;

Tip: LowCardinality — the secret weapon

LowCardinality(String) is a ClickHouse-specific optimization. For columns with few distinct values (country, event_type, status...), it automatically applies dictionary encoding — cutting memory and speeding up filters by up to 10x.

Pattern 2: CDC from PostgreSQL

With ClickPipes (or Debezium + Kafka), you can replicate data from PostgreSQL to ClickHouse in near real time. ClickHouse in 2025 added MySQL CDC, MongoDB CDC, and improved Postgres CDC inside ClickPipes.

-- ReplacingMergeTree for CDC (upsert pattern)
CREATE TABLE orders_replica (
    order_id UInt64,
    customer_id UInt64,
    total_amount Decimal(18,2),
    status LowCardinality(String),
    updated_at DateTime64(3),
    _version UInt64  -- CDC version for dedup
) ENGINE = ReplacingMergeTree(_version)
ORDER BY order_id;

-- Use FINAL to guarantee deduplication
SELECT
    status,
    count() AS order_count,
    sum(total_amount) AS revenue
FROM orders_replica FINAL
WHERE updated_at >= today() - 30
GROUP BY status;

Pattern 3: Connecting from .NET Core

ClickHouse ships an official .NET client (ClickHouse.Client) with a familiar ADO.NET interface, bulk insert, and async operations.

// NuGet: ClickHouse.Client
using ClickHouse.Client.ADO;
using ClickHouse.Client.Copy;

// Simple query via ADO.NET
await using var connection = new ClickHouseConnection(
    "Host=localhost;Port=8123;Database=analytics");
await connection.OpenAsync();

await using var cmd = connection.CreateCommand();
cmd.CommandText = @"
    SELECT event_type, count() AS cnt, uniqExact(user_id) AS users
    FROM events
    WHERE created_at >= @from
    GROUP BY event_type
    ORDER BY cnt DESC
    LIMIT 20";
cmd.Parameters.AddWithValue("@from", DateTime.UtcNow.AddDays(-7));

await using var reader = await cmd.ExecuteReaderAsync();
while (await reader.ReadAsync())
{
    Console.WriteLine($"{reader["event_type"]}: {reader["cnt"]} events, {reader["users"]} users");
}

// High-throughput bulk insert
using var bulkCopy = new ClickHouseBulkCopy(connection)
{
    DestinationTableName = "events",
    BatchSize = 100_000
};

var rows = GenerateEvents(); // IEnumerable<object[]>
await bulkCopy.InitAsync();
await bulkCopy.WriteToServerAsync(rows);

Performance tuning: 7 golden rules

ClickHouse is fast by default, but to get the most out of it you need to follow a few design principles.

7 rules for tuning ClickHouse

1. Pick the right ORDER BY: the primary key determines physical storage order. Put the most frequently filtered columns first (e.g., ORDER BY (tenant_id, event_type, created_at)).

2. Use PARTITION BY sensibly: partition by month (toYYYYMM) is typical. Too many partitions (e.g., per-day on a small table) adds overhead. Rule of thumb: each partition should hold at least 1–10 million rows.

3. Batch your INSERTs: never insert one row at a time. Batch at least 1,000 rows, ideally 10K–100K per call. Since v26.3, asynchronous inserts are enabled by default and buffer automatically.

4. LowCardinality for low-distinct columns: country, status, event_type — any column with fewer than ~10K distinct values should use LowCardinality(String).

5. Materialized views for repeated queries: instead of scanning the raw table every time, build a materialized view that pre-aggregates. Queries over the view run 10x–1000x faster.

6. TTL for lifecycle management: set TTL to auto-delete or move old data to cheaper storage: TTL created_at + INTERVAL 90 DAY DELETE or TTL created_at + INTERVAL 30 DAY TO VOLUME 'cold'.

7. Avoid SELECT *: always specify only the columns you need. Column-oriented means "read only selected columns" — but SELECT * forces ClickHouse to read everything.

ClickHouse Cloud and ClickPipes: analytics without ops

If you'd rather not manage a cluster yourself, ClickHouse Cloud offers a managed service with a long list of enterprise features:

20xreplicas (horizontal scaling GA)
6xfaster backups (2025)
3cloud providers (AWS, GCP, Azure)
HIPAAcompliance ready

ClickPipes is managed data ingestion — connecting directly to Kafka, PostgreSQL (CDC), MySQL (CDC), MongoDB (CDC), Amazon Kinesis, S3, GCS, Azure Blob Storage, and more, without any ETL code of your own.

graph LR
    A["PostgreSQL
(CDC)"] --> CP["ClickPipes"] B["Apache Kafka"] --> CP C["Amazon S3"] --> CP D["MongoDB
(CDC)"] --> CP CP --> CH["ClickHouse Cloud"] CH --> G["Grafana"] CH --> S["Superset"] CH --> API["Your .NET API"] CH --> MB["Metabase"] style CP fill:#e94560,stroke:#fff,color:#fff style CH fill:#16213e,stroke:#fff,color:#fff style A fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50 style B fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50 style C fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50 style D fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50 style G fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50 style S fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50 style API fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50 style MB fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50

ClickPipes: managed ingestion from many sources into ClickHouse Cloud

AI features and MCP Server

One thing worth noting is that ClickHouse now ships an MCP Server (Model Context Protocol) that lets AI agents query ClickHouse directly. Together with the Langfuse acquisition, ClickHouse is turning into a foundational platform for AI observability — tracking LLM tokens, latency, cost, and quality metrics at scale.

-- Example: LLM observability table on ClickHouse
CREATE TABLE llm_traces (
    trace_id UUID,
    model LowCardinality(String),
    prompt_tokens UInt32,
    completion_tokens UInt32,
    latency_ms UInt32,
    cost_usd Decimal(10,6),
    status LowCardinality(String),
    created_at DateTime64(3)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (model, created_at);

-- Dashboard query: LLM cost by model over 7 days
SELECT
    model,
    count() AS requests,
    avg(latency_ms) AS avg_latency,
    sum(cost_usd) AS total_cost,
    sum(prompt_tokens + completion_tokens) AS total_tokens
FROM llm_traces
WHERE created_at >= now() - INTERVAL 7 DAY
GROUP BY model
ORDER BY total_cost DESC;

Self-hosting ClickHouse: from single node to cluster

ClickHouse is completely free to self-host. Here's the quickest way to get going with Docker.

# docker-compose.yml — single-node ClickHouse
version: '3.8'
services:
  clickhouse:
    image: clickhouse/clickhouse-server:26.3
    ports:
      - "8123:8123"   # HTTP interface
      - "9000:9000"   # Native TCP
      - "9009:9009"   # Inter-server replication
    volumes:
      - ch_data:/var/lib/clickhouse
      - ch_logs:/var/log/clickhouse-server
    environment:
      CLICKHOUSE_DB: analytics
      CLICKHOUSE_USER: admin
      CLICKHOUSE_PASSWORD: your_secure_password
    ulimits:
      nofile:
        soft: 262144
        hard: 262144
    deploy:
      resources:
        limits:
          memory: 4G

volumes:
  ch_data:
  ch_logs:

Sizing guide for a single node

RAM: ClickHouse uses roughly 50% of RAM for query processing. 8GB of RAM comfortably handles several hundred million rows. Disk: with a 10:1 compression ratio, 100GB of raw data takes up ~10GB on disk. Prefer SSD/NVMe — ClickHouse takes great advantage of fast random reads. CPU: more cores is always better — ClickHouse parallelizes queries across every available core.

clickhousectl — the new CLI (beta, v26.3)

ClickHouse just released clickhousectl — an official CLI that supports both local installs and cloud deployments. It was designed "with AI agents in mind" — OAuth support, permissioned API keys, and output formats that play well with agentic workflows.

# Install clickhousectl
curl -fsSL https://clickhouse.com/install-ctl | bash

# Manage a local instance
clickhousectl install --version 26.3
clickhousectl start
clickhousectl status

# Connect to the cloud
clickhousectl cloud login
clickhousectl cloud query "SELECT count() FROM events"

Real-world benchmarks: ClickHouse in production

600,000 rows/second Goldsky reached this number after switching from Kafka/Avro to direct ClickHouse reads with Apache Arrow — 12x faster than the old pipeline (~50K rows/s)

A few notable production case studies:

CompanyUse caseScaleResult
CloudflareDNS analytics, HTTP logsTens of PBSub-second queries on petabytes
UberReal-time pricing, loggingBillions of events/dayP99 latency under 1s
GoldskyBlockchain data backfillBillions of rows12x throughput improvement
GitLabProduct analyticsHundreds of TB90% query-cost reduction vs BigQuery
SpotifyA/B testing analyticsTrillions of rowsInteractive queries on experiment data

Conclusion: ClickHouse — not just a database, but an analytics platform

Between the $400M Series D, the Langfuse acquisition, and the steady cadence of 26.x improvements, ClickHouse is evolving from "column-oriented database" into a full analytics platform — covering ingestion (ClickPipes), storage (MergeTree + data lake integrations), querying (vectorized + materialized views), and AI infrastructure (MCP Server + LLM observability).

If your system is bottlenecked at the analytics layer — slow dashboards, timeouts on big tables, or runaway BigQuery/Redshift bills — ClickHouse is the most compelling option to try in 2026. Between free self-hosting and a flexible managed cloud, the barrier to entry is essentially zero.

Getting started with ClickHouse

Try it right now at play.clickhouse.com — a free playground with no sign-up required. Or use the docker-compose file above to spin up a local instance in 2 minutes. The official docs at clickhouse.com/docs have step-by-step tutorials for every common use case.

References