ClickHouse: The Real-time Analytics Engine That Queries Billions of Rows in Milliseconds
Posted on: 4/21/2026 6:08:32 AM
Table of contents
- What is ClickHouse and why is it "unreasonably fast"?
- Inside the engine: MergeTree and the secrets behind its speed
- Notable 2025–2026 updates
- ClickHouse vs. other analytics solutions
- Integration architecture: ClickHouse in a real system
- Performance tuning: 7 golden rules
- ClickHouse Cloud and ClickPipes: analytics without ops
- Self-hosting ClickHouse: from single node to cluster
- Real-world benchmarks: ClickHouse in production
- Conclusion: ClickHouse — not just a database, but an analytics platform
When your system emits millions of events per second — click tracking, IoT sensors, transaction logs, observability data — the question is no longer "where do we store it?" but "how do we query it in milliseconds?". ClickHouse is the answer chosen by thousands of companies, from Cloudflare, Uber, and eBay to Spotify.
What is ClickHouse and why is it "unreasonably fast"?
ClickHouse is an open-source column-oriented OLAP database (Online Analytical Processing), originally developed by Yandex since 2016. Unlike PostgreSQL or MySQL (row-oriented — which read entire rows per query), ClickHouse stores data column by column, letting it read only the columns each query actually needs.
graph LR
subgraph Row-Oriented["Row-Oriented (PostgreSQL)"]
R1["Row 1: id | name | city | amount"]
R2["Row 2: id | name | city | amount"]
R3["Row 3: id | name | city | amount"]
end
subgraph Column-Oriented["Column-Oriented (ClickHouse)"]
C1["Column: id -> 1, 2, 3, ..."]
C2["Column: name -> Alice, Bob, ..."]
C3["Column: city -> NYC, LA, ..."]
C4["Column: amount -> 100, 200, ..."]
end
Q["SELECT SUM(amount)
WHERE city = 'NYC'"] --> Column-Oriented
style Q fill:#e94560,stroke:#fff,color:#fff
style R1 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style R2 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style R3 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style C1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style C2 fill:#f8f9fa,stroke:#e0e0e0,color:#aaa
style C3 fill:#e94560,stroke:#fff,color:#fff
style C4 fill:#e94560,stroke:#fff,color:#fff
Column-oriented storage reads only 2 columns (city + amount) instead of entire rows — a massive I/O reduction
Why is column-oriented faster for analytics?
Imagine a table with 100 columns where a query only needs 3 — ClickHouse reads just 3% of the data from disk. On top of that, data in the same column shares the same type, which compresses extremely well (typically 10:1 to 40:1 ratios), reducing I/O even further.
Inside the engine: MergeTree and the secrets behind its speed
The heart of ClickHouse is the MergeTree engine family — a storage mechanism designed specifically for analytical workloads with write-heavy patterns.
graph TD
A["Client INSERT
batch rows"] --> B["In-memory Buffer"]
B --> C["Write Part
(sorted, compressed)"]
C --> D["Background Merge"]
D --> E["Merged Parts
(optimized, deduped)"]
E --> F["Query Engine
vectorized execution"]
G["Primary Index
(sparse, in-memory)"] --> F
H["Data Skipping Index
(minmax, set, bloom)"] --> F
style A fill:#e94560,stroke:#fff,color:#fff
style B fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style C fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style D fill:#2c3e50,stroke:#fff,color:#fff
style E fill:#16213e,stroke:#fff,color:#fff
style F fill:#e94560,stroke:#fff,color:#fff
style G fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
style H fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
MergeTree architecture: INSERT -> buffer -> sorted parts -> background merge -> query
Sparse Primary Index — less is more
Unlike PostgreSQL's B-Tree index (one entry per row), ClickHouse uses a sparse index — storing only the first value of each granule (8,192 rows by default). For a 1-billion-row table, the primary index holds only ~120K entries — small enough to live in RAM and be scanned almost instantly.
Vectorized Query Execution
ClickHouse processes data in blocks (batches of column values), taking full advantage of SIMD instructions (Single Instruction Multiple Data) on modern CPUs. A comparison like WHERE amount > 1000 can evaluate 8-16 values per CPU cycle instead of one at a time.
Key engines in the MergeTree family
| Engine | Description | Use case |
|---|---|---|
| MergeTree | Base engine — sorted storage + merges | General analytics |
| ReplacingMergeTree | Automatic dedup by sort key | CDC, upsert pattern |
| AggregatingMergeTree | Pre-aggregates at merge time | Materialized views, rollups |
| SummingMergeTree | Automatic SUM on numeric columns | Counters, metrics |
| CollapsingMergeTree | Collapses rows based on a sign column | State tracking, mutable data |
Notable 2025–2026 updates
ClickHouse ships improvements constantly. Here are the biggest changes in the past year you should know about.
naturalSortKey function, the ALP compression codec for floating-point data, and asynchronous inserts enabled by default.
ClickHouse vs. other analytics solutions
There is no silver bullet for every analytics workload. Below is how ClickHouse stacks up against common alternatives on practical criteria.
| Criterion | ClickHouse | Apache Druid | TimescaleDB | BigQuery |
|---|---|---|---|---|
| Data model | Column-oriented | Column-oriented | Row-oriented (PG ext.) | Column-oriented |
| Ingest speed | Millions rows/s | High (via Kafka) | Moderate | Streaming (limited) |
| Query latency | Sub-second on billions of rows | Sub-second | Seconds to minutes | Seconds (cold start) |
| SQL support | Nearly full ANSI SQL | SQL-like (limited JOINs) | Full PostgreSQL SQL | Full SQL |
| JOINs | Good (improving continuously) | Limited | Good | Good |
| Self-hosted | Yes, free | Yes, more complex | Yes (PG extension) | No, GCP only |
| Managed cloud | Yes, ClickHouse Cloud | Yes, Imply Cloud | Yes, Timescale Cloud | Native |
| Best for | Log analytics, BI, observability | Real-time dashboards | Time-series / IoT | Large ad-hoc analytics |
When to pick ClickHouse
Choose ClickHouse when you need: (1) analytical queries over large datasets with low latency, (2) high ingest volume (millions of events per second), (3) standard SQL plus a rich connector ecosystem, (4) the option to self-host for free or a flexible managed cloud. It's especially well-suited for log analytics, observability, product analytics, and real-time reporting.
When NOT to use ClickHouse
ClickHouse is not an OLTP database. If your workload is primarily single-row INSERT/UPDATE/DELETE with ACID transactions (e.g., e-commerce order processing), keep PostgreSQL/EF Core for that and use ClickHouse as the analytics layer behind it.
Integration architecture: ClickHouse in a real system
In practice, ClickHouse rarely stands alone — it lives inside a larger data pipeline. The pattern below is the most common one.
graph LR
A["Application
(.NET / Vue.js)"] -->|events| B["Message Queue
(Kafka / RabbitMQ)"]
B --> C["ClickHouse
Ingest"]
C --> D["ClickHouse
MergeTree Tables"]
D --> E["Materialized Views
(pre-aggregated)"]
E --> F["Grafana / BI Tools
Dashboard"]
D --> G["API Layer
(.NET Core)"]
G --> H["Vue.js
Frontend"]
I["PostgreSQL
OLTP"] -->|CDC via ClickPipes| D
style A fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style B fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style C fill:#e94560,stroke:#fff,color:#fff
style D fill:#16213e,stroke:#fff,color:#fff
style E fill:#2c3e50,stroke:#fff,color:#fff
style F fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
style G fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style H fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style I fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
Common pattern: OLTP (PostgreSQL) -> CDC -> ClickHouse -> Materialized Views -> Dashboard/API
Pattern 1: Event Analytics Pipeline
The application pushes events (page views, clicks, purchases...) through a message queue. ClickHouse ingests directly from Kafka using the built-in Kafka engine or ClickPipes (the managed connector).
-- Raw events table
CREATE TABLE events (
event_id UUID DEFAULT generateUUIDv4(),
event_type LowCardinality(String),
user_id UInt64,
session_id String,
page_url String,
properties String, -- JSON string
created_at DateTime64(3) DEFAULT now64(3)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (event_type, user_id, created_at)
TTL created_at + INTERVAL 90 DAY;
-- Materialized View that aggregates automatically
CREATE MATERIALIZED VIEW events_hourly_mv
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (event_type, hour)
AS SELECT
event_type,
toStartOfHour(created_at) AS hour,
count() AS event_count,
uniqExact(user_id) AS unique_users
FROM events
GROUP BY event_type, hour;
Tip: LowCardinality — the secret weapon
LowCardinality(String) is a ClickHouse-specific optimization. For columns with few distinct values (country, event_type, status...), it automatically applies dictionary encoding — cutting memory and speeding up filters by up to 10x.
Pattern 2: CDC from PostgreSQL
With ClickPipes (or Debezium + Kafka), you can replicate data from PostgreSQL to ClickHouse in near real time. ClickHouse in 2025 added MySQL CDC, MongoDB CDC, and improved Postgres CDC inside ClickPipes.
-- ReplacingMergeTree for CDC (upsert pattern)
CREATE TABLE orders_replica (
order_id UInt64,
customer_id UInt64,
total_amount Decimal(18,2),
status LowCardinality(String),
updated_at DateTime64(3),
_version UInt64 -- CDC version for dedup
) ENGINE = ReplacingMergeTree(_version)
ORDER BY order_id;
-- Use FINAL to guarantee deduplication
SELECT
status,
count() AS order_count,
sum(total_amount) AS revenue
FROM orders_replica FINAL
WHERE updated_at >= today() - 30
GROUP BY status;
Pattern 3: Connecting from .NET Core
ClickHouse ships an official .NET client (ClickHouse.Client) with a familiar ADO.NET interface, bulk insert, and async operations.
// NuGet: ClickHouse.Client
using ClickHouse.Client.ADO;
using ClickHouse.Client.Copy;
// Simple query via ADO.NET
await using var connection = new ClickHouseConnection(
"Host=localhost;Port=8123;Database=analytics");
await connection.OpenAsync();
await using var cmd = connection.CreateCommand();
cmd.CommandText = @"
SELECT event_type, count() AS cnt, uniqExact(user_id) AS users
FROM events
WHERE created_at >= @from
GROUP BY event_type
ORDER BY cnt DESC
LIMIT 20";
cmd.Parameters.AddWithValue("@from", DateTime.UtcNow.AddDays(-7));
await using var reader = await cmd.ExecuteReaderAsync();
while (await reader.ReadAsync())
{
Console.WriteLine($"{reader["event_type"]}: {reader["cnt"]} events, {reader["users"]} users");
}
// High-throughput bulk insert
using var bulkCopy = new ClickHouseBulkCopy(connection)
{
DestinationTableName = "events",
BatchSize = 100_000
};
var rows = GenerateEvents(); // IEnumerable<object[]>
await bulkCopy.InitAsync();
await bulkCopy.WriteToServerAsync(rows);
Performance tuning: 7 golden rules
ClickHouse is fast by default, but to get the most out of it you need to follow a few design principles.
7 rules for tuning ClickHouse
1. Pick the right ORDER BY: the primary key determines physical storage order. Put the most frequently filtered columns first (e.g., ORDER BY (tenant_id, event_type, created_at)).
2. Use PARTITION BY sensibly: partition by month (toYYYYMM) is typical. Too many partitions (e.g., per-day on a small table) adds overhead. Rule of thumb: each partition should hold at least 1–10 million rows.
3. Batch your INSERTs: never insert one row at a time. Batch at least 1,000 rows, ideally 10K–100K per call. Since v26.3, asynchronous inserts are enabled by default and buffer automatically.
4. LowCardinality for low-distinct columns: country, status, event_type — any column with fewer than ~10K distinct values should use LowCardinality(String).
5. Materialized views for repeated queries: instead of scanning the raw table every time, build a materialized view that pre-aggregates. Queries over the view run 10x–1000x faster.
6. TTL for lifecycle management: set TTL to auto-delete or move old data to cheaper storage: TTL created_at + INTERVAL 90 DAY DELETE or TTL created_at + INTERVAL 30 DAY TO VOLUME 'cold'.
7. Avoid SELECT *: always specify only the columns you need. Column-oriented means "read only selected columns" — but SELECT * forces ClickHouse to read everything.
ClickHouse Cloud and ClickPipes: analytics without ops
If you'd rather not manage a cluster yourself, ClickHouse Cloud offers a managed service with a long list of enterprise features:
ClickPipes is managed data ingestion — connecting directly to Kafka, PostgreSQL (CDC), MySQL (CDC), MongoDB (CDC), Amazon Kinesis, S3, GCS, Azure Blob Storage, and more, without any ETL code of your own.
graph LR
A["PostgreSQL
(CDC)"] --> CP["ClickPipes"]
B["Apache Kafka"] --> CP
C["Amazon S3"] --> CP
D["MongoDB
(CDC)"] --> CP
CP --> CH["ClickHouse Cloud"]
CH --> G["Grafana"]
CH --> S["Superset"]
CH --> API["Your .NET API"]
CH --> MB["Metabase"]
style CP fill:#e94560,stroke:#fff,color:#fff
style CH fill:#16213e,stroke:#fff,color:#fff
style A fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style B fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style C fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style D fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style G fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
style S fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
style API fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
style MB fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
ClickPipes: managed ingestion from many sources into ClickHouse Cloud
AI features and MCP Server
One thing worth noting is that ClickHouse now ships an MCP Server (Model Context Protocol) that lets AI agents query ClickHouse directly. Together with the Langfuse acquisition, ClickHouse is turning into a foundational platform for AI observability — tracking LLM tokens, latency, cost, and quality metrics at scale.
-- Example: LLM observability table on ClickHouse
CREATE TABLE llm_traces (
trace_id UUID,
model LowCardinality(String),
prompt_tokens UInt32,
completion_tokens UInt32,
latency_ms UInt32,
cost_usd Decimal(10,6),
status LowCardinality(String),
created_at DateTime64(3)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (model, created_at);
-- Dashboard query: LLM cost by model over 7 days
SELECT
model,
count() AS requests,
avg(latency_ms) AS avg_latency,
sum(cost_usd) AS total_cost,
sum(prompt_tokens + completion_tokens) AS total_tokens
FROM llm_traces
WHERE created_at >= now() - INTERVAL 7 DAY
GROUP BY model
ORDER BY total_cost DESC;
Self-hosting ClickHouse: from single node to cluster
ClickHouse is completely free to self-host. Here's the quickest way to get going with Docker.
# docker-compose.yml — single-node ClickHouse
version: '3.8'
services:
clickhouse:
image: clickhouse/clickhouse-server:26.3
ports:
- "8123:8123" # HTTP interface
- "9000:9000" # Native TCP
- "9009:9009" # Inter-server replication
volumes:
- ch_data:/var/lib/clickhouse
- ch_logs:/var/log/clickhouse-server
environment:
CLICKHOUSE_DB: analytics
CLICKHOUSE_USER: admin
CLICKHOUSE_PASSWORD: your_secure_password
ulimits:
nofile:
soft: 262144
hard: 262144
deploy:
resources:
limits:
memory: 4G
volumes:
ch_data:
ch_logs:
Sizing guide for a single node
RAM: ClickHouse uses roughly 50% of RAM for query processing. 8GB of RAM comfortably handles several hundred million rows. Disk: with a 10:1 compression ratio, 100GB of raw data takes up ~10GB on disk. Prefer SSD/NVMe — ClickHouse takes great advantage of fast random reads. CPU: more cores is always better — ClickHouse parallelizes queries across every available core.
clickhousectl — the new CLI (beta, v26.3)
ClickHouse just released clickhousectl — an official CLI that supports both local installs and cloud deployments. It was designed "with AI agents in mind" — OAuth support, permissioned API keys, and output formats that play well with agentic workflows.
# Install clickhousectl
curl -fsSL https://clickhouse.com/install-ctl | bash
# Manage a local instance
clickhousectl install --version 26.3
clickhousectl start
clickhousectl status
# Connect to the cloud
clickhousectl cloud login
clickhousectl cloud query "SELECT count() FROM events"
Real-world benchmarks: ClickHouse in production
A few notable production case studies:
| Company | Use case | Scale | Result |
|---|---|---|---|
| Cloudflare | DNS analytics, HTTP logs | Tens of PB | Sub-second queries on petabytes |
| Uber | Real-time pricing, logging | Billions of events/day | P99 latency under 1s |
| Goldsky | Blockchain data backfill | Billions of rows | 12x throughput improvement |
| GitLab | Product analytics | Hundreds of TB | 90% query-cost reduction vs BigQuery |
| Spotify | A/B testing analytics | Trillions of rows | Interactive queries on experiment data |
Conclusion: ClickHouse — not just a database, but an analytics platform
Between the $400M Series D, the Langfuse acquisition, and the steady cadence of 26.x improvements, ClickHouse is evolving from "column-oriented database" into a full analytics platform — covering ingestion (ClickPipes), storage (MergeTree + data lake integrations), querying (vectorized + materialized views), and AI infrastructure (MCP Server + LLM observability).
If your system is bottlenecked at the analytics layer — slow dashboards, timeouts on big tables, or runaway BigQuery/Redshift bills — ClickHouse is the most compelling option to try in 2026. Between free self-hosting and a flexible managed cloud, the barrier to entry is essentially zero.
Getting started with ClickHouse
Try it right now at play.clickhouse.com — a free playground with no sign-up required. Or use the docker-compose file above to spin up a local instance in 2 minutes. The official docs at clickhouse.com/docs have step-by-step tutorials for every common use case.
References
Saga Pattern: Managing Distributed Transactions in Microservices
Vue 3.6 Vapor Mode — Goodbye Virtual DOM, Solid.js-level Performance
Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.