DuckDB 2026: Embedded OLAP Database — Data Analytics Without a Server

Posted on: 4/21/2026 11:15:04 AM

Table of contents

What is DuckDB — "SQLite for Analytics"
Columnar-Vectorized Architecture — Why DuckDB Is So Fast
1. Columnar Storage
2. Vectorized Execution
  1. DuckDB Query Execution Pipeline
  2. Why is vectorized better than tuple-at-a-time?
DuckLake 1.0 — Lakehouse Without Spark
1. Data Inlining — Optimizing Small Tables
DuckDB-WASM — Analytics Directly in the Browser
1. DuckDB-WASM Limitations
MotherDuck — Serverless DuckDB in the Cloud
1. MotherDuck Dual Execution Flow
DuckDB vs ClickHouse — When to Use Which
1. DuckDB + ClickHouse: The Optimal Combo
Integrating DuckDB in Practice
Extensions — Extending DuckDB Without Limits
1. Lance Extension — DuckDB for AI/ML Workloads
Benchmarks — DuckDB on Real Hardware
When NOT to Use DuckDB
Roadmap — Where DuckDB Is Heading
Conclusion

v1.5.2 Latest version (04/2026)

258× Speedup with DuckLake metadata queries

5-20ms DuckDB-WASM browser latency

0 Servers to install

What is DuckDB — "SQLite for Analytics"

DuckDB is an embedded OLAP database that runs entirely within your application process without requiring any server. If SQLite is the go-to embedded database for OLTP (transactional workloads), DuckDB is its analytical counterpart — processing complex analytical queries on large datasets with remarkable speed.

No daemons to install, no ports to configure, no connection pools to manage. Just import the DuckDB library into your project (Python, Node.js, Rust, Go, Java, C++...), create a database file or run in-memory — and you have a SQL engine as powerful as ClickHouse for single-node workloads.

graph TB
    subgraph "Traditional OLAP (ClickHouse, BigQuery)"
        C[Client App] -->|Network| S[Database Server]
        S --> D[(Disk Storage)]
        S --> W1[Worker Node 1]
        S --> W2[Worker Node 2]
    end
    subgraph "DuckDB — Embedded OLAP"
        A[Application Process] --> E[DuckDB Engine]
        E --> F[(Local File / Memory)]
    end
    style C fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style S fill:#e94560,stroke:#fff,color:#fff
    style D fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style W1 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style W2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style A fill:#4CAF50,stroke:#fff,color:#fff
    style E fill:#4CAF50,stroke:#fff,color:#fff
    style F fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50

Traditional OLAP requires a separate server. DuckDB runs embedded directly within the application process.

Columnar-Vectorized Architecture — Why DuckDB Is So Fast

DuckDB combines two critical techniques to achieve high performance for analytical queries:

Columnar Storage

Instead of storing data row-by-row (row-oriented) like PostgreSQL or MySQL, DuckDB stores data by column (column-oriented). When you run SELECT AVG(price) FROM orders WHERE year = 2026, the engine only reads the price and year columns — completely skipping all other columns. For a table with 50 columns, this can reduce I/O by up to 96%.

Data within the same column often shares similar values (e.g., a country column repeating "VN" millions of times), enabling significantly better compression compared to row-based storage — typically achieving 5-10× better compression ratios.

Vectorized Execution

DuckDB doesn't process one row at a time (tuple-at-a-time like PostgreSQL), nor does it compile queries to native code (like HyPer). Instead, DuckDB processes data in "vectors" — batches of approximately 2048 values at a time. This approach maximizes CPU cache locality and leverages SIMD instructions on modern processors.

DuckDB Query Execution Pipeline

SQL Query
    ↓
Parser → AST (Abstract Syntax Tree)
    ↓
Binder → Resolve table names, columns, data types
    ↓
Optimizer → Predicate pushdown, join reordering, filter optimization
    ↓
Physical Planner → Select join algorithms, scan strategies
    ↓
Vectorized Executor → Process batches of 2048 values/vector
    ↓
Result (Arrow format / materialized)

Why is vectorized better than tuple-at-a-time?

When processing 1 row at a time, the CPU pays function call overhead for each row. With 100 million rows, that's 100 million function calls. Vectorized execution reduces this to ~50,000 calls (100M / 2048). Combined with cache-friendly columnar layout, DuckDB can be 10-100× faster than PostgreSQL for analytical queries on the same hardware.

DuckLake 1.0 — Lakehouse Without Spark

DuckLake, officially released as production-ready in April 2026, is a new lakehouse format that stores metadata in database catalogs (PostgreSQL, SQLite, or DuckDB itself) rather than scattered metadata files like Apache Iceberg or Delta Lake.

graph LR
    subgraph "Iceberg / Delta Lake"
        MF[Metadata Files
JSON + Avro] --> PQ1[Parquet Files]
        MF --> PQ2[Parquet Files]
        MF --> PQ3[Parquet Files]
    end
    subgraph "DuckLake"
        DB[(Metadata DB
PostgreSQL / SQLite)] --> P1[Parquet Files]
        DB --> P2[Parquet Files]
        DB --> P3[Parquet Files]
    end
    style MF fill:#ff9800,stroke:#fff,color:#fff
    style DB fill:#4CAF50,stroke:#fff,color:#fff
    style PQ1 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style PQ2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style PQ3 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style P1 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style P2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style P3 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

Iceberg/Delta use file-based metadata. DuckLake uses a database catalog — significantly faster for metadata queries.

The metadata-in-database architecture delivers clear benefits:

Feature	DuckLake	Apache Iceberg	Delta Lake
Metadata storage	Database catalog (PG, SQLite)	File-based (JSON + Avro)	File-based (JSON log)
COUNT(*) performance	Metadata-only, 8-258× faster	Requires scanning manifest files	Requires scanning delta log
Sorted tables	Native support	Via sort order config	Z-ordering
Data inlining (≤10 rows)	Stored directly in catalog	Not supported	Not supported
Bucket partitioning	Built-in	Transform-based	Liquid clustering
Deletion vectors	Iceberg-compatible	V2 format	DV-based
Setup complexity	Low — just DuckDB + catalog DB	High — needs Spark/Trino/Flink	Medium — needs Spark/Databricks

Data Inlining — Optimizing Small Tables

When a table has ≤10 rows, DuckLake stores data directly in the metadata catalog instead of creating a separate Parquet file. Use CHECKPOINT to flush inlined data to files when needed. This optimization is extremely useful for dimension tables or small lookup tables.

DuckDB-WASM — Analytics Directly in the Browser

DuckDB is written in C++ and has been compiled to WebAssembly, enabling a full OLAP SQL engine to run directly in the browser. No requests to a backend needed — everything is processed client-side with just 5-20ms latency.

// Initialize DuckDB-WASM in the browser
import * as duckdb from '@duckdb/duckdb-wasm';

const JSDELIVR_BUNDLES = duckdb.getJsDelivrBundles();
const bundle = await duckdb.selectBundle(JSDELIVR_BUNDLES);

const worker = new Worker(bundle.mainWorker);
const logger = new duckdb.ConsoleLogger();
const db = new duckdb.AsyncDuckDB(logger, worker);
await db.instantiate(bundle.mainModule, bundle.pthreadWorker);

const conn = await db.connect();

// Query Parquet files directly from a URL
const result = await conn.query(`
  SELECT
    region,
    COUNT(*) as total_orders,
    AVG(amount) as avg_amount
  FROM 'https://data.example.com/orders.parquet'
  WHERE year = 2026
  GROUP BY region
  ORDER BY total_orders DESC
`);
console.table(result.toArray());

sequenceDiagram
    participant U as User Browser
    participant W as DuckDB-WASM
    participant S as Object Storage (S3/R2)

    U->>W: SQL Query
    W->>S: HTTP Range Request (only needed columns)
    S-->>W: Parquet column chunks
    W->>W: Vectorized execution (local)
    W-->>U: Results (5-20ms)

    Note over U,W: All processing happens client-side
No backend API needed

DuckDB-WASM queries Parquet files directly from object storage without a backend intermediary.

Real-world use cases for DuckDB-WASM:

Interactive dashboards: After initial data load from cloud, all filtering/grouping/sorting runs locally — no network roundtrips
Data exploration tools: Let users upload CSV/Parquet files and analyze them instantly in the browser
Embedded analytics: Integrate into SaaS products so users can self-serve queries without server costs
Offline-capable analytics: Combine with Service Workers to cache data, enabling analysis without network

DuckDB-WASM Limitations

WASM is constrained by browser memory (typically ~2-4GB). For larger datasets, MotherDuck uses "Dual Execution" — the query starts on the cloud engine and streams results to local DuckDB-WASM for subsequent interactions. Additionally, DuckDB-WASM runs single-threaded on some older browsers.

MotherDuck — Serverless DuckDB in the Cloud

MotherDuck is the serverless cloud service for DuckDB, offering a unique "Dual Execution" model: part of the query runs in the cloud (for heavy lifting), while the rest runs locally on the DuckDB client (for fast interaction).

In April 2026, MotherDuck launched a PostgreSQL wire-protocol endpoint, allowing any application with a Postgres driver to connect to MotherDuck without installing DuckDB:

-- Connect from any Postgres client
-- psql, DBeaver, .NET Npgsql, node-postgres...
psql "host=pg.us-east-1-aws.motherduck.com port=5432 dbname=my_db user=token password=eyJ..."

-- Run DuckDB SQL via Postgres protocol
SELECT region, SUM(revenue)
FROM sales_2026.parquet
GROUP BY region;

MotherDuck Dual Execution Flow

Client App
    ↓ SQL Query
MotherDuck Cloud Engine
    ↓ Heavy compute (joins, aggregations on TB of data)
    ↓ Stream results
Local DuckDB (or DuckDB-WASM)
    ↓ Subsequent filtering, pivoting, sorting
    ↓ Zero network roundtrip
User sees results (interactive)

DuckDB vs ClickHouse — When to Use Which

DuckDB and ClickHouse are both excellent OLAP databases but solve different problems. Understanding the differences helps you choose the right tool for each use case.

Criteria	DuckDB	ClickHouse
Deployment model	Embedded (in-process, zero config)	Client-server (requires installation & ops)
Scaling	Vertical — single node	Horizontal — shared-nothing cluster
Optimal data size	GB → a few hundred GB	TB → PB
Concurrent users	1-5 (single analyst / pipeline)	Hundreds (multi-tenant dashboards)
Ingestion pattern	Batch (read Parquet, CSV directly)	Real-time streaming + batch
Ops complexity	Zero — ships with your app	Medium → high (replication, sharding)
WASM support	Full (runs in browser)	No
SQL compliance	PostgreSQL-compatible	ClickHouse SQL (near ANSI)
Ecosystem	Python-first, data science friendly	Backend-first, infra-oriented
Cost	Free, open source (MIT)	Free (Apache 2.0) or ClickHouse Cloud

graph TD
    Q{What's your problem?}
    Q -->|Data < 500GB
1-5 analysts| D[DuckDB]
    Q -->|Data > 1TB
Real-time ingestion| CH[ClickHouse]
    Q -->|Client-side analytics
Browser dashboards| DW[DuckDB-WASM]
    Q -->|Multi-tenant SaaS
100+ concurrent users| CHC[ClickHouse Cloud]
    Q -->|Dev/staging
Data exploration| DD[DuckDB + MotherDuck]
    Q -->|Production analytics
Sub-second dashboards| CHP[ClickHouse Production]

    style Q fill:#e94560,stroke:#fff,color:#fff
    style D fill:#4CAF50,stroke:#fff,color:#fff
    style CH fill:#2c3e50,stroke:#fff,color:#fff
    style DW fill:#4CAF50,stroke:#fff,color:#fff
    style CHC fill:#2c3e50,stroke:#fff,color:#fff
    style DD fill:#4CAF50,stroke:#fff,color:#fff
    style CHP fill:#2c3e50,stroke:#fff,color:#fff

Decision tree: DuckDB for single-node analytics, ClickHouse for distributed production workloads.

DuckDB + ClickHouse: The Optimal Combo

Many teams use both: DuckDB for development, data exploration, and CI/CD testing — ClickHouse for production serving. DuckDB reads Parquet files exported by ClickHouse, and vice versa. Workflow: develop queries on DuckDB locally → validate on staging → deploy to ClickHouse production.

Integrating DuckDB in Practice

Python — Data Science Workflow

import duckdb

# In-memory connection
con = duckdb.connect()

# Query Parquet directly from S3 — no download needed
df = con.sql("""
    SELECT
        product_category,
        DATE_TRUNC('month', order_date) AS month,
        SUM(revenue) AS monthly_revenue,
        COUNT(DISTINCT customer_id) AS unique_customers
    FROM 's3://my-bucket/orders/year=2026/*.parquet'
    WHERE region = 'APAC'
    GROUP BY ALL
    ORDER BY month, monthly_revenue DESC
""").df()  # Returns Pandas DataFrame

# Or convert to Polars
pl_df = con.sql("SELECT * FROM df WHERE monthly_revenue > 100000").pl()

.NET — Embedded Analytics in ASP.NET

// NuGet: DuckDB.NET.Data
using DuckDB.NET.Data;

// Open database file (or ":memory:" for in-memory)
using var connection = new DuckDBConnection("Data Source=analytics.duckdb");
connection.Open();

using var command = connection.CreateCommand();
command.CommandText = @"
    SELECT
        region,
        COUNT(*) as total_orders,
        ROUND(AVG(amount), 2) as avg_amount
    FROM read_parquet('data/orders_2026.parquet')
    GROUP BY region
    HAVING total_orders > 1000
    ORDER BY avg_amount DESC";

using var reader = command.ExecuteReader();
while (reader.Read())
{
    Console.WriteLine($"{reader["region"]}: {reader["total_orders"]} orders, avg ${reader["avg_amount"]}");
}

Vue.js — Client-side Analytics Dashboard

<script setup>
import { ref, onMounted } from 'vue'
import * as duckdb from '@duckdb/duckdb-wasm'

const data = ref([])
const loading = ref(true)

onMounted(async () => {
  const JSDELIVR_BUNDLES = duckdb.getJsDelivrBundles()
  const bundle = await duckdb.selectBundle(JSDELIVR_BUNDLES)

  const worker = new Worker(bundle.mainWorker)
  const db = new duckdb.AsyncDuckDB(new duckdb.ConsoleLogger(), worker)
  await db.instantiate(bundle.mainModule, bundle.pthreadWorker)

  const conn = await db.connect()

  // Load data once from server
  await db.registerFileURL(
    'sales.parquet',
    '/api/exports/sales_2026.parquet',
    duckdb.DuckDBDataProtocol.HTTP, false
  )

  // All subsequent queries run locally — zero latency
  const result = await conn.query(`
    SELECT month, SUM(revenue) as total
    FROM 'sales.parquet'
    GROUP BY month ORDER BY month
  `)

  data.value = result.toArray()
  loading.value = false
})
</script>

Extensions — Extending DuckDB Without Limits

DuckDB features a flexible extension system that allows adding new data types, functions, file formats, and even new SQL syntax. Extensions are loaded dynamically on demand:

-- Install and load extensions
INSTALL httpfs;   -- Read files from HTTP/S3
LOAD httpfs;

INSTALL iceberg;  -- Read Apache Iceberg tables
LOAD iceberg;

INSTALL spatial;  -- Geospatial functions (ST_Distance, ST_Within...)
LOAD spatial;

-- Query an Iceberg table directly
SELECT * FROM iceberg_scan('s3://warehouse/orders')
WHERE order_date >= '2026-01-01';

Extension	Functionality	WASM Support
httpfs	Read files from HTTP, S3, GCS, Azure Blob	Yes
iceberg	Read/write Apache Iceberg tables	Yes (new in 2026)
parquet	Read/write Parquet (built-in)	Yes
json	Read/write JSON/NDJSON	Yes
spatial	Geospatial (PostGIS-like)	Yes
lance	Vector search, full-text search for AI/ML	In development
postgres_scanner	Query PostgreSQL directly from DuckDB	No
mysql_scanner	Query MySQL directly from DuckDB	No

Lance Extension — DuckDB for AI/ML Workloads

The latest lance extension enables reading/writing Lance datasets (a columnar format optimized for ML) with vector search (lance_vector_search()), full-text search (lance_fts()), and hybrid search (lance_hybrid_search()). By combining DuckDB with Lance, you can build a complete RAG pipeline without needing a separate vector database.

Benchmarks — DuckDB on Real Hardware

Benchmarks run on an entry-level MacBook (Apple Silicon, 8GB RAM) show impressive results:

<1s ClickBench median (100M rows, 5GB)

1.63s TPC-DS SF100 query median

79 min TPC-DS SF300 (disk spill)

238ms 5M rows query (Jupyter kernel)

Notably, DuckDB handles datasets larger than RAM through its disk spill mechanism — automatically writing temporary data to disk when memory is insufficient, then reading it back when needed. Processing TPC-DS SF300 (~300GB of data) on a machine with only 8GB of RAM is clear proof of this capability.

When NOT to Use DuckDB

DuckDB is not a silver bullet. There are scenarios where it's not the best choice:

OLTP workloads: DuckDB is optimized for read-heavy, write-light workloads. If you need millions of INSERT/UPDATE per second → use PostgreSQL, MySQL, or SQL Server
High-concurrency serving: Hundreds of simultaneous users querying → ClickHouse or Druid are better suited
Real-time streaming ingestion: DuckDB has no built-in streaming engine. Need Kafka → ClickHouse/Flink pipeline
Multi-TB datasets requiring distributed processing: When data exceeds single-node capacity → ClickHouse cluster or Spark + Iceberg
Multi-writer concurrency: DuckDB uses a single-writer model, allowing only one process to write at a time

Roadmap — Where DuckDB Is Heading

2018 — Origins

Born at CWI Amsterdam (the same institution that created MonetDB). Goal: create "SQLite for analytics".

2022 — DuckDB-WASM

Compiled to WebAssembly, running in the browser. Demo paper at VLDB.

2024 — v1.0 stable

First stable release. MotherDuck launches serverless cloud platform.

2026 Q1 — DuckLake 1.0

Production-ready lakehouse format. Iceberg extension with WASM support. Lance extension for vector search.

2026 Q2 — Current (v1.5.2)

PostgreSQL wire protocol (MotherDuck). Community extensions ecosystem booming. O'Reilly DuckLake book in progress.

Conclusion

DuckDB represents a paradigm shift in how developers approach data analytics. Instead of setting up ClickHouse clusters or waiting for BigQuery query slots, you can run complex analytical queries on your laptop with remarkable performance — sub-second for 100 million rows on entry-level hardware.

With DuckLake 1.0, DuckDB is no longer just an embedded database but has become a lightweight lakehouse platform. With DuckDB-WASM, analytics running directly in the browser opens entirely new use cases for frontend developers.

Practical advice: use DuckDB for development, exploration, and embedded analytics. When you need to scale to terabytes of data with hundreds of concurrent users — ClickHouse remains the production workhorse. And don't forget: DuckDB is free, open source (MIT license), and just a pip install duckdb away.

References:

#DuckDB #OLAP #Database #Data Engineering #ClickHouse #WebAssembly #Analytics #system design

# DuckDB 2026: Embedded OLAP Database — Data Analytics Without a Server

v1.5.2 Latest version (04/2026)

258× Speedup with DuckLake metadata queries

5-20ms DuckDB-WASM browser latency

0 Servers to install

## What is DuckDB — "SQLite for Analytics"

```
graph TB
    subgraph "Traditional OLAP (ClickHouse, BigQuery)"
        C[Client App] -->|Network| S[Database Server]
        S --> D[(Disk Storage)]
        S --> W1[Worker Node 1]
        S --> W2[Worker Node 2]
    end
    subgraph "DuckDB — Embedded OLAP"
        A[Application Process] --> E[DuckDB Engine]
        E --> F[(Local File / Memory)]
    end
    style C fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style S fill:#e94560,stroke:#fff,color:#fff
    style D fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style W1 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style W2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style A fill:#4CAF50,stroke:#fff,color:#fff
    style E fill:#4CAF50,stroke:#fff,color:#fff
    style F fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50

```
Traditional OLAP requires a separate server. DuckDB runs embedded directly within the application process.

## Columnar-Vectorized Architecture — Why DuckDB Is So Fast

DuckDB combines two critical techniques to achieve high performance for analytical queries:

### Columnar Storage

Instead of storing data row-by-row (row-oriented) like PostgreSQL or MySQL, DuckDB stores data by column (column-oriented). When you run `SELECT AVG(price) FROM orders WHERE year = 2026`, the engine only reads the `price` and `year` columns — completely skipping all other columns. For a table with 50 columns, this can reduce I/O by up to 96%.

Data within the same column often shares similar values (e.g., a `country` column repeating "VN" millions of times), enabling significantly better compression compared to row-based storage — typically achieving 5-10× better compression ratios.

### Vectorized Execution

#### DuckDB Query Execution Pipeline

```
SQL Query
    ↓
Parser → AST (Abstract Syntax Tree)
    ↓
Binder → Resolve table names, columns, data types
    ↓
Optimizer → Predicate pushdown, join reordering, filter optimization
    ↓
Physical Planner → Select join algorithms, scan strategies
    ↓
Vectorized Executor → Process batches of 2048 values/vector
    ↓
Result (Arrow format / materialized)

```

#### Why is vectorized better than tuple-at-a-time?

## DuckLake 1.0 — Lakehouse Without Spark

```
graph LR
    subgraph "Iceberg / Delta Lake"
        MF[Metadata Files  
JSON + Avro] --> PQ1[Parquet Files]
        MF --> PQ2[Parquet Files]
        MF --> PQ3[Parquet Files]
    end
    subgraph "DuckLake"
        DB[(Metadata DB  
PostgreSQL / SQLite)] --> P1[Parquet Files]
        DB --> P2[Parquet Files]
        DB --> P3[Parquet Files]
    end
    style MF fill:#ff9800,stroke:#fff,color:#fff
    style DB fill:#4CAF50,stroke:#fff,color:#fff
    style PQ1 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style PQ2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style PQ3 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style P1 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style P2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style P3 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

```
Iceberg/Delta use file-based metadata. DuckLake uses a database catalog — significantly faster for metadata queries.

The metadata-in-database architecture delivers clear benefits:

| Feature | DuckLake | Apache Iceberg | Delta Lake |
| --- | --- | --- | --- |
| Metadata storage | Database catalog (PG, SQLite) | File-based (JSON + Avro) | File-based (JSON log) |
| COUNT(*) performance | Metadata-only, 8-258× faster | Requires scanning manifest files | Requires scanning delta log |
| Sorted tables | Native support | Via sort order config | Z-ordering |
| Data inlining (≤10 rows) | Stored directly in catalog | Not supported | Not supported |
| Bucket partitioning | Built-in | Transform-based | Liquid clustering |
| Deletion vectors | Iceberg-compatible | V2 format | DV-based |
| Setup complexity | Low — just DuckDB + catalog DB | High — needs Spark/Trino/Flink | Medium — needs Spark/Databricks |

#### Data Inlining — Optimizing Small Tables

When a table has ≤10 rows, DuckLake stores data directly in the metadata catalog instead of creating a separate Parquet file. Use `CHECKPOINT` to flush inlined data to files when needed. This optimization is extremely useful for dimension tables or small lookup tables.

## DuckDB-WASM — Analytics Directly in the Browser

```
// Initialize DuckDB-WASM in the browser
import * as duckdb from '@duckdb/duckdb-wasm';

const JSDELIVR_BUNDLES = duckdb.getJsDelivrBundles();
const bundle = await duckdb.selectBundle(JSDELIVR_BUNDLES);

const worker = new Worker(bundle.mainWorker);
const logger = new duckdb.ConsoleLogger();
const db = new duckdb.AsyncDuckDB(logger, worker);
await db.instantiate(bundle.mainModule, bundle.pthreadWorker);

const conn = await db.connect();

// Query Parquet files directly from a URL
const result = await conn.query(`
  SELECT
    region,
    COUNT(*) as total_orders,
    AVG(amount) as avg_amount
  FROM 'https://data.example.com/orders.parquet'
  WHERE year = 2026
  GROUP BY region
  ORDER BY total_orders DESC
`);
console.table(result.toArray());

```

```
sequenceDiagram
    participant U as User Browser
    participant W as DuckDB-WASM
    participant S as Object Storage (S3/R2)

U->>W: SQL Query
    W->>S: HTTP Range Request (only needed columns)
    S-->>W: Parquet column chunks
    W->>W: Vectorized execution (local)
    W-->>U: Results (5-20ms)

Note over U,W: All processing happens client-side  
No backend API needed

```
DuckDB-WASM queries Parquet files directly from object storage without a backend intermediary.

Real-world use cases for DuckDB-WASM:

- **Interactive dashboards:** After initial data load from cloud, all filtering/grouping/sorting runs locally — no network roundtrips
- **Data exploration tools:** Let users upload CSV/Parquet files and analyze them instantly in the browser
- **Embedded analytics:** Integrate into SaaS products so users can self-serve queries without server costs
- **Offline-capable analytics:** Combine with Service Workers to cache data, enabling analysis without network

#### DuckDB-WASM Limitations

## MotherDuck — Serverless DuckDB in the Cloud

In April 2026, MotherDuck launched a PostgreSQL wire-protocol endpoint, allowing any application with a Postgres driver to connect to MotherDuck without installing DuckDB:

```
-- Connect from any Postgres client
-- psql, DBeaver, .NET Npgsql, node-postgres...
psql "host=pg.us-east-1-aws.motherduck.com port=5432 dbname=my_db user=token password=eyJ..."

-- Run DuckDB SQL via Postgres protocol
SELECT region, SUM(revenue)
FROM sales_2026.parquet
GROUP BY region;
```

#### MotherDuck Dual Execution Flow

```
Client App
    ↓ SQL Query
MotherDuck Cloud Engine
    ↓ Heavy compute (joins, aggregations on TB of data)
    ↓ Stream results
Local DuckDB (or DuckDB-WASM)
    ↓ Subsequent filtering, pivoting, sorting
    ↓ Zero network roundtrip
User sees results (interactive)
```

## DuckDB vs ClickHouse — When to Use Which

DuckDB and ClickHouse are both excellent OLAP databases but solve different problems. Understanding the differences helps you choose the right tool for each use case.

| Criteria | DuckDB | ClickHouse |
| --- | --- | --- |
| Deployment model | Embedded (in-process, zero config) | Client-server (requires installation & ops) |
| Scaling | Vertical — single node | Horizontal — shared-nothing cluster |
| Optimal data size | GB → a few hundred GB | TB → PB |
| Concurrent users | 1-5 (single analyst / pipeline) | Hundreds (multi-tenant dashboards) |
| Ingestion pattern | Batch (read Parquet, CSV directly) | Real-time streaming + batch |
| Ops complexity | Zero — ships with your app | Medium → high (replication, sharding) |
| WASM support | Full (runs in browser) | No |
| SQL compliance | PostgreSQL-compatible | ClickHouse SQL (near ANSI) |
| Ecosystem | Python-first, data science friendly | Backend-first, infra-oriented |
| Cost | Free, open source (MIT) | Free (Apache 2.0) or ClickHouse Cloud |

```
graph TD
    Q{What's your problem?}
    Q -->|Data < 500GB  
1-5 analysts| D[DuckDB]
    Q -->|Data > 1TB  
Real-time ingestion| CH[ClickHouse]
    Q -->|Client-side analytics  
Browser dashboards| DW[DuckDB-WASM]
    Q -->|Multi-tenant SaaS  
100+ concurrent users| CHC[ClickHouse Cloud]
    Q -->|Dev/staging  
Data exploration| DD[DuckDB + MotherDuck]
    Q -->|Production analytics  
Sub-second dashboards| CHP[ClickHouse Production]

style Q fill:#e94560,stroke:#fff,color:#fff
    style D fill:#4CAF50,stroke:#fff,color:#fff
    style CH fill:#2c3e50,stroke:#fff,color:#fff
    style DW fill:#4CAF50,stroke:#fff,color:#fff
    style CHC fill:#2c3e50,stroke:#fff,color:#fff
    style DD fill:#4CAF50,stroke:#fff,color:#fff
    style CHP fill:#2c3e50,stroke:#fff,color:#fff

```
Decision tree: DuckDB for single-node analytics, ClickHouse for distributed production workloads.

#### DuckDB + ClickHouse: The Optimal Combo

## Integrating DuckDB in Practice

### Python — Data Science Workflow

```
import duckdb

# In-memory connection
con = duckdb.connect()

# Query Parquet directly from S3 — no download needed
df = con.sql("""
    SELECT
        product_category,
        DATE_TRUNC('month', order_date) AS month,
        SUM(revenue) AS monthly_revenue,
        COUNT(DISTINCT customer_id) AS unique_customers
    FROM 's3://my-bucket/orders/year=2026/*.parquet'
    WHERE region = 'APAC'
    GROUP BY ALL
    ORDER BY month, monthly_revenue DESC
""").df()  # Returns Pandas DataFrame

# Or convert to Polars
pl_df = con.sql("SELECT * FROM df WHERE monthly_revenue > 100000").pl()
```

### .NET — Embedded Analytics in ASP.NET

```
// NuGet: DuckDB.NET.Data
using DuckDB.NET.Data;

// Open database file (or ":memory:" for in-memory)
using var connection = new DuckDBConnection("Data Source=analytics.duckdb");
connection.Open();

using var command = connection.CreateCommand();
command.CommandText = @"
    SELECT
        region,
        COUNT(*) as total_orders,
        ROUND(AVG(amount), 2) as avg_amount
    FROM read_parquet('data/orders_2026.parquet')
    GROUP BY region
    HAVING total_orders > 1000
    ORDER BY avg_amount DESC";

using var reader = command.ExecuteReader();
while (reader.Read())
{
    Console.WriteLine($"{reader["region"]}: {reader["total_orders"]} orders, avg ${reader["avg_amount"]}");
}
```

### Vue.js — Client-side Analytics Dashboard

```
<script setup>
import { ref, onMounted } from 'vue'
import * as duckdb from '@duckdb/duckdb-wasm'

const data = ref([])
const loading = ref(true)

onMounted(async () => {
  const JSDELIVR_BUNDLES = duckdb.getJsDelivrBundles()
  const bundle = await duckdb.selectBundle(JSDELIVR_BUNDLES)

const worker = new Worker(bundle.mainWorker)
  const db = new duckdb.AsyncDuckDB(new duckdb.ConsoleLogger(), worker)
  await db.instantiate(bundle.mainModule, bundle.pthreadWorker)

const conn = await db.connect()

// Load data once from server
  await db.registerFileURL(
    'sales.parquet',
    '/api/exports/sales_2026.parquet',
    duckdb.DuckDBDataProtocol.HTTP, false
  )

// All subsequent queries run locally — zero latency
  const result = await conn.query(`
    SELECT month, SUM(revenue) as total
    FROM 'sales.parquet'
    GROUP BY month ORDER BY month
  `)

data.value = result.toArray()
  loading.value = false
})
</script>
```

## Extensions — Extending DuckDB Without Limits

DuckDB features a flexible extension system that allows adding new data types, functions, file formats, and even new SQL syntax. Extensions are loaded dynamically on demand:

```
-- Install and load extensions
INSTALL httpfs;   -- Read files from HTTP/S3
LOAD httpfs;

INSTALL iceberg;  -- Read Apache Iceberg tables
LOAD iceberg;

INSTALL spatial;  -- Geospatial functions (ST_Distance, ST_Within...)
LOAD spatial;

-- Query an Iceberg table directly
SELECT * FROM iceberg_scan('s3://warehouse/orders')
WHERE order_date >= '2026-01-01';
```

| Extension | Functionality | WASM Support |
| --- | --- | --- |
| httpfs | Read files from HTTP, S3, GCS, Azure Blob | Yes |
| iceberg | Read/write Apache Iceberg tables | Yes (new in 2026) |
| parquet | Read/write Parquet (built-in) | Yes |
| json | Read/write JSON/NDJSON | Yes |
| spatial | Geospatial (PostGIS-like) | Yes |
| lance | Vector search, full-text search for AI/ML | In development |
| postgres_scanner | Query PostgreSQL directly from DuckDB | No |
| mysql_scanner | Query MySQL directly from DuckDB | No |

#### Lance Extension — DuckDB for AI/ML Workloads

The latest `lance` extension enables reading/writing Lance datasets (a columnar format optimized for ML) with vector search (`lance_vector_search()`), full-text search (`lance_fts()`), and hybrid search (`lance_hybrid_search()`). By combining DuckDB with Lance, you can build a complete RAG pipeline without needing a separate vector database.

## Benchmarks — DuckDB on Real Hardware

Benchmarks run on an entry-level MacBook (Apple Silicon, 8GB RAM) show impressive results:

<1s ClickBench median (100M rows, 5GB)

1.63s TPC-DS SF100 query median

79 min TPC-DS SF300 (disk spill)

238ms 5M rows query (Jupyter kernel)

Notably, DuckDB handles datasets larger than RAM through its **disk spill** mechanism — automatically writing temporary data to disk when memory is insufficient, then reading it back when needed. Processing TPC-DS SF300 (~300GB of data) on a machine with only 8GB of RAM is clear proof of this capability.

## When NOT to Use DuckDB

DuckDB is not a silver bullet. There are scenarios where it's not the best choice:

- **OLTP workloads:** DuckDB is optimized for read-heavy, write-light workloads. If you need millions of INSERT/UPDATE per second → use PostgreSQL, MySQL, or SQL Server
- **High-concurrency serving:** Hundreds of simultaneous users querying → ClickHouse or Druid are better suited
- **Real-time streaming ingestion:** DuckDB has no built-in streaming engine. Need Kafka → ClickHouse/Flink pipeline
- **Multi-TB datasets requiring distributed processing:** When data exceeds single-node capacity → ClickHouse cluster or Spark + Iceberg
- **Multi-writer concurrency:** DuckDB uses a single-writer model, allowing only one process to write at a time

## Roadmap — Where DuckDB Is Heading

2018 — Origins

Born at CWI Amsterdam (the same institution that created MonetDB). Goal: create "SQLite for analytics".

2022 — DuckDB-WASM

Compiled to WebAssembly, running in the browser. Demo paper at VLDB.

2024 — v1.0 stable

First stable release. MotherDuck launches serverless cloud platform.

2026 Q1 — DuckLake 1.0

Production-ready lakehouse format. Iceberg extension with WASM support. Lance extension for vector search.

2026 Q2 — Current (v1.5.2)

PostgreSQL wire protocol (MotherDuck). Community extensions ecosystem booming. O'Reilly DuckLake book in progress.

## Conclusion

**References:**

- [Why DuckDB — DuckDB Official Documentation](https://duckdb.org/why_duckdb)
- [DuckDB Ecosystem Newsletter April 2026 — MotherDuck Blog](https://motherduck.com/blog/duckdb-ecosystem-newsletter-april-2026/)
- [ClickHouse vs DuckDB: Choosing the Right OLAP Database — CloudRaft](https://www.cloudraft.io/blog/clickhouse-vs-duckdb)
- [DuckDB WASM: Analytical SQL Database in Your Browser — MotherDuck](https://motherduck.com/blog/duckdb-wasm-in-browser/)
- [Iceberg in the Browser — DuckDB Engineering Blog](https://duckdb.org/2025/12/16/iceberg-in-the-browser)
- [OLAP Databases: What's New and Best in 2026 — Tinybird](https://www.tinybird.co/blog/best-database-for-olap)

Kubernetes Cost Optimization 2026: Karpenter, Spot Instances & Right-Sizing to Cut 55% Cloud Bill

Idempotency Pattern — Designing Duplicate-Proof APIs for Distributed Systems

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.