DuckDB 2026: Embedded OLAP Database — Khi phân tích dữ liệu không cần server

Posted on: 4/21/2026 11:15:04 AM

Table of contents

DuckDB là gì — "SQLite cho Analytics"
Kiến trúc Columnar-Vectorized — Tại sao DuckDB nhanh đến vậy
1. Columnar Storage
2. Vectorized Execution
  1. Pipeline thực thi query trong DuckDB
  2. Tại sao vectorized tốt hơn tuple-at-a-time?
DuckLake 1.0 — Lakehouse không cần Spark
1. Data Inlining — Tối ưu cho bảng nhỏ
DuckDB-WASM — Analytics trực tiếp trên trình duyệt
1. Giới hạn của DuckDB-WASM
MotherDuck — Serverless DuckDB trên Cloud
1. MotherDuck Dual Execution Flow
DuckDB vs ClickHouse — Khi nào dùng cái nào
1. DuckDB + ClickHouse: Combo tối ưu
Tích hợp DuckDB trong thực tế
Extensions — Mở rộng DuckDB không giới hạn
1. Lance Extension — DuckDB cho AI/ML Workloads
Benchmark — DuckDB trên phần cứng thực tế
Khi nào không nên dùng DuckDB
Roadmap — DuckDB đang đi đâu
Kết luận

v1.5.2 Phiên bản mới nhất (04/2026)

258× Speedup với DuckLake metadata queries

5-20ms Latency DuckDB-WASM trên trình duyệt

0 Server cần cài đặt

DuckDB là gì — "SQLite cho Analytics"

DuckDB là một embedded OLAP database, hoạt động hoàn toàn trong process của ứng dụng mà không cần bất kỳ server nào. Nếu SQLite là database nhúng dành cho OLTP (transactional workload), thì DuckDB là đối trọng dành cho OLAP (analytical workload) — xử lý các truy vấn phân tích phức tạp trên lượng dữ liệu lớn với tốc độ đáng kinh ngạc.

Bạn không cần cài đặt daemon, không cần cấu hình port, không cần quản lý connection pool. Chỉ cần import thư viện DuckDB vào project (Python, Node.js, Rust, Go, Java, C++...), tạo một database file hoặc chạy in-memory — và bạn đã có một engine SQL mạnh ngang ClickHouse cho single-node workloads.

graph TB
    subgraph "Traditional OLAP (ClickHouse, BigQuery)"
        C[Client App] -->|Network| S[Database Server]
        S --> D[(Disk Storage)]
        S --> W1[Worker Node 1]
        S --> W2[Worker Node 2]
    end
    subgraph "DuckDB — Embedded OLAP"
        A[Application Process] --> E[DuckDB Engine]
        E --> F[(Local File / Memory)]
    end
    style C fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style S fill:#e94560,stroke:#fff,color:#fff
    style D fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style W1 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style W2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style A fill:#4CAF50,stroke:#fff,color:#fff
    style E fill:#4CAF50,stroke:#fff,color:#fff
    style F fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50

Traditional OLAP cần server riêng biệt. DuckDB chạy embedded ngay trong process ứng dụng.

Kiến trúc Columnar-Vectorized — Tại sao DuckDB nhanh đến vậy

DuckDB kết hợp hai kỹ thuật quan trọng để đạt hiệu năng cao cho analytical queries:

Columnar Storage

Thay vì lưu dữ liệu theo hàng (row-oriented) như PostgreSQL hay MySQL, DuckDB lưu theo cột (column-oriented). Khi bạn chạy SELECT AVG(price) FROM orders WHERE year = 2026, engine chỉ cần đọc đúng 2 cột price và year — bỏ qua hoàn toàn các cột còn lại. Với bảng có 50 cột, điều này có thể giảm I/O tới 96%.

Dữ liệu cùng cột thường có giá trị tương tự nhau (ví dụ cột country lặp lại "VN" hàng triệu lần), cho phép nén hiệu quả hơn đáng kể so với row-based storage — thường đạt tỷ lệ nén 5-10× tốt hơn.

Vectorized Execution

DuckDB không xử lý từng row một (tuple-at-a-time như PostgreSQL), cũng không compile query thành native code (như HyPer). Thay vào đó, DuckDB xử lý theo "vector" — batch khoảng 2048 giá trị mỗi lần. Cách tiếp cận này tận dụng tối đa CPU cache locality và SIMD instructions của CPU hiện đại.

Pipeline thực thi query trong DuckDB

SQL Query
    ↓
Parser → AST (Abstract Syntax Tree)
    ↓
Binder → Resolve tên bảng, cột, kiểu dữ liệu
    ↓
Optimizer → Predicate pushdown, join reordering, filter optimization
    ↓
Physical Planner → Chọn thuật toán join, scan strategy
    ↓
Vectorized Executor → Xử lý batch 2048 values/vector
    ↓
Result (Arrow format / materialized)

Tại sao vectorized tốt hơn tuple-at-a-time?

Khi xử lý 1 row mỗi lần, CPU phải trả chi phí function call overhead cho mỗi row. Với 100 triệu rows, đó là 100 triệu lần gọi hàm. Vectorized execution giảm xuống còn ~50.000 lần gọi (100M / 2048). Kết hợp với cache-friendly columnar layout, DuckDB có thể nhanh hơn PostgreSQL 10-100× cho analytical queries trên cùng phần cứng.

DuckLake 1.0 — Lakehouse không cần Spark

DuckLake, phát hành chính thức production-ready trong tháng 4/2026, là định dạng lakehouse mới lưu metadata trong database catalogs (PostgreSQL, SQLite, hoặc chính DuckDB) thay vì dùng các file metadata rải rác như Apache Iceberg hay Delta Lake.

graph LR
    subgraph "Iceberg / Delta Lake"
        MF[Metadata Files
JSON + Avro] --> PQ1[Parquet Files]
        MF --> PQ2[Parquet Files]
        MF --> PQ3[Parquet Files]
    end
    subgraph "DuckLake"
        DB[(Metadata DB
PostgreSQL / SQLite)] --> P1[Parquet Files]
        DB --> P2[Parquet Files]
        DB --> P3[Parquet Files]
    end
    style MF fill:#ff9800,stroke:#fff,color:#fff
    style DB fill:#4CAF50,stroke:#fff,color:#fff
    style PQ1 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style PQ2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style PQ3 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style P1 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style P2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style P3 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

Iceberg/Delta dùng file-based metadata. DuckLake dùng database catalog — nhanh hơn đáng kể cho metadata queries.

Kiến trúc metadata-in-database mang lại lợi ích rõ rệt:

Tính năng	DuckLake	Apache Iceberg	Delta Lake
Metadata storage	Database catalog (PG, SQLite)	File-based (JSON + Avro)	File-based (JSON log)
COUNT(*) performance	Metadata-only, 8-258× nhanh hơn	Cần scan manifest files	Cần scan delta log
Sorted tables	Native support	Qua sort order config	Z-ordering
Data inlining (≤10 rows)	Lưu trực tiếp trong catalog	Không hỗ trợ	Không hỗ trợ
Bucket partitioning	Built-in	Transform-based	Liquid clustering
Deletion vectors	Iceberg-compatible	V2 format	DV-based
Setup complexity	Thấp — chỉ cần DuckDB + catalog DB	Cao — cần Spark/Trino/Flink	Trung bình — cần Spark/Databricks

Data Inlining — Tối ưu cho bảng nhỏ

Khi bảng có ≤10 rows, DuckLake lưu dữ liệu trực tiếp trong metadata catalog thay vì tạo Parquet file riêng. Sử dụng CHECKPOINT để flush dữ liệu inline ra file khi cần. Đây là tối ưu cực kỳ hữu ích cho dimension tables hoặc lookup tables nhỏ.

DuckDB-WASM — Analytics trực tiếp trên trình duyệt

DuckDB được viết bằng C++ và đã được compile sang WebAssembly, cho phép chạy engine SQL OLAP đầy đủ ngay trong trình duyệt. Không cần gửi request tới backend — mọi thứ xử lý client-side với latency chỉ 5-20ms.

// Khởi tạo DuckDB-WASM trong browser
import * as duckdb from '@duckdb/duckdb-wasm';

const JSDELIVR_BUNDLES = duckdb.getJsDelivrBundles();
const bundle = await duckdb.selectBundle(JSDELIVR_BUNDLES);

const worker = new Worker(bundle.mainWorker);
const logger = new duckdb.ConsoleLogger();
const db = new duckdb.AsyncDuckDB(logger, worker);
await db.instantiate(bundle.mainModule, bundle.pthreadWorker);

const conn = await db.connect();

// Query trực tiếp trên dữ liệu Parquet từ URL
const result = await conn.query(`
  SELECT
    region,
    COUNT(*) as total_orders,
    AVG(amount) as avg_amount
  FROM 'https://data.example.com/orders.parquet'
  WHERE year = 2026
  GROUP BY region
  ORDER BY total_orders DESC
`);
console.table(result.toArray());

sequenceDiagram
    participant U as User Browser
    participant W as DuckDB-WASM
    participant S as Object Storage (S3/R2)

    U->>W: SQL Query
    W->>S: HTTP Range Request (chỉ cột cần thiết)
    S-->>W: Parquet column chunks
    W->>W: Vectorized execution (local)
    W-->>U: Kết quả (5-20ms)

    Note over U,W: Mọi xử lý diễn ra client-side
Không cần backend API

DuckDB-WASM query trực tiếp Parquet files từ object storage mà không cần backend trung gian.

Một số use case thực tế cho DuckDB-WASM:

Interactive dashboards: Sau khi load dữ liệu ban đầu từ cloud, toàn bộ filter/group/sort chạy local — không có network roundtrip
Data exploration tools: Cho phép user upload CSV/Parquet và phân tích ngay trên browser
Embedded analytics: Tích hợp vào SaaS product để user tự query dữ liệu mà không tốn server cost
Offline-capable analytics: Kết hợp Service Worker để cache dữ liệu, user có thể phân tích khi mất mạng

Giới hạn của DuckDB-WASM

WASM bị giới hạn bởi memory trình duyệt (thường ~2-4GB). Với dataset lớn hơn, MotherDuck sử dụng "Dual Execution" — query bắt đầu trên cloud engine và stream kết quả về DuckDB-WASM local để tương tác tiếp. Ngoài ra, DuckDB-WASM chạy single-threaded trên một số trình duyệt cũ.

MotherDuck — Serverless DuckDB trên Cloud

MotherDuck là dịch vụ serverless cloud cho DuckDB, mang lại mô hình "Dual Execution" độc đáo: một phần query chạy trên cloud (cho heavy lifting), phần còn lại chạy local trên DuckDB client (cho tương tác nhanh).

Tháng 4/2026, MotherDuck ra mắt PostgreSQL wire-protocol endpoint, cho phép bất kỳ ứng dụng nào hỗ trợ Postgres driver đều có thể kết nối tới MotherDuck mà không cần cài DuckDB:

-- Kết nối từ bất kỳ Postgres client nào
-- psql, DBeaver, .NET Npgsql, node-postgres...
psql "host=pg.us-east-1-aws.motherduck.com port=5432 dbname=my_db user=token password=eyJ..."

-- Chạy DuckDB SQL qua Postgres protocol
SELECT region, SUM(revenue)
FROM sales_2026.parquet
GROUP BY region;

MotherDuck Dual Execution Flow

Client App
    ↓ SQL Query
MotherDuck Cloud Engine
    ↓ Heavy compute (joins, aggregations trên TB dữ liệu)
    ↓ Stream kết quả
Local DuckDB (hoặc DuckDB-WASM)
    ↓ Subsequent filtering, pivoting, sorting
    ↓ Zero network roundtrip
User sees results (interactive)

DuckDB vs ClickHouse — Khi nào dùng cái nào

DuckDB và ClickHouse đều là OLAP database xuất sắc nhưng giải quyết bài toán khác nhau. Hiểu rõ sự khác biệt giúp bạn chọn đúng tool cho từng use case.

Tiêu chí	DuckDB	ClickHouse
Deployment model	Embedded (in-process, zero config)	Client-server (cần cài và vận hành)
Scaling	Vertical — single node	Horizontal — shared-nothing cluster
Data size tối ưu	GB → vài trăm GB	TB → PB
Concurrent users	1-5 (single analyst / pipeline)	Hàng trăm (multi-tenant dashboards)
Ingestion pattern	Batch (read Parquet, CSV trực tiếp)	Real-time streaming + batch
Ops complexity	Zero — ship cùng app	Trung bình → cao (replication, sharding)
WASM support	Full (chạy trong browser)	Không
SQL compliance	PostgreSQL-compatible	ClickHouse SQL (gần ANSI)
Ecosystem	Python-first, data science friendly	Backend-first, infra-oriented
Cost	Free, open source (MIT)	Free (Apache 2.0) hoặc ClickHouse Cloud

graph TD
    Q{Bài toán của bạn?}
    Q -->|Data < 500GB
1-5 analysts| D[DuckDB]
    Q -->|Data > 1TB
Real-time ingestion| CH[ClickHouse]
    Q -->|Client-side analytics
Browser dashboards| DW[DuckDB-WASM]
    Q -->|Multi-tenant SaaS
100+ concurrent users| CHC[ClickHouse Cloud]
    Q -->|Dev/staging
Data exploration| DD[DuckDB + MotherDuck]
    Q -->|Production analytics
Sub-second dashboards| CHP[ClickHouse Production]

    style Q fill:#e94560,stroke:#fff,color:#fff
    style D fill:#4CAF50,stroke:#fff,color:#fff
    style CH fill:#2c3e50,stroke:#fff,color:#fff
    style DW fill:#4CAF50,stroke:#fff,color:#fff
    style CHC fill:#2c3e50,stroke:#fff,color:#fff
    style DD fill:#4CAF50,stroke:#fff,color:#fff
    style CHP fill:#2c3e50,stroke:#fff,color:#fff

Decision tree: DuckDB cho single-node analytics, ClickHouse cho distributed production workloads.

DuckDB + ClickHouse: Combo tối ưu

Nhiều team dùng cả hai: DuckDB cho development, data exploration, và CI/CD testing — ClickHouse cho production serving. DuckDB đọc được Parquet files do ClickHouse export, và ngược lại. Workflow: develop queries trên DuckDB local → validate trên staging → deploy lên ClickHouse production.

Tích hợp DuckDB trong thực tế

Python — Data Science Workflow

import duckdb

# Kết nối in-memory
con = duckdb.connect()

# Query trực tiếp Parquet từ S3 — không cần download
df = con.sql("""
    SELECT
        product_category,
        DATE_TRUNC('month', order_date) AS month,
        SUM(revenue) AS monthly_revenue,
        COUNT(DISTINCT customer_id) AS unique_customers
    FROM 's3://my-bucket/orders/year=2026/*.parquet'
    WHERE region = 'APAC'
    GROUP BY ALL
    ORDER BY month, monthly_revenue DESC
""").df()  # Trả về Pandas DataFrame

# Hoặc chuyển sang Polars
pl_df = con.sql("SELECT * FROM df WHERE monthly_revenue > 100000").pl()

.NET — Embedded Analytics trong ASP.NET

// NuGet: DuckDB.NET.Data
using DuckDB.NET.Data;

// Mở database file (hoặc ":memory:" cho in-memory)
using var connection = new DuckDBConnection("Data Source=analytics.duckdb");
connection.Open();

using var command = connection.CreateCommand();
command.CommandText = @"
    SELECT
        region,
        COUNT(*) as total_orders,
        ROUND(AVG(amount), 2) as avg_amount
    FROM read_parquet('data/orders_2026.parquet')
    GROUP BY region
    HAVING total_orders > 1000
    ORDER BY avg_amount DESC";

using var reader = command.ExecuteReader();
while (reader.Read())
{
    Console.WriteLine($"{reader["region"]}: {reader["total_orders"]} orders, avg ${reader["avg_amount"]}");
}

Vue.js — Client-side Analytics Dashboard

<script setup>
import { ref, onMounted } from 'vue'
import * as duckdb from '@duckdb/duckdb-wasm'

const data = ref([])
const loading = ref(true)

onMounted(async () => {
  const JSDELIVR_BUNDLES = duckdb.getJsDelivrBundles()
  const bundle = await duckdb.selectBundle(JSDELIVR_BUNDLES)

  const worker = new Worker(bundle.mainWorker)
  const db = new duckdb.AsyncDuckDB(new duckdb.ConsoleLogger(), worker)
  await db.instantiate(bundle.mainModule, bundle.pthreadWorker)

  const conn = await db.connect()

  // Load dữ liệu một lần từ server
  await db.registerFileURL(
    'sales.parquet',
    '/api/exports/sales_2026.parquet',
    duckdb.DuckDBDataProtocol.HTTP, false
  )

  // Mọi query tiếp theo chạy local — zero latency
  const result = await conn.query(`
    SELECT month, SUM(revenue) as total
    FROM 'sales.parquet'
    GROUP BY month ORDER BY month
  `)

  data.value = result.toArray()
  loading.value = false
})
</script>

Extensions — Mở rộng DuckDB không giới hạn

DuckDB có hệ thống extension linh hoạt, cho phép thêm data types, functions, file formats, và cả SQL syntax mới. Các extension được load động khi cần:

-- Cài và load extension
INSTALL httpfs;   -- Đọc file từ HTTP/S3
LOAD httpfs;

INSTALL iceberg;  -- Đọc Apache Iceberg tables
LOAD iceberg;

INSTALL spatial;  -- Geospatial functions (ST_Distance, ST_Within...)
LOAD spatial;

-- Query Iceberg table trực tiếp
SELECT * FROM iceberg_scan('s3://warehouse/orders')
WHERE order_date >= '2026-01-01';

Extension	Chức năng	WASM Support
httpfs	Đọc file từ HTTP, S3, GCS, Azure Blob	Có
iceberg	Đọc/ghi Apache Iceberg tables	Có (mới 2026)
parquet	Đọc/ghi Parquet (built-in)	Có
json	Đọc/ghi JSON/NDJSON	Có
spatial	Geospatial (PostGIS-like)	Có
lance	Vector search, full-text search cho AI/ML	Đang phát triển
postgres_scanner	Query PostgreSQL trực tiếp từ DuckDB	Không
mysql_scanner	Query MySQL trực tiếp từ DuckDB	Không

Lance Extension — DuckDB cho AI/ML Workloads

Extension mới nhất lance cho phép đọc/ghi Lance datasets (columnar format tối ưu cho ML) với vector search (lance_vector_search()), full-text search (lance_fts()), và hybrid search (lance_hybrid_search()). Kết hợp DuckDB với Lance, bạn có thể xây dựng RAG pipeline hoàn chỉnh mà không cần vector database riêng.

Benchmark — DuckDB trên phần cứng thực tế

Benchmark chạy trên MacBook entry-level (Apple Silicon, 8GB RAM) cho kết quả ấn tượng:

<1s ClickBench median (100M rows, 5GB)

1.63s TPC-DS SF100 query median

79 phút TPC-DS SF300 (disk spill)

238ms 5M rows query (Jupyter kernel)

Đáng chú ý, DuckDB xử lý được dataset lớn hơn RAM nhờ cơ chế disk spill — tự động ghi dữ liệu tạm ra đĩa khi memory không đủ, rồi đọc lại khi cần. TPC-DS SF300 (khoảng 300GB dữ liệu) chạy được trên máy chỉ có 8GB RAM là minh chứng rõ ràng.

Khi nào không nên dùng DuckDB

DuckDB không phải silver bullet. Có những scenario mà nó không phải lựa chọn tốt nhất:

OLTP workloads: DuckDB tối ưu cho đọc nhiều, ghi ít. Nếu bạn cần INSERT/UPDATE hàng triệu rows mỗi giây → dùng PostgreSQL, MySQL, hoặc SQL Server
High-concurrency serving: Hàng trăm users đồng thời query → ClickHouse hoặc Druid phù hợp hơn
Real-time streaming ingestion: DuckDB không có built-in streaming engine. Cần Kafka → ClickHouse/Flink pipeline
Multi-TB datasets cần distributed processing: Khi data vượt quá khả năng single node → ClickHouse cluster hoặc Spark + Iceberg
Multi-writer concurrency: DuckDB dùng single-writer model, chỉ cho phép 1 process ghi tại một thời điểm

Roadmap — DuckDB đang đi đâu

2018 — Khởi đầu

Ra đời tại CWI Amsterdam (cùng nơi sinh ra MonetDB). Mục tiêu: tạo "SQLite cho analytics".

2022 — DuckDB-WASM

Compile sang WebAssembly, chạy được trong trình duyệt. Demo paper tại VLDB.

2024 — v1.0 stable

Phiên bản stable đầu tiên. MotherDuck ra mắt serverless cloud platform.

2026 Q1 — DuckLake 1.0

Lakehouse format production-ready. Iceberg extension hỗ trợ WASM. Lance extension cho vector search.

2026 Q2 — Hiện tại (v1.5.2)

PostgreSQL wire protocol (MotherDuck). Community extensions ecosystem bùng nổ. O'Reilly DuckLake book đang viết.

Kết luận

DuckDB đại diện cho một sự thay đổi lớn trong cách developer tiếp cận data analytics. Thay vì setup ClickHouse cluster hay chờ BigQuery query slot, bạn có thể chạy truy vấn phân tích phức tạp trên laptop với hiệu năng đáng kinh ngạc — sub-second cho 100 triệu rows trên phần cứng entry-level.

Với DuckLake 1.0, DuckDB không chỉ là embedded database nữa mà đã trở thành nền tảng lakehouse nhẹ nhàng. Với DuckDB-WASM, analytics chạy trực tiếp trên trình duyệt mở ra use cases hoàn toàn mới cho frontend developers.

Lời khuyên thực tế: dùng DuckDB cho development, exploration, và embedded analytics. Khi cần scale lên hàng TB dữ liệu với hàng trăm concurrent users — ClickHouse vẫn là production workhorse. Và đừng quên: DuckDB miễn phí, open source (MIT license), và chỉ cần pip install duckdb để bắt đầu.

Nguồn tham khảo:

#DuckDB #OLAP #Database #Data Engineering #ClickHouse #WebAssembly #Analytics #system design

# DuckDB 2026: Embedded OLAP Database — Khi phân tích dữ liệu không cần server

v1.5.2 Phiên bản mới nhất (04/2026)

258× Speedup với DuckLake metadata queries

5-20ms Latency DuckDB-WASM trên trình duyệt

0 Server cần cài đặt

## DuckDB là gì — "SQLite cho Analytics"

```
graph TB
    subgraph "Traditional OLAP (ClickHouse, BigQuery)"
        C[Client App] -->|Network| S[Database Server]
        S --> D[(Disk Storage)]
        S --> W1[Worker Node 1]
        S --> W2[Worker Node 2]
    end
    subgraph "DuckDB — Embedded OLAP"
        A[Application Process] --> E[DuckDB Engine]
        E --> F[(Local File / Memory)]
    end
    style C fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style S fill:#e94560,stroke:#fff,color:#fff
    style D fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style W1 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style W2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style A fill:#4CAF50,stroke:#fff,color:#fff
    style E fill:#4CAF50,stroke:#fff,color:#fff
    style F fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50

```
Traditional OLAP cần server riêng biệt. DuckDB chạy embedded ngay trong process ứng dụng.

## Kiến trúc Columnar-Vectorized — Tại sao DuckDB nhanh đến vậy

DuckDB kết hợp hai kỹ thuật quan trọng để đạt hiệu năng cao cho analytical queries:

### Columnar Storage

Thay vì lưu dữ liệu theo hàng (row-oriented) như PostgreSQL hay MySQL, DuckDB lưu theo cột (column-oriented). Khi bạn chạy `SELECT AVG(price) FROM orders WHERE year = 2026`, engine chỉ cần đọc đúng 2 cột `price` và `year` — bỏ qua hoàn toàn các cột còn lại. Với bảng có 50 cột, điều này có thể giảm I/O tới 96%.

Dữ liệu cùng cột thường có giá trị tương tự nhau (ví dụ cột `country` lặp lại "VN" hàng triệu lần), cho phép nén hiệu quả hơn đáng kể so với row-based storage — thường đạt tỷ lệ nén 5-10× tốt hơn.

### Vectorized Execution

#### Pipeline thực thi query trong DuckDB

```
SQL Query
    ↓
Parser → AST (Abstract Syntax Tree)
    ↓
Binder → Resolve tên bảng, cột, kiểu dữ liệu
    ↓
Optimizer → Predicate pushdown, join reordering, filter optimization
    ↓
Physical Planner → Chọn thuật toán join, scan strategy
    ↓
Vectorized Executor → Xử lý batch 2048 values/vector
    ↓
Result (Arrow format / materialized)

```

#### Tại sao vectorized tốt hơn tuple-at-a-time?

## DuckLake 1.0 — Lakehouse không cần Spark

```
graph LR
    subgraph "Iceberg / Delta Lake"
        MF[Metadata Files  
JSON + Avro] --> PQ1[Parquet Files]
        MF --> PQ2[Parquet Files]
        MF --> PQ3[Parquet Files]
    end
    subgraph "DuckLake"
        DB[(Metadata DB  
PostgreSQL / SQLite)] --> P1[Parquet Files]
        DB --> P2[Parquet Files]
        DB --> P3[Parquet Files]
    end
    style MF fill:#ff9800,stroke:#fff,color:#fff
    style DB fill:#4CAF50,stroke:#fff,color:#fff
    style PQ1 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style PQ2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style PQ3 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style P1 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style P2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style P3 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

```
Iceberg/Delta dùng file-based metadata. DuckLake dùng database catalog — nhanh hơn đáng kể cho metadata queries.

Kiến trúc metadata-in-database mang lại lợi ích rõ rệt:

| Tính năng | DuckLake | Apache Iceberg | Delta Lake |
| --- | --- | --- | --- |
| Metadata storage | Database catalog (PG, SQLite) | File-based (JSON + Avro) | File-based (JSON log) |
| COUNT(*) performance | Metadata-only, 8-258× nhanh hơn | Cần scan manifest files | Cần scan delta log |
| Sorted tables | Native support | Qua sort order config | Z-ordering |
| Data inlining (≤10 rows) | Lưu trực tiếp trong catalog | Không hỗ trợ | Không hỗ trợ |
| Bucket partitioning | Built-in | Transform-based | Liquid clustering |
| Deletion vectors | Iceberg-compatible | V2 format | DV-based |
| Setup complexity | Thấp — chỉ cần DuckDB + catalog DB | Cao — cần Spark/Trino/Flink | Trung bình — cần Spark/Databricks |

#### Data Inlining — Tối ưu cho bảng nhỏ

Khi bảng có ≤10 rows, DuckLake lưu dữ liệu trực tiếp trong metadata catalog thay vì tạo Parquet file riêng. Sử dụng `CHECKPOINT` để flush dữ liệu inline ra file khi cần. Đây là tối ưu cực kỳ hữu ích cho dimension tables hoặc lookup tables nhỏ.

## DuckDB-WASM — Analytics trực tiếp trên trình duyệt

```
// Khởi tạo DuckDB-WASM trong browser
import * as duckdb from '@duckdb/duckdb-wasm';

const JSDELIVR_BUNDLES = duckdb.getJsDelivrBundles();
const bundle = await duckdb.selectBundle(JSDELIVR_BUNDLES);

const worker = new Worker(bundle.mainWorker);
const logger = new duckdb.ConsoleLogger();
const db = new duckdb.AsyncDuckDB(logger, worker);
await db.instantiate(bundle.mainModule, bundle.pthreadWorker);

const conn = await db.connect();

// Query trực tiếp trên dữ liệu Parquet từ URL
const result = await conn.query(`
  SELECT
    region,
    COUNT(*) as total_orders,
    AVG(amount) as avg_amount
  FROM 'https://data.example.com/orders.parquet'
  WHERE year = 2026
  GROUP BY region
  ORDER BY total_orders DESC
`);
console.table(result.toArray());

```

```
sequenceDiagram
    participant U as User Browser
    participant W as DuckDB-WASM
    participant S as Object Storage (S3/R2)

U->>W: SQL Query
    W->>S: HTTP Range Request (chỉ cột cần thiết)
    S-->>W: Parquet column chunks
    W->>W: Vectorized execution (local)
    W-->>U: Kết quả (5-20ms)

Note over U,W: Mọi xử lý diễn ra client-side  
Không cần backend API

```
DuckDB-WASM query trực tiếp Parquet files từ object storage mà không cần backend trung gian.

Một số use case thực tế cho DuckDB-WASM:

- **Interactive dashboards:** Sau khi load dữ liệu ban đầu từ cloud, toàn bộ filter/group/sort chạy local — không có network roundtrip
- **Data exploration tools:** Cho phép user upload CSV/Parquet và phân tích ngay trên browser
- **Embedded analytics:** Tích hợp vào SaaS product để user tự query dữ liệu mà không tốn server cost
- **Offline-capable analytics:** Kết hợp Service Worker để cache dữ liệu, user có thể phân tích khi mất mạng

#### Giới hạn của DuckDB-WASM

## MotherDuck — Serverless DuckDB trên Cloud

```
-- Kết nối từ bất kỳ Postgres client nào
-- psql, DBeaver, .NET Npgsql, node-postgres...
psql "host=pg.us-east-1-aws.motherduck.com port=5432 dbname=my_db user=token password=eyJ..."

-- Chạy DuckDB SQL qua Postgres protocol
SELECT region, SUM(revenue)
FROM sales_2026.parquet
GROUP BY region;
```

#### MotherDuck Dual Execution Flow

```
Client App
    ↓ SQL Query
MotherDuck Cloud Engine
    ↓ Heavy compute (joins, aggregations trên TB dữ liệu)
    ↓ Stream kết quả
Local DuckDB (hoặc DuckDB-WASM)
    ↓ Subsequent filtering, pivoting, sorting
    ↓ Zero network roundtrip
User sees results (interactive)
```

## DuckDB vs ClickHouse — Khi nào dùng cái nào

DuckDB và ClickHouse đều là OLAP database xuất sắc nhưng giải quyết bài toán khác nhau. Hiểu rõ sự khác biệt giúp bạn chọn đúng tool cho từng use case.

| Tiêu chí | DuckDB | ClickHouse |
| --- | --- | --- |
| Deployment model | Embedded (in-process, zero config) | Client-server (cần cài và vận hành) |
| Scaling | Vertical — single node | Horizontal — shared-nothing cluster |
| Data size tối ưu | GB → vài trăm GB | TB → PB |
| Concurrent users | 1-5 (single analyst / pipeline) | Hàng trăm (multi-tenant dashboards) |
| Ingestion pattern | Batch (read Parquet, CSV trực tiếp) | Real-time streaming + batch |
| Ops complexity | Zero — ship cùng app | Trung bình → cao (replication, sharding) |
| WASM support | Full (chạy trong browser) | Không |
| SQL compliance | PostgreSQL-compatible | ClickHouse SQL (gần ANSI) |
| Ecosystem | Python-first, data science friendly | Backend-first, infra-oriented |
| Cost | Free, open source (MIT) | Free (Apache 2.0) hoặc ClickHouse Cloud |

```
graph TD
    Q{Bài toán của bạn?}
    Q -->|Data < 500GB  
1-5 analysts| D[DuckDB]
    Q -->|Data > 1TB  
Real-time ingestion| CH[ClickHouse]
    Q -->|Client-side analytics  
Browser dashboards| DW[DuckDB-WASM]
    Q -->|Multi-tenant SaaS  
100+ concurrent users| CHC[ClickHouse Cloud]
    Q -->|Dev/staging  
Data exploration| DD[DuckDB + MotherDuck]
    Q -->|Production analytics  
Sub-second dashboards| CHP[ClickHouse Production]

style Q fill:#e94560,stroke:#fff,color:#fff
    style D fill:#4CAF50,stroke:#fff,color:#fff
    style CH fill:#2c3e50,stroke:#fff,color:#fff
    style DW fill:#4CAF50,stroke:#fff,color:#fff
    style CHC fill:#2c3e50,stroke:#fff,color:#fff
    style DD fill:#4CAF50,stroke:#fff,color:#fff
    style CHP fill:#2c3e50,stroke:#fff,color:#fff

```
Decision tree: DuckDB cho single-node analytics, ClickHouse cho distributed production workloads.

#### DuckDB + ClickHouse: Combo tối ưu

## Tích hợp DuckDB trong thực tế

### Python — Data Science Workflow

```
import duckdb

# Kết nối in-memory
con = duckdb.connect()

# Query trực tiếp Parquet từ S3 — không cần download
df = con.sql("""
    SELECT
        product_category,
        DATE_TRUNC('month', order_date) AS month,
        SUM(revenue) AS monthly_revenue,
        COUNT(DISTINCT customer_id) AS unique_customers
    FROM 's3://my-bucket/orders/year=2026/*.parquet'
    WHERE region = 'APAC'
    GROUP BY ALL
    ORDER BY month, monthly_revenue DESC
""").df()  # Trả về Pandas DataFrame

# Hoặc chuyển sang Polars
pl_df = con.sql("SELECT * FROM df WHERE monthly_revenue > 100000").pl()
```

### .NET — Embedded Analytics trong ASP.NET

```
// NuGet: DuckDB.NET.Data
using DuckDB.NET.Data;

// Mở database file (hoặc ":memory:" cho in-memory)
using var connection = new DuckDBConnection("Data Source=analytics.duckdb");
connection.Open();

using var command = connection.CreateCommand();
command.CommandText = @"
    SELECT
        region,
        COUNT(*) as total_orders,
        ROUND(AVG(amount), 2) as avg_amount
    FROM read_parquet('data/orders_2026.parquet')
    GROUP BY region
    HAVING total_orders > 1000
    ORDER BY avg_amount DESC";

using var reader = command.ExecuteReader();
while (reader.Read())
{
    Console.WriteLine($"{reader["region"]}: {reader["total_orders"]} orders, avg ${reader["avg_amount"]}");
}
```

### Vue.js — Client-side Analytics Dashboard

```
<script setup>
import { ref, onMounted } from 'vue'
import * as duckdb from '@duckdb/duckdb-wasm'

const data = ref([])
const loading = ref(true)

onMounted(async () => {
  const JSDELIVR_BUNDLES = duckdb.getJsDelivrBundles()
  const bundle = await duckdb.selectBundle(JSDELIVR_BUNDLES)

const worker = new Worker(bundle.mainWorker)
  const db = new duckdb.AsyncDuckDB(new duckdb.ConsoleLogger(), worker)
  await db.instantiate(bundle.mainModule, bundle.pthreadWorker)

const conn = await db.connect()

// Load dữ liệu một lần từ server
  await db.registerFileURL(
    'sales.parquet',
    '/api/exports/sales_2026.parquet',
    duckdb.DuckDBDataProtocol.HTTP, false
  )

// Mọi query tiếp theo chạy local — zero latency
  const result = await conn.query(`
    SELECT month, SUM(revenue) as total
    FROM 'sales.parquet'
    GROUP BY month ORDER BY month
  `)

data.value = result.toArray()
  loading.value = false
})
</script>
```

## Extensions — Mở rộng DuckDB không giới hạn

DuckDB có hệ thống extension linh hoạt, cho phép thêm data types, functions, file formats, và cả SQL syntax mới. Các extension được load động khi cần:

```
-- Cài và load extension
INSTALL httpfs;   -- Đọc file từ HTTP/S3
LOAD httpfs;

INSTALL iceberg;  -- Đọc Apache Iceberg tables
LOAD iceberg;

INSTALL spatial;  -- Geospatial functions (ST_Distance, ST_Within...)
LOAD spatial;

-- Query Iceberg table trực tiếp
SELECT * FROM iceberg_scan('s3://warehouse/orders')
WHERE order_date >= '2026-01-01';
```

| Extension | Chức năng | WASM Support |
| --- | --- | --- |
| httpfs | Đọc file từ HTTP, S3, GCS, Azure Blob | Có |
| iceberg | Đọc/ghi Apache Iceberg tables | Có (mới 2026) |
| parquet | Đọc/ghi Parquet (built-in) | Có |
| json | Đọc/ghi JSON/NDJSON | Có |
| spatial | Geospatial (PostGIS-like) | Có |
| lance | Vector search, full-text search cho AI/ML | Đang phát triển |
| postgres_scanner | Query PostgreSQL trực tiếp từ DuckDB | Không |
| mysql_scanner | Query MySQL trực tiếp từ DuckDB | Không |

#### Lance Extension — DuckDB cho AI/ML Workloads

Extension mới nhất `lance` cho phép đọc/ghi Lance datasets (columnar format tối ưu cho ML) với vector search (`lance_vector_search()`), full-text search (`lance_fts()`), và hybrid search (`lance_hybrid_search()`). Kết hợp DuckDB với Lance, bạn có thể xây dựng RAG pipeline hoàn chỉnh mà không cần vector database riêng.

## Benchmark — DuckDB trên phần cứng thực tế

Benchmark chạy trên MacBook entry-level (Apple Silicon, 8GB RAM) cho kết quả ấn tượng:

<1s ClickBench median (100M rows, 5GB)

1.63s TPC-DS SF100 query median

79 phút TPC-DS SF300 (disk spill)

238ms 5M rows query (Jupyter kernel)

Đáng chú ý, DuckDB xử lý được dataset lớn hơn RAM nhờ cơ chế **disk spill** — tự động ghi dữ liệu tạm ra đĩa khi memory không đủ, rồi đọc lại khi cần. TPC-DS SF300 (khoảng 300GB dữ liệu) chạy được trên máy chỉ có 8GB RAM là minh chứng rõ ràng.

## Khi nào không nên dùng DuckDB

DuckDB không phải silver bullet. Có những scenario mà nó không phải lựa chọn tốt nhất:

- **OLTP workloads:** DuckDB tối ưu cho đọc nhiều, ghi ít. Nếu bạn cần INSERT/UPDATE hàng triệu rows mỗi giây → dùng PostgreSQL, MySQL, hoặc SQL Server
- **High-concurrency serving:** Hàng trăm users đồng thời query → ClickHouse hoặc Druid phù hợp hơn
- **Real-time streaming ingestion:** DuckDB không có built-in streaming engine. Cần Kafka → ClickHouse/Flink pipeline
- **Multi-TB datasets cần distributed processing:** Khi data vượt quá khả năng single node → ClickHouse cluster hoặc Spark + Iceberg
- **Multi-writer concurrency:** DuckDB dùng single-writer model, chỉ cho phép 1 process ghi tại một thời điểm

## Roadmap — DuckDB đang đi đâu

2018 — Khởi đầu

Ra đời tại CWI Amsterdam (cùng nơi sinh ra MonetDB). Mục tiêu: tạo "SQLite cho analytics".

2022 — DuckDB-WASM

Compile sang WebAssembly, chạy được trong trình duyệt. Demo paper tại VLDB.

2024 — v1.0 stable

Phiên bản stable đầu tiên. MotherDuck ra mắt serverless cloud platform.

2026 Q1 — DuckLake 1.0

Lakehouse format production-ready. Iceberg extension hỗ trợ WASM. Lance extension cho vector search.

2026 Q2 — Hiện tại (v1.5.2)

PostgreSQL wire protocol (MotherDuck). Community extensions ecosystem bùng nổ. O'Reilly DuckLake book đang viết.

## Kết luận

**Nguồn tham khảo:**

- [Why DuckDB — DuckDB Official Documentation](https://duckdb.org/why_duckdb)
- [DuckDB Ecosystem Newsletter April 2026 — MotherDuck Blog](https://motherduck.com/blog/duckdb-ecosystem-newsletter-april-2026/)
- [ClickHouse vs DuckDB: Choosing the Right OLAP Database — CloudRaft](https://www.cloudraft.io/blog/clickhouse-vs-duckdb)
- [DuckDB WASM: Analytical SQL Database in Your Browser — MotherDuck](https://motherduck.com/blog/duckdb-wasm-in-browser/)
- [Iceberg in the Browser — DuckDB Engineering Blog](https://duckdb.org/2025/12/16/iceberg-in-the-browser)
- [OLAP Databases: What's New and Best in 2026 — Tinybird](https://www.tinybird.co/blog/best-database-for-olap)

Tối ưu chi phí Kubernetes 2026: Karpenter, Spot Instances và Right-Sizing giảm 55% bill cloud

Idempotency Pattern — Thiết kế API chống xử lý trùng lặp trong Distributed Systems

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.