Disaggregated LLM Serving 2026 - Kiến trúc Tách biệt Prefill và Decode với NVIDIA Dynamo, Mooncake, DistServe, NIXL, Redis KV Cache Store và ClickHouse

Posted on: 4/15/2026 9:30:30 AM

Table of contents

1. Từ Monolithic đến Disaggregated Serving: Cuộc cách mạng 2026
1. Disaggregation là gì?
2. Giải phẫu hai pha Prefill và Decode
1. 2.1. Prefill: Compute-bound, chạy song song toàn bộ prompt
2. 2.2. Decode: Memory-bound, sinh từng token một
  1. Interference: Vấn đề cốt lõi của kiến trúc hợp nhất
3. Kiến trúc Disaggregated Serving
4. NVIDIA Dynamo — Reference Implementation 2026
1. 4.1. Smart Router — Routing dựa trên KV cache locality
2. 4.2. Planner — Autoscaling theo SLO
5. Mooncake — Kiến trúc KV-centric của Moonshot AI
1. 5.1. Mooncake Store — KV Cache distributed tiered
  1. Mooncake + Redis 8: Tier 2 trong production
2. 5.2. Conductor — Global Scheduler
6. DistServe và Splitwise — Các công trình tiên phong
1. 6.1. DistServe (PKU + UCSD)
2. 6.2. Splitwise (Microsoft Research)
7. Redis 8 làm KV Cache Store — Triển khai thực tế
8. ClickHouse — Observability cho Disaggregated Serving
9. Deployment trên Kubernetes với KServe và Dynamo Operator
10. Benchmarks và trade-off thực tế
1. Trade-off: Khi nào KHÔNG nên disaggregated?
11. Tác động với hệ Multi-Agent
12. Timeline phát triển 2024-2026
13. Implementation Checklist — Lộ trình triển khai
14. Kết luận
1. Bước tiếp theo
15. Nguồn tham khảo

1. Từ Monolithic đến Disaggregated Serving: Cuộc cách mạng 2026

Trong suốt ba năm kể từ khi vLLM, TensorRT-LLM và SGLang phổ biến hóa các kỹ thuật như PagedAttention, Continuous Batching và Chunked Prefill, kiến trúc serving LLM vẫn trung thành với một giả định ngầm: một request sống cả đời trên cùng một GPU (hoặc cùng một tensor-parallel group). Năm 2026 đánh dấu bước ngoặt: giả định đó bị phá vỡ bởi làn sóng Disaggregated Serving — tách biệt hoàn toàn hai giai đoạn Prefill và Decode lên các cụm GPU độc lập, nối với nhau bằng đường truyền KV Cache tốc độ cao.

Khi Claude Sonnet 4.6, GPT-5, Gemini 2.5 Pro và Llama 4 Maverick đều có context window từ 1M đến 10M tokens, bài toán không còn là "chạy một model trên một GPU" mà là "điều phối hàng trăm GB KV cache qua mạng InfiniBand/NVLink để tối đa hóa throughput mà không phá vỡ SLO độ trễ". Đây là lúc Disaggregated Serving trở thành mặc định cho mọi hệ thống Multi-Agent production.

3.8xThroughput tăng (Mooncake vs vLLM)

62%TTFT giảm (NVIDIA Dynamo)

900GB/sBăng thông NVLink 5 (B200)

4.1xGoodput/$ (DistServe)

Disaggregation là gì?

Trong serving truyền thống, một request chạy Prefill (xử lý toàn bộ prompt) và Decode (sinh từng token) trên cùng một GPU. Hai pha này có đặc điểm tài nguyên trái ngược: Prefill là compute-bound (cần FLOPs cao), Decode là memory-bound (cần băng thông HBM cao). Trộn chung trên một GPU dẫn đến hiện tượng interference — token decode bị trì hoãn mỗi khi có prompt dài đi vào Prefill. Disaggregation tách hai pha ra hai cụm GPU độc lập, mỗi cụm được tối ưu riêng.

2. Giải phẫu hai pha Prefill và Decode

Để hiểu vì sao tách biệt lại tạo ra lợi ích lớn, cần hiểu rõ bản chất computational của hai pha:

graph LR
    A["Input Prompt 8K tokens"] --> B["Prefill Pha"]
    B --> C["KV Cache 8K"]
    C --> D["Decode Pha"]
    D --> E["Token 1"]
    E --> F["Token 2"]
    F --> G["... Token N"]
    style B fill:#e94560,stroke:#fff,color:#fff
    style D fill:#4CAF50,stroke:#fff,color:#fff
    style C fill:#0f3460,stroke:#fff,color:#fff

Hình 1: Prefill sinh KV cache một lần, Decode tái sử dụng KV cache để sinh token tuần tự

2.1. Prefill: Compute-bound, chạy song song toàn bộ prompt

Ở pha Prefill, model xử lý toàn bộ N tokens của prompt cùng lúc thông qua tensor-parallel matmul. Ma trận attention có kích thước N×N, mỗi transformer layer thực hiện khoảng 2·N·D² FLOPs (D là hidden dimension). Với prompt 32K tokens trên Llama 3 70B, một lần Prefill tiêu tốn vài chục PetaFLOPs. GPU H100 chạy ở 65-85% SM utilization — gần như sát trần compute. Metrics quan trọng: TTFT (Time To First Token).

2.2. Decode: Memory-bound, sinh từng token một

Sau khi có KV cache, Decode sinh một token tại một thời điểm. Mỗi bước phải đọc toàn bộ KV cache từ HBM (vài chục GB với context dài). Arithmetic intensity cực thấp — tỷ lệ FLOPs/byte chỉ khoảng 1-2 so với ceiling 300+ của H100. Kết quả: SM utilization rớt xuống 5-15%, bottleneck hoàn toàn là băng thông HBM. Metrics quan trọng: TPOT (Time Per Output Token) và ITL (Inter-Token Latency).

Interference: Vấn đề cốt lõi của kiến trúc hợp nhất

Khi vLLM hoặc SGLang chạy continuous batching, prefill và decode chia sẻ cùng một batch mỗi forward pass. Khi một request mới với prompt 64K tokens đến, toàn bộ batch decode bị trì hoãn khoảng 800ms-2s chỉ để chờ prefill xong. Trong hệ Multi-Agent có nhiều loại request (agent planning dùng context dài, tool-calling dùng prompt ngắn), interference khiến tail latency P99 nổ từ 1s lên 15s.

3. Kiến trúc Disaggregated Serving

Kiến trúc điển hình tách hệ thống thành bốn lớp: Router, Prefill Cluster, Decode Cluster và KV Cache Transport. Mỗi lớp có đặc tính phần cứng và chiến lược scaling riêng.

graph TB
    C["Client Multi-Agent"] --> R["Router Dynamo Frontend"]
    R -->|"1 Dispatch prompt"| P["Prefill Cluster H100 B200"]
    P -->|"2 Produce KV cache"| K["KV Transport NIXL UCX NVLink"]
    K -->|"3 Migrate KV"| D["Decode Cluster H200 MI300X"]
    D -->|"4 Stream tokens"| R
    R -->|"5 SSE WebSocket"| C
    KV["KV Store Redis Mooncake Store"] <--> K
    M["Metrics OTLP"] --> CH["ClickHouse"]
    P --> M
    D --> M
    R --> M
    style P fill:#e94560,stroke:#fff,color:#fff
    style D fill:#4CAF50,stroke:#fff,color:#fff
    style K fill:#0f3460,stroke:#fff,color:#fff
    style KV fill:#9b59b6,stroke:#fff,color:#fff
    style CH fill:#f39c12,stroke:#fff,color:#fff

Hình 2: Kiến trúc Disaggregated Serving với Router, Prefill, Decode, KV Transport và Observability

3.1. Prefill Cluster — Tối ưu cho compute

Prefill cluster dùng GPU có FLOPs cao nhưng HBM có thể khiêm tốn hơn vì KV cache được chuyển đi ngay sau khi tạo. Các cấu hình phổ biến năm 2026:

NVIDIA H100 80GB: 989 TFLOPs FP16, 3.35 TB/s HBM3 — lựa chọn giá/hiệu năng tốt nhất.
NVIDIA B200 192GB: 2.25 PFLOPs FP16, 8 TB/s HBM3e — phù hợp prefill prompt siêu dài 1M+ tokens.
AMD MI300X 192GB: 1.3 PFLOPs FP16, 5.3 TB/s HBM3 — đang được Meta và Microsoft triển khai rộng.

Batching ở Prefill thường chạy chế độ chunked prefill: chia prompt dài thành chunk 2K-4K tokens, xen kẽ với decode batch để giữ GPU bận đều. Dynamo và Mooncake đều hỗ trợ chunked prefill với kích thước động.

3.2. Decode Cluster — Tối ưu cho memory bandwidth

Decode cluster ưu tiên HBM lớn và băng thông cao để chứa và đọc KV cache:

NVIDIA H200 141GB: 4.8 TB/s HBM3e — tăng 43% băng thông so với H100.
NVIDIA GB200 NVL72: 72 GPU kết nối qua NVLink 5 (1.8 TB/s per link) — toàn bộ NVL72 xem như một GPU khổng lồ 13.5 TB HBM.

Với HBM lớn, một node decode có thể giữ KV cache cho hàng trăm request đồng thời. Decode cluster chạy tensor parallelism nhỏ (TP=1 hoặc TP=2) và expert parallelism cho MoE model như Mixtral 8x22B hay DeepSeek V3.

3.3. KV Cache Transport — Xương sống của kiến trúc

Đây là thành phần khó nhất. KV cache cho một request prompt 32K tokens trên Llama 3 70B chiếm khoảng 4-8 GB (tùy quantization). Transport phải hoàn tất trước khi token đầu tiên được sinh ra, tức là trong budget <50ms. Các đường truyền khả thi:

Transport	Băng thông	Latency	Use case
NVLink 5 (GB200)	1800 GB/s per link	<1μs	Intra-rack, Prefill-Decode cùng rack
InfiniBand NDR 400G	400 Gbps (50 GB/s)	1-2μs	Cross-rack, cluster multi-DC
RoCE v2 200G	200 Gbps (25 GB/s)	5-10μs	Ethernet fabric, rẻ hơn IB
TCP (fallback)	10-25 Gbps	100-500μs	Dev/staging, không production

NIXL: Thư viện transport thống nhất

NVIDIA Inference Xfer Library (NIXL) là lớp trừu tượng được Dynamo, vLLM, SGLang và TensorRT-LLM cùng chấp nhận từ đầu 2026. NIXL ẩn chi tiết NVLink/IB/RoCE sau một API thống nhất, hỗ trợ GPU Direct RDMA để chuyển thẳng từ HBM nguồn sang HBM đích mà không đi qua CPU. Trên GB200 NVL72, NIXL đạt throughput 1.2 TB/s cho một luồng transfer KV.

4. NVIDIA Dynamo — Reference Implementation 2026

NVIDIA Dynamo là framework open-source được NVIDIA công bố tháng 3/2025 tại GTC và phát triển mạnh trong 2026, trở thành de facto standard cho disaggregated serving trên phần cứng NVIDIA. Dynamo không phải inference engine — nó là lớp điều phối phía trên vLLM/TRT-LLM/SGLang.

graph TB
    CL["Client OpenAI API"] --> FE["Dynamo Frontend HTTP"]
    FE --> PL["Planner autoscaling"]
    FE --> RT["Smart Router KV-aware"]
    RT --> PW["Prefill Worker vLLM SGLang"]
    RT --> DW["Decode Worker vLLM SGLang"]
    PW -->|"NIXL"| DW
    KVM["KV Block Manager"] -.-> PW
    KVM -.-> DW
    KVS["KV Cache Store Redis Mooncake"] <-.-> KVM
    PL --> PW
    PL --> DW
    style FE fill:#e94560,stroke:#fff,color:#fff
    style RT fill:#0f3460,stroke:#fff,color:#fff
    style PW fill:#4CAF50,stroke:#fff,color:#fff
    style DW fill:#4CAF50,stroke:#fff,color:#fff
    style KVS fill:#9b59b6,stroke:#fff,color:#fff

Hình 3: Kiến trúc NVIDIA Dynamo với Frontend, Planner, Smart Router và KV Block Manager

4.1. Smart Router — Routing dựa trên KV cache locality

Thành phần "thông minh" nhất của Dynamo là Smart Router. Thay vì round-robin, router theo dõi hash prefix của prompt và route request đến worker đã có KV cache cho prefix tương ứng. Với hệ Multi-Agent có system prompt dài chung (5K-10K tokens), khả năng cache hit đạt 85-95%, tiết kiệm cả tỷ token prefill mỗi ngày.

# Dynamo deployment manifest rút gọn
apiVersion: dynamo.nvidia.com/v1
kind: DynamoDeployment
metadata:
  name: llama3-70b-disagg
spec:
  model:
    name: meta-llama/Llama-3.3-70B-Instruct
    quantization: fp8
  prefill:
    replicas: 4
    engine: vllm
    tensorParallelSize: 4
    gpuType: H100
    chunkedPrefill:
      enabled: true
      chunkSize: 2048
  decode:
    replicas: 8
    engine: vllm
    tensorParallelSize: 2
    gpuType: H200
    maxBatchSize: 256
  kvTransport:
    backend: nixl
    protocol: infiniband
  router:
    strategy: kv-aware
    hashPrefixTokens: 64
  planner:
    autoscale: true
    targetTTFT: 300ms
    targetTPOT: 20ms

4.2. Planner — Autoscaling theo SLO

Planner của Dynamo sử dụng thuật toán GPU Planner (tên trong source code) để quyết định số replica prefill/decode dựa trên SLO target. Khi tỷ lệ Prefill tokens/Decode tokens trong workload tăng (ví dụ nhiều request summarization prompt dài, output ngắn), Planner scale up prefill cluster. Ngược lại, khi workload là chat conversational (prompt ngắn, output dài), Planner scale up decode cluster. Chu kỳ re-evaluation mặc định 30s.

5. Mooncake — Kiến trúc KV-centric của Moonshot AI

Mooncake là kiến trúc serving được Moonshot AI (công ty phía sau Kimi) công bố cuối 2024, sau đó open-source toàn bộ trong 2025. Mooncake đi theo triết lý "KV cache là công dân hạng nhất" — mọi quyết định routing và scheduling đều xoay quanh việc tối ưu KV cache lifecycle.

graph LR
    R["Conductor Router"] --> PI["Prefill Instance"]
    R --> DI["Decode Instance"]
    PI -->|"Write KV"| MS["Mooncake Store Tiered Cache"]
    DI -->|"Read KV"| MS
    MS --> HBM["Tier 1 HBM pool"]
    MS --> DRAM["Tier 2 DRAM pool"]
    MS --> NVME["Tier 3 NVMe SSD"]
    MS --> S3["Tier 4 S3 cold"]
    style R fill:#e94560,stroke:#fff,color:#fff
    style MS fill:#9b59b6,stroke:#fff,color:#fff
    style PI fill:#4CAF50,stroke:#fff,color:#fff
    style DI fill:#4CAF50,stroke:#fff,color:#fff

Hình 4: Mooncake Store — KV cache được phân tầng qua HBM, DRAM, NVMe và object storage

5.1. Mooncake Store — KV Cache distributed tiered

Mooncake Store là KV storage phân tán được chia sẻ giữa Prefill và Decode instances. KV blocks được lưu theo content hash, cho phép deduplication mạnh mẽ. Khi nhiều user cùng hỏi về một tài liệu dài, hệ thống chỉ prefill một lần và tái sử dụng KV cache cho mọi user sau đó. Benchmarks công bố trên production Kimi cho thấy cache hit rate đạt 76% và tiết kiệm 3.8x chi phí compute so với vLLM không disaggregated.

Mooncake + Redis 8: Tier 2 trong production

Trong nhiều deployment năm 2026, Redis 8 được dùng làm tier 2 DRAM pool cho Mooncake Store. Lý do: Redis có sẵn RDMA support (qua RESP3 over RDMA), sharding tự động với Redis Cluster, và persistence qua AOF. Một Redis cluster 16 node (1 TB RAM tổng) có thể phục vụ tier 2 cho cụm serving 64 GPU H200 với latency p99 <200μs. Chi tiết xem bài Redis 8 Vector Sets.

5.2. Conductor — Global Scheduler

Conductor của Mooncake là global scheduler cross-cluster. Khi một request đến, Conductor quyết định:

Prefix nào của prompt đã có trong Mooncake Store (cache lookup).
Phần prompt còn lại được routed đến Prefill instance nào dựa trên tải hiện tại.
Decode instance nào sẽ tiếp nhận request dựa trên KV locality và memory pressure.
Chính sách eviction cho Mooncake Store nếu tier 1 đầy.

Conductor được viết bằng Go, giao tiếp với worker qua gRPC, và duy trì một cache metadata nhỏ trong etcd cho high availability.

6. DistServe và Splitwise — Các công trình tiên phong

Hai framework học thuật đặt nền móng cho disaggregation trước khi NVIDIA và Moonshot thương mại hóa:

6.1. DistServe (PKU + UCSD)

DistServe là framework đầu tiên chứng minh goodput (throughput có tính SLO) của disaggregated serving vượt trội so với colocated. Goodput được định nghĩa là số request hoàn thành trong budget (TTFT < 300ms, TPOT < 50ms). Trên workload OPT-175B, DistServe đạt 4.48x goodput so với vLLM colocated với cùng lượng GPU. Khóa thành công: placement algorithm tự động quyết định số prefill/decode replicas và mức parallelism cho từng loại.

6.2. Splitwise (Microsoft Research)

Splitwise khám phá một chiều hướng khác: heterogeneous hardware. Microsoft Azure chạy prefill trên H100 nhưng decode trên A100 — A100 có giá thấp hơn mà vẫn đủ HBM bandwidth cho decode. Kết quả: cùng SLO, chi phí TCO giảm 30%. Splitwise còn giới thiệu khái niệm phase-aware scheduling cho trung tâm dữ liệu với GPU hỗn hợp.

Framework	Ra mắt	License	Điểm mạnh	Điểm yếu
NVIDIA Dynamo	GTC 2025	Apache 2.0	Tích hợp mọi engine, NIXL, Planner	Ưu tiên phần cứng NVIDIA
Mooncake	11/2024	Apache 2.0	Mooncake Store tiered, cache hit rate cao	Cần deploy store layer riêng
DistServe	2024	MIT (research)	Placement algorithm công bố paper	Không có production support
Splitwise	2024	MIT (research)	Heterogeneous HW, TCO thấp	Chưa có open source đầy đủ
vLLM Disagg	2025	Apache 2.0	Native trong vLLM, dùng NIXL	Chỉ 1 cặp P-D, chưa có planner
SGLang PD	2025	Apache 2.0	Radix tree KV reuse mạnh	Router chưa KV-aware như Dynamo

7. Redis 8 làm KV Cache Store — Triển khai thực tế

Câu hỏi thường gặp: "Tại sao dùng Redis cho KV cache LLM, không phải Memcached hay một store chuyên dụng?". Redis 8 có bốn đặc điểm phù hợp đặc biệt với workload này.

7.1. Key schema và hash strategy

Mỗi KV block được định danh bằng hash của tuple (model_id, tokenizer_hash, block_prefix_tokens). Kích thước block điển hình là 16 hoặc 32 tokens, giống vLLM PagedAttention. Một entry Redis điển hình:

import redis
import xxhash
import struct

r = redis.Redis(host="redis-kv", port=6379, protocol=3)

def kv_block_key(model_id: str, prefix_tokens: list[int]) -> str:
    h = xxhash.xxh128(struct.pack(f"{len(prefix_tokens)}I", *prefix_tokens)).hexdigest()
    return f"kv:{model_id}:b32:{h}"

def store_kv_block(model_id: str, prefix_tokens: list[int], kv_tensor: bytes, ttl: int = 3600):
    key = kv_block_key(model_id, prefix_tokens)
    # HSET lưu metadata + tensor bytes
    r.hset(key, mapping={
        "tensor": kv_tensor,
        "model": model_id,
        "block_size": 32,
        "created_at": int(time.time()),
    })
    r.expire(key, ttl)

def fetch_kv_block(model_id: str, prefix_tokens: list[int]) -> bytes | None:
    key = kv_block_key(model_id, prefix_tokens)
    data = r.hget(key, "tensor")
    return data

7.2. Redis Cluster sharding theo prefix hash

Khi scale lên hàng TB KV cache, Redis Cluster chia key theo 16384 slot. Dùng hash tag {model_id} để đảm bảo các block cùng model rơi vào cùng shard, tăng khả năng prefetch một prefix dài với một lệnh HMGET. Một cụm 32 node Redis 8 với 32 GB RAM mỗi node cho phép chứa ~800 GB KV cache sau dedup, đủ cho serving cluster 128 GPU.

7.3. RDMA transport và zero-copy

Từ Redis 8, RESP3 hỗ trợ transport RDMA qua extension. Client trên GPU worker đọc KV block trực tiếp từ memory Redis vào HBM GPU mà không copy qua CPU. Băng thông thực tế đạt 180 Gbps trên cạc ConnectX-7. Đây là lý do Redis vượt trội Memcached (chưa có RDMA native) và etcd (dành cho metadata, không phải bulk data).

7.4. Eviction policy cho KV workload

Policy mặc định của Redis là allkeys-lru, nhưng với KV cache LLM, LFU (Least Frequently Used) cho kết quả tốt hơn 8-15% cache hit rate. Nguyên nhân: một số prefix (system prompt, document nền tảng) được dùng liên tục, không nên bị evict chỉ vì ít recent hơn một prompt transient. Cấu hình:

CONFIG SET maxmemory 32gb
CONFIG SET maxmemory-policy allkeys-lfu
CONFIG SET lfu-log-factor 10
CONFIG SET lfu-decay-time 60

8. ClickHouse — Observability cho Disaggregated Serving

Disaggregated architecture tạo ra lượng telemetry khổng lồ: mỗi request đi qua Router → Prefill → KV Transport → Decode, phát sinh ít nhất 10-15 span OpenTelemetry. Với throughput 1000 req/s, đó là 50K-150K span/s — quá lớn cho Elasticsearch hay Tempo. ClickHouse là lựa chọn mặc định của ngành cho bài toán này.

8.1. Schema cho inference span

CREATE TABLE inference_spans ON CLUSTER observability
(
    trace_id          String,
    span_id           String,
    parent_span_id    String,
    phase             LowCardinality(String),  -- prefill, decode, kv_transfer
    model_id          LowCardinality(String),
    worker_id         LowCardinality(String),
    gpu_type          LowCardinality(String),
    ts_start          DateTime64(6, 'UTC'),
    duration_us       UInt64,
    prompt_tokens     UInt32,
    output_tokens     UInt32,
    kv_cache_hit      Bool,
    kv_bytes_transfer UInt64,
    queue_wait_us     UInt64,
    sm_utilization    Float32,
    hbm_utilization   Float32,
    tenant_id         LowCardinality(String),
    agent_id          LowCardinality(String)
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/inference_spans', '{replica}')
PARTITION BY toYYYYMMDD(ts_start)
ORDER BY (tenant_id, phase, ts_start)
TTL ts_start + INTERVAL 30 DAY
SETTINGS index_granularity = 8192;

8.2. Query phân vị TTFT theo phase

SELECT
    toStartOfMinute(ts_start) AS minute,
    phase,
    quantileTDigest(0.50)(duration_us) / 1000.0 AS p50_ms,
    quantileTDigest(0.95)(duration_us) / 1000.0 AS p95_ms,
    quantileTDigest(0.99)(duration_us) / 1000.0 AS p99_ms,
    count() AS n
FROM inference_spans
WHERE ts_start >= now() - INTERVAL 1 HOUR
  AND model_id = 'llama-3.3-70b'
  AND phase IN ('prefill', 'kv_transfer', 'decode_first_token')
GROUP BY minute, phase
ORDER BY minute DESC, phase;

8.3. Theo dõi KV cache hit rate theo agent

SELECT
    agent_id,
    countIf(kv_cache_hit) * 100.0 / count() AS hit_rate_pct,
    sum(prompt_tokens) AS total_prompt_tokens,
    sumIf(prompt_tokens, kv_cache_hit) AS cached_tokens,
    sumIf(prompt_tokens, kv_cache_hit) * 100.0 / sum(prompt_tokens) AS token_savings_pct
FROM inference_spans
WHERE ts_start >= today()
  AND phase = 'prefill'
GROUP BY agent_id
HAVING total_prompt_tokens > 100000
ORDER BY token_savings_pct DESC;

Materialized Views cho dashboard realtime

Với tốc độ ghi 150K span/s, query trực tiếp từ bảng thô sẽ chậm. Tạo MATERIALIZED VIEW summary theo phút để dashboard Grafana query trong <50ms. Ví dụ: view mv_ttft_minute chỉ chứa 1440 hàng/ngày/model nhưng trả lời được 90% câu hỏi SRE thường gặp.

9. Deployment trên Kubernetes với KServe và Dynamo Operator

Deploy disaggregated serving lên Kubernetes không phải là việc dễ. Có ba chiến lược chính:

9.1. KServe Disaggregated InferenceService

KServe 0.13 (tháng 2/2026) bổ sung trường disaggregation trong CRD InferenceService:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama3-70b-disagg
spec:
  predictor:
    model:
      modelFormat: { name: vllm }
      runtime: vllm-0.7.0
      storageUri: s3://models/llama3-70b
    disaggregation:
      enabled: true
      prefill:
        replicas: 4
        resources: { limits: { nvidia.com/gpu: 4 } }
        nodeSelector: { gpu-type: h100 }
      decode:
        replicas: 8
        resources: { limits: { nvidia.com/gpu: 2 } }
        nodeSelector: { gpu-type: h200 }
      kvTransport:
        backend: nixl
        network: mellanox-ib

9.2. Dynamo Kubernetes Operator

NVIDIA phát hành Dynamo Operator đầu 2026, quản lý lifecycle của disaggregated deployment. Operator watches CRD DynamoDeployment, tự động tạo StatefulSet cho Prefill/Decode, Service cho Router, và ConfigMap cho routing policy. Operator cũng tích hợp với HPA qua custom metrics (TTFT p95, queue depth) thay vì CPU/memory.

9.3. Topology-aware scheduling

Kubernetes mặc định không biết NVLink topology. Dùng NVIDIA GPU Operator v24.9+ để expose nvidia.com/gpu-topology label. Custom scheduler sẽ ưu tiên đặt cặp Prefill-Decode cùng NVL72 pod để dùng NVLink thay vì IB, giảm latency transport từ 50μs xuống <1μs.

10. Benchmarks và trade-off thực tế

Bảng dưới đây tổng hợp benchmarks từ các báo cáo MLPerf Inference v4.1, NVIDIA Dynamo blog và paper Mooncake SOSP 2025:

Cấu hình	Model	TTFT p95	TPOT p95	Throughput (tok/s)	Goodput/GPU
vLLM colocated 8xH100	Llama-3.3 70B	1.2s	35ms	4,800	1.00x (baseline)
SGLang colocated 8xH100	Llama-3.3 70B	980ms	28ms	5,600	1.17x
Dynamo disagg 4+4 H100	Llama-3.3 70B	420ms	22ms	9,200	1.92x
Dynamo disagg 4xH100 + 4xH200	Llama-3.3 70B	380ms	18ms	11,400	2.38x
Mooncake 4+4 H100 + Redis tier	Llama-3.3 70B	210ms	20ms	14,600	3.04x
Dynamo GB200 NVL72	Llama-3.3 70B	95ms	8ms	42,000	8.75x

Trade-off: Khi nào KHÔNG nên disaggregated?

Disaggregation không phải viên đạn bạc. Không nên dùng khi:

Workload nhỏ (<50 req/s): Overhead điều phối và transport lớn hơn lợi ích. Colocated vLLM đơn giản hơn.
Không có IB/NVLink: TCP fallback làm tăng TTFT thêm 200-500ms, phá vỡ mục đích.
Prompt rất ngắn (<256 tokens): Prefill đã nhanh, tách ra không tiết kiệm được gì.
Model nhỏ (<7B): Interference ít, một GPU đủ phục vụ cả hai pha.

Quy tắc chung: disaggregation phát huy tối đa ở model ≥30B, prompt ≥2K tokens, throughput ≥200 req/s.

11. Tác động với hệ Multi-Agent

Trong hệ Multi-Agent, các agent khác nhau có profile workload khác nhau:

graph LR
    subgraph Agents
        A1["Planner Agent
prompt 50K out 500"]
        A2["Coder Agent
prompt 20K out 3K"]
        A3["Researcher
prompt 200K out 1K"]
        A4["ToolCaller
prompt 2K out 100"]
    end
    R["Dynamo Router
KV-aware"] --> P1["Prefill H100
chunked 2K"]
    R --> P2["Prefill B200
long context 1M"]
    R --> D1["Decode H200
high throughput"]
    R --> D2["Decode GB200
low latency"]
    A1 --> R
    A2 --> R
    A3 --> R
    A4 --> R
    P1 -->|"NIXL"| D1
    P2 -->|"NIXL"| D2
    style R fill:#e94560,stroke:#fff,color:#fff
    style P1 fill:#4CAF50,stroke:#fff,color:#fff
    style P2 fill:#4CAF50,stroke:#fff,color:#fff
    style D1 fill:#0f3460,stroke:#fff,color:#fff
    style D2 fill:#0f3460,stroke:#fff,color:#fff

Hình 5: Router phân phối request Multi-Agent theo profile workload vào các cluster phù hợp

Planner Agent (long prompt, short output): ưu tiên Prefill cluster mạnh, Decode ít tải — disaggregation tiết kiệm 40% cost.
Coder Agent (medium prompt, long output): Decode cluster băng thông cao là chìa khóa — GB200 NVL72 tỏa sáng.
Researcher Agent (ultra-long prompt 200K+): bắt buộc B200 cho prefill, KV transport phải qua NVLink vì khối lượng >20GB.
Tool-calling Agent (short everything): có thể dùng colocated nhỏ, không cần disagg.

Router của Dynamo kết hợp với Agent Router & Model Cascading cho phép đồng thời chọn model phù hợp (Claude Haiku cho task đơn giản, Claude Opus cho task phức tạp) và chọn cluster phục vụ phù hợp.

12. Timeline phát triển 2024-2026

06/2024

Paper DistServe (OSDI '24) chứng minh lần đầu goodput của disaggregated vượt colocated 4.48x.

10/2024

Microsoft công bố Splitwise, chạy heterogeneous P-D trên Azure production với H100 + A100.

11/2024

Moonshot AI công bố Mooncake — KV cache centric architecture với tiered store. Open source 01/2025.

03/2025

NVIDIA công bố Dynamo tại GTC, mã nguồn Apache 2.0, tích hợp vLLM/TRT-LLM/SGLang.

06/2025

vLLM 0.6 và SGLang 0.4 bổ sung chế độ disagg native dùng NIXL làm transport chung.

10/2025

Mooncake paper đoạt Best Paper Award tại SOSP 2025. Kimi công bố cache hit rate 76% ở production scale.

02/2026

KServe 0.13 bổ sung CRD disaggregation. NVIDIA phát hành Dynamo Operator cho Kubernetes.

03/2026

MLPerf Inference v4.1 thêm category "Disaggregated Serving" với benchmark chính thức.

04/2026

GB200 NVL72 disaggregated đạt 42,000 tok/s trên Llama 3.3 70B — gấp 8.75 lần baseline vLLM.

13. Implementation Checklist — Lộ trình triển khai

Dưới đây là checklist rút ra từ các team đã triển khai disaggregated serving thành công:

Đo baseline colocated: benchmark vLLM hoặc SGLang trên workload production, ghi TTFT/TPOT/throughput.
Phân tích workload mix: tỷ lệ prefill/decode tokens, phân bố độ dài prompt, số tenant, prefix sharing.
Chọn phần cứng: prefill cần compute (H100/B200), decode cần HBM bandwidth (H200/GB200).
Dựng KV transport: ưu tiên NVLink (intra-rack) > InfiniBand > RoCE > TCP. NIXL là layer bắt buộc.
Deploy Dynamo hoặc Mooncake: Dynamo cho ecosystem NVIDIA-first, Mooncake cho cache hit rate cực đại.
Thiết lập KV cache store: Redis 8 cluster 16-32 node với allkeys-lfu, RDMA transport.
Observability end-to-end: OpenTelemetry → ClickHouse với schema chi tiết phase/worker.
Tuning Smart Router: hash prefix tokens 32-128 tùy workload, cache lookup trước dispatch.
SLO-driven autoscaling: target TTFT/TPOT, không dùng CPU/memory thresholds.
Chaos testing: kill ngẫu nhiên prefill/decode worker, đo impact với router failover.

14. Kết luận

Disaggregated LLM Serving không còn là công trình nghiên cứu mà đã trở thành kiến trúc mặc định cho mọi hệ thống AI serving quy mô lớn trong 2026. Tách biệt Prefill và Decode cho phép mỗi pha được tối ưu hóa phần cứng riêng, loại bỏ interference và mở khóa throughput gấp 3-8 lần so với colocated truyền thống. Với NVIDIA Dynamo, Mooncake và hệ sinh thái Kubernetes chín muồi, việc triển khai trở nên khả thi cho cả team có tài nguyên hạn chế — nếu họ biết chọn đúng thời điểm và cấu hình.

Cho hệ Multi-Agent, disaggregation còn mang ý nghĩa sâu hơn: nó cho phép hệ thống phản ứng linh hoạt với phân bố workload không đồng đều giữa các agent, điều mà kiến trúc monolithic không bao giờ làm được. Khi kết hợp với Redis 8 Vector Sets, Prompt Caching và LLM Gateway, disaggregated serving hoàn thiện bức tranh hạ tầng AI hiện đại: mọi byte KV cache được tái sử dụng, mọi FLOP được đặt đúng chỗ, mọi request đi qua con đường tối ưu nhất.

Bước tiếp theo

Nếu bạn đang vận hành một serving cluster vLLM hoặc SGLang và thấy TTFT p95 vượt 1 giây, hoặc throughput đội trần GPU trong giờ cao điểm, disaggregation là lối thoát rõ ràng. Bắt đầu với một POC 2 prefill + 2 decode node H100 dùng Dynamo, kết nối NIXL qua InfiniBand, observability ClickHouse. Sau 2-4 tuần đo đạc so với baseline, quyết định tiếp quy mô scale.

15. Nguồn tham khảo

#Job Scheduler #Job Scheduler #Job Scheduler #Job Scheduler #Job Scheduler #Job Scheduler #Job Scheduler #KV Cache #Job Scheduler #Job Scheduler #Job Scheduler #Job Scheduler #Job Scheduler #Job Scheduler #vLLM #Job Scheduler #Job Scheduler #redis #ClickHouse #Job Scheduler #Job Scheduler #LLM Inference #Multi-Agent #Agentic AI #System Architecture #Claude Code

# Disaggregated LLM Serving 2026 - Kiến trúc Tách biệt Prefill và Decode với NVIDIA Dynamo, Mooncake, DistServe, NIXL, Redis KV Cache Store và ClickHouse

## 1. Từ Monolithic đến Disaggregated Serving: Cuộc cách mạng 2026

Trong suốt ba năm kể từ khi vLLM, TensorRT-LLM và SGLang phổ biến hóa các kỹ thuật như **PagedAttention**, **Continuous Batching** và **Chunked Prefill**, kiến trúc serving LLM vẫn trung thành với một giả định ngầm: *một request sống cả đời trên cùng một GPU (hoặc cùng một tensor-parallel group)*. Năm 2026 đánh dấu bước ngoặt: giả định đó bị phá vỡ bởi làn sóng **Disaggregated Serving** — tách biệt hoàn toàn hai giai đoạn Prefill và Decode lên các cụm GPU độc lập, nối với nhau bằng đường truyền KV Cache tốc độ cao.

3.8xThroughput tăng (Mooncake vs vLLM)

62%TTFT giảm (NVIDIA Dynamo)

900GB/sBăng thông NVLink 5 (B200)

4.1xGoodput/$ (DistServe)

#### Disaggregation là gì?

Trong serving truyền thống, một request chạy Prefill (xử lý toàn bộ prompt) và Decode (sinh từng token) trên cùng một GPU. Hai pha này có đặc điểm tài nguyên trái ngược: Prefill là **compute-bound** (cần FLOPs cao), Decode là **memory-bound** (cần băng thông HBM cao). Trộn chung trên một GPU dẫn đến hiện tượng *interference* — token decode bị trì hoãn mỗi khi có prompt dài đi vào Prefill. Disaggregation tách hai pha ra hai cụm GPU độc lập, mỗi cụm được tối ưu riêng.

## 2. Giải phẫu hai pha Prefill và Decode

Để hiểu vì sao tách biệt lại tạo ra lợi ích lớn, cần hiểu rõ bản chất computational của hai pha:

```
graph LR
    A["Input Prompt 8K tokens"] --> B["Prefill Pha"]
    B --> C["KV Cache 8K"]
    C --> D["Decode Pha"]
    D --> E["Token 1"]
    E --> F["Token 2"]
    F --> G["... Token N"]
    style B fill:#e94560,stroke:#fff,color:#fff
    style D fill:#4CAF50,stroke:#fff,color:#fff
    style C fill:#0f3460,stroke:#fff,color:#fff

```

Hình 1: Prefill sinh KV cache một lần, Decode tái sử dụng KV cache để sinh token tuần tự

### 2.1. Prefill: Compute-bound, chạy song song toàn bộ prompt

Ở pha Prefill, model xử lý **toàn bộ N tokens của prompt cùng lúc** thông qua tensor-parallel matmul. Ma trận attention có kích thước N×N, mỗi transformer layer thực hiện khoảng 2·N·D² FLOPs (D là hidden dimension). Với prompt 32K tokens trên Llama 3 70B, một lần Prefill tiêu tốn vài chục PetaFLOPs. GPU H100 chạy ở 65-85% SM utilization — gần như sát trần compute. Metrics quan trọng: **TTFT (Time To First Token)**.

### 2.2. Decode: Memory-bound, sinh từng token một

Sau khi có KV cache, Decode sinh **một token tại một thời điểm**. Mỗi bước phải đọc toàn bộ KV cache từ HBM (vài chục GB với context dài). Arithmetic intensity cực thấp — tỷ lệ FLOPs/byte chỉ khoảng 1-2 so với ceiling 300+ của H100. Kết quả: SM utilization rớt xuống 5-15%, bottleneck hoàn toàn là băng thông HBM. Metrics quan trọng: **TPOT (Time Per Output Token)** và **ITL (Inter-Token Latency)**.

#### Interference: Vấn đề cốt lõi của kiến trúc hợp nhất

Khi vLLM hoặc SGLang chạy *continuous batching*, prefill và decode chia sẻ cùng một batch mỗi forward pass. Khi một request mới với prompt 64K tokens đến, toàn bộ batch decode bị trì hoãn khoảng 800ms-2s chỉ để chờ prefill xong. Trong hệ Multi-Agent có nhiều loại request (agent planning dùng context dài, tool-calling dùng prompt ngắn), interference khiến tail latency P99 nổ từ 1s lên 15s.

## 3. Kiến trúc Disaggregated Serving

Kiến trúc điển hình tách hệ thống thành bốn lớp: **Router**, **Prefill Cluster**, **Decode Cluster** và **KV Cache Transport**. Mỗi lớp có đặc tính phần cứng và chiến lược scaling riêng.

```
graph TB
    C["Client Multi-Agent"] --> R["Router Dynamo Frontend"]
    R -->|"1 Dispatch prompt"| P["Prefill Cluster H100 B200"]
    P -->|"2 Produce KV cache"| K["KV Transport NIXL UCX NVLink"]
    K -->|"3 Migrate KV"| D["Decode Cluster H200 MI300X"]
    D -->|"4 Stream tokens"| R
    R -->|"5 SSE WebSocket"| C
    KV["KV Store Redis Mooncake Store"] <--> K
    M["Metrics OTLP"] --> CH["ClickHouse"]
    P --> M
    D --> M
    R --> M
    style P fill:#e94560,stroke:#fff,color:#fff
    style D fill:#4CAF50,stroke:#fff,color:#fff
    style K fill:#0f3460,stroke:#fff,color:#fff
    style KV fill:#9b59b6,stroke:#fff,color:#fff
    style CH fill:#f39c12,stroke:#fff,color:#fff

```

Hình 2: Kiến trúc Disaggregated Serving với Router, Prefill, Decode, KV Transport và Observability

### 3.1. Prefill Cluster — Tối ưu cho compute

Prefill cluster dùng GPU có **FLOPs cao** nhưng HBM có thể khiêm tốn hơn vì KV cache được chuyển đi ngay sau khi tạo. Các cấu hình phổ biến năm 2026:

- **NVIDIA H100 80GB**: 989 TFLOPs FP16, 3.35 TB/s HBM3 — lựa chọn giá/hiệu năng tốt nhất.
- **NVIDIA B200 192GB**: 2.25 PFLOPs FP16, 8 TB/s HBM3e — phù hợp prefill prompt siêu dài 1M+ tokens.
- **AMD MI300X 192GB**: 1.3 PFLOPs FP16, 5.3 TB/s HBM3 — đang được Meta và Microsoft triển khai rộng.

Batching ở Prefill thường chạy chế độ **chunked prefill**: chia prompt dài thành chunk 2K-4K tokens, xen kẽ với decode batch để giữ GPU bận đều. Dynamo và Mooncake đều hỗ trợ chunked prefill với kích thước động.

### 3.2. Decode Cluster — Tối ưu cho memory bandwidth

Decode cluster ưu tiên **HBM lớn** và **băng thông cao** để chứa và đọc KV cache:

- **NVIDIA H200 141GB**: 4.8 TB/s HBM3e — tăng 43% băng thông so với H100.
- **NVIDIA GB200 NVL72**: 72 GPU kết nối qua NVLink 5 (1.8 TB/s per link) — toàn bộ NVL72 xem như một GPU khổng lồ 13.5 TB HBM.

Với HBM lớn, một node decode có thể giữ KV cache cho hàng trăm request đồng thời. Decode cluster chạy tensor parallelism nhỏ (TP=1 hoặc TP=2) và **expert parallelism** cho MoE model như Mixtral 8x22B hay DeepSeek V3.

### 3.3. KV Cache Transport — Xương sống của kiến trúc

Đây là thành phần khó nhất. KV cache cho một request prompt 32K tokens trên Llama 3 70B chiếm khoảng **4-8 GB** (tùy quantization). Transport phải hoàn tất trước khi token đầu tiên được sinh ra, tức là trong budget <50ms. Các đường truyền khả thi:

| Transport | Băng thông | Latency | Use case |
| --- | --- | --- | --- |
| **NVLink 5 (GB200)** | 1800 GB/s per link | <1μs | Intra-rack, Prefill-Decode cùng rack |
| **InfiniBand NDR 400G** | 400 Gbps (50 GB/s) | 1-2μs | Cross-rack, cluster multi-DC |
| **RoCE v2 200G** | 200 Gbps (25 GB/s) | 5-10μs | Ethernet fabric, rẻ hơn IB |
| **TCP (fallback)** | 10-25 Gbps | 100-500μs | Dev/staging, không production |

#### NIXL: Thư viện transport thống nhất

**NVIDIA Inference Xfer Library (NIXL)** là lớp trừu tượng được Dynamo, vLLM, SGLang và TensorRT-LLM cùng chấp nhận từ đầu 2026. NIXL ẩn chi tiết NVLink/IB/RoCE sau một API thống nhất, hỗ trợ **GPU Direct RDMA** để chuyển thẳng từ HBM nguồn sang HBM đích mà không đi qua CPU. Trên GB200 NVL72, NIXL đạt throughput 1.2 TB/s cho một luồng transfer KV.

## 4. NVIDIA Dynamo — Reference Implementation 2026

**NVIDIA Dynamo** là framework open-source được NVIDIA công bố tháng 3/2025 tại GTC và phát triển mạnh trong 2026, trở thành de facto standard cho disaggregated serving trên phần cứng NVIDIA. Dynamo không phải inference engine — nó là lớp điều phối phía trên vLLM/TRT-LLM/SGLang.

```
graph TB
    CL["Client OpenAI API"] --> FE["Dynamo Frontend HTTP"]
    FE --> PL["Planner autoscaling"]
    FE --> RT["Smart Router KV-aware"]
    RT --> PW["Prefill Worker vLLM SGLang"]
    RT --> DW["Decode Worker vLLM SGLang"]
    PW -->|"NIXL"| DW
    KVM["KV Block Manager"] -.-> PW
    KVM -.-> DW
    KVS["KV Cache Store Redis Mooncake"] <-.-> KVM
    PL --> PW
    PL --> DW
    style FE fill:#e94560,stroke:#fff,color:#fff
    style RT fill:#0f3460,stroke:#fff,color:#fff
    style PW fill:#4CAF50,stroke:#fff,color:#fff
    style DW fill:#4CAF50,stroke:#fff,color:#fff
    style KVS fill:#9b59b6,stroke:#fff,color:#fff

```

Hình 3: Kiến trúc NVIDIA Dynamo với Frontend, Planner, Smart Router và KV Block Manager

### 4.1. Smart Router — Routing dựa trên KV cache locality

Thành phần "thông minh" nhất của Dynamo là Smart Router. Thay vì round-robin, router theo dõi **hash prefix** của prompt và route request đến worker đã có KV cache cho prefix tương ứng. Với hệ Multi-Agent có system prompt dài chung (5K-10K tokens), khả năng cache hit đạt 85-95%, tiết kiệm cả tỷ token prefill mỗi ngày.

```yaml
# Dynamo deployment manifest rút gọn
apiVersion: dynamo.nvidia.com/v1
kind: DynamoDeployment
metadata:
  name: llama3-70b-disagg
spec:
  model:
    name: meta-llama/Llama-3.3-70B-Instruct
    quantization: fp8
  prefill:
    replicas: 4
    engine: vllm
    tensorParallelSize: 4
    gpuType: H100
    chunkedPrefill:
      enabled: true
      chunkSize: 2048
  decode:
    replicas: 8
    engine: vllm
    tensorParallelSize: 2
    gpuType: H200
    maxBatchSize: 256
  kvTransport:
    backend: nixl
    protocol: infiniband
  router:
    strategy: kv-aware
    hashPrefixTokens: 64
  planner:
    autoscale: true
    targetTTFT: 300ms
    targetTPOT: 20ms

```

### 4.2. Planner — Autoscaling theo SLO

Planner của Dynamo sử dụng thuật toán **GPU Planner** (tên trong source code) để quyết định số replica prefill/decode dựa trên SLO target. Khi tỷ lệ Prefill tokens/Decode tokens trong workload tăng (ví dụ nhiều request summarization prompt dài, output ngắn), Planner scale up prefill cluster. Ngược lại, khi workload là chat conversational (prompt ngắn, output dài), Planner scale up decode cluster. Chu kỳ re-evaluation mặc định 30s.

## 5. Mooncake — Kiến trúc KV-centric của Moonshot AI

**Mooncake** là kiến trúc serving được Moonshot AI (công ty phía sau Kimi) công bố cuối 2024, sau đó open-source toàn bộ trong 2025. Mooncake đi theo triết lý "**KV cache là công dân hạng nhất**" — mọi quyết định routing và scheduling đều xoay quanh việc tối ưu KV cache lifecycle.

```
graph LR
    R["Conductor Router"] --> PI["Prefill Instance"]
    R --> DI["Decode Instance"]
    PI -->|"Write KV"| MS["Mooncake Store Tiered Cache"]
    DI -->|"Read KV"| MS
    MS --> HBM["Tier 1 HBM pool"]
    MS --> DRAM["Tier 2 DRAM pool"]
    MS --> NVME["Tier 3 NVMe SSD"]
    MS --> S3["Tier 4 S3 cold"]
    style R fill:#e94560,stroke:#fff,color:#fff
    style MS fill:#9b59b6,stroke:#fff,color:#fff
    style PI fill:#4CAF50,stroke:#fff,color:#fff
    style DI fill:#4CAF50,stroke:#fff,color:#fff

```

Hình 4: Mooncake Store — KV cache được phân tầng qua HBM, DRAM, NVMe và object storage

### 5.1. Mooncake Store — KV Cache distributed tiered

#### Mooncake + Redis 8: Tier 2 trong production

Trong nhiều deployment năm 2026, **Redis 8** được dùng làm tier 2 DRAM pool cho Mooncake Store. Lý do: Redis có sẵn **RDMA support** (qua RESP3 over RDMA), sharding tự động với Redis Cluster, và persistence qua AOF. Một Redis cluster 16 node (1 TB RAM tổng) có thể phục vụ tier 2 cho cụm serving 64 GPU H200 với latency p99 <200μs. Chi tiết xem bài [Redis 8 Vector Sets](/redis-8-vector-sets-2026-kien-truc-native-vector-search-voi-hnsw-quantization-q8bin-va-hybrid-filter-cho-multi-agent-ai-1031).

### 5.2. Conductor — Global Scheduler

Conductor của Mooncake là global scheduler cross-cluster. Khi một request đến, Conductor quyết định:

1. Prefix nào của prompt đã có trong Mooncake Store (cache lookup).
2. Phần prompt còn lại được routed đến Prefill instance nào dựa trên tải hiện tại.
3. Decode instance nào sẽ tiếp nhận request dựa trên KV locality và memory pressure.
4. Chính sách eviction cho Mooncake Store nếu tier 1 đầy.

Conductor được viết bằng Go, giao tiếp với worker qua gRPC, và duy trì một cache metadata nhỏ trong etcd cho high availability.

## 6. DistServe và Splitwise — Các công trình tiên phong

Hai framework học thuật đặt nền móng cho disaggregation trước khi NVIDIA và Moonshot thương mại hóa:

### 6.1. DistServe (PKU + UCSD)

DistServe là framework đầu tiên chứng minh *goodput* (throughput có tính SLO) của disaggregated serving vượt trội so với colocated. Goodput được định nghĩa là số request hoàn thành trong budget (TTFT < 300ms, TPOT < 50ms). Trên workload OPT-175B, DistServe đạt **4.48x goodput** so với vLLM colocated với cùng lượng GPU. Khóa thành công: placement algorithm tự động quyết định số prefill/decode replicas và mức parallelism cho từng loại.

### 6.2. Splitwise (Microsoft Research)

Splitwise khám phá một chiều hướng khác: **heterogeneous hardware**. Microsoft Azure chạy prefill trên H100 nhưng decode trên A100 — A100 có giá thấp hơn mà vẫn đủ HBM bandwidth cho decode. Kết quả: cùng SLO, chi phí TCO giảm **30%**. Splitwise còn giới thiệu khái niệm *phase-aware scheduling* cho trung tâm dữ liệu với GPU hỗn hợp.

| Framework | Ra mắt | License | Điểm mạnh | Điểm yếu |
| --- | --- | --- | --- | --- |
| **NVIDIA Dynamo** | GTC 2025 | Apache 2.0 | Tích hợp mọi engine, NIXL, Planner | Ưu tiên phần cứng NVIDIA |
| **Mooncake** | 11/2024 | Apache 2.0 | Mooncake Store tiered, cache hit rate cao | Cần deploy store layer riêng |
| **DistServe** | 2024 | MIT (research) | Placement algorithm công bố paper | Không có production support |
| **Splitwise** | 2024 | MIT (research) | Heterogeneous HW, TCO thấp | Chưa có open source đầy đủ |
| **vLLM Disagg** | 2025 | Apache 2.0 | Native trong vLLM, dùng NIXL | Chỉ 1 cặp P-D, chưa có planner |
| **SGLang PD** | 2025 | Apache 2.0 | Radix tree KV reuse mạnh | Router chưa KV-aware như Dynamo |

## 7. Redis 8 làm KV Cache Store — Triển khai thực tế

### 7.1. Key schema và hash strategy

Mỗi KV block được định danh bằng hash của tuple `(model_id, tokenizer_hash, block_prefix_tokens)`. Kích thước block điển hình là 16 hoặc 32 tokens, giống vLLM PagedAttention. Một entry Redis điển hình:

```python
import redis
import xxhash
import struct

r = redis.Redis(host="redis-kv", port=6379, protocol=3)

def kv_block_key(model_id: str, prefix_tokens: list[int]) -> str:
    h = xxhash.xxh128(struct.pack(f"{len(prefix_tokens)}I", *prefix_tokens)).hexdigest()
    return f"kv:{model_id}:b32:{h}"

def store_kv_block(model_id: str, prefix_tokens: list[int], kv_tensor: bytes, ttl: int = 3600):
    key = kv_block_key(model_id, prefix_tokens)
    # HSET lưu metadata + tensor bytes
    r.hset(key, mapping={
        "tensor": kv_tensor,
        "model": model_id,
        "block_size": 32,
        "created_at": int(time.time()),
    })
    r.expire(key, ttl)

def fetch_kv_block(model_id: str, prefix_tokens: list[int]) -> bytes | None:
    key = kv_block_key(model_id, prefix_tokens)
    data = r.hget(key, "tensor")
    return data

```

### 7.2. Redis Cluster sharding theo prefix hash

Khi scale lên hàng TB KV cache, Redis Cluster chia key theo 16384 slot. Dùng hash tag `{model_id}` để đảm bảo các block cùng model rơi vào cùng shard, tăng khả năng prefetch một prefix dài với một lệnh `HMGET`. Một cụm 32 node Redis 8 với 32 GB RAM mỗi node cho phép chứa ~800 GB KV cache sau dedup, đủ cho serving cluster 128 GPU.

### 7.3. RDMA transport và zero-copy

### 7.4. Eviction policy cho KV workload

Policy mặc định của Redis là `allkeys-lru`, nhưng với KV cache LLM, **LFU** (Least Frequently Used) cho kết quả tốt hơn 8-15% cache hit rate. Nguyên nhân: một số prefix (system prompt, document nền tảng) được dùng liên tục, không nên bị evict chỉ vì ít recent hơn một prompt transient. Cấu hình:

```bash
CONFIG SET maxmemory 32gb
CONFIG SET maxmemory-policy allkeys-lfu
CONFIG SET lfu-log-factor 10
CONFIG SET lfu-decay-time 60

```

## 8. ClickHouse — Observability cho Disaggregated Serving

Disaggregated architecture tạo ra lượng telemetry khổng lồ: mỗi request đi qua Router → Prefill → KV Transport → Decode, phát sinh ít nhất 10-15 span OpenTelemetry. Với throughput 1000 req/s, đó là 50K-150K span/s — quá lớn cho Elasticsearch hay Tempo. **ClickHouse** là lựa chọn mặc định của ngành cho bài toán này.

### 8.1. Schema cho inference span

```sql
CREATE TABLE inference_spans ON CLUSTER observability
(
    trace_id          String,
    span_id           String,
    parent_span_id    String,
    phase             LowCardinality(String),  -- prefill, decode, kv_transfer
    model_id          LowCardinality(String),
    worker_id         LowCardinality(String),
    gpu_type          LowCardinality(String),
    ts_start          DateTime64(6, 'UTC'),
    duration_us       UInt64,
    prompt_tokens     UInt32,
    output_tokens     UInt32,
    kv_cache_hit      Bool,
    kv_bytes_transfer UInt64,
    queue_wait_us     UInt64,
    sm_utilization    Float32,
    hbm_utilization   Float32,
    tenant_id         LowCardinality(String),
    agent_id          LowCardinality(String)
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/inference_spans', '{replica}')
PARTITION BY toYYYYMMDD(ts_start)
ORDER BY (tenant_id, phase, ts_start)
TTL ts_start + INTERVAL 30 DAY
SETTINGS index_granularity = 8192;

```

### 8.2. Query phân vị TTFT theo phase

```sql
SELECT
    toStartOfMinute(ts_start) AS minute,
    phase,
    quantileTDigest(0.50)(duration_us) / 1000.0 AS p50_ms,
    quantileTDigest(0.95)(duration_us) / 1000.0 AS p95_ms,
    quantileTDigest(0.99)(duration_us) / 1000.0 AS p99_ms,
    count() AS n
FROM inference_spans
WHERE ts_start >= now() - INTERVAL 1 HOUR
  AND model_id = 'llama-3.3-70b'
  AND phase IN ('prefill', 'kv_transfer', 'decode_first_token')
GROUP BY minute, phase
ORDER BY minute DESC, phase;

```

### 8.3. Theo dõi KV cache hit rate theo agent

```sql
SELECT
    agent_id,
    countIf(kv_cache_hit) * 100.0 / count() AS hit_rate_pct,
    sum(prompt_tokens) AS total_prompt_tokens,
    sumIf(prompt_tokens, kv_cache_hit) AS cached_tokens,
    sumIf(prompt_tokens, kv_cache_hit) * 100.0 / sum(prompt_tokens) AS token_savings_pct
FROM inference_spans
WHERE ts_start >= today()
  AND phase = 'prefill'
GROUP BY agent_id
HAVING total_prompt_tokens > 100000
ORDER BY token_savings_pct DESC;

```

#### Materialized Views cho dashboard realtime

Với tốc độ ghi 150K span/s, query trực tiếp từ bảng thô sẽ chậm. Tạo `MATERIALIZED VIEW` summary theo phút để dashboard Grafana query trong <50ms. Ví dụ: view `mv_ttft_minute` chỉ chứa 1440 hàng/ngày/model nhưng trả lời được 90% câu hỏi SRE thường gặp.

## 9. Deployment trên Kubernetes với KServe và Dynamo Operator

Deploy disaggregated serving lên Kubernetes không phải là việc dễ. Có ba chiến lược chính:

### 9.1. KServe Disaggregated InferenceService

KServe 0.13 (tháng 2/2026) bổ sung trường `disaggregation` trong CRD `InferenceService`:

```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama3-70b-disagg
spec:
  predictor:
    model:
      modelFormat: { name: vllm }
      runtime: vllm-0.7.0
      storageUri: s3://models/llama3-70b
    disaggregation:
      enabled: true
      prefill:
        replicas: 4
        resources: { limits: { nvidia.com/gpu: 4 } }
        nodeSelector: { gpu-type: h100 }
      decode:
        replicas: 8
        resources: { limits: { nvidia.com/gpu: 2 } }
        nodeSelector: { gpu-type: h200 }
      kvTransport:
        backend: nixl
        network: mellanox-ib

```

### 9.2. Dynamo Kubernetes Operator

NVIDIA phát hành **Dynamo Operator** đầu 2026, quản lý lifecycle của disaggregated deployment. Operator watches CRD `DynamoDeployment`, tự động tạo StatefulSet cho Prefill/Decode, Service cho Router, và ConfigMap cho routing policy. Operator cũng tích hợp với HPA qua custom metrics (TTFT p95, queue depth) thay vì CPU/memory.

### 9.3. Topology-aware scheduling

Kubernetes mặc định không biết NVLink topology. Dùng **NVIDIA GPU Operator** v24.9+ để expose `nvidia.com/gpu-topology` label. Custom scheduler sẽ ưu tiên đặt cặp Prefill-Decode cùng NVL72 pod để dùng NVLink thay vì IB, giảm latency transport từ 50μs xuống <1μs.

## 10. Benchmarks và trade-off thực tế

Bảng dưới đây tổng hợp benchmarks từ các báo cáo MLPerf Inference v4.1, NVIDIA Dynamo blog và paper Mooncake SOSP 2025:

| Cấu hình | Model | TTFT p95 | TPOT p95 | Throughput (tok/s) | Goodput/GPU |
| --- | --- | --- | --- | --- | --- |
| vLLM colocated 8xH100 | Llama-3.3 70B | 1.2s | 35ms | 4,800 | 1.00x (baseline) |
| SGLang colocated 8xH100 | Llama-3.3 70B | 980ms | 28ms | 5,600 | 1.17x |
| Dynamo disagg 4+4 H100 | Llama-3.3 70B | 420ms | 22ms | 9,200 | 1.92x |
| Dynamo disagg 4xH100 + 4xH200 | Llama-3.3 70B | 380ms | 18ms | 11,400 | 2.38x |
| Mooncake 4+4 H100 + Redis tier | Llama-3.3 70B | 210ms | 20ms | 14,600 | 3.04x |
| Dynamo GB200 NVL72 | Llama-3.3 70B | 95ms | 8ms | 42,000 | 8.75x |

#### Trade-off: Khi nào KHÔNG nên disaggregated?

Disaggregation không phải viên đạn bạc. Không nên dùng khi:

- **Workload nhỏ (<50 req/s)**: Overhead điều phối và transport lớn hơn lợi ích. Colocated vLLM đơn giản hơn.
- **Không có IB/NVLink**: TCP fallback làm tăng TTFT thêm 200-500ms, phá vỡ mục đích.
- **Prompt rất ngắn (<256 tokens)**: Prefill đã nhanh, tách ra không tiết kiệm được gì.
- **Model nhỏ (<7B)**: Interference ít, một GPU đủ phục vụ cả hai pha.

Quy tắc chung: disaggregation phát huy tối đa ở **model ≥30B, prompt ≥2K tokens, throughput ≥200 req/s**.

## 11. Tác động với hệ Multi-Agent

Trong hệ Multi-Agent, các agent khác nhau có profile workload khác nhau:

```
graph LR
    subgraph Agents
        A1["Planner Agent  
prompt 50K out 500"]
        A2["Coder Agent  
prompt 20K out 3K"]
        A3["Researcher  
prompt 200K out 1K"]
        A4["ToolCaller  
prompt 2K out 100"]
    end
    R["Dynamo Router  
KV-aware"] --> P1["Prefill H100  
chunked 2K"]
    R --> P2["Prefill B200  
long context 1M"]
    R --> D1["Decode H200  
high throughput"]
    R --> D2["Decode GB200  
low latency"]
    A1 --> R
    A2 --> R
    A3 --> R
    A4 --> R
    P1 -->|"NIXL"| D1
    P2 -->|"NIXL"| D2
    style R fill:#e94560,stroke:#fff,color:#fff
    style P1 fill:#4CAF50,stroke:#fff,color:#fff
    style P2 fill:#4CAF50,stroke:#fff,color:#fff
    style D1 fill:#0f3460,stroke:#fff,color:#fff
    style D2 fill:#0f3460,stroke:#fff,color:#fff

```

Hình 5: Router phân phối request Multi-Agent theo profile workload vào các cluster phù hợp

- **Planner Agent** (long prompt, short output): ưu tiên Prefill cluster mạnh, Decode ít tải — disaggregation tiết kiệm 40% cost.
- **Coder Agent** (medium prompt, long output): Decode cluster băng thông cao là chìa khóa — GB200 NVL72 tỏa sáng.
- **Researcher Agent** (ultra-long prompt 200K+): bắt buộc B200 cho prefill, KV transport phải qua NVLink vì khối lượng >20GB.
- **Tool-calling Agent** (short everything): có thể dùng colocated nhỏ, không cần disagg.

Router của Dynamo kết hợp với [Agent Router & Model Cascading](/agent-router-model-cascading-2026-kien-truc-routing-thong-minh-cho-multi-agent-voi-semantic-classifier-bandit-feedback-redis-va-clickhouse-1033) cho phép đồng thời chọn model phù hợp (Claude Haiku cho task đơn giản, Claude Opus cho task phức tạp) và chọn cluster phục vụ phù hợp.

## 12. Timeline phát triển 2024-2026

06/2024

Paper **DistServe** (OSDI '24) chứng minh lần đầu goodput của disaggregated vượt colocated 4.48x.

10/2024

Microsoft công bố **Splitwise**, chạy heterogeneous P-D trên Azure production với H100 + A100.

11/2024

Moonshot AI công bố **Mooncake** — KV cache centric architecture với tiered store. Open source 01/2025.

03/2025

NVIDIA công bố **Dynamo** tại GTC, mã nguồn Apache 2.0, tích hợp vLLM/TRT-LLM/SGLang.

06/2025

vLLM 0.6 và SGLang 0.4 bổ sung chế độ **disagg native** dùng NIXL làm transport chung.

10/2025

Mooncake paper đoạt **Best Paper Award** tại SOSP 2025. Kimi công bố cache hit rate 76% ở production scale.

02/2026

KServe 0.13 bổ sung CRD `disaggregation`. NVIDIA phát hành **Dynamo Operator** cho Kubernetes.

03/2026

**MLPerf Inference v4.1** thêm category "Disaggregated Serving" với benchmark chính thức.

04/2026

GB200 NVL72 disaggregated đạt **42,000 tok/s** trên Llama 3.3 70B — gấp 8.75 lần baseline vLLM.

## 13. Implementation Checklist — Lộ trình triển khai

Dưới đây là checklist rút ra từ các team đã triển khai disaggregated serving thành công:

1. **Đo baseline colocated**: benchmark vLLM hoặc SGLang trên workload production, ghi TTFT/TPOT/throughput.
2. **Phân tích workload mix**: tỷ lệ prefill/decode tokens, phân bố độ dài prompt, số tenant, prefix sharing.
3. **Chọn phần cứng**: prefill cần compute (H100/B200), decode cần HBM bandwidth (H200/GB200).
4. **Dựng KV transport**: ưu tiên NVLink (intra-rack) > InfiniBand > RoCE > TCP. NIXL là layer bắt buộc.
5. **Deploy Dynamo hoặc Mooncake**: Dynamo cho ecosystem NVIDIA-first, Mooncake cho cache hit rate cực đại.
6. **Thiết lập KV cache store**: Redis 8 cluster 16-32 node với allkeys-lfu, RDMA transport.
7. **Observability end-to-end**: OpenTelemetry → ClickHouse với schema chi tiết phase/worker.
8. **Tuning Smart Router**: hash prefix tokens 32-128 tùy workload, cache lookup trước dispatch.
9. **SLO-driven autoscaling**: target TTFT/TPOT, không dùng CPU/memory thresholds.
10. **Chaos testing**: kill ngẫu nhiên prefill/decode worker, đo impact với router failover.

## 14. Kết luận

Disaggregated LLM Serving không còn là công trình nghiên cứu mà đã trở thành **kiến trúc mặc định** cho mọi hệ thống AI serving quy mô lớn trong 2026. Tách biệt Prefill và Decode cho phép mỗi pha được tối ưu hóa phần cứng riêng, loại bỏ interference và mở khóa throughput gấp 3-8 lần so với colocated truyền thống. Với NVIDIA Dynamo, Mooncake và hệ sinh thái Kubernetes chín muồi, việc triển khai trở nên khả thi cho cả team có tài nguyên hạn chế — nếu họ biết chọn đúng thời điểm và cấu hình.

Cho hệ Multi-Agent, disaggregation còn mang ý nghĩa sâu hơn: nó cho phép hệ thống phản ứng linh hoạt với phân bố workload không đồng đều giữa các agent, điều mà kiến trúc monolithic không bao giờ làm được. Khi kết hợp với [Redis 8 Vector Sets](/redis-8-vector-sets-2026-kien-truc-native-vector-search-voi-hnsw-quantization-q8bin-va-hybrid-filter-cho-multi-agent-ai-1031), [Prompt Caching](/prompt-caching-context-caching-2026-kien-truc-tai-su-dung-kv-cache-provider-level-cho-claude-openai-gemini-voi-redis-edge-va-clickhouse-analytics-1032) và [LLM Gateway](/llm-gateway-2026-cong-ket-noi-ai-thong-minh-cho-multi-agent-voi-redis-va-clickhouse-1022), disaggregated serving hoàn thiện bức tranh hạ tầng AI hiện đại: mọi byte KV cache được tái sử dụng, mọi FLOP được đặt đúng chỗ, mọi request đi qua con đường tối ưu nhất.

#### Bước tiếp theo

Nếu bạn đang vận hành một serving cluster vLLM hoặc SGLang và thấy TTFT p95 vượt 1 giây, hoặc throughput đội trần GPU trong giờ cao điểm, **disaggregation là lối thoát rõ ràng**. Bắt đầu với một POC 2 prefill + 2 decode node H100 dùng Dynamo, kết nối NIXL qua InfiniBand, observability ClickHouse. Sau 2-4 tuần đo đạc so với baseline, quyết định tiếp quy mô scale.

## 15. Nguồn tham khảo

- [NVIDIA Dynamo — GitHub repository](https://github.com/ai-dynamo/dynamo)
- [NVIDIA Blog — Introducing Dynamo Distributed Inference Framework](https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/)
- [Mooncake — KV cache centric serving (GitHub)](https://github.com/kvcache-ai/Mooncake)
- [Mooncake Paper — arXiv:2407.00079](https://arxiv.org/abs/2407.00079)
- [DistServe Paper — OSDI 2024, arXiv:2401.09670](https://arxiv.org/abs/2401.09670)
- [Splitwise Paper — Microsoft Research, arXiv:2311.18677](https://arxiv.org/abs/2311.18677)
- [vLLM Docs — Disaggregated Prefilling](https://docs.vllm.ai/en/latest/features/disagg_prefill.html)
- [SGLang v0.3 — Disaggregated Serving](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)
- [NIXL — NVIDIA Inference Xfer Library](https://github.com/ai-dynamo/nixl)
- [KServe — Dynamo Integration Guide](https://kserve.github.io/website/master/modelserving/v1beta1/llm/dynamo/)
- [Redis 8 — AI workloads documentation](https://redis.io/docs/latest/operate/oss_and_stack/stack-with-enterprise/ai/)
- [ClickHouse — Observability pipelines](https://clickhouse.com/blog/building-observability-pipelines-with-clickhouse)
- [MLPerf Inference v4.1 Results — MLCommons](https://mlcommons.org/benchmarks/inference-datacenter/)

Realtime Voice AI Agents 2026 - Kiến trúc Speech-to-Speech Multi-Agent với LiveKit, Pipecat, gpt-realtime, Deepgram, Cartesia, Redis và ClickHouse

Reasoning Models & Extended Thinking 2026 - Kiến trúc Adaptive Thinking, Budget Router, Reasoning Trace Caching với Redis và ClickHouse cho Claude, OpenAI o3/o4, DeepSeek R1 trong Multi-Agent Production

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.