RAG Pipeline 2026 — Xây dựng kiến trúc AI không ảo giác cho Production

Posted on: 5/4/2026 10:15:39 AM

Table of contents

RAG là gì và tại sao AI cần nó?
1. Vấn đề cốt lõi RAG giải quyết
Kiến trúc tổng quan RAG Pipeline
Chunking — Nghệ thuật chia nhỏ tài liệu
1. Các chiến lược Chunking phổ biến
  1. Production baseline
2. Ví dụ Recursive Chunking với Semantic Kernel
Embedding — Chuyển văn bản thành vector
1. Lưu ý quan trọng về Embedding
Vector Store — Lưu trữ và truy vấn vector
Hybrid Search — Kết hợp vector và keyword
1. Reciprocal Rank Fusion (RRF)
2. Reranking — Nâng precision lên tầm mới
Generation — Từ context đến câu trả lời
1. Temperature thấp cho RAG
Các pattern RAG nâng cao
Đánh giá chất lượng RAG — RAGAS Framework
Production Checklist cho RAG Pipeline
1. Sai lầm phổ biến nhất
Kết luận
Tham khảo

RAG là gì và tại sao AI cần nó?

Retrieval-Augmented Generation (RAG) là kiến trúc kết hợp giữa truy xuất thông tin (retrieval) và sinh văn bản (generation) để giúp mô hình ngôn ngữ lớn (LLM) trả lời dựa trên dữ liệu thực thay vì "bịa đặt" từ bộ nhớ huấn luyện. Thay vì fine-tune toàn bộ model với dữ liệu mới — tốn kém và chậm — RAG đơn giản cung cấp ngữ cảnh liên quan vào prompt tại thời điểm inference.

73%Lỗi RAG đến từ retrieval, không phải generation

90%+Enterprise AI apps dùng RAG trong 2026

10-30%Precision tăng khi thêm Reranker

$0.02-0.10Chi phí trung bình/query Agentic RAG

Vấn đề cốt lõi RAG giải quyết

LLM được huấn luyện trên dữ liệu tĩnh với knowledge cutoff cố định. Khi hỏi về dữ liệu nội bộ công ty, sản phẩm mới, hay sự kiện sau thời điểm huấn luyện — model sẽ hallucinate (tự bịa câu trả lời nghe có vẻ đúng nhưng hoàn toàn sai). RAG giải quyết bằng cách truy xuất tài liệu thực trước khi sinh câu trả lời, biến LLM từ "đoán" thành "đọc rồi trả lời".

Kiến trúc tổng quan RAG Pipeline

Một RAG pipeline production gồm 2 pha chính: Indexing (nạp dữ liệu offline) và Querying (truy vấn real-time). Mỗi pha có nhiều bước có thể tối ưu độc lập.

flowchart TB
    subgraph Indexing["⚙️ Pha Indexing (Offline)"]
        A["📄 Documents\nPDF, Markdown, HTML, DB"] --> B["✂️ Chunking\nSemantic / Recursive"]
        B --> C["🔢 Embedding\nOpenAI / Azure / Local"]
        C --> D["💾 Vector Store\npgvector / Qdrant / Weaviate"]
        A --> E["📝 BM25 Index\nKeyword Search"]
    end

    subgraph Querying["🔍 Pha Querying (Real-time)"]
        F["👤 User Query"] --> G["🔢 Query Embedding"]
        G --> H["🔎 Hybrid Search\nVector + BM25 → RRF"]
        H --> I["🏆 Reranker\nCross-encoder Rescoring"]
        I --> J["📋 Context Assembly\nTop-K Documents"]
        J --> K["🤖 LLM Generation\nPrompt + Context → Answer"]
    end

    D --> H
    E --> H

    style Indexing fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style Querying fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style A fill:#e94560,stroke:#fff,color:#fff
    style F fill:#e94560,stroke:#fff,color:#fff
    style K fill:#4CAF50,stroke:#fff,color:#fff
    style D fill:#2c3e50,stroke:#fff,color:#fff
    style E fill:#2c3e50,stroke:#fff,color:#fff

Kiến trúc tổng quan RAG Pipeline với Hybrid Search và Reranking

Chunking — Nghệ thuật chia nhỏ tài liệu

Chunking là bước đầu tiên và quan trọng nhất của indexing. Mỗi chunk phải đủ ngữ nghĩa hoàn chỉnh để trả lời một câu hỏi độc lập. Chunk quá nhỏ mất ngữ cảnh, quá lớn pha loãng relevance score.

Các chiến lược Chunking phổ biến

Chiến lược	Cách hoạt động	Ưu điểm	Nhược điểm	Khi nào dùng
Fixed-size	Chia theo số token cố định (512-1024) với overlap 20-25%	Đơn giản, nhanh, dễ kiểm soát kích thước	Có thể cắt giữa câu/ý	Dữ liệu đồng nhất, baseline
Recursive	Chia theo thứ tự: heading → paragraph → sentence → token	Giữ được cấu trúc tài liệu	Chunk size không đều	Tài liệu có cấu trúc rõ (Markdown, HTML)
Semantic	Dùng embedding similarity, tạo chunk mới khi cosine similarity giữa câu liên tiếp giảm dưới threshold	Chunk hoàn chỉnh về ngữ nghĩa	Chậm hơn, cần embedding model	Tài liệu dài, đa chủ đề
Agentic	LLM phân tích và quyết định ranh giới chunk	Chất lượng chunk cao nhất	Rất chậm, tốn chi phí LLM	Tài liệu phức tạp, quan trọng

Production baseline

Đa số production systems dùng Recursive Chunking với chunk size 512-1024 tokens và overlap 20%. Đây là điểm cân bằng tốt nhất giữa chất lượng và tốc độ. Semantic chunking cho kết quả tốt hơn nhưng chỉ đáng khi dữ liệu phức tạp, đa chủ đề.

Ví dụ Recursive Chunking với Semantic Kernel

using Microsoft.SemanticKernel.Text;

// Recursive chunking: chia theo paragraph trước, rồi sentence
var lines = TextChunker.SplitPlainTextLines(rawText, maxTokensPerLine: 128);
var paragraphs = TextChunker.SplitPlainTextParagraphs(lines,
    maxTokensPerParagraph: 512,
    overlapTokens: 100);

foreach (var chunk in paragraphs)
{
    var embedding = await embeddingModel.GenerateEmbeddingAsync(chunk);
    await vectorStore.UpsertAsync(new DocumentChunk
    {
        Id = Guid.NewGuid().ToString(),
        Content = chunk,
        Embedding = embedding,
        Metadata = new { Source = fileName, ChunkIndex = index++ }
    });
}

Embedding — Chuyển văn bản thành vector

Embedding model chuyển đổi text thành vector số nhiều chiều, nơi các đoạn văn bản có ý nghĩa tương tự sẽ nằm gần nhau trong không gian vector. Chất lượng embedding quyết định trực tiếp chất lượng retrieval.

Model	Dimensions	MTEB Score	Chi phí	Ghi chú
text-embedding-3-large	3072	64.6	$0.13/1M tokens	OpenAI, phổ biến nhất
text-embedding-3-small	1536	62.3	$0.02/1M tokens	Tiết kiệm, đủ tốt cho nhiều use case
Cohere embed-v4	1024	67.3	$0.10/1M tokens	Hỗ trợ multimodal
BGE-M3	1024	66.1	Miễn phí (self-host)	Đa ngôn ngữ, hybrid retrieval
nomic-embed-text	768	62.4	Miễn phí (self-host)	Nhẹ, chạy local tốt

Lưu ý quan trọng về Embedding

Embedding model phải nhất quán giữa indexing và querying. Nếu index dùng text-embedding-3-large thì query cũng phải dùng model đó. Thay đổi model = phải re-index toàn bộ. Hãy chọn model cẩn thận ngay từ đầu.

Vector Store — Lưu trữ và truy vấn vector

Vector store (hay vector database) là nơi lưu trữ embeddings và thực hiện approximate nearest neighbor (ANN) search. Lựa chọn vector store ảnh hưởng lớn đến latency, scalability và chi phí.

Vector Store	Kiểu	ANN Algorithm	Filtering	Free Tier	Khi nào chọn
pgvector	Extension PostgreSQL	IVFFlat, HNSW	Full SQL	Self-host	Đã có Postgres, không muốn thêm DB mới
Qdrant	Dedicated	HNSW	Rich filters	1GB cloud	Production dedicated, performance cao
Weaviate	Dedicated	HNSW	GraphQL-like	Sandbox	Multi-tenant, hybrid search built-in
Azure AI Search	Managed	HNSW + eKNN	OData filters	Free tier	Azure ecosystem, enterprise
ChromaDB	Embedded	HNSW	Metadata	Open source	Prototyping, local development

Hybrid Search — Kết hợp vector và keyword

Đây là cải tiến quan trọng nhất bạn có thể làm cho RAG pipeline. Pure vector search bỏ sót exact keyword matches, pure BM25 bỏ sót ngữ nghĩa tương tự. Hybrid search kết hợp cả hai, là single biggest quality improvement cho bất kỳ naive RAG pipeline nào.

flowchart LR
    Q["User Query"] --> VS["🔢 Vector Search\nSemantic Similarity\nTop 50"]
    Q --> BM["📝 BM25 Search\nKeyword Match\nTop 50"]
    VS --> RRF["🔗 Reciprocal Rank Fusion\nScore = Σ 1/(k + rank_i)"]
    BM --> RRF
    RRF --> RR["🏆 Reranker\nCross-encoder\nTop 5"]
    RR --> CTX["📋 Context\n→ LLM"]

    style Q fill:#e94560,stroke:#fff,color:#fff
    style RRF fill:#2c3e50,stroke:#fff,color:#fff
    style RR fill:#4CAF50,stroke:#fff,color:#fff
    style CTX fill:#f8f9fa,stroke:#e94560,color:#2c3e50

Hybrid Search Pipeline: Vector + BM25 → RRF → Reranker → LLM

Reciprocal Rank Fusion (RRF)

RRF là phương pháp phổ biến nhất để kết hợp kết quả từ nhiều retriever. Công thức đơn giản nhưng hiệu quả cao:

RRF_score(d) = Σ 1 / (k + rank_i(d))

Trong đó:
- d: document
- rank_i(d): thứ hạng của document d trong retriever thứ i
- k: hằng số smoothing (thường = 60)

Ví dụ: Document X xếp hạng 3 trong vector search và hạng 7 trong BM25:

RRF_score(X) = 1/(60+3) + 1/(60+7) = 0.0159 + 0.0149 = 0.0308

Reranking — Nâng precision lên tầm mới

Reranking là bước có ROI cao nhất trong RAG pipeline. Sau khi hybrid search trả về top-50 kết quả, một cross-encoder model sẽ re-score từng document so với query gốc với full attention — bắt được relevance mà embedding similarity bỏ sót. Precision thường tăng 10-30% với chi phí chỉ thêm 50-100ms latency.

// Reranking với Cohere hoặc cross-encoder model
var hybridResults = await hybridSearch.SearchAsync(query, topK: 50);

var rerankedResults = await rerankerClient.RerankAsync(new RerankRequest
{
    Query = query,
    Documents = hybridResults.Select(r => r.Content).ToList(),
    TopN = 5,
    Model = "rerank-v3.5"
});

var finalContext = string.Join("\n\n---\n\n",
    rerankedResults.Results
        .OrderByDescending(r => r.RelevanceScore)
        .Select(r => hybridResults[r.Index].Content));

Generation — Từ context đến câu trả lời

Bước generation lấy context đã retrieve được và truyền vào LLM cùng với câu hỏi gốc. Prompt engineering ở bước này quyết định chất lượng output:

var systemPrompt = """
    Bạn là trợ lý AI chuyên trả lời câu hỏi dựa trên tài liệu được cung cấp.

    QUY TẮC:
    1. CHỈ trả lời dựa trên thông tin trong [CONTEXT] bên dưới
    2. Nếu context không chứa đủ thông tin, nói rõ "Tôi không tìm thấy
       thông tin này trong tài liệu"
    3. Trích dẫn nguồn cụ thể khi trả lời (tên file, section)
    4. KHÔNG bịa đặt thông tin ngoài context

    [CONTEXT]
    {retrievedContext}
    """;

var chatHistory = new ChatHistory();
chatHistory.AddSystemMessage(systemPrompt);
chatHistory.AddUserMessage(userQuery);

var response = await chatCompletionService.GetChatMessageContentAsync(
    chatHistory,
    new OpenAIPromptExecutionSettings { Temperature = 0.1f });

Temperature thấp cho RAG

Với RAG, nên đặt Temperature = 0.0 - 0.2 để LLM bám sát context đã retrieve. Temperature cao khiến model "sáng tạo" hơn — chính xác là điều ta muốn tránh khi cần factual answers.

Các pattern RAG nâng cao

RAG cơ bản (Naive RAG) hoạt động tốt cho nhiều use case, nhưng khi yêu cầu cao hơn về độ chính xác và khả năng xử lý query phức tạp, 3 pattern nâng cao sau đây là game-changer trong 2026.

Corrective RAG (CRAG)

Corrective RAG thêm một bước đánh giá chất lượng sau retrieval. Nếu documents retrieve được không đủ relevant, hệ thống sẽ tự sửa — rewrite query, mở rộng nguồn tìm kiếm (web search), hoặc loại bỏ documents nhiễu trước khi đưa vào LLM.

flowchart TB
    A["👤 Query"] --> B["🔎 Retrieval"]
    B --> C{"📊 Relevance\nEvaluation"}
    C -->|"✅ Relevant"| D["🤖 Generate Answer"]
    C -->|"⚠️ Ambiguous"| E["✏️ Query Rewrite\n+ Re-retrieve"]
    C -->|"❌ Irrelevant"| F["🌐 Web Search\nFallback"]
    E --> C
    F --> D

    style A fill:#e94560,stroke:#fff,color:#fff
    style C fill:#2c3e50,stroke:#fff,color:#fff
    style D fill:#4CAF50,stroke:#fff,color:#fff
    style F fill:#ff9800,stroke:#fff,color:#fff

Corrective RAG: tự đánh giá và sửa lỗi retrieval trước khi generation

Self-RAG

Self-RAG huấn luyện model sinh ra các reflection tokens đặc biệt tại mỗi bước, tự kiểm tra:

[Retrieve]: "Có cần retrieve thêm không?" — quyết định khi nào cần truy xuất
[ISREL]: "Document này có relevant với query không?" — lọc noise
[ISSUP]: "Câu trả lời có được hỗ trợ bởi document không?" — kiểm tra grounding
[ISUSE]: "Câu trả lời có hữu ích không?" — đánh giá chất lượng tổng thể

Self-RAG vs Corrective RAG

Corrective RAG dùng agent/classifier bên ngoài để đánh giá, phù hợp khi muốn giữ LLM gốc không đổi. Self-RAG tích hợp reflection vào chính model, cho latency thấp hơn nhưng cần fine-tune model. Trong production 2026, Corrective RAG phổ biến hơn vì dễ triển khai với bất kỳ LLM nào.

Agentic RAG

Agentic RAG là pattern mạnh nhất, biến RAG pipeline thành hệ thống multi-agent có khả năng lập kế hoạch, phân tích query, chọn tool phù hợp, và tự điều chỉnh workflow. Đây là pattern dominant cho enterprise AI 2026.

flowchart TB
    U["👤 Complex Query"] --> P["🧠 Planner Agent\nPhân tích & chia nhỏ query"]

    P --> S1["🔎 Retrieval Agent\nTìm trong internal docs"]
    P --> S2["🌐 Web Agent\nTìm trên internet"]
    P --> S3["🗃️ SQL Agent\nQuery database"]

    S1 --> V["✅ Validator Agent\nKiểm tra relevance\n& consistency"]
    S2 --> V
    S3 --> V

    V --> SY["📝 Synthesizer Agent\nTổng hợp từ nhiều nguồn"]
    SY --> R["💬 Final Answer\n+ Citations"]

    style U fill:#e94560,stroke:#fff,color:#fff
    style P fill:#2c3e50,stroke:#fff,color:#fff
    style V fill:#ff9800,stroke:#fff,color:#fff
    style R fill:#4CAF50,stroke:#fff,color:#fff

Agentic RAG: multi-agent pipeline với planning, retrieval song song, validation và synthesis

Bốn khả năng cốt lõi của Agentic RAG:

Reflection: Agent tự đánh giá câu trả lời, phát hiện và sửa lỗi
Planning: Phân tích query phức tạp thành sub-tasks, lập kế hoạch thực thi
Tool Use: Chọn công cụ phù hợp (vector search, SQL, web, API) tùy ngữ cảnh
Multi-agent Collaboration: Nhiều agent chuyên biệt phối hợp song song

Đánh giá chất lượng RAG — RAGAS Framework

Đo lường chất lượng RAG pipeline là thách thức lớn. RAGAS (Retrieval Augmented Generation Assessment) là framework evaluation phổ biến nhất, đánh giá trên 4 trục:

Metric	Đo cái gì	Công thức	Target
Faithfulness	Câu trả lời có đúng với context không?	Số claims supported / Tổng claims	> 0.85
Answer Relevancy	Câu trả lời có liên quan đến câu hỏi không?	Cosine similarity giữa answer embedding và question	> 0.80
Context Precision	Context retrieve có chính xác không?	Relevant chunks trong top-K / K	> 0.75
Context Recall	Có bỏ sót thông tin quan trọng không?	Relevant info found / Total relevant info	> 0.80

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall

result = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=evaluation_llm,
    embeddings=embedding_model
)

print(result)
# {'faithfulness': 0.89, 'answer_relevancy': 0.85,
#  'context_precision': 0.78, 'context_recall': 0.82}

Production Checklist cho RAG Pipeline

Xây dựng RAG pipeline hoạt động tốt trong demo khác xa với production. Dưới đây là checklist những yếu tố cần cân nhắc khi đưa RAG vào hệ thống thực:

Chunking & Indexing

Chọn chunk size phù hợp với loại dữ liệu (512-1024 tokens). Lưu metadata (source, date, section) để filtering. Thiết lập incremental indexing pipeline — không re-index toàn bộ mỗi lần update.

Retrieval Quality

Dùng hybrid search (vector + BM25) thay vì chỉ vector search. Thêm reranker cho precision cao. Metadata filtering để thu hẹp phạm vi tìm kiếm (theo date, source, category).

Prompt Engineering

System prompt rõ ràng: chỉ trả lời từ context, thừa nhận khi không biết. Temperature thấp (0.0-0.2). Giới hạn context window — quá nhiều context gây noise.

Monitoring & Evaluation

Log mọi query + retrieved chunks + generated answer. Đo RAGAS metrics định kỳ. Alert khi faithfulness score giảm dưới threshold. A/B test chunking strategies.

Security & Cost

Row-level security trên vector store (multi-tenant). Rate limiting cho embedding API calls. Cache embeddings cho queries lặp. Ước tính chi phí: embedding + LLM + vector DB storage.

Sai lầm phổ biến nhất

73% lỗi RAG đến từ retrieval, không phải generation. Khi output sai, hãy kiểm tra retrieved documents trước — rất có thể hệ thống retrieve sai tài liệu hoặc thiếu thông tin quan trọng. Đừng vội blame LLM, hãy fix retrieval trước.

Kết luận

RAG không chỉ là "nhồi context vào prompt" — đó là một hệ thống phức tạp đòi hỏi tối ưu từng bước: chunking quyết định chất lượng nguyên liệu, hybrid search + reranking quyết định độ chính xác retrieval, và prompt engineering quyết định chất lượng câu trả lời cuối cùng. Trong 2026, xu hướng rõ ràng là tiến từ Naive RAG sang Agentic RAG — nơi AI agents tự lập kế hoạch, tự đánh giá, và tự sửa lỗi trong toàn bộ pipeline.

Với chi phí chỉ $0.02-0.10/query cho Agentic RAG và hệ sinh thái tooling ngày càng trưởng thành (Semantic Kernel, LangChain, LlamaIndex), đây là thời điểm tốt nhất để bắt đầu xây dựng RAG pipeline cho ứng dụng AI của bạn.

Tham khảo

#RAG #Retrieval-Augmented Generation #AI Pipeline #Vector Database #Hybrid Search #Semantic Kernel #system design #LLM

# RAG Pipeline 2026 — Xây dựng kiến trúc AI không ảo giác cho Production

## RAG là gì và tại sao AI cần nó?

**Retrieval-Augmented Generation (RAG)** là kiến trúc kết hợp giữa *truy xuất thông tin* (retrieval) và *sinh văn bản* (generation) để giúp mô hình ngôn ngữ lớn (LLM) trả lời dựa trên dữ liệu thực thay vì "bịa đặt" từ bộ nhớ huấn luyện. Thay vì fine-tune toàn bộ model với dữ liệu mới — tốn kém và chậm — RAG đơn giản cung cấp ngữ cảnh liên quan vào prompt tại thời điểm inference.

73%Lỗi RAG đến từ retrieval, không phải generation

90%+Enterprise AI apps dùng RAG trong 2026

10-30%Precision tăng khi thêm Reranker

$0.02-0.10Chi phí trung bình/query Agentic RAG

#### Vấn đề cốt lõi RAG giải quyết

LLM được huấn luyện trên dữ liệu tĩnh với knowledge cutoff cố định. Khi hỏi về dữ liệu nội bộ công ty, sản phẩm mới, hay sự kiện sau thời điểm huấn luyện — model sẽ **hallucinate** (tự bịa câu trả lời nghe có vẻ đúng nhưng hoàn toàn sai). RAG giải quyết bằng cách truy xuất tài liệu thực trước khi sinh câu trả lời, biến LLM từ "đoán" thành "đọc rồi trả lời".

## Kiến trúc tổng quan RAG Pipeline

Một RAG pipeline production gồm 2 pha chính: **Indexing** (nạp dữ liệu offline) và **Querying** (truy vấn real-time). Mỗi pha có nhiều bước có thể tối ưu độc lập.

```
flowchart TB
    subgraph Indexing["⚙️ Pha Indexing (Offline)"]
        A["📄 Documents\nPDF, Markdown, HTML, DB"] --> B["✂️ Chunking\nSemantic / Recursive"]
        B --> C["🔢 Embedding\nOpenAI / Azure / Local"]
        C --> D["💾 Vector Store\npgvector / Qdrant / Weaviate"]
        A --> E["📝 BM25 Index\nKeyword Search"]
    end

subgraph Querying["🔍 Pha Querying (Real-time)"]
        F["👤 User Query"] --> G["🔢 Query Embedding"]
        G --> H["🔎 Hybrid Search\nVector + BM25 → RRF"]
        H --> I["🏆 Reranker\nCross-encoder Rescoring"]
        I --> J["📋 Context Assembly\nTop-K Documents"]
        J --> K["🤖 LLM Generation\nPrompt + Context → Answer"]
    end

D --> H
    E --> H

style Indexing fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style Querying fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style A fill:#e94560,stroke:#fff,color:#fff
    style F fill:#e94560,stroke:#fff,color:#fff
    style K fill:#4CAF50,stroke:#fff,color:#fff
    style D fill:#2c3e50,stroke:#fff,color:#fff
    style E fill:#2c3e50,stroke:#fff,color:#fff

```

Kiến trúc tổng quan RAG Pipeline với Hybrid Search và Reranking

## Chunking — Nghệ thuật chia nhỏ tài liệu

### Các chiến lược Chunking phổ biến

| Chiến lược | Cách hoạt động | Ưu điểm | Nhược điểm | Khi nào dùng |
| --- | --- | --- | --- | --- |
| **Fixed-size** | Chia theo số token cố định (512-1024) với overlap 20-25% | Đơn giản, nhanh, dễ kiểm soát kích thước | Có thể cắt giữa câu/ý | Dữ liệu đồng nhất, baseline |
| **Recursive** | Chia theo thứ tự: heading → paragraph → sentence → token | Giữ được cấu trúc tài liệu | Chunk size không đều | Tài liệu có cấu trúc rõ (Markdown, HTML) |
| **Semantic** | Dùng embedding similarity, tạo chunk mới khi cosine similarity giữa câu liên tiếp giảm dưới threshold | Chunk hoàn chỉnh về ngữ nghĩa | Chậm hơn, cần embedding model | Tài liệu dài, đa chủ đề |
| **Agentic** | LLM phân tích và quyết định ranh giới chunk | Chất lượng chunk cao nhất | Rất chậm, tốn chi phí LLM | Tài liệu phức tạp, quan trọng |

#### Production baseline

Đa số production systems dùng **Recursive Chunking** với chunk size 512-1024 tokens và overlap 20%. Đây là điểm cân bằng tốt nhất giữa chất lượng và tốc độ. Semantic chunking cho kết quả tốt hơn nhưng chỉ đáng khi dữ liệu phức tạp, đa chủ đề.

### Ví dụ Recursive Chunking với Semantic Kernel

```csharp
using Microsoft.SemanticKernel.Text;

// Recursive chunking: chia theo paragraph trước, rồi sentence
var lines = TextChunker.SplitPlainTextLines(rawText, maxTokensPerLine: 128);
var paragraphs = TextChunker.SplitPlainTextParagraphs(lines,
    maxTokensPerParagraph: 512,
    overlapTokens: 100);

foreach (var chunk in paragraphs)
{
    var embedding = await embeddingModel.GenerateEmbeddingAsync(chunk);
    await vectorStore.UpsertAsync(new DocumentChunk
    {
        Id = Guid.NewGuid().ToString(),
        Content = chunk,
        Embedding = embedding,
        Metadata = new { Source = fileName, ChunkIndex = index++ }
    });
}
```

## Embedding — Chuyển văn bản thành vector

| Model | Dimensions | MTEB Score | Chi phí | Ghi chú |
| --- | --- | --- | --- | --- |
| **text-embedding-3-large** | 3072 | 64.6 | $0.13/1M tokens | OpenAI, phổ biến nhất |
| **text-embedding-3-small** | 1536 | 62.3 | $0.02/1M tokens | Tiết kiệm, đủ tốt cho nhiều use case |
| **Cohere embed-v4** | 1024 | 67.3 | $0.10/1M tokens | Hỗ trợ multimodal |
| **BGE-M3** | 1024 | 66.1 | Miễn phí (self-host) | Đa ngôn ngữ, hybrid retrieval |
| **nomic-embed-text** | 768 | 62.4 | Miễn phí (self-host) | Nhẹ, chạy local tốt |

#### Lưu ý quan trọng về Embedding

Embedding model phải **nhất quán** giữa indexing và querying. Nếu index dùng `text-embedding-3-large` thì query cũng phải dùng model đó. Thay đổi model = phải re-index toàn bộ. Hãy chọn model cẩn thận ngay từ đầu.

## Vector Store — Lưu trữ và truy vấn vector

| Vector Store | Kiểu | ANN Algorithm | Filtering | Free Tier | Khi nào chọn |
| --- | --- | --- | --- | --- | --- |
| **pgvector** | Extension PostgreSQL | IVFFlat, HNSW | Full SQL | Self-host | Đã có Postgres, không muốn thêm DB mới |
| **Qdrant** | Dedicated | HNSW | Rich filters | 1GB cloud | Production dedicated, performance cao |
| **Weaviate** | Dedicated | HNSW | GraphQL-like | Sandbox | Multi-tenant, hybrid search built-in |
| **Azure AI Search** | Managed | HNSW + eKNN | OData filters | Free tier | Azure ecosystem, enterprise |
| **ChromaDB** | Embedded | HNSW | Metadata | Open source | Prototyping, local development |

## Hybrid Search — Kết hợp vector và keyword

```
flowchart LR
    Q["User Query"] --> VS["🔢 Vector Search\nSemantic Similarity\nTop 50"]
    Q --> BM["📝 BM25 Search\nKeyword Match\nTop 50"]
    VS --> RRF["🔗 Reciprocal Rank Fusion\nScore = Σ 1/(k + rank_i)"]
    BM --> RRF
    RRF --> RR["🏆 Reranker\nCross-encoder\nTop 5"]
    RR --> CTX["📋 Context\n→ LLM"]

style Q fill:#e94560,stroke:#fff,color:#fff
    style RRF fill:#2c3e50,stroke:#fff,color:#fff
    style RR fill:#4CAF50,stroke:#fff,color:#fff
    style CTX fill:#f8f9fa,stroke:#e94560,color:#2c3e50

```

Hybrid Search Pipeline: Vector + BM25 → RRF → Reranker → LLM

### Reciprocal Rank Fusion (RRF)

RRF là phương pháp phổ biến nhất để kết hợp kết quả từ nhiều retriever. Công thức đơn giản nhưng hiệu quả cao:

```plaintext
RRF_score(d) = Σ 1 / (k + rank_i(d))

Trong đó:
- d: document
- rank_i(d): thứ hạng của document d trong retriever thứ i
- k: hằng số smoothing (thường = 60)

```
Ví dụ: Document X xếp hạng 3 trong vector search và hạng 7 trong BM25:

```plaintext
RRF_score(X) = 1/(60+3) + 1/(60+7) = 0.0159 + 0.0149 = 0.0308
```

### Reranking — Nâng precision lên tầm mới

Reranking là bước có **ROI cao nhất** trong RAG pipeline. Sau khi hybrid search trả về top-50 kết quả, một cross-encoder model sẽ re-score từng document so với query gốc với full attention — bắt được relevance mà embedding similarity bỏ sót. Precision thường tăng 10-30% với chi phí chỉ thêm 50-100ms latency.

```csharp
// Reranking với Cohere hoặc cross-encoder model
var hybridResults = await hybridSearch.SearchAsync(query, topK: 50);

var rerankedResults = await rerankerClient.RerankAsync(new RerankRequest
{
    Query = query,
    Documents = hybridResults.Select(r => r.Content).ToList(),
    TopN = 5,
    Model = "rerank-v3.5"
});

var finalContext = string.Join("\n\n---\n\n",
    rerankedResults.Results
        .OrderByDescending(r => r.RelevanceScore)
        .Select(r => hybridResults[r.Index].Content));
```

## Generation — Từ context đến câu trả lời

Bước generation lấy context đã retrieve được và truyền vào LLM cùng với câu hỏi gốc. Prompt engineering ở bước này quyết định chất lượng output:

```csharp
var systemPrompt = """
    Bạn là trợ lý AI chuyên trả lời câu hỏi dựa trên tài liệu được cung cấp.

QUY TẮC:
    1. CHỈ trả lời dựa trên thông tin trong [CONTEXT] bên dưới
    2. Nếu context không chứa đủ thông tin, nói rõ "Tôi không tìm thấy
       thông tin này trong tài liệu"
    3. Trích dẫn nguồn cụ thể khi trả lời (tên file, section)
    4. KHÔNG bịa đặt thông tin ngoài context

[CONTEXT]
    {retrievedContext}
    """;

var chatHistory = new ChatHistory();
chatHistory.AddSystemMessage(systemPrompt);
chatHistory.AddUserMessage(userQuery);

var response = await chatCompletionService.GetChatMessageContentAsync(
    chatHistory,
    new OpenAIPromptExecutionSettings { Temperature = 0.1f });
```

#### Temperature thấp cho RAG

Với RAG, nên đặt `Temperature = 0.0 - 0.2` để LLM bám sát context đã retrieve. Temperature cao khiến model "sáng tạo" hơn — chính xác là điều ta muốn tránh khi cần factual answers.

## Các pattern RAG nâng cao

### Corrective RAG (CRAG)

Corrective RAG thêm một bước **đánh giá chất lượng** sau retrieval. Nếu documents retrieve được không đủ relevant, hệ thống sẽ tự sửa — rewrite query, mở rộng nguồn tìm kiếm (web search), hoặc loại bỏ documents nhiễu trước khi đưa vào LLM.

```
flowchart TB
    A["👤 Query"] --> B["🔎 Retrieval"]
    B --> C{"📊 Relevance\nEvaluation"}
    C -->|"✅ Relevant"| D["🤖 Generate Answer"]
    C -->|"⚠️ Ambiguous"| E["✏️ Query Rewrite\n+ Re-retrieve"]
    C -->|"❌ Irrelevant"| F["🌐 Web Search\nFallback"]
    E --> C
    F --> D

style A fill:#e94560,stroke:#fff,color:#fff
    style C fill:#2c3e50,stroke:#fff,color:#fff
    style D fill:#4CAF50,stroke:#fff,color:#fff
    style F fill:#ff9800,stroke:#fff,color:#fff

```

Corrective RAG: tự đánh giá và sửa lỗi retrieval trước khi generation

### Self-RAG

Self-RAG huấn luyện model sinh ra các **reflection tokens** đặc biệt tại mỗi bước, tự kiểm tra:

- **[Retrieve]**: "Có cần retrieve thêm không?" — quyết định khi nào cần truy xuất
- **[ISREL]**: "Document này có relevant với query không?" — lọc noise
- **[ISSUP]**: "Câu trả lời có được hỗ trợ bởi document không?" — kiểm tra grounding
- **[ISUSE]**: "Câu trả lời có hữu ích không?" — đánh giá chất lượng tổng thể

#### Self-RAG vs Corrective RAG

**Corrective RAG** dùng agent/classifier bên ngoài để đánh giá, phù hợp khi muốn giữ LLM gốc không đổi. **Self-RAG** tích hợp reflection vào chính model, cho latency thấp hơn nhưng cần fine-tune model. Trong production 2026, Corrective RAG phổ biến hơn vì dễ triển khai với bất kỳ LLM nào.

### Agentic RAG

```
flowchart TB
    U["👤 Complex Query"] --> P["🧠 Planner Agent\nPhân tích & chia nhỏ query"]

P --> S1["🔎 Retrieval Agent\nTìm trong internal docs"]
    P --> S2["🌐 Web Agent\nTìm trên internet"]
    P --> S3["🗃️ SQL Agent\nQuery database"]

S1 --> V["✅ Validator Agent\nKiểm tra relevance\n& consistency"]
    S2 --> V
    S3 --> V

V --> SY["📝 Synthesizer Agent\nTổng hợp từ nhiều nguồn"]
    SY --> R["💬 Final Answer\n+ Citations"]

style U fill:#e94560,stroke:#fff,color:#fff
    style P fill:#2c3e50,stroke:#fff,color:#fff
    style V fill:#ff9800,stroke:#fff,color:#fff
    style R fill:#4CAF50,stroke:#fff,color:#fff

```

Agentic RAG: multi-agent pipeline với planning, retrieval song song, validation và synthesis

Bốn khả năng cốt lõi của Agentic RAG:

- **Reflection**: Agent tự đánh giá câu trả lời, phát hiện và sửa lỗi
- **Planning**: Phân tích query phức tạp thành sub-tasks, lập kế hoạch thực thi
- **Tool Use**: Chọn công cụ phù hợp (vector search, SQL, web, API) tùy ngữ cảnh
- **Multi-agent Collaboration**: Nhiều agent chuyên biệt phối hợp song song

## Đánh giá chất lượng RAG — RAGAS Framework

Đo lường chất lượng RAG pipeline là thách thức lớn. **RAGAS** (Retrieval Augmented Generation Assessment) là framework evaluation phổ biến nhất, đánh giá trên 4 trục:

| Metric | Đo cái gì | Công thức | Target |
| --- | --- | --- | --- |
| **Faithfulness** | Câu trả lời có đúng với context không? | Số claims supported / Tổng claims | > 0.85 |
| **Answer Relevancy** | Câu trả lời có liên quan đến câu hỏi không? | Cosine similarity giữa answer embedding và question | > 0.80 |
| **Context Precision** | Context retrieve có chính xác không? | Relevant chunks trong top-K / K | > 0.75 |
| **Context Recall** | Có bỏ sót thông tin quan trọng không? | Relevant info found / Total relevant info | > 0.80 |

```python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall

result = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=evaluation_llm,
    embeddings=embedding_model
)

print(result)
# {'faithfulness': 0.89, 'answer_relevancy': 0.85,
#  'context_precision': 0.78, 'context_recall': 0.82}
```

## Production Checklist cho RAG Pipeline

Xây dựng RAG pipeline hoạt động tốt trong demo khác xa với production. Dưới đây là checklist những yếu tố cần cân nhắc khi đưa RAG vào hệ thống thực:

Chunking & Indexing

Retrieval Quality

Dùng hybrid search (vector + BM25) thay vì chỉ vector search. Thêm reranker cho precision cao. Metadata filtering để thu hẹp phạm vi tìm kiếm (theo date, source, category).

Prompt Engineering

System prompt rõ ràng: chỉ trả lời từ context, thừa nhận khi không biết. Temperature thấp (0.0-0.2). Giới hạn context window — quá nhiều context gây noise.

Monitoring & Evaluation

Log mọi query + retrieved chunks + generated answer. Đo RAGAS metrics định kỳ. Alert khi faithfulness score giảm dưới threshold. A/B test chunking strategies.

Security & Cost

Row-level security trên vector store (multi-tenant). Rate limiting cho embedding API calls. Cache embeddings cho queries lặp. Ước tính chi phí: embedding + LLM + vector DB storage.

#### Sai lầm phổ biến nhất

## Kết luận

RAG không chỉ là "nhồi context vào prompt" — đó là một hệ thống phức tạp đòi hỏi tối ưu từng bước: chunking quyết định chất lượng nguyên liệu, hybrid search + reranking quyết định độ chính xác retrieval, và prompt engineering quyết định chất lượng câu trả lời cuối cùng. Trong 2026, xu hướng rõ ràng là tiến từ Naive RAG sang **Agentic RAG** — nơi AI agents tự lập kế hoạch, tự đánh giá, và tự sửa lỗi trong toàn bộ pipeline.

## Tham khảo

- [RAG Production Guide 2026 — Lushbinary](https://lushbinary.com/blog/rag-retrieval-augmented-generation-production-guide/)
- [RAG Is Not Dead: Advanced Retrieval Patterns That Actually Work in 2026 — DEV Community](https://dev.to/young_gao/rag-is-not-dead-advanced-retrieval-patterns-that-actually-work-in-2026-2gbo)
- [Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG — arXiv](https://arxiv.org/html/2501.09136v4)
- [RAG at Scale: How to Build Production AI Systems in 2026 — Redis](https://redis.io/blog/rag-at-scale/)
- [Building Production RAG: Architecture, Chunking, Evaluation & Monitoring — PremAI](https://blog.premai.io/building-production-rag-architecture-chunking-evaluation-monitoring-2026-guide/)
- [Chunking, Hybrid Search, and Reranking: What Actually Improves RAG — Medium](https://medium.com/@garima_yadav/chunking-hybrid-search-and-reranking-what-actually-improves-rag-de3d453c9059)

Bun Runtime — JavaScript Runtime nhanh nhất 2026 với cold start dưới 5ms

Cloudflare Durable Objects — Stateful Edge Computing không cần Server

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.