Agentic RAG — Khi RAG Gặp AI Agent Tự Chủ

Posted on: 5/11/2026 10:17:34 AM

Table of contents

1. RAG Truyền Thống Hoạt Động Như Thế Nào
1. 1.1. Những Giới Hạn Của RAG Truyền Thống
  1. Vấn đề cốt lõi
2. Agentic RAG Là Gì
1. Định nghĩa Agentic RAG
3. Kiến Trúc Agentic RAG Chi Tiết
1. 3.1. Các Thành Phần Cốt Lõi
4. Bốn Pattern Chính Trong Agentic RAG
5. Xây Dựng Agentic RAG Với LangGraph
6. Evaluation và Monitoring Trong Production
1. 6.1. Ba Tầng Eval
2. 6.2. Observability Cho Agent Loop
7. Best Practices Cho Production
8. Khi Nào Nên và Không Nên Dùng Agentic RAG
9. Kết Luận

Retrieval-Augmented Generation (RAG) đã trở thành kỹ thuật nền tảng giúp LLM truy cập dữ liệu bên ngoài thay vì chỉ dựa vào kiến thức huấn luyện. Tuy nhiên, RAG truyền thống hoạt động theo kiểu single-pass — truy vấn một lần, lấy kết quả, sinh câu trả lời — và bộc lộ nhiều hạn chế khi đối mặt với câu hỏi phức tạp, đa bước, hoặc yêu cầu suy luận qua nhiều nguồn dữ liệu. Agentic RAG là bước tiến hóa tiếp theo: biến pipeline RAG cứng nhắc thành một agent tự chủ có khả năng lập kế hoạch, truy xuất lặp, tự đánh giá và sửa lỗi cho đến khi đạt được câu trả lời đáng tin cậy.

57% Tổ chức đã triển khai AI Agent trong production (2026)

3-10x Chi phí token Agentic RAG so với RAG truyền thống

33.3% Tăng trưởng hybrid retrieval — nhanh nhất trong RAG

≥0.9 Faithfulness target cho production Agentic RAG

1. RAG Truyền Thống Hoạt Động Như Thế Nào

RAG truyền thống tuân theo một pipeline tuyến tính gồm ba bước cơ bản:

graph LR
    A["Câu hỏi người dùng"] --> B["Embedding & Search"]
    B --> C["Top-K Documents"]
    C --> D["LLM + Context"]
    D --> E["Câu trả lời"]
    style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#2c3e50,stroke:#fff,color:#fff
    style C fill:#2c3e50,stroke:#fff,color:#fff
    style D fill:#2c3e50,stroke:#fff,color:#fff
    style E fill:#4CAF50,stroke:#fff,color:#fff

Hình 1: Pipeline RAG truyền thống — một chiều, không phản hồi

Embed: Chuyển câu hỏi thành vector embedding
Retrieve: Tìm top-K document chunks gần nhất từ vector store
Generate: Đưa context vào prompt, LLM sinh câu trả lời

Mô hình này hoạt động tốt cho các câu hỏi đơn giản, single-hop như FAQ chatbot hay tra cứu tài liệu nội bộ. Nhưng nó bộc lộ những điểm yếu nghiêm trọng khi gặp các tình huống phức tạp hơn.

1.1. Những Giới Hạn Của RAG Truyền Thống

Vấn đề cốt lõi

RAG truyền thống là stateless và single-pass. Nó không có khả năng: (1) đánh giá chất lượng kết quả retrieval, (2) quyết định cần tìm thêm thông tin hay không, (3) phân tách câu hỏi phức tạp thành nhiều sub-query, (4) chọn nguồn dữ liệu phù hợp nhất cho từng phần của câu hỏi.

Cụ thể, RAG truyền thống gặp khó khăn với:

Multi-hop questions: Câu hỏi yêu cầu kết hợp thông tin từ nhiều document khác nhau. Ví dụ: "So sánh chiến lược pricing của công ty A và B trong Q1 2026" — cần truy xuất từ ít nhất 2 nguồn rồi tổng hợp.
Ambiguous queries: Khi câu hỏi mơ hồ, retrieval trả về document không liên quan, nhưng hệ thống không có cơ chế để nhận ra và thử lại.
Dynamic knowledge: Dữ liệu thay đổi liên tục, nhưng pipeline cứng không biết khi nào cần refresh hay query real-time.
Reasoning gaps: Câu trả lời đúng có thể yêu cầu suy luận nhiều bước, nhưng single-pass không có không gian cho quá trình này.

2. Agentic RAG Là Gì

Agentic RAG kết hợp sức mạnh của RAG với khả năng ra quyết định tự chủ của AI Agent. Thay vì pipeline tuyến tính cứng nhắc, Agentic RAG biến LLM thành một agent có khả năng lập kế hoạch, thực thi, đánh giá và lặp lại quá trình retrieval cho đến khi đạt được kết quả đáng tin cậy.

Định nghĩa Agentic RAG

Agentic RAG là kiến trúc trong đó LLM hoạt động như một decision-making agent — tự quyết định khi nào cần truy xuất, nguồn nào để query, cách nào để reformulate câu hỏi, và liệu kết quả đã đủ tốt hay cần iterate thêm. Nó chuyển RAG từ mô hình "retrieve-and-read" sang "plan-retrieve-reason-critique-refine".

Những đặc điểm cốt lõi phân biệt Agentic RAG với RAG truyền thống:

Đặc điểm	RAG Truyền Thống	Agentic RAG
Luồng xử lý	Tuyến tính, single-pass	Vòng lặp có điều kiện (cyclic)
Ra quyết định	Không — luôn retrieve rồi generate	Agent tự quyết định: retrieve, skip, rewrite, hay dừng
Đánh giá kết quả	Không — trả lời ngay với context có sẵn	Self-grading: đánh giá relevance, hallucination, completeness
Xử lý lỗi	Không — nếu retrieval kém, output kém	Self-corrective: rewrite query, chuyển nguồn, retry
Nguồn dữ liệu	Thường 1 vector store duy nhất	Multi-source: vector DB, SQL, API, web search, tools
Chi phí	1x token, latency thấp	3-10x token, latency cao hơn
Phù hợp	FAQ, single-corpus, latency-sensitive	Multi-hop, high-stakes, cross-source reasoning

3. Kiến Trúc Agentic RAG Chi Tiết

Kiến trúc Agentic RAG tổ chức các thành phần thành một state machine có hướng, có vòng lặp (directed cyclic graph). Mỗi node trong graph đại diện cho một bước xử lý, và các edge có điều kiện quyết định luồng đi tiếp theo dựa trên kết quả đánh giá.

graph TD
    A["Câu hỏi đầu vào"] --> B["Query Analyzer"]
    B --> C{"Cần retrieval?"}
    C -->|Có| D["Query Router"]
    C -->|Không| J["Direct Answer"]
    D --> E["Vector Store"]
    D --> F["SQL Database"]
    D --> G["Web Search"]
    D --> H["API / Tools"]
    E --> I["Relevance Grader"]
    F --> I
    G --> I
    H --> I
    I --> K{"Kết quả đủ tốt?"}
    K -->|Không| L["Query Rewriter"]
    L --> D
    K -->|Có| M["Response Generator"]
    M --> N["Hallucination Checker"]
    N --> O{"Có hallucination?"}
    O -->|Có| L
    O -->|Không| P["Answer Grader"]
    P --> Q{"Trả lời đầy đủ?"}
    Q -->|Không| L
    Q -->|Có| R["Final Response"]
    J --> R
    style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#2c3e50,stroke:#fff,color:#fff
    style D fill:#2c3e50,stroke:#fff,color:#fff
    style I fill:#2c3e50,stroke:#fff,color:#fff
    style L fill:#ff9800,stroke:#fff,color:#fff
    style M fill:#2c3e50,stroke:#fff,color:#fff
    style N fill:#2c3e50,stroke:#fff,color:#fff
    style P fill:#2c3e50,stroke:#fff,color:#fff
    style R fill:#4CAF50,stroke:#fff,color:#fff

Hình 2: Kiến trúc Agentic RAG — vòng lặp self-corrective với multiple retrieval sources

3.1. Các Thành Phần Cốt Lõi

Query Analyzer: Phân tích câu hỏi để xác định liệu cần retrieval hay có thể trả lời trực tiếp. Đây là bước "suy nghĩ trước khi hành động" — tránh retrieval không cần thiết cho câu hỏi đơn giản.
Query Router: Định tuyến câu hỏi đến nguồn dữ liệu phù hợp nhất. Ví dụ: câu hỏi về số liệu tài chính → SQL database, câu hỏi về policy → vector store, câu hỏi về tin mới → web search.
Relevance Grader: Đánh giá mức độ liên quan của document được truy xuất. Nếu không đủ relevant, trigger query rewriting thay vì ép LLM generate từ context kém.
Query Rewriter: Viết lại câu hỏi dựa trên feedback từ grader. Có thể decompose câu hỏi phức tạp thành sub-queries, thêm context, hoặc thay đổi từ khóa.
Hallucination Checker: Kiểm tra xem câu trả lời có bám sát vào context đã retrieve hay "bịa" thông tin. Faithfulness score ≥ 0.9 là target cho production.
Answer Grader: Đánh giá tổng thể: câu trả lời có thực sự giải quyết câu hỏi ban đầu không? Nếu thiếu, quay lại vòng lặp.

4. Bốn Pattern Chính Trong Agentic RAG

4.1. Adaptive Retrieval

Agent tự quyết định có cần retrieval hay không dựa trên độ phức tạp của câu hỏi. Câu hỏi đơn giản như "Python là gì?" → trả lời trực tiếp. Câu hỏi về dữ liệu cụ thể → trigger retrieval. Điều này giảm chi phí token và latency cho các trường hợp không cần external knowledge.

graph LR
    A["Query"] --> B["Complexity Classifier"]
    B -->|Đơn giản| C["Direct LLM"]
    B -->|Phức tạp| D["Retrieval Pipeline"]
    B -->|Rất phức tạp| E["Multi-Step Retrieval"]
    C --> F["Response"]
    D --> F
    E --> F
    style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#2c3e50,stroke:#fff,color:#fff
    style C fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style E fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style F fill:#4CAF50,stroke:#fff,color:#fff

Hình 3: Adaptive Retrieval — agent chọn chiến lược dựa trên độ phức tạp

4.2. Self-Corrective RAG (CRAG)

Đây là pattern quan trọng nhất của Agentic RAG. Sau khi retrieval, agent đánh giá chất lượng kết quả và tự sửa nếu cần:

Retrieve: Lấy documents từ knowledge base
Grade: Đánh giá relevance score của từng document
Decide: Nếu relevant → generate. Nếu ambiguous → rewrite query. Nếu irrelevant → fallback sang web search hoặc nguồn khác.
Validate: Kiểm tra hallucination và completeness trước khi trả kết quả

CRAG trong thực tế

Một ứng dụng CRAG cho hệ thống tài liệu nội bộ: Khi user hỏi "Chính sách WFH mới nhất?", agent retrieve từ vector store. Nếu document quá cũ (>6 tháng), grader đánh là "ambiguous" → agent rewrite query thành "work from home policy 2026 update" và thử lại. Nếu vẫn không tìm thấy → fallback sang company intranet API. Quá trình này diễn ra tự động, transparent với end user.

4.3. Multi-Step Retrieval

Cho câu hỏi đa bước (multi-hop), agent phân tách thành chuỗi sub-queries, thực hiện retrieval tuần tự, và tổng hợp kết quả. Mỗi bước retrieval sử dụng context từ bước trước để refine query tiếp theo.

Ví dụ với câu hỏi: "Doanh thu Q1 2026 của công ty A có tăng so với Q4 2025 không, và nguyên nhân chính là gì?"

Step 1: Retrieve "doanh thu Q1 2026 công ty A" → Được con số cụ thể
Step 2: Retrieve "doanh thu Q4 2025 công ty A" → So sánh được
Step 3: Retrieve "phân tích nguyên nhân tăng/giảm doanh thu công ty A 2026" → Giải thích
Synthesize: Tổng hợp 3 kết quả thành câu trả lời đầy đủ

4.4. Router RAG

Agent sử dụng semantic routing để chọn nguồn dữ liệu tối ưu cho từng query. Thay vì query tất cả các nguồn rồi merge, router chọn chính xác nguồn phù hợp nhất — tiết kiệm chi phí và giảm noise.

graph TD
    A["User Query"] --> B["Semantic Router"]
    B -->|Technical docs| C["Vector Store
Confluence / Notion"]
    B -->|Metrics & Numbers| D["SQL Database
Analytics"]
    B -->|Recent events| E["Web Search
Tavily / Bing"]
    B -->|Code-related| F["Code Search
GitHub API"]
    B -->|Company policy| G["Document Store
SharePoint"]
    C --> H["Merge & Generate"]
    D --> H
    E --> H
    F --> H
    G --> H
    style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#2c3e50,stroke:#fff,color:#fff
    style C fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style E fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style F fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style G fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style H fill:#4CAF50,stroke:#fff,color:#fff

Hình 4: Router RAG — định tuyến thông minh đến nguồn dữ liệu phù hợp

5. Xây Dựng Agentic RAG Với LangGraph

LangGraph là framework phổ biến nhất để xây dựng Agentic RAG trong năm 2026. Nó mô hình hóa toàn bộ hệ thống như một directed cyclic graph với state management, conditional branching và human-in-the-loop.

5.1. Định Nghĩa State và Nodes

from typing import List, TypedDict
from langgraph.graph import StateGraph, END

class AgenticRAGState(TypedDict):
    question: str
    documents: List[str]
    generation: str
    retry_count: int
    web_search_needed: bool

def retrieve(state: AgenticRAGState) -> AgenticRAGState:
    """Truy xuất documents từ vector store."""
    question = state["question"]
    documents = vector_store.similarity_search(question, k=5)
    return {**state, "documents": documents}

def grade_documents(state: AgenticRAGState) -> AgenticRAGState:
    """Đánh giá relevance của từng document."""
    question = state["question"]
    docs = state["documents"]

    relevant_docs = []
    web_search_needed = False

    for doc in docs:
        score = relevance_grader.invoke({
            "question": question,
            "document": doc.page_content
        })
        if score.binary_score == "yes":
            relevant_docs.append(doc)

    if len(relevant_docs) < 2:
        web_search_needed = True

    return {
        **state,
        "documents": relevant_docs,
        "web_search_needed": web_search_needed
    }

def rewrite_query(state: AgenticRAGState) -> AgenticRAGState:
    """Viết lại câu hỏi để cải thiện retrieval."""
    question = state["question"]
    better_question = query_rewriter.invoke({
        "question": question,
        "feedback": "Previous retrieval returned insufficient results."
    })
    return {
        **state,
        "question": better_question,
        "retry_count": state["retry_count"] + 1
    }

def generate(state: AgenticRAGState) -> AgenticRAGState:
    """Sinh câu trả lời từ context đã validate."""
    docs_content = "\n\n".join(d.page_content for d in state["documents"])
    generation = rag_chain.invoke({
        "context": docs_content,
        "question": state["question"]
    })
    return {**state, "generation": generation}

def check_hallucination(state: AgenticRAGState) -> str:
    """Kiểm tra hallucination — trả về routing decision."""
    score = hallucination_grader.invoke({
        "documents": state["documents"],
        "generation": state["generation"]
    })
    if score.binary_score == "yes":
        return "useful"
    return "not_useful"

5.2. Xây Dựng Graph

workflow = StateGraph(AgenticRAGState)

workflow.add_node("retrieve", retrieve)
workflow.add_node("grade_documents", grade_documents)
workflow.add_node("generate", generate)
workflow.add_node("rewrite_query", rewrite_query)
workflow.add_node("web_search", web_search_node)

workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "grade_documents")

workflow.add_conditional_edges(
    "grade_documents",
    lambda state: "web_search" if state["web_search_needed"] else "generate",
    {
        "web_search": "web_search",
        "generate": "generate"
    }
)

workflow.add_edge("web_search", "generate")

workflow.add_conditional_edges(
    "generate",
    check_hallucination,
    {
        "useful": END,
        "not_useful": "rewrite_query"
    }
)

workflow.add_conditional_edges(
    "rewrite_query",
    lambda state: END if state["retry_count"] >= 3 else "retrieve",
    {
        END: END,
        "retrieve": "retrieve"
    }
)

app = workflow.compile()

Retry Budget

Luôn đặt giới hạn retry (ví dụ retry_count ≥ 3 thì dừng). Agentic RAG có thể rơi vào vòng lặp vô hạn nếu không có budget control — đặc biệt nguy hiểm với câu hỏi mà knowledge base thực sự không có câu trả lời. Khi hết budget, trả lời honest: "Tôi không tìm thấy đủ thông tin để trả lời câu hỏi này."

5.3. Tích Hợp Multi-Source Retrieval

from langchain_community.tools.tavily_search import TavilySearchResults

def route_query(state: AgenticRAGState) -> str:
    """Semantic routing dựa trên nội dung câu hỏi."""
    question = state["question"]

    classification = router_llm.invoke(
        f"""Classify this question into one category:
        - 'vectorstore': technical documentation, internal knowledge
        - 'sql': metrics, numbers, statistics, financial data
        - 'websearch': recent events, news, current information

        Question: {question}"""
    )
    return classification.datasource

workflow.add_conditional_edges(
    "analyze_query",
    route_query,
    {
        "vectorstore": "retrieve_from_vectorstore",
        "sql": "query_sql_database",
        "websearch": "search_web"
    }
)

6. Evaluation và Monitoring Trong Production

Triển khai Agentic RAG trong production đòi hỏi ba tầng evaluation song song:

6.1. Ba Tầng Eval

Tầng	Công cụ	Metrics	Target
Per-Query	Ragas, DeepEval	Faithfulness, Answer Relevancy, Context Precision	≥0.9 / ≥0.85 / ≥0.8
Trajectory	Arize Phoenix, Langfuse	Số bước lặp, token usage, routing accuracy	Avg steps ≤3, cost/query ≤budget
Drift Monitoring	Custom pipeline	Knowledge drift, embedding drift, eval drift	Weekly check vs golden set

6.2. Observability Cho Agent Loop

Mỗi lần agent loop qua các node (retrieve → grade → rewrite → retrieve lại...), bạn cần trace toàn bộ trajectory để debug và tối ưu:

from langfuse import Langfuse
from langfuse.callback import CallbackHandler

langfuse_handler = CallbackHandler(
    public_key="pk-...",
    secret_key="sk-...",
    host="https://cloud.langfuse.com"
)

result = app.invoke(
    {"question": "So sánh chi phí AWS vs Azure cho workload AI?",
     "retry_count": 0},
    config={"callbacks": [langfuse_handler]}
)

Langfuse sẽ ghi lại toàn bộ trace: mỗi node đã chạy, thời gian xử lý, token consumed, và các quyết định routing. Từ đó bạn phát hiện được bottleneck — ví dụ grader quá nghiêm khắc khiến 80% query bị rewrite không cần thiết.

7. Best Practices Cho Production

Budget Control

Đặt hard limit cho retry count (3-5 lần) và total token budget per query. Agentic RAG dễ tốn token nếu agent loop quá nhiều. Monitor cost/query để phát hiện anomaly sớm.

Grader Calibration

Relevance grader quá strict → agent luôn rewrite, tăng latency và cost. Quá loose → chấp nhận document kém, giảm quality. Calibrate trên golden dataset có label manual, target precision ≥0.85, recall ≥0.80.

Fallback Strategy

Khi agent hết budget mà chưa có đủ thông tin, KHÔNG bịa. Trả lời transparent: "Tôi chỉ tìm thấy thông tin một phần..." kèm sources đã retrieve. Điều này xây dựng trust với user.

Caching Layer

Cache kết quả cho identical/similar queries. Semantic cache (dùng embedding similarity) giúp giảm 40-60% retrieval calls cho production workload. Nhưng cần TTL hợp lý cho knowledge thay đổi thường xuyên.

Human-in-the-Loop

Cho high-stakes domain (legal, medical, financial), thêm checkpoint cho human review trước khi trả final response. LangGraph hỗ trợ interrupt_before và interrupt_after để pause workflow chờ approval.

8. Khi Nào Nên và Không Nên Dùng Agentic RAG

✓ Multi-hop queries, cross-source reasoning

✓ High-stakes: legal, medical, financial

✗ Simple FAQ, single-corpus lookup

✗ Latency-sensitive (<500ms requirement)

Dùng Agentic RAG khi: Câu hỏi phức tạp cần reasoning qua nhiều nguồn, domain yêu cầu accuracy cao (sai 1% là không chấp nhận), knowledge base thay đổi thường xuyên cần routing thông minh, hoặc user mong đợi câu trả lời toàn diện thay vì snippet.

Giữ RAG truyền thống khi: FAQ chatbot đơn giản, latency dưới 500ms là bắt buộc, budget token hạn chế, hoặc knowledge base nhỏ và ổn định. RAG truyền thống vẫn là lựa chọn tối ưu cho 60-70% use case phổ biến.

9. Kết Luận

Agentic RAG không phải là sự thay thế hoàn toàn cho RAG truyền thống — mà là bước tiến hóa tự nhiên cho những use case đòi hỏi suy luận phức tạp, multi-source reasoning, và độ chính xác cao. Bằng cách kết hợp khả năng retrieval với decision-making tự chủ của AI Agent, Agentic RAG mở ra khả năng xây dựng các hệ thống AI thực sự "thông minh" — biết khi nào cần tìm thêm, biết đánh giá chất lượng kết quả, và biết dừng lại khi không chắc chắn.

Với sự hỗ trợ của các framework như LangGraph, LlamaIndex, và Semantic Kernel, việc xây dựng Agentic RAG production-ready đã trở nên dễ tiếp cận hơn bao giờ hết. Điều quan trọng là hiểu rõ trade-off giữa cost/latency và quality, và áp dụng đúng pattern cho đúng use case.

Tham khảo:

#Agentic RAG #AI Agent #RAG #LangGraph #LLM

# Agentic RAG — Khi RAG Gặp AI Agent Tự Chủ

Retrieval-Augmented Generation (RAG) đã trở thành kỹ thuật nền tảng giúp LLM truy cập dữ liệu bên ngoài thay vì chỉ dựa vào kiến thức huấn luyện. Tuy nhiên, RAG truyền thống hoạt động theo kiểu **single-pass** — truy vấn một lần, lấy kết quả, sinh câu trả lời — và bộc lộ nhiều hạn chế khi đối mặt với câu hỏi phức tạp, đa bước, hoặc yêu cầu suy luận qua nhiều nguồn dữ liệu. **Agentic RAG** là bước tiến hóa tiếp theo: biến pipeline RAG cứng nhắc thành một agent tự chủ có khả năng lập kế hoạch, truy xuất lặp, tự đánh giá và sửa lỗi cho đến khi đạt được câu trả lời đáng tin cậy.

57% Tổ chức đã triển khai AI Agent trong production (2026)

3-10x Chi phí token Agentic RAG so với RAG truyền thống

33.3% Tăng trưởng hybrid retrieval — nhanh nhất trong RAG

≥0.9 Faithfulness target cho production Agentic RAG

## 1. RAG Truyền Thống Hoạt Động Như Thế Nào

RAG truyền thống tuân theo một pipeline tuyến tính gồm ba bước cơ bản:

```
graph LR
    A["Câu hỏi người dùng"] --> B["Embedding & Search"]
    B --> C["Top-K Documents"]
    C --> D["LLM + Context"]
    D --> E["Câu trả lời"]
    style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#2c3e50,stroke:#fff,color:#fff
    style C fill:#2c3e50,stroke:#fff,color:#fff
    style D fill:#2c3e50,stroke:#fff,color:#fff
    style E fill:#4CAF50,stroke:#fff,color:#fff

```

Hình 1: Pipeline RAG truyền thống — một chiều, không phản hồi

1. **Embed**: Chuyển câu hỏi thành vector embedding
2. **Retrieve**: Tìm top-K document chunks gần nhất từ vector store
3. **Generate**: Đưa context vào prompt, LLM sinh câu trả lời

### 1.1. Những Giới Hạn Của RAG Truyền Thống

#### Vấn đề cốt lõi

RAG truyền thống là **stateless và single-pass**. Nó không có khả năng: (1) đánh giá chất lượng kết quả retrieval, (2) quyết định cần tìm thêm thông tin hay không, (3) phân tách câu hỏi phức tạp thành nhiều sub-query, (4) chọn nguồn dữ liệu phù hợp nhất cho từng phần của câu hỏi.

Cụ thể, RAG truyền thống gặp khó khăn với:

- **Multi-hop questions**: Câu hỏi yêu cầu kết hợp thông tin từ nhiều document khác nhau. Ví dụ: "So sánh chiến lược pricing của công ty A và B trong Q1 2026" — cần truy xuất từ ít nhất 2 nguồn rồi tổng hợp.
- **Ambiguous queries**: Khi câu hỏi mơ hồ, retrieval trả về document không liên quan, nhưng hệ thống không có cơ chế để nhận ra và thử lại.
- **Dynamic knowledge**: Dữ liệu thay đổi liên tục, nhưng pipeline cứng không biết khi nào cần refresh hay query real-time.
- **Reasoning gaps**: Câu trả lời đúng có thể yêu cầu suy luận nhiều bước, nhưng single-pass không có không gian cho quá trình này.

## 2. Agentic RAG Là Gì

Agentic RAG kết hợp sức mạnh của RAG với khả năng ra quyết định tự chủ của AI Agent. Thay vì pipeline tuyến tính cứng nhắc, Agentic RAG biến LLM thành một **agent có khả năng lập kế hoạch, thực thi, đánh giá và lặp lại** quá trình retrieval cho đến khi đạt được kết quả đáng tin cậy.

#### Định nghĩa Agentic RAG

Agentic RAG là kiến trúc trong đó LLM hoạt động như một **decision-making agent** — tự quyết định *khi nào* cần truy xuất, *nguồn nào* để query, *cách nào* để reformulate câu hỏi, và *liệu kết quả* đã đủ tốt hay cần iterate thêm. Nó chuyển RAG từ mô hình "retrieve-and-read" sang "plan-retrieve-reason-critique-refine".

Những đặc điểm cốt lõi phân biệt Agentic RAG với RAG truyền thống:

| Đặc điểm | RAG Truyền Thống | Agentic RAG |
| --- | --- | --- |
| **Luồng xử lý** | Tuyến tính, single-pass | Vòng lặp có điều kiện (cyclic) |
| **Ra quyết định** | Không — luôn retrieve rồi generate | Agent tự quyết định: retrieve, skip, rewrite, hay dừng |
| **Đánh giá kết quả** | Không — trả lời ngay với context có sẵn | Self-grading: đánh giá relevance, hallucination, completeness |
| **Xử lý lỗi** | Không — nếu retrieval kém, output kém | Self-corrective: rewrite query, chuyển nguồn, retry |
| **Nguồn dữ liệu** | Thường 1 vector store duy nhất | Multi-source: vector DB, SQL, API, web search, tools |
| **Chi phí** | 1x token, latency thấp | 3-10x token, latency cao hơn |
| **Phù hợp** | FAQ, single-corpus, latency-sensitive | Multi-hop, high-stakes, cross-source reasoning |

## 3. Kiến Trúc Agentic RAG Chi Tiết

Kiến trúc Agentic RAG tổ chức các thành phần thành một **state machine có hướng, có vòng lặp** (directed cyclic graph). Mỗi node trong graph đại diện cho một bước xử lý, và các edge có điều kiện quyết định luồng đi tiếp theo dựa trên kết quả đánh giá.

```
graph TD
    A["Câu hỏi đầu vào"] --> B["Query Analyzer"]
    B --> C{"Cần retrieval?"}
    C -->|Có| D["Query Router"]
    C -->|Không| J["Direct Answer"]
    D --> E["Vector Store"]
    D --> F["SQL Database"]
    D --> G["Web Search"]
    D --> H["API / Tools"]
    E --> I["Relevance Grader"]
    F --> I
    G --> I
    H --> I
    I --> K{"Kết quả đủ tốt?"}
    K -->|Không| L["Query Rewriter"]
    L --> D
    K -->|Có| M["Response Generator"]
    M --> N["Hallucination Checker"]
    N --> O{"Có hallucination?"}
    O -->|Có| L
    O -->|Không| P["Answer Grader"]
    P --> Q{"Trả lời đầy đủ?"}
    Q -->|Không| L
    Q -->|Có| R["Final Response"]
    J --> R
    style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#2c3e50,stroke:#fff,color:#fff
    style D fill:#2c3e50,stroke:#fff,color:#fff
    style I fill:#2c3e50,stroke:#fff,color:#fff
    style L fill:#ff9800,stroke:#fff,color:#fff
    style M fill:#2c3e50,stroke:#fff,color:#fff
    style N fill:#2c3e50,stroke:#fff,color:#fff
    style P fill:#2c3e50,stroke:#fff,color:#fff
    style R fill:#4CAF50,stroke:#fff,color:#fff

```

Hình 2: Kiến trúc Agentic RAG — vòng lặp self-corrective với multiple retrieval sources

### 3.1. Các Thành Phần Cốt Lõi

- **Query Analyzer**: Phân tích câu hỏi để xác định liệu cần retrieval hay có thể trả lời trực tiếp. Đây là bước "suy nghĩ trước khi hành động" — tránh retrieval không cần thiết cho câu hỏi đơn giản.
- **Query Router**: Định tuyến câu hỏi đến nguồn dữ liệu phù hợp nhất. Ví dụ: câu hỏi về số liệu tài chính → SQL database, câu hỏi về policy → vector store, câu hỏi về tin mới → web search.
- **Relevance Grader**: Đánh giá mức độ liên quan của document được truy xuất. Nếu không đủ relevant, trigger query rewriting thay vì ép LLM generate từ context kém.
- **Query Rewriter**: Viết lại câu hỏi dựa trên feedback từ grader. Có thể decompose câu hỏi phức tạp thành sub-queries, thêm context, hoặc thay đổi từ khóa.
- **Hallucination Checker**: Kiểm tra xem câu trả lời có bám sát vào context đã retrieve hay "bịa" thông tin. Faithfulness score ≥ 0.9 là target cho production.
- **Answer Grader**: Đánh giá tổng thể: câu trả lời có thực sự giải quyết câu hỏi ban đầu không? Nếu thiếu, quay lại vòng lặp.

## 4. Bốn Pattern Chính Trong Agentic RAG

### 4.1. Adaptive Retrieval

Agent tự quyết định **có cần retrieval hay không** dựa trên độ phức tạp của câu hỏi. Câu hỏi đơn giản như "Python là gì?" → trả lời trực tiếp. Câu hỏi về dữ liệu cụ thể → trigger retrieval. Điều này giảm chi phí token và latency cho các trường hợp không cần external knowledge.

```
graph LR
    A["Query"] --> B["Complexity Classifier"]
    B -->|Đơn giản| C["Direct LLM"]
    B -->|Phức tạp| D["Retrieval Pipeline"]
    B -->|Rất phức tạp| E["Multi-Step Retrieval"]
    C --> F["Response"]
    D --> F
    E --> F
    style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#2c3e50,stroke:#fff,color:#fff
    style C fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style E fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style F fill:#4CAF50,stroke:#fff,color:#fff

```

Hình 3: Adaptive Retrieval — agent chọn chiến lược dựa trên độ phức tạp

### 4.2. Self-Corrective RAG (CRAG)

Đây là pattern quan trọng nhất của Agentic RAG. Sau khi retrieval, agent **đánh giá chất lượng kết quả** và tự sửa nếu cần:

1. **Retrieve**: Lấy documents từ knowledge base
2. **Grade**: Đánh giá relevance score của từng document
3. **Decide**: Nếu relevant → generate. Nếu ambiguous → rewrite query. Nếu irrelevant → fallback sang web search hoặc nguồn khác.
4. **Validate**: Kiểm tra hallucination và completeness trước khi trả kết quả

#### CRAG trong thực tế

### 4.3. Multi-Step Retrieval

Cho câu hỏi đa bước (multi-hop), agent **phân tách thành chuỗi sub-queries**, thực hiện retrieval tuần tự, và tổng hợp kết quả. Mỗi bước retrieval sử dụng context từ bước trước để refine query tiếp theo.

Ví dụ với câu hỏi: "Doanh thu Q1 2026 của công ty A có tăng so với Q4 2025 không, và nguyên nhân chính là gì?"

- **Step 1**: Retrieve "doanh thu Q1 2026 công ty A" → Được con số cụ thể
- **Step 2**: Retrieve "doanh thu Q4 2025 công ty A" → So sánh được
- **Step 3**: Retrieve "phân tích nguyên nhân tăng/giảm doanh thu công ty A 2026" → Giải thích
- **Synthesize**: Tổng hợp 3 kết quả thành câu trả lời đầy đủ

### 4.4. Router RAG

Agent sử dụng **semantic routing** để chọn nguồn dữ liệu tối ưu cho từng query. Thay vì query tất cả các nguồn rồi merge, router chọn chính xác nguồn phù hợp nhất — tiết kiệm chi phí và giảm noise.

```
graph TD
    A["User Query"] --> B["Semantic Router"]
    B -->|Technical docs| C["Vector Store  
Confluence / Notion"]
    B -->|Metrics & Numbers| D["SQL Database  
Analytics"]
    B -->|Recent events| E["Web Search  
Tavily / Bing"]
    B -->|Code-related| F["Code Search  
GitHub API"]
    B -->|Company policy| G["Document Store  
SharePoint"]
    C --> H["Merge & Generate"]
    D --> H
    E --> H
    F --> H
    G --> H
    style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#2c3e50,stroke:#fff,color:#fff
    style C fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style E fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style F fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style G fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style H fill:#4CAF50,stroke:#fff,color:#fff

```

Hình 4: Router RAG — định tuyến thông minh đến nguồn dữ liệu phù hợp

## 5. Xây Dựng Agentic RAG Với LangGraph

LangGraph là framework phổ biến nhất để xây dựng Agentic RAG trong năm 2026. Nó mô hình hóa toàn bộ hệ thống như một **directed cyclic graph** với state management, conditional branching và human-in-the-loop.

### 5.1. Định Nghĩa State và Nodes

```python
from typing import List, TypedDict
from langgraph.graph import StateGraph, END

class AgenticRAGState(TypedDict):
    question: str
    documents: List[str]
    generation: str
    retry_count: int
    web_search_needed: bool

def retrieve(state: AgenticRAGState) -> AgenticRAGState:
    """Truy xuất documents từ vector store."""
    question = state["question"]
    documents = vector_store.similarity_search(question, k=5)
    return {**state, "documents": documents}

def grade_documents(state: AgenticRAGState) -> AgenticRAGState:
    """Đánh giá relevance của từng document."""
    question = state["question"]
    docs = state["documents"]

relevant_docs = []
    web_search_needed = False

for doc in docs:
        score = relevance_grader.invoke({
            "question": question,
            "document": doc.page_content
        })
        if score.binary_score == "yes":
            relevant_docs.append(doc)

if len(relevant_docs) < 2:
        web_search_needed = True

return {
        **state,
        "documents": relevant_docs,
        "web_search_needed": web_search_needed
    }

def rewrite_query(state: AgenticRAGState) -> AgenticRAGState:
    """Viết lại câu hỏi để cải thiện retrieval."""
    question = state["question"]
    better_question = query_rewriter.invoke({
        "question": question,
        "feedback": "Previous retrieval returned insufficient results."
    })
    return {
        **state,
        "question": better_question,
        "retry_count": state["retry_count"] + 1
    }

def generate(state: AgenticRAGState) -> AgenticRAGState:
    """Sinh câu trả lời từ context đã validate."""
    docs_content = "\n\n".join(d.page_content for d in state["documents"])
    generation = rag_chain.invoke({
        "context": docs_content,
        "question": state["question"]
    })
    return {**state, "generation": generation}

def check_hallucination(state: AgenticRAGState) -> str:
    """Kiểm tra hallucination — trả về routing decision."""
    score = hallucination_grader.invoke({
        "documents": state["documents"],
        "generation": state["generation"]
    })
    if score.binary_score == "yes":
        return "useful"
    return "not_useful"

```

### 5.2. Xây Dựng Graph

```python
workflow = StateGraph(AgenticRAGState)

workflow.add_node("retrieve", retrieve)
workflow.add_node("grade_documents", grade_documents)
workflow.add_node("generate", generate)
workflow.add_node("rewrite_query", rewrite_query)
workflow.add_node("web_search", web_search_node)

workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "grade_documents")

workflow.add_conditional_edges(
    "grade_documents",
    lambda state: "web_search" if state["web_search_needed"] else "generate",
    {
        "web_search": "web_search",
        "generate": "generate"
    }
)

workflow.add_edge("web_search", "generate")

workflow.add_conditional_edges(
    "generate",
    check_hallucination,
    {
        "useful": END,
        "not_useful": "rewrite_query"
    }
)

workflow.add_conditional_edges(
    "rewrite_query",
    lambda state: END if state["retry_count"] >= 3 else "retrieve",
    {
        END: END,
        "retrieve": "retrieve"
    }
)

app = workflow.compile()

```

#### Retry Budget

Luôn đặt giới hạn retry (ví dụ `retry_count ≥ 3` thì dừng). Agentic RAG có thể rơi vào vòng lặp vô hạn nếu không có budget control — đặc biệt nguy hiểm với câu hỏi mà knowledge base thực sự không có câu trả lời. Khi hết budget, trả lời honest: "Tôi không tìm thấy đủ thông tin để trả lời câu hỏi này."

### 5.3. Tích Hợp Multi-Source Retrieval

```python
from langchain_community.tools.tavily_search import TavilySearchResults

def route_query(state: AgenticRAGState) -> str:
    """Semantic routing dựa trên nội dung câu hỏi."""
    question = state["question"]

classification = router_llm.invoke(
        f"""Classify this question into one category:
        - 'vectorstore': technical documentation, internal knowledge
        - 'sql': metrics, numbers, statistics, financial data
        - 'websearch': recent events, news, current information

Question: {question}"""
    )
    return classification.datasource

workflow.add_conditional_edges(
    "analyze_query",
    route_query,
    {
        "vectorstore": "retrieve_from_vectorstore",
        "sql": "query_sql_database",
        "websearch": "search_web"
    }
)

```

## 6. Evaluation và Monitoring Trong Production

Triển khai Agentic RAG trong production đòi hỏi **ba tầng evaluation** song song:

### 6.1. Ba Tầng Eval

| Tầng | Công cụ | Metrics | Target |
| --- | --- | --- | --- |
| **Per-Query** | Ragas, DeepEval | Faithfulness, Answer Relevancy, Context Precision | ≥0.9 / ≥0.85 / ≥0.8 |
| **Trajectory** | Arize Phoenix, Langfuse | Số bước lặp, token usage, routing accuracy | Avg steps ≤3, cost/query ≤budget |
| **Drift Monitoring** | Custom pipeline | Knowledge drift, embedding drift, eval drift | Weekly check vs golden set |

### 6.2. Observability Cho Agent Loop

Mỗi lần agent loop qua các node (retrieve → grade → rewrite → retrieve lại...), bạn cần trace toàn bộ **trajectory** để debug và tối ưu:

```python
from langfuse import Langfuse
from langfuse.callback import CallbackHandler

langfuse_handler = CallbackHandler(
    public_key="pk-...",
    secret_key="sk-...",
    host="https://cloud.langfuse.com"
)

result = app.invoke(
    {"question": "So sánh chi phí AWS vs Azure cho workload AI?",
     "retry_count": 0},
    config={"callbacks": [langfuse_handler]}
)

```
Langfuse sẽ ghi lại toàn bộ trace: mỗi node đã chạy, thời gian xử lý, token consumed, và các quyết định routing. Từ đó bạn phát hiện được bottleneck — ví dụ grader quá nghiêm khắc khiến 80% query bị rewrite không cần thiết.

## 7. Best Practices Cho Production

Budget Control

Đặt hard limit cho retry count (3-5 lần) và total token budget per query. Agentic RAG dễ tốn token nếu agent loop quá nhiều. Monitor cost/query để phát hiện anomaly sớm.

Grader Calibration

Fallback Strategy

Caching Layer

Human-in-the-Loop

Cho high-stakes domain (legal, medical, financial), thêm checkpoint cho human review trước khi trả final response. LangGraph hỗ trợ `interrupt_before` và `interrupt_after` để pause workflow chờ approval.

## 8. Khi Nào Nên và Không Nên Dùng Agentic RAG

✓ Multi-hop queries, cross-source reasoning

✓ High-stakes: legal, medical, financial

✗ Simple FAQ, single-corpus lookup

✗ Latency-sensitive (<500ms requirement)

**Dùng Agentic RAG khi**: Câu hỏi phức tạp cần reasoning qua nhiều nguồn, domain yêu cầu accuracy cao (sai 1% là không chấp nhận), knowledge base thay đổi thường xuyên cần routing thông minh, hoặc user mong đợi câu trả lời toàn diện thay vì snippet.

**Giữ RAG truyền thống khi**: FAQ chatbot đơn giản, latency dưới 500ms là bắt buộc, budget token hạn chế, hoặc knowledge base nhỏ và ổn định. RAG truyền thống vẫn là lựa chọn tối ưu cho 60-70% use case phổ biến.

## 9. Kết Luận

Agentic RAG không phải là sự thay thế hoàn toàn cho RAG truyền thống — mà là **bước tiến hóa tự nhiên** cho những use case đòi hỏi suy luận phức tạp, multi-source reasoning, và độ chính xác cao. Bằng cách kết hợp khả năng retrieval với decision-making tự chủ của AI Agent, Agentic RAG mở ra khả năng xây dựng các hệ thống AI thực sự "thông minh" — biết khi nào cần tìm thêm, biết đánh giá chất lượng kết quả, và biết dừng lại khi không chắc chắn.

**Tham khảo:**

- [LangChain — Build a Custom RAG Agent with LangGraph](https://docs.langchain.com/oss/python/langgraph/agentic-rag)
- [MarsDevs — Agentic RAG: The 2026 Production Guide](https://www.marsdevs.com/guides/agentic-rag-2026-guide)
- [Comparative Analysis of RAG Architectures: Pipeline, Agentic, and Knowledge Graph](https://micheallanham.substack.com/p/comparative-analysis-of-rag-architectures)
- [Agentic RAG vs Traditional RAG: Complete Architecture Comparison](https://www.paperclipped.de/en/blog/agentic-rag-vs-traditional-rag/)
- [Vellum — Agentic Workflows in 2026: Emerging Architectures](https://www.vellum.ai/blog/agentic-workflows-emerging-architectures-and-design-patterns)

Agentic Design Patterns — 7 mẫu thiết kế AI Agent mà Developer cần biết

Từ Vibe Coding Đến Agentic Engineering — Lập Trình Đang Thay Đổi

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.