On-Device AI 2026: Chạy LLM Cục Bộ với Ollama, llama.cpp và ONNX Runtime trên .NET 10

Posted on: 4/22/2026 9:17:38 PM

Table of contents

1. Tại sao On-Device AI trở thành xu hướng bắt buộc năm 2026?
2. Kiến trúc On-Device AI Stack 2026
3. Ollama — Gateway dễ nhất vào On-Device AI
4. llama.cpp — Engine inference hiệu năng cao với GGUF Quantization
5. ONNX Runtime GenAI — Tích hợp Local AI vào .NET 10
6. Small Language Models 2026 — Nhỏ nhưng Có Võ
1. 70–85% chất lượng frontier, 0đ chi phí
7. Kiến trúc Hybrid: Local + Cloud — Best of Both Worlds
1. 7.1. Implement Router trong .NET 10
  1. Mẹo tiết kiệm: Dùng local model làm classifier
8. Chọn phần cứng cho On-Device AI
1. Lưu ý về VRAM vs RAM
9. Production Patterns cho On-Device AI
10. So sánh Ollama vs llama.cpp vs ONNX Runtime
11. Use Cases thực tế cho On-Device AI
12. Kết luận
1. Tài nguyên hữu ích

Năm 2026, bạn không cần gửi mỗi prompt lên cloud để nhận response từ AI nữa. Với Ollama đạt 52 triệu lượt download/tháng, llama.cpp hỗ trợ quantization từ 1.5-bit đến 8-bit, và ONNX Runtime GenAI tích hợp trực tiếp vào .NET 10 — chạy LLM cục bộ không còn là thí nghiệm mà đã trở thành chiến lược production thực thụ. Bài viết này đi sâu vào kiến trúc, công cụ và chiến lược triển khai On-Device AI cho developer muốn xây dựng ứng dụng AI không phụ thuộc cloud.

52MLượt download Ollama/tháng (Q1 2026)

4.9×Nén KV Cache với TurboQuant TQ3

14BParams — Phi-4-reasoning đọ sức 70B

$0Chi phí inference mỗi request

1. Tại sao On-Device AI trở thành xu hướng bắt buộc năm 2026?

Cloud AI đã chứng minh sức mạnh, nhưng ba vấn đề cốt lõi đang đẩy developer về hướng local inference:

1.1. Chi phí token tích lũy nhanh chóng

Một ứng dụng AI trung bình gọi 10.000–50.000 request/ngày. Với giá GPT-4o khoảng $2.50/1M input tokens, chi phí hàng tháng có thể lên đến hàng nghìn USD. On-Device AI loại bỏ hoàn toàn chi phí per-token — bạn chỉ trả một lần cho phần cứng.

1.2. Độ trễ và khả dụng

Cloud inference thêm 200–500ms network latency mỗi request. Với local inference, latency chỉ phụ thuộc vào tốc độ phần cứng — thường 50–150ms cho first token trên GPU consumer. Quan trọng hơn, ứng dụng hoạt động offline hoàn toàn, không bị ảnh hưởng bởi outage của provider.

1.3. Quyền riêng tư dữ liệu

Trong ngành y tế, tài chính, pháp lý — dữ liệu không được rời khỏi hạ tầng nội bộ. On-Device AI là giải pháp duy nhất đảm bảo zero data egress: không một byte nào rời khỏi máy chủ của bạn.

Khi nào KHÔNG nên dùng On-Device AI?

Nếu bạn cần frontier-level quality (Claude Opus, GPT-5.4), creative writing dài, hoặc reasoning cực phức tạp — cloud model vẫn vượt trội. On-Device AI phù hợp nhất cho: code completion, text classification, summarization, entity extraction, chatbot nội bộ, và RAG pipeline.

2. Kiến trúc On-Device AI Stack 2026

Hệ sinh thái On-Device AI năm 2026 gồm ba tầng chính: Model Format (cách lưu trữ và nén model), Inference Engine (runtime thực thi), và Application Layer (API tích hợp vào ứng dụng).

graph TB
    subgraph APP["Application Layer"]
        A1["REST API
(OpenAI-compatible)"]
        A2[".NET 10 App
(ONNX Runtime GenAI)"]
        A3["Python App
(llama-cpp-python)"]
        A4["Desktop/Mobile App"]
    end
    subgraph ENGINE["Inference Engine"]
        E1["Ollama v0.20
52M downloads/mo"]
        E2["llama.cpp
(ggml backend)"]
        E3["ONNX Runtime
GenAI v0.13"]
        E4["LM Studio
(GUI)"]
    end
    subgraph FORMAT["Model Format & Quantization"]
        F1["GGUF
1.5-bit → 8-bit"]
        F2["ONNX
(INT4/INT8/FP16)"]
        F3["SafeTensors
(HuggingFace)"]
    end
    subgraph MODELS["Small Language Models"]
        M1["Phi-4-reasoning
14B params"]
        M2["Qwen3.5-7B/32B"]
        M3["Gemma 4
2B/4B/26B/31B"]
        M4["LFM2-24B-A2B
Hybrid MoE"]
        M5["Llama 3.3
8B/70B"]
    end
    A1 --> E1
    A2 --> E3
    A3 --> E2
    A4 --> E4
    E1 --> F1
    E2 --> F1
    E3 --> F2
    E4 --> F1
    F1 --> MODELS
    F2 --> MODELS
    F3 --> MODELS
    style APP fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style ENGINE fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style FORMAT fill:#f8f9fa,stroke:#ff9800,color:#2c3e50
    style MODELS fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

Hình 1: Kiến trúc On-Device AI Stack 2026 — từ model format đến application layer

3. Ollama — Gateway dễ nhất vào On-Device AI

Ollama đã trở thành công cụ local LLM mặc định của developer năm 2026 với 169.000+ GitHub stars. Triết lý của Ollama: đơn giản hóa toàn bộ workflow download → configure → run model xuống còn một lệnh duy nhất.

3.1. Cài đặt và chạy model đầu tiên

# Cài đặt Ollama (Windows/macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Chạy Phi-4-reasoning — 14B model reasoning mạnh
ollama run phi4-reasoning

# Hoặc Qwen3.5 7B — hiệu suất cao, chạy tốt trên 8GB RAM
ollama run qwen3.5:7b

# Gemma 4 2B — siêu nhẹ cho edge device
ollama run gemma4:2b

3.2. OpenAI-Compatible REST API

Killer feature của Ollama là REST API tương thích hoàn toàn với OpenAI. Mọi ứng dụng đang dùng OpenAI API chỉ cần đổi base_url — không sửa logic gì thêm:

// .NET 10 — dùng OpenAI SDK trỏ về Ollama local
using OpenAI;
using OpenAI.Chat;

var client = new ChatClient(
    model: "phi4-reasoning",
    credential: new ApiKeyCredential("ollama"), // dummy key
    options: new OpenAIClientOptions
    {
        Endpoint = new Uri("http://localhost:11434/v1/")
    }
);

var response = await client.CompleteChatAsync(
    new ChatMessage[]
    {
        new SystemChatMessage("Bạn là trợ lý lập trình chuyên .NET."),
        new UserChatMessage("Giải thích Dependency Injection trong 3 câu.")
    }
);

Console.WriteLine(response.Value.Content[0].Text);

Mẹo: Multi-model routing

Ollama cho phép tải nhiều model đồng thời. Bạn có thể dùng Phi-4-reasoning cho reasoning tasks, Qwen3.5-7B cho general chat, và Gemma 4 2B cho classification — tất cả qua cùng một endpoint, chỉ khác field model trong request.

3.3. Modelfile — Tùy chỉnh model cho use case cụ thể

# Modelfile cho code review assistant
FROM phi4-reasoning

PARAMETER temperature 0.2
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

SYSTEM """
Bạn là senior .NET developer chuyên review code.
Khi nhận code, hãy:
1. Tìm bug tiềm ẩn
2. Đề xuất cải thiện performance
3. Kiểm tra security vulnerabilities
Trả lời bằng tiếng Việt, ngắn gọn, đi thẳng vào vấn đề.
"""

# Build và chạy custom model
ollama create code-reviewer -f Modelfile
ollama run code-reviewer

4. llama.cpp — Engine inference hiệu năng cao với GGUF Quantization

Nếu Ollama là lớp abstraction thân thiện, thì llama.cpp chính là engine bên dưới. Được viết bằng C/C++ thuần, llama.cpp là dự án đã biến việc chạy LLM trên CPU từ lý thuyết thành thực tế, và hiện tại là backend inference phổ biến nhất cho on-device AI.

4.1. GGUF Format — Tiêu chuẩn quantization cho local inference

GGUF (GPT-Generated Unified Format) là format file model được thiết kế riêng cho llama.cpp, hỗ trợ quantization từ 1.5-bit đến 8-bit. Quantization giảm precision của model weights (ví dụ từ 32-bit float xuống 4-bit integer), thu nhỏ kích thước model và tăng tốc inference với mức mất chất lượng rất nhỏ.

Quantization	Bits/Weight	Kích thước 7B Model	RAM cần	Chất lượng (PPL)	Use case
Q8_0	8-bit	~7.2 GB	~9 GB	Gần FP16	Quality-first, GPU có VRAM dư
Q5_K_M	5-bit	~4.8 GB	~7 GB	Rất tốt	Cân bằng tốt nhất quality/size
Q4_K_M	4-bit	~4.1 GB	~6 GB	Tốt	Phổ biến nhất — 8GB RAM vừa đủ
Q3_K_M	3-bit	~3.3 GB	~5 GB	Chấp nhận	RAM hạn chế, ưu tiên tốc độ
Q2_K	2-bit	~2.7 GB	~4 GB	Giảm rõ	Edge device, embedded system
IQ1_S	1.5-bit	~1.9 GB	~3 GB	Thấp	Thử nghiệm, IoT

4.2. TurboQuant — Bước nhảy KV Cache Compression (ICLR 2026)

TurboQuant (Zandieh et al., ICLR 2026) là kỹ thuật nén KV cache đang được tích hợp vào llama.cpp. Thay vì chỉ quantize model weights, TurboQuant nén cả KV cache — bộ nhớ tạm mà model dùng để theo dõi ngữ cảnh conversation.

TQ33-bit KV cache — nén 4.9× so với FP16

TQ44-bit KV cache — nén 3.8× so với FP16

2×Context length gấp đôi cùng VRAM

Ý nghĩa thực tế: với cùng 8GB VRAM, bạn có thể xử lý context dài gấp đôi, hoặc chạy batch inference song song nhiều request hơn. Đây là bước tiến quan trọng cho production workload trên phần cứng consumer.

graph LR
    subgraph BEFORE["Trước TurboQuant"]
        B1["Model Weights
Q4_K_M = 4.1GB"] --- B2["KV Cache FP16
8K ctx = 2GB"]
        B2 --- B3["Tổng: 6.1GB
Chỉ 8K context"]
    end
    subgraph AFTER["Sau TurboQuant TQ3"]
        A1["Model Weights
Q4_K_M = 4.1GB"] --- A2["KV Cache TQ3
16K ctx = 0.8GB"]
        A2 --- A3["Tổng: 4.9GB
16K context!"]
    end
    BEFORE -.->|"Nén 4.9×
KV Cache"| AFTER
    style BEFORE fill:#f8f9fa,stroke:#ff9800,color:#2c3e50
    style AFTER fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50

Hình 2: TurboQuant giúp gấp đôi context length với cùng lượng VRAM

4.3. Flash Attention 3 — Xử lý context dài hiệu quả

Flash Attention 3 tối ưu cơ chế attention vốn scale theo O(n²) với độ dài context. Trên llama.cpp, FA3 ngăn hiện tượng "performance cliff" khi conversation dài ra, giữ tốc độ inference ổn định ngay cả với context 32K+ tokens.

5. ONNX Runtime GenAI — Tích hợp Local AI vào .NET 10

Đối với developer .NET, ONNX Runtime GenAI là cầu nối trực tiếp nhất để chạy LLM trong ứng dụng C# mà không cần server trung gian. Package Microsoft.ML.OnnxRuntimeGenAI v0.13 cung cấp full generative AI loop: pre/post processing, inference, logits processing, KV cache management, và grammar-based tool calling.

5.1. Setup trên .NET 10

# Tạo project mới
dotnet new console -n LocalAI.Demo
cd LocalAI.Demo

# Thêm ONNX Runtime GenAI package
dotnet add package Microsoft.ML.OnnxRuntimeGenAI --version 0.13.1

# Cho GPU (CUDA)
dotnet add package Microsoft.ML.OnnxRuntimeGenAI.Cuda --version 0.13.1

5.2. Chạy Phi-4-mini local trong C#

using Microsoft.ML.OnnxRuntimeGenAI;

// Download model từ HuggingFace: microsoft/Phi-4-mini-instruct-onnx
var modelPath = @"C:\models\phi-4-mini-instruct-onnx\cpu-int4";

using var model = new Model(modelPath);
using var tokenizer = new Tokenizer(model);

var systemPrompt = "Bạn là trợ lý AI chuyên về .NET và C#. Trả lời ngắn gọn bằng tiếng Việt.";
var userMessage = "So sánh record và class trong C# 13, khi nào dùng cái nào?";

var fullPrompt = $"<|system|>{systemPrompt}<|end|><|user|>{userMessage}<|end|><|assistant|>";

using var tokens = tokenizer.Encode(fullPrompt);
using var generatorParams = new GeneratorParams(model);
generatorParams.SetSearchOption("max_length", 2048);
generatorParams.SetSearchOption("temperature", 0.3);
generatorParams.SetSearchOption("top_p", 0.9);
generatorParams.SetInputSequences(tokens);

using var generator = new Generator(model, generatorParams);
using var tokenizerStream = tokenizer.CreateStream();

while (!generator.IsDone())
{
    generator.ComputeLogits();
    generator.GenerateNextToken();
    var newToken = tokenizerStream.Decode(
        generator.GetSequence(0)[^1]
    );
    Console.Write(newToken);
}
Console.WriteLine();

GPU vs CPU — Khi nào cần GPU?

ONNX Runtime tự động chạy trên GPU (nếu có CUDA/DirectML) hoặc fallback về CPU. Với model INT4, CPU inference trên chip Intel/AMD hiện đại đạt 10–25 tokens/giây — đủ cho interactive chat. GPU consumer (RTX 4060+) đẩy lên 40–80 tokens/giây.

5.3. Tích hợp vào ASP.NET 10 API

// Program.cs — Register ONNX model as singleton
var builder = WebApplication.CreateBuilder(args);

builder.Services.AddSingleton<ILocalAIService>(sp =>
{
    var modelPath = builder.Configuration["LocalAI:ModelPath"]!;
    return new OnnxLocalAIService(modelPath);
});

var app = builder.Build();

app.MapPost("/api/chat", async (
    ChatRequest request,
    ILocalAIService ai,
    CancellationToken ct) =>
{
    var response = await ai.GenerateAsync(
        request.SystemPrompt,
        request.Message,
        ct
    );
    return Results.Ok(new { response });
});

app.Run();

// OnnxLocalAIService.cs
public class OnnxLocalAIService : ILocalAIService, IDisposable
{
    private readonly Model _model;
    private readonly Tokenizer _tokenizer;
    private readonly SemaphoreSlim _semaphore = new(1, 1);

    public OnnxLocalAIService(string modelPath)
    {
        _model = new Model(modelPath);
        _tokenizer = new Tokenizer(_model);
    }

    public async Task<string> GenerateAsync(
        string systemPrompt, string userMessage, CancellationToken ct)
    {
        await _semaphore.WaitAsync(ct);
        try
        {
            var prompt = $"<|system|>{systemPrompt}<|end|>" +
                         $"<|user|>{userMessage}<|end|><|assistant|>";

            using var tokens = _tokenizer.Encode(prompt);
            using var genParams = new GeneratorParams(_model);
            genParams.SetSearchOption("max_length", 2048);
            genParams.SetSearchOption("temperature", 0.3);
            genParams.SetInputSequences(tokens);

            using var generator = new Generator(_model, genParams);
            using var stream = _tokenizer.CreateStream();
            var result = new StringBuilder();

            while (!generator.IsDone())
            {
                ct.ThrowIfCancellationRequested();
                generator.ComputeLogits();
                generator.GenerateNextToken();
                result.Append(stream.Decode(
                    generator.GetSequence(0)[^1]
                ));
            }
            return result.ToString();
        }
        finally
        {
            _semaphore.Release();
        }
    }

    public void Dispose()
    {
        _tokenizer?.Dispose();
        _model?.Dispose();
    }
}

6. Small Language Models 2026 — Nhỏ nhưng Có Võ

Cuộc cách mạng on-device AI được thúc đẩy bởi thế hệ Small Language Models (SLMs) mới — những model dưới 15B params nhưng đạt benchmark ngang tầm model 70B của năm trước.

Model	Params	MMLU	RAM tối thiểu	Điểm mạnh
Phi-4-reasoning	14B	~84%	10GB (Q4)	Reasoning, math, code — đọ sức DeepSeek-R1-Distill-70B
Qwen3.5-7B	7B	76.8%	6GB (Q4)	Nhanh 3×, hiệu suất/param cao nhất
Qwen2.5-32B	32B	83.2%	20GB (Q4)	MMLU cao nhất trong open-weight
Gemma 4 E2B	~2B	~62%	3GB (Q4)	Siêu nhẹ, mobile/IoT
LFM2-24B-A2B	24B (MoE)	~80%	8GB (Q4)	Hybrid MoE, chỉ activate 2B/inference
Phi-4-multimodal	5.6B	—	5GB (Q4)	Speech + Vision + Text trong 1 model

70–85% chất lượng frontier, 0đ chi phí

Benchmark thực tế cho thấy local inference trên phần cứng consumer đạt 70–85% chất lượng so với frontier model (Claude Opus, GPT-5.4), với chi phí biên bằng 0 mỗi request. Đối với rất nhiều use case production — con số này quá đủ.

7. Kiến trúc Hybrid: Local + Cloud — Best of Both Worlds

Trong production thực tế, hiếm khi bạn chỉ dùng 100% local hoặc 100% cloud. Kiến trúc tối ưu là Hybrid Routing — phân luồng request dựa trên độ phức tạp.

graph TB
    REQ["Incoming Request"] --> ROUTER["AI Router
(Complexity Classifier)"]
    ROUTER -->|"Simple tasks
Classification, Extract, QA"| LOCAL["Local LLM
Phi-4 / Qwen3.5
via Ollama"]
    ROUTER -->|"Medium tasks
Summarization, Code Gen"| MID["Mid-tier Cloud
Claude Haiku / GPT-4o-mini"]
    ROUTER -->|"Complex tasks
Deep Reasoning, Creative"| CLOUD["Frontier Cloud
Claude Opus / GPT-5.4"]
    LOCAL --> RESP["Response"]
    MID --> RESP
    CLOUD --> RESP
    ROUTER -->|"Offline / No network"| LOCAL
    style REQ fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style ROUTER fill:#e94560,stroke:#fff,color:#fff
    style LOCAL fill:#4CAF50,stroke:#fff,color:#fff
    style MID fill:#ff9800,stroke:#fff,color:#fff
    style CLOUD fill:#2c3e50,stroke:#fff,color:#fff
    style RESP fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

Hình 3: Hybrid Routing — phân luồng request theo complexity level

7.1. Implement Router trong .NET 10

public class AIRouter
{
    private readonly ILocalAIService _localAI;
    private readonly ICloudAIService _cloudAI;
    private readonly IComplexityClassifier _classifier;

    public AIRouter(
        ILocalAIService localAI,
        ICloudAIService cloudAI,
        IComplexityClassifier classifier)
    {
        _localAI = localAI;
        _cloudAI = cloudAI;
        _classifier = classifier;
    }

    public async Task<AIResponse> RouteAsync(
        string prompt, CancellationToken ct)
    {
        var complexity = await _classifier.ClassifyAsync(prompt, ct);

        return complexity switch
        {
            Complexity.Simple => new AIResponse(
                await _localAI.GenerateAsync(prompt, ct),
                Provider: "local-phi4",
                Cost: 0m),

            Complexity.Medium => new AIResponse(
                await _cloudAI.GenerateAsync(
                    prompt, "claude-haiku-4-5", ct),
                Provider: "cloud-haiku",
                Cost: EstimateCost(prompt, "haiku")),

            Complexity.Complex => new AIResponse(
                await _cloudAI.GenerateAsync(
                    prompt, "claude-opus-4-7", ct),
                Provider: "cloud-opus",
                Cost: EstimateCost(prompt, "opus")),

            _ => throw new ArgumentOutOfRangeException()
        };
    }
}

Mẹo tiết kiệm: Dùng local model làm classifier

Chính local model nhỏ (Gemma 4 2B) có thể đóng vai classifier để phân loại complexity. Chi phí: 0đ. Thời gian: ~20ms. Kết quả: 60–80% request được xử lý local, giảm 60–80% chi phí cloud.

8. Chọn phần cứng cho On-Device AI

Cấu hình	Model phù hợp	Tốc độ (tokens/s)	Chi phí ước tính
Laptop 8GB RAM (CPU only)	Qwen3.5-7B Q4, Gemma 4 2B	8–15 t/s	Có sẵn
Desktop 16GB + RTX 4060	Phi-4-reasoning Q4, Qwen3.5-7B Q5	30–50 t/s	~$800
Workstation 32GB + RTX 4090	Qwen2.5-32B Q4, Phi-4 Q8	50–80 t/s	~$2,500
Server 64GB + 2× RTX 4090	Llama 3.3-70B Q4, Qwen2.5-32B Q8	40–60 t/s	~$5,000
Apple M4 Pro 24GB	Phi-4-reasoning Q5, Qwen2.5-32B Q3	25–45 t/s	~$2,000

Lưu ý về VRAM vs RAM

GPU inference cần model nằm trọn trong VRAM. RTX 4060 chỉ có 8GB VRAM — vừa đủ cho 7B Q4. Nếu model lớn hơn VRAM, llama.cpp sẽ offload một phần lên CPU RAM, nhưng tốc độ giảm 3–5×. Apple Silicon có lợi thế unified memory — 24GB M4 Pro dùng được toàn bộ cho GPU inference.

9. Production Patterns cho On-Device AI

9.1. Model Warm-up và Health Check

// Startup — warm-up model để tránh cold start
app.Lifetime.ApplicationStarted.Register(() =>
{
    var ai = app.Services.GetRequiredService<ILocalAIService>();
    _ = ai.GenerateAsync("system", "ping", CancellationToken.None);
    app.Logger.LogInformation("Local AI model warmed up");
});

// Health check endpoint
app.MapGet("/health/ai", async (ILocalAIService ai) =>
{
    try
    {
        var sw = Stopwatch.StartNew();
        await ai.GenerateAsync("system", "test",
            new CancellationTokenSource(TimeSpan.FromSeconds(10)).Token);
        return Results.Ok(new {
            status = "healthy",
            latency_ms = sw.ElapsedMilliseconds
        });
    }
    catch (Exception ex)
    {
        return Results.Json(new {
            status = "unhealthy",
            error = ex.Message
        }, statusCode: 503);
    }
});

9.2. Concurrent Request Handling

LLM inference là sequential per-request. Để xử lý nhiều request đồng thời, dùng request queue với bounded concurrency:

public class QueuedAIService : ILocalAIService
{
    private readonly Channel<AIWorkItem> _queue;
    private readonly ILocalAIService _inner;

    public QueuedAIService(ILocalAIService inner, int maxConcurrency = 2)
    {
        _inner = inner;
        _queue = Channel.CreateBounded<AIWorkItem>(
            new BoundedChannelOptions(100)
            {
                FullMode = BoundedChannelFullMode.Wait
            });

        for (int i = 0; i < maxConcurrency; i++)
            _ = ProcessQueueAsync();
    }

    public async Task<string> GenerateAsync(
        string system, string user, CancellationToken ct)
    {
        var tcs = new TaskCompletionSource<string>();
        await _queue.Writer.WriteAsync(
            new AIWorkItem(system, user, tcs, ct), ct);
        return await tcs.Task;
    }

    private async Task ProcessQueueAsync()
    {
        await foreach (var item in _queue.Reader.ReadAllAsync())
        {
            try
            {
                var result = await _inner.GenerateAsync(
                    item.System, item.User, item.Ct);
                item.Tcs.SetResult(result);
            }
            catch (Exception ex)
            {
                item.Tcs.SetException(ex);
            }
        }
    }
}

record AIWorkItem(
    string System, string User,
    TaskCompletionSource<string> Tcs, CancellationToken Ct);

9.3. Monitoring và Metrics

// Đo lường performance với .NET Metrics API
var meter = new Meter("LocalAI.Inference");
var tokenCounter = meter.CreateCounter<long>("ai.tokens.generated");
var latencyHistogram = meter.CreateHistogram<double>("ai.inference.latency_ms");
var activeRequests = meter.CreateUpDownCounter<int>("ai.requests.active");

// Trong inference method:
activeRequests.Add(1);
var sw = Stopwatch.StartNew();
try
{
    // ... inference logic ...
    tokenCounter.Add(tokenCount,
        new KeyValuePair<string, object?>("model", modelName));
    latencyHistogram.Record(sw.ElapsedMilliseconds,
        new KeyValuePair<string, object?>("model", modelName));
}
finally
{
    activeRequests.Add(-1);
}

10. So sánh Ollama vs llama.cpp vs ONNX Runtime

Tiêu chí	Ollama	llama.cpp (trực tiếp)	ONNX Runtime GenAI
Dễ sử dụng	⭐⭐⭐⭐⭐ Một lệnh	⭐⭐⭐ Cần build/config	⭐⭐⭐⭐ NuGet package
Performance	Tốt (wrapper llama.cpp)	Tốt nhất (bare metal)	Tốt (optimized runtime)
Model format	GGUF	GGUF	ONNX (INT4/INT8/FP16)
API style	REST (OpenAI-compatible)	CLI / C API / HTTP server	C# native / C API
Tích hợp .NET	Qua HTTP client	Qua llama.cpp bindings	NuGet native — tốt nhất
Multi-model	✅ Hot-swap models	❌ Một model/process	✅ Nhiều Model instance
GPU support	CUDA, ROCm, Metal	CUDA, ROCm, Metal, Vulkan	CUDA, DirectML, CoreML
Best for	Developer cần nhanh gọn	Max performance, custom	.NET production app

11. Use Cases thực tế cho On-Device AI

Code Assistant nội bộ

IDE plugin gợi ý code, refactor, generate test — chạy Phi-4-reasoning local. Zero latency, zero cost, code không rời máy developer. Đặc biệt hữu ích cho dự án có NDA hoặc proprietary codebase.

RAG Pipeline offline

Ingest tài liệu nội bộ (policy, SOP, knowledge base) vào vector store local, dùng Qwen3.5-7B làm generation model. Nhân viên truy vấn knowledge base qua chat interface — hoàn toàn air-gapped.

Log Analysis & Anomaly Detection

Stream application logs qua local LLM để detect anomaly patterns, classify error types, suggest fixes. Xử lý 1000+ log entries/phút với Gemma 4 2B Q4 trên CPU.

Customer Support Bot on-premise

Ngành tài chính, y tế — chatbot hỗ trợ khách hàng chạy hoàn toàn trên hạ tầng nội bộ. Dữ liệu bệnh nhân/tài khoản không bao giờ rời data center.

Automated Code Review

Tích hợp vào CI/CD pipeline: mỗi PR tự động chạy qua local LLM để detect bugs, security issues, coding convention violations. Chi phí: 0đ/review.

12. Kết luận

On-Device AI năm 2026 không còn là "chạy demo cho vui" — nó đã trở thành chiến lược kiến trúc thực sự với hệ sinh thái mature: Ollama cho ease-of-use, llama.cpp cho performance, ONNX Runtime GenAI cho .NET integration. Thế hệ Small Language Models mới (Phi-4-reasoning, Qwen3.5, Gemma 4) đã thu hẹp khoảng cách chất lượng với frontier models xuống chỉ còn 15–30%, trong khi chi phí inference bằng 0.

Chiến lược tối ưu cho hầu hết production system là Hybrid Routing: local model xử lý 60–80% request đơn giản, cloud model cho phần còn lại cần reasoning sâu. Kết quả: giảm 60–80% chi phí AI, loại bỏ network dependency cho task phổ biến, và đảm bảo data privacy cho dữ liệu nhạy cảm.

Bước tiếp theo: cài Ollama, chạy ollama run phi4-reasoning, và trải nghiệm sức mạnh AI ngay trên máy bạn — zero cloud, zero cost, full control.

Tài nguyên hữu ích

• Ollama Official — Download và library models
• llama.cpp GitHub — Source code và documentation
• ONNX Runtime GenAI Docs — Microsoft official docs
• Phi-4 trên HuggingFace — Model weights và hướng dẫn
• Qwen Models — Qwen3.5 model family

Nguồn tham khảo:

#On-Device AI #Ollama #llama.cpp #ONNX Runtime #.NET 10 #Phi-4 #GGUF Quantization #Small Language Model #Local LLM #Edge AI

# On-Device AI 2026: Chạy LLM Cục Bộ với Ollama, llama.cpp và ONNX Runtime trên .NET 10

Năm 2026, bạn không cần gửi mỗi prompt lên cloud để nhận response từ AI nữa. Với **Ollama đạt 52 triệu lượt download/tháng**, llama.cpp hỗ trợ quantization từ 1.5-bit đến 8-bit, và ONNX Runtime GenAI tích hợp trực tiếp vào .NET 10 — chạy LLM cục bộ không còn là thí nghiệm mà đã trở thành **chiến lược production thực thụ**. Bài viết này đi sâu vào kiến trúc, công cụ và chiến lược triển khai On-Device AI cho developer muốn xây dựng ứng dụng AI không phụ thuộc cloud.

52MLượt download Ollama/tháng (Q1 2026)

4.9×Nén KV Cache với TurboQuant TQ3

14BParams — Phi-4-reasoning đọ sức 70B

$0Chi phí inference mỗi request

## 1. Tại sao On-Device AI trở thành xu hướng bắt buộc năm 2026?

Cloud AI đã chứng minh sức mạnh, nhưng ba vấn đề cốt lõi đang đẩy developer về hướng local inference:

### 1.1. Chi phí token tích lũy nhanh chóng

### 1.2. Độ trễ và khả dụng

### 1.3. Quyền riêng tư dữ liệu

Trong ngành y tế, tài chính, pháp lý — dữ liệu không được rời khỏi hạ tầng nội bộ. On-Device AI là giải pháp duy nhất đảm bảo **zero data egress**: không một byte nào rời khỏi máy chủ của bạn.

#### Khi nào KHÔNG nên dùng On-Device AI?

## 2. Kiến trúc On-Device AI Stack 2026

Hệ sinh thái On-Device AI năm 2026 gồm ba tầng chính: **Model Format** (cách lưu trữ và nén model), **Inference Engine** (runtime thực thi), và **Application Layer** (API tích hợp vào ứng dụng).

```
graph TB
    subgraph APP["Application Layer"]
        A1["REST API  
(OpenAI-compatible)"]
        A2[".NET 10 App  
(ONNX Runtime GenAI)"]
        A3["Python App  
(llama-cpp-python)"]
        A4["Desktop/Mobile App"]
    end
    subgraph ENGINE["Inference Engine"]
        E1["Ollama v0.20  
52M downloads/mo"]
        E2["llama.cpp  
(ggml backend)"]
        E3["ONNX Runtime  
GenAI v0.13"]
        E4["LM Studio  
(GUI)"]
    end
    subgraph FORMAT["Model Format & Quantization"]
        F1["GGUF  
1.5-bit → 8-bit"]
        F2["ONNX  
(INT4/INT8/FP16)"]
        F3["SafeTensors  
(HuggingFace)"]
    end
    subgraph MODELS["Small Language Models"]
        M1["Phi-4-reasoning  
14B params"]
        M2["Qwen3.5-7B/32B"]
        M3["Gemma 4  
2B/4B/26B/31B"]
        M4["LFM2-24B-A2B  
Hybrid MoE"]
        M5["Llama 3.3  
8B/70B"]
    end
    A1 --> E1
    A2 --> E3
    A3 --> E2
    A4 --> E4
    E1 --> F1
    E2 --> F1
    E3 --> F2
    E4 --> F1
    F1 --> MODELS
    F2 --> MODELS
    F3 --> MODELS
    style APP fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style ENGINE fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style FORMAT fill:#f8f9fa,stroke:#ff9800,color:#2c3e50
    style MODELS fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

```

Hình 1: Kiến trúc On-Device AI Stack 2026 — từ model format đến application layer

## 3. Ollama — Gateway dễ nhất vào On-Device AI

**Ollama** đã trở thành công cụ local LLM mặc định của developer năm 2026 với 169.000+ GitHub stars. Triết lý của Ollama: đơn giản hóa toàn bộ workflow download → configure → run model xuống còn một lệnh duy nhất.

### 3.1. Cài đặt và chạy model đầu tiên

```bash
# Cài đặt Ollama (Windows/macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Chạy Phi-4-reasoning — 14B model reasoning mạnh
ollama run phi4-reasoning

# Hoặc Qwen3.5 7B — hiệu suất cao, chạy tốt trên 8GB RAM
ollama run qwen3.5:7b

# Gemma 4 2B — siêu nhẹ cho edge device
ollama run gemma4:2b
```

### 3.2. OpenAI-Compatible REST API

Killer feature của Ollama là REST API tương thích hoàn toàn với OpenAI. Mọi ứng dụng đang dùng OpenAI API chỉ cần đổi `base_url` — không sửa logic gì thêm:

```csharp
// .NET 10 — dùng OpenAI SDK trỏ về Ollama local
using OpenAI;
using OpenAI.Chat;

var client = new ChatClient(
    model: "phi4-reasoning",
    credential: new ApiKeyCredential("ollama"), // dummy key
    options: new OpenAIClientOptions
    {
        Endpoint = new Uri("http://localhost:11434/v1/")
    }
);

var response = await client.CompleteChatAsync(
    new ChatMessage[]
    {
        new SystemChatMessage("Bạn là trợ lý lập trình chuyên .NET."),
        new UserChatMessage("Giải thích Dependency Injection trong 3 câu.")
    }
);

Console.WriteLine(response.Value.Content[0].Text);
```

#### Mẹo: Multi-model routing

### 3.3. Modelfile — Tùy chỉnh model cho use case cụ thể

```dockerfile
# Modelfile cho code review assistant
FROM phi4-reasoning

PARAMETER temperature 0.2
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

SYSTEM """
Bạn là senior .NET developer chuyên review code.
Khi nhận code, hãy:
1. Tìm bug tiềm ẩn
2. Đề xuất cải thiện performance
3. Kiểm tra security vulnerabilities
Trả lời bằng tiếng Việt, ngắn gọn, đi thẳng vào vấn đề.
"""
```

```bash
# Build và chạy custom model
ollama create code-reviewer -f Modelfile
ollama run code-reviewer
```

## 4. llama.cpp — Engine inference hiệu năng cao với GGUF Quantization

Nếu Ollama là lớp abstraction thân thiện, thì **llama.cpp** chính là engine bên dưới. Được viết bằng C/C++ thuần, llama.cpp là dự án đã biến việc chạy LLM trên CPU từ lý thuyết thành thực tế, và hiện tại là backend inference phổ biến nhất cho on-device AI.

### 4.1. GGUF Format — Tiêu chuẩn quantization cho local inference

**GGUF** (GPT-Generated Unified Format) là format file model được thiết kế riêng cho llama.cpp, hỗ trợ quantization từ 1.5-bit đến 8-bit. Quantization giảm precision của model weights (ví dụ từ 32-bit float xuống 4-bit integer), thu nhỏ kích thước model và tăng tốc inference với mức mất chất lượng rất nhỏ.

| Quantization | Bits/Weight | Kích thước 7B Model | RAM cần | Chất lượng (PPL) | Use case |
| --- | --- | --- | --- | --- | --- |
| **Q8_0** | 8-bit | ~7.2 GB | ~9 GB | Gần FP16 | Quality-first, GPU có VRAM dư |
| **Q5_K_M** | 5-bit | ~4.8 GB | ~7 GB | Rất tốt | Cân bằng tốt nhất quality/size |
| **Q4_K_M** | 4-bit | ~4.1 GB | ~6 GB | Tốt | Phổ biến nhất — 8GB RAM vừa đủ |
| **Q3_K_M** | 3-bit | ~3.3 GB | ~5 GB | Chấp nhận | RAM hạn chế, ưu tiên tốc độ |
| **Q2_K** | 2-bit | ~2.7 GB | ~4 GB | Giảm rõ | Edge device, embedded system |
| **IQ1_S** | 1.5-bit | ~1.9 GB | ~3 GB | Thấp | Thử nghiệm, IoT |

### 4.2. TurboQuant — Bước nhảy KV Cache Compression (ICLR 2026)

**TurboQuant** (Zandieh et al., ICLR 2026) là kỹ thuật nén KV cache đang được tích hợp vào llama.cpp. Thay vì chỉ quantize model weights, TurboQuant nén cả KV cache — bộ nhớ tạm mà model dùng để theo dõi ngữ cảnh conversation.

TQ33-bit KV cache — nén 4.9× so với FP16

TQ44-bit KV cache — nén 3.8× so với FP16

2×Context length gấp đôi cùng VRAM

```
graph LR
    subgraph BEFORE["Trước TurboQuant"]
        B1["Model Weights  
Q4_K_M = 4.1GB"] --- B2["KV Cache FP16  
8K ctx = 2GB"]
        B2 --- B3["Tổng: 6.1GB  
Chỉ 8K context"]
    end
    subgraph AFTER["Sau TurboQuant TQ3"]
        A1["Model Weights  
Q4_K_M = 4.1GB"] --- A2["KV Cache TQ3  
16K ctx = 0.8GB"]
        A2 --- A3["Tổng: 4.9GB  
16K context!"]
    end
    BEFORE -.->|"Nén 4.9×  
KV Cache"| AFTER
    style BEFORE fill:#f8f9fa,stroke:#ff9800,color:#2c3e50
    style AFTER fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50

```

Hình 2: TurboQuant giúp gấp đôi context length với cùng lượng VRAM

### 4.3. Flash Attention 3 — Xử lý context dài hiệu quả

## 5. ONNX Runtime GenAI — Tích hợp Local AI vào .NET 10

Đối với developer .NET, **ONNX Runtime GenAI** là cầu nối trực tiếp nhất để chạy LLM trong ứng dụng C# mà không cần server trung gian. Package `Microsoft.ML.OnnxRuntimeGenAI` v0.13 cung cấp full generative AI loop: pre/post processing, inference, logits processing, KV cache management, và grammar-based tool calling.

### 5.1. Setup trên .NET 10

```bash
# Tạo project mới
dotnet new console -n LocalAI.Demo
cd LocalAI.Demo

# Thêm ONNX Runtime GenAI package
dotnet add package Microsoft.ML.OnnxRuntimeGenAI --version 0.13.1

# Cho GPU (CUDA)
dotnet add package Microsoft.ML.OnnxRuntimeGenAI.Cuda --version 0.13.1
```

### 5.2. Chạy Phi-4-mini local trong C#

```csharp
using Microsoft.ML.OnnxRuntimeGenAI;

// Download model từ HuggingFace: microsoft/Phi-4-mini-instruct-onnx
var modelPath = @"C:\models\phi-4-mini-instruct-onnx\cpu-int4";

using var model = new Model(modelPath);
using var tokenizer = new Tokenizer(model);

var systemPrompt = "Bạn là trợ lý AI chuyên về .NET và C#. Trả lời ngắn gọn bằng tiếng Việt.";
var userMessage = "So sánh record và class trong C# 13, khi nào dùng cái nào?";

var fullPrompt = $"<|system|>{systemPrompt}<|end|><|user|>{userMessage}<|end|><|assistant|>";

using var tokens = tokenizer.Encode(fullPrompt);
using var generatorParams = new GeneratorParams(model);
generatorParams.SetSearchOption("max_length", 2048);
generatorParams.SetSearchOption("temperature", 0.3);
generatorParams.SetSearchOption("top_p", 0.9);
generatorParams.SetInputSequences(tokens);

using var generator = new Generator(model, generatorParams);
using var tokenizerStream = tokenizer.CreateStream();

while (!generator.IsDone())
{
    generator.ComputeLogits();
    generator.GenerateNextToken();
    var newToken = tokenizerStream.Decode(
        generator.GetSequence(0)[^1]
    );
    Console.Write(newToken);
}
Console.WriteLine();
```

#### GPU vs CPU — Khi nào cần GPU?

### 5.3. Tích hợp vào ASP.NET 10 API

```csharp
// Program.cs — Register ONNX model as singleton
var builder = WebApplication.CreateBuilder(args);

builder.Services.AddSingleton<ILocalAIService>(sp =>
{
    var modelPath = builder.Configuration["LocalAI:ModelPath"]!;
    return new OnnxLocalAIService(modelPath);
});

var app = builder.Build();

app.MapPost("/api/chat", async (
    ChatRequest request,
    ILocalAIService ai,
    CancellationToken ct) =>
{
    var response = await ai.GenerateAsync(
        request.SystemPrompt,
        request.Message,
        ct
    );
    return Results.Ok(new { response });
});

app.Run();
```

```csharp
// OnnxLocalAIService.cs
public class OnnxLocalAIService : ILocalAIService, IDisposable
{
    private readonly Model _model;
    private readonly Tokenizer _tokenizer;
    private readonly SemaphoreSlim _semaphore = new(1, 1);

public OnnxLocalAIService(string modelPath)
    {
        _model = new Model(modelPath);
        _tokenizer = new Tokenizer(_model);
    }

public async Task<string> GenerateAsync(
        string systemPrompt, string userMessage, CancellationToken ct)
    {
        await _semaphore.WaitAsync(ct);
        try
        {
            var prompt = $"<|system|>{systemPrompt}<|end|>" +
                         $"<|user|>{userMessage}<|end|><|assistant|>";

using var tokens = _tokenizer.Encode(prompt);
            using var genParams = new GeneratorParams(_model);
            genParams.SetSearchOption("max_length", 2048);
            genParams.SetSearchOption("temperature", 0.3);
            genParams.SetInputSequences(tokens);

using var generator = new Generator(_model, genParams);
            using var stream = _tokenizer.CreateStream();
            var result = new StringBuilder();

while (!generator.IsDone())
            {
                ct.ThrowIfCancellationRequested();
                generator.ComputeLogits();
                generator.GenerateNextToken();
                result.Append(stream.Decode(
                    generator.GetSequence(0)[^1]
                ));
            }
            return result.ToString();
        }
        finally
        {
            _semaphore.Release();
        }
    }

public void Dispose()
    {
        _tokenizer?.Dispose();
        _model?.Dispose();
    }
}
```

## 6. Small Language Models 2026 — Nhỏ nhưng Có Võ

Cuộc cách mạng on-device AI được thúc đẩy bởi thế hệ **Small Language Models (SLMs)** mới — những model dưới 15B params nhưng đạt benchmark ngang tầm model 70B của năm trước.

| Model | Params | MMLU | RAM tối thiểu | Điểm mạnh |
| --- | --- | --- | --- | --- |
| **Phi-4-reasoning** | 14B | ~84% | 10GB (Q4) | Reasoning, math, code — đọ sức DeepSeek-R1-Distill-70B |
| **Qwen3.5-7B** | 7B | 76.8% | 6GB (Q4) | Nhanh 3×, hiệu suất/param cao nhất |
| **Qwen2.5-32B** | 32B | 83.2% | 20GB (Q4) | MMLU cao nhất trong open-weight |
| **Gemma 4 E2B** | ~2B | ~62% | 3GB (Q4) | Siêu nhẹ, mobile/IoT |
| **LFM2-24B-A2B** | 24B (MoE) | ~80% | 8GB (Q4) | Hybrid MoE, chỉ activate 2B/inference |
| **Phi-4-multimodal** | 5.6B | — | 5GB (Q4) | Speech + Vision + Text trong 1 model |

#### 70–85% chất lượng frontier, 0đ chi phí

Benchmark thực tế cho thấy local inference trên phần cứng consumer đạt **70–85% chất lượng** so với frontier model (Claude Opus, GPT-5.4), với chi phí biên bằng 0 mỗi request. Đối với rất nhiều use case production — con số này quá đủ.

## 7. Kiến trúc Hybrid: Local + Cloud — Best of Both Worlds

Trong production thực tế, hiếm khi bạn chỉ dùng 100% local hoặc 100% cloud. Kiến trúc tối ưu là **Hybrid Routing** — phân luồng request dựa trên độ phức tạp.

```
graph TB
    REQ["Incoming Request"] --> ROUTER["AI Router  
(Complexity Classifier)"]
    ROUTER -->|"Simple tasks  
Classification, Extract, QA"| LOCAL["Local LLM  
Phi-4 / Qwen3.5  
via Ollama"]
    ROUTER -->|"Medium tasks  
Summarization, Code Gen"| MID["Mid-tier Cloud  
Claude Haiku / GPT-4o-mini"]
    ROUTER -->|"Complex tasks  
Deep Reasoning, Creative"| CLOUD["Frontier Cloud  
Claude Opus / GPT-5.4"]
    LOCAL --> RESP["Response"]
    MID --> RESP
    CLOUD --> RESP
    ROUTER -->|"Offline / No network"| LOCAL
    style REQ fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style ROUTER fill:#e94560,stroke:#fff,color:#fff
    style LOCAL fill:#4CAF50,stroke:#fff,color:#fff
    style MID fill:#ff9800,stroke:#fff,color:#fff
    style CLOUD fill:#2c3e50,stroke:#fff,color:#fff
    style RESP fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

```

Hình 3: Hybrid Routing — phân luồng request theo complexity level

### 7.1. Implement Router trong .NET 10

```csharp
public class AIRouter
{
    private readonly ILocalAIService _localAI;
    private readonly ICloudAIService _cloudAI;
    private readonly IComplexityClassifier _classifier;

public AIRouter(
        ILocalAIService localAI,
        ICloudAIService cloudAI,
        IComplexityClassifier classifier)
    {
        _localAI = localAI;
        _cloudAI = cloudAI;
        _classifier = classifier;
    }

public async Task<AIResponse> RouteAsync(
        string prompt, CancellationToken ct)
    {
        var complexity = await _classifier.ClassifyAsync(prompt, ct);

return complexity switch
        {
            Complexity.Simple => new AIResponse(
                await _localAI.GenerateAsync(prompt, ct),
                Provider: "local-phi4",
                Cost: 0m),

Complexity.Medium => new AIResponse(
                await _cloudAI.GenerateAsync(
                    prompt, "claude-haiku-4-5", ct),
                Provider: "cloud-haiku",
                Cost: EstimateCost(prompt, "haiku")),

Complexity.Complex => new AIResponse(
                await _cloudAI.GenerateAsync(
                    prompt, "claude-opus-4-7", ct),
                Provider: "cloud-opus",
                Cost: EstimateCost(prompt, "opus")),

_ => throw new ArgumentOutOfRangeException()
        };
    }
}
```

#### Mẹo tiết kiệm: Dùng local model làm classifier

## 8. Chọn phần cứng cho On-Device AI

| Cấu hình | Model phù hợp | Tốc độ (tokens/s) | Chi phí ước tính |
| --- | --- | --- | --- |
| **Laptop 8GB RAM** (CPU only) | Qwen3.5-7B Q4, Gemma 4 2B | 8–15 t/s | Có sẵn |
| **Desktop 16GB + RTX 4060** | Phi-4-reasoning Q4, Qwen3.5-7B Q5 | 30–50 t/s | ~$800 |
| **Workstation 32GB + RTX 4090** | Qwen2.5-32B Q4, Phi-4 Q8 | 50–80 t/s | ~$2,500 |
| **Server 64GB + 2× RTX 4090** | Llama 3.3-70B Q4, Qwen2.5-32B Q8 | 40–60 t/s | ~$5,000 |
| **Apple M4 Pro 24GB** | Phi-4-reasoning Q5, Qwen2.5-32B Q3 | 25–45 t/s | ~$2,000 |

#### Lưu ý về VRAM vs RAM

## 9. Production Patterns cho On-Device AI

### 9.1. Model Warm-up và Health Check

```csharp
// Startup — warm-up model để tránh cold start
app.Lifetime.ApplicationStarted.Register(() =>
{
    var ai = app.Services.GetRequiredService<ILocalAIService>();
    _ = ai.GenerateAsync("system", "ping", CancellationToken.None);
    app.Logger.LogInformation("Local AI model warmed up");
});

// Health check endpoint
app.MapGet("/health/ai", async (ILocalAIService ai) =>
{
    try
    {
        var sw = Stopwatch.StartNew();
        await ai.GenerateAsync("system", "test",
            new CancellationTokenSource(TimeSpan.FromSeconds(10)).Token);
        return Results.Ok(new {
            status = "healthy",
            latency_ms = sw.ElapsedMilliseconds
        });
    }
    catch (Exception ex)
    {
        return Results.Json(new {
            status = "unhealthy",
            error = ex.Message
        }, statusCode: 503);
    }
});
```

### 9.2. Concurrent Request Handling

LLM inference là sequential per-request. Để xử lý nhiều request đồng thời, dùng **request queue** với bounded concurrency:

```csharp
public class QueuedAIService : ILocalAIService
{
    private readonly Channel<AIWorkItem> _queue;
    private readonly ILocalAIService _inner;

public QueuedAIService(ILocalAIService inner, int maxConcurrency = 2)
    {
        _inner = inner;
        _queue = Channel.CreateBounded<AIWorkItem>(
            new BoundedChannelOptions(100)
            {
                FullMode = BoundedChannelFullMode.Wait
            });

for (int i = 0; i < maxConcurrency; i++)
            _ = ProcessQueueAsync();
    }

public async Task<string> GenerateAsync(
        string system, string user, CancellationToken ct)
    {
        var tcs = new TaskCompletionSource<string>();
        await _queue.Writer.WriteAsync(
            new AIWorkItem(system, user, tcs, ct), ct);
        return await tcs.Task;
    }

private async Task ProcessQueueAsync()
    {
        await foreach (var item in _queue.Reader.ReadAllAsync())
        {
            try
            {
                var result = await _inner.GenerateAsync(
                    item.System, item.User, item.Ct);
                item.Tcs.SetResult(result);
            }
            catch (Exception ex)
            {
                item.Tcs.SetException(ex);
            }
        }
    }
}

record AIWorkItem(
    string System, string User,
    TaskCompletionSource<string> Tcs, CancellationToken Ct);
```

### 9.3. Monitoring và Metrics

```csharp
// Đo lường performance với .NET Metrics API
var meter = new Meter("LocalAI.Inference");
var tokenCounter = meter.CreateCounter<long>("ai.tokens.generated");
var latencyHistogram = meter.CreateHistogram<double>("ai.inference.latency_ms");
var activeRequests = meter.CreateUpDownCounter<int>("ai.requests.active");

// Trong inference method:
activeRequests.Add(1);
var sw = Stopwatch.StartNew();
try
{
    // ... inference logic ...
    tokenCounter.Add(tokenCount,
        new KeyValuePair<string, object?>("model", modelName));
    latencyHistogram.Record(sw.ElapsedMilliseconds,
        new KeyValuePair<string, object?>("model", modelName));
}
finally
{
    activeRequests.Add(-1);
}
```

## 10. So sánh Ollama vs llama.cpp vs ONNX Runtime

| Tiêu chí | Ollama | llama.cpp (trực tiếp) | ONNX Runtime GenAI |
| --- | --- | --- | --- |
| **Dễ sử dụng** | ⭐⭐⭐⭐⭐ Một lệnh | ⭐⭐⭐ Cần build/config | ⭐⭐⭐⭐ NuGet package |
| **Performance** | Tốt (wrapper llama.cpp) | Tốt nhất (bare metal) | Tốt (optimized runtime) |
| **Model format** | GGUF | GGUF | ONNX (INT4/INT8/FP16) |
| **API style** | REST (OpenAI-compatible) | CLI / C API / HTTP server | C# native / C API |
| **Tích hợp .NET** | Qua HTTP client | Qua llama.cpp bindings | NuGet native — tốt nhất |
| **Multi-model** | ✅ Hot-swap models | ❌ Một model/process | ✅ Nhiều Model instance |
| **GPU support** | CUDA, ROCm, Metal | CUDA, ROCm, Metal, Vulkan | CUDA, DirectML, CoreML |
| **Best for** | Developer cần nhanh gọn | Max performance, custom | .NET production app |

## 11. Use Cases thực tế cho On-Device AI

Code Assistant nội bộ

RAG Pipeline offline

Log Analysis & Anomaly Detection

Stream application logs qua local LLM để detect anomaly patterns, classify error types, suggest fixes. Xử lý 1000+ log entries/phút với Gemma 4 2B Q4 trên CPU.

Customer Support Bot on-premise

Ngành tài chính, y tế — chatbot hỗ trợ khách hàng chạy hoàn toàn trên hạ tầng nội bộ. Dữ liệu bệnh nhân/tài khoản không bao giờ rời data center.

Automated Code Review

Tích hợp vào CI/CD pipeline: mỗi PR tự động chạy qua local LLM để detect bugs, security issues, coding convention violations. Chi phí: 0đ/review.

## 12. Kết luận

On-Device AI năm 2026 không còn là "chạy demo cho vui" — nó đã trở thành **chiến lược kiến trúc** thực sự với hệ sinh thái mature: Ollama cho ease-of-use, llama.cpp cho performance, ONNX Runtime GenAI cho .NET integration. Thế hệ Small Language Models mới (Phi-4-reasoning, Qwen3.5, Gemma 4) đã thu hẹp khoảng cách chất lượng với frontier models xuống chỉ còn 15–30%, trong khi chi phí inference bằng 0.

Chiến lược tối ưu cho hầu hết production system là **Hybrid Routing**: local model xử lý 60–80% request đơn giản, cloud model cho phần còn lại cần reasoning sâu. Kết quả: giảm 60–80% chi phí AI, loại bỏ network dependency cho task phổ biến, và đảm bảo data privacy cho dữ liệu nhạy cảm.

Bước tiếp theo: cài Ollama, chạy `ollama run phi4-reasoning`, và trải nghiệm sức mạnh AI ngay trên máy bạn — zero cloud, zero cost, full control.

#### Tài nguyên hữu ích

• [Ollama Official](https://ollama.com/) — Download và library models  
• [llama.cpp GitHub](https://github.com/ggml-org/llama.cpp) — Source code và documentation  
• [ONNX Runtime GenAI Docs](https://onnxruntime.ai/docs/genai/) — Microsoft official docs  
• [Phi-4 trên HuggingFace](https://huggingface.co/microsoft/phi-4) — Model weights và hướng dẫn  
• [Qwen Models](https://huggingface.co/Qwen) — Qwen3.5 model family

**Nguồn tham khảo:**

- [Local AI in 2026: Ollama Benchmarks, $0 Inference — DEV Community](https://dev.to/pooyagolchian/local-ai-in-2026-running-production-llms-on-your-own-hardware-with-ollama-54d0)
- [Phi-4-reasoning Technical Report — Microsoft Research](https://www.microsoft.com/en-us/research/publication/phi-4-reasoning-technical-report/)
- [TurboQuant: Extreme KV Cache Quantization — llama.cpp Discussion](https://github.com/ggml-org/llama.cpp/discussions/20969)
- [ONNX Runtime GenAI Documentation — Microsoft](https://onnxruntime.ai/docs/genai/)
- [llama.cpp GGUF Quantization Guide 2026 — DecodesFuture](https://www.decodesfuture.com/articles/llama-cpp-gguf-quantization-guide-2026)
- [Phi Open Models — Microsoft Azure](https://azure.microsoft.com/en-us/products/phi)
- [Microsoft.ML.OnnxRuntimeGenAI NuGet Package](https://www.nuget.org/packages/Microsoft.ML.OnnxRuntimeGenAI)

Outbox Pattern — Không để mất message trong Microservices

EF Core 10 Deep Dive: Vector Search, JSON Type, Named Filters và LeftJoin

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.