On-Device AI 2026: Running LLMs Locally with Ollama, llama.cpp & ONNX Runtime on .NET 10

Posted on: 4/22/2026 9:17:38 PM

Table of contents

1. Why On-Device AI Became Essential in 2026
2. On-Device AI Stack Architecture 2026
3. Ollama — The Easiest Gateway to On-Device AI
4. llama.cpp — High-Performance Inference Engine with GGUF Quantization
5. ONNX Runtime GenAI — Integrating Local AI into .NET 10
6. Small Language Models 2026 — Small but Mighty
1. 70–85% frontier quality, $0 cost
7. Hybrid Architecture: Local + Cloud — Best of Both Worlds
1. 7.1. Implementing Router in .NET 10
  1. Cost-saving tip: Use local model as classifier
8. Choosing Hardware for On-Device AI
1. Note on VRAM vs RAM
9. Production Patterns for On-Device AI
10. Comparing Ollama vs llama.cpp vs ONNX Runtime
11. Real-World Use Cases for On-Device AI
12. Conclusion
1. Useful Resources

In 2026, you no longer need to send every prompt to the cloud to get an AI response. With Ollama reaching 52 million downloads per month, llama.cpp supporting quantization from 1.5-bit to 8-bit, and ONNX Runtime GenAI integrating directly into .NET 10 — running LLMs locally has evolved from experiment to real production strategy. This article dives deep into the architecture, tools, and deployment strategies for On-Device AI for developers building cloud-independent AI applications.

52MOllama downloads/month (Q1 2026)

4.9×KV Cache compression with TurboQuant TQ3

14BParams — Phi-4-reasoning rivals 70B models

$0Inference cost per request

1. Why On-Device AI Became Essential in 2026

Cloud AI has proven its power, but three core issues are pushing developers toward local inference:

1.1. Token Costs Accumulate Rapidly

An average AI application makes 10,000–50,000 requests per day. At GPT-4o pricing of ~$2.50/1M input tokens, monthly costs can reach thousands of dollars. On-Device AI completely eliminates per-token costs — you only pay once for hardware.

1.2. Latency and Availability

Cloud inference adds 200–500ms network latency per request. With local inference, latency depends solely on hardware speed — typically 50–150ms for first token on consumer GPUs. More importantly, your application works fully offline, unaffected by provider outages.

1.3. Data Privacy

In healthcare, finance, and legal industries — data must not leave internal infrastructure. On-Device AI is the only solution guaranteeing zero data egress: not a single byte leaves your server.

When NOT to use On-Device AI?

If you need frontier-level quality (Claude Opus, GPT-5.4), long creative writing, or extremely complex reasoning — cloud models still excel. On-Device AI is best suited for: code completion, text classification, summarization, entity extraction, internal chatbots, and RAG pipelines.

2. On-Device AI Stack Architecture 2026

The On-Device AI ecosystem in 2026 consists of three main layers: Model Format (how models are stored and compressed), Inference Engine (the execution runtime), and Application Layer (API integration into applications).

graph TB
    subgraph APP["Application Layer"]
        A1["REST API
(OpenAI-compatible)"]
        A2[".NET 10 App
(ONNX Runtime GenAI)"]
        A3["Python App
(llama-cpp-python)"]
        A4["Desktop/Mobile App"]
    end
    subgraph ENGINE["Inference Engine"]
        E1["Ollama v0.20
52M downloads/mo"]
        E2["llama.cpp
(ggml backend)"]
        E3["ONNX Runtime
GenAI v0.13"]
        E4["LM Studio
(GUI)"]
    end
    subgraph FORMAT["Model Format & Quantization"]
        F1["GGUF
1.5-bit → 8-bit"]
        F2["ONNX
(INT4/INT8/FP16)"]
        F3["SafeTensors
(HuggingFace)"]
    end
    subgraph MODELS["Small Language Models"]
        M1["Phi-4-reasoning
14B params"]
        M2["Qwen3.5-7B/32B"]
        M3["Gemma 4
2B/4B/26B/31B"]
        M4["LFM2-24B-A2B
Hybrid MoE"]
        M5["Llama 3.3
8B/70B"]
    end
    A1 --> E1
    A2 --> E3
    A3 --> E2
    A4 --> E4
    E1 --> F1
    E2 --> F1
    E3 --> F2
    E4 --> F1
    F1 --> MODELS
    F2 --> MODELS
    F3 --> MODELS
    style APP fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style ENGINE fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style FORMAT fill:#f8f9fa,stroke:#ff9800,color:#2c3e50
    style MODELS fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

Figure 1: On-Device AI Stack Architecture 2026 — from model format to application layer

3. Ollama — The Easiest Gateway to On-Device AI

Ollama has become the default local LLM tool for developers in 2026 with 169,000+ GitHub stars. Ollama's philosophy: simplify the entire download → configure → run model workflow down to a single command.

3.1. Installation and Running Your First Model

# Install Ollama (Windows/macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Run Phi-4-reasoning — powerful 14B reasoning model
ollama run phi4-reasoning

# Or Qwen3.5 7B — high efficiency, runs well on 8GB RAM
ollama run qwen3.5:7b

# Gemma 4 2B — ultra-light for edge devices
ollama run gemma4:2b

3.2. OpenAI-Compatible REST API

Ollama's killer feature is its REST API that's fully compatible with OpenAI. Any application using the OpenAI API only needs to change the base_url — no additional logic changes required:

// .NET 10 — using OpenAI SDK pointing to local Ollama
using OpenAI;
using OpenAI.Chat;

var client = new ChatClient(
    model: "phi4-reasoning",
    credential: new ApiKeyCredential("ollama"), // dummy key
    options: new OpenAIClientOptions
    {
        Endpoint = new Uri("http://localhost:11434/v1/")
    }
);

var response = await client.CompleteChatAsync(
    new ChatMessage[]
    {
        new SystemChatMessage("You are a .NET programming assistant."),
        new UserChatMessage("Explain Dependency Injection in 3 sentences.")
    }
);

Console.WriteLine(response.Value.Content[0].Text);

Tip: Multi-model routing

Ollama allows loading multiple models simultaneously. You can use Phi-4-reasoning for reasoning tasks, Qwen3.5-7B for general chat, and Gemma 4 2B for classification — all through the same endpoint, just different model field in the request.

3.3. Modelfile — Customizing Models for Specific Use Cases

# Modelfile for a code review assistant
FROM phi4-reasoning

PARAMETER temperature 0.2
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

SYSTEM """
You are a senior .NET developer specializing in code review.
When receiving code:
1. Find potential bugs
2. Suggest performance improvements
3. Check for security vulnerabilities
Be concise and get straight to the point.
"""

# Build and run custom model
ollama create code-reviewer -f Modelfile
ollama run code-reviewer

4. llama.cpp — High-Performance Inference Engine with GGUF Quantization

If Ollama is the friendly abstraction layer, then llama.cpp is the engine underneath. Written in pure C/C++, llama.cpp is the project that turned running LLMs on CPUs from theory into practice, and is currently the most widely used inference backend for on-device AI.

4.1. GGUF Format — The Quantization Standard for Local Inference

GGUF (GPT-Generated Unified Format) is a model file format designed specifically for llama.cpp, supporting quantization from 1.5-bit to 8-bit. Quantization reduces the precision of model weights (e.g., from 32-bit floats to 4-bit integers), shrinking model size and speeding up inference with very minimal quality loss.

Quantization	Bits/Weight	7B Model Size	RAM Required	Quality (PPL)	Use Case
Q8_0	8-bit	~7.2 GB	~9 GB	Near FP16	Quality-first, GPU with spare VRAM
Q5_K_M	5-bit	~4.8 GB	~7 GB	Very good	Best quality/size balance
Q4_K_M	4-bit	~4.1 GB	~6 GB	Good	Most popular — 8GB RAM sufficient
Q3_K_M	3-bit	~3.3 GB	~5 GB	Acceptable	Limited RAM, prioritize speed
Q2_K	2-bit	~2.7 GB	~4 GB	Noticeable loss	Edge devices, embedded systems
IQ1_S	1.5-bit	~1.9 GB	~3 GB	Low	Experimental, IoT

4.2. TurboQuant — KV Cache Compression Breakthrough (ICLR 2026)

TurboQuant (Zandieh et al., ICLR 2026) is a KV cache compression technique being integrated into llama.cpp. Instead of only quantizing model weights, TurboQuant compresses the KV cache — the temporary memory models use to track conversation context.

TQ33-bit KV cache — 4.9× compression vs FP16

TQ44-bit KV cache — 3.8× compression vs FP16

2×Double context length with same VRAM

Practical significance: with the same 8GB VRAM, you can process twice the context length, or run more parallel batch inference requests. This is a critical advancement for production workloads on consumer hardware.

graph LR
    subgraph BEFORE["Before TurboQuant"]
        B1["Model Weights
Q4_K_M = 4.1GB"] --- B2["KV Cache FP16
8K ctx = 2GB"]
        B2 --- B3["Total: 6.1GB
Only 8K context"]
    end
    subgraph AFTER["After TurboQuant TQ3"]
        A1["Model Weights
Q4_K_M = 4.1GB"] --- A2["KV Cache TQ3
16K ctx = 0.8GB"]
        A2 --- A3["Total: 4.9GB
16K context!"]
    end
    BEFORE -.->|"4.9× KV Cache
Compression"| AFTER
    style BEFORE fill:#f8f9fa,stroke:#ff9800,color:#2c3e50
    style AFTER fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50

Figure 2: TurboQuant doubles context length with the same VRAM budget

4.3. Flash Attention 3 — Efficient Long Context Processing

Flash Attention 3 optimizes the attention mechanism that traditionally scales O(n²) with context length. On llama.cpp, FA3 prevents "performance cliffs" as conversations grow longer, maintaining stable inference speed even with 32K+ token contexts.

5. ONNX Runtime GenAI — Integrating Local AI into .NET 10

For .NET developers, ONNX Runtime GenAI is the most direct bridge to running LLMs in C# applications without an intermediate server. The Microsoft.ML.OnnxRuntimeGenAI v0.13 package provides the full generative AI loop: pre/post processing, inference, logits processing, KV cache management, and grammar-based tool calling.

5.1. Setup on .NET 10

# Create new project
dotnet new console -n LocalAI.Demo
cd LocalAI.Demo

# Add ONNX Runtime GenAI package
dotnet add package Microsoft.ML.OnnxRuntimeGenAI --version 0.13.1

# For GPU (CUDA)
dotnet add package Microsoft.ML.OnnxRuntimeGenAI.Cuda --version 0.13.1

5.2. Running Phi-4-mini Locally in C#

using Microsoft.ML.OnnxRuntimeGenAI;

// Download model from HuggingFace: microsoft/Phi-4-mini-instruct-onnx
var modelPath = @"C:\models\phi-4-mini-instruct-onnx\cpu-int4";

using var model = new Model(modelPath);
using var tokenizer = new Tokenizer(model);

var systemPrompt = "You are an AI assistant specializing in .NET and C#. Answer concisely.";
var userMessage = "Compare record vs class in C# 13, when to use which?";

var fullPrompt = $"<|system|>{systemPrompt}<|end|><|user|>{userMessage}<|end|><|assistant|>";

using var tokens = tokenizer.Encode(fullPrompt);
using var generatorParams = new GeneratorParams(model);
generatorParams.SetSearchOption("max_length", 2048);
generatorParams.SetSearchOption("temperature", 0.3);
generatorParams.SetSearchOption("top_p", 0.9);
generatorParams.SetInputSequences(tokens);

using var generator = new Generator(model, generatorParams);
using var tokenizerStream = tokenizer.CreateStream();

while (!generator.IsDone())
{
    generator.ComputeLogits();
    generator.GenerateNextToken();
    var newToken = tokenizerStream.Decode(
        generator.GetSequence(0)[^1]
    );
    Console.Write(newToken);
}
Console.WriteLine();

GPU vs CPU — When Do You Need a GPU?

ONNX Runtime automatically runs on GPU (if CUDA/DirectML is available) or falls back to CPU. With INT4 models, CPU inference on modern Intel/AMD chips achieves 10–25 tokens/second — sufficient for interactive chat. Consumer GPUs (RTX 4060+) push this to 40–80 tokens/second.

5.3. Integration into ASP.NET 10 API

// Program.cs — Register ONNX model as singleton
var builder = WebApplication.CreateBuilder(args);

builder.Services.AddSingleton<ILocalAIService>(sp =>
{
    var modelPath = builder.Configuration["LocalAI:ModelPath"]!;
    return new OnnxLocalAIService(modelPath);
});

var app = builder.Build();

app.MapPost("/api/chat", async (
    ChatRequest request,
    ILocalAIService ai,
    CancellationToken ct) =>
{
    var response = await ai.GenerateAsync(
        request.SystemPrompt,
        request.Message,
        ct
    );
    return Results.Ok(new { response });
});

app.Run();

// OnnxLocalAIService.cs
public class OnnxLocalAIService : ILocalAIService, IDisposable
{
    private readonly Model _model;
    private readonly Tokenizer _tokenizer;
    private readonly SemaphoreSlim _semaphore = new(1, 1);

    public OnnxLocalAIService(string modelPath)
    {
        _model = new Model(modelPath);
        _tokenizer = new Tokenizer(_model);
    }

    public async Task<string> GenerateAsync(
        string systemPrompt, string userMessage, CancellationToken ct)
    {
        await _semaphore.WaitAsync(ct);
        try
        {
            var prompt = $"<|system|>{systemPrompt}<|end|>" +
                         $"<|user|>{userMessage}<|end|><|assistant|>";

            using var tokens = _tokenizer.Encode(prompt);
            using var genParams = new GeneratorParams(_model);
            genParams.SetSearchOption("max_length", 2048);
            genParams.SetSearchOption("temperature", 0.3);
            genParams.SetInputSequences(tokens);

            using var generator = new Generator(_model, genParams);
            using var stream = _tokenizer.CreateStream();
            var result = new StringBuilder();

            while (!generator.IsDone())
            {
                ct.ThrowIfCancellationRequested();
                generator.ComputeLogits();
                generator.GenerateNextToken();
                result.Append(stream.Decode(
                    generator.GetSequence(0)[^1]
                ));
            }
            return result.ToString();
        }
        finally
        {
            _semaphore.Release();
        }
    }

    public void Dispose()
    {
        _tokenizer?.Dispose();
        _model?.Dispose();
    }
}

6. Small Language Models 2026 — Small but Mighty

The on-device AI revolution is driven by the new generation of Small Language Models (SLMs) — models under 15B parameters that achieve benchmarks comparable to last year's 70B models.

Model	Params	MMLU	Min RAM	Strength
Phi-4-reasoning	14B	~84%	10GB (Q4)	Reasoning, math, code — rivals DeepSeek-R1-Distill-70B
Qwen3.5-7B	7B	76.8%	6GB (Q4)	3× faster, highest efficiency per param
Qwen2.5-32B	32B	83.2%	20GB (Q4)	Highest MMLU among open-weight models
Gemma 4 E2B	~2B	~62%	3GB (Q4)	Ultra-light, mobile/IoT
LFM2-24B-A2B	24B (MoE)	~80%	8GB (Q4)	Hybrid MoE, activates only 2B per inference
Phi-4-multimodal	5.6B	—	5GB (Q4)	Speech + Vision + Text in one model

70–85% frontier quality, $0 cost

Real-world benchmarks show that local inference on consumer hardware achieves 70–85% quality compared to frontier models (Claude Opus, GPT-5.4), with zero marginal cost per request. For many production use cases — this is more than enough.

7. Hybrid Architecture: Local + Cloud — Best of Both Worlds

In real production, you rarely use 100% local or 100% cloud. The optimal architecture is Hybrid Routing — routing requests based on complexity.

graph TB
    REQ["Incoming Request"] --> ROUTER["AI Router
(Complexity Classifier)"]
    ROUTER -->|"Simple tasks
Classification, Extract, QA"| LOCAL["Local LLM
Phi-4 / Qwen3.5
via Ollama"]
    ROUTER -->|"Medium tasks
Summarization, Code Gen"| MID["Mid-tier Cloud
Claude Haiku / GPT-4o-mini"]
    ROUTER -->|"Complex tasks
Deep Reasoning, Creative"| CLOUD["Frontier Cloud
Claude Opus / GPT-5.4"]
    LOCAL --> RESP["Response"]
    MID --> RESP
    CLOUD --> RESP
    ROUTER -->|"Offline / No network"| LOCAL
    style REQ fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style ROUTER fill:#e94560,stroke:#fff,color:#fff
    style LOCAL fill:#4CAF50,stroke:#fff,color:#fff
    style MID fill:#ff9800,stroke:#fff,color:#fff
    style CLOUD fill:#2c3e50,stroke:#fff,color:#fff
    style RESP fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

Figure 3: Hybrid Routing — routing requests by complexity level

7.1. Implementing Router in .NET 10

public class AIRouter
{
    private readonly ILocalAIService _localAI;
    private readonly ICloudAIService _cloudAI;
    private readonly IComplexityClassifier _classifier;

    public AIRouter(
        ILocalAIService localAI,
        ICloudAIService cloudAI,
        IComplexityClassifier classifier)
    {
        _localAI = localAI;
        _cloudAI = cloudAI;
        _classifier = classifier;
    }

    public async Task<AIResponse> RouteAsync(
        string prompt, CancellationToken ct)
    {
        var complexity = await _classifier.ClassifyAsync(prompt, ct);

        return complexity switch
        {
            Complexity.Simple => new AIResponse(
                await _localAI.GenerateAsync(prompt, ct),
                Provider: "local-phi4",
                Cost: 0m),

            Complexity.Medium => new AIResponse(
                await _cloudAI.GenerateAsync(
                    prompt, "claude-haiku-4-5", ct),
                Provider: "cloud-haiku",
                Cost: EstimateCost(prompt, "haiku")),

            Complexity.Complex => new AIResponse(
                await _cloudAI.GenerateAsync(
                    prompt, "claude-opus-4-7", ct),
                Provider: "cloud-opus",
                Cost: EstimateCost(prompt, "opus")),

            _ => throw new ArgumentOutOfRangeException()
        };
    }
}

Cost-saving tip: Use local model as classifier

A small local model (Gemma 4 2B) can serve as the complexity classifier itself. Cost: $0. Time: ~20ms. Result: 60–80% of requests handled locally, reducing cloud costs by 60–80%.

8. Choosing Hardware for On-Device AI

Configuration	Suitable Models	Speed (tokens/s)	Est. Cost
Laptop 8GB RAM (CPU only)	Qwen3.5-7B Q4, Gemma 4 2B	8–15 t/s	Already owned
Desktop 16GB + RTX 4060	Phi-4-reasoning Q4, Qwen3.5-7B Q5	30–50 t/s	~$800
Workstation 32GB + RTX 4090	Qwen2.5-32B Q4, Phi-4 Q8	50–80 t/s	~$2,500
Server 64GB + 2× RTX 4090	Llama 3.3-70B Q4, Qwen2.5-32B Q8	40–60 t/s	~$5,000
Apple M4 Pro 24GB	Phi-4-reasoning Q5, Qwen2.5-32B Q3	25–45 t/s	~$2,000

Note on VRAM vs RAM

GPU inference requires the entire model to fit in VRAM. RTX 4060 only has 8GB VRAM — just enough for 7B Q4. If the model exceeds VRAM, llama.cpp will offload part to CPU RAM, but speed drops 3–5×. Apple Silicon has the advantage of unified memory — 24GB M4 Pro can use all of it for GPU inference.

9. Production Patterns for On-Device AI

9.1. Model Warm-up and Health Check

// Startup — warm up model to avoid cold start
app.Lifetime.ApplicationStarted.Register(() =>
{
    var ai = app.Services.GetRequiredService<ILocalAIService>();
    _ = ai.GenerateAsync("system", "ping", CancellationToken.None);
    app.Logger.LogInformation("Local AI model warmed up");
});

// Health check endpoint
app.MapGet("/health/ai", async (ILocalAIService ai) =>
{
    try
    {
        var sw = Stopwatch.StartNew();
        await ai.GenerateAsync("system", "test",
            new CancellationTokenSource(TimeSpan.FromSeconds(10)).Token);
        return Results.Ok(new {
            status = "healthy",
            latency_ms = sw.ElapsedMilliseconds
        });
    }
    catch (Exception ex)
    {
        return Results.Json(new {
            status = "unhealthy",
            error = ex.Message
        }, statusCode: 503);
    }
});

9.2. Concurrent Request Handling

LLM inference is sequential per request. To handle multiple concurrent requests, use a request queue with bounded concurrency:

public class QueuedAIService : ILocalAIService
{
    private readonly Channel<AIWorkItem> _queue;
    private readonly ILocalAIService _inner;

    public QueuedAIService(ILocalAIService inner, int maxConcurrency = 2)
    {
        _inner = inner;
        _queue = Channel.CreateBounded<AIWorkItem>(
            new BoundedChannelOptions(100)
            {
                FullMode = BoundedChannelFullMode.Wait
            });

        for (int i = 0; i < maxConcurrency; i++)
            _ = ProcessQueueAsync();
    }

    public async Task<string> GenerateAsync(
        string system, string user, CancellationToken ct)
    {
        var tcs = new TaskCompletionSource<string>();
        await _queue.Writer.WriteAsync(
            new AIWorkItem(system, user, tcs, ct), ct);
        return await tcs.Task;
    }

    private async Task ProcessQueueAsync()
    {
        await foreach (var item in _queue.Reader.ReadAllAsync())
        {
            try
            {
                var result = await _inner.GenerateAsync(
                    item.System, item.User, item.Ct);
                item.Tcs.SetResult(result);
            }
            catch (Exception ex)
            {
                item.Tcs.SetException(ex);
            }
        }
    }
}

record AIWorkItem(
    string System, string User,
    TaskCompletionSource<string> Tcs, CancellationToken Ct);

9.3. Monitoring and Metrics

// Measure performance with .NET Metrics API
var meter = new Meter("LocalAI.Inference");
var tokenCounter = meter.CreateCounter<long>("ai.tokens.generated");
var latencyHistogram = meter.CreateHistogram<double>("ai.inference.latency_ms");
var activeRequests = meter.CreateUpDownCounter<int>("ai.requests.active");

// In the inference method:
activeRequests.Add(1);
var sw = Stopwatch.StartNew();
try
{
    // ... inference logic ...
    tokenCounter.Add(tokenCount,
        new KeyValuePair<string, object?>("model", modelName));
    latencyHistogram.Record(sw.ElapsedMilliseconds,
        new KeyValuePair<string, object?>("model", modelName));
}
finally
{
    activeRequests.Add(-1);
}

10. Comparing Ollama vs llama.cpp vs ONNX Runtime

Criteria	Ollama	llama.cpp (direct)	ONNX Runtime GenAI
Ease of use	⭐⭐⭐⭐⭐ One command	⭐⭐⭐ Requires build/config	⭐⭐⭐⭐ NuGet package
Performance	Good (llama.cpp wrapper)	Best (bare metal)	Good (optimized runtime)
Model format	GGUF	GGUF	ONNX (INT4/INT8/FP16)
API style	REST (OpenAI-compatible)	CLI / C API / HTTP server	C# native / C API
.NET integration	Via HTTP client	Via llama.cpp bindings	Native NuGet — best
Multi-model	✅ Hot-swap models	❌ One model/process	✅ Multiple Model instances
GPU support	CUDA, ROCm, Metal	CUDA, ROCm, Metal, Vulkan	CUDA, DirectML, CoreML
Best for	Developers wanting quick setup	Max performance, custom needs	.NET production apps

11. Real-World Use Cases for On-Device AI

Internal Code Assistant

IDE plugin for code suggestions, refactoring, and test generation — running Phi-4-reasoning locally. Zero latency, zero cost, code never leaves the developer's machine. Particularly valuable for projects under NDA or with proprietary codebases.

Offline RAG Pipeline

Ingest internal documents (policies, SOPs, knowledge base) into a local vector store, using Qwen3.5-7B as the generation model. Employees query the knowledge base through a chat interface — completely air-gapped.

Log Analysis & Anomaly Detection

Stream application logs through a local LLM to detect anomaly patterns, classify error types, and suggest fixes. Process 1000+ log entries per minute with Gemma 4 2B Q4 on CPU.

On-Premise Customer Support Bot

In finance and healthcare — customer support chatbot running entirely on internal infrastructure. Patient/account data never leaves the data center.

Automated Code Review

Integrated into CI/CD pipeline: every PR automatically runs through a local LLM to detect bugs, security issues, and coding convention violations. Cost: $0 per review.

12. Conclusion

On-Device AI in 2026 is no longer "running demos for fun" — it has become a real architectural strategy with a mature ecosystem: Ollama for ease-of-use, llama.cpp for performance, ONNX Runtime GenAI for .NET integration. The new generation of Small Language Models (Phi-4-reasoning, Qwen3.5, Gemma 4) has narrowed the quality gap with frontier models to just 15–30%, while inference cost is zero.

The optimal strategy for most production systems is Hybrid Routing: local models handle 60–80% of simple requests, cloud models for the rest requiring deep reasoning. Result: 60–80% reduction in AI costs, eliminating network dependency for common tasks, and ensuring data privacy for sensitive information.

Next step: install Ollama, run ollama run phi4-reasoning, and experience AI power right on your machine — zero cloud, zero cost, full control.

Useful Resources

• Ollama Official — Download and model library
• llama.cpp GitHub — Source code and documentation
• ONNX Runtime GenAI Docs — Microsoft official docs
• Phi-4 on HuggingFace — Model weights and guides
• Qwen Models — Qwen3.5 model family

References:

#On-Device AI #Ollama #llama.cpp #ONNX Runtime #.NET 10 #Phi-4 #GGUF Quantization #Small Language Model #Local LLM #Edge AI

# On-Device AI 2026: Running LLMs Locally with Ollama, llama.cpp & ONNX Runtime on .NET 10

In 2026, you no longer need to send every prompt to the cloud to get an AI response. With **Ollama reaching 52 million downloads per month**, llama.cpp supporting quantization from 1.5-bit to 8-bit, and ONNX Runtime GenAI integrating directly into .NET 10 — running LLMs locally has evolved from experiment to **real production strategy**. This article dives deep into the architecture, tools, and deployment strategies for On-Device AI for developers building cloud-independent AI applications.

52MOllama downloads/month (Q1 2026)

4.9×KV Cache compression with TurboQuant TQ3

14BParams — Phi-4-reasoning rivals 70B models

$0Inference cost per request

## 1. Why On-Device AI Became Essential in 2026

Cloud AI has proven its power, but three core issues are pushing developers toward local inference:

### 1.1. Token Costs Accumulate Rapidly

### 1.2. Latency and Availability

### 1.3. Data Privacy

In healthcare, finance, and legal industries — data must not leave internal infrastructure. On-Device AI is the only solution guaranteeing **zero data egress**: not a single byte leaves your server.

#### When NOT to use On-Device AI?

## 2. On-Device AI Stack Architecture 2026

The On-Device AI ecosystem in 2026 consists of three main layers: **Model Format** (how models are stored and compressed), **Inference Engine** (the execution runtime), and **Application Layer** (API integration into applications).

```
graph TB
    subgraph APP["Application Layer"]
        A1["REST API  
(OpenAI-compatible)"]
        A2[".NET 10 App  
(ONNX Runtime GenAI)"]
        A3["Python App  
(llama-cpp-python)"]
        A4["Desktop/Mobile App"]
    end
    subgraph ENGINE["Inference Engine"]
        E1["Ollama v0.20  
52M downloads/mo"]
        E2["llama.cpp  
(ggml backend)"]
        E3["ONNX Runtime  
GenAI v0.13"]
        E4["LM Studio  
(GUI)"]
    end
    subgraph FORMAT["Model Format & Quantization"]
        F1["GGUF  
1.5-bit → 8-bit"]
        F2["ONNX  
(INT4/INT8/FP16)"]
        F3["SafeTensors  
(HuggingFace)"]
    end
    subgraph MODELS["Small Language Models"]
        M1["Phi-4-reasoning  
14B params"]
        M2["Qwen3.5-7B/32B"]
        M3["Gemma 4  
2B/4B/26B/31B"]
        M4["LFM2-24B-A2B  
Hybrid MoE"]
        M5["Llama 3.3  
8B/70B"]
    end
    A1 --> E1
    A2 --> E3
    A3 --> E2
    A4 --> E4
    E1 --> F1
    E2 --> F1
    E3 --> F2
    E4 --> F1
    F1 --> MODELS
    F2 --> MODELS
    F3 --> MODELS
    style APP fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style ENGINE fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style FORMAT fill:#f8f9fa,stroke:#ff9800,color:#2c3e50
    style MODELS fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

```

Figure 1: On-Device AI Stack Architecture 2026 — from model format to application layer

## 3. Ollama — The Easiest Gateway to On-Device AI

**Ollama** has become the default local LLM tool for developers in 2026 with 169,000+ GitHub stars. Ollama's philosophy: simplify the entire download → configure → run model workflow down to a single command.

### 3.1. Installation and Running Your First Model

```bash
# Install Ollama (Windows/macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Run Phi-4-reasoning — powerful 14B reasoning model
ollama run phi4-reasoning

# Or Qwen3.5 7B — high efficiency, runs well on 8GB RAM
ollama run qwen3.5:7b

# Gemma 4 2B — ultra-light for edge devices
ollama run gemma4:2b
```

### 3.2. OpenAI-Compatible REST API

Ollama's killer feature is its REST API that's fully compatible with OpenAI. Any application using the OpenAI API only needs to change the `base_url` — no additional logic changes required:

```csharp
// .NET 10 — using OpenAI SDK pointing to local Ollama
using OpenAI;
using OpenAI.Chat;

var client = new ChatClient(
    model: "phi4-reasoning",
    credential: new ApiKeyCredential("ollama"), // dummy key
    options: new OpenAIClientOptions
    {
        Endpoint = new Uri("http://localhost:11434/v1/")
    }
);

var response = await client.CompleteChatAsync(
    new ChatMessage[]
    {
        new SystemChatMessage("You are a .NET programming assistant."),
        new UserChatMessage("Explain Dependency Injection in 3 sentences.")
    }
);

Console.WriteLine(response.Value.Content[0].Text);
```

#### Tip: Multi-model routing

### 3.3. Modelfile — Customizing Models for Specific Use Cases

```dockerfile
# Modelfile for a code review assistant
FROM phi4-reasoning

PARAMETER temperature 0.2
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

SYSTEM """
You are a senior .NET developer specializing in code review.
When receiving code:
1. Find potential bugs
2. Suggest performance improvements
3. Check for security vulnerabilities
Be concise and get straight to the point.
"""
```

```bash
# Build and run custom model
ollama create code-reviewer -f Modelfile
ollama run code-reviewer
```

## 4. llama.cpp — High-Performance Inference Engine with GGUF Quantization

If Ollama is the friendly abstraction layer, then **llama.cpp** is the engine underneath. Written in pure C/C++, llama.cpp is the project that turned running LLMs on CPUs from theory into practice, and is currently the most widely used inference backend for on-device AI.

### 4.1. GGUF Format — The Quantization Standard for Local Inference

**GGUF** (GPT-Generated Unified Format) is a model file format designed specifically for llama.cpp, supporting quantization from 1.5-bit to 8-bit. Quantization reduces the precision of model weights (e.g., from 32-bit floats to 4-bit integers), shrinking model size and speeding up inference with very minimal quality loss.

| Quantization | Bits/Weight | 7B Model Size | RAM Required | Quality (PPL) | Use Case |
| --- | --- | --- | --- | --- | --- |
| **Q8_0** | 8-bit | ~7.2 GB | ~9 GB | Near FP16 | Quality-first, GPU with spare VRAM |
| **Q5_K_M** | 5-bit | ~4.8 GB | ~7 GB | Very good | Best quality/size balance |
| **Q4_K_M** | 4-bit | ~4.1 GB | ~6 GB | Good | Most popular — 8GB RAM sufficient |
| **Q3_K_M** | 3-bit | ~3.3 GB | ~5 GB | Acceptable | Limited RAM, prioritize speed |
| **Q2_K** | 2-bit | ~2.7 GB | ~4 GB | Noticeable loss | Edge devices, embedded systems |
| **IQ1_S** | 1.5-bit | ~1.9 GB | ~3 GB | Low | Experimental, IoT |

### 4.2. TurboQuant — KV Cache Compression Breakthrough (ICLR 2026)

**TurboQuant** (Zandieh et al., ICLR 2026) is a KV cache compression technique being integrated into llama.cpp. Instead of only quantizing model weights, TurboQuant compresses the KV cache — the temporary memory models use to track conversation context.

TQ33-bit KV cache — 4.9× compression vs FP16

TQ44-bit KV cache — 3.8× compression vs FP16

2×Double context length with same VRAM

```
graph LR
    subgraph BEFORE["Before TurboQuant"]
        B1["Model Weights  
Q4_K_M = 4.1GB"] --- B2["KV Cache FP16  
8K ctx = 2GB"]
        B2 --- B3["Total: 6.1GB  
Only 8K context"]
    end
    subgraph AFTER["After TurboQuant TQ3"]
        A1["Model Weights  
Q4_K_M = 4.1GB"] --- A2["KV Cache TQ3  
16K ctx = 0.8GB"]
        A2 --- A3["Total: 4.9GB  
16K context!"]
    end
    BEFORE -.->|"4.9× KV Cache  
Compression"| AFTER
    style BEFORE fill:#f8f9fa,stroke:#ff9800,color:#2c3e50
    style AFTER fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50

```

Figure 2: TurboQuant doubles context length with the same VRAM budget

### 4.3. Flash Attention 3 — Efficient Long Context Processing

## 5. ONNX Runtime GenAI — Integrating Local AI into .NET 10

For .NET developers, **ONNX Runtime GenAI** is the most direct bridge to running LLMs in C# applications without an intermediate server. The `Microsoft.ML.OnnxRuntimeGenAI` v0.13 package provides the full generative AI loop: pre/post processing, inference, logits processing, KV cache management, and grammar-based tool calling.

### 5.1. Setup on .NET 10

```bash
# Create new project
dotnet new console -n LocalAI.Demo
cd LocalAI.Demo

# Add ONNX Runtime GenAI package
dotnet add package Microsoft.ML.OnnxRuntimeGenAI --version 0.13.1

# For GPU (CUDA)
dotnet add package Microsoft.ML.OnnxRuntimeGenAI.Cuda --version 0.13.1
```

### 5.2. Running Phi-4-mini Locally in C#

```csharp
using Microsoft.ML.OnnxRuntimeGenAI;

// Download model from HuggingFace: microsoft/Phi-4-mini-instruct-onnx
var modelPath = @"C:\models\phi-4-mini-instruct-onnx\cpu-int4";

using var model = new Model(modelPath);
using var tokenizer = new Tokenizer(model);

var systemPrompt = "You are an AI assistant specializing in .NET and C#. Answer concisely.";
var userMessage = "Compare record vs class in C# 13, when to use which?";

var fullPrompt = $"<|system|>{systemPrompt}<|end|><|user|>{userMessage}<|end|><|assistant|>";

using var tokens = tokenizer.Encode(fullPrompt);
using var generatorParams = new GeneratorParams(model);
generatorParams.SetSearchOption("max_length", 2048);
generatorParams.SetSearchOption("temperature", 0.3);
generatorParams.SetSearchOption("top_p", 0.9);
generatorParams.SetInputSequences(tokens);

using var generator = new Generator(model, generatorParams);
using var tokenizerStream = tokenizer.CreateStream();

while (!generator.IsDone())
{
    generator.ComputeLogits();
    generator.GenerateNextToken();
    var newToken = tokenizerStream.Decode(
        generator.GetSequence(0)[^1]
    );
    Console.Write(newToken);
}
Console.WriteLine();
```

#### GPU vs CPU — When Do You Need a GPU?

### 5.3. Integration into ASP.NET 10 API

```csharp
// Program.cs — Register ONNX model as singleton
var builder = WebApplication.CreateBuilder(args);

builder.Services.AddSingleton<ILocalAIService>(sp =>
{
    var modelPath = builder.Configuration["LocalAI:ModelPath"]!;
    return new OnnxLocalAIService(modelPath);
});

var app = builder.Build();

app.MapPost("/api/chat", async (
    ChatRequest request,
    ILocalAIService ai,
    CancellationToken ct) =>
{
    var response = await ai.GenerateAsync(
        request.SystemPrompt,
        request.Message,
        ct
    );
    return Results.Ok(new { response });
});

app.Run();
```

```csharp
// OnnxLocalAIService.cs
public class OnnxLocalAIService : ILocalAIService, IDisposable
{
    private readonly Model _model;
    private readonly Tokenizer _tokenizer;
    private readonly SemaphoreSlim _semaphore = new(1, 1);

public OnnxLocalAIService(string modelPath)
    {
        _model = new Model(modelPath);
        _tokenizer = new Tokenizer(_model);
    }

public async Task<string> GenerateAsync(
        string systemPrompt, string userMessage, CancellationToken ct)
    {
        await _semaphore.WaitAsync(ct);
        try
        {
            var prompt = $"<|system|>{systemPrompt}<|end|>" +
                         $"<|user|>{userMessage}<|end|><|assistant|>";

using var tokens = _tokenizer.Encode(prompt);
            using var genParams = new GeneratorParams(_model);
            genParams.SetSearchOption("max_length", 2048);
            genParams.SetSearchOption("temperature", 0.3);
            genParams.SetInputSequences(tokens);

using var generator = new Generator(_model, genParams);
            using var stream = _tokenizer.CreateStream();
            var result = new StringBuilder();

while (!generator.IsDone())
            {
                ct.ThrowIfCancellationRequested();
                generator.ComputeLogits();
                generator.GenerateNextToken();
                result.Append(stream.Decode(
                    generator.GetSequence(0)[^1]
                ));
            }
            return result.ToString();
        }
        finally
        {
            _semaphore.Release();
        }
    }

public void Dispose()
    {
        _tokenizer?.Dispose();
        _model?.Dispose();
    }
}
```

## 6. Small Language Models 2026 — Small but Mighty

The on-device AI revolution is driven by the new generation of **Small Language Models (SLMs)** — models under 15B parameters that achieve benchmarks comparable to last year's 70B models.

| Model | Params | MMLU | Min RAM | Strength |
| --- | --- | --- | --- | --- |
| **Phi-4-reasoning** | 14B | ~84% | 10GB (Q4) | Reasoning, math, code — rivals DeepSeek-R1-Distill-70B |
| **Qwen3.5-7B** | 7B | 76.8% | 6GB (Q4) | 3× faster, highest efficiency per param |
| **Qwen2.5-32B** | 32B | 83.2% | 20GB (Q4) | Highest MMLU among open-weight models |
| **Gemma 4 E2B** | ~2B | ~62% | 3GB (Q4) | Ultra-light, mobile/IoT |
| **LFM2-24B-A2B** | 24B (MoE) | ~80% | 8GB (Q4) | Hybrid MoE, activates only 2B per inference |
| **Phi-4-multimodal** | 5.6B | — | 5GB (Q4) | Speech + Vision + Text in one model |

#### 70–85% frontier quality, $0 cost

Real-world benchmarks show that local inference on consumer hardware achieves **70–85% quality** compared to frontier models (Claude Opus, GPT-5.4), with zero marginal cost per request. For many production use cases — this is more than enough.

## 7. Hybrid Architecture: Local + Cloud — Best of Both Worlds

In real production, you rarely use 100% local or 100% cloud. The optimal architecture is **Hybrid Routing** — routing requests based on complexity.

```
graph TB
    REQ["Incoming Request"] --> ROUTER["AI Router  
(Complexity Classifier)"]
    ROUTER -->|"Simple tasks  
Classification, Extract, QA"| LOCAL["Local LLM  
Phi-4 / Qwen3.5  
via Ollama"]
    ROUTER -->|"Medium tasks  
Summarization, Code Gen"| MID["Mid-tier Cloud  
Claude Haiku / GPT-4o-mini"]
    ROUTER -->|"Complex tasks  
Deep Reasoning, Creative"| CLOUD["Frontier Cloud  
Claude Opus / GPT-5.4"]
    LOCAL --> RESP["Response"]
    MID --> RESP
    CLOUD --> RESP
    ROUTER -->|"Offline / No network"| LOCAL
    style REQ fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style ROUTER fill:#e94560,stroke:#fff,color:#fff
    style LOCAL fill:#4CAF50,stroke:#fff,color:#fff
    style MID fill:#ff9800,stroke:#fff,color:#fff
    style CLOUD fill:#2c3e50,stroke:#fff,color:#fff
    style RESP fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

```

Figure 3: Hybrid Routing — routing requests by complexity level

### 7.1. Implementing Router in .NET 10

```csharp
public class AIRouter
{
    private readonly ILocalAIService _localAI;
    private readonly ICloudAIService _cloudAI;
    private readonly IComplexityClassifier _classifier;

public AIRouter(
        ILocalAIService localAI,
        ICloudAIService cloudAI,
        IComplexityClassifier classifier)
    {
        _localAI = localAI;
        _cloudAI = cloudAI;
        _classifier = classifier;
    }

public async Task<AIResponse> RouteAsync(
        string prompt, CancellationToken ct)
    {
        var complexity = await _classifier.ClassifyAsync(prompt, ct);

return complexity switch
        {
            Complexity.Simple => new AIResponse(
                await _localAI.GenerateAsync(prompt, ct),
                Provider: "local-phi4",
                Cost: 0m),

Complexity.Medium => new AIResponse(
                await _cloudAI.GenerateAsync(
                    prompt, "claude-haiku-4-5", ct),
                Provider: "cloud-haiku",
                Cost: EstimateCost(prompt, "haiku")),

Complexity.Complex => new AIResponse(
                await _cloudAI.GenerateAsync(
                    prompt, "claude-opus-4-7", ct),
                Provider: "cloud-opus",
                Cost: EstimateCost(prompt, "opus")),

_ => throw new ArgumentOutOfRangeException()
        };
    }
}
```

#### Cost-saving tip: Use local model as classifier

A small local model (Gemma 4 2B) can serve as the complexity classifier itself. Cost: $0. Time: ~20ms. Result: 60–80% of requests handled locally, reducing cloud costs by 60–80%.

## 8. Choosing Hardware for On-Device AI

| Configuration | Suitable Models | Speed (tokens/s) | Est. Cost |
| --- | --- | --- | --- |
| **Laptop 8GB RAM** (CPU only) | Qwen3.5-7B Q4, Gemma 4 2B | 8–15 t/s | Already owned |
| **Desktop 16GB + RTX 4060** | Phi-4-reasoning Q4, Qwen3.5-7B Q5 | 30–50 t/s | ~$800 |
| **Workstation 32GB + RTX 4090** | Qwen2.5-32B Q4, Phi-4 Q8 | 50–80 t/s | ~$2,500 |
| **Server 64GB + 2× RTX 4090** | Llama 3.3-70B Q4, Qwen2.5-32B Q8 | 40–60 t/s | ~$5,000 |
| **Apple M4 Pro 24GB** | Phi-4-reasoning Q5, Qwen2.5-32B Q3 | 25–45 t/s | ~$2,000 |

#### Note on VRAM vs RAM

## 9. Production Patterns for On-Device AI

### 9.1. Model Warm-up and Health Check

```csharp
// Startup — warm up model to avoid cold start
app.Lifetime.ApplicationStarted.Register(() =>
{
    var ai = app.Services.GetRequiredService<ILocalAIService>();
    _ = ai.GenerateAsync("system", "ping", CancellationToken.None);
    app.Logger.LogInformation("Local AI model warmed up");
});

// Health check endpoint
app.MapGet("/health/ai", async (ILocalAIService ai) =>
{
    try
    {
        var sw = Stopwatch.StartNew();
        await ai.GenerateAsync("system", "test",
            new CancellationTokenSource(TimeSpan.FromSeconds(10)).Token);
        return Results.Ok(new {
            status = "healthy",
            latency_ms = sw.ElapsedMilliseconds
        });
    }
    catch (Exception ex)
    {
        return Results.Json(new {
            status = "unhealthy",
            error = ex.Message
        }, statusCode: 503);
    }
});
```

### 9.2. Concurrent Request Handling

LLM inference is sequential per request. To handle multiple concurrent requests, use a **request queue** with bounded concurrency:

```csharp
public class QueuedAIService : ILocalAIService
{
    private readonly Channel<AIWorkItem> _queue;
    private readonly ILocalAIService _inner;

public QueuedAIService(ILocalAIService inner, int maxConcurrency = 2)
    {
        _inner = inner;
        _queue = Channel.CreateBounded<AIWorkItem>(
            new BoundedChannelOptions(100)
            {
                FullMode = BoundedChannelFullMode.Wait
            });

for (int i = 0; i < maxConcurrency; i++)
            _ = ProcessQueueAsync();
    }

public async Task<string> GenerateAsync(
        string system, string user, CancellationToken ct)
    {
        var tcs = new TaskCompletionSource<string>();
        await _queue.Writer.WriteAsync(
            new AIWorkItem(system, user, tcs, ct), ct);
        return await tcs.Task;
    }

private async Task ProcessQueueAsync()
    {
        await foreach (var item in _queue.Reader.ReadAllAsync())
        {
            try
            {
                var result = await _inner.GenerateAsync(
                    item.System, item.User, item.Ct);
                item.Tcs.SetResult(result);
            }
            catch (Exception ex)
            {
                item.Tcs.SetException(ex);
            }
        }
    }
}

record AIWorkItem(
    string System, string User,
    TaskCompletionSource<string> Tcs, CancellationToken Ct);
```

### 9.3. Monitoring and Metrics

```csharp
// Measure performance with .NET Metrics API
var meter = new Meter("LocalAI.Inference");
var tokenCounter = meter.CreateCounter<long>("ai.tokens.generated");
var latencyHistogram = meter.CreateHistogram<double>("ai.inference.latency_ms");
var activeRequests = meter.CreateUpDownCounter<int>("ai.requests.active");

// In the inference method:
activeRequests.Add(1);
var sw = Stopwatch.StartNew();
try
{
    // ... inference logic ...
    tokenCounter.Add(tokenCount,
        new KeyValuePair<string, object?>("model", modelName));
    latencyHistogram.Record(sw.ElapsedMilliseconds,
        new KeyValuePair<string, object?>("model", modelName));
}
finally
{
    activeRequests.Add(-1);
}
```

## 10. Comparing Ollama vs llama.cpp vs ONNX Runtime

| Criteria | Ollama | llama.cpp (direct) | ONNX Runtime GenAI |
| --- | --- | --- | --- |
| **Ease of use** | ⭐⭐⭐⭐⭐ One command | ⭐⭐⭐ Requires build/config | ⭐⭐⭐⭐ NuGet package |
| **Performance** | Good (llama.cpp wrapper) | Best (bare metal) | Good (optimized runtime) |
| **Model format** | GGUF | GGUF | ONNX (INT4/INT8/FP16) |
| **API style** | REST (OpenAI-compatible) | CLI / C API / HTTP server | C# native / C API |
| **.NET integration** | Via HTTP client | Via llama.cpp bindings | Native NuGet — best |
| **Multi-model** | ✅ Hot-swap models | ❌ One model/process | ✅ Multiple Model instances |
| **GPU support** | CUDA, ROCm, Metal | CUDA, ROCm, Metal, Vulkan | CUDA, DirectML, CoreML |
| **Best for** | Developers wanting quick setup | Max performance, custom needs | .NET production apps |

## 11. Real-World Use Cases for On-Device AI

Internal Code Assistant

Offline RAG Pipeline

Log Analysis & Anomaly Detection

Stream application logs through a local LLM to detect anomaly patterns, classify error types, and suggest fixes. Process 1000+ log entries per minute with Gemma 4 2B Q4 on CPU.

On-Premise Customer Support Bot

In finance and healthcare — customer support chatbot running entirely on internal infrastructure. Patient/account data never leaves the data center.

Automated Code Review

Integrated into CI/CD pipeline: every PR automatically runs through a local LLM to detect bugs, security issues, and coding convention violations. Cost: $0 per review.

## 12. Conclusion

On-Device AI in 2026 is no longer "running demos for fun" — it has become a **real architectural strategy** with a mature ecosystem: Ollama for ease-of-use, llama.cpp for performance, ONNX Runtime GenAI for .NET integration. The new generation of Small Language Models (Phi-4-reasoning, Qwen3.5, Gemma 4) has narrowed the quality gap with frontier models to just 15–30%, while inference cost is zero.

The optimal strategy for most production systems is **Hybrid Routing**: local models handle 60–80% of simple requests, cloud models for the rest requiring deep reasoning. Result: 60–80% reduction in AI costs, eliminating network dependency for common tasks, and ensuring data privacy for sensitive information.

Next step: install Ollama, run `ollama run phi4-reasoning`, and experience AI power right on your machine — zero cloud, zero cost, full control.

#### Useful Resources

• [Ollama Official](https://ollama.com/) — Download and model library  
• [llama.cpp GitHub](https://github.com/ggml-org/llama.cpp) — Source code and documentation  
• [ONNX Runtime GenAI Docs](https://onnxruntime.ai/docs/genai/) — Microsoft official docs  
• [Phi-4 on HuggingFace](https://huggingface.co/microsoft/phi-4) — Model weights and guides  
• [Qwen Models](https://huggingface.co/Qwen) — Qwen3.5 model family

**References:**

- [Local AI in 2026: Ollama Benchmarks, $0 Inference — DEV Community](https://dev.to/pooyagolchian/local-ai-in-2026-running-production-llms-on-your-own-hardware-with-ollama-54d0)
- [Phi-4-reasoning Technical Report — Microsoft Research](https://www.microsoft.com/en-us/research/publication/phi-4-reasoning-technical-report/)
- [TurboQuant: Extreme KV Cache Quantization — llama.cpp Discussion](https://github.com/ggml-org/llama.cpp/discussions/20969)
- [ONNX Runtime GenAI Documentation — Microsoft](https://onnxruntime.ai/docs/genai/)
- [llama.cpp GGUF Quantization Guide 2026 — DecodesFuture](https://www.decodesfuture.com/articles/llama-cpp-gguf-quantization-guide-2026)
- [Phi Open Models — Microsoft Azure](https://azure.microsoft.com/en-us/products/phi)
- [Microsoft.ML.OnnxRuntimeGenAI NuGet Package](https://www.nuget.org/packages/Microsoft.ML.OnnxRuntimeGenAI)

Outbox Pattern — Never Lose a Message in Microservices

EF Core 10 Deep Dive: Vector Search, JSON Type, Named Filters & LeftJoin

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.