Grafana LGTM Stack: Xây dựng Observability Platform miễn phí cho Production

Posted on: 4/18/2026 8:11:46 AM

Table of contents

1. LGTM Stack là gì?
1. Tại sao không dùng ELK Stack?
2. Kiến trúc tổng thể của LGTM Stack
3. Grafana Alloy — Collector thống nhất
1. Alloy thay thế những gì?
2. Điểm mấu chốt
4. Loki — Log Aggregation tiết kiệm
1. LogQL — Ngôn ngữ truy vấn log
  1. Bloom Filters trong Loki 3.x
2. Structured Metadata
5. Mimir — Metrics Storage quy mô lớn
6. Tempo — Distributed Tracing không cần index
1. TraceQL — Truy vấn Traces như Query Database
  1. Exemplars — Cầu nối Metrics ↔ Traces
7. Grafana — Dashboard, Alerting và Correlation
1. Grafana 12 Highlights
2. Correlation — Sức mạnh của unified stack
8. Triển khai LGTM Stack với Docker Compose
1. Lưu ý Production
9. Tích hợp với ứng dụng ASP.NET Core
1. Bước 1: Cài đặt packages
2. Bước 2: Cấu hình trong Program.cs
  1. Grafana OpenTelemetry Distribution cho .NET
10. Alerting — Từ giám sát đến hành động
11. Sizing và Chi phí thực tế
1. Trade-off cần cân nhắc
12. Best Practices cho Production
Kết luận

Bạn có ứng dụng chạy trên production nhưng khi có sự cố, phải SSH vào server đọc log bằng grep? Hoặc tệ hơn, không biết service nào đang chậm cho đến khi khách hàng phàn nàn? Grafana LGTM Stack — bộ công cụ observability hoàn toàn miễn phí và open-source — giải quyết triệt để vấn đề này bằng cách hợp nhất Logs, Metrics, Traces và Profiles vào một nền tảng duy nhất.

100% Open-source, tự host, không vendor lock-in

65% Giảm MTTR so với monitoring truyền thống

10M+ Metrics/giây xử lý được với Mimir

4 Tín hiệu: Logs, Metrics, Traces, Profiles

1. LGTM Stack là gì?

LGTM là viết tắt của 4 thành phần cốt lõi do Grafana Labs phát triển:

Thành phần	Vai trò	Tương đương thương mại
Loki	Log aggregation — thu thập, lưu trữ và truy vấn log	Splunk, Datadog Logs
Grafana	Visualization — dashboard, alerting, explore	Datadog Dashboard, Kibana
Tempo	Distributed tracing — theo dõi request xuyên service	Jaeger, Datadog APM
Mimir	Metrics storage — lưu Prometheus metrics dài hạn	Thanos, Cortex, Datadog Metrics

Ngoài 4 thành phần chính, stack còn có Grafana Alloy — bộ collector thống nhất thay thế Promtail, Grafana Agent, và OpenTelemetry Collector, đóng vai trò là "cánh tay nối dài" thu thập mọi tín hiệu telemetry từ ứng dụng.

Tại sao không dùng ELK Stack?

ELK (Elasticsearch + Logstash + Kibana) index toàn bộ nội dung log → tốn RAM và disk khổng lồ. Loki chỉ index labels (metadata) và lưu log dạng nén → tiết kiệm 10-50x storage. Với hệ thống vừa và nhỏ, LGTM stack chạy ổn trên một server 4 CPU / 8GB RAM.

2. Kiến trúc tổng thể của LGTM Stack

Hiểu kiến trúc giúp bạn biết dữ liệu đi từ đâu đến đâu, và khi có sự cố, cần kiểm tra thành phần nào.

graph LR
    subgraph Applications
        A1["ASP.NET Core API"]
        A2["Vue.js Frontend"]
        A3["Background Worker"]
    end

    subgraph "Grafana Alloy (Collector)"
        C1["OTLP Receiver"]
        C2["Prometheus Scraper"]
        C3["Log Pipeline"]
    end

    subgraph "Storage Backends"
        M["Mimir
Metrics"]
        L["Loki
Logs"]
        T["Tempo
Traces"]
    end

    G["Grafana
Dashboard + Alerting"]

    A1 -->|OTLP gRPC| C1
    A2 -->|OTLP HTTP| C1
    A3 -->|OTLP gRPC| C1
    A1 -->|metrics endpoint| C2
    C1 --> M
    C1 --> T
    C2 --> M
    C3 --> L
    M --> G
    L --> G
    T --> G

    style A1 fill:#e94560,stroke:#fff,color:#fff
    style A2 fill:#e94560,stroke:#fff,color:#fff
    style A3 fill:#e94560,stroke:#fff,color:#fff
    style C1 fill:#2c3e50,stroke:#fff,color:#fff
    style C2 fill:#2c3e50,stroke:#fff,color:#fff
    style C3 fill:#2c3e50,stroke:#fff,color:#fff
    style M fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style T fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style G fill:#4CAF50,stroke:#fff,color:#fff

Kiến trúc LGTM Stack — dữ liệu đi từ ứng dụng qua Alloy đến storage backends, Grafana truy vấn tất cả

3. Grafana Alloy — Collector thống nhất

Trước đây, bạn cần chạy riêng Promtail (cho log), Grafana Agent (cho metrics), và OpenTelemetry Collector (cho traces). Grafana Alloy hợp nhất tất cả vào một binary duy nhất với ngôn ngữ cấu hình River declarative.

Alloy thay thế những gì?

Promtail → Alloy loki pipeline · Grafana Agent → Alloy prometheus pipeline · OTel Collector → Alloy otelcol pipeline. Một process, một config, một chỗ debug.

Ví dụ config Alloy nhận OTLP từ ứng dụng .NET và forward đến Loki + Tempo + Mimir:

// Nhận telemetry qua OTLP
otelcol.receiver.otlp "default" {
  grpc { endpoint = "0.0.0.0:4317" }
  http { endpoint = "0.0.0.0:4318" }

  output {
    metrics = [otelcol.processor.batch.default.input]
    logs    = [otelcol.processor.batch.default.input]
    traces  = [otelcol.processor.batch.default.input]
  }
}

// Batch để giảm network overhead
otelcol.processor.batch "default" {
  output {
    metrics = [otelcol.exporter.prometheus.mimir.input]
    logs    = [otelcol.exporter.loki.default.input]
    traces  = [otelcol.exporter.otlp.tempo.input]
  }
}

// Export metrics sang Mimir
otelcol.exporter.prometheus "mimir" {
  forward_to = [prometheus.remote_write.mimir.receiver]
}

prometheus.remote_write "mimir" {
  endpoint {
    url = "http://mimir:9009/api/v1/push"
  }
}

// Export logs sang Loki
otelcol.exporter.loki "default" {
  forward_to = [loki.write.default.receiver]
}

loki.write "default" {
  endpoint {
    url = "http://loki:3100/loki/api/v1/push"
  }
}

// Export traces sang Tempo
otelcol.exporter.otlp "tempo" {
  client {
    endpoint = "tempo:4317"
    tls { insecure = true }
  }
}

Điểm mấu chốt

Alloy dùng mô hình component-based: mỗi block là một component có input/output, nối với nhau qua forward_to hoặc output. Bạn có thể thêm processor (filter, transform, sample) vào giữa pipeline mà không cần thay đổi receiver hay exporter.

4. Loki — Log Aggregation tiết kiệm

Loki là trái tim của việc thu thập log trong LGTM Stack. Khác với Elasticsearch (index full-text), Loki chỉ index labels (ví dụ: {app="api", env="production"}) và lưu nội dung log dạng nén. Điều này giúp:

Storage rẻ hơn 10-50x so với Elasticsearch cho cùng lượng log
Đơn giản hóa operations — không cần tuning JVM heap, shard rebalancing
Tích hợp tự nhiên với Prometheus labels — cùng label set cho metrics và logs

LogQL — Ngôn ngữ truy vấn log

LogQL lấy cảm hứng từ PromQL, dùng label selectors kết hợp filter expressions:

// Tìm log lỗi của service api trong 1 giờ qua
{app="api", env="production"} |= "error" | json | status_code >= 500

// Đếm số request lỗi theo endpoint, mỗi 5 phút
rate({app="api"} |= "HTTP" | json | status_code >= 500 [5m]) by (endpoint)

// Tính P99 response time từ log
{app="api"} | json | unwrap duration_ms [5m] | quantile_over_time(0.99)

// Pattern matching — phát hiện log format tự động
{app="api"} | pattern "<ip> - <method> <path> <status> <duration>ms"
  | status >= 500

Bloom Filters trong Loki 3.x

Loki 3.0+ hỗ trợ Bloom filters để tăng tốc filter queries. Thay vì scan toàn bộ chunks, Loki kiểm tra Bloom filter trước để loại nhanh chunks không chứa keyword cần tìm — giảm I/O đáng kể cho các truy vấn kiểu |= "OutOfMemoryException" trên dataset lớn.

Structured Metadata

Từ Loki 3.0, bạn có thể gắn structured metadata vào log entries mà không cần biến chúng thành labels (tránh high-cardinality). Ví dụ: trace_id, user_id, request_id — filter được nhưng không tạo thêm series.

// Query log theo trace_id từ structured metadata
{app="api"} | trace_id = "abc123def456"

5. Mimir — Metrics Storage quy mô lớn

Prometheus tuyệt vời cho việc scrape metrics, nhưng có 2 hạn chế lớn khi scale:

Single-node storage — local TSDB không horizontal scale
Retention ngắn — thường chỉ giữ 15-30 ngày vì disk

Mimir giải quyết cả hai bằng cách làm remote storage cho Prometheus, hỗ trợ multi-tenant, long-term retention trên object storage (S3, MinIO, Azure Blob).

graph TD
    P1["Prometheus / Alloy"] -->|remote_write| D["Distributor"]
    D --> I1["Ingester 1"]
    D --> I2["Ingester 2"]
    D --> I3["Ingester 3"]
    I1 --> S["Object Storage
S3 / MinIO / Azure Blob"]
    I2 --> S
    I3 --> S
    QF["Query Frontend"] --> Q["Querier"]
    Q --> I1
    Q --> I2
    Q --> I3
    Q --> S
    G["Grafana"] --> QF

    style D fill:#e94560,stroke:#fff,color:#fff
    style I1 fill:#2c3e50,stroke:#fff,color:#fff
    style I2 fill:#2c3e50,stroke:#fff,color:#fff
    style I3 fill:#2c3e50,stroke:#fff,color:#fff
    style S fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style QF fill:#4CAF50,stroke:#fff,color:#fff
    style Q fill:#4CAF50,stroke:#fff,color:#fff
    style G fill:#4CAF50,stroke:#fff,color:#fff
    style P1 fill:#e94560,stroke:#fff,color:#fff

Kiến trúc Mimir — Distributor chia metrics đến Ingesters, lưu dài hạn trên Object Storage

Tính năng	Prometheus (standalone)	Mimir
Horizontal scaling	Không	Có — sharding by tenant/series
Long-term retention	15-30 ngày (disk)	Không giới hạn (object storage)
Multi-tenant	Không	Có — cô lập dữ liệu giữa teams
High availability	Cần Thanos sidecar	Built-in replication
Query performance	Giảm khi data lớn	Query splitting + caching
Chi phí storage	SSD đắt	Object storage rẻ

6. Tempo — Distributed Tracing không cần index

Khi một request đi qua 5 services, bạn cần biết: service nào chậm? Lỗi xảy ra ở đâu? Tempo trả lời câu hỏi đó bằng cách lưu distributed traces với chi phí cực thấp.

Khác Jaeger (cần Elasticsearch/Cassandra), Tempo chỉ cần object storage. Nó không index traces — chỉ lưu theo trace ID. Khi cần tìm trace, bạn dùng:

TraceQL — ngôn ngữ truy vấn chuyên dụng cho traces
Metrics-to-traces — từ spike trên dashboard, click vào xem exemplar traces
Logs-to-traces — từ dòng log có trace_id, click sang Tempo xem full trace

TraceQL — Truy vấn Traces như Query Database

// Tìm traces có span lỗi trong service "order-api"
{ resource.service.name = "order-api" && status = error }

// Traces có duration > 2 giây
{ duration > 2s }

// Tìm traces đi qua cả order-api và payment-service
{ resource.service.name = "order-api" } >> { resource.service.name = "payment-service" }

// Span có attribute cụ thể
{ span.http.status_code >= 500 && span.http.method = "POST" }

Exemplars — Cầu nối Metrics ↔ Traces

Khi Prometheus/Mimir thu thập metrics, nó có thể gắn kèm exemplar — một trace ID mẫu cho mỗi data point. Trên Grafana, khi bạn thấy latency P99 đột ngột tăng, click vào exemplar sẽ nhảy thẳng đến trace cụ thể gây ra spike đó. Đây là killer feature của việc dùng LGTM stack thống nhất.

7. Grafana — Dashboard, Alerting và Correlation

Grafana là lớp visualization kết nối tất cả lại. Phiên bản 12.x mang đến nhiều cải tiến quan trọng:

Grafana 12 Highlights

Git Sync — quản lý dashboard dạng code, version control qua Git. Explore Logs — tự động phát hiện pattern trong log, không cần viết query. Traces to Profiles — từ slow span, drill down thẳng vào flame graph xem function nào tiêu tốn CPU. Adaptive dashboards — layout tự động điều chỉnh theo data.

Correlation — Sức mạnh của unified stack

Lợi thế lớn nhất khi dùng LGTM stack là khả năng correlation giữa 3 tín hiệu:

graph LR
    M["📊 Metrics
CPU spike lúc 14:05"] -->|exemplar trace_id| T["🔍 Traces
Span chậm 3.2s ở payment-service"]
    T -->|trace_id trong log| L["📝 Logs
TimeoutException kết nối DB"]
    L -->|label match| M

    style M fill:#e94560,stroke:#fff,color:#fff
    style T fill:#2c3e50,stroke:#fff,color:#fff
    style L fill:#4CAF50,stroke:#fff,color:#fff

Correlation loop: Metrics → Traces → Logs → quay lại Metrics. Debug incident trong vài phút thay vì hàng giờ

Quy trình debug incident điển hình:

Alert báo "P99 latency > 2s" trên Grafana dashboard
Click vào metric panel → xem exemplar trace ID
Mở trace trong Tempo → thấy span db.query chiếm 2.8s
Click trace_id → Loki hiển thị log: Connection pool exhausted, waiting 2.5s
Root cause: connection pool quá nhỏ → tăng MaxPoolSize → deploy fix

8. Triển khai LGTM Stack với Docker Compose

Dưới đây là cấu hình Docker Compose production-ready cho một hệ thống vừa (10-50 services, ~100GB logs/tháng):

version: "3.8"

services:
  # --- Grafana ---
  grafana:
    image: grafana/grafana:12.4.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your-secure-password
      - GF_FEATURE_TOGGLES_ENABLE=traceToMetrics,traceToLogs
    volumes:
      - grafana-data:/var/lib/grafana
      - ./provisioning:/etc/grafana/provisioning
    depends_on: [loki, mimir, tempo]

  # --- Loki (Log Storage) ---
  loki:
    image: grafana/loki:3.4.0
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/config.yaml
    volumes:
      - ./config/loki.yaml:/etc/loki/config.yaml
      - loki-data:/loki

  # --- Mimir (Metrics Storage) ---
  mimir:
    image: grafana/mimir:2.15.0
    ports:
      - "9009:9009"
    command: -config.file=/etc/mimir/config.yaml
    volumes:
      - ./config/mimir.yaml:/etc/mimir/config.yaml
      - mimir-data:/data

  # --- Tempo (Trace Storage) ---
  tempo:
    image: grafana/tempo:2.7.0
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "3200:3200"   # Tempo query
    command: -config.file=/etc/tempo/config.yaml
    volumes:
      - ./config/tempo.yaml:/etc/tempo/config.yaml
      - tempo-data:/var/tempo

  # --- Alloy (Collector) ---
  alloy:
    image: grafana/alloy:1.6.0
    ports:
      - "12345:12345"  # Alloy UI
      - "4327:4317"    # OTLP gRPC (app gửi tới đây)
      - "4328:4318"    # OTLP HTTP
    volumes:
      - ./config/alloy.river:/etc/alloy/config.river
    command: run /etc/alloy/config.river --server.http.listen-addr=0.0.0.0:12345

volumes:
  grafana-data:
  loki-data:
  mimir-data:
  tempo-data:

Lưu ý Production

Config trên phù hợp cho single-node hoặc staging. Với production traffic lớn (>1TB logs/tháng), cần chạy Loki và Mimir ở microservices mode — tách distributor, ingester, querier thành containers riêng và dùng object storage (MinIO self-hosted hoặc S3) thay vì local disk.

9. Tích hợp với ứng dụng ASP.NET Core

Gửi telemetry từ ứng dụng .NET đến LGTM Stack chỉ cần 2 bước: cài NuGet packages và cấu hình exporter.

Bước 1: Cài đặt packages

dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Instrumentation.Http
dotnet add package OpenTelemetry.Instrumentation.SqlClient
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol

Bước 2: Cấu hình trong Program.cs

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource => resource
        .AddService("order-api", serviceVersion: "1.0.0"))
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddSqlClientInstrumentation(o => o.SetDbStatementForText = true)
        .AddOtlpExporter(o =>
        {
            o.Endpoint = new Uri("http://alloy:4317");
            o.Protocol = OpenTelemetry.Exporter
                .OtlpExportProtocol.Grpc;
        }))
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddOtlpExporter(o =>
        {
            o.Endpoint = new Uri("http://alloy:4317");
            o.Protocol = OpenTelemetry.Exporter
                .OtlpExportProtocol.Grpc;
        }))
    .WithLogging(logging => logging
        .AddOtlpExporter(o =>
        {
            o.Endpoint = new Uri("http://alloy:4317");
            o.Protocol = OpenTelemetry.Exporter
                .OtlpExportProtocol.Grpc;
        }));

Grafana OpenTelemetry Distribution cho .NET

Grafana cung cấp package Grafana.OpenTelemetry — một distribution bọc sẵn các instrumentation phổ biến và cấu hình tối ưu cho LGTM stack. Chỉ cần builder.Services.AddGrafanaOpenTelemetry() là đủ — ít config hơn nhiều so với setup thủ công.

10. Alerting — Từ giám sát đến hành động

Observability không có giá trị nếu không ai nhận được thông báo khi có sự cố. Grafana Alerting hỗ trợ:

Unified alerting — alert rules cho cả metrics (PromQL), logs (LogQL), và traces
Multi-channel — Slack, Discord, Telegram, PagerDuty, email, webhook
Silences & Mute timings — tắt alert trong maintenance window
Alert grouping — gom 100 alerts cùng loại thành 1 notification

Ví dụ alert rule cho error rate:

# Alert khi error rate > 5% trong 5 phút
- alert: HighErrorRate
  expr: |
    sum(rate({app="api"} |= "error" [5m])) by (app)
    /
    sum(rate({app="api"} [5m])) by (app)
    > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Error rate cao bất thường cho {{ $labels.app }}"
    description: "Error rate đang ở {{ $value | humanizePercentage }}"

11. Sizing và Chi phí thực tế

Một trong những lý do chính chọn LGTM Stack là chi phí. So sánh với SaaS:

Quy mô	LGTM Self-hosted	Datadog (ước tính)
10 services, 50GB logs/tháng	1 VM 4 CPU / 16GB RAM ~$40-80/tháng	~$200-500/tháng
50 services, 500GB logs/tháng	3 VMs hoặc K8s cluster ~$200-400/tháng	~$2,000-5,000/tháng
200 services, 2TB logs/tháng	K8s cluster + S3 ~$500-1,000/tháng	~$10,000+/tháng

Trade-off cần cân nhắc

Self-hosted rẻ hơn tiền bạc nhưng tốn thời gian vận hành. Nếu team chỉ có 1-2 DevOps, hãy bắt đầu với Grafana Cloud Free tier (10K metrics, 50GB logs, 50GB traces miễn phí) rồi migrate sang self-hosted khi vượt ngưỡng. Grafana Cloud dùng cùng LGTM stack nên migration gần như chỉ đổi endpoint.

12. Best Practices cho Production

Labels có chọn lọc

Chỉ dùng labels có cardinality thấp (app, env, region). Tuyệt đối không dùng user_id, request_id, IP làm label — dùng structured metadata của Loki 3.x thay thế. High-cardinality labels là nguyên nhân #1 gây OOM cho Loki.

Retention theo tầng

Hot data (7 ngày) trên SSD, warm data (30 ngày) trên HDD, cold data (1 năm+) trên object storage. Cấu hình retention_period và compactor trong Loki để tự động chuyển tầng.

Sampling cho Traces

Không cần lưu 100% traces. Dùng tail-based sampling trong Alloy: luôn giữ traces có error hoặc high latency, sample 10-20% traces thành công. Giảm 80% storage cost cho Tempo mà không mất thông tin quan trọng.

Recording Rules cho Metrics

Pre-compute các PromQL query phức tạp thành recording rules. Thay vì query raw data mỗi lần load dashboard, Mimir tính sẵn aggregated metrics — dashboard load nhanh hơn 10x.

Dashboard as Code

Dùng Grafana 12 Git Sync hoặc Terraform provider grafana/grafana để quản lý dashboard qua version control. Không ai được sửa dashboard bằng tay trên UI production — mọi thay đổi phải qua PR review.

Kết luận

Grafana LGTM Stack — Loki, Grafana, Tempo, Mimir cùng Alloy collector — mang đến một nền tảng observability hoàn chỉnh, miễn phí, và không bị vendor lock-in. Với khả năng correlation giữa logs, metrics, và traces trong cùng một giao diện, đội ngũ của bạn có thể giảm thời gian debug incident từ hàng giờ xuống vài phút.

Nếu bạn đang dùng CloudWatch + Kibana + Jaeger riêng lẻ, hoặc trả hàng nghìn đô mỗi tháng cho Datadog, thì đây là lúc xem xét chuyển sang LGTM Stack — bắt đầu với Grafana Cloud Free tier để thử nghiệm, rồi self-host khi đã quen.

Tham khảo:
Grafana Loki Documentation · Grafana Mimir Documentation · Grafana Tempo Documentation · Grafana Alloy Documentation · Grafana 12 What's New · Instrument .NET with OpenTelemetry — Grafana

#Grafana #Observability #Prometheus #Loki #Tempo #Docker Compose #OpenTelemetry #ASP.NET Core #Monitoring #DevOps

# Grafana LGTM Stack: Xây dựng Observability Platform miễn phí cho Production

Bạn có ứng dụng chạy trên production nhưng khi có sự cố, phải SSH vào server đọc log bằng `grep`? Hoặc tệ hơn, không biết service nào đang chậm cho đến khi khách hàng phàn nàn? **Grafana LGTM Stack** — bộ công cụ observability hoàn toàn miễn phí và open-source — giải quyết triệt để vấn đề này bằng cách hợp nhất **Logs, Metrics, Traces** và **Profiles** vào một nền tảng duy nhất.

100% Open-source, tự host, không vendor lock-in

65% Giảm MTTR so với monitoring truyền thống

10M+ Metrics/giây xử lý được với Mimir

4 Tín hiệu: Logs, Metrics, Traces, Profiles

## 1. LGTM Stack là gì?

LGTM là viết tắt của 4 thành phần cốt lõi do Grafana Labs phát triển:

| Thành phần | Vai trò | Tương đương thương mại |
| --- | --- | --- |
| **Loki** | Log aggregation — thu thập, lưu trữ và truy vấn log | Splunk, Datadog Logs |
| **Grafana** | Visualization — dashboard, alerting, explore | Datadog Dashboard, Kibana |
| **Tempo** | Distributed tracing — theo dõi request xuyên service | Jaeger, Datadog APM |
| **Mimir** | Metrics storage — lưu Prometheus metrics dài hạn | Thanos, Cortex, Datadog Metrics |

Ngoài 4 thành phần chính, stack còn có **Grafana Alloy** — bộ collector thống nhất thay thế Promtail, Grafana Agent, và OpenTelemetry Collector, đóng vai trò là "cánh tay nối dài" thu thập mọi tín hiệu telemetry từ ứng dụng.

#### Tại sao không dùng ELK Stack?

ELK (Elasticsearch + Logstash + Kibana) index toàn bộ nội dung log → tốn RAM và disk khổng lồ. Loki chỉ index **labels** (metadata) và lưu log dạng nén → tiết kiệm 10-50x storage. Với hệ thống vừa và nhỏ, LGTM stack chạy ổn trên một server 4 CPU / 8GB RAM.

## 2. Kiến trúc tổng thể của LGTM Stack

Hiểu kiến trúc giúp bạn biết dữ liệu đi từ đâu đến đâu, và khi có sự cố, cần kiểm tra thành phần nào.

```
graph LR
    subgraph Applications
        A1["ASP.NET Core API"]
        A2["Vue.js Frontend"]
        A3["Background Worker"]
    end

subgraph "Grafana Alloy (Collector)"
        C1["OTLP Receiver"]
        C2["Prometheus Scraper"]
        C3["Log Pipeline"]
    end

subgraph "Storage Backends"
        M["Mimir  
Metrics"]
        L["Loki  
Logs"]
        T["Tempo  
Traces"]
    end

G["Grafana  
Dashboard + Alerting"]

A1 -->|OTLP gRPC| C1
    A2 -->|OTLP HTTP| C1
    A3 -->|OTLP gRPC| C1
    A1 -->|metrics endpoint| C2
    C1 --> M
    C1 --> T
    C2 --> M
    C3 --> L
    M --> G
    L --> G
    T --> G

style A1 fill:#e94560,stroke:#fff,color:#fff
    style A2 fill:#e94560,stroke:#fff,color:#fff
    style A3 fill:#e94560,stroke:#fff,color:#fff
    style C1 fill:#2c3e50,stroke:#fff,color:#fff
    style C2 fill:#2c3e50,stroke:#fff,color:#fff
    style C3 fill:#2c3e50,stroke:#fff,color:#fff
    style M fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style T fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style G fill:#4CAF50,stroke:#fff,color:#fff

```

Kiến trúc LGTM Stack — dữ liệu đi từ ứng dụng qua Alloy đến storage backends, Grafana truy vấn tất cả

## 3. Grafana Alloy — Collector thống nhất

Trước đây, bạn cần chạy riêng Promtail (cho log), Grafana Agent (cho metrics), và OpenTelemetry Collector (cho traces). **Grafana Alloy** hợp nhất tất cả vào một binary duy nhất với ngôn ngữ cấu hình River declarative.

#### Alloy thay thế những gì?

`Promtail` → Alloy loki pipeline · `Grafana Agent` → Alloy prometheus pipeline · `OTel Collector` → Alloy otelcol pipeline. Một process, một config, một chỗ debug.

Ví dụ config Alloy nhận OTLP từ ứng dụng .NET và forward đến Loki + Tempo + Mimir:

```
// Nhận telemetry qua OTLP
otelcol.receiver.otlp "default" {
  grpc { endpoint = "0.0.0.0:4317" }
  http { endpoint = "0.0.0.0:4318" }

output {
    metrics = [otelcol.processor.batch.default.input]
    logs    = [otelcol.processor.batch.default.input]
    traces  = [otelcol.processor.batch.default.input]
  }
}

// Batch để giảm network overhead
otelcol.processor.batch "default" {
  output {
    metrics = [otelcol.exporter.prometheus.mimir.input]
    logs    = [otelcol.exporter.loki.default.input]
    traces  = [otelcol.exporter.otlp.tempo.input]
  }
}

// Export metrics sang Mimir
otelcol.exporter.prometheus "mimir" {
  forward_to = [prometheus.remote_write.mimir.receiver]
}

prometheus.remote_write "mimir" {
  endpoint {
    url = "http://mimir:9009/api/v1/push"
  }
}

// Export logs sang Loki
otelcol.exporter.loki "default" {
  forward_to = [loki.write.default.receiver]
}

loki.write "default" {
  endpoint {
    url = "http://loki:3100/loki/api/v1/push"
  }
}

// Export traces sang Tempo
otelcol.exporter.otlp "tempo" {
  client {
    endpoint = "tempo:4317"
    tls { insecure = true }
  }
}
```

#### Điểm mấu chốt

Alloy dùng mô hình **component-based**: mỗi block là một component có input/output, nối với nhau qua `forward_to` hoặc `output`. Bạn có thể thêm processor (filter, transform, sample) vào giữa pipeline mà không cần thay đổi receiver hay exporter.

## 4. Loki — Log Aggregation tiết kiệm

Loki là trái tim của việc thu thập log trong LGTM Stack. Khác với Elasticsearch (index full-text), Loki chỉ index **labels** (ví dụ: `{app="api", env="production"}`) và lưu nội dung log dạng nén. Điều này giúp:

- **Storage rẻ hơn 10-50x** so với Elasticsearch cho cùng lượng log
- **Đơn giản hóa operations** — không cần tuning JVM heap, shard rebalancing
- **Tích hợp tự nhiên** với Prometheus labels — cùng label set cho metrics và logs

### LogQL — Ngôn ngữ truy vấn log

LogQL lấy cảm hứng từ PromQL, dùng label selectors kết hợp filter expressions:

```
// Tìm log lỗi của service api trong 1 giờ qua
{app="api", env="production"} |= "error" | json | status_code >= 500

// Đếm số request lỗi theo endpoint, mỗi 5 phút
rate({app="api"} |= "HTTP" | json | status_code >= 500 [5m]) by (endpoint)

// Tính P99 response time từ log
{app="api"} | json | unwrap duration_ms [5m] | quantile_over_time(0.99)

// Pattern matching — phát hiện log format tự động
{app="api"} | pattern "<ip> - <method> <path> <status> <duration>ms"
  | status >= 500
```

#### Bloom Filters trong Loki 3.x

Loki 3.0+ hỗ trợ **Bloom filters** để tăng tốc filter queries. Thay vì scan toàn bộ chunks, Loki kiểm tra Bloom filter trước để loại nhanh chunks không chứa keyword cần tìm — giảm I/O đáng kể cho các truy vấn kiểu `|= "OutOfMemoryException"` trên dataset lớn.

### Structured Metadata

Từ Loki 3.0, bạn có thể gắn **structured metadata** vào log entries mà không cần biến chúng thành labels (tránh high-cardinality). Ví dụ: `trace_id`, `user_id`, `request_id` — filter được nhưng không tạo thêm series.

```
// Query log theo trace_id từ structured metadata
{app="api"} | trace_id = "abc123def456"
```

## 5. Mimir — Metrics Storage quy mô lớn

Prometheus tuyệt vời cho việc scrape metrics, nhưng có 2 hạn chế lớn khi scale:

1. **Single-node storage** — local TSDB không horizontal scale
2. **Retention ngắn** — thường chỉ giữ 15-30 ngày vì disk

**Mimir** giải quyết cả hai bằng cách làm remote storage cho Prometheus, hỗ trợ multi-tenant, long-term retention trên object storage (S3, MinIO, Azure Blob).

```
graph TD
    P1["Prometheus / Alloy"] -->|remote_write| D["Distributor"]
    D --> I1["Ingester 1"]
    D --> I2["Ingester 2"]
    D --> I3["Ingester 3"]
    I1 --> S["Object Storage  
S3 / MinIO / Azure Blob"]
    I2 --> S
    I3 --> S
    QF["Query Frontend"] --> Q["Querier"]
    Q --> I1
    Q --> I2
    Q --> I3
    Q --> S
    G["Grafana"] --> QF

style D fill:#e94560,stroke:#fff,color:#fff
    style I1 fill:#2c3e50,stroke:#fff,color:#fff
    style I2 fill:#2c3e50,stroke:#fff,color:#fff
    style I3 fill:#2c3e50,stroke:#fff,color:#fff
    style S fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style QF fill:#4CAF50,stroke:#fff,color:#fff
    style Q fill:#4CAF50,stroke:#fff,color:#fff
    style G fill:#4CAF50,stroke:#fff,color:#fff
    style P1 fill:#e94560,stroke:#fff,color:#fff

```

Kiến trúc Mimir — Distributor chia metrics đến Ingesters, lưu dài hạn trên Object Storage

| Tính năng | Prometheus (standalone) | Mimir |
| --- | --- | --- |
| Horizontal scaling | Không | Có — sharding by tenant/series |
| Long-term retention | 15-30 ngày (disk) | Không giới hạn (object storage) |
| Multi-tenant | Không | Có — cô lập dữ liệu giữa teams |
| High availability | Cần Thanos sidecar | Built-in replication |
| Query performance | Giảm khi data lớn | Query splitting + caching |
| Chi phí storage | SSD đắt | Object storage rẻ |

## 6. Tempo — Distributed Tracing không cần index

Khi một request đi qua 5 services, bạn cần biết: service nào chậm? Lỗi xảy ra ở đâu? **Tempo** trả lời câu hỏi đó bằng cách lưu distributed traces với chi phí cực thấp.

Khác Jaeger (cần Elasticsearch/Cassandra), Tempo chỉ cần **object storage**. Nó không index traces — chỉ lưu theo trace ID. Khi cần tìm trace, bạn dùng:

- **TraceQL** — ngôn ngữ truy vấn chuyên dụng cho traces
- **Metrics-to-traces** — từ spike trên dashboard, click vào xem exemplar traces
- **Logs-to-traces** — từ dòng log có trace_id, click sang Tempo xem full trace

### TraceQL — Truy vấn Traces như Query Database

```
// Tìm traces có span lỗi trong service "order-api"
{ resource.service.name = "order-api" && status = error }

// Traces có duration > 2 giây
{ duration > 2s }

// Tìm traces đi qua cả order-api và payment-service
{ resource.service.name = "order-api" } >> { resource.service.name = "payment-service" }

// Span có attribute cụ thể
{ span.http.status_code >= 500 && span.http.method = "POST" }
```

#### Exemplars — Cầu nối Metrics ↔ Traces

Khi Prometheus/Mimir thu thập metrics, nó có thể gắn kèm **exemplar** — một trace ID mẫu cho mỗi data point. Trên Grafana, khi bạn thấy latency P99 đột ngột tăng, click vào exemplar sẽ nhảy thẳng đến trace cụ thể gây ra spike đó. Đây là killer feature của việc dùng LGTM stack thống nhất.

## 7. Grafana — Dashboard, Alerting và Correlation

Grafana là lớp visualization kết nối tất cả lại. Phiên bản 12.x mang đến nhiều cải tiến quan trọng:

#### Grafana 12 Highlights

**Git Sync** — quản lý dashboard dạng code, version control qua Git. **Explore Logs** — tự động phát hiện pattern trong log, không cần viết query. **Traces to Profiles** — từ slow span, drill down thẳng vào flame graph xem function nào tiêu tốn CPU. **Adaptive dashboards** — layout tự động điều chỉnh theo data.

### Correlation — Sức mạnh của unified stack

Lợi thế lớn nhất khi dùng LGTM stack là khả năng **correlation** giữa 3 tín hiệu:

style M fill:#e94560,stroke:#fff,color:#fff
    style T fill:#2c3e50,stroke:#fff,color:#fff
    style L fill:#4CAF50,stroke:#fff,color:#fff

```

Correlation loop: Metrics → Traces → Logs → quay lại Metrics. Debug incident trong vài phút thay vì hàng giờ

**Quy trình debug incident điển hình:**

1. Alert báo *"P99 latency > 2s"* trên Grafana dashboard
2. Click vào metric panel → xem exemplar trace ID
3. Mở trace trong Tempo → thấy span `db.query` chiếm 2.8s
4. Click trace_id → Loki hiển thị log: `Connection pool exhausted, waiting 2.5s`
5. Root cause: connection pool quá nhỏ → tăng `MaxPoolSize` → deploy fix

## 8. Triển khai LGTM Stack với Docker Compose

Dưới đây là cấu hình Docker Compose production-ready cho một hệ thống vừa (10-50 services, ~100GB logs/tháng):

```
version: "3.8"

services:
  # --- Grafana ---
  grafana:
    image: grafana/grafana:12.4.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your-secure-password
      - GF_FEATURE_TOGGLES_ENABLE=traceToMetrics,traceToLogs
    volumes:
      - grafana-data:/var/lib/grafana
      - ./provisioning:/etc/grafana/provisioning
    depends_on: [loki, mimir, tempo]

# --- Loki (Log Storage) ---
  loki:
    image: grafana/loki:3.4.0
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/config.yaml
    volumes:
      - ./config/loki.yaml:/etc/loki/config.yaml
      - loki-data:/loki

# --- Mimir (Metrics Storage) ---
  mimir:
    image: grafana/mimir:2.15.0
    ports:
      - "9009:9009"
    command: -config.file=/etc/mimir/config.yaml
    volumes:
      - ./config/mimir.yaml:/etc/mimir/config.yaml
      - mimir-data:/data

# --- Tempo (Trace Storage) ---
  tempo:
    image: grafana/tempo:2.7.0
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "3200:3200"   # Tempo query
    command: -config.file=/etc/tempo/config.yaml
    volumes:
      - ./config/tempo.yaml:/etc/tempo/config.yaml
      - tempo-data:/var/tempo

# --- Alloy (Collector) ---
  alloy:
    image: grafana/alloy:1.6.0
    ports:
      - "12345:12345"  # Alloy UI
      - "4327:4317"    # OTLP gRPC (app gửi tới đây)
      - "4328:4318"    # OTLP HTTP
    volumes:
      - ./config/alloy.river:/etc/alloy/config.river
    command: run /etc/alloy/config.river --server.http.listen-addr=0.0.0.0:12345

volumes:
  grafana-data:
  loki-data:
  mimir-data:
  tempo-data:
```

#### Lưu ý Production

Config trên phù hợp cho single-node hoặc staging. Với production traffic lớn (>1TB logs/tháng), cần chạy Loki và Mimir ở **microservices mode** — tách distributor, ingester, querier thành containers riêng và dùng object storage (MinIO self-hosted hoặc S3) thay vì local disk.

## 9. Tích hợp với ứng dụng ASP.NET Core

Gửi telemetry từ ứng dụng .NET đến LGTM Stack chỉ cần 2 bước: cài NuGet packages và cấu hình exporter.

### Bước 1: Cài đặt packages

```
dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Instrumentation.Http
dotnet add package OpenTelemetry.Instrumentation.SqlClient
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol
```

### Bước 2: Cấu hình trong Program.cs

```
var builder = WebApplication.CreateBuilder(args);

builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource => resource
        .AddService("order-api", serviceVersion: "1.0.0"))
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddSqlClientInstrumentation(o => o.SetDbStatementForText = true)
        .AddOtlpExporter(o =>
        {
            o.Endpoint = new Uri("http://alloy:4317");
            o.Protocol = OpenTelemetry.Exporter
                .OtlpExportProtocol.Grpc;
        }))
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddOtlpExporter(o =>
        {
            o.Endpoint = new Uri("http://alloy:4317");
            o.Protocol = OpenTelemetry.Exporter
                .OtlpExportProtocol.Grpc;
        }))
    .WithLogging(logging => logging
        .AddOtlpExporter(o =>
        {
            o.Endpoint = new Uri("http://alloy:4317");
            o.Protocol = OpenTelemetry.Exporter
                .OtlpExportProtocol.Grpc;
        }));
```

#### Grafana OpenTelemetry Distribution cho .NET

Grafana cung cấp package `Grafana.OpenTelemetry` — một distribution bọc sẵn các instrumentation phổ biến và cấu hình tối ưu cho LGTM stack. Chỉ cần `builder.Services.AddGrafanaOpenTelemetry()` là đủ — ít config hơn nhiều so với setup thủ công.

## 10. Alerting — Từ giám sát đến hành động

Observability không có giá trị nếu không ai nhận được thông báo khi có sự cố. Grafana Alerting hỗ trợ:

- **Unified alerting** — alert rules cho cả metrics (PromQL), logs (LogQL), và traces
- **Multi-channel** — Slack, Discord, Telegram, PagerDuty, email, webhook
- **Silences & Mute timings** — tắt alert trong maintenance window
- **Alert grouping** — gom 100 alerts cùng loại thành 1 notification

Ví dụ alert rule cho error rate:

```
# Alert khi error rate > 5% trong 5 phút
- alert: HighErrorRate
  expr: |
    sum(rate({app="api"} |= "error" [5m])) by (app)
    /
    sum(rate({app="api"} [5m])) by (app)
    > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Error rate cao bất thường cho {{ $labels.app }}"
    description: "Error rate đang ở {{ $value | humanizePercentage }}"
```

## 11. Sizing và Chi phí thực tế

Một trong những lý do chính chọn LGTM Stack là chi phí. So sánh với SaaS:

| Quy mô | LGTM Self-hosted | Datadog (ước tính) |
| --- | --- | --- |
| 10 services, 50GB logs/tháng | 1 VM 4 CPU / 16GB RAM   ~$40-80/tháng | ~$200-500/tháng |
| 50 services, 500GB logs/tháng | 3 VMs hoặc K8s cluster   ~$200-400/tháng | ~$2,000-5,000/tháng |
| 200 services, 2TB logs/tháng | K8s cluster + S3   ~$500-1,000/tháng | ~$10,000+/tháng |

#### Trade-off cần cân nhắc

Self-hosted rẻ hơn tiền bạc nhưng tốn **thời gian vận hành**. Nếu team chỉ có 1-2 DevOps, hãy bắt đầu với **Grafana Cloud Free tier** (10K metrics, 50GB logs, 50GB traces miễn phí) rồi migrate sang self-hosted khi vượt ngưỡng. Grafana Cloud dùng cùng LGTM stack nên migration gần như chỉ đổi endpoint.

## 12. Best Practices cho Production

Labels có chọn lọc

Retention theo tầng

Hot data (7 ngày) trên SSD, warm data (30 ngày) trên HDD, cold data (1 năm+) trên object storage. Cấu hình `retention_period` và `compactor` trong Loki để tự động chuyển tầng.

Sampling cho Traces

Không cần lưu 100% traces. Dùng **tail-based sampling** trong Alloy: luôn giữ traces có error hoặc high latency, sample 10-20% traces thành công. Giảm 80% storage cost cho Tempo mà không mất thông tin quan trọng.

Recording Rules cho Metrics

Pre-compute các PromQL query phức tạp thành recording rules. Thay vì query raw data mỗi lần load dashboard, Mimir tính sẵn aggregated metrics — dashboard load nhanh hơn 10x.

Dashboard as Code

Dùng Grafana 12 Git Sync hoặc Terraform provider `grafana/grafana` để quản lý dashboard qua version control. Không ai được sửa dashboard bằng tay trên UI production — mọi thay đổi phải qua PR review.

## Kết luận

**Tham khảo:**  
[Grafana Loki Documentation](https://grafana.com/docs/loki/latest/) · [Grafana Mimir Documentation](https://grafana.com/docs/mimir/latest/) · [Grafana Tempo Documentation](https://grafana.com/docs/tempo/latest/) · [Grafana Alloy Documentation](https://grafana.com/docs/alloy/latest/) · [Grafana 12 What's New](https://grafana.com/docs/grafana/latest/whatsnew/) · [Instrument .NET with OpenTelemetry — Grafana](https://grafana.com/docs/opentelemetry/instrument/grafana-dotnet/)

Tailwind CSS 4 và Oxide Engine: Khi CSS Framework được viết lại bằng Rust

Microsoft Agent Framework 1.0 — SDK Thống Nhất Cho AI Agents Trên .NET 10

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.