Prometheus + Grafana — Xây dựng Monitoring Stack cho Production

Posted on: 4/25/2026 4:32:04 PM

Table of contents

1. Tại sao cần Monitoring Stack?
1. Prometheus ≠ Grafana
2. Kiến trúc Prometheus — Pull-Based Model
3. Bốn loại Metrics trong Prometheus
1. Histogram vs Summary
4. Tích hợp Prometheus với ASP.NET Core
5. PromQL — Ngôn ngữ query cho metrics
6. Alerting — Cảnh báo thông minh
1. 6.1 Alert Rules
2. 6.2 Alertmanager — Route và deduplicate alerts
  1. Alerting Anti-patterns
7. Grafana Dashboards
1. 7.1 RED Method Dashboard
2. 7.2 USE Method cho Infrastructure
8. Triển khai trên Kubernetes
1. 8.1 ServiceMonitor cho ASP.NET Core
9. Recording Rules — Tối ưu Performance
1. Naming convention cho Recording Rules
10. Best Practices cho Production
Kết luận
Tham khảo

v3.xPrometheus (CNCF Graduated)

v12Grafana — 100+ data sources

PullModel thu thập metrics

PromQLNgôn ngữ query mạnh mẽ

1. Tại sao cần Monitoring Stack?

Monitoring không phải "nice-to-have" — nó là yêu cầu bắt buộc cho bất kỳ hệ thống production nào. Không có monitoring, bạn chỉ biết hệ thống có vấn đề khi khách hàng phàn nàn — lúc đó đã quá muộn.

Prometheus + Grafana là combo monitoring phổ biến nhất thế giới, được sử dụng tại Uber, Spotify, DigitalOcean, CERN và hàng nghìn công ty khác. Đây là bộ đôi CNCF Graduated project, miễn phí hoàn toàn và battle-tested trong production với hàng triệu time series.

Prometheus ≠ Grafana

Prometheus thu thập và lưu trữ metrics (time-series database + scraping engine). Grafana visualize metrics thành dashboards và quản lý alerting. Hai công cụ bổ sung cho nhau, không thay thế.

2. Kiến trúc Prometheus — Pull-Based Model

Khác với hầu hết monitoring tools (push-based), Prometheus sử dụng pull model: nó chủ động kéo metrics từ các target (ứng dụng, server) theo interval cố định.

graph LR
    subgraph Targets
        A1[ASP.NET Core App
/metrics endpoint]
        A2[Node Exporter
Linux system metrics]
        A3[SQL Server Exporter
DB metrics]
        A4[Redis Exporter
Cache metrics]
    end

    P[Prometheus Server
Scrape + Store + Query] -->|Pull mỗi 15s| A1
    P -->|Pull mỗi 15s| A2
    P -->|Pull mỗi 15s| A3
    P -->|Pull mỗi 15s| A4

    P --> AM[Alertmanager
Route alerts]
    AM --> S[Slack / Email / PagerDuty]

    P --> G[Grafana
Dashboards + Explore]

    style P fill:#e94560,stroke:#fff,color:#fff
    style G fill:#2c3e50,stroke:#fff,color:#fff
    style AM fill:#ff9800,stroke:#fff,color:#fff
    style A1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style A2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style A3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style A4 fill:#f8f9fa,stroke:#e94560,color:#2c3e50

Hình 1: Kiến trúc Prometheus — Pull metrics từ targets, lưu TSDB, expose cho Grafana và Alertmanager

Ưu điểm của pull model:

Service discovery: Prometheus tự phát hiện targets mới (qua Kubernetes, Consul, DNS)
Debugging dễ hơn: Truy cập /metrics endpoint bằng browser để xem metric thô
Không cần agent: Ứng dụng chỉ cần expose HTTP endpoint, không cần cài agent riêng
Target health: Nếu scrape fail → biết ngay target bị down

3. Bốn loại Metrics trong Prometheus

Loại	Mô tả	Ví dụ	PromQL phổ biến
Counter	Giá trị chỉ tăng (reset khi restart)	Tổng số requests, tổng errors	`rate(http_requests_total[5m])`
Gauge	Giá trị tăng/giảm tự do	CPU usage, memory, queue size	`node_memory_MemFree_bytes`
Histogram	Phân phối giá trị vào buckets	Response time (P50, P95, P99)	`histogram_quantile(0.95, ...)`
Summary	Tương tự histogram, tính quantile phía client	Response time (pre-calculated)	`http_request_duration_seconds{quantile="0.95"}`

Histogram vs Summary

Luôn ưu tiên Histogram vì nó cho phép tính quantile trên server-side (aggregatable across instances). Summary tính quantile trên client → không thể aggregate nhiều instances. Prometheus 3.x còn hỗ trợ Native Histograms với độ chính xác cao hơn và storage hiệu quả hơn.

4. Tích hợp Prometheus với ASP.NET Core

4.1 Cài đặt

dotnet add package prometheus-net.AspNetCore

// Program.cs
var builder = WebApplication.CreateBuilder(args);

var app = builder.Build();

// Expose /metrics endpoint cho Prometheus scrape
app.MapMetrics(); // → http://localhost:5000/metrics

app.MapGet("/api/orders", async (AppDbContext db) =>
{
    return await db.Orders.ToListAsync();
});

app.Run();

4.2 Custom Metrics

public static class AppMetrics
{
    // Counter — đếm số request theo endpoint và status
    public static readonly Counter HttpRequestsTotal = Metrics.CreateCounter(
        "app_http_requests_total",
        "Total HTTP requests processed",
        new CounterConfiguration
        {
            LabelNames = new[] { "method", "endpoint", "status_code" }
        });

    // Histogram — đo response time
    public static readonly Histogram RequestDuration = Metrics.CreateHistogram(
        "app_request_duration_seconds",
        "HTTP request duration in seconds",
        new HistogramConfiguration
        {
            LabelNames = new[] { "method", "endpoint" },
            Buckets = new[] { 0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10 }
        });

    // Gauge — số connections đang active
    public static readonly Gauge ActiveConnections = Metrics.CreateGauge(
        "app_active_connections",
        "Number of active connections");

    // Gauge — queue size
    public static readonly Gauge QueueSize = Metrics.CreateGauge(
        "app_background_queue_size",
        "Number of items in background processing queue");
}

// Middleware đo metrics tự động
public class MetricsMiddleware
{
    private readonly RequestDelegate _next;

    public MetricsMiddleware(RequestDelegate next) => _next = next;

    public async Task InvokeAsync(HttpContext context)
    {
        var path = context.Request.Path.Value ?? "/";
        var method = context.Request.Method;

        AppMetrics.ActiveConnections.Inc();

        using (AppMetrics.RequestDuration
            .WithLabels(method, path)
            .NewTimer())
        {
            await _next(context);
        }

        AppMetrics.HttpRequestsTotal
            .WithLabels(method, path, context.Response.StatusCode.ToString())
            .Inc();

        AppMetrics.ActiveConnections.Dec();
    }
}

4.3 Cấu hình Prometheus scrape

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'aspnet-app'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['order-service:5000', 'payment-service:5000']
        labels:
          environment: 'production'

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  # Kubernetes service discovery
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

5. PromQL — Ngôn ngữ query cho metrics

PromQL (Prometheus Query Language) là ngôn ngữ đặc thù để truy vấn time-series data. Đây là các query thực chiến nhất:

5.1 Request Rate và Error Rate

# Request rate (requests/second) trong 5 phút gần nhất
rate(app_http_requests_total[5m])

# Request rate theo endpoint
sum by (endpoint) (rate(app_http_requests_total[5m]))

# Error rate (% requests trả về 5xx)
sum(rate(app_http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(app_http_requests_total[5m]))
* 100

# Availability (% requests thành công)
1 - (
  sum(rate(app_http_requests_total{status_code=~"5.."}[5m]))
  /
  sum(rate(app_http_requests_total[5m]))
) * 100

5.2 Latency Percentiles

# P50 (median) response time
histogram_quantile(0.50,
  sum by (le) (rate(app_request_duration_seconds_bucket[5m]))
)

# P95 response time
histogram_quantile(0.95,
  sum by (le) (rate(app_request_duration_seconds_bucket[5m]))
)

# P99 response time theo endpoint
histogram_quantile(0.99,
  sum by (le, endpoint) (rate(app_request_duration_seconds_bucket[5m]))
)

# Average response time
sum(rate(app_request_duration_seconds_sum[5m]))
/
sum(rate(app_request_duration_seconds_count[5m]))

5.3 Resource Monitoring

# CPU usage per pod (Kubernetes)
sum by (pod) (
  rate(container_cpu_usage_seconds_total{namespace="production"}[5m])
) * 100

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100

# Disk usage percentage
(node_filesystem_size_bytes - node_filesystem_avail_bytes)
/ node_filesystem_size_bytes * 100

PromQL Golden Rule: rate() trước, aggregate sau

Luôn tính rate() TRƯỚC rồi mới sum(). Nếu làm ngược (sum trước rate), kết quả sẽ sai vì counter reset giữa các instances sẽ bị "nuốt" bởi aggregation. Đây là lỗi PromQL phổ biến nhất.

6. Alerting — Cảnh báo thông minh

6.1 Alert Rules

# alert-rules.yml
groups:
  - name: app-alerts
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(app_http_requests_total{status_code=~"5.."}[5m]))
          / sum(rate(app_http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate > 5% trong 5 phút"
          description: "Error rate hiện tại: {{ $value | humanizePercentage }}"

      # High latency
      - alert: HighLatencyP95
        expr: |
          histogram_quantile(0.95,
            sum by (le) (rate(app_request_duration_seconds_bucket[5m]))
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency > 2 giây"
          description: "P95 hiện tại: {{ $value | humanizeDuration }}"

      # Pod down
      - alert: TargetDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Target {{ $labels.job }}/{{ $labels.instance }} is down"

      # Memory pressure
      - alert: HighMemoryUsage
        expr: |
          (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
          / node_memory_MemTotal_bytes > 0.9
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Memory usage > 90%"

      # Disk almost full
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Disk space < 10% on {{ $labels.mountpoint }}"

6.2 Alertmanager — Route và deduplicate alerts

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      group_wait: 0s
      repeat_interval: 5m

    - match:
        severity: warning
      receiver: 'slack-warnings'
      repeat_interval: 4h

receivers:
  - name: 'default'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'

  - name: 'pagerduty'
    pagerduty_configs:
      - routing_key: 'xxx'
        severity: '{{ .GroupLabels.severity }}'

  - name: 'slack-warnings'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts-warning'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

Alerting Anti-patterns

Tránh alert fatigue: Nếu team nhận >20 alerts/ngày, hầu hết sẽ bị ignore. Mỗi alert phải actionable — nếu nhận alert mà không cần làm gì, hãy xóa nó. Dùng for: 5m hoặc lâu hơn để tránh flapping (alert bật/tắt liên tục do spike tạm thời).

7. Grafana Dashboards

7.1 RED Method Dashboard

Mỗi service cần dashboard theo RED method — 3 metrics cốt lõi:

Metric	Ý nghĩa	PromQL
Rate	Requests per second	`sum(rate(app_http_requests_total[5m]))`
Errors	Error percentage	`sum(rate(...{status=~"5.."}[5m])) / sum(rate(...[5m]))`
Duration	Latency percentiles	`histogram_quantile(0.95, sum by (le) (rate(..._bucket[5m])))`

7.2 USE Method cho Infrastructure

Mỗi resource (CPU, Memory, Disk, Network) cần đo theo USE method:

Metric	CPU	Memory	Disk
Utilization	% CPU busy	% RAM used	% disk used
Saturation	Load average / cores	Swap usage	I/O queue depth
Errors	CPU throttling events	OOM kills	I/O errors

8. Triển khai trên Kubernetes

# Cài đặt kube-prometheus-stack (Prometheus + Grafana + Alertmanager + Node Exporter)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=securePassword \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

graph TB
    subgraph Kubernetes Cluster
        subgraph monitoring namespace
            P[Prometheus
StatefulSet]
            G[Grafana
Deployment]
            AM[Alertmanager
StatefulSet]
            NE[Node Exporter
DaemonSet]
            KSM[Kube-State-Metrics
Deployment]
        end

        subgraph production namespace
            subgraph Pod
                APP[ASP.NET Core App]
                APP -->|/metrics| P
            end
        end

        NE -->|system metrics| P
        KSM -->|k8s state| P
        P -->|alerts| AM
        P -->|data source| G
        AM -->|notify| EXT[Slack / PagerDuty]
    end

    style P fill:#e94560,stroke:#fff,color:#fff
    style G fill:#2c3e50,stroke:#fff,color:#fff
    style AM fill:#ff9800,stroke:#fff,color:#fff
    style APP fill:#f8f9fa,stroke:#e94560,color:#2c3e50

Hình 2: kube-prometheus-stack trên Kubernetes — all-in-one monitoring solution

8.1 ServiceMonitor cho ASP.NET Core

# Tự động discover và scrape ASP.NET Core apps
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: aspnet-apps
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/monitored: "true"
  namespaceSelector:
    matchNames:
      - production
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

9. Recording Rules — Tối ưu Performance

Khi PromQL query phức tạp và chạy thường xuyên (dashboard refresh mỗi 10s), dùng recording rules để pre-compute:

# recording-rules.yml
groups:
  - name: app-recording
    interval: 30s
    rules:
      # Pre-compute request rate per endpoint
      - record: app:http_request_rate:5m
        expr: sum by (endpoint) (rate(app_http_requests_total[5m]))

      # Pre-compute error rate
      - record: app:http_error_rate:5m
        expr: |
          sum(rate(app_http_requests_total{status_code=~"5.."}[5m]))
          / sum(rate(app_http_requests_total[5m]))

      # Pre-compute P95 latency
      - record: app:http_latency_p95:5m
        expr: |
          histogram_quantile(0.95,
            sum by (le) (rate(app_request_duration_seconds_bucket[5m]))
          )

      # Pre-compute P99 latency per endpoint
      - record: app:http_latency_p99_by_endpoint:5m
        expr: |
          histogram_quantile(0.99,
            sum by (le, endpoint) (rate(app_request_duration_seconds_bucket[5m]))
          )

Naming convention cho Recording Rules

Format chuẩn: level:metric_name:operations. Ví dụ app:http_request_rate:5m — app là aggregation level, http_request_rate là metric, 5m là window. Đặt tên đúng giúp team hiểu ngay metric là gì mà không cần đọc PromQL gốc.

10. Best Practices cho Production

10.1 Metric Naming

Dùng prefix theo ứng dụng: orderservice_requests_total thay vì requests_total
Unit trong tên: _seconds, _bytes, _total (counters)
Không dùng label có cardinality cao (user_id, request_id) — sẽ làm Prometheus OOM

10.2 Storage và Retention

Local storage: 15-30 ngày retention là đủ cho hầu hết use case
Long-term storage: Dùng Thanos hoặc Cortex nếu cần lưu metrics >30 ngày
Ước tính: ~1-2 bytes/sample × samples/s × retention → plan storage accordingly

10.3 High Availability

Chạy 2 Prometheus instances scrape cùng targets → dedup ở Thanos/Grafana Cloud
Alertmanager chạy cluster mode (3 instances) để tránh duplicate notifications
Grafana stateless — scale horizontal dễ dàng, state lưu trong PostgreSQL

Component	Replicas (Production)	Resources đề xuất
Prometheus	2 (HA pair)	2 CPU, 8GB RAM, 50GB SSD
Alertmanager	3 (cluster)	0.5 CPU, 256MB RAM
Grafana	2+	1 CPU, 1GB RAM
Node Exporter	1 per node (DaemonSet)	0.1 CPU, 64MB RAM

Kết luận

Prometheus + Grafana không chỉ là monitoring tool — nó là nền tảng observability cho toàn bộ hệ thống. Bắt đầu bằng cách expose /metrics trong ASP.NET Core, dần dần thêm custom metrics theo RED method, thiết lập alerting rules có ý nghĩa (actionable, không spam), và xây dựng dashboards giúp team phát hiện vấn đề nhanh nhất có thể.

Với kube-prometheus-stack trên Kubernetes, bạn có thể có full monitoring setup trong vài phút. Phần khó không phải cài đặt — mà là chọn đúng metrics để theo dõi và viết alert rules không gây alert fatigue.

Tham khảo

#system design #Kubernetes #Grafana #Prometheus #ASP.NET Core #Monitoring

# Prometheus + Grafana — Xây dựng Monitoring Stack cho Production

v3.xPrometheus (CNCF Graduated)

v12Grafana — 100+ data sources

PullModel thu thập metrics

PromQLNgôn ngữ query mạnh mẽ

## 1. Tại sao cần Monitoring Stack?

Monitoring không phải "nice-to-have" — nó là **yêu cầu bắt buộc** cho bất kỳ hệ thống production nào. Không có monitoring, bạn chỉ biết hệ thống có vấn đề khi khách hàng phàn nàn — lúc đó đã quá muộn.

**Prometheus + Grafana** là combo monitoring phổ biến nhất thế giới, được sử dụng tại Uber, Spotify, DigitalOcean, CERN và hàng nghìn công ty khác. Đây là bộ đôi CNCF Graduated project, miễn phí hoàn toàn và battle-tested trong production với hàng triệu time series.

#### Prometheus ≠ Grafana

**Prometheus** thu thập và lưu trữ metrics (time-series database + scraping engine). **Grafana** visualize metrics thành dashboards và quản lý alerting. Hai công cụ bổ sung cho nhau, không thay thế.

## 2. Kiến trúc Prometheus — Pull-Based Model

Khác với hầu hết monitoring tools (push-based), Prometheus sử dụng **pull model**: nó chủ động kéo metrics từ các target (ứng dụng, server) theo interval cố định.

```
graph LR
    subgraph Targets
        A1[ASP.NET Core App  
/metrics endpoint]
        A2[Node Exporter  
Linux system metrics]
        A3[SQL Server Exporter  
DB metrics]
        A4[Redis Exporter  
Cache metrics]
    end

P[Prometheus Server  
Scrape + Store + Query] -->|Pull mỗi 15s| A1
    P -->|Pull mỗi 15s| A2
    P -->|Pull mỗi 15s| A3
    P -->|Pull mỗi 15s| A4

P --> AM[Alertmanager  
Route alerts]
    AM --> S[Slack / Email / PagerDuty]

P --> G[Grafana  
Dashboards + Explore]

style P fill:#e94560,stroke:#fff,color:#fff
    style G fill:#2c3e50,stroke:#fff,color:#fff
    style AM fill:#ff9800,stroke:#fff,color:#fff
    style A1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style A2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style A3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style A4 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
  
```
Hình 1: Kiến trúc Prometheus — Pull metrics từ targets, lưu TSDB, expose cho Grafana và Alertmanager

Ưu điểm của pull model:

- **Service discovery**: Prometheus tự phát hiện targets mới (qua Kubernetes, Consul, DNS)
- **Debugging dễ hơn**: Truy cập `/metrics` endpoint bằng browser để xem metric thô
- **Không cần agent**: Ứng dụng chỉ cần expose HTTP endpoint, không cần cài agent riêng
- **Target health**: Nếu scrape fail → biết ngay target bị down

## 3. Bốn loại Metrics trong Prometheus

| Loại | Mô tả | Ví dụ | PromQL phổ biến |
| --- | --- | --- | --- |
| **Counter** | Giá trị chỉ tăng (reset khi restart) | Tổng số requests, tổng errors | `rate(http_requests_total[5m])` |
| **Gauge** | Giá trị tăng/giảm tự do | CPU usage, memory, queue size | `node_memory_MemFree_bytes` |
| **Histogram** | Phân phối giá trị vào buckets | Response time (P50, P95, P99) | `histogram_quantile(0.95, ...)` |
| **Summary** | Tương tự histogram, tính quantile phía client | Response time (pre-calculated) | `http_request_duration_seconds{quantile="0.95"}` |

#### Histogram vs Summary

Luôn ưu tiên **Histogram** vì nó cho phép tính quantile trên server-side (aggregatable across instances). Summary tính quantile trên client → không thể aggregate nhiều instances. Prometheus 3.x còn hỗ trợ **Native Histograms** với độ chính xác cao hơn và storage hiệu quả hơn.

## 4. Tích hợp Prometheus với ASP.NET Core

### 4.1 Cài đặt

```bash
dotnet add package prometheus-net.AspNetCore
```

```csharp
// Program.cs
var builder = WebApplication.CreateBuilder(args);

var app = builder.Build();

// Expose /metrics endpoint cho Prometheus scrape
app.MapMetrics(); // → http://localhost:5000/metrics

app.MapGet("/api/orders", async (AppDbContext db) =>
{
    return await db.Orders.ToListAsync();
});

app.Run();
```

### 4.2 Custom Metrics

```csharp
public static class AppMetrics
{
    // Counter — đếm số request theo endpoint và status
    public static readonly Counter HttpRequestsTotal = Metrics.CreateCounter(
        "app_http_requests_total",
        "Total HTTP requests processed",
        new CounterConfiguration
        {
            LabelNames = new[] { "method", "endpoint", "status_code" }
        });

// Histogram — đo response time
    public static readonly Histogram RequestDuration = Metrics.CreateHistogram(
        "app_request_duration_seconds",
        "HTTP request duration in seconds",
        new HistogramConfiguration
        {
            LabelNames = new[] { "method", "endpoint" },
            Buckets = new[] { 0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10 }
        });

// Gauge — số connections đang active
    public static readonly Gauge ActiveConnections = Metrics.CreateGauge(
        "app_active_connections",
        "Number of active connections");

// Gauge — queue size
    public static readonly Gauge QueueSize = Metrics.CreateGauge(
        "app_background_queue_size",
        "Number of items in background processing queue");
}

// Middleware đo metrics tự động
public class MetricsMiddleware
{
    private readonly RequestDelegate _next;

public MetricsMiddleware(RequestDelegate next) => _next = next;

public async Task InvokeAsync(HttpContext context)
    {
        var path = context.Request.Path.Value ?? "/";
        var method = context.Request.Method;

AppMetrics.ActiveConnections.Inc();

using (AppMetrics.RequestDuration
            .WithLabels(method, path)
            .NewTimer())
        {
            await _next(context);
        }

AppMetrics.HttpRequestsTotal
            .WithLabels(method, path, context.Response.StatusCode.ToString())
            .Inc();

AppMetrics.ActiveConnections.Dec();
    }
}
```

### 4.3 Cấu hình Prometheus scrape

```yaml
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'aspnet-app'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['order-service:5000', 'payment-service:5000']
        labels:
          environment: 'production'

- job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

# Kubernetes service discovery
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
```

## 5. PromQL — Ngôn ngữ query cho metrics

PromQL (Prometheus Query Language) là ngôn ngữ đặc thù để truy vấn time-series data. Đây là các query thực chiến nhất:

### 5.1 Request Rate và Error Rate

```promql
# Request rate (requests/second) trong 5 phút gần nhất
rate(app_http_requests_total[5m])

# Request rate theo endpoint
sum by (endpoint) (rate(app_http_requests_total[5m]))

# Error rate (% requests trả về 5xx)
sum(rate(app_http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(app_http_requests_total[5m]))
* 100

# Availability (% requests thành công)
1 - (
  sum(rate(app_http_requests_total{status_code=~"5.."}[5m]))
  /
  sum(rate(app_http_requests_total[5m]))
) * 100
```

### 5.2 Latency Percentiles

```promql
# P50 (median) response time
histogram_quantile(0.50,
  sum by (le) (rate(app_request_duration_seconds_bucket[5m]))
)

# P95 response time
histogram_quantile(0.95,
  sum by (le) (rate(app_request_duration_seconds_bucket[5m]))
)

# P99 response time theo endpoint
histogram_quantile(0.99,
  sum by (le, endpoint) (rate(app_request_duration_seconds_bucket[5m]))
)

# Average response time
sum(rate(app_request_duration_seconds_sum[5m]))
/
sum(rate(app_request_duration_seconds_count[5m]))
```

### 5.3 Resource Monitoring

```promql
# CPU usage per pod (Kubernetes)
sum by (pod) (
  rate(container_cpu_usage_seconds_total{namespace="production"}[5m])
) * 100

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100

# Disk usage percentage
(node_filesystem_size_bytes - node_filesystem_avail_bytes)
/ node_filesystem_size_bytes * 100
```

#### PromQL Golden Rule: rate() trước, aggregate sau

Luôn tính `rate()` TRƯỚC rồi mới `sum()`. Nếu làm ngược (`sum` trước `rate`), kết quả sẽ sai vì counter reset giữa các instances sẽ bị "nuốt" bởi aggregation. Đây là lỗi PromQL phổ biến nhất.

## 6. Alerting — Cảnh báo thông minh

### 6.1 Alert Rules

```yaml
# alert-rules.yml
groups:
  - name: app-alerts
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(app_http_requests_total{status_code=~"5.."}[5m]))
          / sum(rate(app_http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate > 5% trong 5 phút"
          description: "Error rate hiện tại: {{ $value | humanizePercentage }}"

# High latency
      - alert: HighLatencyP95
        expr: |
          histogram_quantile(0.95,
            sum by (le) (rate(app_request_duration_seconds_bucket[5m]))
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency > 2 giây"
          description: "P95 hiện tại: {{ $value | humanizeDuration }}"

# Pod down
      - alert: TargetDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Target {{ $labels.job }}/{{ $labels.instance }} is down"

# Memory pressure
      - alert: HighMemoryUsage
        expr: |
          (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
          / node_memory_MemTotal_bytes > 0.9
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Memory usage > 90%"

# Disk almost full
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Disk space < 10% on {{ $labels.mountpoint }}"
```

### 6.2 Alertmanager — Route và deduplicate alerts

```yaml
# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'

routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      group_wait: 0s
      repeat_interval: 5m

- match:
        severity: warning
      receiver: 'slack-warnings'
      repeat_interval: 4h

receivers:
  - name: 'default'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'

- name: 'pagerduty'
    pagerduty_configs:
      - routing_key: 'xxx'
        severity: '{{ .GroupLabels.severity }}'

- name: 'slack-warnings'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts-warning'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']
```

#### Alerting Anti-patterns

**Tránh alert fatigue**: Nếu team nhận >20 alerts/ngày, hầu hết sẽ bị ignore. Mỗi alert phải *actionable* — nếu nhận alert mà không cần làm gì, hãy xóa nó. Dùng `for: 5m` hoặc lâu hơn để tránh flapping (alert bật/tắt liên tục do spike tạm thời).

## 7. Grafana Dashboards

### 7.1 RED Method Dashboard

Mỗi service cần dashboard theo **RED method** — 3 metrics cốt lõi:

| Metric | Ý nghĩa | PromQL |
| --- | --- | --- |
| **R**ate | Requests per second | `sum(rate(app_http_requests_total[5m]))` |
| **E**rrors | Error percentage | `sum(rate(...{status=~"5.."}[5m])) / sum(rate(...[5m]))` |
| **D**uration | Latency percentiles | `histogram_quantile(0.95, sum by (le) (rate(..._bucket[5m])))` |

### 7.2 USE Method cho Infrastructure

Mỗi resource (CPU, Memory, Disk, Network) cần đo theo **USE method**:

| Metric | CPU | Memory | Disk |
| --- | --- | --- | --- |
| **U**tilization | % CPU busy | % RAM used | % disk used |
| **S**aturation | Load average / cores | Swap usage | I/O queue depth |
| **E**rrors | CPU throttling events | OOM kills | I/O errors |

## 8. Triển khai trên Kubernetes

```bash
# Cài đặt kube-prometheus-stack (Prometheus + Grafana + Alertmanager + Node Exporter)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=securePassword \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
```

```
graph TB
    subgraph Kubernetes Cluster
        subgraph monitoring namespace
            P[Prometheus  
StatefulSet]
            G[Grafana  
Deployment]
            AM[Alertmanager  
StatefulSet]
            NE[Node Exporter  
DaemonSet]
            KSM[Kube-State-Metrics  
Deployment]
        end

subgraph production namespace
            subgraph Pod
                APP[ASP.NET Core App]
                APP -->|/metrics| P
            end
        end

style P fill:#e94560,stroke:#fff,color:#fff
    style G fill:#2c3e50,stroke:#fff,color:#fff
    style AM fill:#ff9800,stroke:#fff,color:#fff
    style APP fill:#f8f9fa,stroke:#e94560,color:#2c3e50
  
```
Hình 2: kube-prometheus-stack trên Kubernetes — all-in-one monitoring solution

### 8.1 ServiceMonitor cho ASP.NET Core

```yaml
# Tự động discover và scrape ASP.NET Core apps
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: aspnet-apps
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/monitored: "true"
  namespaceSelector:
    matchNames:
      - production
  endpoints:
    - port: http
      path: /metrics
      interval: 15s
```

## 9. Recording Rules — Tối ưu Performance

Khi PromQL query phức tạp và chạy thường xuyên (dashboard refresh mỗi 10s), dùng **recording rules** để pre-compute:

```yaml
# recording-rules.yml
groups:
  - name: app-recording
    interval: 30s
    rules:
      # Pre-compute request rate per endpoint
      - record: app:http_request_rate:5m
        expr: sum by (endpoint) (rate(app_http_requests_total[5m]))

# Pre-compute error rate
      - record: app:http_error_rate:5m
        expr: |
          sum(rate(app_http_requests_total{status_code=~"5.."}[5m]))
          / sum(rate(app_http_requests_total[5m]))

# Pre-compute P95 latency
      - record: app:http_latency_p95:5m
        expr: |
          histogram_quantile(0.95,
            sum by (le) (rate(app_request_duration_seconds_bucket[5m]))
          )

# Pre-compute P99 latency per endpoint
      - record: app:http_latency_p99_by_endpoint:5m
        expr: |
          histogram_quantile(0.99,
            sum by (le, endpoint) (rate(app_request_duration_seconds_bucket[5m]))
          )
```

#### Naming convention cho Recording Rules

Format chuẩn: `level:metric_name:operations`. Ví dụ `app:http_request_rate:5m` — *app* là aggregation level, *http_request_rate* là metric, *5m* là window. Đặt tên đúng giúp team hiểu ngay metric là gì mà không cần đọc PromQL gốc.

## 10. Best Practices cho Production

### 10.1 Metric Naming

- Dùng prefix theo ứng dụng: `orderservice_requests_total` thay vì `requests_total`
- Unit trong tên: `_seconds`, `_bytes`, `_total` (counters)
- Không dùng label có cardinality cao (user_id, request_id) — sẽ làm Prometheus OOM

### 10.2 Storage và Retention

- **Local storage**: 15-30 ngày retention là đủ cho hầu hết use case
- **Long-term storage**: Dùng Thanos hoặc Cortex nếu cần lưu metrics >30 ngày
- Ước tính: ~1-2 bytes/sample × samples/s × retention → plan storage accordingly

### 10.3 High Availability

- Chạy 2 Prometheus instances scrape cùng targets → dedup ở Thanos/Grafana Cloud
- Alertmanager chạy cluster mode (3 instances) để tránh duplicate notifications
- Grafana stateless — scale horizontal dễ dàng, state lưu trong PostgreSQL

| Component | Replicas (Production) | Resources đề xuất |
| --- | --- | --- |
| Prometheus | 2 (HA pair) | 2 CPU, 8GB RAM, 50GB SSD |
| Alertmanager | 3 (cluster) | 0.5 CPU, 256MB RAM |
| Grafana | 2+ | 1 CPU, 1GB RAM |
| Node Exporter | 1 per node (DaemonSet) | 0.1 CPU, 64MB RAM |

## Kết luận

Prometheus + Grafana không chỉ là monitoring tool — nó là **nền tảng observability** cho toàn bộ hệ thống. Bắt đầu bằng cách expose `/metrics` trong ASP.NET Core, dần dần thêm custom metrics theo RED method, thiết lập alerting rules có ý nghĩa (actionable, không spam), và xây dựng dashboards giúp team phát hiện vấn đề nhanh nhất có thể.

Với `kube-prometheus-stack` trên Kubernetes, bạn có thể có full monitoring setup trong vài phút. Phần khó không phải cài đặt — mà là chọn đúng metrics để theo dõi và viết alert rules không gây alert fatigue.

## Tham khảo

- [Prometheus Documentation — Overview](https://prometheus.io/docs/introduction/overview/)
- [Grafana Documentation](https://grafana.com/docs/grafana/latest/)
- [Infrastructure Monitoring with Prometheus and Grafana 2026 — Hostperl](https://hostperl.com/blog/infrastructure-monitoring-prometheus-grafana-production-observability-2026)
- [kube-prometheus-stack Helm Chart — GitHub](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
- [Grafana & Prometheus Complete Guide 2026 — AiCybr](https://aicybr.com/blog/grafana-prometheus-complete-guide)

n8n — Nền tảng Tự động hóa Workflow AI mã nguồn mở cho Developer

htmx — Xây dựng ứng dụng web động mà không cần JavaScript framework

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.