OpenTelemetry — Tiêu chuẩn Observability cho hệ thống phân tán

Posted on: 4/21/2026 3:10:12 AM

Table of contents

Mục lục
1. Observability là gì và tại sao quan trọng?
1. Monitoring vs. Observability
2. Ba trụ cột: Traces, Metrics, Logs
3. Kiến trúc OpenTelemetry
1. Các thành phần chính
4. OpenTelemetry Collector — Trái tim hệ thống
1. Hai chế độ triển khai
2. Cấu hình Collector (YAML)
  1. Memory Limiter là bắt buộc
5. Tích hợp OpenTelemetry với .NET
6. Chiến lược Sampling thông minh
1. Head-based vs. Tail-based Sampling
7. Xây dựng Observability Stack hoàn chỉnh
1. Docker Compose cho local development
2. So sánh: Self-host vs. Managed Service
8. Best Practices cho Production 2026
Kết luận
1. Tham khảo

CNCF #2 Dự án active nhất sau Kubernetes

40+ Ngôn ngữ & framework được hỗ trợ

100+ Vendor tích hợp sẵn

3 Trụ cột: Traces, Metrics, Logs

Khi hệ thống phân tán ngày càng phức tạp — microservices gọi chéo nhau, message queue xen giữa, cache layer chồng chất — câu hỏi "lỗi xảy ra ở đâu?" trở nên cực kỳ khó trả lời. Bạn không thể debug production bằng breakpoint. Bạn cần observability — và OpenTelemetry đang trở thành tiêu chuẩn duy nhất mà toàn ngành công nhận.

1. Observability là gì và tại sao quan trọng?

Observability (khả năng quan sát) là năng lực hiểu trạng thái bên trong của hệ thống chỉ thông qua các tín hiệu đầu ra — mà không cần thay đổi code hay can thiệp vào luồng chạy chính. Khác với monitoring truyền thống (theo dõi các chỉ số đã biết trước), observability cho phép bạn trả lời cả những câu hỏi chưa từng đặt ra.

Monitoring vs. Observability

Monitoring trả lời: "CPU đang bao nhiêu %?" hay "request/s có đạt ngưỡng không?"
Observability trả lời: "Tại sao request từ user X ở region AP mất 3 giây thay vì 200ms, và service nào gây ra bottleneck?"

Trong kiến trúc monolith, bạn có thể mở log file duy nhất và trace theo thread. Nhưng khi một request đi qua API Gateway → Auth Service → Order Service → Payment → Notification, mỗi service có log riêng, timezone riêng, format riêng — bạn cần một cách để tương quan (correlate) tất cả lại.

2. Ba trụ cột: Traces, Metrics, Logs

graph TD
    A[Telemetry Data] --> B[Traces]
    A --> C[Metrics]
    A --> D[Logs]
    B --> B1["Distributed Tracing
Theo dõi request flow"]
    B --> B2["Spans
Đơn vị thời gian"]
    B --> B3["Context Propagation
W3C TraceContext"]
    C --> C1["Counters
Đếm tích lũy"]
    C --> C2["Gauges
Giá trị tức thời"]
    C --> C3["Histograms
Phân phối thống kê"]
    D --> D1["Structured Logs
Key-value pairs"]
    D --> D2["Correlation
Gắn TraceId/SpanId"]
    D --> D3["Severity Levels
Info/Warn/Error"]

    style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#2c3e50,stroke:#fff,color:#fff
    style C fill:#2c3e50,stroke:#fff,color:#fff
    style D fill:#2c3e50,stroke:#fff,color:#fff
    style B1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50

Ba trụ cột telemetry trong OpenTelemetry

Traces — Theo dõi hành trình request

Một trace đại diện cho toàn bộ hành trình của một request xuyên qua hệ thống phân tán. Mỗi trace gồm nhiều span — đơn vị công việc nhỏ nhất có tên, thời gian bắt đầu/kết thúc, và các attribute mô tả ngữ cảnh.

sequenceDiagram
    participant Client
    participant Gateway as API Gateway
    participant Auth as Auth Service
    participant Order as Order Service
    participant DB as Database
    participant Cache as Redis Cache

    Client->>Gateway: POST /orders (TraceId: abc123)
    Gateway->>Auth: Verify Token (SpanId: s1)
    Auth-->>Gateway: 200 OK (2ms)
    Gateway->>Order: Create Order (SpanId: s2)
    Order->>Cache: Check inventory (SpanId: s3)
    Cache-->>Order: Cache HIT (0.5ms)
    Order->>DB: INSERT order (SpanId: s4)
    DB-->>Order: OK (15ms)
    Order-->>Gateway: 201 Created (18ms)
    Gateway-->>Client: 201 Created (22ms)

Distributed trace qua nhiều service — mỗi mũi tên là một span

Mỗi span chứa:

TraceId: ID duy nhất cho toàn bộ trace (được propagate qua HTTP header traceparent)
SpanId: ID của span hiện tại
ParentSpanId: Liên kết cha-con giữa các span
Attributes: Key-value pairs như http.method=POST, db.system=postgresql
Events: Các sự kiện xảy ra trong span (ví dụ: "cache miss", "retry attempt")
Status: OK, ERROR, hoặc UNSET

Metrics — Đo lường hiệu năng bằng con số

Metrics là các phép đo số lượng theo thời gian. OpenTelemetry hỗ trợ ba loại metric chính:

Loại	Mô tả	Ví dụ	Khi nào dùng
Counter	Giá trị tích lũy, chỉ tăng	Tổng số request, tổng bytes gửi	Đếm sự kiện qua thời gian
Gauge	Giá trị tức thời, tăng/giảm	CPU usage, active connections, queue length	Đo trạng thái hiện tại
Histogram	Phân phối thống kê	Request latency (p50, p95, p99)	Phân tích phân phối giá trị

Logs — Sự kiện có ngữ cảnh

Logs trong OpenTelemetry không chỉ là text thuần — chúng là structured logs với TraceId và SpanId được gắn tự động. Nhờ đó, khi nhìn vào một log entry ERROR, bạn có thể nhảy thẳng tới trace tương ứng để xem toàn bộ hành trình request.

{
  "timestamp": "2026-04-21T10:15:30Z",
  "severity": "ERROR",
  "body": "Payment processing failed",
  "attributes": {
    "order.id": "ORD-98765",
    "payment.provider": "stripe",
    "error.type": "timeout"
  },
  "traceId": "abc123def456...",
  "spanId": "span789..."
}

Tương quan Logs-Traces-Metrics

Sức mạnh thực sự nằm ở việc tương quan ba tín hiệu: khi metric cho thấy latency p99 tăng đột biến → filter traces có duration > 2s → tìm span chậm nhất → đọc log của span đó để hiểu root cause. Đây là workflow mà monitoring truyền thống không thể làm được.

3. Kiến trúc OpenTelemetry

OpenTelemetry không phải một sản phẩm — nó là một framework và bộ công cụ gồm nhiều thành phần phối hợp với nhau:

graph LR
    subgraph Application
        A1[Your Code] --> SDK[OTel SDK]
        A2[Auto-Instrumentation] --> SDK
        A3[Library Instrumentation] --> SDK
    end

    SDK -->|OTLP| C[OTel Collector]

    subgraph Collector
        C --> R[Receivers]
        R --> P[Processors]
        P --> E[Exporters]
    end

    E --> G[Grafana/Tempo]
    E --> J[Jaeger]
    E --> PR[Prometheus]
    E --> AZ[Azure Monitor]
    E --> DD[Datadog]

    style SDK fill:#e94560,stroke:#fff,color:#fff
    style C fill:#2c3e50,stroke:#fff,color:#fff
    style R fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style P fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style E fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style G fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style J fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style PR fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style AZ fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style DD fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

Kiến trúc tổng quan OpenTelemetry — từ application đến backend

Các thành phần chính

API: Interface chuẩn để tạo telemetry — library authors sử dụng API để instrument code mà không phụ thuộc vào implementation cụ thể
SDK: Implementation của API, xử lý việc thu thập, xử lý và export dữ liệu
Auto-Instrumentation: Tự động thu thập telemetry từ các framework phổ biến (ASP.NET Core, HttpClient, EF Core...) mà không cần sửa code
OTLP (OpenTelemetry Protocol): Giao thức truyền tải chuẩn, vendor-neutral, hỗ trợ cả gRPC và HTTP/protobuf
Semantic Conventions: Quy ước đặt tên chuẩn cho attributes — đảm bảo http.request.method có cùng ý nghĩa bất kể ngôn ngữ nào

4. OpenTelemetry Collector — Trái tim hệ thống

Collector là thành phần trung gian nhận, xử lý và chuyển tiếp telemetry data. Nó hoạt động như một proxy thông minh giữa ứng dụng và backend, giúp tách biệt logic instrumentation khỏi logic delivery.

Hai chế độ triển khai

graph TD
    subgraph Agent Mode
        App1[App 1] --> CA[Collector Agent]
        App2[App 2] --> CA
        CA -->|Forward| CG
    end

    subgraph Gateway Mode
        CA2[Agent 1] --> CG[Collector Gateway]
        CA3[Agent 2] --> CG
        CG --> Backend[Observability Backend]
    end

    style CA fill:#e94560,stroke:#fff,color:#fff
    style CG fill:#2c3e50,stroke:#fff,color:#fff
    style Backend fill:#f8f9fa,stroke:#e94560,color:#2c3e50

Agent mode (sidecar) vs. Gateway mode (centralized)

Đặc điểm	Agent Mode	Gateway Mode
Triển khai	Sidecar / DaemonSet cạnh app	Standalone service tập trung
Ưu điểm	Latency thấp, xử lý cục bộ	Quản lý tập trung, sampling phức tạp
Nhược điểm	Tốn resource trên mỗi node	Single point of failure nếu không HA
Phù hợp	Kubernetes, edge computing	Multi-cluster, cross-region

Cấu hình Collector (YAML)

Collector cấu hình theo pipeline: Receivers → Processors → Exporters. Đây là ví dụ production-ready:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"

processors:
  batch:
    send_batch_size: 8192
    timeout: 5s
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-always
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 2000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/tempo:
    endpoint: "tempo:4317"
    tls:
      insecure: false
      cert_file: /certs/client.crt
      key_file: /certs/client.key
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: myapp

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]

Memory Limiter là bắt buộc

Trong production, luôn đặt memory_limiter processor TRƯỚC các processor khác. Nếu không, khi traffic spike, collector có thể OOM và mất toàn bộ telemetry data trong buffer. Cấu hình limit_mib nên ở mức 70-80% RAM available cho container.

5. Tích hợp OpenTelemetry với .NET

.NET có lợi thế đặc biệt: các API telemetry đã được tích hợp sẵn trong framework (ILogger, System.Diagnostics.Metrics, ActivitySource). OpenTelemetry .NET SDK chỉ cần "hook" vào các API này và export ra ngoài — không cần thay đổi cách bạn viết code.

graph LR
    subgraph ".NET Framework APIs"
        IL["ILogger<T>"]
        ME["Meter / Counter"]
        AS["ActivitySource / Activity"]
    end

    subgraph "OTel .NET SDK"
        IL --> LP[Log Provider]
        ME --> MP[Meter Provider]
        AS --> TP[Tracer Provider]
    end

    LP --> EX[OTLP Exporter]
    MP --> EX
    TP --> EX

    EX --> COL[Collector]

    style IL fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style ME fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style AS fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style LP fill:#e94560,stroke:#fff,color:#fff
    style MP fill:#e94560,stroke:#fff,color:#fff
    style TP fill:#e94560,stroke:#fff,color:#fff
    style EX fill:#2c3e50,stroke:#fff,color:#fff
    style COL fill:#2c3e50,stroke:#fff,color:#fff

.NET sử dụng native API, OTel SDK chỉ làm nhiệm vụ export

Cài đặt NuGet packages

dotnet add package OpenTelemetry
dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Instrumentation.Http
dotnet add package OpenTelemetry.Instrumentation.SqlClient
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol

Cấu hình trong Program.cs

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource => resource
        .AddService(
            serviceName: "OrderService",
            serviceVersion: "1.0.0",
            serviceInstanceId: Environment.MachineName))
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation(opts =>
        {
            opts.Filter = ctx => !ctx.Request.Path.StartsWithSegments("/health");
            opts.RecordException = true;
        })
        .AddHttpClientInstrumentation()
        .AddSqlClientInstrumentation(opts =>
        {
            opts.SetDbStatementForText = true;
            opts.RecordException = true;
        })
        .AddOtlpExporter(opts =>
        {
            opts.Endpoint = new Uri("http://otel-collector:4317");
            opts.Protocol = OtlpExportProtocol.Grpc;
        }))
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddOtlpExporter())
    .WithLogging(logging => logging
        .AddOtlpExporter());

var app = builder.Build();
app.Run();

Custom Instrumentation — Business Logic Tracing

Auto-instrumentation bắt được HTTP, DB, gRPC. Nhưng để trace business logic (xử lý đơn hàng, tính giá, kiểm kho), bạn cần thêm span thủ công:

public class OrderService
{
    private static readonly ActivitySource Source = new("OrderService");
    private static readonly Meter Meter = new("OrderService");
    private static readonly Counter<long> OrdersCreated =
        Meter.CreateCounter<long>("orders.created");

    public async Task<Order> CreateOrderAsync(CreateOrderRequest request)
    {
        using var activity = Source.StartActivity("CreateOrder");
        activity?.SetTag("order.customer_id", request.CustomerId);
        activity?.SetTag("order.items_count", request.Items.Count);

        // Validate inventory
        using (Source.StartActivity("ValidateInventory"))
        {
            await ValidateInventoryAsync(request.Items);
        }

        // Calculate pricing
        decimal total;
        using (var pricingSpan = Source.StartActivity("CalculatePricing"))
        {
            total = await CalculatePricingAsync(request.Items);
            pricingSpan?.SetTag("order.total", total);
        }

        // Process payment
        using (Source.StartActivity("ProcessPayment"))
        {
            await ProcessPaymentAsync(request.CustomerId, total);
        }

        OrdersCreated.Add(1,
            new KeyValuePair<string, object?>("region", request.Region));

        activity?.SetStatus(ActivityStatusCode.Ok);
        return new Order { Id = Guid.NewGuid(), Total = total };
    }
}

.NET Aspire — OTel có sẵn

Nếu bạn đang dùng .NET Aspire, OpenTelemetry đã được cấu hình sẵn trong ServiceDefaults project. Chỉ cần gọi builder.ConfigureOpenTelemetry() — tracing, metrics, logging tự động hoạt động. Aspire Dashboard còn hiển thị toàn bộ telemetry trong môi trường dev mà không cần cài Grafana/Jaeger.

6. Chiến lược Sampling thông minh

Ở quy mô lớn, thu thập 100% traces là không khả thi — chi phí lưu trữ và network sẽ rất lớn. Sampling giúp giảm volume mà vẫn giữ được dữ liệu quan trọng.

Head-based vs. Tail-based Sampling

graph TD
    subgraph "Head-based Sampling"
        H1[Request đến] --> H2{Quyết định ngay}
        H2 -->|Sample| H3[Thu thập trace]
        H2 -->|Drop| H4[Bỏ qua hoàn toàn]
    end

    subgraph "Tail-based Sampling"
        T1[Request đến] --> T2[Thu thập TẤT CẢ spans]
        T2 --> T3[Trace hoàn thành]
        T3 --> T4{Đánh giá toàn bộ trace}
        T4 -->|Error/Slow| T5[Giữ lại]
        T4 -->|Bình thường| T6[Áp dụng ratio sampling]
    end

    style H1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style H2 fill:#e94560,stroke:#fff,color:#fff
    style T1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style T4 fill:#2c3e50,stroke:#fff,color:#fff

Head-based quyết định ở đầu, Tail-based đánh giá sau khi trace hoàn tất

Tiêu chí	Head-based	Tail-based
Thời điểm quyết định	Ngay khi request bắt đầu	Sau khi trace hoàn tất
Ưu điểm	Đơn giản, overhead thấp	Giữ được mọi lỗi và request chậm
Nhược điểm	Có thể bỏ sót error traces	Cần collector có đủ RAM để buffer
Phù hợp	Traffic rất cao, budget giới hạn	Production cần debug chính xác

Chiến lược kết hợp phổ biến trong production:

# Cấu hình tail-sampling trên Collector
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      # Luôn giữ lại traces có lỗi
      - name: keep-errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      # Giữ lại request chậm hơn 1 giây
      - name: keep-slow
        type: latency
        latency: { threshold_ms: 1000 }
      # Giữ lại traces từ endpoint quan trọng
      - name: keep-critical-paths
        type: string_attribute
        string_attribute:
          key: http.route
          values: ["/api/payments", "/api/orders"]
      # 5% traces bình thường còn lại
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

7. Xây dựng Observability Stack hoàn chỉnh

Một stack phổ biến và cost-effective cho production (có thể self-host hoàn toàn):

graph LR
    App[Applications] -->|OTLP| Col[OTel Collector]
    Col -->|Traces| Tempo[Grafana Tempo]
    Col -->|Metrics| Prom[Prometheus]
    Col -->|Logs| Loki[Grafana Loki]

    Tempo --> Graf[Grafana Dashboard]
    Prom --> Graf
    Loki --> Graf

    Graf --> Alert[Alertmanager]
    Alert --> PD[PagerDuty/Slack]

    style Col fill:#e94560,stroke:#fff,color:#fff
    style Graf fill:#2c3e50,stroke:#fff,color:#fff
    style Tempo fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style Prom fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style Loki fill:#f8f9fa,stroke:#e94560,color:#2c3e50

Observability stack mã nguồn mở: OTel + Grafana ecosystem

Thành phần	Công cụ	Vai trò	Chi phí
Traces	Grafana Tempo	Lưu trữ traces, tìm kiếm theo TraceId	Free (self-host)
Metrics	Prometheus	Thu thập và query metrics (PromQL)	Free (self-host)
Logs	Grafana Loki	Log aggregation với label-based indexing	Free (self-host)
Visualization	Grafana	Dashboard, alerting, explore	Free (self-host)
Alerting	Alertmanager	Routing alerts → Slack, PagerDuty, Email	Free

Docker Compose cho local development

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    ports:
      - "4317:4317"   # gRPC OTLP
      - "4318:4318"   # HTTP OTLP
      - "8889:8889"   # Prometheus metrics
    volumes:
      - ./otel-config.yaml:/etc/otelcol/config.yaml

  tempo:
    image: grafana/tempo:latest
    ports:
      - "3200:3200"
    volumes:
      - ./tempo-config.yaml:/etc/tempo.yaml
    command: ["-config.file=/etc/tempo.yaml"]

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    volumes:
      - ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/ds.yaml

So sánh: Self-host vs. Managed Service

Tiêu chí	Self-host (Grafana Stack)	Managed (Datadog/New Relic)	Hybrid (Grafana Cloud Free)
Chi phí	Chỉ infra (server/storage)	$15-25/host/tháng	Free tier 50GB logs, 10K metrics
Setup	Cần DevOps experience	5 phút setup	15 phút setup
Scaling	Tự quản lý HA, retention	Tự động	Free tier có giới hạn
Vendor lock-in	Không (OTel chuẩn)	Cao (proprietary features)	Thấp (OTel compatible)
Phù hợp	Team lớn, budget infra	Startup, team nhỏ	Side project, MVP

8. Best Practices cho Production 2026

Semantic Conventions — Đặt tên chuẩn

Một trong những lợi ích lớn nhất của OpenTelemetry là Semantic Conventions — bộ quy ước đặt tên chuẩn cho attributes. Khi mọi service đều dùng cùng convention, bạn có thể query cross-service một cách nhất quán:

Domain	Attribute	Ý nghĩa
HTTP	`http.request.method`	GET, POST, PUT...
HTTP	`http.response.status_code`	200, 404, 500...
HTTP	`url.path`	/api/orders
Database	`db.system`	postgresql, redis, mssql
Database	`db.operation.name`	SELECT, INSERT, findOne
Messaging	`messaging.system`	kafka, rabbitmq, azure_servicebus
Messaging	`messaging.destination.name`	orders-queue, events-topic

Các nguyên tắc quan trọng

1. Giảm Cardinality

Attribute cardinality cao (ví dụ: gắn user.id vào mọi metric) sẽ khiến Prometheus bùng nổ số lượng time series. Chỉ gắn high-cardinality attributes vào traces (lưu trữ rẻ hơn), còn metrics chỉ dùng low-cardinality labels như region, status_code, endpoint.

2. Filter Health Checks & Noise

Loại bỏ traces từ /health, /ready, /metrics endpoints. Chúng tạo ra lượng traces khổng lồ mà không mang giá trị debug. Cấu hình filter ở SDK level (không phải collector) để tiết kiệm network.

3. Bảo mật Telemetry Data

Telemetry có thể chứa PII (email, token, query params). Sử dụng redaction processor trong collector để mask/drop các attribute nhạy cảm trước khi export. Luôn dùng TLS cho OTLP endpoint trong production.

Lộ trình triển khai đề xuất

Phase 1 — Foundation (Tuần 1-2)

Setup cơ bản: Thêm OTel SDK + auto-instrumentation vào tất cả services. Deploy Collector ở Agent mode. Kết nối tới Grafana Cloud Free tier hoặc local Jaeger để xem traces đầu tiên.

Phase 2 — Enrichment (Tuần 3-4)

Thêm context: Custom spans cho business logic quan trọng. Áp dụng semantic conventions. Tạo custom metrics (orders/sec, payment success rate). Structured logging với TraceId correlation.

Phase 3 — Scale (Tuần 5-6)

Tối ưu cho production: Cấu hình tail-based sampling. Tuning batch processor và memory limiter. Setup Grafana dashboards cho RED metrics (Rate, Errors, Duration). Alert rules cho SLO/SLI.

Phase 4 — Production-grade (Tuần 7-8)

Hardening: HA cho Collector (2+ replicas). TLS everywhere. PII redaction. Retention policies. Team training và runbook cho incident response dựa trên observability data.

Kết luận

OpenTelemetry không chỉ là một thư viện — nó là tiêu chuẩn công nghiệp cho observability. Với vị thế dự án CNCF active thứ hai (sau Kubernetes), sự hỗ trợ từ hơn 100 vendor, và tích hợp native với .NET, việc áp dụng OpenTelemetry không còn là câu hỏi "có nên không" mà là "bắt đầu từ đâu".

Điểm mấu chốt cần nhớ:

Bắt đầu từ traces — chúng mang lại giá trị nhanh nhất khi debug hệ thống phân tán
Auto-instrumentation trước, manual instrumentation sau — đừng cố bao phủ mọi thứ ngay từ đầu
Collector là bắt buộc — không bao giờ export trực tiếp từ app đến backend trong production
Tail-based sampling đảm bảo bạn không bao giờ bỏ sót lỗi hay request chậm
Semantic conventions giúp query nhất quán cross-service — đầu tư thời gian chuẩn hóa sớm

Với stack miễn phí (OTel Collector + Grafana + Tempo + Prometheus + Loki), bạn hoàn toàn có thể xây dựng hệ thống observability production-grade mà không tốn chi phí license — chỉ cần thời gian để thiết lập và vận hành đúng cách.

Tham khảo

#OpenTelemetry #Observability #Distributed Tracing #.NET #Grafana #Prometheus #system design #Microservices

# OpenTelemetry — Tiêu chuẩn Observability cho hệ thống phân tán

CNCF #2 Dự án active nhất sau Kubernetes

40+ Ngôn ngữ & framework được hỗ trợ

100+ Vendor tích hợp sẵn

3 Trụ cột: Traces, Metrics, Logs

Khi hệ thống phân tán ngày càng phức tạp — microservices gọi chéo nhau, message queue xen giữa, cache layer chồng chất — câu hỏi *"lỗi xảy ra ở đâu?"* trở nên cực kỳ khó trả lời. Bạn không thể debug production bằng breakpoint. Bạn cần **observability** — và OpenTelemetry đang trở thành tiêu chuẩn duy nhất mà toàn ngành công nhận.

## 1. Observability là gì và tại sao quan trọng?

Observability (khả năng quan sát) là năng lực hiểu trạng thái bên trong của hệ thống chỉ thông qua các tín hiệu đầu ra — mà **không cần thay đổi code** hay can thiệp vào luồng chạy chính. Khác với monitoring truyền thống (theo dõi các chỉ số đã biết trước), observability cho phép bạn trả lời cả những câu hỏi *chưa từng đặt ra*.

#### Monitoring vs. Observability

**Monitoring** trả lời: "CPU đang bao nhiêu %?" hay "request/s có đạt ngưỡng không?"  
**Observability** trả lời: "Tại sao request từ user X ở region AP mất 3 giây thay vì 200ms, và service nào gây ra bottleneck?"

## 2. Ba trụ cột: Traces, Metrics, Logs

```
graph TD
    A[Telemetry Data] --> B[Traces]
    A --> C[Metrics]
    A --> D[Logs]
    B --> B1["Distributed Tracing  
Theo dõi request flow"]
    B --> B2["Spans  
Đơn vị thời gian"]
    B --> B3["Context Propagation  
W3C TraceContext"]
    C --> C1["Counters  
Đếm tích lũy"]
    C --> C2["Gauges  
Giá trị tức thời"]
    C --> C3["Histograms  
Phân phối thống kê"]
    D --> D1["Structured Logs  
Key-value pairs"]
    D --> D2["Correlation  
Gắn TraceId/SpanId"]
    D --> D3["Severity Levels  
Info/Warn/Error"]

style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#2c3e50,stroke:#fff,color:#fff
    style C fill:#2c3e50,stroke:#fff,color:#fff
    style D fill:#2c3e50,stroke:#fff,color:#fff
    style B1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50

```
Ba trụ cột telemetry trong OpenTelemetry

### Traces — Theo dõi hành trình request

Một **trace** đại diện cho toàn bộ hành trình của một request xuyên qua hệ thống phân tán. Mỗi trace gồm nhiều **span** — đơn vị công việc nhỏ nhất có tên, thời gian bắt đầu/kết thúc, và các attribute mô tả ngữ cảnh.

```
sequenceDiagram
    participant Client
    participant Gateway as API Gateway
    participant Auth as Auth Service
    participant Order as Order Service
    participant DB as Database
    participant Cache as Redis Cache

Client->>Gateway: POST /orders (TraceId: abc123)
    Gateway->>Auth: Verify Token (SpanId: s1)
    Auth-->>Gateway: 200 OK (2ms)
    Gateway->>Order: Create Order (SpanId: s2)
    Order->>Cache: Check inventory (SpanId: s3)
    Cache-->>Order: Cache HIT (0.5ms)
    Order->>DB: INSERT order (SpanId: s4)
    DB-->>Order: OK (15ms)
    Order-->>Gateway: 201 Created (18ms)
    Gateway-->>Client: 201 Created (22ms)

```
Distributed trace qua nhiều service — mỗi mũi tên là một span

Mỗi span chứa:

- **TraceId**: ID duy nhất cho toàn bộ trace (được propagate qua HTTP header `traceparent`)
- **SpanId**: ID của span hiện tại
- **ParentSpanId**: Liên kết cha-con giữa các span
- **Attributes**: Key-value pairs như `http.method=POST`, `db.system=postgresql`
- **Events**: Các sự kiện xảy ra trong span (ví dụ: "cache miss", "retry attempt")
- **Status**: OK, ERROR, hoặc UNSET

### Metrics — Đo lường hiệu năng bằng con số

Metrics là các phép đo số lượng theo thời gian. OpenTelemetry hỗ trợ ba loại metric chính:

| Loại | Mô tả | Ví dụ | Khi nào dùng |
| --- | --- | --- | --- |
| **Counter** | Giá trị tích lũy, chỉ tăng | Tổng số request, tổng bytes gửi | Đếm sự kiện qua thời gian |
| **Gauge** | Giá trị tức thời, tăng/giảm | CPU usage, active connections, queue length | Đo trạng thái hiện tại |
| **Histogram** | Phân phối thống kê | Request latency (p50, p95, p99) | Phân tích phân phối giá trị |

### Logs — Sự kiện có ngữ cảnh

Logs trong OpenTelemetry không chỉ là text thuần — chúng là **structured logs** với TraceId và SpanId được gắn tự động. Nhờ đó, khi nhìn vào một log entry ERROR, bạn có thể nhảy thẳng tới trace tương ứng để xem toàn bộ hành trình request.

```
{
  "timestamp": "2026-04-21T10:15:30Z",
  "severity": "ERROR",
  "body": "Payment processing failed",
  "attributes": {
    "order.id": "ORD-98765",
    "payment.provider": "stripe",
    "error.type": "timeout"
  },
  "traceId": "abc123def456...",
  "spanId": "span789..."
}
```

#### Tương quan Logs-Traces-Metrics

Sức mạnh thực sự nằm ở việc **tương quan** ba tín hiệu: khi metric cho thấy latency p99 tăng đột biến → filter traces có duration > 2s → tìm span chậm nhất → đọc log của span đó để hiểu root cause. Đây là workflow mà monitoring truyền thống không thể làm được.

## 3. Kiến trúc OpenTelemetry

OpenTelemetry không phải một sản phẩm — nó là một **framework và bộ công cụ** gồm nhiều thành phần phối hợp với nhau:

```
graph LR
    subgraph Application
        A1[Your Code] --> SDK[OTel SDK]
        A2[Auto-Instrumentation] --> SDK
        A3[Library Instrumentation] --> SDK
    end

SDK -->|OTLP| C[OTel Collector]

subgraph Collector
        C --> R[Receivers]
        R --> P[Processors]
        P --> E[Exporters]
    end

E --> G[Grafana/Tempo]
    E --> J[Jaeger]
    E --> PR[Prometheus]
    E --> AZ[Azure Monitor]
    E --> DD[Datadog]

style SDK fill:#e94560,stroke:#fff,color:#fff
    style C fill:#2c3e50,stroke:#fff,color:#fff
    style R fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style P fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style E fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style G fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style J fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style PR fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style AZ fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style DD fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

```
Kiến trúc tổng quan OpenTelemetry — từ application đến backend

### Các thành phần chính

- **API**: Interface chuẩn để tạo telemetry — library authors sử dụng API để instrument code mà không phụ thuộc vào implementation cụ thể
- **SDK**: Implementation của API, xử lý việc thu thập, xử lý và export dữ liệu
- **Auto-Instrumentation**: Tự động thu thập telemetry từ các framework phổ biến (ASP.NET Core, HttpClient, EF Core...) mà không cần sửa code
- **OTLP (OpenTelemetry Protocol)**: Giao thức truyền tải chuẩn, vendor-neutral, hỗ trợ cả gRPC và HTTP/protobuf
- **Semantic Conventions**: Quy ước đặt tên chuẩn cho attributes — đảm bảo `http.request.method` có cùng ý nghĩa bất kể ngôn ngữ nào

## 4. OpenTelemetry Collector — Trái tim hệ thống

### Hai chế độ triển khai

```
graph TD
    subgraph Agent Mode
        App1[App 1] --> CA[Collector Agent]
        App2[App 2] --> CA
        CA -->|Forward| CG
    end

subgraph Gateway Mode
        CA2[Agent 1] --> CG[Collector Gateway]
        CA3[Agent 2] --> CG
        CG --> Backend[Observability Backend]
    end

style CA fill:#e94560,stroke:#fff,color:#fff
    style CG fill:#2c3e50,stroke:#fff,color:#fff
    style Backend fill:#f8f9fa,stroke:#e94560,color:#2c3e50

```
Agent mode (sidecar) vs. Gateway mode (centralized)

| Đặc điểm | Agent Mode | Gateway Mode |
| --- | --- | --- |
| **Triển khai** | Sidecar / DaemonSet cạnh app | Standalone service tập trung |
| **Ưu điểm** | Latency thấp, xử lý cục bộ | Quản lý tập trung, sampling phức tạp |
| **Nhược điểm** | Tốn resource trên mỗi node | Single point of failure nếu không HA |
| **Phù hợp** | Kubernetes, edge computing | Multi-cluster, cross-region |

### Cấu hình Collector (YAML)

Collector cấu hình theo pipeline: **Receivers → Processors → Exporters**. Đây là ví dụ production-ready:

```
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"

processors:
  batch:
    send_batch_size: 8192
    timeout: 5s
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-always
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 2000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/tempo:
    endpoint: "tempo:4317"
    tls:
      insecure: false
      cert_file: /certs/client.crt
      key_file: /certs/client.key
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: myapp

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]
```

#### Memory Limiter là bắt buộc

Trong production, **luôn** đặt `memory_limiter` processor TRƯỚC các processor khác. Nếu không, khi traffic spike, collector có thể OOM và mất toàn bộ telemetry data trong buffer. Cấu hình `limit_mib` nên ở mức 70-80% RAM available cho container.

## 5. Tích hợp OpenTelemetry với .NET

.NET có lợi thế đặc biệt: các API telemetry đã được tích hợp sẵn trong framework (`ILogger`, `System.Diagnostics.Metrics`, `ActivitySource`). OpenTelemetry .NET SDK chỉ cần "hook" vào các API này và export ra ngoài — không cần thay đổi cách bạn viết code.

```
graph LR
    subgraph ".NET Framework APIs"
        IL["ILogger<T>"]
        ME["Meter / Counter"]
        AS["ActivitySource / Activity"]
    end

subgraph "OTel .NET SDK"
        IL --> LP[Log Provider]
        ME --> MP[Meter Provider]
        AS --> TP[Tracer Provider]
    end

LP --> EX[OTLP Exporter]
    MP --> EX
    TP --> EX

EX --> COL[Collector]

style IL fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style ME fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style AS fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style LP fill:#e94560,stroke:#fff,color:#fff
    style MP fill:#e94560,stroke:#fff,color:#fff
    style TP fill:#e94560,stroke:#fff,color:#fff
    style EX fill:#2c3e50,stroke:#fff,color:#fff
    style COL fill:#2c3e50,stroke:#fff,color:#fff

```
.NET sử dụng native API, OTel SDK chỉ làm nhiệm vụ export

### Cài đặt NuGet packages

```
dotnet add package OpenTelemetry
dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Instrumentation.Http
dotnet add package OpenTelemetry.Instrumentation.SqlClient
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol
```

### Cấu hình trong Program.cs

```
var builder = WebApplication.CreateBuilder(args);

builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource => resource
        .AddService(
            serviceName: "OrderService",
            serviceVersion: "1.0.0",
            serviceInstanceId: Environment.MachineName))
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation(opts =>
        {
            opts.Filter = ctx => !ctx.Request.Path.StartsWithSegments("/health");
            opts.RecordException = true;
        })
        .AddHttpClientInstrumentation()
        .AddSqlClientInstrumentation(opts =>
        {
            opts.SetDbStatementForText = true;
            opts.RecordException = true;
        })
        .AddOtlpExporter(opts =>
        {
            opts.Endpoint = new Uri("http://otel-collector:4317");
            opts.Protocol = OtlpExportProtocol.Grpc;
        }))
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddOtlpExporter())
    .WithLogging(logging => logging
        .AddOtlpExporter());

var app = builder.Build();
app.Run();
```

### Custom Instrumentation — Business Logic Tracing

Auto-instrumentation bắt được HTTP, DB, gRPC. Nhưng để trace **business logic** (xử lý đơn hàng, tính giá, kiểm kho), bạn cần thêm span thủ công:

```
public class OrderService
{
    private static readonly ActivitySource Source = new("OrderService");
    private static readonly Meter Meter = new("OrderService");
    private static readonly Counter<long> OrdersCreated =
        Meter.CreateCounter<long>("orders.created");

public async Task<Order> CreateOrderAsync(CreateOrderRequest request)
    {
        using var activity = Source.StartActivity("CreateOrder");
        activity?.SetTag("order.customer_id", request.CustomerId);
        activity?.SetTag("order.items_count", request.Items.Count);

// Validate inventory
        using (Source.StartActivity("ValidateInventory"))
        {
            await ValidateInventoryAsync(request.Items);
        }

// Calculate pricing
        decimal total;
        using (var pricingSpan = Source.StartActivity("CalculatePricing"))
        {
            total = await CalculatePricingAsync(request.Items);
            pricingSpan?.SetTag("order.total", total);
        }

// Process payment
        using (Source.StartActivity("ProcessPayment"))
        {
            await ProcessPaymentAsync(request.CustomerId, total);
        }

OrdersCreated.Add(1,
            new KeyValuePair<string, object?>("region", request.Region));

activity?.SetStatus(ActivityStatusCode.Ok);
        return new Order { Id = Guid.NewGuid(), Total = total };
    }
}
```

#### .NET Aspire — OTel có sẵn

Nếu bạn đang dùng **.NET Aspire**, OpenTelemetry đã được cấu hình sẵn trong `ServiceDefaults` project. Chỉ cần gọi `builder.ConfigureOpenTelemetry()` — tracing, metrics, logging tự động hoạt động. Aspire Dashboard còn hiển thị toàn bộ telemetry trong môi trường dev mà không cần cài Grafana/Jaeger.

## 6. Chiến lược Sampling thông minh

Ở quy mô lớn, thu thập 100% traces là không khả thi — chi phí lưu trữ và network sẽ rất lớn. Sampling giúp giảm volume mà vẫn giữ được dữ liệu quan trọng.

### Head-based vs. Tail-based Sampling

```
graph TD
    subgraph "Head-based Sampling"
        H1[Request đến] --> H2{Quyết định ngay}
        H2 -->|Sample| H3[Thu thập trace]
        H2 -->|Drop| H4[Bỏ qua hoàn toàn]
    end

subgraph "Tail-based Sampling"
        T1[Request đến] --> T2[Thu thập TẤT CẢ spans]
        T2 --> T3[Trace hoàn thành]
        T3 --> T4{Đánh giá toàn bộ trace}
        T4 -->|Error/Slow| T5[Giữ lại]
        T4 -->|Bình thường| T6[Áp dụng ratio sampling]
    end

style H1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style H2 fill:#e94560,stroke:#fff,color:#fff
    style T1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style T4 fill:#2c3e50,stroke:#fff,color:#fff

```
Head-based quyết định ở đầu, Tail-based đánh giá sau khi trace hoàn tất

| Tiêu chí | Head-based | Tail-based |
| --- | --- | --- |
| **Thời điểm quyết định** | Ngay khi request bắt đầu | Sau khi trace hoàn tất |
| **Ưu điểm** | Đơn giản, overhead thấp | Giữ được mọi lỗi và request chậm |
| **Nhược điểm** | Có thể bỏ sót error traces | Cần collector có đủ RAM để buffer |
| **Phù hợp** | Traffic rất cao, budget giới hạn | Production cần debug chính xác |

Chiến lược kết hợp phổ biến trong production:

```
# Cấu hình tail-sampling trên Collector
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      # Luôn giữ lại traces có lỗi
      - name: keep-errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      # Giữ lại request chậm hơn 1 giây
      - name: keep-slow
        type: latency
        latency: { threshold_ms: 1000 }
      # Giữ lại traces từ endpoint quan trọng
      - name: keep-critical-paths
        type: string_attribute
        string_attribute:
          key: http.route
          values: ["/api/payments", "/api/orders"]
      # 5% traces bình thường còn lại
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }
```

## 7. Xây dựng Observability Stack hoàn chỉnh

Một stack phổ biến và cost-effective cho production (có thể self-host hoàn toàn):

Tempo --> Graf[Grafana Dashboard]
    Prom --> Graf
    Loki --> Graf

Graf --> Alert[Alertmanager]
    Alert --> PD[PagerDuty/Slack]

style Col fill:#e94560,stroke:#fff,color:#fff
    style Graf fill:#2c3e50,stroke:#fff,color:#fff
    style Tempo fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style Prom fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style Loki fill:#f8f9fa,stroke:#e94560,color:#2c3e50

```
Observability stack mã nguồn mở: OTel + Grafana ecosystem

| Thành phần | Công cụ | Vai trò | Chi phí |
| --- | --- | --- | --- |
| **Traces** | Grafana Tempo | Lưu trữ traces, tìm kiếm theo TraceId | Free (self-host) |
| **Metrics** | Prometheus | Thu thập và query metrics (PromQL) | Free (self-host) |
| **Logs** | Grafana Loki | Log aggregation với label-based indexing | Free (self-host) |
| **Visualization** | Grafana | Dashboard, alerting, explore | Free (self-host) |
| **Alerting** | Alertmanager | Routing alerts → Slack, PagerDuty, Email | Free |

### Docker Compose cho local development

```
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    ports:
      - "4317:4317"   # gRPC OTLP
      - "4318:4318"   # HTTP OTLP
      - "8889:8889"   # Prometheus metrics
    volumes:
      - ./otel-config.yaml:/etc/otelcol/config.yaml

tempo:
    image: grafana/tempo:latest
    ports:
      - "3200:3200"
    volumes:
      - ./tempo-config.yaml:/etc/tempo.yaml
    command: ["-config.file=/etc/tempo.yaml"]

prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"

grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    volumes:
      - ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/ds.yaml
```

### So sánh: Self-host vs. Managed Service

| Tiêu chí | Self-host (Grafana Stack) | Managed (Datadog/New Relic) | Hybrid (Grafana Cloud Free) |
| --- | --- | --- | --- |
| **Chi phí** | Chỉ infra (server/storage) | $15-25/host/tháng | Free tier 50GB logs, 10K metrics |
| **Setup** | Cần DevOps experience | 5 phút setup | 15 phút setup |
| **Scaling** | Tự quản lý HA, retention | Tự động | Free tier có giới hạn |
| **Vendor lock-in** | Không (OTel chuẩn) | Cao (proprietary features) | Thấp (OTel compatible) |
| **Phù hợp** | Team lớn, budget infra | Startup, team nhỏ | Side project, MVP |

## 8. Best Practices cho Production 2026

### Semantic Conventions — Đặt tên chuẩn

Một trong những lợi ích lớn nhất của OpenTelemetry là **Semantic Conventions** — bộ quy ước đặt tên chuẩn cho attributes. Khi mọi service đều dùng cùng convention, bạn có thể query cross-service một cách nhất quán:

| Domain | Attribute | Ý nghĩa |
| --- | --- | --- |
| **HTTP** | `http.request.method` | GET, POST, PUT... |
| **HTTP** | `http.response.status_code` | 200, 404, 500... |
| **HTTP** | `url.path` | /api/orders |
| **Database** | `db.system` | postgresql, redis, mssql |
| **Database** | `db.operation.name` | SELECT, INSERT, findOne |
| **Messaging** | `messaging.system` | kafka, rabbitmq, azure_servicebus |
| **Messaging** | `messaging.destination.name` | orders-queue, events-topic |

### Các nguyên tắc quan trọng

#### 1. Giảm Cardinality

Attribute cardinality cao (ví dụ: gắn `user.id` vào mọi metric) sẽ khiến Prometheus bùng nổ số lượng time series. Chỉ gắn high-cardinality attributes vào **traces** (lưu trữ rẻ hơn), còn metrics chỉ dùng low-cardinality labels như `region`, `status_code`, `endpoint`.

#### 2. Filter Health Checks & Noise

Loại bỏ traces từ `/health`, `/ready`, `/metrics` endpoints. Chúng tạo ra lượng traces khổng lồ mà không mang giá trị debug. Cấu hình filter ở SDK level (không phải collector) để tiết kiệm network.

#### 3. Bảo mật Telemetry Data

Telemetry có thể chứa PII (email, token, query params). Sử dụng `redaction processor` trong collector để mask/drop các attribute nhạy cảm trước khi export. Luôn dùng TLS cho OTLP endpoint trong production.

### Lộ trình triển khai đề xuất

Phase 1 — Foundation (Tuần 1-2)

**Setup cơ bản:** Thêm OTel SDK + auto-instrumentation vào tất cả services. Deploy Collector ở Agent mode. Kết nối tới Grafana Cloud Free tier hoặc local Jaeger để xem traces đầu tiên.

Phase 2 — Enrichment (Tuần 3-4)

**Thêm context:** Custom spans cho business logic quan trọng. Áp dụng semantic conventions. Tạo custom metrics (orders/sec, payment success rate). Structured logging với TraceId correlation.

Phase 3 — Scale (Tuần 5-6)

**Tối ưu cho production:** Cấu hình tail-based sampling. Tuning batch processor và memory limiter. Setup Grafana dashboards cho RED metrics (Rate, Errors, Duration). Alert rules cho SLO/SLI.

Phase 4 — Production-grade (Tuần 7-8)

**Hardening:** HA cho Collector (2+ replicas). TLS everywhere. PII redaction. Retention policies. Team training và runbook cho incident response dựa trên observability data.

## Kết luận

OpenTelemetry không chỉ là một thư viện — nó là **tiêu chuẩn công nghiệp** cho observability. Với vị thế dự án CNCF active thứ hai (sau Kubernetes), sự hỗ trợ từ hơn 100 vendor, và tích hợp native với .NET, việc áp dụng OpenTelemetry không còn là câu hỏi *"có nên không"* mà là *"bắt đầu từ đâu"*.

Điểm mấu chốt cần nhớ:

- **Bắt đầu từ traces** — chúng mang lại giá trị nhanh nhất khi debug hệ thống phân tán
- **Auto-instrumentation trước**, manual instrumentation sau — đừng cố bao phủ mọi thứ ngay từ đầu
- **Collector là bắt buộc** — không bao giờ export trực tiếp từ app đến backend trong production
- **Tail-based sampling** đảm bảo bạn không bao giờ bỏ sót lỗi hay request chậm
- **Semantic conventions** giúp query nhất quán cross-service — đầu tư thời gian chuẩn hóa sớm

### Tham khảo

- [OpenTelemetry — What is OpenTelemetry?](https://opentelemetry.io/docs/what-is-opentelemetry/)
- [.NET Observability with OpenTelemetry — Microsoft Learn](https://learn.microsoft.com/en-us/dotnet/core/diagnostics/observability-with-otel)
- [OpenTelemetry eBPF Instrumentation 2026 Goals](https://opentelemetry.io/blog/2026/obi-goals/)
- [Can OpenTelemetry Save Observability in 2026? — The New Stack](https://thenewstack.io/can-opentelemetry-save-observability-in-2026/)
- [Grafana Tempo Documentation](https://grafana.com/docs/tempo/latest/)

Distributed Caching: Thiết kế hệ thống Cache phân tán từ A đến Z

Monorepo 2026: Turborepo, Nx và pnpm Workspaces — Quản lý code cho team quy mô lớn

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.