OpenTelemetry 2026 cho .NET 10 - Kiến trúc Observability Microservices với OTLP, Collector, Tail Sampling, Tempo, Loki, Prometheus và Grafana

Posted on: 4/16/2026 11:22:49 AM

Table of contents

1. OpenTelemetry 2026 — Chuẩn observability mở cho .NET 10
2. Biên niên sử OpenTelemetry
3. Mô hình dữ liệu OpenTelemetry — điểm quan trọng nhất
4. Kiến trúc OpenTelemetry trên .NET 10
1. 4.1. Thiết lập cơ bản trong Program.cs
  1. Bốn thói quen đáng giá
2. 4.2. Manual instrumentation khi cần
  1. Cảnh báo cardinality
5. OTLP — giao thức dưới đáy mọi exporter
6. OpenTelemetry Collector — xương sống của pipeline
1. 6.1. Cấu trúc pipeline Collector
7. Sampling — kỹ thuật quan trọng nhất để kiểm soát chi phí
8. Correlation logs ↔ traces ↔ metrics
9. Auto-instrumentation: khi nào dùng, khi nào không
1. Khuyến nghị cho .NET 10
10. Semantic Conventions — viết một lần, dashboard khắp nơi
11. Backend: LGTM stack hay vendor — chọn thế nào
1. Lời khuyên thực dụng
12. Tích hợp với .NET Aspire 9.5 và OpenTelemetry mặc định
13. Signal thứ tư: Continuous Profiling
14. Production patterns — những gì nên làm, những gì nên tránh
15. Chi phí và benchmark overhead
16. Migration roadmap từ Application Insights / Serilog + ELK
17. Kết luận — OpenTelemetry là kỹ năng nền cho kỹ sư backend 2026
18. Tham khảo

1. OpenTelemetry 2026 — Chuẩn observability mở cho .NET 10

Mỗi lần một request đi qua năm, bảy service trong một hệ microservice, kỹ sư vận hành phải trả lời ba câu hỏi rất cụ thể: nó chậm ở đâu, nó lỗi ở đâu, và tại sao lại thế. Trong mười năm qua, câu trả lời được giải quyết bằng ba họ công cụ tách rời: log aggregator (ELK, Splunk, Graylog), metrics time-series (Prometheus, Graphite, InfluxDB), và tracing system (Jaeger, Zipkin). Mỗi họ có ngôn ngữ riêng, instrumentation riêng, và không hiểu nhau. Một span trong Jaeger không biết log nào tương ứng, một alert trên Prometheus không có đường đi ngược ra trace khiến nó bị trigger.

OpenTelemetry (OTEL) là câu trả lời của cộng đồng cho mớ hỗn độn đó: một bộ chuẩn trung lập, một API thống nhất, một SDK tham chiếu cho hơn 15 ngôn ngữ, một protocol truyền tải chung (OTLP), và quan trọng nhất, một mô hình dữ liệu nơi logs — metrics — traces được liên kết qua trace_id, span_id, resource và scope. Đầu năm 2026, OTEL đã là chuẩn de facto của observability: tất cả vendor lớn (Datadog, New Relic, Dynatrace, Honeycomb, Azure Monitor, Google Cloud Trace) đều nhận OTLP native, tất cả framework backend lớn đều tự động xuất OTEL.

Ở phía .NET, câu chuyện đặc biệt đẹp vì OpenTelemetry không phải một SDK gắn thêm — nó là phần mở rộng tự nhiên của System.Diagnostics.Activity (tracing), System.Diagnostics.Metrics.Meter (metrics) và Microsoft.Extensions.Logging (logs) đã có sẵn trong BCL. Với .NET 10 LTS, Microsoft đã đưa auto-instrumentation cho hầu hết workloads (ASP.NET Core, HttpClient, EF Core, gRPC, Kafka, Redis client, Azure SDK) về trạng thái GA, thêm exemplar cho metrics, và tích hợp sẵn OTLP exporter với HTTP/Protobuf. Bài viết này mổ xẻ OTEL 2026 từ mô hình dữ liệu đến kiến trúc triển khai production, tập trung vào cách một hệ .NET 10 microservices dùng nó để trả lời ba câu hỏi mở đầu bài.

1.37Bản OpenTelemetry Specification stable đầu 2026, API + SDK + Protocol đã frozen

15+Ngôn ngữ có SDK stable: .NET, Java, Go, Python, Node.js, Rust, PHP, Ruby, Swift...

~3%Overhead CPU trung bình khi bật full instrumentation + OTLP exporter với batch processor

3 pillarsLogs, Metrics, Traces thống nhất qua Resource + TraceContext

2. Biên niên sử OpenTelemetry

OpenTelemetry không xuất hiện từ không khí. Nó là kết quả hợp nhất của hai dự án CNCF cạnh tranh nhau suốt 2017–2019, cộng thêm bài học từ mỗi vendor observability thập kỷ trước. Hiểu biên niên sử này giúp trả lời câu hỏi "tại sao mô hình dữ liệu trông như thế" lúc thiết kế hệ thống.

2010 — Google Dapper

Google công bố paper Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Đây là nguồn cảm hứng cho tất cả tracing system hiện đại: khái niệm trace, span, sampling rate, propagation context qua RPC header.

2012–2016 — Zipkin, Jaeger ra đời

Twitter mở mã Zipkin (2012), Uber mở mã Jaeger (2016). Cả hai đều port từ Dapper, cả hai đều có instrumentation SDK riêng không tương thích với nhau — tạo ra lock-in đầu tiên.

2016 — Prometheus stable

Pull-based metrics với /metrics endpoint, labels/tags đa chiều, PromQL. Trở thành chuẩn de facto metrics nhưng model dữ liệu hoàn toàn khác tracing.

2017 — OpenTracing (CNCF)

Ben Sigelman (cựu Google, co-author Dapper) dẫn đầu spec OpenTracing: API trung lập cho tracing, vendor implement adapter. Vấn đề: chỉ spec API, không có SDK tham chiếu, không lo instrumentation library.

2018 — OpenCensus (Google)

Google release OpenCensus: API + SDK + exporter cho traces và metrics. Đối thủ trực diện OpenTracing. Cộng đồng bị chia rẽ, mỗi vendor phải support cả hai.

2019-05 — OpenTelemetry sáp nhập

OpenTracing và OpenCensus công bố gộp thành OpenTelemetry tại KubeCon. CNCF Sandbox project. Mục tiêu: một API, một SDK, một protocol, hết chia rẽ.

2021-02 — OTLP 1.0 + Traces SDK GA

Giao thức OTLP (OpenTelemetry Protocol) stable, hỗ trợ gRPC và HTTP/Protobuf. SDK traces cho các ngôn ngữ lớn GA.

2022 — Metrics SDK GA

Mô hình metrics đặc thù của OTEL: Counter, UpDownCounter, Histogram, Gauge (async và sync). Khác với Prometheus ở chỗ có khái niệm delta vs cumulative temporality.

2023 — Logs SDK GA + Exemplars

Mảnh ghép thứ ba: Logs. Kèm theo là exemplar — một liên kết từ một data point metric về một trace_id + span_id cụ thể — hiện thực hoá lời hứa "ba pillars thống nhất".

2024 — Profiling Signal

Signal thứ tư chính thức bước vào spec: continuous profiling (CPU, memory, goroutine/thread). Kết nối với tracing qua span link. Backend Grafana Pyroscope, Datadog Continuous Profiler đều tương thích.

2025-Q3 — Semantic Conventions 1.30

Bộ attribute chuẩn cho từng domain (HTTP, DB, messaging, FaaS, gen_ai) ổn định. Đánh dấu OTEL trưởng thành ở mức "copy attribute là khớp mọi backend".

2026-Q1 — OTEL 1.37 + .NET 10 LTS

OpenTelemetry.NET 1.13, bundled trong .NET Aspire 9.5/10 mặc định. Auto-instrumentation cho ASP.NET Core, EF Core, HttpClient, gRPC, MassTransit, StackExchange.Redis GA. Exemplar được bật tự động.

3. Mô hình dữ liệu OpenTelemetry — điểm quan trọng nhất

Rất nhiều đội triển khai OTEL mà không hiểu mô hình dữ liệu, kết quả là dashboard đẹp nhưng không trả lời được câu hỏi nghiệp vụ. Trước khi code, cần hiểu bốn khái niệm nền tảng:

3.1. Resource — danh tính của process phát ra tín hiệu

Một Resource là tập attribute mô tả ai đang phát tín hiệu: service name, service version, deployment environment, host name, container id, pod name, region, cloud provider. Resource gắn với process, không phải span. Thay đổi resource nghĩa là process khác. Đây là chìa khoá để gộp logs + metrics + traces của cùng một service: backend lọc theo service.name là thấy mọi signal của service đó.

Semantic Conventions cho Resource

Phải đặt: service.name (bắt buộc), service.version, service.instance.id, deployment.environment.name (prod/staging/dev), host.name, os.type. Nên đặt: container.id, k8s.pod.name, k8s.namespace.name, cloud.region. Backend Grafana, Datadog, New Relic đều dùng đúng các attribute này để auto-correlate.

3.2. Scope — đơn vị instrumentation

InstrumentationScope xác định thư viện nào phát ra signal. Ví dụ: span của HttpClient phát từ scope System.Net.Http, span của EF Core phát từ scope Microsoft.EntityFrameworkCore. Khi debug "span kia từ đâu ra", bạn lọc theo scope.

3.3. Trace Context — sợi dây xuyên suốt

Mỗi request nhận một trace_id (16 byte) sinh ngay biên của service đầu tiên. Mỗi đơn vị công việc bên trong có một span_id (8 byte). Context được propagate qua HTTP header traceparent/tracestate (W3C Trace Context), qua gRPC metadata, qua Kafka header, qua SQL comment (sqlcommenter). Nhờ đó một trace bao trùm toàn bộ lời gọi xuyên service.

3.4. Signal và Temporality

OTEL phân biệt rạch ròi ba signal:

Traces: cây span, nested hoặc linked. Đặc trưng bởi trace_id, span_id, parent_span_id, kind (server/client/producer/consumer/internal), status (ok/error), events, links.
Metrics: time series với instrument type. Mỗi data point có start_time, time, value, attributes, exemplars. Temporality: cumulative (mặc định Prometheus) hoặc delta (mặc định Statsd, vendor agent).
Logs: bản ghi với body, severity, attributes, và (quan trọng) trace_id+span_id nếu được phát trong active span.

graph TB
    subgraph Resource["Resource (service.name=orders, env=prod)"]
        subgraph Scope1["Scope: OrdersApi.Controllers"]
            T1["Span: POST /orders
trace_id=abc, span_id=s1"]
            T2["Span: OrdersService.Create
parent=s1, span_id=s2"]
        end
        subgraph Scope2["Scope: EntityFrameworkCore"]
            T3["Span: INSERT orders
parent=s2, span_id=s3"]
        end
        subgraph Scope3["Scope: Microsoft.Extensions.Logging"]
            L1["Log: Order created id=42
trace_id=abc, span_id=s2"]
        end
        subgraph Scope4["Scope: Orders.Metrics"]
            M1["Counter: orders_created_total
exemplar: trace_id=abc"]
            M2["Histogram: order_value_usd"]
        end
    end

Logs, traces, metrics cùng gắn với Resource; exemplar liên kết metric về trace cụ thể

4. Kiến trúc OpenTelemetry trên .NET 10

Trên .NET, OTEL không phải một namespace riêng biệt mà là một bộ adapter bọc API native của runtime. Cụ thể:

System.Diagnostics.Activity là biểu diễn span native. Khi bật AddSource("X"), OTEL SDK sẽ subscribe mọi ActivitySource tên "X" và convert Activity thành OTEL Span.
System.Diagnostics.Metrics.Meter là instrument native. AddMeter("X") subscribe để chuyển data point thành OTLP Metric.
Microsoft.Extensions.Logging.ILogger với OpenTelemetryLoggerProvider attach để mọi logger.LogInformation(...) đều phát ra LogRecord với trace context nếu đang trong Activity.

Nghĩa là: nếu codebase bạn dùng đúng API BCL, bạn không cần viết lại instrumentation khi chuyển vendor. Chỉ cần đổi exporter.

4.1. Thiết lập cơ bản trong Program.cs

using OpenTelemetry;
using OpenTelemetry.Logs;
using OpenTelemetry.Metrics;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;

var builder = WebApplication.CreateBuilder(args);

const string serviceName = "orders-api";
const string serviceVersion = "2.4.1";

var resource = ResourceBuilder.CreateDefault()
    .AddService(serviceName, serviceVersion: serviceVersion,
                serviceInstanceId: Environment.MachineName)
    .AddAttributes(new Dictionary<string, object>
    {
        ["deployment.environment.name"] = builder.Environment.EnvironmentName,
        ["cloud.region"] = builder.Configuration["Cloud:Region"] ?? "unknown"
    })
    .AddEnvironmentVariableDetector()
    .AddContainerDetector();

builder.Services.AddOpenTelemetry()
    .ConfigureResource(rb => rb.AddService(serviceName, serviceVersion: serviceVersion))
    .WithTracing(t => t
        .AddAspNetCoreInstrumentation(o =>
        {
            o.RecordException = true;
            o.Filter = ctx => !ctx.Request.Path.StartsWithSegments("/health");
        })
        .AddHttpClientInstrumentation()
        .AddEntityFrameworkCoreInstrumentation(o => o.SetDbStatementForText = true)
        .AddGrpcClientInstrumentation()
        .AddSource("Orders.*")
        .SetSampler(new ParentBasedSampler(new TraceIdRatioBasedSampler(0.1)))
        .AddOtlpExporter(o => o.Endpoint = new Uri("http://otel-collector:4317")))
    .WithMetrics(m => m
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddProcessInstrumentation()
        .AddMeter("Orders.*")
        .AddOtlpExporter())
    .WithLogging(l => l
        .AddOtlpExporter(), o =>
        {
            o.IncludeFormattedMessage = true;
            o.IncludeScopes = true;
            o.ParseStateValues = true;
        });

var app = builder.Build();

Một vài chi tiết quan trọng ít người để ý:

Bốn thói quen đáng giá

Filter health check khỏi trace: health endpoint gọi mỗi 10s/pod, nếu không filter, trace DB sẽ bị nhiễu chiếm 40–60% volume.
SetDbStatementForText: bật để lưu câu SQL (đã parameter hoá) trong attribute db.statement. Cân nhắc PII — có thể thay bằng sqlcommenter.
ParentBasedSampler: tôn trọng quyết định sampling của service cha. Nếu gateway đã chọn sample, service dưới dòng phải respect — nếu không trace sẽ "vỡ" ở giữa.
AddRuntimeInstrumentation: bật GC metrics, ThreadPool starvation, assembly loaded. Dashboard runtime chỉ cần một query duy nhất.

4.2. Manual instrumentation khi cần

Auto-instrumentation bắt 80% trường hợp, 20% còn lại là logic nghiệp vụ. Mẫu điển hình:

public sealed class OrdersService
{
    private static readonly ActivitySource Activity = new("Orders.Core", "2.4.1");
    private static readonly Meter Meter = new("Orders.Core", "2.4.1");
    private static readonly Counter<long> CreatedCounter =
        Meter.CreateCounter<long>("orders.created",
            unit: "{order}", description: "Orders created");
    private static readonly Histogram<double> ValueHistogram =
        Meter.CreateHistogram<double>("orders.value",
            unit: "USD", description: "Order value distribution");

    public async Task<Order> CreateAsync(CreateOrderRequest req, CancellationToken ct)
    {
        using var activity = Activity.StartActivity("OrdersService.Create",
            ActivityKind.Internal);
        activity?.SetTag("order.channel", req.Channel);
        activity?.SetTag("order.items_count", req.Items.Count);

        try
        {
            var order = await _repo.InsertAsync(req, ct);
            activity?.SetTag("order.id", order.Id);
            activity?.SetStatus(ActivityStatusCode.Ok);

            CreatedCounter.Add(1,
                new KeyValuePair<string, object?>("channel", req.Channel),
                new KeyValuePair<string, object?>("country", req.Country));
            ValueHistogram.Record(order.Total,
                new KeyValuePair<string, object?>("currency", order.Currency));

            return order;
        }
        catch (Exception ex)
        {
            activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
            activity?.AddException(ex);
            throw;
        }
    }
}

Cảnh báo cardinality

Không bao giờ đặt tag có cardinality cao vào Counter/Histogram: order.id, user.id, request_id, trace_id. Mỗi giá trị unique sinh một time series — backend Prometheus có thể sập nếu cardinality vượt vài triệu. Attribute cardinality cao nên đặt ở span (lưu ở trace backend), không ở metric.

5. OTLP — giao thức dưới đáy mọi exporter

OTLP (OpenTelemetry Protocol) là lý do OTEL thật sự trung lập vendor. Nó chỉ định binary format (Protobuf) và transport (gRPC, HTTP/Protobuf, HTTP/JSON). Mọi SDK xuất OTLP, mọi collector nhận OTLP, mọi backend lớn đều parse được OTLP native hoặc qua adapter.

Transport	Port mặc định	Ưu điểm	Hạn chế
gRPC (HTTP/2)	4317	Nhanh, streaming, giữ kết nối, ít overhead	Không qua dễ proxy HTTP/1.1; cần ALPN
HTTP/Protobuf	4318	Qua load balancer/proxy bình thường; debug Wireshark dễ	Handshake mỗi batch; nặng hơn chút
HTTP/JSON	4318 (path `/v1/traces`)	Browser-friendly (RUM); curl-debug dễ	Volume tăng 3–5x so với Protobuf

Khuyến nghị cho .NET backend service: gRPC 4317 trong cluster (tốc độ), HTTP/Protobuf 4318 cho workload chạy sau proxy không hỗ trợ HTTP/2. Với browser RUM (OpenTelemetry JS), bắt buộc HTTP/JSON qua CORS.

6. OpenTelemetry Collector — xương sống của pipeline

Đây là component quan trọng nhất mà nhiều đội bỏ qua vì tưởng "export thẳng vendor được rồi". Thực tế trong production, bạn cần Collector giữa app và backend vì sáu lý do:

Buffer khi backend chập chờn: app không cần giữ memory queue lớn, Collector có disk queue.
Batching + compression: giảm network cost gấp 5–10 lần.
Tail-based sampling: sampling dựa trên toàn bộ trace sau khi hoàn thành (giữ trace lỗi, drop trace bình thường).
Redaction / PII scrubbing: mask email, số điện thoại, token trước khi gửi ra ngoài.
Fan-out: gửi đồng thời cho Tempo, Datadog, ELK để so sánh / chuyển dần vendor.
Resource enrichment: bổ sung k8s metadata, cloud metadata, git SHA từ label pod.

graph LR
    A1[".NET App SDK"] -- OTLP gRPC --> B["Collector Agent
(DaemonSet/Sidecar)"]
    A2["Node.js App SDK"] -- OTLP gRPC --> B
    A3["Java App SDK"] -- OTLP gRPC --> B
    B -- OTLP --> C["Collector Gateway
(StatefulSet, HA)"]
    C -- Prom Remote Write --> D[(Prometheus / Mimir)]
    C -- OTLP --> E[(Tempo)]
    C -- Loki HTTP --> F[(Loki)]
    C -- OTLP --> G[(Datadog / New Relic)]
    H[Grafana] --> D
    H --> E
    H --> F

Pattern agent + gateway: agent tại mỗi node lo collect, gateway lo fan-out

6.1. Cấu trúc pipeline Collector

Mỗi pipeline gồm ba phần: receivers (nhận tín hiệu), processors (biến đổi), exporters (gửi đi). Ví dụ một cấu hình gateway cân bằng giữa Grafana stack và Datadog:

receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 75
    spike_limit_percentage: 20

  batch:
    timeout: 5s
    send_batch_size: 8192
    send_batch_max_size: 10000

  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 500 }
      - name: random-1pct
        type: probabilistic
        probabilistic: { sampling_percentage: 1 }

  transform/scrub:
    log_statements:
      - context: log
        statements:
          - replace_pattern(body, "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+", "REDACTED_EMAIL")

  resource:
    attributes:
      - key: k8s.cluster.name
        value: prod-asia-southeast
        action: upsert

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls: { insecure: true }
  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
  datadog:
    api: { key: ${env:DD_API_KEY}, site: datadoghq.com }

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, resource, batch]
      exporters: [otlp/tempo, datadog]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [prometheusremotewrite, datadog]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, transform/scrub, resource, batch]
      exporters: [loki, datadog]

7. Sampling — kỹ thuật quan trọng nhất để kiểm soát chi phí

Observability production mà không có sampling chiến lược sẽ đốt ngân sách không lý do. Một API có 10k RPS, full trace mỗi request với trung bình 15 span, mỗi span ~1KB attribute: 150MB/giây, 13TB/ngày. Cost lưu trữ tại vendor tier 1 (Datadog, New Relic) có thể lên 4–6 số đô la một tháng chỉ cho một service. OTEL hỗ trợ ba chế độ sampling, mỗi chế độ có chỗ dùng riêng.

7.1. Head-based sampling (Parent-based)

Quyết định sample/drop ngay ở biên đầu tiên, dựa trên trace_id hash. Service dưới dòng respect traceparent flag. Ưu: rẻ, không cần buffer. Nhược: không biết request sẽ lỗi hay chậm để ưu tiên.

7.2. Tail-based sampling (ở Collector)

Collector buffer toàn bộ span của một trace (tối đa decision_wait), sau đó quyết định dựa trên policy. Cấu hình ở trên giữ lại 100% trace có lỗi, 100% trace có latency > 500ms, 1% trace bình thường. Đây là cân bằng hợp lý cho production: signal cao khi điều tra, volume thấp khi bình yên.

7.3. Probabilistic + Rate Limiting

Khi cần guarantee upper bound volume (ví dụ ký hợp đồng vendor theo Spans/second), dùng rate limiter ở Collector kết hợp probabilistic. Policy là "ít nhất X% trace bình thường, nhưng không vượt Y spans/giây".

Chiến lược sampling tổng hợp cho một hệ microservices

Gateway / biên: head-based 100% (để mọi request có trace_id, dễ log correlation). Collector gateway: tail-based với policy errors + slow + 1–5% random. Batch job / cron: head-based 100% (volume thấp, debug quan trọng). Healthcheck / metrics endpoint: drop 100% ngay ở instrumentation filter.

8. Correlation logs ↔ traces ↔ metrics

Ba pillars chỉ có ý nghĩa khi người dùng nhảy qua lại được giữa chúng. OTEL làm việc này qua ba cầu nối:

8.1. Trace context trong log

Khi một ILogger.LogInformation được gọi trong một Activity đang active, OpenTelemetry Logger tự động gắn TraceId, SpanId, TraceFlags vào LogRecord. Khi đổ về Loki, bạn query {service="orders-api"} |= "trace_id=abc123" và thấy mọi log trong trace đó.

8.2. Exemplar trong metric

Exemplar là một điểm mẫu đi kèm data point histogram, chứa trace_id, span_id, giá trị, timestamp. Ví dụ histogram http.server.request.duration có data point tại bucket 500ms–1s với exemplar trace_id=abc — nghĩa là "tôi thấy một request chậm như này, đây là trace cụ thể". Grafana hiển thị exemplar dưới dạng chấm trên biểu đồ, click thẳng ra Tempo. Với .NET 10, exemplar được bật mặc định — không cần cấu hình gì.

8.3. Span event và span link

Span event là log điểm trong một span (timestamp + attributes). Span link nối hai trace riêng biệt (ví dụ consumer Kafka xử lý message, producer gửi là một trace, consumer là trace khác — dùng link để bảo toàn cả hai).

sequenceDiagram
    participant U as User
    participant G as Grafana
    participant P as Prometheus/Mimir
    participant T as Tempo
    participant L as Loki
    U->>G: "API slow lúc 10:32"
    G->>P: PromQL histogram_quantile p95
    P-->>G: Biểu đồ + exemplar trace_id=abc
    U->>G: Click exemplar
    G->>T: GET /api/traces/abc
    T-->>G: Span tree đầy đủ
    U->>G: Click "Logs for span s2"
    G->>L: {trace_id="abc", span_id="s2"}
    L-->>G: Log lines correlated

Hành trình debug điển hình: metric → exemplar → trace → log

9. Auto-instrumentation: khi nào dùng, khi nào không

OTEL có ba tầng tự động hoá:

Library instrumentation (Nuget package): OpenTelemetry.Instrumentation.AspNetCore, OpenTelemetry.Instrumentation.Http, OpenTelemetry.Instrumentation.EntityFrameworkCore... Bạn add code một dòng, library phát span tự động. Đây là đường đi chuẩn cho .NET production.
Zero-code auto-instrumentation (OpenTelemetry.AutoInstrumentation): attach vào process qua CLR profiler API, không cần chạm code. Phù hợp legacy app hoặc binary không sửa được.
eBPF-based (Grafana Beyla, Pixie): instrument ở kernel level, zero overhead trong app. Trade-off: ít attribute giàu hơn, phụ thuộc kernel version.

Khuyến nghị cho .NET 10

Mặc định dùng library instrumentation qua NuGet (tầng 1): an toàn, giàu attribute, tương thích Aspire. Chuyển sang tầng 2 (zero-code) khi có app legacy .NET Framework hoặc không tiếp cận được code. Tầng 3 (eBPF) chỉ khi cần overhead < 0.5% CPU và đội có năng lực kernel tuning.

10. Semantic Conventions — viết một lần, dashboard khắp nơi

Semantic Conventions (SemConv) là nguyên nhân duy nhất khiến dashboard cài sẵn của Grafana/Datadog "chạy được ngay" sau khi bạn bật OTEL. Thay vì mỗi đội đặt tên attribute kiểu httpStatus, http_status_code, statusCode, SemConv chốt: http.response.status_code (int). Backend build dashboard dựa trên tên chuẩn này.

Domain	Attribute key chuẩn	Ví dụ
HTTP server	`http.request.method`, `http.route`, `http.response.status_code`, `url.scheme`, `url.path`, `server.address`	POST /orders/{id}, 201
Database	`db.system.name`, `db.namespace`, `db.operation.name`, `db.query.text`	mssql, AnhTu, SELECT, "SELECT * FROM Post WHERE Id=@id"
Messaging	`messaging.system`, `messaging.destination.name`, `messaging.operation.type`	kafka, orders.created, publish
gRPC	`rpc.system`, `rpc.service`, `rpc.method`, `rpc.grpc.status_code`	grpc, orders.v1.OrdersApi, CreateOrder, 0
FaaS	`faas.name`, `faas.version`, `faas.trigger`, `faas.invoked_provider`	process-order, 2.4.1, http, aws
GenAI	`gen_ai.system`, `gen_ai.request.model`, `gen_ai.usage.input_tokens`	openai, gpt-5, 1523

Nếu code bạn phát span với key non-standard (orderItemsCount), bạn vẫn search được nhưng mất hết dashboard, alert mặc định, và phân tích cross-service. Quy tắc vàng: một attribute tương đương SemConv luôn dùng tên SemConv; attribute nghiệp vụ riêng đặt prefix namespace (orders.items_count).

11. Backend: LGTM stack hay vendor — chọn thế nào

OTEL giải phóng bạn khỏi lock-in, nên bài toán chọn backend trở thành bài toán economics và operational load. Ba nhóm lựa chọn phổ biến:

Lựa chọn	Thành phần	Cost model	Phù hợp khi
LGTM self-hosted	Loki + Grafana + Tempo + Mimir (Grafana Labs)	Chi phí hạ tầng + nhân sự vận hành	Volume lớn (> 10TB/tháng), team có ops; muốn control dữ liệu
Grafana Cloud	LGTM managed	Theo GB metric + traces + logs ingested	Team nhỏ không muốn vận hành, volume trung bình
Datadog / New Relic / Dynatrace	Full APM + RUM + synthetics + profiler	Host-based + custom metrics + ingestion	Enterprise, cần APM end-to-end + AI-assisted RCA, budget thoải mái
Azure Monitor / Google Cloud Observability	Application Insights + Cloud Trace + Cloud Logging	Per-GB + feature tier	All-in trên một cloud, muốn integrate sâu với cloud service
Honeycomb / Lightstep	Event-based analytics trace	Theo events/tháng	Cần slice-and-dice cardinality cao, debug production thật nhanh

Lời khuyên thực dụng

Startup/SME dưới 50 service: Grafana Cloud tier Pro đủ, đổi khi vượt ngân sách. 50–500 service: self-hosted LGTM trên Kubernetes, có 1–2 kỹ sư platform. Enterprise regulated (bank, insurance, health): Datadog hoặc Dynatrace cho SLA + compliance, kèm self-hosted backup cho retention dài. Mọi trường hợp đều export từ cùng một Collector để chuyển được bất cứ lúc nào.

12. Tích hợp với .NET Aspire 9.5 và OpenTelemetry mặc định

.NET Aspire — bộ orchestration cho .NET cloud-native — đã chọn OpenTelemetry làm mặc định từ 8.0. Ở Aspire 9.5 đầu 2026, mọi project mới sinh ra đã có sẵn ServiceDefaults với OTEL config, không cần copy-paste code. Bạn chỉ cần gọi:

var builder = WebApplication.CreateBuilder(args);
builder.AddServiceDefaults(); // Aspire: OTEL + health check + service discovery + resilience

Bên trong AddServiceDefaults, Aspire cấu hình:

OTEL SDK với auto-instrumentation đầy đủ (AspNetCore, HttpClient, EF Core, gRPC, Runtime)
OTLP exporter đọc từ biến môi trường OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_HEADERS (chuẩn OTEL spec)
Resource detection cho container, k8s, host
Service discovery tích hợp với ServiceDiscovery thay DNS
Resilience (Polly v8) cho HttpClient mặc định

Dashboard Aspire local có sẵn một mini-Jaeger/Prometheus render OTLP nhận từ các project con — nghĩa là dev bật aspire run là thấy trace end-to-end giữa frontend Vue, BFF .NET, Order service, Payment service không cần cài gì. Đây là thay đổi lớn về DX.

13. Signal thứ tư: Continuous Profiling

Profiling là signal mới nhất (spec GA 2024). Nó trả lời câu hỏi: trong 1% CPU lúc 10:32 chậm, hàm nào thật sự đốt thời gian. Thay vì bật profiler thủ công sau khi incident, OTEL profiling chạy liên tục ở overhead thấp (< 1% CPU) và lưu pprof format.

Grafana Pyroscope: backend open source, tương thích pprof, tích hợp Grafana Explore.
.NET: dotnet-monitor hoặc Parca Agent (eBPF) có thể sinh OTEL Profile Signal.
Flame graph nối thẳng với span qua span link — bạn có thể từ span chậm nhảy vào flame graph thấy hàm nào chiếm CPU.

14. Production patterns — những gì nên làm, những gì nên tránh

14.1. Deployment topology

Cho Kubernetes, pattern khuyến nghị là Agent + Gateway hai tầng. Agent chạy DaemonSet ở mỗi node (hoặc sidecar mỗi pod), làm receive + k8s metadata enrichment + batch. Gateway chạy StatefulSet HA, làm tail sampling + fan-out exporter. Lý do tách: tail sampling cần thấy toàn bộ trace, phải aggregate về tầng trung tâm; metadata enrichment cần quyền đọc k8s API, chỉ agent cần.

14.2. Failure modes và retry

Collector có thể sập, backend có thể chậm. SDK của .NET dùng BatchExportProcessor với queue tối đa 2048 span/metric mặc định. Khi queue đầy, span mới bị drop silently. Trong production:

Set OTEL_BSP_MAX_QUEUE_SIZE=8192 cho trace, OTEL_METRIC_EXPORT_INTERVAL=30000 cho metric ở service hot.
Collector bật file_storage extension cho queue on-disk — mất kết nối backend 10 phút cũng không rớt dữ liệu.
Đặt alert "collector queue > 80%" — đây là early warning backend đang chậm.

14.3. Multi-tenant / multi-environment

Nếu một Collector serving nhiều môi trường (dev/stag/prod) hoặc nhiều team, dùng routing processor chia pipeline theo deployment.environment.name hoặc service.namespace. Không dùng một backend duy nhất cho mọi env — volume dev sẽ nuốt retention của prod.

14.4. Security và PII

OTLP qua mTLS bắt buộc khi vượt ranh giới cluster.
transform processor scrub email, số điện thoại, token, JWT khỏi log body và attribute trước khi gửi ra ngoài.
Với GenAI span (gen_ai.*), cân nhắc redact gen_ai.prompt và gen_ai.completion nếu chứa dữ liệu khách hàng.

Anti-pattern hay gặp

Log json-in-json: logger.LogInformation("Order: {@order}", order) với order 10KB. Log backend đánh index tất cả field, cardinality nổ, cost tăng 20 lần.
Counter tăng theo user_id: time series explode, Prometheus OOM.
Span tên động: activity?.DisplayName = $"Process {order.Id}". Mỗi span thành unique, trace backend không group được thành chart.
Không dùng Resource: gửi OTLP không set service.name. Backend hiện "unknown_service" cho tất cả.
Sampling ở mỗi service độc lập: không ParentBased, trace rời rạc giữa chừng.

15. Chi phí và benchmark overhead

Một benchmark nội bộ trên .NET 10 với một API CRUD Redis + PostgreSQL, 5k RPS, p99 baseline 45ms:

Cấu hình	CPU overhead	Memory tăng thêm	p99 latency	Ghi chú
Không OTEL	—	—	45 ms	Baseline
Trace only, sample 10%, BatchExporter	+1.2%	+18 MB	46 ms	Ổn cho production
Trace 100% + Metrics + Logs	+3.8%	+42 MB	48 ms	Overhead vẫn chấp nhận
Thêm Profiling (Parca Agent eBPF)	+0.3% (đo ở node)	+65 MB agent	48 ms	eBPF không chạm app memory
Log mỗi request (Info) + trace 100%	+7.1%	+120 MB	53 ms	Hạ log level hoặc sampling log

Bài học: bật đủ nhưng sampling đúng, chi phí thấy được nhưng chấp nhận được. Log Debug/Information mọi request là nguồn overhead lớn nhất — dùng LoggerMessage source generator và log level đúng giúp giảm đáng kể.

16. Migration roadmap từ Application Insights / Serilog + ELK

Phần đông codebase .NET legacy dùng một trong hai stack: Application Insights (Azure) hoặc Serilog + ELK. Migration sang OTEL không đau nếu làm theo bốn bước:

Bước 1 — Song song (tuần 1–2)

Thêm OpenTelemetry SDK bên cạnh logger cũ. Cả hai cùng phát. OTLP exporter gửi về Collector, từ Collector xuất ra Application Insights qua azuremonitor exporter — dashboard cũ không đổi, bạn có OTLP sẵn sàng.

Bước 2 — Chuẩn hoá attribute (tuần 3–4)

Review code, đổi attribute custom sang SemConv. Thay logger.LogInformation("UserId {userId}", id) thành Activity.Current?.SetTag("user.id", id) và log.WithAttribute("enduser.id", id). Dashboard mới bắt đầu đúng chuẩn.

Bước 3 — Bật backend mới (tuần 5–6)

Triển khai Tempo + Mimir + Loki hoặc thuê Grafana Cloud/Datadog. Collector fan-out cùng lúc về AI cũ và backend mới. Team familiar dần với query mới, alert rewritten qua PromQL/TraceQL/LogQL.

Bước 4 — Cắt khỏi stack cũ (tuần 7–8)

Gỡ Serilog Elasticsearch sink, gỡ Application Insights SDK. Chỉ giữ OpenTelemetry. Collector thành điểm nghẽn duy nhất, dễ cấu hình.

17. Kết luận — OpenTelemetry là kỹ năng nền cho kỹ sư backend 2026

Mười năm trước, chọn sai observability stack nghĩa là viết lại hàng trăm nghìn dòng instrumentation khi đổi vendor. Với OTEL 2026, instrumentation là của chung, backend là tuỳ chọn, và ranh giới giữa log, metric, trace, profile mờ đi đủ để kỹ sư vận hành nhảy qua lại trong vài cú click. Riêng với .NET 10, việc OTEL tựa trên Activity và Meter sẵn có trong BCL nghĩa là bạn gần như không trả giá adoption — API quen thuộc, Aspire bật sẵn, auto-instrumentation phủ hầu hết use case.

Điều tôi muốn đọng lại sau bài này là: observability không phải dashboard. Nó là kỷ luật đặt ra câu hỏi đúng và xây dựng dữ liệu để trả lời. OTEL là công cụ tốt nhất cho kỷ luật đó lúc này — nhưng công cụ chỉ phát huy khi team có thói quen viết span, đặt metric có ý nghĩa, và log đúng tầng. Bắt đầu từ một service, một pipeline Collector đơn giản, một dashboard duy nhất trả lời ba câu hỏi mở bài. Khi quen tay rồi, mở rộng ra toàn hệ là chuyện kỹ thuật, không còn là chuyện tư duy nữa.

18. Tham khảo

#OpenTelemetry #OTEL #Observability #Distributed Tracing #Tracing #Metrics #Logs #Profiling #OTLP #W3C Trace Context #Span #Exemplar #Resource #Semantic Conventions #ActivitySource #System.Diagnostics.Activity #Meter #ILogger #Auto Instrumentation #eBPF #Grafana Beyla #Parca #Pyroscope #OpenTelemetry Collector #Tail Sampling #Head Sampling #Batch Processor #Memory Limiter #.NET 10 #.NET Core #ASP.NET Core 10 #.NET Aspire #.NET Aspire 9.5 #Entity Framework Core #HttpClient #gRPC #Grafana #Tempo #Loki #Mimir #Prometheus #PromQL #TraceQL #LogQL #Datadog #New Relic #Dynatrace #Honeycomb #Application Insights #Azure Monitor #Serilog #Jaeger #Zipkin #Agent Gateway Pattern #Microservices #system design #Cloud Native #SRE #DevOps #OpenTracing #OpenCensus #CNCF

# OpenTelemetry 2026 cho .NET 10 - Kiến trúc Observability Microservices với OTLP, Collector, Tail Sampling, Tempo, Loki, Prometheus và Grafana

## 1. OpenTelemetry 2026 — Chuẩn observability mở cho .NET 10

Mỗi lần một request đi qua năm, bảy service trong một hệ microservice, kỹ sư vận hành phải trả lời ba câu hỏi rất cụ thể: *nó chậm ở đâu*, *nó lỗi ở đâu*, và *tại sao lại thế*. Trong mười năm qua, câu trả lời được giải quyết bằng ba họ công cụ tách rời: log aggregator (ELK, Splunk, Graylog), metrics time-series (Prometheus, Graphite, InfluxDB), và tracing system (Jaeger, Zipkin). Mỗi họ có ngôn ngữ riêng, instrumentation riêng, và không hiểu nhau. Một span trong Jaeger không biết log nào tương ứng, một alert trên Prometheus không có đường đi ngược ra trace khiến nó bị trigger.

**OpenTelemetry** (OTEL) là câu trả lời của cộng đồng cho mớ hỗn độn đó: một bộ chuẩn trung lập, một API thống nhất, một SDK tham chiếu cho hơn 15 ngôn ngữ, một protocol truyền tải chung (OTLP), và quan trọng nhất, một mô hình dữ liệu nơi logs — metrics — traces được liên kết qua `trace_id`, `span_id`, `resource` và `scope`. Đầu năm 2026, OTEL đã là chuẩn *de facto* của observability: tất cả vendor lớn (Datadog, New Relic, Dynatrace, Honeycomb, Azure Monitor, Google Cloud Trace) đều nhận OTLP native, tất cả framework backend lớn đều tự động xuất OTEL.

Ở phía .NET, câu chuyện đặc biệt đẹp vì OpenTelemetry không phải một SDK gắn thêm — nó *là* phần mở rộng tự nhiên của `System.Diagnostics.Activity` (tracing), `System.Diagnostics.Metrics.Meter` (metrics) và `Microsoft.Extensions.Logging` (logs) đã có sẵn trong BCL. Với **.NET 10 LTS**, Microsoft đã đưa auto-instrumentation cho hầu hết workloads (ASP.NET Core, HttpClient, EF Core, gRPC, Kafka, Redis client, Azure SDK) về trạng thái GA, thêm exemplar cho metrics, và tích hợp sẵn OTLP exporter với HTTP/Protobuf. Bài viết này mổ xẻ OTEL 2026 từ mô hình dữ liệu đến kiến trúc triển khai production, tập trung vào cách một hệ .NET 10 microservices dùng nó để trả lời ba câu hỏi mở đầu bài.

1.37Bản OpenTelemetry Specification stable đầu 2026, API + SDK + Protocol đã frozen

15+Ngôn ngữ có SDK stable: .NET, Java, Go, Python, Node.js, Rust, PHP, Ruby, Swift...

~3%Overhead CPU trung bình khi bật full instrumentation + OTLP exporter với batch processor

3 pillarsLogs, Metrics, Traces thống nhất qua Resource + TraceContext

## 2. Biên niên sử OpenTelemetry

2010 — Google Dapper

Google công bố paper *Dapper, a Large-Scale Distributed Systems Tracing Infrastructure*. Đây là nguồn cảm hứng cho tất cả tracing system hiện đại: khái niệm `trace`, `span`, `sampling rate`, propagation context qua RPC header.

2012–2016 — Zipkin, Jaeger ra đời

2016 — Prometheus stable

Pull-based metrics với `/metrics` endpoint, labels/tags đa chiều, PromQL. Trở thành chuẩn de facto metrics nhưng model dữ liệu hoàn toàn khác tracing.

2017 — OpenTracing (CNCF)

2018 — OpenCensus (Google)

Google release OpenCensus: API + SDK + exporter cho traces và metrics. Đối thủ trực diện OpenTracing. Cộng đồng bị chia rẽ, mỗi vendor phải support cả hai.

2019-05 — OpenTelemetry sáp nhập

OpenTracing và OpenCensus công bố gộp thành OpenTelemetry tại KubeCon. CNCF Sandbox project. Mục tiêu: một API, một SDK, một protocol, hết chia rẽ.

2021-02 — OTLP 1.0 + Traces SDK GA

Giao thức OTLP (OpenTelemetry Protocol) stable, hỗ trợ gRPC và HTTP/Protobuf. SDK traces cho các ngôn ngữ lớn GA.

2022 — Metrics SDK GA

Mô hình metrics đặc thù của OTEL: Counter, UpDownCounter, Histogram, Gauge (async và sync). Khác với Prometheus ở chỗ có khái niệm *delta* vs *cumulative* temporality.

2023 — Logs SDK GA + Exemplars

2024 — Profiling Signal

2025-Q3 — Semantic Conventions 1.30

Bộ attribute chuẩn cho từng domain (HTTP, DB, messaging, FaaS, gen_ai) ổn định. Đánh dấu OTEL trưởng thành ở mức "copy attribute là khớp mọi backend".

2026-Q1 — OTEL 1.37 + .NET 10 LTS

## 3. Mô hình dữ liệu OpenTelemetry — điểm quan trọng nhất

### 3.1. Resource — danh tính của process phát ra tín hiệu

Một `Resource` là tập attribute mô tả *ai* đang phát tín hiệu: service name, service version, deployment environment, host name, container id, pod name, region, cloud provider. Resource gắn với *process*, không phải span. Thay đổi resource nghĩa là process khác. Đây là chìa khoá để gộp logs + metrics + traces của cùng một service: backend lọc theo `service.name` là thấy mọi signal của service đó.

#### Semantic Conventions cho Resource

Phải đặt: `service.name` (bắt buộc), `service.version`, `service.instance.id`, `deployment.environment.name` (prod/staging/dev), `host.name`, `os.type`. Nên đặt: `container.id`, `k8s.pod.name`, `k8s.namespace.name`, `cloud.region`. Backend Grafana, Datadog, New Relic đều dùng đúng các attribute này để auto-correlate.

### 3.2. Scope — đơn vị instrumentation

`InstrumentationScope` xác định thư viện nào phát ra signal. Ví dụ: span của HttpClient phát từ scope `System.Net.Http`, span của EF Core phát từ scope `Microsoft.EntityFrameworkCore`. Khi debug "span kia từ đâu ra", bạn lọc theo scope.

### 3.3. Trace Context — sợi dây xuyên suốt

Mỗi request nhận một `trace_id` (16 byte) sinh ngay biên của service đầu tiên. Mỗi đơn vị công việc bên trong có một `span_id` (8 byte). Context được propagate qua HTTP header `traceparent`/`tracestate` (W3C Trace Context), qua gRPC metadata, qua Kafka header, qua SQL comment (sqlcommenter). Nhờ đó một trace bao trùm toàn bộ lời gọi xuyên service.

### 3.4. Signal và Temporality

OTEL phân biệt rạch ròi ba signal:

- **Traces**: cây span, nested hoặc linked. Đặc trưng bởi `trace_id`, `span_id`, `parent_span_id`, `kind` (server/client/producer/consumer/internal), `status` (ok/error), `events`, `links`.
- **Metrics**: time series với instrument type. Mỗi data point có `start_time`, `time`, `value`, `attributes`, `exemplars`. Temporality: *cumulative* (mặc định Prometheus) hoặc *delta* (mặc định Statsd, vendor agent).
- **Logs**: bản ghi với `body`, `severity`, `attributes`, và (quan trọng) `trace_id`+`span_id` nếu được phát trong active span.

```
graph TB
    subgraph Resource["Resource (service.name=orders, env=prod)"]
        subgraph Scope1["Scope: OrdersApi.Controllers"]
            T1["Span: POST /orders  
trace_id=abc, span_id=s1"]
            T2["Span: OrdersService.Create  
parent=s1, span_id=s2"]
        end
        subgraph Scope2["Scope: EntityFrameworkCore"]
            T3["Span: INSERT orders  
parent=s2, span_id=s3"]
        end
        subgraph Scope3["Scope: Microsoft.Extensions.Logging"]
            L1["Log: Order created id=42  
trace_id=abc, span_id=s2"]
        end
        subgraph Scope4["Scope: Orders.Metrics"]
            M1["Counter: orders_created_total  
exemplar: trace_id=abc"]
            M2["Histogram: order_value_usd"]
        end
    end

```

Logs, traces, metrics cùng gắn với Resource; exemplar liên kết metric về trace cụ thể

## 4. Kiến trúc OpenTelemetry trên .NET 10

Trên .NET, OTEL không phải một namespace riêng biệt mà là một bộ adapter bọc API native của runtime. Cụ thể:

- `System.Diagnostics.Activity` là biểu diễn span native. Khi bật `AddSource("X")`, OTEL SDK sẽ subscribe mọi `ActivitySource` tên "X" và convert Activity thành OTEL Span.
- `System.Diagnostics.Metrics.Meter` là instrument native. `AddMeter("X")` subscribe để chuyển data point thành OTLP Metric.
- `Microsoft.Extensions.Logging.ILogger` với `OpenTelemetryLoggerProvider` attach để mọi `logger.LogInformation(...)` đều phát ra LogRecord với trace context nếu đang trong Activity.

Nghĩa là: nếu codebase bạn dùng đúng API BCL, bạn *không cần* viết lại instrumentation khi chuyển vendor. Chỉ cần đổi exporter.

### 4.1. Thiết lập cơ bản trong Program.cs

```
using OpenTelemetry;
using OpenTelemetry.Logs;
using OpenTelemetry.Metrics;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;

var builder = WebApplication.CreateBuilder(args);

const string serviceName = "orders-api";
const string serviceVersion = "2.4.1";

var resource = ResourceBuilder.CreateDefault()
    .AddService(serviceName, serviceVersion: serviceVersion,
                serviceInstanceId: Environment.MachineName)
    .AddAttributes(new Dictionary<string, object>
    {
        ["deployment.environment.name"] = builder.Environment.EnvironmentName,
        ["cloud.region"] = builder.Configuration["Cloud:Region"] ?? "unknown"
    })
    .AddEnvironmentVariableDetector()
    .AddContainerDetector();

builder.Services.AddOpenTelemetry()
    .ConfigureResource(rb => rb.AddService(serviceName, serviceVersion: serviceVersion))
    .WithTracing(t => t
        .AddAspNetCoreInstrumentation(o =>
        {
            o.RecordException = true;
            o.Filter = ctx => !ctx.Request.Path.StartsWithSegments("/health");
        })
        .AddHttpClientInstrumentation()
        .AddEntityFrameworkCoreInstrumentation(o => o.SetDbStatementForText = true)
        .AddGrpcClientInstrumentation()
        .AddSource("Orders.*")
        .SetSampler(new ParentBasedSampler(new TraceIdRatioBasedSampler(0.1)))
        .AddOtlpExporter(o => o.Endpoint = new Uri("http://otel-collector:4317")))
    .WithMetrics(m => m
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddProcessInstrumentation()
        .AddMeter("Orders.*")
        .AddOtlpExporter())
    .WithLogging(l => l
        .AddOtlpExporter(), o =>
        {
            o.IncludeFormattedMessage = true;
            o.IncludeScopes = true;
            o.ParseStateValues = true;
        });

var app = builder.Build();
```
Một vài chi tiết quan trọng ít người để ý:

#### Bốn thói quen đáng giá

1. **Filter health check khỏi trace**: health endpoint gọi mỗi 10s/pod, nếu không filter, trace DB sẽ bị nhiễu chiếm 40–60% volume.
2. **SetDbStatementForText**: bật để lưu câu SQL (đã parameter hoá) trong attribute `db.statement`. Cân nhắc PII — có thể thay bằng sqlcommenter.
3. **ParentBasedSampler**: tôn trọng quyết định sampling của service cha. Nếu gateway đã chọn sample, service dưới dòng phải respect — nếu không trace sẽ "vỡ" ở giữa.
4. **AddRuntimeInstrumentation**: bật GC metrics, ThreadPool starvation, assembly loaded. Dashboard runtime chỉ cần một query duy nhất.

### 4.2. Manual instrumentation khi cần

Auto-instrumentation bắt 80% trường hợp, 20% còn lại là logic nghiệp vụ. Mẫu điển hình:

```
public sealed class OrdersService
{
    private static readonly ActivitySource Activity = new("Orders.Core", "2.4.1");
    private static readonly Meter Meter = new("Orders.Core", "2.4.1");
    private static readonly Counter<long> CreatedCounter =
        Meter.CreateCounter<long>("orders.created",
            unit: "{order}", description: "Orders created");
    private static readonly Histogram<double> ValueHistogram =
        Meter.CreateHistogram<double>("orders.value",
            unit: "USD", description: "Order value distribution");

public async Task<Order> CreateAsync(CreateOrderRequest req, CancellationToken ct)
    {
        using var activity = Activity.StartActivity("OrdersService.Create",
            ActivityKind.Internal);
        activity?.SetTag("order.channel", req.Channel);
        activity?.SetTag("order.items_count", req.Items.Count);

try
        {
            var order = await _repo.InsertAsync(req, ct);
            activity?.SetTag("order.id", order.Id);
            activity?.SetStatus(ActivityStatusCode.Ok);

CreatedCounter.Add(1,
                new KeyValuePair<string, object?>("channel", req.Channel),
                new KeyValuePair<string, object?>("country", req.Country));
            ValueHistogram.Record(order.Total,
                new KeyValuePair<string, object?>("currency", order.Currency));

return order;
        }
        catch (Exception ex)
        {
            activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
            activity?.AddException(ex);
            throw;
        }
    }
}
```

#### Cảnh báo cardinality

Không bao giờ đặt tag có cardinality cao vào Counter/Histogram: `order.id`, `user.id`, `request_id`, `trace_id`. Mỗi giá trị unique sinh một time series — backend Prometheus có thể sập nếu cardinality vượt vài triệu. Attribute cardinality cao nên đặt ở span (lưu ở trace backend), không ở metric.

## 5. OTLP — giao thức dưới đáy mọi exporter

| Transport | Port mặc định | Ưu điểm | Hạn chế |
| --- | --- | --- | --- |
| gRPC (HTTP/2) | 4317 | Nhanh, streaming, giữ kết nối, ít overhead | Không qua dễ proxy HTTP/1.1; cần ALPN |
| HTTP/Protobuf | 4318 | Qua load balancer/proxy bình thường; debug Wireshark dễ | Handshake mỗi batch; nặng hơn chút |
| HTTP/JSON | 4318 (path `/v1/traces`) | Browser-friendly (RUM); curl-debug dễ | Volume tăng 3–5x so với Protobuf |

## 6. OpenTelemetry Collector — xương sống của pipeline

1. **Buffer khi backend chập chờn**: app không cần giữ memory queue lớn, Collector có disk queue.
2. **Batching + compression**: giảm network cost gấp 5–10 lần.
3. **Tail-based sampling**: sampling dựa trên toàn bộ trace sau khi hoàn thành (giữ trace lỗi, drop trace bình thường).
4. **Redaction / PII scrubbing**: mask email, số điện thoại, token trước khi gửi ra ngoài.
5. **Fan-out**: gửi đồng thời cho Tempo, Datadog, ELK để so sánh / chuyển dần vendor.
6. **Resource enrichment**: bổ sung k8s metadata, cloud metadata, git SHA từ label pod.

```
graph LR
    A1[".NET App SDK"] -- OTLP gRPC --> B["Collector Agent  
(DaemonSet/Sidecar)"]
    A2["Node.js App SDK"] -- OTLP gRPC --> B
    A3["Java App SDK"] -- OTLP gRPC --> B
    B -- OTLP --> C["Collector Gateway  
(StatefulSet, HA)"]
    C -- Prom Remote Write --> D[(Prometheus / Mimir)]
    C -- OTLP --> E[(Tempo)]
    C -- Loki HTTP --> F[(Loki)]
    C -- OTLP --> G[(Datadog / New Relic)]
    H[Grafana] --> D
    H --> E
    H --> F

```

Pattern agent + gateway: agent tại mỗi node lo collect, gateway lo fan-out

### 6.1. Cấu trúc pipeline Collector

Mỗi pipeline gồm ba phần: **receivers** (nhận tín hiệu), **processors** (biến đổi), **exporters** (gửi đi). Ví dụ một cấu hình gateway cân bằng giữa Grafana stack và Datadog:

```
receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 75
    spike_limit_percentage: 20

batch:
    timeout: 5s
    send_batch_size: 8192
    send_batch_max_size: 10000

tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 500 }
      - name: random-1pct
        type: probabilistic
        probabilistic: { sampling_percentage: 1 }

transform/scrub:
    log_statements:
      - context: log
        statements:
          - replace_pattern(body, "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+", "REDACTED_EMAIL")

resource:
    attributes:
      - key: k8s.cluster.name
        value: prod-asia-southeast
        action: upsert

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls: { insecure: true }
  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
  datadog:
    api: { key: ${env:DD_API_KEY}, site: datadoghq.com }

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, resource, batch]
      exporters: [otlp/tempo, datadog]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [prometheusremotewrite, datadog]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, transform/scrub, resource, batch]
      exporters: [loki, datadog]
```

## 7. Sampling — kỹ thuật quan trọng nhất để kiểm soát chi phí

### 7.1. Head-based sampling (Parent-based)

Quyết định sample/drop ngay ở biên đầu tiên, dựa trên `trace_id` hash. Service dưới dòng respect `traceparent` flag. Ưu: rẻ, không cần buffer. Nhược: không biết request sẽ lỗi hay chậm để ưu tiên.

### 7.2. Tail-based sampling (ở Collector)

Collector buffer toàn bộ span của một trace (tối đa `decision_wait`), sau đó quyết định dựa trên policy. Cấu hình ở trên giữ lại 100% trace có lỗi, 100% trace có latency > 500ms, 1% trace bình thường. Đây là cân bằng hợp lý cho production: signal cao khi điều tra, volume thấp khi bình yên.

### 7.3. Probabilistic + Rate Limiting

#### Chiến lược sampling tổng hợp cho một hệ microservices

Gateway / biên: head-based 100% (để mọi request có `trace_id`, dễ log correlation). Collector gateway: tail-based với policy errors + slow + 1–5% random. Batch job / cron: head-based 100% (volume thấp, debug quan trọng). Healthcheck / metrics endpoint: drop 100% ngay ở instrumentation filter.

## 8. Correlation logs ↔ traces ↔ metrics

Ba pillars chỉ có ý nghĩa khi người dùng nhảy qua lại được giữa chúng. OTEL làm việc này qua ba cầu nối:

### 8.1. Trace context trong log

Khi một `ILogger.LogInformation` được gọi trong một Activity đang active, OpenTelemetry Logger tự động gắn `TraceId`, `SpanId`, `TraceFlags` vào LogRecord. Khi đổ về Loki, bạn query `{service="orders-api"} |= "trace_id=abc123"` và thấy mọi log trong trace đó.

### 8.2. Exemplar trong metric

Exemplar là một điểm mẫu đi kèm data point histogram, chứa `trace_id`, `span_id`, giá trị, timestamp. Ví dụ histogram `http.server.request.duration` có data point tại bucket 500ms–1s với exemplar trace_id=abc — nghĩa là "tôi thấy một request chậm như này, đây là trace cụ thể". Grafana hiển thị exemplar dưới dạng chấm trên biểu đồ, click thẳng ra Tempo. Với .NET 10, exemplar được bật mặc định — không cần cấu hình gì.

### 8.3. Span event và span link

```
sequenceDiagram
    participant U as User
    participant G as Grafana
    participant P as Prometheus/Mimir
    participant T as Tempo
    participant L as Loki
    U->>G: "API slow lúc 10:32"
    G->>P: PromQL histogram_quantile p95
    P-->>G: Biểu đồ + exemplar trace_id=abc
    U->>G: Click exemplar
    G->>T: GET /api/traces/abc
    T-->>G: Span tree đầy đủ
    U->>G: Click "Logs for span s2"
    G->>L: {trace_id="abc", span_id="s2"}
    L-->>G: Log lines correlated

```

Hành trình debug điển hình: metric → exemplar → trace → log

## 9. Auto-instrumentation: khi nào dùng, khi nào không

OTEL có ba tầng tự động hoá:

1. **Library instrumentation** (Nuget package): `OpenTelemetry.Instrumentation.AspNetCore`, `OpenTelemetry.Instrumentation.Http`, `OpenTelemetry.Instrumentation.EntityFrameworkCore`... Bạn add code một dòng, library phát span tự động. Đây là đường đi chuẩn cho .NET production.
2. **Zero-code auto-instrumentation** (`OpenTelemetry.AutoInstrumentation`): attach vào process qua CLR profiler API, không cần chạm code. Phù hợp legacy app hoặc binary không sửa được.
3. **eBPF-based** (Grafana Beyla, Pixie): instrument ở kernel level, zero overhead trong app. Trade-off: ít attribute giàu hơn, phụ thuộc kernel version.

#### Khuyến nghị cho .NET 10

## 10. Semantic Conventions — viết một lần, dashboard khắp nơi

Semantic Conventions (SemConv) là nguyên nhân duy nhất khiến dashboard cài sẵn của Grafana/Datadog "chạy được ngay" sau khi bạn bật OTEL. Thay vì mỗi đội đặt tên attribute kiểu `httpStatus`, `http_status_code`, `statusCode`, SemConv chốt: `http.response.status_code` (int). Backend build dashboard dựa trên tên chuẩn này.

| Domain | Attribute key chuẩn | Ví dụ |
| --- | --- | --- |
| HTTP server | `http.request.method`, `http.route`, `http.response.status_code`, `url.scheme`, `url.path`, `server.address` | POST /orders/{id}, 201 |
| Database | `db.system.name`, `db.namespace`, `db.operation.name`, `db.query.text` | mssql, AnhTu, SELECT, "SELECT * FROM Post WHERE Id=@id" |
| Messaging | `messaging.system`, `messaging.destination.name`, `messaging.operation.type` | kafka, orders.created, publish |
| gRPC | `rpc.system`, `rpc.service`, `rpc.method`, `rpc.grpc.status_code` | grpc, orders.v1.OrdersApi, CreateOrder, 0 |
| FaaS | `faas.name`, `faas.version`, `faas.trigger`, `faas.invoked_provider` | process-order, 2.4.1, http, aws |
| GenAI | `gen_ai.system`, `gen_ai.request.model`, `gen_ai.usage.input_tokens` | openai, gpt-5, 1523 |

Nếu code bạn phát span với key non-standard (`orderItemsCount`), bạn vẫn search được nhưng mất hết dashboard, alert mặc định, và phân tích cross-service. Quy tắc vàng: *một attribute tương đương SemConv luôn dùng tên SemConv*; attribute nghiệp vụ riêng đặt prefix namespace (`orders.items_count`).

## 11. Backend: LGTM stack hay vendor — chọn thế nào

OTEL giải phóng bạn khỏi lock-in, nên bài toán chọn backend trở thành bài toán economics và operational load. Ba nhóm lựa chọn phổ biến:

| Lựa chọn | Thành phần | Cost model | Phù hợp khi |
| --- | --- | --- | --- |
| **LGTM self-hosted** | Loki + Grafana + Tempo + Mimir (Grafana Labs) | Chi phí hạ tầng + nhân sự vận hành | Volume lớn (> 10TB/tháng), team có ops; muốn control dữ liệu |
| **Grafana Cloud** | LGTM managed | Theo GB metric + traces + logs ingested | Team nhỏ không muốn vận hành, volume trung bình |
| **Datadog / New Relic / Dynatrace** | Full APM + RUM + synthetics + profiler | Host-based + custom metrics + ingestion | Enterprise, cần APM end-to-end + AI-assisted RCA, budget thoải mái |
| **Azure Monitor / Google Cloud Observability** | Application Insights + Cloud Trace + Cloud Logging | Per-GB + feature tier | All-in trên một cloud, muốn integrate sâu với cloud service |
| **Honeycomb / Lightstep** | Event-based analytics trace | Theo events/tháng | Cần slice-and-dice cardinality cao, debug production thật nhanh |

#### Lời khuyên thực dụng

## 12. Tích hợp với .NET Aspire 9.5 và OpenTelemetry mặc định

.NET Aspire — bộ orchestration cho .NET cloud-native — đã chọn OpenTelemetry làm mặc định từ 8.0. Ở Aspire 9.5 đầu 2026, mọi project mới sinh ra đã có sẵn `ServiceDefaults` với OTEL config, không cần copy-paste code. Bạn chỉ cần gọi:

```
var builder = WebApplication.CreateBuilder(args);
builder.AddServiceDefaults(); // Aspire: OTEL + health check + service discovery + resilience
```
Bên trong `AddServiceDefaults`, Aspire cấu hình:

- OTEL SDK với auto-instrumentation đầy đủ (AspNetCore, HttpClient, EF Core, gRPC, Runtime)
- OTLP exporter đọc từ biến môi trường `OTEL_EXPORTER_OTLP_ENDPOINT`, `OTEL_EXPORTER_OTLP_HEADERS` (chuẩn OTEL spec)
- Resource detection cho container, k8s, host
- Service discovery tích hợp với `ServiceDiscovery` thay DNS
- Resilience (Polly v8) cho HttpClient mặc định

Dashboard Aspire local có sẵn một mini-Jaeger/Prometheus render OTLP nhận từ các project con — nghĩa là dev bật `aspire run` là thấy trace end-to-end giữa frontend Vue, BFF .NET, Order service, Payment service không cần cài gì. Đây là thay đổi lớn về DX.

## 13. Signal thứ tư: Continuous Profiling

- **Grafana Pyroscope**: backend open source, tương thích pprof, tích hợp Grafana Explore.
- **.NET**: `dotnet-monitor` hoặc `Parca Agent` (eBPF) có thể sinh OTEL Profile Signal.
- Flame graph nối thẳng với span qua span link — bạn có thể từ span chậm nhảy vào flame graph thấy hàm nào chiếm CPU.

## 14. Production patterns — những gì nên làm, những gì nên tránh

### 14.1. Deployment topology

### 14.2. Failure modes và retry

Collector có thể sập, backend có thể chậm. SDK của .NET dùng `BatchExportProcessor` với queue tối đa 2048 span/metric mặc định. Khi queue đầy, span mới bị drop silently. Trong production:

- Set `OTEL_BSP_MAX_QUEUE_SIZE=8192` cho trace, `OTEL_METRIC_EXPORT_INTERVAL=30000` cho metric ở service hot.
- Collector bật `file_storage` extension cho queue on-disk — mất kết nối backend 10 phút cũng không rớt dữ liệu.
- Đặt alert "collector queue > 80%" — đây là early warning backend đang chậm.

### 14.3. Multi-tenant / multi-environment

Nếu một Collector serving nhiều môi trường (dev/stag/prod) hoặc nhiều team, dùng `routing` processor chia pipeline theo `deployment.environment.name` hoặc `service.namespace`. Không dùng một backend duy nhất cho mọi env — volume dev sẽ nuốt retention của prod.

### 14.4. Security và PII

- OTLP qua mTLS bắt buộc khi vượt ranh giới cluster.
- `transform` processor scrub email, số điện thoại, token, JWT khỏi log body và attribute trước khi gửi ra ngoài.
- Với GenAI span (`gen_ai.*`), cân nhắc redact `gen_ai.prompt` và `gen_ai.completion` nếu chứa dữ liệu khách hàng.

#### Anti-pattern hay gặp

1. **Log json-in-json**: `logger.LogInformation("Order: {@order}", order)` với `order` 10KB. Log backend đánh index tất cả field, cardinality nổ, cost tăng 20 lần.
2. **Counter tăng theo user_id**: time series explode, Prometheus OOM.
3. **Span tên động**: `activity?.DisplayName = $"Process {order.Id}"`. Mỗi span thành unique, trace backend không group được thành chart.
4. **Không dùng Resource**: gửi OTLP không set `service.name`. Backend hiện "unknown_service" cho tất cả.
5. **Sampling ở mỗi service độc lập**: không ParentBased, trace rời rạc giữa chừng.

## 15. Chi phí và benchmark overhead

Một benchmark nội bộ trên .NET 10 với một API CRUD Redis + PostgreSQL, 5k RPS, p99 baseline 45ms:

| Cấu hình | CPU overhead | Memory tăng thêm | p99 latency | Ghi chú |
| --- | --- | --- | --- | --- |
| Không OTEL | — | — | 45 ms | Baseline |
| Trace only, sample 10%, BatchExporter | +1.2% | +18 MB | 46 ms | Ổn cho production |
| Trace 100% + Metrics + Logs | +3.8% | +42 MB | 48 ms | Overhead vẫn chấp nhận |
| Thêm Profiling (Parca Agent eBPF) | +0.3% (đo ở node) | +65 MB agent | 48 ms | eBPF không chạm app memory |
| Log mỗi request (Info) + trace 100% | +7.1% | +120 MB | 53 ms | Hạ log level hoặc sampling log |

Bài học: bật đủ nhưng sampling đúng, chi phí thấy được nhưng chấp nhận được. Log Debug/Information mọi request là nguồn overhead lớn nhất — dùng `LoggerMessage` source generator và log level đúng giúp giảm đáng kể.

## 16. Migration roadmap từ Application Insights / Serilog + ELK

Phần đông codebase .NET legacy dùng một trong hai stack: Application Insights (Azure) hoặc Serilog + ELK. Migration sang OTEL không đau nếu làm theo bốn bước:

Bước 1 — Song song (tuần 1–2)

Thêm OpenTelemetry SDK bên cạnh logger cũ. Cả hai cùng phát. OTLP exporter gửi về Collector, từ Collector xuất ra Application Insights qua `azuremonitor` exporter — dashboard cũ không đổi, bạn có OTLP sẵn sàng.

Bước 2 — Chuẩn hoá attribute (tuần 3–4)

Review code, đổi attribute custom sang SemConv. Thay `logger.LogInformation("UserId {userId}", id)` thành `Activity.Current?.SetTag("user.id", id)` và `log.WithAttribute("enduser.id", id)`. Dashboard mới bắt đầu đúng chuẩn.

Bước 3 — Bật backend mới (tuần 5–6)

Bước 4 — Cắt khỏi stack cũ (tuần 7–8)

Gỡ Serilog Elasticsearch sink, gỡ Application Insights SDK. Chỉ giữ OpenTelemetry. Collector thành điểm nghẽn duy nhất, dễ cấu hình.

## 17. Kết luận — OpenTelemetry là kỹ năng nền cho kỹ sư backend 2026

Mười năm trước, chọn sai observability stack nghĩa là viết lại hàng trăm nghìn dòng instrumentation khi đổi vendor. Với OTEL 2026, instrumentation là của chung, backend là tuỳ chọn, và ranh giới giữa log, metric, trace, profile mờ đi đủ để kỹ sư vận hành nhảy qua lại trong vài cú click. Riêng với .NET 10, việc OTEL tựa trên `Activity` và `Meter` sẵn có trong BCL nghĩa là bạn gần như không trả giá adoption — API quen thuộc, Aspire bật sẵn, auto-instrumentation phủ hầu hết use case.

Điều tôi muốn đọng lại sau bài này là: **observability không phải dashboard**. Nó là kỷ luật đặt ra câu hỏi đúng và xây dựng dữ liệu để trả lời. OTEL là công cụ tốt nhất cho kỷ luật đó lúc này — nhưng công cụ chỉ phát huy khi team có thói quen viết span, đặt metric có ý nghĩa, và log đúng tầng. Bắt đầu từ một service, một pipeline Collector đơn giản, một dashboard duy nhất trả lời ba câu hỏi mở bài. Khi quen tay rồi, mở rộng ra toàn hệ là chuyện kỹ thuật, không còn là chuyện tư duy nữa.

## 18. Tham khảo

- [OpenTelemetry Documentation (chính thức)](https://opentelemetry.io/docs/)
- [OpenTelemetry Specification](https://opentelemetry.io/docs/specs/otel/)
- [Semantic Conventions 1.30](https://opentelemetry.io/docs/specs/semconv/)
- [opentelemetry-dotnet trên GitHub](https://github.com/open-telemetry/opentelemetry-dotnet)
- [Microsoft Learn — Observability with OpenTelemetry](https://learn.microsoft.com/en-us/dotnet/core/diagnostics/observability-with-otel)
- [.NET Aspire — Telemetry fundamentals](https://learn.microsoft.com/en-us/dotnet/aspire/fundamentals/telemetry)
- [OpenTelemetry Collector — Configuration](https://opentelemetry.io/docs/collector/configuration/)
- [Grafana Tempo Documentation](https://grafana.com/docs/tempo/latest/)
- [Grafana Loki Documentation](https://grafana.com/docs/loki/latest/)
- [Prometheus — Metric and label naming](https://prometheus.io/docs/practices/naming/)
- [Google Dapper Paper (2010)](https://research.google/pubs/pub36356/)

CQRS và Event Sourcing 2026 - Kiến trúc Event-Driven với .NET 10, Wolverine, Marten, Outbox Pattern và Saga cho Microservices

gRPC vs GraphQL vs REST vs tRPC 2026 - Chọn Đúng Protocol Giao Tiếp cho Microservices và Frontend-Backend Contract

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.