OpenTelemetry — The Observability Standard for Distributed Systems

Posted on: 4/21/2026 3:10:12 AM

CNCF #2 Most active project after Kubernetes
40+ Languages & frameworks supported
100+ Vendors with native integration
3 Pillars: Traces, Metrics, Logs

As distributed systems grow more complex — microservices calling each other, message queues in between, layered caches — the question "where did this fail?" becomes extremely hard to answer. You can't debug production with breakpoints. You need observability, and OpenTelemetry is becoming the one standard the whole industry agrees on.

1. What is observability and why does it matter?

Observability is the ability to understand a system's internal state purely from its output signals — without changing code or disrupting the main execution path. Unlike traditional monitoring (which tracks pre-known metrics), observability lets you answer questions you never asked ahead of time.

Monitoring vs. observability

Monitoring answers: "What's the CPU at?" or "Are requests/s over the threshold?"
Observability answers: "Why does a request from user X in the APAC region take 3 seconds instead of 200ms, and which service is the bottleneck?"

In a monolith, you can open a single log file and trace through a thread. But when a request moves through API Gateway → Auth Service → Order Service → Payment → Notification, each service has its own logs, timezones, and formats — you need a way to correlate them all.

2. The three pillars: Traces, Metrics, Logs

graph TD
    A[Telemetry Data] --> B[Traces]
    A --> C[Metrics]
    A --> D[Logs]
    B --> B1["Distributed Tracing
Track request flow"] B --> B2["Spans
Unit of time"] B --> B3["Context Propagation
W3C TraceContext"] C --> C1["Counters
Cumulative counts"] C --> C2["Gauges
Instantaneous values"] C --> C3["Histograms
Statistical distribution"] D --> D1["Structured Logs
Key-value pairs"] D --> D2["Correlation
Attach TraceId/SpanId"] D --> D3["Severity Levels
Info/Warn/Error"] style A fill:#e94560,stroke:#fff,color:#fff style B fill:#2c3e50,stroke:#fff,color:#fff style C fill:#2c3e50,stroke:#fff,color:#fff style D fill:#2c3e50,stroke:#fff,color:#fff style B1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style B2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style B3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style C1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style C2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style C3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style D1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style D2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style D3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50

The three telemetry pillars of OpenTelemetry

Traces — follow the request's journey

A trace represents a request's entire journey across a distributed system. Each trace consists of many spans — the smallest units of work, each with a name, start/end time, and attributes that describe the context.

sequenceDiagram
    participant Client
    participant Gateway as API Gateway
    participant Auth as Auth Service
    participant Order as Order Service
    participant DB as Database
    participant Cache as Redis Cache

    Client->>Gateway: POST /orders (TraceId: abc123)
    Gateway->>Auth: Verify Token (SpanId: s1)
    Auth-->>Gateway: 200 OK (2ms)
    Gateway->>Order: Create Order (SpanId: s2)
    Order->>Cache: Check inventory (SpanId: s3)
    Cache-->>Order: Cache HIT (0.5ms)
    Order->>DB: INSERT order (SpanId: s4)
    DB-->>Order: OK (15ms)
    Order-->>Gateway: 201 Created (18ms)
    Gateway-->>Client: 201 Created (22ms)

Distributed trace across multiple services — each arrow is a span

Each span contains:

  • TraceId: unique ID for the whole trace (propagated via the HTTP traceparent header)
  • SpanId: ID of the current span
  • ParentSpanId: parent-child linkage between spans
  • Attributes: key-value pairs such as http.method=POST, db.system=postgresql
  • Events: things that happened inside the span (e.g., "cache miss", "retry attempt")
  • Status: OK, ERROR, or UNSET

Metrics — measure performance with numbers

Metrics are numerical measurements over time. OpenTelemetry supports three core metric types:

Type Description Example Use when
Counter Cumulative, only increases Total requests, total bytes sent Counting events over time
Gauge Instantaneous, goes up/down CPU usage, active connections, queue length Measuring current state
Histogram Statistical distribution Request latency (p50, p95, p99) Analyzing value distributions

Logs — events with context

Logs in OpenTelemetry are not just text — they're structured logs with TraceId and SpanId attached automatically. That way, when you see an ERROR log entry, you can jump straight to the corresponding trace to see the whole request journey.

{
  "timestamp": "2026-04-21T10:15:30Z",
  "severity": "ERROR",
  "body": "Payment processing failed",
  "attributes": {
    "order.id": "ORD-98765",
    "payment.provider": "stripe",
    "error.type": "timeout"
  },
  "traceId": "abc123def456...",
  "spanId": "span789..."
}

Correlating Logs-Traces-Metrics

The real power is correlation: when a metric shows p99 latency spiking → filter traces with duration > 2s → find the slowest span → read that span's logs to understand root cause. Traditional monitoring just can't do that.

3. OpenTelemetry architecture

OpenTelemetry is not a product — it's a framework and toolkit composed of multiple components working together:

graph LR
    subgraph Application
        A1[Your Code] --> SDK[OTel SDK]
        A2[Auto-Instrumentation] --> SDK
        A3[Library Instrumentation] --> SDK
    end

    SDK -->|OTLP| C[OTel Collector]

    subgraph Collector
        C --> R[Receivers]
        R --> P[Processors]
        P --> E[Exporters]
    end

    E --> G[Grafana/Tempo]
    E --> J[Jaeger]
    E --> PR[Prometheus]
    E --> AZ[Azure Monitor]
    E --> DD[Datadog]

    style SDK fill:#e94560,stroke:#fff,color:#fff
    style C fill:#2c3e50,stroke:#fff,color:#fff
    style R fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style P fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style E fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style G fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style J fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style PR fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style AZ fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style DD fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

OpenTelemetry overview — from application to backend

Core components

  • API: standard interface for creating telemetry — library authors instrument code against it without depending on a specific implementation
  • SDK: implementation of the API, responsible for collecting, processing, and exporting data
  • Auto-Instrumentation: automatically captures telemetry from popular frameworks (ASP.NET Core, HttpClient, EF Core...) with no code changes
  • OTLP (OpenTelemetry Protocol): standard vendor-neutral transport supporting both gRPC and HTTP/protobuf
  • Semantic Conventions: standardized attribute names — http.request.method means the same thing in every language

4. OpenTelemetry Collector — heart of the system

The Collector is the middleman that receives, processes, and forwards telemetry. It acts as a smart proxy between apps and backends, decoupling instrumentation logic from delivery logic.

Two deployment modes

graph TD
    subgraph Agent Mode
        App1[App 1] --> CA[Collector Agent]
        App2[App 2] --> CA
        CA -->|Forward| CG
    end

    subgraph Gateway Mode
        CA2[Agent 1] --> CG[Collector Gateway]
        CA3[Agent 2] --> CG
        CG --> Backend[Observability Backend]
    end

    style CA fill:#e94560,stroke:#fff,color:#fff
    style CG fill:#2c3e50,stroke:#fff,color:#fff
    style Backend fill:#f8f9fa,stroke:#e94560,color:#2c3e50

Agent mode (sidecar) vs. Gateway mode (centralized)

Attribute Agent Mode Gateway Mode
Deployment Sidecar / DaemonSet next to the app Standalone centralized service
Pros Low latency, local processing Central management, complex sampling
Cons Uses resources on every node Single point of failure if not HA
Fits Kubernetes, edge computing Multi-cluster, cross-region

Collector configuration (YAML)

The Collector is configured as a pipeline: Receivers → Processors → Exporters. Here's a production-ready example:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"

processors:
  batch:
    send_batch_size: 8192
    timeout: 5s
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-always
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 2000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/tempo:
    endpoint: "tempo:4317"
    tls:
      insecure: false
      cert_file: /certs/client.crt
      key_file: /certs/client.key
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: myapp

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]

Memory Limiter is mandatory

In production, always place the memory_limiter processor BEFORE other processors. Otherwise, a traffic spike can OOM the collector and drop all buffered telemetry. Configure limit_mib at around 70–80% of the container's available RAM.

5. Integrating OpenTelemetry with .NET

.NET has an unusual advantage: telemetry APIs are already baked into the framework (ILogger, System.Diagnostics.Metrics, ActivitySource). The OpenTelemetry .NET SDK just "hooks" into these APIs and exports outside — no code rewrites needed.

graph LR
    subgraph ".NET Framework APIs"
        IL["ILogger<T>"]
        ME["Meter / Counter"]
        AS["ActivitySource / Activity"]
    end

    subgraph "OTel .NET SDK"
        IL --> LP[Log Provider]
        ME --> MP[Meter Provider]
        AS --> TP[Tracer Provider]
    end

    LP --> EX[OTLP Exporter]
    MP --> EX
    TP --> EX

    EX --> COL[Collector]

    style IL fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style ME fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style AS fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style LP fill:#e94560,stroke:#fff,color:#fff
    style MP fill:#e94560,stroke:#fff,color:#fff
    style TP fill:#e94560,stroke:#fff,color:#fff
    style EX fill:#2c3e50,stroke:#fff,color:#fff
    style COL fill:#2c3e50,stroke:#fff,color:#fff

.NET uses its native APIs; the OTel SDK only handles export

Install NuGet packages

dotnet add package OpenTelemetry
dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Instrumentation.Http
dotnet add package OpenTelemetry.Instrumentation.SqlClient
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol

Configuration in Program.cs

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource => resource
        .AddService(
            serviceName: "OrderService",
            serviceVersion: "1.0.0",
            serviceInstanceId: Environment.MachineName))
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation(opts =>
        {
            opts.Filter = ctx => !ctx.Request.Path.StartsWithSegments("/health");
            opts.RecordException = true;
        })
        .AddHttpClientInstrumentation()
        .AddSqlClientInstrumentation(opts =>
        {
            opts.SetDbStatementForText = true;
            opts.RecordException = true;
        })
        .AddOtlpExporter(opts =>
        {
            opts.Endpoint = new Uri("http://otel-collector:4317");
            opts.Protocol = OtlpExportProtocol.Grpc;
        }))
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddOtlpExporter())
    .WithLogging(logging => logging
        .AddOtlpExporter());

var app = builder.Build();
app.Run();

Custom instrumentation — tracing business logic

Auto-instrumentation covers HTTP, DB, gRPC. To trace business logic (order processing, pricing calculation, inventory check), you need to add spans manually:

public class OrderService
{
    private static readonly ActivitySource Source = new("OrderService");
    private static readonly Meter Meter = new("OrderService");
    private static readonly Counter<long> OrdersCreated =
        Meter.CreateCounter<long>("orders.created");

    public async Task<Order> CreateOrderAsync(CreateOrderRequest request)
    {
        using var activity = Source.StartActivity("CreateOrder");
        activity?.SetTag("order.customer_id", request.CustomerId);
        activity?.SetTag("order.items_count", request.Items.Count);

        // Validate inventory
        using (Source.StartActivity("ValidateInventory"))
        {
            await ValidateInventoryAsync(request.Items);
        }

        // Calculate pricing
        decimal total;
        using (var pricingSpan = Source.StartActivity("CalculatePricing"))
        {
            total = await CalculatePricingAsync(request.Items);
            pricingSpan?.SetTag("order.total", total);
        }

        // Process payment
        using (Source.StartActivity("ProcessPayment"))
        {
            await ProcessPaymentAsync(request.CustomerId, total);
        }

        OrdersCreated.Add(1,
            new KeyValuePair<string, object?>("region", request.Region));

        activity?.SetStatus(ActivityStatusCode.Ok);
        return new Order { Id = Guid.NewGuid(), Total = total };
    }
}

.NET Aspire — OTel included

If you're using .NET Aspire, OpenTelemetry is already wired up in the ServiceDefaults project. Just call builder.ConfigureOpenTelemetry() — tracing, metrics, and logging just work. The Aspire Dashboard even shows all telemetry locally in dev without Grafana/Jaeger.

6. Smart sampling strategies

At scale, collecting 100% of traces is infeasible — storage and network costs explode. Sampling reduces volume while keeping the important data.

Head-based vs. tail-based sampling

graph TD
    subgraph "Head-based Sampling"
        H1[Request arrives] --> H2{Decide up front}
        H2 -->|Sample| H3[Collect trace]
        H2 -->|Drop| H4[Discard entirely]
    end

    subgraph "Tail-based Sampling"
        T1[Request arrives] --> T2[Collect ALL spans]
        T2 --> T3[Trace finishes]
        T3 --> T4{Evaluate the whole trace}
        T4 -->|Error/Slow| T5[Keep]
        T4 -->|Normal| T6[Apply ratio sampling]
    end

    style H1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style H2 fill:#e94560,stroke:#fff,color:#fff
    style T1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style T4 fill:#2c3e50,stroke:#fff,color:#fff

Head-based decides at the start; tail-based decides after the trace completes

Criterion Head-based Tail-based
Decision time As the request starts After the trace completes
Pros Simple, low overhead Keeps every error and slow request
Cons Can miss error traces Needs a collector with enough RAM to buffer
Fits Very high traffic, limited budget Production needing precise debugging

A common production combination:

# tail-sampling configuration on the Collector
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      # Always keep error traces
      - name: keep-errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      # Keep requests slower than 1 second
      - name: keep-slow
        type: latency
        latency: { threshold_ms: 1000 }
      # Keep traces from critical endpoints
      - name: keep-critical-paths
        type: string_attribute
        string_attribute:
          key: http.route
          values: ["/api/payments", "/api/orders"]
      # 5% of normal traces
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

7. Building a complete observability stack

A popular, cost-effective production stack (fully self-hostable):

graph LR
    App[Applications] -->|OTLP| Col[OTel Collector]
    Col -->|Traces| Tempo[Grafana Tempo]
    Col -->|Metrics| Prom[Prometheus]
    Col -->|Logs| Loki[Grafana Loki]

    Tempo --> Graf[Grafana Dashboard]
    Prom --> Graf
    Loki --> Graf

    Graf --> Alert[Alertmanager]
    Alert --> PD[PagerDuty/Slack]

    style Col fill:#e94560,stroke:#fff,color:#fff
    style Graf fill:#2c3e50,stroke:#fff,color:#fff
    style Tempo fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style Prom fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style Loki fill:#f8f9fa,stroke:#e94560,color:#2c3e50

Open-source observability stack: OTel + Grafana ecosystem

Pillar Tool Role Cost
Traces Grafana Tempo Trace storage, lookup by TraceId Free (self-host)
Metrics Prometheus Collection and querying (PromQL) Free (self-host)
Logs Grafana Loki Log aggregation with label-based indexing Free (self-host)
Visualization Grafana Dashboards, alerting, explore Free (self-host)
Alerting Alertmanager Routes alerts → Slack, PagerDuty, Email Free

Docker Compose for local development

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    ports:
      - "4317:4317"   # gRPC OTLP
      - "4318:4318"   # HTTP OTLP
      - "8889:8889"   # Prometheus metrics
    volumes:
      - ./otel-config.yaml:/etc/otelcol/config.yaml

  tempo:
    image: grafana/tempo:latest
    ports:
      - "3200:3200"
    volumes:
      - ./tempo-config.yaml:/etc/tempo.yaml
    command: ["-config.file=/etc/tempo.yaml"]

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    volumes:
      - ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/ds.yaml

Self-host vs. managed service

Criterion Self-host (Grafana Stack) Managed (Datadog/New Relic) Hybrid (Grafana Cloud Free)
Cost Infra only (servers/storage) $15-25/host/month Free tier: 50GB logs, 10K metrics
Setup Needs DevOps experience 5-minute setup 15-minute setup
Scaling Manage your own HA and retention Automatic Free tier has limits
Vendor lock-in None (OTel is standard) High (proprietary features) Low (OTel-compatible)
Fits Larger teams with infra budget Startups, small teams Side projects, MVPs

8. Production best practices for 2026

Semantic Conventions — standardized naming

One of OpenTelemetry's greatest benefits is Semantic Conventions — standardized attribute names. When every service uses the same convention, cross-service queries are consistent:

Domain Attribute Meaning
HTTP http.request.method GET, POST, PUT, ...
HTTP http.response.status_code 200, 404, 500, ...
HTTP url.path /api/orders
Database db.system postgresql, redis, mssql
Database db.operation.name SELECT, INSERT, findOne
Messaging messaging.system kafka, rabbitmq, azure_servicebus
Messaging messaging.destination.name orders-queue, events-topic

Key principles

1. Keep cardinality low

High-cardinality attributes (e.g., attaching user.id to every metric) will blow up the number of Prometheus time series. Only attach high-cardinality attributes to traces (storage is cheaper); metrics should stick to low-cardinality labels like region, status_code, endpoint.

2. Filter health checks and noise

Drop traces from /health, /ready, /metrics endpoints. They create huge trace volume with zero debug value. Filter at the SDK level (not the collector) to save network.

3. Protect telemetry data

Telemetry can contain PII (email, tokens, query params). Use a redaction processor in the collector to mask/drop sensitive attributes before export. Always use TLS for OTLP endpoints in production.

Phase 1 — Foundation (weeks 1–2)
Basic setup: add the OTel SDK + auto-instrumentation to every service. Deploy the Collector in Agent mode. Connect to Grafana Cloud Free tier or local Jaeger to see your first traces.
Phase 2 — Enrichment (weeks 3–4)
Add context: custom spans for important business logic. Apply semantic conventions. Create custom metrics (orders/sec, payment success rate). Structured logging with TraceId correlation.
Phase 3 — Scale (weeks 5–6)
Production tuning: configure tail-based sampling. Tune batch processor and memory limiter. Set up Grafana dashboards for RED metrics (Rate, Errors, Duration). Create alert rules for SLO/SLI.
Phase 4 — Production-grade (weeks 7–8)
Hardening: HA for the Collector (2+ replicas). TLS everywhere. PII redaction. Retention policies. Team training and runbooks for incident response based on observability data.

Conclusion

OpenTelemetry isn't just a library — it's the industry standard for observability. As the second most active CNCF project (after Kubernetes), with support from 100+ vendors and native .NET integration, adopting it is no longer a "should we?" question but a "where do we start?" one.

Key takeaways:

  • Start with traces — they deliver the fastest debug value in distributed systems
  • Auto-instrumentation first, manual later — don't try to cover everything from day one
  • The Collector is mandatory — never export directly from app to backend in production
  • Tail-based sampling ensures you never miss an error or slow request
  • Semantic conventions enable consistent cross-service queries — invest in standardization early

With a free stack (OTel Collector + Grafana + Tempo + Prometheus + Loki), you can build a production-grade observability system with no license costs — you only need time to set it up and operate it well.

References