OpenTelemetry — The Observability Standard for Distributed Systems

Posted on: 4/21/2026 3:10:12 AM

Table of contents

Table of Contents
1. What is observability and why does it matter?
1. Monitoring vs. observability
2. The three pillars: Traces, Metrics, Logs
3. OpenTelemetry architecture
1. Core components
4. OpenTelemetry Collector — heart of the system
1. Two deployment modes
2. Collector configuration (YAML)
  1. Memory Limiter is mandatory
5. Integrating OpenTelemetry with .NET
6. Smart sampling strategies
1. Head-based vs. tail-based sampling
7. Building a complete observability stack
1. Docker Compose for local development
2. Self-host vs. managed service
8. Production best practices for 2026
Conclusion
1. References

CNCF #2 Most active project after Kubernetes

40+ Languages & frameworks supported

100+ Vendors with native integration

3 Pillars: Traces, Metrics, Logs

As distributed systems grow more complex — microservices calling each other, message queues in between, layered caches — the question "where did this fail?" becomes extremely hard to answer. You can't debug production with breakpoints. You need observability, and OpenTelemetry is becoming the one standard the whole industry agrees on.

1. What is observability and why does it matter?

Observability is the ability to understand a system's internal state purely from its output signals — without changing code or disrupting the main execution path. Unlike traditional monitoring (which tracks pre-known metrics), observability lets you answer questions you never asked ahead of time.

Monitoring vs. observability

Monitoring answers: "What's the CPU at?" or "Are requests/s over the threshold?"
Observability answers: "Why does a request from user X in the APAC region take 3 seconds instead of 200ms, and which service is the bottleneck?"

In a monolith, you can open a single log file and trace through a thread. But when a request moves through API Gateway → Auth Service → Order Service → Payment → Notification, each service has its own logs, timezones, and formats — you need a way to correlate them all.

2. The three pillars: Traces, Metrics, Logs

graph TD
    A[Telemetry Data] --> B[Traces]
    A --> C[Metrics]
    A --> D[Logs]
    B --> B1["Distributed Tracing
Track request flow"]
    B --> B2["Spans
Unit of time"]
    B --> B3["Context Propagation
W3C TraceContext"]
    C --> C1["Counters
Cumulative counts"]
    C --> C2["Gauges
Instantaneous values"]
    C --> C3["Histograms
Statistical distribution"]
    D --> D1["Structured Logs
Key-value pairs"]
    D --> D2["Correlation
Attach TraceId/SpanId"]
    D --> D3["Severity Levels
Info/Warn/Error"]

    style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#2c3e50,stroke:#fff,color:#fff
    style C fill:#2c3e50,stroke:#fff,color:#fff
    style D fill:#2c3e50,stroke:#fff,color:#fff
    style B1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50

The three telemetry pillars of OpenTelemetry

Traces — follow the request's journey

A trace represents a request's entire journey across a distributed system. Each trace consists of many spans — the smallest units of work, each with a name, start/end time, and attributes that describe the context.

sequenceDiagram
    participant Client
    participant Gateway as API Gateway
    participant Auth as Auth Service
    participant Order as Order Service
    participant DB as Database
    participant Cache as Redis Cache

    Client->>Gateway: POST /orders (TraceId: abc123)
    Gateway->>Auth: Verify Token (SpanId: s1)
    Auth-->>Gateway: 200 OK (2ms)
    Gateway->>Order: Create Order (SpanId: s2)
    Order->>Cache: Check inventory (SpanId: s3)
    Cache-->>Order: Cache HIT (0.5ms)
    Order->>DB: INSERT order (SpanId: s4)
    DB-->>Order: OK (15ms)
    Order-->>Gateway: 201 Created (18ms)
    Gateway-->>Client: 201 Created (22ms)

Distributed trace across multiple services — each arrow is a span

Each span contains:

TraceId: unique ID for the whole trace (propagated via the HTTP traceparent header)
SpanId: ID of the current span
ParentSpanId: parent-child linkage between spans
Attributes: key-value pairs such as http.method=POST, db.system=postgresql
Events: things that happened inside the span (e.g., "cache miss", "retry attempt")
Status: OK, ERROR, or UNSET

Metrics — measure performance with numbers

Metrics are numerical measurements over time. OpenTelemetry supports three core metric types:

Type	Description	Example	Use when
Counter	Cumulative, only increases	Total requests, total bytes sent	Counting events over time
Gauge	Instantaneous, goes up/down	CPU usage, active connections, queue length	Measuring current state
Histogram	Statistical distribution	Request latency (p50, p95, p99)	Analyzing value distributions

Logs — events with context

Logs in OpenTelemetry are not just text — they're structured logs with TraceId and SpanId attached automatically. That way, when you see an ERROR log entry, you can jump straight to the corresponding trace to see the whole request journey.

{
  "timestamp": "2026-04-21T10:15:30Z",
  "severity": "ERROR",
  "body": "Payment processing failed",
  "attributes": {
    "order.id": "ORD-98765",
    "payment.provider": "stripe",
    "error.type": "timeout"
  },
  "traceId": "abc123def456...",
  "spanId": "span789..."
}

Correlating Logs-Traces-Metrics

The real power is correlation: when a metric shows p99 latency spiking → filter traces with duration > 2s → find the slowest span → read that span's logs to understand root cause. Traditional monitoring just can't do that.

3. OpenTelemetry architecture

OpenTelemetry is not a product — it's a framework and toolkit composed of multiple components working together:

graph LR
    subgraph Application
        A1[Your Code] --> SDK[OTel SDK]
        A2[Auto-Instrumentation] --> SDK
        A3[Library Instrumentation] --> SDK
    end

    SDK -->|OTLP| C[OTel Collector]

    subgraph Collector
        C --> R[Receivers]
        R --> P[Processors]
        P --> E[Exporters]
    end

    E --> G[Grafana/Tempo]
    E --> J[Jaeger]
    E --> PR[Prometheus]
    E --> AZ[Azure Monitor]
    E --> DD[Datadog]

    style SDK fill:#e94560,stroke:#fff,color:#fff
    style C fill:#2c3e50,stroke:#fff,color:#fff
    style R fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style P fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style E fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style G fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style J fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style PR fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style AZ fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style DD fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

OpenTelemetry overview — from application to backend

Core components

API: standard interface for creating telemetry — library authors instrument code against it without depending on a specific implementation
SDK: implementation of the API, responsible for collecting, processing, and exporting data
Auto-Instrumentation: automatically captures telemetry from popular frameworks (ASP.NET Core, HttpClient, EF Core...) with no code changes
OTLP (OpenTelemetry Protocol): standard vendor-neutral transport supporting both gRPC and HTTP/protobuf
Semantic Conventions: standardized attribute names — http.request.method means the same thing in every language

4. OpenTelemetry Collector — heart of the system

The Collector is the middleman that receives, processes, and forwards telemetry. It acts as a smart proxy between apps and backends, decoupling instrumentation logic from delivery logic.

Two deployment modes

graph TD
    subgraph Agent Mode
        App1[App 1] --> CA[Collector Agent]
        App2[App 2] --> CA
        CA -->|Forward| CG
    end

    subgraph Gateway Mode
        CA2[Agent 1] --> CG[Collector Gateway]
        CA3[Agent 2] --> CG
        CG --> Backend[Observability Backend]
    end

    style CA fill:#e94560,stroke:#fff,color:#fff
    style CG fill:#2c3e50,stroke:#fff,color:#fff
    style Backend fill:#f8f9fa,stroke:#e94560,color:#2c3e50

Agent mode (sidecar) vs. Gateway mode (centralized)

Attribute	Agent Mode	Gateway Mode
Deployment	Sidecar / DaemonSet next to the app	Standalone centralized service
Pros	Low latency, local processing	Central management, complex sampling
Cons	Uses resources on every node	Single point of failure if not HA
Fits	Kubernetes, edge computing	Multi-cluster, cross-region

Collector configuration (YAML)

The Collector is configured as a pipeline: Receivers → Processors → Exporters. Here's a production-ready example:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"

processors:
  batch:
    send_batch_size: 8192
    timeout: 5s
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-always
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 2000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/tempo:
    endpoint: "tempo:4317"
    tls:
      insecure: false
      cert_file: /certs/client.crt
      key_file: /certs/client.key
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: myapp

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]

Memory Limiter is mandatory

In production, always place the memory_limiter processor BEFORE other processors. Otherwise, a traffic spike can OOM the collector and drop all buffered telemetry. Configure limit_mib at around 70–80% of the container's available RAM.

5. Integrating OpenTelemetry with .NET

.NET has an unusual advantage: telemetry APIs are already baked into the framework (ILogger, System.Diagnostics.Metrics, ActivitySource). The OpenTelemetry .NET SDK just "hooks" into these APIs and exports outside — no code rewrites needed.

graph LR
    subgraph ".NET Framework APIs"
        IL["ILogger<T>"]
        ME["Meter / Counter"]
        AS["ActivitySource / Activity"]
    end

    subgraph "OTel .NET SDK"
        IL --> LP[Log Provider]
        ME --> MP[Meter Provider]
        AS --> TP[Tracer Provider]
    end

    LP --> EX[OTLP Exporter]
    MP --> EX
    TP --> EX

    EX --> COL[Collector]

    style IL fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style ME fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style AS fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style LP fill:#e94560,stroke:#fff,color:#fff
    style MP fill:#e94560,stroke:#fff,color:#fff
    style TP fill:#e94560,stroke:#fff,color:#fff
    style EX fill:#2c3e50,stroke:#fff,color:#fff
    style COL fill:#2c3e50,stroke:#fff,color:#fff

.NET uses its native APIs; the OTel SDK only handles export

Install NuGet packages

dotnet add package OpenTelemetry
dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Instrumentation.Http
dotnet add package OpenTelemetry.Instrumentation.SqlClient
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol

Configuration in Program.cs

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource => resource
        .AddService(
            serviceName: "OrderService",
            serviceVersion: "1.0.0",
            serviceInstanceId: Environment.MachineName))
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation(opts =>
        {
            opts.Filter = ctx => !ctx.Request.Path.StartsWithSegments("/health");
            opts.RecordException = true;
        })
        .AddHttpClientInstrumentation()
        .AddSqlClientInstrumentation(opts =>
        {
            opts.SetDbStatementForText = true;
            opts.RecordException = true;
        })
        .AddOtlpExporter(opts =>
        {
            opts.Endpoint = new Uri("http://otel-collector:4317");
            opts.Protocol = OtlpExportProtocol.Grpc;
        }))
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddOtlpExporter())
    .WithLogging(logging => logging
        .AddOtlpExporter());

var app = builder.Build();
app.Run();

Custom instrumentation — tracing business logic

Auto-instrumentation covers HTTP, DB, gRPC. To trace business logic (order processing, pricing calculation, inventory check), you need to add spans manually:

public class OrderService
{
    private static readonly ActivitySource Source = new("OrderService");
    private static readonly Meter Meter = new("OrderService");
    private static readonly Counter<long> OrdersCreated =
        Meter.CreateCounter<long>("orders.created");

    public async Task<Order> CreateOrderAsync(CreateOrderRequest request)
    {
        using var activity = Source.StartActivity("CreateOrder");
        activity?.SetTag("order.customer_id", request.CustomerId);
        activity?.SetTag("order.items_count", request.Items.Count);

        // Validate inventory
        using (Source.StartActivity("ValidateInventory"))
        {
            await ValidateInventoryAsync(request.Items);
        }

        // Calculate pricing
        decimal total;
        using (var pricingSpan = Source.StartActivity("CalculatePricing"))
        {
            total = await CalculatePricingAsync(request.Items);
            pricingSpan?.SetTag("order.total", total);
        }

        // Process payment
        using (Source.StartActivity("ProcessPayment"))
        {
            await ProcessPaymentAsync(request.CustomerId, total);
        }

        OrdersCreated.Add(1,
            new KeyValuePair<string, object?>("region", request.Region));

        activity?.SetStatus(ActivityStatusCode.Ok);
        return new Order { Id = Guid.NewGuid(), Total = total };
    }
}

.NET Aspire — OTel included

If you're using .NET Aspire, OpenTelemetry is already wired up in the ServiceDefaults project. Just call builder.ConfigureOpenTelemetry() — tracing, metrics, and logging just work. The Aspire Dashboard even shows all telemetry locally in dev without Grafana/Jaeger.

6. Smart sampling strategies

At scale, collecting 100% of traces is infeasible — storage and network costs explode. Sampling reduces volume while keeping the important data.

Head-based vs. tail-based sampling

graph TD
    subgraph "Head-based Sampling"
        H1[Request arrives] --> H2{Decide up front}
        H2 -->|Sample| H3[Collect trace]
        H2 -->|Drop| H4[Discard entirely]
    end

    subgraph "Tail-based Sampling"
        T1[Request arrives] --> T2[Collect ALL spans]
        T2 --> T3[Trace finishes]
        T3 --> T4{Evaluate the whole trace}
        T4 -->|Error/Slow| T5[Keep]
        T4 -->|Normal| T6[Apply ratio sampling]
    end

    style H1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style H2 fill:#e94560,stroke:#fff,color:#fff
    style T1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style T4 fill:#2c3e50,stroke:#fff,color:#fff

Head-based decides at the start; tail-based decides after the trace completes

Criterion	Head-based	Tail-based
Decision time	As the request starts	After the trace completes
Pros	Simple, low overhead	Keeps every error and slow request
Cons	Can miss error traces	Needs a collector with enough RAM to buffer
Fits	Very high traffic, limited budget	Production needing precise debugging

A common production combination:

# tail-sampling configuration on the Collector
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      # Always keep error traces
      - name: keep-errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      # Keep requests slower than 1 second
      - name: keep-slow
        type: latency
        latency: { threshold_ms: 1000 }
      # Keep traces from critical endpoints
      - name: keep-critical-paths
        type: string_attribute
        string_attribute:
          key: http.route
          values: ["/api/payments", "/api/orders"]
      # 5% of normal traces
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

7. Building a complete observability stack

A popular, cost-effective production stack (fully self-hostable):

graph LR
    App[Applications] -->|OTLP| Col[OTel Collector]
    Col -->|Traces| Tempo[Grafana Tempo]
    Col -->|Metrics| Prom[Prometheus]
    Col -->|Logs| Loki[Grafana Loki]

    Tempo --> Graf[Grafana Dashboard]
    Prom --> Graf
    Loki --> Graf

    Graf --> Alert[Alertmanager]
    Alert --> PD[PagerDuty/Slack]

    style Col fill:#e94560,stroke:#fff,color:#fff
    style Graf fill:#2c3e50,stroke:#fff,color:#fff
    style Tempo fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style Prom fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style Loki fill:#f8f9fa,stroke:#e94560,color:#2c3e50

Open-source observability stack: OTel + Grafana ecosystem

Pillar	Tool	Role	Cost
Traces	Grafana Tempo	Trace storage, lookup by TraceId	Free (self-host)
Metrics	Prometheus	Collection and querying (PromQL)	Free (self-host)
Logs	Grafana Loki	Log aggregation with label-based indexing	Free (self-host)
Visualization	Grafana	Dashboards, alerting, explore	Free (self-host)
Alerting	Alertmanager	Routes alerts → Slack, PagerDuty, Email	Free

Docker Compose for local development

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    ports:
      - "4317:4317"   # gRPC OTLP
      - "4318:4318"   # HTTP OTLP
      - "8889:8889"   # Prometheus metrics
    volumes:
      - ./otel-config.yaml:/etc/otelcol/config.yaml

  tempo:
    image: grafana/tempo:latest
    ports:
      - "3200:3200"
    volumes:
      - ./tempo-config.yaml:/etc/tempo.yaml
    command: ["-config.file=/etc/tempo.yaml"]

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    volumes:
      - ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/ds.yaml

Self-host vs. managed service

Criterion	Self-host (Grafana Stack)	Managed (Datadog/New Relic)	Hybrid (Grafana Cloud Free)
Cost	Infra only (servers/storage)	$15-25/host/month	Free tier: 50GB logs, 10K metrics
Setup	Needs DevOps experience	5-minute setup	15-minute setup
Scaling	Manage your own HA and retention	Automatic	Free tier has limits
Vendor lock-in	None (OTel is standard)	High (proprietary features)	Low (OTel-compatible)
Fits	Larger teams with infra budget	Startups, small teams	Side projects, MVPs

8. Production best practices for 2026

Semantic Conventions — standardized naming

One of OpenTelemetry's greatest benefits is Semantic Conventions — standardized attribute names. When every service uses the same convention, cross-service queries are consistent:

Domain	Attribute	Meaning
HTTP	`http.request.method`	GET, POST, PUT, ...
HTTP	`http.response.status_code`	200, 404, 500, ...
HTTP	`url.path`	/api/orders
Database	`db.system`	postgresql, redis, mssql
Database	`db.operation.name`	SELECT, INSERT, findOne
Messaging	`messaging.system`	kafka, rabbitmq, azure_servicebus
Messaging	`messaging.destination.name`	orders-queue, events-topic

Key principles

1. Keep cardinality low

High-cardinality attributes (e.g., attaching user.id to every metric) will blow up the number of Prometheus time series. Only attach high-cardinality attributes to traces (storage is cheaper); metrics should stick to low-cardinality labels like region, status_code, endpoint.

2. Filter health checks and noise

Drop traces from /health, /ready, /metrics endpoints. They create huge trace volume with zero debug value. Filter at the SDK level (not the collector) to save network.

3. Protect telemetry data

Telemetry can contain PII (email, tokens, query params). Use a redaction processor in the collector to mask/drop sensitive attributes before export. Always use TLS for OTLP endpoints in production.

Recommended rollout plan

Phase 1 — Foundation (weeks 1–2)

Basic setup: add the OTel SDK + auto-instrumentation to every service. Deploy the Collector in Agent mode. Connect to Grafana Cloud Free tier or local Jaeger to see your first traces.

Phase 2 — Enrichment (weeks 3–4)

Add context: custom spans for important business logic. Apply semantic conventions. Create custom metrics (orders/sec, payment success rate). Structured logging with TraceId correlation.

Phase 3 — Scale (weeks 5–6)

Production tuning: configure tail-based sampling. Tune batch processor and memory limiter. Set up Grafana dashboards for RED metrics (Rate, Errors, Duration). Create alert rules for SLO/SLI.

Phase 4 — Production-grade (weeks 7–8)

Hardening: HA for the Collector (2+ replicas). TLS everywhere. PII redaction. Retention policies. Team training and runbooks for incident response based on observability data.

Conclusion

OpenTelemetry isn't just a library — it's the industry standard for observability. As the second most active CNCF project (after Kubernetes), with support from 100+ vendors and native .NET integration, adopting it is no longer a "should we?" question but a "where do we start?" one.

Key takeaways:

Start with traces — they deliver the fastest debug value in distributed systems
Auto-instrumentation first, manual later — don't try to cover everything from day one
The Collector is mandatory — never export directly from app to backend in production
Tail-based sampling ensures you never miss an error or slow request
Semantic conventions enable consistent cross-service queries — invest in standardization early

With a free stack (OTel Collector + Grafana + Tempo + Prometheus + Loki), you can build a production-grade observability system with no license costs — you only need time to set it up and operate it well.

References

#OpenTelemetry #Observability #Distributed Tracing #.NET #Grafana #Prometheus #system design #Microservices

# OpenTelemetry — The Observability Standard for Distributed Systems

CNCF #2 Most active project after Kubernetes

40+ Languages & frameworks supported

100+ Vendors with native integration

3 Pillars: Traces, Metrics, Logs

As distributed systems grow more complex — microservices calling each other, message queues in between, layered caches — the question *"where did this fail?"* becomes extremely hard to answer. You can't debug production with breakpoints. You need **observability**, and OpenTelemetry is becoming the one standard the whole industry agrees on.

## 1. What is observability and why does it matter?

Observability is the ability to understand a system's internal state purely from its output signals — **without changing code** or disrupting the main execution path. Unlike traditional monitoring (which tracks pre-known metrics), observability lets you answer questions you *never asked ahead of time*.

#### Monitoring vs. observability

**Monitoring** answers: "What's the CPU at?" or "Are requests/s over the threshold?"  
**Observability** answers: "Why does a request from user X in the APAC region take 3 seconds instead of 200ms, and which service is the bottleneck?"

## 2. The three pillars: Traces, Metrics, Logs

```
graph TD
    A[Telemetry Data] --> B[Traces]
    A --> C[Metrics]
    A --> D[Logs]
    B --> B1["Distributed Tracing  
Track request flow"]
    B --> B2["Spans  
Unit of time"]
    B --> B3["Context Propagation  
W3C TraceContext"]
    C --> C1["Counters  
Cumulative counts"]
    C --> C2["Gauges  
Instantaneous values"]
    C --> C3["Histograms  
Statistical distribution"]
    D --> D1["Structured Logs  
Key-value pairs"]
    D --> D2["Correlation  
Attach TraceId/SpanId"]
    D --> D3["Severity Levels  
Info/Warn/Error"]

style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#2c3e50,stroke:#fff,color:#fff
    style C fill:#2c3e50,stroke:#fff,color:#fff
    style D fill:#2c3e50,stroke:#fff,color:#fff
    style B1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50

```
The three telemetry pillars of OpenTelemetry

### Traces — follow the request's journey

A **trace** represents a request's entire journey across a distributed system. Each trace consists of many **spans** — the smallest units of work, each with a name, start/end time, and attributes that describe the context.

```
sequenceDiagram
    participant Client
    participant Gateway as API Gateway
    participant Auth as Auth Service
    participant Order as Order Service
    participant DB as Database
    participant Cache as Redis Cache

Client->>Gateway: POST /orders (TraceId: abc123)
    Gateway->>Auth: Verify Token (SpanId: s1)
    Auth-->>Gateway: 200 OK (2ms)
    Gateway->>Order: Create Order (SpanId: s2)
    Order->>Cache: Check inventory (SpanId: s3)
    Cache-->>Order: Cache HIT (0.5ms)
    Order->>DB: INSERT order (SpanId: s4)
    DB-->>Order: OK (15ms)
    Order-->>Gateway: 201 Created (18ms)
    Gateway-->>Client: 201 Created (22ms)

```
Distributed trace across multiple services — each arrow is a span

Each span contains:

- **TraceId**: unique ID for the whole trace (propagated via the HTTP `traceparent` header)
- **SpanId**: ID of the current span
- **ParentSpanId**: parent-child linkage between spans
- **Attributes**: key-value pairs such as `http.method=POST`, `db.system=postgresql`
- **Events**: things that happened inside the span (e.g., "cache miss", "retry attempt")
- **Status**: OK, ERROR, or UNSET

### Metrics — measure performance with numbers

Metrics are numerical measurements over time. OpenTelemetry supports three core metric types:

| Type | Description | Example | Use when |
| --- | --- | --- | --- |
| **Counter** | Cumulative, only increases | Total requests, total bytes sent | Counting events over time |
| **Gauge** | Instantaneous, goes up/down | CPU usage, active connections, queue length | Measuring current state |
| **Histogram** | Statistical distribution | Request latency (p50, p95, p99) | Analyzing value distributions |

### Logs — events with context

Logs in OpenTelemetry are not just text — they're **structured logs** with TraceId and SpanId attached automatically. That way, when you see an ERROR log entry, you can jump straight to the corresponding trace to see the whole request journey.

```
{
  "timestamp": "2026-04-21T10:15:30Z",
  "severity": "ERROR",
  "body": "Payment processing failed",
  "attributes": {
    "order.id": "ORD-98765",
    "payment.provider": "stripe",
    "error.type": "timeout"
  },
  "traceId": "abc123def456...",
  "spanId": "span789..."
}
```

#### Correlating Logs-Traces-Metrics

The real power is **correlation**: when a metric shows p99 latency spiking → filter traces with duration > 2s → find the slowest span → read that span's logs to understand root cause. Traditional monitoring just can't do that.

## 3. OpenTelemetry architecture

OpenTelemetry is not a product — it's a **framework and toolkit** composed of multiple components working together:

```
graph LR
    subgraph Application
        A1[Your Code] --> SDK[OTel SDK]
        A2[Auto-Instrumentation] --> SDK
        A3[Library Instrumentation] --> SDK
    end

SDK -->|OTLP| C[OTel Collector]

subgraph Collector
        C --> R[Receivers]
        R --> P[Processors]
        P --> E[Exporters]
    end

E --> G[Grafana/Tempo]
    E --> J[Jaeger]
    E --> PR[Prometheus]
    E --> AZ[Azure Monitor]
    E --> DD[Datadog]

style SDK fill:#e94560,stroke:#fff,color:#fff
    style C fill:#2c3e50,stroke:#fff,color:#fff
    style R fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style P fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style E fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style G fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style J fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style PR fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style AZ fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style DD fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

```
OpenTelemetry overview — from application to backend

### Core components

- **API**: standard interface for creating telemetry — library authors instrument code against it without depending on a specific implementation
- **SDK**: implementation of the API, responsible for collecting, processing, and exporting data
- **Auto-Instrumentation**: automatically captures telemetry from popular frameworks (ASP.NET Core, HttpClient, EF Core...) with no code changes
- **OTLP (OpenTelemetry Protocol)**: standard vendor-neutral transport supporting both gRPC and HTTP/protobuf
- **Semantic Conventions**: standardized attribute names — `http.request.method` means the same thing in every language

## 4. OpenTelemetry Collector — heart of the system

The Collector is the middleman that receives, processes, and forwards telemetry. It acts as a smart proxy between apps and backends, decoupling instrumentation logic from delivery logic.

### Two deployment modes

```
graph TD
    subgraph Agent Mode
        App1[App 1] --> CA[Collector Agent]
        App2[App 2] --> CA
        CA -->|Forward| CG
    end

subgraph Gateway Mode
        CA2[Agent 1] --> CG[Collector Gateway]
        CA3[Agent 2] --> CG
        CG --> Backend[Observability Backend]
    end

style CA fill:#e94560,stroke:#fff,color:#fff
    style CG fill:#2c3e50,stroke:#fff,color:#fff
    style Backend fill:#f8f9fa,stroke:#e94560,color:#2c3e50

```
Agent mode (sidecar) vs. Gateway mode (centralized)

| Attribute | Agent Mode | Gateway Mode |
| --- | --- | --- |
| **Deployment** | Sidecar / DaemonSet next to the app | Standalone centralized service |
| **Pros** | Low latency, local processing | Central management, complex sampling |
| **Cons** | Uses resources on every node | Single point of failure if not HA |
| **Fits** | Kubernetes, edge computing | Multi-cluster, cross-region |

### Collector configuration (YAML)

The Collector is configured as a pipeline: **Receivers → Processors → Exporters**. Here's a production-ready example:

```
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"

processors:
  batch:
    send_batch_size: 8192
    timeout: 5s
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-always
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 2000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/tempo:
    endpoint: "tempo:4317"
    tls:
      insecure: false
      cert_file: /certs/client.crt
      key_file: /certs/client.key
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: myapp

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]
```

#### Memory Limiter is mandatory

In production, **always** place the `memory_limiter` processor BEFORE other processors. Otherwise, a traffic spike can OOM the collector and drop all buffered telemetry. Configure `limit_mib` at around 70–80% of the container's available RAM.

## 5. Integrating OpenTelemetry with .NET

.NET has an unusual advantage: telemetry APIs are already baked into the framework (`ILogger`, `System.Diagnostics.Metrics`, `ActivitySource`). The OpenTelemetry .NET SDK just "hooks" into these APIs and exports outside — no code rewrites needed.

```
graph LR
    subgraph ".NET Framework APIs"
        IL["ILogger<T>"]
        ME["Meter / Counter"]
        AS["ActivitySource / Activity"]
    end

subgraph "OTel .NET SDK"
        IL --> LP[Log Provider]
        ME --> MP[Meter Provider]
        AS --> TP[Tracer Provider]
    end

LP --> EX[OTLP Exporter]
    MP --> EX
    TP --> EX

EX --> COL[Collector]

style IL fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style ME fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style AS fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style LP fill:#e94560,stroke:#fff,color:#fff
    style MP fill:#e94560,stroke:#fff,color:#fff
    style TP fill:#e94560,stroke:#fff,color:#fff
    style EX fill:#2c3e50,stroke:#fff,color:#fff
    style COL fill:#2c3e50,stroke:#fff,color:#fff

```
.NET uses its native APIs; the OTel SDK only handles export

### Install NuGet packages

```
dotnet add package OpenTelemetry
dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Instrumentation.Http
dotnet add package OpenTelemetry.Instrumentation.SqlClient
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol
```

### Configuration in Program.cs

```
var builder = WebApplication.CreateBuilder(args);

builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource => resource
        .AddService(
            serviceName: "OrderService",
            serviceVersion: "1.0.0",
            serviceInstanceId: Environment.MachineName))
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation(opts =>
        {
            opts.Filter = ctx => !ctx.Request.Path.StartsWithSegments("/health");
            opts.RecordException = true;
        })
        .AddHttpClientInstrumentation()
        .AddSqlClientInstrumentation(opts =>
        {
            opts.SetDbStatementForText = true;
            opts.RecordException = true;
        })
        .AddOtlpExporter(opts =>
        {
            opts.Endpoint = new Uri("http://otel-collector:4317");
            opts.Protocol = OtlpExportProtocol.Grpc;
        }))
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddOtlpExporter())
    .WithLogging(logging => logging
        .AddOtlpExporter());

var app = builder.Build();
app.Run();
```

### Custom instrumentation — tracing business logic

Auto-instrumentation covers HTTP, DB, gRPC. To trace **business logic** (order processing, pricing calculation, inventory check), you need to add spans manually:

```
public class OrderService
{
    private static readonly ActivitySource Source = new("OrderService");
    private static readonly Meter Meter = new("OrderService");
    private static readonly Counter<long> OrdersCreated =
        Meter.CreateCounter<long>("orders.created");

public async Task<Order> CreateOrderAsync(CreateOrderRequest request)
    {
        using var activity = Source.StartActivity("CreateOrder");
        activity?.SetTag("order.customer_id", request.CustomerId);
        activity?.SetTag("order.items_count", request.Items.Count);

// Validate inventory
        using (Source.StartActivity("ValidateInventory"))
        {
            await ValidateInventoryAsync(request.Items);
        }

// Calculate pricing
        decimal total;
        using (var pricingSpan = Source.StartActivity("CalculatePricing"))
        {
            total = await CalculatePricingAsync(request.Items);
            pricingSpan?.SetTag("order.total", total);
        }

// Process payment
        using (Source.StartActivity("ProcessPayment"))
        {
            await ProcessPaymentAsync(request.CustomerId, total);
        }

OrdersCreated.Add(1,
            new KeyValuePair<string, object?>("region", request.Region));

activity?.SetStatus(ActivityStatusCode.Ok);
        return new Order { Id = Guid.NewGuid(), Total = total };
    }
}
```

#### .NET Aspire — OTel included

If you're using **.NET Aspire**, OpenTelemetry is already wired up in the `ServiceDefaults` project. Just call `builder.ConfigureOpenTelemetry()` — tracing, metrics, and logging just work. The Aspire Dashboard even shows all telemetry locally in dev without Grafana/Jaeger.

## 6. Smart sampling strategies

At scale, collecting 100% of traces is infeasible — storage and network costs explode. Sampling reduces volume while keeping the important data.

### Head-based vs. tail-based sampling

```
graph TD
    subgraph "Head-based Sampling"
        H1[Request arrives] --> H2{Decide up front}
        H2 -->|Sample| H3[Collect trace]
        H2 -->|Drop| H4[Discard entirely]
    end

subgraph "Tail-based Sampling"
        T1[Request arrives] --> T2[Collect ALL spans]
        T2 --> T3[Trace finishes]
        T3 --> T4{Evaluate the whole trace}
        T4 -->|Error/Slow| T5[Keep]
        T4 -->|Normal| T6[Apply ratio sampling]
    end

style H1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style H2 fill:#e94560,stroke:#fff,color:#fff
    style T1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style T4 fill:#2c3e50,stroke:#fff,color:#fff

```
Head-based decides at the start; tail-based decides after the trace completes

| Criterion | Head-based | Tail-based |
| --- | --- | --- |
| **Decision time** | As the request starts | After the trace completes |
| **Pros** | Simple, low overhead | Keeps every error and slow request |
| **Cons** | Can miss error traces | Needs a collector with enough RAM to buffer |
| **Fits** | Very high traffic, limited budget | Production needing precise debugging |

A common production combination:

```
# tail-sampling configuration on the Collector
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      # Always keep error traces
      - name: keep-errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      # Keep requests slower than 1 second
      - name: keep-slow
        type: latency
        latency: { threshold_ms: 1000 }
      # Keep traces from critical endpoints
      - name: keep-critical-paths
        type: string_attribute
        string_attribute:
          key: http.route
          values: ["/api/payments", "/api/orders"]
      # 5% of normal traces
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }
```

## 7. Building a complete observability stack

A popular, cost-effective production stack (fully self-hostable):

Tempo --> Graf[Grafana Dashboard]
    Prom --> Graf
    Loki --> Graf

Graf --> Alert[Alertmanager]
    Alert --> PD[PagerDuty/Slack]

style Col fill:#e94560,stroke:#fff,color:#fff
    style Graf fill:#2c3e50,stroke:#fff,color:#fff
    style Tempo fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style Prom fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style Loki fill:#f8f9fa,stroke:#e94560,color:#2c3e50

```
Open-source observability stack: OTel + Grafana ecosystem

| Pillar | Tool | Role | Cost |
| --- | --- | --- | --- |
| **Traces** | Grafana Tempo | Trace storage, lookup by TraceId | Free (self-host) |
| **Metrics** | Prometheus | Collection and querying (PromQL) | Free (self-host) |
| **Logs** | Grafana Loki | Log aggregation with label-based indexing | Free (self-host) |
| **Visualization** | Grafana | Dashboards, alerting, explore | Free (self-host) |
| **Alerting** | Alertmanager | Routes alerts → Slack, PagerDuty, Email | Free |

### Docker Compose for local development

```
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    ports:
      - "4317:4317"   # gRPC OTLP
      - "4318:4318"   # HTTP OTLP
      - "8889:8889"   # Prometheus metrics
    volumes:
      - ./otel-config.yaml:/etc/otelcol/config.yaml

tempo:
    image: grafana/tempo:latest
    ports:
      - "3200:3200"
    volumes:
      - ./tempo-config.yaml:/etc/tempo.yaml
    command: ["-config.file=/etc/tempo.yaml"]

prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"

grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    volumes:
      - ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/ds.yaml
```

### Self-host vs. managed service

| Criterion | Self-host (Grafana Stack) | Managed (Datadog/New Relic) | Hybrid (Grafana Cloud Free) |
| --- | --- | --- | --- |
| **Cost** | Infra only (servers/storage) | $15-25/host/month | Free tier: 50GB logs, 10K metrics |
| **Setup** | Needs DevOps experience | 5-minute setup | 15-minute setup |
| **Scaling** | Manage your own HA and retention | Automatic | Free tier has limits |
| **Vendor lock-in** | None (OTel is standard) | High (proprietary features) | Low (OTel-compatible) |
| **Fits** | Larger teams with infra budget | Startups, small teams | Side projects, MVPs |

## 8. Production best practices for 2026

### Semantic Conventions — standardized naming

One of OpenTelemetry's greatest benefits is **Semantic Conventions** — standardized attribute names. When every service uses the same convention, cross-service queries are consistent:

| Domain | Attribute | Meaning |
| --- | --- | --- |
| **HTTP** | `http.request.method` | GET, POST, PUT, ... |
| **HTTP** | `http.response.status_code` | 200, 404, 500, ... |
| **HTTP** | `url.path` | /api/orders |
| **Database** | `db.system` | postgresql, redis, mssql |
| **Database** | `db.operation.name` | SELECT, INSERT, findOne |
| **Messaging** | `messaging.system` | kafka, rabbitmq, azure_servicebus |
| **Messaging** | `messaging.destination.name` | orders-queue, events-topic |

### Key principles

#### 1. Keep cardinality low

High-cardinality attributes (e.g., attaching `user.id` to every metric) will blow up the number of Prometheus time series. Only attach high-cardinality attributes to **traces** (storage is cheaper); metrics should stick to low-cardinality labels like `region`, `status_code`, `endpoint`.

#### 2. Filter health checks and noise

Drop traces from `/health`, `/ready`, `/metrics` endpoints. They create huge trace volume with zero debug value. Filter at the SDK level (not the collector) to save network.

#### 3. Protect telemetry data

Telemetry can contain PII (email, tokens, query params). Use a `redaction processor` in the collector to mask/drop sensitive attributes before export. Always use TLS for OTLP endpoints in production.

### Recommended rollout plan

Phase 1 — Foundation (weeks 1–2)

**Basic setup:** add the OTel SDK + auto-instrumentation to every service. Deploy the Collector in Agent mode. Connect to Grafana Cloud Free tier or local Jaeger to see your first traces.

Phase 2 — Enrichment (weeks 3–4)

**Add context:** custom spans for important business logic. Apply semantic conventions. Create custom metrics (orders/sec, payment success rate). Structured logging with TraceId correlation.

Phase 3 — Scale (weeks 5–6)

**Production tuning:** configure tail-based sampling. Tune batch processor and memory limiter. Set up Grafana dashboards for RED metrics (Rate, Errors, Duration). Create alert rules for SLO/SLI.

Phase 4 — Production-grade (weeks 7–8)

**Hardening:** HA for the Collector (2+ replicas). TLS everywhere. PII redaction. Retention policies. Team training and runbooks for incident response based on observability data.

## Conclusion

OpenTelemetry isn't just a library — it's the **industry standard** for observability. As the second most active CNCF project (after Kubernetes), with support from 100+ vendors and native .NET integration, adopting it is no longer a "should we?" question but a "where do we start?" one.

Key takeaways:

- **Start with traces** — they deliver the fastest debug value in distributed systems
- **Auto-instrumentation first**, manual later — don't try to cover everything from day one
- **The Collector is mandatory** — never export directly from app to backend in production
- **Tail-based sampling** ensures you never miss an error or slow request
- **Semantic conventions** enable consistent cross-service queries — invest in standardization early

### References

- [OpenTelemetry — What is OpenTelemetry?](https://opentelemetry.io/docs/what-is-opentelemetry/)
- [.NET Observability with OpenTelemetry — Microsoft Learn](https://learn.microsoft.com/en-us/dotnet/core/diagnostics/observability-with-otel)
- [OpenTelemetry eBPF Instrumentation 2026 Goals](https://opentelemetry.io/blog/2026/obi-goals/)
- [Can OpenTelemetry Save Observability in 2026? — The New Stack](https://thenewstack.io/can-opentelemetry-save-observability-in-2026/)
- [Grafana Tempo Documentation](https://grafana.com/docs/tempo/latest/)

Distributed Caching: Designing a Distributed Cache System from A to Z

Monorepo 2026: Turborepo, Nx, and pnpm Workspaces — Managing Code for Large Teams

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.