OpenTelemetry — The Observability Standard for Distributed Systems
Posted on: 4/21/2026 3:10:12 AM
Table of contents
- Table of Contents
- 1. What is observability and why does it matter?
- 2. The three pillars: Traces, Metrics, Logs
- 3. OpenTelemetry architecture
- 4. OpenTelemetry Collector — heart of the system
- 5. Integrating OpenTelemetry with .NET
- 6. Smart sampling strategies
- 7. Building a complete observability stack
- 8. Production best practices for 2026
- Conclusion
As distributed systems grow more complex — microservices calling each other, message queues in between, layered caches — the question "where did this fail?" becomes extremely hard to answer. You can't debug production with breakpoints. You need observability, and OpenTelemetry is becoming the one standard the whole industry agrees on.
1. What is observability and why does it matter?
Observability is the ability to understand a system's internal state purely from its output signals — without changing code or disrupting the main execution path. Unlike traditional monitoring (which tracks pre-known metrics), observability lets you answer questions you never asked ahead of time.
Monitoring vs. observability
Monitoring answers: "What's the CPU at?" or "Are requests/s over the threshold?"
Observability answers: "Why does a request from user X in the APAC region take 3 seconds instead of 200ms, and which service is the bottleneck?"
In a monolith, you can open a single log file and trace through a thread. But when a request moves through API Gateway → Auth Service → Order Service → Payment → Notification, each service has its own logs, timezones, and formats — you need a way to correlate them all.
2. The three pillars: Traces, Metrics, Logs
graph TD
A[Telemetry Data] --> B[Traces]
A --> C[Metrics]
A --> D[Logs]
B --> B1["Distributed Tracing
Track request flow"]
B --> B2["Spans
Unit of time"]
B --> B3["Context Propagation
W3C TraceContext"]
C --> C1["Counters
Cumulative counts"]
C --> C2["Gauges
Instantaneous values"]
C --> C3["Histograms
Statistical distribution"]
D --> D1["Structured Logs
Key-value pairs"]
D --> D2["Correlation
Attach TraceId/SpanId"]
D --> D3["Severity Levels
Info/Warn/Error"]
style A fill:#e94560,stroke:#fff,color:#fff
style B fill:#2c3e50,stroke:#fff,color:#fff
style C fill:#2c3e50,stroke:#fff,color:#fff
style D fill:#2c3e50,stroke:#fff,color:#fff
style B1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style B2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style B3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style C1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style C2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style C3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style D1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style D2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style D3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
The three telemetry pillars of OpenTelemetry
Traces — follow the request's journey
A trace represents a request's entire journey across a distributed system. Each trace consists of many spans — the smallest units of work, each with a name, start/end time, and attributes that describe the context.
sequenceDiagram
participant Client
participant Gateway as API Gateway
participant Auth as Auth Service
participant Order as Order Service
participant DB as Database
participant Cache as Redis Cache
Client->>Gateway: POST /orders (TraceId: abc123)
Gateway->>Auth: Verify Token (SpanId: s1)
Auth-->>Gateway: 200 OK (2ms)
Gateway->>Order: Create Order (SpanId: s2)
Order->>Cache: Check inventory (SpanId: s3)
Cache-->>Order: Cache HIT (0.5ms)
Order->>DB: INSERT order (SpanId: s4)
DB-->>Order: OK (15ms)
Order-->>Gateway: 201 Created (18ms)
Gateway-->>Client: 201 Created (22ms)
Distributed trace across multiple services — each arrow is a span
Each span contains:
- TraceId: unique ID for the whole trace (propagated via the HTTP
traceparentheader) - SpanId: ID of the current span
- ParentSpanId: parent-child linkage between spans
- Attributes: key-value pairs such as
http.method=POST,db.system=postgresql - Events: things that happened inside the span (e.g., "cache miss", "retry attempt")
- Status: OK, ERROR, or UNSET
Metrics — measure performance with numbers
Metrics are numerical measurements over time. OpenTelemetry supports three core metric types:
| Type | Description | Example | Use when |
|---|---|---|---|
| Counter | Cumulative, only increases | Total requests, total bytes sent | Counting events over time |
| Gauge | Instantaneous, goes up/down | CPU usage, active connections, queue length | Measuring current state |
| Histogram | Statistical distribution | Request latency (p50, p95, p99) | Analyzing value distributions |
Logs — events with context
Logs in OpenTelemetry are not just text — they're structured logs with TraceId and SpanId attached automatically. That way, when you see an ERROR log entry, you can jump straight to the corresponding trace to see the whole request journey.
{
"timestamp": "2026-04-21T10:15:30Z",
"severity": "ERROR",
"body": "Payment processing failed",
"attributes": {
"order.id": "ORD-98765",
"payment.provider": "stripe",
"error.type": "timeout"
},
"traceId": "abc123def456...",
"spanId": "span789..."
}
Correlating Logs-Traces-Metrics
The real power is correlation: when a metric shows p99 latency spiking → filter traces with duration > 2s → find the slowest span → read that span's logs to understand root cause. Traditional monitoring just can't do that.
3. OpenTelemetry architecture
OpenTelemetry is not a product — it's a framework and toolkit composed of multiple components working together:
graph LR
subgraph Application
A1[Your Code] --> SDK[OTel SDK]
A2[Auto-Instrumentation] --> SDK
A3[Library Instrumentation] --> SDK
end
SDK -->|OTLP| C[OTel Collector]
subgraph Collector
C --> R[Receivers]
R --> P[Processors]
P --> E[Exporters]
end
E --> G[Grafana/Tempo]
E --> J[Jaeger]
E --> PR[Prometheus]
E --> AZ[Azure Monitor]
E --> DD[Datadog]
style SDK fill:#e94560,stroke:#fff,color:#fff
style C fill:#2c3e50,stroke:#fff,color:#fff
style R fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style P fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style E fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style G fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
style J fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
style PR fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
style AZ fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
style DD fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
OpenTelemetry overview — from application to backend
Core components
- API: standard interface for creating telemetry — library authors instrument code against it without depending on a specific implementation
- SDK: implementation of the API, responsible for collecting, processing, and exporting data
- Auto-Instrumentation: automatically captures telemetry from popular frameworks (ASP.NET Core, HttpClient, EF Core...) with no code changes
- OTLP (OpenTelemetry Protocol): standard vendor-neutral transport supporting both gRPC and HTTP/protobuf
- Semantic Conventions: standardized attribute names —
http.request.methodmeans the same thing in every language
4. OpenTelemetry Collector — heart of the system
The Collector is the middleman that receives, processes, and forwards telemetry. It acts as a smart proxy between apps and backends, decoupling instrumentation logic from delivery logic.
Two deployment modes
graph TD
subgraph Agent Mode
App1[App 1] --> CA[Collector Agent]
App2[App 2] --> CA
CA -->|Forward| CG
end
subgraph Gateway Mode
CA2[Agent 1] --> CG[Collector Gateway]
CA3[Agent 2] --> CG
CG --> Backend[Observability Backend]
end
style CA fill:#e94560,stroke:#fff,color:#fff
style CG fill:#2c3e50,stroke:#fff,color:#fff
style Backend fill:#f8f9fa,stroke:#e94560,color:#2c3e50
Agent mode (sidecar) vs. Gateway mode (centralized)
| Attribute | Agent Mode | Gateway Mode |
|---|---|---|
| Deployment | Sidecar / DaemonSet next to the app | Standalone centralized service |
| Pros | Low latency, local processing | Central management, complex sampling |
| Cons | Uses resources on every node | Single point of failure if not HA |
| Fits | Kubernetes, edge computing | Multi-cluster, cross-region |
Collector configuration (YAML)
The Collector is configured as a pipeline: Receivers → Processors → Exporters. Here's a production-ready example:
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
processors:
batch:
send_batch_size: 8192
timeout: 5s
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
tail_sampling:
decision_wait: 10s
policies:
- name: errors-always
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-requests
type: latency
latency: { threshold_ms: 2000 }
- name: probabilistic
type: probabilistic
probabilistic: { sampling_percentage: 10 }
exporters:
otlp/tempo:
endpoint: "tempo:4317"
tls:
insecure: false
cert_file: /certs/client.crt
key_file: /certs/client.key
prometheus:
endpoint: "0.0.0.0:8889"
namespace: myapp
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/tempo]
Memory Limiter is mandatory
In production, always place the memory_limiter processor BEFORE other processors. Otherwise, a traffic spike can OOM the collector and drop all buffered telemetry. Configure limit_mib at around 70–80% of the container's available RAM.
5. Integrating OpenTelemetry with .NET
.NET has an unusual advantage: telemetry APIs are already baked into the framework (ILogger, System.Diagnostics.Metrics, ActivitySource). The OpenTelemetry .NET SDK just "hooks" into these APIs and exports outside — no code rewrites needed.
graph LR
subgraph ".NET Framework APIs"
IL["ILogger<T>"]
ME["Meter / Counter"]
AS["ActivitySource / Activity"]
end
subgraph "OTel .NET SDK"
IL --> LP[Log Provider]
ME --> MP[Meter Provider]
AS --> TP[Tracer Provider]
end
LP --> EX[OTLP Exporter]
MP --> EX
TP --> EX
EX --> COL[Collector]
style IL fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style ME fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style AS fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style LP fill:#e94560,stroke:#fff,color:#fff
style MP fill:#e94560,stroke:#fff,color:#fff
style TP fill:#e94560,stroke:#fff,color:#fff
style EX fill:#2c3e50,stroke:#fff,color:#fff
style COL fill:#2c3e50,stroke:#fff,color:#fff
.NET uses its native APIs; the OTel SDK only handles export
Install NuGet packages
dotnet add package OpenTelemetry
dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Instrumentation.Http
dotnet add package OpenTelemetry.Instrumentation.SqlClient
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol
Configuration in Program.cs
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddOpenTelemetry()
.ConfigureResource(resource => resource
.AddService(
serviceName: "OrderService",
serviceVersion: "1.0.0",
serviceInstanceId: Environment.MachineName))
.WithTracing(tracing => tracing
.AddAspNetCoreInstrumentation(opts =>
{
opts.Filter = ctx => !ctx.Request.Path.StartsWithSegments("/health");
opts.RecordException = true;
})
.AddHttpClientInstrumentation()
.AddSqlClientInstrumentation(opts =>
{
opts.SetDbStatementForText = true;
opts.RecordException = true;
})
.AddOtlpExporter(opts =>
{
opts.Endpoint = new Uri("http://otel-collector:4317");
opts.Protocol = OtlpExportProtocol.Grpc;
}))
.WithMetrics(metrics => metrics
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddRuntimeInstrumentation()
.AddOtlpExporter())
.WithLogging(logging => logging
.AddOtlpExporter());
var app = builder.Build();
app.Run();
Custom instrumentation — tracing business logic
Auto-instrumentation covers HTTP, DB, gRPC. To trace business logic (order processing, pricing calculation, inventory check), you need to add spans manually:
public class OrderService
{
private static readonly ActivitySource Source = new("OrderService");
private static readonly Meter Meter = new("OrderService");
private static readonly Counter<long> OrdersCreated =
Meter.CreateCounter<long>("orders.created");
public async Task<Order> CreateOrderAsync(CreateOrderRequest request)
{
using var activity = Source.StartActivity("CreateOrder");
activity?.SetTag("order.customer_id", request.CustomerId);
activity?.SetTag("order.items_count", request.Items.Count);
// Validate inventory
using (Source.StartActivity("ValidateInventory"))
{
await ValidateInventoryAsync(request.Items);
}
// Calculate pricing
decimal total;
using (var pricingSpan = Source.StartActivity("CalculatePricing"))
{
total = await CalculatePricingAsync(request.Items);
pricingSpan?.SetTag("order.total", total);
}
// Process payment
using (Source.StartActivity("ProcessPayment"))
{
await ProcessPaymentAsync(request.CustomerId, total);
}
OrdersCreated.Add(1,
new KeyValuePair<string, object?>("region", request.Region));
activity?.SetStatus(ActivityStatusCode.Ok);
return new Order { Id = Guid.NewGuid(), Total = total };
}
}
.NET Aspire — OTel included
If you're using .NET Aspire, OpenTelemetry is already wired up in the ServiceDefaults project. Just call builder.ConfigureOpenTelemetry() — tracing, metrics, and logging just work. The Aspire Dashboard even shows all telemetry locally in dev without Grafana/Jaeger.
6. Smart sampling strategies
At scale, collecting 100% of traces is infeasible — storage and network costs explode. Sampling reduces volume while keeping the important data.
Head-based vs. tail-based sampling
graph TD
subgraph "Head-based Sampling"
H1[Request arrives] --> H2{Decide up front}
H2 -->|Sample| H3[Collect trace]
H2 -->|Drop| H4[Discard entirely]
end
subgraph "Tail-based Sampling"
T1[Request arrives] --> T2[Collect ALL spans]
T2 --> T3[Trace finishes]
T3 --> T4{Evaluate the whole trace}
T4 -->|Error/Slow| T5[Keep]
T4 -->|Normal| T6[Apply ratio sampling]
end
style H1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style H2 fill:#e94560,stroke:#fff,color:#fff
style T1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style T4 fill:#2c3e50,stroke:#fff,color:#fff
Head-based decides at the start; tail-based decides after the trace completes
| Criterion | Head-based | Tail-based |
|---|---|---|
| Decision time | As the request starts | After the trace completes |
| Pros | Simple, low overhead | Keeps every error and slow request |
| Cons | Can miss error traces | Needs a collector with enough RAM to buffer |
| Fits | Very high traffic, limited budget | Production needing precise debugging |
A common production combination:
# tail-sampling configuration on the Collector
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
policies:
# Always keep error traces
- name: keep-errors
type: status_code
status_code: { status_codes: [ERROR] }
# Keep requests slower than 1 second
- name: keep-slow
type: latency
latency: { threshold_ms: 1000 }
# Keep traces from critical endpoints
- name: keep-critical-paths
type: string_attribute
string_attribute:
key: http.route
values: ["/api/payments", "/api/orders"]
# 5% of normal traces
- name: baseline
type: probabilistic
probabilistic: { sampling_percentage: 5 }
7. Building a complete observability stack
A popular, cost-effective production stack (fully self-hostable):
graph LR
App[Applications] -->|OTLP| Col[OTel Collector]
Col -->|Traces| Tempo[Grafana Tempo]
Col -->|Metrics| Prom[Prometheus]
Col -->|Logs| Loki[Grafana Loki]
Tempo --> Graf[Grafana Dashboard]
Prom --> Graf
Loki --> Graf
Graf --> Alert[Alertmanager]
Alert --> PD[PagerDuty/Slack]
style Col fill:#e94560,stroke:#fff,color:#fff
style Graf fill:#2c3e50,stroke:#fff,color:#fff
style Tempo fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style Prom fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style Loki fill:#f8f9fa,stroke:#e94560,color:#2c3e50
Open-source observability stack: OTel + Grafana ecosystem
| Pillar | Tool | Role | Cost |
|---|---|---|---|
| Traces | Grafana Tempo | Trace storage, lookup by TraceId | Free (self-host) |
| Metrics | Prometheus | Collection and querying (PromQL) | Free (self-host) |
| Logs | Grafana Loki | Log aggregation with label-based indexing | Free (self-host) |
| Visualization | Grafana | Dashboards, alerting, explore | Free (self-host) |
| Alerting | Alertmanager | Routes alerts → Slack, PagerDuty, Email | Free |
Docker Compose for local development
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
ports:
- "4317:4317" # gRPC OTLP
- "4318:4318" # HTTP OTLP
- "8889:8889" # Prometheus metrics
volumes:
- ./otel-config.yaml:/etc/otelcol/config.yaml
tempo:
image: grafana/tempo:latest
ports:
- "3200:3200"
volumes:
- ./tempo-config.yaml:/etc/tempo.yaml
command: ["-config.file=/etc/tempo.yaml"]
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
volumes:
- ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/ds.yaml
Self-host vs. managed service
| Criterion | Self-host (Grafana Stack) | Managed (Datadog/New Relic) | Hybrid (Grafana Cloud Free) |
|---|---|---|---|
| Cost | Infra only (servers/storage) | $15-25/host/month | Free tier: 50GB logs, 10K metrics |
| Setup | Needs DevOps experience | 5-minute setup | 15-minute setup |
| Scaling | Manage your own HA and retention | Automatic | Free tier has limits |
| Vendor lock-in | None (OTel is standard) | High (proprietary features) | Low (OTel-compatible) |
| Fits | Larger teams with infra budget | Startups, small teams | Side projects, MVPs |
8. Production best practices for 2026
Semantic Conventions — standardized naming
One of OpenTelemetry's greatest benefits is Semantic Conventions — standardized attribute names. When every service uses the same convention, cross-service queries are consistent:
| Domain | Attribute | Meaning |
|---|---|---|
| HTTP | http.request.method |
GET, POST, PUT, ... |
| HTTP | http.response.status_code |
200, 404, 500, ... |
| HTTP | url.path |
/api/orders |
| Database | db.system |
postgresql, redis, mssql |
| Database | db.operation.name |
SELECT, INSERT, findOne |
| Messaging | messaging.system |
kafka, rabbitmq, azure_servicebus |
| Messaging | messaging.destination.name |
orders-queue, events-topic |
Key principles
1. Keep cardinality low
High-cardinality attributes (e.g., attaching user.id to every metric) will blow up the number of Prometheus time series. Only attach high-cardinality attributes to traces (storage is cheaper); metrics should stick to low-cardinality labels like region, status_code, endpoint.
2. Filter health checks and noise
Drop traces from /health, /ready, /metrics endpoints. They create huge trace volume with zero debug value. Filter at the SDK level (not the collector) to save network.
3. Protect telemetry data
Telemetry can contain PII (email, tokens, query params). Use a redaction processor in the collector to mask/drop sensitive attributes before export. Always use TLS for OTLP endpoints in production.
Recommended rollout plan
Conclusion
OpenTelemetry isn't just a library — it's the industry standard for observability. As the second most active CNCF project (after Kubernetes), with support from 100+ vendors and native .NET integration, adopting it is no longer a "should we?" question but a "where do we start?" one.
Key takeaways:
- Start with traces — they deliver the fastest debug value in distributed systems
- Auto-instrumentation first, manual later — don't try to cover everything from day one
- The Collector is mandatory — never export directly from app to backend in production
- Tail-based sampling ensures you never miss an error or slow request
- Semantic conventions enable consistent cross-service queries — invest in standardization early
With a free stack (OTel Collector + Grafana + Tempo + Prometheus + Loki), you can build a production-grade observability system with no license costs — you only need time to set it up and operate it well.
References
Distributed Caching: Designing a Distributed Cache System from A to Z
Monorepo 2026: Turborepo, Nx, and pnpm Workspaces — Managing Code for Large Teams
Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.