Observability & Ops Intermediate 5 min read

OpenTelemetry in .NET: Traces, Metrics, Logs

How to wire OpenTelemetry into ASP.NET Core for distributed traces, metrics, structured logs. The single instrumentation pipeline that powers alerts.

Table of contents
  1. When does observability become non-negotiable?
  2. What numbers should I budget for the observability tier?
  3. What does the minimal observability pipeline look like?
  4. What is the .NET 10 wiring for OpenTelemetry?
  5. What custom traces and metrics should I add?
  6. How do I turn observability into useful alerts?
  7. What failure modes does observability itself introduce?
  8. When is observability investment premature?
  9. Where should you go from here?

The first time you have to debug a production issue without observability, you understand why every chapter in this series quotes a metric. Without traces, metrics, and logs you are guessing. With them, the post-mortem writes itself. This chapter wires OpenTelemetry into ASP.NET Core in the shape that has become the 2026 standard.

When does observability become non-negotiable?

Three signals.

The service is in production. Even one customer means an outage costs reputation. Without metrics you have no alert; without traces you have no fix; without logs you have no explanation.

The service has more than one instance. Logging to a local file on each box stops working the moment requests are spread across replicas. You need a central pipeline.

The service depends on other services. A slow checkout that turns out to be a slow third-party payment provider takes hours to diagnose without distributed traces and minutes with them.

If your code is a one-off script that runs once and you read the console, you do not need this chapter. Anything else, you do.

What numbers should I budget for the observability tier?

Signal     Volume per req     Cost driver           Storage cost
Metrics    ~constant          unique label values   cheap
Traces     1 trace            sampling rate         expensive
Logs       ~5-20 lines        line size + count     medium

For a service at 10K QPS:

Never log at INFO level on the hot path. WARN and above for steady state; DEBUG only when investigating.

What does the minimal observability pipeline look like?

flowchart LR
    App[ASP.NET Core] -->|OTel SDK| Collector[OTel Collector]
    Collector -->|metrics| Prom[(Prometheus)]
    Collector -->|traces| Jaeger[(Jaeger / Tempo)]
    Collector -->|logs| Loki[(Loki / OpenSearch)]
    Prom --> Grafana
    Jaeger --> Grafana
    Loki --> Grafana
    Grafana --> Alert[Alerts -> PagerDuty]

The application emits via OTel; the Collector demultiplexes by signal type into purpose-built backends; Grafana is the unified dashboard. The same pipeline supports Datadog, Honeycomb, Azure Monitor - swap the backend without touching app code.

What is the .NET 10 wiring for OpenTelemetry?

One block in Program.cs:

builder.Services.AddOpenTelemetry()
    .ConfigureResource(r => r.AddService(
        serviceName: builder.Environment.ApplicationName,
        serviceVersion: Assembly.GetExecutingAssembly().GetName().Version?.ToString()))
    .WithTracing(t => t
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddEntityFrameworkCoreInstrumentation()
        .AddSource("MyService.*")
        .SetSampler(new TraceIdRatioBasedSampler(0.01))   // 1% head-based
        .AddOtlpExporter())
    .WithMetrics(m => m
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()                      // GC, thread pool
        .AddProcessInstrumentation()
        .AddMeter("MyService.*")
        .AddOtlpExporter());

// Logging through Serilog with OTel sink
builder.Host.UseSerilog((ctx, lc) => lc
    .ReadFrom.Configuration(ctx.Configuration)
    .Enrich.FromLogContext()
    .WriteTo.Console()
    .WriteTo.OpenTelemetry(o => o.Endpoint = "http://collector:4317"));

Three details. The default ASP.NET Core instrumentation gives you the entire RED method without writing code. EF Core instrumentation includes the SQL text - sanitise PII out of it before exporting. Sampling at 1% head-based is the most cost-effective default; combine with tail sampling on errors for richer error traces.

What custom traces and metrics should I add?

Three types worth the effort:

// Custom span around a business operation
using var activity = ActivitySource.StartActivity("checkout.process");
activity?.SetTag("order.id", orderId);
activity?.SetTag("order.amount", amount);
// ... business code

// Custom metric for a business KPI
private static readonly Meter Meter = new("MyService.Business");
private static readonly Counter<long> OrdersPlaced =
    Meter.CreateCounter<long>("orders_placed_total");

OrdersPlaced.Add(1, new KeyValuePair<string, object?>("payment_method", "stripe"));

// Structured log with correlation
log.LogInformation(
    "Order {OrderId} placed by user {UserId} for {Amount:0.00}",
    orderId, userId, amount);

The ActivitySource ties the custom span into the trace tree automatically; the Meter exposes a counter that Prometheus scrapes. The structured log preserves the correlation ID injected by ASP.NET Core - Grafana can jump from a trace to its logs.

How do I turn observability into useful alerts?

Three rules.

One: alert on user-visible symptoms, not internal causes. "5xx rate above 0.5% for 5 minutes" is good. "Garbage collection time above 1%" is not - it might be fine, depending on the workload. SRE's "symptom-based alerting" is the right framing. The signals from the caching layer - hit rate, miss-per-second - are the right type: visible to the user as latency.

Two: alert on SLO burn rate, not raw thresholds. A 1% error rate is fine if your SLO is 99% over 30 days; it is a fire if your SLO is 99.99%. Compute the burn rate against the SLO and alert when you are spending error budget too fast.

Three: every alert must have a runbook. If the on-call engineer cannot answer "what do I do?", the alert is noise. Pair every alert with a Confluence page or wiki link explaining diagnosis and fix.

What failure modes does observability itself introduce?

When is observability investment premature?

When the service is single-instance, low-traffic, and not in production. A local dev box reading console logs is fine. Add OpenTelemetry the day you deploy to staging - not before, not after. Wiring it into a service that already serves users is harder than wiring it from day one. The QPS estimate from chapter 2 tells you when "production" is real enough to require this.

Where should you go from here?

Next chapter: rate limiting in .NET - the second ops building block. Together with observability, rate limiting is what keeps your service alive when something upstream goes wrong. After that, the case-study chapters compose every block from foundations through ops into real, complete designs.

Frequently asked questions

Three signals - which one matters most?
Metrics for alerting (rate, latency p99, error count - cheap to aggregate, fast to query). Traces for debugging (where did this request slow down). Logs for context (the exact exception, the input that broke). Most teams over-invest in logs and under-invest in metrics; the right ratio is 'metrics first, traces second, logs as supporting evidence'.
OpenTelemetry vs Serilog vs Application Insights?
Not vs - layered. OpenTelemetry is the protocol and instrumentation. Serilog is a logging library that emits via OTel. Application Insights is a backend that ingests OTel data. The choice is mostly 'which backend' (App Insights, Datadog, Honeycomb, Grafana Cloud, self-hosted Jaeger+Prometheus). Code stays the same; OTel decouples instrumentation from vendor.
What metrics should I emit by default?
RED for every endpoint: Rate (req/s), Errors (req/s with status >=500), Duration (latency histogram). Plus saturation signals: CPU, memory, GC pause time, thread pool queue length. The .NET instrumentation packages emit all of this for free. Custom metrics are for business KPIs (orders/min, signups/hour) - rarely needed for technical alerting.
How much does observability cost in production?
Mostly storage. Traces are expensive at scale - 1 trace per request at 10K QPS is 25M traces/hour. Sampling cuts that 100x with little loss; 1% head-based sampling plus error sampling at 100% is the standard. Metrics are cheap (aggregated by the time they hit storage). Logs are middle - structured, sampled at the source for chatty paths.