Reliability Intermediate 5 min read

Circuit Breaker, Retry, Timeout: Polly in ASP.NET Core

How to wire Polly and Microsoft.Extensions.Http.Resilience into a .NET service: retry with exponential backoff, circuit breaker, timeout, and bulkhead isolation.

Table of contents
  1. When does adding resilience code start paying back?
  2. What numbers should I budget for resilience?
  3. What does the layered pipeline look like?
  4. What is the .NET 10 wiring with Microsoft.Extensions.Http.Resilience?
  5. How do circuit breakers actually decide to open?
  6. What failure modes does Polly itself introduce?
  7. When should you skip resilience handlers?
  8. Where should you go from here?

The first time a downstream service goes slow and your own service stops responding because every thread is blocked waiting for it, you have met the case Polly was designed for. This chapter wires the four core resilience patterns - timeout, retry, circuit breaker, bulkhead - into ASP.NET Core in a way that survives the unanticipated outage.

When does adding resilience code start paying back?

Three signals.

The service calls another service synchronously. HTTP clients, gRPC clients, third-party APIs. Every one of these can fail independently and your service should keep responding for the rest of the requests.

The dependency has variable latency. A slow third-party often hurts more than a fast failure. The thread pool fills with waiters, new requests queue up, and the whole service tips over. Timeouts + bulkheads protect against this.

You have user-facing latency requirements. A p99 budget of 500 ms means a downstream that takes 30 seconds to fail must time out faster than that, retry once, then fall back. Polly orchestrates all of that with one pipeline.

If the only network call is to a single in-cluster Postgres, resilience is the connection-pool tuning, not Polly. Save Polly for the call out of your network.

What numbers should I budget for resilience?

Pattern              Default settings                 Effect
Timeout              500 ms - 5 s per attempt         caps single call
Retry                3 attempts, 200 ms-1.5 s backoff handles transient
Circuit breaker      open after 5/10 failures, 30 s   stops cascading
Bulkhead             10-20 concurrent calls           caps blast radius
Total budget         ~3x single-attempt timeout       upper bound

The total wall-clock budget for a Polly pipeline is roughly the single attempt timeout times the retry count (with backoff). A 500 ms timeout with 3 retries can take ~3 seconds in the worst case; bound your service-level deadline accordingly.

What does the layered pipeline look like?

flowchart LR
    Req[HTTP request] --> Timeout[Timeout 5 s]
    Timeout --> Retry[Retry 3x exp backoff]
    Retry --> Breaker[Circuit breaker]
    Breaker --> Bulkhead[Bulkhead 20 concurrent]
    Bulkhead --> Downstream[Downstream service]
    Downstream --> Out[Response]

Outer to inner: Timeout (caps one attempt), Retry (handles transient), Circuit Breaker (stops calling a dead dependency), Bulkhead (limits concurrent calls), then the actual call. Each layer is independent and configured per dependency.

What is the .NET 10 wiring with Microsoft.Extensions.Http.Resilience?

Modern .NET ships a standard resilience pipeline as a one-line extension. Wire it once per HttpClient:

// Program.cs
builder.Services.AddHttpClient<IPaymentClient, PaymentClient>(c =>
{
    c.BaseAddress = new Uri(builder.Configuration["Payment:BaseUrl"]!);
    c.Timeout = TimeSpan.FromSeconds(10);
})
.AddStandardResilienceHandler(opt =>
{
    // Default values shown for clarity - overrides only what you change.
    opt.AttemptTimeout.Timeout = TimeSpan.FromSeconds(2);
    opt.TotalRequestTimeout.Timeout = TimeSpan.FromSeconds(10);
    opt.Retry.MaxRetryAttempts = 3;
    opt.Retry.BackoffType = DelayBackoffType.Exponential;
    opt.Retry.UseJitter = true;
    opt.CircuitBreaker.FailureRatio = 0.5;
    opt.CircuitBreaker.MinimumThroughput = 10;
    opt.CircuitBreaker.SamplingDuration = TimeSpan.FromSeconds(30);
    opt.CircuitBreaker.BreakDuration = TimeSpan.FromSeconds(30);
});

// Usage - no Polly code in the calling site.
public class CheckoutService(IPaymentClient payment)
{
    public Task<PaymentResult> ChargeAsync(Order order, CancellationToken ct)
        => payment.ChargeAsync(order.Id, order.Amount, ct);  // resilience is invisible
}

Three details. The standard handler bundles all four patterns with sensible defaults. Configure per-client, never globally - a recommendations service and a payment service deserve different budgets. The handler emits OpenTelemetry metrics so you can see retries and breaks in Grafana (chapter 13).

How do circuit breakers actually decide to open?

Three states with thresholds:

stateDiagram-v2
    [*] --> Closed
    Closed --> Open: failures exceed threshold
    Open --> HalfOpen: break duration elapsed
    HalfOpen --> Closed: probe succeeds
    HalfOpen --> Open: probe fails

The thresholds matter. Default ratio is "more than 50% failures over 30 seconds with at least 10 requests". Lower the threshold and the breaker opens on small hiccups; raise it and the breaker is too slow. The defaults work for most services; tune by metrics, not guesses.

What failure modes does Polly itself introduce?

When should you skip resilience handlers?

Three cases.

One, internal trusted dependency with synchronous replication and the same failure domain - your own Postgres in the same VPC. The connection-pool retry handles transient errors; adding Polly is duplicate code.

Two, best-effort fire-and-forget through a queue (chapter 6). The queue already retries; Polly on the publish side is overkill.

Three, already broken architectures. If the call you are wrapping is the wrong shape - a synchronous loop where async messaging would work - resilience masks the design problem instead of fixing it.

Where should you go from here?

Next chapter: the saga pattern - reliability at the multi-step business workflow level, where retry and circuit breakers on individual calls are not enough. The saga ties together outbox (chapter 10) and resilience handlers into one consistent shape.

Frequently asked questions

Retry first or circuit breaker first?
Both, but in the right order. The standard pipeline is timeout → retry → circuit breaker → request. Timeout caps a single attempt; retry handles the next attempt; circuit breaker stops further attempts when the dependency is clearly down. Reversed (circuit breaker outside retry), the breaker opens too aggressively because every retry counts as a failure.
What does exponential backoff actually buy me?
It avoids the thundering-herd. If 1000 clients all retry a failed call after 100 ms, you get 1000 simultaneous retries that hit the still-recovering dependency at the same time and knock it down again. Exponential backoff (100 ms, 200 ms, 400 ms, 800 ms) plus jitter (a random 50% to 150% of the delay) spreads the retries across time.
When does a circuit breaker hurt instead of help?
When the dependency is critical and degrading gracefully is worse than failing fast. A circuit breaker on the auth service will lock out every user the moment the service has a hiccup; better to retry hard. The breaker is right for non-critical dependencies (recommendations, analytics, third-party feeds) where 'no answer' is preferable to '5-second hang'.
How is this different from a load balancer's health check?
Different layer. The load balancer removes a downed instance from the pool. The circuit breaker stops calls to a dependency from a single client. Both are needed: the LB keeps healthy instances behind one IP, the breaker keeps your service responsive when all instances are slow. They compose.