Circuit Breaker, Retry, Timeout: Polly in ASP.NET Core
How to wire Polly and Microsoft.Extensions.Http.Resilience into a .NET service: retry with exponential backoff, circuit breaker, timeout, and bulkhead isolation.
Table of contents
- When does adding resilience code start paying back?
- What numbers should I budget for resilience?
- What does the layered pipeline look like?
- What is the .NET 10 wiring with Microsoft.Extensions.Http.Resilience?
- How do circuit breakers actually decide to open?
- What failure modes does Polly itself introduce?
- When should you skip resilience handlers?
- Where should you go from here?
The first time a downstream service goes slow and your own service stops responding because every thread is blocked waiting for it, you have met the case Polly was designed for. This chapter wires the four core resilience patterns - timeout, retry, circuit breaker, bulkhead - into ASP.NET Core in a way that survives the unanticipated outage.
When does adding resilience code start paying back?
Three signals.
The service calls another service synchronously. HTTP clients, gRPC clients, third-party APIs. Every one of these can fail independently and your service should keep responding for the rest of the requests.
The dependency has variable latency. A slow third-party often hurts more than a fast failure. The thread pool fills with waiters, new requests queue up, and the whole service tips over. Timeouts + bulkheads protect against this.
You have user-facing latency requirements. A p99 budget of 500 ms means a downstream that takes 30 seconds to fail must time out faster than that, retry once, then fall back. Polly orchestrates all of that with one pipeline.
If the only network call is to a single in-cluster Postgres, resilience is the connection-pool tuning, not Polly. Save Polly for the call out of your network.
What numbers should I budget for resilience?
Pattern Default settings Effect
Timeout 500 ms - 5 s per attempt caps single call
Retry 3 attempts, 200 ms-1.5 s backoff handles transient
Circuit breaker open after 5/10 failures, 30 s stops cascading
Bulkhead 10-20 concurrent calls caps blast radius
Total budget ~3x single-attempt timeout upper bound
The total wall-clock budget for a Polly pipeline is roughly the single attempt timeout times the retry count (with backoff). A 500 ms timeout with 3 retries can take ~3 seconds in the worst case; bound your service-level deadline accordingly.
What does the layered pipeline look like?
flowchart LR
Req[HTTP request] --> Timeout[Timeout 5 s]
Timeout --> Retry[Retry 3x exp backoff]
Retry --> Breaker[Circuit breaker]
Breaker --> Bulkhead[Bulkhead 20 concurrent]
Bulkhead --> Downstream[Downstream service]
Downstream --> Out[Response]
Outer to inner: Timeout (caps one attempt), Retry (handles transient), Circuit Breaker (stops calling a dead dependency), Bulkhead (limits concurrent calls), then the actual call. Each layer is independent and configured per dependency.
What is the .NET 10 wiring with Microsoft.Extensions.Http.Resilience?
Modern .NET ships a standard resilience pipeline as a one-line extension. Wire it once per HttpClient:
// Program.cs
builder.Services.AddHttpClient<IPaymentClient, PaymentClient>(c =>
{
c.BaseAddress = new Uri(builder.Configuration["Payment:BaseUrl"]!);
c.Timeout = TimeSpan.FromSeconds(10);
})
.AddStandardResilienceHandler(opt =>
{
// Default values shown for clarity - overrides only what you change.
opt.AttemptTimeout.Timeout = TimeSpan.FromSeconds(2);
opt.TotalRequestTimeout.Timeout = TimeSpan.FromSeconds(10);
opt.Retry.MaxRetryAttempts = 3;
opt.Retry.BackoffType = DelayBackoffType.Exponential;
opt.Retry.UseJitter = true;
opt.CircuitBreaker.FailureRatio = 0.5;
opt.CircuitBreaker.MinimumThroughput = 10;
opt.CircuitBreaker.SamplingDuration = TimeSpan.FromSeconds(30);
opt.CircuitBreaker.BreakDuration = TimeSpan.FromSeconds(30);
});
// Usage - no Polly code in the calling site.
public class CheckoutService(IPaymentClient payment)
{
public Task<PaymentResult> ChargeAsync(Order order, CancellationToken ct)
=> payment.ChargeAsync(order.Id, order.Amount, ct); // resilience is invisible
}
Three details. The standard handler bundles all four patterns with sensible defaults. Configure per-client, never globally - a recommendations service and a payment service deserve different budgets. The handler emits OpenTelemetry metrics so you can see retries and breaks in Grafana (chapter 13).
How do circuit breakers actually decide to open?
Three states with thresholds:
stateDiagram-v2
[*] --> Closed
Closed --> Open: failures exceed threshold
Open --> HalfOpen: break duration elapsed
HalfOpen --> Closed: probe succeeds
HalfOpen --> Open: probe fails
- Closed - normal traffic, counting failures.
- Open - all requests fail fast for the break duration; no calls reach the dependency.
- Half-open - one probe request goes through; success closes the breaker, failure reopens.
The thresholds matter. Default ratio is "more than 50% failures over 30 seconds with at least 10 requests". Lower the threshold and the breaker opens on small hiccups; raise it and the breaker is too slow. The defaults work for most services; tune by metrics, not guesses.
What failure modes does Polly itself introduce?
- Retry storm - retries hit the dependency while it is
recovering, knocking it back down. Mitigation: jitter, exponential
backoff, and a small
MaxRetryAttempts(3 is plenty). - Retry on non-idempotent calls - a
POST /charge-cardretried twice charges twice. Mitigation: only retry idempotent operations, and pair every retry with the idempotency key from chapter 10. - Hidden circuit breaker - the breaker is open and you do not
know. Mitigation: alert on
circuit_breaker_state == open; emit logs on state transitions. - Bulkhead too tight - the concurrency cap is below normal traffic, throttling healthy requests. Mitigation: size the bulkhead at 2x peak concurrent calls under normal load.
When should you skip resilience handlers?
Three cases.
One, internal trusted dependency with synchronous replication and the same failure domain - your own Postgres in the same VPC. The connection-pool retry handles transient errors; adding Polly is duplicate code.
Two, best-effort fire-and-forget through a queue (chapter 6). The queue already retries; Polly on the publish side is overkill.
Three, already broken architectures. If the call you are wrapping is the wrong shape - a synchronous loop where async messaging would work - resilience masks the design problem instead of fixing it.
Where should you go from here?
Next chapter: the saga pattern - reliability at the multi-step business workflow level, where retry and circuit breakers on individual calls are not enough. The saga ties together outbox (chapter 10) and resilience handlers into one consistent shape.