Mastering Polly for Building Resilient .NET APIs

Resilience is not a luxury anymore. If your API calls external services, touches a database, or depends on any network boundary, you will face transient failures. Over the years I have watched production systems slow down, choke, or fail entirely because engineers assumed happy paths. Polly is the toolkit I always reach for when I need predictable, controlled resilience in .NET.

This post is not a basic tutorial. It is a collection of patterns that have saved my teams in real outages plus mistakes I have watched other teams make around retries, timeouts, and circuit breakers.

The mindset shift: resilience is about bounding failure

When you start using Polly professionally, the goal is not to eliminate failure. The goal is to make failures predictable and bounded so they cannot cascade across services or take down the system. Polly gives you guardrails in a world full of unreliable networks.

Architecture insight: Resilience is not about making calls succeed but about ensuring failures do not destabilize the rest of your system.

Where most teams go wrong with retries

The number one mistake I see is naive retries. A basic retry loop without delay or jitter is a self-inflicted DDoS during outages. When your downstream system is slow, retry storms make everything worse.

A retry policy that actually works

services.AddHttpClient("payments")
    .AddPolicyHandler(Policy
        .Handle<HttpRequestException>()
        .OrResult<HttpResponseMessage>(r => (int)r.StatusCode >= 500)
        .WaitAndRetryAsync(
            retryCount: 3,
            sleepDurationProvider: attempt => TimeSpan.FromMilliseconds(200 * attempt),
            onRetry: (outcome, sleep, attempt, context) =>
            {
                // logging, metrics
            }));

Key points:

Retry only on transient errors
Use backoff and jitter
Never retry client errors like 400 or 404
Never retry long-running operations

Warning: If you do not know whether an operation is idempotent, do not retry it. You can create duplicate invoices, send duplicate emails, or charge a card twice.

Timeouts: the most underrated resilience mechanism

I would argue that timeout policies are more important than retries. Without timeouts, your threads or connections will hang during a downstream outage, causing your entire API to slow down or freeze.

var timeoutPolicy = Policy.TimeoutAsync(
    TimeSpan.FromSeconds(2));

One of my teams once suffered a cascading production failure because a third party slowed from 200 ms response time to 30 seconds. Without timeouts, our worker pool was exhausted and we started dropping internal requests. A two-second timeout would have prevented the entire incident.

Best practice: Apply timeouts closest to the boundary and always pair them with retries or fallbacks.

Circuit breakers: protecting your system during real outages

Circuit breakers prevent repetitive calls to an unhealthy dependency. Once tripped, all calls fail fast until the downstream shows signs of recovery.

var breaker = Policy
    .Handle<Exception>()
    .CircuitBreakerAsync(
        exceptionsAllowedBeforeBreaking: 3,
        durationOfBreak: TimeSpan.FromSeconds(10));

Good circuit design requires tuning. If your break duration is too low, you hammer the dependency again too soon. If it is too high, recovery delays become artificial.

Warning: Do not hide a circuit breaker behind retries. This makes it useless and delays tripping.

Bulkhead isolation: the pattern that saved us more than once

Bulkheads limit the number of concurrent calls to a dependency. If the dependency stalls, the bulkhead protects the rest of the system by isolating the damage.

I have seen APIs stay fully available during downstream outages because bulkheads prevented thread starvation.

var bulkhead = Policy.BulkheadAsync(
    maxParallelization: 20,
    maxQueuingActions: 40);

Think of this as a circuit breaker for concurrency.

Architecture insight: Bulkheads protect your system from resource exhaustion better than retries or breakers. Use them for high-traffic APIs.

Fallbacks: the last safety net

Fallbacks let you define graceful degradation when everything else has failed. For example, return cached data, defaults, or a partial response.

A real example from a past payment platform:

If fraud-check service was down, fallback used last known safe score
Customer could still checkout but with reduced risk rules

var fallback = Policy<HttpResponseMessage>
    .Handle<Exception>()
    .FallbackAsync(new HttpResponseMessage(HttpStatusCode.OK)
    {
        Content = new StringContent("{\"fraudScore\": 0}")
    });

Warning: Fallbacks can hide real outages. Only use them where business allows graceful degradation.

The real power: policy composition

The magic of Polly is how policies combine. A well designed resilience chain might look like:

services.AddHttpClient("orders")
    .AddPolicyHandler(timeout)
    .AddPolicyHandler(bulkhead)
    .AddPolicyHandler(retry)
    .AddPolicyHandler(circuitBreaker);

Order matters. In production I typically apply policies in this sequence:

Timeout
Bulkhead
Circuit breaker
Retry (carefully)

Best practice: Always fail fast near the boundary. Timeouts and bulkheads go first in the chain.

Chaos testing your resilience policies

You cannot rely on theory. You must test your resilience strategies under stress. Some chaos scenarios I regularly use:

Downstream timeouts
Slow responses
Random 500s
DNS failures
Rate limits

Only after real chaos testing do you learn if your retries are too aggressive or your breaker settings too optimistic.

Debugging insight: The most important metric is not success rate but latency distribution. Long-tail latencies kill systems.

Closing thoughts

A resilient API is not about sprinkling Polly policies randomly across your codebase. It is about understanding where failure originates, how it propagates, and how to shape its boundaries. Polly remains one of the most powerful tools in the .NET ecosystem for that purpose.

If you design policies intentionally, measure their impact, and test under chaos conditions, you can operate APIs with confidence even when dependencies are unreliable.

Join My Developer Community

Mastering Polly for Building Resilient .NET APIs

The mindset shift: resilience is about bounding failure

Where most teams go wrong with retries

A retry policy that actually works

Timeouts: the most underrated resilience mechanism

Circuit breakers: protecting your system during real outages

Bulkhead isolation: the pattern that saved us more than once

Fallbacks: the last safety net

The real power: policy composition

Chaos testing your resilience policies

Closing thoughts

Enjoying the article?

On this page