Resilience is not a luxury anymore. If your API calls external services, touches a database, or depends on any network boundary, you will face transient failures. Over the years I have watched production systems slow down, choke, or fail entirely because engineers assumed happy paths. Polly is the toolkit I always reach for when I need predictable, controlled resilience in .NET.
This post is not a basic tutorial. It is a collection of patterns that have saved my teams in real outages plus mistakes I have watched other teams make around retries, timeouts, and circuit breakers.
The mindset shift: resilience is about bounding failure
When you start using Polly professionally, the goal is not to eliminate failure. The goal is to make failures predictable and bounded so they cannot cascade across services or take down the system. Polly gives you guardrails in a world full of unreliable networks.
Where most teams go wrong with retries
The number one mistake I see is naive retries. A basic retry loop without delay or jitter is a self-inflicted DDoS during outages. When your downstream system is slow, retry storms make everything worse.
A retry policy that actually works
services.AddHttpClient("payments")
.AddPolicyHandler(Policy
.Handle<HttpRequestException>()
.OrResult<HttpResponseMessage>(r => (int)r.StatusCode >= 500)
.WaitAndRetryAsync(
retryCount: 3,
sleepDurationProvider: attempt => TimeSpan.FromMilliseconds(200 * attempt),
onRetry: (outcome, sleep, attempt, context) =>
{
// logging, metrics
}));
Key points:
- Retry only on transient errors
- Use backoff and jitter
- Never retry client errors like 400 or 404
- Never retry long-running operations
Timeouts: the most underrated resilience mechanism
I would argue that timeout policies are more important than retries. Without timeouts, your threads or connections will hang during a downstream outage, causing your entire API to slow down or freeze.
var timeoutPolicy = Policy.TimeoutAsync(
TimeSpan.FromSeconds(2));
One of my teams once suffered a cascading production failure because a third party slowed from 200 ms response time to 30 seconds. Without timeouts, our worker pool was exhausted and we started dropping internal requests. A two-second timeout would have prevented the entire incident.
Circuit breakers: protecting your system during real outages
Circuit breakers prevent repetitive calls to an unhealthy dependency. Once tripped, all calls fail fast until the downstream shows signs of recovery.
var breaker = Policy
.Handle<Exception>()
.CircuitBreakerAsync(
exceptionsAllowedBeforeBreaking: 3,
durationOfBreak: TimeSpan.FromSeconds(10));
Good circuit design requires tuning. If your break duration is too low, you hammer the dependency again too soon. If it is too high, recovery delays become artificial.
Bulkhead isolation: the pattern that saved us more than once
Bulkheads limit the number of concurrent calls to a dependency. If the dependency stalls, the bulkhead protects the rest of the system by isolating the damage.
I have seen APIs stay fully available during downstream outages because bulkheads prevented thread starvation.
var bulkhead = Policy.BulkheadAsync(
maxParallelization: 20,
maxQueuingActions: 40);
Think of this as a circuit breaker for concurrency.
Fallbacks: the last safety net
Fallbacks let you define graceful degradation when everything else has failed. For example, return cached data, defaults, or a partial response.
A real example from a past payment platform:
- If fraud-check service was down, fallback used last known safe score
- Customer could still checkout but with reduced risk rules
var fallback = Policy<HttpResponseMessage>
.Handle<Exception>()
.FallbackAsync(new HttpResponseMessage(HttpStatusCode.OK)
{
Content = new StringContent("{\"fraudScore\": 0}")
});
The real power: policy composition
The magic of Polly is how policies combine. A well designed resilience chain might look like:
services.AddHttpClient("orders")
.AddPolicyHandler(timeout)
.AddPolicyHandler(bulkhead)
.AddPolicyHandler(retry)
.AddPolicyHandler(circuitBreaker);
Order matters. In production I typically apply policies in this sequence:
- Timeout
- Bulkhead
- Circuit breaker
- Retry (carefully)
Chaos testing your resilience policies
You cannot rely on theory. You must test your resilience strategies under stress. Some chaos scenarios I regularly use:
- Downstream timeouts
- Slow responses
- Random 500s
- DNS failures
- Rate limits
Only after real chaos testing do you learn if your retries are too aggressive or your breaker settings too optimistic.
Closing thoughts
A resilient API is not about sprinkling Polly policies randomly across your codebase. It is about understanding where failure originates, how it propagates, and how to shape its boundaries. Polly remains one of the most powerful tools in the .NET ecosystem for that purpose.
If you design policies intentionally, measure their impact, and test under chaos conditions, you can operate APIs with confidence even when dependencies are unreliable.