Why is resilience important?
“The oak fought the wind and was broken, the willow bent when it must and survived.”
Resilience is crucial because it helps mitigate the impact of unexpected issues, ensuring your application stays functional and delivers a smooth user experience—even when things go sideways.
Take this real-world example: we had a commonly-used request to a third-party service that, on occasion, would start timing out. Normally, a few timeouts here and there were no big deal—our system would handle them gracefully, allowing users to retry as needed. But then, things escalated: every single request to this service started timing out. Suddenly, what used to be quick, millisecond responses turned into painfully long delays. Our server got backed up with a flood of stalled requests, and before we knew it, the entire system was grinding to a crawl.
In hindsight, we figured out that the root issue was with the third-party service, which was beyond our control. But we had no fallback, no resiliency built in. No circuit breakers, no retry limits, nothing to shield us from the dependency that we've been taking for granted for too long. Our application was an oak, not a willow.
Needless to say, resiliency became a top priority. Thankfully, the open-source library Polly helped us to quickly implement a solution, turning those third-party failures from moments of doom into distant memories.
For those who’ve noted that asking users to redo actions due to transient errors isn’t ideal for user experience, a great benefit is that this approach also enhances that aspect.
Resiliancy strategies
"Systems fail, but resilient systems ensure that users don’t feel it."
Imagine you are sitting in your car, the engine running, waiting on your wife to come out of the house. You honk the car to let her know you are ready. Much like a request being sent to an external API, you are waiting on response - for her to come over. There are several strategies you could employ if things take a while:
Set a timeout.
If your wife is taking too long putting on her shoes, it might be wise to turn off the engine (after all, we're all responsible for a better environment). Likewise, if a request takes too long, you might want to forget about it altogether and default to a fallback. In synchronous situations, you don’t want your users to wait too long or have requests pile up on your server.
Retry transient failures.
If you’re still waiting on your wife, maybe honking a second time will help move things along—perhaps she didn’t hear you the first time (though, personally, I have to admit that’s not a risk I’m willing to take). Similarly, some requests might fail due to temporary network glitches or brief server-side errors. Rather than letting the operation fail, consider retrying the request. This way, the user might only experience a slight delay, doesn’t need to retry manually, and still receives the intended result.
Pause communication during remote service outages.
If, after several minutes (and honks), you’re still waiting for your wife, perhaps she’s focused on her hair instead of her shoes—and it might be wiser to go back inside and have a coffee; no amount of honking will speed things up. In the context of resiliency, this is known as a circuit breaker: if a remote service is temporarily unavailable, it may be wise to halt all communication temporarily and resume once the service is back online. This prevents the execution of unnecessary, failing requests, saving resources and reducing alerts and error logs.
Establish fallback actions.
If you've finished your coffee and your wife still isn’t downstairs, it might be time to go help do her hair—though, to be fair, this might not be the best analogy (and probably wouldn’t make things better..). In resiliency, however, it’s often wise to design fallback actions to execute when primary operations fail. Sometimes things just won’t work, and you’ve got to ensure your application can handle that gracefully.
Microsoft 💓 Polly
For years, an open source project named Polly was the go-to solution to work these strategies into an application. However, after a teamup with the Polly community, Microsoft worked the v8 version of Polly into .NET 8's resiliancy libraries. The Microsoft.Extensions.Http.Resilience
package provides a concise, HTTP-focused layer atop the Polly library, enhancing it with features like telemetry, dependency injection support, and options-based configuration.
The basics
The way it works is that you configure resiliancy handlers on your http clients so that requests made by these clients are handled in a resilient way - be it with retries, timeouts, circuit breakers, fallbacks, or any combination thereof. After installing the Microsoft.Extensions.Http.Resilience
package, this can be done as simple as:
services.AddHttpClient("ThirdPartyThatShouldContactAllPhiToFixItsIssuesHttpClient")
.AddStandardResilienceHandler();
The library exposes two standard resiliancy handlers (combining the four resiliency strategies - even more so) and a resiliance handler that allows full configuration.
The standard resiliency handler - AddStandardResilienceHandler
The above example adds the standard resiliency handler to the api client and adds a number of resiliency strategies out of the box. Without further configuration, this adds resiliency to http requests through the http client with the following defaults:
Total timeout strategy: 30s
Retry strategy: 3, backoff: exponential with jitter, delay: 2s
Circuit breaker strategy: ratio 10% with a minimum throughput of 100 and 30s sampling duration, break 5s
Attempt timeout: 10s
The retry and circuit breaker strategies both handle a set of specific HTTP status codes and exceptions:
HTTP 500 and above (Server errors)
HTTP 408 (Request timeout)
HTTP 429 (Too many requests)
HttpRequestException
TimeoutRejectedException
Now, what does this mean exactly?
It means that if a single request takes longer than 10 seconds, or returns with one of the statuses above, the request will be retried after a period of time. Exponential with jitter means that each request will have a longer delay (exponential), but there will be some randomness involved (jitter). The reason for the jitter is to space out spikes for a smoother distribution of client calls.
The request will be retried 3 times. If, within 30 seconds, 10 or more out of 100 request fail, the circuit breaker opens and no requests will be made for 5 seconds. If the circuit is open, requests will not be sent to the third party, and instead fail with a BrokenCircuitException.
Customization
Customizaion can be done in two ways. Suppose you'd want to set the number of retries or the attempt timeout to a value that is not the default. One way is to configure the options directly.
services
.AddHttpClient("ThirdPartyThatShouldContactAllPhiToFixItsIssuesHttpClient")
.AddStandardResilienceHandler(options =>
{
options.Retry.MaxRetryAttempts = 5;
options.AttemptTimeout.Timeout = TimeSpan.FromSeconds(2);
});
Another possibility is to separate the configuration into a settings JSON file. This provides the benefit of dynamic reloading, which is enabled by default on the standard handler and allows the configuration to change at runtime without redeployment.
appsettings.json
{
"RetryOptions": {
"Retry": {
"MaxRetryAttempts": 5
}
}
}
C#
var retryOptionsSection = builder.Configuration.GetSection("RetryOptions");
services
.AddHttpClient("ThirdPartyThatShouldContactAllPhiToFixItsIssuesHttpClient")
.AddStandardResilienceHandler()
.Configure(retryOptionsSection)
There are many settings that can be configured for each of these strategies. For a complete guide, please refer to the sources listed below.
The standard hedging handler - AddStandardHedgingHandler
services.AddHttpClient("ThirdPartyThatShouldContactAllPhiToFixItsIssuesHttpClient")
.AddStandardHedgingHandler();
Similar to the standard resiliency handler, but with a different retry mechanism. New in Polly v8, the hedging strategy issues multiple concurrent requests, with the aim to improve latency. Default settings are mostly the same as the standard resiliency handler, but with a hedging instead of a retry strategy as follows:
Hedging: minimum attempts 1, maximum attempts 10, delay 2s
What does this mean exactly?
It means that if a request is sent and takes longer than 2 seconds, a new request is sent. This also means that multiple requests can be running simultaneously, and the quickest valid response will be used as a result. For example, if the first request would 'hang' for 6 seconds, and a second request would finish successfully in 0.1 seconds, the request will be succesful in 2.01 seconds, give or take, instead of the 6 seconds if no hedging was implemented.
Hedging can also be executed against multiple endpoints (AB testing comes to mind). It's a bit out of scope for this blog post, but refer to the sources below for more information.
Note: Hedging is typically used for GET requests, as idempotency is a key requirement. Since GET requests are idempotent (they don’t change server state), hedging works well in those cases. However, for other HTTP methods (such as POST, PUT, or DELETE), idempotency cannot always be guaranteed, so use hedging with caution in those scenarios.
The custom resilience pipeline - AddResilienceHandler
It is possible to roll your own handler, for those situations where you need full control over the which strategies are used.
services
.AddHttpClient("ThirdPartyThatShouldContactAllPhiToFixItsIssuesHttpClient")
.AddResilienceHandler("custom-pipeline", builder =>
{
builder
.AddRetry(new HttpRetryStrategyOptions())
.AddTimeout(new HttpTimeoutStrategyOptions());
});
The code above will register a http client with a custom handler with a retry and timeout strategy.
Again, the options of the individual strategies can be configured as needed.
Sources
There is much more that can be said about Microsofts http resiliency package. Should you ever be in need, these are some of the sources that helped me out:
Summary
By leveraging Microsoft’s Resilience package, you can seamlessly integrate robust resilience strategies into your .NET 8 applications, making them more stable, responsive, and user-friendly. This approach empowers developers to preemptively handle failures, so instead of spending time explaining why something went wrong, you can confidently demonstrate that everything is running smoothly—even when the world outside is on fire. Embracing these resilience practices doesn’t just create stronger applications; it builds trust with your users and lets you focus on innovation instead of firefighting. For any developer aiming to deliver reliability and peace of mind, Microsoft’s Resilience package is an essential tool in the .NET 8 ecosystem.