Last week we checked out the circuit-breaker. This week, we'll examine the retry pattern. The basis for the retry pattern is to handle transient failures to a service or network resource. Requests temporarily fail for many reasons. Examples of these failures include the network having a faulty connection, a site is reloading after a deployment, or data hasn't propagated to all instances.
In the following examples, I'm going to base failures off of this intentionally simplified client/service request model. A successful request:
The retry pattern encapsulates more than the idea of just retrying but different ways to categorize issues and make choices around retrying.
If the problem is a rare error, retry the request immediately:
If the problem is a more common error, retry the request after some amount of a delay:
If the issue indicates that the failure isn't transient, for example, invalid credentials that are unlikely to succeed on subsequent requests, cancel the request:
Sometimes the client may think that there is an error, but the operation has successfully completed. If the request isn't an idempotent operation, this can cause problems if a state change occurs and the request is duplicated:
Regardless of the type of the issue, some tracking should be done to ensure that endless retries aren't occurring. Implementing a circuit-breaker in this situation can be helpful to limit the impact of a retry storm on a failed or recovering service.
When operating a service, it's important to track metrics around retries. Metrics can tell you a lot about your customers' experiences with your service. It might be a case that a service needs to be scaled out or up. Additionally, even if the problem is on the client's end, if they are repeatedly experiencing degraded service, the problem is going to reflect on your service.
I managed a service that used Apache Traffic Server at one point. Apache Traffic server did not handle Expect: request headers as specified by the HTTP/1.1 spec. 1KB and larger requests from applications built with the cURL library had 1-second delays unless customers turned off the Expect: header explicitly. Compounding the issue, retries due to not receiving a response within the expected 50ms response time led to retry storms. When I saw this pattern in my logs, I wrote up documentation and set up tools that would catch this issue so that I could reach out to customers proactively.
When using a third-party service or external tool, make sure that you're not layering retry logic. Layering retries at different levels or cascading retries can lead to increased latency where failing fast would have been the preferred operation.
Many SDKs include retry configurations. Generally, there will be a maximum number of retries and a delay factor that may be an exponential, incremental, or randomized interval of the amount of time to delay subsequent retries.
Separate the transient issues in logging so that they are warnings rather than errors. While they should be monitored, paging out on something that self-resolves is a recipe for the increased loss of attention, especially when it pages folks on off-hours. There is nothing so frustrating as losing sleep over something that fixed itself.
Top comments (7)
Nice article!
Before you go and implement this pattern, note that YOU MIGHT REALLY REALLY WANT TO LOOK FOR AN ALREADY EXISTING SOLUTION!
The recipe comes in with an MQ (Message Queue, like: RabbitMQ) and a task queue.
Existing Task Queues:
I read this more as if the client is retrying the requests. A message queue would (to the client) silently retry. Waiting for the response could potentially take a long time.
If the client is implementing this pattern it can keep the user informed of what's going on, instead of having the users wait potentially a few seconds for the request to finish. This also has the benefit of leaving the user experience to the client (long response times in general being detrimental to UX).
I prefer message queues for deferred work where the client doesn't really care about getting an immediate result.
It always depends... whether the user does or does NOT care about the result.
Let's make this more concrete:
Say when a user places an order, he'll put his credit card and pay the money, so he really cares about the response.
Placing an order would require a transaction across multiple tables (to deduct the quota of the product and create an order at the same time).
But but but, if you use a such transaction, your system will be doomed to be a non-scalable one as this might cause a deadlock, a bad one actually!
So yeah, although the user cares about the response, you will really need an MQ to avoid deadlocks and return back the result when it comes even after few seconds.
There are definitely some subtleties in different patterns. One of the reasons I wanted to share a little bit more about patterns is I think too often folks assume shared understanding. Patterns are not algorithms. They are ways we talk about the architecture we are building.
It's really hard to keep descriptions about a single pattern specific and clear but the example with financial transactions you are giving is why I brought up idempotency. I could have gone into greater detail there but it's a delicate balance of info loading and providing enough information at appropriate times. With transactions, they are definitely not idempotent in nature. If you keep applying a -$5 charge for coffee against an account, you'll quickly rack up a lot of credit charges. So it's a balance between retries, fast fails, and complete roll backs.
I'm curious though; why bring up the message queue pattern here? Retry to me is a lower level pattern. How do you communicate to a message queue? What compromises do you make when sending events into the message queue?
You're absolutely right, patterns are just like a way of communication.
The retry pattern is meant for resilience.
Task queues (like Celery) already have the implementation of the retry pattern, and task queues require MQs (or brokers) to do their job.
So, to achieve a greater level or resilience, both a task queue and an MQ are necessary.
Actually, this pattern goes with couple of other design decisions (like microservices and event-driven-architecture).
Because of that, the biggest compromise when using such architecture is that your program would often need a complete rewrite.
Say for the previous example (of using a transaction to place an order), you would instead create an event "order_created".
This will trigger other services like shippingService & chargingService to do the next steps.
Suppose the chargingService failed in the middle, it will retry again (cuz the message of "charging_credit_card" won't leave the queue for couple of retries). If it ultimately fails, it will trigger another event for rolling back across multiple services.
Another useful article! Thanks!
Nice article!
What do you mean with Rare error and Common error? Can you provide some examples?