As part of the work I was doing developing Booster Framework, I was in charge of creating heavy load tests to ensure that its auto-scalability feature works and that the data integrity is preserved (under the AWS provider).
The problem
After many hours of work, load testing Booster with 3,000 requests per second was working pretty well. However, sometimes we were seeing this problem:
- The final counter of processed events had to be 60,000 to consider the test successful, but sometimes it got stuck in a random number very close to the right one (for example, 59,489)
- We didn't see any error anywhere.
- Suddenly, after 5 minutes or so, the right number appeared.
This was unacceptable.
Searching for the root cause
After a lot of investigation, we were seeing that the 0.00001% of all the DynamoDB calls done (it didn't matter if they were a “query”, “get” or “put”) were taking like 4 or 5 minutes to succeed, while the usual time for them was between 5 and 20 milliseconds.
– "Aha! That's the problem!", I thought.
Yes but, Why were those calls taking that long?
The AWS SDK has a built-in retry mechanism. When you do a DynamoDB call, for example, it will automatically redo the request if it gets an error that’s considered “retriable” (like a temporary network error).
It keeps doing retries using an exponential back-off algorithm, meaning that it will wait 50ms before doing the first retry. If it fails again, it will wait twice that time, 100ms, to do the next retry. If it fails again, it will wait 200ms, then 400ms, then 800ms, then 1.6 seconds, etc.
– "That's why calls take so long!!!", I told to myself.
– "Nein, nein, nein!!!", the facts replied to me
It turned out that only 10 retries are attempted by default, and that’s only 51 seconds in total.
Where did the rest of the time (more than 4 minutes) come from?
To the point: It turned out that those 0.00001% DynamoDB calls were done with an HTTP socket in a bad state (because these things happen randomly) and the requests never reached DynamoDB.
The AWS SDK has a default timeout of 2 minutes, so the call was stuck that amount of time before returning an error (a "socket timeout error").
This means that, in order for the first retry to occur, we need to wait 2 minutes. Then, it could be that the second retry also finds the socket in a bad state (it is being reused by thousands of requests), so we wait 2 more minutes.
This explains why a simple DynamoDB call occasionally took 4 or 5 minutes.
Solution
Surprisingly, the solution was just to reduce the timeout to, for example, 5 seconds, instead of 2 minutes. This way we fail much earlier, the AWS SDK can do the next retry (with a new socket connection) sooner, and the chances to succeed are much higher.
This is the code to do this in the Typescript AWS SDK:
const dynamoDB: DynamoDB.DocumentClient = new DynamoDB.DocumentClient({
maxRetries: 10,
httpOptions: {
timeout: 5000,
},
})
Lessons learned
- a) Networking things can fail randomly for no apparent reasons. Even if there are no bugs, just because of network conditions.
- b) Timeouts are extremely important. Any request should have a timeout. If not, it can get stuck because of point a) above and there is no way to stop it.
- c) Retries are a must in any request. Due to points a) and b) you need to have a retry policy. Not having it is simply a bug, because any network request is expected to fail for no apparent reason. If you don’t have a retry policy, it is similar to having a switch in code with no default and you missed a case condition.
- d) Load tests are the most useful thing to find those kinds of errors that you never think of.
Extra tip!
If you find yourself facing similar abnormal long waits when doing any AWS SDK call, try to set the parameter "maxRetries" to 0 and try again. This will expose the error that's causing the retries, so you will be able to act upon it.
Of course, don't forget to enable the retries again after the error is fixed!!
Top comments (0)