I had a client get an unexpectedly high AWS bill whose fundamental cause was an AWS ECS service stuck in a restart loop. Although I’ve encountered these before, I decided to dig a little deeper and write up a series of blog posts on the subject.
On the previous post, I described in more detail what an ECS Restart Loop is, and what it looks like, for people who haven’t encountered one, or haven’t spent much time looking into them.
Now I’m going to talk about AWS ECS features that can help deal with restart loops: service throttling and circuit breakers.
Deployment Circuit Breakers
Deployment circuit breakers were introduced to ECS in late 2020. These work reasonably well if your infinite restarts are caused by a broken deployment, and can even roll back your service to the previous working state.
This is what the ECS events look like with a failed deployment:
It's worth pointing out that the failure threshold (10 failures minimum) can take a while to reach, particularly on small services. This depends in part on how long it takes for a single health check to fail, but in extreme cases particularly for services with a very small number of tasks, it might take up to an hour for the circuit breaker to kick in.
Deployment circuit breakers solve the most common case and are easy to turn on, so this is a great starting point. If you're using ECS and you haven't looked into circuit breakers, I recommend you do so now.
There are limitations:
- Not all infinite restart problems happen during deployment
- It's still possible to create an ECS service with deployment circuit breakers disabled
- The failure threshold is not configurable
Throttling
ECS Service Throttle Logic can help in the specific scenario that an ECS service does not reach the RUNNING
state, so that when ECS tries to start a service and it cannot be started, it will throttle the rate at which it re-attempts to start the container.
The limitations here are important:
- Only employed if a task goes from
PENDING
toSTOPPED
- More common if a task does not have the necessary resources, or the docker image can't be pulled
This can be helpful outside of a deployment context, but in most scenarios where I've encountered ECS service restart loops, it would not have helped.
Recommendations
Firstly, if you're not already using Circuit Breakers, you probably should be. Unless you've already evaluated circuit breakers and decided they aren't right for your environment, I'd recommend enabling these by default. If you have organizational defaults in a tool like Terraform modules, CDK or Pulumi, make enabling circuit breakers your default.
Secondly, the place where I've seen this the most is during deployment. If you're automating deployments, you might want to consider making sure that ECS considers the deployment successful and complete before your automation finishes. This might mean the automation needs to wait for the service deployment to be complete and the service to be healthy, which can take time, but it’s better than doing a deployment and assuming everything’s ok until you discover later than it’s not.
Is this enough?
So if ECS has features to deal with restart loops, is that enough? Is the problem solved? No.
Circuit breakers and deployment automation will help with failed deployments, but that's not the only reason an ECS service might become unhealthy.
They're also the something that you have to put in place, a precaution you have to deliberately take. In lots of organizations, you might have many different workloads built by different teams. If a team spins up their first ECS service and doesn’t turn on circuit breakers, will you catch that? Do you have the right policies in place to make sure that can’t happen? Or do you have monitoring to catch it when it does?
That’s the topic of the next post: Monitoring AWS for ECS Restart Loops.
Top comments (0)