Table of Contents
- The Main Compute Primitives for AWS
- Control Planes vs Data Planes
- How Each Primitive Handles Concurrency During a Control Plane Outage
- The Takeaway
- References
The Main Compute Primitives for AWS
There are 3 compute primitives in AWS (Amazon Web Services) that almost all of its other compute offerings are built on top of
- Virtual Machines (i.e. EC2, Elastic Cloud Compute)
- Containers (e.g. Fargate for ECS, Elastic Container Service, or EKS, Elastic Kubernetes Service)
- Functions (i.e. Lambda)
Each comes with its own set of tradeoffs, but there is one subtle tradeoff that only manifests during certain AWS outages.
To understand that tradeoff, we first need to understand the concepts of “control planes” and “data planes” of AWS services.
Control Planes vs Data Planes
Every AWS compute service is separated into 2 logical components, a control plane and a data plane.
The data plane is responsible for actually running the hardware and software powering the compute. Think of the physical server running a virtual machine, for example.
The control plane is responsible for making changes to the data plane. If you want to add a new virtual machine to the data plane, you have to make a request to the control plane for it to do so on your behalf.
Static Stability
Services are designed this way in part to be more fault tolerant. If an outage occurs in the control plane, the data plane will continue working without issue.
And in general, outages in control planes are more common than outages in data planes.
This leads to the concept of “static stability”, where as long as your workload doesn’t depend on control planes, it will remain stable during most AWS outages.
How Each Primitive Handles Concurrency During a Control Plane Outage
But your existing workloads being stable during an AWS outage might not be enough. What if there’s a surge in load that they need to respond to? Will they be able to scale up to meet that demand?
Specifically, there’s the question of the maximum concurrency a workload can support during an AWS service control plane outage.
The answer to this question (perhaps surprisingly) depends on the compute primitive involved.
EC2
In normal times, in the face of increased concurrency a workload can autoscale up to handle it (e.g. an ASG, Autoscaling Group, can bring up more virtual machines).
However, during an outage of the EC2 control plane, this isn’t possible (i.e. autoscaling requires requests to the control plane).
This means that during the outage, the maximum concurrency a workload can support is fixed and cannot be increased. Any requests exceeding this limit will fail.
Fargate
Fargate behaves similarly to EC2, as starting new tasks requires a request to the ECS or EKS control plane.
So during a control plane outage, any requests exceeding the fixed maximum concurrency of the workload will fail.
Lambda
Lambda is the odd duck out.
In normal times, in the face of increased concurrency a workload can start up multiple new Lambda execution environments to handle the load.
But the subtlety here is that this behavior is part of the Lambda data plane, NOT the control plane.
This means that during an outage of the Lambda control plane, a workload can still handle essentially arbitrary concurrency of requests (only limited by your account’s quota on concurrent executions).
The Takeaway
If you need to handle arbitrary concurrency while the control plane of a service is impaired, Lambda provides the best tradeoff.
Fargate or EC2 (or any other more managed service built on top of them, e.g. Elastic Beanstalk) will not be able to meet the need.
References
- AWS whitepaper on fault isolation boundaries that defines “static stability” and references the lack of ability of EC2-based workloads to autoscale during control plane outages
- Lambda whitepaper that says scaling occurs at the level of the data plane
Top comments (0)