As of September 2021, Step Functions natively integrates with the AWS SDK, expanding the number of supported AWS Services from 17 to over 200 and AWS API Actions from 46 to over 9,000. As a result, Step Functions can interact with any service Lambda integrates with (through the SDK). This opens up a whole new ecosystem of use cases where Step Functions can substitute Lambda.
Is Step Functions worth substituting Lambda ?
The answer depends on which criterion to compare both services:
- 🚀 Performance - which has the smallest response latency ?
- 💰 Cost - which is the most cost effective ?
- 💻 Developer experience - which is the easiest / fastest / most pleasant to set up and maintain ?
- 🔒 Security: which is the most secure ?
TL;DR This article focuses on performance and shows that Lambda executes on average 3x faster than Step Functions for a specific use case. However, Step Functions' smallest response time overall is the same as Lambda's.
To get these results, I created a lambda and a state machine doing the same tasks (querying from and storing items to Dynamodb), which I then invoked synchronously and successively 1000 times through ApiGateway.
NB: Since I'm using Step Functions in express mode - whose pricing is based on Lambda's - I'm also comparing services' cost.
1. Setting up the architecture
Choosing a recurring lambda template to substitute
For this experiment to be meaningful, I wanted to replicate a realistic lambda template, one that I use in my projects. In my CQRS experience, I have repeatedly used Lambda to react to HTTP requests through Api Gateway for storing new events in Dynamodb. As an example, in a bike rental app, I would have used a lambda to react to a "/rent-bike" POST request, in order to:
The code would have looked like this:
const handler = async ({ body }) => {
const { id } = body;
// QUERY EVENTS
const { Items: [lastEvent] } = await EventStore.query(id, {
consistent: true,
reverse: true,
limit: 1,
});
// CHECK IF EVENTS EXIST
if (!lastEvent) {
throw new createHttpError.NotFound();
}
// CHECK FOR CONFLICTS
if (lastEvent.type === 'Rented') {
throw new createHttpError.Conflict();
}
// SAVE A NEW EVENT
await RentEvent.put({
id,
version: lastEvent.version + 1,
});
};
Replicating the lambda behaviour in a state machine
Before the release of AWS SDK integration, Step Functions could only manage 1 item in Dynamodb (getItem, putItem, deleteItem, updateItem). Since the release, it can now query multiple items through the AWS SDK. Hence I was able to replicate the "rentBike" lambda behaviour for returning the bike with Step Functions in a "returnBike" state machine:
Once I set up the bike-rental architecture, it was time to compare the lambda vs. state machine performances.
2. Comparing services performance
Querying both resources
I used a Postman collection to query synchronously and successively 1000 times the "/rent-bike" (Lambda) and "/return-bike" (Step Functions) endpoints.
The following Cloudwatch widget displays the average integration latency (~65 requests on each resource per minute) of the "rentBike" lambda (in blue) and the "returnBike" state machine (in orange).
Api Gateway integrates on average 3x faster with Lambda than Step Functions. 🤯
Splitting the integration latency
To get a better understanding of the performance, I split the integration latency as follows:
1) Api Gatway integration & network latency
There is a 10ms difference between Lambda and Step Functions, which might be due to the difference between Api Gateway Lambda proxy integration vs. AWS integration.
One way to close the gap could thus be using a HTTP api in which Api Gateway integrations with Lambda and Step Functions are the same, rather than a REST api.
2) Execution duration
The bulk of the difference (40ms) between Lambda and Step Functions performances lies in resources' execution time.
It could be explained on the first hand by the memory allocated to both services: each lambda is allocated 1Go of RAM by default, whereas a state machine is allocated 64Mo.
On the other hand, I used X-Ray to deep dive into execution and response times. The following graphs show the execution duration distributions of the 1000 queries made to both services (⚠️ x-axis scales are different).
Besides the only query undergoing a cold start, Lambda distribution highlights that executions behave homogeneously overall: 80% of executions made to Lambda are within 20ms. On the opposite, the different peaks in Step Functions distribution suggest that the state machine benefits from multiple optimisations occurring more sporadically.
However, when all Step Functions optimisations are gathered, Step Functions performs as well as Lambda: the best response time for both services overall is 24ms!
Conclusion
Fastest execution | Slowest Execution | Integration latency p95 | API Gateway integration + Network latency p95 | Execution duration p95 | |
---|---|---|---|---|---|
Lambda | 24.0 ms | 745 ms (second slowest is 183ms) | 38.8 ms | 15.0 ms | 23.7 ms |
Step Functions | 24.0 ms | 252 ms | 129.5 ms | 43.1 ms | 86.3 ms |
Performance-wise, Lambda remains a better solution for this use case because, from my understanding:
- It benefits from better internal optimisations,
- It has a better integration with Api Gateway (on a REST api),
- More resources are allocated to it.
However, Step Functions is promising:
- Its fastest invocation matches Lambda's,
- It also benefits from internal optimisations, although not comparable to Lambda's.
Finally, 1000 successive invocations are not representative of a real application behaviour. I'm curious what Step Functions optimisations a more realistic load test would reveal...
Top comments (0)