DEV Community

Roy R.
Roy R.

Posted on • Edited on

The bane of AWS Lambda Coldstarts

Coldstarts of AWS lambdas are the bane of my existence, are they for you? I often find that the coldstart delay degrades the user experience, and while it seems like they should be very avoidable because the coldstarts should not be exposed to end users, they end up being a little difficult to see the way around.

When The User Encounters A Coldstart

Here is the scenario of interaction with coldstarts that you may encounter. You do an api call for some data, it involves a lambda, and it's fast. Like blazing fast, 12ms, 15ms, great response time. But every so often, the api call hangs... ...and hangs, and hangs, for seconds, not ms. And because it only happens infrequently, it's really hard to debug, but when it does happen it's a pretty bad user experience.
That is it. You've encountered the coldstart problem. Your lambda interaction that is usually blazing fast is sometimes really slow BUT ONLY INFREQUENTLY. The problem is that it's so slow that infrequently that it really degrades the user experience.

When you get a hot lambda instance it's like a good cup of fresh coffee, only blazing fast, taking just a few miliseconds to perk you right up.
hot start lambda is like fresh coffee

When it's cold, it's like yesterday's coffee, you take a sip and grimace and the caffeine is going to take a while to kick in
cold start lambda is like day-old coffee

So what can we do?

A Few Avoidance Techniques

I am going to give a few hacks that have worked for me first, because the real answer overall is going to be annoying; it requires a mindset change in how we deal with data (precomputation, which I will get to later). So here are a few good hacks to tide us both over.

  • Don't put synchronous lambdas at the end of your api gateways (in other words if you expect to wait on a response or to have to manipulate the response data, make it an async lambda)
    • when writing input or "commands" through an api endpoint, try to immediately emit an event to eventBridge and return {'success':'OK'} respond, instead of doing everything all at once (this is generally a CQRS approach)
    • when reading or querying, try to: read directly from database without a lambda, or turn on caching on apigateway, or precompute the resultset seperately
  • Warming: If you have an event driven architecture, you can set up an event scheduler that fires an event to warm up a lambda with whatever schedule you want, to wake it up at appropriate times (e.g. 7 am and 8 am in the morning is a good one before people come online for the day to warm it up for others) or use someone else's warmer library
  • Increase the memory of a lambda function (because in the background that causes AWS to increase the CPU class of a function), e.g. double the provisioned memory will probably increase the CPU class that AWS will give it and because you're charged by resources*time, faster invocations with more memory may not actually cost more in the end (your mileage may vary)
  • A Temporary Stopgap: Provisioned Concurrency: It's an option to increase the provisioned concurrency to 1 so that there's always a "warm" instance, but I don't recommend it and only include that here for completeness. It really doesn't solve the problem anyway (when enough load occurs, a new instance will spin up and you'll still have to incurr the cost of a coldstart)

AWS's Recommendations for Coldstarts

One of the concepts that you will encounter when reading about coldstarts and how to deal with them is as follows:

  • AWS will recommend ways to make coldstarts less slow
  • AWS will recommend not putting a lambda directly at the end of a user-facting interaction, because the coldstart will kill the user experience

What doesn't happen often enough is a recommendation for what to actually do instead of sticking lambdas at the end, instead of suffering from coldstarts at all.

An onion model for AWS lambda and serverless interaction times
...a glimpse of a model to determine where slowness can come from via lumigo where the problem of coldstarts is generally happening in the orange band

Any recommendation to decrease the time of the lambda itself or the lambda allocation where coldstarts occur is only going to get you so far. It's useful when you have already written lambda code and just want it to act faster, but the deeper solution brings us to precomputation...

The Deeper, Harder Precomputation Solution

The really deep solution is to develop systems with cold starts in mind by relying on precomputation. In this model, instead of calculating and manipulating data on the fly inside a lambda when an api call requests it, put the data into dynamodb in a triggered manner (whether triggered by writes or by a time-to-live or on a schedule) using a precomputational lambda and then just always pull right from that precomputed data set in api calls. So let's say there is a latest_statistics dynamodb table, the api calls and apigateway would pull directly from that dynamodb fast and simplistically with no computation at all, and then separately out of band a lambda would update what data gets in latest_statistics.
Precomputation is a major change from approaching things from a "pull, manipulate, respond" way of developing, so we may need helpful techniques like the avoidance techniques above to help us along the way.

Techniques for Faster Hot Lambda Invocations

This isn't applicable to coldstarts, but here are few techniques for speedier hot lambda invocations to round it out for good measure.

  • Cache variables in the global scope of the instance, so that the hot lambda instance can have them in memory, things like api clients and unchanging dependencies
  • Cache a response if a request happens close enough in time, in other words putting a json/data response into memory along with a timestamp, and if little enough time has elapsed that you are comfortable, just return the in-memory data without having to reach out again to the database
  • Predictive, 2-stage approach: Return a quick and dirty response fast, and then set the slower results to be collected and displayed with once updated timestamp or a specific command id becomes available

Overall

In the 1,000 foot view from above, there are two parts to mitigating the subtle effects of coldstarts: The first is an approach change to ensure that users cannot touch lambdas that will wait during coldstarts. For example, if you have to get a user profile, instead of manipulating the data in custom ways in the lambda compute, have apigateway pull directly from dynamodb. And secondly, deferring results, so that the lambda that is front-facing and user facing can be fast, and the slow compute work can happen behind the scenes.

Additional resources

A resource for more coldstart avoidance techniques and countermeasures: https://www.simform.com/blog/lambda-cold-starts/

Serverless warmer library: https://github.com/juanjoDiaz/serverless-plugin-warmup

Coldstart Stats Daily By Language/Environment:
https://maxday.github.io/lambda-perf/

An option to watch is LLRT, Low Latency Runtime: https://github.com/awslabs/llrt which is in the early stages, but generally bundles node into a single bundle

If connecting to an RDS database, you can consider using an RDS proxy for connection pooling, though it is an additional expense: https://docs.aws.amazon.com/lambda/latest/dg/configuration-database.html

Top comments (0)