DEV Community

Vadym Kazulkin for AWS Community Builders

Posted on • Edited on

AWS SnapStart - Part 17 Impact of the snapshot tiered cache on the cold starts with Java 21

Introduction

In the course of this blog series, we measured cold starts with very different scenarios mainly with SnapStart enabled. Now let's explore one more SnapStart detail which is called "tiered caching" of the microVM snapshot. This procedure was briefly mentioned in the article where SnapStart was announced Here is the sentence : With SnapStart, when a customer publishes a function version, the Lambda service initializes the function’s code. It takes an encrypted snapshot of the initialized execution environment, and persists the snapshot in a tiered cache for low latency access. So what this tiered cache might be?

Tiered cache of the snapshot

In our experiment we'll re-use the application introduced in part 9. Let's take GetProductByIdWithPureJava21Lambda function with SnapStart-enabled (but without priming) and 1024 MB memory and measure the cold start for exactly 1 invocation. Let's assume it's a very first execution after the newer version of the Lambda function has been published. My result was 2270.28 ms. The cold start is still big enough. Without SnapStart enabled the result might be in the range of 3100 and 3600 ms.

But what will happen with the subsequent cold start times with SnapStart enabled? Let's see how percentiles change for increasing number of the cold starts.

data and time number of cold starts p50 p75 p90 p99 p99.9 max
8.3. 18:15 1 2270.28 2270.28 2270.28 2270.28 2270.28 2270.28
8.3. 18:26 4 2078.54 2196.68 2270.28 2270.28 2270.28 2270.28
8.3. 18:38 9 2131.58 2210.69 2340.34 2340.34 2340.34 2340.34
8.3. 18:51 14 1880.21 2131.58 2270.28 2340.34 2340.34 2340.34
8.3. 19:05 20 1792.05 2015.11 2196.68 2340.34 2340.34 2340.34
8.3. 19:20 34 1706.08 1856.04 2131.58 2340.34 2340.34 2340.34
8.3. 19:32 49 1662.7 1792.05 2168.88 2340.34 2340.34 2340.34
8.3. 19:44 66 1642.87 1709.27 2078.54 2340.34 2340.34 2340.34
8.3. 19:44 76 1640.13 1703.17 2064.59 2340.34 2340.34 2340.34
8.3. 20:10 85 1640.13 1700.9 2015.11 2340.34 2340.34 2340.34
8.3. 20:20 98 1642.74 1703.17 1880.21 2340.34 2340.34 2340.34
8.3. 20:30 109 1639.75 1691.35 1865.41 2269.10 2338.17 2340.34
8.3. 20:41 120 1633.21 1679.56 1854.25 2269.10 2338.17 2340.34
8.3. 20:52 129 1629.95 1676.21 1854.25 2269.10 2338.17 2340.34

So, what we observe is that the cold start times reduce the more cold start we experience. After 50 cold starts the effect becomes less and less visible for p50 and after 100 cold starts for p90. This is the effect of the tiered cache for the mivcorVM snapshot in action. The effect of the tiered cache dependings on percentile and is significant (up to 600ms).

If you are interested in the deep details about how Lambda SnapStart (which is currently only available for Java runtime) is implemented and particularly microVM (the whole execution environment) snapshot and its tiered caching works under the hood, I recommend you the talk AWS Lambda Under the Hood by Mike Danilov. There is also a detailed summary of his talk here and additional resources here.

Of course the question was what will happen if we don't invoke my Lambda function for a while and than execute it later? Will the cold start increase. Let's check.

Let's then stop Lambda function execution and then invoke it 30 min later at 21.22 - cold start was 1674.06, then stop again and invoke it at 21:52 - cold start was 1702.17 then the same at 23:00 cold start was 1735.06. So, it got slightly bigger, but we don't observe the worst values from the first executions. Then I stopped Lambda execution for 8 hours and executed it next morning then 15.000 times running into 16 cold start with p50 being 1669.07 and from p90 on 2019.88. So tiered caching effect was still there after so many hours and the p90 and higher numbers didn't look that big as during the first invocations.

To complete the test of snapshot tiered caching I also did the same experiments on GetProductByIdWithPureJava21LambdaAndPriming which uses SnapStart and DynamoDB request invocation priming on top. Let's summarize them in the table below.

data and time number of cold starts p50 p75 p90 p99 p99.9 max
8.3. 18:15 1 1189.55 1189.55 1189.55 1189.55 1189.55 1189.55
8.3. 18:26 4 1046.09 1166.55 1189.55 1189.55 1189.55 1189.55
8.3. 18:38 9 801.74 1046.09 1189.55 1189.55 1189.55 1189.55
8.3. 18:51 14 763.37 808.96 1166.55 1189.55 1189.55 1189.55
8.3. 19:05 23 730.28 801.74 1046.09 1189.55 1189.55 1189.55
8.3. 19:20 32 720.01 796.2 941.29 1189.55 1189.55 1189.55
8.3. 19:32 47 700 758.39 903.36 1189.55 1189.55 1189.55
8.3. 19:44 58 692.52 749.01 831.72 1189.55 1189.55 1189.55
8.3. 19:44 68 684 748.61 831.72 1189.55 1189.55 1189.55
8.3. 20:10 80 679.44 731.52 801.74 1189.55 1189.55 1189.55
8.3. 20:20 91 688.25 748.61 799.25 1189.55 1189.55 1189.55
8.3. 20:30 100 689.34 748.22 799.24 1166.16 1188.52 1189.55
8.3. 20:41 110 679.76 744.49 799.24 1166.16 1188.52 1189.55
8.3. 20:52 122 679.08 744.49 799.24 1166.16 1188.52 1188.52

So, what we observe is the same as without priming. The cold start time reduces with more cold starts we experience. After 50 cold starts the effect becomes less and less visible and after 80 cold starts the effect becomes negligible for p50 and after 90 cold starts for p90. The effect of the tiered cache dependings on percentile and is significant (up to 500ms).

Of course I had the same question was what will happen if I don't invoke my Lambda function for a while and than execute it later? Let's check.

Let's stop the Lambda function execution and then invoke it 30 min later at 21.22 - cold start was 746.63, then stop again and invoke it at 21:52 - cold start was 617.7 then the same at 23:00 cold start was 673.5. Then I stopped Lambda execution for 8 hours and executed it next morning then 15.000 times running into 17 cold start times with p50 being 723.99 and p90 being 894.05. So tiered caching effect was still there as well after so many hours and the p90 and higher numbers didn't look that big as in the first invocations.

Conclusion

In this article we saw microVM snapshot tiered cache in action for SnapStart-enabled Lambda function (with and without priming) with Java 21 runtime. The conclusion is quite obvious: don't stop by enabling SnapStart and measuring only one cold start time or a couple of them. Yes, the first cold starts take longer, but it gets better with the number of invocations for the same Lambda function version and seems to stay on a good level independent whether you invoked your Lambda for a while or not. I assume exactly the same or very similar effect to be with Java 17 as it's not about the Java version itself but about technical implementation of the microVM snapshot tiered cache done by AWS.

Top comments (0)