AWS Lambda SnapStart -Part 10 Troubleshooting errors and timeouts of init and restore phase

#aws #java #serverless #troubleshooting

Introduction

In the previous parts of our article series about AWS SnapStart we measured Lambda cold start times with various Java runtime versions and frameworks. Now let's talk about how to troubleshoot errors and timeouts of SnapStart-enabled Lambda function Init and Restore Phase.

Let's imagine we have enabled SnapStart for our function and during the Deployment phase we received the following or similar CloudFormation error message.

The following resource(s) failed to create: [GetProductByIdWithPrimingFunctionVersion5b3d011e02, GetProductByIdFunctionVersion5b3d011e

This will also leave the CloudFormation Stack in the "Update_Rollback_In_Progress_State" like this.

Previously, Lambda reported errors and timeout into the CloudWatch Logs during the Invoke phase only, so it was difficult to figure out the reason for the error.

Troubleshooting Errors and Timeouts of SnapStart-enabled Lambda function Init and Restore Phase

Since November 8 2023 AWS Lambda makes it easier to troubleshoot errors and timeouts of Init and Restore phase.

Let's explore on the short example what this means and intentionally produce the error during the invocation of the CraC Lambda hook in the beforeCheckpoint method.

@Override
    public void beforeCheckpoint(org.crac.Context<? extends Resource> context) throws Exception {
        Optional<Product> optionalProduct= productDao.getProduct("0");
        Product product = optionalProduct.get());
    }

We do some priming here and as the product with id equals to 0 doesn't exist we'll run into the

java.util.NoSuchElementException: No value present
        at java.base/java.util.Optional.get()

error. As the invocation of beforeCheckpoint for that SnapStart-enabled Lambda function occurs before taking the snapshot according to the recent improvement in the troubleshooting the error should be published into the CloudWatch Logs. And it is:

Now it's at least clear what happens during the "Init phase", so the error in this phase can be identified more easily.

For sake of completeness let's reproduce provoke the same error during the Restore phase

    @Override
    public void afterRestore(org.crac.Context<? extends Resource> context) throws Exception {   
                Optional<Product> optionalProduct= productDao.getProduct("0");
        Product product = optionalProduct.get());

    }

It doesn't make much sense to do such priming in the Restore phase . This codes is only for the sake of provoking some error. Also in this case the error appears in the CloudWatch Log and is easy to understand and fix.

Of course there are much more complex errors like some internal failures during snapshot taking and restoring phase which can only be fixed by the corresponding AWS team. But providing them with the exact error message in the created support case will definitely help resolve the issue much quicker.

Conclusion

In this article we sucessfully demonstrated the recent improvement that Lambda now automatically captures and sends logs about each phase of the Lambda execution environment lifecycle to CloudWatch Logs. This includes the Init phase in which Lambda initializes the Lambda runtime and static code outside the function handler, Restore phase in which Lambda restores the execution environment from a snapshot for Lambda SnapStart-enabled functions, and Invoke phase in which Lambda executes the code in your function handler. This improvement enables us to much better troubleshoot errors and timeouts of Init and Restore Lambda function Phase.