DEV Community

Cover image for Implement custom retry logic with SQS & Lambda - Part II - using EventBridge Scheduler
Pubudu Jayawardana for AWS Community Builders

Posted on • Originally published at Medium

Implement custom retry logic with SQS & Lambda - Part II - using EventBridge Scheduler

In a previous blog post, it was discussed how we can build a custom retry mechanism for Lambda with SQS. There we used SQS feature- delayed messages - to set the next retry time.

If you haven't already read that, you can find it here:
https://dev.to/aws-builders/implement-custom-retry-logic-with-sqs-lambda-part-i-using-sqs-delayed-messages-gpa

One limitation of that approach was we can only delay messages for up to 900 seconds (15 minutes). So, the next retry must be within 15 minutes.

In this blog post, I am going to discuss how we can address this limitation by replacing delayed messages with scheduling messages using EventBridge Scheduler.

Architecture

Architecture

How it works

Let's assume you have a message with this structure. Here, data includes the payload your application needs and metadata is used to identify the message and any information that is used by infrastructure to process the message.

{
   "metadata":{
      "message_id":"22200000-0749-2113-8e00-000a2b111191"
   },
   "data":{
      // message payload needs to be processed.
   }
}
Enter fullscreen mode Exit fullscreen mode
  1. Source Queue is the entry point where this message is added to be processed by Message Processor Lambda Function.

  2. In Message Processor Lambda Function, first it validates the message scheme to verify the metadata and data elements are there and the message Id is included in the meta data as an uuid.

  3. If this schema validation fails, it sends the message directly into Final DLQ without retrying, because simply retrying will not fix this issue.

  4. If the message is valid, let's assume that LambdaX needs to call an external system to successfully process the message, but this external system has some issues so the message processing fails. Then it raises a Runtime exception.

  5. Then the message is sent back to the source queue.

  6. On the Source Queue, there is a Dead Letter Queue (namely Intermediate DLQ) configured. Also, there's a condition set for message max receive count as 1. So, when the failed message is received, it will be sent directly into the DLQ.

  7. This Intermediate DLQ has a Lambda function (Failed Message Processor Lambda Function) listening to it.

  8. So, this Failed Message Processor Lambda function starts processing the message.

  9. The Failed Message Processor Lambda function has several responsibilities.

  10. First it increments the retry attempt count by one if it already exists or if not, set as 1 and add it into the metadata of the message.

  11. Then, it checks if the number of retry attempts exceeds the pre-defined limit. Here I have used a max retry attempt as 5.

  12. If retry attempts exceeded the max amount, then the message is moved into the final DLQ with a Error Type and Error Details as follows:

    Message attributes in Final DLQ

  13. If not, it calculates the next retry time based on the logic you provided. Here I used a simple backoff logic. Here, the retry attempt count is multiplied by 60 seconds and then added to the current time to determine the time for the next retry.

    def _calculate_next_retry_time(message):
    retry_attempt = message["metadata"]["retry_attempt"]
    next_retry_time = datetime.now() + timedelta(seconds=(60 * retry_attempt))
    message["metadata"]["next_retry_time"] = new_datetime.strftime(
        "%Y-%m-%dT%H:%M:00.0Z"
    )
    
    return next_retry_time, message
    
  14. Then one time schedule is created in Amazon EventBridge Scheduler given the calculated time.

    Please Note: With EventBridge scheduler, we can only schedule with minute precision, but not down to the exact second.

  15. Once the schedule is created, at the given time, it will call the SQS send message API to queue the message with the given payload.

  16. Message that is sent to the Source Queue will look like below:

    {
       "metadata":{
          "message_id":"22200000-0749-2113-8e00-000a2b111191",
          "retry_attempt":1,
          "next_retry_time":"2023-10-13T22:52:00.0Z"
       },
       "data":{
          "location_name":"Amsterdam",
          "location_id":12345
       }
    }
    
  17. Once the message is available to process in the Source Queue, this cycle will continue until the message is processed successfully or sent to the Final DLQ.

Try this yourself

I have a public GitHub repository created if you need to set up this project in your own AWS environment.

Clone the repository at: https://github.com/pubudusj/retry-with-sqs-and-scheduler

Set up:

This is implemented with CDK and Python. So you need to have Python, AWS and CDK CLIs installed.

Test:

Once stack is deployed, send below message into the Source Queue:

{
   "metadata":{
      "message_id":"22200000-0749-2113-8e00-000a2b111191"
   },
   "data":{
      "location_name":"Amsterdam",
      "location_id":12345
   }
}
Enter fullscreen mode Exit fullscreen mode

Result

If you check the log group of Message Processor and Failed Message Processor Lambdas, you can see the records of the message payload with different retry attempts and next retry times for each retry attempt.

Also, after a retry attempt, if you check the EventBridge Scheduler, you can see there is a schedule created with the payload to be sent to SQS and with a reschedule time calculated in the Failed Message Processor Lambda.

Also, after 5 unsuccessful retries, you can see that message will be available in the Final DLQ with the error attributes.

Advantage of this approach

Advantage of this approach is that you can schedule the message for a longer period. Even though practically might not be required a long time, EventBridge schedule trigger time can be set any time into the future.

Also, unlike the previous approach, you have some more visibility when the message will be retried, based on the schedule created in the EventBridge Scheduler.

Further, AWS has recently announced the feature to automatically delete EventBridge schedules once they are triggered. Which is great, so you don't need to worry about keeping track of the schedules and deleting them manually or programmatically within your application after they are triggered.

Top comments (0)