One of the most interesting architectures nowadays is event-driven. It consists of an event producer and action. So, if we imagined that I uploaded an excel sheet to S3, I want a set of actions to be performed agains the records that are in this file. Pretty simple, event-driven architecture will solve the issue.
But, what makes this flow, somehow hard to maintain is that when it starts, it keeps moving. Another imagination is if you have 500k+ records in that excel file, and you had an unhandled error hits the record number 200k, what would you do? Do you reload the file and waste all computing power that processed 199,999 records? Sounds expensive and frustrating.
Luckily, there are number of ways to handle this issue. Allow me to present “ a way” to solve this issue.
It seems to be simple and repetitive? Yes it is.
Let’s assume you uploaded an excel file to S3, and you want all the rows to be cleaned from extra spaces, perform some data adjustment, and insert the records to a database.
So let me walk you through the record lifecycle, and then why it was this way.
1- File got uploaded, S3 events triggered Data Generator function.
2- Data Generator function will pass all the records into Queue 1. After visibility timeout and delivery delay reached, Function 1 will start pulling batches from the queue.
3- If a record had unhandled error, the whole batch will be sent to DLQ 1.Why is that? Because The function pulled a whole batch, not a single record. So, the batch will be still in the queue until Lambda tells the queue I’m done with it you can delete it. Otherwise, the batch will be deliver to Queue 2 and will be deleted from Queue 1 after Function 1 returns success to Queue 1.
4- The same scenario will be performed with Queue2, Function 2 and DLQ2.
Now let me tell you why this scenario is achieving the goal.
1- Event-Driven:
As you can see, no action will be performed unless it satisfies the trigger requirements. Uploading the file to S3 is the flow generator. Without it, the whole flow will be idle.
2- Lower batch to retry:
SQS FIFO (First In - First Out) can send batches that hold up to 300 messages per second or 3000 when it get sent by batches. This has some factors to it that shapes the batch size ( Data Generator passed messages interval, visibility timeout, and delivery delay). Having 300 messages to handle is more manageable than the whole file.
3- DLQ is the saviour:
Because of DLQ, the unhandled records can be isolated and dealt with separately. Some records might have some special characters, different format or even empty strings. This can toast all the processed data if it was not held in a safe place (like cache or inserted into DB). Once the records reached DLQ, you have the freedom to do whatever you want to that batch of records. Like reprocess if, for example, you exceeded the concurrent limit of Lambdas, or you had the unhandled error.
4- Price wise? Justifiable:
When we see the price of 1 million requests (0.50$) it is nothing comparing with the benefits that we’re getting from. Because this is the price we pay to isolate the records that shows us clearly the gaps and code mistakes to handle our records. Without it, you will pay way more for Lambda with the reruns that are unnecessary and A LOT MORE for CloudWatch for put logs action. Which is by the way expensive if you dont know how to use it wisely.
To conclude this short article, I highly encourage anyone who works with Serverless to invest more in this part of the architecture. Because the amount of money and time you invest to handle these small, yet expensive errors, is totally worth it. I will bring part 2 to show you how you can make it more dynamic with EventBridge. Until then, be safe.
Top comments (0)