ETL stands for: “Extract, Transform, Load”, which is the process of dealing with a series of data in a computing unit.
Since AWS Lambda is a computing unit, cheap to use, and versatile with what you can achieve or combine with, makes it an appealing option for this kind of jobs. But does it?
1- What ETL Jobs need to finish successfully?
Let’s assume that you have an excel sheet with a huge amount of rows along with a respectful number of columns, and you need to clean, transform and store these rows in a specific order or shape. You need to have certain components to achieve this task.
AWS Lambda provides these sets by being able to connect with storage services like S3, Glacier. Also, you can use EventBridge to create a series of actions that fits with the custom schemas you’ve configured. DynamoDB gives the ability to use NoSQL option that provides a pretty fast DB connectivity. SQS for queueing the data into the function, and catch the unprocessed data for later adjustment in the DLQ.
So, we can assure that the underlying architecture and connectivity is there for Lambda to do the job.
2- Only Lambda can do the job?
Of course NO. AWS has another service that called Glue, which is a Data Integration service that built to do these type of operation. It has its own set of features and options. But it cannot be compared to Lambda when it comes to the connectivity. Lambda has more options. Yes Lambda is a computing unit and Glue declared as Data Integration service. But, what we’re doing here is to compare the use case itself, not the limitations.
What can be done in Lambda, can be done in any EC2 family. You just need to find the right family to do the job. But, keep in mind that EC2 is a customer managed service, you have to take care of.
One small comment here, if your ETL is consuming a lot of CPU, do not use T family.
Another point I want to clarify that with what Lambda can utalize, it can recover the job if an issue comes to the code like unexpected data type like a difference in datetime format. Lambda can redirect these rows into DLQ which can work as a net that catch the uncommon data rows. This connectivity isn’t presented clearly in other services that might made you rerun the whole file after fixing the code manually. Trust me I’ve been there.
3- Advantages vs Disadvantages:
Let’s start with with the disadvantages:
Time: you cannot run the job normally for more than 15 minutes. It needs more than that to split the file, run many functions and maintain the passed data integrity.
Power: it is very limited to what you can allocate, it is hard to know the right amount of memory to allocate which comes to CPU allocation.
Misconfiguration: since its a computing unit, your code might be the misconfiguration. Using the wrong function in your code could lead to many issues which affects the allocation of memory and create bottleneck.
Let’s lift the spirit a bit:
Less configuration: Since its a managed service, you don’t need to think about the availability of service, or if any other process will use the computing power that you need for your job. It’s all taken care of.
Ready runtimes: If you need to use python, just select it and all needed tools, services are ready to be used. As simple as that.
sidekick services: You can use a lot of services to help you achieve the goal easily. Thanks to the SDK.
So, the question is does it work for ETL?
The answer is: it depends.
It depends on what are you doing, what are the expected outcome and how are you utalizing it. It can be very costly if you’re using a lot of surrounding services. Or your code it taking a long time to run. All these factors can influence your decision.
There are no perfect services for ETL. There are services can do the job. Do you know what kind of job is needed to accomplished? Can you picture the process? Then you can answer this question.
Top comments (0)