We can create serverless iteration solutions like pagination with Step Functions. This way, we won't face the Lambda functions' timeout limit and can process many items without interruptions.
1. The scenario
Bob just got the task to write a logic that backfills some fields in a 3rd party application. The solution should get secret agent data from the database, then transform and upload them to the 3rd party application.
Bob decided to create a Lambda function to run the script. He allocated it a large memory and added the 15 minutes maximum timeout. We should address a restriction: the 3rd party API has a rate limit cap and can only accept two requests per second. Considering the number of agents Bob's company stores in the database, he believes 15 minutes should be enough for the migration to complete.
But much to his surprise, the function timed out after processing approximately two-thirds of the database items. Even though he logged how many agents the logic had processed so far, he wasn't sure if the application implemented the logic correctly on all items.
Because of the timeout interruption, Bob had to start the Lambda function again to process the remaining chunk of data.
2. A solution
Step Functions can be a good solution in such situations. We can set up a state machine easier than deploying an ECS container or provisioning an EC2 instance with all the installation hassle.
2.1. Pre-requisites
This article won't discuss how to
- create a state machine in Step Functions
- do data flow manipulation
- use CDK or other Infrastructure as Code method to create the state machine
- create Lambda functions and integrate them with Step Functions.
I'll leave some links at the bottom of the page that describe these operations in detail.
2.2. Setup
The state machine will orchestrate five Lambda functions. Each function will simulate one stage of the workflow.
get-agents
The get-agents
function will simulate the database call where the application initially fetches the items from:
const arr = [{
name: 'Jack Reacher'
}, {
name: 'Lorraine Broughton'
}, {
name: 'Harry Hart'
}, {
name: 'Ethan Hunt'
}, {
name: 'Jason Bourne'
}, {
name: 'James Bond'
}];
const limit = 2;
exports.handler = async (event) => {
const page = event.pageNumber;
const skip = page * limit;
const agents = arr.slice(skip, skip + limit);
return {
pageNumber: page,
numberOfProcessedAgents: agents.length,
agents,
};
};
The array of agents will play the role of the database. The limit
variable reflects the number of 3rd party calls per second we are allowed to make before the API throttles the requests.
The function returns an object with the agents
to be processed in the current invocation round. The pageNumber
property shows which chunk of two the application is currently processing.
The numberOfProcessingAgents
is the number of agents we are feeding into the next stage of the state machine. This number will always be 2
at most and correspond to the number of API calls the 3rd party application allows.
transform-data
There is not much to see here. We do something internally with the data if necessary:
exports.handler = async (event) => {
const transformedAgent = {
name: event.name,
age: 30
};
return transformedAgent;
};
The function here adds the age
property to the agent name.
call-third-party
This function simulates the 3rd party call and as such, implements a Promise:
exports.handler = async (event) => {
const { name, age } = event;
const response = await Promise.resolve({
name,
age,
id: `${name}_${age}`
});
return response;
};
retry
This function is small but is an essential part of the workflow. It calculates the upcoming page number by incrementing the original pageNumber
value. Step Functions then triggers get-agents
again with the new value.
exports.handler = async (event) => {
const originalPageNumber = event.pageNumber;
const newPageNumber = originalPageNumber + 1;
return {
pageNumber: newPageNumber,
};
};
get-agents
will then get the next chunk of two agents from the array (database) based on the newly calculated pageNumber
.
base-case-handler
This function will run when no more agents are left to process. It can do anything relevant to the use case. In this example, it simply logs Success
to the console.
2.3. The state machine
We can create state machines (relatively) quickly in the AWS Console with the Workflow Studio. It's a drag-and-drop tool that provides Action
and Flow
pieces. We can use these building blocks to generate the state machine.
I think Workflow Studio is one of the best tools in the AWS Console. (It would be even better if they displayed the toasts only once. It's a pain to close them every time I edit the state machine in Workflow Studio.)
The simple recursion state machine in Workflow Studio looks like this:
One of the advantages of using Workflow Studio is that it automatically generates the state machine definition in Amazon State Language (ASL). It can be a huge JSON file, which is a pain to create by hand.
I will add the JSON that Workflow Studio generated at the end of this article for anyone who enjoys reading it.
3. Key points
Let's go over some key points that make the state machine work.
3.1. Input to get-agents
The state machine will start with pageNumber: 0
:
{
"pageNumber": 0
}
The get-agents
function will receive a different pageNumber
input at each invocation. This way, the function's logic can calculate which chunk of items it needs to get from the database.
3.2. The returned object from get-agents
We already discussed it above. One critical point is that we carry the current pageNumber
value through the state machine because we want to increment it in the retry
function after the main logic has run.
3.3. Choice state
The Choice
state separates the base case from the recursive case.
The base case defines when the recursion should end. It will finish when we have no more agents left to process (numberOfProcessedAgents === 0
). In this case, the state machine will run the success handler. If we don't define the base case, we'll get an infinite loop, and we don't want that.
The recursive case is the path that runs the main logic.
3.4. Map state
The Map
state will process the current chunk of agents simultaneously. Step Functions will call transform-data
and call-third-party
separately for each agent in the input array. In this case, two function instances will run in parallel by each processing one agent.
This state has some input and output manipulation.
InputPath
Map
will need an array of elements as input.
We can refer to the current agents' chunk array with $.agents
, which extracts the agents
array from the state input. (This is the same as the get-agent
function's return value.)
ResultPath
To make the recursion/pagination work, we will push the original pageNumber
value to the next stage.
We can do so by using the ResultPath
filter.
By default, the result of the state will be the output. In this case, this would be the resolved values of the call-third-party
function combined back to an array.
But we also want to forward the pageNumber
to the next stage.
So the output of the entire Map
state will contain a result
object, which has the transformed agents. The original Map
state input (the return value of get-agents
with the pageNumber
, numberOfProcessedAgents
, and agents
properties) will also be in the output. This way, the retry
Lambda will know about the current page and can calculate the entry point for the next iteration.
3.5. Wait state
The Wait
state will wait (surprise) one second before continuing with the next iteration cycle.
This step simulates the rate-limiting behavior of the 3rd party application. We can process two items per second, so we will wait one second before processing the next chunk of two agents.
The Wait
step will only run after Map
has finished processing all items.
3.6. Retry
We already discussed how this function works. It simply increments pageNumber
and calls get-agents
again with
{
"pageNumber": 1
}
after the first iteration. The function will keep adding 1
to pageNumber
until the logic invokes the success handler, which means that the state machine has processed all agent data.
4. Considerations
Step Functions can be more complex than what this solution presents.
The application shown above is only a skeleton and presents the basic pattern. This article's purpose is to create a working application that we can use to build a production-ready workflow. The solution is not suitable for implementation in production in its current format. Let's see some points why.
4.1. Only the happy path
Step Functions have built-in retry logic in case a Lambda function fails. We can see these settings in the Retry
section of the JSON. We should work on this part.
I only show the success path here, and the solution doesn't handle errors. What happens when an API call to the 3rd party throws? We should manage these cases.
4.2. More data manipulation
We can add more data manipulation to the state machine. I use InputPath
and ResultPath
here but Step Functions provides more ways to modify the input and output data.
The documentation has some examples, and the Data Flow simulator in the Step Functions console is also a tool that is worth using.
4.3. Stubbed side effects
I simulate the database and 3rd party calls in this exercise. It means that the state machine executes relatively quickly. The total execution time will be way longer with actual data in a production environment. It is particularly true when we have to process thousands of database items.
A state machine can run for up to one year (standard workflow), while Lambda functions will time out after 15 minutes. Hopefully, we don't have to sit and wait a year until the logic completes. But we can get into a situation when the total execution time exceeds the maximum Lambda timeout.
4.4. Cost factor
AWS charges for standard state machine workflows based on the number of state transitions. They also provide a free tier, so this exercise didn't cost me anything.
But we can receive a large bill at the end of the month if a busy application runs this workflow multiple times an hour. I would implement this logic in a scenario where the application needs to run, for example, as a cron job once a day or week.
5. Summary
Step Functions is a great alternative to Lambda functions when we need to create an application. State machines don't have the Lambda function timeout limit and are also suitable for implementing recursive logic like pagination.
6. State machine definition
I'll insert the state machine definition JSON for hardcore serverless fans. Workflow Studio generated this ASL file from the state machine I created for this exercise.
{
"Comment": "Agents",
"StartAt": "Get Agents",
"States": {
"Get Agents": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"OutputPath": "$.Payload",
"Parameters": {
"Payload.$": "$",
"FunctionName": "arn:aws:lambda:us-east-1:123456789012:function:get-agents:$LATEST"
},
"Retry": [
{
"ErrorEquals": [
"Lambda.ServiceException",
"Lambda.AWSLambdaException",
"Lambda.SdkClientException"
],
"IntervalSeconds": 2,
"MaxAttempts": 6,
"BackoffRate": 2
}
],
"Next": "Choice"
},
"Choice": {
"Type": "Choice",
"Choices": [
{
"Not": {
"Variable": "$.numberOfProcessedAgents",
"NumericEquals": 0
},
"Next": "Map"
}
],
"Default": "Base case handler"
},
"Map": {
"Type": "Map",
"Iterator": {
"StartAt": "Transform data",
"States": {
"Transform data": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"OutputPath": "$.Payload",
"Parameters": {
"Payload.$": "$",
"FunctionName": "arn:aws:lambda:us-east-1:123456789012:function:transform-data:$LATEST"
},
"Retry": [
{
"ErrorEquals": [
"Lambda.ServiceException",
"Lambda.AWSLambdaException",
"Lambda.SdkClientException"
],
"IntervalSeconds": 2,
"MaxAttempts": 6,
"BackoffRate": 2
}
],
"Next": "Call 3rd party"
},
"Call 3rd party": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"OutputPath": "$.Payload",
"Parameters": {
"Payload.$": "$",
"FunctionName": "arn:aws:lambda:us-east-1:123456789012:function:call-third-party:$LATEST"
},
"Retry": [
{
"ErrorEquals": [
"Lambda.ServiceException",
"Lambda.AWSLambdaException",
"Lambda.SdkClientException"
],
"IntervalSeconds": 2,
"MaxAttempts": 6,
"BackoffRate": 2
}
],
"End": true
}
}
},
"Next": "Wait",
"InputPath": "$.agents",
"ResultPath": "$.result"
},
"Wait": {
"Type": "Wait",
"Seconds": 1,
"Next": "Retry"
},
"Retry": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"OutputPath": "$.Payload",
"Parameters": {
"Payload.$": "$",
"FunctionName": "arn:aws:lambda:us-east-1:123456789012:function:retry:$LATEST"
},
"Retry": [
{
"ErrorEquals": [
"Lambda.ServiceException",
"Lambda.AWSLambdaException",
"Lambda.SdkClientException"
],
"IntervalSeconds": 2,
"MaxAttempts": 6,
"BackoffRate": 2
}
],
"Next": "Get Agents"
},
"Base case handler": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"OutputPath": "$.Payload",
"Parameters": {
"Payload.$": "$",
"FunctionName": "arn:aws:lambda:us-east-1:123456789012:function:base-case-handler:$LATEST"
},
"Retry": [
{
"ErrorEquals": [
"Lambda.ServiceException",
"Lambda.AWSLambdaException",
"Lambda.SdkClientException"
],
"IntervalSeconds": 2,
"MaxAttempts": 6,
"BackoffRate": 2
}
],
"Next": "Success"
},
"Success": {
"Type": "Succeed"
}
}
}
6. Further reading
Getting started with AWS Step Functions - Start from the beginning with Step Functions
Getting started with Lambda - Same with Lambda functions
Amazon State Language - Documentation on ASL
Input and Output Processing in Step Functions - Data flow manipulation
Top comments (0)