Arpad Toth for AWS Community Builders

Posted on Nov 17, 2022 • Originally published at arpadt.com

Building recursive logic with Step Functions

#aws #serverless

We can create serverless iteration solutions like pagination with Step Functions. This way, we won't face the Lambda functions' timeout limit and can process many items without interruptions.

1. The scenario

Bob just got the task to write a logic that backfills some fields in a 3rd party application. The solution should get secret agent data from the database, then transform and upload them to the 3rd party application.

Bob decided to create a Lambda function to run the script. He allocated it a large memory and added the 15 minutes maximum timeout. We should address a restriction: the 3rd party API has a rate limit cap and can only accept two requests per second. Considering the number of agents Bob's company stores in the database, he believes 15 minutes should be enough for the migration to complete.

But much to his surprise, the function timed out after processing approximately two-thirds of the database items. Even though he logged how many agents the logic had processed so far, he wasn't sure if the application implemented the logic correctly on all items.

Because of the timeout interruption, Bob had to start the Lambda function again to process the remaining chunk of data.

2. A solution

Step Functions can be a good solution in such situations. We can set up a state machine easier than deploying an ECS container or provisioning an EC2 instance with all the installation hassle.

2.1. Pre-requisites

This article won't discuss how to

create a state machine in Step Functions
do data flow manipulation
use CDK or other Infrastructure as Code method to create the state machine
create Lambda functions and integrate them with Step Functions.

I'll leave some links at the bottom of the page that describe these operations in detail.

2.2. Setup

The state machine will orchestrate five Lambda functions. Each function will simulate one stage of the workflow.

get-agents

The get-agents function will simulate the database call where the application initially fetches the items from:

const arr = [{
    name: 'Jack Reacher'
}, {
    name: 'Lorraine Broughton'
}, {
    name: 'Harry Hart'
}, {
    name: 'Ethan Hunt'
}, {
    name: 'Jason Bourne'
}, {
    name: 'James Bond'
}];

const limit = 2;

exports.handler = async (event) => {
  const page = event.pageNumber;
  const skip = page * limit;

  const agents = arr.slice(skip, skip + limit);

  return {
    pageNumber: page,
    numberOfProcessedAgents: agents.length,
    agents,
  };
};

The array of agents will play the role of the database. The limit variable reflects the number of 3rd party calls per second we are allowed to make before the API throttles the requests.

The function returns an object with the agents to be processed in the current invocation round. The pageNumber property shows which chunk of two the application is currently processing.

The numberOfProcessingAgents is the number of agents we are feeding into the next stage of the state machine. This number will always be 2 at most and correspond to the number of API calls the 3rd party application allows.

transform-data

There is not much to see here. We do something internally with the data if necessary:

exports.handler = async (event) => {
  const transformedAgent = {
    name: event.name,
    age: 30
  };

  return transformedAgent;
};

The function here adds the age property to the agent name.

call-third-party

This function simulates the 3rd party call and as such, implements a Promise:

exports.handler = async (event) => {
  const { name, age } = event;

  const response = await Promise.resolve({
    name,
    age,
    id: `${name}_${age}`
  });
  return response;
};

retry

This function is small but is an essential part of the workflow. It calculates the upcoming page number by incrementing the original pageNumber value. Step Functions then triggers get-agents again with the new value.

exports.handler = async (event) => {
  const originalPageNumber = event.pageNumber;
  const newPageNumber = originalPageNumber + 1;

  return {
    pageNumber: newPageNumber,
  };
};

get-agents will then get the next chunk of two agents from the array (database) based on the newly calculated pageNumber.

base-case-handler

This function will run when no more agents are left to process. It can do anything relevant to the use case. In this example, it simply logs Success to the console.

2.3. The state machine

We can create state machines (relatively) quickly in the AWS Console with the Workflow Studio. It's a drag-and-drop tool that provides Action and Flow pieces. We can use these building blocks to generate the state machine.

I think Workflow Studio is one of the best tools in the AWS Console. (It would be even better if they displayed the toasts only once. It's a pain to close them every time I edit the state machine in Workflow Studio.)

The simple recursion state machine in Workflow Studio looks like this:

One of the advantages of using Workflow Studio is that it automatically generates the state machine definition in Amazon State Language (ASL). It can be a huge JSON file, which is a pain to create by hand.

I will add the JSON that Workflow Studio generated at the end of this article for anyone who enjoys reading it.

3. Key points

Let's go over some key points that make the state machine work.

3.1. Input to get-agents

The state machine will start with pageNumber: 0:

{
  "pageNumber": 0
}

The get-agents function will receive a different pageNumber input at each invocation. This way, the function's logic can calculate which chunk of items it needs to get from the database.

3.2. The returned object from get-agents

We already discussed it above. One critical point is that we carry the current pageNumber value through the state machine because we want to increment it in the retry function after the main logic has run.

3.3. Choice state

The Choice state separates the base case from the recursive case.

The base case defines when the recursion should end. It will finish when we have no more agents left to process (numberOfProcessedAgents === 0). In this case, the state machine will run the success handler. If we don't define the base case, we'll get an infinite loop, and we don't want that.

The recursive case is the path that runs the main logic.

3.4. Map state

The Map state will process the current chunk of agents simultaneously. Step Functions will call transform-data and call-third-party separately for each agent in the input array. In this case, two function instances will run in parallel by each processing one agent.

This state has some input and output manipulation.

InputPath

Map will need an array of elements as input.

We can refer to the current agents' chunk array with $.agents, which extracts the agents array from the state input. (This is the same as the get-agent function's return value.)

ResultPath

To make the recursion/pagination work, we will push the original pageNumber value to the next stage.

We can do so by using the ResultPath filter.

By default, the result of the state will be the output. In this case, this would be the resolved values of the call-third-party function combined back to an array.

But we also want to forward the pageNumber to the next stage.

So the output of the entire Map state will contain a result object, which has the transformed agents. The original Map state input (the return value of get-agents with the pageNumber, numberOfProcessedAgents, and agents properties) will also be in the output. This way, the retry Lambda will know about the current page and can calculate the entry point for the next iteration.

3.5. Wait state

The Wait state will wait (surprise) one second before continuing with the next iteration cycle.

This step simulates the rate-limiting behavior of the 3rd party application. We can process two items per second, so we will wait one second before processing the next chunk of two agents.

The Wait step will only run after Map has finished processing all items.

3.6. Retry

We already discussed how this function works. It simply increments pageNumber and calls get-agents again with

{
  "pageNumber": 1
}

after the first iteration. The function will keep adding 1 to pageNumber until the logic invokes the success handler, which means that the state machine has processed all agent data.

4. Considerations

Step Functions can be more complex than what this solution presents.

The application shown above is only a skeleton and presents the basic pattern. This article's purpose is to create a working application that we can use to build a production-ready workflow. The solution is not suitable for implementation in production in its current format. Let's see some points why.

4.1. Only the happy path

Step Functions have built-in retry logic in case a Lambda function fails. We can see these settings in the Retry section of the JSON. We should work on this part.

I only show the success path here, and the solution doesn't handle errors. What happens when an API call to the 3rd party throws? We should manage these cases.

4.2. More data manipulation

We can add more data manipulation to the state machine. I use InputPath and ResultPath here but Step Functions provides more ways to modify the input and output data.

The documentation has some examples, and the Data Flow simulator in the Step Functions console is also a tool that is worth using.

4.3. Stubbed side effects

I simulate the database and 3rd party calls in this exercise. It means that the state machine executes relatively quickly. The total execution time will be way longer with actual data in a production environment. It is particularly true when we have to process thousands of database items.

A state machine can run for up to one year (standard workflow), while Lambda functions will time out after 15 minutes. Hopefully, we don't have to sit and wait a year until the logic completes. But we can get into a situation when the total execution time exceeds the maximum Lambda timeout.

4.4. Cost factor

AWS charges for standard state machine workflows based on the number of state transitions. They also provide a free tier, so this exercise didn't cost me anything.

But we can receive a large bill at the end of the month if a busy application runs this workflow multiple times an hour. I would implement this logic in a scenario where the application needs to run, for example, as a cron job once a day or week.

5. Summary

Step Functions is a great alternative to Lambda functions when we need to create an application. State machines don't have the Lambda function timeout limit and are also suitable for implementing recursive logic like pagination.

6. State machine definition

I'll insert the state machine definition JSON for hardcore serverless fans. Workflow Studio generated this ASL file from the state machine I created for this exercise.

{
  "Comment": "Agents",
  "StartAt": "Get Agents",
  "States": {
    "Get Agents": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "OutputPath": "$.Payload",
      "Parameters": {
        "Payload.$": "$",
        "FunctionName": "arn:aws:lambda:us-east-1:123456789012:function:get-agents:$LATEST"
      },
      "Retry": [
        {
          "ErrorEquals": [
            "Lambda.ServiceException",
            "Lambda.AWSLambdaException",
            "Lambda.SdkClientException"
          ],
          "IntervalSeconds": 2,
          "MaxAttempts": 6,
          "BackoffRate": 2
        }
      ],
      "Next": "Choice"
    },
    "Choice": {
      "Type": "Choice",
      "Choices": [
        {
          "Not": {
            "Variable": "$.numberOfProcessedAgents",
            "NumericEquals": 0
          },
          "Next": "Map"
        }
      ],
      "Default": "Base case handler"
    },
    "Map": {
      "Type": "Map",
      "Iterator": {
        "StartAt": "Transform data",
        "States": {
          "Transform data": {
            "Type": "Task",
            "Resource": "arn:aws:states:::lambda:invoke",
            "OutputPath": "$.Payload",
            "Parameters": {
              "Payload.$": "$",
              "FunctionName": "arn:aws:lambda:us-east-1:123456789012:function:transform-data:$LATEST"
            },
            "Retry": [
              {
                "ErrorEquals": [
                  "Lambda.ServiceException",
                  "Lambda.AWSLambdaException",
                  "Lambda.SdkClientException"
                ],
                "IntervalSeconds": 2,
                "MaxAttempts": 6,
                "BackoffRate": 2
              }
            ],
            "Next": "Call 3rd party"
          },
          "Call 3rd party": {
            "Type": "Task",
            "Resource": "arn:aws:states:::lambda:invoke",
            "OutputPath": "$.Payload",
            "Parameters": {
              "Payload.$": "$",
              "FunctionName": "arn:aws:lambda:us-east-1:123456789012:function:call-third-party:$LATEST"
            },
            "Retry": [
              {
                "ErrorEquals": [
                  "Lambda.ServiceException",
                  "Lambda.AWSLambdaException",
                  "Lambda.SdkClientException"
                ],
                "IntervalSeconds": 2,
                "MaxAttempts": 6,
                "BackoffRate": 2
              }
            ],
            "End": true
          }
        }
      },
      "Next": "Wait",
      "InputPath": "$.agents",
      "ResultPath": "$.result"
    },
    "Wait": {
      "Type": "Wait",
      "Seconds": 1,
      "Next": "Retry"
    },
    "Retry": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "OutputPath": "$.Payload",
      "Parameters": {
        "Payload.$": "$",
        "FunctionName": "arn:aws:lambda:us-east-1:123456789012:function:retry:$LATEST"
      },
      "Retry": [
        {
          "ErrorEquals": [
            "Lambda.ServiceException",
            "Lambda.AWSLambdaException",
            "Lambda.SdkClientException"
          ],
          "IntervalSeconds": 2,
          "MaxAttempts": 6,
          "BackoffRate": 2
        }
      ],
      "Next": "Get Agents"
    },
    "Base case handler": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "OutputPath": "$.Payload",
      "Parameters": {
        "Payload.$": "$",
        "FunctionName": "arn:aws:lambda:us-east-1:123456789012:function:base-case-handler:$LATEST"
      },
      "Retry": [
        {
          "ErrorEquals": [
            "Lambda.ServiceException",
            "Lambda.AWSLambdaException",
            "Lambda.SdkClientException"
          ],
          "IntervalSeconds": 2,
          "MaxAttempts": 6,
          "BackoffRate": 2
        }
      ],
      "Next": "Success"
    },
    "Success": {
      "Type": "Succeed"
    }
  }
}

6. Further reading

Getting started with AWS Step Functions - Start from the beginning with Step Functions

Getting started with Lambda - Same with Lambda functions

Amazon State Language - Documentation on ASL

Input and Output Processing in Step Functions - Data flow manipulation

DEV Community