In this post you will learn some of the best patterns/tricks I have learned during my time creating Step Functions workflows, while working with clients at Accenture.
Table of contents (clickable)
- Make use of the Step Functions Workflow Studio
- Utilize the service integrations
- Use
.waitForTaskToken
- Make use of the inbuilt retries
- Utilize Heartbeats to fail fast
- Define a Catch Handler
- Further reading
Make use of the Step Functions Workflow Studio
Since it has been introduced in June Step Functions Workflow Studio proved its value to me several times. As its a low-code editor with the most common configurations already baked-in, the effort of writing/designing a workflow with Amazon States Language plummeted. Those minutes and hours handwriting workflows with 100 lines++ are finally over for good, which is something you must not ignore when dealing with Step Functions.
As visualized below, the option to create a workflow from scratch in Workflow Studio directly is already present:
As is the option to edit pre-existing workflows directly in Step Functions:
Utilize the service integrations
While AWS offers some 17 "optimized" service integrations (for the definitive list see here), that include different custom options of integrating with the specific services, AWS has released an option to call the APIs of nearly all AWS services directly, as described in this article. This allows you to scrap some of the utility lambdas one uses to add much-needed functionality-augmentation to a service and go with Step Function instead.
Use .waitForTaskToken
By using .waitForTaskToken
, you are able to transparently pause the workflow, until a task like a lambda function has finished executing.
Be aware, that you need to specify the Task Token in the payload for the lambda, as Step Function does not inject it automatically for you.
Example
code example
This example shows how to send the task token for success/failure back to Step Functions via AWS JS SDK v3.
import {
SFNClient,
SendTaskFailureCommand,
SendTaskSuccessCommand
} from "@aws-sdk/client-sfn";
const client = new SFNClient();
async function success(taskToken, input) {
const stepFunctionsCommand = new SendTaskSuccessCommand({
taskToken,
output: input
});
await client.send(stepFunctionsCommand);
}
async function failure(taskToken, cause, error) {
const stepFunctionsCommand = new SendTaskFailureCommand({
taskToken,
cause,
error
});
await client.send(stepFunctionsCommand);
}
async function main(event) {
try {
await success(event.MyTaskToken)
} catch (error) {
console.error(error)
const {
requestId,
cfId,
extendedRequestId
} = error.$metadata;
await failure(event.MyTaskToken, {
requestId,
cfId,
extendedRequestId
})
}
}
Utilize Heartbeats to fail fast
As StepFunctions can run for up to a year (at least Standard workflows) it is imperative to avoid stuck executions. One way of doing this when integrating with Lambda is the Heartbeat API and specification. This allows developers to specify a max-interval, in which the Heartbeat has to be send back to Step Functions. Failure to meet this deadline leads to termination of the task.
Example
This code-example shows how to send back a heartbeat with AWS JS SDK v3.code example
In the following example a task is specified with a max-heartbeat duration of 10 minutes.
{
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"HeartbeatSeconds": 600
}
import {
SFNClient,
SendTaskHeartbeatCommand
} from "@aws-sdk/client-sfn";
const client = new SFNClient();
async function heartbeat(taskToken) {
const stepFunctionsCommand = new SendTaskHeartbeatCommand({
taskToken
});
await client.send(stepFunctionsCommand);
}
async function main(event) {
await heartbeat(event.MyTaskToken)
// some expensive calculation
await heartbeat(event.MyTaskToken)
}
Define a Catch Handler
Rationale
To quote Werner Vogels, Amazon CTO:
everything fails, all the time
Be prepared for the wildly different and sometimes unexpected errors the AWS APIs can throw, by catching them like you would in a lambda function (if you don't do that we are having a whoole different conversation).
Example
In this pretty simple example, the catch block is used on a lambda task. It works on the other tasks as well. This example uses a catch-all error code, for other error codes see here.
As it is usually a good idea to get notified when an error occurs, this example publishes to a SNS topic, which may have an email subscription. I'd reccomend to use this only for really critical errors, as you may otherwise miss important errors in your then-cluttered inbox.
Make use of the inbuilt retries
Instead of catching errors and terminating the flow then, one might as well use retries to try recovery from an error or to await a desired state. One may use this for example to await results from APIs similar to those of a Glue Crawler, which need repeated polling and potentially exponential backoff.
Example
In this example a retry from specific Lambda exceptions is shown. The code example
IntervalSeconds
parameter defines an initial offset, which has to pass before the first retry is attempted. The BackoffRate
parameter specifies the duration-multiplier which is applied after each unsuccessful attempt. Step Functions will retry after 2,4,8,16,32,64 seconds, limited by the MaxAttempts
parameter of 6.
{
"Invoke Lambda Task": {
// [..]
"Retry": [
{
"ErrorEquals": [
"Lambda.ServiceException",
"Lambda.AWSLambdaException",
"Lambda.SdkClientException"
],
"IntervalSeconds": 2,
"MaxAttempts": 6,
"BackoffRate": 2
}
]
// [..]
}
}
Further reading
A nice paper by López et al. from 2018 compares the various orchestration platforms of the different hyperscalers:
DOI - 10.1109/UCC-Companion.2018.00049
ArXiv - arXiv:1807.11248
Be aware though, that the services developed over the course of the three years since than, so it's an excercise to the reader to recognize, how far Step Functions has come since then.
Header image by Gabriel Santos Fotografia via Pexels
Top comments (2)
Nice content
thank you! :)