DEV Community

Atsushi Suzuki
Atsushi Suzuki

Posted on

Mastering ECS Task Scheduling: Effective Strategies to Reduce Costs

Introduction

As we moved our service deployment from Lambda to ECS on Fargate, we aimed to minimize costs outside of the production environment. We implemented several cost-saving measures:

  • Configuring vCPU and memory to the minimum viable settings (0.25 vCPU, 0.5 GB)
  • Utilizing Fargate Spot for cost efficiency

These settings were the extent of what we could optimize on the ECS side. However, considering that costs accrue per hour based on vCPU and memory usage, we developed a system using Lambda and EventBridge to stop ECS tasks outside of business hours.

I have detailed the Lambda function code and EventBridge Terraform configuration below, so you can easily replicate and utilize them to reduce costs.

AWS Fargate Pricing

Lambda Function

The integration with EventBridge allows for scheduled execution, making Lambda an ideal environment for stopping and starting ECS tasks.

Considering the Single Responsibility Principle, I initially thought about separating the stop and start processes into different functions. However, since the code does not change frequently and to keep management simple, I decided to consolidate them into a single function.

In some cases, tasks need to be stopped or started manually rather than on a schedule, so I included the action parameter to select between scheduled and manual modes.

Below is the complete code, which you can copy and paste by simply changing the ECS cluster and service names specified in clusters_services_scheduled and clusters_services_manual.



import boto3
from botocore.exceptions import ClientError

def update_service(cluster_name, service_name, desired_count):
    client = boto3.client('ecs')
    application_autoscaling_client = boto3.client('application-autoscaling')

    # Control the minimum number of tasks in the AutoScaling policy
    scalable_targets = application_autoscaling_client.describe_scalable_targets(
        ServiceNamespace='ecs',
        ResourceIds=[f'service/{cluster_name}/{service_name}'],
        ScalableDimension='ecs:service:DesiredCount'
    )['ScalableTargets']

    for scalable_target in scalable_targets:
        application_autoscaling_client.register_scalable_target(
            ServiceNamespace='ecs',
            ResourceId=f'service/{cluster_name}/{service_name}',
            ScalableDimension='ecs:service:DesiredCount',
            MinCapacity=0 if desired_count == 0 else 1,
            MaxCapacity=scalable_target['MaxCapacity']
        )

    # Update the service
    service_update_result = client.update_service(
        cluster=cluster_name,
        service=service_name,
        desiredCount=desired_count
    )
    print(service_update_result)

def lambda_handler(event, context):
    try:
        client = boto3.client('ecs')
        elbv2_client = boto3.client('elbv2')

        action = event.get('action')  # 'stop' or 'start'
        environment = event.get('environment')  # 'scheduled' or 'manual'

        clusters_services_scheduled = [
            ('example-cluster', 'example-service-stg')
        ]

        clusters_services_manual = [
            ('example-cluster', 'example-service-dev')
        ]

        if environment == 'scheduled':
            clusters_services = clusters_services_scheduled
        elif environment == 'manual':
            clusters_services = clusters_services_manual
        else:
            raise ValueError("Invalid environment specified")

        desired_count = 0 if action == 'stop' else 1

        for cluster_name, service_name in clusters_services:
            update_service(cluster_name, service_name, desired_count)

        if action == 'start':
            for cluster_name, service_name in clusters_services:
                # Retrieve new task IDs and register them with the target group
                tasks = client.list_tasks(
                    cluster=cluster_name,
                    serviceName=service_name
                )['taskArns']

                task_descriptions = client.describe_tasks(
                    cluster=cluster_name,
                    tasks=tasks
                )['tasks']

                # Retrieve target group information associated with the service
                load_balancers = client.describe_services(
                    cluster=cluster_name,
                    services=[service_name]
                )['services'][0]['loadBalancers']

                for load_balancer in load_balancers:
                    target_group_arn = load_balancer['targetGroupArn']

                    for task in task_descriptions:
                        task_id = task['taskArn'].split('/')[-1]
                        elbv2_client.register_targets(
                            TargetGroupArn=target_group_arn,
                            Targets=[{'Id': task_id}]
                        )

    except ClientError as e:
        print(f"Exception: {e}")
    except ValueError as e:
        print(f"Exception: {e}")


Enter fullscreen mode Exit fullscreen mode

Code Explanation

Task Count Control in Services

The task count of ECS services is updated within the update_service function:



    service_update_result = client.update_service(
        cluster=cluster_name,
        service_name=service_name,
        desiredCount=desired_count
    )


Enter fullscreen mode Exit fullscreen mode

This section uses the update_service method to dynamically update the desiredCount for the specified cluster and service. The desiredCount determines the number of tasks aimed for the service, facilitating the stop or start operations.

Scaling Settings Control

The update_service function adjusts the number of instances in an ECS service and sets the MinCapacity in the Auto Scaling policy to either 0 or 1:



def update_service(cluster_name, service_name, desired_count):
    client = boto3.client('ecs')
    application_autoscaling_client = boto3.client('application-autoscaling')
    ...


Enter fullscreen mode Exit fullscreen mode

Even if the service's task count is set to 0, if MinCapacity remains at 1, the system will attempt to launch tasks repeatedly. Therefore, this setting is crucial if Auto Scaling is configured.

Task Launch and Target Group Registration

If you're integrating ELB (ALB) with ECS, when tasks are launched, you need to retrieve new task IDs and register them with the ALB's target group:



if action == 'start':
    for cluster_name, service_name in clusters_services:
        # Retrieve new task IDs and register them with the target group
        tasks = client.list_tasks(cluster=cluster_name, serviceName=service_name)['taskArns']
        ...


Enter fullscreen mode Exit fullscreen mode

Even during task shutdown, the ECS service remains linked to the target group, allowing you to retrieve the target group in the following section:



load_balancers = client.describe_services(
    cluster=cluster_name,
    services=[service_name]
)['services'][0]['loadBalancers']

for load_balancer in load_balancers:
    target_group_arn = load_balancer['targetGroupArn']


Enter fullscreen mode Exit fullscreen mode

Terraform Code Example

The configuration for deploying the Lambda function via a ZIP file is as follows:



resource "aws_lambda_function" "ecs_task_scheduler" {
  function_name    = "ecs-task-scheduler"
  s3_bucket        = var.s3_bucket_lambda_functions_storage_bucket
  s3_key           = "ecs-task-scheduler.zip"
  handler          = "app.lambda_handler"
  runtime          = "python3.12"
  role             = var.iam_role_ecs_task_scheduler_lambda_exec_role_arn
  timeout          = 300 # 5 minutes
}


Enter fullscreen mode Exit fullscreen mode

Place the Dockerfile and build.sh in the same directory as app.py. Run ./build.sh to create the ecs-task-scheduler.zip. Upload this ZIP file to the designated S3 bucket and execute terraform apply to complete the deployment.



FROM public.ecr.aws/lambda/python:3.12

# Install Python dependencies
COPY requirements.txt /var/task/
RUN pip install -r /var/task/requirements.txt --target /var/task

# Copy the Lambda function code
COPY app.py /var/task/

# Set the working directory
WORKDIR /var/task

# Set the CMD to your handler
CMD ["app.lambda_handler"]


Enter fullscreen mode Exit fullscreen mode


#!/bin/bash

# Build the Docker image
docker build -t ecs-task-scheduler-build.

# Create a container from the image
container_id=$(docker create ecs-task-scheduler-build)

# Copy the contents of the container to a local directory
docker cp $container_id:/var/task ./package

# Clean up
docker rm $container_id

# Zip the contents of the local directory
cd package
zip -r ../ecs-task-scheduler.zip .
cd ..

# Clean up
rm -rf package


Enter fullscreen mode Exit fullscreen mode

Manual Execution Method

To manually execute the Lambda function, use the testing tab in the console. Paste the JSON formatted request into the test event body and press the test button to execute the function.



{
  "action": "start",
  "environment": "manual"
}


Enter fullscreen mode Exit fullscreen mode

Screenshot 2024-06-01 17.10.46.png

EventBridge

Use the Cron expression in EventBridge to schedule the automatic execution of the Lambda function on specific days and times. This feature allows you to build a schedule to stop ECS tasks during weekday nights and all weekend.

Weekday Schedule (22:00 - 5:00 Stop)

  • Stop: Weekdays at 22:00 (UTC 13:00)
  • Start: Weekdays at 5:00 (UTC 20:00)

Weekend Schedule (All Day Stop)

  • Stop: Saturday at 00:00 (UTC 15:00 previous day)
  • Start: Monday at 5:00 (UTC 20:00)

Below is an example of Terraform code to set these schedules. Define the EventBridge rules and set the appropriate Lambda function as the target. Note that time settings are in UTC, so adjust according to your region.

Weekday Schedule



# Weekday stop schedule (every day at 22:00 JST)
resource "aws_cloudwatch_event_rule" "ecs_weekday_stop_tasks_schedule" {
name = "ECSWeekdayStopTasksSchedule"
description = "Schedule to stop ECS tasks on weekdays at 22:00 JST"
schedule_expression = "cron(0 13 ? * MON-FRI *)" # Weekdays at 22:00 JST
}

resource "aws_cloudwatch_event_target" "ecs_weekday_stop_tasks_target" {
rule = aws_cloudwatch_event_rule.ecs_weekday_stop_tasks_schedule.name
target_id = "ecsTaskSchedulerWeekdayStop"
arn = var.lambda_function_ecs_task_scheduler_arn

input = jsonencode({
action = "stop"
environment = "scheduled"
})
}

resource "aws_lambda_permission" "ecs_weekday_stop_tasks_allow_eventbridge" {
statement_id = "AllowEventBridgeInvokeLambdaWeekdayStop"
action = "lambda:InvokeFunction"
function_name = var.lambda_function_ecs_task_scheduler_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.ecs_weekday_stop_tasks_schedule.arn
}

# Weekday start schedule (every day at 5:00 JST)
resource "aws_cloudwatch_event_rule" "ecs_weekday_start_tasks_schedule" {
name = "ECSWeekdayStartTasksSchedule"
description = "Schedule to start ECS tasks on weekdays at 5:00 JST"
schedule_expression = "cron(0 20 ? * MON-FRI *)" # Weekdays at 5:00 JST
}

resource "aws_cloudwatch_event_target" "ecs_weekday_start_tasks_target" {
rule = aws_cloudwatch_event_rule.ecs_weekday_start_tasks_schedule.name
target_id = "ecsTaskSchedulerWeekdayStart"
arn = var.lambda_function_ecs_task_scheduler_arn

input = jsonencode({
action = "start"
environment = "scheduled"
})
}

resource "aws_lambda_permission" "ecs_weekday_start_tasks_allow_eventbridge" {
statement_id = "AllowEventBridgeInvokeLambdaWeekdayStart"
action = "lambda:InvokeFunction"
function_name = var.lambda_function_ecs_task_scheduler_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.ecs_weekday_start_tasks_schedule.arn
}

Enter fullscreen mode Exit fullscreen mode




Weekend Schedule




# Saturday stop schedule (at 0:00 JST)
resource "aws_cloudwatch_event_rule" "ecs_weekend_stop_tasks_schedule" {
name = "ECSWeekendStopTasksSchedule"
description = "Schedule to stop ECS tasks on Saturday at 00:00 JST"
schedule_expression = "cron(0 15 ? * SAT *)" # Saturday at 0:00 JST
}

resource "aws_cloudwatch_event_target" "ecs_weekend_stop_tasks_target" {
rule = aws_cloudwatch_event_rule.ecs_weekend_stop_tasks_schedule.name
target_id = "ecsTaskSchedulerWeekendStop"
arn = var.lambda_function_ecs_task_scheduler_arn

input = jsonencode({
action = "stop"
environment = "scheduled"
})
}

resource "aws_lambda_permission" "ecs_weekend_stop_tasks_allow_eventbridge" {
statement_id = "AllowEventBridgeInvokeLambdaWeekendStop"
action = "lambda:InvokeFunction"
function_name = var.lambda_function_ecs_task_scheduler_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.ecs_weekend_stop_tasks_schedule.arn
}

# Monday start schedule (at 5:00 JST)
resource "aws_cloudwatch_event_rule" "ecs_weekend_start_tasks_schedule" {
name = "ECSWeekendStartTasksSchedule"
description = "Schedule to start ECS tasks on Monday at 05:00 JST"
schedule_expression = "cron(0 20 ? * MON *)" # Monday at 5:00 JST
}

resource "aws_cloudwatch_event_target" "ecs_weekend_start_tasks_target" {
rule = aws_cloudwatch_event_rule.ecs_weekend_start_tasks_schedule.name
target_id = "ecsTaskSchedulerWeekendStart"
ar

n = var.lambda_function_ecs_task_scheduler_arn

input = jsonencode({
action = "start"
environment = "scheduled"
})
}

resource "aws_lambda_permission" "ecs_weekend_start_tasks_allow_eventbridge" {
statement_id = "AllowEventBridgeInvokeLambdaWeekendStart"
action = "lambda:InvokeFunction"
function_name = var.lambda_function_ecs_task_scheduler_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.ecs_weekend_start_tasks_schedule.arn
}

Enter fullscreen mode Exit fullscreen mode




IAM Role

Here's an example of Terraform code to define the execution role for the Lambda function scheduling ECS tasks. This role enables the Lambda function to call APIs from ECS and related AWS services.

  • ECS Service Management: Update services, list tasks, and fetch detailed information about tasks and services.
  • Auto Scaling Management: Register and deregister scalable targets, retrieve information about scalable targets.
  • Elastic Load Balancing (ELB) Management: Register and deregister targets, fetch detailed information about target groups and listeners.
  • Log Management: Create log groups and streams, post log events.


resource "aws_iam_policy" "ecs_task_scheduler_policy" {
name = "ecs-task-scheduler-policy"
description = "Policy for ECS task scheduler Lambda function"

policy = jsonencode({
Version = "2012-10-17",
Statement = [
{
Effect = "Allow",
Action = [
"ecs:UpdateService",
"ecs:ListTasks",
"ecs:DescribeTasks",
"ecs:DescribeServices"
],
Resource = ""
},
{
Effect = "Allow",
Action = [
"application-autoscaling:RegisterScalableTarget",
"application-autoscaling:DeregisterScalableTarget",
"application-autoscaling:DescribeScalableTargets"
],
Resource = ""
},
{
Effect = "Allow",
Action = [
"elasticloadbalancing:RegisterTargets",
"elasticloadbalancing:DeregisterTargets",
"elasticloadbalancing:DescribeTargetGroups",
"elasticloadbalancing:DescribeListeners",
"elasticloadbalancing:DescribeRules"
],
Resource = ""
},
{
Effect = "Allow",
Action = [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
Resource = "arn:aws:logs:::"
}
]
})
}

resource "aws_iam_role_policy_attachment" "ecs_task_scheduler_policy_attach" {
role = aws_iam_role.ecs_task_scheduler_lambda_exec_role.name
policy_arn = aws_iam_policy.ecs_task_scheduler_policy.arn
}

Enter fullscreen mode Exit fullscreen mode




Conclusion

As we continue to aim for further cost reductions, the next step will be to introduce Compute Savings Plans.

Compute Savings Plans Pricing

Top comments (5)

Collapse
 
cristim profile image
Cristian Măgherușan-Stanciu @magheru_san

This is cool and all, I'm just curious why move from Lambda to Fargate.

Lambda should be much more cost effective since it just runs when invoked, and needs no such tricks.

Earlier today did an estimation for one of my customers and Lambda should reduce their production costs by 10x, from $4k to about $400, and Staging costs by 300x, from $1.5k to only $5/month.

I have a feeling that your start/stop Lambda may have just as much if not more code than the entire application you're trying to orchestrate :-)

Collapse
 
suzuki0430 profile image
Atsushi Suzuki

Thank you for your comment. The main reason for our transition from Lambda to Fargate is based on the specific requirements of our service. Our service demands high performance and real-time processing, and we were facing issues with Lambda's cold start delays. While Fargate tends to be more expensive, we manage costs by using Fargate Spot and stopping tasks during idle periods.

Collapse
 
kishor_kumbhar profile image
Kishor Kumbhar

Great post !!

May i know how its different from sheduling with native option available with ASG scaling.

Collapse
 
suzuki0430 profile image
Atsushi Suzuki

Thank you for your appreciation!

The method described in the article differs from ASG scaling scheduling primarily because it targets ECS services directly and uses AWS Lambda for more granular control. While Auto Scaling Groups (ASG) scheduling allows you to define scaling actions based on time schedules, it operates at the EC2 level, impacting all services running on those instances.

In contrast, the approach using Lambda and EventBridge provides specific control over ECS tasks, allowing us to start or stop individual services as needed without affecting others.

Collapse
 
kishor_kumbhar profile image
Kishor Kumbhar

Understood,

So this solution will scale-in the minimum tasks running in service to zero, resulting graceful termination of tasks then eventually ec2 resources, is it ?

If you are running Stateful services yea this makes sense. But for stateless services scaling policies with asg scheduling works same imo.