DEV Community

Cover image for 🦊 GitLab CI: Deploy a Majestic Single Server Runner on AWS
Benoit COUETIL 💫 for Zenika

Posted on • Edited on

🦊 GitLab CI: Deploy a Majestic Single Server Runner on AWS

Initial thoughts

In GitLab CI: The Majestic Single Server Runner, we found that a single server runner outperforms a Kubernetes cluster with equivalent node specifications until approximately 200 jobs requested simultaneously! This is typically beyond the average daily usage for most software teams. Equally important, when there are 40 queued jobs to process or below, the single server runner is twice as fast. This scenario is quite common, even during the busiest days, for most teams.

This article will help you deploy this no-compromise runner on AWS, at a reasonable price, thanks to multiple optimizations. Part of it applies to any Cloud, public or private.

The deployment is automated and optimized as much as possible:

  • Infrastructure is provisioned with Terraform
  • A spot instance is used
  • EC2 is stopped at night and on week-end
  • EC2 boot script (re)installs everything and registers to GitLab
  • The runner is tagged with a few interesting ec2 characteristics

1. The right EC2 instance at the right price

An AWS spot instance is a cost-effective option that allows you to leverage spare EC2 capacity at a discounted price. By choosing spot instances, you can significantly reduce your Amazon EC2 costs. Since our deployment is automated and downtime is not critical, opting for spot instances is an optimal choice for cost optimization.

To fully utilize the capabilities of a single server runner while keeping costs reasonable, it is essential to select an EC2 instance with a local NVMe SSD disk. These instances are identified by the 'd' in their name, indicating that they are disk-optimized.

When choosing an EC2 instance, the following conditions should be considered:

  • The instance should have the 'd' letter to indicate NVMe local disk support.
  • It should be available in our usual region.
  • The CPU specifications should match our usage requirements. For Java/Javascript applications CICD, about 1 core per parallel job is good. We choose here 16 CPU for 20 parallel jobs.
  • The spot price should be reasonable.

For the purpose of this article, we have selected the r5d.4xlarge instance type. At the time of writing, the spot price for this instance in us-east-1 is approximately $370/month. It might seems high to you.

But when compared to the monthly cost of our development team, this price is relatively low. However, we can further optimize costs by automatically stopping the EC2 instance outside of working hours using daily CloudWatch executions. Since it is a local disk instance, the state will be lost every day, but we have nothing to loose except some cache, that can be warmed up with a scheduled pipeline every morning.

Let's calculate the cost: $0.5045/hour x 12 open daily hours x 21 open days per month = $127/month. This brings the cost even lower than the already acceptable price. To put it into perspective, this represents an 85% discount compared to running the same instance full-time on-demand ($841/month).

mechanical humanoid orange fox, muscular, fur, cyberpunk background

2. Scripting the GitLab runner installation and configuration

To streamline the process of deploying the EC2 instance, we will create a script that can be used as the user_data to bootstrap the server anytime it (re)boots. This script will handle the installation of Docker, the GitLab Runner, and the configuration required to connect to the GitLab instance.

The script is designed to handle reboots and stop/start actions, which may result in the deletion of local disk data on the NVMe EC2 instance.

Make sure to modify the following variables at the start of the script according to your specific requirements:

aws-ec2-init-nvme-and-gitlab-runner.sh

#!/bin/bash
#
### Script to initialize a GitLab runner on an existing AWS EC2 instance with NVME disk(s)
#
# - script is not interactive (can be run as user_data)
# - will reboot at the end to perform NVME mounting
# - first NVME disk will be used for GitLab custom cache
# - last NVME disk will be used for Docker data (if only one NVME, the same will be used without problem)
# - robust: on each reboot and stop/start, disks are mounted again (but data may be lost if stop and then start after a few minutes)
# - runner is tagged with multiple instance data (public dns, IP, instance type...)
# - works with a single spot instance
# - should work even with multiple ones in a fleet, with same user_data (not tested for now)
#
# /!\ There is no prerequisite, except these needed variables:
MAINTAINER=zenika
RUNNER_NAME="majestic-runner"
GITLAB_URL=https://gitlab.com/
GITLAB_TOKEN=XXXX

# prepare docker (re)install
sudo apt-get -y install apt-transport-https ca-certificates curl gnupg lsb-release sysstat
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list >/dev/null
sudo apt-get update # needed to use the docker.list

# install gitlab runner
curl -L "https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh" | sudo bash
sudo apt-get -y install gitlab-runner

# create NVME initializer script
cat <<EOF >/home/ubuntu/nvme-initializer.sh
#!/bin/bash
#
# To be run on each fresh start, since NVME disks are ephemeral
# so first start, start after stop, but not on reboot
# inspired by https://stackoverflow.com/questions/45167717/mounting-a-nvme-disk-on-aws-ec2
#

date | tee -a /home/ubuntu/nvme-initializer.log

### Handle NVME disks

# get NVME disks bigger than 100Go (some small size disk may be there for root, depending on server type)
NVME_DISK_LIST=\$(lsblk -b --output=NAME,SIZE | grep "^nvme" | awk '{if(\$2>100000000000)print\$1}' | sort)
echo "NVME disks are: \$NVME_DISK_LIST" | tee -a /home/ubuntu/nvme-initializer.log

# there may be 1 or 2 NVME disks, then we split (or not) the mounts between GitLab custom cache and Docker data
export NVME_GITLAB=\$(echo "\$NVME_DISK_LIST" | head -n 1)
export NVME_DOCKER=\$(echo "\$NVME_DISK_LIST" | tail -n 1)
echo "NVME_GITLAB=\$NVME_GITLAB and NVME_DOCKER=\$NVME_DOCKER" | tee -a /home/ubuntu/nvme-initializer.log

# format disks if not
sudo mkfs -t xfs /dev/\$NVME_GITLAB | tee -a /home/ubuntu/nvme-initializer.log || echo "\$NVME_GITLAB already formatted" # this may already be done
sudo mkfs -t xfs /dev/\$NVME_DOCKER | tee -a /home/ubuntu/nvme-initializer.log || echo "\$NVME_DOCKER already formatted" # disk may be the same, then already formated by previous command

# mount on /gitlab-host/ and /var/lib/docker/
sudo mkdir -p /gitlab
sudo mount /dev/\$NVME_GITLAB /gitlab | tee -a /home/ubuntu/nvme-initializer.log
sudo mkdir -p /gitlab/custom-cache
sudo mkdir -p /var/lib/docker
sudo mount /dev/\$NVME_DOCKER /var/lib/docker | tee -a /home/ubuntu/nvme-initializer.log

### reinstall Docker (which data may have been wiped out)

# docker (re)install
sudo apt-get -y reinstall docker-ce docker-ce-cli containerd.io docker-compose-plugin | tee -a /home/ubuntu/nvme-initializer.log

echo "NVME initialization succesful" | tee -a /home/ubuntu/nvme-initializer.log

EOF

# set NVME initializer script as startup script
sudo tee /etc/systemd/system/nvme-initializer.service >/dev/null <<EOS

[Unit]
Description=NVME Initializer
After=network.target

[Service]
ExecStart=/home/ubuntu/nvme-initializer.sh
Type=oneshot
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

EOS

sudo chmod 744 /home/ubuntu/nvme-initializer.sh
sudo chmod 664 /etc/systemd/system/nvme-initializer.service
sudo systemctl daemon-reload
sudo systemctl enable nvme-initializer.service

sudo systemctl start nvme-initializer.service
sudo systemctl status nvme-initializer.service

# tail -f /var/log/syslog

### Runner creation at the end to have a feedback on Gitlab side of the whole process done

echo "gitlab-runner ALL=(ALL) NOPASSWD:ALL" | sudo tee -a /etc/sudoers

RUNNER_VERSION_DETAILS=$(sudo gitlab-runner --version)
### Example
# Version:      15.10.1
# Git revision: dcfb4b66
# Git branch:   15-10-stable
# GO version:   go1.19.6
# Built:        2023-03-29T13:01:22+0000
# OS/Arch:      linux/amd64

RUNNER_VERSION=$(echo "$RUNNER_VERSION_DETAILS" | grep -oP 'Version:\s+\K[\d\.]+')
RUNNER_VERSION_DATE=$(echo "$RUNNER_VERSION_DETAILS" | grep -oP 'Built:\s+\K.+')
RUNNER_OS_ARCH=$(echo "$RUNNER_VERSION_DETAILS" | grep -oP 'OS/Arch:\s+\K.+')

# 169.254.169.254 IP is always the same whatever the instance
# EC2 IP and hostname will change on AWS if VM is restarted but may not be elsewhere
RUNNER_TAGS="$MAINTAINER,$RUNNER_VERSION,$RUNNER_VERSION_DATE,$RUNNER_OS_ARCH,$(curl --silent http://169.254.169.254/latest/meta-data/instance-type),$(curl --silent http://169.254.169.254/latest/meta-data/instance-life-cycle),$(curl --silent http://169.254.169.254/latest/meta-data/public-ipv4),$(curl --silent http://169.254.169.254/latest/meta-data/public-hostname)" && echo $RUNNER_TAGS

# to start as paused (only if on-demand ec2): --paused
sudo gitlab-runner register --name "$RUNNER_NAME" --url "$GITLAB_URL" --registration-token "$GITLAB_TOKEN" --executor "docker" --docker-image "ubuntu:20.04" --docker-volumes "/var/run/docker.sock:/var/run/docker.sock" --docker-volumes "/gitlab/custom-cache/:/host/" --run-untagged=true --custom_build_dir-enabled=true --tag-list "$RUNNER_TAGS" --docker-privileged --docker-pull-policy "if-not-present" --non-interactive

# replace "concurrent = 1" with "concurrent = 20"
sudo sed -i '/^concurrent /s/=.*$/= 20/' /etc/gitlab-runner/config.toml
# replace "check_interval = 0" with "check_interval = 2"
sudo sed -i '/^check_interval /s/=.*$/= 2/' /etc/gitlab-runner/config.toml
### from https://gitlab.com/gitlab-org/gitlab-runner/-/issues/4036#note_1083142570
# replace "/cache" technical volume with one mounted on disk to avoid cache failure when several jobs in parallel
# this could have also have been a docker volume mounted: https://gitlab.com/gitlab-org/gitlab-runner/-/issues/1151#note_1019634818 but this does not make it faster if 2 different MVNE disks (gitlab + docker)
sudo sed -i 's#"/cache"#"/gitlab/cache:/cache"#' /etc/gitlab-runner/config.toml
Enter fullscreen mode Exit fullscreen mode

mechanical humanoid orange fox, muscular, fur, cyberpunk background

3. Deploying the auto-stopping architecture with Terraform

To quickly deploy the architecture, we will be using Terraform. With Terraform, we can automate the deployment process and have our infrastructure up and running in minutes.

Before we proceed, please ensure that you have an existing VPC created as a prerequisite. You can refer to the examples provided in the official GitHub repo for guidance on creating the VPC.

Here is the gitlab-runner.tf file that contains the Terraform configuration:

################################################################################
# Gitlab Runner EC2 Spot instance (with security group)
################################################################################

resource "aws_security_group" "in-ssh-out-all" {
  name   = "in-ssh-out-all"
  vpc_id = module.vpc.vpc_id
  ingress {
    cidr_blocks = [
      "0.0.0.0/0"
    ]
    from_port = 22
    to_port   = 22
    protocol  = "tcp"
  } // Terraform removes the default rule
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_spot_instance_request" "gitlab-runner" {

  ami = "ami-04ab94c703fb30101" # us-east-1, Canonical, Ubuntu, 22.04 LTS, amd64 jammy build on 2024-01-26. Choose here: https://cloud-images.ubuntu.com/locator/ec2/

  instance_type = "r5d.4xlarge"

  key_name = "my-key" # create a key and put it here if you want to connect to your EC2 in SSH

  availability_zone      = "us-east-1a"             # sadly only one possible for now
  subnet_id              = module.vpc.public_subnets[0] # sadly only one possible for now
  vpc_security_group_ids = [aws_security_group.in-ssh-out-all.id]

  user_data = file("aws-ec2-init-nvme-and-gitlab-runner.sh")

  valid_until          = "2030-01-01T00:00:00Z"
  wait_for_fulfillment = true

  tags = merge(
    local.tags,
    {
      Scheduled = "working-hours"
    }
  )

}

# Stop runner nightly and start it daily on working days
# from https://github.com/popovserhii/terraform-aws-lambda-scheduler

module "runner-stop-nightly" {
  source      = "popovserhii/lambda-scheduler/aws"
  name        = "stop-runner"
  aws_regions = ["us-east-1"]

  cloudwatch_schedule_expression = "cron(0 20 ? * MON-SUN *)"
  schedule_action                = "stop"

  spot_schedule             = true
  ec2_schedule              = false
  rds_schedule              = false
  autoscaling_schedule      = false
  cloudwatch_alarm_schedule = false

  resource_tags = [
    {
      Key   = "Scheduled"
      Value = "working-hours"
    }
  ]
}

module "runner-start-daily" {
  source      = "popovserhii/lambda-scheduler/aws"
  name        = "start-runner"
  aws_regions = ["us-east-1"]

  cloudwatch_schedule_expression = "cron(0 08 ? * MON-FRI *)"
  schedule_action                = "start"

  spot_schedule             = true
  ec2_schedule              = false
  rds_schedule              = false
  autoscaling_schedule      = false
  cloudwatch_alarm_schedule = false

  resource_tags = [
    {
      Key   = "Scheduled"
      Value = "working-hours"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

The runner starts at 08h00 and stops at 20h00, Monday to Friday. Feel free to change according to your requirements.

Once you have created and adapted the configuration, follow these steps:

  1. Run terraform init to initialize the Terraform configuration.
  2. Run terraform apply to apply the configuration and deploy the infrastructure.

With these commands, Terraform will handle the deployment process, and your autonomous architecture will be up and running in no time.

mechanical humanoid orange fox, muscular, fur, cyberpunk background

Illustrations generated locally by DiffusionBee using FLUX.1-schnell model

Further reading

This article was enhanced with the assistance of an AI language model to ensure clarity and accuracy in the content, as English is not my native language.

Top comments (0)