DEV Community

Friedrich Kurz
Friedrich Kurz

Posted on • Edited on

GitLab CI/CD Runner Clean-up with Pre-build Scripts

Introduction

GitLab CI/CD has a powerful but somewhat under-documented pre-build script feature that allows us to execute custom logic before builds are run on a GitLab runner. We’ll have a look at how to utilize the pre-build script to automate Docker system clean-up on GitLab runners.

GitLab runners and the problem of automated clean-up

Imagine we work on a project using GitLab CI/CD with a set of demanding criteria such as

  • high degree of automation,
  • need for high runner uptime,
  • fast build execution, and
  • frequent use of Docker.

Particularly due to the lack of downtime for clean-up and maintenance, we may eventually run into the well-known problem that GitLab runners will run out of disk space.

Unfortunately, finding a general solution for GitLab runner clean-up is not particularly easy as indicated by the existence of this, to date, unresolved issue on the GitLab runner issue tracker. For instance, if we simply clean up all Docker resources after each build, we won't likely run out of disk space. However, our build times would be much higher because Docker won't be able to leverage its build cache mechanism.

Meanwhile, GitLab's documentation recommends running the clear-docker-cache script once a week via cron as a workaround. Using the cron approach is also fairly simple and will moreover slow down our builds less frequently. On the flip side, however, we will now have to provide our runners with sufficient disk space for a full week (or, whatever interval the cron job is set to run on) which might be excessive and is moreover hard to guess correctly.

Image description

As noted in the unresolved issue that I mentioned earlier, the GitLab suggested way of managing disk space also has at least two more problems:

  1. it only addresses images, and it is
  2. indiscriminate meaning that it may end up cleaning up frequently used images as well.

This is problematic for a couple of reasons. First of all, other cached Docker resources—like volumes and containers—are not targeted by the clean-up script. Moreover, it may result in a build slow-down because frequently used images are being cleaned up and have to be rebuilt. Finally—since the script is run by cron—, there also is the intricate problem of race conditions between the clean-up script and build jobs: since cron runs asynchronous to build job executions, our clean-up job may inadvertently crash pipelines because it cleans up images some jobs depend upon. ^[For example if we build our images and push them to our registry in separate jobs.] This problem is of course mitigated by running the clean-up script only once a week but developers will have to keep in mind that pipelines may fail once in a while due to missing images.

Determining Docker disk usage and cleaning up Docker cache

So, what to do? First of all, let's look at what tools are available to determine Docker disk usage as well as trigger clean-up of resources that occupy disk space.

To get an indication of the current disk usage of Docker, we can run docker system df. Here's an example output from my local machine:



$ docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          2         2         216.6MB   0B (0%)
Containers      2         2         84.83kB   0B (0%)
Local Volumes   5         5         506.7MB   0B (0%)
Build Cache     0         0         0B        0B


Enter fullscreen mode Exit fullscreen mode

As we can see, Docker helpfully lists the disk space taken up by each of its resource classes.

With some light BASH acrobatics, we can work this into a test that tells us if Docker disk space usage is below a given limit (see is_docker_disk_space_usage_above_limit.sh).



#!/bin/bash
# --
# Test if Docker daemon disk space usage is above a given limit in bytes.
#
# Example 1: Docker disk space usage is above limit 
#
#    $ source is_docker_disk_space_usage_above_limit.sh
#    $ is_docker_disk_space_usage_above_limit 1
#    Docker disk space usage is above limit (actual: 1050000005B, limit: 1B)
#    $ printf $?
#    0
#
# Example 2: Docker disk space usage is below or equal to limit
#
#    $ source is_docker_disk_space_usage_above_limit.sh
#    $ is_docker_disk_space_usage_above_limit 1000000000000
#    Docker disk space usage is below or equal to limit (actual: 1050000005B, limit: 1000000000000B)
#    $ printf $?
#    1
# 
# --
iec_string_to_bytes() {
  local iec_string=$1
      iec_format_pattern='([0-9.]+)\s*([kMGTP]?B)'

  if ! [[ "${iec_string}" =~ ${iec_format_pattern} ]]; then
    printf "Input string has invalid format (received: \"%s\", expected: \"%s\")." "$1" "${iec_format_pattern}"
    return 1
  fi 

  local number_value="${BASH_REMATCH[1]}"
    iec_unit="${BASH_REMATCH[2]}"
    factor=""

  case "${iec_unit}" in 
    B) factor=1;;
    kB) factor=1000;;
    MB) factor=1000000;;
    GB) factor=1000000000;;
    TB) factor=1000000000000;;
    PB) factor=1000000000000000;;
  esac

  # We use scale=0 here to drop the (redundant) decimal points.
  # This only works with division so we divide by one.
  printf "scale=0;%s * %s/1\n" "${number_value}" "${factor}" \
    | bc 
}

calculate_docker_total_disk_space_usage() {
  local bc_expression="0"
    disk_space_used="$(docker system df --format='{{.Size}}' | tr '\n' ' ')"

  for disk_space_used_by_resource in ${disk_space_used}; do 
    # shellcheck disable=SC2086
    disk_space_used_by_resource_bytes="$(iec_string_to_bytes \"${disk_space_used_by_resource}\")"
    bc_expression="${bc_expression} + ${disk_space_used_by_resource_bytes}"
  done 

  printf "%s\n" "${bc_expression}" \
    | bc
}

is_docker_disk_space_usage_above_limit() {
  local disk_space_limit=$1
    docker_disk_space_used="$(calculate_docker_total_disk_space_usage)"
    docker_disk_space_usage_is_above_limit="$(printf '%s > %s\n' ${docker_disk_space_used} ${disk_space_limit} | bc -l)"

  # Note that bc returns 1 if the comparison is true and 0 otherwise.
  if [ "${docker_disk_space_usage_is_above_limit}" -eq 1 ]; then 
    printf "Docker disk space usage is above limit (actual: %sB, limit: %sB)" "${docker_disk_space_used}" "${disk_space_limit}"
    return 0
  fi 

  printf "Docker disk space usage is below or equal to limit (actual: %sB, limit: %sB)" "${docker_disk_space_used}" "${disk_space_limit}"
  return 1
}



Enter fullscreen mode Exit fullscreen mode

Schematically, we can run this script in BASH as follows to trigger our clean-up logic:



# Check if docker disk space usage is above a given limit and run clean-up logic if it is
if is_docker_disk_space_usage_above_limit "${docker_disk_space_usage_limit}"; then
  # run clean up
fi


Enter fullscreen mode Exit fullscreen mode

If Docker uses too much disk space, we may then proceed to remove unused resources via the Docker CLI. The most simple way to do this is by running docker system prune -af --volumes. Pruning the system with these parameters will simply clean up all dangling and unused images, containers, networks, as well as volumes.

In case we need a more elaborate clean-up logic, Docker CLI also has individual commands to free the cache of unused images, containers, networks, and volumes which support filters. E.g. docker image prune can be used to only clean up images that are older than 24 hours by running



docker image prune -a --force --filter "until=24h"


Enter fullscreen mode Exit fullscreen mode

Runner hooks to the rescue

Now that we have a way how to find out if Docker is running out of disk space and how to trigger clean-up, we can think about when to run our logic. Per the requirements of our GitLab CI/CD set-up—as stated earlier—, we want to run

  • custom clean up logic,
  • pre-emptively (to avoid runner failures due to lack of disk space), and
  • in sync with pipeline execution (to prevent randomly failing pipelines).

Luckily, GitLab CI/CD provides a couple of script hooks that let us execute code in various stages of pipeline execution. Specifically, pre-clone, post-clone, pre-build, and post-build (see The [[runners]] section in Advanced Configuration).

Both the pre-build as well as the post-build hooks make sense in our scenario as both

  • run synchronous with pipeline jobs (either before or after a job),
  • pre-emptively allow us to clean up resources (either before the current job or the next job), and
  • provide a mechanism to define custom clean-up logic.

We choose the pre-build hook here.

To register our pre-build script, we have to configure our GitLab runners using their configuration file like so:



# /etc/gitlab-runner/config.toml
# ...
[[runners]]
  # ...
  pre_build_script = '''
    # execute clean-up script
  '''


Enter fullscreen mode Exit fullscreen mode

Having added the pre_build_script property, our GitLab runners will now execute our clean-up script before each job.

This is unfortunately not a perfect, general solution either—as will be discussed later in Prerequisites and limitations of the pre-build script approach—but let's look at how to implement the pre-build clean-up technique first.

A quick test drive

Installing Docker and gitlab-runner

To test our setup we will install and configure GitLab runner on an AWS EC2 instance. (Obviously, any other similar cloud infrastructure as a service solution would work too.) GitLab runner binaries are available for multiple platforms. We pick an Ubuntu 20.4 machine here to use the Linux binaries.

After logging into our EC2 instance



ssh -i "${key-pair-pem-file}" "${ec2-user}@${ec2-instance-address}"


Enter fullscreen mode Exit fullscreen mode

we first want to install Docker. E.g. using the official convenience install script



curl -fsSL https://get.docker.com -o /home/ubuntu/get-docker.sh
sudo sh /home/ubuntu/get-docker.sh


Enter fullscreen mode Exit fullscreen mode

⚠️ Note that using the convenience script is not recommended for production builds. Neither is it generally a good idea to execute a downloaded script file with sudo. But since we trust the source here and only run a test, it's not a big deal.

Verify Docker installation success by running e.g. docker --version



$ docker --version
Docker version 20.10.16, build aa7e414


Enter fullscreen mode Exit fullscreen mode

Now let's execute the script shown below to install gitlab-runner. ^[The GitLab runner installation script is also available from Settings > CI/CD > Runners > Specific runners of a GitLab project for reference.]



sh <<EOF
  # Download the binary for your system
  sudo curl -L --output /usr/local/bin/gitlab-runner https://gitlab-runner-downloads.s3.amazonaws.com/latest/binaries/gitlab-runner-linux-amd64

  # Give it permission to execute
  sudo chmod +x /usr/local/bin/gitlab-runner

  # Create a GitLab Runner user
  sudo useradd --comment 'GitLab Runner' --create-home gitlab-runner --shell /bin/bash

  # Install and run as a service
  sudo gitlab-runner install --user=gitlab-runner --working-directory=/home/gitlab-runner
  sudo gitlab-runner start
EOF


Enter fullscreen mode Exit fullscreen mode

And then let's register our runner with GitLab by running



sudo gitlab-runner register 


Enter fullscreen mode Exit fullscreen mode

The gitlab-runner tool will guide us through the process and ask for some configuration details. We have to enter the URL of our GitLab instance (e.g. https://gitlab.com for the public GitLab) and a registration token (that can be copied from Settings > CI/CD > Runners > Specific runners). We moreover pick the Docker executor (docker) since it is—at least in my experience—the most common one as well as the docker:20.10.16 image to be able to run Docker builds within our pipeline jobs. ^[Full disclosure, I tried the SSH executor for simplicity's sake but was not able to make it work due to connection problems.] Also, when prompted for tags, we enter gl-cl to be able to run our pipeline jobs on exactly this machine.

At the end of the registration process, we should see the following confirmation that our runner has been registered.

Runner registered successfully. Feel free to start it, but if it's running already the config should be automatically reloaded!

As described in Advanced configuration, the configuration file is stored in /etc/gitlab-runner/config.toml on Unix systems. We will add the runners.pre_build_script and runners.docker.volumes properties shown below.



# /etc/gitlab-runner/config.toml
# ...
[[runners]]
  # ...
  pre_build_script = '''
    sh $CLEAN_UP_SCRIPT     
  '''
  # ...
  [runners.docker]
  # ...
  volumes = ["/var/run/docker.sock:/var/run/docker.sock", "/cache"]


Enter fullscreen mode Exit fullscreen mode

The pre_build_script property uses a little trick and simply executes a reference to a script file path CLEAN_UP_SCRIPT which we will later add as a GitLab CI/CD variable of type file. By doing that, we can apply changes and test our clean-up script without having to connect to our runner. The volumes property mounts the host machine's Docker socket in our container so we clean up resources on the host machine rather than only in the container (Docker in Docker via Docker socket binding).

Configuring our GitLab CI/CD pipeline

The file variable CLEAN_UP_SCRIPT has to be defined in the Settings > CI/CD > Variables section of our project for this to work. As shown in the next section. Let's add the following clean-up script for now.



set -eo pipefail
apk update 
apk upgrade 
apk add bash curl
curl https://gist.githubusercontent.com/fkurz/d84e5117d31c2b37a69a2951561b846e/raw/a39d6adb1aaede5df2fc54c1882618bcea9f01e0/is_docker_disk_space_usage_above_limit.sh > /tmp/is_docker_disk_space_above_limit.sh
bash <<EOF || printf "\nClean-up failed."
  source /tmp/is_docker_disk_space_above_limit.sh
  if is_docker_disk_space_usage_above_limit 2000000000; then
    printf "\nRunning clean up...\n"
    docker system prune -af --volumes
  else 
    printf "\nSkipping clean up...\n"
  fi 
EOF


Enter fullscreen mode Exit fullscreen mode

Make sure to select File as Type when defining the variable.

Image description

Note that we install bash and curl in the pre-build script for simplicity's sake. This implies that both tools are installed before every job that is processed on this runner. In a real scenario, we'd naturally want to provide a custom image that has all the required tools installed to speed up the pre-build script's execution.

Now let's add a sample gitlab-ci.yaml to our project which creates a large one GB image and will therefore eventually trigger clean-up. (Code is available on GitLab.)



stages:
  - build

build-job:
  stage: build
  tags:
    - gl-cl
  script:
    - echo "Generating random nonsense..."
    - ./scripts/generate-random-nonsense.sh
    - echo "Building random nonsense image..."
    - ./scripts/build-random-nonsense-image.sh


Enter fullscreen mode Exit fullscreen mode

To limit our pipelines to our new runner, we use tag selectors and pick our previously registered runner via the gl-cl tag. Now we may finally run our pipeline a couple of times to see the effect of our pre-build script. Depending on the runner's disk size, we should see a couple of jobs without clean-up followed eventually by a run that contains a log similar to this one:

Docker disk space usage is above limit (actual: 2361821460B, limit: 2000000000B)
Running clean up...
Deleted Containers:
9a126aead174a15a4f76f2cb5744e36aff30741cc6ab0ac5044837aaee946496
2fa51dc3e7f5b1e3bec63a635c266552f9b02eb74015a9683f7cbf13418a12eb

Deleted Images:
untagged: registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-febb2a09
untagged: registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper@sha256:edc1bf6ab9e1c7048d054b270f79919eabcbb9cf052b3e5d6f29c886c842bfed
deleted: sha256:c20c992e5d83348903a6f8d18b4005ed1db893c4f97a61e1cd7a8a06c2989c40
deleted: sha256:873201b44549097dfa61fa4ee55e5efe6e8a41bbc3db9c6c6a9bfad4cb18b4ea
untagged: random-nonsense-image-1653227274:latest
deleted: sha256:67fde47d8b24ee105be2ea3d5f04d6cd0982d9db2f1c934b3f5b3675eb7a626f
deleted: sha256:1a310f85590c46c1e885278d1cab269f07033fefdab8f581f06046787cd6156e
untagged: alpine:latest
untagged: alpine@sha256:4edbd2beb5f78b1014028f4fbb99f3237d9561100b6881aabbf5acce2c4f9454
untagged: random-nonsense-image-1653226909:latest
deleted: sha256:b5923f3fb6dd2446d18d75d5fbdb4d35e5fca888bd88aef8174821c0edfcb87f
deleted: sha256:59150b0202d2d5f75ec54634b4d8b208572cbeec9c5519a9566d2e2e6f2c13f3
deleted: sha256:0ac33e5f5afa79e084075e8698a22d574816eea8d7b7d480586835657c3e1c8b

Total reclaimed space: 2.059GB

This output indicates that our pre-build script was executed when Docker's disk space usage reached a value above 2GB and clean-up was triggered successfully (freeing in this case roughly 2GB of disk space).

Prerequisites and limitations of the pre-build script approach

It's probably easy to see that our pre-build script approach fulfills the requirements we laid out for it. I.e. it

  • runs synchronously before pipeline jobs, it
  • pre-emptively cleans-up unused resources, and it
  • allows us to provide a custom clean-up logic.

Nonetheless, there are still a few limitations left to consider.

First of all, bash must be available during pre-build script execution if we want to use the is_docker_disk_space_usage_above_limit.sh script because it uses some BASHisms. Moreover, since we use the Docker executor, we need to use some kind of runner image that has the Docker CLI installed (such as the official Docker base image we used in our test earlier). Writing a custom image to use as the base image to run our pipelines takes care of and reduces the severity of this problem, but it's still something that has to be addressed.

Another thing to keep in mind is that using Docker disk space usage is only an approximation (i.e. lower than) the system disk space usage. Consequently, we have to find a good value for the limit that triggers our clean-up logic to not run out of disk space on the machine anyway.

Also, it may still be tricky to pick the right clean-up logic. For instance, if we just run the docker system prune -af --volumes as in our test, we may delete images that are required by subsequent jobs in more complex pipelines. Excluding certain images from clean-up—for instance, those built in the last 24 hours—may be able to alleviate this exact problem. However, in more complicated pipelines, we will likely need a more elaborate clean-up logic.

Lastly, there are still edge cases even with the pre-build clean-up script approach where our runners will run out of disk space. Off the top of my head, if the limit is too high, a runner might still end up running out of disk space because the job run produces more resources than available free space.

Summary

As we've seen, we can use GitLab CI/CD's pre-build script hook to clean up GitLab runners in sync with job execution, pre-emptively to avoid breaking pipelines, as well as using custom clean-up logic. That being said, the pre-build script clean-up approach is not perfect, because it cannot avoid all situations where a runner will run out of disk space. Nonetheless, I think it is still a more elegant way to handle clean-up of GitLab runners than maintenance downtimes or the cron job approach.

Top comments (0)