The Problem
In the past months, I was working on tracking down a bug caused by the dependency PTY package that the Buildkite agent depends on. The following are the symptoms:
Symptom 1 - operation not permitted for git checkout:
2024-05-20 11:15:52 PDT $ git clone -v -- ssh...
...
...
2024-05-20 11:15:52 PDT π¨ Error: Failed to checkout plugin docker: error running "/usr/bin/git checkout -f e9efccb": error starting pty: fork/exec /usr/bin/git: operation not permitted
2024-05-20 11:15:52 PDT Running global pre-exit hook
Symptom 2 - operation not permitted for git clean:
2024-05-21 09:28:12 PDT β οΈ Warning: Checkout failed! cleaning repository post-checkout: error running "/usr/bin/git clean -ffxdq": error starting pty: fork/exec /usr/bin/git: operation not permitted (Attempt 1/3 Retrying in 2s)
Symptom 3 - operation not permitted fork/exec bash:
2024-05-20 10:37:40 PDT π¨ Error: error running "/bin/bash -e -c \"trap 'kill -- $$' INT TERM QUIT; if [ -n \\\"$(git diff HEAD~1 --exit-code 'cruise/ai_platform/search')\\\" ]; then\\n cat <<- ...\"": error starting pty: fork/exec /bin/bash: operation not permitted
The Investigation
After further investigation, I made following observations:
- this doesnβt happen consistently for a single step. If a step failed once, the next step is likely going to pass the next time
- this happens at low frequency. At Cruise, we run over 100k build steps on daily basis. And it only occurs about <10 times a day.
- this happen among different pipelines. This means the likelihood is only 0.01%.
- the git operation failed even after the previous git operation succeeded (eg. git clone in the previous hooks within the same step worked without the permission issue)
- it has nothing to do with the
+x
flag on the binary being executed, both git and bash has the proper permission flag set. - the machine status or installed package is not mutated by the CI job in any malicious way. This is guaranteed by how we run CI infra at Cruise, each time a CI job is created, we spin up a branch new ephemeral VM to handle the job, and the job is terminated right after the job finishes. This guarantees that machine is not messed up.
(If you are interested in learning how we design ephemeral CI infrastructure, please comment below :D )
Given all the observation above, this lead me to believe the bug is not specific to any build step, or any pipeline. Hence the build is likely somewhere at the VM level. The next thing come into my radar is the buildkite agent binary itself.
The Buildkite Agent
I've been doing Buildkite agent upgrade from time to time. One thing I noticed is that I recently made an upgrade to the Buildkit agent from v3.36.1 to v3.59.0. From our log, it showed that the bug started showing up after the v3.59.0 upgrade. Hence, the bug must be lying between the difference of two version. I'm close to the truth.
Buildkite Support is very helpful
After reported the problem to Bulidkite support, and working with their support engineers. We decided to compare the strace output between the Buildkite agent version v3.36.1 and v3.59.0. There is a specific line that the Buildkite support pointed out that's interesting:
ioctl(0, TIOCSCTTY, 1) = -1 EPERM (Operation not permitted)"
According to the Buildkite support:
By default the agent runs the command process inside a pseudo-terminal, which leads to this particular ioctl call. A possible workaround would be to disable PTY mode (buildkite-agent start --no-pty or BUILDKITE_NO_PTY=1).
Then, I tested turning off the PTY. Unfortunately, this is not a backward compatible change we can easily adapt at infrastructure level. Because some of the workloads that depends on the PTY will break, eg docker run -ti
.
So the remaining option is to fix the PTY itself. And more digging shows it's the github.com/creack/pty package that Buildkite agent depends on that's doing the setup of PTY for the Buildkite agent. And we found Agent v3.36.1 used v1.1.12 of ptyPTY and Agent v3.59.0 used v1.1.20. The latest is v1.1.21, which is used in Agent v3.60.0 and later. And recently there is a fix for a race condition on Linux in the v1.1.21 release notes.
Testing Buildkite Agent v3.60.0
The next immediate fix is to valid the fix within PTY package by upgrading Buildkite agent to v3.60.0. Although the upgrade helped, it doesn't fully resolve the issue. As shown in the follow graph, the frequency reduced from ~10 times per day to <1 time per day. And it still happens.
Finding the breaking change
Then question becomes which version of the PTY package introduce the bug. In order to find the breaking change, I did a binary search on the PTY package version. We know v1.1.12 works without the issue, and v1.1.21 doesn't work. So the bug must be lying within the range. And there are about 10 release between v1.1.12 and v1.1.21. I tested v1.1.17, v1.19, and couldn't reproduce the problem is any of the versions. So we concluded that the breaking change is v1.1.20+
What we did was simply path the OSS buildkite agent's go.mod with a downgraded PTY package version as v1.1.19. And we noticed for any Buildkite agent v3.58.0+, they all uses the PTY v1.1.20+. So if you running into the same issue, downgrading the PTY package is likely what you need.
Cheers
Updates
2024-05-23: Buildkite already fix the issue in the main branch.
Top comments (0)