DEV Community

Cover image for How to debug an infinite loop in Node.js production code
Jakub Drobník for Apify

Posted on • Originally published at blog.apify.com

5

How to debug an infinite loop in Node.js production code

Debugging infinite loops in Node.js code locally is easy — just launch the debugger, break the execution, see where your code is stuck, fix and you’re done. However, in production systems, this becomes much more complicated.

Node.js doesn’t have any out-of-the-box tool to break a running program and print its stack trace. So, when your production code suddenly peaks at 100% CPU, it’s tricky to find out where exactly it got stuck. Neither -prof nor -cpu-prof (native Node.js profiling tools provided with the V8 debugger) helped since the infinite loop in application code was caused non-deterministically.

At Apify, we had this type of problem in a production application running inside a stateless Kubernetes (K8s) container. The application is a simple express.js based web server. This article describes the solution that worked for us. Hopefully, it can also help you.

TL;DR — We used a script based on this GitHub gist, which attaches the GNU debugger (GDB) to Node.js processes to print the leaking code’s stack trace. We had to run the script with K8s’ livenessProbe check to get the stack trace and save it to a persistent volume.

Using the GDB debugger in the app container

As a Node.js developer with a basic knowledge of V8 and the underlying C++ code, you probably haven’t used GDB for debugging your Node.js applications. You probably have no use for it most of the time but in this specific case, GDB proved to be extremely useful.

GDB allows you to attach the debugger to a running Node.js process and set up a breakpoint in C++ where the infinite loop occurs. This place in V8 is called the stack guard and we got the idea to use it from this GitHub gist (it includes an explanation of the whole script if you need to know more).

With some basic knowledge of GDB and V8’s stack guard, you can reproduce the steps that cause the infinite loop and print the stack trace of your app’s code where it occurs. The code below attaches a breakpoint to the stack guard and prints the stack trace.

gdb -batch \
-ex "b v8::internal::Runtime_StackGuard" \
-ex "p 'v8::Isolate::GetCurrent'()" \
-ex "p 'v8::Isolate::TerminateExecution'(\$1)" \
-ex "c" \
-ex "p printf(\"FATAL ERROR: Application is not responding and will be killed\")" \
-ex "p 'v8::internal::Runtime_DebugTrace'(0, 0, (void *)(\$1))" \
attach ${PID}

You can easily test it by running a simple Docker container with GDB installed. First, run an infinite loop, then run the GDB command.

Below are the steps for testing it in your local terminal using Docker.

docker run -it --cap-add=SYS_PTRACE node:12 /bin/bash
apt-get update
apt-get install gdb
echo "function myLoop() { while(true); } myLoop(); " > my_loop.js
node my_loop.js &
PID=$(pidof node)
gdb -batch \
-ex "b v8::internal::Runtime_StackGuard" \
-ex "p 'v8::Isolate::GetCurrent'()" \
-ex "p 'v8::Isolate::TerminateExecution'(\$1)" \
-ex "c" \
-ex "p printf(\"FATAL ERROR: Application is not responding and will be killed\")" \
-ex "p 'v8::internal::Runtime_DebugTrace'(0, 0, (void *)(\$1))" \
attach ${PID}

After running these commands, your terminal should display myLoop function’s stack trace.

The printed stack trace in terminal

Update K8s deployment to use the GDB script

Now you know how to get the infinite loop’s stack trace, you can use it in the production container. First, add GDB to your Docker container. In this case, update the Dockerfile using the commands used in the test.

apt-get update
apt-get install gdb

Below is the Dockerfile for this scenario.

FROM node:12
WORKDIR /app
COPY ..
RUN apt-get update
RUN apt-get -y --allow-unauthenticated install gdb
RUN npm install && \
npm run build
EXPOSE 3002
CMD [ "npm", "start" ]
view raw Dockerfile hosted with ❤ by GitHub

Now you have GDB installed in your Docker container, you need to ensure that the GDB command will be attached in case of an infinite loop. As mentioned above, our loop was caused non-deterministically, so we used the liveness probe command to find it.

In our case, we had a basic HTTP liveness probe check set up. It checks the /health-check path every 5 seconds, allowing 3 failed attempts.

livenessProbe:
httpGet:
path: /health-check
port: 3002
failureThreshold: 3
initialDelaySeconds: 30
periodSeconds: 5
timeoutSeconds: 5

If this probe fails a 4th time, the K8s scheduler pronounces the container as dead and replaces it in the pool. This place in the container’s runtime where the container is pronounced as dead is the place where the GDB command will need to run.

You want to preserve the loop-causing behavior; however, if the health check fails, the GDB script should run and save the infinite loop’s stack trace into a specific file. The bash script below does exactly that.

HTTP_TIMEOUT_SECS=3 # timeout for the HTTP GET
URL="http://localhost:3002/health-check"
FAIL_COUNT=0
HEALTH_CHECK_INTERVAL_SECS=5
UNHEALTHY_THRESHOLD=3
while true; do
curl --fail --max-time ${HTTP_TIMEOUT_SECS} "${URL}" > /dev/null
CODE=$?
if [ $CODE -ne 0 ]; then
FAIL_COUNT=$((FAIL_COUNT+1))
echo "Health check failed (error: ${CODE}, failCount: ${FAIL_COUNT})"
else
echo "Health check passed"
FAIL_COUNT=0
exit 0
fi
if [ ${FAIL_COUNT} -ge ${UNHEALTHY_THRESHOLD} ]; then
echo "Health check failed, reach unhealthy threshold error: ${CODE}, try to print stack trace with gdb."
PID=$(pidof node)
echo "$PID"
# print stack trace from the Node.js process
gdb -batch \
-ex "b v8::internal::Runtime_StackGuard" \
-ex "p 'v8::Isolate::GetCurrent'()" \
-ex "p 'v8::Isolate::TerminateExecution'(\$1)" \
-ex "c" \
-ex "p printf(\"FATAL ERROR: Application is not responding and will be killed\")" \
-ex "p 'v8::internal::Runtime_DebugTrace'(0, 0, (void *)(\$1))" \
-p "${PID}" &> "/debugger/fatal-error-stack-$(date "+%Y.%m.%d-%H.%M.%S").txt"
sleep 1
kill "${PID}" | tee
exit $CODE
fi
sleep ${HEALTH_CHECK_INTERVAL_SECS}
done

This saves the script as liveness_probe.sh into your app’s root directory. You can see that the bash script does exactly the same as the HTTP liveness probe. However, if the health check fails 4 times, it runs the GDB command and prints the stack trace.

To use this script in our app, we needed to edit the liveness probe in the K8s deployment specification as shown below.

livenessProbe:
exec:
command:
- /bin/bash
- /app/liveness_probe.sh
failureThreshold: 1
initialDelaySeconds: 30
periodSeconds: 40
timeoutSeconds: 40

This ensures our health check script runs every 40 seconds, which is enough time to run HTTP probe 4 times every 5 seconds. But be careful: since we’re using a debugger here, we need to allow processes using process trace with the SYS_PTRACE flag.

We can do this using securityContext in K8s deployment.

securityContext:
capabilities:
add:
- SYS_PTRACE

Saving the stack trace file to a persistent volume

Once you are able to track and print the loop into a specific file, you need to ensure that the file will not be deleted after the restart. The application runs as stateless, so after the container restarts, you lose all the data in memory and storage.

To attach a persistent volume to your K8s pod, you can follow these steps. The attachable volume is a little different on each K8s-managed cluster. Our app uses the AWS Elastic Kubernetes Service (EKS), which is easily compatible with the Elastic File System (EFS).

You can do a very basic setup of EFS by running the command below.

aws efs create-file-system

From the output, you will need the FileSystemId property for further use. To attach EFS as a persistent volume to your EKS cluster, launch the Amazon EFS CSI Driver. After installing it, let your application know about it by creating a StorageClass K8s resource.

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: myClass
provisioner: efs.csi.aws.com

Next, create a persistent volume and persistent volume claim.
Note: Use FileSystemId as volumeHandle.

apiVersion: v1
kind: PersistentVolume
metadata:
name: myVolume
spec:
capacity:
storage: 5Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: myClass
csi:
driver: efs.csi.aws.com
volumeHandle: fs-1234567
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: myClaim
spec:
accessModes:
- ReadWriteMany
storageClassName: myClass
resources:
requests:
storage: 5Gi

Finally, mount the persistent volume claim to the deployment.

volumes:
- name: myDeploymentVolume
persistentVolumeClaim:
claimName: myClaim
...
volumeMounts:
- name: myDeploymentVolume
mountPath: /debugger

When the persistent volume is set up, use SSH to connect it to one of the app’s containers. The files containing stack traces will be in the debugger folder.

Conclusion

To summarize, our app had a non-deterministic infinite loop, which occurred only on production. We identified it by attaching the GNU debugger to the app’s Node.js processes, which allowed us to print the leaking code’s stack trace. We then ran Kubernetes’ livenessProbe check to get the stack trace and save it to a persistent volume.

In our case, the infinite loop was caused by a third-party package.

We hope you will find this article useful if you encounter an infinite loop in your Node.js application.

Additionally, we added a sidecar container into k8s cluster to sync stack trace files directly to AWS S3 bucket. If you are interested in how we did it, let us know in the comments, and we will describe it in a future blog post.

Image of Wix Studio

2025: Your year to build apps that sell

Dive into hands-on resources and actionable strategies designed to help you build and sell apps on the Wix App Market.

Get started

Top comments (2)

Collapse
 
vbarzana profile image
Victor A. Barzana

This is a great article, who would have thought you would need to step into V8 to debug this issue and the workaround you came up with is very smart :) Thanks for the post, this helps a lot of people with the same problem.

Collapse
 
drobnikj profile image
Jakub Drobník

Thanks! I glad to hear that it helps. :)