A big incident within my Docker Swarm and the solution I developed after it occurred.
Paul Knulst in Programming • Oct 31, 2021 • 6 min read
The incident
This was a lesson I've learned the hard way some days ago. At the moment, I am working for a software company as a consultant. During my remote work for different companies, I log all my invested time in my personal time tracking software which is working really well. The best thing was that my time tracking software run on my local machine in a docker environment! This is cool!
Within the last weeks, I was working on Docker Swarm, setting up my personal Swarm Environment. After everything was finished I decided to move my time tracking software into the cluster. Because it was already a docker container I could easily back up the MySQL data and put everything into my Swarm. It runs smoothly. Everything was working!
After some days I decided to create another docker container within my swarm. So I wrote the docker-compose.yml
and deployed it. I was working with my Swarm. Many services were added (Gitlab, Mailserver, Portainer, etc).
And then one day after I finished my work and wanted to log my working hours the time tracking software was not available.
So I started to research: What is going on? Why can't I access the site? I checked the Swarm. Every service was deployed correctly and was running. After that, I checked the logs…. Within the software, I saw a message which says that the DB isn’t there. But it should be there??
I exec
in my MySQL container. And then I saw it. NO DB ANYMORE! What was going on? With docker stack ps timetracking
I checked every container. After some research, I saw that one container was killed and another one was deployed. Normally that's not a real problem. BUT THEN … I saw that it was deployed to another node in my cluster. This was bad because I know that there was no volume on that node.
I was happy because I thought I could easily change it back. I opened the docker-compose.yml
and added a constraint that the service should be deployed all the time on the same node:
deploy:
placement:
constraints:
- node.hostname == *****
I know this led to deployment on a specific node within my cluster. Now the database won’t be lost and it should use the old volume!
AND THEN AFTER A RESTART OF THE CONTAINER, THE WHOLE VOLUME WAS DESTROYED. IT COULD NOT BE OPENED ANYMORE.*
Luckily I created a database backup some days earlier while I moved from my local docker env into the swarm. I was happy that I did not lose a whole month of working hours…
The Backup Script
After this incident, the only thing I want to do is to develop a simple backup functionality for my docker env. At first, I exec
into every container and saved the DB just to be sure. The next step was really easy. I searched for docker backup functionality and I found some really interesting articles and tutorials about saving all docker volumes to AWS with encryption and.....
This was too much to do because I need a really simple solution. I only want to copy my databases from every docker container into a safe place. I decided to create a small script that can do this. And then the script will be executed every day.
After some hours of bash-script researching, I developed a simple file called full-db-backup.sh
which I try to explain later
#!/bin/bash
containers=$(docker ps | grep 'mysql\|maria' | awk {'print $NF'})
for container in $containers
do
containerStringParts=$(echo $container | tr "." "\n")
for single in $containerStringParts
do
simpleName=$single
break 1
done
timestamp=$(date +%Y-%m-%d_%H-%M-%S)
docker exec $container sh -c 'exec mysqldump --all-databases -uroot -p"$MYSQL_ROOT_PASSWORD"' > /root/backups/$simpleName-$timestamp.sql
done
Line 1:
Adding *#!/bin/bash* as the first line of your script tells the OS to invoke the specified shell to execute the commands that follow in the script.
Line 2:
A variable called containers
is created and is filled with the content of the command withing $( )
. This command list all running docker processes and then forwards the results (through |
) to grep
which filters them based on two patterns (mysql
OR maria
). These results will then be forwarded to awk which only prints the last thing from the earlier listed rows. As a result, there will be an array containing each container name.
Line 4–5:
Starts a for loop to iterate through every container name.
Line 6:
Creates an array from the container name split by .
. Normally docker container names are created like this: service-name.replica.some-hash
Line 8–9:
Starts a for loop to iterate through every part of the string array
Line 10–11:
Saves the FIRST value (the service name) within the variable simpleName
and then breaks the loop
Line 12:
For loop end parameter
Line 14:
Save the actual timestamp within a variable
Line 15:
Connects to the docker container named $container
(part of $containers
array) and execute myqsldump
to save every database. As Password it uses an ENV variable ($MYSL_ROOT_PASSWORD
) which is used for nearly every MySQL/MariaDB docker volume in my environment. After dumping the file is then saved within /root/backups with a name created from the container simpleName
and the actual timestamp
Line 16: For loop end parameter
Executing The Backup
I tested the script and it works as expected. It creates every time a new file containing a backup of the whole database.
But then I had to execute the script in an interval so that updates within the databases will be saved too.
I did not want to be too complicated so I decided to use just a simple cronjob on every docker swarm node. It is not the best thing but it works. It just works and that is what I want at the moment.
I used crontab
to create a new cronjob. With the following command, I opened the cronjob list on my machine.
crontab -e
I added a cron job that will be executed every day at 1 AM and it will just execute my developed script which creates the backups within my root folder.
0 1 * * * /bin/sh /root/cronjobs/full-db-backup.sh
At last, I copied the script to every node in my cluster and add the same cronjob to it.
Closing Notes
I know this is not the best solution but this fast approach is enough at the moment. Now I can start to learn something about ANSIBLE and how I can automate this script within a playbook. Also, I will create something which will upload my dumps to Amazon. And encrypt them before uploading. but now I have time for this *because I'm SAFE*!
Still, I hope you find this article helpful! If you also have an interesting backup strategy to share feel free to comment here. Happy Backupping!
This article was published on my blog at https://www.paulsblog.dev/everybody-needs-backups-a-lesson-learned-the-hard-way/
Feel free to connect with me on my personal blog Medium, LinkedIn, and Twitter.
Did you find this article valuable? Want to support the author? (... and support development of current and future tutorials!). You can sponsor me on Buy Me a Coffee or Ko-Fi. Furthermore, you can become a free or paid member by signing up on this website. See the contribute page for all (free or paid) ways to say thank you!
Photo by Christian Erfurt / Unsplash
Top comments (0)