In this article, we’ll look at how a developer’s mistake caused GitLab to lose six hours of data from their website. we’ll see what happened at that time, how they fixed it, and what they learned from it. For GitLab and its users, this issue was terrible.de
GitLab is one of the most popular platforms. But on January 31, 2017, GitLab had a big problem: one of their developers accidentally erased the whole production database, wiping out six hours' worth of data from GitLab.com, which was one of GitLab’s greatest nightmares.
The Problem: Too Much Data
The problem started around 6 pm UTC when GitLab saw that some bad people were making a lot of snippets (small pieces of code) on GitLab.com, making the database very busy and unstable. GitLab started blocking the bad people by their IP address and deleting their users and snippets.
Around 9 pm UTC, the database got worse, making it hard to write anything and making the website go down. GitLab saw that one user was using a project as a CDN, making 47,000 IPs sign in with the same account. This made the database very busy too. They deleted this user as well.
Around 10 pm UTC, GitLab got an alert because the database was not copying itself to another database, which is important for backup. This happened because there was too much data to copy and the other database could not keep up. GitLab tried to fix the other database by deleting its data folder and starting the copy again.
Mistake: Wrong Command
But the copy did not work, giving some errors. GitLab tried to change some settings on the main database, but this made PostgreSQL not start because of too many things open.
Around 11 pm UTC, one of the developers (team-member-1) thought that maybe the copy was not working because the data folder was there (even though it was empty) on the other database. He decided to delete the folder using rm -rf /var/opt/gitlab/postgresql/data/*
.
But he made a big mistake: he ran the command on the main database instead of the other one. This deleted all the data from the website database, leaving GitLab.com with no data at all.
The Solution: Use an Old Backup
As soon as team-member-1 knew what he did, he told his team member and stopped everything on GitLab.com. They started looking for backups to get the data back.
They found out that they had some backup methods, but none of them worked well:
The disk snapshots were not turned on
The S3 backups were not found
The DB dumps were old
The copy process was broken
The only backup that worked was one that team-member-1 made by hand six hours before the problem. This backup had most of the data from GitLab.com, but not things like issues, merge requests, users, comments, snippets, etc. that people made or changed in those six hours.
GitLab decided to use this backup to make GitLab.com work again as soon as possible. They also asked their users to help them get back any lost data by sending them pictures or copies of their recent work.
The backup process took a long time and had many steps:
Putting the backup on a new database server
Making GitLab use the new database server
Checking and fixing the backup data
Starting GitLab services and testing if everything works
Talking to users and telling them what’s going on
GitLab.com was finally working again around 6:14 pm UTC on February 1st, more than 18 hours after the problem started.
The Lesson: Learn from Mistakes
GitLab looked at the problem very carefully and wrote a blog post about it. They found out why the problem happened and what made it worse, such as:
Human error: team-member-1 deleted the wrong folder
Lack of verification: none of the backup methods were tested or watched
Lack of documentation: there was no clear way to use the backups
Lack of communication: there was no good way to talk and work together
Lack of sleep: team-member-1 was working late at night and was tired
They also made a list of things to do and make better to stop such problems from happening again, such as:
Turning on disk snapshots and checking S3 backups
Making backup documents and testing ways better
Making alerts and watching for backup problems
Making role-based access control and audit logging for database servers
Teaching and helping with PostgreSQL copy
Making a blameless culture and a way to learn from mistakes
GitLab’s problem was very bad for them and their users. It showed that they needed to have good and tested backups and a clear and written way to use them.
GitLab was open and honest about the problem, and they shared what they found and learned with everyone. They also said sorry to their users and gave them something for the data loss. They got a lot of feedback and support from their community, who liked their openness and work.
Conclusion
GitLab messed up and lost data, but they fixed it and learned from it. GitLab’s problem is a reminder for all of us who work with data and databases to be careful, smart, and ready. We should always check our commands, test our backups, write our ways, talk with team members, and learn from our mistakes.
Reference
If You are using Medium Please support and follow me for interesting articles. Medium Profile
Subscribe to My Newsletter:
If you're ready to subscribe, simply click the link below:
By clicking the link, you'll be directed to the newsletter page, where you can easily subscribe and receive regular updates delivered directly to your inbox. Don't miss out on the latest trends and expert analysis in software development.
Stay updated with my latest and most interesting articles by following me.
If this guide has been helpful to you and your team please share it with others!
Top comments (10)
Please don't use clickbait headlines. Phrasing an incident from 6 years ago in the present tense in the title is unfair to readers, not to mention gitlab
Exactly. I was shocked when I saw the title and wondered why I hadn't heard about this anywhere else. Then I saw the date in the second paragraph.
@kanani_nirav , please consider changing the title of your article because it is misleading, and you don't want to get a reputation as a clickbaiter.
@squidbe Title has been updated. Thanks for the feedback!
It's better, but you're still using present tense which strongly implies that this is an event that just occurred rather than an event that took place 6.5 years ago. If you want to avoid any clickbait implications, try using past tense. E.g., "When GitLab Lost Their Entire Prod DB: Lessons Learned", or something like that that shows the reader you're not referencing a current event.
This all happens when:
lesson learned should be - cut the corners and wait for consequences to happen.
same applies for the 'Azure case' mentioned below.
We had similar issue but nobody from our end was at fault, it was the Azure SQL team who didn’t test their backup methods.
Azure released new database called hyper scale, it was advertised as the best form of highly available solution. So we bought it and we were using it. The only problem was, backups were not recoverable to any other tier/edition. The way to restore database on different tier/edition/server required manual backup and restore which would take generally 4 hours as database was 100GB in size. But the restore would took another 4 hours.
One fine day the server went down, luckily read only replica was available so I started taking manual backup.
Unfortunately nobody in azure support team knew how to restart the primary host as it was not a standalone server but a form of some services running on top of multiple servers. And this was Friday. So the azure support staff was primarily on weekend gateway mood and no senior was available. I was on phone with them for an hour and I demanded to restart whatever was it to fix it. Support staff directed me to some hot backup articles but these methods weren’t compatible with hyper scale. After couple of hours of arguing, somebody got hold of product manager and he managed to guide support staff to restart couple of hosts. And the server did start. It took total 5 hours to get the database up and running.
And the manual backup and restore was still in progress. Next day I saw that the manual restore took nearly more than 9 hours.
Since this happened almost at end of the day, the data loss was very low as most of people who use had called the day off. But still we had significant red faces from people who worked late in office.
I wrote and gave feedback but nobody contacted me as if it was something that happened over weekend and not important.
We switched to other database tire.
My friend also made the same mistake, but not on a big system/website. Just about 70k users, and West lost about 20 hours of data. And the they never learn anything about it...
which tells more about his/her professional integrity than about the importance of this db
this is worse:
https://gitlab.pasteur.fr/rpellari/bayesianem-exosome/-/issues/?sort=created_date&state=opened&first_page_size=100
🤣🤣🤣😂