How it began
One day a few months back, the QA engineer texted me on Slack saying that he couldn't log into the web application. Naturally, I tried to login with his credentials and was able to. I thought he forgot his password, so I sent it to him and he said that he could login now. I didn't think anything of it.
2 hours later, right about log off time, I get an email from a client saying they're not able to login. I dismissed it thinking they forgot their password, intending to get back to them first thing tomorrow. Then, a mobile developer on my team said the same thing.
So I get to investigating. I went to the website and tried to login and I couldn't. The page would reload when I hit Enter without showing any errors.
I quickly started debugging when a colleague mentioned it could be an issue with the database connection. We had recently moved from using a single database instance to a database cluster, and assumed this might be causing an issue or that one instance was taking on too much load. Since this change had been the biggest and most recent one, we narrowed our focus on it. However, the database console looked fine, and didn't show any extra load on any specific instance.
This issue only happened on production, so it was safe to assume it was not related to any code changes. At this point, we started getting more and more complaints from clients, so I decided to do something dangerous: debug on production.
I connected to the server using Cyberduck, navigated to the login view file and logged something like logging in
. To my surprise, when I hit save, the file didn't get saved. Cyberduck showed a vague error I can't remember and didn't understand at the time.
After a couple more hours of debugging, we realized that the server has reached 100% disk usage. That day, I learned two useful unix commands: du
and df
. From the man page:
The du utility displays the file system block usage for each file argument and for each directory in the file hierarchy rooted in each directory argument.
The df utility displays statistics about the amount of free disk space on the specified filesystem or on the filesystem of which file is a part.
This meant one thing: we had to upgrade the disk size. Thankfully my colleague figured out how to do that with no downtime.
Crisis was averted. People were able to login.
The end... Not
Believe it or not, but due to the immense workload we had at the time, no further action was taken to monitor the server disk space or dig deeper into why this happened. So somewhat unsurprisingly, two months later, the server reached 100% capacity again!
We were better prepared and quickly identified the issue and upgraded the disk size. This time around, I took the time to dig into why this happened, since we didn't upload a lot of files within the last two months that would justify filling up around ~90 Gigabytes.
Again, I utilized the du
and df
commands to pinpoint the directory that's eating up the disk space:
$ du -sh /var
...
170.3G /mail
...
Imagine that, the mail directory was taking up 170 Gigabytes, almost 80% of the entire server's disk space! Further digging showed that the culprit was crontab. We had several cron jobs running, and crontab sends emails to the root user that get stored in /var/mail
. This was listed clearly in the crontab file as shown below, but the output of a particular cron job was returning a lot of junk that somehow managed to fill up the directory quite quickly.
$ crontab -l
...
# Output of the crontab jobs (including errors) is sent through
# email to the user the crontab file belongs to (unless redirected).
...
Now what?
The plan of action was to first stop further emails, then to delete the existing emails to free up the server.
$ crontab -e
MAILTO="" # to disable cron emails
$ sudo rm /var/mail/ubuntu
Smarter and wiser, we figured let's set up a monitoring service to catch this particular issue in case it happens again. The service of choice was Monit and it was surprisingly easy to start using. It creates a dashboard that allows us to visualize all the numbers we need easily, from disk space to CPU usage to memory, and sends emails alerts on custom events. This great article is very helpful in setting up Monit on an Ubuntu server.
And the rest is history. We didn't face an issue with disk space again. So far.
Thanks for reading! Until next time 👋
Cover photo by Taylor Vick on Unsplash
Top comments (21)
I'm sorry to write this, but we're in 2020 and you're still on a monolithic architecture? If you sandboxed every service in small, resources-monitored (and limited) instances (containers, for example), your headache would have been way shorter.
I know, I know: business' first. But it's up to you to stand up and talk about tech debt.
Side note: I don't remember the last time I used a (S)FTP client to access to a server... brrrr...
Of course implementing microservices could never cause its own headaches 🙄
Obviously running microservices requires experience and good knowledge to avoid culprits, but in my personal experience, probably because I got more experienced over the years, I tend to consider small (in code and responsibility) services a great way to make things simple and able to be understood and developed by both experienced and newcomers in the company.
It really takes you back, huh? 😂
As you said, startups in particular often need to put their business needs first to accelerate growth and market share. Plus, not every application needs to use microservices.
In my experience every company puts business in front of everything - and most of them put marketing in front of business ("the real heroes"). Just because I worked a long time for startups, old companies and lately consulting ones (i.e. more or less short projects), I can tell you you have more power than you might think, when you adopt standards that are simply different than monolithic approaches, without meaning more work (once you get practiced with them). It's years now that I work with autoscaling solutions, microservices or simply containerised services on a single machine, because in my opinion simple services allow you for great flexibility and replaceability over great control. Which apparently was what you lacked in this occasion :)
I appreciate you sharing your experience.
I’m skeptical about any power I have at the moment, but I will try to be a better advocate for better and more modern tech. I have much to learn!
Thanks for sharing. I like the graphics you added, it helps me to keep engaged and continue reading
Interesting, why a simple RM was the answer for the mails, instead of filtering and keeping important messages (e.g.: business logic related ones) even as an archive?
Aren't task for a devop to know what he/she/they do? (I have the feeling, there is no dedicated devop/sysop working there...)
Totally liked the end conclusion to have a monitoring system setup, because 99% of companies lack of monitoring or understanding this kind of things. (It is super common to say for balancing or scaling to just adding more cpu/memory and thats all instead of investigating why they even need that amount)
Hi there, you're right on us not having a dedicated devOps person as it is a small startup.
The amount of emails was huge and after going through a number of them, I noticed one cron job's response was adding backslashes, the number of which increased exponentially with each email. It reached a point where the entire screen was just filled with backslashes. The cron job's output wasn't incorrect, it was just being parsed incorrectly.
This Stack Overflow answer explains the root cause better:
Why is json_encode adding backslashes?
I've been using
json_encode
for a long time, and I've not had any problems so far Now I'm working with a upload script and I try to return some JSON data after file upload.I have the following code:
…A bit misleading.... got excited over simple error. This is merely disk usage... server usage depends on several factors (which does include disk usage)... but I was thinking you hit a bandwidth limit due to excess customers and had to add load balancers or wrote custom provisioning scripts to create new cloud instances/turn off based on bandwidth usage.
Either way, this wasn’t even an organic thing because of customers—the issue was crontab so your server upgrades were kinda pointless when you coulda simply done the logging right i.e. handled it better from the start. In fact, it’s wrong to have a cron job that logs so much info. Ideally, you’d want to log only abnormal behaviour and normal behaviour must be consolidated into aggregated statistics instead. Anyway, glad you were able to find the culprit in the end.
P.S. It maybe ideal to switch from ftp to ssh.
Hello Saifur.
Exactly, the point of this story is that the first time around we didn't bother debugging closely to understand the issue and just placed a bandaid on it, which led to it happening again. I find FTP faster and easier for me to edit and upload files and that's why I used it in this case, after all it was a high pressure situation.
Thanks for reading.
Hi 👋! I'll be quiet about any of my observations that mirror what's already noted here :) One single character to add that may help you out:
du -sh *
. That * behaves like the * in ls and most others - it'll give you itemized, summarized, human-readable useage for all of the contents of $PWD.You can start from
/
and work yoour way down.Bonus:
du -sh * | sort -h
will organize the list - but I wouldn't recommend running that at / with 100% du. It could take some time. There are some other pipes that will limit the results to the x heaviest items, too (for busy dirs).Oh another one for clearing du, if you're logrotating and don't need the archives:
find /var/log -type f -name "*.gz" -exec "rm -f {} ;"
I typed that on mobile, please don't copy paste lol.
Finds all the rotated .gz logs and nukes em - those add up to A LOT (esp if you're running mod_security lol)
Hello Kevin! Thanks for that lol.
Yes,
du /
took forever. I remember experimenting with a few different flags and parameters, but I'm not sure which combination I ended up using to figure out the root cause. Thanks a lot for your suggestions, I'll keep them in mind 👍Nice sharing Doaa. Monit is good choice. I am using LXC to overcome this situation where email used 80% of available disk space. I have separate LXC containers for email , app, database. Each containers are created with profile fixing harddrive, ram , memory.
I had some of the same reactions that others had regarding your approach and infrastructure setup. However, after taking a step back, I think at this point in your business (early startup it sounds like) and your team's expertise, you're doing just fine. We can't (and shouldn't?) all start off running kubernetes against microservices on the blockchain. I think you're doing great!
Thanks for sharing your postmortem and a transparent view into your work!
Thanks a lot Garett.
We can't all work using the latest tech. We're doing the best we can and slowly moving towards a more stable system and better infrastructure, but it's understandably a laborious process.
I appreciate your response :)
Trying out something different 😁 thanks for reading
Good post!
Great article