We started with django 1.4 and ubuntu 14.04 in 2015 at doxper.com.
We get around 600 request per minute average, which we have been handling with two instances of t2.medium, even then we were getting 504, 502 sometimes and service goes down, and it needed machines reboot to come back, apache couldn't even restart when worker got killed by kernel.
Running 3 apache workers even on t2.medium
Service Usage (taskcutter): 1. We are using this internally around 30 employees post and get data from this service.
A server to server get api call on an average thrice the number of prescription we get daily.
Server to server post api call probably 5 times the number of prescription we get.
Problems faced: Whenever any of 2 or all 3 usage go above average usage apache worker gets kill by kernel sometimes all 3 workers on a machine gets killed and alb through that machine out, and occasionally both machines gets out of alb and out service got down.
Only way out to get service up was to reboot the machines from AWS console.
We tried debugging our system, we used to check syslog to find out reason service went down, our hypothesis was one or more APIs are bulky enough to make service down.
Why it was happening
We have been using ubuntu 14.04 which is almost 6 year old, this ubuntu version was having even older modwsgi which was almost 9 year old from now.
modwsgi is WGGI's (Web Server Gateway Interface) plugin for apache embedded in apache and runs inproc python interpreter with apache to server http requests.
Probably this older version of modwsgi and apache have these problems. Upgrading modwsgi was not easy because we had to compile newer version of wsgi with newer apache head. Or We could upgrade ubuntu and then modwsgi, which was chicken egg problem(should we upgrade ubuntu first or modwsgi).
Solution
Since we are upgrading ubuntu we discussed why not use gunicorn which is most used now a days with python projects.
So we started from scratch with ubuntu18.04 and installed nginx and gunicorn to server client.
When we started nginx and gunicorn setup in production we started getting a lot of 504 response at peak hours, because it better back pressure handling. So what nginx does is it aggressively stops taking more requests when gunicorn workers are busy, and through 504 (service not available) and we been noticing gunicorn workers getting killed due to page fault.
We started logging memory and cpu usage to confirm the cause what we noticed there memory usage was increasing slowly and when it reaches 94-96% some gunicorn worker got killed, and memory usage suddenly came to something like 60-65%, which is clearly a memory leak mechanism.
How we handled memory leak
We started killing gunicorn workers with --max-requests 1000 to recycle the memory and we also fine tuned machine to not overcommit memory (As we do not have any swap on aws instances)
This was all, we good now, no service down, and yeah we downsized our machines to t2.small now.
Note: Memory leak may be because of django, gunicorn or any other python packages we have need using debugging leak in python is quite difficult.
Top comments (0)