We at my workplace use Django based backend and frequently need to perform data analysis tasks on the data in the django models. This post provides a basic setup for allowing data analysts to analyse data using the jupyter notebook.
Note: This can be done in a lot better way using docker, jupyterlab, separate machines etc, there is no limit. This post just aims to outline the setup which was done in the constraints and timeline available.
Tools Used
shell_plus - Django shell with all models imported already in a jupyter notebook and more.
supervisor - to run the notebook server continuously in background
postgresql - Database. Creating a read only user so that analysts can work on latest production db without risk of changing it.
cgroups - For making sure that notebook only consumes limited amount of CPU and memory
Use django_extensions
Install django_extensions as an app in your django project. Instructions for installation and usage are available on https://django-extensions.readthedocs.io/en/latest/.
Once done you will be able to run and test the notebook via
./manage.py shell_plus --notebook
Setup a readonly user for postgres
Help take from https://stackoverflow.com/a/42044878/1660759 and https://stackoverflow.com/a/54193832/1660759
CREATE ROLE Read_Only_User WITH LOGIN PASSWORD 'Test1234' NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION VALID UNTIL 'infinity';
Statements after this need to be run while connected to YourDatabaseName
GRANT CONNECT ON DATABASE YourDatabaseName TO Read_Only_User;
GRANT USAGE ON SCHEMA public TO Read_Only_User;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO Read_Only_User;
GRANT SELECT ON ALL SEQUENCES IN SCHEMA public TO Read_Only_User;
#So that new tables are in this database are automatically accessible to this user
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO Read_Only_User;
#for functions
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA public TO Read_Only_User;
Setup separate settings file
[Or manage through environment variables. If that is what you do you know how to do that for this exampleπ]
Create a new settings file, import settings from main settings file and replace the database settings with new readonly user
settings_database_readonly.py
from .settings import *
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql',
'NAME': '<db_name>',
'USER': '<readonly_username>',
'PASSWORD' : '<readonly_user_password>',
'HOST': '<db_address>',
'PORT': '<db_port>',
}
}
NOTEBOOK_ARGUMENTS = [
'--port', '9999',
]
Limit Usage of notebook's process CPU and RAM
For limiting usage of ram and cpu I explored cpu_limit and nice but cpu_limit does not give memory and nice does not give granular control over limits. After some research I finalised on cgroups.
cgroups has very less posts around there, and i didn't want to write just some commands and rather use a config file. This decision turned out not simple enough to follow through given the documentation out there.
Comparison between cpulimiti, nice and cgroups - https://scoutapm.com/blog/restricting-process-cpu-usage-using-nice-cpulimit-and-cgroups
Guide for using cgroups using config file https://www.paranoids.at/cgroup-ubuntu-18-04-howto/
apt install cgroup-tools
Copy conf file from examples
cp /usr/share/doc/cgroup-tools/examples/cgred.conf /etc/
/etc/cgconfig.conf - limiting 10% CPU Usage and 1G memory
group notebooks {
cpu {
cpu.cfs_quota_us=10000;
}
memory {
memory.limit_in_bytes = 1024m;
}
}
/etc/cgrules.conf - This will limit every process of the processes running in group notebooks to 10% CPU and 1G of memory.
#<user> <controllers> <destination>
notebooks cpu,memory notebooks
For testing use this commands:
/usr/sbin/cgconfigparser -l /etc/cgconfig.conf
/usr/sbin/cgrulesengd -vvv
Supervisor
Setup supervisor - supervisor.
Use the following configuration in supervisor.conf for running the notebook using cgroup
[program:readonly_shell_notebook]
environment=DJANGO_SETTINGS_MODULE="dashboard.settings_database_readonly"
command=cgexec -g cpu,memory:notebooks /home/videoken-engage/.virtualenvs/django_project/bin/python ../manage.py shell_plus --notebook
directory=<directory of django project>
stdout_logfile=<path_to_log_file>
stderr_logfile=<path_to_log_file>
cgexec command is used to run any process in a given cgroup.
Due to using configuration files for cgroups, a file is sometimes not given proper permissions. I was getting the following error while running the cgexec command -
Cgroups error: cgroup change of group failed
This fixed the problem
chown <my_ubunut_username> /sys/fs/cgroup/memory/notebooks/tasks
chown <my_ubuntu_username> /sys/fs/cgroup/cpu/notebooks/tasks
PS: This was my first usage of cgroups after the initial reading I did for understanding of Docker internals. I might be overlooking some concepts, any inputs regarding that is specifically welcome π.
Top comments (0)