DEV Community

sanchitsharma
sanchitsharma

Posted on • Edited on

Data Analysis Setup on Django data on Single machine

We at my workplace use Django based backend and frequently need to perform data analysis tasks on the data in the django models. This post provides a basic setup for allowing data analysts to analyse data using the jupyter notebook.

Note: This can be done in a lot better way using docker, jupyterlab, separate machines etc, there is no limit. This post just aims to outline the setup which was done in the constraints and timeline available.

Tools Used

shell_plus - Django shell with all models imported already in a jupyter notebook and more.

supervisor - to run the notebook server continuously in background

postgresql - Database. Creating a read only user so that analysts can work on latest production db without risk of changing it.

cgroups - For making sure that notebook only consumes limited amount of CPU and memory

Use django_extensions

Install django_extensions as an app in your django project. Instructions for installation and usage are available on https://django-extensions.readthedocs.io/en/latest/.

Once done you will be able to run and test the notebook via

./manage.py shell_plus --notebook

Setup a readonly user for postgres

Help take from https://stackoverflow.com/a/42044878/1660759 and https://stackoverflow.com/a/54193832/1660759

CREATE ROLE Read_Only_User WITH LOGIN PASSWORD 'Test1234' NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION VALID UNTIL 'infinity';

Statements after this need to be run while connected to YourDatabaseName

GRANT CONNECT ON DATABASE YourDatabaseName TO Read_Only_User;
GRANT USAGE ON SCHEMA public TO Read_Only_User;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO Read_Only_User;
GRANT SELECT ON ALL SEQUENCES IN SCHEMA public TO Read_Only_User;
#So that new tables are in this database are automatically accessible to this user
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO Read_Only_User; 
#for functions
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA public TO Read_Only_User;

Setup separate settings file

[Or manage through environment variables. If that is what you do you know how to do that for this example👌]

Create a new settings file, import settings from main settings file and replace the database settings with new readonly user

settings_database_readonly.py

from .settings import *

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql',
        'NAME': '<db_name>',
        'USER': '<readonly_username>',
        'PASSWORD' : '<readonly_user_password>',
        'HOST': '<db_address>',
        'PORT': '<db_port>',
    }
}

NOTEBOOK_ARGUMENTS = [
    '--port', '9999',
]

Limit Usage of notebook's process CPU and RAM

For limiting usage of ram and cpu I explored cpu_limit and nice but cpu_limit does not give memory and nice does not give granular control over limits. After some research I finalised on cgroups.

cgroups has very less posts around there, and i didn't want to write just some commands and rather use a config file. This decision turned out not simple enough to follow through given the documentation out there.

Comparison between cpulimiti, nice and cgroups - https://scoutapm.com/blog/restricting-process-cpu-usage-using-nice-cpulimit-and-cgroups

Guide for using cgroups using config file https://www.paranoids.at/cgroup-ubuntu-18-04-howto/

apt install cgroup-tools

Copy conf file from examples

cp /usr/share/doc/cgroup-tools/examples/cgred.conf /etc/

/etc/cgconfig.conf - limiting 10% CPU Usage and 1G memory

group notebooks {
     cpu {
         cpu.cfs_quota_us=10000;
     }
     memory {
         memory.limit_in_bytes = 1024m;
     }
}

/etc/cgrules.conf - This will limit every process of the processes running in group notebooks to 10% CPU and 1G of memory.

#<user>    <controllers>           <destination>
notebooks       cpu,memory              notebooks

For testing use this commands:

/usr/sbin/cgconfigparser -l /etc/cgconfig.conf
/usr/sbin/cgrulesengd -vvv

Supervisor

Setup supervisor - supervisor.

Use the following configuration in supervisor.conf for running the notebook using cgroup

[program:readonly_shell_notebook]
environment=DJANGO_SETTINGS_MODULE="dashboard.settings_database_readonly"
command=cgexec -g cpu,memory:notebooks /home/videoken-engage/.virtualenvs/django_project/bin/python ../manage.py shell_plus --notebook
directory=<directory of django project>
stdout_logfile=<path_to_log_file>
stderr_logfile=<path_to_log_file>

cgexec command is used to run any process in a given cgroup.

Due to using configuration files for cgroups, a file is sometimes not given proper permissions. I was getting the following error while running the cgexec command -

Cgroups error: cgroup change of group failed

This fixed the problem

chown <my_ubunut_username> /sys/fs/cgroup/memory/notebooks/tasks
chown <my_ubuntu_username> /sys/fs/cgroup/cpu/notebooks/tasks

PS: This was my first usage of cgroups after the initial reading I did for understanding of Docker internals. I might be overlooking some concepts, any inputs regarding that is specifically welcome 👐.

Top comments (0)