DEV Community

Cover image for Data Science: Project Structure
Azad Kshitij
Azad Kshitij

Posted on

Data Science: Project Structure

Why should you use project structure?

structure is of once preference but at the end of day all that matter is you should be comfortable moving around and writing code.

When we think about data science/analysis, we often think it is just about result, charts, figures, insights or visualization. While these end products are generally the main event, it's easy to focus on making the product look nice and forget about the quality of code and reusability and ability to collaborate with other and for that project structure matters.

That being said there is no correct method of structuring your project but you should at least have some structure that we can follow for all our project for standardization. You might be thinking why one should use a project structure!! Here are some reasons behind that:

Happy team member

Nobody sits around before creating a new Rails project to figure out where they want to put their views; they just run rails new  to get a standard project skeleton like everybody else.

A well-defined, standard project structure means that a newcomer can begin to understand an analysis without digging in to extensive documentation. It also means that they don't necessarily have to read 100% of the code before knowing where to look for very specific things.
Well organized code tends to be self-documenting in that the organization itself provides context for your code without much overhead. People will thank you for this because they can:

  • Collaborate more easily with you on this analysis
  • Learn from your analysis about the process and the domain
  • Feel confident in the conclusions at which the analysis arrives

Happy future you

Ever tried to reproduce an analysis that you did a few months ago or even a few years ago? You may have written the code, but it's now impossible to decipher whether you should use make_figures.py.old, make_figures_working.py or new_make_figures01.py, try_figre.py to get things done. And you start questioning your existence, that happened to all of us at some point of time. Here are some questions we ask when we open an old data science project.

  • Where did I stored the data files?
  • Where did the plots get saved?
  • Where did I saved the ML model?

These types of questions are painful and are symptoms of a disorganized project. A good project structure encourages practices that make it easier to come back to old work.

Getting Started

Consistency within a project is more important. Consistency within one module or function is the most important... However, know when to be inconsistent -- sometimes style guide recommendations just aren't applicable. When in doubt, use your best judgment. Look at other examples and decide what looks best. And don't hesitate to ask!

Requirements

  • Python 3.5+

Starting a new project

Starting a new project is as easy as running this command at the command line.

  • Create a folder.
  • Copy paste setup.py.
  • Run the python file with python setup.py
# setup.py
import os

folders = [
    "data",
    "data/processed",
    "data/raw",
    "data/external",
    "data/transformed",
    "docs",
    "models",
    "notebooks",
    "references",
    "reports",
    "reports/figures",
    "src",
    "src/data",
    "src/features",
    "src/models",
    "src/visualization",
]


for folder in folders:
    os.makedirs(folder, exist_ok=True)
    with open(os.path.join(folder, ".gitkeep"), "w") as f:
        pass

files = [
    "src/__init__.py",
    "src/data/make_dataset.py",
    "src/features/build_features.py",
    "src/models/train_model.py",
    "src/models/predict_model.py",
    "src/visualization/visualize.py",
    "requirements.txt"
]

for file in files:
    with open(file, "w") as f:
        pass
Enter fullscreen mode Exit fullscreen mode

Directory structure

├── data
│   ├── external       <- Data from third party sources.
│   ├── transformed    <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- Make this project pip installable with `pip install -e`
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
└─ 

Enter fullscreen mode Exit fullscreen mode

Opinions

This is by no means the best project structure but it a starting point. This structure provide a starting point, you can iterate over it to make it you own. Some of the opinions are about workflows, and some of the opinions are about tools that make life easier.

Data is immutable

Don't ever edit your raw data, especially not manually, and especially not in Excel. Don't overwrite your raw data. Don't save multiple versions of the raw data. Treat the data (and its format) as immutable. The code you write should move the raw data through a pipeline to your final analysis. You shouldn't have to run all of the steps every time you want to make a new figures but anyone should be able to reproduce the final products with only the code in src and the data in data/raw. Also, if data is immutable, it doesn't need source control in the same way that code does. Therefore, by default, the data folder is included in the .gitignore file.  If you have a small amount of data that rarely changes, you may want to include the data in the repository.

Notebooks are for exploration and trying

Notebooks are very effective for exploratory data analysis. However, these tools can be less effective for reproducing an analysis. We should subdivide the notebooks folder. For example, notebooks/exploratory contains initial explorations, whereas notebooks/reports is more polished work that can be exported as html to the reports directory.

References

  1. https://www.youtube.com/watch?v=MaIfDPuSlw8&t=264s
  2. https://drivendata.github.io/cookiecutter-data-science/
  3. https://learn.microsoft.com/en-us/azure/architecture/data-science-process/overview
  4. https://github.com/Azure/Azure-TDSP-ProjectTemplate

Top comments (0)