Waylon Walker

Posted on Jan 20, 2021 • Originally published at waylonwalker.com

Minimal Kedro Pipeline

#python #kedro #datascience

How small can a minimum kedro pipeline ready to package be? I made one within 4 files that you can pip install. It's only a total of 35 lines of python, 8 in setup.py and 27 in mini_kedro_pipeline.py.

Minimal Kedro Pipeline

I have everything for this post hosted in this gihub repo, you can fork it, clone it, or just follow along.

Installation

pip install git+https://github.com/WaylonWalker/mini-kedro-pipeline

Caveats

This repo represents the minimal amount of structure to build a kedro pipeline that can be shared across projects. Its installable, and drops right into your hooks.py or run.py modules. It is not a runnable pipeline. At this point
I think the config loader requires to have a logging config file.

This is a sharable pipeline that can be used across many different projects.

Usage

# hooks.py

import mini_kedro_project as mkp

class ProjectHooks:
    @hook_impl
    def register_pipelines(self) -> Dict[str, Pipeline]:
        """Register the project's pipeline.

        Returns:
            A mapping from a pipeline name to a ``Pipeline`` object.

        """

        return {"__default__": Pipeline([]), "mkp": mkp.pipeline}

Implemantation

This builds on another post that I made about creating the minimal python package. I am not sure if it should be called a package, it's a module, but what do you call it after you build it and host it on pypi?

Minimal Python Package

What does it take to create an installable python package that can be hosted on pypi? What is the minimal python package

Directory structure

.
├── .gitignore
├── README.md
├── setup.py
└── my_pipeline.py

setup.py

This is a very minimal setup.py. This is enough to get you started with a package that you can share across your team. In practice, there is a bit more that you might want to include as your project grows.

from setuptools import setup

setup(
    name="MiniKedroPipeline",
    version="0.1.0",
    py_modules=["mini_kedro_pipeline"],
    install_requires=["kedro"],
)

mini_kedro_pipeline.py

The mini kedro pipeline looks like any set of nodes in your project. Many projects will separate nodes and functions, I prefer to keep them close together. The default recommendation is also to have a create_pipelines function that returns the pipeline.

This pattern creates a singleton, if you were to reference the same pipeline in multiple places within the same running interpreter and modify the one you would run into issues. I don't foresee myself running into this issue, but maybe as more features become available I will change my mind.

"""
An example of a minimal kedro pipeline project
"""
from kedro.pipeline import Pipeline, node

__version__ = "0.1.0"
__author__ = "Waylon S. Walker"

nodes = []


def create_data():
    "creates a dictionary of sample data"
    return {"beans": range(10)}


nodes.append(node(create_data, None, "raw_data", name="create_raw_data"))


def mult_data(data):
    "multiplies each record of each item by 100"
    return {item: [i * 100 for i in data[item]] for item in data}


nodes.append(node(mult_data, "raw_data", "mult_data", name="create_mult_data"))

pipeline = Pipeline(nodes)

Share your pipelines

Go forth and share your pipelines across projects. Let me know, do you share pipelines or catalogs across projects?

DEV Community

Minimal Kedro Pipeline

Minimal Kedro Pipeline

Installation

Caveats

Usage

Implemantation

Minimal Python Package

Directory structure

setup.py

mini_kedro_pipeline.py

Share your pipelines

Top comments (0)

Read next

ECCV 2024: Zero-shot Video Anomaly Detection: Leveraging Large Language Models for Rule-Based Reasoning

Advent of Code 2024 - Day 14 : Restroom Redoubt

Brain-Inspired Method Cuts Neural Networks by 90% Without Losing Accuracy

React JS vs Python: How to Choose the Best Fit for Your Project