Designing a "Router" for kedro

#data #datascience #python #kedro

nodes_global

I released a router-like plugin for kedro back in April 2020. This was not the first design, the idea actually came from one of the QB folks who taught me kedro nearly a year before. We were assembling our pipelines with something called nodes_global. It worked fairly well but did have some issues around being set as a global variable.

But...

One thing in particular that it did not lend itself well to was being able to create a packagable pipeline that I could pip install and append into any of my existing pipelines. Something I am still trying to work out, maybe I don't need this. I think I have it working for our internal pipelines and it seems like the way to go, but we don't necessarily end up using it.

Also...

With this pattern all of the nodes needed to be importable by the module containing nodes_global. I find that this becomes a big hurdle for new pipelines coming from jupyter to overcome and can be most infuriating when their nodes aren't getting ran after they added them.

If you are a bit unsure about what kedro is make sure to check out my what-is-kedro article.

@node(inputs='a_raw_cars', outputs='b_int_cars')

I set off to design something that was flask-like. Around November I had something working. You could simply start creating functions. and decorate these functions with a decorator just like with flask. I even had it setup to autoname the nodes things like create_b_int_cars.

But....

This did not lend well to pulling in functions from a library or dynamically creating nodes. I didn't realize how few nodes I actually make in my pipelines that are a 1:1 relationship between the node and function in real work. Most examples work this way, but for some reason when I step into a project we end up pulling a lot of functions out of existing libraries, or dynamically creating many datasets from a list of options.

pytest inspired

simplicity

The final design ended up being suggested by a colleague of mine who is not using kedro, but is a brilliant python dev. The idea was to walk through the project like pytest does looking for modules and variables with a certain pattern (node, or pipeline).

I have been using this since April and am loving it. It has have very little change since first release. When I create a new module, that automatically becomes a new pipeline in my pipelines dict and all of the variables with the name node get scrapped up and put into a single pipeline.

Beginner Friendly

Just like with pytest. You just start hacking in modules ending with _nodes.py with nodes in them and they just appear in your final pipeline.

How to use it

The readme has some great examples.

Install it

pip install find-kedro

Enable it

Enable it by changing one line in your run.py

run.py

from kedro.context import KedroContext
from find_kedro import find_kedro

class ProjectContext(KedroContext):
    def _get_pipelines(self) -> Pipeline:
        return find_kedro()

Or if your using the new hooks.py method. Again no need to import all of your nodes.

hooks.py

class ProjectHooks:
    @hook_impl
    def register_pipelines(self) -> Dict[str, Pipeline]:
        """Register the project's pipeline.
        Returns:
            A mapping from a pipeline name to a ``Pipeline`` object.
        """

        return find_kedro()

Use it

Check out the readme for more examples, but this one is the one that I use and recommend most often. This method helps keep nodes close to functions that are designed for them.

_my_nodes.py_

# my-proj/pipelinies/data_engineering/pipeline
from kedro.pipeline import node
from .nodes import split_data

nodes = []

def split_data(df: pd.DataFrame, ratio: float) -> Dict[str, pd.DataFrame]:
   ...

nodes.append(
    node(
        split_data,
        ["example_iris_data", "params:example_test_data_ratio"],
        dict(
            train_x="example_train_x",
            train_y="example_train_y",
            test_x="example_test_x",
            test_y="example_test_y",
        ),
    )
)

Ready to start using kedro

If you still have not tried out kedro, it's easier than you think. Check out create-new-kedro-project to get a project started in just a few minutes.

I have been writing short snippets about my mentality breaking into the tech/data industry in my newsletter, 👉 check it out and lets get the conversation started.

	👀 see an issue, edit this post on GitHub