A `for`-loop to stop writing.

#python #sideprojects #opensource

Let's make life a whole log simpler.

When you're working in a notebook you've probably written a for-loop like the one below.

import numpy as np 

data = []

def birthday_experiment(class_size, n_sim):
    """Simulates the birthday paradox. Vectorized = Fast!"""
    sims = np.random.randint(1, 365 + 1, (n_sim, class_size))
    sort_sims = np.sort(sims, axis=1)
    n_uniq = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
    proba = np.mean(n_uniq != class_size)
    return proba

# This is the for-loop everybody keeps writing! 
for size in [10, 15, 20, 25, 30]:
    # At every turn in the loop we add a number to the list.
    data.append(birthday_experiment(class_size=size, n_sim=10_000))

We're doing a simulation here, but it might as well be a grid-search. We're looping over settings in order to collect data in our list.

Pandas

We can expand this loop to get our data into a pandas dataframe.

import numpy as np 

data = []

def birthday_experiment(class_size, n_sim):
    """Simulates the birthday paradox. Vectorized = Fast!"""
    sims = np.random.randint(1, 365 + 1, (n_sim, class_size))
    sort_sims = np.sort(sims, axis=1)
    n_uniq = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
    proba = np.mean(n_uniq != class_size)
    return proba

sizes = [10, 15, 20, 25, 30]
for size in sizes:
    data.append(birthday_experiment(class_size=size, n_sim=10_000))

# At the end we put everything in a DataFrame. Neeto! 
pd.DataFrame({"size": sizes, "result": data})

So far, so good. But will this pattern work for larger grids?

It gets bad.

Let's see what happens when we add more elements we'd like to loop over.

import numpy as np 

data = []

def birthday_experiment(class_size, n_sim):
    """Simulates the birthday paradox. Vectorized = Fast!"""
    sims = np.random.randint(1, 365 + 1, (n_sim, class_size))
    sort_sims = np.sort(sims, axis=1)
    n_uniq = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
    proba = np.mean(n_uniq != class_size)
    return proba

# We're now looping over a larger grid!
for size in [10, 15, 20, 25, 30]:
    for n_sim in [1_000, 10_000, 100_000]:
        data.append(birthday_experiment(class_size=size, n_sim=n_sim))

We now need to write two loops but this has a consequence. How can we possibly link up the size parameter with the n_sim parameter when we cast this into a dataframe? You could do something like this;

import numpy as np 

data = []

def birthday_experiment(class_size, n_sim):
    """Simulates the birthday paradox. Vectorized = Fast!"""
    sims = np.random.randint(1, 365 + 1, (n_sim, class_size))
    sort_sims = np.sort(sims, axis=1)
    n_uniq = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
    proba = np.mean(n_uniq != class_size)
    return proba

# We're now looping over a larger grid!
for size in [10, 15, 20, 25, 30]:
    for n_sim in [1_000, 10_000, 100_000]:
        result = birthday_experiment(class_size=size, n_sim=n_sim)
        row = [size, n_sim, result]
        data.append(row)

# More manual labor. Kind of error prone.
df = pd.DataFrame(data, columns=["size", "n_sim", "result"])

But suddenly we're spending a lot of effort in maintaining a for-loop.

Been here before?

I've noticed that this for-loop keeps getting re-written in a lot of notebooks. You find it in simulations, but also in lots of grid-searches. It's a source of complexity, especially when our nested loops increase in size. So I figured I'd write a small package that can make all this easier.

Decorators

Let's make a three minor changes to the code.

import numpy as np 
from memo import memlist

data = []

@memlist(data=data)
def birthday_experiment(class_size, n_sim):
    """Simulates the birthday paradox. Vectorized = Fast!"""
    sims = np.random.randint(1, 365 + 1, (n_sim, class_size))
    sort_sims = np.sort(sims, axis=1)
    n_uniq = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
    proba = np.mean(n_uniq != class_size)
    return {"est_proba": proba}

for size in [5, 10, 20, 30]:
    for n_sim in [1000, 1_000_000]:
        birthday_experiment(class_size=size, n_sim=n_sim)

We've changed three things.

We've added a memlist decorator to our original function from the memo package. This will allow us to configure a place to relay out stats into. Note that the decorator receives an empty list as input. It's this data list that will receive new data.
We've changed our function to output a dictionary. This way we can attach names to our output and we're able to support functions that output more than one number.
The for-loops now only run the function and don't handle any state any more.

If you were to run this code, the data variable would now contain the following information:

[
    {"class_size": 5, "n_sim": 1000, "est_proba": 0.024},
    {"class_size": 5, "n_sim": 1000000, "est_proba": 0.027178},
    {"class_size": 10, "n_sim": 1000, "est_proba": 0.104},
    {"class_size": 10, "n_sim": 1000000, "est_proba": 0.117062},
    {"class_size": 20, "n_sim": 1000, "est_proba": 0.415},
    {"class_size": 20, "n_sim": 1000000, "est_proba": 0.411571},
    {"class_size": 30, "n_sim": 1000, "est_proba": 0.703},
    {"class_size": 30, "n_sim": 1000000, "est_proba": 0.706033},
]

What's nice about a list of dictionaries is that this is pandas can parse this directly without the need for you to worry about column names.

pd.DataFrame(data)

Let's do more.

This pattern is nice, but we're still dealing with for-loops. So let's fix that and add some extra features.

import numpy as np 
from memo import memlist, memfile, grid, time_taken

data = []

@memfile(filepath="results.jsonl")
@memlist(data=data)
@time_taken()
def birthday_experiment(class_size, n_sim):
    """Simulates the birthday paradox. Vectorized = Fast!"""
    sims = np.random.randint(1, 365 + 1, (n_sim, class_size))
    sort_sims = np.sort(sims, axis=1)
    n_uniq = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
    proba = np.mean(n_uniq != class_size)
    return {"est_proba": proba}

setting_grid = grid(class_size=[5, 10, 20, 30], n_sim=[1000, 1_000_000])
for settings in setting_grid:
    birthday_experiment(**settings)

Pay attention to the following changes.

We've got two mem-decorators now. One decorator is passing the stats to a list while the other one appends the results to a file ("results.json" to be exact).
We've also added a decorator called time_taken which will make sure that we also log how long the function took to complete.
We've used a grid method to generate a grid of settings on our behalf. It represents a generate of settings that can directly be passed to our function. This way, we need one (and only one) for loop. Even if we are working on large grids. You can even configure it to show a neat little progress bar!

If you were to inspect the "results.json" file it would look like this:

{"class_size": 5, "n_sim": 1000, "est_proba": 0.024, "time_taken": 0.0004899501800537109}
{"class_size": 5, "n_sim": 1000000, "est_proba": 0.027178, "time_taken": 0.19407916069030762}
{"class_size": 10, "n_sim": 1000, "est_proba": 0.104, "time_taken": 0.000598907470703125}
{"class_size": 10, "n_sim": 1000000, "est_proba": 0.117062, "time_taken": 0.3751380443572998}
{"class_size": 20, "n_sim": 1000, "est_proba": 0.415, "time_taken": 0.0009679794311523438}
{"class_size": 20, "n_sim": 1000000, "est_proba": 0.411571, "time_taken": 0.7928380966186523}
{"class_size": 30, "n_sim": 1000, "est_proba": 0.703, "time_taken": 0.0018239021301269531}
{"class_size": 30, "n_sim": 1000000, "est_proba": 0.706033, "time_taken": 1.1375510692596436}

When is this useful?

The goal for memo is to make it easier to stop worrying about that one for-loop that we all write. We should just collect stats instead. Note that you can use the decorators from this package to send information to files and lists, but also to callback functions or as a post-request payload to a central logging service.

I've found it useful in many projects. The main example for me is running benchmarks with scikit-learn. I do a lot of NLP and a lot of my components are not serializable with a python pickle which means that I cannot use the standard GridSearch from scikit-learn. You can even combine it with ray to gather statistics from compute happening in parallel. It also plays very nicely with hiplot if you're interested in visualising your statistics.

If this tools sounds useful, feel free to install it via:

pip install memo

If you'd like to understand more about the details, check out the github repo and the documentation page. There's also a full tutorial on calmcode.io in case you're interested.