Pandas offers a lot of convenient utilities for handling tabular data. The pandas DataFrame
is versatile, but some operations can only be done in a single command. I'll offer some advice on how you can extend DataFrame
to better suit your workflow. Also, when I say DataFrame
, I'm referring to the class in the pandas library.
Suppose you had a collection of students
, and you'd like to select all the students that have an "A" in Ms. Frizzle's class.
Here's one way to do this in pandas:
(students.query("grade == 'A'")
.query("teacher == 'Frizzle'"))
But what if we had a custom method?
students.select_by_grade_and_teacher("A", "Frizzle")
Note: by enclosing the operations in parentheses, you can chain several pandas operations on multiple lines, since they all return
DataFrame
s. This is called method-chaining.
What Extending Means
All I want to do is add custom methods to the standard pandas DataFrame
. Custom methods are useful because they:
- ♻️ simplify repeatable, multi-line logic
- ⛓ chain with built-in methods in
DataFrame
So extending just means adding additional methods to DataFrame
; these methods can return another DataFrame
, or anything else.
Note: We could write a function, but that doesn't fit into the method-chaining workflow, producing more lines of code
How to Extend DataFrame
Since we'd like to add new methods to DataFrame
without losing the old ones, we should consider subclassing DataFrame
. That is, defining a new class that inherits from the DataFrame
class.
This comes with a few caveats. We need to make sure:
- 🏠 it keeps all of the methods in
DataFrame
- 🤝 when a custom method is called, it returns our class, and not
DataFrame
(i.e. not the subclass)
The second point ensures we can continue to use custom methods for method-chaining.
The following class does just that:
import pandas as pd
class ExtendedDataFrame(pd.DataFrame):
def __init__(self, *args, **kwargs):
# use the __init__ method from DataFrame to ensure
# that we're inheriting the correct behavior
super(ExtendedDataFrame, self).__init__(*args, **kwargs)
# this method is makes it so our methods return an instance
# of ExtendedDataFrame, instead of a regular DataFrame
@property
def _constructor(self):
return ExtendedDataFrame
# now define a custom method!
# note that `self` is a DataFrame
def select_by_grade_and_teacher(self, grade, teacher):
return (self
.query("grade == @grade")
.query("teacher == @teacher"))
See it in action in this repl.it
Note: The pandas documentation has a page on extensions, but it's quite advanced, and includes many other topics
When to Extend
This is a useful pattern for data exploration, especially in Jupyter Notebooks, or any environment that has code-completion. You can use it to:
- ⏲ shorten highly-repeated tasks
- 👩🏽🎓 as a utility packaged with a specific dataset, to share with your team... especially if they're not pandas experts like you
- 📊 construct methods that make plots outside of the standard functionality
Final Thoughts
I do not recommend using this pattern in production, since it's not officially endorsed by pandas. Furthermore, you run the risk of conflicting with the current/future DataFrame
API.
I'll leave you with an extended DataFrame
that I often use. It has some more advanced features, and all of the methods begin with my initials, pj_
(so it's easier to find in code-completion).
Feel free to ask any questions, and let me know if you find something interesting!
Top comments (0)