Why?
Why am I writing the billionth article on Pandas data munging? Aren't the other articles enough? Maybe! If you already read a few of those and you feel like you know a lot about Pandas maybe your time would be better spent moving on to some of these materials:
- Python for Data Analysis by Pandas BDFL Wes McKinney
- Enhancing Performance by the Pandas dev team
- Introduction to Dask for parallel computing in Python
- The Python Data Science Handbook by Jake VanderPlas
Getting started
Quick note: throughout this article I'm going to refer to features and observations. When you start working with multidimensional problems it's an important distinction, but in this article a feature is the same as a spreadsheet column, and an observation is the same as a spreadsheet row.
import numpy as np
import pandas as pd
We're going to work with some random data, which you can construct directly. Pandas DataFrames
are constructed from Pandas Series
objects, which in turn use numpy
objects and operations for many of their behaviors.
# Create a DataFrame with some randomized columns
df = pd.DataFrame({
"str_feature": [np.random.choice(["dog", "cat", "snake"]) for _ in range(10000)],
"int_feature": np.arange(10000)
})
The .head
method can grab the top n
rows, five by default.
df.head()
str_feature | int_feature | |
---|---|---|
0 | snake | 0 |
1 | dog | 1 |
2 | cat | 2 |
3 | cat | 3 |
4 | dog | 4 |
df.head(10)
str_feature | int_feature | |
---|---|---|
0 | snake | 0 |
1 | dog | 1 |
2 | cat | 2 |
3 | cat | 3 |
4 | dog | 4 |
5 | snake | 5 |
6 | snake | 6 |
7 | dog | 7 |
8 | cat | 8 |
9 | dog | 9 |
Of course in some circumstances we don't want to take the first 5 or 10. We can take a pseudo-random sample of the data using .sample
. Note that .sample
defaults to 1 observation.
df.sample(5)
str_feature | int_feature | |
---|---|---|
2476 | cat | 2476 |
2793 | dog | 2793 |
3120 | snake | 3120 |
9338 | snake | 9338 |
3593 | dog | 3593 |
We can get the underlying types by accessing the .dtypes
attribute.
df.dtypes
str_feature object
int_feature int32
dtype: object
It should be stressed that object
features are less memory efficient than int
features. A discussion of the basic dtypes
is included in the official getting started guide.
Descriptive statistics can be reported using .describe
-- note that only numerical features will be summarized.
df.describe()
int_feature | |
---|---|
count | 10000.00000 |
mean | 4999.50000 |
std | 2886.89568 |
min | 0.00000 |
25% | 2499.75000 |
50% | 4999.50000 |
75% | 7499.25000 |
max | 9999.00000 |
The descriptive statistics in this report have their own methods, for example here's std
:
df.std()
int_feature 2886.89568
dtype: float64
pandas-profiling
is a new-ish package that creates much more detailed reports from a DataFrame
. You can check it out here.
.size
returns the number of observations multiplied by the number of features.
df.size
20000
.info
returns metadata about the DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 2 columns):
str_feature 10000 non-null object
int_feature 10000 non-null int32
dtypes: int32(1), object(1)
memory usage: 78.2+ KB
Selection and Assignment
We tend to access data from the DataFrame
using the following patterns:
df[condition]
-
df.iloc
&df.loc
-
df.at
&df.iat
# First pattern
df[df['str_feature'] == 'snake'].head()
str_feature | int_feature | |
---|---|---|
0 | snake | 0 |
5 | snake | 5 |
6 | snake | 6 |
14 | snake | 14 |
18 | snake | 18 |
# Second pattern
df.loc[:, "int_feature"]
0 0
1 1
2 2
3 3
4 4
...
9995 9995
9996 9996
9997 9997
9998 9998
9999 9999
Name: int_feature, Length: 10000, dtype: int32
Note that here the lone :
means to access everything on that axis, the same way my_list[:]
tells a plain Python list
to access everything.
.iloc
behaves similarly but works on axes only.
df.iloc[0:5]
str_feature | int_feature | |
---|---|---|
0 | snake | 0 |
1 | dog | 1 |
2 | cat | 2 |
3 | cat | 3 |
4 | dog | 4 |
df.iloc[[0,5]]
str_feature | int_feature | |
---|---|---|
0 | snake | 0 |
5 | snake | 5 |
.at
and .iat
are used similarly to .loc
and .iloc
but they can't retrieve more than one observation.
df.at[0:5, 'str_feature']
will give a ValueError
because of 0:5
.
df.at[0]
will give a TypeError
because there is no col
specified.
df.at[0, 'str_feature']
'snake'
There is a discussion on SO about how you can get some performance gains by accepting the limitations of .at
.
There's one more method I want to cover for accessing data from our DataFrame
:
df.int_feature.head()
0 0
1 1
2 2
3 3
4 4
Name: int_feature, dtype: int32
This works fine until:
- Your feature is named
my super feature
and you try to accessdf.my super feature
- You forget that the reverse operation is illegal.
That's right, you cannot do this:
df.new_feature = df[df['int_feature'] == 1]
c:\users\cdl\documents\personalproject\common_pandas_blog\.venv\lib\site-packages\ipykernel_launcher.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
"""Entry point for launching an IPython kernel.
Trying to will trigger a warning, at least in in newer versions of Pandas
. We can confirm that the feature was not added to the DataFrame
:
df.head()
str_feature | int_feature | |
---|---|---|
0 | snake | 0 |
1 | dog | 1 |
2 | cat | 2 |
3 | cat | 3 |
4 | dog | 4 |
Instead, let's add a new feature like so:
df['new_feature'] = df.loc[:, 'str_feature'] == 'snake'
Now we can see that the new feature was added properly:
df.head()
str_feature | int_feature | new_feature | |
---|---|---|---|
0 | snake | 0 | True |
1 | dog | 1 | False |
2 | cat | 2 | False |
3 | cat | 3 | False |
4 | dog | 4 | False |
Manipulation
Let's perform some common tasks to manipulate our data. Namely:
- Filter values
- Handle missing or invalid values
- SQL-like join two
DataFrame
s - SQL-like groupby
1. Filtering values
To filter values, we can simply overwrite our df
variable like so:
# We're cat people now.
df = df[df.loc[:, 'str_feature'] == 'cat']
df.head()
str_feature | int_feature | new_feature | |
---|---|---|---|
2 | cat | 2 | False |
3 | cat | 3 | False |
8 | cat | 8 | False |
10 | cat | 10 | False |
11 | cat | 11 | False |
Since we've just filtered our DataFrame
to use only use cat rows, our snakes-only new_feature
will be full of False
values and str_feature
will be all cat
s. Let's filter our columns.
df.drop(["new_feature", "str_feature"], axis=1, inplace=True)
df.head()
int_feature | |
---|---|
2 | 2 |
3 | 3 |
8 | 8 |
10 | 10 |
11 | 11 |
As discussed here, the axis
argument tells pandas
to drop on the basis of the features
axis, and the inplace
argument makes this equivalent to df = df.drop(...)
.
As a final note, don't make the mistake of assuming that any of these methods will automatically free up memory. Instead, use tools like Python del
and pandas
I/O features like low_memory
or memory_map
.
2. Handle missing or invalid values
Another common data munging task is to handle cases where the data is invalid in some way. Let's make a new, messy DataFrame
:
# Create a DataFrame with some randomized columns
df = pd.DataFrame({
"str_feature": [np.random.choice(["dog", "cat", "snake"]) for _ in range(10000)],
"int_feature": np.arange(10000),
"messy_feature": [np.random.choice([1, None, np.nan, False, True, -1]) for _ in range(10000)]
})
This simplest thing we could do is clean our data by filtering all observations with np.nan
, using .isna
:
# Note the rarely used ~ negation operator below
clean_df = df[~df['messy_feature'].isna()]
clean_df.sample(10)
str_feature | int_feature | messy_feature | |
---|---|---|---|
4134 | cat | 4134 | -1 |
2127 | snake | 2127 | -1 |
7014 | dog | 7014 | False |
922 | dog | 922 | False |
5451 | dog | 5451 | True |
5059 | cat | 5059 | -1 |
1289 | snake | 1289 | False |
5016 | dog | 5016 | -1 |
9765 | snake | 9765 | True |
7094 | cat | 7094 | 1 |
As you can see, this method will clear just the NaN
values. That's great, but what if we only want values that are truthy in Python?
df[df['messy_feature']]
won't work, because it's a ValueError
to filter using a column with NaN
values. Instead, we have to filter the NaN
, then check for truthiness. Just like with a plain Python conditional and
statement, the conditions are checked left to right.
clean_df = df[~df['messy_feature'].isna() & df['messy_feature']]
clean_df.sample(10)
str_feature | int_feature | messy_feature | |
---|---|---|---|
1049 | snake | 1049 | 1 |
4769 | snake | 4769 | -1 |
4924 | cat | 4924 | -1 |
6907 | snake | 6907 | -1 |
5914 | dog | 5914 | True |
6291 | cat | 6291 | 1 |
5872 | dog | 5872 | 1 |
1794 | cat | 1794 | 1 |
8987 | cat | 8987 | 1 |
3848 | dog | 3848 | True |
Imputing values is easy too:
clean_df = df.copy()
clean_df['messy_feature'].fillna(clean_df['str_feature'], inplace=True)
clean_df.sample(10)
str_feature | int_feature | messy_feature | |
---|---|---|---|
8063 | snake | 8063 | snake |
3901 | cat | 3901 | cat |
3702 | dog | 3702 | -1 |
906 | dog | 906 | 1 |
8039 | dog | 8039 | dog |
2433 | dog | 2433 | -1 |
4996 | snake | 4996 | False |
3015 | snake | 3015 | snake |
8307 | cat | 8307 | 1 |
1904 | cat | 1904 | cat |
And we can go beyond simple .fillna
:
clean_df = df.copy()
clean_df['messy_feature'] = clean_df.apply(
lambda row: row['messy_feature'] if row['messy_feature'] == -1 else row['str_feature'],
axis=1
)
clean_df.sample(10)
str_feature | int_feature | messy_feature | |
---|---|---|---|
9217 | snake | 9217 | snake |
5018 | snake | 5018 | snake |
6383 | dog | 6383 | dog |
1297 | cat | 1297 | -1 |
9701 | dog | 9701 | dog |
5192 | cat | 5192 | cat |
4018 | dog | 4018 | dog |
7619 | snake | 7619 | snake |
9890 | dog | 9890 | dog |
5102 | snake | 5102 | snake |
Just be careful with .apply
as it has the potential to significantly degrade performance. See: enhancing performance in the Pandas docs.
3. SQL-like joins
I would definitely recommend that you learn about SQL joins and then just map that knowledge to Pandas joins. The arguments like on
, how
, left
, and right
all make sense if you just keep in mind how the joins would work in SQL.
merged_df = df.copy().join(df.copy(), how='left', lsuffix="_l", rsuffix="_r")
merged_df
str_feature_l | int_feature_l | messy_feature_l | str_feature_r | int_feature_r | messy_feature_r | |
---|---|---|---|---|---|---|
0 | snake | 0 | False | snake | 0 | False |
1 | cat | 1 | NaN | cat | 1 | NaN |
2 | snake | 2 | False | snake | 2 | False |
3 | dog | 3 | False | dog | 3 | False |
4 | snake | 4 | None | snake | 4 | None |
... | ... | ... | ... | ... | ... | ... |
9995 | snake | 9995 | False | snake | 9995 | False |
9996 | dog | 9996 | True | dog | 9996 | True |
9997 | snake | 9997 | NaN | snake | 9997 | NaN |
9998 | dog | 9998 | NaN | dog | 9998 | NaN |
9999 | dog | 9999 | 1 | dog | 9999 | 1 |
10000 rows × 6 columns
4. SQL-like groupby
Like the SQL JOIN advice, I strongly recommend learning about SQL GROUP BY and letting that be the basis of your Pandas .groupby
knowledge. Here's a simple GROUP BY SUM in Pandas
df.groupby('str_feature').sum()
int_feature | |
---|---|
str_feature | |
cat | 16476822 |
dog | 16729175 |
snake | 16789003 |
Note that .groupby
must be called by a method that aggregates the grouping. Let's see what happens if we neglect that part...
df.groupby('str_feature')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0A6F91D0>
.groupby
is particularly powerful when combined with the .apply
method...but my earlier warning about the performance of .apply
still holds.
# Simple sum/size calculation
df.groupby(['str_feature']).apply(lambda x: x.sum()/x.size)
int_feature | messy_feature | |
---|---|---|
str_feature | ||
cat | 2485.191855 | 0.088688 |
dog | 2509.627213 | 0.086709 |
snake | 2504.326223 | 0.089499 |
Moving Windows
Often we want to calculate rolling values (like rolling sums):
df['rolling'] = df.rolling(3).sum()
df.head(10)
str_feature | int_feature | messy_feature | rolling | |
---|---|---|---|---|
0 | snake | 0 | False | NaN |
1 | cat | 1 | NaN | NaN |
2 | snake | 2 | False | 3.0 |
3 | dog | 3 | False | 6.0 |
4 | snake | 4 | None | 9.0 |
5 | snake | 5 | NaN | 12.0 |
6 | snake | 6 | None | 15.0 |
7 | cat | 7 | -1 | 18.0 |
8 | cat | 8 | NaN | 21.0 |
9 | snake | 9 | NaN | 24.0 |
Indexing by Time
A related task is to use datetime, which we accomplish using pd.to_datetime
. There are helper methods like .astype
for casting to datetime types, as well as .set_index
if we want to use a datetime
column as our index.
import time
df['Datetime'] = pd.to_datetime([time.time() - (86400*x) for x in range(df.shape[0])], unit='s')
# Round to nearest day
# see https://stackoverflow.com/a/13788301/10553976
df['Datetime'] = np.round(df['Datetime'].astype(np.int64), -9).astype('datetime64[ns]')
# Set as index
df = df.set_index('Datetime')
I/O
One of the most attractive features of Pandas is the robust I/O capabilities. The whole list is here but let's take a look at two important ones, CSV and SQL.
CSV
Writing a CSV in Pandas couldn't be easier:
df.to_csv('./data.csv')
Reading it back in is just as simple:
df = pd.read_csv('./data.csv')
But wait: let's look at our df
now...
df.head()
Datetime | str_feature | int_feature | messy_feature | rolling | |
---|---|---|---|---|---|
0 | 2019-10-08 21:51:50 | snake | 0 | False | NaN |
1 | 2019-10-07 21:51:50 | cat | 1 | NaN | NaN |
2 | 2019-10-06 21:51:50 | snake | 2 | False | 3.0 |
3 | 2019-10-05 21:51:50 | dog | 3 | False | 6.0 |
4 | 2019-10-04 21:51:50 | snake | 4 | NaN | 9.0 |
We can't infer that any particular column is the index, so .read_csv
created a DataFrame
with a simple index instead of using the DatetimeIndex
we set before writing to a file. Of course we could always use .set_index
after the fact, but in resource constrained scenarios we want control of how large the created DataFrame
will be. Searching for a solution, we come to the .read_csv
documentation and the power of this tool immediately becomes apparent:
- Read CSVs from URL (!!!!!), path, or file-like
- Specify the exact window in a CSV file to read from with kwargs like
header
,index_col
,usecols
,nrows
,skiprows
, andskipfooter
- Manage
NaN
with kwargs likena_filter
andna_values
-
sep
for handlingtsv
and other delimiter situations - Kwargs to help save memory like
low_memory
,memory_map
, anddtype
Armed with tons of new knowledge about how to read csv files with Pandas, we can confidently read our data.csv
file with the DatetimeIndex
specified.
df = pd.read_csv('./data.csv', index_col='Datetime')
df.head()
str_feature | int_feature | messy_feature | rolling | |
---|---|---|---|---|
Datetime | ||||
2019-10-08 21:51:50 | snake | 0 | False | NaN |
2019-10-07 21:51:50 | cat | 1 | NaN | NaN |
2019-10-06 21:51:50 | snake | 2 | False | 3.0 |
2019-10-05 21:51:50 | dog | 3 | False | 6.0 |
2019-10-04 21:51:50 | snake | 4 | NaN | 9.0 |
Just the way it originally was! Let's clean that file up:
import os
os.remove('./data.csv')
Note: if you aren't using Pandas features for data manipulation, consider using the stdlib csv
library for I/O instead of Pandas. Pandas objects create some overhead in memory, and Pandas itself is a nontrivial dependency.
SQL
If you want to read and write data that lives in a SQL database, Pandas uses the pd.read_sql
and .to_sql
methods to interface with databases. In this example we'll use the stdlib sqlite3
dbAPI, but Pandas can integrate with any kind of DB that is supported by sqlalchemy. Also note that we're going to use a try...finally
block to enforce .close
always being called.
import sqlite3
# DB operations
try:
# Connect to db
conn = sqlite3.connect('data.db')
# Write dataframe to table 'df'
df.to_sql('df', conn, if_exists='replace')
# Read a SELECT * query of the same table, specifying index column
sqldf = pd.read_sql('SELECT * FROM df', conn, index_col='Datetime')
# Ensure the connection closes
finally:
conn.close()
# Cleanup file
os.remove('data.db')
sqldf.head()
str_feature | int_feature | messy_feature | rolling | |
---|---|---|---|---|
Datetime | ||||
2019-10-08 21:51:50 | snake | 0 | False | NaN |
2019-10-07 21:51:50 | cat | 1 | None | NaN |
2019-10-06 21:51:50 | snake | 2 | False | 3.0 |
2019-10-05 21:51:50 | dog | 3 | False | 6.0 |
2019-10-04 21:51:50 | snake | 4 | None | 9.0 |
Once again our data has made a round trip from memory to disk and back again. Well done Pandas and well done to you, dear reader!
Wrapping Up
In this article you've learned how Pandas implements some basic data munging tasks, specifically:
- Extracting metadata and descriptive statistics
- Filtering, imputing, joining and grouping data
- Windowed functions (like rolling sum)
- Using Datetime data and indexing by time
- Reading and writing data to files and databases
Happy wrangling!
Image by Cimberley on Pixabay
Top comments (2)
Just what I was looking for !!! Thanks a lot Charles
There are indeed lots of data munging tuts out there but this one is a great one. Super concise and helpful