Charles Landau

Posted on Oct 9, 2019 • Originally published at charlesdlandau.io

Data Munging with Common Pandas Operations

#python #beginners #tutorial

Why?

Why am I writing the billionth article on Pandas data munging? Aren't the other articles enough? Maybe! If you already read a few of those and you feel like you know a lot about Pandas maybe your time would be better spent moving on to some of these materials:

Getting started

Quick note: throughout this article I'm going to refer to features and observations. When you start working with multidimensional problems it's an important distinction, but in this article a feature is the same as a spreadsheet column, and an observation is the same as a spreadsheet row.

import numpy as np
import pandas as pd

We're going to work with some random data, which you can construct directly. Pandas DataFrames are constructed from Pandas Series objects, which in turn use numpy objects and operations for many of their behaviors.

# Create a DataFrame with some randomized columns
df = pd.DataFrame({
    "str_feature": [np.random.choice(["dog", "cat", "snake"]) for _ in range(10000)],
    "int_feature": np.arange(10000)
})

The .head method can grab the top n rows, five by default.

df.head()

	str_feature	int_feature
0	snake	0
1	dog	1
2	cat	2
3	cat	3
4	dog	4

df.head(10)

	str_feature	int_feature
0	snake	0
1	dog	1
2	cat	2
3	cat	3
4	dog	4
5	snake	5
6	snake	6
7	dog	7
8	cat	8
9	dog	9

Of course in some circumstances we don't want to take the first 5 or 10. We can take a pseudo-random sample of the data using .sample. Note that .sample defaults to 1 observation.

df.sample(5)

	str_feature	int_feature
2476	cat	2476
2793	dog	2793
3120	snake	3120
9338	snake	9338
3593	dog	3593

We can get the underlying types by accessing the .dtypes attribute.

df.dtypes

str_feature    object
int_feature     int32
dtype: object

It should be stressed that object features are less memory efficient than int features. A discussion of the basic dtypes is included in the official getting started guide.

Descriptive statistics can be reported using .describe -- note that only numerical features will be summarized.

df.describe()

	int_feature
count	10000.00000
mean	4999.50000
std	2886.89568
min	0.00000
25%	2499.75000
50%	4999.50000
75%	7499.25000
max	9999.00000

The descriptive statistics in this report have their own methods, for example here's std:

df.std()

int_feature    2886.89568
dtype: float64

pandas-profiling is a new-ish package that creates much more detailed reports from a DataFrame. You can check it out here.

.size returns the number of observations multiplied by the number of features.

df.size

.info returns metadata about the DataFrame

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 2 columns):
str_feature    10000 non-null object
int_feature    10000 non-null int32
dtypes: int32(1), object(1)
memory usage: 78.2+ KB

Selection and Assignment

We tend to access data from the DataFrame using the following patterns:

df[condition]
df.iloc & df.loc
df.at & df.iat

# First pattern
df[df['str_feature'] == 'snake'].head()

	str_feature	int_feature
0	snake	0
5	snake	5
6	snake	6
14	snake	14
18	snake	18

# Second pattern
df.loc[:, "int_feature"]

0          0
1          1
2          2
3          3
4          4
        ... 
9995    9995
9996    9996
9997    9997
9998    9998
9999    9999
Name: int_feature, Length: 10000, dtype: int32

Note that here the lone : means to access everything on that axis, the same way my_list[:] tells a plain Python list to access everything.

.iloc behaves similarly but works on axes only.

df.iloc[0:5]

	str_feature	int_feature
0	snake	0
1	dog	1
2	cat	2
3	cat	3
4	dog	4

df.iloc[[0,5]]

	str_feature	int_feature
0	snake	0
5	snake	5

.at and .iat are used similarly to .loc and .iloc but they can't retrieve more than one observation.

df.at[0:5, 'str_feature'] will give a ValueError because of 0:5.

df.at[0] will give a TypeError because there is no col specified.

df.at[0, 'str_feature']

'snake'

There is a discussion on SO about how you can get some performance gains by accepting the limitations of .at.

There's one more method I want to cover for accessing data from our DataFrame:

df.int_feature.head()

0    0
1    1
2    2
3    3
4    4
Name: int_feature, dtype: int32

This works fine until:

Your feature is named my super feature and you try to access df.my super feature
You forget that the reverse operation is illegal.

That's right, you cannot do this:

df.new_feature = df[df['int_feature'] == 1]

c:\users\cdl\documents\personalproject\common_pandas_blog\.venv\lib\site-packages\ipykernel_launcher.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  """Entry point for launching an IPython kernel.

Trying to will trigger a warning, at least in in newer versions of Pandas. We can confirm that the feature was not added to the DataFrame:

df.head()

	str_feature	int_feature
0	snake	0
1	dog	1
2	cat	2
3	cat	3
4	dog	4

Instead, let's add a new feature like so:

df['new_feature'] = df.loc[:, 'str_feature'] == 'snake'

Now we can see that the new feature was added properly:

df.head()

	str_feature	int_feature	new_feature
0	snake	0	True
1	dog	1	False
2	cat	2	False
3	cat	3	False
4	dog	4	False

Manipulation

Let's perform some common tasks to manipulate our data. Namely:

Filter values
Handle missing or invalid values
SQL-like join two DataFrames
SQL-like groupby

1. Filtering values

To filter values, we can simply overwrite our df variable like so:

# We're cat people now.
df = df[df.loc[:, 'str_feature'] == 'cat']

df.head()

	str_feature	int_feature	new_feature
2	cat	2	False
3	cat	3	False
8	cat	8	False
10	cat	10	False
11	cat	11	False

Since we've just filtered our DataFrame to use only use cat rows, our snakes-only new_feature will be full of False values and str_feature will be all cats. Let's filter our columns.

df.drop(["new_feature", "str_feature"], axis=1, inplace=True)
df.head()

	int_feature
2	2
3	3
8	8
10	10
11	11

As discussed here, the axis argument tells pandas to drop on the basis of the features axis, and the inplace argument makes this equivalent to df = df.drop(...).

As a final note, don't make the mistake of assuming that any of these methods will automatically free up memory. Instead, use tools like Python del and pandas I/O features like low_memory or memory_map.

2. Handle missing or invalid values

Another common data munging task is to handle cases where the data is invalid in some way. Let's make a new, messy DataFrame:

# Create a DataFrame with some randomized columns
df = pd.DataFrame({
    "str_feature": [np.random.choice(["dog", "cat", "snake"]) for _ in range(10000)],
    "int_feature": np.arange(10000),
    "messy_feature": [np.random.choice([1, None, np.nan, False, True, -1]) for _ in range(10000)]
})

This simplest thing we could do is clean our data by filtering all observations with np.nan, using .isna:

# Note the rarely used ~ negation operator below
clean_df = df[~df['messy_feature'].isna()]
clean_df.sample(10)

	str_feature	int_feature	messy_feature
4134	cat	4134	-1
2127	snake	2127	-1
7014	dog	7014	False
922	dog	922	False
5451	dog	5451	True
5059	cat	5059	-1
1289	snake	1289	False
5016	dog	5016	-1
9765	snake	9765	True
7094	cat	7094	1

As you can see, this method will clear just the NaN values. That's great, but what if we only want values that are truthy in Python?

df[df['messy_feature']] won't work, because it's a ValueError to filter using a column with NaN values. Instead, we have to filter the NaN, then check for truthiness. Just like with a plain Python conditional and statement, the conditions are checked left to right.

clean_df = df[~df['messy_feature'].isna() & df['messy_feature']]
clean_df.sample(10)

	str_feature	int_feature	messy_feature
1049	snake	1049	1
4769	snake	4769	-1
4924	cat	4924	-1
6907	snake	6907	-1
5914	dog	5914	True
6291	cat	6291	1
5872	dog	5872	1
1794	cat	1794	1
8987	cat	8987	1
3848	dog	3848	True

Imputing values is easy too:

clean_df = df.copy()
clean_df['messy_feature'].fillna(clean_df['str_feature'], inplace=True)
clean_df.sample(10)

	str_feature	int_feature	messy_feature
8063	snake	8063	snake
3901	cat	3901	cat
3702	dog	3702	-1
906	dog	906	1
8039	dog	8039	dog
2433	dog	2433	-1
4996	snake	4996	False
3015	snake	3015	snake
8307	cat	8307	1
1904	cat	1904	cat

And we can go beyond simple .fillna:

clean_df = df.copy()
clean_df['messy_feature'] = clean_df.apply(
    lambda row: row['messy_feature'] if row['messy_feature'] == -1 else row['str_feature'],
    axis=1
)
clean_df.sample(10)

	str_feature	int_feature	messy_feature
9217	snake	9217	snake
5018	snake	5018	snake
6383	dog	6383	dog
1297	cat	1297	-1
9701	dog	9701	dog
5192	cat	5192	cat
4018	dog	4018	dog
7619	snake	7619	snake
9890	dog	9890	dog
5102	snake	5102	snake

Just be careful with .apply as it has the potential to significantly degrade performance. See: enhancing performance in the Pandas docs.

3. SQL-like joins

I would definitely recommend that you learn about SQL joins and then just map that knowledge to Pandas joins. The arguments like on, how, left, and right all make sense if you just keep in mind how the joins would work in SQL.

merged_df = df.copy().join(df.copy(), how='left', lsuffix="_l", rsuffix="_r")
merged_df

	str_feature_l	int_feature_l	messy_feature_l	str_feature_r	int_feature_r	messy_feature_r
0	snake	0	False	snake	0	False
1	cat	1	NaN	cat	1	NaN
2	snake	2	False	snake	2	False
3	dog	3	False	dog	3	False
4	snake	4	None	snake	4	None
...	...	...	...	...	...	...
9995	snake	9995	False	snake	9995	False
9996	dog	9996	True	dog	9996	True
9997	snake	9997	NaN	snake	9997	NaN
9998	dog	9998	NaN	dog	9998	NaN
9999	dog	9999	1	dog	9999	1

10000 rows × 6 columns

4. SQL-like groupby

Like the SQL JOIN advice, I strongly recommend learning about SQL GROUP BY and letting that be the basis of your Pandas .groupby knowledge. Here's a simple GROUP BY SUM in Pandas

df.groupby('str_feature').sum()

	int_feature
str_feature
cat	16476822
dog	16729175
snake	16789003

Note that .groupby must be called by a method that aggregates the grouping. Let's see what happens if we neglect that part...

df.groupby('str_feature')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0A6F91D0>

.groupby is particularly powerful when combined with the .apply method...but my earlier warning about the performance of .apply still holds.

# Simple sum/size calculation
df.groupby(['str_feature']).apply(lambda x: x.sum()/x.size)

	int_feature	messy_feature
str_feature
cat	2485.191855	0.088688
dog	2509.627213	0.086709
snake	2504.326223	0.089499

Moving Windows

Often we want to calculate rolling values (like rolling sums):

df['rolling'] = df.rolling(3).sum()
df.head(10)

	str_feature	int_feature	messy_feature	rolling
0	snake	0	False	NaN
1	cat	1	NaN	NaN
2	snake	2	False	3.0
3	dog	3	False	6.0
4	snake	4	None	9.0
5	snake	5	NaN	12.0
6	snake	6	None	15.0
7	cat	7	-1	18.0
8	cat	8	NaN	21.0
9	snake	9	NaN	24.0

Indexing by Time

A related task is to use datetime, which we accomplish using pd.to_datetime. There are helper methods like .astype for casting to datetime types, as well as .set_index if we want to use a datetime column as our index.

import time
df['Datetime'] = pd.to_datetime([time.time() - (86400*x) for x in range(df.shape[0])], unit='s')

# Round to nearest day
# see https://stackoverflow.com/a/13788301/10553976
df['Datetime'] = np.round(df['Datetime'].astype(np.int64), -9).astype('datetime64[ns]')

# Set as index
df = df.set_index('Datetime')

I/O

One of the most attractive features of Pandas is the robust I/O capabilities. The whole list is here but let's take a look at two important ones, CSV and SQL.

CSV

Writing a CSV in Pandas couldn't be easier:

df.to_csv('./data.csv')

Reading it back in is just as simple:

df = pd.read_csv('./data.csv')

But wait: let's look at our df now...

df.head()

	Datetime	str_feature	int_feature	messy_feature	rolling
0	2019-10-08 21:51:50	snake	0	False	NaN
1	2019-10-07 21:51:50	cat	1	NaN	NaN
2	2019-10-06 21:51:50	snake	2	False	3.0
3	2019-10-05 21:51:50	dog	3	False	6.0
4	2019-10-04 21:51:50	snake	4	NaN	9.0

We can't infer that any particular column is the index, so .read_csv created a DataFrame with a simple index instead of using the DatetimeIndex we set before writing to a file. Of course we could always use .set_index after the fact, but in resource constrained scenarios we want control of how large the created DataFrame will be. Searching for a solution, we come to the .read_csv documentation and the power of this tool immediately becomes apparent:

Read CSVs from URL (!!!!!), path, or file-like
Specify the exact window in a CSV file to read from with kwargs like header, index_col, usecols, nrows, skiprows, and skipfooter
Manage NaN with kwargs like na_filter and na_values
sep for handling tsv and other delimiter situations
Kwargs to help save memory like low_memory, memory_map, and dtype

Armed with tons of new knowledge about how to read csv files with Pandas, we can confidently read our data.csv file with the DatetimeIndex specified.

df = pd.read_csv('./data.csv', index_col='Datetime')
df.head()

	str_feature	int_feature	messy_feature	rolling
Datetime
2019-10-08 21:51:50	snake	0	False	NaN
2019-10-07 21:51:50	cat	1	NaN	NaN
2019-10-06 21:51:50	snake	2	False	3.0
2019-10-05 21:51:50	dog	3	False	6.0
2019-10-04 21:51:50	snake	4	NaN	9.0

Just the way it originally was! Let's clean that file up:

import os
os.remove('./data.csv')

Note: if you aren't using Pandas features for data manipulation, consider using the stdlib csv library for I/O instead of Pandas. Pandas objects create some overhead in memory, and Pandas itself is a nontrivial dependency.

SQL

If you want to read and write data that lives in a SQL database, Pandas uses the pd.read_sql and .to_sql methods to interface with databases. In this example we'll use the stdlib sqlite3 dbAPI, but Pandas can integrate with any kind of DB that is supported by sqlalchemy. Also note that we're going to use a try...finally block to enforce .close always being called.

import sqlite3

# DB operations
try:
    # Connect to db
    conn = sqlite3.connect('data.db')
    # Write dataframe to table 'df'
    df.to_sql('df', conn, if_exists='replace')

    # Read a SELECT * query of the same table, specifying index column
    sqldf = pd.read_sql('SELECT * FROM df', conn, index_col='Datetime')

# Ensure the connection closes
finally:
    conn.close()

# Cleanup file
os.remove('data.db')

sqldf.head()

	str_feature	int_feature	messy_feature	rolling
Datetime
2019-10-08 21:51:50	snake	0	False	NaN
2019-10-07 21:51:50	cat	1	None	NaN
2019-10-06 21:51:50	snake	2	False	3.0
2019-10-05 21:51:50	dog	3	False	6.0
2019-10-04 21:51:50	snake	4	None	9.0