Cezar Peixeiro

Posted on Feb 2, 2020 • Edited on Oct 8, 2021

Pandas 1.0.0 - jan/2020 - What's new?

Pandas is a famous Python library for data wrangling and in January-2020 its development team made a major release! YES, big changes were made in this lib, what made it jump from 0.25 to 1.0.0 version.

I just read the official release dosc and wrote above what important changes were made:

IMPROVEMENTS

1. DataFrame.apply() method can use Numba engine!

Numba is a JIT compiler project that can translate a subset of Python code in optimized machine code using the LLVM compiler library. In other words, Numba creates faster Python codes. (You can apply Numba in your general code importing and embedding it as a decorator. Read more in the official docs)

The official release says:
"Using the Numba engine can yield significant performance gains if the apply function can operate on numpy arrays and the data set is larger (1 million rows or greater)"

When using DataFrame.rolling.apply() or DataFrame.expanding.apply() until now, the processing cost was huge. Now, we can pass engine = "numba" to the apply method and have an increase in the performance as follows:

import pandas as pd #pandas 1.0.0 version

data = pd.Series(range(1_000_000))
roll = data.rolling(10)

def f(x):
    return np.sum(x) + 5

# Running in Jupyter Notebook
# Run the first time, compilation time will affect performance
In [4]: %%timeit -r 1 -n 1 
        roll.apply(f, engine='numba', raw=True)
Out [4]: 1.23 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

# Function is cached and performance will improve
In [5]: %%timeit 
        roll.apply(f, engine='numba', raw=True)
Out [5]: 188 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %%timeit 
        roll.apply(f, engine='cython', raw=True)
Out [6]: 3.92 s ± 59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

DataFrame.to_markdown()

This new method is just an easy way to output dataframes in markdown format. To illustrate this I created a simple dataframe below and applied the method.

df = pd.DataFrame(data={"col_1": ["a", "b"], "col_2": ["c", "d"]})

print(df.to_markdown())
>>> |    | col_1 | col_2 |
    |---:|:------|:------|
    |  0 | a     | c     |
    |  1 | b     | d     |

EXPERIMENTAL NEW FEATURES

This kind of changes has its use optional and is available for the community evaluation. Besides this, some experimental features are really cool and useful...and can make some work and learning easier. Let's see:

pandas.NA value

When dealing with missing values in our dataframes is really common the usage of numpy.nan to represent them. Pandas now has its own missing value data type. Creating a pandas series with a None value with pandas 1.0.0, we'll see something like this:

import pandas as pd #pandas 1.0.0 version
s = pd.Series([1, 2, None], dtype="Int64")

print(s)

>>> 0       1
    1       2
    2    <NA>    
    Length: 3, dtype: Int64

and if we print the type of s[2], we'll get the following answer:

print(type(s[2]))
>>> pandas._libs.missing.NAType

Comparisons

Working with the comparison operators ==, >, >=, <, <=, between numpy.nan and numbers the result is always False, but when using !=, Python returns True
With the new pandas.NA, the value will be propagated with all the comparison operators. So, all operations with return

Logical

Working with logical operators with numpy.nan is not supported.
To pandas.NA some different outputs will occur:

import pandas as pd #pandas 1.0.0 version

print(pd.NA & False)
>>> False

print(pd.NA & True)
>>> <NA>

print(pd.NA | False)
>>> <NA>

print(pd.NA | True)
>>> True

String and Boolean Data Types

When analyzing a pandas dataframe dtype with strings the usual result is an OBJECT TYPE. But an object type column can hold more than one data type and make the analysis confusing. Now the string dtype is defined and a string column can hold only strings

import pandas as pd #pandas 1.0.0 version 
s = pd.Series(['abc', None, 'def'], dtype="string")

print(s)
>>> 0     abc
    1    <NA>
    2     def
    Length: 3, dtype: string

Different from strings a bool dtype already exists, but the columns with boolean values doesn't support missing values. This inconvenient was solved with the boolean dtype

import pandas as pd #pandas 1.0.0 version 
s = pd.Series([True, False, None], dtype="boolean")

print(s)
>>> 0     True
    1     False
    2     <NA>
    Length: 3, dtype: boolean

Well, that's all! New improvements were made in some existing functions and I really encourage you to take a look there!

DEV Community

Pandas 1.0.0 - jan/2020 - What's new?

IMPROVEMENTS

1. DataFrame.apply() method can use Numba engine!

DataFrame.to_markdown()

EXPERIMENTAL NEW FEATURES

pandas.NA value

String and Boolean Data Types

Top comments (0)

Read next

AI vs. Detective: How Well Can Language Models Solve Murder Mysteries?

Robot AI with Memory Makes Complex Tasks 45% More Successful

New AI Safety System Improves Language Model Safety by 25% Without Retraining

Hidden Image Generation Powers Found Inside AI Recognition Systems