DEV Community

Cezar Peixeiro
Cezar Peixeiro

Posted on • Edited on

Pandas 1.0.0 - jan/2020 - What's new?

Pandas is a famous Python library for data wrangling and in January-2020 its development team made a major release! YES, big changes were made in this lib, what made it jump from 0.25 to 1.0.0 version.

I just read the official release dosc and wrote above what important changes were made:

IMPROVEMENTS

1. DataFrame.apply() method can use Numba engine!

Numba is a JIT compiler project that can translate a subset of Python code in optimized machine code using the LLVM compiler library. In other words, Numba creates faster Python codes. (You can apply Numba in your general code importing and embedding it as a decorator. Read more in the official docs)

The official release says:
"Using the Numba engine can yield significant performance gains if the apply function can operate on numpy arrays and the data set is larger (1 million rows or greater)"

When using DataFrame.rolling.apply() or DataFrame.expanding.apply() until now, the processing cost was huge. Now, we can pass engine = "numba" to the apply method and have an increase in the performance as follows:

import pandas as pd #pandas 1.0.0 version

data = pd.Series(range(1_000_000))
roll = data.rolling(10)

def f(x):
    return np.sum(x) + 5

# Running in Jupyter Notebook
# Run the first time, compilation time will affect performance
In [4]: %%timeit -r 1 -n 1 
        roll.apply(f, engine='numba', raw=True)
Out [4]: 1.23 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

# Function is cached and performance will improve
In [5]: %%timeit 
        roll.apply(f, engine='numba', raw=True)
Out [5]: 188 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %%timeit 
        roll.apply(f, engine='cython', raw=True)
Out [6]: 3.92 s ± 59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Enter fullscreen mode Exit fullscreen mode

Read more HERE.

DataFrame.to_markdown()

This new method is just an easy way to output dataframes in markdown format. To illustrate this I created a simple dataframe below and applied the method.

df = pd.DataFrame(data={"col_1": ["a", "b"], "col_2": ["c", "d"]})

print(df.to_markdown())
>>> |    | col_1 | col_2 |
    |---:|:------|:------|
    |  0 | a     | c     |
    |  1 | b     | d     |
Enter fullscreen mode Exit fullscreen mode

EXPERIMENTAL NEW FEATURES

This kind of changes has its use optional and is available for the community evaluation. Besides this, some experimental features are really cool and useful...and can make some work and learning easier. Let's see:

pandas.NA value

When dealing with missing values in our dataframes is really common the usage of numpy.nan to represent them. Pandas now has its own missing value data type. Creating a pandas series with a None value with pandas 1.0.0, we'll see something like this:

import pandas as pd #pandas 1.0.0 version
s = pd.Series([1, 2, None], dtype="Int64")

print(s)

>>> 0       1
    1       2
    2    <NA>    
    Length: 3, dtype: Int64
Enter fullscreen mode Exit fullscreen mode

and if we print the type of s[2], we'll get the following answer:

print(type(s[2]))
>>> pandas._libs.missing.NAType
Enter fullscreen mode Exit fullscreen mode

Comparisons

  • Working with the comparison operators ==, >, >=, <, <=, between numpy.nan and numbers the result is always False, but when using !=, Python returns True
  • With the new pandas.NA, the value will be propagated with all the comparison operators. So, all operations with return

Logical

  • Working with logical operators with numpy.nan is not supported.
  • To pandas.NA some different outputs will occur:
import pandas as pd #pandas 1.0.0 version

print(pd.NA & False)
>>> False

print(pd.NA & True)
>>> <NA>

print(pd.NA | False)
>>> <NA>

print(pd.NA | True)
>>> True
Enter fullscreen mode Exit fullscreen mode

String and Boolean Data Types

When analyzing a pandas dataframe dtype with strings the usual result is an OBJECT TYPE. But an object type column can hold more than one data type and make the analysis confusing. Now the string dtype is defined and a string column can hold only strings

import pandas as pd #pandas 1.0.0 version 
s = pd.Series(['abc', None, 'def'], dtype="string")

print(s)
>>> 0     abc
    1    <NA>
    2     def
    Length: 3, dtype: string
Enter fullscreen mode Exit fullscreen mode

Different from strings a bool dtype already exists, but the columns with boolean values doesn't support missing values. This inconvenient was solved with the boolean dtype

import pandas as pd #pandas 1.0.0 version 
s = pd.Series([True, False, None], dtype="boolean")

print(s)
>>> 0     True
    1     False
    2     <NA>
    Length: 3, dtype: boolean

Enter fullscreen mode Exit fullscreen mode

Well, that's all! New improvements were made in some existing functions and I really encourage you to take a look there!

Top comments (0)