DEV Community

kojix2
kojix2

Posted on • Edited on

Tried Polars in Ruby

Polars is a data frame in the Rust language using Apache Arrow Columnar Format. polars-ruby is the Ruby binding for Polars created by Andrew Kane.

Several members of the Ruby community have been deeply involved in the development of Apache Arrow.

Fast data processing with Ruby and Apache Arrow - rubykaigi2022

So while the Arrow C binding for the Ruby language is relatively well-developed, polars-df is not an Arrow C binding, but a binding to Polars implemented in Rust. magnus is used for the connection between Ruby and Rust. In fact, there is also a Ruby data frame that uses the Arrow binding, which is called RedAmber. But we are not talking about that now.

Please note that this post is incomplete and polars-df is still in the development phase, so the API is subject to change.

Documentation

RubyDoc.info

Chapter 1 Getting started in Ruby

Installation

Ruby gem

gem install polars
Enter fullscreen mode Exit fullscreen mode

From source code

git clone
https://github.com/ankane/polars-ruby
cd polars-ruby
bundle
bundle exec rake compile
bundle exec rake install
Enter fullscreen mode Exit fullscreen mode

Quick start

Below we show a simple snippet that parses a CSV file, filters it, and finishes with a groupby operation. This example is presented in python only, as the "eager" API is not the preferred model in Rust.

require 'polars'
require 'uri'

df = Polars.read_csv(URI('https://j.mp/iriscsv'))
df.filter(Polars.col('sepal_length') > 5)
  .groupby('species')
  .agg(Polars.all.sum)
Enter fullscreen mode Exit fullscreen mode

The snippet above will output:

shape: (3, 5)
┌────────────┬──────────────┬─────────────┬──────────────┬─────────────┐
│ species    ┆ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width │
│ ---        ┆ ---          ┆ ---         ┆ ---          ┆ ---         │
│ str        ┆ f64          ┆ f64         ┆ f64          ┆ f64         │
╞════════════╪══════════════╪═════════════╪══════════════╪═════════════╡
│ versicolor ┆ 281.9        ┆ 131.8       ┆ 202.9        ┆ 63.3        │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ setosa     ┆ 116.9        ┆ 81.7        ┆ 33.2         ┆ 6.1         │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ virginica  ┆ 324.5        ┆ 146.2       ┆ 273.1        ┆ 99.6        │
└────────────┴──────────────┴─────────────┴──────────────┴─────────────┘
Enter fullscreen mode Exit fullscreen mode

As we can see, Polars pretty-prints the output object, including the column name and datatype as headers.

Lazy quick start

If we want to run this query in lazy Polars we'd write:

require 'polars'

Polars.read_csv(URI('https://j.mp/iriscsv'))
      .lazy
      .filter(Polars.col('sepal_length') > 5)
      .groupby('species')
      .agg(Polars.all.sum)
      .collect
Enter fullscreen mode Exit fullscreen mode
shape: (3, 5)
┌────────────┬──────────────┬─────────────┬──────────────┬─────────────┐
│ species    ┆ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width │
│ ---        ┆ ---          ┆ ---         ┆ ---          ┆ ---         │
│ str        ┆ f64          ┆ f64         ┆ f64          ┆ f64         │
╞════════════╪══════════════╪═════════════╪══════════════╪═════════════╡
│ virginica  ┆ 324.5        ┆ 146.2       ┆ 273.1        ┆ 99.6        │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ setosa     ┆ 116.9        ┆ 81.7        ┆ 33.2         ┆ 6.1         │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ versicolor ┆ 281.9        ┆ 131.8       ┆ 202.9        ┆ 63.3        │
└────────────┴──────────────┴─────────────┴──────────────┴─────────────┘
Enter fullscreen mode Exit fullscreen mode

Chapter 2 Polars cheat sheet in Ruby

Creating / reading DataFrames

Create DataFrame

df = Polars::DataFrame.new({
  nrs: [1, 2, 3, nil, 5],
  names: ["foo", "ham", "spam", "egg", nil],
  random: [0.3, 0.7, 0.1, 0.9, 0.6],
  groups: %w[A A B C B],
})
Enter fullscreen mode Exit fullscreen mode
shape: (5, 4)                           
┌──────┬───────┬────────┬────────┐      
│ nrs  ┆ names ┆ random ┆ groups │      
│ ---  ┆ ---   ┆ ---    ┆ ---    │      
│ i64  ┆ str   ┆ f64    ┆ str    │      
╞══════╪═══════╪════════╪════════╡      
│ 1    ┆ foo   ┆ 0.3    ┆ A      │      
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤      
│ 2    ┆ ham   ┆ 0.7    ┆ A      │      
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3    ┆ spam  ┆ 0.1    ┆ B      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ null ┆ egg   ┆ 0.9    ┆ C      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5    ┆ null  ┆ 0.6    ┆ B      │
└──────┴───────┴────────┴────────┘
Enter fullscreen mode Exit fullscreen mode

Read CSV

iris = Polars.read_csv(URI('https://j.mp/iriscsv'),
                     has_header: true)
Enter fullscreen mode Exit fullscreen mode
shape: (150, 5)                             
┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
│ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width ┆ species   │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ str       │
╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ...          ┆ ...         ┆ ...          ┆ ...         ┆ ...       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 6.3          ┆ 2.5         ┆ 5.0          ┆ 1.9         ┆ virginica │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 6.5          ┆ 3.0         ┆ 5.2          ┆ 2.0         ┆ virginica │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 6.2          ┆ 3.4         ┆ 5.4          ┆ 2.3         ┆ virginica │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 5.9          ┆ 3.0         ┆ 5.1          ┆ 1.8         ┆ virginica │
└──────────────┴─────────────┴──────────────┴─────────────┴───────────┘
Enter fullscreen mode Exit fullscreen mode

Read parquet

Polars.read_parquet('file.parquet')
Enter fullscreen mode Exit fullscreen mode

Expressions

df.filter(Polars.col('nrs') < 4) # symbols not work?
  .groupby('groups')
  .agg(Polars.all.sum)
Enter fullscreen mode Exit fullscreen mode
shape: (2, 4)
┌────────┬─────┬───────┬────────┐
│ groups ┆ nrs ┆ names ┆ random │
│ ---    ┆ --- ┆ ---   ┆ ---    │
│ str    ┆ i64 ┆ str   ┆ f64    │
╞════════╪═════╪═══════╪════════╡
│ A      ┆ 3   ┆ null  ┆ 1.0    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ B      ┆ 3   ┆ null  ┆ 0.1    │
└────────┴─────┴───────┴────────┘
Enter fullscreen mode Exit fullscreen mode

Subset Observations - rows

Filter: Extract rows that meet logical criteria

df.filter(Polars.col('random') > 0.5)
df.filter(
  (Polars.col('groups') == 'B') &
  (Polars.col('random') > 0.5)
)
Enter fullscreen mode Exit fullscreen mode
shape: (1, 4)
┌─────┬───────┬────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ ---   ┆ ---    ┆ ---    │
│ i64 ┆ str   ┆ f64    ┆ str    │
╞═════╪═══════╪════════╪════════╡
│ 5   ┆ null  ┆ 0.6    ┆ B      │
└─────┴───────┴────────┴────────┘
Enter fullscreen mode Exit fullscreen mode

Randomly select fraction of rows.

df.sample(frac: 0.5)
Enter fullscreen mode Exit fullscreen mode
# Results are random.
shape: (1, 4)
┌─────┬───────┬────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ ---   ┆ ---    ┆ ---    │
│ i64 ┆ str   ┆ f64    ┆ str    │
╞═════╪═══════╪════════╪════════╡
│ 2   ┆ ham   ┆ 0.7    ┆ A      │
└─────┴───────┴────────┴────────┘
Enter fullscreen mode Exit fullscreen mode

Randomly select n rows.

df.sample(n: 2)
Enter fullscreen mode Exit fullscreen mode
# Results are random.
shape: (2, 4)
┌──────┬───────┬────────┬────────┐
│ nrs  ┆ names ┆ random ┆ groups │
│ ---  ┆ ---   ┆ ---    ┆ ---    │
│ i64  ┆ str   ┆ f64    ┆ str    │
╞══════╪═══════╪════════╪════════╡
│ 3    ┆ spam  ┆ 0.1    ┆ B      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ null ┆ egg   ┆ 0.9    ┆ C      │
└──────┴───────┴────────┴────────┘
Enter fullscreen mode Exit fullscreen mode

select first n rows.

df.head(2)
Enter fullscreen mode Exit fullscreen mode
shape: (2, 4)                                       
┌─────┬───────┬────────┬────────┐                   
│ nrs ┆ names ┆ random ┆ groups │                   
│ --- ┆ ---   ┆ ---    ┆ ---    │                   
│ i64 ┆ str   ┆ f64    ┆ str    │                   
╞═════╪═══════╪════════╪════════╡                   
│ 1   ┆ foo   ┆ 0.3    ┆ A      │                   
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤                   
│ 2   ┆ ham   ┆ 0.7    ┆ A      │                   
└─────┴───────┴────────┴────────┘  
Enter fullscreen mode Exit fullscreen mode

select last n rows.

df.tail(2)
Enter fullscreen mode Exit fullscreen mode
shape: (2, 4)                                       
┌──────┬───────┬────────┬────────┐                  
│ nrs  ┆ names ┆ random ┆ groups │                  
│ ---  ┆ ---   ┆ ---    ┆ ---    │                  
│ i64  ┆ str   ┆ f64    ┆ str    │                  
╞══════╪═══════╪════════╪════════╡                  
│ null ┆ egg   ┆ 0.9    ┆ C      │                  
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤                  
│ 5    ┆ null  ┆ 0.6    ┆ B      │                  
└──────┴───────┴────────┴────────┘                  
Enter fullscreen mode Exit fullscreen mode

Subset Observations - columns

Select multiple columns with specific names

df.select(["nrs", "names"])
Enter fullscreen mode Exit fullscreen mode
shape: (5, 2)                                       
┌──────┬───────┐                                    
│ nrs  ┆ names │
│ ---  ┆ ---   │
│ i64  ┆ str   │
╞══════╪═══════╡
│ 1    ┆ foo   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2    ┆ ham   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3    ┆ spam  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ egg   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5    ┆ null  │
└──────┴───────┘
Enter fullscreen mode Exit fullscreen mode

Select columns whose name matches regex

df.select(Polars.col("^n.*$"))
Enter fullscreen mode Exit fullscreen mode
shape: (5, 2)
┌──────┬───────┐
│ nrs  ┆ names │
│ ---  ┆ ---   │
│ i64  ┆ str   │
╞══════╪═══════╡
│ 1    ┆ foo   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2    ┆ ham   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3    ┆ spam  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ egg   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5    ┆ null  │
└──────┴───────┘
Enter fullscreen mode Exit fullscreen mode

Subsets - rows and columns

Select rows 2-4

? # Yet Range support appears to be limited
Enter fullscreen mode Exit fullscreen mode

Select columns in positions 1 and 3 (first column is 0)

?
Enter fullscreen mode Exit fullscreen mode

Select rows meeting logical condition, and only the specific columns

?
Enter fullscreen mode Exit fullscreen mode

Reshaping Data – Change layout, sorting, renaming

Append rows of DataFrames

Polars.concat([df, df2])
Enter fullscreen mode Exit fullscreen mode

Append columns of DataFrames

Polars.concat([df, df3], how: "horizontal")
Enter fullscreen mode Exit fullscreen mode

Gather columns into rows

df.melt(
  id_vars: 'nrs',
  value_vars: %w[names groups]
)
Enter fullscreen mode Exit fullscreen mode
shape: (10, 3)
┌──────┬──────────┬───────┐
│ nrs  ┆ variable ┆ value │
│ ---  ┆ ---      ┆ ---   │
│ i64  ┆ str      ┆ str   │
╞══════╪══════════╪═══════╡
│ 1    ┆ names    ┆ foo   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2    ┆ names    ┆ ham   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3    ┆ names    ┆ spam  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ names    ┆ egg   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ...  ┆ ...      ┆ ...   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2    ┆ groups   ┆ A     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3    ┆ groups   ┆ B     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ groups   ┆ C     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5    ┆ groups   ┆ B     │
└──────┴──────────┴───────┘
Enter fullscreen mode Exit fullscreen mode

Spread rows into columns

df.pivot(values: 'nrs', index: 'groups',
         columns: 'names')
Enter fullscreen mode Exit fullscreen mode
shape: (3, 6)
┌────────┬──────┬──────┬──────┬──────┬──────┐
│ groups ┆ foo  ┆ ham  ┆ spam ┆ egg  ┆ null │
│ ---    ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ str    ┆ i64  ┆ i64  ┆ i64  ┆ i64  ┆ i64  │
╞════════╪══════╪══════╪══════╪══════╪══════╡
│ A      ┆ 1    ┆ 2    ┆ null ┆ null ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ B      ┆ null ┆ null ┆ 3    ┆ null ┆ 5    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ C      ┆ null ┆ null ┆ null ┆ null ┆ null │
└────────┴──────┴──────┴──────┴──────┴──────┘
Enter fullscreen mode Exit fullscreen mode

Order rows by values of a column

# low to high
df.sort("random")
Enter fullscreen mode Exit fullscreen mode
shape: (5, 4)                                                           
┌──────┬───────┬────────┬────────┐                                      
│ nrs  ┆ names ┆ random ┆ groups │                                      
│ ---  ┆ ---   ┆ ---    ┆ ---    │                                      
│ i64  ┆ str   ┆ f64    ┆ str    │                                      
╞══════╪═══════╪════════╪════════╡                                      
│ 3    ┆ spam  ┆ 0.1    ┆ B      │                                      
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤                                      
│ 1    ┆ foo   ┆ 0.3    ┆ A      │                                      
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤                                      
│ 5    ┆ null  ┆ 0.6    ┆ B      │                                      
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤                                      
│ 2    ┆ ham   ┆ 0.7    ┆ A      │                                      
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤                                      
│ null ┆ egg   ┆ 0.9    ┆ C      │
└──────┴───────┴────────┴────────┘
Enter fullscreen mode Exit fullscreen mode
# high to low
df.sort("random", reverse: true)
Enter fullscreen mode Exit fullscreen mode
shape: (5, 4)
┌──────┬───────┬────────┬────────┐
│ nrs  ┆ names ┆ random ┆ groups │
│ ---  ┆ ---   ┆ ---    ┆ ---    │
│ i64  ┆ str   ┆ f64    ┆ str    │
╞══════╪═══════╪════════╪════════╡
│ null ┆ egg   ┆ 0.9    ┆ C      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2    ┆ ham   ┆ 0.7    ┆ A      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5    ┆ null  ┆ 0.6    ┆ B      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1    ┆ foo   ┆ 0.3    ┆ A      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3    ┆ spam  ┆ 0.1    ┆ B      │
└──────┴───────┴────────┴────────┘
Enter fullscreen mode Exit fullscreen mode

Rename the columns of a DataFrame

df.rename({"nrs" => "idx"})
Enter fullscreen mode Exit fullscreen mode
shape: (5, 4)
┌──────┬───────┬────────┬────────┐
│ idx  ┆ names ┆ random ┆ groups │
│ ---  ┆ ---   ┆ ---    ┆ ---    │
│ i64  ┆ str   ┆ f64    ┆ str    │
╞══════╪═══════╪════════╪════════╡
│ 1    ┆ foo   ┆ 0.3    ┆ A      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2    ┆ ham   ┆ 0.7    ┆ A      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3    ┆ spam  ┆ 0.1    ┆ B      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ null ┆ egg   ┆ 0.9    ┆ C      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5    ┆ null  ┆ 0.6    ┆ B      │
└──────┴───────┴────────┴────────┘
Enter fullscreen mode Exit fullscreen mode

Drop columns from DataFrame

df.drop(["names", "random"])
Enter fullscreen mode Exit fullscreen mode
shape: (5, 2)
┌──────┬────────┐
│ nrs  ┆ groups │
│ ---  ┆ ---    │
│ i64  ┆ str    │
╞══════╪════════╡
│ 1    ┆ A      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2    ┆ A      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3    ┆ B      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ null ┆ C      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5    ┆ B      │
└──────┴────────┘
Enter fullscreen mode Exit fullscreen mode

Summarize Data

Count number of rows with each unique value of variable

df["groups"].value_counts
Enter fullscreen mode Exit fullscreen mode
shape: (3, 2)
┌────────┬────────┐
│ groups ┆ counts │
│ ---    ┆ ---    │
│ str    ┆ u32    │
╞════════╪════════╡
│ B      ┆ 2      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ A      ┆ 2      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ C      ┆ 1      │
└────────┴────────┘
Enter fullscreen mode Exit fullscreen mode

Number of rows in DataFrame

df.height
# => 5
Enter fullscreen mode Exit fullscreen mode

Tuple of number of rows, number of columns in DataFrame

df.shape
# => [5, 4]
Enter fullscreen mode Exit fullscreen mode

Number of distinct values in a column

df["groups"].n_unique
# => 3
Enter fullscreen mode Exit fullscreen mode

Basic descriptive and statistics for each column

df.describe
Enter fullscreen mode Exit fullscreen mode
shape: (7, 5)                                       
┌────────────┬──────────┬───────┬──────────┬────────┐
│ describe   ┆ nrs      ┆ names ┆ random   ┆ groups │
│ ---        ┆ ---      ┆ ---   ┆ ---      ┆ ---    │
│ str        ┆ f64      ┆ str   ┆ f64      ┆ str    │
╞════════════╪══════════╪═══════╪══════════╪════════╡
│ count      ┆ 5.0      ┆ 5     ┆ 5.0      ┆ 5      │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ null_count ┆ 1.0      ┆ 1     ┆ 0.0      ┆ 0      │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ mean       ┆ 2.75     ┆ null  ┆ 0.52     ┆ null   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ std        ┆ 1.707825 ┆ null  ┆ 0.319374 ┆ null   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ min        ┆ 1.0      ┆ egg   ┆ 0.1      ┆ A      │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ max        ┆ 5.0      ┆ spam  ┆ 0.9      ┆ C      │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ median     ┆ 2.5      ┆ null  ┆ 0.6      ┆ null   │
└────────────┴──────────┴───────┴──────────┴────────┘
Enter fullscreen mode Exit fullscreen mode

Aggregation functions

sum min max std median mean quantile first

df.select(
  [
    # Sum values
    Polars.sum('random').alias('sum'),
    # Minimum value
    Polars.min('random').alias('min'),
    # Maximum value
    Polars.max('random').alias('max'),
    # or
    Polars.col('random').max.alias('other_max'),
    # Standard deviation
    Polars.std('random').alias('std dev'),
    # Variance
    Polars.var('random').alias('variance'),
    # Median
    Polars.median('random').alias('median'),
    # Mean
    Polars.mean('random').alias('mean'),
    # Quantile
    Polars.quantile('random', 0.75).alias('quantile_0.75'),
    # or
    Polars.col('random').quantile(0.75).alias('other_quantile_0.75'),
    # First value
    Polars.first('random').alias('first')
  ]
)
Enter fullscreen mode Exit fullscreen mode
┌─────┬─────┬─────┬───────────┬─────┬──────┬────────────┬──────────────┬───────┐
│ sum ┆ min ┆ max ┆ other_max ┆ ... ┆ mean ┆ quantile_0 ┆ other_quanti ┆ first │
│ --- ┆ --- ┆ --- ┆ ---       ┆     ┆ ---  ┆ .75        ┆ le_0.75      ┆ ---   │
│ f64 ┆ f64 ┆ f64 ┆ f64       ┆     ┆ f64  ┆ ---        ┆ ---          ┆ f64   │
│     ┆     ┆     ┆           ┆     ┆      ┆ f64        ┆ f64          ┆       │
╞═════╪═════╪═════╪═══════════╪═════╪══════╪════════════╪══════════════╪═══════╡
│ 2.6 ┆ 0.1 ┆ 0.9 ┆ 0.9       ┆ ... ┆ 0.52 ┆ 0.7        ┆ 0.7          ┆ 0.3   │
└─────┴─────┴─────┴───────────┴─────┴──────┴────────────┴──────────────┴───────┘
Enter fullscreen mode Exit fullscreen mode

Group Data

Group by values in column named "col", returning a GroupBy object

df.groupby("groups")
Enter fullscreen mode Exit fullscreen mode

All of the aggregation functions from above can be applied to a group as well

df.groupby(by = 'groups').agg(
  [
    # Sum values
    Polars.sum('random').alias('sum'),
    # Minimum value
    Polars.min('random').alias('min'),
    # Maximum value
    Polars.max('random').alias('max'),
    # or
    Polars.col('random').max.alias('other_max'),
    # Standard deviation
    Polars.std('random').alias('std_dev'),
    # Variance
    Polars.var('random').alias('variance'),
    # Median
    Polars.median('random').alias('median'),
    # Mean
    Polars.mean('random').alias('mean'),
    # Quantile
    Polars.quantile('random', 0.75).alias('quantile_0.75'),
    # or
    Polars.col('random').quantile(0.75).alias('other_quantile_0.75'),
    # First value
    Polars.first('random').alias('first')
  ]
)
Enter fullscreen mode Exit fullscreen mode
shape: (3, 12)
┌────────┬─────┬─────┬─────┬─────┬──────┬───────────────┬──────────────┬───────┐
│ groups ┆ sum ┆ min ┆ max ┆ ... ┆ mean ┆ quantile_0.75 ┆ other_quanti ┆ first │
│ ---    ┆ --- ┆ --- ┆ --- ┆     ┆ ---  ┆ ---           ┆ le_0.75      ┆ ---   │
│ str    ┆ f64 ┆ f64 ┆ f64 ┆     ┆ f64  ┆ f64           ┆ ---          ┆ f64   │
│        ┆     ┆     ┆     ┆     ┆      ┆               ┆ f64          ┆       │
╞════════╪═════╪═════╪═════╪═════╪══════╪═══════════════╪══════════════╪═══════╡
│ C      ┆ 0.9 ┆ 0.9 ┆ 0.9 ┆ ... ┆ 0.9  ┆ 0.9           ┆ 0.9          ┆ 0.9   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ A      ┆ 1.0 ┆ 0.3 ┆ 0.7 ┆ ... ┆ 0.5  ┆ 0.7           ┆ 0.7          ┆ 0.3   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ B      ┆ 0.7 ┆ 0.1 ┆ 0.6 ┆ ... ┆ 0.35 ┆ 0.6           ┆ 0.6          ┆ 0.1   │
└────────┴─────┴─────┴─────┴─────┴──────┴───────────────┴──────────────┴───────┘
Enter fullscreen mode Exit fullscreen mode

Additional GroupBy functions

??
Enter fullscreen mode Exit fullscreen mode

Handling Missing Data

Drop rows with any column having a null value

df.drop_nulls
Enter fullscreen mode Exit fullscreen mode
shape: (3, 4)
┌─────┬───────┬────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ ---   ┆ ---    ┆ ---    │
│ i64 ┆ str   ┆ f64    ┆ str    │
╞═════╪═══════╪════════╪════════╡
│ 1   ┆ foo   ┆ 0.3    ┆ A      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2   ┆ ham   ┆ 0.7    ┆ A      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3   ┆ spam  ┆ 0.1    ┆ B      │
└─────┴───────┴────────┴────────┘
Enter fullscreen mode Exit fullscreen mode

Replace null values with given value

df.fill_null(42)
Enter fullscreen mode Exit fullscreen mode
shape: (5, 4)
┌─────┬───────┬────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ ---   ┆ ---    ┆ ---    │
│ i64 ┆ str   ┆ f64    ┆ str    │
╞═════╪═══════╪════════╪════════╡
│ 1   ┆ foo   ┆ 0.3    ┆ A      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2   ┆ ham   ┆ 0.7    ┆ A      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3   ┆ spam  ┆ 0.1    ┆ B      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 42  ┆ egg   ┆ 0.9    ┆ C      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5   ┆ 42    ┆ 0.6    ┆ B      │
└─────┴───────┴────────┴────────┘
Enter fullscreen mode Exit fullscreen mode

Replace null values using forward strategy

df.fill_null(strategy: "forward")
Enter fullscreen mode Exit fullscreen mode
shape: (5, 4)
┌─────┬───────┬────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ ---   ┆ ---    ┆ ---    │
│ i64 ┆ str   ┆ f64    ┆ str    │
╞═════╪═══════╪════════╪════════╡
│ 1   ┆ foo   ┆ 0.3    ┆ A      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2   ┆ ham   ┆ 0.7    ┆ A      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3   ┆ spam  ┆ 0.1    ┆ B      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3   ┆ egg   ┆ 0.9    ┆ C      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5   ┆ egg   ┆ 0.6    ┆ B      │
└─────┴───────┴────────┴────────┘
Enter fullscreen mode Exit fullscreen mode

Other fill strategies are "backward", "min", "max", "mean", "zero" and "one"

Replace floating point Nan values with given value

dfn = Polars::DataFrame.new(
  {
    "a" => [1.5, 2, Float::NAN, 4],
    "b" => [0.5, 4, Float::NAN, 13]
  }
)
dfn.fill_nan(99)
Enter fullscreen mode Exit fullscreen mode

Make New Columns

Add a new columns to the DataFrame

df.with_column(
  (Polars.col('random') * Polars.col('nrs'))
  .alias('product')
)
Enter fullscreen mode Exit fullscreen mode
shape: (5, 5)
┌──────┬───────┬────────┬────────┬─────────┐
│ nrs  ┆ names ┆ random ┆ groups ┆ product │
│ ---  ┆ ---   ┆ ---    ┆ ---    ┆ ---     │
│ i64  ┆ str   ┆ f64    ┆ str    ┆ f64     │
╞══════╪═══════╪════════╪════════╪═════════╡
│ 1    ┆ foo   ┆ 0.3    ┆ A      ┆ 0.3     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2    ┆ ham   ┆ 0.7    ┆ A      ┆ 1.4     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3    ┆ spam  ┆ 0.1    ┆ B      ┆ 0.3     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ null ┆ egg   ┆ 0.9    ┆ C      ┆ null    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 5    ┆ null  ┆ 0.6    ┆ B      ┆ 3.0     │
└──────┴───────┴────────┴────────┴─────────┘
Enter fullscreen mode Exit fullscreen mode

Add several new columns to the DataFrame

df.with_columns(
  [
    (Polars.col('random') * Polars.col('nrs'))
    .alias('product'),
    Polars.col('names').str.lengths
    .alias('names_lengths')
  ]
)
Enter fullscreen mode Exit fullscreen mode
shape: (5, 6)
┌──────┬───────┬────────┬────────┬─────────┬───────────────┐
│ nrs  ┆ names ┆ random ┆ groups ┆ product ┆ names_lengths │
│ ---  ┆ ---   ┆ ---    ┆ ---    ┆ ---     ┆ ---           │
│ i64  ┆ str   ┆ f64    ┆ str    ┆ f64     ┆ u32           │
╞══════╪═══════╪════════╪════════╪═════════╪═══════════════╡
│ 1    ┆ foo   ┆ 0.3    ┆ A      ┆ 0.3     ┆ 3             │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2    ┆ ham   ┆ 0.7    ┆ A      ┆ 1.4     ┆ 3             │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3    ┆ spam  ┆ 0.1    ┆ B      ┆ 0.3     ┆ 4             │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ egg   ┆ 0.9    ┆ C      ┆ null    ┆ 3             │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5    ┆ null  ┆ 0.6    ┆ B      ┆ 3.0     ┆ null          │
└──────┴───────┴────────┴────────┴─────────┴───────────────┘
Enter fullscreen mode Exit fullscreen mode

Add a column at index 0 that counts the rows

df.with_row_count
Enter fullscreen mode Exit fullscreen mode
shape: (5, 5)
┌────────┬──────┬───────┬────────┬────────┐
│ row_nr ┆ nrs  ┆ names ┆ random ┆ groups │
│ ---    ┆ ---  ┆ ---   ┆ ---    ┆ ---    │
│ u32    ┆ i64  ┆ str   ┆ f64    ┆ str    │
╞════════╪══════╪═══════╪════════╪════════╡
│ 0      ┆ 1    ┆ foo   ┆ 0.3    ┆ A      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1      ┆ 2    ┆ ham   ┆ 0.7    ┆ A      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2      ┆ 3    ┆ spam  ┆ 0.1    ┆ B      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3      ┆ null ┆ egg   ┆ 0.9    ┆ C      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4      ┆ 5    ┆ null  ┆ 0.6    ┆ B      │
└────────┴──────┴───────┴────────┴────────┘
Enter fullscreen mode Exit fullscreen mode

Rolling Functions

The following rolling functions are available

df.select(
  [
    # Rolling maximum value
    Polars.col('random')
          .rolling_max(2)
          .alias('rolling_max'),
    # Rolling mean value
    Polars.col('random')
          .rolling_mean(2)
          .alias('rolling_mean'),
    # Rolling median value
    Polars.col('random')
          .rolling_median(2, min_periods: 2)
          .alias('rolling_median'),
    # Rolling minimum value
    Polars.col('random')
          .rolling_min(2)
          .alias('rolling_min'),
    # Rolling standard deviation
    Polars.col('random')
          .rolling_std(2)
          .alias('rolling_std'),
    # Rolling sum values
    Polars.col('random')
          .rolling_sum(2)
          .alias('rolling_sum'),
    # Rolling variance
    Polars.col('random')
          .rolling_var(2)
          .alias('rolling_var'),
    # Rolling quantile
    Polars.col('random')
          .rolling_quantile(
            0.75,
            window_size: 2,
            min_periods: 2
          )
          .alias('rolling_quantile'),
    # Rolling skew
    Polars.col('random')
          .rolling_skew(2)
          .alias('rolling_skew')
    # Rolling custom function
    # Polars.col('random')
    # .rolling_apply(
    #   function = np.nanstd, window_size = 2
    # )
    # .alias('rolling_apply')
  ]
)
Enter fullscreen mode Exit fullscreen mode
shape: (5, 9)
┌───────────┬────────────┬────────────┬───────────┬─────┬───────────┬───────────┬────────────┬────────────┐
│ rolling_m ┆ rolling_me ┆ rolling_me ┆ rolling_m ┆ ... ┆ rolling_s ┆ rolling_v ┆ rolling_qu ┆ rolling_sk │
│ ax        ┆ an         ┆ dian       ┆ in        ┆     ┆ um        ┆ ar        ┆ antile     ┆ ew         │
│ ---       ┆ ---        ┆ ---        ┆ ---       ┆     ┆ ---       ┆ ---       ┆ ---        ┆ ---        │
│ f64       ┆ f64        ┆ f64        ┆ f64       ┆     ┆ f64       ┆ f64       ┆ f64        ┆ f64        │
╞═══════════╪════════════╪════════════╪═══════════╪═════╪═══════════╪═══════════╪════════════╪════════════╡
│ null      ┆ null       ┆ null       ┆ null      ┆ ... ┆ null      ┆ null      ┆ null       ┆ null       │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.7       ┆ 0.5        ┆ 0.5        ┆ 0.3       ┆ ... ┆ 1.0       ┆ 0.08      ┆ 0.7        ┆ -4.3368e-1 │
│           ┆            ┆            ┆           ┆     ┆           ┆           ┆            ┆ 6          │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.7       ┆ 0.4        ┆ 0.4        ┆ 0.1       ┆ ... ┆ 0.8       ┆ 0.18      ┆ 0.7        ┆ 3.8549e-16 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.9       ┆ 0.5        ┆ 0.5        ┆ 0.1       ┆ ... ┆ 1.0       ┆ 0.32      ┆ 0.9        ┆ 0.0        │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.9       ┆ 0.75       ┆ 0.75       ┆ 0.6       ┆ ... ┆ 1.5       ┆ 0.045     ┆ 0.9        ┆ 0.0        │
└───────────┴────────────┴────────────┴───────────┴─────┴───────────┴───────────┴────────────┴────────────┘
Enter fullscreen mode Exit fullscreen mode

Window Functions

Window functions allow to group by several columns simultaneously

df.select(
  [
    'names',
    'groups',
    Polars.col('random').sum.over('names')
    .alias('sum_by_names'),
    Polars.col('random').sum.over('groups')
    .alias('sum_by_groups')
  ]
)
Enter fullscreen mode Exit fullscreen mode

Combine Data Sets

df_1 = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
df_2 = Polars::DataFrame.new(
  {
    "apple" => ["x", "y", "z"],
    "ham" => ["a", "b", "d"]
  }
)
Enter fullscreen mode Exit fullscreen mode
shape: (3, 3)                           
┌─────┬─────┬─────┐                     
│ foo ┆ bar ┆ ham │                     
│ --- ┆ --- ┆ --- │                     
│ i64 ┆ f64 ┆ str │                     
╞═════╪═════╪═════╡                     
│ 1   ┆ 6.0 ┆ a   │                     
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤                     
│ 2   ┆ 7.0 ┆ b   │                     
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤                     
│ 3   ┆ 8.0 ┆ c   │                     
└─────┴─────┴─────┘
Enter fullscreen mode Exit fullscreen mode
shape: (3, 2)                           
┌───────┬─────┐                         
│ apple ┆ ham │                         
│ ---   ┆ --- │                         
│ str   ┆ str │
╞═══════╪═════╡
│ x     ┆ a   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ y     ┆ b   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ z     ┆ d   │
└───────┴─────┘
Enter fullscreen mode Exit fullscreen mode

Inner Join

Retains only rows with a match in the other set.

df_1.join(df_2, on: "ham")
df_1.join(df_2, on: "ham", how: "inner")
Enter fullscreen mode Exit fullscreen mode
shape: (2, 4)
┌─────┬─────┬─────┬───────┐                           
│ foo ┆ bar ┆ ham ┆ apple │                           
│ --- ┆ --- ┆ --- ┆ ---   │                           
│ i64 ┆ f64 ┆ str ┆ str   │                           
╞═════╪═════╪═════╪═══════╡                           
│ 1   ┆ 6.0 ┆ a   ┆ x     │                           
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤                           
│ 2   ┆ 7.0 ┆ b   ┆ y     │                           
└─────┴─────┴─────┴───────┘                           
Enter fullscreen mode Exit fullscreen mode

Left Join

Retains each row from "left" set (df).

df_1.join(df_2, on: "ham", how: "left")
Enter fullscreen mode Exit fullscreen mode
shape: (3, 4)
┌─────┬─────┬─────┬───────┐
│ foo ┆ bar ┆ ham ┆ apple │
│ --- ┆ --- ┆ --- ┆ ---   │
│ i64 ┆ f64 ┆ str ┆ str   │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 6.0 ┆ a   ┆ x     │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2   ┆ 7.0 ┆ b   ┆ y     │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3   ┆ 8.0 ┆ c   ┆ null  │
└─────┴─────┴─────┴───────┘
Enter fullscreen mode Exit fullscreen mode

Outer Join

Retains each row, even if no other matching row exists.

df_1.join(df_2, on: "ham", how: "outer")
Enter fullscreen mode Exit fullscreen mode
shape: (4, 4)
┌──────┬──────┬─────┬───────┐
│ foo  ┆ bar  ┆ ham ┆ apple │
│ ---  ┆ ---  ┆ --- ┆ ---   │
│ i64  ┆ f64  ┆ str ┆ str   │
╞══════╪══════╪═════╪═══════╡
│ 1    ┆ 6.0  ┆ a   ┆ x     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2    ┆ 7.0  ┆ b   ┆ y     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ null ┆ d   ┆ z     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3    ┆ 8.0  ┆ c   ┆ null  │
└──────┴──────┴─────┴───────┘
Enter fullscreen mode Exit fullscreen mode

Anti Join

Contains all rows from df that do not have a match in other_df

df_1.join(df_2, on: "ham", how: "anti")
Enter fullscreen mode Exit fullscreen mode
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 3   ┆ 8.0 ┆ c   │
└─────┴─────┴─────┘
Enter fullscreen mode Exit fullscreen mode

Top comments (0)