InterSystems Developer for InterSystems

Posted on Jun 16 • Originally published at community.intersystems.com

Testing Columnar Storage

#docker #python #programming

As most of you probably already know, since approximately the end of 2022 InterSystems IRIS included the columnar storage functionality to its database, well, in today's article we are going to put it to the test in comparison to the usual row storage.

Columnar Storage

What is the main characteristic of this type of storage? Well, if we consult the official documentation we will see this fantastic table that explains the main characteristics of both types of storage (by rows or by columns):

As you can see, columnar storage is designed primarily for analytical tasks in which queries are launched against specific fields in our table, while row storage is more optimal when a large number of insertion, update and deletion operations are required. as well as obtaining complete records.

If you continue reading the documentation you will see how simple it is to configure our table to be able to use columnar storage:

CREATE TABLE table (column type, column2 type2, column3 type3) WITH STORAGETYPE = COLUMNAR

Using this command we would be defining all the columns of our table with columnar storage, but we could opt for a mixed model in which our table has row storage but certain columns make use of columnar storage.

This mixed scenario could be interesting in cases where aggregation operations such as sums, averages, etc. are common. For this case we could define which column is the one that will use said storage:

CREATE TABLE table (column type, column2 type2, column3 type3 WITH STORAGETYPE = COLUMNAR)

In the previous example we defined a table with row storage and a column (column3) with columnar storage.

Comparative

To compare the time spent by column storage and row storage in different queries, we have created a small exercise using Jupyter Notebook that will insert a series of records that we will generate in two tables, the first with storage with rows ( Test.PurchaseOrderRow) and the second with columnar storage in two of its columns (Test.PurchaseOrderColumnar)

Test.PurchaseOrderRow

CREATE TABLE Test.PurchaseOrderRow (
    Reference INTEGER,
    Customer VARCHAR(225),
    PaymentDate DATE,
    Vat NUMERIC(10,2),
    Amount NUMERIC(10,2),
    Status VARCHAR(10))

Test.PurchaseOrderColumnar

CREATE TABLE Test.PurchaseOrderColumnar (
    Reference INTEGER,
    Customer VARCHAR(225),
    PaymentDate DATE,
    Vat NUMERIC(10,2),
    Amount NUMERIC(10,2) WITH STORAGETYPE = COLUMNAR,
    Status VARCHAR(10) WITH STORAGETYPE = COLUMNAR)

If you download the Open Exchange project and deploy it in your local Docker, you can access the Jupyter Notebook instance and review the file PerformanceTests.ipynb, which will be responsible for generating the random data that we are going to store in different phases in our tables and finally it will show us a graph with the performance of the query operations.

Let's take a quick look at our project configuration:

docker-compose.yml

version: '3.7'
services:
  # iris
  iris:
    init: true
    container_name: iris
    build:
      context: .
      dockerfile: iris/Dockerfile
    ports:
      - 52774:52773
      - 51774:1972
    volumes:
    - ./shared:/shared
    environment:
    - ISC_DATA_DIRECTORY=/shared/durable
    command: --check-caps false --ISCAgent false
  # jupyter notebook
  jupyter:
    build:
      context: .
      dockerfile: jupyter/Dockerfile
    container_name: jupyter
    ports:
      - "8888:8888"
    environment:
      - JUPYTER_ENABLE_LAB=yes
      - JUPYTER_ALLOW_INSECURE_WRITES=true
    volumes:
      - ./jupyter:/home/jovyan
      - ./data:/app/data
    command: "start-notebook.sh --NotebookApp.token='' --NotebookApp.password=''"

We deploy the IRIS and Jupyter containers in our docker, initially configuring IRIS with the namespace "TEST" and the two tables required for the test.

To avoid boring you with code, you can consult the PerformanceTests.ipynb file from which we will connect to IRIS, generate the records to be inserted and store them in IRIS

Test execution

The results have been the following (in seconds):

Inserts:

The insertions made are of bulk type:

INSERT INTO Test.PurchaseOrderColumnar (Reference, Customer, PaymentDate, Vat, Amount, Status) VALUES (?, ?, ?, ?, ?, ?)

And the time for each batch of inserts is as follows:

Total inserts	Row storage	Mixed storage
1000	0.031733	0.041677
5000	0.159338	0.185252
20000	0.565775	0.642662
50000	1.486459	1.747124
100000	2.735016	3.265492
200000	5.395032	6.382278

Selects:

The Select launched includes an aggregation function and a condition, both on columns with columnar storage:

SELECT AVG(Amount) FROM Test.PurchaseOrderColumnar WHERE Status = 'SENT'

Total rows	Row storage	Mixed storage
1000	0.002039	0.001178
5000	0.00328	0.000647
20000	0.005493	0.001555
50000	0.016616	0.000987
100000	0.036112	0.001605
200000	0.070909	0.002738

Conclusions

As you can see in the results obtained, the operation is exactly what is indicated in the documentation. Including columns with columnar storage has slightly penalized performance during insert (about 18% slower for our example) while queries on those same columns have dramatically improved response time (258 times faster).

It is undoubtedly something to take into account when planning the development of any application.

DEV Community

Testing Columnar Storage

Columnar Storage

Comparative

Test.PurchaseOrderRow

Test.PurchaseOrderColumnar

docker-compose.yml

Test execution

Inserts:

Selects:

Conclusions

Top comments (0)

Read next

7 Must-Try Open-Source Tools for Python and JavaScript Developers 🚀

Optimizing Large-Scale Data Processing in Python: A Guide to Parallelizing CSV Operations

2097. Valid Arrangement of Pairs

Setup PostgreSQL w/ pgvector in a docker container