As most of you probably already know, since approximately the end of 2022 InterSystems IRIS included the columnar storage functionality to its database, well, in today's article we are going to put it to the test in comparison to the usual row storage.
Columnar Storage
What is the main characteristic of this type of storage? Well, if we consult the official documentation we will see this fantastic table that explains the main characteristics of both types of storage (by rows or by columns):
As you can see, columnar storage is designed primarily for analytical tasks in which queries are launched against specific fields in our table, while row storage is more optimal when a large number of insertion, update and deletion operations are required. as well as obtaining complete records.
If you continue reading the documentation you will see how simple it is to configure our table to be able to use columnar storage:
CREATE TABLE table (column type, column2 type2, column3 type3) WITH STORAGETYPE = COLUMNAR
Using this command we would be defining all the columns of our table with columnar storage, but we could opt for a mixed model in which our table has row storage but certain columns make use of columnar storage.
This mixed scenario could be interesting in cases where aggregation operations such as sums, averages, etc. are common. For this case we could define which column is the one that will use said storage:
CREATE TABLE table (column type, column2 type2, column3 type3 WITH STORAGETYPE = COLUMNAR)
In the previous example we defined a table with row storage and a column (column3) with columnar storage.
Comparative
To compare the time spent by column storage and row storage in different queries, we have created a small exercise using Jupyter Notebook that will insert a series of records that we will generate in two tables, the first with storage with rows ( Test.PurchaseOrderRow) and the second with columnar storage in two of its columns (Test.PurchaseOrderColumnar)
Test.PurchaseOrderRow
CREATE TABLE Test.PurchaseOrderRow (
Reference INTEGER,
Customer VARCHAR(225),
PaymentDate DATE,
Vat NUMERIC(10,2),
Amount NUMERIC(10,2),
Status VARCHAR(10))
Test.PurchaseOrderColumnar
CREATE TABLE Test.PurchaseOrderColumnar (
Reference INTEGER,
Customer VARCHAR(225),
PaymentDate DATE,
Vat NUMERIC(10,2),
Amount NUMERIC(10,2) WITH STORAGETYPE = COLUMNAR,
Status VARCHAR(10) WITH STORAGETYPE = COLUMNAR)
If you download the Open Exchange project and deploy it in your local Docker, you can access the Jupyter Notebook instance and review the file PerformanceTests.ipynb, which will be responsible for generating the random data that we are going to store in different phases in our tablesΒ and finally it will show us a graph with the performance of the query operations.
Let's take a quick look at our project configuration:
docker-compose.yml
version: '3.7'
services:
# iris
iris:
init: true
container_name: iris
build:
context: .
dockerfile: iris/Dockerfile
ports:
- 52774:52773
- 51774:1972
volumes:
- ./shared:/shared
environment:
- ISC_DATA_DIRECTORY=/shared/durable
command: --check-caps false --ISCAgent false
# jupyter notebook
jupyter:
build:
context: .
dockerfile: jupyter/Dockerfile
container_name: jupyter
ports:
- "8888:8888"
environment:
- JUPYTER_ENABLE_LAB=yes
- JUPYTER_ALLOW_INSECURE_WRITES=true
volumes:
- ./jupyter:/home/jovyan
- ./data:/app/data
command: "start-notebook.sh --NotebookApp.token='' --NotebookApp.password=''"
We deploy the IRIS and Jupyter containers in our docker, initially configuring IRIS with the namespace "TEST" and the two tables required for the test.
To avoid boring you with code, you can consult the PerformanceTests.ipynb file from which we will connect to IRIS, generate the records to be inserted and store them in IRIS
Test execution
The results have been the following (in seconds):
Inserts:
The insertions made are of bulk type:
INSERT INTO Test.PurchaseOrderColumnar (Reference, Customer, PaymentDate, Vat, Amount, Status) VALUES (?, ?, ?, ?, ?, ?)
And the time for each batch of inserts is as follows:
Total inserts |
Row storage | Mixed storage |
1000 |
0.031733 |
0.041677 |
5000 |
0.159338 |
0.185252 |
20000 |
0.565775 |
0.642662 |
50000 |
1.486459 |
1.747124 |
100000 |
2.735016 |
3.265492 |
200000 |
5.395032 |
6.382278 |
Selects:
The Select launched includes an aggregation function and a condition, both on columns with columnar storage:
SELECT AVG(Amount) FROM Test.PurchaseOrderColumnar WHERE Status = 'SENT'
Total rows |
Row storage | Mixed storage |
1000 |
0.002039 |
0.001178 |
5000 |
0.00328 |
0.000647 |
20000 |
0.005493 |
0.001555 |
50000 |
0.016616 |
0.000987 |
100000 |
0.036112 |
0.001605 |
200000 |
0.070909 |
0.002738 |
Conclusions
As you can see in the results obtained, the operation is exactly what is indicated in the documentation. Including columns with columnar storage has slightly penalized performance during insert (about 18% slower for our example) while queries on those same columns have dramatically improved response time (258 times faster).
It is undoubtedly something to take into account when planning the development of any application.
Top comments (0)