Introduction
Today we will see a fundamental concept when working with data and how you represent data.
We will talk about Pivot Table; representing data in columns instead of rows is often more readable.
Example
Let's say you run a coffee shop and want to know each coffee's sales over two years.
We have an OLAP database with a (straightforward) star schema with the fact table coffee_sales and two dimensions, a coffees dimension, and a time (year here) dimension.
CREATE TABLE coffee_dim (
coffee_id INTEGER PRIMARY KEY,
name text,
unit_price NUMERIC
);
CREATE TABLE date_dim (
date_id INTEGER PRIMARY KEY,
year integer
);
CREATE TABLE coffees_sales (
sales_id INTEGER PRIMARY KEY,
coffee_id INTEGER REFERENCES coffee_dim(coffee_id),
date_id INTEGER REFERENCES date_dim(date_id),
quantity INTEGER
)
If we visualize it in an ERD tool, we can see the fact table coffee_sales and the two dimensions like this :
Generate a report over the years
As we said, we want to generate a report about the sales of our coffees over the year. Something like the following data :
It is not readable, isn't it? These data are generated by a query that is more or less looking like this :
SELECT c.name name,d.year as year ,sum(cs.quantity)::INTEGER
FROM coffees_sales cs
JOIN coffees c
USING(coffee_id)
JOIN date_dim d
USING(date_id)
GROUP BY name, year
ORDER BY 1,2
Pivoting our data over the year would be much more readable here. To do that, we have several solutions.
We will dive into two of them, one that works with plain SQL and another that needs Postgres extensions.
The precedent SQL query will be the base of the solutions we will see.
Pivoting with CTE
One of the approaches we could use is CTE. Indeed if we define a CTE for each amount for each year, we can then query the different amounts and name and name the amount column with the year where comes these data.
Let's see what it looks like in a concrete example.
WITH cte_2020 as (SELECT c.name name,d.year as year ,sum(cs.quantity)::INTEGER
FROM coffees_sales cs
JOIN coffee_dim c
USING(coffee_id)
JOIN date_dim d
ON d.date_id = cs.date_id
AND YEAR=2020
GROUP BY name, year
ORDER BY 1,2),
cte_2021 as (SELECT c.name name,d.year as year ,sum(cs.quantity)::INTEGER
FROM coffees_sales cs
JOIN coffee_dim c
USING(coffee_id)
JOIN date_dim d
ON d.date_id = cs.date_id
AND YEAR=2021
GROUP BY name, year
ORDER BY 1,2)
select c1.name, c1.sum as "2020", c2.sum as "2021"
from cte_2020 c1 join cte_2021 c2
on c1.name = c2.name
You probably think it is a lot of effort to not much. Moreover, what if you want 2022 also? Here comes Crosstab!
Pivoting with Crosstab extension
You can create a pivot table in Postgres with the CROSSTAB extension.
First, you will need to import the tablefunc extension of Postgres. And then you have to use the CROSSTAB function like this :
CREATE EXTENSION IF NOT EXISTS tablefunc;
SELECT * FROM CROSSTAB(/**/) AS ct (name text, "2020" INTEGER,
"2021" INTEGER)
The query for the name, year, and number of sales will lead to the following final query :
CREATE EXTENSION IF NOT EXISTS tablefunc;
SELECT * FROM CROSSTAB($$
SELECT c.name name,d.year as year ,sum(cs.quantity)::INTEGER
FROM coffees_sales cs
JOIN coffee_dim c
USING(coffee_id)
JOIN date_dim d
USING(date_id)
GROUP BY name, year
ORDER BY 1,2
$$)
AS ct (name text, "2020" INTEGER, "2021" INTEGER)
This query will generate this data :
It is a bit clearer and readable.
So with CROSSTAB, it is a lot clearer and easier to respect DRY principles. Nevertheless, it took me some time to understand how to use CROSSTAB and pivot my data.
Whereas it is much more straightforward, thanks to CTE, you have to query your data filter and give the column the name you want.
Conclusion
Pivoting is a crucial technique to know in data analysis. First, we have seen how to pivot our data with CTEs; this method can be applied to any database but is not scalable if we have a lot of different values in the column we want to pivot.
Then we saw how to pivot our data thanks to the CROSSTAB function presented in the Postgres tablefunc EXTENSION. This method is more scalable, and it will be easier to respect the DRY principles.
Nevertheless, the learning curve is a bit more complicated with CROSSTAB. If you do not pivot data often or you do not have a lot of values in the column you want to pivot, it is not worth it. 😎
Keep in Touch
On Twitter : @yet_anotherdev
Top comments (2)
Clear and concise 👌
Hope it helps :)