DEV Community

Mircea Cadariu
Mircea Cadariu

Posted on • Updated on

Retrieving the latest row per group from PostgreSQL

In this post I'll show you several approaches for writing SQL queries when you need to retrieve the latest row per group from the database. I rank them according to how performant I found them to be on my generated dataset. The main idea with using SQL (a declarative language) to retrieve data is that you don't have to provide precise instructions on how to retrieve data but just express what you want under the form of SQL queries. The database component called the planner will actually determine what's the best way to retrieve the data. But as we shall see, different ways to express the same need can have quite the impact on performance.

To explain why the alternatives presented are slower or faster, I use visualisations of the Postgres explain plans, showing how the database processes the query internally. As you will see, some changes will make quite a difference - we will go from hundreds of ms to about 5ms.

To support the story we'll use the following scenario: we have a set of meters that regularly record readings and store them in the database. We have to retrieve the latest reading per every meter. I'm using Postgres version 16.2 (default configuration) on my laptop.

Setup

First, we create the table of meters.

create table meters
(
    id          bigserial,
    -- other columns which I omitted for simplicity
    primary key (id)
);
Enter fullscreen mode Exit fullscreen mode

Then, we create the table of readings.

create table readings
(
    id           bigserial,
    meter_id     bigint,
    date         date,  
    reading      double precision,
    primary key (id),

    constraint fk__readings_meters foreign key (meter_id) references meters (id)
);
Enter fullscreen mode Exit fullscreen mode

For integrity, the tables are linked together with a foreign key constraint.

Let's now populate our tables with some rows: we add 500 meters, all having one reading every day, for a year. For generating test data like this, you can use the generate_series function:

insert into meters select * from generate_series(1, 500) seq;

insert into readings(meter_id, date, reading)
select m.id, seq, random() from generate_series('2024-02-01'::date, '2025-02-01'::date, '1 day'::interval) seq, meters m;
Enter fullscreen mode Exit fullscreen mode

Our tables are now populated and ready to be queried. Let's discuss the first approach.

1. Window functions: 250ms

If we use window functions to write this query, we first express that we want to partition our readings per meter IDs, and then use the row_number function to assign an "order number" for every reading based on its age among its peers within a partition. You then return only ones that are at the top (have row number equal to 1) per every partition. Here's the query:

explain(analyse, buffers)
with readings_with_rownums as (                                                                                                                                                                                                                                                                       
    select                                                                                                                                                                                                                                                                                                
        meters.id        as meter_id,
        readings.reading as reading,                                                                                                                                                                                                                
        row_number() over (partition by meters.id order by readings.date desc) as rownum                                                                                                                                                                                                              
    from                                                                                                                                                                                                                                                                                                  
        readings                                                                                                                                                                                                                                                                                      
    join                                                                                                                                                                                                                                                                                                  
         meters on meters.id = readings.meter_id
)
select
    meter_id,
    reading
from 
    readings_with_rownums
where
    readings_with_rownums.rownum = 1;
Enter fullscreen mode Exit fullscreen mode

This runs in about 250ms. It is not the worst, but this timing does mean that the users will not perceive this response as being instantaneous (100ms). Let's try to do better.

Time to have a look at the explain plan to understand how the database retrieves our data and get some clues. As you can see how I formulated the query above, I am using the buffers option, see the motivation for why that is helpful here. In terms of visualisation tools, you have several great alternatives, here's some examples:

For the rest of this blog I'm going to use the one from Dalibo.

explain-plan

I've opened the relevant explain plan nodes which I want to explore further. They are in the upper part of the image above. We can quickly discern why this is not performing very well. It's because the query efficiency is low. It discards 99% of rows that it's read right before returning the final results. You can see that in the Sort node, where 183k rows arrive coming from the nodes below, but then, only 500 are actually returned in the final result set. Retrieving our results like this is also less scalable, being very sensitive to increases in the dataset.

Let's consider resource utilisation for a moment as well. When the database is performing sorts like this, especially over and over again (like in an application used by several users concurrently) on potentially very large tables, you will see a spike in the CPU utilisation - as sorting is a CPU-bound operation. If possible we should try to preserve the CPU resource for situations where there are actually no alternatives. Let's proceed then and look at other options.

2. DISTINCT ON: 250ms

For details about how this works, you can have a look at the DISTINCT clause section in the Postgres docs. I find this approach rather elegant actually. Here's how it looks like applied to our use-case:

explain (analyse, buffers)
select 
    distinct on (meters.id) 
    meters.id        as meter_id, 
    readings.reading as reading
from 
    meters 
join 
    readings on meters.id = readings.meter_id
order by 
    meters.id, 
    readings.date desc;
Enter fullscreen mode Exit fullscreen mode

I didn't get a noticeable difference with regards to how fast it runs compared with the previous approach with the window functions. The explain plan looks quite similar to the one we have seen before. Have a look.

explain-plan

We can indeed observe a difference in that it doesn't have the WindowAgg node anymore, but this didn't get us very far. It's is still inefficient, reading a lot of rows and discarding the majority before returning the results. One thing I noticed in the text version of the explain plan before visualising it is the following detail which is worth taking a closer look at:

Sort Method: external merge  Disk: 3968kB
Enter fullscreen mode Exit fullscreen mode

This means the sort operation is being slowed down because it is forced to use the external disk. This happens when the work_mem setting is too low given the size of the dataset. The sort can't be done fully in memory, so it has no choice but to "spill" the operation to disk. The default setting for work_mem in Postgres is 4MB, which in this case proves to be too little.

Let's increase this setting to make sure it's sufficient to do the sort in memory. But note that if you give it too much, it can be problematic because if the load increases a lot surprisingly, you can run out of memory and start seeing errors - it's a balancing act.

set work_mem='16MB';
Enter fullscreen mode Exit fullscreen mode

I don't have to restart Postgres for this to take effect. For other settings, we have to.

We retry the query above, and I we can confirm that the sort happens in memory now; the proof is that we can see the following detail in the explain plan:

Sort Method: quicksort  Memory: 14951kB
Enter fullscreen mode Exit fullscreen mode

Cool! Did this make a difference though? Not really - we're still at ~250ms response time. We still got other options though.

One thing to note at this point. This experiment is on my laptop, and the CPU / memory / storage are all on one machine. But for example when using Amazon RDS, the storage relies on another service - EBS, which is not on the same machine as your database, but it's separated by the network. So in those scenarios, avoiding disk spills like the one I showed you above will make more of a difference because the data has to "travel" more between memory and disk. As a side-note, recently AWS introduced the https://aws.amazon.com/blogs/database/introducing-optimized-reads-for-amazon-rds-for-postgresql/ option, where for these operations, the database instance can use a fast local NVMe SSD disk instead of a network attached EBS volume.

3. MAX(DATE): 115ms

Time to look at another approach. How about this, using SQL aggregate functions.

explain(analyse, buffers)
with latest_reads_per_meter as (
    select
        readings.meter_id   as meter_id,
        max(date)           as reading_date
    from
        readings
    group by
        readings.meter_id
)
select
    readings.meter_id,
    readings.reading
from
    meters
join
    readings on meters.id = readings.meter_id
join
    latest_reads_per_meter lrpm on lrpm.meter_id = meters.id
    and readings.date = lrpm.reading_date;
Enter fullscreen mode Exit fullscreen mode

As usual, the explain plan:

explain-plan

Hmm, this looks a bit different than what we have seen before. In a good way! Let's understand why. It's now doing the sort much earlier, and so it's discarding the irrelevant rows earlier. It doesn't "carry them over" all the way throughout the retrieval process. This will consume less memory because the intermediate results are smaller. However, does it speed up our query?

Indeed it does! Have a look at this.

Execution Time: 114.942 ms
Enter fullscreen mode Exit fullscreen mode

Finally, some solid progress. We cut the runtime we started with in half. It's now very close to the limit where we can say it's perceived as being instantaneous. But can we do better? You bet, even reduce it by one order of magnitude.

4. Loose index scans: 14ms

Let's have a look at how the loose index scan works, since by now you're probably already wondering, when are you going to bring in the indexes?!

For the index definition, order of columns matters. The columns have to be defined exactly in this order I'm about to show you, with the "grouping" element first and then the other column which will be used for determining "latest" within a group.

create index idx on readings(meter_id, date);
Enter fullscreen mode Exit fullscreen mode

We can then write the query like this.

explain (analyse, buffers)
select
    meters.id        as meter_id,
    readings.reading as reading
from
    meters
cross join lateral (
    select
        meter_id, 
        reading
    from
        readings
    where
        readings.meter_id = meters.id
    order by
        date desc
    limit 1
) readings;
Enter fullscreen mode Exit fullscreen mode
Execution Time: 13.814 ms
Enter fullscreen mode Exit fullscreen mode

explain-plan

This is much faster. Let's see why. For starters - you don't see the 183k rows anywhere at all in the explain plan. We are not sorting anything anymore as part of the query, because the index keeps our data sorted, at the expense of insertion overhead.

But, let's push the envelope, since we're here anyway. Let's open the IO & Buffers tab of the index scan node in the above explain plan and have a look in there. Here it is:

explain-plans

Let's take a step back and consider how the index-based retrieval works at a high level. What happens is that as a first step, the B-tree index is traversed to determine which rows have to be retrieved to satisfy the query, however after this step, Postgres has to now actually go ahead and retrieve the rows from somewhere else, namely the table (or heap).

As we can see from the explain plan, this amounts to 2000 blocks read, about 16MiB (2000 * 8 kibibytes). A block (or page) is the fundamental unit by which Postgres stores data, have a look here for details.

But wait a minute, let's try to put the 2000 blocks in some context - how many of these blocks does the table readings have in total on disk?

SELECT relpages FROM pg_class WHERE relname = 'readings';
 relpages 
----------
     1350
Enter fullscreen mode Exit fullscreen mode

It seems to be reading more blocks with this index-based approach, compared with if we'd simply just do a sequential scan and simply read the entire table, sort, and discard, like we saw above with the window functions. This trade-off is something to watch out for indeed. The index based approach leads to more speed, but needs more I/O (more blocks touched) to do it. It is good to know that for example, if you're on Amazon Aurora or other cloud databases, you pay for every block that it has to retrieve from storage because it couldn't find it in memory. A nice cautionary tale about keeping an eye out for I/O on Aurora would be this one. However, this situation in my generated dataset is a byproduct of how "narrow" my tables are (small number of columns) - it's an artificial setup after all. If there would be more columns, then the number of pages in the table would be larger, so the B-tree traversing would add up to much less blocks.

Let's see if we can reduce the number of blocks touched. You might wonder, can we avoid the extra step (reading from the table after the index)? Exactly!

5. Loose index-only scans: 5ms

We can implement an index-only scan. We use the include option when creating the same index as above to achieve this, like so:

create index idx_with_including on readings(meter_id, date) include (reading);
Enter fullscreen mode Exit fullscreen mode

Let's retry the query and look at the relevant tabs again.

explain-plans

explain-plans

Two important things to observe here. First, notice the Heap Fetches: 0, which indicates that it does indeed not go to the heap anymore to get the rows because they are in the index already so it can get them from there. Secondly, the total number of blocks is now ~1500, so 500 less than before. This confirms again that it indeed doesn't read anything from the heap.

It can happen that you try this out and don't see Heap Fetches: 0, and you might wonder why, because you've set everything up correctly. This can happen when the visibility map is not up to date and Postgres has no other option but to actually go to the heap to get the visibility information about a row, because that is where it is stored. As the visibility map is kept up to date by the autovacuum process, it is important to regularly visit the configuration settings to make sure it is able to keep up.

Let's look at the final result. How fast did we get it?

Execution Time: 5.448 ms
Enter fullscreen mode Exit fullscreen mode

This is very nice. Let's now understand it a bit better how exactly the B-tree index is traversed for retrieving the data. Here's a diagram. In Postgres, we have three types of nodes in a B-tree, the root node, intermediate nodes (used only for traversal) and leaf nodes (pointers to tuples on the heap and included data).

b-tree-vis

For each row returned in the final result, it will do 3 page read operations - first for the root page, then for an intermediate/internal B-tree page, and lastly it will read the leaf page, from where it will collect the reading. I have marked these steps with 1, 2, 3 next to the arrows that represent descending down the tree.

Note though that index-only scans are not a panacea and should always be used judiciously, because by including columns into it, it does increase its overall size, so it might increase the time needed to traverse it for other use-cases.

Conclusion

After quite the journey, we've finally arrived at the fastest solution - a loose index-only scan; meaning it's end of the line for this post. Let me know if you've tried it out and it improved the performance of your queries!

Note though that as they say, there is no free lunch - every additional index gives the database more work to do at every insert, so you will have to decide if it's worth it. Also, the results do depend on the data distribution. It's so fast because we in our dataset we have many readings for every meter, but things might be different in your setup. As a last remark, note that if you used the indexes I added in the latter queries and retried the initial queries of the post, they will be faster, however not as fast as the loose index scans.

References

The first time I heard about loose index scanning was from this answer on SO.

Here's some other great resources on the topic:

There is also a Postgres wiki entry on the topic.

There's an ongoing discussion about implementing skip scans in Postgres.

Thanks for reading!

Top comments (0)