The Fastest Way to Get the Most Recent Row Per Group in Postgres
Jan 7, 2023 · 392 words · 2 minutes read
A common problem for backend engineers, data scientists, and data engineers is to find the most recent record for each group, or a single group. For example, the database might have a table of user login events and you want to find the latest login for each user.
Often people will write a SQL query with the ROW NUMBER
window function in a CTE or sub query , but can be faster to use SELECT DISTINCT ON
.
Setup
If you want to test this for yourself, you can follow along
First we need some data. We can generate synthetic data for testing and insert it into a table.
CREATE TABLE login_event AS
SELECT
floor(random() * (10000-0+1) + 0)::int AS user,
random() AS timestamp
FROM generate_series(1, 1000000);
Then we can benchmark the queries below by running them with EXPLAIN ANALYZE
.
The Slow Way
The often recommended, yet slow, method is to use a CTE or sub query to add a row number (ordered by the timestamp column) and then select the rows where the row number is one.
Using our login_event
table, the query would look like this:
WITH tbl AS (
SELECT
"user",
"timestamp",
row_number() over(partition by "user" order by "timestamp" desc) as rownum
FROM login_event
)
SELECT
"user",
"timestamp"
FROM tbl
WHERE rownum=1;
This executes in about 534ms on my laptop.
The Fast Way
The fastest way to select the most recent row is to use SELECT DISTINCT ON
.
To use SELECT DISTINCT ON
, provide which column(s) should be used for the DISTINCT
clause (eg user
).
Then, order the table by the DISTINCT
column and the timestamp column.
For example,
SELECT DISTINCT ON ("user")
"user",
"timestamp"
FROM login_event
ORDER BY "user", "timestamp" DESC;
Executes in about 323ms.
This is not only faster, but also less code!
The Fastest Way
Notice when we created the table, we didn’t add indexes.
Using the proper indexes can dramatically speed up the queries.
Adding indexes to the columns used in the DISTINCT
and ORDER BY
clause can improve query performance.
CREATE INDEX login_event_user_six ON login_event ("user");
CREATE INDEX login_event_user_timestamp_desc_fix ON login_event ("timestamp" DESC);
Adding these indexes changes the slow query time from 534ms to 681ms (it’s actually slower) and the fast query time from 323ms to 98ms!
Closing
The fastest way to get the most recent row per group is to use SELECT DISTINCT ON
.