The Fastest Way to Get the Most Recent Row Per Group in Postgres
Jan 7, 2023 · 364 words · 2 minutes read
A common problem for backend engineers, data scientists, and data engineers is to find the most recent record for each group, or a single group. For example, the database might have a table of user login events and you want to find the latest login for each user.
Often people will write a SQL query with the ROW NUMBER
window function in a CTE or sub query , but it is faster to use SELECT DISTINCT ON
.
Setup
If you want to test this for yourself, you can follow along
First we need some data. We can generate synthetic data for testing and insert it into a table.
CREATE TABLE login_event AS
SELECT
random() AS user,
random() AS timestamp
FROM generate_series(1, 1000000);
Then we can benchmark the queries below by running them with EXPLAIN ANALYZE
.
The Slow Way
The often recommended, yet slow, method is to use a CTE or sub query to add a row number (ordered by the timestamp column) and then select the rows where the row number is one.
Using our login_event
table, the query would look like this:
WITH tbl AS (
SELECT
user,
timestamp,
row number() over(partition by user order by timestamp desc) as rownum
FROM login_event
GROUP BY user
ORDER BY timestamp DESC
)
SELECT
user,
timestamp
FROM tbl
WHERE rownum=1;
The Fast Way
The fastest way to select the most recent row is to use SELECT DISTINCT ON
.
To use SELECT DISTINCT ON
, provide which column(s) should be used for the DISTINCT
clause (eg user
).
Then, order the table by the DISTINCT
column and the timestamp column.
For example,
SELECT DISTINCT ON (user)
user,
timestamp
FROM login_event
ORDER BY user, timestamp DESC;
This is not only faster, but also less code!
The Fastest Way
Notice when we created the table, we didn’t add indexes. Using the proper indexes can dramatically speed up the queries.
CREATE INDEX login_event_user_six ON login_event (user);
CREATE INDEX login_event_user_timestamp_desc_fix ON login_event (timestamp DESC);
Adding these indexes changes the slow query time from X to Y and the fast query time from X to Y.
Closing
The fastest way to get the most recent row per group is to use SELECT DISTINCT ON
.