The Fastest Way to Get the Most Recent Row Per Group in Postgres

Jan 7, 2023 · 392 words · 2 minutes read datapostgressql

A common problem for backend engineers, data scientists, and data engineers is to find the most recent record for each group, or a single group. For example, the database might have a table of user login events and you want to find the latest login for each user.

Often people will write a SQL query with the ROW NUMBER window function in a CTE or sub query , but can be faster to use SELECT DISTINCT ON.

Setup

If you want to test this for yourself, you can follow along

First we need some data. We can generate synthetic data for testing and insert it into a table.

CREATE TABLE login_event AS
SELECT
  floor(random() * (10000-0+1) + 0)::int AS user,
  random() AS timestamp
FROM generate_series(1, 1000000);

Then we can benchmark the queries below by running them with EXPLAIN ANALYZE.

The Slow Way

The often recommended, yet slow, method is to use a CTE or sub query to add a row number (ordered by the timestamp column) and then select the rows where the row number is one.

Using our login_event table, the query would look like this:

WITH tbl AS (
  SELECT
    "user",
    "timestamp",
    row_number() over(partition by "user" order by "timestamp" desc) as rownum
  FROM login_event
)
SELECT
  "user",
  "timestamp"
FROM tbl
WHERE rownum=1;

This executes in about 534ms on my laptop.

The Fast Way

The fastest way to select the most recent row is to use SELECT DISTINCT ON. To use SELECT DISTINCT ON, provide which column(s) should be used for the DISTINCT clause (eg user). Then, order the table by the DISTINCT column and the timestamp column. For example,

SELECT DISTINCT ON ("user")
  "user",
  "timestamp"
FROM login_event
ORDER BY "user", "timestamp" DESC;

Executes in about 323ms.

This is not only faster, but also less code!

The Fastest Way

Notice when we created the table, we didn’t add indexes. Using the proper indexes can dramatically speed up the queries. Adding indexes to the columns used in the DISTINCT and ORDER BY clause can improve query performance.

CREATE INDEX login_event_user_six ON login_event ("user");
CREATE INDEX login_event_user_timestamp_desc_fix ON login_event ("timestamp" DESC);

Adding these indexes changes the slow query time from 534ms to 681ms (it’s actually slower) and the fast query time from 323ms to 98ms!

Closing

The fastest way to get the most recent row per group is to use SELECT DISTINCT ON.