Discretizing data in Postgres using width_bucket

Mar 18, 2019 · 310 words · 2 minutes read postgres • sql

Sometimes you want to match a value to a range – for instance suppose you want to map an age to one of these age categories: 0-14, 15-24, 24-54, 55-64, 65+. If you’re working in the database you might write this using a CASE statement like this:

SELECT
  CASE
    WHEN age BETWEEN  0 AND 14 THEN 0
    WHEN age BETWEEN 15 AND 24 THEN 1
    WHEN age BETWEEN 25 AND 54 THEN 2
    WHEN age BETWEEN 55 AND 64 THEN 3
    ELSE 4
  END AS age_category

But if you’re using Postgres you have another option – the width_bucket function.

The width_bucket function has two different forms. The first argument is always a value. Next you can provide lower and upper bounds and the number of buckets you want in that range, or provide an array of thresholds.

Here’s an example using the first form:

SELECT width_bucket(x, 0, 100, 4);

The statement above would return 0 if x is in the range (-Inf, 0) (i.e. less than 0). If x is [0, 25) (i.e. at least 0 and less than 25) then the statement would return 1, if x is [25, 50) then 2, etc. If x is [100, Inf) then the returned value would be 5..

Alternatively you could use the second form to accomplish the same thing:

SELECT width_bucket(x, ARRAY[0, 25, 50, 75, 100])

Using Postgres’s width_bucket function with an array of thresholds is especially useful when you have buckets are are unevenly sized.

Here is how you would use width_bucket for the age categorization example above:

SELECT width_bucket(age, ARRAY[15, 25, 55, 65]) AS age_category;

As you can see this is less code than using a CASE statement. I’ve found this especially handy for discretizing data when I’m using Python and don’t want to import Numpy or there are multiple clients/use cases and I want to avoid multiple implementations.