Discretizing data in Postgres using width_bucket
Mar 18, 2019 · 310 words · 2 minutes read
Sometimes you want to match a value to a range – for instance suppose you want to map an age to one of these age categories: 0-14, 15-24, 24-54, 55-64, 65+. If you’re working in the database you might write this using a CASE
statement like this:
SELECT
CASE
WHEN age BETWEEN 0 AND 14 THEN 0
WHEN age BETWEEN 15 AND 24 THEN 1
WHEN age BETWEEN 25 AND 54 THEN 2
WHEN age BETWEEN 55 AND 64 THEN 3
ELSE 4
END AS age_category
But if you’re using Postgres you have another option – the width_bucket
function.
The width_bucket
function has two different forms. The first argument is always a value. Next you can provide lower and upper bounds and the number of buckets you want in that range, or provide an array of thresholds.
Here’s an example using the first form:
SELECT width_bucket(x, 0, 100, 4);
The statement above would return 0 if x
is in the range (-Inf, 0) (i.e. less than 0). If x
is [0, 25) (i.e. at least 0 and less than 25) then the statement would return 1, if x
is [25, 50) then 2, etc. If x
is [100, Inf) then the returned value would be 5..
Alternatively you could use the second form to accomplish the same thing:
SELECT width_bucket(x, ARRAY[0, 25, 50, 75, 100])
Using Postgres’s width_bucket
function with an array of thresholds is especially useful when you have buckets are are unevenly sized.
Here is how you would use width_bucket
for the age categorization example above:
SELECT width_bucket(age, ARRAY[15, 25, 55, 65]) AS age_category;
As you can see this is less code than using a CASE
statement. I’ve found this especially handy for discretizing data when I’m using Python and don’t want to import Numpy or there are multiple clients/use cases and I want to avoid multiple implementations.