Flattening JSON(b) in Postgres
Jan 6, 2022 · 1067 words · 6 minutes read
Developers like to use JSON and when they store data, often don’t normalize it. Thus, it’s not unusual to encounter JSON(b) fields in the database. For data analysts, data scientists, and machine learning engineers unnesting, or flattening, the data is often a prerequisite to subsequent analysis.
Wrangling JSON columns in the database can be challenging. Often the goal is to expand JSON arrays into new rows and unnest JSON objects into new columns. Fortunately Postgres has several built-in functions that can be composed to tackle normalizing complex JSON in the database.
Dataset
The table below illustrates a common scenario in which a table (we’ll refer to it as city
) contains a JSON or JSON(b) column (i.e., coordinates
).
+---+-------------------+-------+---------------------------------+
|idx|name |country|coordinates |
+---+-------------------+-------+---------------------------------+
|1 |Sant Julià de Lòria|AD |{"lat": 42.46372, "lng": 1.49129}|
|2 |Pas de la Casa |AD |{"lat": 42.54277, "lng": 1.73361}|
|3 |Ordino |AD |{"lat": 42.55623, "lng": 1.53319}|
|4 |les Escaldes |AD |{"lat": 42.50729, "lng": 1.53414}|
|5 |la Massana |AD |{"lat": 42.54499, "lng": 1.51483}|
|6 |Encamp |AD |{"lat": 42.53474, "lng": 1.58014}|
+---+-------------------+-------+---------------------------------+
While the data is stored as a JSON object the data analyst, data scientist, or ML engineer would prefer lat
and lng
were normalized into columns.
Fortunately we can structure JSON objects using the jsonb_to_record
function that’s built in to Postgres.
Flattening JSON objects using jsonb_to_record
If the column in question contains a JSON(b) object, you can use the build-in function jsonb_to_record
(or json_to_record
) to normalize the key-value pairs into columns.
The function takes a JSON object, or column containing JSON objects, and returns a record.
The record is expanded by matching a user-provided composite type in the AS
expression.
The composite type expression is simply the JSON object key name (which is case sensitive) and the Postgres data type to use for it.
The JSON object can contain more key-value pairs than you define in the composite type and they will be excluded. If the composite type defines a key that is missing from the JSON object, it will have a value of null
.
The Postgres documentation shows how to use json_to_record
given a JSON object:
SELECT *
FROM json_to_record('{"a":1,"b":[1,2,3],"c":"bar"}') AS
x(a int, b text, d text)
a | b | d
---+---------+---
1 | [1,2,3] |
In this example the function takes in a JSON object, {"a":1,"b":[1,2,3],"c":"bar"}
.
The user-defined composite type, x
, defines mappings for keys a
, b
, and d
.
Because the JSON object contains keys a
and b
, and the data types match those in the composite type, they are extracted as part of the record.
The composite type doesn’t define a mapping for c
, so it isn’t extracted from the object.
Additionally, the composite type defines d
, which is missing from the object and thus isn’t extracted.
The example in the documentation is nice but usually we want to unnest JSON in a table column, not a string.
Fortunately we can use the same json_to_record
and jsonb_to_record
functions on columns containing JSON objects.
Using the city
table above as an example, we can flatten the JSON object in the coordinates
column.
SELECT
city.idx,
city."name",
city.country,
coord.lat,
coord.lng
FROM
city,
jsonb_to_record(coordinates) AS coord(lat numeric, lng numeric);
+---+-------------------+-------+--------+-------+
|idx|name |country|lat |lng |
+---+-------------------+-------+--------+-------+
|1 |Sant Julià de Lòria|AD |42.46372|1.49129|
|2 |Pas de la Casa |AD |42.54277|1.73361|
|3 |Ordino |AD |42.55623|1.53319|
|4 |les Escaldes |AD |42.50729|1.53414|
|5 |la Massana |AD |42.54499|1.51483|
|6 |Encamp |AD |42.53474|1.58014|
+---+-------------------+-------+--------+-------+
In this example we use the jsonb_to_record
on the coordinates
column to flatten the object into two columns.
The composite type coord
matches the object keys and data types.
We’re then able to reference the record matched by the composite type in the SELECT
statement to return lat
and lng
as separate columns.
Unnesting JSON arrays using jsonb_to_recordset
If your data is a JSON array instead of a JSON object you can’t use jsonb_to_record
to normalize it.
Instead you want to use the related function, jsonb_to_recordset
.
The built-in functions json_to_recordset
and jsonb_to_recordset
are very similar to their json_to_record
and jsonb_to_record
counterparts.
The difference is the *_to_recordset
functions operate on JSON arrays rather than JSON objects.
To illustrate this, consider the table below (called country
).
The table is similar to the city
table above except instead of one row per city there is one row per country.
The cities
column contains a JSON array where each entry is a city, stored as a JSON object.
+---+------------+---------------------------------------------------------------+
|idx|country_name|cities |
+---+------------+---------------------------------------------------------------+
|1 |AU |[{"name": "York", "coordinates": {"lat": -31.88809, "lng": 1...|
|2 |AT |[{"name": "Neu-Guntramsdorf", "coordinates": {"lat": 48.0642...|
|3 |AR |[{"name": "Zárate", "coordinates": {"lat": -34.09814, "lng":...|
|4 |AG |[{"name": "Saint John’s", "coordinates": {"lat": 17.12096, "...|
|5 |AO |[{"name": "Saurimo", "coordinates": {"lat": -9.66078, "lng":...|
|6 |AQ |[{"name": "McMurdo Station", "coordinates": {"lat": -77.846,...|
+---+------------+---------------------------------------------------------------+
Using the jsonb_to_recordset
function on the cities
column, we can expand the JSON array into separate rows.
Like before, we provide a composite type with the key names and data types.
SELECT
idx,
country_name,
city.name,
city.coordinates
FROM country,
jsonb_to_recordset(cities) AS city(name text, coordinates jsonb);
+---+------------+---------+------------------------------------+
|idx|country_name|name |coordinates |
+---+------------+---------+------------------------------------+
|1 |AU |York |{"lat": -31.88809, "lng": 116.7678} |
|1 |AU |Yanchep |{"lat": -31.54678, "lng": 115.63171}|
|1 |AU |Yallingup|{"lat": -33.64592, "lng": 115.03514}|
|1 |AU |Wundowie |{"lat": -31.76163, "lng": 116.3799} |
|1 |AU |Wooroloo |{"lat": -31.8038, "lng": 116.31311} |
|1 |AU |Woodville|{"lat": -34.88333, "lng": 138.55} |
+---+------------+---------+------------------------------------+
Note the idx
value (the country index column) repeats for each city.
Expanding and flattening nested JSON
In the previous example, the composite type used the jsonb
type for the coordindates
key.
Thus, the query returned the coordinates as a JSONB column.
Given that our goal is to flatten the JSON data, we can build on the previous queries to return separate columns for lat
and lng
.
SELECT
idx,
country_name,
city.name,
coord.lat,
coord.lng
FROM country,
jsonb_to_recordset(cities) AS city(name text, coordinates jsonb),
jsonb_to_record(coordinates) AS coord(lat numeric, lng numeric);
+---+------------+---------+---------+---------+
|idx|country_name|name |lat |lng |
+---+------------+---------+---------+---------+
|1 |AU |York |-31.88809|116.7678 |
|1 |AU |Yanchep |-31.54678|115.63171|
|1 |AU |Yallingup|-33.64592|115.03514|
|1 |AU |Wundowie |-31.76163|116.3799 |
|1 |AU |Wooroloo |-31.8038 |116.31311|
|1 |AU |Woodville|-34.88333|138.55 |
+---+------------+---------+---------+---------+
Using both jsonb_to_recordset
and jsonb_to_record
, we’re able to flatten the JSON data so that we have one row per city, which repeated country data.
Closing
Data analysts, data scientists, and ML engineers wrangling JSON in the database can use functions built-in to Postgres to normalize the data. By composing these functions together, it’s possible to expand complex JSON into new rows and columns so they can used downstream for data exploration, data analysis, and building models.