Flattening JSON(b) in Postgres

Jan 6, 2022 · 1067 words · 6 minutes read datajsonpostgressql

Developers like to use JSON and when they store data, often don’t normalize it. Thus, it’s not unusual to encounter JSON(b) fields in the database. For data analysts, data scientists, and machine learning engineers unnesting, or flattening, the data is often a prerequisite to subsequent analysis.

Wrangling JSON columns in the database can be challenging. Often the goal is to expand JSON arrays into new rows and unnest JSON objects into new columns. Fortunately Postgres has several built-in functions that can be composed to tackle normalizing complex JSON in the database.

Dataset

The table below illustrates a common scenario in which a table (we’ll refer to it as city) contains a JSON or JSON(b) column (i.e., coordinates).

+---+-------------------+-------+---------------------------------+
|idx|name               |country|coordinates                      |
+---+-------------------+-------+---------------------------------+
|1  |Sant Julià de Lòria|AD     |{"lat": 42.46372, "lng": 1.49129}|
|2  |Pas de la Casa     |AD     |{"lat": 42.54277, "lng": 1.73361}|
|3  |Ordino             |AD     |{"lat": 42.55623, "lng": 1.53319}|
|4  |les Escaldes       |AD     |{"lat": 42.50729, "lng": 1.53414}|
|5  |la Massana         |AD     |{"lat": 42.54499, "lng": 1.51483}|
|6  |Encamp             |AD     |{"lat": 42.53474, "lng": 1.58014}|
+---+-------------------+-------+---------------------------------+

While the data is stored as a JSON object the data analyst, data scientist, or ML engineer would prefer lat and lng were normalized into columns. Fortunately we can structure JSON objects using the jsonb_to_record function that’s built in to Postgres.

Flattening JSON objects using jsonb_to_record

If the column in question contains a JSON(b) object, you can use the build-in function jsonb_to_record (or json_to_record) to normalize the key-value pairs into columns. The function takes a JSON object, or column containing JSON objects, and returns a record. The record is expanded by matching a user-provided composite type in the AS expression. The composite type expression is simply the JSON object key name (which is case sensitive) and the Postgres data type to use for it. The JSON object can contain more key-value pairs than you define in the composite type and they will be excluded. If the composite type defines a key that is missing from the JSON object, it will have a value of null.

The Postgres documentation shows how to use json_to_record given a JSON object:

SELECT *
FROM json_to_record('{"a":1,"b":[1,2,3],"c":"bar"}') AS
  x(a int, b text, d text)
 a |    b    | d
---+---------+---
 1 | [1,2,3] |

In this example the function takes in a JSON object, {"a":1,"b":[1,2,3],"c":"bar"}. The user-defined composite type, x, defines mappings for keys a, b, and d. Because the JSON object contains keys a and b, and the data types match those in the composite type, they are extracted as part of the record. The composite type doesn’t define a mapping for c, so it isn’t extracted from the object. Additionally, the composite type defines d, which is missing from the object and thus isn’t extracted.

The example in the documentation is nice but usually we want to unnest JSON in a table column, not a string. Fortunately we can use the same json_to_record and jsonb_to_record functions on columns containing JSON objects.

Using the city table above as an example, we can flatten the JSON object in the coordinates column.

SELECT
  city.idx,
  city."name",
  city.country,
  coord.lat,
  coord.lng
FROM
  city,
  jsonb_to_record(coordinates) AS coord(lat numeric, lng numeric);
+---+-------------------+-------+--------+-------+
|idx|name               |country|lat     |lng    |
+---+-------------------+-------+--------+-------+
|1  |Sant Julià de Lòria|AD     |42.46372|1.49129|
|2  |Pas de la Casa     |AD     |42.54277|1.73361|
|3  |Ordino             |AD     |42.55623|1.53319|
|4  |les Escaldes       |AD     |42.50729|1.53414|
|5  |la Massana         |AD     |42.54499|1.51483|
|6  |Encamp             |AD     |42.53474|1.58014|
+---+-------------------+-------+--------+-------+

In this example we use the jsonb_to_record on the coordinates column to flatten the object into two columns. The composite type coord matches the object keys and data types. We’re then able to reference the record matched by the composite type in the SELECT statement to return lat and lng as separate columns.

Unnesting JSON arrays using jsonb_to_recordset

If your data is a JSON array instead of a JSON object you can’t use jsonb_to_record to normalize it. Instead you want to use the related function, jsonb_to_recordset.

The built-in functions json_to_recordset and jsonb_to_recordset are very similar to their json_to_record and jsonb_to_record counterparts. The difference is the *_to_recordset functions operate on JSON arrays rather than JSON objects.

To illustrate this, consider the table below (called country). The table is similar to the city table above except instead of one row per city there is one row per country. The cities column contains a JSON array where each entry is a city, stored as a JSON object.

+---+------------+---------------------------------------------------------------+
|idx|country_name|cities                                                         |
+---+------------+---------------------------------------------------------------+
|1  |AU          |[{"name": "York", "coordinates": {"lat": -31.88809, "lng": 1...|
|2  |AT          |[{"name": "Neu-Guntramsdorf", "coordinates": {"lat": 48.0642...|
|3  |AR          |[{"name": "Zárate", "coordinates": {"lat": -34.09814, "lng":...|
|4  |AG          |[{"name": "Saint John’s", "coordinates": {"lat": 17.12096, "...|
|5  |AO          |[{"name": "Saurimo", "coordinates": {"lat": -9.66078, "lng":...|
|6  |AQ          |[{"name": "McMurdo Station", "coordinates": {"lat": -77.846,...|
+---+------------+---------------------------------------------------------------+

Using the jsonb_to_recordset function on the cities column, we can expand the JSON array into separate rows. Like before, we provide a composite type with the key names and data types.

SELECT
  idx,
  country_name,
  city.name,
  city.coordinates
FROM country,
     jsonb_to_recordset(cities) AS city(name text, coordinates jsonb);
+---+------------+---------+------------------------------------+
|idx|country_name|name     |coordinates                         |
+---+------------+---------+------------------------------------+
|1  |AU          |York     |{"lat": -31.88809, "lng": 116.7678} |
|1  |AU          |Yanchep  |{"lat": -31.54678, "lng": 115.63171}|
|1  |AU          |Yallingup|{"lat": -33.64592, "lng": 115.03514}|
|1  |AU          |Wundowie |{"lat": -31.76163, "lng": 116.3799} |
|1  |AU          |Wooroloo |{"lat": -31.8038, "lng": 116.31311} |
|1  |AU          |Woodville|{"lat": -34.88333, "lng": 138.55}   |
+---+------------+---------+------------------------------------+

Note the idx value (the country index column) repeats for each city.

Expanding and flattening nested JSON

In the previous example, the composite type used the jsonb type for the coordindates key. Thus, the query returned the coordinates as a JSONB column. Given that our goal is to flatten the JSON data, we can build on the previous queries to return separate columns for lat and lng.

SELECT
  idx,
  country_name,
  city.name,
  coord.lat,
  coord.lng
FROM country,
     jsonb_to_recordset(cities) AS city(name text, coordinates jsonb),
     jsonb_to_record(coordinates) AS coord(lat numeric, lng numeric);
+---+------------+---------+---------+---------+
|idx|country_name|name     |lat      |lng      |
+---+------------+---------+---------+---------+
|1  |AU          |York     |-31.88809|116.7678 |
|1  |AU          |Yanchep  |-31.54678|115.63171|
|1  |AU          |Yallingup|-33.64592|115.03514|
|1  |AU          |Wundowie |-31.76163|116.3799 |
|1  |AU          |Wooroloo |-31.8038 |116.31311|
|1  |AU          |Woodville|-34.88333|138.55   |
+---+------------+---------+---------+---------+

Using both jsonb_to_recordset and jsonb_to_record, we’re able to flatten the JSON data so that we have one row per city, which repeated country data.

Closing

Data analysts, data scientists, and ML engineers wrangling JSON in the database can use functions built-in to Postgres to normalize the data. By composing these functions together, it’s possible to expand complex JSON into new rows and columns so they can used downstream for data exploration, data analysis, and building models.