JSON Data Patterns

Dec 11, 2024 · 1272 words · 6 minutes read data • json

Over the years I’ve worked with a lot of developers who use JSON. In fact, probably every developer I’ve worked with has used JSON. Despite it’s widespread use, it’s often misused or used poorly. By this I mean that developers often store data in JSON but do so in a poorly organized way that don’t translate well to more principled data engineering.

There are several common patterns for storing data in JSON. Developers often use a multiple patterns in their work, but not always the right ones.

Goals

Developers often throw data into JSON structures without much thought. The goal of thoughtful JSON structure is to make data both usable and maintainable.

This is especially true when using schemaless NoSQL databases like MongoDB, DynamoDB, and Datastore. When developers don’t think about the structure of their data, it leads to adverse consequences down the road. As the saying goes, you can either pay the cost or you’ll pay the price.

Developers should consider applying database normalization principles when structuring their data in JSON. Particularly, consider normal forms. While JSON is more flexible than traditional relational databases, we can still apply database normalization principles when structuring our data. These principles help us avoid common pitfalls and make our data more robust.

The most relevant principles come from database normal forms: - First Normal Form (1NF): Each record is unique and contains only atomic (indivisible) values - Second Normal Form (2NF): Data depends on the entire key, not just part of it - Third Normal Form (3NF): Data depends only on the key, not on other non-key fields

While we can’t always strictly follow these rules with JSON (and sometimes we shouldn’t), they provide a useful framework for thinking about how to structure our data. The patterns I describe below reflect these principles to varying degrees, each with their own trade-offs.

Index oriented

The most common I’ve seen is to store data in JSON objects with an index orientation, meaning each key represents some sort of index for the contents and the value is typically also a JSON object. This is an example of what I mean by an index oriented JSON object:

{
    "alice":
    {
        "id": 1,
        "email": "alice@ex.co"
    },
    "bob":
    {
        "id": 2,
        "email": "bob@ex.co"
    }
}

Although this is one of the most common patterns I see developers use, it violates several database normalization principles. Using this approach requires distinct indexes – using the example, there can be only 1 Alice and 1 Bob. Relying on the keys (e.g., “alice”) being distinct, which means the data structure itself is enforcing uniqueness rather than the application logic.

This obviously presents a problem if the key will ever be non-distinct. In my experience developers often overlook whether a key is distinct, even when using so called unique identifiers.

This pattern often leads to denormalization because developers tend to duplicate data across multiple objects when using index-oriented structures. For example, if Alice has multiple email addresses, developers might create multiple entries like “alice_personal” and “alice_work”, violating 2NF by making the email dependent on only part of what should be a composite key.

Column oriented

Another common pattern is to use a column orientation. This approach typically consists of a JSON object where each key represents a column, or field, and the values are arrays. An example would look like:

{
    "id":
    [
        1,
        2
    ],
    "email":
    [
        "alice@ex.co",
        "bob@ex.co"
    ],
    "name":
    [
        "alice",
        "bob"
    ]
}

Unlike the index-oriented approach, this satisfies the 1NF principle. However it introduces other problems, notably forcing developers to rely on array position to maintain relationships between values. This violates both 2NF and 3NF principles because there is an implicit reliance on the array index instead of proper keys.

From a usability standpoint this format is useful when performing a task that operates on an entire array at once, such as statistical summaries. However, it is a lot less ergonomic when needing to perform tasks that uses data from different arrays. This can also be problematic when order isn’t guaranteed (relatively rare) or arrays have different lengths (surprisingly common).

Maintaining consistency across all arrays adds significant overhead. Thus, adding or removing records becomes more cumbersome and extra care needs to be take to preserve the implicit relationships across arrays.

Records

My preferred method to store data in JSON is as records. This means we store multiple objects in an array. This pattern most closely adheres to database normalization principles while remaining flexible and practical.

Using the data from above, it would look like:

[
    {
        "id": 1,
        "email": "alice@ex.co",
        "name": "alice"
    },
    {
        "id": 2,
        "email": "bob@ex.co",
        "name": "bob"
    }
]

This pattern satisfies all three normal forms:

It maintains 1NF by storing atomic values and allowing each record to be uniquely identified
It supports 2NF by allowing proper composite keys when needed, rather than forcing artificial uniqueness
It enables 3NF by making it clear which fields depend on the key(s)

The tenants of this approach is that whenever we have represent each item as its own object and store multiple objects in an array. Whenever we have multiple values we prefer to store them as an array of objects rather than an array of raw values. This allows multiple objects to have the same keys which often is desirable because it is usually better to manually de-duplicate than rely on your data structure to do it for you.

This pattern also makes it easier to use proper composite keys (i.e. multiple fields that together uniquely identify a record). One of the worst anti-patterns I see is developers constructing composite keys as delimited strings like {“alice-1”: …}, which violates normalization principles and makes the data harder to work with.

The records pattern tends to be more maintainable over time because it maps naturally to how most programming languages and data processing tools work with data. It also makes it easier to evolve the schema over time without breaking existing code or requiring complex data migrations.

Other structures

There are other data structures that are often used in JSON, however these are the basic building block structures. Sometimes these structures are nested within each other. By understanding the basic building blocks, we can better understand the more complex structures.

Recommendations

Given the above data structures, here are some recommendations for how to store data in JSON.

First, use the records pattern when possible. This is the most maintainable and flexible approach. It may be tempting to use the column oriented pattern when you need to store multiple values for a single key such as timeseries data. However, it is better to be explicit and provide a key for each value in the array. If you see an array of values with no keys, it is likely that you should be using the records pattern.

Second, avoid storing multiple values for a single key in the same object. This violates the 1NF principle. Instead, store each value in its own object. For example, instead of storing multiple email addresses in a single object, store each email address in its own object.

Finally, avoid using delimited strings as keys. While this is a common pattern, it violates normalization principles and makes the data harder to work with. Instead, use a proper composite key. If you see a delimited string as a key, it is likely that you should be using a proper composite key.

Conclusion

JSON is a great data format for storing data. However, developers should be thoughtful about how they choose to store their data. The data structures described above are designed to make it easier to work with JSON data.