When to use a graph database

There are a lot of “intro to graph database” tutorials on the internet. While the “how” part of using a graph database has it’s place, I don’t know if enough has been said about “when”.

The answer to “when” depends on the properties of the data you are working with. In broad strokes, you should probably keep a graph database in mind if you are dealing with a significant amount of any of the following:

  • Hierarchical data

  • Connected data

  • Semi-structured data

  • Data with Polymorphic associations

Each of these data types either requires some number of extra tables or columns (or both) to deal with under the relational model. The cost of these extra tables and columns is an increase in complexity.

Terms like “connected data” or “semi-structured data” get used a lot in the NoSQL world but the definitions, where you can find one at all, have a “you’ll know it when you see it” flavour to them. “You’ll know it when you see it” can be distinctly unsatisfying when examples are hard to come by as well. Lets take a look at these one by one and get a sense of they mean generally and how to recognize them in existing relational database backed projects.

Hierarchical Data

Hierarchies show up everywhere. There are geographical hierarchies where a country has many provinces, which have many cities which have many towns. There is also the taxonomic rank, indicating the level of a taxon in the Taxonomic Hierarchy, organizational hierarchies, the North American Industry Classification system… the list goes on and on.

What it looks like in a relational database.

Usually its easy to tell if you are dealing with this type of data. In an existing database schema you may see tables with a parent_id column indicating the use of the Adjacency List pattern or left/right columns indicating the use of the Nested Sets pattern. There are many others as well.

Connected Data

Connected data is roughly synonymous with graph data. When we are talking about graph data we mean bits of data, plus information about how those bits of data are related.

What it looks like in a relational database.

Join tables are the most obvious sign that you are storing graph data. Join tables exist solely to act as the target of two one-to-many relationships, each row representing a relationship between two rows in other tables. Cases where you are storing data in the join table (such as the Rails has_many :through relationship) are even more clear, since the columns of your join table are attributes of an edge.

While one-to-many relationships also technically describe a graph, they probably are not going to make you reconsider the use of a relational database the way large numbers of many-to-many relationships might.

Semi-structured Data

Attempts to define semi-structured data seem to focus on variability; just because one piece of data has a particular set of attributes does not mean that the rest do. You can actually get an example of semi-structured data by mashing together two sets of structured (tabular) data. In this world of APIs and SOA where drawing data from multiple sources is pretty much the new normal, semi-structured data is increasingly common.

What it looks like in a relational database.

Anywhere you have columns with lots of null values in them. The columns provide the structure, but long stretches of null values suggest that this data does not really fit that structure.

semi_structured_data
An example of semi-structured data: a hypothetical products table combining books (structured data) and music (also structured data).

Polymorphic associations

Having one type data with an association that might be to related to one of two or more things, that’s what known as a polymorphic association. As an example, a photo might be related to a user or a product.

What it looks like in a relational database.

While polymorphic relations can be done in a relational database most commonly they are handled at the framework level, where the framework a foreign key and an additional “type” column to determine the correct table/row. Seeing both an something_id and something_type in the same table gives a hint that a polymorphic relationship is being used. Both Ruby on Rails and Java’s Spring Framework offer this.

So when?

These types of data are known to be an awkward fit for the relational model, in the same way that storing large quantities of perfectly tabular data would be awkward under the graph model. These are ultimately threshold problems, like the famous paradox of the heap.

1000000 grains of sand is a heap of sand

A heap of sand minus one grain is still a heap.

Your first join table or set of polymorphic relations will leave you will a perfectly reasonable database design, but just as “a heap of sand minus one grain” will eventually cross some ill defined threshold and produce something that is no longer a heap of sand, there is some number of join tables or other workarounds for the relational model that will leave you with a database that is significantly more complex than a graph database equivalent would be.

Knowing about the limits of the relational model and doing some hard thinking about how much time you are spending pressed up against those limits is really the only things that can guide your decision making.

Leave a comment