Dealing with supernodes in ArangoDB

About a year ago I wrote about data modeling in ArangoDB. The main take away is to avoid the temptation to make unnecessary vertices (and all the attendant edges), since the traversing high degree vertices (a vertex with lots of edges pointing at it) is an expensive process.

Graphs are cool and it’s easy to forget that ArangoDB is a great document store. Treating is as such means “Embedding“, to borrow a term from MongoDB,  which lets you keep your traversals fast.

But while good data modeling can prevent you from creating some high-degree vertices, you will run into them eventually, and ArangoDB’s new “vertex-centric” indexes is a feature that is there for exactly those moments.

First we can try and get a sense of the impact these high-degree vertices have on a traversal. To do that I wrote a script that would generate a star graph starting with three “seed” vertices, a start, middle, and end.

The goal was to walk across the middle vertex to get from start to end while changing the number of vertices connected to the middle.

// the seed vertices
let seedVertices = [
  {name: 'start', _key: '1', type: "seed"},
  {name: 'middle', _key: '2', type: "seed"},
  {name: 'end', _key: '3', type: "seed"}
]

With just 10 vertices, this is nowhere near deserving the name “supernode”, but it’s pretty clear why these things are called star-graphs.

star_graph
A baby “super node”.

Next we crank up the number of vertices so we can see the effect on traversal time.

star_graph_traversal

By the time we get to a vertex surrounded by 100,000 to 1,000,000 other vertices we are starting to get into full-blown supernode territory. You can see that by the time we get to sorting through a million incident edges ArangoDB is up to 4.3 seconds to traverse across that middle vertex.

A “vertex-centric” index is one that include either _to or _from plus some other edge attribute. In this case I’ve added a type attribute which I’ll combine with _to to make my index. (Note that if I was allowing “any” as a direction I would need a second index that combines type and _from)

creating_a_hash_index

ArangoDB calculates the path and offers it to you as you do a traversal. You can access the path by declaring a variable to receive it. In the query below p contains the path. If we use the ALL array comparison operator on p.edges to say that all the edges in the path should have type of “seed”, that should be enough to get Arango to use are new index.

    FOR v,e,p IN 1..2 ANY 'vertices/1' edges
      FILTER p.edges[*].type ALL == 'seed' &&  v.name == 'end'
        RETURN v

The selectivity score shown by explain doesn’t leave you very hopeful that Arango will use the index…

Indexes used:
 By   Type   Collection   Unique   Sparse   Selectivity   Fields               Ranges
  2   hash   edges        false    false         0.00 %   [ `type`, `_to` ]    base INBOUND
  2   edge   edges        false    false        50.00 %   [ `_from`, `_to` ]   base OUTBOUND

but having our query execute in 0.2 milliseconds instead of 4.3 seconds is a pretty good indication it’s working.

arango_query_explained

Back to modeling

For me, this little experiment has underscored the importance of good data modeling. You don’t have to worry about the number of edges if you don’t create a vertex in the first place. If you are conservative about what you are willing to make a vertex, and make good use of indexes you can see that ArangoDB is going to be able to gracefully handle some pretty hefty data.

With vertex-centric indexes, and other features like the new Smartgraphs, ArangoDB has gone from being a great document database with a trick up it’s sleeve (Joins!) to being a really solid graph database and there are new features landing regularly. I’m curious to see where they go next.

Advertisements

Graph migrations

One of the things that is not obvious at first glance is how “broad” ArangoDB is. By combining the flexibility of the document model with the joins of the graph model, ArangoDB has become a viable replacement for Postgres or MySQL, which is exactly how I have been using it; for the last two years it’s been my only database.

One of the things that falls out of that type of usage is a need to actually change the structure of your graph. In a graph, structure comes from the edges that connect your vertices. Since both the vertices and edges are just documents, that structure is actually just more data. Changing your graph structure essentially means refactoring your data.

There are definitely patterns that appear in that refactoring, and over the last little while I have been playing with putting the obvious ones into a library called graph_migrations. This is work in progress but there are some useful functions working already and could use some proper documentation.

eagerDelete

One of the first of these is what I have called eagerDelete. If you were wanting to delete Bob from graph below, Charlie and Dave would be orphaned.

Screenshot from 2016-04-06 10-55-54

Deleting Bob with eagerDelete means that Bob is deleted as well as any neighbors whose only neighbor is Bob.

gm = new GraphMigration("test") //use the database named "test"
gm.eagerDelete({name: "Bob"}, "knows_graph")

alice_eve

mergeVertices

Occasionally you will end up with duplicate vertices, which should be merged together. Below you can see we have an extra Charlie vertex.

extra_charlie

gm = new GraphMigration("test")
gm.mergeVertices({name: "CHARLIE"},{name: "Charlie"}, "knows_graph")

merged_charlie

attributeToVertex

One of the other common transformations is needing to make a vertex out of attribute. This process of “promoting” something to be a vertex is sometimes called reifying. Lets say Eve and Charlie are programmers.

knows_graph

Lets add an attribute called job to both Eve and Charlie identifying them as programmers:

adding_job_attr

But lets say that we decide that it makes more sense for job: "programmer" to be a vertex on it’s own (we want to reify it). We can use the attributeToVertex function for that, but because Arango allows us to split our edge collections and it’s good practice to do that, lets add a new edge collection to our “knows_graph” to store the edges that will be created when we reify this attribute.

adding_works_as

With that we can run attributeToVertex, telling it the attribute(s) to look for, the graph (knows_graph) to search and the collection to save the edges in (works_as).

gm = new GraphMigration("test")
gm.attributeToVertex({job: "programmer"}, "knows_graph", "works_as", {direction: "inbound"})

The result is this:

after_attrTo_vertex

vertexToAttribute

Another common transformation is exactly the reverse of what we just did; folding the job: "programmer" vertex into the vertices connected to it.

gm = new GraphMigration("test")
gm.vertexToAttribute({job: "programmer"}, "knows_graph", {direction: "inbound"})

That code puts us right back to where we started, with Eve and Charlie both having a job: "programmer" attribute.

knows_graph

redirectEdges

There are times when things are just not connected the way you want. Lets say in our knows_graph we want all the inbound edges pointing at Bob to point instead to Charlie.
knows_graph

We can use redirectEdges to do exactly that.

gm = new GraphMigration("test")
gm.redirectEdges({_id: "persons/bob"}, {_id: "persons/charlie"}, "knows_graph", {direction: "inbound"})

And now Eve and Alice know Charlie.

redirected_edges

Where to go from here.

As the name “graph migrations” suggests the thinking behind this was to create something similar to the Active Record Migrations library from Ruby on Rails but for graphs.

As more and more of this goes from idea to code and I get a chance to play with it, I’m less sure that a direct copy of Migrations makes sense. Graphs are actually pretty fine-grained data in the end and maybe something more interactive makes sense. It could be that this makes more sense as a Foxx app or perhaps part of Arangojs or ArangoDB’s admin interface. It feels a little to early to tell.

Beyond providing a little documentation the hope here is make this a little more visible to people who are thinking along the same lines and might be interested in contributing.

Back up your data, give it a try and tell me what you think.

Why graphs? Why now?

Buyer behavior analysis, protein-protein interactions, the human brain, fraud detection, financial analysis; if you sketch any of these out on a whiteboard, you will most likely end up with a series of circles connected by lines.

This simple representation we all intuitively use to map out the relationships between things. Though simple, under the name “graph” or “network graph”, it is the subject of study for an entire branch of mathematics (graph theory), and the burgeoning field of Social Network Analysis (SNA).

SNA is the study of the network of relationships among a set of things rather than the things themselves. This type of analysis is increasing common in academia across a huge number of fields. Google Scholar gives a fairly clear indication that the term is increasingly common among the academic databases it crawls.

The growth of Social Network Analysis
The growth of Social Network Analysis.

The technique surfaces in many domains, used to identify which actors within a given network are “interesting”. In a biology context “interesting” actors might be the genes that are interacting the most with other genes given a certain stimulus.

 Transcriptome-Based Network Analysis Reveals a Spectrum Model of Human Macrophage Activation Xue, Jia et al.
Transcriptome-Based Network Analysis Reveals a Spectrum Model of Human Macrophage Activation – Xue, Jia et al.

In the context of epidemiology, if the actors are fish farms, the movement of fish between then forms a network which can be analysed. Aquaculture sites that have the highest number of incoming and outgoing connections become “interesting” since the movements mean a high vulnerability to infection and likelihood to spread disease.

Image from the paper Application of network analysis to farmed salmonid movement data from Scotland
Image from the paper Application of network analysis to farmed salmonid
movement data from Scotland

An “interesting” financial institution might be one whose financial ties with other institutions indicate that it’s failure might have a domino effect.

Figure from DebtRank: Too Central to Fail? Financial Networks, the FED and Systemic Risk
Figure from DebtRank: Too Central to Fail? Financial
Networks, the FED and Systemic Risk

While Social Network analysis has been steadily growing, shifts in industry are underway that promise to make this type of analysis more common outside academia.

In 2001, with computers spreading everywhere, e-commerce heating up and internet usage around 500 million, Doug Laney, an eagle-eyed analyst for MetaGroup (now Gartner), notices a trend; data is changing. He described how data was increasing along 3 axes: increasing in volume, velocity and variety which eventually became known as “the 3 V’s of Big Data”.

An amazing visualization of the 3 V's from Datasciencecentral.com
An amazing vizualization of the 3 V’s from Datasciencecentral.com

This changing characteristics of data itself has touched off what is often called a “Cambrian explosion” of non-relational databases that offer the speed, flexibility and horizontal scalability needed to accommodate it. These databases and collectively known as NoSQL databases.

A non-exhaustive but representative timeline of NoSQL databases.
A non-exhaustive but representative timeline of NoSQL databases. From the first database (a graph database) till today. Can you see the explosion? Explore the timeline yourself.

Since the launch of Google’s Bigtable in 2005, More than 28 NoSQL databases have been launched. The majority fall into one of the 3 main sub-categories: key/value store, document store or graph database.

Anywhere a graph database is used is an obvious place to use SNA, but the ability to build a graph atop either key/value stores or document databases means that the majority of NoSQL databases are amenable to being analysed with SNA.

There are also many relational databases struggling to contain and query graphy datasets that developers have dutifully pounded into the shape of a table. As graph databases gain more traction we will likely see some of these converted in their entirety to graphs using tools like R2G wherever developers end up struggling with recursion or an explosion of join tables, or something like routing.

In addition to the steady pressure of the 3 V’s and the growth of SNA as an analytical and even predictive, tool, there are many companies (RunkeeperYahooLinkedIn) whose data model is a graph. Facebook and Netflix both fall into this category and have each released tools, both of which are pitched as alternatives to REST architecture style most web applications are based on, to make building graph backed applications easier.

Circling back to the original question of “why graphs?”, hopefully the answer is clearer. For anyone with an interest in data analysis, paying attention to this space gives access to powerful tools and a growing number of opportunities to apply them. For developers, understanding graphs allows better data modelling and architectural decisions.

Beyond the skills needed to design and tend to these new databases and make sense of their contents, knowledge of graphs will also increasingly be required to make sense of the world around us.

Understanding why you got turned down for a loan will require understanding a graph, why you are/aren’t fat, and eventually who gets insurance and at what price will too.

Facebook patents technology to help lenders discriminate against borrowers based on social connections
Facebook patents technology to help lenders discriminate against borrowers based on social connections

Proper security increasingly requires thinking in graphs, as even the most “boring” of us can be a useful stepping stone on the way to compromising someone “interesting”; perhaps a client, an acquaintance, an employer, or a user of something we create.

Graphs will be used to find and eliminate key terrorists, map out criminal networks, and route you to that new vegan restaurant across town.

With talk of storage capacities actually surpassing Moore’s law, SNA growing nearly linearly, NoSQL growing, interest in graphs on the way up, and application development tools finally appearing, the answer to “why now?” is that this is only the beginning. We are all connected, and understanding how is the future.

When to use a graph database

There are a lot of “intro to graph database” tutorials on the internet. While the “how” part of using a graph database has it’s place, I don’t know if enough has been said about “when”.

The answer to “when” depends on the properties of the data you are working with. In broad strokes, you should probably keep a graph database in mind if you are dealing with a significant amount of any of the following:

  • Hierarchical data

  • Connected data

  • Semi-structured data

  • Data with Polymorphic associations

Each of these data types either requires some number of extra tables or columns (or both) to deal with under the relational model. The cost of these extra tables and columns is an increase in complexity.

Terms like “connected data” or “semi-structured data” get used a lot in the NoSQL world but the definitions, where you can find one at all, have a “you’ll know it when you see it” flavour to them. “You’ll know it when you see it” can be distinctly unsatisfying when examples are hard to come by as well. Lets take a look at these one by one and get a sense of they mean generally and how to recognize them in existing relational database backed projects.

Hierarchical Data

Hierarchies show up everywhere. There are geographical hierarchies where a country has many provinces, which have many cities which have many towns. There is also the taxonomic rank, indicating the level of a taxon in the Taxonomic Hierarchy, organizational hierarchies, the North American Industry Classification system… the list goes on and on.

What it looks like in a relational database.

Usually its easy to tell if you are dealing with this type of data. In an existing database schema you may see tables with a parent_id column indicating the use of the Adjacency List pattern or left/right columns indicating the use of the Nested Sets pattern. There are many others as well.

Connected Data

Connected data is roughly synonymous with graph data. When we are talking about graph data we mean bits of data, plus information about how those bits of data are related.

What it looks like in a relational database.

Join tables are the most obvious sign that you are storing graph data. Join tables exist solely to act as the target of two one-to-many relationships, each row representing a relationship between two rows in other tables. Cases where you are storing data in the join table (such as the Rails has_many :through relationship) are even more clear, since the columns of your join table are attributes of an edge.

While one-to-many relationships also technically describe a graph, they probably are not going to make you reconsider the use of a relational database the way large numbers of many-to-many relationships might.

Semi-structured Data

Attempts to define semi-structured data seem to focus on variability; just because one piece of data has a particular set of attributes does not mean that the rest do. You can actually get an example of semi-structured data by mashing together two sets of structured (tabular) data. In this world of APIs and SOA where drawing data from multiple sources is pretty much the new normal, semi-structured data is increasingly common.

What it looks like in a relational database.

Anywhere you have columns with lots of null values in them. The columns provide the structure, but long stretches of null values suggest that this data does not really fit that structure.

semi_structured_data
An example of semi-structured data: a hypothetical products table combining books (structured data) and music (also structured data).

Polymorphic associations

Having one type data with an association that might be to related to one of two or more things, that’s what known as a polymorphic association. As an example, a photo might be related to a user or a product.

What it looks like in a relational database.

While polymorphic relations can be done in a relational database most commonly they are handled at the framework level, where the framework a foreign key and an additional “type” column to determine the correct table/row. Seeing both an something_id and something_type in the same table gives a hint that a polymorphic relationship is being used. Both Ruby on Rails and Java’s Spring Framework offer this.

So when?

These types of data are known to be an awkward fit for the relational model, in the same way that storing large quantities of perfectly tabular data would be awkward under the graph model. These are ultimately threshold problems, like the famous paradox of the heap.

1000000 grains of sand is a heap of sand

A heap of sand minus one grain is still a heap.

Your first join table or set of polymorphic relations will leave you will a perfectly reasonable database design, but just as “a heap of sand minus one grain” will eventually cross some ill defined threshold and produce something that is no longer a heap of sand, there is some number of join tables or other workarounds for the relational model that will leave you with a database that is significantly more complex than a graph database equivalent would be.

Knowing about the limits of the relational model and doing some hard thinking about how much time you are spending pressed up against those limits is really the only things that can guide your decision making.