ArangoDB and GraphQL

For a while now I’ve been wondering about what might be the minimal set of technologies that allows me to tackle the widest range of projects. The answer I’ve arrived at, for backend development at least, is GraphQL and ArangoDB.

Both of these tools expand my reach as a developer. Projects involving integrations, multiple clients and complicated data that would have been extremely difficult are now within easy reach.

But the minimal set idea is that I can enjoy this expanded range while juggling far fewer technologies than before. Tools that apply in more situations mean fewer things to learn, fewer moving parts and more depth in the learning I do.

While GraphQL and ArangoDB are both interesting technologies individually, it’s in using them together that I’ve been able to realize those benefits; one of those moments where the whole is different from the sum of it’s parts.

Backend Minimalism

My embrace of Javascript has definitely been part of creating that minimal set. A single language for both front and back end development has been a big part of simplifying my tech stack. Both GraphQL and ArangoDB can be used in many languages, but Javascript support is what you might describe as “first among equals” for both projects.

GraphQL can replace, and for me has replaced, server side frameworks like Rails or Django, leaving me with a handful of Javascript functions and more modular, testable code.

GraphQL also replaces ReST, freeing me from thinking about HATEOAS, bike-shedding over the vagaries of ReST, or needing pages and pages of JSON API documentation to save me from bike-shedding over the vagaries of ReST.

ArangoDB has also reduced the number of things I need to need to know. For a start it has removed the “need” for an ORM (no relational database, no need for Object Relational Mapping), which never really delivered on it’s promise to free you from knowing the underlying SQL.

More importantly it’s replaced not just NoSQL databases with a razor-thin set of capabilities like Mongodb (which stores nested documents but can’t do joins) or Neo4j (which does joins but can’t store nested documents), but also general purpose databases like MySQL or Postgres. I have one query language to learn, and one database whose quirks and characteristics I need to know.

It’s also replaced the deeply unpleasant process of relational data modeling with a seamless blend of documents and graphs that make modeling even really ugly connected datasets anticlimactic. As a bonus, in moving the schema outside the database GraphQL lets us enjoy all the benefits of a schema (making sure there is at least some structure I can rely on) and all the benefits of schemalessness (flexibility, ease of change).

Tools that actually reduce the number of things you need to know don’t come along very often. My goal here is to give a sense of what it looks like to use these two technologies together, and hopefully admiring the trees can let us appreciate the forest.

Show me the code

First we need some data to work with. ArangoDB’s administrative interface has some example graphs it can create, so lets use one to explore.

example_graphs

If we select the “knows” graph, we get a simple graph with 5 vertices.

knows_graph

This graph is going to be the foundation for our little exploration.

Next, the only really meaningful information these vertices have is a name attribute. If we are wanting to create a GraphQL type that represents one of these objects it would look like this:

  let Person = new GraphQLObjectType({
    name: 'Person',
    fields: () => ({
      name: {
        type: GraphQLString
      }
    })
  })

Now that we have a type that describes what a Person object looks like we can use it in a schema. This schema has a field called person which has two attributes: type, and resolve.

let schema = new GraphQLSchema({
    query: new GraphQLObjectType({
      name: 'Query',
      fields: () => ({
        person: {
          type: Person,
          resolve: () => {
            return {name: 'Mike'}
          },
        }
      })
    })
  })

The resolve is a function that will be run whenever graphql is asked to produce a person object. type is a type that describes the object that the resolve function returns, which in this this case is our Person type.

To see if this all works we can write a test using Jest.

import {
  graphql,
  GraphQLSchema,
  GraphQLObjectType,
  GraphQLString,
  GraphQLList,
  GraphQLNonNull
} from 'graphql'

describe('returning a hardcoded object that matches a type', () => {

  let Person = new GraphQLObjectType({
    name: 'Person',
    fields: () => ({
      name: {
        type: GraphQLString
      }
    })
  })

  let schema = new GraphQLSchema({
    query: new GraphQLObjectType({
      name: 'Query',
      fields: () => ({
        person: {
          type: Person,
          resolve: () => {
            return {name: 'Mike'}
          },
        }
      })
    })
  })

  it('lets you ask for a person', async () => {

    let query = `
      query {
        person {
          name
        }
      }
    `;

    let { data } = await graphql(schema, query)
    expect(data.person).toEqual({name: 'Mike'})
  })

})

This test passes which tells us that we got everything wired together properly, and the foundation laid to talk to ArangoDB.

First we’ll use arangojs and create a db instance and then a function that allows us to get a person using their name.

//src/database.js
import arangojs, { aql } from 'arangojs'

export const db = arangojs({
  url: `http://${process.env.ARANGODB_USER}:${process.env.ARANGODB_PASSWORD}@127.0.0.1:8529`,
  databaseName: 'knows'
})

export async function getPersonByName (name) {
  let query = aql`
      FOR person IN persons
        FILTER person.name == ${ name }
          LIMIT 1
          RETURN person
    `
  let results = await db.query(query)
  return results.next()
}

Now lets use that function with our schema to retrieve real data from ArangoDB.

import {
  graphql,
  GraphQLSchema,
  GraphQLObjectType,
  GraphQLString,
  GraphQLList,
  GraphQLNonNull
} from 'graphql'
import {
  db,
  getPersonByName
} from '../src/database'

describe('queries', () => {

  it('lets you ask for a person from the database', async () => {

    let Person = new GraphQLObjectType({
      name: 'Person',
      fields: () => ({
        name: {
          type: GraphQLString
        }
      })
    })

    let schema = new GraphQLSchema({
      query: new GraphQLObjectType({
        name: 'Query',
        fields: () => ({
          person: {
            args: { //person now accepts args
              name: { // the arg is called "name"
                type: new GraphQLNonNull(GraphQLString) // name is a string & manadatory
              }
            },
            type: Person,
            resolve: (root, args) => {
              return getPersonByName(args.name)
            },
          }
        })
      })
    })

    let query = `
        query {
          person(name "Eve") {
            name
          }
        }
      `

    let { data } = await graphql(schema, query)
    expect(data.person).toEqual({name: 'Eve'})
  })
})

Here we have modified our schema to accept a name argument when asking for a person. We access the name via the args object and pass it to our database function to go get the matching person from Arango.

Let’s add a new database function to get the friends of a user given their id.
What’s worth pointing out here is that we are using ArangoDB’s AQL traversal syntax. It allows us to do a graph traversal across outbound edges get the vertex on the other end of the edge.

export async function getFriends (id) {
  let query = aql`
      FOR vertex IN OUTBOUND ${id} knows
        RETURN vertex
    `
  let results = await db.query(query)
  return results.all()
}

Now that we have that function, instead of adding it to the schema, we add a field to the Person type. In the resolve for our new friends field we are going to use the root argument to get the id of the current person object and then use our getFriends function to do the traveral to retrieve the persons friends.

    let Person = new GraphQLObjectType({
      name: 'Person',
      fields: () => ({
        name: {
          type: GraphQLString
        },
        friends: {
          type: new GraphQLList(Person),
          resolve(root) {
            return getFriends(root._id)
          }
        }
      })
    })

What’s interesting is that because of GraphQL’s recursive nature, this change lets us query for friends:

        query {
          person(name: "Eve") {
            name
            friends {
              name
            }
          }
        }

and also ask for friends of friends (and so on) like this:

        query {
          person(name: "Eve") {
            name
            friends {
              name
              friends {
                name
              }
            }
          }
        }

We can show that with a test.

import {
  graphql,
  GraphQLSchema,
  GraphQLObjectType,
  GraphQLString,
  GraphQLList,
  GraphQLNonNull
} from 'graphql'
import {
  db,
  getPersonByName,
  getFriends
} from '../src/database'

describe('queries', () => {

  it('returns friends of friends', async () => {

    let Person = new GraphQLObjectType({
      name: 'Person',
      fields: () => ({
        name: {
          type: GraphQLString
        },
        friends: {
          type: new GraphQLList(Person),
          resolve(root) {
            return getFriends(root._id)
          }
        }
      })
    })

    let schema = new GraphQLSchema({
      query: new GraphQLObjectType({
        name: 'Query',
        fields: () => ({
          person: {
            args: {
              name: {
                type: new GraphQLNonNull(GraphQLString)
              }
            },
            type: Person,
            resolve: (root, args) => {
              return getPersonByName(args.name)
            },
          }
        })
      })
    })

    let query = `
        query {
          person(name: "Eve") {
            name
            friends {
              name
              friends {
                name
              }
            }
          }
        }
      `

    let result = await graphql(schema, query)
    let { friends } = result.data.person
    let foaf = [].concat(...friends.map(friend => friend.friends))
    expect([{name: 'Charlie'},{name: 'Dave'},{name: 'Bob'}]).toEqual(expect.arrayContaining(foaf))
  })

})

This test has running a query three levels deep and walking the entire graph. Because we can ask for any combination of any of the things our types defined, we have a whole lot of flexibility with very little code. The code that’s there is just a few simple functions, modular and easy to test.

But what did we trade away to get all that? If we look at the queries that get sent to Arango with tcpdump we can see how that sausage was made.

// getPersonByName('Eve') from the person resolver in our schema 
{"query":"FOR person IN persons
  FILTER person.name == @value0
  LIMIT 1 RETURN person","bindVars":{"value0":"Eve"}}
// getFriends('persons/eve') in Person type -> returns Bob & Alice.
{"query":"FOR vertex IN OUTBOUND @value0 knows
  RETURN vertex","bindVars":{"value0":"persons/eve"}}
// now a new request for each friend:
// getFriends('persons/bob')
{"query":"FOR vertex IN OUTBOUND @value0 knows
  RETURN vertex","bindVars":{"value0":"persons/bob"}}
// getFriends('persons/alice')
{"query":"FOR vertex IN OUTBOUND @value0 knows
  RETURN vertex","bindVars":{"value0":"persons/alice"}}

What we have here is our own version of the famous N+1 problem. If we were to add more people to this graph things would get out of hand quickly.

Facebook, which has been using GraphQL in production for years, is probably even less excited about the prospect of N+1 queries battering their database than we are. So what are they doing to solve this?

Using Dataloader

Dataloader is a small library released by Facebook that solves the N+1 problem by cleverly leveraging the way promises work. To use it, we need to give it a batch loading function and then replace our calls to the database with calls call Dataloader’s load method in all our resolves.

What, you might ask, is a batch loading function? The dataloader documentation offers that “A batch loading function accepts an Array of keys, and returns a Promise which resolves to an Array of values.”

We can write one of those.

async function getFriendsByIDs (ids) {
  let query = aql`
    FOR id IN ${ ids }
      let friends = (
        FOR vertex IN OUTBOUND id knows
          RETURN vertex
      )
      RETURN friends
  `
  let response = await db.query(query)
  return response.all()
}

We can then use that in a new test.

import {
  graphql
} from 'graphql'
import DataLoader from 'dataloader'
import {
  db,
  getFriendsByIDs
} from '../src/database'
import schema from '../src/schema'

describe('Using dataloader', () => {

  it('returns friends of friends', async () => {

    let Person = new GraphQLObjectType({
      name: 'Person',
      fields: () => ({
        name: {
          type: GraphQLString
        },
        friends: {
          type: new GraphQLList(Person),
          resolve(root, args, context) {
            return context.FriendsLoader.load(root._id)
          }
        }
      })
    })

    let query = `
        query {
          person(name: "Eve") {
            name
            friends {
              name
              friends {
                name
              }
            }
          }
        }
      `
    const FriendsLoader = new DataLoader(getFriendsByIDs)
    let result = await graphql(schema, query, {}, { FriendsLoader })
    let { person } = result.data
    expect(person.name).toEqual('Eve')
    expect(person.friends.length).toEqual(2)
    let names = person.friends.map(friend => friend.name)
    expect(names).toContain('Alice', 'Bob')
  })

})

The key section of the above test is this:

    const FriendsLoader = new DataLoader(getFriendsByIDs)
    //                         schema, query, root, context
    let result = await graphql(schema, query, {}, { FriendsLoader })

The context object is passed as the fourth parameter to the graphql function which is then available as the third parameter in every resolve function. With our FriendsLoader attached to the context object, you can see us accessing it in the resolve function on the Person type.

Let’s see what effect that batch loading has on our queries.

// getPersonByName('Eve') from the person resolver in our schema 
{"query":"FOR person IN persons
  FILTER person.name == @value0
  LIMIT 1 RETURN person","bindVars":{"value0":"Eve"}}
// getFriendsByIDs(["persons/eve"]) -> returns Bob & Alice.
{"query":"FOR id IN @value0
   let friends = (
    FOR vertex IN  OUTBOUND id knows
      RETURN vertex
    )
  RETURN friends","bindVars":{"value0":["persons/eve"]}}
// getFriendsByIDs(["persons/alice","persons/bob"])
{"query":"FOR id IN @value0
   let friends = (
    FOR vertex IN  OUTBOUND id knows
      RETURN vertex
    )
  RETURN friends","bindVars":{"value0":["persons/alice","persons/bob"]}}

Now for a three level query (Eve, her friends, their friends) we are down to just 1 query per level and the N+1 problem is no longer a problem.

When it’s time to serve your data to the world, express-graphql supplies a middleware that we can pass our schema and loaders to like this:

import express from 'express'
import graphqlHTTP from 'express-graphql'
import schema from './schema'
import DataLoader from 'dataloader'
import { getFriendsByIDs } from '../src/database'

const FriendsLoader = new DataLoader(getFriendsByIDs)
const app = express()
app.use('/graphql', graphqlHTTP({ schema, context: { FriendsLoader }}))
app.listen(3000)
// http://localhost:3000/graphql is up and running!

What we just did

With just those few code examples we’ve built a backend system that provides a query-able API for clients backed by a graph database. Growing it would look like adding a few more functions and a few more types. The code stays modular and testable. Dataloader has ensured that we didn’t even pay a performance penalty for that.

A perfect combination

While geeking out on the technology is fun, it loses sight of what I think is the larger point: The design of both GraphQL and ArangoDB allow you to combine and recombine some really simple primitives to tackle anything you can think of.

With ArangoDB, it’s all just documents, whether you use them like that or treat them as key/value or a graph is up to you. While this approach is marketed as “multi-model” database, the term is unfortunate since it makes the database sound like it’s trying to do lots of things instead of leveraging some fundamental similarity between these types of data. That similarity becomes the “primitive” that makes all this flexibility possible.

For GraphQL, my application is just a bunch of functions in an Abstract Syntax Tree which get combined and recombined by client queries. The parser and execution engine take care of what gets called when.

In each case what I need to understand is simple, the behaviour I can produce is complex.

I’m still honing my minimal set for front end development, but for backend development this is now how I build. These days I’m refocusing my learning to go narrow and deep and it feels good. Infinite width never felt sustainable. It’s not apparent at first, but once that burden is lifted off your shoulders you will realize how heavy it was.

Graph migrations

One of the things that is not obvious at first glance is how “broad” ArangoDB is. By combining the flexibility of the document model with the joins of the graph model, ArangoDB has become a viable replacement for Postgres or MySQL, which is exactly how I have been using it; for the last two years it’s been my only database.

One of the things that falls out of that type of usage is a need to actually change the structure of your graph. In a graph, structure comes from the edges that connect your vertices. Since both the vertices and edges are just documents, that structure is actually just more data. Changing your graph structure essentially means refactoring your data.

There are definitely patterns that appear in that refactoring, and over the last little while I have been playing with putting the obvious ones into a library called graph_migrations. This is work in progress but there are some useful functions working already and could use some proper documentation.

eagerDelete

One of the first of these is what I have called eagerDelete. If you were wanting to delete Bob from graph below, Charlie and Dave would be orphaned.

Screenshot from 2016-04-06 10-55-54

Deleting Bob with eagerDelete means that Bob is deleted as well as any neighbors whose only neighbor is Bob.

gm = new GraphMigration("test") //use the database named "test"
gm.eagerDelete({name: "Bob"}, "knows_graph")

alice_eve

mergeVertices

Occasionally you will end up with duplicate vertices, which should be merged together. Below you can see we have an extra Charlie vertex.

extra_charlie

gm = new GraphMigration("test")
gm.mergeVertices({name: "CHARLIE"},{name: "Charlie"}, "knows_graph")

merged_charlie

attributeToVertex

One of the other common transformations is needing to make a vertex out of attribute. This process of “promoting” something to be a vertex is sometimes called reifying. Lets say Eve and Charlie are programmers.

knows_graph

Lets add an attribute called job to both Eve and Charlie identifying them as programmers:

adding_job_attr

But lets say that we decide that it makes more sense for job: "programmer" to be a vertex on it’s own (we want to reify it). We can use the attributeToVertex function for that, but because Arango allows us to split our edge collections and it’s good practice to do that, lets add a new edge collection to our “knows_graph” to store the edges that will be created when we reify this attribute.

adding_works_as

With that we can run attributeToVertex, telling it the attribute(s) to look for, the graph (knows_graph) to search and the collection to save the edges in (works_as).

gm = new GraphMigration("test")
gm.attributeToVertex({job: "programmer"}, "knows_graph", "works_as", {direction: "inbound"})

The result is this:

after_attrTo_vertex

vertexToAttribute

Another common transformation is exactly the reverse of what we just did; folding the job: "programmer" vertex into the vertices connected to it.

gm = new GraphMigration("test")
gm.vertexToAttribute({job: "programmer"}, "knows_graph", {direction: "inbound"})

That code puts us right back to where we started, with Eve and Charlie both having a job: "programmer" attribute.

knows_graph

redirectEdges

There are times when things are just not connected the way you want. Lets say in our knows_graph we want all the inbound edges pointing at Bob to point instead to Charlie.
knows_graph

We can use redirectEdges to do exactly that.

gm = new GraphMigration("test")
gm.redirectEdges({_id: "persons/bob"}, {_id: "persons/charlie"}, "knows_graph", {direction: "inbound"})

And now Eve and Alice know Charlie.

redirected_edges

Where to go from here.

As the name “graph migrations” suggests the thinking behind this was to create something similar to the Active Record Migrations library from Ruby on Rails but for graphs.

As more and more of this goes from idea to code and I get a chance to play with it, I’m less sure that a direct copy of Migrations makes sense. Graphs are actually pretty fine-grained data in the end and maybe something more interactive makes sense. It could be that this makes more sense as a Foxx app or perhaps part of Arangojs or ArangoDB’s admin interface. It feels a little to early to tell.

Beyond providing a little documentation the hope here is make this a little more visible to people who are thinking along the same lines and might be interested in contributing.

Back up your data, give it a try and tell me what you think.

Graph traversals in ArangoDB

ArangoDB’s AQL query language was created to offer a unified interface for working with key/value, document and graph data. While AQL has been easy to work with and learn, it wasn’t until the addition of AQL traversals in ArangoDB 2.8 that it really felt like it has achieved it’s goal.

Adding keywords GRAPH, OUTBOUND, INBOUND and ANY suddenly made iteration using a FOR loop the central idea in the language. This one construct can now be used to iterate over everything; collections, graphs or documents:

//FOR loops for everything
FOR person IN persons //collections
  FOR friend IN OUTBOUND person GRAPH "knows_graph" //graphs
    FOR value in VALUES(friend, true) //documents
    RETURN DISTINCT value

AQL has always felt more like programming than SQL ever did, but the central role of the FOR loop gives a clarity and simplicity that makes AQL very nice to work with. While this is a great addition to the language, it does however, mean that there are now 4 different ways to traverse a graph in AQL and a few things are worth pointing out about the differences between them.

AQL Traversals

There are two variations of the AQL traversal syntax; the named graph and the anonymous graph. The named graph version uses the GRAPH keyword and a string indicating the name of an existing graph. With the anonymous syntax you can simply supply the edge collections

//Passing the name of a named graph
FOR vertex IN OUTBOUND "persons/eve" GRAPH "knows_graph"
//Pass an edge collection to use an anonymous graph
FOR vertex IN OUTBOUND "persons/eve" knows

Both of these will return the same result. The traversal of the named graph uses the vertex and edge collections specified in the graph definition, while the anonymous graph uses the vertex collection names from the _to/_from attributes of each edge to determine the vertex collections.

If you want access to the edge or the entire path all you need to do is ask:

FOR vertex IN OUTBOUND "persons/eve" knows
FOR vertex, edge IN OUTBOUND "persons/eve" knows
FOR vertex, edge, path IN OUTBOUND "persons/eve" knows

The vertex, edge and path variables can be combined and filtered on to do some complex stuff. The Arango docs show a great example:

FOR v, e, p IN 1..5 OUTBOUND 'circles/A' GRAPH 'traversalGraph'
  FILTER p.edges[0].theTruth == true
  AND p.edges[1].theFalse == false
  FILTER p.vertices[1]._key == "G"
  RETURN p

Notes

Arango can end up doing a lot of work to fill in those FOR v, e, p IN variables. ArangoDB is really fast, so to show the effect these variables can have, I created the most inefficient query I could think of; a directionless traversal across a high degree vertex with no indexes.

The basic setup looked like this except with 10000 vertices instead of 10. The test was getting from start across the middle vertex to end.

Screenshot from 2016-04-05 10-07-04

What you can see is that adding those variables comes at a cost, so only declare ones you actually need.

effects_of_traversal_variables
Traversing a supernode with 10000 incident edges with various traversal methods. N=5. No indexes used.

GRAPH_* functions and TRAVERSAL

ArangoDB also has a series of “Named Operations” that feature among
them a few that also do traversals. There is also a super old-school TRAVERSAL function hiding in the “Other” section. What’s interesting is how different their performance can be while still returning the same results.

I tested all of the traversal functions on the same supernode described above. These are the queries:

//AQL traversal
FOR v IN 2 ANY "vertices/1" edges
  FILTER v.name == "end"
    RETURN v

//GRAPH_NEIGHBORS
RETURN GRAPH_NEIGHBORS("db_10000", {_id: "vertices/1"}, {direction: "any", maxDepth:2, includeData: true, neighborExamples: [{name: "end"}]})

//GRAPH_TRAVERSAL
RETURN GRAPH_TRAVERSAL("db_10000", {_id:"vertices/1"}, "any", {maxDepth:2, includeData: true, filterVertices: [{name: "end"}], vertexFilterMethod: ["exclude"]})

//TRAVERSAL
RETURN TRAVERSAL(vertices, edges, {_id: "vertices/1"}, "any", {maxDepth:2, includeData: true, filterVertices: [{name: "end"}], vertexFilterMethod: ["exclude"]})

All of these returned the same vertex, just with varying levels of nesting within various arrays. Removing the nesting did not make a signficant difference in the execution time.

traversal_comparison
Traversing a supernode with 10000 incident edges with various traversal methods. N=5.

Notes

While TRAVERSAL and GRAPH_TRAVERSAL were not stellar performers here, the both have a lot to offer in terms of customizability. For ordering, depthfirst searches and custom expanders and visitors, this is the place to look. As you explore the options, I’m sure these get much faster.

Slightly less obvious but still worth pointing out that where AQL traversals require an id (“vertices/1000” or a document with and _id attribute), GRAPH_* functions just accept an example like {foo: “bar”} (I’ve passed in {_id: “vertices/1”} as the example just to keep things comparable). Being able to find things, without needing to know a specific id, or what collection to look in is very useful. It lets you abstract away document level concerns like collections and operate on a higher “graph” level so you can avoid hardcoding collections into your queries.

What it all means

The difference between these, at least superficially, similar traversals are pretty surprising. While some where faster than others, none of the options for tightening the scope of the traversal were used (edge restrictions, indexes, directionality). That tells you there is likely a lot of headroom for performance gains for all of the different methods.

The conceptual clarity that AQL traversals bring to the language as a whole is really nice, but it’s clear there is some optimization work to be done before I go and rewrite all my queries.

Where I have used the new AQL traversal syntax, I’m also going to have to check to make sure there are no unused v,e,p variables hiding in my queries. Where you need to use them, it looks like restricting yourself to v,e is the way to go. Generating those full paths is costly. If you use them, make sure it’s worth it.

Slowing Arango down is surprisingly instructive, but with 3.0 bringing the switch to Velocypack for JSON serialization, new indexes, and more, it looks like it’s going to get harder to do. :)

 

A quick tour of Arangojs

I’ve been using ArangoDB for a while now, but for most of that time I’ve been using it from Ruby. I’ve dabbled with the Guacamole library and even took a crack at writing my own, but switching to Javascript has led me to get to know Arangojs.

Given that Arangojs is talking to ArangoDB via its HTTP API, basically everything you do is asynchronous. There are a few ways of dealing with async code in Javascript, and Arangojs has been written to support basically all of them.

Arangojs’s flexibility and my inexperience with the new Javascript syntax combined to give me bit of an awkward start, so with a little learning under my belt I thought I would write up some examples that would have saved me some time.

My most common use case is running an AQL query, so lets use that as an example. First up, I’ve been saving my config in a separate file:

// arango_config.js
//Using auth your url would look like:
// "http://uname:passwd@127.0.0.1:8529"
module.exports = {
  "production" : {
    "databaseName": process.env.PROD_DB_NAME,
    "url": process.env.PROD_DB_HOST,
  },
  "development" : {
    "databaseName": process.env.DEVELOPMENT_DB_NAME,
    "url": process.env.DEVELOPMENT_URL
  },
  "test" : {
    "databaseName": "test",
    "url": "http://127.0.0.1:8529",
  },
}

With that I can connect to one of my existing databases like so:

var config = require('../arangodb_config')[process.env.NODE_ENV]
var db = require('arangojs')(config)

This keeps my test database nicely separated from everything else and all my db credentials in the environment and out of my project code.

Assuming that our test db has a collection called “documents” containing a single document, we can use Arangojs to go get it:

db.query('FOR doc IN documents RETURN doc', function(err, cursor) {
  cursor.all(function(err, result) {
    console.log(result)
  })
})

Which returns:

[ { foo: 'bar',
    _id: 'documents/206191358605',
    _rev: '206192931469',
    _key: '206191358605' } ]

While this is perfectly valid Javascript, its pretty old-school at this point since ECMAScript 2015 is now standard in both Node.js and any browser worth having. This means we can get rid of the “function” keyword and replace it with the “fat arrow” syntax and get the same result:

db.query('FOR doc IN documents RETURN doc', (err, cursor) => {
  cursor.all((err, result) => {
    console.log(result)
  })
})

So far so good but the callback style (and the callback-hell it brings) is definitely an anti-pattern. The widely cited antidote to this is promises:

db.query('FOR doc IN documents RETURN doc')
  .then((cursor) => { return cursor.all() })
  .then((doc) => { console.log(doc) });

While this code is functionally equivalent, it operates by chaining promises together. While it’s an improvement over callback-hell, after writing a bunch of this type of code, I ended up feeling like I had replaced callback hell with promise hell.

what-fresh-hell-is-this

The path back to sanity lies in ECMAScript 2016 aka ES7 and the new async/await keywords. Inside a function marked as async, you have access to an await keyword which allows you to write code that looks synchronous but does not block the event loop.

Using the babel transpiler lets us use the new ES7 syntax right now by compiling it all down to ES5/6 equivalents. Installing with npm install -g babel and running your project with babel-node is all that you need to be able to write this:

async () => {
    let cursor = await db.query('FOR doc IN documents RETURN doc')
    let result = await cursor.all()
    console.log(result)
}()

Once again we get the same result but without all the extra cruft that we would normally have to write.

One thing that is notably absent in these examples is the use of bound variables in our queries to avoid SQL injection (technically parameter injection since this is NoSQL).

So what does that look like?

async () => {
    let bindvars = {foo: "bar"}
    let cursor = await db.query('FOR doc IN documents FILTER doc.foo == @foo RETURN doc', bindvars)
    let result = await cursor.all()
    console.log(result)
}()

But Arangojs lets you go further, giving you a nice aqlQuery function based on ES6 template strings:

async () => {
    let foo = "bar"
    let aql = aqlQuery`
      FOR doc IN documents
        FILTER doc.foo == ${foo}
          RETURN doc
    `
    let cursor = await db.query(aql)
    let result = await cursor.all()
    console.log(result)
}()

Its pretty astounding how far that simple example has come. It’s hard to believe that it’s even the same language.
With Javascript (the language and the community) clearly in transition, Arangojs (and likely every other JS library) is compelled to support both the callback style and promises. It’s a bit surprising to see how much leeway that gives me to write some pretty sub-optimal code.

With all the above in mind, suddenly Arangojs’s async heavy API no longer feels intimidating.

The documentation for Arangojs is simple (just a long readme file) but comprehensive and there is lots more it can do. Hopefully this little exploration will help people get started with Arangojs a little more smoothly than I did.

Extracting test data from ArangoDB’s admin interface

Test Driven Development is an important part of my development process and
ArangoDB’s speed, schema-less nature and truncate command make testing really nice.

Testing has ended up being especially important to me when it comes to AQL (Arango Query Language) queries. Just the same way that its easy to write a regular expression that matches more than you expect, constraining the traversal algorithm so you get what you want (and only that) can be tricky.

AQL queries that traverse a graph are often (maybe always?) sensitive to the structure of the graph. The direction of the edges (inbound/outbound) or the number of edges to cross (maxDepth) are often used to constrain a traversal. Both of these are examples of how details of your graphs structure get built into your AQL queries. When the structure isn’t what you think, you can end up with some pretty surprising results coming back from your queries.

All of that underscores the need to test against data that you know has a specific structure. A few times now I have found myself with bunch of existing graph data, and wondering how to pick out a selected subgraph to test my AQL queries against.

ArangoDB’s web interface gets me tantalizingly close, letting me filter down to a single starting node and clicking with the spot tool to reveal its neighbors.

Filtering for a specific vertex in the graph.
Filtering for a specific vertex in the graph.

In a few clicks I can get exactly the vertices and edges that I need to test my queries, and because I have seen it, I know the structure is correct, and has only what I need. All that is missing is a way to save what I see.

Since this seems to keep coming up for me, I’ve solved this for myself with a little hackery that I’ve turned to several times now. The first step is turning on Firefox’s dump function by entering about:config in the URL bar and searching the settings for “dump”.

Firefox can dump to the terminal with browser.dom.window.dump.enabled
Firefox can dump to the terminal with browser.dom.window.dump.enabled

The dump function allows you to write to the terminal from javascript. Once that is set to true, launching Firefox from the terminal, and typing dump("foo") in the Javascript console should print “foo” in the controlling terminal.

test_data

Next, since the graph viewer uses D3 to for its visualization, we can dig into the DOM and print out the bits we need using dump. Pasting the following into the Javascript console will print out the edges:

var edges = document.querySelector('#graphViewerSVG').childNodes[0].childNodes[0].children; for(var i = 0; i < edges.length; i++) { dump("\r\n" + JSON.stringify(edges[i].__data__._data) + "\r\n"); }

And then this will print out the vertices:

var vertices = document.querySelector('#graphViewerSVG').childNodes[0].childNodes[1].children; for(var i = 0; i < vertices.length; i++) { dump("\r\n" + JSON.stringify(vertices[i].__data__._data) + "\r\n"); }

With the vertices and edges now printed to the terminal, a little copy/paste action and you can import the data into your test database before running your tests with arangojs’s import function.

myCollection.import([
  {foo: "bar"},
  {fizz: "buzz"}
])

Alternately you can upload JSON files into the collection via the web interface as well.

Importing JSON into a collection.
Importing JSON into a collection.

While this process has no claim on elegance, its been very useful for testing my AQL queries and saved me a lot of hassle.

Data modeling with ArangoDB

Since 2009 there has been a “Cambrian Explosion” of NoSQL databases, but information on data modeling with these new data stores feels hard to come by.
My weapon of choice for over a year now has been ArangoDB. While ArangoDB is pretty conscientious about having good documentation, there has been something missing for me: criteria for making modeling decisions.

Like most (all?) graph databases, ArangoDB allows you to model your data with a property graph. The building blocks of a property graph are attributes, vertices and edges. What makes data modelling with ArangoDB (and any other graph database) difficult is deciding between them.

To start with we need a little terminology. Since a blog is a well known thing, we can use a post with some comments and some tags as our test data to illustrate the idea.

Sparse vs Compact

Modeling our blog post with as a “sparse” graph might look something like this:

sparse

At first glance it looks satisfyingly graphy: in the centre we see a green “Data Modeling” vertex which has a edge going to another vertex “post”, indicating that “Data Modeling” is a post. Commenters, tags and comments all have connections to a vertex representing their type as well.

Looking at the data you can see we are storing lots of edges and most vertices contain only a single attribute (apart from the internal attributes ArangoDB creates: _id, _key, _rev).

//vertices
{"_id":"vertices/26590247555","_key":"26590247555","_rev":"26590247555","title":"That's great honey","text":"Love you!"},
{"_id":"vertices/26590378627","_key":"26590378627","_rev":"26590378627","type":"comment"},
{"_id":"vertices/26590509699","_key":"26590509699","_rev":"26590509699","name":"Spammy McSpamerson","email":"spammer@fakeguccihandbags.com"},
{"_id":"vertices/26590640771","_key":"26590640771","_rev":"26590640771","title":"Brilliant","text":"Gucci handbags..."},
{"_id":"vertices/26590771843","_key":"26590771843","_rev":"26590771843","name":"arangodb"},
{"_id":"vertices/26590902915","_key":"26590902915","_rev":"26590902915","name":"modeling"},
{"_id":"vertices/26591033987","_key":"26591033987","_rev":"26591033987","name":"nosql"},
{"_id":"vertices/26591165059","_key":"26591165059","_rev":"26591165059","type":"tag"}]
 
//edges
[{"_id":"edges/26604010115","_key":"26604010115","_rev":"26604010115","_from":"vertices/26589723267","_to":"vertices/26589395587"},
{"_id":"edges/26607352451","_key":"26607352451","_rev":"26607352451","_from":"vertices/26589723267","_to":"vertices/26589854339"},
{"_id":"edges/26608204419","_key":"26608204419","_rev":"26608204419","_from":"vertices/26590640771","_to":"vertices/26590378627"},
{"_id":"edges/26609842819","_key":"26609842819","_rev":"26609842819","_from":"vertices/26590247555","_to":"vertices/26590378627"},
{"_id":"edges/26610694787","_key":"26610694787","_rev":"26610694787","_from":"vertices/26589985411","_to":"vertices/26590247555"},
{"_id":"edges/26611546755","_key":"26611546755","_rev":"26611546755","_from":"vertices/26589395587","_to":"vertices/26590247555"},
{"_id":"edges/26615020163","_key":"26615020163","_rev":"26615020163","_from":"vertices/26589985411","_to":"vertices/26590116483"},
{"_id":"edges/26618821251","_key":"26618821251","_rev":"26618821251","_from":"vertices/26590771843","_to":"vertices/26591165059"},
{"_id":"edges/26622622339","_key":"26622622339","_rev":"26622622339","_from":"vertices/26589395587","_to":"vertices/26589592195"},
{"_id":"edges/26625833603","_key":"26625833603","_rev":"26625833603","_from":"vertices/26590509699","_to":"vertices/26590640771"},
{"_id":"edges/26642741891","_key":"26642741891","_rev":"26642741891","_from":"vertices/26589395587","_to":"vertices/26590902915"},
{"_id":"edges/26645101187","_key":"26645101187","_rev":"26645101187","_from":"vertices/26589395587","_to":"vertices/26590771843"},
{"_id":"edges/26649885315","_key":"26649885315","_rev":"26649885315","_from":"vertices/26589395587","_to":"vertices/26591033987"},
{"_id":"edges/26651064963","_key":"26651064963","_rev":"26651064963","_from":"vertices/26590902915","_to":"vertices/26591165059"},
{"_id":"edges/26651785859","_key":"26651785859","_rev":"26651785859","_from":"vertices/26591033987","_to":"vertices/26591165059"},
{"_id":"edges/26652965507","_key":"26652965507","_rev":"26652965507","_from":"vertices/26590509699","_to":"vertices/26590116483"},
{"_id":"edges/26670267011","_key":"26670267011","_rev":"26670267011","_from":"vertices/26589395587","_to":"vertices/26590640771"}]

A “compact” graph on the other hand might look something like this:

{
  title: "Data modelling",
  text: "lorum ipsum...",
  author: "Mike Williamson",
  date: "2015-11-19",
  comments: [
    {
      author:"Mike's Mum",
      email:"mikes_mum@allthemums.com",
      text: "That's great honey",
    },
    {
      "author": "spammer@fakeguccihandbags.com",
      "title": "Brilliant",
      "text": "Gucci handbags...",
    }
  ],
  tags:["mongodb","modeling","nosql"]
}

Here we have taken exactly the same data and collapsed it together into a single document. While its a bit of a stretch to even classify this as a graph, ArangoDB’s multi-model nature largely erases the boundary between a document and a graph with a single vertex.

The two extremes above give us some tools for talking about our graph. Its the same data either way, but clearly different choices are being made. In the sparse graph, every vertex you see could have been an attribute, but was consciously moved into a vertex of its own. The compact graph is what comes out of repeated choosing to add new data as an attribute rather than a vertex.

When modeling real data your decisions don’t always fall one one side or the other. So what criteria should we be using to make those decisions?

Compact by default

As a baseline you should favor a compact graph. Generally data that is displayed together should be combined into a single document.

Defaulting to compact will mean fewer edges will exist in the graph as a whole. Since each traversal across a graph will have to find, evaluate and then traverse the edges for each vertex it encounters, keeping the number of edges to a strict minimum will ensure traversals stay fast as your graph grows.
Compact graphs will also mean fewer queries and traversals to get the information you need. When in doubt, embed. Resist the temptation to add vertices just because it’s a graph database, and keep it compact.

But not everything belongs together. Any attribute that contains a complex data structure (like the “comments” array or the “tags” array) deserves a little scrutiny as it might make sense as a vertex (or vertices) of its own.

Looking at our compact graph above, the array of comments, the array of tags, and maybe even the author might be better off as vertices rather than leaving them as attributes. How do we decide?

  • Will it be accessed on it’s own? (ie: showing comments without the post)
  • Will you be running a graph measurement (like GRAPH_BETWEENNESS) on it?
  • Will it be edited on it’s own?
  • Does/could the attribute have relationships of it’s own? (assuming you care)
  • Would/should this attribute exist without it’s parent vertex?

Removing duplicate data can also be a reason to make something a vertex, but with the cost of storage ridiculously low (and dropping) its a weak reason. Finding yourself updating duplicate data however, tells you that it should have been a vertex.

Edge Direction

Once you promote a piece of data to being a vertex (or “reify” it) your next decision is which way the edge connecting it to another vertex should go. Edge direction is a powerful way to put up boundaries to contain your traversals, but while the boundary is important the actual directions are not. Whatever you choose, it just needs to be consistent. One edge going the wrong direction is going to have your traversal returning things you don’t expect.

And another thing…

This post is the post I kept hoping to find as I worked on modeling my data with ArangoDB. Its not complete, data modeling is a big topic and there is lots more depth to ArangoDB to explore (ie: I haven’t yet tried splitting my edges amongst multiple edge collections) but these are some guidelines that I was hoping for when I was starting.

I would love to learn more about the criteria people are using to make those tough calls between attribute and vertex, and all those other hard modeling decisions.

If you have thoughts on this let me know!

When to use a graph database

There are a lot of “intro to graph database” tutorials on the internet. While the “how” part of using a graph database has it’s place, I don’t know if enough has been said about “when”.

The answer to “when” depends on the properties of the data you are working with. In broad strokes, you should probably keep a graph database in mind if you are dealing with a significant amount of any of the following:

  • Hierarchical data

  • Connected data

  • Semi-structured data

  • Data with Polymorphic associations

Each of these data types either requires some number of extra tables or columns (or both) to deal with under the relational model. The cost of these extra tables and columns is an increase in complexity.

Terms like “connected data” or “semi-structured data” get used a lot in the NoSQL world but the definitions, where you can find one at all, have a “you’ll know it when you see it” flavour to them. “You’ll know it when you see it” can be distinctly unsatisfying when examples are hard to come by as well. Lets take a look at these one by one and get a sense of they mean generally and how to recognize them in existing relational database backed projects.

Hierarchical Data

Hierarchies show up everywhere. There are geographical hierarchies where a country has many provinces, which have many cities which have many towns. There is also the taxonomic rank, indicating the level of a taxon in the Taxonomic Hierarchy, organizational hierarchies, the North American Industry Classification system… the list goes on and on.

What it looks like in a relational database.

Usually its easy to tell if you are dealing with this type of data. In an existing database schema you may see tables with a parent_id column indicating the use of the Adjacency List pattern or left/right columns indicating the use of the Nested Sets pattern. There are many others as well.

Connected Data

Connected data is roughly synonymous with graph data. When we are talking about graph data we mean bits of data, plus information about how those bits of data are related.

What it looks like in a relational database.

Join tables are the most obvious sign that you are storing graph data. Join tables exist solely to act as the target of two one-to-many relationships, each row representing a relationship between two rows in other tables. Cases where you are storing data in the join table (such as the Rails has_many :through relationship) are even more clear, since the columns of your join table are attributes of an edge.

While one-to-many relationships also technically describe a graph, they probably are not going to make you reconsider the use of a relational database the way large numbers of many-to-many relationships might.

Semi-structured Data

Attempts to define semi-structured data seem to focus on variability; just because one piece of data has a particular set of attributes does not mean that the rest do. You can actually get an example of semi-structured data by mashing together two sets of structured (tabular) data. In this world of APIs and SOA where drawing data from multiple sources is pretty much the new normal, semi-structured data is increasingly common.

What it looks like in a relational database.

Anywhere you have columns with lots of null values in them. The columns provide the structure, but long stretches of null values suggest that this data does not really fit that structure.

semi_structured_data
An example of semi-structured data: a hypothetical products table combining books (structured data) and music (also structured data).

Polymorphic associations

Having one type data with an association that might be to related to one of two or more things, that’s what known as a polymorphic association. As an example, a photo might be related to a user or a product.

What it looks like in a relational database.

While polymorphic relations can be done in a relational database most commonly they are handled at the framework level, where the framework a foreign key and an additional “type” column to determine the correct table/row. Seeing both an something_id and something_type in the same table gives a hint that a polymorphic relationship is being used. Both Ruby on Rails and Java’s Spring Framework offer this.

So when?

These types of data are known to be an awkward fit for the relational model, in the same way that storing large quantities of perfectly tabular data would be awkward under the graph model. These are ultimately threshold problems, like the famous paradox of the heap.

1000000 grains of sand is a heap of sand

A heap of sand minus one grain is still a heap.

Your first join table or set of polymorphic relations will leave you will a perfectly reasonable database design, but just as “a heap of sand minus one grain” will eventually cross some ill defined threshold and produce something that is no longer a heap of sand, there is some number of join tables or other workarounds for the relational model that will leave you with a database that is significantly more complex than a graph database equivalent would be.

Knowing about the limits of the relational model and doing some hard thinking about how much time you are spending pressed up against those limits is really the only things that can guide your decision making.