Getting started with graph databases

I have a personal project I have been chipping away on for a little while now. I’ve been slowly adding more and more test data to it and as I do its become increasingly clear that while the data itself is neat, the stuff that is actually interesting is actually the relationships between the various entities and not so much the entities themselves. This realization led me to do some reading about graph databases. O’Reilly (as usual) has an interesting book on Graph Databases written by Ian Robinson, Jim Webber, and Emil Eifrem. Its a good intro but given that they hold positions of engineer, chief scientist and CEO at the company that makes the Neo4j graph database, its unsurprisingly focused on Neo4j.

Unfortunately the ‘j’ part of Neo4j refers to Java, which is a whole can of worms that I would rather not open. So I set off to look for a graph database that would not force me onto the JVM, or trap me with open-core licencing (or an $18,000 per year cost for my startup), and ultimately found ArangoDB.

Licenced under Apache 2, ArangoDB (formerly AvacadoDB) is a document database, in the same vein as MongoDB. What’s interesting is that it can also do key/value stuff like Redis and graphs like Neo4j.
Since its written in C++ I don’t have to worry about the JVM. So, lets get started with with it!

Installation is painless on Ubuntu:

wget -qO - http://www.arangodb.org/repositories/arangodb2/xUbuntu_13.10/Release.key | sudo apt-key add -
sudo sh -c "echo 'deb http://www.arangodb.org/repositories/arangodb2/xUbuntu_13.10/ /' > /etc/apt/sources.list.d/arangodb.list"
sudo apt-get update && sudo apt-get install arangodb

Since this is my first contact with graphs, I want a dataset that I can get a feel for working with graphs. Fortunately the company behind ArangoDB (triAGENS) has put some sample data up on github to get people started:

$> git clone https://github.com/triAGENS/ArangoDB-Data.git
Cloning into 'ArangoDB-Data'...
...
$> cd ArangoDB-Data/Graphs/IMDB
$> ./import.sh

That import script imports a bunch of IMDB data into ArangoDB and means that we can start exploring with the arango shell:

$> arangosh

                                       _     
  __ _ _ __ __ _ _ __   __ _  ___  ___| |__  
 / _` | '__/ _` | '_ \ / _` |/ _ \/ __| '_ \ 
| (_| | | | (_| | | | | (_| | (_) \__ \ | | |
 \__,_|_|  \__,_|_| |_|\__, |\___/|___/_| |_|
                       |___/                 

Welcome to arangosh 2.0.2 [linux]. Copyright (c) triAGENS GmbH
Using Google V8 3.16.14 JavaScript engine, READLINE 6.2, ICU 4.8.1.1

Pretty printing values.
Connected to ArangoDB 'tcp://localhost:8529' version: 2.0.2, database: '_system', username: 'root'

use 'help' to see common examples
arangosh [_system]>

Tab completion works super well here to give a sense of what your options are, but the first thing we care about is figuring out what that import did for us. You can see it created two collections (imdb_vertices and imdb_edges) with the db._collections() function:

arangosh [_system]> db._collections()
[ 
  [ArangoCollection 3021163, "_aal" (type document, status loaded)], 
  [ArangoCollection 1317227, "_graphs" (type document, status loaded)], 
  [ArangoCollection 3545451, "_replication" (type document, status loaded)], 
  [ArangoCollection 137579, "_users" (type document, status loaded)], 
  [ArangoCollection 1513835, "_cluster_kickstarter_plans" (type document, status loaded)], 
  [ArangoCollection 940644715, "vertices" (type document, status loaded)], 
  [ArangoCollection 3414379, "_aqlfunctions" (type document, status loaded)], 
  [ArangoCollection 1382763, "_modules" (type document, status loaded)], 
  [ArangoCollection 3610987, "_statistics" (type document, status loaded)], 
  [ArangoCollection 1160255851, "imdb_vertices" (type document, status loaded)], 
  [ArangoCollection 940710251, "edges" (type edge, status loaded)], 
  [ArangoCollection 3479915, "_trx" (type document, status loaded)], 
  [ArangoCollection 266194196843, "imdb_edges" (type edge, status loaded)], 
  [ArangoCollection 1448299, "_routing" (type document, status loaded)] 
]

We can also pick random documents out of the vertices collection with the .any() function to get a sense of whats in there.

 db.imdb_vertices.any()
{ 
  "_id" : "imdb_vertices/40233", 
  "_rev" : "6407199083", 
  "_key" : "40233", 
  "version" : 21, 
  "id" : "65952", 
  "type" : "Person", 
  "birthplace" : "", 
  "biography" : "", 
  "label" : "Jude Poyer", 
  "lastModified" : "1301901667000", 
  "name" : "Jude Poyer" 
}

If you have spent any time on the internet you will of course know that the obvious use for an IMDB graph is calculate Bacon numbers. So lets see if we can find Kevin in here:

arangosh [_system]> db._query('FOR Person IN imdb_vertices FILTER Person.name == "Kevin Bacon" RETURN Person').toArray()
[ 
  { 
    "_id" : "imdb_vertices/759", 
    "_rev" : "1218713963", 
    "_key" : "759", 
    "version" : 146, 
    "id" : "4724", 
    "type" : "Person", 
    "biography" : "", 
    "label" : "Kevin Bacon", 
    "lastModified" : "1299491319000", 
    "name" : "Kevin Bacon", 
    "birthplace" : "Philadelphia", 
    "profileImageUrl" : "http://cf1.imgobject.com/profiles/3e0/4bed49cf017a3c37a30003e0/kevin-bacon-profi...", 
    "birthday" : "-362451600000" 
  } 
]

And let’s see if we can connect him to, say, Kate Winslet. Since we know that Kevin is id imdb_vertices/759 and a little digging shows that Kate’s id is imdb_vertices/1088. We can pass those ids along with the imdb_vertices and imdb_edges collections to the SHORTEST_PATH function ArangoDB supplies for it to make the link between them:

arangosh [_system]> db._query('RETURN SHORTEST_PATH(imdb_vertices, imdb_edges, "imdb_vertices/759", "imdb_vertices/1088", "any", { maxIterations: 100000})').toArray()
[ 
  [ 
    { 
      "vertex" : { 
        "_id" : "imdb_vertices/759", 
        "_rev" : "1218713963", 
        "_key" : "759", 
        "version" : 146, 
        "id" : "4724", 
        "type" : "Person", 
        "biography" : "", 
        "label" : "Kevin Bacon", 
        "lastModified" : "1299491319000", 
        "name" : "Kevin Bacon", 
        "birthplace" : "Philadelphia", 
        "profileImageUrl" : "http://cf1.imgobject.com/profiles/3e0/4bed49cf017a3c37a30003e0/kevin-bacon-profi...", 
        "birthday" : "-362451600000" 
      } 
    }, 
    { 
      "vertex" : { 
        "_id" : "imdb_vertices/35451", 
        "_rev" : "5779626347", 
        "_key" : "35451", 
        "runtime" : 87, 
        "version" : 186, 
        "id" : "9692", 
        "genre" : "Drama", 
        "language" : "en", 
        "type" : "Movie", 
        "homepage" : "", 
        "tagline" : "", 
        "title" : "The Woodsman", 
        "label" : "The Woodsman", 
        "description" : "A pedophile returns to his hometown after 12 years in prison and attempts to sta...", 
        "imdbId" : "tt0361127", 
        "lastModified" : "1301903901000", 
        "imageUrl" : "http://cf1.imgobject.com/posters/3c1/4bc9281e017a3c57fe0103c1/the-woodsman-mid.j...", 
        "studio" : "Dash Films", 
        "releaseDate" : "1103842800000", 
        "released" : "2000-2010" 
      } 
    }, 
    { 
      "vertex" : { 
        "_id" : "imdb_vertices/1179", 
        "_rev" : "1274747243", 
        "_key" : "1179", 
        "version" : 90, 
        "id" : "335", 
        "type" : "Person", 
        "biography" : "", 
        "label" : "Michael Shannon", 
        "lastModified" : "1299902807000", 
        "name" : "Michael Shannon", 
        "profileImageUrl" : "http://cf1.imgobject.com/profiles/01c/4c2a3dc87b9aa15e9900001c/michael-shannon-p..." 
      } 
    }, 
    { 
      "vertex" : { 
        "_id" : "imdb_vertices/21077", 
        "_rev" : "3892517227", 
        "_key" : "21077", 
        "runtime" : 119, 
        "version" : 339, 
        "id" : "4148", 
        "genre" : "Drama", 
        "language" : "en", 
        "type" : "Movie", 
        "homepage" : "", 
        "tagline" : "", 
        "title" : "Revolutionary Road", 
        "label" : "Revolutionary Road", 
        "description" : "A young couple living in a Connecticut suburb during the mid-1950s struggle to c...", 
        "imdbId" : "tt0959337", 
        "trailer" : "http://www.youtube.com/watch?v=af01__Kvvr8", 
        "lastModified" : "1301907499000", 
        "imageUrl" : "http://cf1.imgobject.com/posters/627/4d4f8e275e73d617b7003627/revolutionary-road...", 
        "studio" : "BBC Films", 
        "releaseDate" : "1229641200000", 
        "released" : "2000-2010" 
      } 
    }, 
    { 
      "vertex" : { 
        "_id" : "imdb_vertices/1088", 
        "_rev" : "1262754155", 
        "_key" : "1088", 
        "version" : 102, 
        "id" : "204", 
        "type" : "Person", 
        "label" : "Kate Winslet", 
        "lastModified" : "1299746700000", 
        "name" : "Kate Winslet", 
        "birthplace" : "Reading, UK", 
        "profileImageUrl" : "http://cf1.imgobject.com/profiles/59f/4c022d0e017a3c702d00159f/kate-winslet-prof...", 
        "biography" : "<meta charset=\"utf-8\"><span style=\"font-family: sans-serif; font-size: 18px; lin...", 
        "birthday" : "181695600000" 
      } 
    } 
  ] 
]

So what we can see here is that it takes two hops (from Kevin to Michael Shannon via “The Woodsman“, and from Michael to Kate via “Revolutionary Road“), to connect Kevin Bacon to Kate Winslet, giving her a Bacon number of 2.

For the moment that is as far as I have gotten but I am pretty excited to explore the possiblities here. The more I think about graphs as a data model the more they seem to be a good fit for a lot of problems that I would normally be forcing into tables. Given that I can also just do straight document storage and the fact that they have a Object Document Mapper that works with Rails, I can tell you that ArangoDB and I will be spending a lot of time together.

Advertisements

6 thoughts on “Getting started with graph databases”

  1. Hey Rob,
    I am definitely still happy with it. My only real gripe so far is that I am not finding it as language neutral as “just talk to it with our HTTP API” sort of implies.

    While they have some pretty good Ruby libraries, there a bunch little things (like the API being a little chatty, or that doing a transaction means passing a JS function as a string to the server) that all seem to lead towards writing your app in Javascript putting it directly on the ArangoDB server as a Foxx app.

    This is a little disconcerting as someone who likes writing Ruby, and I find myself wondering why the handful of things I do in Ruby are worth the flurry of HTTP requests it is costing. Or wondering if I move a few of the more complicated queries into Foxx, is there enough of my Ruby app left to justify its existance…

    Moving to JS/Foxx will remove a tier from my achitecture, and make the queries I am doing faster… well, I’m just being obstinate in hanging on to Ruby. Of course all of this is actually a feature rather than a bug if you are already doing JS on the server.

    Their combination of document store with graph and geospatial capabilities is killer for me. Their AQL query language is actually really nice (altough COLLECT INTO hurts my head) and overall Arango is fast. On top of all that, the developers have been really open to suggestions and are active on StackOverflow.

    If there is a downside, I have yet to see it.

    I do see that other DB’s (like Titan) are focusing on just graph and using things like Mongo or Cassandra as a backend. Arango seems to be doing it all. Is that better? Worse? I aspire to scale to a point where I care about the consequences of that.

    As for OrientDB and Neo4J, both built on the JVM, which I am really trying to avoid. I think if I was going to learn to love the JVM (and its resource usage) I would go with Titan with a Cassandra backend. OrientDB does not seem to do any geo-spatial stuff, and while Neo4J certainly does, I’m not a fan of their licencing.

    Another interesting project to watch is Barak Michener’s Cayley (a graph DB written in Go): https://github.com/google/cayley

    I hope that helps!

    1. Thanks for the quick reply! Yah, this totally helps. We’re currently evaluating Arango vs Neo. My biggest gripe with Neo is the crazy-expensive licensing costs, although for a single-server commercial deployment, the free version is fine. “Startup” is really expensive and “Enterprise” is absurd.

      We’re doing the bulk of our work on the backend, so mostly through the Ruby clients. I’m definitely not interested in Foxx.

      I haven’t heard of Cayley, but I love the concept; being built on top of Mongo is great for us, because we’re already using Mongo. Too bad Google just released it though; it’s way too new for us to use in production.

      How large is the graph that you’re working with? How many vertexes & edges?

      1. My project is still in development and so there is not a meaningful amount of data yet, so that won’t be much of a guide for you. I know they are working on stuff for distributed graph processing (aiming for the next release), and between that and the existing sharding you should be able to handle some pretty big graphs. How big? No idea… but I am going to try and find out. :)

  2. You should try Foxx. It sits on a co-ordinator (which is right at the DB level), so latency between your app and Foxx is pretty much the same as what it’d be writing raw queries in Ruby, and sending that data over the wire. HTTP adds minimal overhead.

    It’s a terse and pretty sweet framework that runs on V8. Don’t think of it as “needing” Javascript for the rest of your stack, think of it as a microservice that’s designed to do one thing — serve up data to whatever back-end is in front of it. Like stored procedures, on steroids. It has a few other niceties like AQL template strings that makes working with it really simple. An ideal pair for Ruby or any other server-side stack, IMO.

    They’ve recently released v3.0 and I’m evaluating it with a DC/OS clustered environment, which is where I’m really hoping it will shine. Scaling up to several machines is a few mouse clicks. So far, so good.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s