Private communications with Signal

The internet is a global network. You would think that the fact that every message sent over it passes though many legal jurisdictions would make the need for encryption obvious and uncontroversial. Sadly that is not the case.

The need for something more than legal safeguards becomes obvious when you see that a request from a Toronto home to (whose server is in Toronto!) leaving Canadian legal jurisdiction on a path through both New York and Chicago before finally reaching it’s Toronto destination.

An example of a boomerang route from the project. About 25% of traffic with both a start and end point in Canada is routed this way.

Applications that deliver technical safeguards, like end-to-end encryption, offer that “something more” that protects my data beyond the border. One of these a applications is Signal, a project of Open Whisper Systems.

In an offline world, privacy was the default, a product of things you didn’t say or do, and probably also a byproduct of how hard it was to move information around. As things like chatting with friends and family or reading a newspaper all moved online, those activities suddenly involved sending data in plain text over public infrastructure. Privacy become something that existed only for those that found a way to avoid the default of sending plain text. Privacy became a product of action rather than inaction.

Signal and its predecessor projects Textsecure and Redphone are part of an effort to make privacy the default again by rolling high end encryption into apps polished for mainstream adoption.


Signal does two main things: sending text messages and making calls. What Signal actually does for secure communications is very different from what it does for insecure ones and is worth understanding.

Text messages

When sending a text message to someone who does not have Signal, the application sends a standard SMS message. The details of what constitutes an SMS message were hashed out in 1988 long before security was a thing and consequently, a related specification notes “SMS messages are transported without any provisions for privacy or integrity”, but importantly they are transported over the telephone network.

When sending secure text messages, Signal uses your mobile data to send the message using the Textsecure protocol v2.

The distinction between those two is worth making since coverage for data vs telephone can vary as can the costs, say if you are travelling and turn off mobile data.

The first time you send or receive a secure message with someone, behind the scenes you exchange cryptographic keys. Everything Signal does is focused on ensuring secure communication with the holder of that key. If the key for one of your contacts changes, it should be considered an event worth a quick phone call. This can happen innocently enough, say if they uninstall and then reinstall the app, but since all the other security measures a built on that, its worth asking about.

After the first text message has been sent or received, from then on Signal uses those keys to generate new keys for each individual message (Described in detail here.). This ensures that even if one message were to be decrypted, every other message is still safe.


Calling follows a similar pattern; for insecure calls Signal simply launches your phones standard phone app, while encrypted calls it handles itself. And like the secure text messages, this also uses your mobile data rather than routing through the phone network.

Secure calls are placed using the ZRTP protocol, the details of which are hidden from the user with one important exception.
On screen when you make a secure call you will notice two words displayed. These words are part of the ZRTP protocol and were generated based on the key that both parties used to encrypt the call.


Both parties should see the same two words. If you say one and ask your contact to read the other, and they don’t match up, they keys you have agreed upon are not the same. If the keys are not the same it suggests someone has tampered with connection information inflight and inserted themselves into your conversation.

Verifying keys

Part of whole key exchange process that allows users to identify each other involves pulling your contacts public key from a central key directory server. The use of a central server means that I now have to trust that server not to maliciously give me a key for someone else. Open Whisper Systems Trevor Perrin addressed the problem of trusting unauthenticated keys in a talk at the NSEC security conference. It’s just a few minutes but its an interesting insight into the balancing act involved in bringing private communications to the masses:

For the interested or the paranoid, Signal lets you verify a contacts key by touching your contacts name/number at the top of your conversation. This brings up the details for that contact which includes a “Verify identity” option.verifyWith that, and your own identity details, found under Settings (three vertical dots on the top right of the main screen) > “My Identity Key”, you are able to either read a key fingerprint or if you have a QR/Barcode scanner you can use that to verify your contacts key.


Open Source

Establishing that there is no secret behaviour or hidden flaws somewhere in the code is critical in this world where we put a significant amount of trust in computers (and where the definition of computer is expanding to encompass everything from voting machines to Volkswagens).

Signal establishes trust by developing the code in the open the code so that it can be reviewed (like this review of Signal’s predecessor Redphone by Cryptographer Matthew Green). Former Google security researcher Morgan Marquise-Boire has endorsed Signal as has Edward Snowden.

But even if you believe the experts that Signal works as advertised, its common for “free” apps to seem significantly less “free” once you realize what they do to turn a profit. With that in mind, another component of trust is understanding the business model behind the products you use.

When asked about the business model on a popular tech forum, Open Whisper Systems founder Moxie Marlinspike explained “in general, Open Whisper Systems is a project rather than a company, and the project’s objective is not financial profit.”

The project is funded by a combination of grants and donations from the Freedom of the Press Foundation and The Shuttlesworth Foundation among others. It is worked on by a core group of contributors led by and supporting cast of volunteers.

Final thoughts

Signal does a great job of making encrypting your communications a non-event. Encrypted as they travel the network, our messages are now secure against tampering and interception, no matter whose servers/routers they pass through. The result: privacy.

The fact that applying security measures result in privacy should tell you that the oft quoted choice between “security vs privacy” is a false one. As pointed out by Timothy Mitchener-Nissen, assuming a balance between these things only results in sacrificing increments of privacy in pursuit of the unachievable idea of total security. The ultimate result is reducing privacy to zero. Signal is just one way to grab back one of those increments of privacy.

All of that said my interest in privacy technologies and encryption is an interest for me as a technologist. If you are counting on these technologies like Signal to protect you from anyone serious (like a nation-state) the information above is barely a beginning. I would suggest reading this best practices for Tor and the grugq’s article on signals, intelligence. Actually anything/everything by the grugq.

A quick tour of Arangojs

I’ve been using ArangoDB for a while now, but for most of that time I’ve been using it from Ruby. I’ve dabbled with the Guacamole library and even took a crack at writing my own, but switching to Javascript has led me to get to know Arangojs.

Given that Arangojs is talking to ArangoDB via its HTTP API, basically everything you do is asynchronous. There are a few ways of dealing with async code in Javascript, and Arangojs has been written to support basically all of them.

Arangojs’s flexibility and my inexperience with the new Javascript syntax combined to give me bit of an awkward start, so with a little learning under my belt I thought I would write up some examples that would have saved me some time.

My most common use case is running an AQL query, so lets use that as an example. First up, I’ve been saving my config in a separate file:

// arango_config.js
//Using auth your url would look like:
// "http://uname:passwd@"
module.exports = {
  "production" : {
    "databaseName": process.env.PROD_DB_NAME,
    "url": process.env.PROD_DB_HOST,
  "development" : {
    "databaseName": process.env.DEVELOPMENT_DB_NAME,
    "url": process.env.DEVELOPMENT_URL
  "test" : {
    "databaseName": "test",
    "url": "",

With that I can connect to one of my existing databases like so:

var config = require('../arangodb_config')[process.env.NODE_ENV]
var db = require('arangojs')(config)

This keeps my test database nicely separated from everything else and all my db credentials in the environment and out of my project code.

Assuming that our test db has a collection called “documents” containing a single document, we can use Arangojs to go get it:

db.query('FOR doc IN documents RETURN doc', function(err, cursor) {
  cursor.all(function(err, result) {

Which returns:

[ { foo: 'bar',
    _id: 'documents/206191358605',
    _rev: '206192931469',
    _key: '206191358605' } ]

While this is perfectly valid Javascript, its pretty old-school at this point since ECMAScript 2015 is now standard in both Node.js and any browser worth having. This means we can get rid of the “function” keyword and replace it with the “fat arrow” syntax and get the same result:

db.query('FOR doc IN documents RETURN doc', (err, cursor) => {
  cursor.all((err, result) => {

So far so good but the callback style (and the callback-hell it brings) is definitely an anti-pattern. The widely cited antidote to this is promises:

db.query('FOR doc IN documents RETURN doc')
  .then((cursor) => { return cursor.all() })
  .then((doc) => { console.log(doc) });

While this code is functionally equivalent, it operates by chaining promises together. While it’s an improvement over callback-hell, after writing a bunch of this type of code, I ended up feeling like I had replaced callback hell with promise hell.


The path back to sanity lies in ECMAScript 2016 aka ES7 and the new async/await keywords. Inside a function marked as async, you have access to an await keyword which allows you to write code that looks synchronous but does not block the event loop.

Using the babel transpiler lets us use the new ES7 syntax right now by compiling it all down to ES5/6 equivalents. Installing with npm install -g babel and running your project with babel-node is all that you need to be able to write this:

async () => {
    let cursor = await db.query('FOR doc IN documents RETURN doc')
    let result = await cursor.all()

Once again we get the same result but without all the extra cruft that we would normally have to write.

One thing that is notably absent in these examples is the use of bound variables in our queries to avoid SQL injection (technically parameter injection since this is NoSQL).

So what does that look like?

async () => {
    let bindvars = {foo: "bar"}
    let cursor = await db.query('FOR doc IN documents FILTER == @foo RETURN doc', bindvars)
    let result = await cursor.all()

But Arangojs lets you go further, giving you a nice aqlQuery function based on ES6 template strings:

async () => {
    let foo = "bar"
    let aql = aqlQuery`
      FOR doc IN documents
        FILTER == ${foo}
          RETURN doc
    let cursor = await db.query(aql)
    let result = await cursor.all()

Its pretty astounding how far that simple example has come. It’s hard to believe that it’s even the same language.
With Javascript (the language and the community) clearly in transition, Arangojs (and likely every other JS library) is compelled to support both the callback style and promises. It’s a bit surprising to see how much leeway that gives me to write some pretty sub-optimal code.

With all the above in mind, suddenly Arangojs’s async heavy API no longer feels intimidating.

The documentation for Arangojs is simple (just a long readme file) but comprehensive and there is lots more it can do. Hopefully this little exploration will help people get started with Arangojs a little more smoothly than I did.

Extracting test data from ArangoDB’s admin interface

Test Driven Development is an important part of my development process and
ArangoDB’s speed, schema-less nature and truncate command make testing really nice.

Testing has ended up being especially important to me when it comes to AQL (Arango Query Language) queries. Just the same way that its easy to write a regular expression that matches more than you expect, constraining the traversal algorithm so you get what you want (and only that) can be tricky.

AQL queries that traverse a graph are often (maybe always?) sensitive to the structure of the graph. The direction of the edges (inbound/outbound) or the number of edges to cross (maxDepth) are often used to constrain a traversal. Both of these are examples of how details of your graphs structure get built into your AQL queries. When the structure isn’t what you think, you can end up with some pretty surprising results coming back from your queries.

All of that underscores the need to test against data that you know has a specific structure. A few times now I have found myself with bunch of existing graph data, and wondering how to pick out a selected subgraph to test my AQL queries against.

ArangoDB’s web interface gets me tantalizingly close, letting me filter down to a single starting node and clicking with the spot tool to reveal its neighbors.

Filtering for a specific vertex in the graph.
Filtering for a specific vertex in the graph.

In a few clicks I can get exactly the vertices and edges that I need to test my queries, and because I have seen it, I know the structure is correct, and has only what I need. All that is missing is a way to save what I see.

Since this seems to keep coming up for me, I’ve solved this for myself with a little hackery that I’ve turned to several times now. The first step is turning on Firefox’s dump function by entering about:config in the URL bar and searching the settings for “dump”.

Firefox can dump to the terminal with browser.dom.window.dump.enabled
Firefox can dump to the terminal with browser.dom.window.dump.enabled

The dump function allows you to write to the terminal from javascript. Once that is set to true, launching Firefox from the terminal, and typing dump("foo") in the Javascript console should print “foo” in the controlling terminal.


Next, since the graph viewer uses D3 to for its visualization, we can dig into the DOM and print out the bits we need using dump. Pasting the following into the Javascript console will print out the edges:

var edges = document.querySelector('#graphViewerSVG').childNodes[0].childNodes[0].children; for(var i = 0; i < edges.length; i++) { dump("\r\n" + JSON.stringify(edges[i].__data__._data) + "\r\n"); }

And then this will print out the vertices:

var vertices = document.querySelector('#graphViewerSVG').childNodes[0].childNodes[1].children; for(var i = 0; i < vertices.length; i++) { dump("\r\n" + JSON.stringify(vertices[i].__data__._data) + "\r\n"); }

With the vertices and edges now printed to the terminal, a little copy/paste action and you can import the data into your test database before running your tests with arangojs’s import function.

  {foo: "bar"},
  {fizz: "buzz"}

Alternately you can upload JSON files into the collection via the web interface as well.

Importing JSON into a collection.
Importing JSON into a collection.

While this process has no claim on elegance, its been very useful for testing my AQL queries and saved me a lot of hassle.

Hello GraphQL

One of the most interesting projects to me lately has been Facebook’s GraphQL. Announced at React.conf in January, those of us that were excited by the idea have had to wait, first for the spec to be formalized and then for some actual running code.

I think the GraphQL team is on to something big (it’s positioned as an alternative to REST to give a sense of how big), and I’ve been meaning to dig in to it for a while, but it was never clear where to start. So after a little head-scratching and a little RTFM, I want to share a GraphQL hello world.

So what does that look like? Well Facebook as released two projects: graphql-js and express-graphql. Graphql-js is the reference implementation of what is described in the spec. express-graphql is a middleware component for the express framework that lets you use graphql.

So express is going to be our starting point. First we need to create a new project using the generator:

mike@longshot:~☺  express --git -e gql_hello

create : gql_hello
create : gql_hello/package.json
create : gql_hello/app.js
create : gql_hello/.gitignore
create : gql_hello/public
create : gql_hello/routes
create : gql_hello/routes/index.js
create : gql_hello/routes/users.js
create : gql_hello/views
create : gql_hello/views/index.ejs
create : gql_hello/views/error.ejs
create : gql_hello/bin
create : gql_hello/bin/www
create : gql_hello/public/javascripts
create : gql_hello/public/images
create : gql_hello/public/stylesheets
create : gql_hello/public/stylesheets/style.css

install dependencies:
$ cd gql_hello && npm install

run the app:
$ DEBUG=gql_hello:* npm start

Lets do as we are told and run cd gql_hello && npm install.
When that’s done we can get to the interesting stuff.
Next up will be installing graphql and the middleware using the –save option so that our app’s dependencies in our package.json will be updated:

mike@longshot:~/gql_hello☺  npm install --save express-graphql graphql babel
npm WARN install Couldn't install optional dependency: Unsupported
npm WARN prefer global babel@5.8.23 should be installed with -g

I took the basic app.js file that was generated and just added the following:

app.use('/', routes);
app.use('/users', users);

// GraphQL:
var graphqlHTTP = require('express-graphql');

import {
} from 'graphql';

var schema = new GraphQLSchema({
  query: new GraphQLObjectType({
    name: 'RootQueryType',
    fields: {
      hello: {
        type: GraphQLString,
        resolve() {
          return 'world';

//Mount the middleware on the /graphql endpoint:
app.use('/graphql', graphqlHTTP({ schema: schema , pretty: true}));
//That's it!

// catch 404 and forward to error handler
app.use(function(req, res, next) {
  var err = new Error('Not Found');
  err.status = 404;

Notice that we are passing our GraphQL schema to graphqlHTTP as well as pretty: true so that responses from the server will be pretty printed.

One other thing is that since those GraphQL libraries make extensive use of ECMAScript 6 syntax, we will need to use the Babel Transpiler to actually be able to run this thing.

If you installed Babel with npm install -g babel you can add the following to your package.json scripts section:

    "start": "babel-node ./bin/www"

Because I didn’t install it globally, I’ll just point to it in the node_modules folder:

    "start": "node_modules/babel/bin/babel-node.js ./bin/www"

With that done we can use npm start to start the app and try things out using curl:

mike@longshot:~☺  curl localhost:3000/graphql?query=%7Bhello%7D
  "data": {
    "hello": "world"

Looking back at the schema we defined, we can see that our request {hello} (or %7Bhello%7D when its been url encoded) caused the resolve function to be called, which returned the string “world”.

  name: 'RootQueryType',
  fields: {
    hello: {
      type: GraphQLString,
      resolve() {
        return 'world';

This explains what they mean when you hear that GraphQL “exposes fields that are backed by arbitrary code”. What drew me to GraphQL is that it seems to be a great solution for people with graph database backed applications, but it’s only now that I realize that GraphQL is much more flexible. That string could have just as easily pulled something out of a relational database or calculated something useful. In fact this might be the only time “arbitrary code execution” is something to be happy about.

I’m super excited to explore this further and to start using it with ArangoDB. If you want to dig deeper I suggest you check out Exploring GraphQL and Creating a GraphQL server and of course read the spec.

Why graphs? Why now?

Buyer behavior analysis, protein-protein interactions, the human brain, fraud detection, financial analysis; if you sketch any of these out on a whiteboard, you will most likely end up with a series of circles connected by lines.

This simple representation we all intuitively use to map out the relationships between things. Though simple, under the name “graph” or “network graph”, it is the subject of study for an entire branch of mathematics (graph theory), and the burgeoning field of Social Network Analysis (SNA).

SNA is the study of the network of relationships among a set of things rather than the things themselves. This type of analysis is increasing common in academia across a huge number of fields. Google Scholar gives a fairly clear indication that the term is increasingly common among the academic databases it crawls.

The growth of Social Network Analysis
The growth of Social Network Analysis.

The technique surfaces in many domains, used to identify which actors within a given network are “interesting”. In a biology context “interesting” actors might be the genes that are interacting the most with other genes given a certain stimulus.

 Transcriptome-Based Network Analysis Reveals a Spectrum Model of Human Macrophage Activation Xue, Jia et al.
Transcriptome-Based Network Analysis Reveals a Spectrum Model of Human Macrophage Activation – Xue, Jia et al.

In the context of epidemiology, if the actors are fish farms, the movement of fish between then forms a network which can be analysed. Aquaculture sites that have the highest number of incoming and outgoing connections become “interesting” since the movements mean a high vulnerability to infection and likelihood to spread disease.

Image from the paper Application of network analysis to farmed salmonid movement data from Scotland
Image from the paper Application of network analysis to farmed salmonid
movement data from Scotland

An “interesting” financial institution might be one whose financial ties with other institutions indicate that it’s failure might have a domino effect.

Figure from DebtRank: Too Central to Fail? Financial Networks, the FED and Systemic Risk
Figure from DebtRank: Too Central to Fail? Financial
Networks, the FED and Systemic Risk

While Social Network analysis has been steadily growing, shifts in industry are underway that promise to make this type of analysis more common outside academia.

In 2001, with computers spreading everywhere, e-commerce heating up and internet usage around 500 million, Doug Laney, an eagle-eyed analyst for MetaGroup (now Gartner), notices a trend; data is changing. He described how data was increasing along 3 axes: increasing in volume, velocity and variety which eventually became known as “the 3 V’s of Big Data”.

An amazing visualization of the 3 V's from
An amazing vizualization of the 3 V’s from

This changing characteristics of data itself has touched off what is often called a “Cambrian explosion” of non-relational databases that offer the speed, flexibility and horizontal scalability needed to accommodate it. These databases and collectively known as NoSQL databases.

A non-exhaustive but representative timeline of NoSQL databases.
A non-exhaustive but representative timeline of NoSQL databases. From the first database (a graph database) till today. Can you see the explosion? Explore the timeline yourself.

Since the launch of Google’s Bigtable in 2005, More than 28 NoSQL databases have been launched. The majority fall into one of the 3 main sub-categories: key/value store, document store or graph database.

Anywhere a graph database is used is an obvious place to use SNA, but the ability to build a graph atop either key/value stores or document databases means that the majority of NoSQL databases are amenable to being analysed with SNA.

There are also many relational databases struggling to contain and query graphy datasets that developers have dutifully pounded into the shape of a table. As graph databases gain more traction we will likely see some of these converted in their entirety to graphs using tools like R2G wherever developers end up struggling with recursion or an explosion of join tables, or something like routing.

In addition to the steady pressure of the 3 V’s and the growth of SNA as an analytical and even predictive, tool, there are many companies (RunkeeperYahooLinkedIn) whose data model is a graph. Facebook and Netflix both fall into this category and have each released tools, both of which are pitched as alternatives to REST architecture style most web applications are based on, to make building graph backed applications easier.

Circling back to the original question of “why graphs?”, hopefully the answer is clearer. For anyone with an interest in data analysis, paying attention to this space gives access to powerful tools and a growing number of opportunities to apply them. For developers, understanding graphs allows better data modelling and architectural decisions.

Beyond the skills needed to design and tend to these new databases and make sense of their contents, knowledge of graphs will also increasingly be required to make sense of the world around us.

Understanding why you got turned down for a loan will require understanding a graph, why you are/aren’t fat, and eventually who gets insurance and at what price will too.

Facebook patents technology to help lenders discriminate against borrowers based on social connections
Facebook patents technology to help lenders discriminate against borrowers based on social connections

Proper security increasingly requires thinking in graphs, as even the most “boring” of us can be a useful stepping stone on the way to compromising someone “interesting”; perhaps a client, an acquaintance, an employer, or a user of something we create.

Graphs will be used to find and eliminate key terrorists, map out criminal networks, and route you to that new vegan restaurant across town.

With talk of storage capacities actually surpassing Moore’s law, SNA growing nearly linearly, NoSQL growing, interest in graphs on the way up, and application development tools finally appearing, the answer to “why now?” is that this is only the beginning. We are all connected, and understanding how is the future.

Running a Rails app with Systemd and liking it

Systemd has, over the strident objections of many strident people, become the default init system for a surprising number of linux distributions. Though I’ve been aware of the drama, the eye-rolling, the uh.. enterprising nature of systemd, I really have only just started using it myself. All the wailing and gnashing of teeth surrounding it left me unsure what to expect.

Recently I needed to get a Proof-Of-Concept app I built running so a client could use it on their internal network to evaluate it. Getting my Rails app to start on boot was pretty straight forward and I’m going to be using this again so I thought I would document it here.

First I created a “rails” user and group, and in /home/rails I installed my usual Rbenv setup. The fact that only root is allowed to listen to ports below 1024, conflicts with my plan to run my app with the “rails” user and listen on port 80. The solution is setcap:

setcap 'cap_net_bind_service=+ep' .rbenv/versions/2.2.2/bin/bundle

With that capability added, I set up my systemd unit file in /usr/lib/systemd/system/myapp.service and added the following:


ExecStart=/usr/bin/bash -lc 'bundle exec rails server -e production --bind --port 80'


The secret sauce that makes this work with rbenv is the “bash -l” in the ExecStart section. This means that the bash will execute as though it was a login shell, meaning that the .bashrc file with all the PATH exports and rbenv init stuff will be sourced before the command I give it will be run. In other words, exactly what happens normally.

From there, I just start the service like all the rest of them:

systemd enable myapp.service
systemd start myapp.service

This Just Works™ and got the job done, but in the process I find I am really starting to appreciate Systemd. Running daemons is complicated, and with a the dropping of privileges, ordering, isolation and security options, there is a lot to get right… or wrong.

What I am liking about Systemd is that it is taking the same functions that Docker is built on, namely cgroups and namespacing, and giving you a declarative way of using them while starting your process. Doing so puts some really nice (and otherwise complicated) security features within reach of anyone willing to read a man page.

PrivateTmp=yes is a great example of this. By simply adding that to the unit file above (which you should if you call in your app) closes off a bunch of security problems because systemd “sets up a new file system namespace for the executed processes and mounts private /tmp and /var/tmp directories inside it that is not shared by processes outside of the namespace”.

Could I get the same effect as PrivateTmp=yes with unshare? With some fiddling, but Systemd makes it a zero cost option.

There is also ProtectSystem=full to mount /boot, /usr and /etc as read only which “ensures that any modification of the vendor supplied operating system (and optionally its configuration) is prohibited for the service”. Systemd can even handle running setcap for me, resulting in beautiful stuff like this, and there is a lot more in man systemd.exec besides.

For me I think one of the things that has become clear over the last few years is that removing “footguns” from our software is really important. All the work that is going into the tools (like rm -rf) and languages (Rust!) we use less spectacularly dangerous is critical to raising the security bar across the industry.

The more I learn about Systemd the more it seems to be a much needed part of that.

Data modeling with ArangoDB

Since 2009 there has been a “Cambrian Explosion” of NoSQL databases, but information on data modeling with these new data stores feels hard to come by.
My weapon of choice for over a year now has been ArangoDB. While ArangoDB is pretty conscientious about having good documentation, there has been something missing for me: criteria for making modeling decisions.

Like most (all?) graph databases, ArangoDB allows you to model your data with a property graph. The building blocks of a property graph are attributes, vertices and edges. What makes data modelling with ArangoDB (and any other graph database) difficult is deciding between them.

To start with we need a little terminology. Since a blog is a well known thing, we can use a post with some comments and some tags as our test data to illustrate the idea.

Sparse vs Compact

Modeling our blog post with as a “sparse” graph might look something like this:


At first glance it looks satisfyingly graphy: in the centre we see a green “Data Modeling” vertex which has a edge going to another vertex “post”, indicating that “Data Modeling” is a post. Commenters, tags and comments all have connections to a vertex representing their type as well.

Looking at the data you can see we are storing lots of edges and most vertices contain only a single attribute (apart from the internal attributes ArangoDB creates: _id, _key, _rev).

[{"_id":"vertices/26589395587","_key":"26589395587","_rev":"26589395587","title":"Data modeling","text":"lorum ipsum...","date":"1436537253903"},
{"_id":"vertices/26589723267","_key":"26589723267","_rev":"26589723267","name":"Mike Williamson"},
{"_id":"vertices/26589985411","_key":"26589985411","_rev":"26589985411","name":"Mike's Mum","email":""},
{"_id":"vertices/26590247555","_key":"26590247555","_rev":"26590247555","title":"That's great honey","text":"Love you!"},
{"_id":"vertices/26590509699","_key":"26590509699","_rev":"26590509699","name":"Spammy McSpamerson","email":""},
{"_id":"vertices/26590640771","_key":"26590640771","_rev":"26590640771","title":"Brilliant","text":"Gucci handbags..."},


A “compact” graph on the other hand might look something like this:

  title:  "Data modelling",
  text: "lorum ipsum...",
  author: "Mike Williamson",
  date:   "2015-11-19",
  comments: [
      author:"Mike's Mum",
      text: "That's great honey",
      "author": "",
      "title": "Brilliant",
      "text": "Gucci handbags...",

Here we have taken exactly the same data and collapsed it together into a single document. While its a bit of a stretch to even classify this as a graph, ArangoDB’s multi-model nature largely erases the boundary between a document and a graph with a single vertex.

The two extremes above give us some tools for talking about our graph. Its the same data either way, but clearly different choices are being made. In the sparse graph, every vertex you see could have been an attribute, but was consciously moved into a vertex of its own. The compact graph is what comes out of repeated choosing to add new data an attribute rather than a vertex.

When modeling real data your decisions don’t always fall one one side or the other. So what criteria should we be using to make those decisions?

Compact by default

As a baseline you should default to a compact graph. Generally data that is displayed together should be combined into a single document.

Defaulting to compact will mean fewer edges will exist in the graph as a whole. Since each traversal across a graph will have to find, evaluate and then traverse the edges for each vertex it encounters, keeping the number of edges to a strict minimum will ensure traversals stay fast as your graph grows.
Compact graphs will also mean fewer queries an traversals to get the information you need.

But not everything belongs together. Any attribute that contains a complex data structure (like the “comments” array or the “tags” array) deserves a little scrutiny as it might make sense as a vertex (or vertices) of its own.

Looking at our compact graph above, the array of comments, the array of tags, and maybe even the author might be better off as vertices rather than leaving them as attributes. How do we decide?

  • If you need to point an edge at it, it will need to be a vertex.
  • If it will be accessed on its own (ie: showing comments without the post), it will need to be a vertex.
  • If you are going to use certain graph measurements (like centrality) it will need to be a vertex.
  • If its not a value object (the values can change but the object remains the same).

Removing duplicate data can also be a reason, but with the cost of storage low (and dropping) its a weak reason.

Edge Direction

Once you promote a piece of data to being a vertex (or “reify” it) your next decision is which way the edge connecting it to another vertex should go. Edge direction is a powerful way to put up boundaries to contain your traversals, but while the boundary is important the actual directions are not. Whatever you choose, it just needs to be consistent. One edge going the wrong direction is going to have your traversal returning things you don’t expect.

And another thing…

This post is the post I kept hoping to find as I worked on modeling my data with ArangoDB. Its not complete, data modeling is a big topic and there is lots more depth to ArangoDB to explore (ie: I haven’t yet tried splitting my edges amongst multiple edge collections) but these are some guidelines that I was hoping for when I was starting.

I would love to learn more about the criteria people are using to make those tough calls between attribute and vertex, and all those other hard modeling decisions.

If you have thoughts on this let me know!

When to use a graph database

There are a lot of “intro to graph database” tutorials on the internet. While the “how” part of using a graph database has it’s place, I don’t know if enough has been said about “when”.

The answer to “when” depends on the properties of the data you are working with. In broad strokes, you should probably keep a graph database in mind if you are dealing with a significant amount of any of the following:

  • Hierarchical data

  • Connected data

  • Semi-structured data

  • Data with Polymorphic associations

Each of these data types either requires some number of extra tables or columns (or both) to deal with under the relational model. The cost of these extra tables and columns is an increase in complexity.

Terms like “connected data” or “semi-structured data” get used a lot in the NoSQL world but the definitions, where you can find one at all, have a “you’ll know it when you see it” flavour to them. “You’ll know it when you see it” can be distinctly unsatisfying when examples are hard to come by as well. Lets take a look at these one by one and get a sense of they mean generally and how to recognize them in existing relational database backed projects.

Hierarchical Data

Hierarchies show up everywhere. There are geographical hierarchies where a country has many provinces, which have many cities which have many towns. There is also the taxonomic rank, indicating the level of a taxon in the Taxonomic Hierarchy, organizational hierarchies, the North American Industry Classification system… the list goes on and on.

What it looks like in a relational database.

Usually its easy to tell if you are dealing with this type of data. In an existing database schema you may see tables with a parent_id column indicating the use of the Adjacency List pattern or left/right columns indicating the use of the Nested Sets pattern. There are many others as well.

Connected Data

Connected data is roughly synonymous with graph data. When we are talking about graph data we mean bits of data, plus information about how those bits of data are related.

What it looks like in a relational database.

Join tables are the most obvious sign that you are storing graph data. Join tables exist solely to act as the target of two one-to-many relationships, each row representing a relationship between two rows in other tables. Cases where you are storing data in the join table (such as the Rails has_many :through relationship) are even more clear; you are definitely storing connected data.

While one-to-many relationships also technically describe a graph, they probably are not going to make you reconsider the use of a relational database the way large numbers of many-to-many relationships might.

Semi-structured Data

Attempts to define semi-structured data seem to focus on variability; just because one piece of data has a particular set of attributes does not mean that the rest do. You can actually get an example of semi-structured data by mashing together two sets of structured (tabular) data. In this world of APIs and SOA where drawing data from multiple sources is pretty much the new normal, semi-structured data is increasingly common.

What it looks like in a relational database.

Anywhere you have columns with lots of null values in them. The columns provide the structure, but long stretches of null values suggest that this data does not really fit that structure.

An example of semi-structured data: a hypothetical products table combining books (structured data) and music (also structured data).

Polymorphic associations

Having one type data has an association that might be to related to one of two or more things, that’s what known as a polymorphic association. As an example, perhaps a photo, which might be related to a user or a product.

What it looks like in a relational database.

While polymorphic relations can be done in a relational database most commonly they are handled at the framework level, where the framework a foreign key and an additional “type” column to determine the correct table/row. Seeing both an something_id and something_type in the same table gives a hint that a polymorphic relationship is being used. Both Ruby on Rails and Java’s Spring Framework offer this.

So when?

These types of data are known to be an awkward fit for the relational model, in the same way that storing large quantities of perfectly tabular data would be awkward under the graph model. These are ultimately threshold problems, like the famous paradox of the heap.

1000000 grains of sand is a heap of sand

A heap of sand minus one grain is still a heap.

Your first join table or set of polymorphic relations will leave you will a perfectly reasonable database design, but just as “a heap of sand minus one grain” will eventually cross some ill defined threshold and produce something that is no longer a heap of sand, there is some number of join tables or other workarounds for the relational model that will leave you with a database that is significantly more complex than a graph database equivalent would be.

Knowing about the limits of the relational model and doing some hard thinking about how much time you are spending pressed up against those limits is really the only things that can guide your decision making.

Querying the Openstreetmap Dataset

While much has been written about putting data into OpenStreetMap (OSM), it doesn’t feel like much has been said about getting data out. For those familiar with GIS software, grabbing a “metro extract” is a reasonable place to start, but for developers or regular users its not quite as clear how to get at the data we can see is in there.

The first way to get at the data is with the Overpass API. Overpass was started by Roland Olbricht in 2008 as a way to ask for some specified subset of the OSM data.

Lets say I was curious about the number of bike racks that could hold 8 bikes in down-town Ottawa. The first thing to know is that OSM data is XML, which means that each element (node/way/area/relation) looks something like this:

  <node id="3046036633" lat="45.4168480" lon="-75.7016922">
    <tag k="access" v="public"/>
    <tag k="amenity" v="bicycle_parking"/>
    <tag k="bicycle_parking" v="rack"/>
    <tag k="capacity" v="8"/>

Basically any XML element may be associated with a bunch tags containing keys and values.

You specify which elements of the OSM dataset are interesting to you by creating an Overpass query in XML format or using a query language called Overpass QL. You can use either one, but I’m using XML here.

Here is a query asking for all the elements of type “node” that has both a tag with a key of “amenity” and a value of “bicycle_parking” as well as a tag with a key of “capacity” and a value of “8”. You can also see my query includes a bbox-query element with coordinates for North, East, South, and West supplied; the two corners of a bounding box so search will be limited to that geographic area.

<osm-script output="json">
  <query type="node">
    <has-kv k="amenity" v="bicycle_parking"/>
    <has-kv k="capacity" v="8"/>
    <bbox-query e="-75.69105863571167" n="45.42274779392456" s="45.415714100972636" w="-75.70568203926086"/>

I’ve saved that query into a file named “query” and I am using cat to read the file and pass the text to curl which sends the query.

mike@longshot:~/osm☺  cat query | curl -X POST -d @-{
  "version": 0.6,
  "generator": "Overpass API",
  "osm3s": {
    "timestamp_osm_base": "2014-08-27T18:47:01Z",
    "copyright": "The data included in this document is from The data is made available under ODbL."
  "elements": [

  "type": "node",
  "id": 3046036633,
  "lat": 45.4168480,
  "lon": -75.7016922,
  "tags": {
    "access": "public",
    "amenity": "bicycle_parking",
    "bicycle_parking": "rack",
    "capacity": "8"
  "type": "node",
  "id": 3046036634,
  "lat": 45.4168354,
  "lon": -75.7017258,
  "tags": {
    "access": "public",
    "amenity": "bicycle_parking",
    "capacity": "8",
    "covered": "no"
  "type": "node",
  "id": 3046036636,
  "lat": 45.4168223,
  "lon": -75.7017618,
  "tags": {
    "access": "public",
    "amenity": "bicycle_parking",
    "bicycle_parking": "rack",
    "capacity": "8"


This is pretty exciting, but its worth pointing out that the response is JSON, and not GeoJSON which you will probably want for doing things with Leaflet. The author is certainly aware of it and apparently working on it, but in the meantime you will need to use the npm module osmtogeojson if you need to do the conversion from what Overpass gives to what Leaflet accepts.

So what might that get you? Well lets say you are trying to calculate the total amount of bike parking in down-town Ottawa. With a single API call (this time using the Overpass QL, so its cut & paste friendly), we can tally up the capacity tags:

mike@longshot:~/osm☺  curl -s -g '[out:json];node["amenity"="bicycle_parking"](45.415714100972636,-75.70568203926086,45.42274779392456,-75.69105863571167);out;' | grep capacity | tr -d ',":' | sort | uniq -c
      2     capacity 10
      7     capacity 2
      6     capacity 8

Looks like more bike racks need to be tagged with “capacity”, but its a good start on coming up with a total.

Building on the Overpass API is the web based Overpass-turbo. If you are an regular user trying to get some “how many of X in this area” type questions answered, this is the place to go. Its also helpful for developers looking to work the kinks out of a query.

Displaying my edits in the Ottawa area.
Using Overpass-Turbo to display my edits in the Ottawa area.

Its really simple to get started using the wizard, which helps write a query for you. With a little fooling around with the styles you can do some really interesting stuff. As an example, we can colour the bicycle parking according to its capacity so we can see which ones have a capacity tag and which ones don’t. The query ends up looking like this:

<osm-script timeout="25">
  <!-- gather results -->
    <!-- query part for: “amenity=bicycle_parking” -->
    <query type="node">
      <has-kv k="amenity" v="bicycle_parking"/>
      <bbox-query {{bbox}}/>
      node[amenity=bicycle_parking]{ fill-opacity: 1; fill-color: grey;color: white;}
      node[capacity=2]{ fill-color: yellow; }
      node[capacity=8]{ fill-color: orange;}
      node[capacity=10]{fill-color: red;}
  <print mode="body"/>
  <recurse type="down"/>
  <print mode="skeleton" order="quadtile"/>

Bike racks with no capacity attribute will be grey. You can see the result here.

While Overpass-turbo might not be as sophisticated as CartoDB, it is really approachable and surprisingly capable. Highlighting certain nodes, picking out the edits of a particular user, there are lots of interesting applications.

Being able to query the OSM data easily opens some interesting possibilities. If you are gathering data for whatever reason, you are going to run into the problems of where to store it, and how to keep it up to date. One way of dealing with both of those is to store your data in OSM.

With all the thinking that has gone into what attributes can be attached  to things like trees, bike racks, and public art, you can store a surprising amount of information in a single point. Once saved into the OSM dataset, you will always know where to find the most current version of your data, and backups are dealt with for you.

This approach  also opens the door to other people helping you keep it up to date. Asking for volunteers or running hackathons to help you update your data is pretty reasonable when it also means improving a valuable public resource, instead of just enriching the owner alone. Once the data is in OSM, the maintenance burden is easy to distribute.

When its time to revisit your question, fresh data will only ever be an Overpass query away…

Something to think about.

ArangoDB’s geo-spatial functions

I’ve been playing with ArangoDB a lot lately. As a document database it looks to be a drop-in replacement for MongoDB, but it goes further, allowing graph traversals and geo-spatial queries.

Since I have a geo-referenced data set in mind I wanted to get to know its geo-spatial functions. I found the documentation a kind of unclear so I thought I would write up my exploration here.

At the moment there are only two geo-spatial functions in Arango; WITHIN and NEAR. Lets make some test data using the arango shell. Run arangosh and then the following:

db._create('cities'){name: 'Ottawa', lat: 45.4215296, lng: -75.69719309999999}){name: 'Montreal', lat: 45.5086699, lng: -73.55399249999999}){name: 'São Paulo', lat: -23.5505199, lng: -46.63330939999999})

We will also need a geo-index for the functions to work. You can create one by passing in the name(s) of the fields that hold the latitude and longitude. In our case I just called them lat and lng so:

db.cities.ensureGeoIndex('lat', 'lng')

Alternately I could have done:{name: 'Ottawa', location: [45.4215296, -75.69719309999999]})

As long as the values are of type double life is good. If you have some documents in the collection that don’t have the key(s) you specified for the index it will just ignore them.

First up is the WITHIN function. Its pretty much what you might expect, you give it a lat/lng and a radius and it gives you records with the area you specified. What is a little unexpected it that the radius is given in meters. So I am going to ask for the documents that are closest to the lat/lng of my favourite coffee shop (45.42890720357919, -75.68796873092651). To make the results more interesting I’ll ask for a 170000 meter radius (I know that Montreal is about 170 kilometers from Ottawa) so I should see those two cities in the result set:

arangosh [_system]> db._createStatement({query: 'FOR city in WITHIN(cities, 45.42890720357919, -75.68796873092651, 170000) RETURN city'}).execute().toArray()
    "_id" : "cities/393503132620", 
    "_rev" : "393503132620", 
    "_key" : "393503132620", 
    "lat" : 45.4215296, 
    "lng" : -75.69719309999999, 
    "name" : "Ottawa" 
    "_id" : "cities/393504967628", 
    "_rev" : "393504967628", 
    "_key" : "393504967628", 
    "lat" : 45.5086699, 
    "lng" : -73.55399249999999, 
    "name" : "Montreal" 


There is also an optional “distancename” parameter which, when given, prompts Arango to add the number of meters from your target point each document is. We can use that like this:

arangosh [_system]> db._createStatement({query: 'FOR city in WITHIN(cities, 45.42890720357919, -75.68796873092651, 170000, "distance_from_artissimo_cafe") RETURN city'}).execute().toArray()
    "_id" : "cities/393503132620", 
    "_rev" : "393503132620", 
    "_key" : "393503132620", 
    "distance_from_artissimo_cafe" : 1091.4226157106734, 
    "lat" : 45.4215296, 
    "lng" : -75.69719309999999, 
    "name" : "Ottawa" 
    "_id" : "cities/393504967628", 
    "_rev" : "393504967628", 
    "_key" : "393504967628", 
    "distance_from_artissimo_cafe" : 166640.3086328647, 
    "lat" : 45.5086699, 
    "lng" : -73.55399249999999, 
    "name" : "Montreal" 

Arango’s NEAR function returns a set of documents ordered by their distance in meters from the lat/lng you provide. The number of documents in the set is controlled by the optional “limit” argument (which defaults to 100) and the same “distancename” as above. I am going to limit the result set to 3 (I only have 3 records in there anyway), and use my coffeeshop again:

arangosh [_system]> db._createStatement({query: 'FOR city in NEAR(cities, 45.42890720357919, -75.68796873092651, 3, "distance_from_artissimo_cafe") RETURN city'}).execute().toArray()
    "_id" : "cities/393503132620", 
    "_rev" : "393503132620", 
    "_key" : "393503132620", 
    "distance_from_artissimo_cafe" : 1091.4226157106734, 
    "lat" : 45.4215296, 
    "lng" : -75.69719309999999, 
    "name" : "Ottawa" 
    "_id" : "cities/393504967628", 
    "_rev" : "393504967628", 
    "_key" : "393504967628", 
    "distance_from_artissimo_cafe" : 166640.3086328647, 
    "lat" : 45.5086699, 
    "lng" : -73.55399249999999, 
    "name" : "Montreal" 
    "_id" : "cities/393506343884", 
    "_rev" : "393506343884", 
    "_key" : "393506343884", 
    "distance_from_artissimo_cafe" : 8214463.292795454, 
    "lat" : -23.5505199, 
    "lng" : -46.63330939999999, 
    "name" : "São Paulo" 

As you can see ArangoDB’s geo-spatial functionality is sparse but certainly enough to do some interesting things. Being able to act as a graph database AND do geo-spatial queries places Arango in a really interesting position and I am hoping to see its capabilities in both those areas expand. I’ve sent a feature request for WITHIN_BOUNDS, which I think would make working with leaflet.js or Google maps really nice, since it would save me doing a bunch of calculations with the map centre and the current zoom level to figure out a radius in meters for my query. I’ll keep my fingers crossed…