Graph migrations

One of the things that is not obvious at first glance is how “broad” ArangoDB is. By combining the flexibility of the document model with the joins of the graph model, ArangoDB has become a viable replacement for Postgres or MySQL, which is exactly how I have been using it; for the last two years it’s been my only database.

One of the things that falls out of that type of usage is a need to actually change the structure of your graph. In a graph, structure comes from the edges that connect your vertices. Since both the vertices and edges are just documents, that structure is actually just more data. Changing your graph structure essentially means refactoring your data.

There are definitely patterns that appear in that refactoring, and over the last little while I have been playing with putting the obvious ones into a library called graph_migrations. This is work in progress but there are some useful functions working already and could use some proper documentation.

eagerDelete

One of the first of these is what I have called eagerDelete. If you were wanting to delete Bob from graph below, Charlie and Dave would be orphaned.

Screenshot from 2016-04-06 10-55-54

Deleting Bob with eagerDelete means that Bob is deleted as well as any neighbors whose only neighbor is Bob.

gm = new GraphMigration("test") //use the database named "test"
gm.eagerDelete({name: "Bob"}, "knows_graph")

alice_eve

mergeVertices

Occasionally you will end up with duplicate vertices, which should be merged together. Below you can see we have an extra Charlie vertex.

extra_charlie

gm = new GraphMigration("test")
gm.mergeVertices({name: "CHARLIE"},{name: "Charlie"}, "knows_graph")

merged_charlie

attributeToVertex

One of the other common transformations is needing to make a vertex out of attribute. This process of “promoting” something to be a vertex is sometimes called reifying. Lets say Eve and Charlie are programmers.

knows_graph

Lets add an attribute called job to both Eve and Charlie identifying them as programmers:

adding_job_attr

But lets say that we decide that it makes more sense for job: "programmer" to be a vertex on it’s own (we want to reify it). We can use the attributeToVertex function for that, but because Arango allows us to split our edge collections and it’s good practice to do that, lets add a new edge collection to our “knows_graph” to store the edges that will be created when we reify this attribute.

adding_works_as

With that we can run attributeToVertex, telling it the attribute(s) to look for, the graph (knows_graph) to search and the collection to save the edges in (works_as).

gm = new GraphMigration("test")
gm.attributeToVertex({job: "programmer"}, "knows_graph", "works_as", {direction: "inbound"})

The result is this:

after_attrTo_vertex

vertexToAttribute

Another common transformation is exactly the reverse of what we just did; folding the job: "programmer" vertex into the vertices connected to it.

gm = new GraphMigration("test")
gm.vertexToAttribute({job: "programmer"}, "knows_graph", {direction: "inbound"})

That code puts us right back to where we started, with Eve and Charlie both having a job: "programmer" attribute.

knows_graph

redirectEdges

There are times when things are just not connected the way you want. Lets say in our knows_graph we want all the inbound edges pointing at Bob to point instead to Charlie.
knows_graph

We can use redirectEdges to do exactly that.

gm = new GraphMigration("test")
gm.redirectEdges({_id: "persons/bob"}, {_id: "persons/charlie"}, "knows_graph", {direction: "inbound"})

And now Eve and Alice know Charlie.

redirected_edges

Where to go from here.

As the name “graph migrations” suggests the thinking behind this was to create something similar to the Active Record Migrations library from Ruby on Rails but for graphs.

As more and more of this goes from idea to code and I get a chance to play with it, I’m less sure that a direct copy of Migrations makes sense. Graphs are actually pretty fine-grained data in the end and maybe something more interactive makes sense. It could be that this makes more sense as a Foxx app or perhaps part of Arangojs or ArangoDB’s admin interface. It feels a little to early to tell.

Beyond providing a little documentation the hope here is make this a little more visible to people who are thinking along the same lines and might be interested in contributing.

Back up your data, give it a try and tell me what you think.

That stonewalling thing

There is a meme in the current crypto “debate” that makes me cringe whenever I read it: the idea of “stonewalling”. It’s come up in the Apple vs FBI case as the ForbesLA Times, Jacobin magazine and others all described Apple as “stonewalling” the FBI.

Wired’s recent Whatsapp story mentioned that “WhatsApp is, in practice, stonewalling the federal government” and while Foreign Policy magazine avoided the word, they captured the essence when they described Whatsapp as a “a service willing to adopt technological solutions to prevent compliance with many types of court orders”.

All of these articles make it sound like Apple/Whatsapp has the data, but it unwilling to give it to the government.

�
  !��s�����|Ǧ�2}|q�h�J�,�^��=&/
                                    _,e�r%����/D@�1f��"�
                                                                ]�?c�,��y�l?��3�lF�'���ǘ��IA��O�Y�i�����ё�R��`�[�]�H���P�1'��������S����~tF\�������^��f@��<P�g�	!X���6eh�U�rN���=d@܉eQe���B�lk����\ҠcE��
�$�d&���_xor�s�-���l,v���44�E����n�[���1YL�o�ޜ�g�m�����Tx�f	܁�����å+e�LR�E1���ޅx
                                                                                              �a*�Զ\l�ϫ)4&���or�-�4���C���q��|-2[͘7 ��
��0�ǹ����+�5b!�wV����������3\n�꨻�R�,Ĝ�

\F����P�IJ<Ը$�`Q/���D�w��̣���v"|��z�g/I��@!�(�z������]ɹ3}+f1�
                                                                  ju��vw�y~#7�w��K������M\g�.uW�i
                                                                                                    TYc���I@�s�;�/��
                                                                                                                        �����s�c�ݮ���C�
                                                                                                                                         �6~�e

Blobs of encrypted text like the one above are useless for anyone put the holder of the decryption key. Where the company holds the decryption key and refused to give it up, it seems reasonable to call that “stonewalling”.

Without the decryption key, you may be in possession of such a blob but you can’t meaningfully be described as “having” the data within it. Calls of “stonewalling” in cases like that are either grandstanding or reveal an opinion-disqualifying level of ignorance.

These accusations of stonewalling obscure what I think is the real appeal of encryption and tools such as Tor: it’s not that these technologies prevent compliance, it’s that companies can prevent the collection certain types of data in the first place.

The authors of a recent paper called “Cryptopolitik and the Darknet” did exactly that when they crawled the darknet for data:

“In order to avoid illegal material, such as media files of child pornography or publications by terrorist organisations, only textual content was harvested automatically. Any other material was either filtered out or immediately discarded.”

Nobody would think to accuse them of stonewalling or adopting “technological solutions to prevent compliance” for finding a way to do their darknet crawl without accumulating a bunch of data that is going to bring with it complicated law enforcement dealings.

When Whatapp wants to “avoid illegal material” while still remaining in the messaging business, they do it with end-to-end encryption.

 

Why end-to-end? In end-to-end encryption, the end users hold the decryption keys. Companies who decide to keep the keys themselves become a target of every spy agency on the planet and run the risk of becoming the next Gemalto.

That technologies and architectural choices exist which allow you to filter the data you are exposed to, and therefore your level of legal liability/obligation feels new. Or maybe what’s new is companies willingness to actually implement them.

No-one is interested in obstructing investigations, just managing risk and legal exposure. “If you collect it, they will come” is becoming a common phrase among programmers and security people, and for companies who don’t want to end up holding data related to a crime, painting a giant target on themselves, dedicating resources to servicing government requests, or having awkward public relations moments, end-to-end encryption starts to look like good risk management. Doubly so when you are dealing with multiple governments.

In that context, governments pushing back against end-to-end seem to indicate an existing idea that companies are somehow obligated to collect data on behalf of the government and that using encryption to limit your collection is not OK. This is visible in the issue of the government conscripting companies to do it’s work raised by the FBI’s recent use of the 1789 All-Writs Act law to try to force Apple to build software to hack it’s own phone.

With many years of enthusiastic support from companies like AT&T it’s easy to see where that idea might have come from. As the American government oscillates between attacking tech companies and asking them to do it’s job, and authoritarian governments and international customers look on, it’s not hard to see why many tech companies are far less enthusiastic about facilitating government aims. So far “stonewalling” seems to be a deliberately provocative framing for the “we’ve stopped collecting that data, leave us out of this” reality that end-to-end encryption creates.

Seeing that kind of inflammatory rhetoric from the FBI or congress is one thing, but it’s widespread use by journalists is very disconcerting.

As cries of “stonewalling” turn to accusations of tech companies thinking they are “above the law” and now draft anti-encryption legislation, it’s probably good to remember that blob of encrypted text. It’s not that these companies are getting in the way of the FBI getting data, they are trying to get themselves out of the way by removing their own access to it.

Of all people, former NSA director Michael Hayden recently observed “America is simply more secure with unbreakable end-to-end encryption”. I never thought I would be hoping more people would listen to him.

Graph traversals in ArangoDB

ArangoDB’s AQL query language was created to offer a unified interface for working with key/value, document and graph data. While AQL has been easy to work with and learn, it wasn’t until the addition of AQL traversals in ArangoDB 2.8 that it really felt like it has achieved it’s goal.

Adding keywords GRAPH, OUTBOUND, INBOUND and ANY suddenly made iteration using a FOR loop the central idea in the language. This one construct can now be used to iterate over everything; collections, graphs or documents:

//FOR loops for everything
FOR person IN persons //collections
  FOR friend IN OUTBOUND person GRAPH "knows_graph" //graphs
    FOR value in VALUES(friend, true) //documents
    RETURN DISTINCT value

AQL has always felt more like programming than SQL ever did, but the central role of the FOR loop gives a clarity and simplicity that makes AQL very nice to work with. While this is a great addition to the language, it does however, mean that there are now 4 different ways to traverse a graph in AQL and a few things are worth pointing out about the differences between them.

AQL Traversals

There are two variations of the AQL traversal syntax; the named graph and the anonymous graph. The named graph version uses the GRAPH keyword and a string indicating the name of an existing graph. With the anonymous syntax you can simply supply the edge collections

//Passing the name of a named graph
FOR vertex IN OUTBOUND "persons/eve" GRAPH "knows_graph"
//Pass an edge collection to use an anonymous graph
FOR vertex IN OUTBOUND "persons/eve" knows

Both of these will return the same result. The traversal of the named graph uses the vertex and edge collections specified in the graph definition, while the anonymous graph uses the vertex collection names from the _to/_from attributes of each edge to determine the vertex collections.

If you want access to the edge or the entire path all you need to do is ask:

FOR vertex IN OUTBOUND "persons/eve" knows
FOR vertex, edge IN OUTBOUND "persons/eve" knows
FOR vertex, edge, path IN OUTBOUND "persons/eve" knows

The vertex, edge and path variables can be combined and filtered on to do some complex stuff. The Arango docs show a great example:

FOR v, e, p IN 1..5 OUTBOUND 'circles/A' GRAPH 'traversalGraph'
  FILTER p.edges[0].theTruth == true
  AND p.edges[1].theFalse == false
  FILTER p.vertices[1]._key == "G"
  RETURN p

Notes

Arango can end up doing a lot of work to fill in those FOR v, e, p IN variables. ArangoDB is really fast, so to show the effect these variables can have, I created the most inefficient query I could think of; a directionless traversal across a high degree vertex with no indexes.

The basic setup looked like this except with 10000 vertices instead of 10. The test was getting from start across the middle vertex to end.

Screenshot from 2016-04-05 10-07-04

What you can see is that adding those variables comes at a cost, so only declare ones you actually need.

effects_of_traversal_variables
Traversing a supernode with 10000 incident edges with various traversal methods. N=5. No indexes used.

GRAPH_* functions and TRAVERSAL

ArangoDB also has a series of “Named Operations” that feature among
them a few that also do traversals. There is also a super old-school TRAVERSAL function hiding in the “Other” section. What’s interesting is how different their performance can be while still returning the same results.

I tested all of the traversal functions on the same supernode described above. These are the queries:

//AQL traversal
FOR v IN 2 ANY "vertices/1" edges
  FILTER v.name == "end"
    RETURN v

//GRAPH_NEIGHBORS
RETURN GRAPH_NEIGHBORS("db_10000", {_id: "vertices/1"}, {direction: "any", maxDepth:2, includeData: true, neighborExamples: [{name: "end"}]})

//GRAPH_TRAVERSAL
RETURN GRAPH_TRAVERSAL("db_10000", {_id:"vertices/1"}, "any", {maxDepth:2, includeData: true, filterVertices: [{name: "end"}], vertexFilterMethod: ["exclude"]})

//TRAVERSAL
RETURN TRAVERSAL(vertices, edges, {_id: "vertices/1"}, "any", {maxDepth:2, includeData: true, filterVertices: [{name: "end"}], vertexFilterMethod: ["exclude"]})

All of these returned the same vertex, just with varying levels of nesting within various arrays. Removing the nesting did not make a signficant difference in the execution time.

traversal_comparison
Traversing a supernode with 10000 incident edges with various traversal methods. N=5.

Notes

While TRAVERSAL and GRAPH_TRAVERSAL were not stellar performers here, the both have a lot to offer in terms of customizability. For ordering, depthfirst searches and custom expanders and visitors, this is the place to look. As you explore the options, I’m sure these get much faster.

Slightly less obvious but still worth pointing out that where AQL traversals require an id (“vertices/1000” or a document with and _id attribute), GRAPH_* functions just accept an example like {foo: “bar”} (I’ve passed in {_id: “vertices/1”} as the example just to keep things comparable). Being able to find things, without needing to know a specific id, or what collection to look in is very useful. It lets you abstract away document level concerns like collections and operate on a higher “graph” level so you can avoid hardcoding collections into your queries.

What it all means

The difference between these, at least superficially, similar traversals are pretty surprising. While some where faster than others, none of the options for tightening the scope of the traversal were used (edge restrictions, indexes, directionality). That tells you there is likely a lot of headroom for performance gains for all of the different methods.

The conceptual clarity that AQL traversals bring to the language as a whole is really nice, but it’s clear there is some optimization work to be done before I go and rewrite all my queries.

Where I have used the new AQL traversal syntax, I’m also going to have to check to make sure there are no unused v,e,p variables hiding in my queries. Where you need to use them, it looks like restricting yourself to v,e is the way to go. Generating those full paths is costly. If you use them, make sure it’s worth it.

Slowing Arango down is surprisingly instructive, but with 3.0 bringing the switch to Velocypack for JSON serialization, new indexes, and more, it looks like it’s going to get harder to do. :)

 

Flash messages for Mapbox GL JS

I’ve been working on an application where I’m using ArangoDB’s WITHIN_RECTANGLE function to pull up documents within the current map bounds. The obvious problem there is that the current map bounds can be very very big.

Dumping the entire contents of your database every time the map moves sounded decidedly sub-optimal to me so I decided to calculate the area within the requested bounds using Turf.js and send back an error if it’s to big.

So far so good, but I wanted a nice way to display that error message  as a notification right on the map. There are lots of ways to tackle that sort of thing, but given that this seemed very specific to the map, I thought I might take a stab at making it a mapbox-gl.js plugin.

The result is mapbox-gl-flash. Currently you would install it from github:

npm install --save mapbox-gl-flash

I’m using babel so I’ll use the ES2015 syntax and get a map going.

import mapboxgl from 'mapbox-gl'
import Flash from 'mapbox-gl-flash'

//This is mapbox's api token that it uses for it's examples
mapboxgl.accessToken = 'pk.eyJ1IjoibWlrZXdpbGxpYW1zb24iLCJhIjoibzRCYUlGSSJ9.QGvlt6Opm5futGhE5i-1kw';
var map = new mapboxgl.Map({
    container: 'map', // container id
    style: 'mapbox://styles/mapbox/streets-v8', //stylesheet location
    center: [-74.50, 40], // starting position
    zoom: 9 // starting zoom
});

// And now set up flash:
map.addControl(new Flash());

This sets up an element on the map that listens for a “mapbox.setflash” event.

Next the element that is listening has a class of .flash-message, so lets set up a little basic styling for it:

.flash-message {
  font-family: 'Ubuntu', sans-serif;
  position: relative;
  text-align: center;
  color: #fff;
  margin: 0;
  padding: 0.5em;
  background-color: grey;
}

.flash-message.info {
  background-color: DarkSeaGreen;
}

.flash-message.warn {
  background-color: Khaki;
}

.flash-message.error {
  background-color: LightCoral;
}

With that done lets fire an CustomEvent and see what it does.

document.dispatchEvent(new CustomEvent('mapbox.setflash', {detail: {message: "foo"}}))

foo_message

Ruby on Rails has three different kinds of flash messages: info, warn and error. That seems pretty reasonable so I’ve implemented that here as well. We’ve already set up some basic styles for those classes above and we can apply one of those classes by adding another option to out custom event detail object:

document.dispatchEvent(new CustomEvent('mapbox.setflash', {detail: {message: "foo", info: true}}))

document.dispatchEvent(new CustomEvent('mapbox.setflash', {detail: {message: "foo", warn: true}}))

document.dispatchEvent(new CustomEvent('mapbox.setflash', {detail: {message: "foo", error: true}}))

These events add the specified class to the flash message.

flash_message_classes

One final thing that I expect is for the flash message to fade out after a specified number of seconds. The is accomplished by adding a fadeout attribute:


document.dispatchEvent(new CustomEvent('mapbox.setflash', {detail: {message: "foo", fadeout: 3}}))

Lastly you can make the message go away by firing the event again with an empty string.

With a little CSS twiddling I was able to get the nice user-friendly notification I had in mind to let people know why there is no more data showing up.

flash-message

I’m pretty happy with how this turned out. Now I have a nice map specific notification that not only works in this project, but is going to be easy to add to future ones too.

Running Gephi on Ubuntu 15.10

A while ago I gave a talk at the Ottawa graph meetup about getting started doing graph data visualizations with Gephi. Ever the optimist, I invited people to install Gephi on their machines and then follow along as I walked through doing various things with the program.

java_install

What trying to get a room of 20 people to install a Java program has taught me is that the installer’s “Java is found everywhere” is not advertising; it’s a warning. I did indeed experience the power of Java, and after about ten minutes of old/broken/multiple  Java versions, broken classpaths and Java 7/8 compatiblity drama, I gave up and completed the rest of the talk as a demo.

All of this was long forgotten until my wife and I started a little open data project recently and needed to use Gephi to visualize the data. The Gephi install she had attempted the day of the talk was still lingering on her Ubuntu system and so it was time to actually figure out how to get it going.

The instructions for installing Gephi are pretty straight forward:

  1. Update your distribution with the last official JRE 7 or 8 packages.
  2. After the download completes, unzip and untar the file in a directory.
  3. Run it by executing ./bin/gephi script file.

The difficulty was that after doing that, Gephi would show its splash screen and then hang as the loading bar said “Starting modules…“.

If you have every downloaded plugins for Gephi, you will have noticed that they have an .nbm extension, which indicates they, and (if you will pardon the pun) by extension, Gephi itself is built on top of the Netbeans IDE.
So the next question was, does Netbeans itself work?

sudo apt-get install netbeans
netbeans

Wouldn’t you know it, that Netbeans also freezes while loading modules.

Installing Oracle’s version of Java was suggested and the place to get that is the Webupd8 Team’s ppa:

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer oracle-java8-set-default
# The java version that got installed:
java -version
java version &quot;1.8.0_72&quot;
Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)

That finally left us with a working version of gephi.

Gephi 0.9.1 running on Ubuntu 15.10
Gephi 0.9.1 running on Ubuntu 15.10

Installing Gephi on Arch Linux was (thankfully) drama-free, but interestingly installs the OpenJDK, they very thing that seemed to causing the problems on Ubuntu:

yaourt -S gephi
java -version
openjdk version &quot;1.8.0_74&quot;
OpenJDK Runtime Environment (build 1.8.0_74-b02)
OpenJDK 64-Bit Server VM (build 25.74-b02, mixed mode)

It’s a mystery to me why Gephi on Ubuntu seems to require Oracle’s Java but on Arch I can run it on OpenJDK.
With a little luck it can remain a mystery.

Using mapbox-gl and webpack together

For those who might have missed it, Mapbox has been doing some very cool work to update the age old slippy-map to brand new world of WebGL. The library they have released to do this is mapbox-gl.

Webpack is a module bundler that reads the imports of your Javascript files and creates a bundled version by walking the dependency graph. Part of its appeal is the fact that it can do “code splitting”; creating bundles for specific pages as well as bundles for code shared across pages (Of course there’s more to it). Pete Hunt gives a great overview of it here.

So the big question is, what happens when you try to use this two awesome projects together?

ERROR in ./~/mapbox-gl/js/render/shaders.js
Module not found: Error: Cannot resolve module 'fs' in /home/mike/projects/usesthis/node_modules/mapbox-gl/js/render
 @ ./~/mapbox-gl/js/render/shaders.js 3:9-22

ERROR in ./~/mapbox-gl-style-spec/reference/v8.json
Module parse failed: /home/mike/projects/usesthis/node_modules/mapbox-gl-style-spec/reference/v8.json Line 2: Unexpected token :
You may need an appropriate loader to handle this file type.
| {
|   "$version": 8,
|   "$root": {
|     "version": {
 @ ./~/mapbox-gl-style-spec/reference/latest.js 1:17-37

ERROR in ./~/mapbox-gl-style-spec/reference/v8.min.json
Module parse failed: /home/mike/projects/usesthis/node_modules/mapbox-gl-style-spec/reference/v8.min.json Line 1: Unexpected token :
You may need an appropriate loader to handle this file type.

With a bunch flailing around and a little google-fu you also run into other fun errors like the “Uncaught TypeError: fs.readFileSync is not a function” when you try to run this in your browser.

After playing around with loaders and config options before finding useful github issues, I thought I would for the benefit of my future self compile a simple working example, so I don’t have to figure this out again.

The goal here is to get Mapbox’s most basic example up and running with webpack.

Screenshot from 2016-02-24 14-43-47
The basic Mapbox-gl example.

Let create a directory to work in:

mkdir webpack-mapboxgl &amp;&amp; cd webpack-mapboxgl

To do this we will divide the code from the example into two basic files; app.js for the javascript and index.html for the HTML.

First here’s index.html. Note that we are removing all the Javascript and in it’s place we are including the bundle.js that will be generated by webpack:

<!DOCTYPE html>
<html>
<head>
    <meta charset='utf-8' />
    <title></title>
    <meta name='viewport' content='initial-scale=1,maximum-scale=1,user-scalable=no' />
    <link href='https://api.tiles.mapbox.com/mapbox-gl-js/v0.14.2/mapbox-gl.css' rel='stylesheet' />
    <style>
        body { margin:0; padding:0; }
        #map { position:absolute; top:0; bottom:0; width:100%; }
    </style>
</head>
<body>

<div id='map'></div>
<script src="bundle.js"></script>
</body>
</html>

Next, app.js:

import mapboxgl from 'mapbox-gl'

mapboxgl.accessToken = 'pk.eyJ1IjoibWlrZXdpbGxpYW1zb24iLCJhIjoibzRCYUlGSSJ9.QGvlt6Opm5futGhE5i-1kw';
var map = new mapboxgl.Map({
    container: 'map', // container id
    style: 'mapbox://styles/mapbox/streets-v8', //stylesheet location
    center: [-74.50, 40], // starting position
    zoom: 9 // starting zoom
});

No real changes, just using the new ES2015 import syntax to pull in mapboxgl.
It’s probably a good time to install webpack globally:

sudo npm install -g webpack

This is where is gets a little hairy. Obviously mapboxgl and webpack need to be installed, as well as babel and a mess of loaders and transpiler presets. That’s life in the big city, right?

I set up npm in the directory with npm init, and then the fun begins:

npm install --save-dev webworkify-webpack transform-loader json-loader babel-loader babel-preset-es2015 babel-preset-stage-0 babel-core mapbox-gl

Next is the secret sauce that knits it all together, the webpack.config.js file:

var webpack = require('webpack')
var path = require('path')

module.exports = {
  entry: './app.js',
  output: { path: __dirname, filename: 'bundle.js' },
  node: {
    console: true,
    fs: "empty"
  },
  resolve: {
    extensions: ['', '.js', '.jsx'],
    alias: {
      webworkify: 'webworkify-webpack'
    }
  },
  module: {
    loaders: [
      {
        test: /\.jsx?$/,
        loader: 'babel',
        exclude: /node_modules/,
        query: {
          presets: ['es2015', 'stage-0']
        }
      },
      {
        test: /\.json$/,
        loader: 'json-loader'
      },
      {
        test: /\.js$/,
        include: path.resolve(__dirname, 'node_modules/mapbox-gl/js/render/painter/use_program.js'),
        loader: 'transform/cacheable?brfs'
      },
      {
        test: /\.js$/,
        include: path.resolve(__dirname, 'node_modules/mapbox-gl/js/render/shaders.js'),
        loader: 'transform/cacheable?brfs'
      },
      {
        test: /\.js$/,
        include: path.resolve(__dirname, 'node_modules/webworkify/index.js'),
        loader: 'worker'
      }
    ]
  },
};

With that you should be able to run the webpack command and it will produce the bundle we referenced earlier in our HTML. Open index.html in your browser and you should have a working WebGl map.

If you want to just clone this example, I’ve put it up on Github.

gpg and signing your own Arch Linux packages

One of the first things that I wanted to install on my system after switching to Arch was ArangoDB. Sadly it wasn’t in the official repos. I’ve had mixed success installing things from the AUR and the Arango packages there didn’t improve my ratio.

Using those packages as a starting point, I did a little tinkering and got it all working the way I like. I’ve been following all the work being done on reproducible builds, and why that is needed and it seems that with all that going on, anyone dabbling with making packages should at the very least be signing them. With that as my baseline, I figured I might as well start with mine .

Of course package signing involves learning about gpg/pgp whose “ease of use” is legendary.

Before we get to package signing, a little about gpg.

gpg --list-keys
gpg -k
gpg --list-public-keys

All of these commands list the contents of ~/.gnupg/pubring.gpg.

gpg --list-secret-keys
gpg -K

Both list all keys from ~/.gnupg/secring.gpg.

The pacman package manager also has its own gpg databases which you can explore with:

gpg --homedir /etc/pacman.d/gnupg --list-keys

So the task at hand is getting my public key into the list of public keys that pacman trusts. To do that we will need to do more than just list keys we need to reference them individually. gpg has a few ways to do that by passing an argument to one of our list keys commands above. I’ll do a quick search through the list of keys that pacman trusts:

mike@longshot:~/projects/arangodb_pkg☺ gpg --homedir /etc/pacman.d/gnupg -k pierre
pub   rsa3072/6AC6A4C2 2011-11-18 [SC]
uid         [  full  ] Pierre Schmitz (Arch Linux Master Key) <pierre@master-key.archlinux.org>
sub   rsa1024/86872C2F 2011-11-18 [E]
sub   rsa3072/1B516B59 2011-11-18 [A]

pub   rsa2048/9741E8AC 2011-04-10 [SC]
uid         [  full  ] Pierre Schmitz <pierre@archlinux.de>
sub   rsa2048/54211796 2011-04-10 [E]

mike@longshot:~/projects/arangodb_pkg☺  gpg --homedir /etc/pacman.d/gnupg --list-keys stephane
pub   rsa2048/AB441196 2011-10-30 [SC]
uid         [ unknown] Stéphane Gaudreault <stephane@archlinux.org>
sub   rsa2048/FDC576A9 2011-10-30 [E]

If you look at the output there you can see what is called an openpgp short key id. We can use those to refer to individual keys but we can also use long id’s and fingerprints:

gpg --homedir /etc/pacman.d/gnupg -k --keyid-format long stephane
pub   rsa2048/EA6836E1AB441196 2011-10-30 [SC]
uid                 [ unknown] Stéphane Gaudreault <stephane@archlinux.org>
sub   rsa2048/4ABE673EFDC576A9 2011-10-30 [E]


gpg --homedir /etc/pacman.d/gnupg -k --fingerprint stephane
pub   rsa2048/AB441196 2011-10-30 [SC]
      Key fingerprint = 0B20 CA19 31F5 DA3A 70D0  F8D2 EA68 36E1 AB44 1196
uid         [ unknown] Stéphane Gaudreault <stephane@archlinux.org>
sub   rsa2048/FDC576A9 2011-10-30 [E]

So we can identify Stephane’s specific key using either the short id, the long id or the fingerprint:

gpg --homedir /etc/pacman.d/gnupg -k AB441196
gpg --homedir /etc/pacman.d/gnupg -k EA6836E1AB441196
gpg --homedir /etc/pacman.d/gnupg -k 0B20CA1931F5DA3A70D0F8D2EA6836E1AB441196

Armed with a way to identify the key I want pacman to trust I need to do the transfer. Though not initially obvious, gpg can push and pull keys from designated key servers. The file at ~/.gnupg/gpg.conf tells me that my keyserver is keys.gnupg.net, while pacman’s file at /etc/pacman.d/gnupg/gpg.conf says it is using pool.sks-keyservers.net

Using my key’s long id I’ll push it to my default keyserver and tell pacman to pull it and then sign it.

#send my key
gpg --send-key F77636AC51B71B99
#tell pacman to pull that key from my keyserver
sudo pacman-key --keyserver keys.gnupg.net -r F77636AC51B71B99
#sign the key it recieved and start trusting it
sudo pacman-key --lsign-key F77636AC51B71B99

With all that done, I should be able to sign my package with

makepkg --sign --key F77636AC51B71B99

We can also shorten that by setting the default-key option in ~/.gnupg/gpg.conf.

# If you have more than 1 secret key in your keyring, you may want to
# uncomment the following option and set your preferred keyid.

default-key F77636AC51B71B99

With my default key set I’m able to make and install with this:

mike@longshot:~/projects/arangodb_pkg☺ makepkg --sign
mike@longshot:~/projects/arangodb_pkg☺ sudo pacman -U arangodb-2.8.1-1-x86_64.pkg.tar.xz
loading packages...
resolving dependencies...
looking for conflicting packages...

Packages (1) arangodb-2.8.1-1

Total Installed Size:  146.92 MiB
Net Upgrade Size:        0.02 MiB

:: Proceed with installation? [Y/n] y
(1/1) checking keys in keyring
[##################################################] 100%
(1/1) checking package integrity
[##################################################] 100%

The ease with which I can make my own packages is a very appealing part of Arch Linux for me. Signing them was the next logical step and I’m looking forward to exploring some related topics like running my own repo, digging deeper into GPG, the Web of Trust, and reproducible builds. It’s all fun stuff, if you can only find the time.

Private communications with Signal

The internet is a global network. You would think that the fact that every message sent over it passes though many legal jurisdictions would make the need for encryption obvious and uncontroversial. Sadly that is not the case.

The need for something more than legal safeguards becomes obvious when you see that a request from a Toronto home to toronto.com (whose server is in Toronto!) leaving Canadian legal jurisdiction on a path through both New York and Chicago before finally reaching it’s Toronto destination.

boomerang_route
An example of a boomerang route from the ixmaps.ca project. About 25% of traffic with both a start and end point in Canada is routed this way.

Applications that deliver technical safeguards, like end-to-end encryption, offer that “something more” that protects my data beyond the border. One of these a applications is Signal, a project of Open Whisper Systems.

In an offline world, privacy was the default, a product of things you didn’t say or do, and probably also a byproduct of how hard it was to move information around. As things like chatting with friends and family or reading a newspaper all moved online, those activities suddenly involved sending data in plain text over public infrastructure. Privacy become something that existed only for those that found a way to avoid the default of sending plain text. Privacy became a product of action rather than inaction.

Signal and its predecessor projects Textsecure and Redphone are part of an effort to make privacy the default again by rolling high end encryption into apps polished for mainstream adoption.

Screenshot_20151103-142606

Signal does two main things: sending text messages and making calls. What Signal actually does for secure communications is very different from what it does for insecure ones and is worth understanding.

Text messages

When sending a text message to someone who does not have Signal, the application sends a standard SMS message. The details of what constitutes an SMS message were hashed out in 1988 long before security was a thing and consequently, a related specification notes “SMS messages are transported without any provisions for privacy or integrity”, but importantly they are transported over the telephone network.

When sending secure text messages, Signal uses your mobile data to send the message using the Textsecure protocol v2.

The distinction between those two is worth making since coverage for data vs telephone can vary as can the costs, say if you are travelling and turn off mobile data.

The first time you send or receive a secure message with someone, behind the scenes you exchange cryptographic keys. Everything Signal does is focused on ensuring secure communication with the holder of that key. If the key for one of your contacts changes, it should be considered an event worth a quick phone call. This can happen innocently enough, say if they uninstall and then reinstall the app, but since all the other security measures a built on that, its worth asking about.

After the first text message has been sent or received, from then on Signal uses those keys to generate new keys for each individual message (Described in detail here.). This ensures that even if one message were to be decrypted, every other message is still safe.

Calling

Calling follows a similar pattern; for insecure calls Signal simply launches your phones standard phone app, while encrypted calls it handles itself. And like the secure text messages, this also uses your mobile data rather than routing through the phone network.

Secure calls are placed using the ZRTP protocol, the details of which are hidden from the user with one important exception.
On screen when you make a secure call you will notice two words displayed. These words are part of the ZRTP protocol and were generated based on the key that both parties used to encrypt the call.

zrtp_call

Both parties should see the same two words. If you say one and ask your contact to read the other, and they don’t match up, they keys you have agreed upon are not the same. If the keys are not the same it suggests someone has tampered with connection information inflight and inserted themselves into your conversation.

Verifying keys

Part of whole key exchange process that allows users to identify each other involves pulling your contacts public key from a central key directory server. The use of a central server means that I now have to trust that server not to maliciously give me a key for someone else. Open Whisper Systems Trevor Perrin addressed the problem of trusting unauthenticated keys in a talk at the NSEC security conference. It’s just a few minutes but its an interesting insight into the balancing act involved in bringing private communications to the masses:

For the interested or the paranoid, Signal lets you verify a contacts key by touching your contacts name/number at the top of your conversation. This brings up the details for that contact which includes a “Verify identity” option.verifyWith that, and your own identity details, found under Settings (three vertical dots on the top right of the main screen) > “My Identity Key”, you are able to either read a key fingerprint or if you have a QR/Barcode scanner you can use that to verify your contacts key.

scan_optionsverified

Open Source

Establishing that there is no secret behaviour or hidden flaws somewhere in the code is critical in this world where we put a significant amount of trust in computers (and where the definition of computer is expanding to encompass everything from voting machines to Volkswagens).

Signal establishes trust by developing the code in the open the code so that it can be reviewed (like this review of Signal’s predecessor Redphone by Cryptographer Matthew Green). Former Google security researcher Morgan Marquise-Boire has endorsed Signal as has Edward Snowden.

But even if you believe the experts that Signal works as advertised, its common for “free” apps to seem significantly less “free” once you realize what they do to turn a profit. With that in mind, another component of trust is understanding the business model behind the products you use.

When asked about the business model on a popular tech forum, Open Whisper Systems founder Moxie Marlinspike explained “in general, Open Whisper Systems is a project rather than a company, and the project’s objective is not financial profit.”

The project is funded by a combination of grants and donations from the Freedom of the Press Foundation and The Shuttlesworth Foundation among others. It is worked on by a core group of contributors led by and supporting cast of volunteers.

Final thoughts

Signal does a great job of making encrypting your communications a non-event. Encrypted as they travel the network, our messages are now secure against tampering and interception, no matter whose servers/routers they pass through. The result: privacy.

The fact that applying security measures result in privacy should tell you that the oft quoted choice between “security vs privacy” is a false one. As pointed out by Timothy Mitchener-Nissen, assuming a balance between these things only results in sacrificing increments of privacy in pursuit of the unachievable idea of total security. The ultimate result is reducing privacy to zero. Signal is just one way to grab back one of those increments of privacy.

All of that said my interest in privacy technologies and encryption is an interest for me as a technologist. If you are counting on these technologies like Signal to protect you from anyone serious (like a nation-state) the information above is barely a beginning. I would suggest reading this best practices for Tor and the grugq’s article on signals, intelligence. Actually anything/everything by the grugq.

A quick tour of Arangojs

I’ve been using ArangoDB for a while now, but for most of that time I’ve been using it from Ruby. I’ve dabbled with the Guacamole library and even took a crack at writing my own, but switching to Javascript has led me to get to know Arangojs.

Given that Arangojs is talking to ArangoDB via its HTTP API, basically everything you do is asynchronous. There are a few ways of dealing with async code in Javascript, and Arangojs has been written to support basically all of them.

Arangojs’s flexibility and my inexperience with the new Javascript syntax combined to give me bit of an awkward start, so with a little learning under my belt I thought I would write up some examples that would have saved me some time.

My most common use case is running an AQL query, so lets use that as an example. First up, I’ve been saving my config in a separate file:

// arango_config.js
//Using auth your url would look like:
// "http://uname:passwd@127.0.0.1:8529"
module.exports = {
  "production" : {
    "databaseName": process.env.PROD_DB_NAME,
    "url": process.env.PROD_DB_HOST,
  },
  "development" : {
    "databaseName": process.env.DEVELOPMENT_DB_NAME,
    "url": process.env.DEVELOPMENT_URL
  },
  "test" : {
    "databaseName": "test",
    "url": "http://127.0.0.1:8529",
  },
}

With that I can connect to one of my existing databases like so:

var config = require('../arangodb_config')[process.env.NODE_ENV]
var db = require('arangojs')(config)

This keeps my test database nicely separated from everything else and all my db credentials in the environment and out of my project code.

Assuming that our test db has a collection called “documents” containing a single document, we can use Arangojs to go get it:

db.query('FOR doc IN documents RETURN doc', function(err, cursor) {
  cursor.all(function(err, result) {
    console.log(result)
  })
})

Which returns:

[ { foo: 'bar',
    _id: 'documents/206191358605',
    _rev: '206192931469',
    _key: '206191358605' } ]

While this is perfectly valid Javascript, its pretty old-school at this point since ECMAScript 2015 is now standard in both Node.js and any browser worth having. This means we can get rid of the “function” keyword and replace it with the “fat arrow” syntax and get the same result:

db.query('FOR doc IN documents RETURN doc', (err, cursor) => {
  cursor.all((err, result) => {
    console.log(result)
  })
})

So far so good but the callback style (and the callback-hell it brings) is definitely an anti-pattern. The widely cited antidote to this is promises:

db.query('FOR doc IN documents RETURN doc')
  .then((cursor) => { return cursor.all() })
  .then((doc) => { console.log(doc) });

While this code is functionally equivalent, it operates by chaining promises together. While it’s an improvement over callback-hell, after writing a bunch of this type of code, I ended up feeling like I had replaced callback hell with promise hell.

what-fresh-hell-is-this

The path back to sanity lies in ECMAScript 2016 aka ES7 and the new async/await keywords. Inside a function marked as async, you have access to an await keyword which allows you to write code that looks synchronous but does not block the event loop.

Using the babel transpiler lets us use the new ES7 syntax right now by compiling it all down to ES5/6 equivalents. Installing with npm install -g babel and running your project with babel-node is all that you need to be able to write this:

async () => {
    let cursor = await db.query('FOR doc IN documents RETURN doc')
    let result = await cursor.all()
    console.log(result)
}()

Once again we get the same result but without all the extra cruft that we would normally have to write.

One thing that is notably absent in these examples is the use of bound variables in our queries to avoid SQL injection (technically parameter injection since this is NoSQL).

So what does that look like?

async () => {
    let bindvars = {foo: "bar"}
    let cursor = await db.query('FOR doc IN documents FILTER doc.foo == @foo RETURN doc', bindvars)
    let result = await cursor.all()
    console.log(result)
}()

But Arangojs lets you go further, giving you a nice aqlQuery function based on ES6 template strings:

async () => {
    let foo = "bar"
    let aql = aqlQuery`
      FOR doc IN documents
        FILTER doc.foo == ${foo}
          RETURN doc
    `
    let cursor = await db.query(aql)
    let result = await cursor.all()
    console.log(result)
}()

Its pretty astounding how far that simple example has come. It’s hard to believe that it’s even the same language.
With Javascript (the language and the community) clearly in transition, Arangojs (and likely every other JS library) is compelled to support both the callback style and promises. It’s a bit surprising to see how much leeway that gives me to write some pretty sub-optimal code.

With all the above in mind, suddenly Arangojs’s async heavy API no longer feels intimidating.

The documentation for Arangojs is simple (just a long readme file) but comprehensive and there is lots more it can do. Hopefully this little exploration will help people get started with Arangojs a little more smoothly than I did.

Extracting test data from ArangoDB’s admin interface

Test Driven Development is an important part of my development process and
ArangoDB’s speed, schema-less nature and truncate command make testing really nice.

Testing has ended up being especially important to me when it comes to AQL (Arango Query Language) queries. Just the same way that its easy to write a regular expression that matches more than you expect, constraining the traversal algorithm so you get what you want (and only that) can be tricky.

AQL queries that traverse a graph are often (maybe always?) sensitive to the structure of the graph. The direction of the edges (inbound/outbound) or the number of edges to cross (maxDepth) are often used to constrain a traversal. Both of these are examples of how details of your graphs structure get built into your AQL queries. When the structure isn’t what you think, you can end up with some pretty surprising results coming back from your queries.

All of that underscores the need to test against data that you know has a specific structure. A few times now I have found myself with bunch of existing graph data, and wondering how to pick out a selected subgraph to test my AQL queries against.

ArangoDB’s web interface gets me tantalizingly close, letting me filter down to a single starting node and clicking with the spot tool to reveal its neighbors.

Filtering for a specific vertex in the graph.
Filtering for a specific vertex in the graph.

In a few clicks I can get exactly the vertices and edges that I need to test my queries, and because I have seen it, I know the structure is correct, and has only what I need. All that is missing is a way to save what I see.

Since this seems to keep coming up for me, I’ve solved this for myself with a little hackery that I’ve turned to several times now. The first step is turning on Firefox’s dump function by entering about:config in the URL bar and searching the settings for “dump”.

Firefox can dump to the terminal with browser.dom.window.dump.enabled
Firefox can dump to the terminal with browser.dom.window.dump.enabled

The dump function allows you to write to the terminal from javascript. Once that is set to true, launching Firefox from the terminal, and typing dump("foo") in the Javascript console should print “foo” in the controlling terminal.

test_data

Next, since the graph viewer uses D3 to for its visualization, we can dig into the DOM and print out the bits we need using dump. Pasting the following into the Javascript console will print out the edges:

var edges = document.querySelector('#graphViewerSVG').childNodes[0].childNodes[0].children; for(var i = 0; i < edges.length; i++) { dump("\r\n" + JSON.stringify(edges[i].__data__._data) + "\r\n"); }

And then this will print out the vertices:

var vertices = document.querySelector('#graphViewerSVG').childNodes[0].childNodes[1].children; for(var i = 0; i < vertices.length; i++) { dump("\r\n" + JSON.stringify(vertices[i].__data__._data) + "\r\n"); }

With the vertices and edges now printed to the terminal, a little copy/paste action and you can import the data into your test database before running your tests with arangojs’s import function.

myCollection.import([
  {foo: "bar"},
  {fizz: "buzz"}
])

Alternately you can upload JSON files into the collection via the web interface as well.

Importing JSON into a collection.
Importing JSON into a collection.

While this process has no claim on elegance, its been very useful for testing my AQL queries and saved me a lot of hassle.