Data modeling with ArangoDB

Since 2009 there has been a “Cambrian Explosion” of NoSQL databases, but information on data modeling with these new data stores feels hard to come by.
My weapon of choice for over a year now has been ArangoDB. While ArangoDB is pretty conscientious about having good documentation, there has been something missing for me: criteria for making modeling decisions.

Like most (all?) graph databases, ArangoDB allows you to model your data with a property graph. The building blocks of a property graph are attributes, vertices and edges. What makes data modelling with ArangoDB (and any other graph database) difficult is deciding between them.

To start with we need a little terminology. Since a blog is a well known thing, we can use a post with some comments and some tags as our test data to illustrate the idea.

Sparse vs Compact

Modeling our blog post with as a “sparse” graph might look something like this:

sparse

At first glance it looks satisfyingly graphy: in the centre we see a green “Data Modeling” vertex which has a edge going to another vertex “post”, indicating that “Data Modeling” is a post. Commenters, tags and comments all have connections to a vertex representing their type as well.

Looking at the data you can see we are storing lots of edges and most vertices contain only a single attribute (apart from the internal attributes ArangoDB creates: _id, _key, _rev).

//vertices
[{"_id":"vertices/26589395587","_key":"26589395587","_rev":"26589395587","title":"Data modeling","text":"lorum ipsum...","date":"1436537253903"},
{"_id":"vertices/26589592195","_key":"26589592195","_rev":"26589592195","type":"post"},
{"_id":"vertices/26589723267","_key":"26589723267","_rev":"26589723267","name":"Mike Williamson"},
{"_id":"vertices/26589854339","_key":"26589854339","_rev":"26589854339","type":"author"},
{"_id":"vertices/26589985411","_key":"26589985411","_rev":"26589985411","name":"Mike's Mum","email":"mikes_mum@allthemums.com"},
{"_id":"vertices/26590116483","_key":"26590116483","_rev":"26590116483","type":"commenter"},
{"_id":"vertices/26590247555","_key":"26590247555","_rev":"26590247555","title":"That's great honey","text":"Love you!"},
{"_id":"vertices/26590378627","_key":"26590378627","_rev":"26590378627","type":"comment"},
{"_id":"vertices/26590509699","_key":"26590509699","_rev":"26590509699","name":"Spammy McSpamerson","email":"spammer@fakeguccihandbags.com"},
{"_id":"vertices/26590640771","_key":"26590640771","_rev":"26590640771","title":"Brilliant","text":"Gucci handbags..."},
{"_id":"vertices/26590771843","_key":"26590771843","_rev":"26590771843","name":"arangodb"},
{"_id":"vertices/26590902915","_key":"26590902915","_rev":"26590902915","name":"modeling"},
{"_id":"vertices/26591033987","_key":"26591033987","_rev":"26591033987","name":"nosql"},
{"_id":"vertices/26591165059","_key":"26591165059","_rev":"26591165059","type":"tag"}]

//edges
[{"_id":"edges/26604010115","_key":"26604010115","_rev":"26604010115","_from":"vertices/26589723267","_to":"vertices/26589395587"},
{"_id":"edges/26607352451","_key":"26607352451","_rev":"26607352451","_from":"vertices/26589723267","_to":"vertices/26589854339"},
{"_id":"edges/26608204419","_key":"26608204419","_rev":"26608204419","_from":"vertices/26590640771","_to":"vertices/26590378627"},
{"_id":"edges/26609842819","_key":"26609842819","_rev":"26609842819","_from":"vertices/26590247555","_to":"vertices/26590378627"},
{"_id":"edges/26610694787","_key":"26610694787","_rev":"26610694787","_from":"vertices/26589985411","_to":"vertices/26590247555"},
{"_id":"edges/26611546755","_key":"26611546755","_rev":"26611546755","_from":"vertices/26589395587","_to":"vertices/26590247555"},
{"_id":"edges/26615020163","_key":"26615020163","_rev":"26615020163","_from":"vertices/26589985411","_to":"vertices/26590116483"},
{"_id":"edges/26618821251","_key":"26618821251","_rev":"26618821251","_from":"vertices/26590771843","_to":"vertices/26591165059"},
{"_id":"edges/26622622339","_key":"26622622339","_rev":"26622622339","_from":"vertices/26589395587","_to":"vertices/26589592195"},
{"_id":"edges/26625833603","_key":"26625833603","_rev":"26625833603","_from":"vertices/26590509699","_to":"vertices/26590640771"},
{"_id":"edges/26642741891","_key":"26642741891","_rev":"26642741891","_from":"vertices/26589395587","_to":"vertices/26590902915"},
{"_id":"edges/26645101187","_key":"26645101187","_rev":"26645101187","_from":"vertices/26589395587","_to":"vertices/26590771843"},
{"_id":"edges/26649885315","_key":"26649885315","_rev":"26649885315","_from":"vertices/26589395587","_to":"vertices/26591033987"},
{"_id":"edges/26651064963","_key":"26651064963","_rev":"26651064963","_from":"vertices/26590902915","_to":"vertices/26591165059"},
{"_id":"edges/26651785859","_key":"26651785859","_rev":"26651785859","_from":"vertices/26591033987","_to":"vertices/26591165059"},
{"_id":"edges/26652965507","_key":"26652965507","_rev":"26652965507","_from":"vertices/26590509699","_to":"vertices/26590116483"},
{"_id":"edges/26670267011","_key":"26670267011","_rev":"26670267011","_from":"vertices/26589395587","_to":"vertices/26590640771"}]

A “compact” graph on the other hand might look something like this:

{
  title:  "Data modelling",
  text: "lorum ipsum...",
  author: "Mike Williamson",
  date:   "2015-11-19",
  comments: [
    {
      author:"Mike's Mum",
      email:"mikes_mum@allthemums.com",
      text: "That's great honey",
    },
    {
      "author": "spammer@fakeguccihandbags.com",
      "title": "Brilliant",
      "text": "Gucci handbags...",
    }
  ],
  tags:["mongodb","modeling","nosql"]
}

Here we have taken exactly the same data and collapsed it together into a single document. While its a bit of a stretch to even classify this as a graph, ArangoDB’s multi-modal nature largely erases the boundary between a document and a graph with a single vertex.

The two extremes above give us some tools for talking about our graph. Its the same data either way, but clearly different choices are being made. In the sparse graph, every vertex you see could have been an attribute, but was consciously moved into a vertex of its own. The compact graph is what comes out of repeated choosing to add new data an attribute rather than a vertex.

When modeling real data your decisions don’t always fall one one side or the other. So what criteria should we be using to make those decisions?

Compact by default

As a baseline you should default to a compact graph. Generally data that is displayed together should be combined into a single document.

Defaulting to compact will mean fewer edges will exist in the graph as a whole. Since each traversal across a graph will have to find, evaluate and then traverse the edges for each vertex it encounters, keeping the number of edges to a strict minimum will ensure traversals stay fast as your graph grows.
Compact graphs will also mean fewer queries an traversals to get the information you need.

But not everything belongs together. Any attribute that contains a complex data structure (like the “comments” array or the “tags” array) deserves a little scrutiny as it might make sense as a vertex (or vertices) of its own.

Looking at our compact graph above, the array of comments, the array of tags, and maybe even the author might be better off as vertices rather than leaving them as attributes. How do we decide?

  • If you need to point an edge at it, it will need to be a vertex.
  • If it will be accessed on its own (ie: showing comments without the post), it will need to be a vertex.
  • If you are going to use certain graph measurements (like centrality) it will need to be a vertex.
  • If its not a value object (the values can change but the object remains the same).

Removing duplicate data can also be a reason, but with the cost of storage low (and dropping) its a weak reason.

Edge Direction

Once you promote a piece of data to being a vertex (or “reify” it) your next decision is which way the edge connecting it to another vertex should go. Edge direction is a powerful way to put up boundaries to contain your traversals, but while the boundary is important the actual directions are not. Whatever you choose, it just needs to be consistent. One edge going the wrong direction is going to have your traversal returning things you don’t expect.

And another thing…

This post is the post I kept hoping to find as I worked on modeling my data with ArangoDB. Its not complete, data modeling is a big topic and there is lots more depth to ArangoDB to explore (ie: I haven’t yet tried splitting my edges amongst multiple edge collections) but these are some guidelines that I was hoping for when I was starting.

I would love to learn more about the criteria people are using to make those tough calls between attribute and vertex, and all those other hard modeling decisions.

If you have thoughts on this let me know!

When to use a graph database

There are a lot of “intro to graph database” tutorials on the internet. While the “how” part of using a graph database has it’s place, I don’t know if enough has been said about “when”.

The answer to “when” depends on the properties of the data you are working with. In broad strokes, you should probably keep a graph database in mind if you are dealing with a significant amount of any of the following:

  • Hierarchical data

  • Connected data

  • Semi-structured data

  • Data with Polymorphic associations

Each of these data types either requires some number of extra tables or columns (or both) to deal with under the relational model. The cost of these extra tables and columns is an increase in complexity.

Terms like “connected data” or “semi-structured data” get used a lot in the NoSQL world but the definitions, where you can find one at all, have a “you’ll know it when you see it” flavour to them. “You’ll know it when you see it” can be distinctly unsatisfying when examples are hard to come by as well. Lets take a look at these one by one and get a sense of they mean generally and how to recognize them in existing relational database backed projects.

Hierarchical Data

Hierarchies show up everywhere. There are geographical hierarchies where a country has many provinces, which have many cities which have many towns. There is also the taxonomic rank, indicating the level of a taxon in the Taxonomic Hierarchy, organizational hierarchies, the North American Industry Classification system… the list goes on and on.

What it looks like in a relational database.

Usually its easy to tell if you are dealing with this type of data. In an existing database schema you may see tables with a parent_id column indicating the use of the Adjacency List pattern or left/right columns indicating the use of the Nested Sets pattern. There are many others as well.

Connected Data

Connected data is roughly synonymous with graph data. When we are talking about graph data we mean bits of data, plus information about how those bits of data are related.

What it looks like in a relational database.

Join tables are the most obvious sign that you are storing graph data. Join tables exist solely to act as the target of two one-to-many relationships, each row representing a relationship between two rows in other tables. Cases where you are storing data in the join table (such as the Rails has_many :through relationship) are even more clear; you are definitely storing connected data.

While one-to-many relationships also technically describe a graph, they probably are not going to make you reconsider the use of a relational database the way large numbers of many-to-many relationships might.

Semi-structured Data

Attempts to define semi-structured data seem to focus on variability; just because one piece of data has a particular set of attributes does not mean that the rest do. You can actually get an example of semi-structured data by mashing together two sets of structured (tabular) data. In this world of APIs and SOA where drawing data from multiple sources is pretty much the new normal, semi-structured data is increasingly common.

What it looks like in a relational database.

Anywhere you have columns with lots of null values in them. The columns provide the structure, but long stretches of null values suggest that this data does not really fit that structure.

semi_structured_data
An example of semi-structured data: a hypothetical products table combining books (structured data) and music (also structured data).

Polymorphic associations

Having one type data has an association that might be to related to one of two or more things, that’s what known as a polymorphic association. As an example, perhaps a photo, which might be related to a user or a product.

What it looks like in a relational database.

While polymorphic relations can be done in a relational database most commonly they are handled at the framework level, where the framework a foreign key and an additional “type” column to determine the correct table/row. Seeing both an something_id and something_type in the same table gives a hint that a polymorphic relationship is being used. Both Ruby on Rails and Java’s Spring Framework offer this.

So when?

These types of data are known to be an awkward fit for the relational model, in the same way that storing large quantities of perfectly tabular data would be awkward under the graph model. These are ultimately threshold problems, like the famous paradox of the heap.

1000000 grains of sand is a heap of sand

A heap of sand minus one grain is still a heap.

Your first join table or set of polymorphic relations will leave you will a perfectly reasonable database design, but just as “a heap of sand minus one grain” will eventually cross some ill defined threshold and produce something that is no longer a heap of sand, there is some number of join tables or other workarounds for the relational model that will leave you with a database that is significantly more complex than a graph database equivalent would be.

Knowing about the limits of the relational model and doing some hard thinking about how much time you are spending pressed up against those limits is really the only things that can guide your decision making.

Querying the Openstreetmap Dataset

While much has been written about putting data into OpenStreetMap (OSM), it doesn’t feel like much has been said about getting data out. For those familiar with GIS software, grabbing a “metro extract” is a reasonable place to start, but for developers or regular users its not quite as clear how to get at the data we can see is in there.

The first way to get at the data is with the Overpass API. Overpass was started by Roland Olbricht in 2008 as a way to ask for some specified subset of the OSM data.

Lets say I was curious about the number of bike racks that could hold 8 bikes in down-town Ottawa. The first thing to know is that OSM data is XML, which means that each element (node/way/area/relation) looks something like this:

  <node id="3046036633" lat="45.4168480" lon="-75.7016922">
    <tag k="access" v="public"/>
    <tag k="amenity" v="bicycle_parking"/>
    <tag k="bicycle_parking" v="rack"/>
    <tag k="capacity" v="8"/>
  </node>

Basically any XML element may be associated with a bunch tags containing keys and values.

You specify which elements of the OSM dataset are interesting to you by creating an Overpass query in XML format or using a query language called Overpass QL. You can use either one, but I’m using XML here.

Here is a query asking for all the elements of type “node” that has both a tag with a key of “amenity” and a value of “bicycle_parking” as well as a tag with a key of “capacity” and a value of “8”. You can also see my query includes a bbox-query element with coordinates for North, East, South, and West supplied; the two corners of a bounding box so search will be limited to that geographic area.

<osm-script output="json">
  <query type="node">
    <has-kv k="amenity" v="bicycle_parking"/>
    <has-kv k="capacity" v="8"/>
    <bbox-query e="-75.69105863571167" n="45.42274779392456" s="45.415714100972636" w="-75.70568203926086"/>
  </query>
  <print/>
</osm-script>

I’ve saved that query into a file named “query” and I am using cat to read the file and pass the text to curl which sends the query.

mike@longshot:~/osm☺  cat query | curl -X POST -d @- http://overpass-api.de/api/interpreter{
  "version": 0.6,
  "generator": "Overpass API",
  "osm3s": {
    "timestamp_osm_base": "2014-08-27T18:47:01Z",
    "copyright": "The data included in this document is from www.openstreetmap.org. The data is made available under ODbL."
  },
  "elements": [

{
  "type": "node",
  "id": 3046036633,
  "lat": 45.4168480,
  "lon": -75.7016922,
  "tags": {
    "access": "public",
    "amenity": "bicycle_parking",
    "bicycle_parking": "rack",
    "capacity": "8"
  }
},
{
  "type": "node",
  "id": 3046036634,
  "lat": 45.4168354,
  "lon": -75.7017258,
  "tags": {
    "access": "public",
    "amenity": "bicycle_parking",
    "capacity": "8",
    "covered": "no"
  }
},
{
  "type": "node",
  "id": 3046036636,
  "lat": 45.4168223,
  "lon": -75.7017618,
  "tags": {
    "access": "public",
    "amenity": "bicycle_parking",
    "bicycle_parking": "rack",
    "capacity": "8"
  }
}

  ]
}

This is pretty exciting, but its worth pointing out that the response is JSON, and not GeoJSON which you will probably want for doing things with Leaflet. The author is certainly aware of it and apparently working on it, but in the meantime you will need to use the npm module osmtogeojson if you need to do the conversion from what Overpass gives to what Leaflet accepts.

So what might that get you? Well lets say you are trying to calculate the total amount of bike parking in down-town Ottawa. With a single API call (this time using the Overpass QL, so its cut & paste friendly), we can tally up the capacity tags:

mike@longshot:~/osm☺  curl -s -g 'http://overpass-api.de/api/interpreter?data=[out:json];node["amenity"="bicycle_parking"](45.415714100972636,-75.70568203926086,45.42274779392456,-75.69105863571167);out;' | grep capacity | tr -d ',":' | sort | uniq -c
      2     capacity 10
      7     capacity 2
      6     capacity 8

Looks like more bike racks need to be tagged with “capacity”, but its a good start on coming up with a total.

Building on the Overpass API is the web based Overpass-turbo. If you are an regular user trying to get some “how many of X in this area” type questions answered, this is the place to go. Its also helpful for developers looking to work the kinks out of a query.

Displaying my edits in the Ottawa area.
Using Overpass-Turbo to display my edits in the Ottawa area.

Its really simple to get started using the wizard, which helps write a query for you. With a little fooling around with the styles you can do some really interesting stuff. As an example, we can colour the bicycle parking according to its capacity so we can see which ones have a capacity tag and which ones don’t. The query ends up looking like this:

<osm-script timeout="25">
  <!-- gather results -->
  <union>
    <!-- query part for: “amenity=bicycle_parking” -->
    <query type="node">
      <has-kv k="amenity" v="bicycle_parking"/>
      <bbox-query {{bbox}}/>
    </query>
    {{style:
      node[amenity=bicycle_parking]{ fill-opacity: 1; fill-color: grey;color: white;}
      node[capacity=2]{ fill-color: yellow; }
      node[capacity=8]{ fill-color: orange;}
      node[capacity=10]{fill-color: red;}
    }}
  </union>
  <print mode="body"/>
  <recurse type="down"/>
  <print mode="skeleton" order="quadtile"/>
</osm-script>

Bike racks with no capacity attribute will be grey. You can see the result here.

While Overpass-turbo might not be as sophisticated as CartoDB, it is really approachable and surprisingly capable. Highlighting certain nodes, picking out the edits of a particular user, there are lots of interesting applications.

Being able to query the OSM data easily opens some interesting possibilities. If you are gathering data for whatever reason, you are going to run into the problems of where to store it, and how to keep it up to date. One way of dealing with both of those is to store your data in OSM.

With all the thinking that has gone into what attributes can be attached  to things like trees, bike racks, and public art, you can store a surprising amount of information in a single point. Once saved into the OSM dataset, you will always know where to find the most current version of your data, and backups are dealt with for you.

This approach  also opens the door to other people helping you keep it up to date. Asking for volunteers or running hackathons to help you update your data is pretty reasonable when it also means improving a valuable public resource, instead of just enriching the owner alone. Once the data is in OSM, the maintenance burden is easy to distribute.

When its time to revisit your question, fresh data will only ever be an Overpass query away…

Something to think about.

ArangoDB’s geo-spatial functions

I’ve been playing with ArangoDB a lot lately. As a document database it looks to be a drop-in replacement for MongoDB, but it goes further, allowing graph traversals and geo-spatial queries.

Since I have a geo-referenced data set in mind I wanted to get to know its geo-spatial functions. I found the documentation a kind of unclear so I thought I would write up my exploration here.

At the moment there are only two geo-spatial functions in Arango; WITHIN and NEAR. Lets make some test data using the arango shell. Run arangosh and then the following:

db._create('cities')
db.cities.save({name: 'Ottawa', lat: 45.4215296, lng: -75.69719309999999})
db.cities.save({name: 'Montreal', lat: 45.5086699, lng: -73.55399249999999})
db.cities.save({name: 'São Paulo', lat: -23.5505199, lng: -46.63330939999999})

We will also need a geo-index for the functions to work. You can create one by passing in the name(s) of the fields that hold the latitude and longitude. In our case I just called them lat and lng so:

db.cities.ensureGeoIndex('lat', 'lng')

Alternately I could have done:

db.cities.save({name: 'Ottawa', location: [45.4215296, -75.69719309999999]})
db.cities.ensureGeoIndex('location')

As long as the values are of type double life is good. If you have some documents in the collection that don’t have the key(s) you specified for the index it will just ignore them.

First up is the WITHIN function. Its pretty much what you might expect, you give it a lat/lng and a radius and it gives you records with the area you specified. What is a little unexpected it that the radius is given in meters. So I am going to ask for the documents that are closest to the lat/lng of my favourite coffee shop (45.42890720357919, -75.68796873092651). To make the results more interesting I’ll ask for a 170000 meter radius (I know that Montreal is about 170 kilometers from Ottawa) so I should see those two cities in the result set:

arangosh [_system]> db._createStatement({query: 'FOR city in WITHIN(cities, 45.42890720357919, -75.68796873092651, 170000) RETURN city'}).execute().toArray()
[ 
  { 
    "_id" : "cities/393503132620", 
    "_rev" : "393503132620", 
    "_key" : "393503132620", 
    "lat" : 45.4215296, 
    "lng" : -75.69719309999999, 
    "name" : "Ottawa" 
  }, 
  { 
    "_id" : "cities/393504967628", 
    "_rev" : "393504967628", 
    "_key" : "393504967628", 
    "lat" : 45.5086699, 
    "lng" : -73.55399249999999, 
    "name" : "Montreal" 
  } 
]

]

There is also an optional “distancename” parameter which, when given, prompts Arango to add the number of meters from your target point each document is. We can use that like this:

arangosh [_system]> db._createStatement({query: 'FOR city in WITHIN(cities, 45.42890720357919, -75.68796873092651, 170000, "distance_from_artissimo_cafe") RETURN city'}).execute().toArray()
[ 
  { 
    "_id" : "cities/393503132620", 
    "_rev" : "393503132620", 
    "_key" : "393503132620", 
    "distance_from_artissimo_cafe" : 1091.4226157106734, 
    "lat" : 45.4215296, 
    "lng" : -75.69719309999999, 
    "name" : "Ottawa" 
  }, 
  { 
    "_id" : "cities/393504967628", 
    "_rev" : "393504967628", 
    "_key" : "393504967628", 
    "distance_from_artissimo_cafe" : 166640.3086328647, 
    "lat" : 45.5086699, 
    "lng" : -73.55399249999999, 
    "name" : "Montreal" 
  } 
]

Arango’s NEAR function returns a set of documents ordered by their distance in meters from the lat/lng you provide. The number of documents in the set is controlled by the optional “limit” argument (which defaults to 100) and the same “distancename” as above. I am going to limit the result set to 3 (I only have 3 records in there anyway), and use my coffeeshop again:

arangosh [_system]> db._createStatement({query: 'FOR city in NEAR(cities, 45.42890720357919, -75.68796873092651, 3, "distance_from_artissimo_cafe") RETURN city'}).execute().toArray()
[ 
  { 
    "_id" : "cities/393503132620", 
    "_rev" : "393503132620", 
    "_key" : "393503132620", 
    "distance_from_artissimo_cafe" : 1091.4226157106734, 
    "lat" : 45.4215296, 
    "lng" : -75.69719309999999, 
    "name" : "Ottawa" 
  }, 
  { 
    "_id" : "cities/393504967628", 
    "_rev" : "393504967628", 
    "_key" : "393504967628", 
    "distance_from_artissimo_cafe" : 166640.3086328647, 
    "lat" : 45.5086699, 
    "lng" : -73.55399249999999, 
    "name" : "Montreal" 
  }, 
  { 
    "_id" : "cities/393506343884", 
    "_rev" : "393506343884", 
    "_key" : "393506343884", 
    "distance_from_artissimo_cafe" : 8214463.292795454, 
    "lat" : -23.5505199, 
    "lng" : -46.63330939999999, 
    "name" : "São Paulo" 
  } 
]

As you can see ArangoDB’s geo-spatial functionality is sparse but certainly enough to do some interesting things. Being able to act as a graph database AND do geo-spatial queries places Arango in a really interesting position and I am hoping to see its capabilities in both those areas expand. I’ve sent a feature request for WITHIN_BOUNDS, which I think would make working with leaflet.js or Google maps really nice, since it would save me doing a bunch of calculations with the map centre and the current zoom level to figure out a radius in meters for my query. I’ll keep my fingers crossed…

Getting to know the Firefox developer tools

Back in 2011 things were not looking good for developer tools in Firefox. Firebug development had slowed and its lead developer took a job with Google after IBM lost interest in funding his work on the project. Mozilla was already working on moving debugging tools into Firefox core but the new dev tools were pretty uninspiring compared to what Chrome had. Myself and pretty much every other developer I know eventually ended up using Chrome for development work, which eventually translated into using Chrome all the time.

Well, I’ve been spending more time in Firefox lately, and am happy to see Mozilla has been rapidly closing the gap with Chrome on the dev tools front.

Firefox developer tools
Firefox developer tools

One of the major frustrations with the Firefox dev tools was removed with Firefox 29’s new ability to disable the cache. Strangely there does not seem to be a way to set this permanently (it’s forgotten each time you close the dev tools!), but at least it exists.

Disable cache!
Finally a way to disable the cache!

The challenge of the dev tools is in presenting a huge amount of information to the user in as compact a way as possible. After working with the Firefox dev tools a little it feels like the focus is less on piling in features trying to match Chrome and more on clear presentation of the essentials. “Clarity over parity” I suppose you could say. This approach is really visible if you compare the Network timings in Firefox and Chrome:

The network timings from Firefox and Chrome

The network timings from Firefox (top) and Chrome.

I think its far more clear in Firefox that the total time for a request is the sum of a few different times (DNS resolution, connecting, sending, waiting, receiving), while its really not clear what is going on in Chrome until you start digging.

One thing I was happy to notice was that the famous Tilt addon has been moved into core and incorporated into the dev tools.

See your page in 3D
See your page in 3D

While this might have been written off initially as a WebGL Demo or a bit of a gimmick, I think its super useful for finding and eliminating unnecessary nesting of elements so I am really glad to see it find a home here.

3d_hckrnews

While the responsive design mode is really nice, I really like to be able to debug pages live on my phone. Fortunately Mozilla has made that possible by using the adb tool that the Android SDK provides. On Ubuntu you can install it from the repos:

mike@sleepycat:~☺  sudo apt-get install android-tools-adb
Setting up android-tools-adb (4.2.2+git20130218-3ubuntu16)
...

Then you will need to enable remote debugging in your Firefox mobile settings:

Screenshot_2014-05-05-21-46-06

With that done you should be able to see the mobile browser in the adb’s list of devices:

mike@sleepycat:~☺  adb devices
* daemon not running. starting it now on port 5037 *
* daemon started successfully *
List of devices attached 
0149B33B12005018	device

To connect run adb forward tcp:6000 tcp:6000 and the choose Tools > Web Developer > Connect… from your Desktop Firefox’s menu. It will bring up a page like this:

Connecting to Firefox mobile for remote debugging
Connecting to Firefox mobile for remote debugging

When you click connect you should see this on the screen of your mobile device:

Screenshot_2014-05-05-21-49-47

While the dev tools are still missing a few features I like from the Chrome dev tools (mostly being able to get a list of unused CSS declarations), all my day to day use cases are well covered. In fact, while I rarely used network timings before, the clearer presentation of them has made me realize how much valuable information was in there.

Its been good to dig into these tools and find so much good stuff going on. It feels like there was a bit of a rocky transition from Firebug to the new dev tools, but clearly Mozilla has found it’s feet. I’m looking forward to exploring further.

Getting started with graph databases

I have a personal project I have been chipping away on for a little while now. I’ve been slowly adding more and more test data to it and as I do its become increasingly clear that while the data itself is neat, the stuff that is actually interesting is actually the relationships between the various entities and not so much the entities themselves. This realization led me to do some reading about graph databases. O’Reilly (as usual) has an interesting book on Graph Databases written by Ian Robinson, Jim Webber, and Emil Eifrem. Its a good intro but given that they hold positions of engineer, chief scientist and CEO at the company that makes the Neo4j graph database, its unsurprisingly focused on Neo4j.

Unfortunately the ‘j’ part of Neo4j refers to Java, which is a whole can of worms that I would rather not open. So I set off to look for a graph database that would not force me onto the JVM or trap me with open-core licencing, and ultimately found ArangoDB.

Licenced under Apache 2, ArangoDB (formerly AvacadoDB) is a document database, in the same vein as MongoDB. What’s interesting is that it can also do key/value stuff like Redis and graphs like Neo4j.
Since its written in C++ I don’t have to worry about the JVM. So, lets get started with with it!

Installation is painless on Ubuntu:

wget -qO - http://www.arangodb.org/repositories/arangodb2/xUbuntu_13.10/Release.key | sudo apt-key add -
sudo sh -c "echo 'deb http://www.arangodb.org/repositories/arangodb2/xUbuntu_13.10/ /' > /etc/apt/sources.list.d/arangodb.list"
sudo apt-get update && sudo apt-get install arangodb

Since this is my first contact with graphs, I want a dataset that I can get a feel for working with graphs. Fortunately the company behind ArangoDB (triAGENS) has put some sample data up on github to get people started:

$> git clone https://github.com/triAGENS/ArangoDB-Data.git
Cloning into 'ArangoDB-Data'...
...
$> cd ArangoDB-Data/Graphs/IMDB
$> ./import.sh

That import script imports a bunch of IMDB data into ArangoDB and means that we can start exploring with the arango shell:

$> arangosh

                                       _     
  __ _ _ __ __ _ _ __   __ _  ___  ___| |__  
 / _` | '__/ _` | '_ \ / _` |/ _ \/ __| '_ \ 
| (_| | | | (_| | | | | (_| | (_) \__ \ | | |
 \__,_|_|  \__,_|_| |_|\__, |\___/|___/_| |_|
                       |___/                 

Welcome to arangosh 2.0.2 [linux]. Copyright (c) triAGENS GmbH
Using Google V8 3.16.14 JavaScript engine, READLINE 6.2, ICU 4.8.1.1

Pretty printing values.
Connected to ArangoDB 'tcp://localhost:8529' version: 2.0.2, database: '_system', username: 'root'

use 'help' to see common examples
arangosh [_system]>

Tab completion works super well here to give a sense of what your options are, but the first thing we care about is figuring out what that import did for us. You can see it created two collections (imdb_vertices and imdb_edges) with the db._collections() function:

arangosh [_system]> db._collections()
[ 
  [ArangoCollection 3021163, "_aal" (type document, status loaded)], 
  [ArangoCollection 1317227, "_graphs" (type document, status loaded)], 
  [ArangoCollection 3545451, "_replication" (type document, status loaded)], 
  [ArangoCollection 137579, "_users" (type document, status loaded)], 
  [ArangoCollection 1513835, "_cluster_kickstarter_plans" (type document, status loaded)], 
  [ArangoCollection 940644715, "vertices" (type document, status loaded)], 
  [ArangoCollection 3414379, "_aqlfunctions" (type document, status loaded)], 
  [ArangoCollection 1382763, "_modules" (type document, status loaded)], 
  [ArangoCollection 3610987, "_statistics" (type document, status loaded)], 
  [ArangoCollection 1160255851, "imdb_vertices" (type document, status loaded)], 
  [ArangoCollection 940710251, "edges" (type edge, status loaded)], 
  [ArangoCollection 3479915, "_trx" (type document, status loaded)], 
  [ArangoCollection 266194196843, "imdb_edges" (type edge, status loaded)], 
  [ArangoCollection 1448299, "_routing" (type document, status loaded)] 
]

We can also pick random documents out of the vertices collection with the .any() function to get a sense of whats in there.

 db.imdb_vertices.any()
{ 
  "_id" : "imdb_vertices/40233", 
  "_rev" : "6407199083", 
  "_key" : "40233", 
  "version" : 21, 
  "id" : "65952", 
  "type" : "Person", 
  "birthplace" : "", 
  "biography" : "", 
  "label" : "Jude Poyer", 
  "lastModified" : "1301901667000", 
  "name" : "Jude Poyer" 
}

If you have spent any time on the internet you will of course know that the obvious use for an IMDB graph is calculate Bacon numbers. So lets see if we can find Kevin in here:

arangosh [_system]> db._query('FOR Person IN imdb_vertices FILTER Person.name == "Kevin Bacon" RETURN Person').toArray()
[ 
  { 
    "_id" : "imdb_vertices/759", 
    "_rev" : "1218713963", 
    "_key" : "759", 
    "version" : 146, 
    "id" : "4724", 
    "type" : "Person", 
    "biography" : "", 
    "label" : "Kevin Bacon", 
    "lastModified" : "1299491319000", 
    "name" : "Kevin Bacon", 
    "birthplace" : "Philadelphia", 
    "profileImageUrl" : "http://cf1.imgobject.com/profiles/3e0/4bed49cf017a3c37a30003e0/kevin-bacon-profi...", 
    "birthday" : "-362451600000" 
  } 
]

And let’s see if we can connect him to, say, Kate Winslet. Since we know that Kevin is id imdb_vertices/759 and a little digging shows that Kate’s id is imdb_vertices/1088. We can pass those ids along with the imdb_vertices and imdb_edges collections to the SHORTEST_PATH function ArangoDB supplies for it to make the link between them:

arangosh [_system]> db._query('RETURN SHORTEST_PATH(imdb_vertices, imdb_edges, "imdb_vertices/759", "imdb_vertices/1088", "any", { maxIterations: 100000})').toArray()
[ 
  [ 
    { 
      "vertex" : { 
        "_id" : "imdb_vertices/759", 
        "_rev" : "1218713963", 
        "_key" : "759", 
        "version" : 146, 
        "id" : "4724", 
        "type" : "Person", 
        "biography" : "", 
        "label" : "Kevin Bacon", 
        "lastModified" : "1299491319000", 
        "name" : "Kevin Bacon", 
        "birthplace" : "Philadelphia", 
        "profileImageUrl" : "http://cf1.imgobject.com/profiles/3e0/4bed49cf017a3c37a30003e0/kevin-bacon-profi...", 
        "birthday" : "-362451600000" 
      } 
    }, 
    { 
      "vertex" : { 
        "_id" : "imdb_vertices/35451", 
        "_rev" : "5779626347", 
        "_key" : "35451", 
        "runtime" : 87, 
        "version" : 186, 
        "id" : "9692", 
        "genre" : "Drama", 
        "language" : "en", 
        "type" : "Movie", 
        "homepage" : "", 
        "tagline" : "", 
        "title" : "The Woodsman", 
        "label" : "The Woodsman", 
        "description" : "A pedophile returns to his hometown after 12 years in prison and attempts to sta...", 
        "imdbId" : "tt0361127", 
        "lastModified" : "1301903901000", 
        "imageUrl" : "http://cf1.imgobject.com/posters/3c1/4bc9281e017a3c57fe0103c1/the-woodsman-mid.j...", 
        "studio" : "Dash Films", 
        "releaseDate" : "1103842800000", 
        "released" : "2000-2010" 
      } 
    }, 
    { 
      "vertex" : { 
        "_id" : "imdb_vertices/1179", 
        "_rev" : "1274747243", 
        "_key" : "1179", 
        "version" : 90, 
        "id" : "335", 
        "type" : "Person", 
        "biography" : "", 
        "label" : "Michael Shannon", 
        "lastModified" : "1299902807000", 
        "name" : "Michael Shannon", 
        "profileImageUrl" : "http://cf1.imgobject.com/profiles/01c/4c2a3dc87b9aa15e9900001c/michael-shannon-p..." 
      } 
    }, 
    { 
      "vertex" : { 
        "_id" : "imdb_vertices/21077", 
        "_rev" : "3892517227", 
        "_key" : "21077", 
        "runtime" : 119, 
        "version" : 339, 
        "id" : "4148", 
        "genre" : "Drama", 
        "language" : "en", 
        "type" : "Movie", 
        "homepage" : "", 
        "tagline" : "", 
        "title" : "Revolutionary Road", 
        "label" : "Revolutionary Road", 
        "description" : "A young couple living in a Connecticut suburb during the mid-1950s struggle to c...", 
        "imdbId" : "tt0959337", 
        "trailer" : "http://www.youtube.com/watch?v=af01__Kvvr8", 
        "lastModified" : "1301907499000", 
        "imageUrl" : "http://cf1.imgobject.com/posters/627/4d4f8e275e73d617b7003627/revolutionary-road...", 
        "studio" : "BBC Films", 
        "releaseDate" : "1229641200000", 
        "released" : "2000-2010" 
      } 
    }, 
    { 
      "vertex" : { 
        "_id" : "imdb_vertices/1088", 
        "_rev" : "1262754155", 
        "_key" : "1088", 
        "version" : 102, 
        "id" : "204", 
        "type" : "Person", 
        "label" : "Kate Winslet", 
        "lastModified" : "1299746700000", 
        "name" : "Kate Winslet", 
        "birthplace" : "Reading, UK", 
        "profileImageUrl" : "http://cf1.imgobject.com/profiles/59f/4c022d0e017a3c702d00159f/kate-winslet-prof...", 
        "biography" : "<meta charset=\"utf-8\"><span style=\"font-family: sans-serif; font-size: 18px; lin...", 
        "birthday" : "181695600000" 
      } 
    } 
  ] 
]

So what we can see here is that it takes two hops (from Kevin to Michael Shannon via “The Woodsman“, and from Michael to Kate via “Revolutionary Road“), to connect Kevin Bacon to Kate Winslet, giving her a Bacon number of 2.

For the moment that is as far as I have gotten but I am pretty excited to explore the possiblities here. The more I think about graphs as a data model the more they seem to be a good fit for a lot of problems that I would normally be forcing into tables. Given that I can also just do straight document storage and the fact that they have a Object Document Mapper that works with Rails, I can tell you that ArangoDB and I will be spending a lot of time together.

Why Virtualenv?

Virtualenv comes up often when learning about Python. It’s a Python library that creates a folder into which you install all the libraries your project will need. While its often stated that you should use it, its not often explained why. I recently stumbled upon a good intro that gives an example of creating an application that uses requests and then giving the scenario where running sudo pip install --upgrade requests while working on a separate project breaks the first application.
The idea that updating a library in one project would/could break some/all of my other projects that rely on that library is bizarre and kind of terrifying. It’s nice that the solution to the problem is apparently Virtualenv, but why is this a problem to begin with?

The root of this problem seems to be Pip. If I install version 1.0 of the testing library nose (because I am using pyenv) it gets placed in ~/.pyenv/versions/3.4.0/lib/python3.4/site-packages/. Looking in there, I can see folders for both the code and the metadata (the egg-info folder):

nose/                      nose-1.0.0-py3.4.egg-info/

If I run the pip install --upgrade command you can see the problem unfold:

mike@sleepycat:~/projects/play/python_play☺  pip install --upgrade nose
Downloading/unpacking nose from https://pypi.python.org/packages/source/n/nose/nose-1.3.1.tar.gz#md5=672398801ddf5ba745c55c6eed79c5aa
  Downloading nose-1.3.1.tar.gz (274kB): 274kB downloaded
...
Installing collected packages: nose
  Found existing installation: nose 1.0.0
    Uninstalling nose:
      Successfully uninstalled nose
  Running setup.py install for nose
...

Yup, Pip only installs a single version of a library on your system. A quick look back in the ~/.pyenv/versions/3.4.0/lib/python3.4/site-packages/ folder confirms what Pip’s insanely verbose output told us, our nose 1.0 is gone:

nose/                      nose-1.0.0-py3.4.egg-info/

This is pretty surprising for someone whose expectations have been shaped by Ruby’s package manager rubygems. You can see multiple versions of the same library coexisting in the interpreter’s gems folder, meaning that my old projects will still be able to use the old version while my new projects can use the newer without carnage:

ls ~/.rbenv/versions/rbx-2.2.5/gems/gems/arel-
arel-3.0.3/ arel-4.0.1/

Returning to the reason for needing Virtualenv, at first glance it seems you need Virtualenv to protect you from Pip’s inability to do multiple versions of a library. What’s interesting is that both Virtualenv and Pip where written by the same person, Ian Bicking, Virtualenv in 2007 and Pip in 2008. What this seems to suggest is that installing a single version is a design decision made because Pip assumes the existence/use of something like Virtualenv. This is especially true when you realize that Pip was aimed at replacing easy_install, an earlier tool which actually could do multiple versions of the same library as Rubygems had since 2003.

So if you have ever wondered why you need Virtualenv, its seems we have an answer. Pip has pretty much completely replaced previous package managers, and it was developed to assume Virtualenv or something similar is being used… and its assumptions essentially force you to use it.

For those of us starting out with Python, sorting out the ins and outs of the messy world of Python packaging is a pain. The old system seems to be broken, the current one using Virtualenv/Pip is hacky and far from ideal, and the future seems to be largely ignored. Fortunately the beginnings of a clean solution appear to be coming from the world of Docker, so we will have to watch that space carefully. In the meantime, I guess I will have to install Virtualenv…

The more things change

The more they stay the same. Microsoft has changed their tack, from casting people who hate Internet Explorer as internet trolls (rather than every person who has ever tried to build a cross-browser website for themselves or a client) to casting it as a new brower.
Of course its a product of the same old Microsoft and a quick comparison on caniuse shows it; Shadow DOM? Nope. Server sent events? Nope. WebRTC? Nope.

Everything looks the same:

People are still writing shims to bring IE up to speed with the rest of the web.

IE is still adopting standards as slowly as possible, trying stave off the inevitable moment where web apps compete directly with desktop apps and the OS ceases to mean anything.

ie_vs_the_internet

I’m curious about the direction IE will take once Microsoft’s new CEO Satya Nadella has been around long enough to make his mark. Releasing Office of iPad in his first appearance as CEO is definitely a statement, as is rechristening “Windows Azure” to “Microsoft Azure” (since its aimed a more than just Windows…). We’ll have to wait to see what this more collaborative attitude means for IE. For the moment, the “new” IE, in spite of its rapid release cycle looks a lot like the old IE when you compare it to Firefox and Chromium.

Changing keyboard layout options in Ubuntu 14.04

Back in 2012 I switched my caps-lock key to act as a second ESC key. This made a big impact in my Vim usage, and you can understand why when you see the keyboard vi was created with. Since I now rely on that little tweak, it was a little disconcerting to realize that the keyboard layout options I had used to switch my caps-lock were nowhere to be found in Ubuntu 14.04. It turns out that Gnome (upstream from Ubuntu) removed the settings from the system settings entirely.

Fortunately this is still accessible via the Gnome Tweak Tool.
You can install that like this:

sudo apt-get install gnome-tweak-tool

Then you can find the all the old options under the “typing” option.
gnome tweak tool

Its a little wierd to have such useful stuff suddenly removed from the system settings. Hopefully they will find there way back in a future version, for the moment, my Vim crisis has been averted and that’s enough.

On working with Capybara

I’ve been writing Ruby since 2009 and while TDD as a process has long been clear to me, the morass of testing terminology has not. For a recent project I made pretty significant use of Capybara, and through it, Selenium. While it solved some problems I am not sure I could have solved any other way, it created others and on the way shed some light on some murky terminology.

I think mine was a pretty common pathway to working with Capybara, finding it via Rails and Rspec and an need to do an integration test.

The Rspec equivalent of an integration test is the Request Spec, and its often pointed out that its intended use is API testing, which wasn’t what I was doing. What’s held up as the user focused compliment to request specs are feature specs using Capybara.

The sudden appearance of “the feature” as the focus of these specs, and the brief description of “Acceptance test framework for web applications” at the top of the Capybara Github page should be our first signs that things have shifted a little.

This shift in terminology has implications technically, which are not immediately obvious. The intent of Acceptance testing “is to guarantee that a customers requirements have been met and the system is acceptable“. Importantly, since “Acceptance tests are black box system tests”, this means testing the system from the outside via the UI.

Its this “via the UI” part that should stand out, since its a far cry from the other kinds of tests of tests common with Rails. Uncle Bob has said that testing via the UI “is always a bad idea” and I got a taste of why pretty much right away. Lets take a feature spec like this as an example:

    it "asks for user email" do
      visit '/'
      fill_in "user_email", with: "foo@example.com"
      click_button "user_submit"
      expect(page).to have_content "Thanks for your email!"
    end

Notice that suddenly I am stating expectations about the HTML page contents and looking for and manipulating page elements like forms and buttons.
The code for the user_submit button above would typically look like this in most Rails apps:

<%= f.submit "Submit", :class => "btn snazzy" %>

In Rails 3.0.19 that code would use the User class and the input type to create the id attribute automatically. Our click_button 'user_submit' from above finds the element by id and our test passes:

<input class="btn snazzy" id="user_submit" name="commit" type="submit" value="Submit">

In Rails 3.1.12, the same code outputs this:

<input class="btn snazzy" name="commit" type="submit" value="Submit">

There are patterns like page objects that can reduce the brittleness but testing the UI is something that only really makes sense when the system under test is a black box.

Contrary to the black box assumption of Acceptance tests, Rails integration tests access system internals like cookies, session and the assigns hash as well as asserting specific templates have been rendered. All of that is done via HTTP requests without reliance on clicking UI elements to move between the controllers under test.

Another problem comes from the use of Selenium. By default Capybara uses rack-test as its driver, but adding js: true switches to the javascript driver which defaults to Selenium:

    it "asks for user email", js: true do
      visit '/'
      fill_in "user_email", with: "foo@example.com"
      click_button "user_submit"
      expect(page).to have_content "Thanks for your email!"
    end

The unexpected consequences of this seemingly innocuous option come a month or two later when I try to run my test suite:

     Failure/Error: visit '/'
     Selenium::WebDriver::Error::WebDriverError:
       unable to obtain stable firefox connection in 60 seconds (127.0.0.1:7055)

What happened? Well, my package manager has updated my Firefox version to 26.0, and the selenium-webdriver version specified in my Gemfile.lock is 2.35.1.

Yes, with “js: true” I have made that test dependent on a particular version of Firefox, which is living outside of my version control and gets auto-updated by my package manager every six weeks.

While workarounds like bundle update selenium-webdriver or simply skipping tests tagged with js:true using rspec -t ~js:true are available your default rspec command will always run all the tests. The need to use special options to avoid breakage is unlikely to be remembered/known by future selves/developers, so the solution seems to be keeping some sort of strict separation between your regular test suite and, minimally any test that uses js, or ideally all Acceptance tests. I’m not sure that that might look like yet, but I’m thinking about it.

Acceptance testing differs far more that I initially appreciated from other types of testing, and like most things, when used as intended its pretty fabulous. The unsettling thing in all this was how easy it was to drift into Acceptance testing without explicitly meaning to. Including the Capybara DSL in integration tests certainly blurs the boundaries between two things that are already pretty easily confused. In most other places Capybara’s ostensible raison d’etre seems to be largely glossed over. Matthew Robbins otherwise great book “Application Testing with Capybara“, is not only conspicuously not called “Acceptance testing with Capybara”, it only mentions the words “Acceptance testing” twice.

Cabybara is certainly nice to work with, and being able to translate a clients “when I click here it breaks” almost verbatim into a test is amazing. I feel like I now have a better idea of how to enjoy that upside without the downside.