Querying the Openstreetmap Dataset

While much has been written about putting data into OpenStreetMap (OSM), it doesn’t feel like much has been said about getting data out. For those familiar with GIS software, grabbing a “metro extract” is a reasonable place to start, but for developers or regular users its not quite as clear how to get at the data we can see is in there.

The first way to get at the data is with the Overpass API. Overpass was started by Roland Olbricht in 2008 as a way to ask for some specified subset of the OSM data.

Lets say I was curious about the number of bike racks that could hold 8 bikes in down-town Ottawa. The first thing to know is that OSM data is XML, which means that each element (node/way/area/relation) looks something like this:

  <node id="3046036633" lat="45.4168480" lon="-75.7016922">
    <tag k="access" v="public"/>
    <tag k="amenity" v="bicycle_parking"/>
    <tag k="bicycle_parking" v="rack"/>
    <tag k="capacity" v="8"/>
  </node>

Basically any XML element may be associated with a bunch tags containing keys and values.

You specify which elements of the OSM dataset are interesting to you by creating an Overpass query in XML format or using a query language called Overpass QL. You can use either one, but I’m using XML here.

Here is a query asking for all the elements of type “node” that has both a tag with a key of “amenity” and a value of “bicycle_parking” as well as a tag with a key of “capacity” and a value of “8”. You can also see my query includes a bbox-query element with coordinates for North, East, South, and West supplied; the two corners of a bounding box so search will be limited to that geographic area.

<osm-script output="json">
  <query type="node">
    <has-kv k="amenity" v="bicycle_parking"/>
    <has-kv k="capacity" v="8"/>
    <bbox-query e="-75.69105863571167" n="45.42274779392456" s="45.415714100972636" w="-75.70568203926086"/>
  </query>
  <print/>
</osm-script>

I’ve saved that query into a file named “query” and I am using cat to read the file and pass the text to curl which sends the query.

mike@longshot:~/osm☺  cat query | curl -X POST -d @- http://overpass-api.de/api/interpreter{
  "version": 0.6,
  "generator": "Overpass API",
  "osm3s": {
    "timestamp_osm_base": "2014-08-27T18:47:01Z",
    "copyright": "The data included in this document is from www.openstreetmap.org. The data is made available under ODbL."
  },
  "elements": [

{
  "type": "node",
  "id": 3046036633,
  "lat": 45.4168480,
  "lon": -75.7016922,
  "tags": {
    "access": "public",
    "amenity": "bicycle_parking",
    "bicycle_parking": "rack",
    "capacity": "8"
  }
},
{
  "type": "node",
  "id": 3046036634,
  "lat": 45.4168354,
  "lon": -75.7017258,
  "tags": {
    "access": "public",
    "amenity": "bicycle_parking",
    "capacity": "8",
    "covered": "no"
  }
},
{
  "type": "node",
  "id": 3046036636,
  "lat": 45.4168223,
  "lon": -75.7017618,
  "tags": {
    "access": "public",
    "amenity": "bicycle_parking",
    "bicycle_parking": "rack",
    "capacity": "8"
  }
}

  ]
}

This is pretty exciting, but its worth pointing out that the response is JSON, and not GeoJSON which you will probably want for doing things with Leaflet. The author is certainly aware of it and apparently working on it, but in the meantime you will need to use the npm module osmtogeojson if you need to do the conversion from what Overpass gives to what Leaflet accepts.

So what might that get you? Well lets say you are trying to calculate the total amount of bike parking in down-town Ottawa. With a single API call (this time using the Overpass QL, so its cut & paste friendly), we can tally up the capacity tags:

mike@longshot:~/osm☺  curl -s -g 'http://overpass-api.de/api/interpreter?data=[out:json];node["amenity"="bicycle_parking"](45.415714100972636,-75.70568203926086,45.42274779392456,-75.69105863571167);out;' | grep capacity | tr -d ',":' | sort | uniq -c
      2     capacity 10
      7     capacity 2
      6     capacity 8

Looks like more bike racks need to be tagged with “capacity”, but its a good start on coming up with a total.

Building on the Overpass API is the web based Overpass-turbo. If you are an regular user trying to get some “how many of X in this area” type questions answered, this is the place to go. Its also helpful for developers looking to work the kinks out of a query.

Displaying my edits in the Ottawa area.
Using Overpass-Turbo to display my edits in the Ottawa area.

Its really simple to get started using the wizard, which helps write a query for you. With a little fooling around with the styles you can do some really interesting stuff. As an example, we can colour the bicycle parking according to its capacity so we can see which ones have a capacity tag and which ones don’t. The query ends up looking like this:

<osm-script timeout="25">
  <!-- gather results -->
  <union>
    <!-- query part for: “amenity=bicycle_parking” -->
    <query type="node">
      <has-kv k="amenity" v="bicycle_parking"/>
      <bbox-query {{bbox}}/>
    </query>
    {{style:
      node[amenity=bicycle_parking]{ fill-opacity: 1; fill-color: grey;color: white;}
      node[capacity=2]{ fill-color: yellow; }
      node[capacity=8]{ fill-color: orange;}
      node[capacity=10]{fill-color: red;}
    }}
  </union>
  <print mode="body"/>
  <recurse type="down"/>
  <print mode="skeleton" order="quadtile"/>
</osm-script>

Bike racks with no capacity attribute will be grey. You can see the result here.

While Overpass-turbo might not be as sophisticated as CartoDB, it is really approachable and surprisingly capable. Highlighting certain nodes, picking out the edits of a particular user, there are lots of interesting applications.

Being able to query the OSM data easily opens some interesting possibilities. If you are gathering data for whatever reason, you are going to run into the problems of where to store it, and how to keep it up to date. One way of dealing with both of those is to store your data in OSM.

With all the thinking that has gone into what attributes can be attached  to things like trees, bike racks, and public art, you can store a surprising amount of information in a single point. Once saved into the OSM dataset, you will always know where to find the most current version of your data, and backups are dealt with for you.

This approach  also opens the door to other people helping you keep it up to date. Asking for volunteers or running hackathons to help you update your data is pretty reasonable when it also means improving a valuable public resource, instead of just enriching the owner alone. Once the data is in OSM, the maintenance burden is easy to distribute.

When its time to revisit your question, fresh data will only ever be an Overpass query away…

Something to think about.

ArangoDB’s geo-spatial functions

I’ve been playing with ArangoDB a lot lately. As a document database it looks to be a drop-in replacement for MongoDB, but it goes further, allowing graph traversals and geo-spatial queries.

Since I have a geo-referenced data set in mind I wanted to get to know its geo-spatial functions. I found the documentation a kind of unclear so I thought I would write up my exploration here.

At the moment there are only two geo-spatial functions in Arango; WITHIN and NEAR. Lets make some test data using the arango shell. Run arangosh and then the following:

db._create('cities')
db.cities.save({name: 'Ottawa', lat: 45.4215296, lng: -75.69719309999999})
db.cities.save({name: 'Montreal', lat: 45.5086699, lng: -73.55399249999999})
db.cities.save({name: 'São Paulo', lat: -23.5505199, lng: -46.63330939999999})

We will also need a geo-index for the functions to work. You can create one by passing in the name(s) of the fields that hold the latitude and longitude. In our case I just called them lat and lng so:

db.cities.ensureGeoIndex('lat', 'lng')

Alternately I could have done:

db.cities.save({name: 'Ottawa', location: [45.4215296, -75.69719309999999]})
db.cities.ensureGeoIndex('location')

As long as the values are of type double life is good. If you have some documents in the collection that don’t have the key(s) you specified for the index it will just ignore them.

First up is the WITHIN function. Its pretty much what you might expect, you give it a lat/lng and a radius and it gives you records with the area you specified. What is a little unexpected it that the radius is given in meters. So I am going to ask for the documents that are closest to the lat/lng of my favourite coffee shop (45.42890720357919, -75.68796873092651). To make the results more interesting I’ll ask for a 170000 meter radius (I know that Montreal is about 170 kilometers from Ottawa) so I should see those two cities in the result set:

arangosh [_system]> db._createStatement({query: 'FOR city in WITHIN(cities, 45.42890720357919, -75.68796873092651, 170000) RETURN city'}).execute().toArray()
[ 
  { 
    "_id" : "cities/393503132620", 
    "_rev" : "393503132620", 
    "_key" : "393503132620", 
    "lat" : 45.4215296, 
    "lng" : -75.69719309999999, 
    "name" : "Ottawa" 
  }, 
  { 
    "_id" : "cities/393504967628", 
    "_rev" : "393504967628", 
    "_key" : "393504967628", 
    "lat" : 45.5086699, 
    "lng" : -73.55399249999999, 
    "name" : "Montreal" 
  } 
]

]

There is also an optional “distancename” parameter which, when given, prompts Arango to add the number of meters from your target point each document is. We can use that like this:

arangosh [_system]> db._createStatement({query: 'FOR city in WITHIN(cities, 45.42890720357919, -75.68796873092651, 170000, "distance_from_artissimo_cafe") RETURN city'}).execute().toArray()
[ 
  { 
    "_id" : "cities/393503132620", 
    "_rev" : "393503132620", 
    "_key" : "393503132620", 
    "distance_from_artissimo_cafe" : 1091.4226157106734, 
    "lat" : 45.4215296, 
    "lng" : -75.69719309999999, 
    "name" : "Ottawa" 
  }, 
  { 
    "_id" : "cities/393504967628", 
    "_rev" : "393504967628", 
    "_key" : "393504967628", 
    "distance_from_artissimo_cafe" : 166640.3086328647, 
    "lat" : 45.5086699, 
    "lng" : -73.55399249999999, 
    "name" : "Montreal" 
  } 
]

Arango’s NEAR function returns a set of documents ordered by their distance in meters from the lat/lng you provide. The number of documents in the set is controlled by the optional “limit” argument (which defaults to 100) and the same “distancename” as above. I am going to limit the result set to 3 (I only have 3 records in there anyway), and use my coffeeshop again:

arangosh [_system]> db._createStatement({query: 'FOR city in NEAR(cities, 45.42890720357919, -75.68796873092651, 3, "distance_from_artissimo_cafe") RETURN city'}).execute().toArray()
[ 
  { 
    "_id" : "cities/393503132620", 
    "_rev" : "393503132620", 
    "_key" : "393503132620", 
    "distance_from_artissimo_cafe" : 1091.4226157106734, 
    "lat" : 45.4215296, 
    "lng" : -75.69719309999999, 
    "name" : "Ottawa" 
  }, 
  { 
    "_id" : "cities/393504967628", 
    "_rev" : "393504967628", 
    "_key" : "393504967628", 
    "distance_from_artissimo_cafe" : 166640.3086328647, 
    "lat" : 45.5086699, 
    "lng" : -73.55399249999999, 
    "name" : "Montreal" 
  }, 
  { 
    "_id" : "cities/393506343884", 
    "_rev" : "393506343884", 
    "_key" : "393506343884", 
    "distance_from_artissimo_cafe" : 8214463.292795454, 
    "lat" : -23.5505199, 
    "lng" : -46.63330939999999, 
    "name" : "São Paulo" 
  } 
]

As you can see ArangoDB’s geo-spatial functionality is sparse but certainly enough to do some interesting things. Being able to act as a graph database AND do geo-spatial queries places Arango in a really interesting position and I am hoping to see its capabilities in both those areas expand. I’ve sent a feature request for WITHIN_BOUNDS, which I think would make working with leaflet.js or Google maps really nice, since it would save me doing a bunch of calculations with the map centre and the current zoom level to figure out a radius in meters for my query. I’ll keep my fingers crossed…

Getting to know the Firefox developer tools

Back in 2011 things were not looking good for developer tools in Firefox. Firebug development had slowed and its lead developer took a job with Google after IBM lost interest in funding his work on the project. Mozilla was already working on moving debugging tools into Firefox core but the new dev tools were pretty uninspiring compared to what Chrome had. Myself and pretty much every other developer I know eventually ended up using Chrome for development work, which eventually translated into using Chrome all the time.

Well, I’ve been spending more time in Firefox lately, and am happy to see Mozilla has been rapidly closing the gap with Chrome on the dev tools front.

Firefox developer tools
Firefox developer tools

One of the major frustrations with the Firefox dev tools was removed with Firefox 29’s new ability to disable the cache. Strangely there does not seem to be a way to set this permanently (it’s forgotten each time you close the dev tools!), but at least it exists.

Disable cache!
Finally a way to disable the cache!

The challenge of the dev tools is in presenting a huge amount of information to the user in as compact a way as possible. After working with the Firefox dev tools a little it feels like the focus is less on piling in features trying to match Chrome and more on clear presentation of the essentials. “Clarity over parity” I suppose you could say. This approach is really visible if you compare the Network timings in Firefox and Chrome:

The network timings from Firefox and Chrome

The network timings from Firefox (top) and Chrome.

I think its far more clear in Firefox that the total time for a request is the sum of a few different times (DNS resolution, connecting, sending, waiting, receiving), while its really not clear what is going on in Chrome until you start digging.

One thing I was happy to notice was that the famous Tilt addon has been moved into core and incorporated into the dev tools.

See your page in 3D
See your page in 3D

While this might have been written off initially as a WebGL Demo or a bit of a gimmick, I think its super useful for finding and eliminating unnecessary nesting of elements so I am really glad to see it find a home here.

3d_hckrnews

While the responsive design mode is really nice, I really like to be able to debug pages live on my phone. Fortunately Mozilla has made that possible by using the adb tool that the Android SDK provides. On Ubuntu you can install it from the repos:

mike@sleepycat:~☺  sudo apt-get install android-tools-adb
Setting up android-tools-adb (4.2.2+git20130218-3ubuntu16)
...

Then you will need to enable remote debugging in your Firefox mobile settings:

Screenshot_2014-05-05-21-46-06

With that done you should be able to see the mobile browser in the adb’s list of devices:

mike@sleepycat:~☺  adb devices
* daemon not running. starting it now on port 5037 *
* daemon started successfully *
List of devices attached 
0149B33B12005018	device

To connect run adb forward tcp:6000 tcp:6000 and the choose Tools > Web Developer > Connect… from your Desktop Firefox’s menu. It will bring up a page like this:

Connecting to Firefox mobile for remote debugging
Connecting to Firefox mobile for remote debugging

When you click connect you should see this on the screen of your mobile device:

Screenshot_2014-05-05-21-49-47

While the dev tools are still missing a few features I like from the Chrome dev tools (mostly being able to get a list of unused CSS declarations), all my day to day use cases are well covered. In fact, while I rarely used network timings before, the clearer presentation of them has made me realize how much valuable information was in there.

Its been good to dig into these tools and find so much good stuff going on. It feels like there was a bit of a rocky transition from Firebug to the new dev tools, but clearly Mozilla has found it’s feet. I’m looking forward to exploring further.

Getting started with graph databases

I have a personal project I have been chipping away on for a little while now. I’ve been slowly adding more and more test data to it and as I do its become increasingly clear that while the data itself is neat, the stuff that is actually interesting is actually the relationships between the various entities and not so much the entities themselves. This realization led me to do some reading about graph databases. O’Reilly (as usual) has an interesting book on Graph Databases written by Ian Robinson, Jim Webber, and Emil Eifrem. Its a good intro but given that they hold positions of engineer, chief scientist and CEO at the company that makes the Neo4j graph database, its unsurprisingly focused on Neo4j.

Unfortunately the ‘j’ part of Neo4j refers to Java, which is a whole can of worms that I would rather not open. So I set off to look for a graph database that would not force me onto the JVM or trap me with open-core licencing, and ultimately found ArangoDB.

Licenced under Apache 2, ArangoDB (formerly AvacadoDB) is a document database, in the same vein as MongoDB. What’s interesting is that it can also do key/value stuff like Redis and graphs like Neo4j.
Since its written in C++ I don’t have to worry about the JVM. So, lets get started with with it!

Installation is painless on Ubuntu:

wget -qO - http://www.arangodb.org/repositories/arangodb2/xUbuntu_13.10/Release.key | sudo apt-key add -
sudo sh -c "echo 'deb http://www.arangodb.org/repositories/arangodb2/xUbuntu_13.10/ /' > /etc/apt/sources.list.d/arangodb.list"
sudo apt-get update && sudo apt-get install arangodb

Since this is my first contact with graphs, I want a dataset that I can get a feel for working with graphs. Fortunately the company behind ArangoDB (triAGENS) has put some sample data up on github to get people started:

$> git clone https://github.com/triAGENS/ArangoDB-Data.git
Cloning into 'ArangoDB-Data'...
...
$> cd ArangoDB-Data/Graphs/IMDB
$> ./import.sh

That import script imports a bunch of IMDB data into ArangoDB and means that we can start exploring with the arango shell:

$> arangosh

                                       _     
  __ _ _ __ __ _ _ __   __ _  ___  ___| |__  
 / _` | '__/ _` | '_ \ / _` |/ _ \/ __| '_ \ 
| (_| | | | (_| | | | | (_| | (_) \__ \ | | |
 \__,_|_|  \__,_|_| |_|\__, |\___/|___/_| |_|
                       |___/                 

Welcome to arangosh 2.0.2 [linux]. Copyright (c) triAGENS GmbH
Using Google V8 3.16.14 JavaScript engine, READLINE 6.2, ICU 4.8.1.1

Pretty printing values.
Connected to ArangoDB 'tcp://localhost:8529' version: 2.0.2, database: '_system', username: 'root'

use 'help' to see common examples
arangosh [_system]>

Tab completion works super well here to give a sense of what your options are, but the first thing we care about is figuring out what that import did for us. You can see it created two collections (imdb_vertices and imdb_edges) with the db._collections() function:

arangosh [_system]> db._collections()
[ 
  [ArangoCollection 3021163, "_aal" (type document, status loaded)], 
  [ArangoCollection 1317227, "_graphs" (type document, status loaded)], 
  [ArangoCollection 3545451, "_replication" (type document, status loaded)], 
  [ArangoCollection 137579, "_users" (type document, status loaded)], 
  [ArangoCollection 1513835, "_cluster_kickstarter_plans" (type document, status loaded)], 
  [ArangoCollection 940644715, "vertices" (type document, status loaded)], 
  [ArangoCollection 3414379, "_aqlfunctions" (type document, status loaded)], 
  [ArangoCollection 1382763, "_modules" (type document, status loaded)], 
  [ArangoCollection 3610987, "_statistics" (type document, status loaded)], 
  [ArangoCollection 1160255851, "imdb_vertices" (type document, status loaded)], 
  [ArangoCollection 940710251, "edges" (type edge, status loaded)], 
  [ArangoCollection 3479915, "_trx" (type document, status loaded)], 
  [ArangoCollection 266194196843, "imdb_edges" (type edge, status loaded)], 
  [ArangoCollection 1448299, "_routing" (type document, status loaded)] 
]

We can also pick random documents out of the vertices collection with the .any() function to get a sense of whats in there.

 db.imdb_vertices.any()
{ 
  "_id" : "imdb_vertices/40233", 
  "_rev" : "6407199083", 
  "_key" : "40233", 
  "version" : 21, 
  "id" : "65952", 
  "type" : "Person", 
  "birthplace" : "", 
  "biography" : "", 
  "label" : "Jude Poyer", 
  "lastModified" : "1301901667000", 
  "name" : "Jude Poyer" 
}

If you have spent any time on the internet you will of course know that the obvious use for an IMDB graph is calculate Bacon numbers. So lets see if we can find Kevin in here:

arangosh [_system]> db._query('FOR Person IN imdb_vertices FILTER Person.name == "Kevin Bacon" RETURN Person').toArray()
[ 
  { 
    "_id" : "imdb_vertices/759", 
    "_rev" : "1218713963", 
    "_key" : "759", 
    "version" : 146, 
    "id" : "4724", 
    "type" : "Person", 
    "biography" : "", 
    "label" : "Kevin Bacon", 
    "lastModified" : "1299491319000", 
    "name" : "Kevin Bacon", 
    "birthplace" : "Philadelphia", 
    "profileImageUrl" : "http://cf1.imgobject.com/profiles/3e0/4bed49cf017a3c37a30003e0/kevin-bacon-profi...", 
    "birthday" : "-362451600000" 
  } 
]

And let’s see if we can connect him to, say, Kate Winslet. Since we know that Kevin is id imdb_vertices/759 and a little digging shows that Kate’s id is imdb_vertices/1088. We can pass those ids along with the imdb_vertices and imdb_edges collections to the SHORTEST_PATH function ArangoDB supplies for it to make the link between them:

arangosh [_system]> db._query('RETURN SHORTEST_PATH(imdb_vertices, imdb_edges, "imdb_vertices/759", "imdb_vertices/1088", "any", { maxIterations: 100000})').toArray()
[ 
  [ 
    { 
      "vertex" : { 
        "_id" : "imdb_vertices/759", 
        "_rev" : "1218713963", 
        "_key" : "759", 
        "version" : 146, 
        "id" : "4724", 
        "type" : "Person", 
        "biography" : "", 
        "label" : "Kevin Bacon", 
        "lastModified" : "1299491319000", 
        "name" : "Kevin Bacon", 
        "birthplace" : "Philadelphia", 
        "profileImageUrl" : "http://cf1.imgobject.com/profiles/3e0/4bed49cf017a3c37a30003e0/kevin-bacon-profi...", 
        "birthday" : "-362451600000" 
      } 
    }, 
    { 
      "vertex" : { 
        "_id" : "imdb_vertices/35451", 
        "_rev" : "5779626347", 
        "_key" : "35451", 
        "runtime" : 87, 
        "version" : 186, 
        "id" : "9692", 
        "genre" : "Drama", 
        "language" : "en", 
        "type" : "Movie", 
        "homepage" : "", 
        "tagline" : "", 
        "title" : "The Woodsman", 
        "label" : "The Woodsman", 
        "description" : "A pedophile returns to his hometown after 12 years in prison and attempts to sta...", 
        "imdbId" : "tt0361127", 
        "lastModified" : "1301903901000", 
        "imageUrl" : "http://cf1.imgobject.com/posters/3c1/4bc9281e017a3c57fe0103c1/the-woodsman-mid.j...", 
        "studio" : "Dash Films", 
        "releaseDate" : "1103842800000", 
        "released" : "2000-2010" 
      } 
    }, 
    { 
      "vertex" : { 
        "_id" : "imdb_vertices/1179", 
        "_rev" : "1274747243", 
        "_key" : "1179", 
        "version" : 90, 
        "id" : "335", 
        "type" : "Person", 
        "biography" : "", 
        "label" : "Michael Shannon", 
        "lastModified" : "1299902807000", 
        "name" : "Michael Shannon", 
        "profileImageUrl" : "http://cf1.imgobject.com/profiles/01c/4c2a3dc87b9aa15e9900001c/michael-shannon-p..." 
      } 
    }, 
    { 
      "vertex" : { 
        "_id" : "imdb_vertices/21077", 
        "_rev" : "3892517227", 
        "_key" : "21077", 
        "runtime" : 119, 
        "version" : 339, 
        "id" : "4148", 
        "genre" : "Drama", 
        "language" : "en", 
        "type" : "Movie", 
        "homepage" : "", 
        "tagline" : "", 
        "title" : "Revolutionary Road", 
        "label" : "Revolutionary Road", 
        "description" : "A young couple living in a Connecticut suburb during the mid-1950s struggle to c...", 
        "imdbId" : "tt0959337", 
        "trailer" : "http://www.youtube.com/watch?v=af01__Kvvr8", 
        "lastModified" : "1301907499000", 
        "imageUrl" : "http://cf1.imgobject.com/posters/627/4d4f8e275e73d617b7003627/revolutionary-road...", 
        "studio" : "BBC Films", 
        "releaseDate" : "1229641200000", 
        "released" : "2000-2010" 
      } 
    }, 
    { 
      "vertex" : { 
        "_id" : "imdb_vertices/1088", 
        "_rev" : "1262754155", 
        "_key" : "1088", 
        "version" : 102, 
        "id" : "204", 
        "type" : "Person", 
        "label" : "Kate Winslet", 
        "lastModified" : "1299746700000", 
        "name" : "Kate Winslet", 
        "birthplace" : "Reading, UK", 
        "profileImageUrl" : "http://cf1.imgobject.com/profiles/59f/4c022d0e017a3c702d00159f/kate-winslet-prof...", 
        "biography" : "<meta charset=\"utf-8\"><span style=\"font-family: sans-serif; font-size: 18px; lin...", 
        "birthday" : "181695600000" 
      } 
    } 
  ] 
]

So what we can see here is that it takes two hops (from Kevin to Michael Shannon via “The Woodsman“, and from Michael to Kate via “Revolutionary Road“), to connect Kevin Bacon to Kate Winslet, giving her a Bacon number of 2.

For the moment that is as far as I have gotten but I am pretty excited to explore the possiblities here. The more I think about graphs as a data model the more they seem to be a good fit for a lot of problems that I would normally be forcing into tables. Given that I can also just do straight document storage and the fact that they have a Object Document Mapper that works with Rails, I can tell you that ArangoDB and I will be spending a lot of time together.

Why Virtualenv?

Virtualenv comes up often when learning about Python. It’s a Python library that creates a folder into which you install all the libraries your project will need. While its often stated that you should use it, its not often explained why. I recently stumbled upon a good intro that gives an example of creating an application that uses requests and then giving the scenario where running sudo pip install --upgrade requests while working on a separate project breaks the first application.
The idea that updating a library in one project would/could break some/all of my other projects that rely on that library is bizarre and kind of terrifying. It’s nice that the solution to the problem is apparently Virtualenv, but why is this a problem to begin with?

The root of this problem seems to be Pip. If I install version 1.0 of the testing library nose (because I am using pyenv) it gets placed in ~/.pyenv/versions/3.4.0/lib/python3.4/site-packages/. Looking in there, I can see folders for both the code and the metadata (the egg-info folder):

nose/                      nose-1.0.0-py3.4.egg-info/

If I run the pip install --upgrade command you can see the problem unfold:

mike@sleepycat:~/projects/play/python_play☺  pip install --upgrade nose
Downloading/unpacking nose from https://pypi.python.org/packages/source/n/nose/nose-1.3.1.tar.gz#md5=672398801ddf5ba745c55c6eed79c5aa
  Downloading nose-1.3.1.tar.gz (274kB): 274kB downloaded
...
Installing collected packages: nose
  Found existing installation: nose 1.0.0
    Uninstalling nose:
      Successfully uninstalled nose
  Running setup.py install for nose
...

Yup, Pip only installs a single version of a library on your system. A quick look back in the ~/.pyenv/versions/3.4.0/lib/python3.4/site-packages/ folder confirms what Pip’s insanely verbose output told us, our nose 1.0 is gone:

nose/                      nose-1.0.0-py3.4.egg-info/

This is pretty surprising for someone whose expectations have been shaped by Ruby’s package manager rubygems. You can see multiple versions of the same library coexisting in the interpreter’s gems folder, meaning that my old projects will still be able to use the old version while my new projects can use the newer without carnage:

ls ~/.rbenv/versions/rbx-2.2.5/gems/gems/arel-
arel-3.0.3/ arel-4.0.1/

Returning to the reason for needing Virtualenv, at first glance it seems you need Virtualenv to protect you from Pip’s inability to do multiple versions of a library. What’s interesting is that both Virtualenv and Pip where written by the same person, Ian Bicking, Virtualenv in 2007 and Pip in 2008. What this seems to suggest is that installing a single version is a design decision made because Pip assumes the existence/use of something like Virtualenv. This is especially true when you realize that Pip was aimed at replacing easy_install, an earlier tool which actually could do multiple versions of the same library as Rubygems had since 2003.

So if you have ever wondered why you need Virtualenv, its seems we have an answer. Pip has pretty much completely replaced previous package managers, and it was developed to assume Virtualenv or something similar is being used… and its assumptions essentially force you to use it.

For those of us starting out with Python, sorting out the ins and outs of the messy world of Python packaging is a pain. The old system seems to be broken, the current one using Virtualenv/Pip is hacky and far from ideal, and the future seems to be largely ignored. Fortunately the beginnings of a clean solution appear to be coming from the world of Docker, so we will have to watch that space carefully. In the meantime, I guess I will have to install Virtualenv…

The more things change

The more they stay the same. Microsoft has changed their tack, from casting people who hate Internet Explorer as internet trolls (rather than every person who has ever tried to build a cross-browser website for themselves or a client) to casting it as a new brower.
Of course its a product of the same old Microsoft and a quick comparison on caniuse shows it; Shadow DOM? Nope. Server sent events? Nope. WebRTC? Nope.

Everything looks the same:

People are still writing shims to bring IE up to speed with the rest of the web.

IE is still adopting standards as slowly as possible, trying stave off the inevitable moment where web apps compete directly with desktop apps and the OS ceases to mean anything.

ie_vs_the_internet

I’m curious about the direction IE will take once Microsoft’s new CEO Satya Nadella has been around long enough to make his mark. Releasing Office of iPad in his first appearance as CEO is definitely a statement, as is rechristening “Windows Azure” to “Microsoft Azure” (since its aimed a more than just Windows…). We’ll have to wait to see what this more collaborative attitude means for IE. For the moment, the “new” IE, in spite of its rapid release cycle looks a lot like the old IE when you compare it to Firefox and Chromium.

Changing keyboard layout options in Ubuntu 14.04

Back in 2012 I switched my caps-lock key to act as a second ESC key. This made a big impact in my Vim usage, and you can understand why when you see the keyboard vi was created with. Since I now rely on that little tweak, it was a little disconcerting to realize that the keyboard layout options I had used to switch my caps-lock were nowhere to be found in Ubuntu 14.04. It turns out that Gnome (upstream from Ubuntu) removed the settings from the system settings entirely.

Fortunately this is still accessible via the Gnome Tweak Tool.
You can install that like this:

sudo apt-get install gnome-tweak-tool

Then you can find the all the old options under the “typing” option.
gnome tweak tool

Its a little wierd to have such useful stuff suddenly removed from the system settings. Hopefully they will find there way back in a future version, for the moment, my Vim crisis has been averted and that’s enough.

On working with Capybara

I’ve been writing Ruby since 2009 and while TDD as a process has long been clear to me, the morass of testing terminology has not. For a recent project I made pretty significant use of Capybara, and through it, Selenium. While it solved some problems I am not sure I could have solved any other way, it created others and on the way shed some light on some murky terminology.

I think mine was a pretty common pathway to working with Capybara, finding it via Rails and Rspec and an need to do an integration test.

The Rspec equivalent of an integration test is the Request Spec, and its often pointed out that its intended use is API testing, which wasn’t what I was doing. What’s held up as the user focused compliment to request specs are feature specs using Capybara.

The sudden appearance of “the feature” as the focus of these specs, and the brief description of “Acceptance test framework for web applications” at the top of the Capybara Github page should be our first signs that things have shifted a little.

This shift in terminology has implications technically, which are not immediately obvious. The intent of Acceptance testing “is to guarantee that a customers requirements have been met and the system is acceptable“. Importantly, since “Acceptance tests are black box system tests”, this means testing the system from the outside via the UI.

Its this “via the UI” part that should stand out, since its a far cry from the other kinds of tests of tests common with Rails. Uncle Bob has said that testing via the UI “is always a bad idea” and I got a taste of why pretty much right away. Lets take a feature spec like this as an example:

    it "asks for user email" do
      visit '/'
      fill_in "user_email", with: "foo@example.com"
      click_button "user_submit"
      expect(page).to have_content "Thanks for your email!"
    end

Notice that suddenly I am stating expectations about the HTML page contents and looking for and manipulating page elements like forms and buttons.
The code for the user_submit button above would typically look like this in most Rails apps:

<%= f.submit "Submit", :class => "btn snazzy" %>

In Rails 3.0.19 that code would use the User class and the input type to create the id attribute automatically. Our click_button 'user_submit' from above finds the element by id and our test passes:

<input class="btn snazzy" id="user_submit" name="commit" type="submit" value="Submit">

In Rails 3.1.12, the same code outputs this:

<input class="btn snazzy" name="commit" type="submit" value="Submit">

There are patterns like page objects that can reduce the brittleness but testing the UI is something that only really makes sense when the system under test is a black box.

Contrary to the black box assumption of Acceptance tests, Rails integration tests access system internals like cookies, session and the assigns hash as well as asserting specific templates have been rendered. All of that is done via HTTP requests without reliance on clicking UI elements to move between the controllers under test.

Another problem comes from the use of Selenium. By default Capybara uses rack-test as its driver, but adding js: true switches to the javascript driver which defaults to Selenium:

    it "asks for user email", js: true do
      visit '/'
      fill_in "user_email", with: "foo@example.com"
      click_button "user_submit"
      expect(page).to have_content "Thanks for your email!"
    end

The unexpected consequences of this seemingly innocuous option come a month or two later when I try to run my test suite:

     Failure/Error: visit '/'
     Selenium::WebDriver::Error::WebDriverError:
       unable to obtain stable firefox connection in 60 seconds (127.0.0.1:7055)

What happened? Well, my package manager has updated my Firefox version to 26.0, and the selenium-webdriver version specified in my Gemfile.lock is 2.35.1.

Yes, with “js: true” I have made that test dependent on a particular version of Firefox, which is living outside of my version control and gets auto-updated by my package manager every six weeks.

While workarounds like bundle update selenium-webdriver or simply skipping tests tagged with js:true using rspec -t ~js:true are available your default rspec command will always run all the tests. The need to use special options to avoid breakage is unlikely to be remembered/known by future selves/developers, so the solution seems to be keeping some sort of strict separation between your regular test suite and, minimally any test that uses js, or ideally all Acceptance tests. I’m not sure that that might look like yet, but I’m thinking about it.

Acceptance testing differs far more that I initially appreciated from other types of testing, and like most things, when used as intended its pretty fabulous. The unsettling thing in all this was how easy it was to drift into Acceptance testing without explicitly meaning to. Including the Capybara DSL in integration tests certainly blurs the boundaries between two things that are already pretty easily confused. In most other places Capybara’s ostensible raison d’etre seems to be largely glossed over. Matthew Robbins otherwise great book “Application Testing with Capybara“, is not only conspicuously not called “Acceptance testing with Capybara”, it only mentions the words “Acceptance testing” twice.

Cabybara is certainly nice to work with, and being able to translate a clients “when I click here it breaks” almost verbatim into a test is amazing. I feel like I now have a better idea of how to enjoy that upside without the downside.

Web scraping with Ruby

Writing a web scraper seems to be almost a right of passage as a programmer, and recently I found it was my turn to write one. Judging from the libraries I looked at, the term “web scraping” seems to refer to a spectrum of functionality; web crawling on one end to processing the results on the other, and often doing some combination of the two.

I looked at a few libraries and all of them seemed to either do to much or not enough relative to what I had in mind, so as millions of programmers before me have done, I wrote my own.

The idea was very much oriented around extracting a set of structured information from a given page, and to that end I wrote a little DSL to get it done.
The library is called (very creatively) Skrape.

Assuming you have a page like this at the address example.com:

<html><body><h1>I am a title</h1></body></html>

You can scrape the title off the page with this:

results = Skrape::Page.new("http://example.com").extract do
  extract_title with: 'h1'
end

The results would be:

{title: "I am a title"}

The calls to “extract_*” are caught with method_missing and whatever follows the “extract_” is used as the key in the hash of results that is returned.
To deal with the inevitable edge cases that come up so often in scraping you can all so pass a block which will be handed whatever the CSS selector found so you can do some further processing.

I’ve needed to use it for picking out the href attribute of a link:

results = Skrape::Page.new(url).extract do
  extract_link_href with: 'a', and_run: proc {|link| link.attr('href').value }
end

And also removing problematic <br> tags from elements:

results = Skrape::Page.new(url).extract do
  extract_locations with: '.address', and_run: proc {|address| address.map{|a| a.inner_html.gsub('<br>', ', ')} }
end

While there are still some improvements I would like to make, so far I am pretty happy with the library. It feels readable and does not do to much. If you have some scraping to do, check it out. Pull requests welcome.

An intro to injection attacks

I’ve found myself explaining SQL injection attacks to people a few times lately and thought I would write up something I can just point to instead.

For illustration purposes lets make a toy authentication system.

Lets say you have a database with table for all your users that looks like this:

The structure of table "users":
+-----+-----------+----------------+
| uid | username  | password       |
+-----+-----------+----------------+
|   1 | mike      | catfish        |
+-----+-----------+----------------+
|   2 | sally     | floride        |
+-----+-----------+----------------+
|   3 | akira     | pickles        |
+-----+-----------+----------------+

Lets say a user wants to use your application and you ask them for their username and password, and they give you ‘akira’ and ‘pickles’.

The next step in authenticating this user is to check in the database to see if we have a user with both a username of ‘akira’ and a password of ‘pickles’ which we can do with a SQL statement like this:

select * from user where username = 'akira' and password = 'pickles';

Since we know we have one row that satisfies both of the conditions we set (value in username must equal ‘akira’ and the value in password must equal ‘pickles’), if we hand that string of text to the database and ask it to execute it, we would expect the database to return the following data:

+-----+-----------+----------------+
| uid | username  | password       |
+-----+-----------+----------------+
|   3 | akira     | pickles        |
+-----+-----------+----------------+

If the database returns a row we can let the user go ahead and use our application.

Of course if we need our SQL statement to work for everyone and so we can’t just write ‘akira’ in there. So lets replace the username and password with variables (PHP style):

select * from user where username = '$username' and password = '$password';

Now if someone logs in with ‘mike’ and ‘catfish’ our application is going to place the value ‘mike’ in the variable $username  and ‘catfish’ in the variable $password and the PHP interpreter will be responsible for substituting the variable names for the actual values so it can create the finished string that looks like this:

select * from user where username = 'mike' and password = 'catfish';

This will be passed to the database which executes the command and return a row.

Unfortunately mixing data supplied by the user with our pre-exisisting SQL commands and passing the result to the database as one single string has set the stage for some bad behaviour:

$username = "";
$password = " ' or user_name like '%m% ";

select * from user where username = '$username' and password = '$password';

Once the final string is assembled suddenly the meaning is very different:

select * from user where username = ' ' and password = '' or username like '%m%';

Now our database will return a row if the username AND password are both empty strings OR if any of the usernames contains the letter ‘m’. Suddenly the bar for logging into our application is a lot lower.

There are tonnes of possible variations on this, and it’s a common enough problem that its the stuff of jokes among programmers:

XKCD: Little bobby tables.

The root of the problem is the commingling of user supplied data with commands intended  to be executed. The user supplied password value of “ ‘ or username like ‘%m%” entirely changed the meaning of our SQL command where if we had been able make it clear that this was just a string to search the database for we would have had the expected behaviour of comparing the string ” ‘ or username like ‘%m%” to strings in our list of passwords (‘pickles’, ‘catfish’ and ‘floride’).

If you think about it like that you realize that this is not just a problem with SQL, but a problem that shows up everywhere data and commands are mixed.

Keeping these things separate is the only way to stay safe. When the data is kept separate it can be sanitized, or singled out for whatever special treatment is appropriate for the situation by libraries/drivers or whatever else.

Separating data from commands can look pretty different depending on the what you are working on. Keeping data and commands separate when executing system commands in Ruby or Node.js looks different from keeping them separate using prepared statements to safely query a database using JRuby/Java. The same rules apply to NoSQL things like Mongodb as well.

People far smarter than I am still get caught by this stuff pretty regularly, so there is no magic bullet to solve it. If you are working with Rails, security tools like Breakman can help catch things but the only real solution is awareness and practice.