Category Archives: linux

Graph traversals in ArangoDB

ArangoDB’s AQL query language was created to offer a unified interface for working with key/value, document and graph data. While AQL has been easy to work with and learn, it wasn’t until the addition of AQL traversals in ArangoDB 2.8 that it really felt like it has achieved it’s goal.

Adding keywords GRAPH, OUTBOUND, INBOUND and ANY suddenly made iteration using a FOR loop the central idea in the language. This one construct can now be used to iterate over everything; collections, graphs or documents:

//FOR loops for everything
FOR person IN persons //collections
  FOR friend IN OUTBOUND person GRAPH "knows_graph" //graphs
    FOR value in VALUES(friend, true) //documents
    RETURN DISTINCT value

AQL has always felt more like programming than SQL ever did, but the central role of the FOR loop gives a clarity and simplicity that makes AQL very nice to work with. While this is a great addition to the language, it does however, mean that there are now 4 different ways to traverse a graph in AQL and a few things are worth pointing out about the differences between them.

AQL Traversals

There are two variations of the AQL traversal syntax; the named graph and the anonymous graph. The named graph version uses the GRAPH keyword and a string indicating the name of an existing graph. With the anonymous syntax you can simply supply the edge collections

//Passing the name of a named graph
FOR vertex IN OUTBOUND "persons/eve" GRAPH "knows_graph"
//Pass an edge collection to use an anonymous graph
FOR vertex IN OUTBOUND "persons/eve" knows

Both of these will return the same result. The traversal of the named graph uses the vertex and edge collections specified in the graph definition, while the anonymous graph uses the vertex collection names from the _to/_from attributes of each edge to determine the vertex collections.

If you want access to the edge or the entire path all you need to do is ask:

FOR vertex IN OUTBOUND "persons/eve" knows
FOR vertex, edge IN OUTBOUND "persons/eve" knows
FOR vertex, edge, path IN OUTBOUND "persons/eve" knows

The vertex, edge and path variables can be combined and filtered on to do some complex stuff. The Arango docs show a great example:

FOR v, e, p IN 1..5 OUTBOUND 'circles/A' GRAPH 'traversalGraph'
  FILTER p.edges[0].theTruth == true
  AND p.edges[1].theFalse == false
  FILTER p.vertices[1]._key == "G"
  RETURN p

Notes

Arango can end up doing a lot of work to fill in those FOR v, e, p IN variables. ArangoDB is really fast, so to show the effect these variables can have, I created the most inefficient query I could think of; a directionless traversal across a high degree vertex with no indexes.

The basic setup looked like this except with 10000 vertices instead of 10. The test was getting from start across the middle vertex to end.

Screenshot from 2016-04-05 10-07-04

What you can see is that adding those variables comes at a cost, so only declare ones you actually need.

effects_of_traversal_variables
Traversing a supernode with 10000 incident edges with various traversal methods. N=5. No indexes used.

GRAPH_* functions and TRAVERSAL

ArangoDB also has a series of “Named Operations” that feature among
them a few that also do traversals. There is also a super old-school TRAVERSAL function hiding in the “Other” section. What’s interesting is how different their performance can be while still returning the same results.

I tested all of the traversal functions on the same supernode described above. These are the queries:

//AQL traversal
FOR v IN 2 ANY "vertices/1" edges
  FILTER v.name == "end"
    RETURN v

//GRAPH_NEIGHBORS
RETURN GRAPH_NEIGHBORS("db_10000", {_id: "vertices/1"}, {direction: "any", maxDepth:2, includeData: true, neighborExamples: [{name: "end"}]})

//GRAPH_TRAVERSAL
RETURN GRAPH_TRAVERSAL("db_10000", {_id:"vertices/1"}, "any", {maxDepth:2, includeData: true, filterVertices: [{name: "end"}], vertexFilterMethod: ["exclude"]})

//TRAVERSAL
RETURN TRAVERSAL(vertices, edges, {_id: "vertices/1"}, "any", {maxDepth:2, includeData: true, filterVertices: [{name: "end"}], vertexFilterMethod: ["exclude"]})

All of these returned the same vertex, just with varying levels of nesting within various arrays. Removing the nesting did not make a signficant difference in the execution time.

traversal_comparison
Traversing a supernode with 10000 incident edges with various traversal methods. N=5.

Notes

While TRAVERSAL and GRAPH_TRAVERSAL were not stellar performers here, the both have a lot to offer in terms of customizability. For ordering, depthfirst searches and custom expanders and visitors, this is the place to look. As you explore the options, I’m sure these get much faster.

Slightly less obvious but still worth pointing out that where AQL traversals require an id (“vertices/1000” or a document with and _id attribute), GRAPH_* functions just accept an example like {foo: “bar”} (I’ve passed in {_id: “vertices/1”} as the example just to keep things comparable). Being able to find things, without needing to know a specific id, or what collection to look in is very useful. It lets you abstract away document level concerns like collections and operate on a higher “graph” level so you can avoid hardcoding collections into your queries.

What it all means

The difference between these, at least superficially, similar traversals are pretty surprising. While some where faster than others, none of the options for tightening the scope of the traversal were used (edge restrictions, indexes, directionality). That tells you there is likely a lot of headroom for performance gains for all of the different methods.

The conceptual clarity that AQL traversals bring to the language as a whole is really nice, but it’s clear there is some optimization work to be done before I go and rewrite all my queries.

Where I have used the new AQL traversal syntax, I’m also going to have to check to make sure there are no unused v,e,p variables hiding in my queries. Where you need to use them, it looks like restricting yourself to v,e is the way to go. Generating those full paths is costly. If you use them, make sure it’s worth it.

Slowing Arango down is surprisingly instructive, but with 3.0 bringing the switch to Velocypack for JSON serialization, new indexes, and more, it looks like it’s going to get harder to do. :)

 

Running Gephi on Ubuntu 15.10

A while ago I gave a talk at the Ottawa graph meetup about getting started doing graph data visualizations with Gephi. Ever the optimist, I invited people to install Gephi on their machines and then follow along as I walked through doing various things with the program.

java_install

What trying to get a room of 20 people to install a Java program has taught me is that the installer’s “Java is found everywhere” is not advertising; it’s a warning. I did indeed experience the power of Java, and after about ten minutes of old/broken/multiple  Java versions, broken classpaths and Java 7/8 compatiblity drama, I gave up and completed the rest of the talk as a demo.

All of this was long forgotten until my wife and I started a little open data project recently and needed to use Gephi to visualize the data. The Gephi install she had attempted the day of the talk was still lingering on her Ubuntu system and so it was time to actually figure out how to get it going.

The instructions for installing Gephi are pretty straight forward:

  1. Update your distribution with the last official JRE 7 or 8 packages.
  2. After the download completes, unzip and untar the file in a directory.
  3. Run it by executing ./bin/gephi script file.

The difficulty was that after doing that, Gephi would show its splash screen and then hang as the loading bar said “Starting modules…“.

If you have every downloaded plugins for Gephi, you will have noticed that they have an .nbm extension, which indicates they, and (if you will pardon the pun) by extension, Gephi itself is built on top of the Netbeans IDE.
So the next question was, does Netbeans itself work?

sudo apt-get install netbeans
netbeans

Wouldn’t you know it, that Netbeans also freezes while loading modules.

Installing Oracle’s version of Java was suggested and the place to get that is the Webupd8 Team’s ppa:

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer oracle-java8-set-default
# The java version that got installed:
java -version
java version "1.8.0_72"
Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)

That finally left us with a working version of gephi.

Gephi 0.9.1 running on Ubuntu 15.10
Gephi 0.9.1 running on Ubuntu 15.10

Installing Gephi on Arch Linux was (thankfully) drama-free, but interestingly installs the OpenJDK, they very thing that seemed to causing the problems on Ubuntu:

yaourt -S gephi
java -version
openjdk version "1.8.0_74"
OpenJDK Runtime Environment (build 1.8.0_74-b02)
OpenJDK 64-Bit Server VM (build 25.74-b02, mixed mode)

It’s a mystery to me why Gephi on Ubuntu seems to require Oracle’s Java but on Arch I can run it on OpenJDK.
With a little luck it can remain a mystery.

gpg and signing your own Arch Linux packages

One of the first things that I wanted to install on my system after switching to Arch was ArangoDB. Sadly it wasn’t in the official repos. I’ve had mixed success installing things from the AUR and the Arango packages there didn’t improve my ratio.

Using those packages as a starting point, I did a little tinkering and got it all working the way I like. I’ve been following all the work being done on reproducible builds, and why that is needed and it seems that with all that going on, anyone dabbling with making packages should at the very least be signing them. With that as my baseline, I figured I might as well start with mine .

Of course package signing involves learning about gpg/pgp whose “ease of use” is legendary.

Before we get to package signing, a little about gpg.

gpg --list-keys
gpg -k
gpg --list-public-keys

All of these commands list the contents of ~/.gnupg/pubring.gpg.

gpg --list-secret-keys
gpg -K

Both list all keys from ~/.gnupg/secring.gpg.

The pacman package manager also has its own gpg databases which you can explore with:

gpg --homedir /etc/pacman.d/gnupg --list-keys

So the task at hand is getting my public key into the list of public keys that pacman trusts. To do that we will need to do more than just list keys we need to reference them individually. gpg has a few ways to do that by passing an argument to one of our list keys commands above. I’ll do a quick search through the list of keys that pacman trusts:

mike@longshot:~/projects/arangodb_pkg☺ gpg --homedir /etc/pacman.d/gnupg -k pierre
pub   rsa3072/6AC6A4C2 2011-11-18 [SC]
uid         [  full  ] Pierre Schmitz (Arch Linux Master Key) <pierre@master-key.archlinux.org>
sub   rsa1024/86872C2F 2011-11-18 [E]
sub   rsa3072/1B516B59 2011-11-18 [A]

pub   rsa2048/9741E8AC 2011-04-10 [SC]
uid         [  full  ] Pierre Schmitz <pierre@archlinux.de>
sub   rsa2048/54211796 2011-04-10 [E]

mike@longshot:~/projects/arangodb_pkg☺  gpg --homedir /etc/pacman.d/gnupg --list-keys stephane
pub   rsa2048/AB441196 2011-10-30 [SC]
uid         [ unknown] Stéphane Gaudreault <stephane@archlinux.org>
sub   rsa2048/FDC576A9 2011-10-30 [E]

If you look at the output there you can see what is called an openpgp short key id. We can use those to refer to individual keys but we can also use long id’s and fingerprints:

gpg --homedir /etc/pacman.d/gnupg -k --keyid-format long stephane
pub   rsa2048/EA6836E1AB441196 2011-10-30 [SC]
uid                 [ unknown] Stéphane Gaudreault <stephane@archlinux.org>
sub   rsa2048/4ABE673EFDC576A9 2011-10-30 [E]


gpg --homedir /etc/pacman.d/gnupg -k --fingerprint stephane
pub   rsa2048/AB441196 2011-10-30 [SC]
      Key fingerprint = 0B20 CA19 31F5 DA3A 70D0  F8D2 EA68 36E1 AB44 1196
uid         [ unknown] Stéphane Gaudreault <stephane@archlinux.org>
sub   rsa2048/FDC576A9 2011-10-30 [E]

So we can identify Stephane’s specific key using either the short id, the long id or the fingerprint:

gpg --homedir /etc/pacman.d/gnupg -k AB441196
gpg --homedir /etc/pacman.d/gnupg -k EA6836E1AB441196
gpg --homedir /etc/pacman.d/gnupg -k 0B20CA1931F5DA3A70D0F8D2EA6836E1AB441196

Armed with a way to identify the key I want pacman to trust I need to do the transfer. Though not initially obvious, gpg can push and pull keys from designated key servers. The file at ~/.gnupg/gpg.conf tells me that my keyserver is keys.gnupg.net, while pacman’s file at /etc/pacman.d/gnupg/gpg.conf says it is using pool.sks-keyservers.net

Using my key’s long id I’ll push it to my default keyserver and tell pacman to pull it and then sign it.

#send my key
gpg --send-key F77636AC51B71B99
#tell pacman to pull that key from my keyserver
sudo pacman-key --keyserver keys.gnupg.net -r F77636AC51B71B99
#sign the key it recieved and start trusting it
sudo pacman-key --lsign-key F77636AC51B71B99

With all that done, I should be able to sign my package with

makepkg --sign --key F77636AC51B71B99

We can also shorten that by setting the default-key option in ~/.gnupg/gpg.conf.

# If you have more than 1 secret key in your keyring, you may want to
# uncomment the following option and set your preferred keyid.

default-key F77636AC51B71B99

With my default key set I’m able to make and install with this:

mike@longshot:~/projects/arangodb_pkg☺ makepkg --sign
mike@longshot:~/projects/arangodb_pkg☺ sudo pacman -U arangodb-2.8.1-1-x86_64.pkg.tar.xz
loading packages...
resolving dependencies...
looking for conflicting packages...

Packages (1) arangodb-2.8.1-1

Total Installed Size:  146.92 MiB
Net Upgrade Size:        0.02 MiB

:: Proceed with installation? [Y/n] y
(1/1) checking keys in keyring
[##################################################] 100%
(1/1) checking package integrity
[##################################################] 100%

The ease with which I can make my own packages is a very appealing part of Arch Linux for me. Signing them was the next logical step and I’m looking forward to exploring some related topics like running my own repo, digging deeper into GPG, the Web of Trust, and reproducible builds. It’s all fun stuff, if you can only find the time.

Hello GraphQL

One of the most interesting projects to me lately has been Facebook’s GraphQL. Announced at React.conf in January, those of us that were excited by the idea have had to wait, first for the spec to be formalized and then for some actual running code.

I think the GraphQL team is on to something big (it’s positioned as an alternative to REST to give a sense of how big), and I’ve been meaning to dig in to it for a while, but it was never clear where to start. So after a little head-scratching and a little RTFM, I want to share a GraphQL hello world.

So what does that look like? Well Facebook as released two projects: graphql-js and express-graphql. Graphql-js is the reference implementation of what is described in the spec. express-graphql is a middleware component for the express framework that lets you use graphql.

So express is going to be our starting point. First we need to create a new project using the generator:

mike@longshot:~☺  express --git -e gql_hello

create : gql_hello
create : gql_hello/package.json
create : gql_hello/app.js
create : gql_hello/.gitignore
create : gql_hello/public
create : gql_hello/routes
create : gql_hello/routes/index.js
create : gql_hello/routes/users.js
create : gql_hello/views
create : gql_hello/views/index.ejs
create : gql_hello/views/error.ejs
create : gql_hello/bin
create : gql_hello/bin/www
create : gql_hello/public/javascripts
create : gql_hello/public/images
create : gql_hello/public/stylesheets
create : gql_hello/public/stylesheets/style.css

install dependencies:
$ cd gql_hello && npm install

run the app:
$ DEBUG=gql_hello:* npm start

Lets do as we are told and run cd gql_hello && npm install.
When that’s done we can get to the interesting stuff.
Next up will be installing graphql and the middleware using the –save option so that our app’s dependencies in our package.json will be updated:

mike@longshot:~/gql_hello☺  npm install --save express-graphql graphql babel
npm WARN install Couldn't install optional dependency: Unsupported
npm WARN prefer global babel@5.8.23 should be installed with -g
...

I took the basic app.js file that was generated and just added the following:

app.use('/', routes);
app.use('/users', users);

// GraphQL:
var graphqlHTTP = require('express-graphql');

import {
  graphql,
  GraphQLSchema,
  GraphQLObjectType,
  GraphQLString,
} from 'graphql';

var schema = new GraphQLSchema({
  query: new GraphQLObjectType({
    name: 'RootQueryType',
    fields: {
      hello: {
        type: GraphQLString,
        resolve() {
          return 'world';
        }
      }
    }
  })
});

//Mount the middleware on the /graphql endpoint:
app.use('/graphql', graphqlHTTP({ schema: schema , pretty: true}));
//That's it!

// catch 404 and forward to error handler
app.use(function(req, res, next) {
  var err = new Error('Not Found');
  err.status = 404;
  next(err);
});

Notice that we are passing our GraphQL schema to graphqlHTTP as well as pretty: true so that responses from the server will be pretty printed.

One other thing is that since those GraphQL libraries make extensive use of ECMAScript 6 syntax, we will need to use the Babel Transpiler to actually be able to run this thing.

If you installed Babel with npm install -g babel you can add the following to your package.json scripts section:

  {
    "start": "babel-node ./bin/www"
  }

Because I didn’t install it globally, I’ll just point to it in the node_modules folder:

  {
    "start": "node_modules/babel/bin/babel-node.js ./bin/www"
  }

With that done we can use npm start to start the app and try things out using curl:

mike@longshot:~☺  curl localhost:3000/graphql?query=%7Bhello%7D
{
  "data": {
    "hello": "world"
  }
}

Looking back at the schema we defined, we can see that our request {hello} (or %7Bhello%7D when its been url encoded) caused the resolve function to be called, which returned the string “world”.

{
  name: 'RootQueryType',
  fields: {
    hello: {
      type: GraphQLString,
      resolve() {
        return 'world';
      }
    }
  }
}

This explains what they mean when you hear that GraphQL “exposes fields that are backed by arbitrary code”. What drew me to GraphQL is that it seems to be a great solution for people with graph database backed applications, but it’s only now that I realize that GraphQL is much more flexible. That string could have just as easily pulled something out of a relational database or calculated something useful. In fact this might be the only time “arbitrary code execution” is something to be happy about.

I’m super excited to explore this further and to start using it with ArangoDB. If you want to dig deeper I suggest you check out Exploring GraphQL and Creating a GraphQL server and of course read the spec.

Running a Rails app with Systemd and liking it

Systemd has, over the strident objections of many strident people, become the default init system for a surprising number of linux distributions. Though I’ve been aware of the drama, the eye-rolling, the uh.. enterprising nature of systemd, I really have only just started using it myself. All the wailing and gnashing of teeth surrounding it left me unsure what to expect.

Recently I needed to get a Proof-Of-Concept app I built running so a client could use it on their internal network to evaluate it. Getting my Rails app to start on boot was pretty straight forward and I’m going to be using this again so I thought I would document it here.

First I created a “rails” user and group, and in /home/rails I installed my usual Rbenv setup. The fact that only root is allowed to listen to ports below 1024, conflicts with my plan to run my app with the “rails” user and listen on port 80. The solution is setcap:

setcap 'cap_net_bind_service=+ep' .rbenv/versions/2.2.2/bin/bundle

With that capability added, I set up my systemd unit file in /usr/lib/systemd/system/myapp.service and added the following:

[Unit]
Description=MyApp
Requires=network.target
Requires=arangodb.service

[Service]
Type=simple
User=rails
Group=rails
WorkingDirectory=/home/rails/myapp
ExecStart=/usr/bin/bash -lc 'bundle exec rails server -e production --bind 0.0.0.0 --port 80'
TimeoutSec=30
RestartSec=15s
Restart=always

[Install]
WantedBy=multi-user.target

The secret sauce that makes this work with rbenv is the “bash -l” in the ExecStart section. This means that the bash will execute as though it was a login shell, meaning that the .bashrc file with all the PATH exports and rbenv init stuff will be sourced before the command I give it will be run. In other words, exactly what happens normally.

From there, I just start the service like all the rest of them:

systemctl enable myapp.service
systemctl start myapp.service

This Just Works™ and got the job done, but in the process I find I am really starting to appreciate Systemd. Running daemons is complicated, and with a the dropping of privileges, ordering, isolation and security options, there is a lot to get right… or wrong.

What I am liking about Systemd is that it is taking the same functions that Docker is built on, namely cgroups and namespacing, and giving you a declarative way of using them while starting your process. Doing so puts some really nice (and otherwise complicated) security features within reach of anyone willing to read a man page.

PrivateTmp=yes is a great example of this. By simply adding that to the unit file above (which you should if you call Tempfile.new in your app) closes off a bunch of security problems because systemd “sets up a new file system namespace for the executed processes and mounts private /tmp and /var/tmp directories inside it that is not shared by processes outside of the namespace”.

Could I get the same effect as PrivateTmp=yes with unshare? With some fiddling, but Systemd makes it a zero cost option.

There is also ProtectSystem=full to mount /boot, /usr and /etc as read only which “ensures that any modification of the vendor supplied operating system (and optionally its configuration) is prohibited for the service”. Systemd can even handle running setcap for me, resulting in beautiful stuff like this, and there is a lot more in man systemd.exec besides.

For me I think one of the things that has become clear over the last few years is that removing “footguns” from our software is really important. All the work that is going into the tools (like rm -rf) and languages (Rust!) we use less spectacularly dangerous is critical to raising the security bar across the industry.

The more I learn about Systemd the more it seems to be a much needed part of that.

Zero downtime Rails redeploys with Unicorn

Like any self-respecting Ruby hipster I am using Unicorn as my app server in production. Unlike most Ruby hipsters my deployment process is pretty manual at the moment. While the high-touch manual deployment that I am currently doing is far from ideal long term, short term its giving me a close up look at the care and feeding of a production Rails app. Think of it as wanting to get my hands dirty after years of being coddled by Heroku. :)

Much ado has been made of how unixy Unicorn is, and one of the ways that manifests itself is how Unicorn uses signals to allow you to talk to a running server process. What has been interesting about this has been a reintroduction to the “kill” command. Its pretty common to know that “kill -9 1234” is a quick way to kill process 1234 but it turns out that the kill command has much more happening. The mysterious -9 option is significantly less mysterious once know that kill can send ANY signal, and finally look at the list of options:

mike@sleepycat:~☺  kill -l
 1) SIGHUP	 2) SIGINT	 3) SIGQUIT	 4) SIGILL	 5) SIGTRAP
 6) SIGABRT	 7) SIGBUS	 8) SIGFPE	 9) SIGKILL	10) SIGUSR1
11) SIGSEGV	12) SIGUSR2	13) SIGPIPE	14) SIGALRM	15) SIGTERM
16) SIGSTKFLT	17) SIGCHLD	18) SIGCONT	19) SIGSTOP	20) SIGTSTP
21) SIGTTIN	22) SIGTTOU	23) SIGURG	24) SIGXCPU	25) SIGXFSZ
26) SIGVTALRM	27) SIGPROF	28) SIGWINCH	29) SIGIO	30) SIGPWR
31) SIGSYS	34) SIGRTMIN	35) SIGRTMIN+1	36) SIGRTMIN+2	37) SIGRTMIN+3
38) SIGRTMIN+4	39) SIGRTMIN+5	40) SIGRTMIN+6	41) SIGRTMIN+7	42) SIGRTMIN+8
43) SIGRTMIN+9	44) SIGRTMIN+10	45) SIGRTMIN+11	46) SIGRTMIN+12	47) SIGRTMIN+13
48) SIGRTMIN+14	49) SIGRTMIN+15	50) SIGRTMAX-14	51) SIGRTMAX-13	52) SIGRTMAX-12
53) SIGRTMAX-11	54) SIGRTMAX-10	55) SIGRTMAX-9	56) SIGRTMAX-8	57) SIGRTMAX-7
58) SIGRTMAX-6	59) SIGRTMAX-5	60) SIGRTMAX-4	61) SIGRTMAX-3	62) SIGRTMAX-2
63) SIGRTMAX-1	64) SIGRTMAX	

So with that knowledge lets send some signals to Unicorn to get it serving up the latest version of our code. First we need its process id. We are really just interested in the process id of the master process which we can see is 26465:

mike@sleepycat:/myapp$ ps aux | grep unicorn
503       7995   0:00 grep unicorn
503      26465   0:07 unicorn_rails master -c config/unicorn.rb --env production -D
503      26498   0:11 unicorn_rails worker[0] -c config/unicorn.rb --env production -D
503      26502   2:37 unicorn_rails worker[1] -c config/unicorn.rb --env production -D
503      26506   0:06 unicorn_rails worker[2] -c config/unicorn.rb --env production -D
503      26510   0:06 unicorn_rails worker[3] -c config/unicorn.rb --env production -D
503      26514   0:06 unicorn_rails worker[4] -c config/unicorn.rb --env production -D
503      26518   0:06 unicorn_rails worker[5] -c config/unicorn.rb --env production -D
503      26522   0:06 unicorn_rails worker[6] -c config/unicorn.rb --env production -D
503      26526   0:07 unicorn_rails worker[7] -c config/unicorn.rb --env production -D
503      26530   0:07 unicorn_rails worker[8] -c config/unicorn.rb --env production -D
503      26534   0:06 unicorn_rails worker[9] -c config/unicorn.rb --env production -D
503      26538   0:09 unicorn_rails worker[10] -c config/unicorn.rb --env production -D
503      26542   0:07 unicorn_rails worker[11] -c config/unicorn.rb --env production -D
503      26546   0:07 unicorn_rails worker[12] -c config/unicorn.rb --env production -D
503      26550   0:08 unicorn_rails worker[13] -c config/unicorn.rb --env production -D
503      26554   0:10 unicorn_rails worker[14] -c config/unicorn.rb --env production -D
503      26558   0:08 unicorn_rails worker[15] -c config/unicorn.rb --env production -D
503      26562   0:05 unicorn_rails worker[16] -c config/unicorn.rb --env production -D
503      26566   0:08 unicorn_rails worker[17] -c config/unicorn.rb --env production -D
503      26570   0:07 unicorn_rails worker[18] -c config/unicorn.rb --env production -D
503      26574   0:06 unicorn_rails worker[19] -c config/unicorn.rb --env production -D         

Since I have just pulled down some new code I want to restart the master process. I can get Unicorn to launch an new master process by sending the master process the USR2 signal. After that you can see that there is now a new master (7996) with its set of workers and the old master (26465) and its set of workers

mike@sleepycat:/myapp$ kill -USR2 26465
mike@sleepycat:/myapp$ ps aux | grep unicorn
503       7996  0:07 unicorn_rails master -c config/unicorn.rb --env production -D
503       8035  0:00 unicorn_rails worker[0] -c config/unicorn.rb --env production -D
503       8038  0:00 unicorn_rails worker[1] -c config/unicorn.rb --env production -D
503       8041  0:00 unicorn_rails worker[2] -c config/unicorn.rb --env production -D
503       8044  0:00 unicorn_rails worker[3] -c config/unicorn.rb --env production -D
503       8046  0:00 unicorn_rails worker[4] -c config/unicorn.rb --env production -D
503       8050  0:00 unicorn_rails worker[5] -c config/unicorn.rb --env production -D
503       8052  0:00 unicorn_rails worker[6] -c config/unicorn.rb --env production -D
503       8056  0:00 unicorn_rails worker[7] -c config/unicorn.rb --env production -D
503       8059  0:00 unicorn_rails worker[8] -c config/unicorn.rb --env production -D
503       8062  0:00 unicorn_rails worker[9] -c config/unicorn.rb --env production -D
503       8064  0:00 unicorn_rails worker[10] -c config/unicorn.rb --env production -D
503       8069  0:00 unicorn_rails worker[11] -c config/unicorn.rb --env production -D
503       8073  0:00 unicorn_rails worker[12] -c config/unicorn.rb --env production -D
503       8075  0:00 unicorn_rails worker[13] -c config/unicorn.rb --env production -D
503       8079  0:00 unicorn_rails worker[14] -c config/unicorn.rb --env production -D
503       8082  0:00 unicorn_rails worker[15] -c config/unicorn.rb --env production -D
503       8085  0:00 unicorn_rails worker[16] -c config/unicorn.rb --env production -D
503       8088  0:00 unicorn_rails worker[17] -c config/unicorn.rb --env production -D
503       8091  0:00 unicorn_rails worker[18] -c config/unicorn.rb --env production -D
503       8094  0:00 unicorn_rails worker[19] -c config/unicorn.rb --env production -D
503       8156  0:00 grep unicorn
503      26465  0:07 unicorn_rails master (old) -c config/unicorn.rb --env production -D
503      26498  0:11 unicorn_rails worker[0] -c config/unicorn.rb --env production -D
503      26502  2:37 unicorn_rails worker[1] -c config/unicorn.rb --env production -D
503      26506  0:06 unicorn_rails worker[2] -c config/unicorn.rb --env production -D
503      26510  0:06 unicorn_rails worker[3] -c config/unicorn.rb --env production -D
503      26514  0:06 unicorn_rails worker[4] -c config/unicorn.rb --env production -D
503      26518  0:06 unicorn_rails worker[5] -c config/unicorn.rb --env production -D
503      26522  0:06 unicorn_rails worker[6] -c config/unicorn.rb --env production -D
503      26526  0:07 unicorn_rails worker[7] -c config/unicorn.rb --env production -D
503      26530  0:07 unicorn_rails worker[8] -c config/unicorn.rb --env production -D
503      26534  0:06 unicorn_rails worker[9] -c config/unicorn.rb --env production -D
503      26538  0:09 unicorn_rails worker[10] -c config/unicorn.rb --env production -D
503      26542  0:07 unicorn_rails worker[11] -c config/unicorn.rb --env production -D
503      26546  0:07 unicorn_rails worker[12] -c config/unicorn.rb --env production -D
503      26550  0:08 unicorn_rails worker[13] -c config/unicorn.rb --env production -D
503      26554  0:10 unicorn_rails worker[14] -c config/unicorn.rb --env production -D
503      26558  0:08 unicorn_rails worker[15] -c config/unicorn.rb --env production -D
503      26562  0:06 unicorn_rails worker[16] -c config/unicorn.rb --env production -D
503      26566  0:08 unicorn_rails worker[17] -c config/unicorn.rb --env production -D
503      26570  0:07 unicorn_rails worker[18] -c config/unicorn.rb --env production -D
503      26574  0:06 unicorn_rails worker[19] -c config/unicorn.rb --env production -D

s
Now I want to shutdown the old master process and its workers. I can do that with the QUIT signal:

mike@sleepycat:/myapp$/myapp$ kill -QUIT 26465
mike@sleepycat:/myapp$/myapp$ ps aux | grep unicorn
503       7996  0:07 unicorn_rails master -c config/unicorn.rb --env production -D
503       8035  0:00 unicorn_rails worker[0] -c config/unicorn.rb --env production -D
503       8038  0:00 unicorn_rails worker[1] -c config/unicorn.rb --env production -D
503       8041  0:00 unicorn_rails worker[2] -c config/unicorn.rb --env production -D
503       8044  0:00 unicorn_rails worker[3] -c config/unicorn.rb --env production -D
503       8046  0:00 unicorn_rails worker[4] -c config/unicorn.rb --env production -D
503       8050  0:00 unicorn_rails worker[5] -c config/unicorn.rb --env production -D
503       8052  0:00 unicorn_rails worker[6] -c config/unicorn.rb --env production -D
503       8056  0:00 unicorn_rails worker[7] -c config/unicorn.rb --env production -D
503       8059  0:00 unicorn_rails worker[8] -c config/unicorn.rb --env production -D
503       8062  0:00 unicorn_rails worker[9] -c config/unicorn.rb --env production -D
503       8064  0:00 unicorn_rails worker[10] -c config/unicorn.rb --env production -D
503       8069  0:00 unicorn_rails worker[11] -c config/unicorn.rb --env production -D
503       8073  0:00 unicorn_rails worker[12] -c config/unicorn.rb --env production -D
503       8075  0:00 unicorn_rails worker[13] -c config/unicorn.rb --env production -D
503       8079  0:00 unicorn_rails worker[14] -c config/unicorn.rb --env production -D
503       8082  0:00 unicorn_rails worker[15] -c config/unicorn.rb --env production -D
503       8085  0:00 unicorn_rails worker[16] -c config/unicorn.rb --env production -D
503       8088  0:00 unicorn_rails worker[17] -c config/unicorn.rb --env production -D
503       8091  0:00 unicorn_rails worker[18] -c config/unicorn.rb --env production -D
503       8094  0:00 unicorn_rails worker[19] -c config/unicorn.rb --env production -D
503       8161  0:00 grep unicorn

So now we have a Unicorn serving up the latest version of our code without dropping a single request. Really slick.
Ryan Bates has a screencast that has a broader look at the subject of zero downtime deployments, automated with Capistrano (probably a more sustainable approach), but if you look closely you will see these signals lurking in the code.

If you are interested in digging into more Unix fundamentals (from a Rubyist’s perspective!) I would recommend checking out Jesse Storimer’s books.

Understanding Docker

Docker has generated alot of buzz lately and seems poised to fundamentally change how apps get deployed. With the apps I work on increasingly dependent on the environment (environmental vars, cron jobs and additional libraries) having a way of encapsulating my app and its environment is pretty appealing. With that in mind I’ve playing with docker for a little while but I found I had a hard time building a clear picture in my head of what is actually going on.

The tutorials all feel a little magical and a lot of the docs for the commands end up being stuff like “docker pull: pulls an image” which is pretty unsatisfying. So while I am still just getting started with Docker I thought I would share what I have pieced together so far and use it as an opportunity to explain this to myself as well.

The first thing to point out is that Docker is built on top of AuFS, Linux Containers (LXC), and cgroups (lots of details here). Doing some reading about those things first really helps understand what is going on.

While that is neat, I am a pretty visual person so for me to feel like I have any idea of what is going on I need to see it. So to do that I created my own image using the traditional debootstrap command:

☺  sudo debootstrap raring raring64
I: Retrieving InRelease
I: Failed to retrieve InRelease
I: Retrieving Release
I: Retrieving Release.gpg
....

You can see it created a folder with the following sub-folders:

ls raring64/
bin   dev  home  lib64  mnt  proc  run   selinux  sys  usr
boot  etc  lib   media  opt  root  sbin  srv      tmp  var

Then we tar up the folder and piping it into dockers import command. This creates the image and prints out the hash id of the image before exiting:

☺ sudo tar -C raring64 -c . | sudo docker import - raring64
9a6984a920c9

If I dig I can then find those folders in the docker graph folder:

☺ sudo ls /var/lib/docker/graph/9a6984a920c9badcaed6456bfdef2f20a414b08ed09acfd9140f2124065697b2/layer
bin   dev  home  lib64	mnt  proc  run	 selinux  sys  usr
boot  etc  lib	 media	opt  root  sbin  srv	  tmp  var

I can then log into that image by asking docker to run interactively (-i) and give me a pseudo tty (-t). But notice the host name on the root prompt you get when you get docker to run bash (which changes each time):

☺ sudo docker run -i -t raring64 /usr/bin/env bash
WARNING: Docker detected local DNS server on resolv.conf. Using default external servers: [8.8.8.8 8.8.4.4]
root@b0472d03f134:/# exit
exit
☺ sudo docker run -i -t raring64 /usr/bin/env bash
WARNING: Docker detected local DNS server on resolv.conf. Using default external servers: [8.8.8.8 8.8.4.4]
root@76c7860cf94e:/# exit
exit

If I run some commands that change the state of that image and I want to keep them I will need to use the hash we can see in the host name to commit that change back to the graph directory. So for example, I installed git (with “apt-get install git”) and afterwards I commit the change:

☺ sudo docker commit 76c7860cf94e raring_and_git
c153792e04b4

Sure enough this creates a new directory inside /var/lib/docker/graph/ that contains this difference between the original image (my raring64 image) and my new one with git:

☺ sudo ls /var/lib/docker/graph/
27cf784147099545						  9a6984a920c9badcaed6456bfdef2f20a414b08ed09acfd9140f2124065697b2  c153792e04b4a164b9eb981e0f59a82c8775cad90a7771045ba3c6daabc41f23  :tmp:
8dbd9e392a964056420e5d58ca5cc376ef18e2de93b5cc90e868a1bbc8318c1c  b750fe79269d2ec9a3c593ef05b4332b1d1a02a62b4accb2c21d589ff2f5f2dc  checksums

☺ sudo ls /var/lib/docker/graph/c153792e04b4a164b9eb981e0f59a82c8775cad90a7771045ba3c6daabc41f23/layer
dev  etc  lib  tmp  usr  var

It is the job of AuFS to take all the folders and files in the graph directory and sort of sum them into a single filesystem with all the files from raring64 + the new files that changed when I installed git. Docker can then use that filesystem as the base from which to run its namespaced process (similar to chroot).

All of this creates a pretty “git-like” experience where each hash represents a change set applied to a base set of files.

From here building out images takes one of two forms; give yourself an interactive bash session, make your changes and then commit them, or use a Dockerfile.

So this feels like a solid starting point in the world of Docker, and its a pretty exciting world. In fact I am looking forward to deploying my next app… how often can you say that?

EuRuKo 2013

Some of the talks from Eururko sounded really good but I missed most of them because of the time difference. Fortunately the live stream was saved and posted, so now I can watch them whenever!

Update: Ustream has thoughtfully decided to ignore the autoplay=false parameter that WordPress adds to all the videos to prevent them autoplaying. So rather than embeding them and having them all play at the same time everytime the page loads I am just going to link to them. Thanks Ustream.

Day 1 Part 1

Day 1 Part 2

Day 1 Part 3

Day 2 Part 1

Day 2 Part 2

Day 2 Part 3

Contributing to Musicbrainz

Hacking on code is not the only way to contribute to the Free Software / Open Source Software community. Many applications rely on external datasets to provide some or most of their functionality and a contribution there helps all the projects downstream.

Ubuntu’s default music application Rhythmbox as well as Banshee, KDE’s Amarok and a host of smaller programs like Sound Juicer all offer the ability to rip CDs. For this feature to be seen to “work” in the eyes of the user the software needs to correctly identify the CD and add the appropriate album details and track names. To do this, these programs all query the Musicbrainz database and the quality of that response essentially decides the users experience of the ripping process; Will it be a one click “import”, or 10 minutes of filling in track names while squinting at the CD jacket? God help you if you have tracks with accented characters and an English keyboard.

What all this means is that every contribution to Musicbrainz results in a better experience for users of ALL of these programs. When someone somewhere decides to rip their favourite obscure CD, and the software magically fills in the track names and album details, its a pretty happy moment. So if you have wanted to contribute to the Free/Open Source Software community but don’t have the programming chops to do it, contributing to Musicbrainz is one way to help out.

The Musicbrainz dataset is under the Creative Commons CCO license which places all the data in the public domain. This means that your contributions will stay open to the public and we won’t have another CDDB/Gracenote situation where people contributed to a database that ended up charging for access. All you need to get started is to create a Musicbrainz account.

A typical contribution looks something like this. I’ll decide to rip one of my CDs and pop it in the drive. I launch Rhythmbox which will tell me if its not recognized:

rhythmbox_unknown_album2

When you click the “Submit Album” button the application will send you to the Musicbrainz website with the Table Of Contents (TOC) information from the CD in the url. Once on the site you will need to search for the Artist or Release to add the TOC to:

musicbrainz_search

Most of the time there will be an entry in the database for the Artist and all that needs to happen is to add the TOC that you are still carrying from page to page in the url to one of the Artists CDs. In cases where the search returns multiple matches I usually explore the different options by ctrl+clicking on the results to open them in separate tabs.

musicbrainz_selected

Click through the artists albums until you find the one you are looking for, or add one if you need to. In this case there was one already there (and all the details including the catalog number matched) so my next step was to click the “Attach CD TOC”. This takes the TOC you can see in the address bar and attaches it to that release.

musicbrainz_attaching_toc

You will be asked to add a note describing where this information you are providing is coming from. In this case its coming from the CD. Add the note and you are done.  What make contributing to Musicbrainz particularly gratifying is that next time you put in the CD, it is recognized right away. My favourite kind of gratification: instant. You can also immediately see the results in Rhythmbox, as well as Banshee and any other application that uses Musicbrainz.

rhythmbox_after

banshee_science_fiction_lookup_after

Its pretty great thinking that the few minutes invested in this process  not only solves your immediate problem of having an unrecognised CD, but also makes software all over the Linux ecosystem just a tiny bit better. That’s a win/win situation if I’ve ever seen one.

Getting to know SQLite3

I’m finding SQLite3 super useful lately. Its great for any kind of experimentation and quick and painless way to persist data. There are just a few things I needed to wrap my head around to start to feel commfortable with it.

As with most things on Debian based systems, installing is really easy:
sudo apt-get install sqlite3 libsqlite3-dev

My first real question was about datatypes. What does SQLite support? It was a bit mysterious to read that SQLite has 5 datatypes (null, integer, real(float), text, blob) but then see a MySQL style create table statement like this work:

create table people(
  id integer primary key autoincrement,
  name varchar(30),
  age integer,
  awesomeness decimal(5,2)
);

How are varchar and decimal able to work? Worse still, why does something like this work:

create table people(
  id integer primary key autoincrement,
  name foo(30),
  age bar(100000000),
  awesomeness baz
);

As it happens SQLite maps certain terms to its internal datatypes:

If the declared type contains the string “INT” then it is assigned INTEGER affinity.

If the declared type of the column contains any of the strings “CHAR”, “CLOB”, or “TEXT” then that column has TEXT affinity. Notice that the type VARCHAR contains the string “CHAR” and is thus assigned TEXT affinity.

If the declared type for a column contains the string “BLOB” or if no type is specified then the column has affinity NONE.

If the declared type for a column contains any of the strings “REAL”, “FLOA”, or “DOUB” then the column has REAL affinity.

Otherwise, the affinity is NUMERIC.

So the foo, bar and baz columns above, being unrecognized, would have received an affinity of numeric, and would try to convert whatever was inserted into them into a numeric format. You can read more about the in’s and outs of type affinities in the docs, but the main thing to grasp up front is that syntax-wise you can usually write whatever you are comfortable with and it will probably work, just keep in mind that affinities are being set and you will know where to look when you see something strange happening. For the most part this system of affinities does a good job of not violating your expectations regardless of what database you are used to using.

The other thing to get is that SQLite determines the datatype from the values themselves. Anything in quotes is assumed to be a string, unquoted digits are integers, or if they have a decimal, a “real” while a blob is a string of hex digits prefixed with an x: x’00ff’.

So the safest/easiest thing might just be to leave the column definitions out altogether so they will all have an affinity of none and let the values speak for themselves.

The rest of my learning about SQLite is really a grab bag of little goodies:

Getting meta info about tables, indexes or the database itself is done with a pragma statement.
For example, if I want information about the table data:

sqlite> pragma table_info(people);
0|id|integer|0||1
1|name|foo(30)|0||0
2|age|bar(100000000)|0||0
3|awesomeness|baz|0||0

You can get that same list of info within Ruby like so (after running “gem install sqlite3”):

require 'sqlite3'
@db = SQLite3::Database.new("cats.db")
table_name = "cats"
@db.table_info(table_name)

A complete list of pragma statements can be found in the docs.

To open or create a database simply run sqlite3 with the name of the file:

mike@sleepycat:~☺ sqlite3 cats.db

And finally if you have a file with sql statements you would like to run on a database:

mike@sleepycat:~☺ sqlite3 cats.db < insert_all_the_cats.sql

Its been good to get to know SQLite3 a little better. Before this I had only really come in contact with it through my Rails development work and knew it only as the in-memory test database or the one I would use when I couldn’t be bothered to set up a “real” database. The more I look at it the more its seems like a really powerful and useful tool.