invalid value for parameter “TimeZone”

While working on standing up a Rails app I ran into a pretty weird error that really had me scratching my head.

[mike@longshot identity-idp]$ rake db:create
PG::InvalidParameterValue: ERROR:  invalid value for parameter "TimeZone": "UTC"
: SET time zone 'UTC'
Couldn't create database for {"pool"=>5, "timeout"=>5000, "host"=>"localhost", "adapter"=>"postgresql", "encoding"=>"utf8", "database"=>"upaya_development", "port"=>5432}
rake aborted!
ActiveRecord::StatementInvalid: PG::InvalidParameterValue: ERROR:  invalid value for parameter "TimeZone": "UTC"
: SET time zone 'UTC'

PG::InvalidParameterValue: ERROR:  invalid value for parameter "TimeZone": "UTC"

Tasks: TOP => db:create
(See full trace by running task with --trace)

The output of timedatectl status looked OK, but just to be sure, I updated them to EDT. No difference. When I tried rake db:migrate I got a far more instructive error:

[mike@longshot identity-idp]$ rake db:migrate
rake aborted!
ArgumentError: Invalid Timezone: UTC
/home/mike/projects/identity-idp/config/environment.rb:5:in `<top (required)>'
TZInfo::InvalidTimezoneIdentifier: Expected 44 bytes reading '/usr/share/zoneinfo/UTC', but got 0 bytes
/home/mike/projects/identity-idp/config/environment.rb:5:in `<top (required)>'
TZInfo::InvalidZoneinfoFile: Expected 44 bytes reading '/usr/share/zoneinfo/UTC', but got 0 bytes
/home/mike/projects/identity-idp/config/environment.rb:5:in `<top (required)>'
Tasks: TOP => log => environment
(See full trace by running task with --trace)

/usr/share/zoneinfo/UTC is 0 bytes? A quick looks shows it to be true, and the package that supplied this file is tzdata.

[mike@longshot identity-idp]$ cat /usr/share/zoneinfo/UTC 
[mike@longshot identity-idp]$ ls -l /usr/share/zoneinfo/UTC 
-rw-r--r-- 6 root root 0 Jul 26 18:01 /usr/share/zoneinfo/UTC
[mike@longshot identity-idp]$ pacman -Qo /usr/share/zoneinfo/UTC
/usr/share/zoneinfo/UTC is owned by tzdata 2017b-1

That doesn’t seem right, let’s reinstall…

[mike@longshot identity-idp]$ sudo pacman -S tzdata
warning: tzdata-2017b-1 is up to date -- reinstalling
resolving dependencies...
looking for conflicting packages...

Packages (1) tzdata-2017b-1

Total Installed Size:  1.81 MiB
Net Upgrade Size:      0.00 MiB

:: Proceed with installation? [Y/n] y
(1/1) checking keys in keyring                                               [###########################################] 100%
(1/1) checking package integrity                                             [###########################################] 100%
(1/1) loading package files                                                  [###########################################] 100%
(1/1) checking for file conflicts                                            [###########################################] 100%
(1/1) checking available disk space                                          [###########################################] 100%
:: Processing package changes...
(1/1) reinstalling tzdata                                                    [###########################################] 100%
:: Running post-transaction hooks...
(1/1) Arming ConditionNeedsUpdate...
[mike@longshot identity-idp]$ ls -l /usr/share/zoneinfo/UTC 
-rw-r--r-- 6 root root 127 Mar 24 12:38 /usr/share/zoneinfo/UTC
[mike@longshot identity-idp]$ cat /usr/share/zoneinfo/UTC 
TZif2UTCTZif2�UTC
UTC0

After that, creating and migrating worked again without problems. I’m not sure what happened there, but hopefully this will prevent people (or future me) wasting a bunch more time on it.

Advertisements

Installing Ruby 2.3 on Archlinux

I’ve been running Archlinux for a few years now. I ran Ubuntu for a 8 years before that and frequently ran into issues with old packages that eventually spurred me to jump to Arch where I get to deal with issues in new packages instead. “Pick your poison” as the saying goes.

Today I needed to get an app running that required Ruby 2.3.3 and, true to form, the poison of the day was all about the libraries installed on my system being to new to compile Ruby 2.3.

I’m a long time user of Rbenv. It’s nice and clean and it’s ruby-build plugin makes installing new versions of Ruby as easy as rbenv install 2.3.3… which is exactly what kicked off the fun.

[mike@longshot identity-idp]$ rbenv install 2.3.3
Downloading ruby-2.3.3.tar.bz2...
-> https://cache.ruby-lang.org/pub/ruby/2.3/ruby-2.3.3.tar.bz2
Installing ruby-2.3.3...
*** Error in `./miniruby': malloc(): memory corruption: 0x00007637497798d8 ***
======= Backtrace: =========
/usr/lib/libc.so.6(+0x72bdd)[0x66e27048fbdd]
...
./miniruby(+0x2470b)[0x80e03b1670b]
/usr/lib/libc.so.6(__libc_start_main+0xea)[0x66e27043d4ca]
./miniruby(_start+0x2a)[0x80e03b1673a]
======= Memory map: ========
80e03af2000-80e03de0000 r-xp 00000000 00:27 154419
...
66e2715e7000-66e2715e8000 rw-p 00000000 00:00 0
763748f81000-763749780000 rw-p 00000000 00:00 0                          [stack]

BUILD FAILED (Arch Linux using ruby-build 20170726-9-g86909bf)

Inspect or clean up the working tree at /tmp/ruby-build.20170828122031.16671
Results logged to /tmp/ruby-build.20170828122031.16671.log

Last 10 log lines:
generating enc.mk
creating verconf.h
./template/encdb.h.tmpl:86:in `<main>': undefined local variable or method `encidx' for main:Object (NameError)
	from /tmp/ruby-build.20170828122031.16671/ruby-2.3.3/lib/erb.rb:864:in `eval'
	from /tmp/ruby-build.20170828122031.16671/ruby-2.3.3/lib/erb.rb:864:in `result'
	from ./tool/generic_erb.rb:38:in `<main>'
make: *** [uncommon.mk:818: encdb.h] Error 1
make: *** Waiting for unfinished jobs....
verconf.h updated
make: *** [uncommon.mk:655: enc.mk] Aborted (core dumped)

The issues here are twofold; Ruby 2.3 won’t build with GCC 7 or OpenSSL 1.1. Arch as it stands today has both by default.

[mike@longshot ~]$ openssl version
OpenSSL 1.1.0f  25 May 2017
[mike@longshot ~]$ gcc -v
gcc version 7.1.1 20170630 (GCC)

To solve the OpenSSL problem we need 1.0 installed (sudo pacman -S openssl-1.0, but it’s probably installed already), and we need to tell ruby-build where to find both the header files, and the openssl directory itself.

Helping compilers find header files is the job of pkg-config. On Arch the config files that do that are typically in /usr/lib/pkgconfig/ but in this case we want to point to the pkg-config file in /usr/lib/openssl/1.0/pkgconfig before searching there. To do that we assign a colon-delimited set of paths to PKG_CONFIG_PATH.

Then we need to tell Ruby where the openssl directory is which is done via RUBY_CONFIGURE_OPTS.

[mike@longshot ~]$ PKG_CONFIG_PATH=/usr/lib/openssl-1.0/pkgconfig/:/usr/lib/pkgconfig/ RUBY_CONFIGURE_OPTS=--with-openssl-dir=/usr/lib/openssl-1.0/ rbenv install 2.3.3
Downloading ruby-2.3.3.tar.bz2...
-> https://cache.ruby-lang.org/pub/ruby/2.3/ruby-2.3.3.tar.bz2
Installing ruby-2.3.3...

BUILD FAILED (Arch Linux using ruby-build 20170726-9-g86909bf)

Inspect or clean up the working tree at /tmp/ruby-build.20170829103308.24191
Results logged to /tmp/ruby-build.20170829103308.24191.log

Last 10 log lines:
  R8: 0x0000016363058550  R9: 0x0000016362cc3dd8 R10: 0x0000016362fafe80
 R11: 0x000000000000001b R12: 0x0000000000000031 R13: 0x0000016363059a40
 R14: 0x0000000000000000 R15: 0x00000163630599a0 EFL: 0x0000000000010202

-- C level backtrace information -------------------------------------------
linking static-library libruby-static.a
ar: `u' modifier ignored since `D' is the default (see `U')
verifying static-library libruby-static.a
make: *** [uncommon.mk:655: enc.mk] Segmentation fault (core dumped)
make: *** Waiting for unfinished jobs....

Our OpenSSL errors fixed we now get the segfault that comes from GCC 7. So we need to install an earlier gcc (sudo pacman -S gcc5) add two more variables (CC and CXX) to specify the C and C++ compilers to we want used.

[mike@longshot ~]$ CC=gcc-5 CXX=g++-5 PKG_CONFIG_PATH=/usr/lib/openssl-1.0/pkgconfig/:/usr/lib/pkgconfig/ RUBY_CONFIGURE_OPTS=--with-openssl-dir=/usr/lib/openssl-1.0/ rbenv install 2.3.3
Downloading ruby-2.3.3.tar.bz2...
-> https://cache.ruby-lang.org/pub/ruby/2.3/ruby-2.3.3.tar.bz2
Installing ruby-2.3.3...
Installed ruby-2.3.3 to /home/mike/.rbenv/versions/2.3.3

With that done, you should now have a working Ruby 2.3:

[mike@longshot ~]$ rbenv global 2.3.3
[mike@longshot ~]$ ruby -e "puts 'hello world'"
hello world

ArangoDB and GraphQL

For a while now I’ve been wondering about what might be the minimal set of technologies that allows me to tackle the widest range of projects. The answer I’ve arrived at, for backend development at least, is GraphQL and ArangoDB.

Both of these tools expand my reach as a developer. Projects involving integrations, multiple clients and complicated data that would have been extremely difficult are now within easy reach.

But the minimal set idea is that I can enjoy this expanded range while juggling far fewer technologies than before. Tools that apply in more situations mean fewer things to learn, fewer moving parts and more depth in the learning I do.

While GraphQL and ArangoDB are both interesting technologies individually, it’s in using them together that I’ve been able to realize those benefits; one of those moments where the whole is different from the sum of it’s parts.

Backend Minimalism

My embrace of Javascript has definitely been part of creating that minimal set. A single language for both front and back end development has been a big part of simplifying my tech stack. Both GraphQL and ArangoDB can be used in many languages, but Javascript support is what you might describe as “first among equals” for both projects.

GraphQL can replace, and for me has replaced, server side frameworks like Rails or Django, leaving me with a handful of Javascript functions and more modular, testable code.

GraphQL also replaces ReST, freeing me from thinking about HATEOAS, bike-shedding over the vagaries of ReST, or needing pages and pages of JSON API documentation to save me from bike-shedding over the vagaries of ReST.

ArangoDB has also reduced the number of things I need to need to know. For a start it has removed the “need” for an ORM (no relational database, no need for Object Relational Mapping), which never really delivered on it’s promise to free you from knowing the underlying SQL.

More importantly it’s replaced not just NoSQL databases with a razor-thin set of capabilities like Mongodb (which stores nested documents but can’t do joins) or Neo4j (which does joins but can’t store nested documents), but also general purpose databases like MySQL or Postgres. I have one query language to learn, and one database whose quirks and characteristics I need to know.

It’s also replaced the deeply unpleasant process of relational data modeling with a seamless blend of documents and graphs that make modeling even really ugly connected datasets anticlimactic. As a bonus, in moving the schema outside the database GraphQL lets us enjoy all the benefits of a schema (making sure there is at least some structure I can rely on) and all the benefits of schemalessness (flexibility, ease of change).

Tools that actually reduce the number of things you need to know don’t come along very often. My goal here is to give a sense of what it looks like to use these two technologies together, and hopefully admiring the trees can let us appreciate the forest.

Show me the code

First we need some data to work with. ArangoDB’s administrative interface has some example graphs it can create, so lets use one to explore.

example_graphs

If we select the “knows” graph, we get a simple graph with 5 vertices.

knows_graph

This graph is going to be the foundation for our little exploration.

Next, the only really meaningful information these vertices have is a name attribute. If we are wanting to create a GraphQL type that represents one of these objects it would look like this:

  let Person = new GraphQLObjectType({
    name: 'Person',
    fields: () => ({
      name: {
        type: GraphQLString
      }
    })
  })

Now that we have a type that describes what a Person object looks like we can use it in a schema. This schema has a field called person which has two attributes: type, and resolve.

let schema = new GraphQLSchema({
    query: new GraphQLObjectType({
      name: 'Query',
      fields: () => ({
        person: {
          type: Person,
          resolve: () => {
            return {name: 'Mike'}
          },
        }
      })
    })
  })

The resolve is a function that will be run whenever graphql is asked to produce a person object. type is a type that describes the object that the resolve function returns, which in this this case is our Person type.

To see if this all works we can write a test using Jest.

import {
  graphql,
  GraphQLSchema,
  GraphQLObjectType,
  GraphQLString,
  GraphQLList,
  GraphQLNonNull
} from 'graphql'

describe('returning a hardcoded object that matches a type', () => {

  let Person = new GraphQLObjectType({
    name: 'Person',
    fields: () => ({
      name: {
        type: GraphQLString
      }
    })
  })

  let schema = new GraphQLSchema({
    query: new GraphQLObjectType({
      name: 'Query',
      fields: () => ({
        person: {
          type: Person,
          resolve: () => {
            return {name: 'Mike'}
          },
        }
      })
    })
  })

  it('lets you ask for a person', async () => {

    let query = `
      query {
        person {
          name
        }
      }
    `;

    let { data } = await graphql(schema, query)
    expect(data.person).toEqual({name: 'Mike'})
  })

})

This test passes which tells us that we got everything wired together properly, and the foundation laid to talk to ArangoDB.

First we’ll use arangojs and create a db instance and then a function that allows us to get a person using their name.

//src/database.js
import arangojs, { aql } from 'arangojs'

export const db = arangojs({
  url: `http://${process.env.ARANGODB_USER}:${process.env.ARANGODB_PASSWORD}@127.0.0.1:8529`,
  databaseName: 'knows'
})

export async function getPersonByName (name) {
  let query = aql`
      FOR person IN persons
        FILTER person.name == ${ name }
          LIMIT 1
          RETURN person
    `
  let results = await db.query(query)
  return results.next()
}

Now lets use that function with our schema to retrieve real data from ArangoDB.

import {
  graphql,
  GraphQLSchema,
  GraphQLObjectType,
  GraphQLString,
  GraphQLList,
  GraphQLNonNull
} from 'graphql'
import {
  db,
  getPersonByName
} from '../src/database'

describe('queries', () => {

  it('lets you ask for a person from the database', async () => {

    let Person = new GraphQLObjectType({
      name: 'Person',
      fields: () => ({
        name: {
          type: GraphQLString
        }
      })
    })

    let schema = new GraphQLSchema({
      query: new GraphQLObjectType({
        name: 'Query',
        fields: () => ({
          person: {
            args: { //person now accepts args
              name: { // the arg is called "name"
                type: new GraphQLNonNull(GraphQLString) // name is a string & manadatory
              }
            },
            type: Person,
            resolve: (root, args) => {
              return getPersonByName(args.name)
            },
          }
        })
      })
    })

    let query = `
        query {
          person(name "Eve") {
            name
          }
        }
      `

    let { data } = await graphql(schema, query)
    expect(data.person).toEqual({name: 'Eve'})
  })
})

Here we have modified our schema to accept a name argument when asking for a person. We access the name via the args object and pass it to our database function to go get the matching person from Arango.

Let’s add a new database function to get the friends of a user given their id.
What’s worth pointing out here is that we are using ArangoDB’s AQL traversal syntax. It allows us to do a graph traversal across outbound edges get the vertex on the other end of the edge.

export async function getFriends (id) {
  let query = aql`
      FOR vertex IN OUTBOUND ${id} knows
        RETURN vertex
    `
  let results = await db.query(query)
  return results.all()
}

Now that we have that function, instead of adding it to the schema, we add a field to the Person type. In the resolve for our new friends field we are going to use the root argument to get the id of the current person object and then use our getFriends function to do the traveral to retrieve the persons friends.

    let Person = new GraphQLObjectType({
      name: 'Person',
      fields: () => ({
        name: {
          type: GraphQLString
        },
        friends: {
          type: new GraphQLList(Person),
          resolve(root) {
            return getFriends(root._id)
          }
        }
      })
    })

What’s interesting is that because of GraphQL’s recursive nature, this change lets us query for friends:

        query {
          person(name: "Eve") {
            name
            friends {
              name
            }
          }
        }

and also ask for friends of friends (and so on) like this:

        query {
          person(name: "Eve") {
            name
            friends {
              name
              friends {
                name
              }
            }
          }
        }

We can show that with a test.

import {
  graphql,
  GraphQLSchema,
  GraphQLObjectType,
  GraphQLString,
  GraphQLList,
  GraphQLNonNull
} from 'graphql'
import {
  db,
  getPersonByName,
  getFriends
} from '../src/database'

describe('queries', () => {

  it('returns friends of friends', async () => {

    let Person = new GraphQLObjectType({
      name: 'Person',
      fields: () => ({
        name: {
          type: GraphQLString
        },
        friends: {
          type: new GraphQLList(Person),
          resolve(root) {
            return getFriends(root._id)
          }
        }
      })
    })

    let schema = new GraphQLSchema({
      query: new GraphQLObjectType({
        name: 'Query',
        fields: () => ({
          person: {
            args: {
              name: {
                type: new GraphQLNonNull(GraphQLString)
              }
            },
            type: Person,
            resolve: (root, args) => {
              return getPersonByName(args.name)
            },
          }
        })
      })
    })

    let query = `
        query {
          person(name: "Eve") {
            name
            friends {
              name
              friends {
                name
              }
            }
          }
        }
      `

    let result = await graphql(schema, query)
    let { friends } = result.data.person
    let foaf = [].concat(...friends.map(friend => friend.friends))
    expect([{name: 'Charlie'},{name: 'Dave'},{name: 'Bob'}]).toEqual(expect.arrayContaining(foaf))
  })

})

This test has running a query three levels deep and walking the entire graph. Because we can ask for any combination of any of the things our types defined, we have a whole lot of flexibility with very little code. The code that’s there is just a few simple functions, modular and easy to test.

But what did we trade away to get all that? If we look at the queries that get sent to Arango with tcpdump we can see how that sausage was made.

// getPersonByName('Eve') from the person resolver in our schema 
{"query":"FOR person IN persons
  FILTER person.name == @value0
  LIMIT 1 RETURN person","bindVars":{"value0":"Eve"}}
// getFriends('persons/eve') in Person type -> returns Bob & Alice.
{"query":"FOR vertex IN OUTBOUND @value0 knows
  RETURN vertex","bindVars":{"value0":"persons/eve"}}
// now a new request for each friend:
// getFriends('persons/bob')
{"query":"FOR vertex IN OUTBOUND @value0 knows
  RETURN vertex","bindVars":{"value0":"persons/bob"}}
// getFriends('persons/alice')
{"query":"FOR vertex IN OUTBOUND @value0 knows
  RETURN vertex","bindVars":{"value0":"persons/alice"}}

What we have here is our own version of the famous N+1 problem. If we were to add more people to this graph things would get out of hand quickly.

Facebook, which has been using GraphQL in production for years, is probably even less excited about the prospect of N+1 queries battering their database than we are. So what are they doing to solve this?

Using Dataloader

Dataloader is a small library released by Facebook that solves the N+1 problem by cleverly leveraging the way promises work. To use it, we need to give it a batch loading function and then replace our calls to the database with calls call Dataloader’s load method in all our resolves.

What, you might ask, is a batch loading function? The dataloader documentation offers that “A batch loading function accepts an Array of keys, and returns a Promise which resolves to an Array of values.”

We can write one of those.

async function getFriendsByIDs (ids) {
  let query = aql`
    FOR id IN ${ ids }
      let friends = (
        FOR vertex IN OUTBOUND id knows
          RETURN vertex
      )
      RETURN friends
  `
  let response = await db.query(query)
  return response.all()
}

We can then use that in a new test.

import {
  graphql
} from 'graphql'
import DataLoader from 'dataloader'
import {
  db,
  getFriendsByIDs
} from '../src/database'
import schema from '../src/schema'

describe('Using dataloader', () => {

  it('returns friends of friends', async () => {

    let Person = new GraphQLObjectType({
      name: 'Person',
      fields: () => ({
        name: {
          type: GraphQLString
        },
        friends: {
          type: new GraphQLList(Person),
          resolve(root, args, context) {
            return context.FriendsLoader.load(root._id)
          }
        }
      })
    })

    let query = `
        query {
          person(name: "Eve") {
            name
            friends {
              name
              friends {
                name
              }
            }
          }
        }
      `
    const FriendsLoader = new DataLoader(getFriendsByIDs)
    let result = await graphql(schema, query, {}, { FriendsLoader })
    let { person } = result.data
    expect(person.name).toEqual('Eve')
    expect(person.friends.length).toEqual(2)
    let names = person.friends.map(friend => friend.name)
    expect(names).toContain('Alice', 'Bob')
  })

})

The key section of the above test is this:

    const FriendsLoader = new DataLoader(getFriendsByIDs)
    //                         schema, query, root, context
    let result = await graphql(schema, query, {}, { FriendsLoader })

The context object is passed as the fourth parameter to the graphql function which is then available as the third parameter in every resolve function. With our FriendsLoader attached to the context object, you can see us accessing it in the resolve function on the Person type.

Let’s see what effect that batch loading has on our queries.

// getPersonByName('Eve') from the person resolver in our schema 
{"query":"FOR person IN persons
  FILTER person.name == @value0
  LIMIT 1 RETURN person","bindVars":{"value0":"Eve"}}
// getFriendsByIDs(["persons/eve"]) -> returns Bob & Alice.
{"query":"FOR id IN @value0
   let friends = (
    FOR vertex IN  OUTBOUND id knows
      RETURN vertex
    )
  RETURN friends","bindVars":{"value0":["persons/eve"]}}
// getFriendsByIDs(["persons/alice","persons/bob"])
{"query":"FOR id IN @value0
   let friends = (
    FOR vertex IN  OUTBOUND id knows
      RETURN vertex
    )
  RETURN friends","bindVars":{"value0":["persons/alice","persons/bob"]}}

Now for a three level query (Eve, her friends, their friends) we are down to just 1 query per level and the N+1 problem is no longer a problem.

When it’s time to serve your data to the world, express-graphql supplies a middleware that we can pass our schema and loaders to like this:

import express from 'express'
import graphqlHTTP from 'express-graphql'
import schema from './schema'
import DataLoader from 'dataloader'
import { getFriendsByIDs } from '../src/database'

const FriendsLoader = new DataLoader(getFriendsByIDs)
const app = express()
app.use('/graphql', graphqlHTTP({ schema, context: { FriendsLoader }}))
app.listen(3000)
// http://localhost:3000/graphql is up and running!

What we just did

With just those few code examples we’ve built a backend system that provides a query-able API for clients backed by a graph database. Growing it would look like adding a few more functions and a few more types. The code stays modular and testable. Dataloader has ensured that we didn’t even pay a performance penalty for that.

A perfect combination

While geeking out on the technology is fun, it loses sight of what I think is the larger point: The design of both GraphQL and ArangoDB allow you to combine and recombine some really simple primitives to tackle anything you can think of.

With ArangoDB, it’s all just documents, whether you use them like that or treat them as key/value or a graph is up to you. While this approach is marketed as “multi-model” database, the term is unfortunate since it makes the database sound like it’s trying to do lots of things instead of leveraging some fundamental similarity between these types of data. That similarity becomes the “primitive” that makes all this flexibility possible.

For GraphQL, my application is just a bunch of functions in an Abstract Syntax Tree which get combined and recombined by client queries. The parser and execution engine take care of what gets called when.

In each case what I need to understand is simple, the behaviour I can produce is complex.

I’m still honing my minimal set for front end development, but for backend development this is now how I build. These days I’m refocusing my learning to go narrow and deep and it feels good. Infinite width never felt sustainable. It’s not apparent at first, but once that burden is lifted off your shoulders you will realize how heavy it was.

Working with the Google Vision API

I remember hearing a story about a developer whose contract with the military specified the number of kilos of documentation that were required to accompany the system they were building. I think of that story from time to time when I use Google products.

Google’s Vision API gives access to legit state-of-the-art Artificial Intelligence and is amazing for extracting text from images, but a concise modern example doesn’t seem to exist in spite of the huge volume of documentation.

The example they give is in the classic callback style:

var vision = require('@google-cloud/vision');

var visionClient = vision({
  projectId: 'grape-spaceship-123',
  keyFilename: '/path/to/keyfile.json'
});

visionClient.detectText('./image.jpg', function(err, text) {
  // text = [
  //   'This was text found in the image',
  //   'This was more text found in the image'
  // ]
});

With all that has been written about the inversion of control problems of callbacks and ES2015 support nearly complete and in wide use thanks to Babel, examples like this are feeling distinctly retro.

Also painful for anyone working with Docker is that the authentication appears to require me to include a keyfile.json somewhere in my container, where what I actually want is to store that stuff in the environment.

After a bit of experimentation, it turns out that that the google-cloud-node library doesn’t let us down. It’s filled with all the promisey goodness we scripters-of-java have come to expect. If you are using jest this test should get you going:

import Vision from '@google-cloud/vision'

describe('Google Vision client', () => {

  it('successfully connects', async () => {
    let client = Vision({
      projectId: process.env.GOOGLE_VISION_PROJECT_ID,
      credentials: {
	      private_key: process.env.GOOGLE_VISION_PRIVATE_KEY.replace(/\\n/g, '\n'),
        client_email: process.env.GOOGLE_VISION_CLIENT_EMAIL
      }
    })

    let [[text, ...words], annotations] = await client.detectText(__dirname + '/data/foo.jpg')
    expect(text).toEqual("foo bar\n")
    expect(words).toContain("foo", "bar")
  })

})

The project id is easy enough to find, but the environment variables used to avoid the keyfile.json are actually found within the keyfile.

{
  "type": "service_account",
  "project_id": "...",
  "private_key_id": "...",
  "private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
  "client_email": "...@developer.gserviceaccount.com",
  "client_id": "...",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://accounts.google.com/o/oauth2/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/..."
}

The keyfile above was created by going to the credentials console and following the instructions here.

Note the replace(/\\n/g, '\n') happening on the GOOGLE_VISION_PRIVATE_KEY. This is from issue 1173 and without it you end up with the error
Error: error:0906D06C:PEM routines:PEM_read_bio:no start line. Replacing new lines with new lines seems silly but you gotta do what you gotta do.

The last missing piece is an image with some text. I created a quick test image in Gimp with the words “foo bar”:

foo

While it wasn’t clear at first glance, google-cloud-node is a pretty sophisticated and capable library, despite being theoretically “alpha”. Google is remaking itself as “the AI company” and the boundary pushing stuff it’s doing means I’m probably going to be using this client a lot. I really was hoping to find a small amount of the “right” documentation instead of the huge volume of partial answers spread across their sprawling empire. Hopefully this is a useful contribution towards that reality.

Dealing with supernodes in ArangoDB

About a year ago I wrote about data modeling in ArangoDB. The main take away is to avoid the temptation to make unnecessary vertices (and all the attendant edges), since the traversing high degree vertices (a vertex with lots of edges pointing at it) is an expensive process.

Graphs are cool and it’s easy to forget that ArangoDB is a great document store. Treating is as such means “Embedding“, to borrow a term from MongoDB,  which lets you keep your traversals fast.

But while good data modeling can prevent you from creating some high-degree vertices, you will run into them eventually, and ArangoDB’s new “vertex-centric” indexes is a feature that is there for exactly those moments.

First we can try and get a sense of the impact these high-degree vertices have on a traversal. To do that I wrote a script that would generate a star graph starting with three “seed” vertices, a start, middle, and end.

The goal was to walk across the middle vertex to get from start to end while changing the number of vertices connected to the middle.

// the seed vertices
let seedVertices = [
  {name: 'start', _key: '1', type: "seed"},
  {name: 'middle', _key: '2', type: "seed"},
  {name: 'end', _key: '3', type: "seed"}
]

With just 10 vertices, this is nowhere near deserving the name “supernode”, but it’s pretty clear why these things are called star-graphs.

star_graph
A baby “super node”.

Next we crank up the number of vertices so we can see the effect on traversal time.

star_graph_traversal

By the time we get to a vertex surrounded by 100,000 to 1,000,000 other vertices we are starting to get into full-blown supernode territory. You can see that by the time we get to sorting through a million incident edges ArangoDB is up to 4.3 seconds to traverse across that middle vertex.

A “vertex-centric” index is one that include either _to or _from plus some other edge attribute. In this case I’ve added a type attribute which I’ll combine with _to to make my index. (Note that if I was allowing “any” as a direction I would need a second index that combines type and _from)

creating_a_hash_index

ArangoDB calculates the path and offers it to you as you do a traversal. You can access the path by declaring a variable to receive it. In the query below p contains the path. If we use the ALL array comparison operator on p.edges to say that all the edges in the path should have type of “seed”, that should be enough to get Arango to use are new index.

    FOR v,e,p IN 1..2 ANY 'vertices/1' edges
      FILTER p.edges[*].type ALL == 'seed' &&  v.name == 'end'
        RETURN v

The selectivity score shown by explain doesn’t leave you very hopeful that Arango will use the index…

Indexes used:
 By   Type   Collection   Unique   Sparse   Selectivity   Fields               Ranges
  2   hash   edges        false    false         0.00 %   [ `type`, `_to` ]    base INBOUND
  2   edge   edges        false    false        50.00 %   [ `_from`, `_to` ]   base OUTBOUND

but having our query execute in 0.2 milliseconds instead of 4.3 seconds is a pretty good indication it’s working.

arango_query_explained

Back to modeling

For me, this little experiment has underscored the importance of good data modeling. You don’t have to worry about the number of edges if you don’t create a vertex in the first place. If you are conservative about what you are willing to make a vertex, and make good use of indexes you can see that ArangoDB is going to be able to gracefully handle some pretty hefty data.

With vertex-centric indexes, and other features like the new Smartgraphs, ArangoDB has gone from being a great document database with a trick up it’s sleeve (Joins!) to being a really solid graph database and there are new features landing regularly. I’m curious to see where they go next.

Making requests in vanilla js with Apollo

There are lots of good reasons to be running GraphQL on the server. It’s clean, no ORM‘s or frameworks needed and has some interesting security properties too. But just because you are rockin’ the new hotness on the server side doesn’t mean you want it on the client side too. Sometimes the right thing is the simplest thing that can possibly work.

The Apollo Client is a GraphQL client made by the people behind Meteor. It aims to be an advanced and capable client that plays nice with the rest of the ecosystem. It has a lot going on, and sadly doesn’t seem to spend much time advertising that it’s actually a pretty great fit for those “simplest thing that can possibly work” moments as well.

Installing it is roughly what you might expect, but you also need the graphql-tag library so you can create queries Javascript’s new tagged template literals.

npm install --save apollo-client graphql-tag

So here, in all it’s glory, “simplest thing that can possibly work”:

import ApolloClient from 'apollo-client'
import gql from 'graphql-tag'

const client = new ApolloClient();

let query = gql`
  query {
    foo {
      bar
    }
  }
`
client.query({query}).then((results) => {
  //do something useful
})

I think this is actually even more simple than Lokka, which actually bills itself as the “Simple JavaScript Client for GraphQL”.

If you need to specify your endpoint as something other than the host the js came from, then you get to add just a little extra:

import ApolloClient, { createNetworkInterface } from 'apollo-client'

const opts = {uri: 'http://example.com:8080/graphql'}
const networkInterface = createNetworkInterface(opts)
const client = new ApolloClient({
  networkInterface,
});

But simple doesn’t mean we are restricted to queries only. Mutations can be simple too:

let mutation = gql`
  mutation ($foo: [FooInput] $bar: String!) {
    addFoo(
      foo: $foo
      bar: $bar
    ){
      foo
      bar
    }
  }
`

client.mutate({mutation, variables: {foo: [1,2,3], bar: "baz"}}).then((results) => {
  //do something with result
})

Obviously you will need the server side schema to support that, but that is all that is needed on the client.

Apollo has a tonne of features and integrates with Redux nicely (it does caching with it’s own internal Redux store unless you want it to use yours). While simplicity doesn’t appear to be it’s focus, the Apollo client is certainly capable of it. You’d just never guess from the documentation. Hopefully this will make it a little easier to appreciate the simple side of Apollo.

Installing R-Studio on Ubuntu 16.10

rstudioInstalling things on Linux is either really easy, or a yak shave with surprisingly little between those extremes.

It seems that Ubuntu 16.10 has removed Gstreamer 0.10 from the repos and replaced it with Gstreamer 1.0, which is great… until you need to install R-Studio.

While the R-Studio people are aiming to drop the Gstreamer dependency, for the moment, as of 16.10, installing it has fallen into the yak-shave category.

Installing R-Studio works fine, but if you try to run (from the terminal) it you will get the error:

rstudio: error while loading shared libraries: libgstreamer-0.10.so.0: cannot open shared object file: No such file or directory

We can see that it’s failing to load Gstreamer, but since it’s been removed from the Ubuntu repos fixing this will mean getting those packages elsewhere.

To start with, we can download the latest R-studio daily build and install it using dpkg:

$ wget https://s3.amazonaws.com/rstudio-dailybuilds/rstudio-1.0.124-amd64.deb
$ sudo dpkg -i rstudio-1.0.124-amd64.deb

The dpkg command can also query the package to display information about it. If we use the uppercase I option we can confirm that this package requires exactly version 0.10 of libgstreamer:

dpkg -I rstudio-1.0.124-amd64.deb 
 new debian package, version 2.0.
 size 98840122 bytes: control archive=42847 bytes.
     554 bytes,    12 lines      control              
  163246 bytes,  1548 lines      md5sums              
     198 bytes,    10 lines   *  postinst             #!/bin/sh
     158 bytes,    10 lines   *  postrm               #!/bin/sh
 Package: rstudio
 Version: 1.0.124
 Section: devel
 Priority: optional
 Architecture: amd64
 Depends: libjpeg62, libedit2, libgstreamer0.10-0, libgstreamer-plugins-base0.10-0, libssl1.0.0,  libc6 (>= 2.7)
 Recommends: r-base (>= 2.11.1)
 Installed-Size: 526019
 Maintainer: RStudio <info@rstudio.com>
 Description: RStudio
  RStudio is a set of integrated tools designed to help you be more productive with R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, and workspace management.

Debian (which Ubuntu is based on) has the old Gstreamer packages we need to satisfy those dependencies, so we can get them from there. If you need something other than the AMD64 see here and here. The if you have a 64bit machine, you can download and install like this:

# download with wget
$ wget http://ftp.ca.debian.org/debian/pool/main/g/gstreamer0.10/libgstreamer0.10-0_0.10.36-1.5_amd64.deb
$ wget http://ftp.ca.debian.org/debian/pool/main/g/gst-plugins-base0.10/libgstreamer-plugins-base0.10-0_0.10.36-2_amd64.deb

# Now install with dpkg
$ sudo dpkg -i libgstreamer0.10-0_0.10.36-1.5_amd64.deb
$ sudo dpkg -i libgstreamer-plugins-base0.10-0_0.10.36-2_amd64.deb

While that solves R’s problems, we now have one of our own. We’ve purposefully installed old packages and don’t want Ubuntu’s package manager to enthusiastically upgrade them next time we update.
To resolve that problem will put a hold on them with apt-mark:

$ sudo apt-mark hold libgstreamer-plugins-base0.10-0
libgstreamer-plugins-base0.10-0 set on hold.
$ sudo apt-mark hold libgstreamer0.10
libgstreamer0.10-0 set on hold.

And we can check the packages that are on hold with:

$ sudo apt-mark showhold
libgstreamer-plugins-base0.10-0
libgstreamer0.10-0

Hopefully that saves someone some Googling.
Now that’s working, it’s time to play with some R!