Dealing with supernodes in ArangoDB

About a year ago I wrote about data modeling in ArangoDB. The main take away is to avoid the temptation to make unnecessary vertices (and all the attendant edges), since the traversing high degree vertices (a vertex with lots of edges pointing at it) is an expensive process.

Graphs are cool and it’s easy to forget that ArangoDB is a great document store. Treating is as such means “Embedding“, to borrow a term from MongoDB,  which lets you keep your traversals fast.

But while good data modeling can prevent you from creating some high-degree vertices, you will run into them eventually, and ArangoDB’s new “vertex-centric” indexes is a feature that is there for exactly those moments.

First we can try and get a sense of the impact these high-degree vertices have on a traversal. To do that I wrote a script that would generate a star graph starting with three “seed” vertices, a start, middle, and end.

The goal was to walk across the middle vertex to get from start to end while changing the number of vertices connected to the middle.

// the seed vertices
let seedVertices = [
  {name: 'start', _key: '1', type: "seed"},
  {name: 'middle', _key: '2', type: "seed"},
  {name: 'end', _key: '3', type: "seed"}
]

With just 10 vertices, this is nowhere near deserving the name “supernode”, but it’s pretty clear why these things are called star-graphs.

star_graph
A baby “super node”.

Next we crank up the number of vertices so we can see the effect on traversal time.

star_graph_traversal

By the time we get to a vertex surrounded by 100,000 to 1,000,000 other vertices we are starting to get into full-blown supernode territory. You can see that by the time we get to sorting through a million incident edges ArangoDB is up to 4.3 seconds to traverse across that middle vertex.

A “vertex-centric” index is one that include either _to or _from plus some other edge attribute. In this case I’ve added a type attribute which I’ll combine with _to to make my index. (Note that if I was allowing “any” as a direction I would need a second index that combines type and _from)

creating_a_hash_index

ArangoDB calculates the path and offers it to you as you do a traversal. You can access the path by declaring a variable to receive it. In the query below p contains the path. If we use the ALL array comparison operator on p.edges to say that all the edges in the path should have type of “seed”, that should be enough to get Arango to use are new index.

    FOR v,e,p IN 1..2 ANY 'vertices/1' edges
      FILTER p.edges[*].type ALL == 'seed' &&  v.name == 'end'
        RETURN v

The selectivity score shown by explain doesn’t leave you very hopeful that Arango will use the index…

Indexes used:
 By   Type   Collection   Unique   Sparse   Selectivity   Fields               Ranges
  2   hash   edges        false    false         0.00 %   [ `type`, `_to` ]    base INBOUND
  2   edge   edges        false    false        50.00 %   [ `_from`, `_to` ]   base OUTBOUND

but having our query execute in 0.2 milliseconds instead of 4.3 seconds is a pretty good indication it’s working.

arango_query_explained

Back to modeling

For me, this little experiment has underscored the importance of good data modeling. You don’t have to worry about the number of edges if you don’t create a vertex in the first place. If you are conservative about what you are willing to make a vertex, and make good use of indexes you can see that ArangoDB is going to be able to gracefully handle some pretty hefty data.

With vertex-centric indexes, and other features like the new Smartgraphs, ArangoDB has gone from being a great document database with a trick up it’s sleeve (Joins!) to being a really solid graph database and there are new features landing regularly. I’m curious to see where they go next.

Making requests in vanilla js with Apollo

There are lots of good reasons to be running GraphQL on the server. It’s clean, no ORM‘s or frameworks needed and has some interesting security properties too. But just because you are rockin’ the new hotness on the server side doesn’t mean you want it on the client side too. Sometimes the right thing is the simplest thing that can possibly work.

The Apollo Client is a GraphQL client made by the people behind Meteor. It aims to be an advanced and capable client that plays nice with the rest of the ecosystem. It has a lot going on, and sadly doesn’t seem to spend much time advertising that it’s actually a pretty great fit for those “simplest thing that can possibly work” moments as well.

Installing it is roughly what you might expect, but you also need the graphql-tag library so you can create queries Javascript’s new tagged template literals.

npm install --save apollo-client graphql-tag

So here, in all it’s glory, “simplest thing that can possibly work”:

import ApolloClient from 'apollo-client'
import gql from 'graphql-tag'

const client = new ApolloClient();

let query = gql`
  query {
    foo {
      bar
    }
  }
`
client.query({query}).then((results) => {
  //do something useful
})

I think this is actually even more simple than Lokka, which actually bills itself as the “Simple JavaScript Client for GraphQL”.

If you need to specify your endpoint as something other than the host the js came from, then you get to add just a little extra:

import ApolloClient, { createNetworkInterface } from 'apollo-client'

const opts = {uri: 'http://example.com:8080/graphql'}
const networkInterface = createNetworkInterface(opts)
const client = new ApolloClient({
  networkInterface,
});

But simple doesn’t mean we are restricted to queries only. Mutations can be simple too:

let mutation = gql`
  mutation ($foo: [FooInput] $bar: String!) {
    addFoo(
      foo: $foo
      bar: $bar
    ){
      foo
      bar
    }
  }
`

client.mutate({mutation, variables: {foo: [1,2,3], bar: "baz"}}).then((results) => {
  //do something with result
})

Obviously you will need the server side schema to support that, but that is all that is needed on the client.

Apollo has a tonne of features and integrates with Redux nicely (it does caching with it’s own internal Redux store unless you want it to use yours). While simplicity doesn’t appear to be it’s focus, the Apollo client is certainly capable of it. You’d just never guess from the documentation. Hopefully this will make it a little easier to appreciate the simple side of Apollo.

Installing R-Studio on Ubuntu 16.10

rstudioInstalling things on Linux is either really easy, or a yak shave with surprisingly little between those extremes.

It seems that Ubuntu 16.10 has removed Gstreamer 0.10 from the repos and replaced it with Gstreamer 1.0, which is great… until you need to install R-Studio.

While the R-Studio people are aiming to drop the Gstreamer dependency, for the moment, as of 16.10, installing it has fallen into the yak-shave category.

Installing R-Studio works fine, but if you try to run (from the terminal) it you will get the error:

rstudio: error while loading shared libraries: libgstreamer-0.10.so.0: cannot open shared object file: No such file or directory

We can see that it’s failing to load Gstreamer, but since it’s been removed from the Ubuntu repos fixing this will mean getting those packages elsewhere.

To start with, we can download the latest R-studio daily build and install it using dpkg:

$ wget https://s3.amazonaws.com/rstudio-dailybuilds/rstudio-1.0.124-amd64.deb
$ sudo dpkg -i rstudio-1.0.124-amd64.deb

The dpkg command can also query the package to display information about it. If we use the uppercase I option we can confirm that this package requires exactly version 0.10 of libgstreamer:

dpkg -I rstudio-1.0.124-amd64.deb 
 new debian package, version 2.0.
 size 98840122 bytes: control archive=42847 bytes.
     554 bytes,    12 lines      control              
  163246 bytes,  1548 lines      md5sums              
     198 bytes,    10 lines   *  postinst             #!/bin/sh
     158 bytes,    10 lines   *  postrm               #!/bin/sh
 Package: rstudio
 Version: 1.0.124
 Section: devel
 Priority: optional
 Architecture: amd64
 Depends: libjpeg62, libedit2, libgstreamer0.10-0, libgstreamer-plugins-base0.10-0, libssl1.0.0,  libc6 (>= 2.7)
 Recommends: r-base (>= 2.11.1)
 Installed-Size: 526019
 Maintainer: RStudio <info@rstudio.com>
 Description: RStudio
  RStudio is a set of integrated tools designed to help you be more productive with R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, and workspace management.

Debian (which Ubuntu is based on) has the old Gstreamer packages we need to satisfy those dependencies, so we can get them from there. If you need something other than the AMD64 see here and here. The if you have a 64bit machine, you can download and install like this:

# download with wget
$ wget http://ftp.ca.debian.org/debian/pool/main/g/gstreamer0.10/libgstreamer0.10-0_0.10.36-1.5_amd64.deb
$ wget http://ftp.ca.debian.org/debian/pool/main/g/gst-plugins-base0.10/libgstreamer-plugins-base0.10-0_0.10.36-2_amd64.deb

# Now install with dpkg
$ sudo dpkg -i libgstreamer0.10-0_0.10.36-1.5_amd64.deb
$ sudo dpkg -i libgstreamer-plugins-base0.10-0_0.10.36-2_amd64.deb

While that solves R’s problems, we now have one of our own. We’ve purposefully installed old packages and don’t want Ubuntu’s package manager to enthusiastically upgrade them next time we update.
To resolve that problem will put a hold on them with apt-mark:

$ sudo apt-mark hold libgstreamer-plugins-base0.10-0
libgstreamer-plugins-base0.10-0 set on hold.
$ sudo apt-mark hold libgstreamer0.10
libgstreamer0.10-0 set on hold.

And we can check the packages that are on hold with:

$ sudo apt-mark showhold
libgstreamer-plugins-base0.10-0
libgstreamer0.10-0

Hopefully that saves someone some Googling.
Now that’s working, it’s time to play with some R!

GraphQL and security

Imagine you have a web application that allows people view widgets by name. Somewhere deep in your codebase, a programmer has thoughtfully updated the existing ES5 SQL injection to this stylish new ES6 SQL injection:

`select * from widgets where name = '${name}';`

Injection attacks just like this one persist in-spite of the fact that a search for “sql injection tutorial” returns around 3,740,000 results on Google. They have made the top of the OWASP top 10 in 2010 and 2013 and probably will again in 2016 and likely for the foreseeable future. Motherboard even calls it “the hack that will never go away“.

The standard answer to this sort of problem is input sanitization. It’s standard enough that it shows up in jokes.

exploits_of_a_mom
xkcd’s famous “Exploits of a Mom

But when faced with such consistent failure, it’s reasonable to ask if there isn’t something systemic going on.

There is a sub-field of security research known as Language Theoretic Security (Langsec) that is asking exactly that question.

Exploitation is unexpected computation caused reliably or probabilistically by some crafted inputs. — Meredith Patterson

Dropping the students table in the comic is exactly the kind of “unexpected computation” they are talking about. To combat it, Langsec encourages programmers to consider their inputs as a language and the sum of the adhoc checks on those inputs as a parser for that language.

They advocate a clean separation of concerns between the recognition and processing of inputs, and bringing a recognizer of appropriate strength to bear on the input you are parsing (ie: not validating HTML/XML with a regex)

Where this starts intersecting with GraphQL is in what this actually implies:

This design and programming paradigm begins with a description of valid inputs to a program as a formal language (such as a grammar or a DFA). The purpose of such a disciplined specification is to cleanly separate the input-handling code and processing code.

A LangSec-compliant design properly transforms input-handling code into a recognizer for the input language; this recognizer rejects non-conforming inputs and transforms conforming inputs to structured data (such as an object or a tree structure, ready for type or value-based pattern matching).

The processing code can then access the structured data (but not the raw inputs or parsers’ temporary data artifacts) under a set of assumptions regarding the accepted inputs that are enforced by the recognizer.

If all that starts sounding kind of familiar, well it did to me too.

GraphQL

GraphQL allows you to define a formal language via it’s type system. As queries arrive, they are lexed, parsed, matched against the user defined types and formed into an Abstract Syntax Tree (AST). The contents of that AST are then made available to processing code via resolve functions.

The promise of Langsec is “software free from broad and currently dominant classes of bugs and vulnerabilities related to incorrect parsing and interpretation of messages between software components”, and GraphQL seems poised to put this within reach of every developer.

Turning back to the example we started with, how could we use GraphQL to protect against that SQL injection? Lets get this going in a test.

require("babel-polyfill")

import expect from 'expect'
import {
  graphql,
  GraphQLSchema,
  GraphQLScalarType,
  GraphQLObjectType,
  GraphQLString,
  GraphQLNonNull
} from 'graphql'
import { GraphQLError } from 'graphql/error';
import { Kind } from 'graphql/language';

describe('SQL injection', () => {

  it('returns a SQL injected string', async () => {

    let schema = new GraphQLSchema({
      query: new GraphQLObjectType({
        name: 'Query',
        fields: () => ({
          widgets: {
            type: GraphQLString,
            args: {
              name: {
                description: 'The name of the widget',
                type: new GraphQLNonNull(GraphQLString)
              }
            },
            resolve: (source, {name}) => `select * from widgets where name = '${name}';`
          }
        })
      })
    })

    let query = `
      query widgetByName($name: String!) {
        widgets(name: $name)
      }
    `

    //args: schema, query, rootValue, contextValue, variables
    let result = await graphql(schema, query, null, null, {name: "foo'; drop table widgets; --"})
    expect(result.data.widgets).toEqual("select * from widgets where name = 'foo'; drop table widgets; --'")
  })

})

GraphQL brings lots of benefits (no need for API versioning, all data in a single round trip, etc…), and while those are compelling, simply passing strings into our backend systems misses an opportunity to do something different.

Let’s create a custom type that is more specific than just a “string”; an AlphabeticString.

  it('is fixed with custom types', async () => {

    let AlphabeticString = new GraphQLScalarType({
      name: 'AlphabeticString',
      description: 'represents a string with no special characters.',
      serialize: String,
      parseValue: (value) => {
        if(value.match(/^([A-Za-z]|\s)+$/)) {
          return value
        }
        return null
      },
      parseLiteral: (ast) => {
        if(value.match(/^([A-Za-z]|\s)+$/)) {
          return ast.value
        }
        return null
      }
    });

    let schema = new GraphQLSchema({
      query: new GraphQLObjectType({
        name: 'Query',
        fields: () => ({
          widgets: {
            type: GraphQLString,
            args: {
              name: {
                description: 'The name of the widget',
                type: new GraphQLNonNull(AlphabeticString)
              }
            },
            resolve: (source, {name}) => {
              return `select * from widgets where name = '${name}';`
            }
          }
        })
      })
    })

    let query = `
    query widgetByName($name: AlphabeticString!) {
        widgets(name: $name)
      }
    `

    //args: schema, query, rootValue, contextValue, variables
    let result = await graphql(schema, query, null, null, {name: "foo'; drop table widgets; --"})
    expect(result.errors).toExist()
    expect(result.errors[0].message).toInclude("got invalid value")
  })

This test now passes; the SQL string is rejected during parsing before ever reaching the resolve function.

On making custom types

There are already libraries out there that provide custom types for things like URLs, datetime’s and other things, but you will definitely want to be able to make your own.

To start defining your own types you will need to understand the role
the various functions play:

    let MyType = new GraphQLScalarType({
      name: 'MyType',
      description: 'this will show up in the documentation!',
      serialize: (value) => { //... }
      parseValue: (value) => { //... },
      parseLiteral: (ast) => { //... }
    });

The serialize function is the easiest to understand: GraphQL responses are serialized to JSON, and if this type needs special treatment before being included in a response, this is the place to do it.

Understanding parseValue and parseLiteral requires a quick look at two different queries:

//name is a string literal, so parseLiteral is called 
query {
  widgets(name: "Soft squishy widget")
}

//name is a value so parseValue is called
query widgetByName($name: AlphabeticString!) {
  widgets(name: $name)
}

In the first query, name is a string literal, so your types parseLiteral function will be called. The second query, the name is supplied as the value of a variable so parseValue is called.

Your type could end up being used in either of those scenarios so it’s important to do your validation in both of those functions.

The standard GraphQL types (GraphQLInt, GraphQLFloat, etc.) also implement those same functions.

Putting the whole thing together, the process then looks something like this:

A query arrives, gets tokenized, the parser builds the AST, calling parseValue and parseLiteral as needed. When the AST is complete, it recurses down the AST calling resolve, using parseValue and parseLiteral again on whatever is returned before calling serialize on each to create the response.

Where to go from here

Langsec is a deep topic, with implications wider than just what is discussed here. While GraphQL is certainly not “Langsec in a box”, it not only seems to be making the design patterns that follow from Langsec’s insights a reality, it has a has a shot at making them mainstream. I’d love to see the Langsec lens turned on GraphQL and see how it can guide the evolution of the spec and the practices around it.

I would encourage you to dig into the ideas of Langsec, and the best place to start is here:

A look at the React Lifecycle

Every react component is required to provide a render function. It can return false or it can return elements but it needs to be there. If you providing a single function, it’s assumed to be a render function:

const Foo = ({thing}) => <p>Hello {thing}</p>
<Foo thing="world" />

There has been a fair bit written about the chain of lifecycle methods that React calls leading up to it’s invocation of the render function and afterwards. The basic order is this:

constructor
render
componentDidMount

But things are rarely that simple, and often this.setState is called in componentDidMount which gives a call chain that looks like this:

constructor
render
componentDidMount
shouldComponentUpdate
componentWillUpdate
render
componentDidUpdate

Nesting components inside each other adds another wrinkle to this, as does my use of ES6/7, which adds a few subtle changes to the existing lifecycle methods. To get this sorted out in my own head, I created two classes: an Owner and and Ownee.

class Owner extends React.Component {

  // ES7 Property Initializers replace getInitialState
  //state = {}

  // ES6 class constructor replaces componentWillMount
  constructor(props) {
    super(props)
    console.log("Owner constructor")
    this.state = {
      foo: "baz"
    }
  }

  componentWillReceiveProps(nextProps) {
    console.log("Owner componentWillReceiveProps")
  }

  shouldComponentUpdate(nextProps, nextState) {
    console.log("Owner shouldComponentUpdate")
    return true
  }

  componentWillUpdate(nextProps, nextState) {
    console.log("Owner componentWillUpdate")
  }

  shouldComponentUpdate(nextProps, nextState) {
    console.log("Owner shouldComponentUpdate")
    return true
  }

  render() {
    console.log("Owner render")
    return (
      <div className="owner">
        <Ownee foo={ this.state.foo } />
      </div>
    )
  }

  componentDidUpdate(nextProps, nextState) {
    console.log("Owner componentDidUpdate")
  }

  componentDidMount() {
    console.log("Owner componentDidMount")
  }

  componentWillUnmount() {
    console.log("Owner componentWillUnmount")
  }

}

A component is said to be the owner of another component when it sets it’s props. A component whose props are being set is an ownee, so here is our Ownee component:

class Ownee extends React.Component {

  // ES6 class constructor replaces componentWillMount
  constructor(props) {
    super(props)
    console.log("  Ownee constructor")
  }

  componentWillReceiveProps(nextProps) {
    console.log("  Ownee componentWillReceiveProps")
  }

  shouldComponentUpdate(nextProps, nextState) {
    console.log("  Ownee shouldComponentUpdate")
    return true
  }

  componentWillUpdate(nextProps, nextState) {
    console.log("  Ownee componentWillUpdate")
  }

  render() {
    console.log("  Ownee render")
    return (
        <p>Ownee says {this.props.foo}</p>
    )
  }

  componentDidUpdate(nextProps, nextState) {
    console.log("  Ownee componentDidUpdate")
  }

  componentDidMount() {
    console.log("  Ownee componentDidMount")
  }

  componentWillUnmount() {
    console.log("  Ownee componentWillUnmount")
  }

}

This gives us the following chain:

Owner constructor
Owner render
  Ownee constructor
  Ownee render
  Ownee componentDidMount
Owner componentDidMount

Adding this.setState({foo: "bar"}) into the Owner’s componentDidMount gives us a more complete view of the call chain:

Owner constructor
Owner render
  Ownee constructor
  Ownee render
  Ownee componentDidMount
Owner componentDidMount
Owner shouldComponentUpdate
Owner componentWillUpdate
Owner render
  Ownee componentWillReceiveProps
  Ownee shouldComponentUpdate
  Ownee componentWillUpdate
  Ownee render
  Ownee componentDidUpdate
Owner componentDidUpdate

Things definitely get more complicated when components start talking to each other and passing functions that setState but the basic model is reassuringly straight forward. The changes that ES6/7 bring to the React lifecycle are relatively minor but nonetheless nice to have clear in my head as well.
If you want to explore this further I’ve created a JSbin.

Packaging, pid-files and systemd

When I first built my ArangoDB package one of the problems I had was getting ArangoDB to start after a reboot. While reworking it for Arango 3.0 I ran into this again.
The reason this can be tricky is that ArangoDB, like basically all forking processes needs to write a pid file somewhere. Where things get confusing is that that anything you create in /var/run will be gone next time you reboot leading to errors like this:

-- Unit arangodb.service has begun starting up.
Aug 24 08:50:27 longshot arangod[10366]: {startup} starting up in daemon mode
Aug 24 08:50:27 longshot arangod[10366]: cannot write pid-file '/var/run/arangodb3/arangod.pid'
Aug 24 08:50:27 longshot systemd[1]: arangodb.service: Control process exited, code=exited status=1
Aug 24 08:50:27 longshot systemd[1]: Failed to start ArangoDB.
-- Subject: Unit arangodb.service has failed

If you DuckDuckGo it you can see that people stumble into this pretty regularly.

To understand what’s going on here it’s important to know about what /var/run is actually for.

The Filesystem Hierarchy Standard describes it as a folder for “run-time variable data” and lays out some rules for the folder:

This directory contains system information data describing the system since it was booted. Files under this directory must be cleared (removed or truncated as appropriate) at the beginning of the boot process. Programs may have a subdirectory of /var/run; this is encouraged for programs that use more than one run-time file. Process identifier (PID) files, which were originally placed in /etc , must be placed in /var/run. The naming convention for PID files is .pid. For example, the crond PID file is named /var/run/crond.pid.

Since those words were written in 2004, the evolving needs of init systems, variations across distributions and the idea of storing pid-files (which shouldn’t survive reboot) with logs and stuff (which should) have all conspired to push for the creation of a standard place to put ephemeral data: /run.

Here in 2016, /run is a done deal, and for backwards compatibility, /var/run is now simply a simlink to /run:

mike@longshot ~/$  ls -l /var/
total 52
...
lrwxrwxrwx  1 root root     11 Sep 30  2015 lock -> ../run/lock
lrwxrwxrwx  1 root root      6 Sep 30  2015 run -> ../run
...

Looking back at our cannot write pid-file '/var/run/arangodb3/arangod.pid' error, a few things are clear. First, we should probably stop using /var/run since /run has been standard since around 2011.

Second, our files disappear because /run is a tmpfs. While there are some subtleties it’s basically storing your files in RAM.

So the question is; how do we ensure our /run folder is prepped with our /run/arangodb3 directory (and whatever other files) before our systemd unit file is run? As it happens, systemd has a subproject that deals with this: tmpfiles.d.

The well-named tmpfiles.d creates tmpfiles in /run and /tmp (and a few others). It does this by reading conf files written in a simple configuration format out of certain folders. A quick demo:

mike@longshot ~$  sudo bash -c "echo 'd /run/foo 0755 mike users -' > /usr/lib/tmpfiles.d/foo.conf"
mike@longshot ~$  sudo systemd-tmpfiles --create foo.conf
mike@longshot ~$  ls -l /run
...
drwxr-xr-x  2 mike     users     40 Aug 24 14:18 foo
d
...

While we specified an individual conf file by name running systemd-tmpfiles --create would create the files for all the conf files that exist in /usr/lib/tmpfiles.d/.

mike@longshot ~$  ls -l /usr/lib/tmpfiles.d/
total 104
-rw-r--r-- 1 root root   30 Jul  5 10:35 apache.conf
-rw-r--r-- 1 root root   78 May  8 16:35 colord.conf
-rw-r--r-- 1 root root  574 Jul 25 17:10 etc.conf
-rw-r--r-- 1 root root  595 Aug 11 08:04 gvfsd-fuse-tmpfiles.conf
-rw-r--r-- 1 root root  362 Jul 25 17:10 home.conf
...

Tying all this together is a systemd service that runs just before sysinit.target that uses that exact command to create all the tmpfiles:

mike@longshot ~/$  systemctl cat systemd-tmpfiles-setup.service
# /usr/lib/systemd/system/systemd-tmpfiles-setup.service
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

[Unit]
Description=Create Volatile Files and Directories
Documentation=man:tmpfiles.d(5) man:systemd-tmpfiles(8)
DefaultDependencies=no
Conflicts=shutdown.target
After=local-fs.target systemd-sysusers.service
Before=sysinit.target shutdown.target
RefuseManualStop=yes

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/systemd-tmpfiles --create --remove --boot --exclude-prefix=/dev

If your unit file includes After=sysinit.target you know that tmpfiles you specified will exist when your unit file is run.

Knowing that this plumbing is in place, your package should include a conf file which gets installed into /usr/lib/tmpfiles.d/. Here is mine for ArangoDB:

mike@longshot ~/projects/arangodb_pkg (master)$  cat arangodb-tmpfile.conf 
d /run/arangodb3 0755 arangodb arangodb -

While this will ensure that tmpfiles are created next time the computer boots, we also need to make sure the service can be started right now. If you are packaging software for ArchLinux that means having a post_install hook that looks like this:

post_install() {
  systemd-tmpfiles --create arangodb.conf
}

If you are running systemd, and you probably are, this is the way to go. While it’s not hard to find people using mkdir in their unit file’s ExecStartPre section (been there, done that) or writing some sort of startup script, this is much cleaner. Make use of the infrastructure that is there.

D3 and React 3 ways

D3 and React are two of the most popular libraries out there and a fair bit has been written about using them together.
The reason this has been worth writing about is the potential for conflict between them. With D3 adding and removing DOM elements to represent data and React tracking and diffing of DOM elements, either library could end up with elements being deleted out from under it or operations returning unexpected elements (their apparent approach when finding such an element is “kill it with fire“).

One way of avoiding this situation is simply telling a React component not to update it’s children via shouldComponentUpdate(){ return false }. While effective, having React manage all the DOM except for some designated area doesn’t feel like the cleanest solution. A little digging shows that there are some better options out there.

To explore these, I’ve taken D3 creator Mike Bostock’s letter frequency bar chart example and used it as the example for all three cases. I’ve updated it to ES6, D3 version 4 and implemented it as a React component.

letter_frequency
Mike Bostock’s letter frequency chart

Option 1: Use Canvas

One nice option is to use HTML5’s canvas element. Draw what you need and let React render the one element into the DOM. Mike Bostock has an example of the letter frequency chart done with canvas. His code can be transplanted into React without much fuss.

class CanvasChart extends React.Component {

  componentDidMount() {
    //All Mike's code
  }

  render() {
    return <canvas width={this.props.width} height={this.props.height} ref={(el) => { this.canvas = el }} />
  }
}

I’ve created a working demo of the code on Plunkr.
The canvas approach is something to consider if you are drawing or animating a large amount of data. Speed is also in it’s favour, but React probably narrows the speed gap a bit.

A single element is produced since the charts are drawn with Javascript no other elements need be created or destroyed, avoiding the conflict with React entirely.

Option 2: Use react-faux-dom

Oliver Caldwell’s react-faux-dom project creates a Javascript object that passes for a DOM element. D3 can do it’s DOM operations on that and when it’s done you just call toReact() to return React elements. Updating Mike Bostock’s original bar chart demo gives us this:

import React from 'react'
import ReactFauxDOM from 'react-faux-dom'
import d3 from 'd3'

class SVGChart extends React.Component {

  render() {
    let data = this.props.data

    let margin = {top: 20, right: 20, bottom: 30, left: 40},
      width = this.props.width - margin.left - margin.right,
      height = this.props.height - margin.top - margin.bottom;

    let x = d3.scaleBand()
      .rangeRound([0, width])

    let y = d3.scaleLinear()
      .range([height, 0])

    let xAxis = d3.axisBottom()
      .scale(x)

    let yAxis = d3.axisLeft()
      .scale(y)
      .ticks(10, "%");

    //Create the element
    const div = new ReactFauxDOM.Element('div')
    
    //Pass it to d3.select and proceed as normal
    let svg = d3.select(div).append("svg")
      .attr("width", width + margin.left + margin.right)
      .attr("height", height + margin.top + margin.bottom)
      .append("g")
      .attr("transform", `translate(${margin.left},${margin.top})`);

      x.domain(data.map((d) => d.letter));
      y.domain([0, d3.max(data, (d) => d.frequency)]);

    svg.append("g")
      .attr("class", "x axis")
      .attr("transform", `translate(0,${height})`)
      .call(xAxis);

    svg.append("g")
      .attr("class", "y axis")
      .call(yAxis)
      .append("text")
      .attr("transform", "rotate(-90)")
      .attr("y", 6)
      .attr("dy", ".71em")
      .style("text-anchor", "end")
      .text("Frequency");

    svg.selectAll(".bar")
      .data(data)
      .enter().append("rect")
      .attr("class", "bar")
      .attr("x", (d) => x(d.letter))
      .attr("width", 20)
      .attr("y", (d) => y(d.frequency))
      .attr("height", (d) => {return height - y(d.frequency)});

    //DOM manipulations done, convert to React
    return div.toReact()
  }

}

This approach has a number of advantages, and as Oliver points out, one of the big ones is being able to use this with Server Side Rendering. Another bonus is that existing D3 visualizations hardly need to be modified at all to get them working with React. If you look back at the original bar chart example, you can see that it’s basically the same code.

Option 3: D3 for math, React for DOM

The final option is a full embrace of React, both the idea of components and it’s dominion over the DOM. In this scenario D3 is used strictly for it’s math and formatting functions. Colin Megill put this nicely stating “D3’s core contribution is not its DOM model but the math it brings to the client”.

I’ve re-implemented the letter frequency chart following this approach. D3 is only used to do a few calculations and format numbers. No DOM operations at all. Creating the SVG elements is all done with React by iterating over the data and the arrays generated by D3.

Screenshot from 2016-06-02 09-24-34
My pure React re-implementation of Mike Bostock’s letter frequency bar chart. D3 for math, React for DOM. No cheating.

What I learned from doing this, is that D3 does a lot of work for you, especially when generating axes. You can see in the code there is a fair number of “magic values”, a little +5 here or a -4 there to get everything aligned right. Probably all that stuff can be cleaned up into props like “margin” or “padding”, but it’ll take a few more iterations (and possibly actual reuse of these components) to get that stuff all cleaned up. D3 has already got that stuff figured out.

This approach is a lot of work in the short term, but has some real benefits. First, I like this approach for it’s consistency with the React way of doing things. Second, long term, after good boundaries between components are established you can really see lots of possibilities for reuse. The modular nature of D3 version 4 probably also means this approach will lead to some reduced file sizes since you can be very selective about what functions you include.
If you can see yourself doing a lot of D3 and React in the future, the price paid for this purity would be worth it.

Where to go from here

It’s probably worth pointing out that D3 isn’t a charting library, it’s a generic data visualisation framework. So while the examples above might be useful for showing how to integrate D3 and React, they aren’t trying to suggest that this is a great use of D3 (though it’s not an unreasonable use either). If all you need is a bar chart there are libraries like Chart.js and react-chartjs aimed directly at that.

In my particular case I had and existing D3 visualization, and react-faux-dom was the option I used. It’s a perfect balance between purity and pragmatism and probably the right choice for most cases.

Hopefully this will save people some digging.