Tagged template literals and the hack that will never go away

Tagged template literals were added to Javascript as part of ES 2015. While a fair bit has been written about them, I’m going to argue their significance is underappreciated and I’m hoping this post will help change that. In part, it’s significant because it strikes at the root of a problem people had otherwise resigned themselves to living with: SQL injection.

Just so we are clear, before ES 2015, combining query strings with untrusted user input to create a SQL injection was done via concatenation using the plus operator.

let query = 'select * from widgets where id = ' + id + ';'

As of ES 2015, you can create far more stylish SQL injections using backticks.

let query = `select * from widgets where id = ${id};`

By itself this addition is really only remarkable for not being included in the language sooner. The backticks are weird, but it gives us some much-needed multiline string support and a very Rubyish string interpolation syntax. It’s pairing this new syntax with another language feature known as tagged templates that has a real potential to make an impact on SQL injections.

> let id = 1
// define a function to use as a "tag"
> sql = (strings, ...vars) => ({strings, vars})
[Function: sql]
// pass our tag a template literal
> sql`select * from widgets where id = ${id};`
{ strings: [ 'select * from widgets where id = ', ';' ], vars: [ 1 ] }

What you see above is just a function call, but it no longer works like other languages. Instead of doing the variable interpolation first and then calling the sql function with select * from widgets where id = 1;, the sql function is called with an array of strings and the variables that are supposed to be interpolated.

You can see how different this is from the standard evaluation process by adding brackets to make this a standard function invocation. The string is interpolated before being passed to the sql function, entirely loosing the distinction between the variable (which we probably don’t trust) and the string (that we probably do). The result is an injected string and an empty array of variables.

> sql(`select * from widgets where id = ${id};`)
{ strings: 'select * from widgets where id = 1;', vars: [] }

This loss of context is the heart of matter when it comes to SQL injection (or injection attacks generally). The moment the strings and variables are combined you have a problem on your hands.

So why not just use parameterized queries or something similar? It’s generally held that good code expresses the programmers intent. I would argue that our SQL injection example code perfectly expresses the programmers intent; they want the id variable to be included in the query string. As a perfect expression of the programmers intent, this should be acknowledged as “good code”… as well as a horrendous security problem.

let query = sql(`select * from widgets where id = ${id};`)

When the clearest expression of a programmers intent is also a security problem what you have is a systemic issue which requires a systemic fix. This is why despite years of security education, developer shaming and “push left” pep-talks SQL injection stubbornly remains “the hack that will never go away”. It’s also why you find Mike Samuel from Google’s security team as the champion of the “Template Strings” proposal.

You can see the fruits of this labour by noticing library authors leveraging this to deliver a great developer experience while doing the right thing for security. Allan Plum, the driving force behind the Arangodb Javascript driver leveraging tagged template literals to let users query ArangoDB safely.

The aql (Arango Query Language) function lets you write what would in any other language be an intent revealing SQL injection, safely returns an object with a query and some accompanying bindvars.

aql`FOR thing IN collection FILTER thing.foo == ${foo} RETURN thing`
{ query: 'FOR thing IN collection FILTER thing.foo == @value0 RETURN thing',
  bindVars: { value0: 'bar' } }

Mike Samuel himself has a number of node libraries that leverage Tagged Template Literals, among them one to safely handle shell commands.

sh`echo -- ${a} "${b}" 'c: ${c}'`

It’s important to point out that Tagged Template Literals don’t entirely solve SQL injections, since there are no guarantees that any particular tag function will do “the right thing” security-wise, but the arguments the tag function receives set library authors up for success.

Authors using them get to offer an intuitive developer experience rather than the clunkiness of prepared statements, even though the tag function may well be using them under the hood. The best experience is from safest thing; It’s a great example of creating a “pit of success” for people to fall into.

// Good security hinges on devs learning to write
// stuff like this instead of stuff that makes sense.
// Clunky prepared statement is clunky.
const ps = new sql.PreparedStatement(/* [pool] */)
ps.input('param', sql.Int)
ps.prepare('select * from widgets where id = @id;', err => {
    // ... error checks
    ps.execute({id: 1}, (err, result) => {
        // ... error checks
        ps.unprepare(err => {
            // ... error checks
        })
    })
})

It’s an interesting thought that Javascripts deficiencies seem to have become it’s strength. First Ryan Dahl filled out the missing IO pieces to create Node JS and now missing features like multiline string support provide an opportunity for some of the worlds most brilliant minds to insert cutting edge security features along-side these much needed fixes.

I’m really happy to finally see language level fixes for things that are clearly language level problems, and excited to see where Mike Samuel’s mission to “make the easiest way to express an idea in code a secure way to express that idea” takes Javascript next. It’s the only way I can see to make “the hack that will never go away” go away.

GraphQL and security

Imagine you have a web application that allows people view widgets by name. Somewhere deep in your codebase, a programmer has thoughtfully updated the existing ES5 SQL injection to this stylish new ES6 SQL injection:

`select * from widgets where name = '${name}';`

Injection attacks just like this one persist in-spite of the fact that a search for “sql injection tutorial” returns around 3,740,000 results on Google. They have made the top of the OWASP top 10 in 2010 and 2013 and probably will again in 2016 and likely for the foreseeable future. Motherboard even calls it “the hack that will never go away“.

The standard answer to this sort of problem is input sanitization. It’s standard enough that it shows up in jokes.

exploits_of_a_mom
xkcd’s famous “Exploits of a Mom

But when faced with such consistent failure, it’s reasonable to ask if there isn’t something systemic going on.

There is a sub-field of security research known as Language Theoretic Security (Langsec) that is asking exactly that question.

Exploitation is unexpected computation caused reliably or probabilistically by some crafted inputs. — Meredith Patterson

Dropping the students table in the comic is exactly the kind of “unexpected computation” they are talking about. To combat it, Langsec encourages programmers to consider their inputs as a language and the sum of the adhoc checks on those inputs as a parser for that language.

They advocate a clean separation of concerns between the recognition and processing of inputs, and bringing a recognizer of appropriate strength to bear on the input you are parsing (ie: not validating HTML/XML with a regex)

Where this starts intersecting with GraphQL is in what this actually implies:

This design and programming paradigm begins with a description of valid inputs to a program as a formal language (such as a grammar or a DFA). The purpose of such a disciplined specification is to cleanly separate the input-handling code and processing code.

A LangSec-compliant design properly transforms input-handling code into a recognizer for the input language; this recognizer rejects non-conforming inputs and transforms conforming inputs to structured data (such as an object or a tree structure, ready for type or value-based pattern matching).

The processing code can then access the structured data (but not the raw inputs or parsers’ temporary data artifacts) under a set of assumptions regarding the accepted inputs that are enforced by the recognizer.

If all that starts sounding kind of familiar, well it did to me too.

GraphQL

GraphQL allows you to define a formal language via it’s type system. As queries arrive, they are lexed, parsed, matched against the user defined types and formed into an Abstract Syntax Tree (AST). The contents of that AST are then made available to processing code via resolve functions.

The promise of Langsec is “software free from broad and currently dominant classes of bugs and vulnerabilities related to incorrect parsing and interpretation of messages between software components”, and GraphQL seems poised to put this within reach of every developer.

Turning back to the example we started with, how could we use GraphQL to protect against that SQL injection? Lets get this going in a test.

require("babel-polyfill")

import expect from 'expect'
import {
  graphql,
  GraphQLSchema,
  GraphQLScalarType,
  GraphQLObjectType,
  GraphQLString,
  GraphQLNonNull
} from 'graphql'
import { GraphQLError } from 'graphql/error';
import { Kind } from 'graphql/language';

describe('SQL injection', () => {

  it('returns a SQL injected string', async () => {

    let schema = new GraphQLSchema({
      query: new GraphQLObjectType({
        name: 'Query',
        fields: () => ({
          widgets: {
            type: GraphQLString,
            args: {
              name: {
                description: 'The name of the widget',
                type: new GraphQLNonNull(GraphQLString)
              }
            },
            resolve: (source, {name}) => `select * from widgets where name = '${name}';`
          }
        })
      })
    })

    let query = `
      query widgetByName($name: String!) {
        widgets(name: $name)
      }
    `

    //args: schema, query, rootValue, contextValue, variables
    let result = await graphql(schema, query, null, null, {name: "foo'; drop table widgets; --"})
    expect(result.data.widgets).toEqual("select * from widgets where name = 'foo'; drop table widgets; --'")
  })

})

GraphQL brings lots of benefits (no need for API versioning, all data in a single round trip, etc…), and while those are compelling, simply passing strings into our backend systems misses an opportunity to do something different.

Let’s create a custom type that is more specific than just a “string”; an AlphabeticString.

  it('is fixed with custom types', async () => {

    let AlphabeticString = new GraphQLScalarType({
      name: 'AlphabeticString',
      description: 'represents a string with no special characters.',
      serialize: String,
      parseValue: (value) => {
        if(value.match(/^([A-Za-z]|\s)+$/)) {
          return value
        }
        return null
      },
      parseLiteral: (ast) => {
        if(value.match(/^([A-Za-z]|\s)+$/)) {
          return ast.value
        }
        return null
      }
    });

    let schema = new GraphQLSchema({
      query: new GraphQLObjectType({
        name: 'Query',
        fields: () => ({
          widgets: {
            type: GraphQLString,
            args: {
              name: {
                description: 'The name of the widget',
                type: new GraphQLNonNull(AlphabeticString)
              }
            },
            resolve: (source, {name}) => {
              return `select * from widgets where name = '${name}';`
            }
          }
        })
      })
    })

    let query = `
    query widgetByName($name: AlphabeticString!) {
        widgets(name: $name)
      }
    `

    //args: schema, query, rootValue, contextValue, variables
    let result = await graphql(schema, query, null, null, {name: "foo'; drop table widgets; --"})
    expect(result.errors).toExist()
    expect(result.errors[0].message).toInclude("got invalid value")
  })

This test now passes; the SQL string is rejected during parsing before ever reaching the resolve function.

On making custom types

There are already libraries out there that provide custom types for things like URLs, datetime’s and other things, but you will definitely want to be able to make your own.

To start defining your own types you will need to understand the role
the various functions play:

    let MyType = new GraphQLScalarType({
      name: 'MyType',
      description: 'this will show up in the documentation!',
      serialize: (value) => { //... }
      parseValue: (value) => { //... },
      parseLiteral: (ast) => { //... }
    });

The serialize function is the easiest to understand: GraphQL responses are serialized to JSON, and if this type needs special treatment before being included in a response, this is the place to do it.

Understanding parseValue and parseLiteral requires a quick look at two different queries:

//name is a string literal, so parseLiteral is called 
query {
  widgets(name: "Soft squishy widget")
}

//name is a value so parseValue is called
query widgetByName($name: AlphabeticString!) {
  widgets(name: $name)
}

In the first query, name is a string literal, so your types parseLiteral function will be called. The second query, the name is supplied as the value of a variable so parseValue is called.

Your type could end up being used in either of those scenarios so it’s important to do your validation in both of those functions.

The standard GraphQL types (GraphQLInt, GraphQLFloat, etc.) also implement those same functions.

Putting the whole thing together, the process then looks something like this:

A query arrives, gets tokenized, the parser builds the AST, calling parseValue and parseLiteral as needed. When the AST is complete, it recurses down the AST calling resolve, using parseValue and parseLiteral again on whatever is returned before calling serialize on each to create the response.

Where to go from here

Langsec is a deep topic, with implications wider than just what is discussed here. While GraphQL is certainly not “Langsec in a box”, it not only seems to be making the design patterns that follow from Langsec’s insights a reality, it has a has a shot at making them mainstream. I’d love to see the Langsec lens turned on GraphQL and see how it can guide the evolution of the spec and the practices around it.

I would encourage you to dig into the ideas of Langsec, and the best place to start is here:

An intro to injection attacks

I’ve found myself explaining SQL injection attacks to people a few times lately and thought I would write up something I can just point to instead.

For illustration purposes lets make a toy authentication system.

Lets say you have a database with table for all your users that looks like this:

The structure of table "users":
+-----+-----------+----------------+
| uid | username  | password       |
+-----+-----------+----------------+
|   1 | mike      | catfish        |
+-----+-----------+----------------+
|   2 | sally     | floride        |
+-----+-----------+----------------+
|   3 | akira     | pickles        |
+-----+-----------+----------------+

Lets say a user wants to use your application and you ask them for their username and password, and they give you ‘akira’ and ‘pickles’.

The next step in authenticating this user is to check in the database to see if we have a user with both a username of ‘akira’ and a password of ‘pickles’ which we can do with a SQL statement like this:

select * from user where username = 'akira' and password = 'pickles';

Since we know we have one row that satisfies both of the conditions we set (value in username must equal ‘akira’ and the value in password must equal ‘pickles’), if we hand that string of text to the database and ask it to execute it, we would expect the database to return the following data:

+-----+-----------+----------------+
| uid | username  | password       |
+-----+-----------+----------------+
|   3 | akira     | pickles        |
+-----+-----------+----------------+

If the database returns a row we can let the user go ahead and use our application.

Of course if we need our SQL statement to work for everyone and so we can’t just write ‘akira’ in there. So lets replace the username and password with variables (PHP style):

select * from user where username = '$username' and password = '$password';

Now if someone logs in with ‘mike’ and ‘catfish’ our application is going to place the value ‘mike’ in the variable $username  and ‘catfish’ in the variable $password and the PHP interpreter will be responsible for substituting the variable names for the actual values so it can create the finished string that looks like this:

select * from user where username = 'mike' and password = 'catfish';

This will be passed to the database which executes the command and return a row.

Unfortunately mixing data supplied by the user with our pre-exisisting SQL commands and passing the result to the database as one single string has set the stage for some bad behaviour:

$username = "";
$password = " ' or user_name like '%m% ";

select * from user where username = '$username' and password = '$password';

Once the final string is assembled suddenly the meaning is very different:

select * from user where username = ' ' and password = '' or username like '%m%';

Now our database will return a row if the username AND password are both empty strings OR if any of the usernames contains the letter ‘m’. Suddenly the bar for logging into our application is a lot lower.

There are tonnes of possible variations on this, and it’s a common enough problem that its the stuff of jokes among programmers:

XKCD: Little bobby tables.

The root of the problem is the commingling of user supplied data with commands intended  to be executed. The user supplied password value of “ ‘ or username like ‘%m%” entirely changed the meaning of our SQL command where if we had been able make it clear that this was just a string to search the database for we would have had the expected behaviour of comparing the string ” ‘ or username like ‘%m%” to strings in our list of passwords (‘pickles’, ‘catfish’ and ‘floride’).

If you think about it like that you realize that this is not just a problem with SQL, but a problem that shows up everywhere data and commands are mixed.

Keeping these things separate is the only way to stay safe. When the data is kept separate it can be sanitized, or singled out for whatever special treatment is appropriate for the situation by libraries/drivers or whatever else.

Separating data from commands can look pretty different depending on the what you are working on. Keeping data and commands separate when executing system commands in Ruby or Node.js looks different from keeping them separate using prepared statements to safely query a database using JRuby/Java. The same rules apply to NoSQL things like Mongodb as well.

People far smarter than I am still get caught by this stuff pretty regularly, so there is no magic bullet to solve it. If you are working with Rails, security tools like Breakman can help catch things but the only real solution is awareness and practice.