Steve Bennett blogs

…about maps, open data, Git, and other tech.

Category Archives: e-Research

One week of Salt: frustrations and reflections.

 

Reflecting on Salt. (Credit: Psyberartist, CC-BY 2.0)

Salt, or SaltStack, is an automatic deployment tool for clusters of machines, in the same category as Chef, Puppet and Ansible. I’ve previously used Chef, found the experience frustrating, and thought I’d give Salt a go. The task: converting my tilemill-server bash scripts, which set up a complete TileMill server stack, with OpenStreetMap data, Nginx and a couple of other goodies.

The goals in this kind of automation for me are threefold:

  1. So I can quickly build new servers for Mapmaking for Academics workshops.
  2. So I can manage other, similar (but slightly different) servers, such as cycletour.org and gis.researchmaps.net. “Manage” here means documenting and controlling the slight configuration differences, being able to roll out further changes (such as web passwords) without having to log in.
  3. So other people can do this too. (That is, the solution has to be simple enough to explain that other people can do it).

The product of my labours: https://github.com/stevage/saltymill

Disclaimers and apologies

A week is too soon to feel the real benefits of automatic deployment and configuration management, but plenty long enough for frustrations and annoyances. There are lots of good things about Salt, but those will go in another post.

I’ve intentionally written this post as I go, to document things that are surprising, counter-intuitive or bothersome to the newcomer. Once you become familiar with a system, it’s much harder to remember what it was like as a newbie, and easier to make allowances for it. So experienced Salt users will probably rankle at the wilful ignorance displayed herein.

A few things I quickly appreciated, compared to Chef:

  • Two parties, not three. Chef has the client (your laptop), the Chef Server (which I never used because it was too complicated) and the target machine itself. Salt doesn’t have the client, so you do everything either on the master (“salt ‘*’ state.highstate) or on the target machine in masterless mode (“salt-call –local state.highstate)
  • No need for “knife upload”: the Salt Master just looks for your files in /srv/salt, so you can use any method you like for getting them there. I always hated having to upload recipes, environments etc individually – you always forgot something.
  • YAML rather than Ruby.
  • A nifty command line tool that lets you try states out on the fly, or run commands remotely on several servers at once. (Chef’s shell, formerly ‘shef’, never really worked out for me.)

Mediocre documentation

Automatic deployment systems are complicated, and you’d love some clear mental concepts to hang on to. Unfortunately, Salt’s chosen metaphor is pretty clumsy (salt grains, pillars, mines…meh), and they’re not well explained. Apparently the concept of a “state” is pretty important to Salt, but even after reading much documentation and quite a reasonable amount of hacking away, I still have only the vaguest notion of what a Salt state is. What’s a “high state”? A “low state”? How can I query the “state” of a minion? No idea. It’s hard to tell if “states” are an implementation magic that makes the whole thing work, or if I, the user am supposed to know about them.

(After more digging around, I think “state” has several meanings, of which one is just one of the actions defined in a SLS file.)

And then, strangely, how do you refer to the most important objects that you work with, the actual code that specifies what you want done? “SLS files”. Or “state files”. Or “SLS[es]”. Or “Salt States”. Or “Salt state files”. Those terms are all used in the docs. (I thought they were also called “formulas“, but those are apparently only pre-canned state files.)

YAML headaches

It turns out YAML has some weird indenting and formatting rules that occasionally ruin your day. Initially it’s especially tedious having to learn this extra language and its subtleties (> vs |, multi-line hyphenated lists versus single line square-bracketed lists) when you just want to get started – it steepens the learning curve.

IDs, names

The behaviour of statement IDs is cute at first, but not well explained, and becomes pretty tiresome. The idea is you can write this:

httpd:
  pkg.installed

Succinct, huh? That’s actually shorthand for this:

install_apache_please:
  pkg:
    - name: httpd
    - installed

That is, the “id” (httpd) is also the default value for “name”. Secondly, you can compress one item of the following list  (“installed”) onto the function name (“pkg”) with a dot.

So far, so good. But it quickly leads to this kind of silliness:

"{{ pillar.installdir}}/myfile.txt":
  file.managed:
    - unless: ...
another_command:  
cmd.wait: - watch: [ "{{ pillar.installdir}}/myfile.txt" ]

Free-text strings don’t make great IDs. On the whole, I’d prefer a syntax which is robustly readable and predictable, rather than one which is very succinct in special cases, but mostly less so.

The next problem is that all those IDs have to be unique across all your .sls files. That’s tedious when you have repetitious little actions that you can’t be bothered naming properly.

Another annoyance is that Salt’s data structure sometimes requires lists, and sometimes requires dicts, and it’s hard to remember which is which. For example:

finishlog:
  cmd.run:
    - name: |
        echo "All done! Enjoy your new server.<br/>" >> /var/log/salt/buildlog.html
  file.managed:
    - name: /var/log/salt/index.html
    - source: salt://initindex.html
    - template: jinja
    - context:
       buildtitle: "Your server is ready!"
       buildsubtitle: "Get in there and make something."
       buildtitlecolor: "hsl(130,70%,70%)"

Notice how the first level of indentation doesn’t have hyphens, the second level does, then the third level doesn’t? I’m not sure what the philosophy is here – it looks to me like each level is just a list of key/value pairs.

Jinja templating logic

Jinja is a fantastic templating language. But Salt uses this template language as its core programming logic. That’s like writing a C program using macros all over the place. It’s sort of ok as a way to access data values:

update_data:
 cmd.run:
 - cwd: {{ pillar['tm_dir'] }}

Here, the {{ … }} bit is a Jinja substitution, so the actual YAML code that gets parsed and executed looks like this:

update_data:
 cmd.run:
 - cwd: /mnt/saltymill

As most people who have dealt with macro preprocessors know, they rapidly get complex and very hard to debug. The error you see is a YAML parse error, but is the problem in your YAML code, or in what got substituted by Jinja?

It’s certainly possible to use the full range of Jinja functions (that is, {% … %} territory) inside Salt formulas, but for sanity I’m trying to avoid it. On the flip side, certain things that I expected to be easy, like defining a variable in a state {% set foo = blah %} and then being able to access it from inside any other template, turn out to not work. You need to explicitly pass context variables around, or import state files, and it becomes quite cumbersome.

Help may be at hand.  Evan Borgstrom also noticed that “simple readability of YAML starts to get lost in the noise of the Jinja syntax” and is working on a new, pure Python syntax called NaCl.

Many ordering options

There are about 4 competing systems simultaneously operating to determine which order actions get executed in, ranked here from most important to least important.

  1. Requisite order: The “preferred”, declarative system, where if X requires Y and Z requires X, then Salt just figures out that Y has to happen first.
  2. Explicit order: you can set an “order flag” on an action, like “order: 1” or “order: last”. Basically, if you’re tearing your hair out with desperation, I think. (I tried it once, and it seemed to have no effect.)
  3. Definition order: things happen in the order they’re declared in. (By far my preferred option). This doesn’t always seem to work, in the top.sls, for reasons I don’t understand.
  4. Lexicographical order (yes, really): actions are executed in alphabetical order of their state name. (Apparently this was a good thing because at least it was consistent across platforms.)

The requisite ordering system sounds good, but in practice it has two shortcomings:

  1. Repetitiveness: let’s say you have psql 10 commands that all strictly depend on Postgresql being installed. It’s very repetitive to state that dependency explicitly, and much more efficient to simply install Postgresql at the top of the formula, and assume its existence thereafter.
  2. Scope: Requiring actions in other files seems to prevent the running of just this file. That is, if in load_db.sls you do “require: [ cmd: install_postgres ]”, then this command line no longer works: salt ‘*’ state.sls load_db

I’m yet to see any advantages to a declarative style. I like the certainty of knowing that a given sequence of steps works. The declarative model implies that one day the steps might be rearranged based only on my statement of what the dependencies are.

Arbitrary rules

There’s a rule that says you can’t have the same kind of state multiple times in a statement. This is invalid:

/var/log/salt/buildlog.html:
 file.managed:
 - source: salt://initlog.html 
 file.append:
 - text: Commencing build...

Why? I don’t know. The rules dictate that you have to do this:

/var/log/salt/buildlog.html:
 file.managed:
 - source: salt://initlog.html 

append_to_that_file:
 file.append:
 - name: /var/log/salt/buildlog.html
 - text: Commencing build...

Notice the cascade of consequences as this fragile “readability” tower crashes down:

  1. You can’t have two states of the same type in a statement, so we move the second state to a new statement; but:
  2. You can’t have two statements with the same ID, so we give the second statement a new ID; meaning:
  3. We now have to explicitly define the “name” of the file.append function.

Clumsy bootstrapping

I miss Chef’s elegant “knife bootstrap” though. In one command line command, Chef:

  1. Defines properties about the node, like its environment and role.
  2. Connects to the node
  3. Installs Chef
  4. Registers the node with the Chef server, or your client, or both.
  5. Starts a deployment, as required.

With Salt, these all seem to be manual steps:

  1. I SSH to the node
  2. I download and install Salt (using one-liner bootstrap script)
  3. I write ‘grains’ to the newly created /etc/salt/grains file (yes, I think I’m doing this wrong – roles should be defined through pillars maybe)
  4. The node then attempts to register itself with the Salt Master, but because the connection is insecure:
  5. I have to go to the Salt Master and accept the pending key: yes | salt-key -A [and the docs suggest you’re supposed to manually verify the keys!]
  6. Now I can launch a deployment: salt-master <nodename> state.highstate

When I rebuild a server, the above are preceded by these steps:

  1. Log in to OpenStack Dashboard, click ‘rebuild’ on the instance.
  2. Edit my ~/.ssh/known_hosts to remove the old SSH key.
  3. On the Salt Master, delete the old key: yes | salt-key <node> -d

There are probably ways of streamlining this process. I hope so, anyway, because it’s pretty clumsy. I don’t know yet whether the SaltMaster can automatically launch deployments when new nodes register.

Digital humanities for beginners: get started with the Trove API

Trove is the National Library of Australia’s “discovery interface” – an amazing catalogue of books, newspapers, maps, music, journal articles etc. The Trove API is a special website that programs can talk to run queries, retrieve individual records etc. With some creative ideas and a bit of coding skill, you could make a pretty nifty tool and win a digital humanities award.

The documentation is pretty thorough, but doesn’t tell you how to get started. These days, web development is mostly about assembling the right tools and putting them together, but it can be hard for a novice to know what they need.

This guide is for you if:

  • You want to try exploring Trove programmatically; and
  • You’ve done some coding before; but
  • You’re not very familiar with coding against an API, or writing web applications.

In short, if you think the digital humanities might be for you.

What we’ll need

  • A server on which to run your web application. You can use your laptop directly, but better to use a virtual machine inside your laptop. Even better, a server running on the internet, like a VM on the NeCTAR Research Cloud.
  • A web application framework: Flask and Python
  • A library that makes it easy to make requests to other web servers (we’ll use Requests)
  • A JavaScript framework which makes dynamically modifying web pages much easier (we’ll use jQuery)
  • A CSS framework, which makes your page look attractive with no effort (we’ll use Twitter Bootstrap)
  • A templating engine, to combine HTML and the results of our queries into a web page (we’ll use Jinja2)
  • Our server logic, expressed as code in files.

A server

As an Australian academic, you can create your own server on the NeCTAR Research Cloud. That would be the best option, but can be a bit fiddly for a novice so we won’t explain that here.

The next best option is to create a server on a virtual machine inside your laptop, using VirtualBox and Vagrant. That keeps everything related to this one project self contained, which is a big bonus when you start working on other projects. It also creates a pristine, controlled Linux environment which means you’re less likely to run into issues cause by the peculiarities of your own system. Try the Vagrant Getting Started guide.

As a last resort, the simplest approach is to just install stuff directly on your computer. This may work ok to begin with.

Building a web application

We’re going to build a web application that will talk to Trove. This isn’t the only approach. You could:

  • Build a command line tool that pulls stuff out of Trove and saves it to disk
  • Make a desktop application that a user would have to download and install
  • Make a static website (not a web application) that does everything with JavaScript.

But a web application is probably what you’re going to want eventually and is the easiest to share with other people.

We’ll use Python, the programming language, and Flask, a web application framework for Python.

Why Flask? It’s easy to install, lightweight, and is well suited to experimental programming like this. (An alternative would be Django, which would be much better if you wanted to store stuff in a database and write something big and scalable.)

You can install these on the command line like this:

sudo apt-get install -y python-pip
sudo pip install flask

That is, first you install Python’s “pip” installer. Pip then knows how to install Python packages like Flask.

Whatever server you’re using, create a directory somewhere (perhaps under your home directory), called ‘trovetest’.

cd ~
mkdir trovetest
cd trovetest

Making requests to the Trove API

The Trove documentation says that to perform a query, we need to access a URL like this:

http://api.trove.nla.gov.au/result?key=<INSERT KEY>&zone=book&q=%22piers%20anthony%22

You might think you need to construct that string yourself, complete with question marks, equals signs and %22’s. You could do that using Python’s built-in urllib2 library:

import urllib2
apikey='1b2c3d4f'
troveURL = 'http://api.trove.nla.gov.au/'
zone = 'book'
search = '%22' + 'piers' + '%20' + 'anthony' + '%22'
r = urllib2.urlopen(troveURL + 'result' + '?' + 'key=' + apikey + '&' + 'zone=' + zone + '&' + 'q=' + search).read()

But it’s easier than that, if we use the Requests library.

import requests 
apikey='1b2c3d4f'
troveURL = 'http://api.trove.nla.gov.au/'
searchquery = 'piers'
r = requests.get(troveURL + 'result/', params = { 
 'key': apikey, 
 'zone': 'book', 
 'q': search
 } )

So, install the Requests library as well:

sudo pip install requests

jQuery

These days, virtually all web applications use a JavaScript framework, to make manipulating the web page more practical. We’ll use jQuery, which is also required by Twitter BootStrap.

Create a directory under ‘trovetest’ called ‘static’, and download and unzip the latest jQuery.

CSS Framework

Making a web page look good using straight CSS (cascading style sheets) is really hard. Making it look good in every browser and device is a nightmare. Save yourself the pain, and start from somewhere sensible like Bootstrap, made by Twitter. (By default, it looks a bit like Twitter’s website, but you can change that.)

Without Twitter Bootstrap

Without Twitter Bootstrap

Adding Bootstrap and minimal changes.

Adding Bootstrap and minimal changes.


Although it’s best to download both jQuery and Boostrap to your server and link to them there, you can get started quickly by linking to them on the web. Just include this text at the top of your HTML files.

<head>
http://code.jquery.com/jquery-1.10.1.min.js
<link rel="stylesheet" href="http://netdna.bootstrapcdn.com/bootstrap/3.1.0/css/bootstrap.min.css">
http://netdna.bootstrapcdn.com/bootstrap/3.1.0/js/bootstrap.min.js
</head>

Templating engine

The most common way to build a web page dynamically is by writing a “template” of the HTML page. It can be as simple as this:

<body>
<h1>Request complete</h1>
You requested this article: {{ article.name }}
</body>

When you tell the web application framework to render the page, you pass in the list of variables – in this case, an object called ‘article’ with an attribute ‘name’.

Templating example

We’ll use the Jinja2 templating engine, because that’s what Flask uses.

Our server logic

Now that everything is in place, let’s write a simple application that will simply make one request. We’ll ask the Trove API for a list of newspaper articles mentioning Piers Anthony (to follow their documentation’s example).

Create this file as ~/trovetest/simple.py

from flask import Flask, render_template, request # load Flask itself
import requests, json # load Requests library, and JSON library for interpreting responses
app = Flask(__name__) # create our web application object, named after this file
apikey = '1b2c3d...' # insert your API key, following instructions here: http://trove.nla.gov.au/general/api
troveURL = 'http://api.trove.nla.gov.au/' # all Trove API URLs start with this

@app.route("/list") # this is what we do when someone goes to "http://localhost:5000/list"
def list(): # the function that defines the behaviour
query = { # this dict structure contains the parameters that make up a URL
'key': apikey, # following the Trove API documentation.
'zone': 'newspaper',
'q': 'piers anthony',
'encoding': 'json', # we specify 'json' as the format because json is easy to parse.
'include': 'tags' }
r = requests.get(troveURL + 'result/', params=query)

# Requests will transform the params property into a string like
# ?key=...&zone=newspaper&q=%22piers%20anthony%22&include=tags&encoding=json

# After calling requests.get(), r is now an object containing lots of information about the response - errors codes, content etc.

r.encoding = 'ISO-8859-1' # We avoid some unicode conversion errors by adding this step.
results=r.json() # Convert the text we receive (in JSON format) into a Python dict
return render_template('list.html', results=results)# Render the template, passing through that dict

app.run(host='0.0.0.0', debug=True) # Run the defined web server, allowing anyone to connect, in debug mode. (This is not safe for a public web server.)

 

And save this as ~/trovetest/templates/list.html:

<!doctype html>
<head>
    <script src="http://code.jquery.com/jquery-1.10.1.min.js"></script>
    <linkrel="stylesheet"href="http://netdna.bootstrapcdn.com/bootstrap/3.1.0/css/bootstrap.min.css">
    <script src="http://netdna.bootstrapcdn.com/bootstrap/3.1.0/js/bootstrap.min.js"></script>
</head>
<body>
    <title>Trove test</title>
    <h1>Results</h1>

    <ul>
    {% for article in results.response.zone.0.records.article %}
      <li>
        <ahref="{{ article.troveUrl }}">{{ article.title.value }}</a> - <ahref="item/{{ article.id}}">More info</a>
        <br/>
        <i><small>{{ article.snippet | safe}}</small></i>
      </li>
    {% endfor %}
    </ul>
</body>

Run your server

On the command line, tell Python (and hence Flask) to run your application:

/home/ubuntu/trove$ python simple.py
 * Running on http://0.0.0.0:5000/
 * Restarting with reloader

Flask will sit there waiting for someone to connect to your webserver in a browser.

In your browser, go to http://localhost/5000/list

The results of our little Trove query.

The results of our little Trove query.

What next?

That was a very quick, high level view of all the pieces you need. If you made it through the example, you’re be in an excellent position to start making interesting web applications building on the Trove API. There’s no shortage of information on the web about all these technologies – the hard bit is knowing what you need to know.

All sorts of fun things can be done:

  • Cool exploration interfaces
  • Bots
  • Visualisations like graphs, charts etc
  • Text mining and analysis

Tim Sherratt (aka @wragge) has made many fun creations with the Trove API, documented on his discontents.com.au blog. Tim works for the Trove team, who are generally pretty happy to help .

 

Anonymous longitudinal surveys with LimeSurvey

A researcher at the Australian Research Centre in Sex, Health and Society (ARCSHS, pronounced “archers”) carries out surveys that are both:

  • anonymous: personal details are not collected, and steps are taken to avoid being able to identify participants; and
  • longitudinal: participants are contacted at intervals to answer similar questions. Questions often refer to previous answers given by the participant.

It’s a tricky combination, not well supported by most survey software. Here’s a solution, using LimeSurvey, that doesn’t require modifying any code. These instructions are aimed at people with no familiarity with LimeSurvey.

Quick summary:

  1. The survey is carried out in not anonymous mode. Anonymity is maintained through an external email redirection service.
  2. Additional attributes are added on to token tables for follow-up rounds.
  3. Answers from each survey are transferred to the additional attributes through Excel manipulation

Survey hosting

LimeService is a very cheap way to host LimeSurvey, and removes the burden of server administration.

Anonymous email addresses

Participants must not sign up for the survey using their actual email address. Instead, they should go to a third-party email forwarding site like http://notsharingmy.info. For the participant:

  1. Click on a link
  2. Enter their email address, click “Get an obscure email”
  3. Copy the generated email address into the survey registration form.

(Note: as of January 2013, notsharingmy.info is unreliable. I’m working on another solution.)

About tokens

This solution relies heavily on LimeSurvey’s “tokens” which are explained badly in the interface. Enabling tokens just means you want to track information about individual participants: individual adding them to the list, inviting them, keeping track of who completed the survey or who needs another reminder. We also add “additional attributes”: the participants’ previous answers.

1. Set up the initial survey

  1. Create a new survey:  Image
  2. Fill out the Description, Welcome Message, End Message.
  3. Create the first question group.
  4. Create some questions.
  5. Set these survey “general settings”:
    1. Tokens > Allow public registration? Yes
    2. (Optional) Modify the registration text as described below in “Avoiding personal information”
  6. Activate the survey. You’ll be asked if you want to initialise tokens:
    Image
    Yes, you do.
  7. Publicise the URL, and get lots of responses. Great.

Avoiding personal information

By default, LimeSurvey collects from each self-registering participant their first name, last name and email address. For a truly anonymous survey, you may wish to avoid collecting their name.

  1. Under Global Settings, choose Template Editor:
    Screen shot 2013-01-21 at 4.27.42 PM
  2. As the warning points out, editing the default template isn’t a great idea. Let’s make a copy called “no_name_collecting”:
    Screen shot 2013-01-21 at 4.28.47 PM
  3. Now, edit the copy. Select “startpage.pstpl”, then add  this code just before the “</head”> line:
    ...
    <script>
    $(function(){
    $("[name=register_firstname]").parent().parent().hide()
    $("[name=register_lastname]").parent().parent().hide()
    });
    </script>
    </head>

    Screen shot 2013-01-21 at 4.31.05 PM

     

  4. Next, you might want to change the registration message, to direct people to create an anonymous email address. On the “Register” screen,Screen shot 2013-01-21 at 4.36.16 PM
    select “register.pstpl”.
  5. In the code, replace {REGISTERMESSAGE1} and {REGISTERMESSAGE2} with any text or HTML you like.Screen shot 2013-01-21 at 4.38.41 PM

2. Export responses and tokens

  1. Export the responses as CSV
    Image

    Image
  2. Export the tokens as CSV:
    Image
    Image

3. Create follow-up survey

The first survey was a success, and there are now lots of entries in the tokens table and the responses table. More in the former than the latter, because some people will sign up and not answer any questions.

  1. Copy the first survey. (Click the + button, like starting a new survey first.)
    Image
  2. You’ll probably want to remove some irrelevant questions and groups, so do that.
  3. Now add one “additional attribute” for every response that you want to refer to from the initial survey. In Token Management:
    Screen shot 2013-01-21 at 2.07.54 PMScreen shot 2013-01-21 at 2.17.35 PMThe other fields don’t matter. Leave them blank.
  4. Next, you may want to use these attributes in some questions. Let’s say you want to ask “Last time you did this survey you were 42. How old are you now?”
    Screen shot 2013-01-21 at 4.07.11 PMIn the question text, click ‘LimeSurvey replacement field properties’ then select the extended attribute:Screen shot 2013-01-21 at 4.08.29 PMThe text then looks like this:
    Screen shot 2013-01-21 at 4.09.57 PM
  5. You can get fancy and set that value as the default for the question. Instructions on how to do this.
  6. Repeat this for each question, wherever those extended attributes are needed.

4. Transfer previous responses to the follow-up survey

  1. Download the token table for this survey. It will have no data, but the headers will be useful.So far, you have created a way to hold answers from previous surveys for each participant. But you haven’t actually put those answers in there. Time for some Excel magic. You want to create a token table for the follow-up survey, using bits from the three CSV files you’ve dowloaded so far.:
    1. Tokens table for initial survey. (Containing the email addresses people signed up with.)
    2. Responses for initial survey.
    3. Tokens table for follow-up survey (containing the additional attributes).
  2. Copy all the rows of tokens from the initial survey to the follow-up survey. Don’t copy the header row:Screen shot 2013-01-21 at 3.08.29 PMMake sure you retain the additional attribute fields (attribute_1 etc.) They should be blank.
  3. Notice that these tokens have already been “completed”. Clear out the invited and completed fields. Set the remindercount field to 0 and usesleft field to 1:
    Screen shot 2013-01-21 at 3.10.19 PM
  4. Now, copy columns from the participant responses table into attribute fields, one by one, as appropriate. Make sure the IDs line up correctly.Screen shot 2013-01-21 at 3.11.42 PM
  5. Save the file (“followup-tokens.csv” if you like). Now import it into your follow-up survey.Screen shot 2013-01-21 at 3.13.37 PMScreen shot 2013-01-21 at 3.16.11 PM

5. Activate the follow-up survey

Whew. You’re now ready to launch.

  1. Activate the survey.
  2. Send out invitations. Under “Token control”:Screen shot 2013-01-21 at 4.44.58 PM
  3. The default email template is pretty gross, so clean it up a bit before you send:Screen shot 2013-01-21 at 4.47.13 PM
  4. Each participant receives a custom URL which logs them in without requiring any username and password.

You can then repeat this process for each subsequent iteration.