Steve Bennett blogs

…about maps, open data, Git, and other tech.

A pattern for multi-instrument data harvesting with MyTardis

Here’s a formula for setting up a multi-instrument MyTardis installation. It describes one particular pattern, which has been implemented at RMIT’s Microscopy and Microanalysis Facility (RMMF), and is being implemented at Swinburne. For brevity, the many variations on this pattern are not discussed, and gross simplifications are made.

This architecture is not optimal for your situation. Maybe a documented a suboptimal architecture is more useful than an undocumented optimal one.

The goal

The components:

  • RSync server running on each instrument machine. (On Windows, use DeltaCopy.)
  • Harvester script which collates data from the RSync servers to a staging area.
  • Atom provider which serves the data in the staging area as an Atom feed.
  • MyTardis, which uses the Atom Ingest app to pull data from the Atom feed. It saves data to the MyTardis store and metadata to the MyTardis database.


Things you need to find out:

  1. List of instruments you’ll be harvesting from. Typically there’s a point of diminishing returns: 10 instruments in total, of which 3 get most of the use, and numbers #9 and #10 are running DOS and are more trouble than they’re worth.
  2. For each instrument: what data format(s) are produced? Some “instruments” have several distinct sensors or detectors: a scanning electron microscope (producing .tif images) with add-on EDAX detector (producing .spt files), for instance. If there is no MyTardis filter for the given data format(s), then MyTardis will store the files without extracting any metadata, making them opaque to the user.

    Details collected for each instrument at RMIT’s microscopy facility.

  3. Discuss with the users how MyTardis’s concept of “dataset” and “experiment” will apply to the files they produce. If possible, arrange for them to save files into a file structure which makes this clear: /User_data/user123/experiment14/dataset16/file1.tif . Map the terms “dataset” and “experiment” as appropriate: perhaps “session” and “project”.
  4. Networking. Each instrument needs to be reachable on at least one port from the harvester.
  5. You’ll need a staging area. Somewhere big, and readily accessible.
  6. You’ll eventually need a permanent MyTardis store. Somewhere big, and backed up. (The Chef cookbooks actually assume storage inside the VM, at present.)
  7. You’ll eventually need a production-level MyTardis database. It may get very large if you have a lot of metadata (eg, Synchrotron data). (The Chef cookbooks install Postgres locally.)

RSync Server

We assume each PC is currently operating in “stand-alone” mode (users save data to a local hard disk) but is reachable by network. If your users have authenticated access to network storage, you’ll do something a bit different.

Install an Rsync server on each PC. For Windows machines, use DeltaCopy. It’s free. Estimate 5 minutes per machine. Make it auto-start, running in the background, all the time, serving up the user  data directory on a given port.

Harvester script

The harvester script collates data from these different servers into the staging area. The end result is a directory with this kind of structure:

            2012-03-14 exp1
            2012-03-15 exp1

You will need to customise the scripts for your installation.

  1. Create  a staging area directory somewhere. Probably it will be network storage mounted by NFS or something.
  2. sudo mkdir /opt/harvester
  3. Create a user to run the scripts, “harvester”. Harvested data will end up being owned by this user. Make sure that works for you.
  4. sudo chown harvester /opt/harvester
  5. sudo -su harvester bash
  6. git clone
  7. Review the scripts, understand what they do, make changes as appropriate.
  8. Create a Cron job to run the harvester at frequent intervals. For example:
    */5 * * * * /usr/local/microtardis/harvest/ >/dev/null 2>&1
  9. Test.

Atom provider

My boldest assumption yet: you will use a dedicated VM simply to serve an Atom feed from the staging area.

Since you already know how to use Chef, use the Chef Cookbook for the Atom Dataset Provider to provision this VM from scratch in one hit.

  1. git clone
  2. Modify cookbooks/atom-dataset-provider/files/default/feed.atom as appropriate. See the documentation at
  3. Modify environments/dev.rb and environments/prod.rb as appropriate.
  4. Upload it all to your Chef server:
    knife cookbook upload –all
    knife environment from file environments/*
    knife role from file roles/*
  5. Bootstrap your VM. It works first try.


You’re doing great. Now let’s install MyTardis. You’ll need another VM for that. Your local ITS department can probably deliver one of those in a couple of hours, 6 months tops. If possible (ie, if no requirement to host data on-site), use the Nectar Research Cloud.

You’ll use Chef again.

  1. git clone
  2. Follow the instructions there.
  3. Or, follow the instructions here: Getting started with Chef on the NeCTAR Research Cloud.
  4. Currently, you need to manually configure the Atom harvester to harvest from your provider. Instructions for that are on Github.

MyTardis is now running. For extra credit, you will want to:

  • Develop or customise filters to extract useful metadata.
  • Add features that your particular researchers would find helpful. (On-the-fly CSV generation from binary spectrum files was a killer feature for some of our users.)
  • Extend harvesting to further instruments, and more kinds of data.
  • Integrate this system into the local research data management framework. Chat to your friendly local librarian. Ponder issues like how long to keep data, how to back it up, and what to do when users leave the institution.
  • Integrate into authentication systems: your local LDAP, or perhaps the AAF.
  • Find more sources of metadata: booking systems, grant applications, research management systems…


After developing this pattern for RMMF, it seems useful to apply it again. And if we (VeRSI, RMIT’s Ian Thomas, Monash’s Steve Androulakis, etc) can do it, perhaps you’ll find it useful too. I was inspired by Peter Sefton’s UWS blog post Connecting Data Capture applications to Research Data Catalogues/Registries/Stores (which raises the idea of design patterns for research data harvesting).

Some good things about this pattern:

  • It’s pretty flexible. Using separate components means individual parts can be substituted out. Perhaps you have another way of producing an Atom feed. Or maybe users save their data directly to a network share, eliminating the need for the harvester scripts.
  • By using simple network transfers (eg, HTTP), it suits a range of possible network architectures (eg, internal staging area, but MyTardis on the cloud; or everything on one VM…)
  • Work can be divided up: one party can work on harvesting from the instruments, while another configures MyTardis.
  • You can archive the staging area directly if you want a raw copy of all data, especially during the testing phase.

Some bad things:

  • It’s probably a bit too complicated – perhaps one day there will be a MyTardis ingest from RSync servers directly.
  • Since the Atom provider is written in NodeJS, it means another set of technologies to become familiar with for troubleshooting.
  • There are lots of places where things can go wrong, and there is no monitoring layer (eg, Nagios).


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: