Steve Bennett blogs

…about maps, open data, Git, and other tech.

Tag Archives: atom

A pattern for multi-instrument data harvesting with MyTardis

Leave a comment Posted by steveko on September 19, 2012

Here’s a formula for setting up a multi-instrument MyTardis installation. It describes one particular pattern, which has been implemented at RMIT’s Microscopy and Microanalysis Facility (RMMF), and is being implemented at Swinburne. For brevity, the many variations on this pattern are not discussed, and gross simplifications are made.

This architecture is not optimal for your situation. Maybe a documented a suboptimal architecture is more useful than an undocumented optimal one.

The goal

The components:

RSync server running on each instrument machine. (On Windows, use DeltaCopy.)
Harvester script which collates data from the RSync servers to a staging area.
Atom provider which serves the data in the staging area as an Atom feed.
MyTardis, which uses the Atom Ingest app to pull data from the Atom feed. It saves data to the MyTardis store and metadata to the MyTardis database.

Preparation

Things you need to find out:

List of instruments you’ll be harvesting from. Typically there’s a point of diminishing returns: 10 instruments in total, of which 3 get most of the use, and numbers #9 and #10 are running DOS and are more trouble than they’re worth.
For each instrument: what data format(s) are produced? Some “instruments” have several distinct sensors or detectors: a scanning electron microscope (producing .tif images) with add-on EDAX detector (producing .spt files), for instance. If there is no MyTardis filter for the given data format(s), then MyTardis will store the files without extracting any metadata, making them opaque to the user.
Details collected for each instrument at RMIT’s microscopy facility.
Discuss with the users how MyTardis’s concept of “dataset” and “experiment” will apply to the files they produce. If possible, arrange for them to save files into a file structure which makes this clear: /User_data/user123/experiment14/dataset16/file1.tif . Map the terms “dataset” and “experiment” as appropriate: perhaps “session” and “project”.
Networking. Each instrument needs to be reachable on at least one port from the harvester.
You’ll need a staging area. Somewhere big, and readily accessible.
You’ll eventually need a permanent MyTardis store. Somewhere big, and backed up. (The Chef cookbooks actually assume storage inside the VM, at present.)
You’ll eventually need a production-level MyTardis database. It may get very large if you have a lot of metadata (eg, Synchrotron data). (The Chef cookbooks install Postgres locally.)

RSync Server

We assume each PC is currently operating in “stand-alone” mode (users save data to a local hard disk) but is reachable by network. If your users have authenticated access to network storage, you’ll do something a bit different.

Install an Rsync server on each PC. For Windows machines, use DeltaCopy. It’s free. Estimate 5 minutes per machine. Make it auto-start, running in the background, all the time, serving up the user data directory on a given port.

Harvester script

The harvester script collates data from these different servers into the staging area. The end result is a directory with this kind of structure:

user_123
        TiO2
            2012-03-14 exp1
                run1.tif
                run2.tif
                edax.spc
            2012-03-15 exp1
                anotherrun.tif
                highmag.tif
user_456
...

You will need to customise the scripts for your installation.

Create a staging area directory somewhere. Probably it will be network storage mounted by NFS or something.
sudo mkdir /opt/harvester
Create a user to run the scripts, “harvester”. Harvested data will end up being owned by this user. Make sure that works for you.
sudo chown harvester /opt/harvester
sudo -su harvester bash
git clone https://github.com/VeRSI-Australia/RMIT-MicroTardis-Harvest
Review the scripts, understand what they do, make changes as appropriate.
Create a Cron job to run the harvester at frequent intervals. For example:
*/5 * * * * /usr/local/microtardis/harvest/harvest.sh >/dev/null 2>&1
Test.

Atom provider

My boldest assumption yet: you will use a dedicated VM simply to serve an Atom feed from the staging area.

Since you already know how to use Chef, use the Chef Cookbook for the Atom Dataset Provider to provision this VM from scratch in one hit.

git clone https://github.com/stevage/atom-provider-chef
Modify cookbooks/atom-dataset-provider/files/default/feed.atom as appropriate. See the documentation at https://github.com/stevage/atom-dataset-provider.
Modify environments/dev.rb and environments/prod.rb as appropriate.
Upload it all to your Chef server:
knife cookbook upload –all
knife environment from file environments/*
knife role from file roles/*
Bootstrap your VM. It works first try.

MyTardis

You’re doing great. Now let’s install MyTardis. You’ll need another VM for that. Your local ITS department can probably deliver one of those in a couple of hours, 6 months tops. If possible (ie, if no requirement to host data on-site), use the Nectar Research Cloud.

You’ll use Chef again.

git clone https://github.com/mytardis/mytardis-chef

Follow the instructions there.
Or, follow the instructions here: Getting started with Chef on the NeCTAR Research Cloud.
Currently, you need to manually configure the Atom harvester to harvest from your provider. Instructions for that are on Github.

MyTardis is now running. For extra credit, you will want to:

Develop or customise filters to extract useful metadata.
Add features that your particular researchers would find helpful. (On-the-fly CSV generation from binary spectrum files was a killer feature for some of our users.)
Extend harvesting to further instruments, and more kinds of data.
Integrate this system into the local research data management framework. Chat to your friendly local librarian. Ponder issues like how long to keep data, how to back it up, and what to do when users leave the institution.
Integrate into authentication systems: your local LDAP, or perhaps the AAF.
Find more sources of metadata: booking systems, grant applications, research management systems…

Discussion

After developing this pattern for RMMF, it seems useful to apply it again. And if we (VeRSI, RMIT’s Ian Thomas, Monash’s Steve Androulakis, etc) can do it, perhaps you’ll find it useful too. I was inspired by Peter Sefton’s UWS blog post Connecting Data Capture applications to Research Data Catalogues/Registries/Stores (which raises the idea of design patterns for research data harvesting).

Some good things about this pattern:

It’s pretty flexible. Using separate components means individual parts can be substituted out. Perhaps you have another way of producing an Atom feed. Or maybe users save their data directly to a network share, eliminating the need for the harvester scripts.
By using simple network transfers (eg, HTTP), it suits a range of possible network architectures (eg, internal staging area, but MyTardis on the cloud; or everything on one VM…)
Work can be divided up: one party can work on harvesting from the instruments, while another configures MyTardis.
You can archive the staging area directly if you want a raw copy of all data, especially during the testing phase.

Some bad things:

It’s probably a bit too complicated – perhaps one day there will be a MyTardis ingest from RSync servers directly.
Since the Atom provider is written in NodeJS, it means another set of technologies to become familiar with for troubleshooting.
There are lots of places where things can go wrong, and there is no monitoring layer (eg, Nagios).

research data atom, chef, e-research, featured, mytardis, patterns, research data, research data management

How OData will solve data sharing and reuse for e-Research

3 Comments Posted by steveko on June 21, 2012

OData’s shiny logo.

Ok, now that I have your attention, let me start backpedalling. In the Australian e-Research sector, I’ve worked on problems of research data management (getting researchers’ data into some kind of database) and registering (recording the existence of that data somewhere else, like an institutional store, or Research Data Australia). There hasn’t been a huge amount of discussion about the mechanics of reusing data: the discussion has mostly been “help researcher A find researcher B’s data, then they’ll talk”.

Now, I think the mechanics of how that data exchange is going to take place will become more important. We’re seeing the need for data exchange between instances of MyTardis (eg, from a facility to a researcher’s home institution) and we will almost certainly see demand for exchange between heterogeneous systems (eg, MyTardis to/from XNAT). And as soon as data being published becomes the norm, consumers of that data will expect it to be available in easy to use formats, with standard protocols. They will expect to be able to easily get data from one system into another, particularly once those two systems are hosted in the cloud.

Until this week, I didn’t know of a solution to these problems. There is the Semantic Web aka RDF aka Linked Data, but the barrier to entry is high. I know: I tried and failed. Then I heard about OData from Conal Tuohy.

What’s OData?

OData is:

An application-level data publishing and editing protocol that is intended to become “the language of data sharing”
A standardised, conventionalised application of AtomPub, JSON, and REST.
A technology built at Microsoft (as “Astoria”), but now released as an open standard, under the “We won’t sue you for using it” Open Specification Promise.

You have OData producers (or “services”) and consumers. A consumer could be: a desktop app that lets a user browse data from any OData site; a web application that mashes up data between several producers; another research data system that harvests its users’ data from other systems; a data aggregator that mirrors and archives data from a wide range of sites. Current examples of desktop apps include a plugin to Excel, a desktop data browser called LINQPad, a SilverLight (shudder) data explorer, high-powered business analytics tools for millionaires, a 3.5 star script to import into SQL Server…

I already publish data though!

Perhaps you already have a set of webservices with names like “ListExperiments”, “GetExperimentById”, “GetAuthorByNLAID”. You’ve documented your API, and everyone loves it. Here’s why OData might be a smarter choice, either now, or for next time:

Data model discovery: instead of a WSDL documenting the *services*, you have a CSDL describing the *data model*. (Of course if your API provides services other than simple data retrieval or update, that’s different. Although actually OData 3.0 supports actions and functions.)
Any OData consumer can explore all your exposed data and download it, without needing any more information. No documentation required.

What do you need to do to provide OData?

To turn an existing data management system (say, PARADISEC’s catalogue of endangered language field recordings) into an OData producer, you bolt an Atom endpoint onto the side. Much like many systems have already done for RIF-CS, but instead of using the rather boutique, library-centric standards OAI-PMH and RIF-CS to describe collections, you use AtomPub and HTTP to provide individual data items.

Specifically, you need to do this:

Design an object model for the data that you want to expose. This could be a direct mapping of an underlying database, but it needn’t be.
Describe that object model in CSDL (Conceptual Schema Definition Language), a Microsofty schema serialised as EDMX, looking like this example.
Provide an HTTP endpoint which responds to GET requests. In particular:
- it provides the EDMX file when “$metadata” is requested.
- it provides an Atom file containing requested data, broken up into reasonably sized chunks. Optionally, it can provide this in JSON instead.
- optionally, it supports a standard filtering mechanism, like “http://data.stackexchange.com/stackoverflow/atom/Badges?$filter=Id eq 92046“. (Ie, return all badges with Id equal to 92046)
For binary or large data, the items in the Atom feed should link rather than contain the raw data.
Optionally support the full CRUD range by responding to PUT, PATCH and DELETE requests.

There are libraries for a number of technologies, but if your technology doesn’t have an OData library, it probably has libraries for components like Atom, HTTP, JSON, and so on.

As an example, Tim Dettrick (UQ) and I have built an Atom dataset provider for MyTardis – before knowing about OData. To make it OData compliant probably requires:

modifying the feed to use (even more) standard names
following the OData convention for URLs
following the OData convention for filtering (eg, it already supports page size specification, but not using the OData standard naming)
supporting the $metadata request to describe its object model
probably adding a few more functions required by the spec.

Having done that, this would then be a generic file system->Odata server useful for any similar data harvesting project.

So, why not Linked Data?

Linked Data aka RDF aka Semantic Web is another approach that has been around for many years. Both use HTTP with content negotiation, an entity-attribute-value data model (ie, equivalent to XML or JSON), stable URIs and serialise primarily in an XML format. Key differences (in my understanding):

The ultimate goal of Linked Data is building a rich semantic web of data, so that data from a wide variety of sources can be aggregated, cross-checked, mashed up etc in interesting ways. The goal of OData is less lofty: a standardised data publishing and consuming mechanism. (Which is probably why I think OData works better in the short term)
Linked Data suggests, but does not mandate, a triplestore as the underlying storage. OData suggests, but does not mandate, a traditional RDBMS. Pragmatically, table-row-column RDBMS are much easier to work with, and the skills are much more widely available.
RDF/XML is hard to learn, because it has many unique concepts: ontologies, predicates, classes and so on. The W3C RDF primer shows the problem: you need to learn all this just in order to express your data. By contrast, OData is more pragmatic: you have a data model, so go and write that as Atom. In short, OData is conceptually simple, whereas RDF is conceptually rich but complex.[I’m not sure my comparison is entirely fair: most of my reading and experience of Linked Data comes from semantic web ideologues, who want you to go the whole nine yards with ontologies that align with other ontologies, proper URIs etc. It may be possible do quickly generate some bastardised RDF/XML with much less effort – at the risk of the community’s wrath.]
Linked Data is, unsurprisingly, much better at linking outside your own data source. In fact, I’m not sure how possible it is in OData. In some disciplines, such as the digital humanities cultural space, linking is pretty crucial: combining knowledge across disciplines like literature, film, history etc. In hard sciences, this is not (yet) important: much data is generated by some experiment, stored, and analysed more or less in isolation from other sources of data.

More details in this great comparison matrix.

In any case, the debate is moot for several reasons:

Linked Data and OData will probably play nicely together if efforts to use schema.org ontologies are successful.
There is room for more than one standard.
In some spaces, OData is already gaining traction, while in others Linked Data is the way to go. This will affect which you choose. IMHO it is telling that StackExchange (a programmer-focused set of Q&A sites) publishes its data in OData: it’s a programmer-friendly technology.

Other candidates?

GData

It is highly implausible that Google will be beaten by Microsoft in such an important area as data sharing. And yet Google Data Protocol (aka GData) is not yet a contender: it is controlled entirely by Google, and limited in scope to accessing services provided by Google or by the Google App engine. A cursory glance suggests that it is in fact very similar to OData anyway: AtomPub, REST, JSON, batch processing (different syntax), a filterying/query syntax. I don’t see support for a data model discovery, however: there’s no $metadata.

So, pros and cons?

Pros

Uses technologies you were probably going to use anyway.
A reasonable (but heavily MS-skewed) ecosystem
Potentially avoids the cost of building data exploration and query tools, letting some smart client do that for you.
Driven by the private sector, meaning serious dollars invested in high quality tools.

Cons

Driven by business and Microsoft, meaning those high quality tools are expensive and locked to MS technologies that we don’t use.
Future potential unclear
Lower barrier to entry than linked data/RDF

research data atom, atompub, csdl, e-research, gdata, json, linked data, odata, rdf, rdf/xml, research data, rest, rss, semantic web

	arnebab on 10 things I hate about Gi…
	arnebab on 10 things I hate about Gi…
	arnebab on 10 things I hate about Gi…
	Smitty on 10 things I hate about Gi…
	Igor on 10 things I hate about Gi…