Steve Bennett blogs

…about maps, open data, Git, and other tech.

Category Archives: research data

Normalize cross-tabs for Tableau: a free Google Sheets tool

Problem

You want to do some visualisation magic in Tableau, but your spreadsheet looks like this:

All those green columns are dependent variables: independent observations about one location defined by the white columns.

Tableau would be so much happier if your spreadsheet looked like this:

This is called “normalizing” the “cross-tab” format, or converting from “wide format” to “long format”, or “UNPIVOT“. Tableau provides an Excel plugin for reshaping data. Unfortunately, if you don’t use Excel, you’re stuck. It’s kind of weird.

Solution

Anyway, I’ve made a Google Sheets script “Normalize cross-tab” that will do it for you.

As the instructions say, to use it, you:

  1. Reorganise your data so that all the independent variable columns are to the right of all the dependent ones; then
  2. Place the cursor somewhere in the first (leftmost) independent variable column.

It then creates a new sheet, “NormalizedResult”, and puts the result there.

How to use

It’s surprisingly clumsy to share Google Scripts, at least until the new “Add-ons” feature is mature. Here’s the best I can do for you:

1. Copy the script to the clipboard

Go to https://raw.githubusercontent.com/stevage/normalize-crosstab/master/normalizeCrossTab.gs, select all the text, and copy to the clipboard.

2. Upload your spreadsheet to Google Sheets

Upload your Excel spreadsheet into Google Sheets, if it’s not there already.

3. Tools > Script Editor…

4. Click “Spreadsheet”

 

 

 

 

 

 

 

5. Paste

In the window labelled “Code.gs”, select all the text and paste over it the script from the clipboard.

6. Save.

You need to give this script “project” a name. It doesn’t matter.

7. Select the “start” function.

8. Click Run

Click Continue and accept the authorisation request.

9. Follow the instructions of the script

Now, switch windows to your Google Sheet, and you’ll see the sidebar.

10. Download your normalised spreadsheet

On the NormalizedResult page, choose File > Download as…

Screenshot 2015-01-06 20.53.36

 

 

 

 

If you want to convert several spreadsheets, you can save yourself pain by loading them all into the same workbook. Just remember that the script will always save its output to NormalizedOutput.

Advertisement

The Data Guru in Residence

Cross-posted at Code for Australia.

Last week, Code for Australia launched its first fellowship program, a four-month project where a civic-minded developer will try a new approach to helping government solve problems with their data. For the next few months, I’ll be the Data Guru in Residence, blogging mostly to http://melbdataguru.tumblr.com. The program got a brief mention in The Age.My goals are to find interesting and useful datasets, help make them public, and do fun stuff with them. It’s a kind of test run for the Code for Australia hacker in residence program currently being developed. Since I work for the University of Melbourne, I’ll be targeting datasets that are useful for researchers, and using VicNode to store data wherever it’s needed.

To start with, I’m spending some time with the CityLab team at City of Melbourne. They’re very progressive on the open data front, and their Open Data Platform has some really high quality datasets, like the 70,000-tree Urban Forest or the Development Activity Monitor which contains detailed information on property developments.

“Living, Breathing Melbourne”, our GovHack Project, would be so much better with live data feeds.

Some of the immediate datasets on the radar are finding live feeds from the city’s pedestrian sensors and bike share stations. I’d love to incorporate these into the successful Govhack project, Living Breathing Melbourne, built with Yuri Feldman and Andrew Chin. There’s also lots of interesting data from the Census of Land Use and Employment with immense detail on how floorspace is divided up between residential, retail, commercial and so on. There areMahlstedt fire plans, LIDAR data, and a really detailed, textured 3D model of the CBD. And of course other data that’s already public, but whose full potential hasn’t yet been realised.

If you’re from a government body (Federal, State, Council, or agency), based in or around Melbourne and you could use the services of a Data Guru, please get in touch!

What I learned at e-Research Australasia 2012

  1. Waterfall development still doesn’t work.
  2. Filling an institutional research data registry is still a hard slog.
  3. Omero is not just for optical microscopy.
  4. Spending money so universities can build tools that the private sector will soon provide for free ends badly.
  5. Research data tools that mimicking the design of laymen’s tools (“realestate.com.au for ecologists“) work well
  6. AURIN is brilliant. AuScope is even more brilliant than before.
  7. Touchscreen whiteboards are here, are extremely cool, and cost less than $5000.
  8. AAF still doesn’t work, but will soon. Please.
  9. NeuroHub. If neuroscience is your thing.
  10. Boring, mildly offensive after-dinner entertainment works well as a team-building exercise.
  11. All the state-based e-Research organisations (VeRSI, IVEC, QCIF, Intersect, eRSA, TPAC etc.) are working on a joint framework for e-research support.
  12. Cytoscape: An Open Source Platform for Complex Network Analysis and Visualization
  13. Staff at e-Research organisations have much more informed view of future e-Research projects, having worked on so many of them.
  14. If you tick the wrong checkbox, your paper turns into a poster.
  15. People find the word “productionising” offensive, but don’t mind “productifying”.
  16. CloudStor’s FileSender is the right way for people in the university sector to send big files to each other.

Build it? Wait for someone else to build it?

And a thought that hit me after the conference: although a dominant message over the last few years has been “the data deluge is coming! start building things!”, there are sometimes significant advantages in waiting. e-Research organisations overseas are also building data repositories, registries, tools etc. In many cases, it would pay to wait for a big project overseas to deliver a reusable product, rather than going it alone on a much smaller scale. So, since we (at e-Research organisations) are trying to help many researchers, perhaps we should consider the prospect of some other tool arriving on the scene, when assessing the merit of any individual project.

A pattern for multi-instrument data harvesting with MyTardis

Here’s a formula for setting up a multi-instrument MyTardis installation. It describes one particular pattern, which has been implemented at RMIT’s Microscopy and Microanalysis Facility (RMMF), and is being implemented at Swinburne. For brevity, the many variations on this pattern are not discussed, and gross simplifications are made.

This architecture is not optimal for your situation. Maybe a documented a suboptimal architecture is more useful than an undocumented optimal one.

The goal

The components:

  • RSync server running on each instrument machine. (On Windows, use DeltaCopy.)
  • Harvester script which collates data from the RSync servers to a staging area.
  • Atom provider which serves the data in the staging area as an Atom feed.
  • MyTardis, which uses the Atom Ingest app to pull data from the Atom feed. It saves data to the MyTardis store and metadata to the MyTardis database.

Preparation

Things you need to find out:

  1. List of instruments you’ll be harvesting from. Typically there’s a point of diminishing returns: 10 instruments in total, of which 3 get most of the use, and numbers #9 and #10 are running DOS and are more trouble than they’re worth.
  2. For each instrument: what data format(s) are produced? Some “instruments” have several distinct sensors or detectors: a scanning electron microscope (producing .tif images) with add-on EDAX detector (producing .spt files), for instance. If there is no MyTardis filter for the given data format(s), then MyTardis will store the files without extracting any metadata, making them opaque to the user.

    Details collected for each instrument at RMIT’s microscopy facility.

  3. Discuss with the users how MyTardis’s concept of “dataset” and “experiment” will apply to the files they produce. If possible, arrange for them to save files into a file structure which makes this clear: /User_data/user123/experiment14/dataset16/file1.tif . Map the terms “dataset” and “experiment” as appropriate: perhaps “session” and “project”.
  4. Networking. Each instrument needs to be reachable on at least one port from the harvester.
  5. You’ll need a staging area. Somewhere big, and readily accessible.
  6. You’ll eventually need a permanent MyTardis store. Somewhere big, and backed up. (The Chef cookbooks actually assume storage inside the VM, at present.)
  7. You’ll eventually need a production-level MyTardis database. It may get very large if you have a lot of metadata (eg, Synchrotron data). (The Chef cookbooks install Postgres locally.)

RSync Server

We assume each PC is currently operating in “stand-alone” mode (users save data to a local hard disk) but is reachable by network. If your users have authenticated access to network storage, you’ll do something a bit different.

Install an Rsync server on each PC. For Windows machines, use DeltaCopy. It’s free. Estimate 5 minutes per machine. Make it auto-start, running in the background, all the time, serving up the user  data directory on a given port.

Harvester script

The harvester script collates data from these different servers into the staging area. The end result is a directory with this kind of structure:

user_123
        TiO2
            2012-03-14 exp1
                run1.tif
                run2.tif
                edax.spc
            2012-03-15 exp1
                anotherrun.tif
                highmag.tif
user_456
...

You will need to customise the scripts for your installation.

  1. Create  a staging area directory somewhere. Probably it will be network storage mounted by NFS or something.
  2. sudo mkdir /opt/harvester
  3. Create a user to run the scripts, “harvester”. Harvested data will end up being owned by this user. Make sure that works for you.
  4. sudo chown harvester /opt/harvester
  5. sudo -su harvester bash
  6. git clone https://github.com/VeRSI-Australia/RMIT-MicroTardis-Harvest
  7. Review the scripts, understand what they do, make changes as appropriate.
  8. Create a Cron job to run the harvester at frequent intervals. For example:
    */5 * * * * /usr/local/microtardis/harvest/harvest.sh >/dev/null 2>&1
  9. Test.

Atom provider

My boldest assumption yet: you will use a dedicated VM simply to serve an Atom feed from the staging area.

Since you already know how to use Chef, use the Chef Cookbook for the Atom Dataset Provider to provision this VM from scratch in one hit.

  1. git clone https://github.com/stevage/atom-provider-chef
  2. Modify cookbooks/atom-dataset-provider/files/default/feed.atom as appropriate. See the documentation at https://github.com/stevage/atom-dataset-provider.
  3. Modify environments/dev.rb and environments/prod.rb as appropriate.
  4. Upload it all to your Chef server:
    knife cookbook upload –all
    knife environment from file environments/*
    knife role from file roles/*
  5. Bootstrap your VM. It works first try.

MyTardis

You’re doing great. Now let’s install MyTardis. You’ll need another VM for that. Your local ITS department can probably deliver one of those in a couple of hours, 6 months tops. If possible (ie, if no requirement to host data on-site), use the Nectar Research Cloud.

You’ll use Chef again.

  1. git clone https://github.com/mytardis/mytardis-chef
  2. Follow the instructions there.
  3. Or, follow the instructions here: Getting started with Chef on the NeCTAR Research Cloud.
  4. Currently, you need to manually configure the Atom harvester to harvest from your provider. Instructions for that are on Github.

MyTardis is now running. For extra credit, you will want to:

  • Develop or customise filters to extract useful metadata.
  • Add features that your particular researchers would find helpful. (On-the-fly CSV generation from binary spectrum files was a killer feature for some of our users.)
  • Extend harvesting to further instruments, and more kinds of data.
  • Integrate this system into the local research data management framework. Chat to your friendly local librarian. Ponder issues like how long to keep data, how to back it up, and what to do when users leave the institution.
  • Integrate into authentication systems: your local LDAP, or perhaps the AAF.
  • Find more sources of metadata: booking systems, grant applications, research management systems…

Discussion

After developing this pattern for RMMF, it seems useful to apply it again. And if we (VeRSI, RMIT’s Ian Thomas, Monash’s Steve Androulakis, etc) can do it, perhaps you’ll find it useful too. I was inspired by Peter Sefton’s UWS blog post Connecting Data Capture applications to Research Data Catalogues/Registries/Stores (which raises the idea of design patterns for research data harvesting).

Some good things about this pattern:

  • It’s pretty flexible. Using separate components means individual parts can be substituted out. Perhaps you have another way of producing an Atom feed. Or maybe users save their data directly to a network share, eliminating the need for the harvester scripts.
  • By using simple network transfers (eg, HTTP), it suits a range of possible network architectures (eg, internal staging area, but MyTardis on the cloud; or everything on one VM…)
  • Work can be divided up: one party can work on harvesting from the instruments, while another configures MyTardis.
  • You can archive the staging area directly if you want a raw copy of all data, especially during the testing phase.

Some bad things:

  • It’s probably a bit too complicated – perhaps one day there will be a MyTardis ingest from RSync servers directly.
  • Since the Atom provider is written in NodeJS, it means another set of technologies to become familiar with for troubleshooting.
  • There are lots of places where things can go wrong, and there is no monitoring layer (eg, Nagios).

How OData will solve data sharing and reuse for e-Research

OData’s shiny logo.

Ok, now that I have your attention, let me start backpedalling. In the Australian e-Research sector, I’ve worked on problems of research data management (getting researchers’ data into some kind of database) and registering (recording the existence of that data somewhere else, like an institutional store, or Research Data Australia). There hasn’t been a huge amount of discussion about the mechanics of reusing data: the discussion has mostly been “help researcher A find researcher B’s data, then they’ll talk”.

Now, I think the mechanics of how that data exchange is going to take place will become more important. We’re seeing the need for data exchange between instances of MyTardis (eg, from a facility to a researcher’s home institution) and we will almost certainly see demand for exchange between heterogeneous systems (eg, MyTardis to/from XNAT). And as soon as data being published becomes the norm, consumers of that data will expect it to be available in easy to use formats, with standard protocols. They will expect to be able to easily get data from one system into another, particularly once those two systems are hosted in the cloud.

Until this week, I didn’t know of a solution to these problems. There is the Semantic Web aka RDF aka Linked Data, but the barrier to entry is high. I know: I tried and failed. Then I heard about OData from Conal Tuohy.

What’s OData?

OData is:

  • An application-level data publishing and editing protocol that is intended to become “the language of data sharing”
  • A standardised, conventionalised application of AtomPub, JSON, and REST.
  • A technology built at Microsoft (as “Astoria”), but now released as an open standard, under the “We won’t sue you for using it” Open Specification Promise.

You have OData producers (or “services”) and consumers. A consumer could be: a desktop app that lets a user browse data from any OData site; a web application that mashes up data between several producers; another research data system that harvests its users’ data from other systems; a data aggregator that mirrors and archives data from a wide range of sites. Current examples of desktop apps include a plugin to Excel, a desktop data browser called LINQPad, a SilverLight (shudder) data explorer, high-powered business analytics tools for millionaires, a 3.5 star script to import into SQL Server

I already publish data though!

Perhaps you already have a set of webservices with names like “ListExperiments”, “GetExperimentById”, “GetAuthorByNLAID”. You’ve documented your API, and everyone loves it. Here’s why OData might be a smarter choice, either now, or for next time:

  • Data model discovery: instead of a WSDL documenting the *services*, you have a CSDL describing the *data model*. (Of course if your API provides services other than simple data retrieval or update, that’s different. Although actually OData 3.0 supports actions and functions.)
  • Any OData consumer can explore all your exposed data and download it, without needing any more information. No documentation required.

What do you need to do to provide OData?

To turn an existing data management system (say, PARADISEC’s catalogue of endangered language field recordings) into an OData producer, you bolt an Atom endpoint onto the side. Much like many systems have already done for RIF-CS, but instead of using the rather boutique, library-centric standards OAI-PMH and RIF-CS to describe collections, you use AtomPub and HTTP to provide individual data items.

Specifically, you need to do this:

  • Design an object model for the data that you want to expose. This could be a direct mapping of an underlying database, but it needn’t be.
  • Describe that object model in CSDL (Conceptual Schema Definition Language), a Microsofty schema serialised as EDMX, looking like this example.
  • Provide an HTTP endpoint which responds to GET requests. In particular:
    • it provides the EDMX file when “$metadata” is requested.
    • it provides an Atom file containing requested data, broken up into reasonably sized chunks. Optionally, it can provide this in JSON instead.
    • optionally, it supports a standard filtering mechanism, like “http://data.stackexchange.com/stackoverflow/atom/Badges?$filter=Id eq 92046“. (Ie, return all badges with Id equal to 92046)
  • For binary or large data, the items in the Atom feed should link rather than contain the raw data.
  • Optionally support the full CRUD range by responding to PUT, PATCH and DELETE requests.

There are libraries for a number of technologies, but if your technology doesn’t have an OData library, it probably has libraries for components like Atom, HTTP, JSON, and so on.

As an example, Tim Dettrick (UQ) and I have built an Atom dataset provider for MyTardis – before knowing about OData. To make it OData compliant probably requires:

  • modifying the feed to use (even more) standard names
  • following the OData convention for URLs
  • following the OData convention for filtering (eg, it already supports page size specification, but not using the OData standard naming)
  • supporting the $metadata request to describe its object model
  • probably adding a few more functions required by the spec.
Having done that, this would then be a generic file system->Odata server useful for any similar data harvesting project.

So, why not Linked Data?

Linked Data aka RDF aka Semantic Web is another approach that has been around for many years. Both use HTTP with content negotiation, an entity-attribute-value data model (ie, equivalent to XML or JSON), stable URIs and serialise primarily in an XML format. Key differences (in my understanding):

  • The ultimate goal of Linked Data is building a rich semantic web of data, so that data from a wide variety of sources can be aggregated, cross-checked, mashed up etc in interesting ways. The goal of OData is less lofty: a standardised data publishing and consuming mechanism. (Which is probably why I think OData works better in the short term)
  • Linked Data suggests, but does not mandate, a triplestore as the underlying storage. OData suggests, but does not mandate, a traditional RDBMS. Pragmatically, table-row-column RDBMS are much easier to work with, and the skills are much  more widely available.
  • RDF/XML is hard to learn, because it has many unique concepts: ontologies, predicates, classes and so on. The W3C RDF primer shows the problem: you need to learn all this just in order to express your data. By contrast, OData is more pragmatic: you have a data model, so go and write that as Atom. In short, OData is conceptually simple, whereas RDF is conceptually rich but complex.[I’m not sure my comparison is entirely fair: most of my reading and experience of Linked Data comes from semantic web ideologues, who want you to go the whole nine yards with ontologies that align with other ontologies, proper URIs etc. It may be possible do quickly generate some bastardised RDF/XML with much less effort – at the risk of the community’s wrath.]
  • Linked Data is, unsurprisingly, much better at linking outside your own data source. In fact, I’m not sure how possible it is in OData. In some disciplines, such as the digital humanities cultural space, linking is pretty crucial: combining knowledge across disciplines like literature, film, history etc. In hard sciences, this is not (yet) important: much data is generated by some experiment, stored, and analysed more or less in isolation from other sources of data.
More details in this great  comparison matrix.
In any case, the debate is moot for several reasons:
  • Linked Data and OData will probably play nicely together if  efforts to use schema.org ontologies are successful.
  • There is room for more than one standard.
  • In some spaces, OData is already gaining traction, while in others Linked Data is the way to go. This will affect which you choose. IMHO it is telling that StackExchange (a programmer-focused set of Q&A sites) publishes its data in OData: it’s a programmer-friendly technology.

Other candidates?

GData

It is highly implausible that Google will be beaten by Microsoft in such an important area as data sharing. And yet Google Data Protocol (aka GData) is not yet a contender: it is controlled entirely by Google, and limited in scope to accessing services provided by Google or by the Google App engine. A cursory glance suggests that it is in fact very similar to OData anyway: AtomPub, REST, JSON, batch processing (different syntax), a filterying/query syntax. I don’t see support for a data model discovery, however: there’s no $metadata.

So, pros and cons?

Pros

  • Uses technologies you were probably going to use anyway.
  • A reasonable (but heavily MS-skewed) ecosystem
  • Potentially avoids the cost of building data exploration and query tools, letting some smart client do that for you.
  • Driven by the private sector, meaning serious dollars invested in high quality tools.

Cons

  • Driven by business and Microsoft, meaning those high quality tools are expensive and locked to MS technologies that we don’t use.
  • Future potential unclear
  • Lower barrier to entry than linked data/RDF