Steve Bennett blogs

…about maps, open data, Git, and other tech.

A pattern for multi-instrument data harvesting with MyTardis

Leave a comment Posted by steveko on September 19, 2012

Here’s a formula for setting up a multi-instrument MyTardis installation. It describes one particular pattern, which has been implemented at RMIT’s Microscopy and Microanalysis Facility (RMMF), and is being implemented at Swinburne. For brevity, the many variations on this pattern are not discussed, and gross simplifications are made.

This architecture is not optimal for your situation. Maybe a documented a suboptimal architecture is more useful than an undocumented optimal one.

The goal

The components:

RSync server running on each instrument machine. (On Windows, use DeltaCopy.)
Harvester script which collates data from the RSync servers to a staging area.
Atom provider which serves the data in the staging area as an Atom feed.
MyTardis, which uses the Atom Ingest app to pull data from the Atom feed. It saves data to the MyTardis store and metadata to the MyTardis database.

Preparation

Things you need to find out:

List of instruments you’ll be harvesting from. Typically there’s a point of diminishing returns: 10 instruments in total, of which 3 get most of the use, and numbers #9 and #10 are running DOS and are more trouble than they’re worth.
For each instrument: what data format(s) are produced? Some “instruments” have several distinct sensors or detectors: a scanning electron microscope (producing .tif images) with add-on EDAX detector (producing .spt files), for instance. If there is no MyTardis filter for the given data format(s), then MyTardis will store the files without extracting any metadata, making them opaque to the user.
Details collected for each instrument at RMIT’s microscopy facility.
Discuss with the users how MyTardis’s concept of “dataset” and “experiment” will apply to the files they produce. If possible, arrange for them to save files into a file structure which makes this clear: /User_data/user123/experiment14/dataset16/file1.tif . Map the terms “dataset” and “experiment” as appropriate: perhaps “session” and “project”.
Networking. Each instrument needs to be reachable on at least one port from the harvester.
You’ll need a staging area. Somewhere big, and readily accessible.
You’ll eventually need a permanent MyTardis store. Somewhere big, and backed up. (The Chef cookbooks actually assume storage inside the VM, at present.)
You’ll eventually need a production-level MyTardis database. It may get very large if you have a lot of metadata (eg, Synchrotron data). (The Chef cookbooks install Postgres locally.)

RSync Server

We assume each PC is currently operating in “stand-alone” mode (users save data to a local hard disk) but is reachable by network. If your users have authenticated access to network storage, you’ll do something a bit different.

Install an Rsync server on each PC. For Windows machines, use DeltaCopy. It’s free. Estimate 5 minutes per machine. Make it auto-start, running in the background, all the time, serving up the user data directory on a given port.

Harvester script

The harvester script collates data from these different servers into the staging area. The end result is a directory with this kind of structure:

user_123
        TiO2
            2012-03-14 exp1
                run1.tif
                run2.tif
                edax.spc
            2012-03-15 exp1
                anotherrun.tif
                highmag.tif
user_456
...

You will need to customise the scripts for your installation.

Create a staging area directory somewhere. Probably it will be network storage mounted by NFS or something.
sudo mkdir /opt/harvester
Create a user to run the scripts, “harvester”. Harvested data will end up being owned by this user. Make sure that works for you.
sudo chown harvester /opt/harvester
sudo -su harvester bash
git clone https://github.com/VeRSI-Australia/RMIT-MicroTardis-Harvest
Review the scripts, understand what they do, make changes as appropriate.
Create a Cron job to run the harvester at frequent intervals. For example:
*/5 * * * * /usr/local/microtardis/harvest/harvest.sh >/dev/null 2>&1
Test.

Atom provider

My boldest assumption yet: you will use a dedicated VM simply to serve an Atom feed from the staging area.

Since you already know how to use Chef, use the Chef Cookbook for the Atom Dataset Provider to provision this VM from scratch in one hit.

git clone https://github.com/stevage/atom-provider-chef
Modify cookbooks/atom-dataset-provider/files/default/feed.atom as appropriate. See the documentation at https://github.com/stevage/atom-dataset-provider.
Modify environments/dev.rb and environments/prod.rb as appropriate.
Upload it all to your Chef server:
knife cookbook upload –all
knife environment from file environments/*
knife role from file roles/*
Bootstrap your VM. It works first try.

MyTardis

You’re doing great. Now let’s install MyTardis. You’ll need another VM for that. Your local ITS department can probably deliver one of those in a couple of hours, 6 months tops. If possible (ie, if no requirement to host data on-site), use the Nectar Research Cloud.

You’ll use Chef again.

git clone https://github.com/mytardis/mytardis-chef

Follow the instructions there.
Or, follow the instructions here: Getting started with Chef on the NeCTAR Research Cloud.
Currently, you need to manually configure the Atom harvester to harvest from your provider. Instructions for that are on Github.

MyTardis is now running. For extra credit, you will want to:

Develop or customise filters to extract useful metadata.
Add features that your particular researchers would find helpful. (On-the-fly CSV generation from binary spectrum files was a killer feature for some of our users.)
Extend harvesting to further instruments, and more kinds of data.
Integrate this system into the local research data management framework. Chat to your friendly local librarian. Ponder issues like how long to keep data, how to back it up, and what to do when users leave the institution.
Integrate into authentication systems: your local LDAP, or perhaps the AAF.
Find more sources of metadata: booking systems, grant applications, research management systems…

Discussion

After developing this pattern for RMMF, it seems useful to apply it again. And if we (VeRSI, RMIT’s Ian Thomas, Monash’s Steve Androulakis, etc) can do it, perhaps you’ll find it useful too. I was inspired by Peter Sefton’s UWS blog post Connecting Data Capture applications to Research Data Catalogues/Registries/Stores (which raises the idea of design patterns for research data harvesting).

Some good things about this pattern:

It’s pretty flexible. Using separate components means individual parts can be substituted out. Perhaps you have another way of producing an Atom feed. Or maybe users save their data directly to a network share, eliminating the need for the harvester scripts.
By using simple network transfers (eg, HTTP), it suits a range of possible network architectures (eg, internal staging area, but MyTardis on the cloud; or everything on one VM…)
Work can be divided up: one party can work on harvesting from the instruments, while another configures MyTardis.
You can archive the staging area directly if you want a raw copy of all data, especially during the testing phase.

Some bad things:

It’s probably a bit too complicated – perhaps one day there will be a MyTardis ingest from RSync servers directly.
Since the Atom provider is written in NodeJS, it means another set of technologies to become familiar with for troubleshooting.
There are lots of places where things can go wrong, and there is no monitoring layer (eg, Nagios).

research data atom, chef, e-research, featured, mytardis, patterns, research data, research data management

Getting started with Chef on the NeCTAR Research Cloud

1 Comment Posted by steveko on September 19, 2012

Opscode Chef is a powerful tool for automating the configuration of new servers, and indeed, entire clouds, multi-server architectures etc. We now use it to deploy MyTardis. It’s quite daunting at first, so here’s a quick guide to the setup I use. You’ll probably need to refer to more complete tutorials to cover gaps (eg OpsCode’s Fast Start Guide, the MyTardis chef documentation and this random other one.) We’ll use installing MyTardis as the goal.

Sign up to Hosted Chef. That gives you a place where Chef cookbooks, environments, roles etc will live.
Install Chef (client) locally.

Get the MyTardis Chef cookbook:

cd ~/mytardis
git clone https://github.com/mytardis/mytardis-chef

Make a directory, say ~/chef, and download the “…-validator.pem” and knife.rb files that Opscode gives you to there. The knife.rb file contains settings for Knife, like directories. Modify it so it points to ~/mytardis/mytardis-chef. Mine looks like this

current_dir = File.dirname(__FILE__)
log_level :info
log_location STDOUT
node_name "stevage"
client_key "#{current_dir}/stevage.pem"
validation_client_name "versi-validator"
validation_key "#{current_dir}/versi-validator.pem"
chef_server_url "https://api.opscode.com/organizations/versi"
cache_type 'BasicFile'
cache_options( :path => "#{ENV['HOME']}/.chef/checksums" )
cookbook_path ["#{current_dir}/../mytardis/mytardis-chef/site-cookbooks", "#{current_dir}/../mytardis/mytardis-chef/cookbooks"]

You now need to upload these cookbooks and assorted stuff to Chef Server. This part of Chef is really dumb. (Why isn’t there a single command to do this – or why can’t knife automatically synchronise either a local directory tree or git repo with the Chef server. Why is one command ‘upload’ and another ‘from file’…? Why are cookbooks looked for in the ‘cookbooks path’ but roles are sought under ./roles/?)
```
knife cookbook upload -a
knife role from file ../mytardis/mytardis-chef/roles/mytardis.json
```
Get a NeCTAR Research Cloud VM. Launch an instance of “Ubuntu 12.04 LTS (Precise) amd64 UEC”. Call it “mytardisdemo”, give it your NeCTAR public key, and do what you have to (security groups…) to open port 80.
Download your NeCTAR private key to your ~/chef directory.

Now, to make your life a bit easier, here are three scripts that I’ve made to simplify working with Knife:bootstrap.sh:

#!/bin/bash -x
# Bootstraps Chef on the remote host, then runs its runlist.
source ./settings$1.sh
knife bootstrap $CHEF_IP -x $CHEF_ACCOUNT --no-host-key-verify -i nectar.pem --sudo $CHEF_BOOTSTRAP_ARGS

update.sh:

#!/bin/bash -x
# Runs Chef-client on the remote host, to run the latest version of any applicable cookbooks.
source settingsi$1.sh
knife ssh name:$NECTAR_HOST -a ipaddress -x $NECTAR_ACCOUNT -i nectar.pem -o "VerifyHostKeyDNS no" "sudo chef-client"

connect.sh:

#!/bin/bash -x
# Opens an SSH session to the remote host.
source ./settings$1.sh
ssh $CHEF_ACCOUNT@$CHEF_IP -i nectar.pem -o "VerifyHostKeyDNS no"

(We disable host key checking because otherwise SSH baulks every time you allocate a new VM with an IP that it’s seen before.)With these, you can have several “settings…sh” files, each one describing a different remote host. (Or just a single settings.sh). So, create a “settings-mytardisdemo.sh” file, like this:

# Settings for Ubuntu Precise NeCTAR VM running mytardis as a demo
export CHEF_HOST=mytardisdemo
export CHEF_IP=115.146.94.53
export CHEF_ACCOUNT=ubuntu
export CHEF_BOOTSTRAP_ARGS="-d ubuntu12.04-gems -r 'role[mytardis]'"

Now you have it all in place: A chef server with the cookbook you’re going to run, a server to run it on, and the local tools to do it with. So, bootstrap your instance:
```
./bootstrap.sh -mytardisdemo
```
If all goes well (it won’t), that will churn away for 40 minutes or so, then MyTardis will be up and running.
If you change a cookbook or something (don’t forget to re-upload) and want to run chef again:
```
./update.sh -mytardisdemo
```
And to SSH into your box:
```
./connect.sh -mytardisdemo
```

System administration chef, cloud, configuration management, cookbook, deployment, knife, mytardis, nectar, nectar research cloud, tutorial, virtual machine

How OData will solve data sharing and reuse for e-Research

3 Comments Posted by steveko on June 21, 2012

OData’s shiny logo.

Ok, now that I have your attention, let me start backpedalling. In the Australian e-Research sector, I’ve worked on problems of research data management (getting researchers’ data into some kind of database) and registering (recording the existence of that data somewhere else, like an institutional store, or Research Data Australia). There hasn’t been a huge amount of discussion about the mechanics of reusing data: the discussion has mostly been “help researcher A find researcher B’s data, then they’ll talk”.

Now, I think the mechanics of how that data exchange is going to take place will become more important. We’re seeing the need for data exchange between instances of MyTardis (eg, from a facility to a researcher’s home institution) and we will almost certainly see demand for exchange between heterogeneous systems (eg, MyTardis to/from XNAT). And as soon as data being published becomes the norm, consumers of that data will expect it to be available in easy to use formats, with standard protocols. They will expect to be able to easily get data from one system into another, particularly once those two systems are hosted in the cloud.

Until this week, I didn’t know of a solution to these problems. There is the Semantic Web aka RDF aka Linked Data, but the barrier to entry is high. I know: I tried and failed. Then I heard about OData from Conal Tuohy.

What’s OData?

OData is:

An application-level data publishing and editing protocol that is intended to become “the language of data sharing”
A standardised, conventionalised application of AtomPub, JSON, and REST.
A technology built at Microsoft (as “Astoria”), but now released as an open standard, under the “We won’t sue you for using it” Open Specification Promise.

You have OData producers (or “services”) and consumers. A consumer could be: a desktop app that lets a user browse data from any OData site; a web application that mashes up data between several producers; another research data system that harvests its users’ data from other systems; a data aggregator that mirrors and archives data from a wide range of sites. Current examples of desktop apps include a plugin to Excel, a desktop data browser called LINQPad, a SilverLight (shudder) data explorer, high-powered business analytics tools for millionaires, a 3.5 star script to import into SQL Server…

I already publish data though!

Perhaps you already have a set of webservices with names like “ListExperiments”, “GetExperimentById”, “GetAuthorByNLAID”. You’ve documented your API, and everyone loves it. Here’s why OData might be a smarter choice, either now, or for next time:

Data model discovery: instead of a WSDL documenting the *services*, you have a CSDL describing the *data model*. (Of course if your API provides services other than simple data retrieval or update, that’s different. Although actually OData 3.0 supports actions and functions.)
Any OData consumer can explore all your exposed data and download it, without needing any more information. No documentation required.

What do you need to do to provide OData?

To turn an existing data management system (say, PARADISEC’s catalogue of endangered language field recordings) into an OData producer, you bolt an Atom endpoint onto the side. Much like many systems have already done for RIF-CS, but instead of using the rather boutique, library-centric standards OAI-PMH and RIF-CS to describe collections, you use AtomPub and HTTP to provide individual data items.

Specifically, you need to do this:

Design an object model for the data that you want to expose. This could be a direct mapping of an underlying database, but it needn’t be.
Describe that object model in CSDL (Conceptual Schema Definition Language), a Microsofty schema serialised as EDMX, looking like this example.
Provide an HTTP endpoint which responds to GET requests. In particular:
- it provides the EDMX file when “$metadata” is requested.
- it provides an Atom file containing requested data, broken up into reasonably sized chunks. Optionally, it can provide this in JSON instead.
- optionally, it supports a standard filtering mechanism, like “http://data.stackexchange.com/stackoverflow/atom/Badges?$filter=Id eq 92046“. (Ie, return all badges with Id equal to 92046)
For binary or large data, the items in the Atom feed should link rather than contain the raw data.
Optionally support the full CRUD range by responding to PUT, PATCH and DELETE requests.

There are libraries for a number of technologies, but if your technology doesn’t have an OData library, it probably has libraries for components like Atom, HTTP, JSON, and so on.

As an example, Tim Dettrick (UQ) and I have built an Atom dataset provider for MyTardis – before knowing about OData. To make it OData compliant probably requires:

modifying the feed to use (even more) standard names
following the OData convention for URLs
following the OData convention for filtering (eg, it already supports page size specification, but not using the OData standard naming)
supporting the $metadata request to describe its object model
probably adding a few more functions required by the spec.

Having done that, this would then be a generic file system->Odata server useful for any similar data harvesting project.

So, why not Linked Data?

Linked Data aka RDF aka Semantic Web is another approach that has been around for many years. Both use HTTP with content negotiation, an entity-attribute-value data model (ie, equivalent to XML or JSON), stable URIs and serialise primarily in an XML format. Key differences (in my understanding):

The ultimate goal of Linked Data is building a rich semantic web of data, so that data from a wide variety of sources can be aggregated, cross-checked, mashed up etc in interesting ways. The goal of OData is less lofty: a standardised data publishing and consuming mechanism. (Which is probably why I think OData works better in the short term)
Linked Data suggests, but does not mandate, a triplestore as the underlying storage. OData suggests, but does not mandate, a traditional RDBMS. Pragmatically, table-row-column RDBMS are much easier to work with, and the skills are much more widely available.
RDF/XML is hard to learn, because it has many unique concepts: ontologies, predicates, classes and so on. The W3C RDF primer shows the problem: you need to learn all this just in order to express your data. By contrast, OData is more pragmatic: you have a data model, so go and write that as Atom. In short, OData is conceptually simple, whereas RDF is conceptually rich but complex.[I’m not sure my comparison is entirely fair: most of my reading and experience of Linked Data comes from semantic web ideologues, who want you to go the whole nine yards with ontologies that align with other ontologies, proper URIs etc. It may be possible do quickly generate some bastardised RDF/XML with much less effort – at the risk of the community’s wrath.]
Linked Data is, unsurprisingly, much better at linking outside your own data source. In fact, I’m not sure how possible it is in OData. In some disciplines, such as the digital humanities cultural space, linking is pretty crucial: combining knowledge across disciplines like literature, film, history etc. In hard sciences, this is not (yet) important: much data is generated by some experiment, stored, and analysed more or less in isolation from other sources of data.

More details in this great comparison matrix.

In any case, the debate is moot for several reasons:

Linked Data and OData will probably play nicely together if efforts to use schema.org ontologies are successful.
There is room for more than one standard.
In some spaces, OData is already gaining traction, while in others Linked Data is the way to go. This will affect which you choose. IMHO it is telling that StackExchange (a programmer-focused set of Q&A sites) publishes its data in OData: it’s a programmer-friendly technology.

Other candidates?

GData

It is highly implausible that Google will be beaten by Microsoft in such an important area as data sharing. And yet Google Data Protocol (aka GData) is not yet a contender: it is controlled entirely by Google, and limited in scope to accessing services provided by Google or by the Google App engine. A cursory glance suggests that it is in fact very similar to OData anyway: AtomPub, REST, JSON, batch processing (different syntax), a filterying/query syntax. I don’t see support for a data model discovery, however: there’s no $metadata.

So, pros and cons?

Pros

Uses technologies you were probably going to use anyway.
A reasonable (but heavily MS-skewed) ecosystem
Potentially avoids the cost of building data exploration and query tools, letting some smart client do that for you.
Driven by the private sector, meaning serious dollars invested in high quality tools.

Cons

Driven by business and Microsoft, meaning those high quality tools are expensive and locked to MS technologies that we don’t use.
Future potential unclear
Lower barrier to entry than linked data/RDF

research data atom, atompub, csdl, e-research, gdata, json, linked data, odata, rdf, rdf/xml, research data, rest, rss, semantic web

10 things I hate about Git

1,018 Comments Posted by steveko on February 24, 2012

Git is the source code version control system that is rapidly becoming the standard for open source projects. It has a powerful distributed model which allows advanced users to do tricky things with branches, and rewriting history. What a pity that it’s so hard to learn, has such an unpleasant command line interface, and treats its users with such utter contempt.

1. Complex information model

The information model is complicated – and you need to know all of it. As a point of reference, consider Subversion: you have files, a working directory, a repository, versions, branches, and tags. That’s pretty much everything you need to know. In fact, branches are tags, and files you already know about, so you really need to learn three new things. Versions are linear, with the odd merge. Now Git: you have files, a working tree, an index, a local repository, a remote repository, remotes (pointers to remote repositories), commits, treeishes (pointers to commits), branches, a stash… and you need to know all of it.

2. Crazy command line syntax

The command line syntax is completely arbitrary and inconsistent. Some “shortcuts” are graced with top level commands: “git pull” is exactly equivalent to “git fetch” followed by “git merge”. But the shortcut for “git branch” combined with “git checkout”? “git checkout -b”. Specifying filenames completely changes the semantics of some commands (“git commit” ignores local, unstaged changes in foo.txt; “git commit foo.txt” doesn’t). The various options of “git reset” do completely different things.

The most spectacular example of this is the command “git am”, which as far as I can tell, is something Linus hacked up and forced into the main codebase to solve a problem he was having one night. It combines email reading with patch applying, and thus uses a different patch syntax (specifically, one with email headers at the top).

3. Crappy documentation

The man pages are one almighty “fuck you”. They describe the commands from the perspective of a computer scientist, not a user. Case in point:

git-push – Update remote refs along with associated objects

Here’s a description for humans: git-push – Upload changes from your local repository into a remote repository

Update, another example: (thanks cgd)

git-rebase – Forward-port local commits to the updated upstream head

Translation: git-rebase – Sequentially regenerate a series of commits so they can be applied directly to the head node

4. Information model sprawl

Remember the complicated information model in step 1? It keeps growing, like a cancer. Keep using Git, and more concepts will occasionally drop out of the sky: refs, tags, the reflog, fast-forward commits, detached head state (!), remote branches, tracking, namespaces

5. Leaky abstraction

Git doesn’t so much have a leaky abstraction as no abstraction. There is essentially no distinction between implementation detail and user interface. It’s understandable that an advanced user might need to know a little about how features are implemented, to grasp subtleties about various commands. But even beginners are quickly confronted with hideous internal details. In theory, there is the “plumbing” and “the porcelain” – but you’d better be a plumber to know how to work the porcelain.

A common response I get to complaints about Git’s command line complexity is that “you don’t need to use all those commands, you can use it like Subversion if that’s what you really want”. Rubbish. That’s like telling an old granny that the freeway isn’t scary, she can drive at 20kph in the left lane if she wants. Git doesn’t provide any useful subsets – every command soon requires another; even simple actions often require complex actions to undo or refine.

Here was the (well-intentioned!) advice from a GitHub maintainer of a project I’m working on (with apologies!):

Find the merge base between your branch and master: ‘git merge-base master yourbranch’
Assuming you’ve already committed your changes, rebased your commit onto the merge base, then create a new branch:
git rebase –onto <basecommit> HEAD~1 HEAD
git checkout -b my-new-branch
Checkout your ruggedisation branch, and remove the commit you just rebased: ‘git reset –hard HEAD~1’
Merge your new branch back into ruggedisation: ‘git merge my-new-branch’
Checkout master (‘git checkout master’), merge your new branch in (‘git merge my-new-branch’), and check it works when merged, then remove the merge (‘git reset –hard HEAD~1’).
Push your new branch (‘git push origin my-new-branch’) and log a pull request.

Translation: “It’s easy, Granny. Just rev to 6000, dump the clutch, and use wheel spin to get round the first corner. Up to third, then trail brake onto the freeway, late apexing but watch the marbles on the inside. Hard up to fifth, then handbrake turn to make the exit.”

6. Power for the maintainer, at the expense of the contributor

Most of the power of Git is aimed squarely at maintainers of codebases: people who have to merge contributions from a wide number of different sources, or who have to ensure a number of parallel development efforts result in a single, coherent, stable release. This is good. But the majority of Git users are not in this situation: they simply write code, often on a single branch for months at a time. Git is a 4 handle, dual boiler espresso machine – when all they need is instant.

Interestingly, I don’t think this trade-off is inherent in Git’s design. It’s simply the result of ignoring the needs of normal users, and confusing architecture with interface. “Git is good” is true if speaking of architecture – but false of user interface. Someone could quite conceivably write an improved interface (easygit is a start) that hides unhelpful complexity such as the index and the local repository.

7. Unsafe version control

The fundamental promise of any version control system is this: “Once you put your precious source code in here, it’s safe. You can make any changes you like, and you can always get it back”. Git breaks this promise. Several ways a committer can irrevocably destroy the contents of a repository:

git add . / … / git push -f origin master
git push origin +master
git rebase -i <some commit that has already been pushed and worked from> / git push

8. Burden of VCS maintainance pushed to contributors

In the traditional open source project, only one person had to deal with the complexities of branches and merges: the maintainer. Everyone else only had to update, commit, update, commit, update, commit… Git dumps the burden of understanding complex version control on everyone – while making the maintainer’s job easier. Why would you do this to new contributors – those with nothing invested in the project, and every incentive to throw their hands up and leave?

9. Git history is a bunch of lies

The primary output of development work should be source code. Is a well-maintained history really such an important by-product? Most of the arguments for rebase, in particular, rely on aesthetic judgments about “messy merges” in the history, or “unreadable logs”. So rebase encourages you to lie in order to provide other developers with a “clean”, “uncluttered” history. Surely the correct solution is a better log output that can filter out these unwanted merges.

10. Simple tasks need so many commands

The point of working on an open source project is to make some changes, then share them with the world. In Subversion, this looks like:

Make some changes
svn commit

If your changes involve creating new files, there’s a tricky extra step:

Make some changes
svn add
svn commit

For a Github-hosted project, the following is basically the bare minimum:

Make some changes
git add [not to be confused with svn add]
git commit
git push
Your changes are still only halfway there. Now login to Github, find your commit, and issue a “pull request” so that someone downstream can merge it.

In reality though, the maintainer of that Github-hosted project will probably prefer your changes to be on feature branches. They’ll ask you to work like this:

git checkout master [to make sure each new feature starts from the baseline]
git checkout -b newfeature
Make some changes
git add [not to be confused with svn add]
git commit
git push
Now login to Github, switch to your newfeature branch, and issue a “pull request” so that the maintainer can merge it.

So, to move your changes from your local directory to the actual project repository will be: add, commit, push, “click pull request”, pull, merge, push. (I think)

As an added bonus, here’s a diagram illustrating the commands a typical developer on a traditional Subversion project needed to know about to get their work done. This is the bread and butter of VCS: checking out a repository, committing changes, and getting updates.

“Bread and butter” commands and concepts needed to work with a remote Subversion repository.

And now here’s what you need to deal with for a typical Github-hosted project:

The “bread and butter” commands and concepts needed to work with a Github-hosted project.

If the power of Git is sophisticated branching and merging, then its weakness is the complexity of simple tasks.

Update (August 3, 2012)

This post has obviously struck a nerve, and gets a lot of traffic. Thought I’d address some of the most frequent comments.

The comparison between a Subversion repository with commit access and a Git repository without it isn’t fair True. But that’s been my experience: most SVN repositories I’ve seen have many committers – it works better that way. Git (or at least Github) repositories tend not to: you’re expected to submit pull requests, even after you reach the “trusted” stage. Perhaps someone else would like to do a fairer apples-to-apples comparison.
You’re just used to SVN There’s some truth to this, even though I haven’t done a huge amount of coding in SVN-based projects. Git’s commands and information model are still inherently difficult to learn, and the situation is not helped by using Subversion command names with different meanings (eg, “svn add” vs “git add”).
But my life is so much better with Git, why are you against it? I’m not – I actually quite like the architecture and what it lets you do. You can be against a UI without being against the product.
But you only need a few basic commands to get by. That hasn’t been my experience at all. You can just barely survive for a while with clone, add, commit, and checkout. But very soon you need rebase, push, pull, fetch , merge, status, log, and the annoyingly-commandless “pull request”. And before long, cherry-pick, reflog, etc etc…
Use Mercurial instead! Sure, if you’re the lucky person who gets to choose the VCS used by your project.
Subversion has even worse problems! Probably. This post is about Git’s deficiencies. Subversion’s own crappiness is no excuse.
As a programmer, it’s worth investing time learning your tools. True, but beside the point. The point is, the tool is hard to learn and should be improved.
If you can’t understand it, you must be dumb. True to an extent, but see the previous point.
There’s a flaw in point X. You’re right. As of writing, over 80,000 people have viewed this post. Probably over 1000 have commented on it, on Reddit (530 comments), on Hacker News (250 comments), here (100 comments). All the many flaws, inaccuracies, mischaracterisations, generalisations and biases have been brought to light. If I’d known it would be so popular, I would have tried harder. Overall, the level of debate has actually been pretty good, so thank you all.

A few bonus command inconsistencies:

Reset/checkout

To reset one file in your working directory to its committed state:

git checkout file.txt

To reset every file in your working directory to its committed state:

git reset --hard

Remotes and branches

git checkout remotename/branchname
git pull remotename branchname

There’s another command where the separator is remotename:branchname, but I don’t recall right now.

Command options that are practically mandatory

And finally, a list of commands I’ve noticed which are almost useless without additional options.

Base command	Useless functionality	Useful command	Useful functionality
git branch foo	Creates a branch but does nothing with it	git checkout -b foo	Creates branch and switches to it
git remote	Shows names of remotes	git remote -v	Shows names and URLs of remotes
git stash	Stores modifications to tracked files, then rolls them back	git stash -u	Also does the same to untracked files
git branch	Lists names of local branches	git branch -rv	Lists local and remote tracking branches; shows latest commit message
git rebase	Destroy history blindfolded	git rebase -i	Lets you rewrite the upstream history of a branch, choosing which commits to keep, squash, or ditch.
git reset foo	Unstages files	git reset –hard git reset –soft	Discards local modifications Returns to another commit, but doesn’t touch working directory.
git add	Nothing – prints warning	git add . git add -A	Stages all local modifications/additions Stages all local modifications/additions/deletions

Update 2 (September 3, 2012)

A few interesting links:

http://think-like-a-git.net/
https://github.com/nvie/gitflow – a tool to support Vincent Driessent’s branching model.
http://www.itworld.com/software/288711/things-people-hate-about-git – apparently a magazine article inspired by this post.

Tools commit, distributed-version-control, dvcs, featured, git, github, merge, rant, rebase, subversion, vcs, version-control

Semantic Google keywords

Leave a comment Posted by steveko on February 20, 2012

I seem to have picked up the habit of using semantic keywords in my Google searches. I have no idea if Google actually supports any of them, but, if they don’t, they should!

buy
Example: buy usb extension cable
Meaning: I’m specifically looking to purchase something, so don’t show me any site that doesn’t help me buy things, preferably online.
review
Example: Samsung galaxy note review
Meaning:I want to read reviews about something. Don’t show any site that doesn’t have at least one review.
experience
Example: Samsung galaxy note experience
Meaning: I’m interested in people’s personal experiences. I’m interested in blogs, and informal reviews: not professional reviews from technophiles.
get/download
Example: get git
Meaning:Take me to sites that either let me download Git, or will explain how I can do this.
photo
Example: photo Birrarung Marr
Meaning: I want to see what Birrarung Marr looks like. Show me photos, pictures, whatever!
how to
Example: how to uninstall Crayon Physics
Meaning: I’m looking for technical solutions, problem solving, troubleshooting type sites.
compare
Example: compare HDMI DLNA
Meaning: I’m looking for information about the differences between these two things: massively prioritise pages that treat just these two things
alternatives
Example: BeyondPod alternatives
Meaning: I’m looking for things like BeyondPod, but different to it. Especially show me comparisons between it and its alternatives.
why
Example: why github
Meaning: I want to know the benefits of GitHub. Prioritise sites that tell me specifically about its advantages, the problems that it solves – not how to use it, or general descriptions of it.

Uncategorized compare, download, google, keyword, photo, review, search, semantic

Improving on the “administration rights required” workflow

Leave a comment Posted by steveko on October 13, 2011

Consider an action like creating a new “space” in Confluence. Because of the visibility of this action, people want it restricted. Controlled. Managed. So Atlassian makes it only available to people with administrator rights. Which seems ok, until you realise that the workflow of the non-administrator ends up looking like this:

…doing stuff…
Decide to create new space
Attempt to create space, discover that you need an administrator to do it
Find out who the administrator is
Ask them to do it
Wait

And for the administrator it looks like this:

…doing stuff…
Receive request to make a new Confluence space.
Confluence? What? I’m busy configuring a new VM here…oh, fine.
Go log into Confluence, remember how to create a space
Get back to work

And probably there will be some miscommunication about exactly what is required. Confluence spaces are a bit of a trivial example. In other instances, you need a lot of information to perform the administrative action, and if there’s a mistake, it again requires the administrator to fix it. Eventually the administrator gets sick of being bothered with such trivia, and the non-administrator gets sick of hassling them, finding an alternative, or living with a misconfigured thing.

For the user: frustration, blockage, helplessness

For the administrator: disruption, menial tasks

Alternative workflow #1: approval only

For the non-administrator:

…doing stuff
Decide to create a new Confluence space
Enter the form, fill out all the details, press Ok.
(Confluence sends a “Is this ok?” email to administrator)
Wait

For the administrator:

…doing stuff
Receive approval request.
Since it’s from a fairly trustworthy user, and seems to make sense, click the link.
Get back to work.

It’s better. The administrator now doesn’t need to know anything about the request. This workflow exists in some kinds of systems (CMSes particularly), but could be a lot more widespread. Ironically, although Jira is a workflow tool, it doesn’t actually have any workflows built in for administration.

For the user: blockage

For the administrator: disruption

Alternative workflow #2: private pending approval

Since creating a space has very limited potential for destruction if it’s hidden, how about this:

For the administrator:

…doing stuff
Decide to create Confluence space
Fill out form
Press Ok
(Confluence sends email to administrator, creates space in “private” mode)
Start working in new space. Do almost anything except collaborate with others in this space.

The benefit is clear: there is now no “waiting” step.

For the administrator, it’s the same as before:

…doing stuff
Receive approval request.
Since it’s from a fairly trustworthy user, and seems to make sense, click the link.

If more information is required, look at what the user has done in the new space, for a bit of context.

Get back to work.

There is a mild benefit here, too: the administrator is no longer under as much pressure to immediately approve the thing, and can even see what has been done with the space.

For the user: flow maintained

For the administrator: very mild disruption

Alternative workflow #3: public until reverted

Most users aren’t destructive. And especially in professional environments, virtually all users can be trusted to act in good faith. (Managers strangely predict “chaos” if many users are authorised). So, let them do it:

Decide to create Confluence space
Create Confluence space
Work in new space

For the administrator:

Receive notification that a space has been created
If it looks wrong, strange, inappropriate etc, discuss with the user, and possibly remove it.
Otherwise, keep working

For the user: productivity, responsibility

For the administrator: productivity, trust

System administration approval, CMS, confluence, efficiency, Jira, productivity, system administration, workflows

Why is buying stuff from eBay so complicated?

Leave a comment Posted by steveko on September 27, 2011

So you’ve found the thing you want, and you already have both an eBay account and a PayPal account. You don’t want to do anything complicated – send this item to me, and charge my (already stored) card appropriately.

Click Buy it Now
Sign in
Click Commit to Buy
Click Pay Now
Click Continue
Sign in again (Paypal this time)
Click Continue (confirming payment type?)
Click Confirm Payment

In other words, “I’d like to buy this. Yes. Yes. Yes. Yes. Yes.” Maybe there was more genius to Amazon’s one-click purchasing than I thought.

Uncategorized e-commerce, ebay, one-click purchasing, online purchasing, paypal, usability

Introducing: Cooking for engineers

4 Comments Posted by steveko on November 26, 2010

Here’s what I hate about recipes:

They’re delivered as unstructured narratives.
They mix identifying information (“carrots”) with process information (“thinly sliced”)
They’re overly specific (3/4 of a teaspoon, does it matter?)
They have too many ingredients, and you don’t know which ones you can leave out
They always have one or two ingredients you wouldn’t have lying around. Shrimp paste?

I can’t fix all of that. But here goes:

Cooking for engineers: spicy cauliflower and almonds

"Cooking for engineers" version of a sicy cauliflower dish

A cauliflower dish I made up the other night, seen by an engineer.

Recipe for non-engineers

Here’s what a conventional version of that dish might look like:

Ingredients:

1/2 cauliflower, chopped

1 bok choi

1 tbsp coriander, chopped

1/2 cup slivered almonds

1 tbsp ground turmeric

1 tbsp mustard seeds

1-4 tsp chilli powder, to taste

Put 2cm of water in a saucepan and bring it to the boil. Simmer the cauliflower 2 minutes, then drain, discarding the water. Add half the sesame oil, and return the cauliflower to the saucepan.

Meanwhile, heat the rest of the sesame oil in a frying pan. When hot, add the slivered almonds and remove from the heat, stirring continuously. Once browned, add the almonds to the frying cauliflower, and add the spices. After a few minutes, add the bok choi. Stir frequently until the bok choi is soft, then serve with rice and roti bread.

Principles of cooking for engineers

Here are the rules:

One column per preparation dish (saucepan, mixing bowl, tray…)
Red arrows show cooking. Maybe thicker arrows for hotter.
Ingredients start off to the sides. Name of thing in bold. Quantities as rough as appropriate, in parentheses.
Thin blue arrows show transfer of stuff.
Text between two black lines means “keep doing the thing above until this happens”
Stuff you need to do (processes) in square boxes.
Collapse stuff down wherever appropriate. It’s a communication tool, not an exhaustive process analysis.

Explaining step 7: the diagram above could have shown a colander as another column, with cauliflower transferred from the saucepan to the colander and back. But why would you do that?

Discussion

Normal recipe format works pretty well for most people, but I find it takes multiple readings before I can start. Effectively, I’m constructing this kind of flow chart in my head, working out what gets transferred from where, to where. I hope that a refined version of this diagramming methodology would allow you to confidently dive straight in, with no nasty surprises.

On the downside, diagrams are hard to manage. I realise I forgot an ingredient (crushed garlic and ginger paste), but it’s time-consuming to modify the diagram and re-upload.

And lastly, yes, it’s a bit facile to call this “cooking for engineers”. Probably real engineers want precise measurements, correct use of flow control symbols and so forth.

Uncategorized cauliflower, cooking, engineers, flow chart, process engineering, recipe

New Gmail feature: auto mailing list management

Leave a comment Posted by steveko on September 6, 2010

Ok, Gmail, you’re halfway there: you’ve got labels, and you can detect mailing lists (Google Groups ones, anyway). Now, take it a bit further:

Automatically create a label for every group that I receive mail from. Don’t wait for me to do it.
Make these labels more than just a label. Give them options like “skip the inbox” or “delete on sight”.
Have a view which shows all mail from all mailing lists, with some smart options to make this even more useful.

For extra credit:

4. Detect other sources of regular mail which are not “mailing lists” as such. Newsletters from the bank. Quarterly updates from my alma mater. Treat them exactly the same.

Most of the features are already there, but it’s so tedious having to set up a special rule and label for every single list.

Uncategorized email, feature request, gmail, gmail feature, mailing lists, suggestion

Penny Auctions – a bit of analysis

24 Comments Posted by steveko on July 31, 2009

swoopo, bidray, bidstick (bids tick, apparently), bidrivals and dozens of others are running what we’ll call “penny auctions“. Using bidrivals.com as the example, they all work on the following principals:

There are consumer electronics for auction, usually at big discounts.
It costs a certain amount to make a bid, regardless of whether that bid is ultimately successful. For bidrivals.com, it’s 40 British pence.
Every bid raises the price by a fixed amount. In this example, by 1c. It also extends the auction to last another 15 seconds or so.
If you “win” the auction you must then buy the item at the final price.
It’s not a lottery. Because they say so.

At first glance, the auction looks great – buy a phone for $20! Buy a plasma tv for $1.53!

But not so fast. A couple of things that are not obvious to the beginner:

Every dollar of the final price represents $40 in bidding fees. A $1000 TV selling for $1 is a big loss for the site. The same TV selling for $25 is a small profit. Sold for $1000 it’s a $40,000 profit.
You can use a site-provided bot (“bidbot”, “bidbutler”…) to bid on your behalf. If two people do this simultaneously, they’ll both lose a lot of money with no apparent gain.

So, is it a scam? Well, there are really two quesions:

If the site is running completely as described, legitimately, and not using shill bidders (bidding on their own auctions), is this an honest way to make a living – and should you participate?
How do you know if a site is legitimate? Is it likely to be?

Is this honest?

I see very little to distinguish these penny auctions from gambling:

When you bid, whether you win or not depends entirely on whether anyone else bids in the next 15 seconds. Assuming you’re bidding on an item which is clearly a bargain (eg, $5 for a TV), then the normal considerations of auctions do not apply: any rational person would bid if they could so for free.
The house take is enormous. Frighteningly so. For example, imagine on average the site sells items at a 65% discount from RRP, and bids cost 40 times as much as the amount they increase the value by. This means that it costs on average (100-65)x40 to win a $100 item (whose value is now $65), or in other words (100-65)x40/65=$21.50 to win one dollar’s worth of value. By comparison, a skilled blackjack player in a casino can pay as little as $1.01 to win $1’s worth of value.
Most bids give no return to the bidder. This means that even if you don’t want to call it “gambling”, it should still be regulated, as the potential for dishonesty is great. You don’t want to be bidding for a dead donkey.

There is quite a bit of antipathy towards these sites: Washington Post, Jeff Atwood, Ed Oswald.

Can you trust them?

There are two main risks:

The site may use “shill bidders” to bid on items that would otherwise go for a low price. This could prevent you ever winning, or cause you to spend far more than you want.
Even if you “win”, the site may never ship. The whole thing could be a scam.

Fortunately, there are sites on the look out for this kind of thing, such as pennyauctionwatch.com. There is evidence of dodgy sites, such as fake testimonials.

So, what are the incentives for a site to use shill bidding? Well, as we saw above, the difference between a $1000 item selling for $1 and $25 doesn’t look like much, but it’s the difference between breaking even and posting a big loss. Imagine there is fairly steady bidding activity, but there are just a few gaps before that $25 mark. If the site could shill just a few times, they would massively increase their profitability.

But how much should they shill? Consider two strategies:

bid whenever the time gets to 1 second; or
bid immediately after anyone else bids.

In 1), the shill bids guarantee almost any asking price, as long as there is still some demand. This has the potential to greatly increase profit, and decrease variance.
In 2), half the bids end up being shill bids. This causes two problems: first, you’re directly losing one bid fee for every shill bid. Second, by inflating the price, you’re accelerating reaching the point at which people no longer want to bid, because the prize at stake is shrinking. So if people might normally bid strongly up to half the value of the item, then shilling along the way is just replacing paying bids with free ones. You might even decrease the final sale value, and every dollar of sale value lost is $40 of bidding fees lost.

Conclusion: shill bidding seems likely to occur, in small doses, because the incentive is just so strong.

Can you beat them?

Probably not. It’s been tried. To beat it:

You have to find a site that is not a complete scam.
You have to find a site that is completely honest. Even a little bit of shill bidding will crush you.
You have to defeat an absolutely incredible house take of 95% (remember, normal house take for gambling ranges from 1% to 5%).
You have to know enough about the auctions and your fellow users to make you fairly confident that no one will bid in the next 15 seconds. In the $1000 TV at $25 case (40c to bid), you need there to be a greater than 1/250 chance that you will win the auction with this one bid. Sound easy? Think: if that were the case, how did the price get to $25? It would stop, on average, at $2.50.

If it is beatable, you’d think people would have done it. And, since they’re capped at 4 wins per month generally, there would not be much harm in them sharing their secret. Unless they have a network of penny auction-beating bots. There’s a thought.

But just in case you wanted to try:

Compare sites. Find a safe one that appears to be losing money.
Collect lots of data. Try GreaseMonkey.
Find the right time of day, with the least competition.
Track all the auctions, pick individual moments and place bids.
Don’t try and win a specific auction. Bid any time your positive expectation on that bid is positive. The moment could pass.
Consider the effect of distractions. A good moment might be when several auctions are closing at the same time. You could even engineer that by bidding on several simultaneously.
Consider using several accounts to bid with, to drive off other bidders. If you know they’re paying attention and will react appropriately, that is. I’m thinking you bid with a group, gradually spacing their bids further apart and hoping you can sneak a 15 second gap through.

	arnebab on 10 things I hate about Gi…
	arnebab on 10 things I hate about Gi…
	arnebab on 10 things I hate about Gi…
	Smitty on 10 things I hate about Gi…
	Igor on 10 things I hate about Gi…

The goal

Preparation

RSync Server

Harvester script

Atom provider

MyTardis

Discussion

What’s OData?

I already publish data though!

What do you need to do to provide OData?

So, why not Linked Data?

Other candidates?

GData

So, pros and cons?

Pros

Cons

1. Complex information model

2. Crazy command line syntax

3. Crappy documentation

4. Information model sprawl

5. Leaky abstraction

6. Power for the maintainer, at the expense of the contributor

7. Unsafe version control

8. Burden of VCS maintainance pushed to contributors

9. Git history is a bunch of lies

10. Simple tasks need so many commands

Update (August 3, 2012)

Update 2 (September 3, 2012)

Alternative workflow #1: approval only

Alternative workflow #2: private pending approval

Alternative workflow #3: public until reverted

Cooking for engineers: spicy cauliflower and almonds

Recipe for non-engineers

Principles of cooking for engineers

Discussion

Is this honest?

Can you trust them?

More reading:

Social

Recent Posts

Recent Comments

Top Posts & Pages