Steve Bennett blogs

…about maps, open data, Git, and other tech.

Getting started with Chef on the NeCTAR Research Cloud

Opscode Chef is a powerful tool for automating the configuration of new servers, and indeed, entire clouds, multi-server architectures etc. We now use it to deploy MyTardis. It’s quite daunting at first, so here’s a quick guide to the setup I use. You’ll probably need to refer to more complete tutorials to cover gaps (eg  OpsCode’s Fast Start Guide, the MyTardis chef documentation and this random other one.) We’ll use installing MyTardis as the goal.

  1. Sign up to Hosted Chef. That gives you a place where Chef cookbooks, environments, roles etc will live.
  2. Install Chef (client) locally.
  3. Get the MyTardis Chef cookbook:
    cd ~/mytardis
    git clone https://github.com/mytardis/mytardis-chef
  4. Make a directory, say ~/chef, and download the “…-validator.pem” and knife.rb files that Opscode gives you to there. The knife.rb file contains settings for Knife, like directories. Modify it so it points to ~/mytardis/mytardis-chef. Mine looks like this
    current_dir = File.dirname(__FILE__)
    log_level :info
    log_location STDOUT
    node_name "stevage"
    client_key "#{current_dir}/stevage.pem"
    validation_client_name "versi-validator"
    validation_key "#{current_dir}/versi-validator.pem"
    chef_server_url "https://api.opscode.com/organizations/versi"
    cache_type 'BasicFile'
    cache_options( :path => "#{ENV['HOME']}/.chef/checksums" )
    cookbook_path ["#{current_dir}/../mytardis/mytardis-chef/site-cookbooks", "#{current_dir}/../mytardis/mytardis-chef/cookbooks"]
  5. You now need to upload these cookbooks and assorted stuff to Chef Server. This part of Chef is really dumb. (Why isn’t there a single command to do this – or why can’t knife automatically synchronise either a local directory tree or git repo with the Chef server. Why is one command ‘upload’ and another ‘from file’…? Why are cookbooks looked for in the ‘cookbooks path’ but roles are sought under ./roles/?)
    knife cookbook upload -a
    knife role from file ../mytardis/mytardis-chef/roles/mytardis.json
  6. Get a NeCTAR Research Cloud VM. Launch an instance of “Ubuntu 12.04 LTS (Precise) amd64 UEC”. Call it “mytardisdemo”, give it your NeCTAR public key, and do what you have to (security groups…) to open port 80.
  7. Download your NeCTAR private key to your ~/chef directory.
  8. Now, to make your life a bit easier, here are three scripts that I’ve made to simplify working with Knife:bootstrap.sh:
    #!/bin/bash -x
    # Bootstraps Chef on the remote host, then runs its runlist.
    source ./settings$1.sh
    knife bootstrap $CHEF_IP -x $CHEF_ACCOUNT --no-host-key-verify -i nectar.pem --sudo $CHEF_BOOTSTRAP_ARGS

    update.sh:

    #!/bin/bash -x
    # Runs Chef-client on the remote host, to run the latest version of any applicable cookbooks.
    source settingsi$1.sh
    knife ssh name:$NECTAR_HOST -a ipaddress -x $NECTAR_ACCOUNT -i nectar.pem -o "VerifyHostKeyDNS no" "sudo chef-client"

    connect.sh:

    #!/bin/bash -x
    # Opens an SSH session to the remote host.
    source ./settings$1.sh
    ssh $CHEF_ACCOUNT@$CHEF_IP -i nectar.pem -o "VerifyHostKeyDNS no"

    (We disable host key checking because otherwise SSH baulks every time you allocate a new VM with an IP that it’s seen before.)With these, you can have several “settings…sh” files, each one describing a different remote host. (Or just a single settings.sh). So, create a “settings-mytardisdemo.sh” file, like this:

    # Settings for Ubuntu Precise NeCTAR VM running mytardis as a demo
    export CHEF_HOST=mytardisdemo
    export CHEF_IP=115.146.94.53
    export CHEF_ACCOUNT=ubuntu
    export CHEF_BOOTSTRAP_ARGS="-d ubuntu12.04-gems -r 'role[mytardis]'"
  9. Now you have it all in place: A chef server with the cookbook you’re going to run, a server to run it on, and the local tools to do it with. So, bootstrap your instance:
    ./bootstrap.sh -mytardisdemo

    If all goes well (it won’t), that will churn away for 40 minutes or so, then MyTardis will be up and running.

  10. If you change a cookbook or something (don’t forget to re-upload) and want to run chef again:
    ./update.sh -mytardisdemo
  11. And to SSH into your box:
    ./connect.sh -mytardisdemo
Advertisement

How OData will solve data sharing and reuse for e-Research

OData’s shiny logo.

Ok, now that I have your attention, let me start backpedalling. In the Australian e-Research sector, I’ve worked on problems of research data management (getting researchers’ data into some kind of database) and registering (recording the existence of that data somewhere else, like an institutional store, or Research Data Australia). There hasn’t been a huge amount of discussion about the mechanics of reusing data: the discussion has mostly been “help researcher A find researcher B’s data, then they’ll talk”.

Now, I think the mechanics of how that data exchange is going to take place will become more important. We’re seeing the need for data exchange between instances of MyTardis (eg, from a facility to a researcher’s home institution) and we will almost certainly see demand for exchange between heterogeneous systems (eg, MyTardis to/from XNAT). And as soon as data being published becomes the norm, consumers of that data will expect it to be available in easy to use formats, with standard protocols. They will expect to be able to easily get data from one system into another, particularly once those two systems are hosted in the cloud.

Until this week, I didn’t know of a solution to these problems. There is the Semantic Web aka RDF aka Linked Data, but the barrier to entry is high. I know: I tried and failed. Then I heard about OData from Conal Tuohy.

What’s OData?

OData is:

  • An application-level data publishing and editing protocol that is intended to become “the language of data sharing”
  • A standardised, conventionalised application of AtomPub, JSON, and REST.
  • A technology built at Microsoft (as “Astoria”), but now released as an open standard, under the “We won’t sue you for using it” Open Specification Promise.

You have OData producers (or “services”) and consumers. A consumer could be: a desktop app that lets a user browse data from any OData site; a web application that mashes up data between several producers; another research data system that harvests its users’ data from other systems; a data aggregator that mirrors and archives data from a wide range of sites. Current examples of desktop apps include a plugin to Excel, a desktop data browser called LINQPad, a SilverLight (shudder) data explorer, high-powered business analytics tools for millionaires, a 3.5 star script to import into SQL Server

I already publish data though!

Perhaps you already have a set of webservices with names like “ListExperiments”, “GetExperimentById”, “GetAuthorByNLAID”. You’ve documented your API, and everyone loves it. Here’s why OData might be a smarter choice, either now, or for next time:

  • Data model discovery: instead of a WSDL documenting the *services*, you have a CSDL describing the *data model*. (Of course if your API provides services other than simple data retrieval or update, that’s different. Although actually OData 3.0 supports actions and functions.)
  • Any OData consumer can explore all your exposed data and download it, without needing any more information. No documentation required.

What do you need to do to provide OData?

To turn an existing data management system (say, PARADISEC’s catalogue of endangered language field recordings) into an OData producer, you bolt an Atom endpoint onto the side. Much like many systems have already done for RIF-CS, but instead of using the rather boutique, library-centric standards OAI-PMH and RIF-CS to describe collections, you use AtomPub and HTTP to provide individual data items.

Specifically, you need to do this:

  • Design an object model for the data that you want to expose. This could be a direct mapping of an underlying database, but it needn’t be.
  • Describe that object model in CSDL (Conceptual Schema Definition Language), a Microsofty schema serialised as EDMX, looking like this example.
  • Provide an HTTP endpoint which responds to GET requests. In particular:
    • it provides the EDMX file when “$metadata” is requested.
    • it provides an Atom file containing requested data, broken up into reasonably sized chunks. Optionally, it can provide this in JSON instead.
    • optionally, it supports a standard filtering mechanism, like “http://data.stackexchange.com/stackoverflow/atom/Badges?$filter=Id eq 92046“. (Ie, return all badges with Id equal to 92046)
  • For binary or large data, the items in the Atom feed should link rather than contain the raw data.
  • Optionally support the full CRUD range by responding to PUT, PATCH and DELETE requests.

There are libraries for a number of technologies, but if your technology doesn’t have an OData library, it probably has libraries for components like Atom, HTTP, JSON, and so on.

As an example, Tim Dettrick (UQ) and I have built an Atom dataset provider for MyTardis – before knowing about OData. To make it OData compliant probably requires:

  • modifying the feed to use (even more) standard names
  • following the OData convention for URLs
  • following the OData convention for filtering (eg, it already supports page size specification, but not using the OData standard naming)
  • supporting the $metadata request to describe its object model
  • probably adding a few more functions required by the spec.
Having done that, this would then be a generic file system->Odata server useful for any similar data harvesting project.

So, why not Linked Data?

Linked Data aka RDF aka Semantic Web is another approach that has been around for many years. Both use HTTP with content negotiation, an entity-attribute-value data model (ie, equivalent to XML or JSON), stable URIs and serialise primarily in an XML format. Key differences (in my understanding):

  • The ultimate goal of Linked Data is building a rich semantic web of data, so that data from a wide variety of sources can be aggregated, cross-checked, mashed up etc in interesting ways. The goal of OData is less lofty: a standardised data publishing and consuming mechanism. (Which is probably why I think OData works better in the short term)
  • Linked Data suggests, but does not mandate, a triplestore as the underlying storage. OData suggests, but does not mandate, a traditional RDBMS. Pragmatically, table-row-column RDBMS are much easier to work with, and the skills are much  more widely available.
  • RDF/XML is hard to learn, because it has many unique concepts: ontologies, predicates, classes and so on. The W3C RDF primer shows the problem: you need to learn all this just in order to express your data. By contrast, OData is more pragmatic: you have a data model, so go and write that as Atom. In short, OData is conceptually simple, whereas RDF is conceptually rich but complex.[I’m not sure my comparison is entirely fair: most of my reading and experience of Linked Data comes from semantic web ideologues, who want you to go the whole nine yards with ontologies that align with other ontologies, proper URIs etc. It may be possible do quickly generate some bastardised RDF/XML with much less effort – at the risk of the community’s wrath.]
  • Linked Data is, unsurprisingly, much better at linking outside your own data source. In fact, I’m not sure how possible it is in OData. In some disciplines, such as the digital humanities cultural space, linking is pretty crucial: combining knowledge across disciplines like literature, film, history etc. In hard sciences, this is not (yet) important: much data is generated by some experiment, stored, and analysed more or less in isolation from other sources of data.
More details in this great  comparison matrix.
In any case, the debate is moot for several reasons:
  • Linked Data and OData will probably play nicely together if  efforts to use schema.org ontologies are successful.
  • There is room for more than one standard.
  • In some spaces, OData is already gaining traction, while in others Linked Data is the way to go. This will affect which you choose. IMHO it is telling that StackExchange (a programmer-focused set of Q&A sites) publishes its data in OData: it’s a programmer-friendly technology.

Other candidates?

GData

It is highly implausible that Google will be beaten by Microsoft in such an important area as data sharing. And yet Google Data Protocol (aka GData) is not yet a contender: it is controlled entirely by Google, and limited in scope to accessing services provided by Google or by the Google App engine. A cursory glance suggests that it is in fact very similar to OData anyway: AtomPub, REST, JSON, batch processing (different syntax), a filterying/query syntax. I don’t see support for a data model discovery, however: there’s no $metadata.

So, pros and cons?

Pros

  • Uses technologies you were probably going to use anyway.
  • A reasonable (but heavily MS-skewed) ecosystem
  • Potentially avoids the cost of building data exploration and query tools, letting some smart client do that for you.
  • Driven by the private sector, meaning serious dollars invested in high quality tools.

Cons

  • Driven by business and Microsoft, meaning those high quality tools are expensive and locked to MS technologies that we don’t use.
  • Future potential unclear
  • Lower barrier to entry than linked data/RDF

10 things I hate about Git

Git is the source code version control system that is rapidly becoming the standard for open source projects. It has a powerful distributed model which allows advanced users to do tricky things with branches, and rewriting history. What a pity that it’s so hard to learn, has such an unpleasant command line interface, and treats its users with such utter contempt.

1. Complex information model

The information model is complicated – and you need to know all of it. As a point of reference, consider Subversion: you have files, a working directory, a repository, versions, branches, and tags. That’s pretty much everything you need to know. In fact, branches are tags, and files you already know about, so you really need to learn three new things. Versions are linear, with the odd merge. Now Git: you have files, a working tree, an index, a local repository, a remote repository, remotes (pointers to remote repositories), commits, treeishes (pointers to commits), branches, a stash… and you need to know all of it.

2. Crazy command line syntax

The command line syntax is completely arbitrary and inconsistent. Some “shortcuts” are graced with top level commands: “git pull” is exactly equivalent to “git fetch” followed by “git merge”. But the shortcut for “git branch” combined with “git checkout”? “git checkout -b”. Specifying filenames completely changes the semantics of some commands (“git commit” ignores local, unstaged changes in foo.txt; “git commit foo.txt” doesn’t). The various options of “git reset” do completely different things.

The most spectacular example of this is the command “git am”, which as far as I can tell, is something Linus hacked up and forced into the main codebase to solve a problem he was having one night. It combines email reading with patch applying, and thus uses a different patch syntax (specifically, one with email headers at the top).

3. Crappy documentation

The man pages are one almighty “fuck you”. They describe the commands from the perspective of a computer scientist, not a user. Case in point:

git-push – Update remote refs along with associated objects

Here’s a description for humans: git-push – Upload changes from your local repository into a remote repository

Update, another example: (thanks cgd)

git-rebase – Forward-port local commits to the updated upstream head

Translation: git-rebase – Sequentially regenerate a series of commits so they can be applied directly to the head node

4. Information model sprawl

Remember the complicated information model in step 1? It keeps growing, like a cancer. Keep using Git, and more concepts will occasionally drop out of the sky: refs, tags, the reflog, fast-forward commits, detached head state (!), remote branches, tracking, namespaces

5. Leaky abstraction

Git doesn’t so much have a leaky abstraction as no abstraction. There is essentially no distinction between implementation detail and user interface. It’s understandable that an advanced user might need to know a little about how features are implemented, to grasp subtleties about various commands. But even beginners are quickly confronted with hideous internal details. In theory, there is the “plumbing” and “the porcelain” – but you’d better be a plumber to know how to work the porcelain.
A common response I get to complaints about Git’s command line complexity is that “you don’t need to use all those commands, you can use it like Subversion if that’s what you really want”. Rubbish. That’s like telling an old granny that the freeway isn’t scary, she can drive at 20kph in the left lane if she wants. Git doesn’t provide any useful subsets – every command soon requires another; even simple actions often require complex actions to undo or refine.
Here was the (well-intentioned!) advice from a GitHub maintainer of a project I’m working on (with apologies!):
  1. Find the merge base between your branch and master: ‘git merge-base master yourbranch’
  2. Assuming you’ve already committed your changes, rebased your commit onto the merge base, then create a new branch:
  3. git rebase –onto <basecommit> HEAD~1 HEAD
  4. git checkout -b my-new-branch
  5. Checkout your ruggedisation branch, and remove the commit you just rebased: ‘git reset –hard HEAD~1’
  6. Merge your new branch back into ruggedisation: ‘git merge my-new-branch’
  7. Checkout master (‘git checkout master’), merge your new branch in (‘git merge my-new-branch’), and check it works when merged, then remove the merge (‘git reset –hard HEAD~1’).
  8. Push your new branch (‘git push origin my-new-branch’) and log a pull request.
Translation: “It’s easy, Granny. Just rev to 6000, dump the clutch, and use wheel spin to get round the first corner. Up to third, then trail brake onto the freeway, late apexing but watch the marbles on the inside. Hard up to fifth, then handbrake turn to make the exit.”

6. Power for the maintainer, at the expense of the contributor

Most of the power of Git is aimed squarely at maintainers of codebases: people who have to merge contributions from a wide number of different sources, or who have to ensure a number of parallel development efforts result in a single, coherent, stable release. This is good. But the majority of Git users are not in this situation: they simply write code, often on a single branch for months at a time. Git is a 4 handle, dual boiler espresso machine – when all they need is instant.

Interestingly, I don’t think this trade-off is inherent in Git’s  design. It’s simply the result of ignoring the needs of normal users, and confusing architecture with interface. “Git is good” is true if speaking of architecture – but false of user interface. Someone could quite conceivably write an improved interface (easygit is a start) that hides unhelpful complexity such as the index and the local repository.

7. Unsafe version control

The fundamental promise of any version control system is this: “Once you put your precious source code in here, it’s safe. You can make any changes you like, and you can always get it back”. Git breaks this promise. Several ways a committer can irrevocably destroy the contents of a repository:

  1. git add . / … / git push -f origin master
  2. git push origin +master
  3. git rebase -i <some commit that has already been pushed and worked from> / git push

8. Burden of VCS maintainance pushed to contributors

In the traditional open source project, only one person had to deal with the complexities of branches and merges: the maintainer. Everyone else only had to update, commit, update, commit, update, commit… Git dumps the burden of  understanding complex version control on everyone – while making the maintainer’s job easier. Why would you do this to new contributors – those with nothing invested in the project, and every incentive to throw their hands up and leave?

9. Git history is a bunch of lies

The primary output of development work should be source code. Is a well-maintained history really such an important by-product? Most of the arguments for rebase, in particular, rely on aesthetic judgments about “messy merges” in the history, or “unreadable logs”. So rebase encourages you to lie in order to provide other developers with a “clean”, “uncluttered” history. Surely the correct solution is a better log output that can filter out these unwanted merges.

10. Simple tasks need so many commands

The point of working on an open source project is to make some changes, then share them with the world. In Subversion, this looks like:
  1. Make some changes
  2. svn commit

If your changes involve creating new files, there’s a tricky extra step:

  1. Make some changes
  2. svn add
  3. svn commit

For a Github-hosted project, the following is basically the bare minimum:

  1. Make some changes
  2. git add [not to be confused with svn add]
  3. git commit
  4. git push
  5. Your changes are still only halfway there. Now login to Github, find your commit, and issue a “pull request” so that someone downstream can merge it.

In reality though, the maintainer of that Github-hosted project will probably prefer your changes to be on feature branches. They’ll ask you to work like this:

  1. git checkout master [to make sure each new feature starts from the baseline]
  2. git checkout -b newfeature
  3. Make some changes
  4. git add [not to be confused with svn add]
  5. git commit
  6. git push
  7. Now login to Github, switch to your newfeature branch, and issue a “pull request” so that the maintainer can merge it.
So, to move your changes from your local directory to the actual project repository will be: add, commit, push, “click pull request”, pull, merge, push. (I think)

As an added bonus, here’s a diagram illustrating the commands a typical developer on a traditional Subversion project needed to know about to get their work done. This is the bread and butter of VCS: checking out a repository, committing changes, and getting updates.

“Bread and butter” commands and concepts needed to work with a remote Subversion repository.

And now here’s what you need to deal with for a typical Github-hosted project:

The “bread and butter” commands and concepts needed to work with a Github-hosted project.

If the power of Git is sophisticated branching and merging, then its weakness is the complexity of simple tasks.

Update (August 3, 2012)

This post has obviously struck a nerve, and gets a lot of traffic. Thought I’d address some of the most frequent comments.

  1. The comparison between a Subversion repository with commit access and a Git repository without it isn’t fair True. But that’s been my experience: most SVN repositories I’ve seen have many committers – it works better that way. Git (or at least Github) repositories tend not to: you’re expected to submit pull requests, even after you reach the “trusted” stage. Perhaps someone else would like to do a fairer apples-to-apples comparison.
  2. You’re just used to SVN There’s some truth to this, even though I haven’t done a huge amount of coding in SVN-based projects. Git’s commands and information model are still inherently difficult to learn, and the situation is not helped by using Subversion command names with different meanings (eg, “svn add” vs “git add”).
  3. But my life is so much better with Git, why are you against it? I’m not – I actually quite like the architecture and what it lets you do. You can be against a UI without being against the product.
  4. But you only need a few basic commands to get by. That hasn’t been my experience at all. You can just barely survive for a while with clone, add, commit, and checkout. But very soon you need rebase, push, pull, fetch , merge, status, log, and the annoyingly-commandless “pull request”. And before long, cherry-pick, reflog, etc etc…
  5. Use Mercurial instead! Sure, if you’re the lucky person who gets to choose the VCS used by your project.
  6. Subversion has even worse problems! Probably. This post is about Git’s deficiencies. Subversion’s own crappiness is no excuse.
  7. As a programmer, it’s worth investing time learning your tools. True, but beside the point. The point is, the tool is hard to learn and should be improved.
  8. If you can’t understand it, you must be dumb. True to an extent, but see the previous point.
  9. There’s a flaw in point X. You’re right. As of writing, over 80,000 people have viewed this post. Probably over 1000 have commented on it, on Reddit (530 comments), on Hacker News (250 comments), here (100 comments). All the many flaws, inaccuracies, mischaracterisations, generalisations and biases have been brought to light. If I’d known it would be so popular, I would have tried harder. Overall, the level of debate has actually been pretty good, so thank you all.

A few bonus command inconsistencies:

Reset/checkout

To reset one file in your working directory to its committed state:

git checkout file.txt

To reset every file in your working directory to its committed state:

git reset --hard

Remotes and branches

git checkout remotename/branchname
git pull remotename branchname

There’s another command where the separator is remotename:branchname, but I don’t recall right now.

Command options that are practically mandatory

And finally, a list of commands I’ve noticed which are almost useless without additional options.

Base command Useless functionality Useful command Useful functionality
git branch foo Creates a branch but does nothing with it git checkout -b foo Creates branch and switches to it
git remote Shows names of remotes git remote -v Shows names and URLs of remotes
git stash Stores modifications to tracked files, then rolls them back git stash -u Also does the same to untracked files
git branch Lists names of local branches git branch -rv Lists local and remote tracking branches; shows latest commit message
git rebase Destroy history blindfolded git rebase -i Lets you rewrite the upstream history of a branch, choosing which commits to keep, squash, or ditch.
git reset foo Unstages files git reset –hard
git reset –soft
Discards local modifications
Returns to another commit, but doesn’t touch working directory.
git add Nothing – prints warning git add .
git add -A
Stages all local modifications/additions
Stages all local modifications/additions/deletions

Update 2 (September 3, 2012)

A few interesting links:

Semantic Google keywords

I seem to have picked up the habit of using semantic keywords in my Google searches. I have no idea if Google actually supports any of them, but, if they don’t, they should!

  1. buy
    Example: buy usb extension cable
    Meaning: I’m specifically looking to purchase something, so don’t show me any site that doesn’t help me buy things, preferably online.
  2. review
    Example:  Samsung galaxy note review
    Meaning:I want to read reviews about something. Don’t show any site that doesn’t have at least one review.
  3. experience
    Example: Samsung galaxy note experience
    Meaning: I’m interested in people’s personal experiences. I’m interested in blogs, and informal reviews: not professional reviews from technophiles. 
  4. get/download
    Example:  get git
    Meaning:Take me to sites that either let me download Git, or will explain how I can do this.
  5. photo
    Example: photo Birrarung Marr
    Meaning: I want to see what Birrarung Marr looks like. Show me photos, pictures, whatever!
  6. how to
    Example: how to uninstall Crayon Physics
    Meaning: I’m looking for  technical solutions, problem solving, troubleshooting type sites.
  7. compare
    Example: compare HDMI DLNA
    Meaning: I’m looking for information about the differences between these two things: massively prioritise pages that treat just these two things
  8. alternatives
    Example: BeyondPod alternatives
    Meaning:  I’m looking for things like BeyondPod, but different to it. Especially show me comparisons between it and its alternatives.
  9. why
    Example: why github
    Meaning: I want to know the benefits of GitHub. Prioritise sites that tell me specifically about its advantages, the problems that it solves – not how to use it, or general descriptions of it.

Improving on the “administration rights required” workflow

Consider an action like creating a new “space” in Confluence. Because of the visibility of this action, people want it restricted. Controlled. Managed. So Atlassian makes it only available to people with administrator rights. Which seems ok, until you realise that the workflow of the non-administrator ends up looking like this:

  1. …doing stuff…
  2. Decide to create new space
  3. Attempt to create space, discover that you need an administrator to do it
  4. Find out who the administrator is
  5. Ask them to do it
  6. Wait
And for the administrator it looks like this:
  1. …doing stuff…
  2. Receive request to make a new Confluence space.
  3. Confluence? What? I’m busy configuring a new VM here…oh, fine.
  4. Go log into Confluence, remember how to create a space
  5. Get back to work
And probably there will be some miscommunication about exactly what is required. Confluence spaces are a bit of a trivial example. In other instances, you need a lot of information to perform the administrative action, and if there’s a mistake, it again requires the administrator to fix it. Eventually the administrator gets sick of being bothered with such trivia, and the non-administrator gets sick of hassling them, finding an alternative, or living with a misconfigured thing.
For the user: frustration, blockage, helplessness
For the administrator: disruption, menial tasks

Alternative workflow #1: approval only

For the non-administrator:

  1. …doing stuff
  2. Decide to create a new Confluence space
  3. Enter the form, fill out all the details, press Ok.
  4. (Confluence sends a “Is this ok?” email to administrator)
  5. Wait
For the administrator:
  1. …doing stuff
  2. Receive approval request.
  3. Since it’s from a fairly trustworthy user, and seems to make sense, click the link.
  4. Get back to work.
It’s better. The administrator now doesn’t need to know anything about the request. This workflow exists in some kinds of systems (CMSes particularly), but could be a lot more widespread. Ironically, although Jira is a workflow tool, it doesn’t actually have any workflows built in for administration.
For the user: blockage
For the administrator: disruption

Alternative workflow #2: private pending approval

Since creating a space has very limited potential for destruction if it’s hidden, how about this:

For the administrator:

  1. …doing stuff
  2. Decide to create Confluence space
  3. Fill out form
  4. Press Ok
  5. (Confluence sends email to administrator, creates space in “private” mode)
  6. Start working in new space. Do almost anything except collaborate with others in this space.
The benefit is clear: there is now no “waiting” step.
For the administrator, it’s the same as before:
  1. …doing stuff
  2. Receive approval request.
  3. Since it’s from a fairly trustworthy user, and seems to make sense, click the link.
    1. If more information is required, look at what the user has done in the new space, for a bit of context.
  4. Get back to work.
There is a mild benefit here, too: the administrator is no longer under as much pressure to immediately approve the thing, and can even see what has been done with the space.
For the user: flow maintained
For the administrator: very mild disruption

Alternative workflow #3: public until reverted

Most users aren’t destructive. And especially in professional environments, virtually all users can be trusted to act in good faith. (Managers strangely predict “chaos” if many users are authorised). So, let them do it:

  1. Decide to create Confluence space
  2. Create Confluence space
  3. Work in new space
For the administrator:
  1. Receive notification that a space has been created
  2. If it looks wrong, strange, inappropriate etc, discuss with the user, and possibly remove it.
  3. Otherwise, keep working
For the user: productivity, responsibility
For the administrator: productivity, trust

Why is buying stuff from eBay so complicated?

So you’ve found the thing you want, and you already have both an eBay account and a PayPal account. You don’t want to do anything complicated – send this item to me, and charge my (already stored) card appropriately.

  1. Click Buy it Now
  2. Sign in
  3. Click Commit to Buy
  4. Click Pay Now
  5. Click Continue
  6. Sign in again (Paypal this time)
  7. Click Continue (confirming payment type?)
  8. Click Confirm Payment
In other words, “I’d like to buy this. Yes. Yes. Yes. Yes. Yes.” Maybe there was more genius to Amazon’s one-click purchasing than I thought.

Introducing: Cooking for engineers

Here’s what I hate about recipes:

  1. They’re delivered as unstructured narratives.
  2. They mix identifying information (“carrots”) with process information (“thinly sliced”)
  3. They’re overly specific (3/4 of a teaspoon, does it matter?)
  4. They have too many ingredients, and you don’t know which ones you can leave out
  5. They always have one or two ingredients you wouldn’t have lying around. Shrimp paste?

I can’t fix all of that. But here goes:

Cooking for engineers: spicy cauliflower and almonds

"Cooking for engineers" version of a sicy cauliflower dish

A cauliflower dish I made up the other night, seen by an engineer.

Recipe for non-engineers

Here’s what a conventional version of that dish might look like:

Ingredients:

1/2 cauliflower, chopped

1 bok choi

1 tbsp coriander, chopped

1/2 cup slivered almonds

1 tbsp ground turmeric

1 tbsp mustard seeds

1-4 tsp chilli powder, to taste

Put 2cm of water in a saucepan and bring it to the boil. Simmer the cauliflower 2 minutes, then drain, discarding the water. Add half the sesame oil, and return the cauliflower to the saucepan.

Meanwhile, heat the rest of the sesame oil in a frying pan. When hot, add the slivered almonds and remove from the heat, stirring continuously. Once browned, add the almonds to the frying cauliflower, and add the spices. After a few minutes, add the bok choi. Stir frequently until the bok choi is soft, then serve with rice and roti bread.

Principles of cooking for engineers

Here are the rules:

  1. One column per preparation dish (saucepan, mixing bowl, tray…)
  2. Red arrows show cooking. Maybe thicker arrows for hotter.
  3. Ingredients start off to the sides. Name of thing in bold. Quantities as rough as appropriate, in parentheses.
  4. Thin blue arrows show transfer of stuff.
  5. Text between two black lines means “keep doing the thing above until this happens”
  6. Stuff you need to do (processes) in square boxes.
  7. Collapse stuff down wherever appropriate. It’s a communication tool, not an exhaustive process analysis.

Explaining step 7: the diagram above could have shown a colander as another column, with cauliflower transferred from the saucepan to the colander and back. But why would you do that?

Discussion

Normal recipe format works pretty well for most people, but I find it takes multiple readings before I can start. Effectively, I’m constructing this kind of flow chart in my head, working out what gets transferred from where, to where. I hope that a refined version of this diagramming methodology would allow you to confidently dive straight in, with no nasty surprises.

On the downside, diagrams are hard to manage. I realise I forgot an ingredient (crushed garlic and ginger paste), but it’s time-consuming to modify the diagram and re-upload.

And lastly, yes, it’s a bit facile to call this “cooking for engineers”. Probably real engineers want precise measurements, correct use of flow control symbols and so forth.

New Gmail feature: auto mailing list management

Ok, Gmail, you’re halfway there: you’ve got labels, and you can detect mailing lists (Google Groups ones, anyway). Now, take it a bit further:

  1. Automatically create a label for every group that I receive mail from. Don’t wait for me to do it.
  2. Make these labels more than just a label. Give them options like “skip the inbox” or “delete on sight”.
  3. Have a view which shows all mail from all mailing lists, with some smart options to make this even more useful.

For extra credit:

4. Detect other sources of regular mail which are not “mailing lists” as such. Newsletters from the bank. Quarterly updates from my alma mater. Treat them exactly the same.

Most of the features are already there, but it’s so tedious having to set up a special rule and label for every single list.

Penny Auctions – a bit of analysis

swoopo, bidray, bidstick (bids tick, apparently), bidrivals and dozens of others are running what we’ll call “penny auctions“. Using bidrivals.com as the example, they all work on the following principals:

  1. There are consumer electronics for auction, usually at big discounts.
  2. It costs a certain amount to make a bid, regardless of whether that bid is ultimately successful. For bidrivals.com, it’s 40 British pence.
  3. Every bid raises the price by a fixed amount. In this example, by 1c. It also extends the auction to last another 15 seconds or so.
  4. If you “win” the auction you must then buy the item at the final price.
  5. It’s not a lottery. Because they say so.

At first glance, the auction looks great – buy a phone for $20! Buy a plasma tv for $1.53!

But not so fast. A couple of things that are not obvious to the beginner:

  1. Every dollar of the final price represents $40 in bidding fees. A $1000 TV selling for $1 is a big loss for the site. The same TV selling for $25 is a small profit. Sold for $1000 it’s a $40,000 profit.
  2. You can use a site-provided bot (“bidbot”, “bidbutler”…) to bid on your behalf. If two people do this simultaneously, they’ll both lose a lot of money with no apparent gain.

So, is it a scam? Well, there are really two quesions:

  1. If the site is running completely as described, legitimately, and not using shill bidders (bidding on their own auctions), is this an honest way to make a living – and should you participate?
  2. How do you know if a site is legitimate? Is it likely to be?

Is this honest?

I see very little to distinguish these penny auctions from gambling:

  • When you bid, whether you win or not depends entirely on whether anyone else bids in the next 15 seconds. Assuming you’re bidding on an item which is clearly a bargain (eg, $5 for a TV), then the normal considerations of auctions do not apply: any rational person would bid if they could so for free.
  • The house take is enormous. Frighteningly so. For example, imagine on average the site sells items at a 65% discount from RRP, and bids cost 40 times as much as the amount they increase the value by. This means that it costs on average (100-65)x40 to win a $100 item (whose value is now $65), or in other words (100-65)x40/65=$21.50 to win one dollar’s worth of value. By comparison, a skilled blackjack player in a casino can pay as little as $1.01 to win $1’s worth of value.
  • Most bids give no return to the bidder. This means that even if you don’t want to call it “gambling”, it should still be regulated, as the potential for dishonesty is great. You don’t want to be bidding for a dead donkey.

There is quite a bit of antipathy towards these sites: Washington Post, Jeff Atwood, Ed Oswald.

Can you trust them?

There are two main risks:

  • The site may use “shill bidders” to bid on items that would otherwise go for a low price. This could prevent you ever winning, or cause you to spend far more than you want.
  • Even if you “win”, the site may never ship. The whole thing could be a scam.

Fortunately, there are sites on the look out for this kind of thing, such as pennyauctionwatch.com. There is evidence of dodgy sites, such as fake testimonials.

So, what are the incentives for a site to use shill bidding? Well, as we saw above, the difference between a $1000 item selling for $1 and $25 doesn’t look like much, but it’s the difference between breaking even and posting a big loss. Imagine there is fairly steady bidding activity, but there are just a few gaps before that $25 mark. If the site could shill just a few times, they would massively increase their profitability.

But how much should they shill? Consider two strategies:

  1. bid whenever the time gets to 1 second; or
  2. bid immediately after anyone else bids.

In 1), the shill bids guarantee almost any asking price, as long as there is still some demand. This has the potential to greatly increase profit, and decrease variance.
In 2), half the bids end up being shill bids. This causes two problems: first, you’re directly losing one bid fee for every shill bid. Second, by inflating the price, you’re accelerating reaching the point at which people no longer want to bid, because the prize at stake is shrinking. So if people might normally bid strongly up to half the value of the item, then shilling along the way is just replacing paying bids with free ones. You might even decrease the final sale value, and every dollar of sale value lost is $40 of bidding fees lost.

Conclusion: shill bidding seems likely to occur, in small doses, because the incentive is just so strong.

Can you beat them?

Probably not. It’s been tried. To beat it:

  • You have to find a site that is not a complete scam.
  • You have to find a site that is completely honest. Even a little bit of shill bidding will crush you.
  • You have to defeat an absolutely incredible house take of 95% (remember, normal house take for gambling ranges from 1% to 5%).
  • You have to know enough about the auctions and your fellow users to make you fairly confident that no one will bid in the next 15 seconds. In the $1000 TV at $25 case (40c to bid), you need there to be a greater than 1/250 chance that you will win the auction with this one bid. Sound easy? Think: if that were the case, how did the price get to $25? It would stop, on average, at $2.50.

If it is beatable, you’d think people would have done it. And, since they’re capped at 4 wins per month generally, there would not be much harm in them sharing their secret. Unless they have a network of penny auction-beating bots. There’s a thought.

But just in case you wanted to try:

  • Compare sites. Find a safe one that appears to be losing money.
  • Collect lots of data. Try GreaseMonkey.
  • Find the right time of day, with the least competition.
  • Track all the auctions, pick individual moments and place bids.
  • Don’t try and win a specific auction. Bid any time your positive expectation on that bid is positive. The moment could pass.
  • Consider the effect of distractions. A good moment might be when several auctions are closing at the same time. You could even engineer that by bidding on several simultaneously.
  • Consider using several accounts to bid with, to drive off other bidders. If you know they’re paying attention and will react appropriately, that is. I’m thinking you bid with a group, gradually spacing their bids further apart and hoping you can sneak a 15 second gap through.

More reading:

Hello world!

Welcome to WordPress.com. This is your first post. Edit or delete it and start blogging!