Steve Bennett blogs

…about maps, open data, Git, and other tech.

Tag Archives: featured

Trello Tennis

(Credit: Flickr’s Carine06, CC-BY-SA)

Here’s my method of keeping track of stuff to do. Other tools just didn’t work for me. Most to-do list tools make two assumptions, I think:

  1. Work arrives from somewhere (eg, a manager), and is readily broken up into pieces roughly 0.5-4 hours in size.
  2. Your primary responsibility is processing these tasks without gaps, maintaining a high level of output.

My work, technology for academics, tends to involve  coordinating lots of little projects with lots of different people. At any given moment, there are few tasks “in my court”, and they tend to be small. My primary responsibility is to keep all projects moving, and ensure that projects aren’t held up by a task being with me too long.

So, I ran with the ball in my court / ball in your court metaphor. It’s useful when:

  1. Your work is primarily about projects, and particularly a relationship between you and the main stakeholder in that project.
  2. You anxiously wonder whether you’re supposed to be doing something on a project.
  3. You have trouble remembering all the relevant bits of information for each project.

“Projects” here can be very small, and often include very early stage proposals that haven’t formed into anything very concrete.

Presenting: Trello Tennis.

Trello Tennis in (fictious) action

Trello Tennis in (fictious) action

To start with, the two most important columns:

  • Ball in my court: projects whose current state requires me to do something (as small as writing an email or checking something, or as big as preparing a whole workshop structure)
  • Ball in their court: projects whose current state is waiting for someone else to do something, such as providing feedback on a document, organising an event, or replying to an email.

The goal each day is to move projects from the “my court” column to the “their court” column. I periodically review “my court”, placing the most urgent items at the top. I also review “their court”, checking whether I should follow up or offer assistance.

More columns, that turned out to be useful for me:

  • All good/meeting scheduled: projects whose current state is that both parties are waiting for an event, usually a meeting. Neither of us has to do anything until then.
  • Wut?: the most dangerous column, projects where I’m not sure what the current state is after all. Am I waiting on them? Did I promise to research something and get back to them?
  • Somebody else’s problem (SEP): where responsibility for the project has been shifted to someone else. I don’t need to do anything, and I don’t even need to monitor it at the moment. Someone else may eventually hand it back to me, but until then I don’t worry.
  • Archive: the project is actually finished.
  • Blocked: there is a technical reason why the project can’t advance, and no one is currently doing anything about it.

Using Trello

Trello works really well for this.

The name of each card consists of both the project name, and a quick summary of my current task (if in my court), or what I’m waiting on if in theirs. Examples:

  • WidgetUpgrade – check if compatible with Ubuntu Precise
  • FoobarSystem – Anita fixing DNS problem
  • Maps for Psych – meeting Thurs 3rd 2pm

What’s particularly fantastic about using Trello is that each card can also track all the work on that project. Changing the title automatically creates a history trace, or I can add comments to record more information. Need somewhere to record a server name and login details? Use the card’s description field.

You can also use coloured labels to group projects, such as tagging by university in my case.

Of course, each individual project frequently has its own task tracking system: perhaps another Trello board, Github issues, Pivotal Tracker, Google spreadsheets etc. I link to them from the card’s description, ultimately tying together all information systems for all projects under a single roof.

The rules

Each day (or whenever):

  1. If there is anything in “Wut?”, attempt to clarify its state, then move it to the right column.
  2. Check each item in “Ball in their court” to see if it needs attention. Set alarms if you need. If it’s time to follow up, just move it back to “Ball in my court” and adjust the title.
  3. Review the “Ball in my court” list, ordering them appropriately – most urgent at the top.
  4. Work through items in “Ball in my court” if you have any free time. :)

To make this natural, I order the columns from left to right as follows: Wut, My Court, Blocked, Their Court, All Good, SEP, Archive. Stress is on the left, calm is on the right.

The goal is to get your board to look like this blissful fantasy:

A totally successful day in Trello Tennis land.

A totally successful day in Trello Tennis land.

These examples are actually tiny compared to (my) reality. By the end of 2013, I had 21 active cards, plus another 10 inactive (SEP and Archive). If you find this technique at all useful, or have any improvements, I’d love to hear about it.

Filtering for sanity

(Added 4 Feb 2014)
Sometimes it gets daunting looking at the huge mounds of work in front of you. Filtering can help cut through the distraction. Two types have worked for me:

Filter by organisation

Filter by client

Are you working for a particular client today, and don’t want to see anything related to any other clients? Assign one label per client, then turn on the filter!

Filter by … you

Filter by youTo focus on just one or two tasks that you need to finish, first assign yourself to those cards. Then, filter to only show cards that you’re assigned to. This feature is intended for collaboration, but it works nicely.

Terrain in TileMill: a walkthrough for non-GIS types

I created a basemap for http://cycletour.org with TileMill and OpenStreetMap. It looked…ok.

Image

But it felt like there was something missing. Terrain. Elevation. Hills. Mountains. I’d put all that in the too hard basket, until a quick google turned up two blog posts from MapBox, the wonderful people who make TileMill. “Working with terrain data” and “Using TileMill’s raster-colorizer“. Putting these two together, plus a little OCD, led me to this:

Image

Slight improvement! Now, I felt that the two blog posts skipped a couple of little steps, and were slightly confusing on the difference between the two approaches, so here’s my step by step. (MapBox, please feel free to reuse any of this content.)

My setup is an 8 core Ubuntu VM with 32GB RAM and lots of disk space. I have OpenStreetMap loaded into PostGIS.

1. Install the development version of TileMill

You need to do this if you want to use the raster-colorizer. You want this while developing your terrain style, if you want the “snow up high, green valleys below” look. Without it, you have to pre-render the elevation color effect, which is time consuming. If you want to tweak anything (say, to move the snow line slightly), you need to re-render all the tiles.

Fortunately, it’s pretty easy.

  1. Get the “install-tilemill” gist (my version works slightly better)
  2. Probably uncomment the mapnik-from-source line (and comment out the other one). I don’t know whether you need the latest mapnik.
  3. Run it. Oh – it will uninstall your existing TileMill. Watch out for that.
  4. Reassemble stuff. The dev version puts projects in ~/Documents/<username/MapBox/project which is weird.

2. Get some terrain data.

The easiest place is the ubiquitous NASA SRTM DEM (digital elevation model) data. You get it from CGIAR. The user interface is awful. You can only download pieces that are 5 degrees by 5 degrees, so Victoria is 4 pieces.

Screen shot 2013-09-11 at 10.46.00 PM

If you’re downloading more than about that many, you’ll probably want to automate the process. I wrote this quick script to get all the bits for Australia:

for x in {59..67}; do
for y in {14..21}; do
echo $x,$y
if [ ! -f srtm_${x}_${y}.zip ]; then
wget http://droppr.org/srtm/v4.1/6_5x5_TIFs/srtm_${x}_${y}.zip
else
echo "Already got it."
fi
done
done
unzip '*.zip'


3. Process SRTM .tifs with GDAL.

To have any fun with terrain mapping in TileMill, you need to produce separate layers from the terrain data:

  1. The heightmap itself, so you can colour high elevations differently from low ones.
  2. A “hillshading” layer, where southeast facing slopes are dark, and northwest ones are light. This is what produces the “terrain” illusion.
  3. A “slopeshading” layer, where steep slopes (regardless of aspect) are dark. I’m ambivalent about how useful this is, but you’ll want to play with it.
  4. Contours. These can make your map look AMAZING.
Screen shot 2013-09-11 at 10.53.41 PM

Contours – they’re the best.

In addition, you’ll need some extra processing:

  1. Merge all the .tif’s into one. (I made the mistake of keeping them separate, which makes a lot of extra layers in TileMill). Because they’re GeoTiffs, GDAL can magically merge them without further instruction.
  2. Re-project it (converting it from some random ESPG to Google Web Mercator – can you tell I’m not a real GIS person?)
  3. A bit of scaling here and there.
  4. Generating .tif “overviews”, which are basically smaller versions of the tifs stored inside the same file, so that TileMill doesn’t explode.

Hopefully you already have GDAL installed. It probably came with the development version of TileMill.

Here’s my script for doing all the processing:

#!/bin/bash
echo -n "Merging files: "
gdal_merge.py srtm_*.tif -o srtm.tif
f=srtm
echo -n "Re-projecting: "
gdalwarp -s_srs EPSG:4326 -t_srs EPSG:3785 -r bilinear $f.tif $f-3785.tif

echo -n "Generating hill shading: "
gdaldem hillshade -z 5 $f-3785.tif $f-3785-hs.tif
echo and overviews:

gdaladdo -r average $f-3785-hs.tif 2 4 8 16 32
echo -n "Generating slope files: "
gdaldem slope $f-3785.tif $f-3785-slope.tif
echo -n "Translating to 0-90..."
gdal_translate -ot Byte -scale 0 90 $f-3785-slope.tif $f-3785-slope-scale.tif
echo "and overviews."
gdaladdo -r average $f-3785-slope-scale.tif 2 4 8 16 32
echo -n Translating DEM...
gdal_translate -ot Byte -scale -10 2000 $f-3785.tif $f-3785-scale.tif
echo and overviews.
gdaladdo -r average $f-3785-scale.tif 2 4 8 16 32
#echo Creating contours
gdal_contour -a elev -i 20 $f-3785.tif $f-3785-contour.shp

Take my word for it that the above script does everything I promise. The options I’ve chosen are all pretty standard, except that:

  • I’m exaggerating the hillshading by a factor of 5 vertically (“-z 5”). For no particularly good reason.
  • Contours are generated at 20m intervals (“-i 20”).
  • Terrain is scaled in the range -10 to 2000m. Probably an even lower lower bound would be better (you’d be surprised how much terrain is below sea levels – especially coal mines.) Excessively low terrain results in holes that can’t be styled and turn up white.

4. Load terrain layers into TileMill

Now you have four useful files, so create layers for them. I’d suggest creating them as layers in this order (as seen in TileMill):

  1. srtm-3785-contour.shp – the shapefile containing all the contours.
  2. srtm-3785-hs.tif – the hillshading file.
  3. srtm-3785-slope-scale.tif – the scaled slope shading file.
  4. srtm-3785.tif – the height map itself. (I also generate srtm-3785-scale.tif. The latter is scaled to a 0-255 range, while the former is in metres. I find metres makes more sense.)

For each of these, you must set the projection to 900913 (that’s how you spell GOOGLE in numbers). For the three ‘tifs’, set band=1 in the “advanced” box. I gather that GeoTiffs can have multiple bands of data, and this is the band where TileMill expects to find numeric data to apply colour to.

Screen shot 2013-09-11 at 11.06.39 PM

5. Style the layers

Mapbox’s blog posts go into detail about how to do this, so I’ll just copy/paste my styles. The key lessons here are:

  • Very slight differences in opacity when stacking terrain layers make a huge impact on the appearance of your map. Changing the colour of a road doesn’t make that much difference, but with raster data, a slight change can affect every single pixel.
  • There are lots of different raster-comp-ops to try out, but ‘multiply’ is a good default. (Remember, order matters).
  • Carefully work out each individual zoom level. It seems to work best to have hillshading transition to contours around zoom 12-13. The SRTM data isn’t detailed enough to really allow hillshading above zoom 13

My styles:

.hs[zoom <= 15] {
[zoom>=15] { raster-opacity: 0.1; }
[zoom>=13] { raster-opacity: 0.125; }
[zoom=12] { raster-opacity:0.15;}
[zoom<=11] { raster-opacity: 0.12; }
[zoom<=8] { raster-opacity: 0.3; }

raster-scaling:bilinear;
raster-comp-op:multiply;
}

// not really convinced about the value of slope shading
.slope[zoom <= 14][zoom >= 10] {
raster-opacity:0.1;
[zoom=14] { raster-opacity:0.05; }
[zoom=13] { raster-opacity:0.05; }
raster-scaling:lanczos;
raster-colorizer-default-mode: linear;
raster-colorizer-default-color: transparent;

// this combo is ok
raster-colorizer-stops:
stop(0, white)
stop(5, white)
stop(80, black);
raster-comp-op:color-burn;

}

// colour-graded elevation model
.dem {
[zoom >= 10] { raster-opacity: 0.2; }
[zoom = 9] { raster-opacity: 0.225; }
[zoom = 8] { raster-opacity: 0.25; }
[zoom <= 7] { raster-opacity: 0.3; }
raster-scaling:bilinear;
raster-colorizer-default-mode: linear;
raster-colorizer-default-color: hsl(60,50%,80%);
// hay, forest, rocks, snow

// if using the srtm-3785-scale.tif file, these stops should be in the range 0-255.
raster-colorizer-stops:
stop(0,hsl(60,50%,80%))
stop(392,hsl(110,80%,20%))
stop(785,hsl(120,70%,20%))
stop(1100,hsl(100,0%,50%))
stop(1370,white);
}
.contour[zoom >=13] {
line-smooth:1.0;
line-width:0.75;
line-color:hsla(100,30%,50%,20%);
[zoom = 13] {
line-width:0.5;
line-color:hsla(100,30%,50%,15%);
}

[zoom >= 16],
[elev =~ “.*00”] {
l/text-face-name:’Roboto Condensed Light’;
l/text-size:8;
l/text-name:'[elev]’;
[elev =~ “.*00”] { line-color:hsla(100,30%,50%,40%); }
[zoom >= 16] { l/text-size: 10; }
l/text-fill:gray;
l/text-placement:line;
}
}

And finally a gratuitous shot of Mt Feathertop, showing the major approaches and the two huts: MUMC Hut to the north and Federation Hut further south. Terrain is awesome!

Screen shot 2013-09-11 at 11.25.23 PM

A pattern for multi-instrument data harvesting with MyTardis

Here’s a formula for setting up a multi-instrument MyTardis installation. It describes one particular pattern, which has been implemented at RMIT’s Microscopy and Microanalysis Facility (RMMF), and is being implemented at Swinburne. For brevity, the many variations on this pattern are not discussed, and gross simplifications are made.

This architecture is not optimal for your situation. Maybe a documented a suboptimal architecture is more useful than an undocumented optimal one.

The goal

The components:

  • RSync server running on each instrument machine. (On Windows, use DeltaCopy.)
  • Harvester script which collates data from the RSync servers to a staging area.
  • Atom provider which serves the data in the staging area as an Atom feed.
  • MyTardis, which uses the Atom Ingest app to pull data from the Atom feed. It saves data to the MyTardis store and metadata to the MyTardis database.

Preparation

Things you need to find out:

  1. List of instruments you’ll be harvesting from. Typically there’s a point of diminishing returns: 10 instruments in total, of which 3 get most of the use, and numbers #9 and #10 are running DOS and are more trouble than they’re worth.
  2. For each instrument: what data format(s) are produced? Some “instruments” have several distinct sensors or detectors: a scanning electron microscope (producing .tif images) with add-on EDAX detector (producing .spt files), for instance. If there is no MyTardis filter for the given data format(s), then MyTardis will store the files without extracting any metadata, making them opaque to the user.

    Details collected for each instrument at RMIT’s microscopy facility.

  3. Discuss with the users how MyTardis’s concept of “dataset” and “experiment” will apply to the files they produce. If possible, arrange for them to save files into a file structure which makes this clear: /User_data/user123/experiment14/dataset16/file1.tif . Map the terms “dataset” and “experiment” as appropriate: perhaps “session” and “project”.
  4. Networking. Each instrument needs to be reachable on at least one port from the harvester.
  5. You’ll need a staging area. Somewhere big, and readily accessible.
  6. You’ll eventually need a permanent MyTardis store. Somewhere big, and backed up. (The Chef cookbooks actually assume storage inside the VM, at present.)
  7. You’ll eventually need a production-level MyTardis database. It may get very large if you have a lot of metadata (eg, Synchrotron data). (The Chef cookbooks install Postgres locally.)

RSync Server

We assume each PC is currently operating in “stand-alone” mode (users save data to a local hard disk) but is reachable by network. If your users have authenticated access to network storage, you’ll do something a bit different.

Install an Rsync server on each PC. For Windows machines, use DeltaCopy. It’s free. Estimate 5 minutes per machine. Make it auto-start, running in the background, all the time, serving up the user  data directory on a given port.

Harvester script

The harvester script collates data from these different servers into the staging area. The end result is a directory with this kind of structure:

user_123
        TiO2
            2012-03-14 exp1
                run1.tif
                run2.tif
                edax.spc
            2012-03-15 exp1
                anotherrun.tif
                highmag.tif
user_456
...

You will need to customise the scripts for your installation.

  1. Create  a staging area directory somewhere. Probably it will be network storage mounted by NFS or something.
  2. sudo mkdir /opt/harvester
  3. Create a user to run the scripts, “harvester”. Harvested data will end up being owned by this user. Make sure that works for you.
  4. sudo chown harvester /opt/harvester
  5. sudo -su harvester bash
  6. git clone https://github.com/VeRSI-Australia/RMIT-MicroTardis-Harvest
  7. Review the scripts, understand what they do, make changes as appropriate.
  8. Create a Cron job to run the harvester at frequent intervals. For example:
    */5 * * * * /usr/local/microtardis/harvest/harvest.sh >/dev/null 2>&1
  9. Test.

Atom provider

My boldest assumption yet: you will use a dedicated VM simply to serve an Atom feed from the staging area.

Since you already know how to use Chef, use the Chef Cookbook for the Atom Dataset Provider to provision this VM from scratch in one hit.

  1. git clone https://github.com/stevage/atom-provider-chef
  2. Modify cookbooks/atom-dataset-provider/files/default/feed.atom as appropriate. See the documentation at https://github.com/stevage/atom-dataset-provider.
  3. Modify environments/dev.rb and environments/prod.rb as appropriate.
  4. Upload it all to your Chef server:
    knife cookbook upload –all
    knife environment from file environments/*
    knife role from file roles/*
  5. Bootstrap your VM. It works first try.

MyTardis

You’re doing great. Now let’s install MyTardis. You’ll need another VM for that. Your local ITS department can probably deliver one of those in a couple of hours, 6 months tops. If possible (ie, if no requirement to host data on-site), use the Nectar Research Cloud.

You’ll use Chef again.

  1. git clone https://github.com/mytardis/mytardis-chef
  2. Follow the instructions there.
  3. Or, follow the instructions here: Getting started with Chef on the NeCTAR Research Cloud.
  4. Currently, you need to manually configure the Atom harvester to harvest from your provider. Instructions for that are on Github.

MyTardis is now running. For extra credit, you will want to:

  • Develop or customise filters to extract useful metadata.
  • Add features that your particular researchers would find helpful. (On-the-fly CSV generation from binary spectrum files was a killer feature for some of our users.)
  • Extend harvesting to further instruments, and more kinds of data.
  • Integrate this system into the local research data management framework. Chat to your friendly local librarian. Ponder issues like how long to keep data, how to back it up, and what to do when users leave the institution.
  • Integrate into authentication systems: your local LDAP, or perhaps the AAF.
  • Find more sources of metadata: booking systems, grant applications, research management systems…

Discussion

After developing this pattern for RMMF, it seems useful to apply it again. And if we (VeRSI, RMIT’s Ian Thomas, Monash’s Steve Androulakis, etc) can do it, perhaps you’ll find it useful too. I was inspired by Peter Sefton’s UWS blog post Connecting Data Capture applications to Research Data Catalogues/Registries/Stores (which raises the idea of design patterns for research data harvesting).

Some good things about this pattern:

  • It’s pretty flexible. Using separate components means individual parts can be substituted out. Perhaps you have another way of producing an Atom feed. Or maybe users save their data directly to a network share, eliminating the need for the harvester scripts.
  • By using simple network transfers (eg, HTTP), it suits a range of possible network architectures (eg, internal staging area, but MyTardis on the cloud; or everything on one VM…)
  • Work can be divided up: one party can work on harvesting from the instruments, while another configures MyTardis.
  • You can archive the staging area directly if you want a raw copy of all data, especially during the testing phase.

Some bad things:

  • It’s probably a bit too complicated – perhaps one day there will be a MyTardis ingest from RSync servers directly.
  • Since the Atom provider is written in NodeJS, it means another set of technologies to become familiar with for troubleshooting.
  • There are lots of places where things can go wrong, and there is no monitoring layer (eg, Nagios).

10 things I hate about Git

Git is the source code version control system that is rapidly becoming the standard for open source projects. It has a powerful distributed model which allows advanced users to do tricky things with branches, and rewriting history. What a pity that it’s so hard to learn, has such an unpleasant command line interface, and treats its users with such utter contempt.

1. Complex information model

The information model is complicated – and you need to know all of it. As a point of reference, consider Subversion: you have files, a working directory, a repository, versions, branches, and tags. That’s pretty much everything you need to know. In fact, branches are tags, and files you already know about, so you really need to learn three new things. Versions are linear, with the odd merge. Now Git: you have files, a working tree, an index, a local repository, a remote repository, remotes (pointers to remote repositories), commits, treeishes (pointers to commits), branches, a stash… and you need to know all of it.

2. Crazy command line syntax

The command line syntax is completely arbitrary and inconsistent. Some “shortcuts” are graced with top level commands: “git pull” is exactly equivalent to “git fetch” followed by “git merge”. But the shortcut for “git branch” combined with “git checkout”? “git checkout -b”. Specifying filenames completely changes the semantics of some commands (“git commit” ignores local, unstaged changes in foo.txt; “git commit foo.txt” doesn’t). The various options of “git reset” do completely different things.

The most spectacular example of this is the command “git am”, which as far as I can tell, is something Linus hacked up and forced into the main codebase to solve a problem he was having one night. It combines email reading with patch applying, and thus uses a different patch syntax (specifically, one with email headers at the top).

3. Crappy documentation

The man pages are one almighty “fuck you”. They describe the commands from the perspective of a computer scientist, not a user. Case in point:

git-push – Update remote refs along with associated objects

Here’s a description for humans: git-push – Upload changes from your local repository into a remote repository

Update, another example: (thanks cgd)

git-rebase – Forward-port local commits to the updated upstream head

Translation: git-rebase – Sequentially regenerate a series of commits so they can be applied directly to the head node

4. Information model sprawl

Remember the complicated information model in step 1? It keeps growing, like a cancer. Keep using Git, and more concepts will occasionally drop out of the sky: refs, tags, the reflog, fast-forward commits, detached head state (!), remote branches, tracking, namespaces

5. Leaky abstraction

Git doesn’t so much have a leaky abstraction as no abstraction. There is essentially no distinction between implementation detail and user interface. It’s understandable that an advanced user might need to know a little about how features are implemented, to grasp subtleties about various commands. But even beginners are quickly confronted with hideous internal details. In theory, there is the “plumbing” and “the porcelain” – but you’d better be a plumber to know how to work the porcelain.
A common response I get to complaints about Git’s command line complexity is that “you don’t need to use all those commands, you can use it like Subversion if that’s what you really want”. Rubbish. That’s like telling an old granny that the freeway isn’t scary, she can drive at 20kph in the left lane if she wants. Git doesn’t provide any useful subsets – every command soon requires another; even simple actions often require complex actions to undo or refine.
Here was the (well-intentioned!) advice from a GitHub maintainer of a project I’m working on (with apologies!):
  1. Find the merge base between your branch and master: ‘git merge-base master yourbranch’
  2. Assuming you’ve already committed your changes, rebased your commit onto the merge base, then create a new branch:
  3. git rebase –onto <basecommit> HEAD~1 HEAD
  4. git checkout -b my-new-branch
  5. Checkout your ruggedisation branch, and remove the commit you just rebased: ‘git reset –hard HEAD~1’
  6. Merge your new branch back into ruggedisation: ‘git merge my-new-branch’
  7. Checkout master (‘git checkout master’), merge your new branch in (‘git merge my-new-branch’), and check it works when merged, then remove the merge (‘git reset –hard HEAD~1’).
  8. Push your new branch (‘git push origin my-new-branch’) and log a pull request.
Translation: “It’s easy, Granny. Just rev to 6000, dump the clutch, and use wheel spin to get round the first corner. Up to third, then trail brake onto the freeway, late apexing but watch the marbles on the inside. Hard up to fifth, then handbrake turn to make the exit.”

6. Power for the maintainer, at the expense of the contributor

Most of the power of Git is aimed squarely at maintainers of codebases: people who have to merge contributions from a wide number of different sources, or who have to ensure a number of parallel development efforts result in a single, coherent, stable release. This is good. But the majority of Git users are not in this situation: they simply write code, often on a single branch for months at a time. Git is a 4 handle, dual boiler espresso machine – when all they need is instant.

Interestingly, I don’t think this trade-off is inherent in Git’s  design. It’s simply the result of ignoring the needs of normal users, and confusing architecture with interface. “Git is good” is true if speaking of architecture – but false of user interface. Someone could quite conceivably write an improved interface (easygit is a start) that hides unhelpful complexity such as the index and the local repository.

7. Unsafe version control

The fundamental promise of any version control system is this: “Once you put your precious source code in here, it’s safe. You can make any changes you like, and you can always get it back”. Git breaks this promise. Several ways a committer can irrevocably destroy the contents of a repository:

  1. git add . / … / git push -f origin master
  2. git push origin +master
  3. git rebase -i <some commit that has already been pushed and worked from> / git push

8. Burden of VCS maintainance pushed to contributors

In the traditional open source project, only one person had to deal with the complexities of branches and merges: the maintainer. Everyone else only had to update, commit, update, commit, update, commit… Git dumps the burden of  understanding complex version control on everyone – while making the maintainer’s job easier. Why would you do this to new contributors – those with nothing invested in the project, and every incentive to throw their hands up and leave?

9. Git history is a bunch of lies

The primary output of development work should be source code. Is a well-maintained history really such an important by-product? Most of the arguments for rebase, in particular, rely on aesthetic judgments about “messy merges” in the history, or “unreadable logs”. So rebase encourages you to lie in order to provide other developers with a “clean”, “uncluttered” history. Surely the correct solution is a better log output that can filter out these unwanted merges.

10. Simple tasks need so many commands

The point of working on an open source project is to make some changes, then share them with the world. In Subversion, this looks like:
  1. Make some changes
  2. svn commit

If your changes involve creating new files, there’s a tricky extra step:

  1. Make some changes
  2. svn add
  3. svn commit

For a Github-hosted project, the following is basically the bare minimum:

  1. Make some changes
  2. git add [not to be confused with svn add]
  3. git commit
  4. git push
  5. Your changes are still only halfway there. Now login to Github, find your commit, and issue a “pull request” so that someone downstream can merge it.

In reality though, the maintainer of that Github-hosted project will probably prefer your changes to be on feature branches. They’ll ask you to work like this:

  1. git checkout master [to make sure each new feature starts from the baseline]
  2. git checkout -b newfeature
  3. Make some changes
  4. git add [not to be confused with svn add]
  5. git commit
  6. git push
  7. Now login to Github, switch to your newfeature branch, and issue a “pull request” so that the maintainer can merge it.
So, to move your changes from your local directory to the actual project repository will be: add, commit, push, “click pull request”, pull, merge, push. (I think)

As an added bonus, here’s a diagram illustrating the commands a typical developer on a traditional Subversion project needed to know about to get their work done. This is the bread and butter of VCS: checking out a repository, committing changes, and getting updates.

“Bread and butter” commands and concepts needed to work with a remote Subversion repository.

And now here’s what you need to deal with for a typical Github-hosted project:

The “bread and butter” commands and concepts needed to work with a Github-hosted project.

If the power of Git is sophisticated branching and merging, then its weakness is the complexity of simple tasks.

Update (August 3, 2012)

This post has obviously struck a nerve, and gets a lot of traffic. Thought I’d address some of the most frequent comments.

  1. The comparison between a Subversion repository with commit access and a Git repository without it isn’t fair True. But that’s been my experience: most SVN repositories I’ve seen have many committers – it works better that way. Git (or at least Github) repositories tend not to: you’re expected to submit pull requests, even after you reach the “trusted” stage. Perhaps someone else would like to do a fairer apples-to-apples comparison.
  2. You’re just used to SVN There’s some truth to this, even though I haven’t done a huge amount of coding in SVN-based projects. Git’s commands and information model are still inherently difficult to learn, and the situation is not helped by using Subversion command names with different meanings (eg, “svn add” vs “git add”).
  3. But my life is so much better with Git, why are you against it? I’m not – I actually quite like the architecture and what it lets you do. You can be against a UI without being against the product.
  4. But you only need a few basic commands to get by. That hasn’t been my experience at all. You can just barely survive for a while with clone, add, commit, and checkout. But very soon you need rebase, push, pull, fetch , merge, status, log, and the annoyingly-commandless “pull request”. And before long, cherry-pick, reflog, etc etc…
  5. Use Mercurial instead! Sure, if you’re the lucky person who gets to choose the VCS used by your project.
  6. Subversion has even worse problems! Probably. This post is about Git’s deficiencies. Subversion’s own crappiness is no excuse.
  7. As a programmer, it’s worth investing time learning your tools. True, but beside the point. The point is, the tool is hard to learn and should be improved.
  8. If you can’t understand it, you must be dumb. True to an extent, but see the previous point.
  9. There’s a flaw in point X. You’re right. As of writing, over 80,000 people have viewed this post. Probably over 1000 have commented on it, on Reddit (530 comments), on Hacker News (250 comments), here (100 comments). All the many flaws, inaccuracies, mischaracterisations, generalisations and biases have been brought to light. If I’d known it would be so popular, I would have tried harder. Overall, the level of debate has actually been pretty good, so thank you all.

A few bonus command inconsistencies:

Reset/checkout

To reset one file in your working directory to its committed state:

git checkout file.txt

To reset every file in your working directory to its committed state:

git reset --hard

Remotes and branches

git checkout remotename/branchname
git pull remotename branchname

There’s another command where the separator is remotename:branchname, but I don’t recall right now.

Command options that are practically mandatory

And finally, a list of commands I’ve noticed which are almost useless without additional options.

Base command Useless functionality Useful command Useful functionality
git branch foo Creates a branch but does nothing with it git checkout -b foo Creates branch and switches to it
git remote Shows names of remotes git remote -v Shows names and URLs of remotes
git stash Stores modifications to tracked files, then rolls them back git stash -u Also does the same to untracked files
git branch Lists names of local branches git branch -rv Lists local and remote tracking branches; shows latest commit message
git rebase Destroy history blindfolded git rebase -i Lets you rewrite the upstream history of a branch, choosing which commits to keep, squash, or ditch.
git reset foo Unstages files git reset –hard
git reset –soft
Discards local modifications
Returns to another commit, but doesn’t touch working directory.
git add Nothing – prints warning git add .
git add -A
Stages all local modifications/additions
Stages all local modifications/additions/deletions

Update 2 (September 3, 2012)

A few interesting links: