Steve Bennett blogs

…about maps, open data, Git, and other tech.

Tag Archives: github

Git: what they didn’t tell you

Credit:Tim Strater from Rotterdam, Nederland CC-BY-SA

Of all the well-documented difficulties I’ve had working with Git over the years, a few conceptual difficulties really stand out. They’re quirks in the Git architecture that took me far too long to realise, far too long to believe, or far too long to really grasp. And maybe you have the same problem without realising it.

Branch names are completely arbitrary


git branch -d master
git checkout develop -b master

There, I did it. I’m now calling the develop branch master. What you call this branch, and what I call it, and what your Github repo calls it, and what my Github repo calls it just don’t matter. Four different names? No problem. There are some flimsy conventions that Git half-heartedly follows to link two branches with the same name, but it gives up pretty easily.

Remote branches are local

I have very frequently fallen into this trap:

$ git diff origin/master
$

No differences, so my branch must be in sync with Origin, right? Wrong. What is true is my branch is in sync with the local copy of Origin. If you don’t run git fetch, then Git will never even update its local copies.

Technically, Git has always been upfront about this. The Git book opens the section on remote branches:

Remote branches are references to the state of branches on your remote repositories.

But it’s counterintuitive, and so I keep messing it up. I keep hoping (and assuming) that Git will one day include an auto-fetch option, where it constantly synchronises remote branches

‘Detached HEAD’ mode is fine

Here’s the message that we have seen many times:

Note: checking out ‘origin/develop’.

You are in ‘detached HEAD’ state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

git checkout -b new_branch_name

This scary looking message threw me off for a long time, despite the fact that it’s actually one of Git’s most helpful messages – it tells you everything you need to know.

It boils down to this: you can’t make a commit if you’re not at the top of a branch. The two most common situations that cause this are:

    1. Checking out a remote branch. You should do this:

      git checkout origin/develop -b mydevelop
      or even, if you want to abandon your branch completely:
      git branch -d develop
      git checkout origin/develop -b develop
  1. Checking out a commit in the middle of a branch, like:

    git checkout a8e6b18

    Usually in this case you just want to look at it, so you can just ignore the message.

There’s nothing special about ‘git clone’

For a long time, I thought that ‘git init’, ‘git pull’ and ‘git clone’ somehow created repositories that were different, even if they ended up with the same commits in them. It’s hard to recreate my state of mind, but I spent a long time trying to salvage certain directories on disk when I should have just abandoned them.

Similarly, there is no difference between:

  1. git clone http://github.com/stevage/myrepo
  2. git init
    git remote add origin http://github.com/stevage/myrepo
    git pull origin master

Well, in the second case Git’s a bit confused about which local branches map onto which remote branches, so you have to be more explicit or fix it with some configuration option.

Don’t call any remote ‘origin’

Credit: Chevassu (GFDL)

For some reason, Git encourages you to call the source of the first clone “origin”. I have found this very confusing and ultimately very unhelpful. Let’s say you’re working on a project called widget, and you fork it in Github so you can work on it. You will want both remotes accessible locally, so you will probably do one of these:

  1. git clone http://github.com/stevage/widget
    git remote add widget http://github.com/widget/widget
  2. git clone http://github.com/widget/widget
    git remote add mine http://github.com/stevage/widget

So you either have remotes called “origin” and “widget”, or remotes called “origin” and “mine”. But on the next project, you might make the opposite choice, and soon you really don’t remember what “origin” means.

My tip: never name any remote “origin” ever. Name them all after their Github username.

  1. git init
    git remote add widget http://github.com/widget/widget
    git remote add stevage http://github.com/stevage/widget
    git fetch --all
    get checkout stevage -b master

You will never understand “git reset”‘s options

The difference between “git reset”, “git reset –soft” and “git reset –hard” is beyond your comprehension. And you probably wanted “git checkout” anyway.

Rule of thumb:

  • If your directory is FUBAR: git reset –hard
  • If you just want to throw away changes to one file: git checkout file
  • In all other cases, google it.

Super lightweight map websites with Github

Github, the online version control repository host for Git, recently added support for GeoJSON files. Sounds boring, right? It actually lets you do something very cool: build your own “dots on a map” website with virtually zero code.

An example of GeoJSON on Github I whipped up.

An example of GeoJSON on Github I whipped up. Click it.

Here’s what you need to do.

  1. Get a Github repository if you don’t have one already. They’re free.
  2. Create a GeoJSON file. You can export to this format from various tools. One easy way to get started would be to upload a CSV file with locations to dotspotting.org then download the GeoJSON from there. Or even easier, use geojson.io to place dots, lines and polygons with a graphical tool. It can save directly to your GitHub.
  3. Here’s what my test file looks like:
    {
     "type": "FeatureCollection",
     "features": [{ "type": "Feature",
     "geometry": {
     "type": "Point",
     "coordinates": [144.9,-37.8]
     },
     "properties": {
     "title": "Scooter",
     "description": "Here's a dot",
     "marker-size": "medium",
     "marker-symbol": "scooter",
     "marker-color": "#a59",
     "stroke": "#555555",
     "stroke-opacity": 1.0,
     "stroke-width": 2,
     "fill": "#555555",
     "fill-opacity": 0.5
     }
     },
     { "type": "Feature",
     "geometry": {
     "type": "Point",
     "coordinates": [144.4,-37.5]
     },
     "properties": {
     "title": "Cafe",
     "description": "Coffee and stuff",
     "marker-size": "medium",
     "marker-symbol": "cafe",
     "marker-color": "#f99",
     "stroke": "#555555",
     "stroke-opacity": 1.0,
     "stroke-width": 2,
     "fill": "#555555",
     "fill-opacity": 0.5
     }
     }]
    }

    It’s worth validating with GeoJSONLint.

  4. Commit this file, say test.geojson, to your Github repository. You can get a preview of it in Github:

    The test GeoJSON file, as seen on GitHub.

    The test GeoJSON file, as seen on GitHub.

  5. Now the really cool part. Embed the map into your own website. This is stupidly easy:
    <!DOCTYPE html>
    <html>
    <body>
    <script src="https://embed.github.com/view/geojson/stevage/georly/master/test1.geojson?height=500&width=1000"></script>
    </body>
    </html>

    If you don’t have a website, site44 is an extremely easy way to get started. You place HTML files into your DropBox, and they get automagicked onto the web, with a subdomain: something.site44.com.

Now what?

That’s it! What’s especially interesting about the hosting on GitHub is it’s a very easy way to have a lightweight shared geospatial database of points, lines or polygons. Here’s how you could add dots to my map:

  1. Fork my repository
  2. Add a few points, by modifying the GeoJSON file
  3. Commit your changes to your repository.
  4. Send me a pull request
  5. I accept the changes, and voila – now your points are shown with mine.

Using this method, we have a “review before publish” workflow, and a full version history of every change.

Caveats

This is a nifty tool for prototyping social mapping applications, but it obviously won’t cut it for production purposes:

  • No support for different layers: all the dots are always shown
  • No support for different basemaps: always the same OpenStreetMap style
  • No authoring tools: you must use something else to generate the GeoJSON
  • Obligatory “rendered with (heart) by GitHub” footer.

Soon you’ll want to build a proper application, using tools like MapBox, CloudMade, CartoDB, Leaflet etc.

10 things I hate about Git

Git is the source code version control system that is rapidly becoming the standard for open source projects. It has a powerful distributed model which allows advanced users to do tricky things with branches, and rewriting history. What a pity that it’s so hard to learn, has such an unpleasant command line interface, and treats its users with such utter contempt.

1. Complex information model

The information model is complicated – and you need to know all of it. As a point of reference, consider Subversion: you have files, a working directory, a repository, versions, branches, and tags. That’s pretty much everything you need to know. In fact, branches are tags, and files you already know about, so you really need to learn three new things. Versions are linear, with the odd merge. Now Git: you have files, a working tree, an index, a local repository, a remote repository, remotes (pointers to remote repositories), commits, treeishes (pointers to commits), branches, a stash… and you need to know all of it.

2. Crazy command line syntax

The command line syntax is completely arbitrary and inconsistent. Some “shortcuts” are graced with top level commands: “git pull” is exactly equivalent to “git fetch” followed by “git merge”. But the shortcut for “git branch” combined with “git checkout”? “git checkout -b”. Specifying filenames completely changes the semantics of some commands (“git commit” ignores local, unstaged changes in foo.txt; “git commit foo.txt” doesn’t). The various options of “git reset” do completely different things.

The most spectacular example of this is the command “git am”, which as far as I can tell, is something Linus hacked up and forced into the main codebase to solve a problem he was having one night. It combines email reading with patch applying, and thus uses a different patch syntax (specifically, one with email headers at the top).

3. Crappy documentation

The man pages are one almighty “fuck you”. They describe the commands from the perspective of a computer scientist, not a user. Case in point:

git-push – Update remote refs along with associated objects

Here’s a description for humans: git-push – Upload changes from your local repository into a remote repository

Update, another example: (thanks cgd)

git-rebase – Forward-port local commits to the updated upstream head

Translation: git-rebase – Sequentially regenerate a series of commits so they can be applied directly to the head node

4. Information model sprawl

Remember the complicated information model in step 1? It keeps growing, like a cancer. Keep using Git, and more concepts will occasionally drop out of the sky: refs, tags, the reflog, fast-forward commits, detached head state (!), remote branches, tracking, namespaces

5. Leaky abstraction

Git doesn’t so much have a leaky abstraction as no abstraction. There is essentially no distinction between implementation detail and user interface. It’s understandable that an advanced user might need to know a little about how features are implemented, to grasp subtleties about various commands. But even beginners are quickly confronted with hideous internal details. In theory, there is the “plumbing” and “the porcelain” – but you’d better be a plumber to know how to work the porcelain.
A common response I get to complaints about Git’s command line complexity is that “you don’t need to use all those commands, you can use it like Subversion if that’s what you really want”. Rubbish. That’s like telling an old granny that the freeway isn’t scary, she can drive at 20kph in the left lane if she wants. Git doesn’t provide any useful subsets – every command soon requires another; even simple actions often require complex actions to undo or refine.
Here was the (well-intentioned!) advice from a GitHub maintainer of a project I’m working on (with apologies!):
  1. Find the merge base between your branch and master: ‘git merge-base master yourbranch’
  2. Assuming you’ve already committed your changes, rebased your commit onto the merge base, then create a new branch:
  3. git rebase –onto <basecommit> HEAD~1 HEAD
  4. git checkout -b my-new-branch
  5. Checkout your ruggedisation branch, and remove the commit you just rebased: ‘git reset –hard HEAD~1’
  6. Merge your new branch back into ruggedisation: ‘git merge my-new-branch’
  7. Checkout master (‘git checkout master’), merge your new branch in (‘git merge my-new-branch’), and check it works when merged, then remove the merge (‘git reset –hard HEAD~1’).
  8. Push your new branch (‘git push origin my-new-branch’) and log a pull request.
Translation: “It’s easy, Granny. Just rev to 6000, dump the clutch, and use wheel spin to get round the first corner. Up to third, then trail brake onto the freeway, late apexing but watch the marbles on the inside. Hard up to fifth, then handbrake turn to make the exit.”

6. Power for the maintainer, at the expense of the contributor

Most of the power of Git is aimed squarely at maintainers of codebases: people who have to merge contributions from a wide number of different sources, or who have to ensure a number of parallel development efforts result in a single, coherent, stable release. This is good. But the majority of Git users are not in this situation: they simply write code, often on a single branch for months at a time. Git is a 4 handle, dual boiler espresso machine – when all they need is instant.

Interestingly, I don’t think this trade-off is inherent in Git’s  design. It’s simply the result of ignoring the needs of normal users, and confusing architecture with interface. “Git is good” is true if speaking of architecture – but false of user interface. Someone could quite conceivably write an improved interface (easygit is a start) that hides unhelpful complexity such as the index and the local repository.

7. Unsafe version control

The fundamental promise of any version control system is this: “Once you put your precious source code in here, it’s safe. You can make any changes you like, and you can always get it back”. Git breaks this promise. Several ways a committer can irrevocably destroy the contents of a repository:

  1. git add . / … / git push -f origin master
  2. git push origin +master
  3. git rebase -i <some commit that has already been pushed and worked from> / git push

8. Burden of VCS maintainance pushed to contributors

In the traditional open source project, only one person had to deal with the complexities of branches and merges: the maintainer. Everyone else only had to update, commit, update, commit, update, commit… Git dumps the burden of  understanding complex version control on everyone – while making the maintainer’s job easier. Why would you do this to new contributors – those with nothing invested in the project, and every incentive to throw their hands up and leave?

9. Git history is a bunch of lies

The primary output of development work should be source code. Is a well-maintained history really such an important by-product? Most of the arguments for rebase, in particular, rely on aesthetic judgments about “messy merges” in the history, or “unreadable logs”. So rebase encourages you to lie in order to provide other developers with a “clean”, “uncluttered” history. Surely the correct solution is a better log output that can filter out these unwanted merges.

10. Simple tasks need so many commands

The point of working on an open source project is to make some changes, then share them with the world. In Subversion, this looks like:
  1. Make some changes
  2. svn commit

If your changes involve creating new files, there’s a tricky extra step:

  1. Make some changes
  2. svn add
  3. svn commit

For a Github-hosted project, the following is basically the bare minimum:

  1. Make some changes
  2. git add [not to be confused with svn add]
  3. git commit
  4. git push
  5. Your changes are still only halfway there. Now login to Github, find your commit, and issue a “pull request” so that someone downstream can merge it.

In reality though, the maintainer of that Github-hosted project will probably prefer your changes to be on feature branches. They’ll ask you to work like this:

  1. git checkout master [to make sure each new feature starts from the baseline]
  2. git checkout -b newfeature
  3. Make some changes
  4. git add [not to be confused with svn add]
  5. git commit
  6. git push
  7. Now login to Github, switch to your newfeature branch, and issue a “pull request” so that the maintainer can merge it.
So, to move your changes from your local directory to the actual project repository will be: add, commit, push, “click pull request”, pull, merge, push. (I think)

As an added bonus, here’s a diagram illustrating the commands a typical developer on a traditional Subversion project needed to know about to get their work done. This is the bread and butter of VCS: checking out a repository, committing changes, and getting updates.

“Bread and butter” commands and concepts needed to work with a remote Subversion repository.

And now here’s what you need to deal with for a typical Github-hosted project:

The “bread and butter” commands and concepts needed to work with a Github-hosted project.

If the power of Git is sophisticated branching and merging, then its weakness is the complexity of simple tasks.

Update (August 3, 2012)

This post has obviously struck a nerve, and gets a lot of traffic. Thought I’d address some of the most frequent comments.

  1. The comparison between a Subversion repository with commit access and a Git repository without it isn’t fair True. But that’s been my experience: most SVN repositories I’ve seen have many committers – it works better that way. Git (or at least Github) repositories tend not to: you’re expected to submit pull requests, even after you reach the “trusted” stage. Perhaps someone else would like to do a fairer apples-to-apples comparison.
  2. You’re just used to SVN There’s some truth to this, even though I haven’t done a huge amount of coding in SVN-based projects. Git’s commands and information model are still inherently difficult to learn, and the situation is not helped by using Subversion command names with different meanings (eg, “svn add” vs “git add”).
  3. But my life is so much better with Git, why are you against it? I’m not – I actually quite like the architecture and what it lets you do. You can be against a UI without being against the product.
  4. But you only need a few basic commands to get by. That hasn’t been my experience at all. You can just barely survive for a while with clone, add, commit, and checkout. But very soon you need rebase, push, pull, fetch , merge, status, log, and the annoyingly-commandless “pull request”. And before long, cherry-pick, reflog, etc etc…
  5. Use Mercurial instead! Sure, if you’re the lucky person who gets to choose the VCS used by your project.
  6. Subversion has even worse problems! Probably. This post is about Git’s deficiencies. Subversion’s own crappiness is no excuse.
  7. As a programmer, it’s worth investing time learning your tools. True, but beside the point. The point is, the tool is hard to learn and should be improved.
  8. If you can’t understand it, you must be dumb. True to an extent, but see the previous point.
  9. There’s a flaw in point X. You’re right. As of writing, over 80,000 people have viewed this post. Probably over 1000 have commented on it, on Reddit (530 comments), on Hacker News (250 comments), here (100 comments). All the many flaws, inaccuracies, mischaracterisations, generalisations and biases have been brought to light. If I’d known it would be so popular, I would have tried harder. Overall, the level of debate has actually been pretty good, so thank you all.

A few bonus command inconsistencies:

Reset/checkout

To reset one file in your working directory to its committed state:

git checkout file.txt

To reset every file in your working directory to its committed state:

git reset --hard

Remotes and branches

git checkout remotename/branchname
git pull remotename branchname

There’s another command where the separator is remotename:branchname, but I don’t recall right now.

Command options that are practically mandatory

And finally, a list of commands I’ve noticed which are almost useless without additional options.

Base command Useless functionality Useful command Useful functionality
git branch foo Creates a branch but does nothing with it git checkout -b foo Creates branch and switches to it
git remote Shows names of remotes git remote -v Shows names and URLs of remotes
git stash Stores modifications to tracked files, then rolls them back git stash -u Also does the same to untracked files
git branch Lists names of local branches git branch -rv Lists local and remote tracking branches; shows latest commit message
git rebase Destroy history blindfolded git rebase -i Lets you rewrite the upstream history of a branch, choosing which commits to keep, squash, or ditch.
git reset foo Unstages files git reset –hard
git reset –soft
Discards local modifications
Returns to another commit, but doesn’t touch working directory.
git add Nothing – prints warning git add .
git add -A
Stages all local modifications/additions
Stages all local modifications/additions/deletions

Update 2 (September 3, 2012)

A few interesting links: