Steve Bennett blogs

…about maps, open data, Git, and other tech.

Tag Archives: api

OpenTrees.org: how to aggregate 373,000 trees from 9 open data sources

2 Comments Posted by steveko on April 7, 2015

I try to convince government bodies, especially local councils, to publish more open data. It’s much easier when there is a concrete benefit to point to: if you publish your tree inventory, it could be joined up with all the other councils’ tree inventories, to make some kind of big tree-explorey interface thing.

Introducing: opentrees.org. It’s fun! Click on “interesting trees”, hover over a few, and click on the ones that take your fancy. You can play for ages.

Here’s how I made it.

First you get the data

Through a bit of searching on data.gov.au, I found tree inventories (normally called “Geelong street trees” or similar) for: Geelong, Ballarat (both participating in OpenCouncilData), Corangamite (I visited last year), Colac-Otways (friends of Corangamite), Wyndham (a surprise!), Manningham (total surprise). It showed two results from data.sa.gov.au: Adelaide, and the Waite Arboretum (in Adelaide). Plus the City of Melbourne’s (open data pioneers) “Urban Forest” dataset on data.melbourne.vic.gov.au.

Every dataset is different. For instance:

GeoJSON’s for Corangamite, Colac-Otways, Ballarat, Manningham
CSV for Melbourne and Adelaide. Socrata has a “JSON” export, but it’s not GeoJSON.
Wyndham has a GeoJSON, but for some reason the data is represented as “MultiPoint”, rather than “Point”, which GDAL couldn’t handle. They also have a CSV, which are also very weird, with an embedded WKT geometry (also MULTIPOINT), in a projected (probably UTM) format. There are also several blank columns.
Waite Arboretum’s data is in zipped Shapefile and KML. KML is the worst, because it seems to have attributes encoded as HTML, so I used the Shapefile.

Source code for gettrees.sh.

Tip for data providers #1: Choose CSV files for all point data, with columns “lat” and “lon”. (They’re much easier to manipulate than other formats, it’s easy to strip fields you don’t need, and they’re useful for doing non-spatial things with.)

Then you load the data

Next we load all the data files, as they are, into separate tables in PostGIS. GDAL is the magic tool here. Its conversion tool, ogr2ogr, has a slightly weird command line but works very well. A few tips:

Set the target table geometry type to be “GEOMETRY”, rather than letting it choose a more specific type like POINT or MULTIPOINT. This makes it easier to combine layers later.
-nlt GEOMETRY
Re-project all geometry to Web Mercator (EPSG:3857) when you load. Save yourself pain.
-t_srs EPSG:3857
Load data faster by using Postgres “copy” mode:
–config PG_USE_COPY YES
Specify your own table name:
-nln adelaide

Tip for data providers #2: Provide all data in unprojected (latitude/longitude) coordinates by preference, or Web Mercator (EPSG:3857).

CSV files unfortunately require creating a companion ‘.vrt’ file for non-trivial cases (eg, weird projections, weird column names). For example:
<OGRVRTDataSource> <OGRVRTLayer name="melbourne"> <SrcDataSource>melbourne.csv</SrcDataSource> <GeometryType>wkbPoint</GeometryType> <LayerSRS>WGS84</LayerSRS> <GeometryField encoding="PointFromColumns" x="Longitude" y="Latitude"/> </OGRVRTLayer> </OGRVRTDataSource>
The command to load a dataset looks like:
ogr2ogr --config PG_USE_COPY YES -overwrite -f "PostgreSQL" PG:"dbname=trees" -t_srs EPSG:3857 melbourne.vrt -nln melbourne -lco GEOMETRY_NAME=the_geom -lco FID=gid -nlt GEOMETRY
Source code for loadtrees-db.sh.

Merge the data

Unfortunately most councils do not yet publish data in the (very easy to follow!) opencouncildata.org standards. So we have to investigate the data and try to match the fields into the scheme. Basically, it’s a bunch of hand-crafted SQL INSERT statements like:
INSERT INTO alltrees (the_geom, ref, genus, species, scientific, common, location, height, crown, dbh, planted, maturity, source) SELECT the_geom, tree_id AS ref, genus_desc AS genus, spec_desc AS species, trim(concat(genus_desc, ' ', spec_desc)) AS scientific, common_nam AS common, split_part(location_t, ' ', 1) AS location, height_m AS height, canopy_wid AS crown, diam_breas AS dbh, CASE WHEN length(year_plant::varchar) = 4 THEN to_date(year_plant::varchar, 'YYYY') END AS planted, life_stage AS maturity, 'colac_otways' AS source FROM colac_otways;
Notice that we have to convert the year (“year_plant”) into an actual date. I haven’t yet fully handled complicated fields like health, structure, height and dbh, so there’s a mish-mash of non-numeric values, different units (Adelaide records the circumference of trees rather than diameter!)

Tip for data providers #3: Follow the opencouncildata.org standards, and participate in the process.

Source code for mergetrees.sql

Clean the data

We now have 370,000 trees but it’s of very variable quality. For instance, in some datasets, values like “Stump”, “Unknown” or “Fan Palm” appear in the “scientific name” column. We need to clean them out:
UPDATE alltrees SET scientific='', genus='', species='', description=scientific WHERE scientific='Vacant Planting' OR scientific ILIKE 'Native%' OR scientific ILIKE 'Ornamental%' OR scientific ILIKE 'Rose %' OR scientific ILIKE 'Fan Palm%' OR scientific ILIKE 'Unidentified%' OR scientific ILIKE 'Unknown%' OR scientific ILIKE 'Stump';
We also want to split scientific names into individual genus and species fields, handle varieties, sub-species and so on. Then there are the typos which, due to some quirk in tree management software, become faithfully and consistently retained across a whole dataset. This results in hundreds of Angpohoras, Qurecuses, Botlebrushes etc. We also need to turn non-values (“Not assessed”, “Unknown”, “Unidentified”) into actual NULL values.
UPDATE alltrees SET crown=NULL WHERE crown ILIKE 'Not Assessed';
Source code for cleantrees.sql

Tip for data providers #4: The cleaner your data, the more interesting things people can do with it. (But we’d rather see dirty data than nothing.)

Make a map

I use TileMill to make web maps. For this project it has a killer feature: the ability to pre-render a map of hundreds of thousands of points, and allow the user to interact with those points, without exploding the browser. That’s incredibly clever. Having complete control of the cartography is also great, and looks much better than, say, dumping a bunch of points on a Google Map.

As far as TileMill maps goes, it’s very conventional. I add a PostGIS layer for the tree points, plus layers for other features such as roads, rivers and parks, pointing to an OpenStreetMap database I already had loaded. Also show the names of the local government areas with their boundaries, which fade out and disappear as you zoom in.

My style is intentionally all about the trees. There are some very discreet roads and footpaths to serve as landmarks, but they’re very subdued. I use colour (from green to grey) to indicate when species and/or genus information is missing. The Waite Arboretum data has polygons for (I presume) crown coverage, which I show as a semi-opaque dark green.

Source code for the TileMill CartoCSS style.

There’s also an interactive layer, so the user can hover over a tree to see more information. It looks like this:
{{{common}}} {{{scientific}}} <table> {{#genus}}<tr><th>Genus </th><td>{{{genus}}}</td></tr>{{/genus}} {{#species}}<tr><th>Species</th><td>{{{species}}}</td></tr>{{/species}} {{#variety}}<tr><th>Variety</th><td>{{{variety}}}</td></tr>{{/variety}}
...
I also whipped up two more layers:

OpenStreetMap trees, showing “natural=tree” objects in OpenStreetMap. The data is very sketchy. This kind of data is something that councils collect much better than OpenStreetMap.
Interesting trees. I compute the “interestingness” of a tree by calculating the number of other trees in the total database of the same species. A tree in a set of 5 or less is very interesting (red), 25 or less is somewhat interesting (yellow).

Source code for makespecies.sql.

Build a website

It’s very easy to display a tiled, interactive map in a browser, using Leaflet.JS and Mapbox’s extensions. It’s a lot more work to turn that into an interesting website. A couple of the main features:

The base CSS is Twitter Bootstrap, mostly because I don’t know any better.
Mapbox.js handles the interactivity, but I intercept clicks (map.gridLayer.on) to look up the species and genus on Wikipedia. It’s straightforward using JQuery but I found it fiddly due to unfamiliarity. The Wikipedia API is surprisingly rough, and doesn’t have a proper page of its own – there’s the MediaWiki API page, the Wikipedia API Sandbox, and this useful StackOverflow question which that community helpfully shut down as a service to humanity.
To make embedding the page in other sites (such as Open Council Data trees) work better, the “?embed” URL parameter hides the titlebar.
You can go straight to certain councils with bookmarks: opentrees.org/#adelaide
I found the fonts (the title font is “Lancelot“) on Adobe Edge.
The header background combines the forces of subtlepatterns.com and px64.net.

Source code for treesmap.html, treesmap.js, treesmap.css.

And of course there’s a server component as well. The lightweight tilelive_server, written mostly by Yuri Feldman, glues together the necessary server-side bits of MapBox’s technology. I pre-generate a large-ish chunk of map tiles, then the rest are computed on demand. This bit of nginx code makes that work (well, after tilelive_server generated 404s appropriately):
location /treetiles/ { # Redirect to TileLive. If tile not found, redirect to TileMill. rewrite_log on; rewrite ^.*/(\d+)/(\d+)/(\d+.*)$ /supertrees_c8887d/$1/$2/$3 break;
proxy_intercept_errors on; error_page 404 = @dynamictiles; proxy_set_header Host $http_host; proxy_pass http://127.0.0.1:5044;
proxy_cache my-cache; }
location @dynamictiles { rewrite_log on; rewrite ^.*/(\d+)/(\d+)/(\d+.*)$ /tile/supertrees/$1/$2/$3 break; proxy_pass http://guru.cycletour.org:20008; proxy_cache my-cache; }

Too hard basket

A really obvious feature would be to show native and introduced species in different colours. Try as I might, I could not find any database with this information. There are numerous online plant databases, but none seemed to have this information in a way I could access. If you have ideas, I’d love to hear from you.

It would also be great to make a great mobile app, so you can easily answer the question “what is this tree in front of me”, and who knows what else.

In conclusion

Dear councils,

Please release datasets such as tree inventories, garbage collection locations and times, and customer service centres, following the open standards at opencouncildata.org. We’ll do our best to make fun, interesting and useful things with them.

Love,

The open data community

mapmaking api, data.gov.au, data.melbourne.vic.gov.au, data.sa.gov.au, datamelb, leaflet, mapbox, melbdataguru, open data, opencouncildata, opentrees.org, plants, tilemill, trees, wikipedia

Digital humanities for beginners: get started with the Trove API

2 Comments Posted by steveko on February 5, 2014

Trove is the National Library of Australia’s “discovery interface” – an amazing catalogue of books, newspapers, maps, music, journal articles etc. The Trove API is a special website that programs can talk to run queries, retrieve individual records etc. With some creative ideas and a bit of coding skill, you could make a pretty nifty tool and win a digital humanities award.

The documentation is pretty thorough, but doesn’t tell you how to get started. These days, web development is mostly about assembling the right tools and putting them together, but it can be hard for a novice to know what they need.

This guide is for you if:

You want to try exploring Trove programmatically; and
You’ve done some coding before; but
You’re not very familiar with coding against an API, or writing web applications.

In short, if you think the digital humanities might be for you.

What we’ll need

A server on which to run your web application. You can use your laptop directly, but better to use a virtual machine inside your laptop. Even better, a server running on the internet, like a VM on the NeCTAR Research Cloud.
A web application framework: Flask and Python
A library that makes it easy to make requests to other web servers (we’ll use Requests)
A JavaScript framework which makes dynamically modifying web pages much easier (we’ll use jQuery)
A CSS framework, which makes your page look attractive with no effort (we’ll use Twitter Bootstrap)
A templating engine, to combine HTML and the results of our queries into a web page (we’ll use Jinja2)
Our server logic, expressed as code in files.

A server

As an Australian academic, you can create your own server on the NeCTAR Research Cloud. That would be the best option, but can be a bit fiddly for a novice so we won’t explain that here.

The next best option is to create a server on a virtual machine inside your laptop, using VirtualBox and Vagrant. That keeps everything related to this one project self contained, which is a big bonus when you start working on other projects. It also creates a pristine, controlled Linux environment which means you’re less likely to run into issues cause by the peculiarities of your own system. Try the Vagrant Getting Started guide.

As a last resort, the simplest approach is to just install stuff directly on your computer. This may work ok to begin with.

Building a web application

We’re going to build a web application that will talk to Trove. This isn’t the only approach. You could:

Build a command line tool that pulls stuff out of Trove and saves it to disk
Make a desktop application that a user would have to download and install
Make a static website (not a web application) that does everything with JavaScript.

But a web application is probably what you’re going to want eventually and is the easiest to share with other people.

We’ll use Python, the programming language, and Flask, a web application framework for Python.

Why Flask? It’s easy to install, lightweight, and is well suited to experimental programming like this. (An alternative would be Django, which would be much better if you wanted to store stuff in a database and write something big and scalable.)

You can install these on the command line like this:

sudo apt-get install -y python-pip
sudo pip install flask

That is, first you install Python’s “pip” installer. Pip then knows how to install Python packages like Flask.

Whatever server you’re using, create a directory somewhere (perhaps under your home directory), called ‘trovetest’.

cd ~
mkdir trovetest
cd trovetest

Making requests to the Trove API

The Trove documentation says that to perform a query, we need to access a URL like this:

http://api.trove.nla.gov.au/result?key=<INSERT KEY>&zone=book&q=%22piers%20anthony%22

You might think you need to construct that string yourself, complete with question marks, equals signs and %22’s. You could do that using Python’s built-in urllib2 library:

import urllib2
apikey='1b2c3d4f'
troveURL = 'http://api.trove.nla.gov.au/'
zone = 'book'
search = '%22' + 'piers' + '%20' + 'anthony' + '%22'
r = urllib2.urlopen(troveURL + 'result' + '?' + 'key=' + apikey + '&' + 'zone=' + zone + '&' + 'q=' + search).read()

But it’s easier than that, if we use the Requests library.

import requests 
apikey='1b2c3d4f'
troveURL = 'http://api.trove.nla.gov.au/'
searchquery = 'piers'
r = requests.get(troveURL + 'result/', params = { 
 'key': apikey, 
 'zone': 'book', 
 'q': search
 } )

So, install the Requests library as well:

sudo pip install requests

jQuery

These days, virtually all web applications use a JavaScript framework, to make manipulating the web page more practical. We’ll use jQuery, which is also required by Twitter BootStrap.

Create a directory under ‘trovetest’ called ‘static’, and download and unzip the latest jQuery.

CSS Framework

Making a web page look good using straight CSS (cascading style sheets) is really hard. Making it look good in every browser and device is a nightmare. Save yourself the pain, and start from somewhere sensible like Bootstrap, made by Twitter. (By default, it looks a bit like Twitter’s website, but you can change that.)

Without Twitter Bootstrap

Adding Bootstrap and minimal changes.

Although it’s best to download both jQuery and Boostrap to your server and link to them there, you can get started quickly by linking to them on the web. Just include this text at the top of your HTML files.

<head>
http://code.jquery.com/jquery-1.10.1.min.js
<link rel="stylesheet" href="http://netdna.bootstrapcdn.com/bootstrap/3.1.0/css/bootstrap.min.css">
http://netdna.bootstrapcdn.com/bootstrap/3.1.0/js/bootstrap.min.js
</head>

Templating engine

The most common way to build a web page dynamically is by writing a “template” of the HTML page. It can be as simple as this:

<body>
<h1>Request complete</h1>
You requested this article: {{ article.name }}
</body>

When you tell the web application framework to render the page, you pass in the list of variables – in this case, an object called ‘article’ with an attribute ‘name’.

We’ll use the Jinja2 templating engine, because that’s what Flask uses.

Our server logic

Now that everything is in place, let’s write a simple application that will simply make one request. We’ll ask the Trove API for a list of newspaper articles mentioning Piers Anthony (to follow their documentation’s example).

Create this file as ~/trovetest/simple.py

from flask import Flask, render_template, request # load Flask itself
import requests, json # load Requests library, and JSON library for interpreting responses
app = Flask(__name__) # create our web application object, named after this file
apikey = '1b2c3d...' # insert your API key, following instructions here: http://trove.nla.gov.au/general/api
troveURL = 'http://api.trove.nla.gov.au/' # all Trove API URLs start with this

@app.route("/list") # this is what we do when someone goes to "http://localhost:5000/list"
def list(): # the function that defines the behaviour
query = { # this dict structure contains the parameters that make up a URL
'key': apikey, # following the Trove API documentation.
'zone': 'newspaper',
'q': 'piers anthony',
'encoding': 'json', # we specify 'json' as the format because json is easy to parse.
'include': 'tags' }
r = requests.get(troveURL + 'result/', params=query)

# Requests will transform the params property into a string like
# ?key=...&zone=newspaper&q=%22piers%20anthony%22&include=tags&encoding=json

# After calling requests.get(), r is now an object containing lots of information about the response - errors codes, content etc.

r.encoding = 'ISO-8859-1' # We avoid some unicode conversion errors by adding this step.
results=r.json() # Convert the text we receive (in JSON format) into a Python dict
return render_template('list.html', results=results)# Render the template, passing through that dict

app.run(host='0.0.0.0', debug=True) # Run the defined web server, allowing anyone to connect, in debug mode. (This is not safe for a public web server.)

And save this as ~/trovetest/templates/list.html:

<!doctype html>
<head>
    <script src="http://code.jquery.com/jquery-1.10.1.min.js"></script>
    <linkrel="stylesheet"href="http://netdna.bootstrapcdn.com/bootstrap/3.1.0/css/bootstrap.min.css">
    <script src="http://netdna.bootstrapcdn.com/bootstrap/3.1.0/js/bootstrap.min.js"></script>
</head>
<body>
    <title>Trove test</title>
    <h1>Results</h1>

    <ul>
    {% for article in results.response.zone.0.records.article %}
      <li>
        <ahref="{{ article.troveUrl }}">{{ article.title.value }}</a> - <ahref="item/{{ article.id}}">More info</a>
        <br/>
        <i><small>{{ article.snippet | safe}}</small></i>
      </li>
    {% endfor %}
    </ul>
</body>

Run your server

On the command line, tell Python (and hence Flask) to run your application:

/home/ubuntu/trove$ python simple.py
 * Running on http://0.0.0.0:5000/
 * Restarting with reloader

Flask will sit there waiting for someone to connect to your webserver in a browser.

In your browser, go to http://localhost/5000/list

The results of our little Trove query.

What next?

That was a very quick, high level view of all the pieces you need. If you made it through the example, you’re be in an excellent position to start making interesting web applications building on the Trove API. There’s no shortage of information on the web about all these technologies – the hard bit is knowing what you need to know.

All sorts of fun things can be done:

Cool exploration interfaces
Bots
Visualisations like graphs, charts etc
Text mining and analysis

Tim Sherratt (aka @wragge) has made many fun creations with the Trove API, documented on his discontents.com.au blog. Tim works for the Trove team, who are generally pretty happy to help .

e-Research api, application programming interface, css, digital humanities, flask, jinja2, jquery, python, trove, trove api, twitter bootstrap

	arnebab on 10 things I hate about Gi…
	arnebab on 10 things I hate about Gi…
	arnebab on 10 things I hate about Gi…
	Smitty on 10 things I hate about Gi…
	Igor on 10 things I hate about Gi…