Every dataset gets rated against Tim Berners-Lee's 5 Stars of Openness using an algorithm that has to cope with a number of technical challenges. For example it is relatively simple to check a link works, but how do you tell apart a CSV file from a text file that has lots of commas in it? I'm keen to share what we have done and welcome feedback for improving it. As always, the code is open source Python on Github, so anyone can play around with it, modify it and submit improvements.
We run the algorithm as a background process when a dataset is added to data.gov.uk and then again at weekly intervals, in case the file goes missing or gets updated.
The first job is to make a request to the 'Download' URL and see what comes back. Commonly it returns a file to download, but there were plenty of other eventualities to cope with:
- temporary connectivity problems requiring some retries - switching from Python's urllib2 to the 'requests' module handles this.
- 30X redirects need to be followed before you get to the file - again 'requests' deals with this.
- URLs were failing because of spaces at the start or the end - Web browsers strip them out automatically it appears
- colons in the URL parameters are legal but caused errors - upgrading to the latest 'requests' solved this.
- what if the file is massive? Or has no end? - we choose not to download files over a certain threshold - determined from the Content-Length header. If that is missing or incorrect, then as soon as the download ends up over the threshold we also cancel.
Our original version of the code started by doing a HEAD request to get basic info about the file, such as its size. But we found plenty of sites returned a range of misleading responses - including '405 Not Supported', '503 Service Unavailable', '403 Not Allowed' or no connection at all, when the equivalent GET worked fine.
Where a link is broken, we keep track of attempts in a report for the publisher and also display it when you hover over the star rating. For example it might say,
Download failed: Server responded: 404 Not Found. There have been 5 failed attempts from 2/10/12. The content was last successfully accessed on 28/9/12.
We calculate a hash of the file contents so that we can tell if it has been updated the next time we download it. We also record the Content-Type that the server provides, as a clue to the file type (although it is pretty unreliable one in our experience)
The file is saved to data.gov.uk's server and becomes available as the 'Cached' copy on data.gov.uk. Hopefully when a public body rearranges their website and files become unavailable, this cache will be of benefit. Note that The National Archives also takes copies of files, but their remit is slightly different - they store the complete website, but only for the central government departments and agencies.
Now we have the file stored, we look at its contents to determine the format of the file, since that determines the number of stars it gets. There are various tools that do this, and we couldn't find one that covered all the data formats we focus on, and some were most unreliable!
The process starts with two open libraries that examine a data file for signatures of well-known formats - 'magic' and 'file'. These detected PDFs, Torrent index files and many office files fine. However 'magic' categorised older Excel and Powerpoint files as Word files. And plenty of files just come out as 'binary' or not understood at all, so we had to cross-reference the results with 'file'.
It was recently suggested to us that we use the National Archives' Droid tool, and in tests I found it extremely accurate at identifying binary formats like Microsoft documents, Open Spreadsheet (ODS) and Shapefiles, so we hope to harness it in future.
However plenty of our data formats were not recognised or differentiated, so more code was needed:
- CSV files are mainstays of open data, yet 'magic' and 'file' categorised it simply as plain text. Although there are plenty of tools for reading CSVs, the CSV is not a tight standard at all and the different dialects cause problems. We have CSVs that quote all, some or no items, whilst others miss off commas between blank cells at the end of a row, and some used tabs or pipes instead of commas. Python's 'csv' library is pretty good at much of this, but we turned to OKF's messytables to cope with all the different CSV properties. We found some files with non-standard character encodings, so we contributed a fix to messytables that uses 'chardet' to detect and cope with this.
- JSON data is another wonderful format for on-line data, yet amazingly none of the tools detected it! Rather than using a standard JSON parsing library, that would require the full file to be loaded and parsed (possibly tens of megabytes worth!), we wrote a simple parser with a state machine, to see if the first chunk of the file was legal JSON.
- Many files were detected as just 'XML', but we needed to pick out if this was RDF, KML, XHTML, IATI, WMS or RSS. We did this with some regular expressions. These had to avoid be tripped up by the presence or absence of <xml> and <!doctype> tags, whitespace and curiously some files started with a couple of random binary characters.
- Some HTML files have embedded RDFA data in them, so we used another regex to check for those.
The most interesting problem was differentiating a CSV with a file with lots of commas in. For example, a CSV might have a very wordy start:
Electricity - kWh
"* please see note number five in the notes section on http://www.carbonculture.net/orgs/cabinet-office/70-whitehall/","**Please note: empty cells representing half-hourly consumption mean that there is no data available for that time; numbers followed by an 'E' are unreliable numbers"
and a text file (explaining a data format) with lots of commas. It parses fine as CSV, but it isn't data and shouldn't get 3 stars!:
UK Rainfall (mm) We came up with a threshold based on numbers of cells per row, over a number of rows. It seems to meet the current needs, but can't guarantee it will always be correct.And then there are formats which we've simply not attempted to detect, like this one:
Areal series, starting from 1910
Allowances have been made for topographic, coastal and urban effects where relationships are found to exist.
Seasons: Winter=Dec-Feb, Spring=Mar-May, Summer=June-Aug, Autumn=Sept-Nov. (Winter: Year refers to Jan/Feb).
Values are ranked and displayed to 1 dp. Where values are equal, rankings are based in order of year descending.
Data are provisional from March 2012 & Spring 2012. Last updated 03/09/2012
Year JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
1910 111.4 126.1 49.9 71.8 70.2 97.1 140.2 27.0 89.4 128.4 142.2
1911 59.2 99.7 62.1 69.0 52.2 77.0 43.3 69.4 91.5 141.3 188.4
Finally, once we know the file format, it is relatively simple to map that to a star rating. There is a bit of debate about whether newer XLSX format is open or not, but we decided that, whatever the legal position, it was not as accessible to users as CSV, so we give the few XLSX files we have 2 stars and CSV gets 3.
We had trouble automatically differentiating between 4 and 5 stars - analysis of the vocabularies and links requires some judgement and we'd love to hear suggestions about tackling this part. At the moment anything in RDF gets 5 stars, and the bulk of our RDF is Organograms, for which that is correct.
We've yet to detect APIs well yet, and it raises questions about how many stars to award.
And finally, perhaps the most contentious decision was to mark HTML pages as 0 stars. A good number of data.gov.uk links go not to data, but to an HTML page. Sometimes these are nice web apps providing a human interface to the data. And often these (such as hundreds of ONS datasets) provide extra useful metadata as well as an obvious link to the data. These are pretty good quality, and you could argue for a decent star rating. But all too often the link is more hidden in the page, or it is just a 'search the website' page or 'file not found' page, or to require registration, and these are definitely ones we want to score as 0. In a bid to encourage all datasets to provide a simple download of all their data, and in the absence of a more sophisticated algorithm, we score all HTML pages 0.
As you can see, a fair amount of effort has gone into getting these 5 star ratings working reasonably smoothly, but there is still plenty of scope for improvement and feedback is most welcome.
These are CKAN extensions based on Celery message queue tasks.