Rating the datasets - how many stars?

data.gov.uk continues to open up large quantities of data that was previously closed. Government data has become 'open by default' and the site has grown from the initial 1000 datasets to an amazing 8000 - one of the most successful open data movements anywhere.

But occasionally we get a message from a member of the public who is surprised to find the data they expected is not at the end of the 'Download' link. Maybe the data has moved, or it is not in the format advertised or not even available any more. Many will be suprised to hear that a small number of datasets require registration before access, or can only be posted to you on CD-ROM, or even require fees.

With 8000 datasets being listed on data.gov.uk by over a thousand different individuals across 700 bodies over the last three years, its not surprising that there are some issues and variance in quality.

Tim Berners-Lee set the tone for data releases: it started with "Raw Data Now!" (just get it out there, whatever the current quality) but he also set out what it should aspire to - the Five Stars of Openness:

Data gets three stars if it is currently available (not a broken link), openly licensed (no particular legal restrictions on reuse), structured and in an open format. Plenty of datasets on data.gov.uk are like this, such as spreadsheet tables stored in CSV files, or geographical boundaries in KML.

To get all the way to five stars it needs to be linked to other datasets on the internet. To do this its data points are made all available at separate addresses on the Internet, the data properties are expressed in common standards, and the links to other datasets are added. For more about Linked Data, see: What is Linked Data.

When data.gov.uk was relaunched in June, every dataset was given a star rating. We've been working to improve the rating algorithm and have this week added the ratings to the search page, so you can see what the distribution of stars is across all the datasets, and compare the quality of data from different departments.

We firmly believe that the way to improve the quality is to make these scores public. It may well influence a small council to not put out a PDF of spend data in favour of a spreadsheet, or maybe a spreadsheet of poverty indicators or international aid donations could be upgraded to a format that allows comparisons internationally.

With high quality standards expected, and a rating algorithm that will evolve in sophistication as we iterate further, many of the ratings might appear harsh:

  • If a dataset just has a link to another web page that requires you to hunt around for the actual data, we award it 0 stars.
  • If data is offered not under the Open Government Licence but instead have terms and conditions we award it 0 stars. (You might not easily know if you can even print off that dataset)
  • Some PDF files are produced pretty well, containing embedded spreadsheets - but that makes it difficult for a user's automatic tool, so we score that the same as a bad PDF scan - 1 star.

Read more about it here: 5 Stars Rating Algorithm

So when you look today and see that half of the datasets get 0 stars, be proud that this government sets the bar of quality high, and is brave enough to be open about not only what is good, but what is not good enough.

Comments

Rating Data Quality

The "openness" rating is a great idea and one I haven't considered before, however my first thought when reading the title of this blog post was that it would be discussing rating the "quality" of the data.  This is always highly contentious and subjective, so how could it be achieved? The original data provider can give an assessment, however this must be as transparent as possible, and must also be comparable between datasets and providers, therefore some sort of framework is required.  An indication could be derived from mandatory elements within the accompanying metadata, with an option for additional validation by crowd-sourced ratings. The age, breadth, collection techniques and spatial scale will hav

 Flag as offensive