5 Stars Rating Algorithm

Every dataset gets rated against Tim Berners-Lee's 5 Stars of Openness using an algorithm that has to cope with a number of technical challenges. For example it is relatively simple to check a link works, but how do you tell apart a CSV file from a text file that has lots of commas in it? I'm keen to share what we have done and welcome feedback for improving it. As always, the code is open source Python on Github, so anyone can play around with it, modify it and submit improvements.

We run the algorithm as a background process when a dataset is added to data.gov.uk and then again at weekly intervals, in case the file goes missing or gets updated.

The first job is to make a request to the 'Download' URL and see what comes back. Commonly it returns a file to download, but there were plenty of other eventualities to cope with:

  • temporary connectivity problems requiring some retries - switching from Python's urllib2 to the 'requests' module handles this.
  • 30X redirects need to be followed before you get to the file - again 'requests' deals with this.
  • URLs were failing because of spaces at the start or the end - Web browsers strip them out automatically it appears
  • colons in the URL parameters are legal but caused errors - upgrading to the latest 'requests' solved this.
  • what if the file is massive? Or has no end? - we choose not to download files over a certain threshold - determined from the Content-Length header. If that is missing or incorrect, then as soon as the download ends up over the threshold we also cancel.

Our original version of the code started by doing a HEAD request to get basic info about the file, such as its size. But we found plenty of sites returned a range of misleading responses - including '405 Not Supported', '503 Service Unavailable', '403 Not Allowed' or no connection at all, when the equivalent GET worked fine.

Where a link is broken, we keep track of attempts in a report for the publisher and also display it when you hover over the star rating. For example it might say,

Download failed: Server responded: 404 Not Found. There have been 5 failed attempts from 2/10/12. The content was last successfully accessed on 28/9/12.

We calculate a hash of the file contents so that we can tell if it has been updated the next time we download it. We also record the Content-Type that the server provides, as a clue to the file type (although it is pretty unreliable one in our experience)

The file is saved to data.gov.uk's server and becomes available as the 'Cached' copy on data.gov.uk. Hopefully when a public body rearranges their website and files become unavailable, this cache will be of benefit. Note that The National Archives also takes copies of files, but their remit is slightly different - they store the complete website, but only for the central government departments and agencies.

Now we have the file stored, we look at its contents to determine the format of the file, since that determines the number of stars it gets. There are various tools that do this, and we couldn't find one that covered all the data formats we focus on, and some were most unreliable!

The process starts with two open libraries that examine a data file for signatures of well-known formats - 'magic' and 'file'. These detected PDFs, Torrent index files and many office files fine. However 'magic' categorised older Excel and Powerpoint files as Word files. And plenty of files just come out as 'binary' or not understood at all, so we had to cross-reference the results with 'file'.

It was recently suggested to us that we use the National Archives' Droid tool, and in tests I found it extremely accurate at identifying binary formats like Microsoft documents, Open Spreadsheet (ODS) and Shapefiles, so we hope to harness it in future.

However plenty of our data formats were not recognised or differentiated, so more code was needed:

  • CSV files are mainstays of open data, yet 'magic' and 'file' categorised it simply as plain text. Although there are plenty of tools for reading CSVs, the CSV is not a tight standard at all and the different dialects cause problems. We have CSVs that quote all, some or no items, whilst others miss off commas between blank cells at the end of a row, and some used tabs or pipes instead of commas. Python's 'csv' library is pretty good at much of this, but we turned to OKF's messytables to cope with all the different CSV properties. We found some files with non-standard character encodings, so we contributed a fix to messytables that uses 'chardet' to detect and cope with this.
  • JSON data is another wonderful format for on-line data, yet amazingly none of the tools detected it! Rather than using a standard JSON parsing library, that would require the full file to be loaded and parsed (possibly tens of megabytes worth!), we wrote a simple parser with a state machine, to see if the first chunk of the file was legal JSON.
  • Many files were detected as just 'XML', but we needed to pick out if this was RDF, KML, XHTML, IATI, WMS or RSS. We did this with some regular expressions. These had to avoid be tripped up by the presence or absence of <xml> and <!doctype> tags, whitespace and curiously some files started with a couple of random binary characters.
  • Some HTML files have embedded RDFA data in them, so we used another regex to check for those.

The most interesting problem was differentiating a CSV with a file with lots of commas in. For example, a CSV might have a very wordy start:

Cabinet Office
Electricity - kWh
"* please see note number five in the notes section on http://www.carbonculture.net/orgs/cabinet-office/70-whitehall/","**Please note: empty cells representing half-hourly consumption mean that there is no data available for that time; numbers followed by an 'E' are unreliable numbers"
Site Name,Utility,Unit,Date,00:00,00:30,01:00,01:30,02:00,02:30,03:00,03:30,04:00,04:30,05:00,05:30,06:00,06:30,07:00,07:30,08:00,08:30,09:00,09:30,10:00,10:30,11:00,11:30,12:00,12:30,13:00,13:30,14:00,14:30,15:00,15:30,16:00,16:30,17:00,17:30,18:00,18:30,19:00,19:30,20:00,20:30,21:00,21:30,22:00,22:30,23:00,23:30,Total
70 Whitehall,Electricity,kWh,2010-07-31,69,70,86,74,67,67,67,80,81,68,66,66,71,103,37,18,7,0,1,8,20,75,74,71,97,87,76,73,84,91,81,72,70,90,84,73,67,71,91,81,62,67,63,63,72,77,62,61,3161

and a text file (explaining a data format) with lots of commas. It parses fine as CSV, but it isn't data and shouldn't get 3 stars!:

UK Rainfall (mm)
Areal series, starting from 1910
Allowances have been made for topographic, coastal and urban effects where relationships are found to exist.
Seasons: Winter=Dec-Feb, Spring=Mar-May, Summer=June-Aug, Autumn=Sept-Nov. (Winter: Year refers to Jan/Feb).
Values are ranked and displayed to 1 dp. Where values are equal, rankings are based in order of year descending.
Data are provisional from March 2012 & Spring 2012. Last updated 03/09/2012
We came up with a threshold based on numbers of cells per row, over a number of rows. It seems to meet the current needs, but can't guarantee it will always be correct.And then there are formats which we've simply not attempted to detect, like this one: Year    JAN    FEB    MAR    APR    MAY    JUN    JUL    AUG    SEP    OCT    NOV    DEC
1910  111.4  126.1   49.9     71.8   70.2   97.1  140.2   27.0   89.4  128.4  142.2
1911   59.2   99.7   62.1   69.0   52.2   77.0   43.3   69.4   91.5  141.3  188.4

Finally, once we know the file format, it is relatively simple to map that to a star rating. There is a bit of debate about whether newer XLSX format is open or not, but we decided that, whatever the legal position, it was not as accessible to users as CSV, so we give the few XLSX files we have 2 stars and CSV gets 3.

We had trouble automatically differentiating between 4 and 5 stars - analysis of the vocabularies and links requires some judgement and we'd love to hear suggestions about tackling this part. At the moment anything in RDF gets 5 stars, and the bulk of our RDF is Organograms, for which that is correct.

We've yet to detect APIs well yet, and it raises questions about how many stars to award.

And finally, perhaps the most contentious decision was to mark HTML pages as 0 stars. A good number of data.gov.uk links go not to data, but to an HTML page. Sometimes these are nice web apps providing a human interface to the data. And often these (such as hundreds of ONS datasets) provide extra useful metadata as well as an obvious link to the data. These are pretty good quality, and you could argue for a decent star rating. But all too often the link is more hidden in the page, or it is just a 'search the website' page or 'file not found' page, or to require registration, and these are definitely ones we want to score as 0. In a bid to encourage all datasets to provide a simple download of all their data, and in the absence of a more sophisticated algorithm, we score all HTML pages 0.

As you can see, a fair amount of effort has gone into getting these 5 star ratings working reasonably smoothly, but there is still plenty of scope for improvement and feedback is most welcome.

Source code

These are CKAN extensions based on Celery message queue tasks.

Comments

Does one size really fit all?

One of the reasons I am pleased to be posting as exstat rather than currentstat is that I do not have to cope with instructions from "the centre" which tell me how the data should be made available without considering the context of how they are likely to be used.

This comment is my reaction to the apparent writing off of anything that links to an html page, which is the way most official statistical data are made available.  You accept that most of the ONS (and other departmental statistics) html pages are well designed with useful metadata but your algorithm still gives it a ranking of zero.

We all accept the importance of metadata but it appears that the way for statisticians to boost their ratings is to have the links pointing direct to the data, thereby making it extremely easy for users to ignore these metadata.  Self-defeating, or what?

These sttaitical html pages typically also offer links to related datasets and to previous versions of the dataset.  That's far better than sprinkling data.gov.uk with goodness knows how many versions of essentially the same thing, which is common for some of the other datasets you have (and is probably why you have 8000+ in the first place).

I appreciate that the algorithm is probably intended as a first go at rating datasets but I also fear that the first cut will become a tool to enforce a standard which may well be unsuited to the uses actually made for the data.  For many datasets, the ranking and the tool may be very useful, but if a html page which actively helps users is zero rated, let's consider that it may be the ranking design or the tool which are the problem, not the page.

 Flag as offensive 

Good debate

It is great to see some debate on the merits or demerits of a star rating automated system!

The basic premise behind data.gov.uk is simple:

1-To provide a unified catalogue of government data available freely (at the moment, some exceptions in our implementation are given to datasets related to the INSPIRE directive)

2-to provide quick and easy access to the said data so others can make use of it

It was a clear mandate from its very inception, one supported by the Transparency Board, that the catalogue should link to the data file itself, providing a description, basic metadata and licensing information so people could make use of it immediately if they chose to do so.  

We catalogue metadata and provide a direct link to the data, we are not a collection of links to pages.  Our metadata allows for HTML links that contain further info on the data to be provided under additional information, these will not count towards the star rating. They would allow people thinking of using the data to then travel to the page and read further if they wish to do so.

As for previous versions of the data, give it a couple of years and a great part of the historical record would be captured in our datasets.

While I appreciate the value of the information provided in the ONS pages, the additional information option on our dataset pro-forma accommodates its function quite well.  It is not about boosting ratings, it is about making publishers think about how they publish their data.

The star rating apply to the dataset, as a published package pointing to a data file.  It incorporates elements of format, license (yes, although we have had some technical issues, it does take into account license) and so on.

If we take the premise of a data catalogue as a source, then being thrown to a page where in many instances it is not obvious which file is the one you read about in the dataset you started with, we have a problem, at least in relation to the idea of using a catalogue to obtain data efficiently.

A good example:

http://www.ons.gov.uk/ons/rel/psa/a-generational-accounts-approach-to-long-term-public-finance-in-the-uk/july-2010/index.html

What benefit does this page has over providing the information contained in it in the description section of our metadata pro-forma and a direct link to the file?

Or worse:

http://data.gov.uk/dataset/construction_statistics/resource/7c875e3e-a5c1-4637-a5de-69e9405dadd2

That is the issue with pointing to HTML pages.

Imagine an old style library, you find the book you want in the catalogue, you go to it thinking that in row PG, book number 000343 you’ll find THAT book, only to find a biography of the author with a critique of his or her work and only in the last page, you find the actual location of that book, I’d be annoyed.

The feature is in beta, and open to improvement by conversations such as this.

 Flag as offensive 

Bash the bad pages, not all of them

On the other hand, if I wanted to see what was available and what I ought to know before diving head first into data I don't really know, I would prefer to have a short well-designed summary and critique which also told me where to go for all the datasets of interest.  Different once I know what I need, granted.

I agree that the two ONS pages you mention are pretty poor.  The most fundamental flaw of all is that they appear to point to PDF files, with no obvious (to me) way of getting at the data.  The "generational accounts" output is essentially an article illustrating some of the issues, so in that sense the data illustrated are, er, illustrative (and rapidly overaken by revisions and fresh data) so I can see some sense in that.  The construction outputs really should give access to something other than PDF though.  (There's also an issue about the links being to archived pages, which looks poor.)

So I would agree with marking down pages that do not themselves do the job.  However, the problem here is that you are not distinguishing between these and those pages that ARE doing the job for their users.  You then lose focus on improving the "output experiences" that are really bad (and will still be bad whether accessed direct from data.gov.uk or via a html landing page).  Or, to adapt your phrase, to target the publishers who most need to think about how they publish the data (Or need most assistance.  Or are desperately short of resource to do the day job...)

I don't see much in the Additional Information section which would accommodate a description of the various changes in definitions of claimant count, reported crime, changes to survey design and so on, which can all impact on the comparability of a dataset over time.  Most of it seems to be about process (again). You say that it is OK to have a link to an HTML page which people can visit to read if they wish.  In my view, that is too weak a relationship between metadata and data, and unnecessarily invites inadvertent misuse of the data.  That might sound a bit precious, but some datasets are so complex that anyone choosing just to jump in to the data is likely to drown!

The basic premise and the clear mandate to link direct to the data file itself is probably fine as a general guideline but it should not be promoted to the status of evangelical theology. If the mandate does not give good results in certain circumstances, it should be relaxed in a commonsense fashion. 

I'm well out of it (probably in more senses than one) but I think there was always a possibility of a clash between National Statistics principles, as enshrined in its Code of Practice and supported by law, and those underlying data.gov.uk.  One wanted a portal which keeps data, metadata and commentary close together.  The other wanted as simple an interface to the data part of the trinity.  I deliberately put it in the past, but it would be interesting to know if (and how) the two have been reconciled.  I imagine that you have fairly frequent contact on this sort of thing with the Statistics Commission and ONS. 

 Flag as offensive 

Indeed...

I think we are, in principle, concerned about the same thing.  We both want good metadata and proper contextualisation and we also want (I infer from your writing here) good access to the data. Were we seem to diverge is in the method by which we achieve this.

It would be a monumental task to write something that could asses if an HTML was right (in the sense of a good and proper ONS page with great info, proper links to CSV files, etc) against some of the examples I showed you (which are not rare actually, over 19% of ONS datasets point to a broken link at any given time, quite a job keeping up with that)

I agree with you that the PDF issue is not ideal and I hope that in the future, open formats will be used.

The automated scoring is in beta and it is constantly being tested and improved.  We also take views such as yours to better understand how people perceive it and to adjust our intent and make sure we are on message with the purpose of the tool, avoiding it becoming misconstrued as something else, for that, your views are essential to us.

On a basic level, the tool is serving to make publishers think about their data and how it matches the 5 stars premise.  As we do not intent to replace the ONS site, I don’t think the impact of requiring a direct link to the file is high, the ONS site will always be there (I hope) and if a link to the proper web page entry for a given dataset is always provided then I consider that adequate.

If we were ‘replacing’ somehow the ONS web site service with just our dataset metadata then I would agree that either we expand greatly our metadata fields or it would be a dis-service, as it stands, we require direct links and good metadata, obtaining that alone is a journey in itself and we rate the dataset against its contents and for our purposes, it is:

Format of links (against the requirement for a direct link), license type (open or restricted) and implementation of the semantic web principles.  

Key things to remember are that the five star rating was conceived as a way to measure the degree to which a given dataset (as in a file, not a metadata record as ours) complied with the full principles of the semantic web; linked data. As Tim Berners-Lee put it:

Under the star scheme, you get one (big!) star if the information has been made public at all, even if it is a photo of a scan of a fax of a table -- if it has an open licence. The[n] you get more stars as you make it progressively more powerful, easier for people to use.

The key word there is ‘progressively easier to use’, the ratings are a measure of the data against an arbitrary standard position to facilitate use, within the context of data.gov.uk, we rate against the completeness of the metadata, if we cannot check the compliance of the ‘data’ against format and structure then we rate it down, basically, we are telling user ‘we don’t know if this link will provide you with data of x quality’ and in fact, for 19% of the datasets, in the case of ONS data, that assessment proves right.

As for licensing, which has been mentioned in other comments in this discussion, open license is not a requirement for the five star per se (although it is our ultimate goal), Tim again (sic):

Linked Data does not of course in general have to be open -- there is a lot of important use of lnked data internally, and for personal and group-wide data. You can have 5-star Linked Data without it being open. However, if it claims to be Linked Open Data then it does have to be open, to get any star at all.

Again, 5 star ratings in data.gov.uk is against the dataset as published, if there is a link to HTML, then it fails one of the criteria. Berners-Lee again (minor spelling corrections):

… People have been pressing me, for government data, to add a new requirement, and that is there should be metadata about the data itself, and that that metadata should be available from a major catalog. Any open dataset (or even datasets which are not but should be open) can be registered at ckan.net. Government datasets from the UK and US should be registered at data.gov.uk or data.gov respectively. Other countries I expect to develop their own registries. Yes, there should be metadata about your dataset.

The point is, we are not trying to fool anyone, or obscure data issues with fancy ratings, we are trying to apply key principles and in the process have developed some clever enough ways to automate that process, this in an open and transparent way, laying out how we did it, what the criteria is and providing the code for other to use, change and improve.  We also, as in this instance, engage in conversations about the different ways in which this can be done and its risks.  I think this is a great improvement from where we were three years ago with regards to government data and its accessibility. You are right of course that in any endeavour like this we need to exercise  a measure of flexibility and not throw the baby with the bathwater, I agree completely and although it appears that we are being black and white on this, it is just the start, it is a trial of a tool, it has some issues, it will change, it will get better and so will the ability of our policies to accommodate different scenarios, the fact is, we are friends in the battle to liberate data and we share the same desire and goal.

Cheers!

Source: http://www.w3.org/DesignIssues/LinkedData.html

A good primer is:

http://tinyurl.com/cncsu2z

 Flag as offensive 

Ways forward

I share some of the concerns expressed in comments above and below that a five star rating (a) doesn't fit all cases; and (b) doesn't capture all that we need to know. It's useful, but as part of a set of different ways of assessing datasets, and we should be careful not to let the pursuit of an automated five-star rating become the sole driver of decisions over how to share data. In particular, I've argued in the past that at least some elements of the five stars need to be seen as cumulative (or as independent indicators), so that, if our goal is accessible information for citizens then it should not be possible to leap to five stars by just providing RDF data if there is no mechanism for the average user to get CSV / Excel / KML etc. formats that they can work with without advanced technical skills. And, as you recognise above and as is illustrated in exstat's comments below, context is really important (Star 2 in the draft 'five stars of open data engagement') and whilst not part of a technical five-star rating, is important to find ways to measure.

Some practical suggestions:

  • Can the method me modified so that it more clearly incentivises providing both 3-star, and 5-star data? For example, visually showing with a bright star when a step has been met, or when the star is just lit because a star above it is lit, and circling 3 and 5 to show that both are important. Both of these are important. 3-star data enables the everyday user to access data; 5-star builds towards a web of government data with longer-term public value. Ultimately it doesn't really matter which way round these are provided (3-star converted to 5-star; or 5-star flattened into 3-star formats), but it is important to balance the incentives so that both are provided. 
  • Can the data.gov.uk meta-data standard include a requirement for a link to documentation for a dataset, such that this would be displayed above the list of datasets for download. It would then be possible to (a) come up with an assessment method to automatically check if a dataset has associated documentation; and (b) to think about how user-feedback on the quality of documentation could be sought, to find out if the documentation really was enough to help the user make sense of the data. This could help incentivise the creation of simple, human readable pages of documentation for datasets. It may even be appropriate to draw on the work on plain language content from GDS and provide a template for good dataset documentation with sections for caveats; considerations; analysis approaches and so-on.  
 Flag as offensive 

useful analysis

Tim,

Thanks for this useful analysis - all good ideas to ponder on.

Documentation would be helpful for some datasets, although I'd be interested to see examples of where it is particularly lacking. I believe there a few comments on datasets where people request more info and responses have been forthcoming. But it's not a big an issue as hundreds of links being broken. I think just getting the metadata into plainer English would be a more worthy aim. For instance, titles like "Lower Layer Super Output Area (LSOA) boundaries", "A Generational Accounts Approach to Long Term Public Finance in the UK" and "Combined Online Information System" are from our top 10 frequently viewed datasets, yet that might be because their titles are so incomprehensible to the average citizen and could be improved. Even when you include the full metadata record, it is sometimes difficult, such as "1:1M scale Offshore Quaternary Map". I don't think an automatic system could be much help here. We could pester departments to improve their metadata (and we do a bit of that), or maybe we should employ someone articulate to improve them centrally.

I agree that Linked Data formats are not great for everybody. I think it would be feasible to write an automatic converter from RDF to CSV. But to be honest, I'm struggling to see the specific examples and use cases driving this. We have organogram RDF which are beautifully rendered in a web-app for the average citizen to explore. The same for Bathing Water Quality. The older transport nodes, school info etc have APIs, CSV, as well as some third party web apps, but although the web apps were v. popular, I don't know if there was much take up of CSV. So let's have a hunt for RDF that needs greater accessibility and see if there is a need for alternative formats.

 Flag as offensive 

Direct links are not good practice for all datasets

Antonio,

You obviously have particular views on this. I mostly disagree, and support the points made by exstat above. All I can say is that, implemented on the above basis, I don't think the star system is likely to be very useful to either data owners or data re-users.

Datasets are not books, and metadata is not full documentation. You are kidding yourself if you think that the Data.gov.uk 'pro-forma' is any substitute for a properly written landing page.

If I were a data owner making available a dataset of substance I would specifically want to avoid providing a direct download link from Data.gov.uk, if that meant the re-user might miss any additional data or metadata files, licensing caveats, information warnings or technical docs produced in support of the data. That may not be the process that you or the Transparency Board had in mind, but data owners have their own responsbilities.

As a data re-user I would not want to overlook that information either. That's why, whenever I use Data.gov.uk to find a dataset, I invariably explore the data owner's site for first-hand information in the full context rather than relying on the Data.gov.uk record.

-- Owen Boswarva, 21/11/2012

 Flag as offensive 

Antonio As usual, Owen has

Antonio

As usual, Owen has made the points I would make much more clearly and succinctly.

I hope you are bashing ONS with a heavy implement ... but about the 19 per cent of pages that have links that don't work.  I believe they have redesigned their website recently (I can't remember a time when they weren't doing so, planning to do so, or consulting about it!).  That should not be an excuse but it may be part of the reason.

The use of PDF files alone is hard to credit, so much so that I suspect there is a way to get at the data that neither of us have managed to locate - which would be bad in any case.  Historically, when Bill McLennan made the brave and historic decision to forgo income and go to website dissemination, everything had to be put up quickly and naturally that meant PDF.  When I was still working I repeatedly badgered them about this, and things changed, though in some cases the actual data were only available via bookmarks in a PDF document, which again is not good.  I would find it disappointing if they had regressed, if I can use that word about a statistical office.

You imply that the pages get zero ratings because you cannot tell whether the links they contain are valid.  Wouldn't it be better to describe them as unrated? 

Perhaps a concrete example may help.  Some time before the last election, an Opposition spokesman put out a high profile report quoting Home Office figures on increases in certain types of crime (violent crime, I think) over the period of the last government.  Fair enough, except he completely ignored the huge caveats put on the published figures about definitions, recording practice and heaven knows what.  Also the standard advice that the British Crime Survey was the preferred source for many types of crime.  Now I don't know whether he (or his advisers) were aware of this, but they should have been, and when picked up on it he attempted to brush objections aside.  You can't stop people misusing data deliberately, and you can only try to minimise accidental misuse, but if the standard way to the recorded crime data does not say "Hold on a minute.  This is best for local figures, not national." you are doing nobody a service.  It's also presenting any deliberate misuser with a built in excuse, that there was no warning.

I had better make clear that I have never worked on crime figures of any type!

I have no idea whether the Transparency Board has representatives from providers and/or users in the analytical community.  I expect it does (I haven't looked) but if, say, it was dominated by the experts in the technology and principles of open data and the semantic web, I suppose the kinds of issues I am raising miught not have been at the top of people's minds.

 Flag as offensive 

Disservice?!

" if the standard way to the recorded crime data does not say "Hold on a minute.  This is best for local figures, not national." you are doing nobody a service."

This barrier to the data, this "hold on a minute while I send you on a custom human-only journey before you get the data" is not appropriate for many of the reuses of the data? And I don't think Owen represents the views of most citizens or businesses when he says he would prefer not to have a direct link to the data either!

To name but two things it prevents:

* comparison of openness

* browsing by data format

The principle is clear - it is public data and the data links should be provided with the metadata in an open format, not buried as links in human-only text. There is no disservice by providing the important metadata on the data.gov.uk page with links to both the detailed notes and the data file.

In fact, if ONS doesn't, I would expect some citizens might crowd-source or scrape all the valuable data links and publish them themselves. This would be a great benefit to the public, for all sorts of purposes not imagined by the ONS.

 Flag as offensive 

No 100% correct answer

David

We really ought to be trying to avoid an "either/or" approach to this.  I am not arguing that direct links are always wrong; I would think they are fine for quite a few statistical datasets.  And I don't think you should be arguing that links to an html page are always wrong either.  It is quite a subtle judgement, dependent on the complexity and messiness of the data, and one not well served by hard and fast rules.

Experienced (re-)users will set up their own direct links of course, or interact direct with the data, but the question is how do you assist them best when they are just starting out?  A "human only" journey suits most humans.  A journey that a machine can follow may be just as good for humans, but I would not take it as read.

By the way, I strongly suspect that the "two things" that you refer to are of little interest to most users of statistical (and possibly some other) data.  Most simply want simply to use the data rather than republish it, so licensing issues do not affect them.  And I doubt many care too much about the format it is in, as long as they can be read by their computers, so they won't be browsing by data format either.  I'm not belittling either of the issues; some important users will be very interested in one or both, and I fully appreciate the importance of adding value through linking and reuse, and of machine-readability.  But there is a danger of overzealous evangelicalism here; we have to avoid making understandable data less accessible to the everyday user as an unforeseen consequence of making the pure data more accessible to the data pro.

Placing more metadata on the data.gov.uk pages is an interesting idea, but it would be necessary to be clear about what is meant by metadata.  It has to be more than the description of the process, which is almost all that you seem to get now.  But on the other hand you don't really want twenty pages of "notes and definitions", which some datasets get, and some might even need!

Perhaps I could finish with a suggestion.  As what you are doing is ranking openness, perhaps it would be best always to describe it explictly as such, rather than referring to "ranking of datasets", which to me implies a ranking of the quality of the data as well.

 Flag as offensive 

Direct links are not good practice for *all* datasets

Come on, David. I pretty clearly said direct links weren't good practice for *all* datasets, and referred to datasets "of substance". I didn't say direct links were a bad idea per se.

My point is it should be a matter of judgment for the data owner; because they will have the most nuanced understanding of the dataset and the pitfalls of misinterpreting it.

There are many small, simple datasets on Data.gov.uk for which a direct link would be perfectly reasonable. However your current methodology (sorry: method) penalises data owners for failing to provide a direct link in all cases. That's my objection.

-- Owen Boswarva, 22/11/2012

 Flag as offensive 

Good ideas

This is brilliant, thank you guys for taking the time to put your views forth!

To make it clear to all, I may be very specific and committed to what we are doing and will of course, defend its basic implementation, BUT, I also share the view that 5 star rating is too simple to account for the complexity of the open data landscape.  We are experimenting with ways in which we can use it to give better context on the potential quality of a dataset, but it is that, experimentation, it will probably amount to an aide memoir, and added value item across the set of tools we provide around our data. That means that we might get it right, we might get it wrong, we may find a balance, which will determine its final shape.  That’s where you guys come in to have the healthy back and forth we’ve had in this blog.

This is not a black and white issue, although rating will probably stay, it will stay incorporating any caveats we may need for it and with whatever changes to the algorithm we need to make for it to be better and useful.

I like Tim’s suggestion for modifying the method and we will be looking at those in the near future.  There is already a process in the publication system to add links to pages with further information and to legislative documentation.  Changing the metadata profile is something we are always investigating (without forgetting the very wise saying ‘if it ain’t broke, don’t fix it’ - sorry for my colloquial use of English) but remember that increasing the level of information captured will most likely apply to future datasets or be implemented when datasets are updated (it will be quite a resource intensive activity for publishers to go back and ‘fill’ it in for existing datasets) Some of the work GDS has done in simplifying the conveying of information will surely find its way to departments when dealing with metadata.

On the ‘controversial’ yet very important issue of pointing to HTML pages, we obviously differ in opinion but there are ways forward.  With the advent of government web APIs there will certainly be changes on how data is presented, we may also evolve to a point, especially after everyone is on the single domain, where we can use the content syndication features of .gov.uk to pull key items into the dataset (as further tabs) so someone will see the same content as it would appear on the .gov.uk ONS website.

ONS is actually a great organisation to work with and they have been very supportive of our views (well, understanding of our view, they follow the same lines as exstat and Owen) and I look forward to continue working with them.

So…we have taken a view on star ratings and are experimenting with it, you guys have put forth a very powerful argument and highlighted what you see as holes in the approach, for which I thank you.  Now, if it wasn’t star ratings…how could we (think of the technical implications of implementing it) indicate compliance (with what?) or quality in a way that can be automated? 

 Flag as offensive 

exstat, thanks for the

exstat, thanks for the question. ONS have been showing the way in releasing data for many years!

I'm very pleased that people are concerned about getting 0 stars. I've said that the scoring system is on the harsh side, but I fail to see that providing a direct link to the data is something to be discouraged.

We aim to add the links to the ONS data, to show alongside the links to the original ONS metadata records. This should raise their openness scores and benefit the public.

I agree that the ONS metadata is often useful, and that's why we show it in data.gov.uk.

 Flag as offensive 

Depends what you mean by metadata

David

Looking at a horribly small sample of statistical datasets, the metadata on data.gov.uk is mainly about the process of release.  I'm more concerned with the metadata which describe the data and point out potential pitfalls for users, such as discontinuities in the series.  Those metadata are generally available from statistical html pages alongside the data.  I don't seriously think they should be held on the data.gov.uk pages as well, not least because to do so would dilute the identity of National Statistics, but in that case it surely has to be a Good Thing to go to a html landing page, not something getting a zero rating. 

There is a bit of a theme here.  Your algorithm is rating sites on process, not quality.  Your metadata explain the process, not the data.  We're all on the same side here, I think, but let's keep away from anything which looks like box ticking and rate datasets in a more holistic fashion.  If you don't people are going to be annoyed with rather than concerned about getting a zero rating.  They'll still be concerned, but about the rating system itself.  And that's not healthy for anyone.

 Flag as offensive 

That approach worries me ..

The point of the star system (i.e. Tim Berners-Lee's 5 Stars of Openness) is to characterise the qualities of the open dataset itself. You are using it to describe the qualities of the metadata provided to Data.gov.uk by the data owner. 

Those are two different things. If you have not gathered sufficient information to locate the dataset, any rating will be bad information.

 Flag as offensive 

iCalendar rating

Thanks David - a good improvement. Any chance iCalendar files can get 3 stars rather than 1 though?

 Flag as offensive 

Thanks, Rick. I've now added

Thanks, Rick. I've now added this format to the mapping and hopefully it should get this deployed this week. Cheers, David

 Flag as offensive 

fixed

Rick, the fix for iCalendar is now deployed - only 5 datasets so far with that format, but let's hope there are more.

 Flag as offensive 

Three stars for a dataset that isn't even open data?

David,

I've been looking at some of the Openness Scores on individual datasets, and in short I'm not convinced the methodology you are using reflects the '5 Stars of Openness' at all. 

You seem to be considering only the data format, and not whether the data is subject to an open data licence. For example this Environment Agency dataset:

     Water Framework Directive – Surface Water Classification Status and Objectives

has been given three stars because it's in csv format. However the dataset is freely available only for non-commercial re-use. It isn't open data.

It is rather fundamental to the 5 Stars scale that a dataset must be open to get even the first star. If you want to take an approach based only on the data format, perhaps you should devise your own rating system?

-- Owen Boswarva, 20/11/2012

 Flag as offensive 

bug

Owen, Thanks for highlighting this, although it is limited in scope. Looks like a bug picking up this licence. I'll fix it in the next day or so.

There are at least a thousand datasets which are marked 0 stars because of the license not being open, so I think you are rather harsh to claim that the 'methodology' is broken, (let alone the 'method' ;-). I should have probably mentioned this step in the blog post though.

There are probably other bugs out there - do let us know if you spot any. BTW if you hover the mouse over the star rating, it tells you more about how it was rated.

 Flag as offensive 

More examples

David,

I'm glad to have confirmation that your intention is for datasets without open data licences to be marked 0 stars. Thank you.

I recognise the tendency of developers to focus on technical standards. However when it comes to measuring "openness" it doesn't much matter how nicely the data is packaged and delivered if it's not legally available for open re-use.

Last year the UK Location group tried to put forward an argument for replacing the five star rating for open data with a "different type of rating" that gave greater weight to technical interoperability, data quality, availability of applications and other factors.

Consequently when your post failed to mention open licensing as the first criteria it rather jumped out at me as a glaring omission. (Antonio seems to be hedging on this issue as well; see "open license is not a requirement for the five star per se" in his comment below.)

Thanks for taking a look at the dataset I gave as an example. I appreciate the application of three stars to a closed dataset might be an isolated bug or anomaly. However there also seem to be no shortage of datasets with one star despite no open data licence: here, here, here, here and here.

-- Owen Boswarva, 21/11/2012

 Flag as offensive 

bug fixed

I've deployed a fix for these issues. Deleted resources were being included in the rating for a package, and closed-licenses with 1 star were slipping through. Cheers for letting us know about these, and do say if there are any more problems you spot.

 Flag as offensive 

Blogs by this author

Created 1 month 1 week ago
2
Created 1 year 1 month ago
8
Created 1 year 10 months ago
6
Created 2 years 1 month ago
1
Created 2 years 3 months ago
3