Developing a National Core Reference Data Set

Today was a big day for open data in the UK with the publication of Stephan Shakespeare’s Independent Review of Public Sector Information. Of course the Government will be responding formally to the review over the next couple of months but I wanted to explore one of the main issues highlighted by the review. For me, the stand out recommendation from the report was to  establish a National Core Reference data set.

Stephan does not set out these data in detail, but offers a common sense set of principles to assist us:  

   

“Within such National Core Reference Data we would also expect to find the connective tissue of place and location, the administrative building blocks of registered legal entities, the details of land and property ownership.”  

We should define 'National Core Reference Data' as the most important data held by each government department and other publicly funded bodies; this should be identified by an external body; it should (a) identify and describe the key entities at the heart of a department’s responsibilities and (b) form the foundation for a range of other datasets, both inside and outside government, by providing points of reference and interconnection.” 

 

As a result of a number of Prime Minister’s letters, the 2011 Growth Review, the Open Data White Paper and Departmental Open Data Strategies the Transparency Team in the Cabinet Office have been focussed on getting the most valuable data out of government and into the hands of developers. We have drawn upon advice from senior figures in the world of open data like Nigel Shadbolt, Rufus Pollock and others who sit on the Transparency Board to help define our priorities. In addition, through the user request process on data.gov.uk and with assistance from the Open Data User Group released data sets are increasingly user-demand driven. We report to Parliament on the progress made against commitments made in the Open Data White Paper through quarterly Written Ministerial Statements.

 

Of course, we have already made substantial progress, with over 9,000 data sets released by government departments and agencies, including critical information such as government spending, crime data and school performance. Stephan himself as Chair of the Data Strategy Board has helped drive the release of core location data from the Trading Funds such as Ordinance Survey OpenData maps and the important recent release of Land Registry Historical property data

 

But clearly there is more to do, and Stephan offers a potential route to the next stage. I like Stephan’s suggestion that there is a core set of data that is critical to each government department, although I suspect it may not be straightforward to define. If we were thinking about transport it seems likely this might involve the location of transport terminals, live timetable information, government subsidies/spending, and fares, (all of which is already openly available), but is there more?

 

I would like to open a conversation with data users and, of course, government departments about what we think this core reference data set might consist of. It seems sensible to think of it as additional to the unique demand-led process we have in place via data.gov.uk and ODUG, and I suspect there will also always be a role for a central focus on particular data sets that might not be seen as ‘core’ but which could still make a transformational difference to citizens.

 

I am interested in your comments about how we might create a more detailed set of definitions for a national core reference data set that could be applied across Departments, and I’m also interested what people think the top five of so data sets in each department might be, regardless of whether they have already been released.

 

Paul Maltby

@_OpenP

Comments

Some initial comments

Paul,

Four quick points:

  • Core reference data is not evenly distributed across Government. It's disproportionately held by the key delivery departments and by the trading funds (by design), along with technical/scientific agencies like the Environment Agency. If we understand "core reference data" to mean the nation's information infrastructure, some public authorities may not have primary responsibility for any such data. I would suggest that "top five or so" should be very flexible.
  • There is not necessarily a close relationship between the importance of a dataset to the operation of a public authority and the importance of that dataset to the outside world. Innovation is unpredictable, and re-use of a dataset may diverge radically from the original use. To me it would make more sense to draw up lists of core reference data based on theme or market sector instead of by public authority.
  • If you want external input on the most important datasets, there is a problem of discovery. Although there are examples of good practice from individual departments, we do not yet have a comprehensive inventory of government datasets (or anything like it). Some important datasets will be familiar only to internal users.
  • This exercise will necessarily produce an incomplete picture of the UK's core reference data, because quite a few datasets that fit that definition are now held by companies in the privatised industries (e.g. utilities and public transport providers).

-- Owen Boswarva, 15/05/2013

 Flag as offensive 

Related blog post

I've written a blog for the Guardian website that follows up on some of these points, with particular reference to the importance of core reference data held by the Public Data Group trading funds:

     The cost of unlocking economic potential from core reference data

-- Owen Boswarva, 23/05/2013

 Flag as offensive 

National Core Reference Data

I like a list because when I've done something on it, I can tick it off and it looks like I'm making progress.  With this simple principal, can someone suggest a list of National Core Reference Data Sets?  Lets leave PAF out of it for now as we know all about that one.

From my perspective, within Health.  I would see GP Practices as a Core Data Set and work is in train between the HSCIC and Ordnance Survey to add the UPRN to this dataset as a pilot project to link AddressBase and datasets provided by the HSCIC.  Simple steps but its a start.....!

Anyone else care to contribute?

Graham Hyde

 Flag as offensive 

First we need an asset register

Hi all,

I completely agree with the suggestion that we need a core reference data set from each area of Government. I also agree that place, ownership and some indication of 'activity' should form the basis of this.

To reflect on Owen's points; before we can work out what the core dataset is for each department, we need what I would call a 'data asset register'. Many data-reliant businesses have these already to record:

  • What it is and what it contains
  • Who is responsible for it
  • How it can and can't be used
  • Where it is used already

If the likes of ODUG and the Transparency Team had this asset list, it would make it much easier to see which datasets were already 'core' to the activities of each department, validate the potential value to society / UK Plc and then get it released (subject to the normal processes of ensuring data security, compliance and so on).

If I took an example such as HMRC; the core dataset here is obviously a list of tax payers (or tax paying entities) associated with a location and in the case of businesses, a unique ID (VAT number). ODUG are already working on this core dataset (well, the business part of it at least) but on top of this, you could see other core data assets from HMRC being the amount of tax paid and information on our Customs processes. However, without a list of these datasets, it's difficult to find the potential value.

 Flag as offensive 

MOJ core reference data set

I agree that the idea of departments releasing core reference data sets is an exciting one, giving some good structure to the open data drive within government.

Thinking about it what it might mean for the Ministry Of Justice, a core set of data on each prison and court would seem appropriate, including location and category or type. Also important of course is specifying the formats for this core reference data. We should definitely think about this as we redesign the court finder tool, which does not currently produce machine-readable data.

 Flag as offensive 

Continuity in the core

The need for continuity in whatever comprises a national core reference dataset has been stressed.  It would not be reasonable to argue that everything that exists now must exist for evermore, but it is important that the needs of the user base generated through open data initiatives for a source are given proper weight and not dismissed because a politician has taken a dislike to a source.  Can we trust ministers to do this?

In this context, it is worth a look at http://www.statisticsauthority.gov.uk/reports---correspondence/correspondence/letter-from-brandon-lewis-mp-to-andrew-dilnot---250413.pdf from the UK Statistics Authority site.  DCLG is withdrawing statistics at regional level.  It argues (with some justification) that regions are not that useful for service delivery, that there are huge variations within regions and so on.  Great play is also made of the abolition of regional government and - in a pejorative sense, I suspect - of their role at European Union level.  They appear to favour new concepts like Local Economic Partnerships.  Fine, but my understanding is that a local authority can be in more than one LEP, or indeed in none, and that they are quite likely to change in composition over time.  So overlap, incomplete coverage and no continuity.  Regions have been around for decades; whatever their faults they do provide continuity over time.  The abolition of regional government, also given great weight in the document,

The argument is that if data are made available at much smaller area level, anyone who wishes to can construct data for regions or any other geographical entity.  Whether it is right to impose the need to do this upon users who still find regions useful is a moot point; far more efficient for DCLG to do it once.  And it is not generally possible to produce results from statistical surveys at small area level, as the sample sizes are not designed to do so - regions are often the only game in town for these.  Similarly for research exercises; econometric research on business performance often looks to see if there are regional effects and it is hard to see how there could be agreement on alternative geographical entities for that kind of use for which suitable data exist.

The underlying argument seems to be that because government (or at least DCLG) has abolished regional government (if anyone noticed it existed!) obstacles should be put in the way of those who wish to use them (or have no feasible alternative).  To go back to the beginning, presumably regional data would have been considered core up to 2010 and non-core afterwards.  Is this kind of political flip-flopping (which I thought statistical independence was supposed to insure against) something we should worry about in the wider context of the core data set?

 Flag as offensive 

National Core Reference Data Set

I am in the Health Sector. One thing I did over and over was create a reference sql database of various datasets which we update every week and supply to NHS organisations called UK Health Dimensions. last year I set up a company and started supplying a comprehensive SQL Database of datasets used by NHS organsaitions. So far Harrogate District Hospital, Heatherwood & Wexham NHSFT, Somerset Partnership NHS Trust, Taunton & Somerset NHS FT, Berkshire Healthcare NHS FT, South London CSU and Central Souother CSU are using it. Also two universities are reviewing it.

It has 1400 tables covering:

  1. A table for every data dictionary attribute and data element (Admission_Method, Actual_Delivery_Place, etc) with SCD history back to 2005
  2. A table for all the TRUD SDS datasets (GP Practice, Postcodes, etc) with SCD history back to 2009
  3. An extensive, detailed Date Dimension with one record per day for 1st Jan 1901 to 31st Dec 2199
  4. A detailed Time Dimension with one record per second of the day
  5. An age-band dimension
  6. ICD10 tables, containing the ICD10 2000 version and the 4th edition that came out Apr 2012 as an SCD
  7. OPCS4 tables, including v4.5 and 4.6.
  8. PbR tables back to 2010/11
  9. Ordnance Survey tables (LSOAs, MSOAs) which are to be used for building maps of the UK
  10. Read Code v2 as SCD back to 2012

See www.fjmcmanus.co.uk/dimensions

So far Harrogate District Hospital, Heatherwood & Wexham NHSFT, Somerset Partnership NHS Trust, Taunton & Somerset NHS FT, Berkshire Healthcare NHS FT, South London CSU and Central Souothern CSU are using it. Also two universities are reviewing it.

We hope to expand it or alter it for other sectors of UK public and private industry. 

If any organisations in the health sector, including companies supplying BI services to the health sector would like a free trial, please let us know.

 Flag as offensive