COINS as Linked Data

Author: Ian Dickinson, ian@epimorphics.com
Published: Nov 11th, 2010

Introduction

In common with other governments around the world, the UK Coalition Government has made a commitment to to increasing government transparency. See, for example, Chapter 16 of the Coalition Agreement. A key initiative under the general umbrella of government transparency worldwide is open data – the release to the public of government data, unencumbered by restrictive licenses or fees.

Unsurprisingly perhaps, one of the topics of greatest interest has proved to be how the government spends its money. In June 2010, HM Treasury, the principal custodian of financial data for the UK government, released previously restricted information from its Combined Online Information System (COINS). Until then, access to COINS by journalists and members of the public had been very difficult, and COINS itself had been the subject of various Freedom of Information requests. The release of COINS data therefore represented a step change in the Treasury's willingness to share data openly.

While the release of the COINS data is very significant, it does not by itself make the government's finances scrutable. COINS is a highly technical information source, requiring detailed knowledge of the encoding schemes used to make sense of. Moreover, the data files are too large to load into widely available analysis tools such as Microsoft Excel, a common tool-of-choice for many data investigators. Some enterprising teams have stepped up to the challenge of providing better access to the COINS data, notable among them the Where does my money go? project, and the Guardian datastore. We believe, however, that more can be done to open up the COINS dataset for exploration and analysis.

Linked Open Data is the name of a distributed, global initiative to provide better means to open up datasets, and, crucially, to publish links and connections between data.

At its heart, Linked Open Data uses a collection of data encoding standards maintained by the W3C, with the resource description framework (RDF) at its core. Linked Open Data, often abbreviated LOD, is founded on a small number of key principles:

  • naming things with uniform resource identifiers (URI's), which can be resolved to get both human-readable and machine-readable descriptions of the resource
  • a uniform data model based on the assertion of a named relationship between two resources, or between a resource and a data value (such as a number or a date)
  • extensibility, or the ability to add extra annotations to a resource without first requiring a prescriptive data schema to be updated

For more background information on linked data, see linkeddata.org and the Wikipedia page on linked data.

This article explains how we converted COINS to RDF in linked-data style, so that it can be more easily explored and investigated. We will explain some of the principles involved in the representation, and illustrates some of the queries that can be performed against the RDF data.

What is COINS?

COINS is a large database system at the centre of operations in HM Treasury. The Treasury itself describes COINS as:

COINS – the Combined On-line Information System – is used by the Treasury to collect financial data from across the public sector to support fiscal management, the production of Parliamentary Supply Estimates and public expenditure statistics, the preparation of Whole of Government Accounts (WGA) and to meet data requirements of the Office for National Statistics (ONS). Up to nine years of data are actively maintained : five historic (or outturn) years, the current year and up to three future (or plan) years depending on the timing of the latest spending review. COINS is a consolidation system rather than an accounts application, so it does not hold details of individual financial transactions by departments.

More detailed descriptions of COINS and its contents can be found on the Treasury's web site.

The COINS data release

The currently released data from COINS contains snapshots of the contents of the database on 14th June, 2010. Not all spending lines have been included: data from the Ministry of Defence and numbers relating directly to security and intelligence have been omitted. Nonetheless, the five data files released contain between 3.3 and 4.9 million rows of data. Each row represents one aggregate transaction, reported by one of the "spending departments" – for example the Department of Health. These transactions are not individual items of expenditure (other datasets are becoming available which do provide that level of detail), rather these are figures that the departments report to the Treasury to allow overall government spending to be monitored, understood and reported accurately.

Some of the data rows in the published data files are zero-valued or dummy entries. These are typically entered during the process of preparing to make a finalised entry into COINS. We have omitted these values from our linked-data presentation of COINS. The COINS release contains two sets of tables: the fact tables present a snapshot of the state of the COINS database when the data was released, while the adjustment tables catalogue the various additions and updates to the database that lead to the snapshot state. In our work to date, we have concentrated exclusively on the fact tables.

Hypercube structure of COINS

Each datum in COINS is identified by seven key indices. The combination of values of these indices is unique for every data entry in COINS. By analogy to the way in which three values, x, y and z can be used to identify a point in three-dimensional space, we refer to these seven indices as the dimensions of the COINS dataset, and the corresponding structure a hypercube (since it has more than the three dimensions we normally experience). This terminology is shared with OLAP cubes. The seven identifying dimensions in COINS are:

  1. time – contains the year or month to which the data relates.
  2. data-type – distinguishes Plans (i.e. budgets), and Outturn (i.e. actuals)
  3. data-subtype – the status of the reported transaction (draft, submitted, approved) and the reason for entering the data (e.g. corrections, reserve claims).
  4. department – identifies the department responsible for the spending. Note that for budgeting and forecast outturn, the "arms-length bodies" are consolidated into their parent department.
  5. account – contains the chart of accounts, identifying the economic nature of the data.
  6. programme object – records the function associated with the data, similar to a cost centre. The programme object codes and descriptions are maintained by departments.
  7. counterparty – records the second party in some transactions between departments.

In addition to these dimensions, there thirty-three further fields which provide some additional insight into the transaction being reported. These include, for example, the type of department identified by the department ID, whether expenditure falls under resource or capital budgets, which National Account Code (NAC) it is classified under, etc. Some fields in the COINS release are annotated as no longer being in use, so we have omitted these from our linked-data translation.

For illustration, an example entry from COINS is shown below:

value 1119900 this is the value of the transaction in thousands of pounds sterling
dataType plans this amount is planned expenditure
time 2009-10 the time at which the plan was recorded
accountCode 51611300 account code for Other Grants to Local Authorities (CAP)
counterpartyCode cpid.na no counterparty recorded
dataSubtype pdiraa Plans Dept Internal Restructuring Approved Adjustments
departmentCode des022 Department for Children, Schools and Families
programmeObjectCode p0110002 labelled as P01 Devolved Capital LA
accountingAuthority aa101 Central Government, supply: all voted payments and receipts by central government departments
accountsCapitalCurrent a-cap capital expenditure
activityCode act0201 the activity code is "Used to allocate expenditure to Departmental Annual Reports and Supplementary Budgetary Information publications."
budgetBoundary del department expenditure limit
budgetCapitalCurrent b-cap capital budget
cgaBodyType dept
cofog cofog090102 COFOG code 09.1.2
departmentGroup dept022
esa esa-d92b ESA code D.92 Investment grant
estimateLine el12 Estimate line is "Used to identify specific sections in Supply Estimates in the current financial year"
estimateLineLastYear e12ly
estimateLineNextYear e12ny
estimateNumber me01
estimateNumberLastYear m01ly
estimateNumberNextYear m01ny
estimatesCapitalCurrent e-gra
nac n1001 national account code N10.01 Capital grants to local authorities: other
pesaCurrentGrants ocg Other Current Grants
pesaDelivery regional
pesaServices individual
programmeAdmin prog
programmeObjectGroupCode p01s100101 P01 S100101 Grants to Voluntary Aided schools for capital and repairs
requestForResources rfr01
requestForResourcesLastYear r01ly
requestForResourcesNextYear r01ny
resourceCapital capital
sbi sbi01
sector la Local Authority
signage positive
territory eng applies to England only

COINS as linked data: overview

Our goal is to make the COINS data available as linked data. There are several reasons why doing so will, we hope, be useful to users of the COINS data:

  • using open standard representation makes it easier to work with the data with available technologies, such as the SPARQL query language and the linked data API. As time goes on, we can expect more products and open-source software to make use of these technologies to provide easy-to-use explorer tools;
  • individual transactions and groups of transactions are given an identity, and so can be referenced by web address (URL), to allow them to be discussed, annotated, or listed as source data for articles or visualizations;
  • cross-links between linked-data datasets allow for much richer exploration of related datasets

To achieve a linked-data presentation of the COINS data, we need a vocabulary that specialises the very general notations used in RDF to the particular needs of COINS. For some other datasets, this requires authoring a new vocabulary (sometimes referred to as an ontology), or adapting an existing one to the purpose at hand. Fortunately, there is already a suitable vocabulary for encoding statistical data that has a hypercube structure: the data cube vocabulary.

The data cube vocabulary

The data cube vocabulary allows a statistical dataset, on the web in linked-data format together with descriptions of its dimensions and metadata. The vocabulary was developed in 2010 as a joint activity among a group of academic research groups and linked-data companies. It derives some of its basic principles from the work of the Statistical Data and Metadata Exchange (SDMX) Initiative.

The vocabulary specification document describes the basic cube organization as follows:

A statistical data set comprises a collection of observations made at some points across some logical space. The collection can be characterized by a set of dimensions that define what the observation applies to (e.g. time, area, population) along with metadata describing what has been measured (e.g. economic activity), how it was measured and how the observations are expressed (e.g. units, multipliers, status). We can think of the statistical data set as multi-dimensional space, or hyper-cube, indexed by those dimensions. This space is commonly referred to as a cube for short; though the name shouldn't be taken literally, it is not meant to imply that there are exactly three dimensions (there can be more or fewer) nor that all the dimensions are somehow similar in size. [source]

For the COINS data, the dimensions and attributes of the cube are outlined above, although they are not in linked-data form. We will also require the data-set descriptor (DSD), which describes the arrangement of dimensions and attributes to consumers of the data, and we will need to convert the encoded values used in the data into RDF resources. Finally, given the importance of tracing statistical data to its origins, we need to show the provenance of the data in linked data form.

COINS graph structure

The RDF encoding of data links nodes, or resources via named predicates to other resources or to data literals (such as a string or a number). Mathematically, this structure is a directed acyclic graph, and we use the term named graph to group together related RDF resources so that we can refer to them by URI.

For each of the years of the COINS release, we will have an authoritative graph which contains the main data transcribed from the Treasury COINS data, a metadata graph which contains provenance and other information describing the authoritative graph, and and supplementary graph which contains additional links we are able to derive from the main data but which is not part of the core data release. Figure 1, below, shows the relationships of these graph structures, together with an overview of the data cube vocabulary applied to COINS:

diagram of COINS graph structure

Figure 1: The basic COINS graph structure

Each row of data in the COINS release, omitting the zero-valued dummy entries, will become a single qb:Observation resource. As we show below, the URI for that resource is composed from the code values of the dimensions. Values for each of the dimensions, and for the other non-null valued columns in COINS become properties attached to the observation. Observations may be grouped together as "slices" of the dataset. While these may provide helpful means of navigating the structure of the dataset, they are not fundamental to the encoding. Note that a given observation may be grouped into more than one qb:slice of the data.

Individual column values in COINS are represented as code lists. Some narrative documentation on the code lists used was released with COINS, but definitive catalogues of the code lists is not currently available. As we show below, we have created code lists from the data values in the release, augmented with descriptions from narrative sources. This makes the linked-data presentation of COINS a complete and self-contained unit, but limits the amount of cross-linking to other datasets – a key value of the linked data approach. Where possible, we provide links in the supplementary graph which annotate the code list entries with relationships to other vocabularies. For example, the Ordnance Survey maintain a list of geographic entries, which are clearly related to the territory codes used in COINS.

COINS datasets and named graphs

COINS is an example of one of the more complex statistical datasets being publishing via data.gov.uk. Part of the complexity of COINS arises from the nature of the data being released. The published COINS datasets cover expenditure related to five different years (2005–06 to 2009–10). The actual COINS database at HM Treasury is updated daily. In principle at least, multiple snapshots of the COINS data could be released through the year. In the current, somewhat experimental, phase, we only have one snapshot which was released in June 2010.

In conceptual terms, we have at least four perspectives on what we mean by “COINS” data: the abstract notion of “all of COINS”, the data for a particular year, the version of the data for a particular year released on a given date, and the constituent graphs which hold both the authoritative data translated from HMT’s own sources, and additional supplementary information which we derive from the data, for example by cross-linking to other datasets.

Figure 2, below, shows these conceptual objects, together with the naming schemes we have used and the RDF predicates which link these different views:

Diagram of the various perspectives of COINS data and their relationships to RDF resources and named graphs

Figure 2: The various perspectives of COINS data and their relationships to RDF resources and named graphs

The URI patterns we use are summarised in the table below

URI pattern RDF type Meaning
/{domain} void:Dataset The top-level, most abstract concept in the domain. In our case, the domain is finance, and the uppermost concept is the whole of COINS.
/dataset/{domain}/{id} qb:DataSet A particular identified collection of data, such as the COINS data for 2009 – 10
/dataset/{domain}/{version}/{id} void:Dataset Identifies a particular version of the dataset by date of release
/dataset/{domain}/{version}/{id}/{subset} void:Dataset Container for the authoritative, supplementary or metadata triples for a dataset version. This identifier is both a dataset resource and a graph name.

When the data is published via a SPARQL endpoint, the default graph will contain a union of the current versions of the subset graphs, making it possible to query older versions of the data by name while also being straightforward to query the most up-to-date data.

Converting COINS to RDF

In this section, we describe the process of converting the released COINS data to RDF linked data. Our aims are primarily to share experience and lessons with other developers engaged in similar tasks, and to provide a more detailed description of the structure of COINS in RDF to guide consumers of the data.

Downloading and preparing

The COINS release files are available via data.gov.uk, as either straightforward or BitTorrent downloads. The files, once downloaded, are .zip archives containing CSV files (notionally comma-separated values, though in this case the comma character is replaced by the @ character). Listing 1, which was run under Linux but should work on Windows with cygwin, prepares the files for the next stage of the process. The primary goals are to convert the data from UTF-16 to UTF-8, and compact the CSV's slightly by omitting unnecessary strings such as NULL.

It would be possible at this point to process the files directly from the CSV format. However, we found it much more convenient to load the data into a database, so that the conversion code could use SQL queries to, for example, efficiently collect the lists of code values used for each column. The script in listing 2 loads one of the above generated CSV files into a PostgresQL instance, adding an index column to identify each row of the table with an index value:

Defining the cube structure: the COINS DSD

The data structure definition, or DSD, specifies the structure of the qb:dataset – the dimensions and attributes available. For dimensions, the order in which they are applied is also specified explicitly, since the underlying RDF representation is not ordered: the sequence of statements in an RDF source file does not control the order in which triples are extracted from a triple store containing that source file.

While values are normally attached to observation resources directly, the DSD also allows some information to be attached at the dataset level for brevity. In the case of COINS, all of the values are in thousands of pounds sterling. In principle, this information could be attached at the dataset level, and become implicitly part of each observation. In general, avoiding repetition is a good thing. However, one of our use cases for the linked data version of COINS is to allow others to link to individual observations, which suggests that these observations should be standalone and self-contained – and should therefore have explicit multipliers and units on each observation. The cube vocabulary community has not reached a consensus on this issue as yet; one suggestion is to author data without the duplication, but have the data publication tools "flatten" the compact representation into standalone observations during the publication process. The complete COINS DSD can be seen in listing 4.

Code lists

For each of the dimensions and attributes in the DSD, with the exception of the value property itself (sdmx:obsValue), we define the range of permitted values as a code list. For some datasets, code lists are available as separate resources, which can be converted into linked data form. In the case of COINS, we do not have such standalone code lists. We therefore generate controlled vocabularies to represent the code lists by querying the released COINS data in the PostgresQL database. Specifically, for each column in the database, we construct the set of unique values that are in use, then create a code list resource programmatically. For example, the only values in the territory field of the database are:

  • ENG
  • E&W
  • GB
  • NI
  • SCO
  • UK
  • WAL

We create a code list from this set of values, by defining each to be a member of a SKOS concept scheme. So that we can specify the range constraint on the RDF property denoting the territory attribute of the DSD, we also create an RDFS class to denote the entries in the code list. The final listing is shown in full in listing 5. Currently, these code lists are generated automatically from the COINS data using some custom Ruby code. In future, we expect that commercially available or open-source tools with this capability will become the norm.

Observations and slices of the cube

An individual data point in the data cube vocabulary is an observation, so we represent each transaction from the COINS dataset as an observation, even though this use slightly stretches our real-world understanding of that word. Each observation has one RDF property for each of the dimensions, and a further property for the other fields in COINS, which the data cube vocabulary terms attributes. In addition, we attach properties to denote the value of the observation, its units and multiplier. For COINS, we know from the narrative description of the dataset that the values are in thousands of pounds sterling, so the units are pounds sterling and the multiplier is 103 (encoded using the SDMX code value sdmx-code:unit-mult-3).

Each observation is given a URI, so that it can be referred to by name. For linked data, it is important that the URI actually resolves to a representation of the observation, so that consumers of the data can easily retrieve the details of a given value, and follow links from there to other observations in the dataset. From one point of view, the actual structure of the URI as a sequence of letters and words is not important: for machine processing, it is the use of consistent identifiers that allows resources, such as observations, to be linked together to create context and meaning. However, as human readers of the data, and especially as software developers, it is useful to have the URI's follow a meaningful and useful pattern.

Our COINS observation URI's are composed of a base name, identifying the resources as part of COINS, followed by a compound name sufficient to uniquely identify each observation. The pattern is:

http://{base}/data/coins/{year}/{version}/{key}

The version allows us to distinguish different releases of COINS data: the only snapshot provided by HM Treasury to date was released on the 4th of June 2010, then updated on the 14th. We therefore use 2010-06-14 as the version field, though this will change with future releases. The year field denotes the government year which the data applies to (e.g. 2005-06), while the key is used to uniquely identify each observation in the dataset. Since each observation has a unique combination of values for the seven dimensions used in COINS, we use the short-form name of the code value to form the URI. Thus:

http://source.data.gov.uk/data/coins/2009-10/2010-06-14/snapshot30/bcaa/mod017/31070000/segment.na/mod017.cpi

With respect to the base URI, we use two different subdomains of data.gov.uk: source and finance. The team's strategy here is to distinguish between stable long-lived identifiers, such as the names of statistical concepts or of the COINS DSD, and the outputs from data conversion. Ontology terms, concept schemes and other concepts will be maintained under domain.data.gov.uk, where domain is appropriate to the dataset (e.g. finance for COINS). Data conversion, which is currently being handled by the central Linked Data Kernel Project team, will in time be managed from the data-owning departments. At that point the departments may elect to change the data URI's to a domain that they control. Should this happen in future, URI's under source.data.gov.uk will be automatically redirected to the new identities, so data-consuming applications should see few, if any, disruptions.

Data slices

A slice through a dataset is a view of the data in which some dimensions are fixed. For example, in a sales volume dataset organized with dimensions of time against product line, fixing the time dimension to one of its values, say July 2010, would create a slice showing the volume of sales in each product line for that month. In SDMX, it is common to create slices in which all dimensions except the time dimension are fixed, creating a time series. In COINS, while there is a time dimension, it does not record a particular series of values, such as the months of a year. Rather, the meaning of the time dimension is bound up with the datatype dimension: for example, a forecast of expenditure expected in a forthcoming time period.

We therefore do not create a slice specifically to represent a time series. In fact, it is somewhat unclear at this point which slices through the data will be useful to COINS-RDF users. Currently, we create a set of slices in which the time and datatype are fixed, but we will review this design choice over time. While the datacube vocabulary links datasets to slices, and slices to observations, it also includes a reference from the observation to the dataset in which it appears (see figure 1). The flexibility of SPARQL as a query language makes it straightforward to exploit this link to query observations in a dataset without needing the associated slice:

select ?obs where {?obs qb:dataSet <http://source.data.gov.uk/dataset/coins/fact-table-extract-2009-10>} limit 10

Metadata and provenance

An important benefit of linked data is that we are able to annotate data, at a fine-grained level of detail, to record information about the data itself. This includes where it came from – the provenance of the data – but could include annotations from reviewers, links to other useful resources, etc. Being able to trust that data is correct and reliable is a central value for government-published data, so recording provenance is a key requirement for the COINS data.

We use the Open Provenance Model vocabulary (OPMV) to record the provenance of the COINS data. OPMV allows us to record not only the source of the data and its publisher, but also the stages of transformation that were undertaken to convert the COINS CSV release files into RDF on the web.

Accessing the data

So far we have shown how the data conversion from raw CSV to linked data has worked, but not how to access the data once it is published. To a large extent, we would not expect end users of COINS data to need to access the data in its raw form. Rather, exploration an analysis of the data will be driven by user-friendly tools and web widgets. However, it is not appropriate to speculate on what tools will be created in the future, so in this section we will illustrate some of the more basic means of accessing COINS as linked data. In particular, we will show some sample SPARQL queries, which is the most flexible means of accessing the data. We will also show a configuration of the linked data API, a middleware layer designed to make RDF data more amenable to applications using a JSON-based HTTP application program interface (API). The linked-data API does not offer quite as much flexibility as the full SPARQL end-point, but is considerably easier to use for web applications.

Example SPARQL queries

Before showing some example queries, we should re-iterate some of the guidance from the Treasury on the use of COINS data. First, COINS contains a mixture of tentative and firm data. It also contains a mixture of forecast and outturn data. A transaction value may be entered in draft form by a department, reviewed by Treasury officials, perhaps adjusted, and finally entered in accepted form. Consequently, we must be careful to pick out consistent data, or risk double-counting some values. The Treasury advises various criteria when selecting actual spending. Quoting from the document:

Defining the data required for a given aggregate can be very complex. The example below shows the "fields" and settings needed for those fields to produce a DEL resource outturn number for a given department and year from "live" data (e.g. not from snapshots).

Using [the COINS fact table] you need the following settings:

Data_type: set to ‘Outturn’
Department_code: set to required departments(s);
Time1: set to required year;
Budget_Boundary: set to DEL;
Resource_capital set to Resource (on 2010-11 budgeting basis);
Data_subtype: set to all entries with “approved” or = submitted_outturn (both conditions required).

[From Understanding the COINS data, pp8–9]

On the last point, it is inefficient to have to query all the time for a string match (i.e. does the data-subtype contain the string "approved"?) A small illustration of the flexibility of the open, schemaless data format we are using is that we can augment the code list for data-subtype with a property denoting the status of the transaction. Since at its core the distinction is between transactions that have been or have not been accepted as final by the Treasury, we re-use the terminology approved and draft from the (very simple) status ontology at http://reference.data.gov.uk/def/status. Thus a typical data-subtype code entry becomes:

<http://finance.data.gov.uk/def/coins/data-subtype/afaa>
    a <http://finance.data.gov.uk/def/coins/DataSubtype>, skos:Concept;
    rdfs:label "AME Forecast Approved Adjustments"@en;
    skos:prefLabel "AME Forecast Approved Adjustments"@en;
    skos:notation "afaa";
    skos:topConceptOf <http://finance.data.gov.uk/def/coins/data-subtype>;
    <http://reference.data.gov.uk/def/status> <http://reference.data.gov.uk/def/status/approved>.

Example query: spending for one department

The following query shows the outturn spending for the Department of Children, Schools and Families (DCSF) for 2007-08:

select ?boundaryLabel ?rcLabel (COUNT(?amount) as ?n) (SUM(?amount) as ?total)
where
{
   ?obs
        coins-dimension:departmentCode <http://finance.data.gov.uk/def/coins/department-code/des022>;
        coins-dimension:dataType <http://finance.data.gov.uk/def/coins/data-type/outturn>;
        coins-measure:amount ?amount;
        coins-attribute:budgetBoundary ?boundary;
        coins-attribute:resourceCapital ?rc
        .

    graph <http://source.data.gov.uk/finance/coins/2010-06-14/schema>
    {
        ?boundary rdfs:label ?boundaryLabel.
        ?rc rdfs:label ?rcLabel
    }
}
group by ?boundaryLabel ?rcLabel

This query produces the following result:

-----------------------------------------------------------------
| boundaryLabel    | rcLabel                   | n   | total    |
=================================================================
| "DEL"@en         | "Resource"@en             | 264 | 44725725 |
| "DEL"@en         | "Capital"@en              | 82  |  5225818 |
| "AME"@en         | "Resource"@en             | 14  |    11396 |
| "Not DEL/AME"@en | "Not Resource/Capital"@en | 15  |  4416790 |
-----------------------------------------------------------------
Time: 2.013 sec

Example query: Total spending by department

The following query shows capital and resource spending, by department, for 2007–08. Note the use of a union operator to restrict the reported outcome results to just those categories:

select ?deptLongName ?rcLabel (COUNT(?amount) as ?n) (SUM(?amount) as ?total)
where
{
   ?obs
        coins-dimension:dataType <http://finance.data.gov.uk/def/coins/data-type/outturn>;
        coins-dimension:departmentCode ?dept;
        coins-measure:amount ?amount;
        coins-attribute:resourceCapital ?rc
    .

    {?obs coins-attribute:resourceCapital <http://finance.data.gov.uk/def/coins/resource-capital/capital>}
    union
    {?obs coins-attribute:resourceCapital <http://finance.data.gov.uk/def/coins/resource-capital/resource>}.

    graph <http://source.data.gov.uk/finance/coins/2010-06-14/schema>
    {
        ?rc rdfs:label ?rcLabel.
        ?dept rdfs:comment ?deptLongName.
    }
}
group by ?deptLongName ?rcLabel
order by ?deptLongName ?rcLabel

A sample of the results from this query is show below:

-------------------------------------------------------------------------------------
| deptLongName                                   | rcLabel       | n    | total     |
=====================================================================================
| "Armed Forces Retired Pay, Pensions etc"       | "Resource"@en | 27   | 5474562   |
| "Army Base Repair Organisation"                | "Capital"@en  | 4    | 0         |
| "British Waterways Board"                      | "Capital"@en  | 6    | 0         |
| "Cabinet Office"                               | "Capital"@en  | 40   | 33072     |
| "Cabinet Office"                               | "Resource"@en | 187  | 359887    |
| "Cabinet Office: Civil superannuation"         | "Resource"@en | 42   | 7479267   |
| "Central Office of Information"                | "Resource"@en | 5    | 336       |
| "Charity Commission"                           | "Capital"@en  | 2    | 968       |
| "Charity Commission"                           | "Resource"@en | 25   | 32458     |

Example query: retrieving all data for a given observation

Pulling back data on individual observations in RDF using SPARQL is straightforward. In the simplest case, where we know the identity of the observation (perhaps through some user interactions on the UI), we can simply ask:

describe <http://source.data.gov.uk/data/coins/2009-10/2010-06-14/2009-10/plans/pdiraa/des022/51611300/p0110002/cpid.na>

The output from the above query will appear similar to listing 6. If the specific identity (i.e. the URI) of the observation is not known, we can still use describe with a where clause to select the observations of interest. For example, the following query selects the same observation:

describe ?obs
  where {
      ?obs coins-dimension:dataType <http://finance.data.gov.uk/def/coins/data-type/plans>;
           coins-dimension:dataSubtype <http://finance.data.gov.uk/def/coins/data-subtype/pdiraa>;
           coins-dimension:departmentCode <http://finance.data.gov.uk/def/coins/department-code/des022>;
           coins-dimension:accountCode <http://finance.data.gov.uk/def/coins/account-code/51611300>;
           coins-dimension:programmeObjectCode <http://finance.data.gov.uk/def/coins/programme-object-code/p0110002>
  }

Other means of accessing the COINS data

While SPARQL provides a rich and feature-complete means of accessing the COINS data, it is not a notation that is familiar to many developers. The linked data API (LOD API) provides a simplified entry point to a large subset of SPARQL capability against some RDF data store, in a way that is optimised for access by web applications. In particular, the LOD API provides a REST-style interface to both resources and collections of resources from a triple store, and can return results in JSON format, which makes integration with other web applications easier.

We also set up HTTP-level redirections, via the web server, so that directly resolving the URI of a resource from COINS will directly return a description of that resource from the triple store. This means that performing an HTTP GET on, for example, the identity of an observation will return a description of that observation. Depending on the MIME type requested, this description may be in JSON, XML, RDF or human-readable HTML format. Since the URI of an observation, or dataset, or statistical concept can be used both to identify and to access a resource, it becomes straightforward to add additional information, such as annotations, to the data. As long as there is a persistent store which can contain the extra assertions, and user-contributed information is distinguished from authoritative source data, it is easy to see how, for example, a collaborative annotations service could be created to facilitate "crowd-sourced" dataset investigation.

Both the direct-access approach and the LOD API approach to inspecting the data can be very useful in layering richer, more comprehensible user interfaces on top of complex datasets, such as COINS.

Conclusion

In this article, we described the translation of the first release of HM Treasury's COINS data as linked-data using RDF. We have outlined the process used to convert the data, the various elements of the encoding using the Data Cube vocabulary, and illustrated some sample queries that can be made against the data using the query language SPARQL. This is a first release of the linked data version of COINS, and we anticipate that some clarifications or refinements may be needed in future releases, though we will endeavour to keep the URI structure as stable as possible. Feedback, questions and suggestions for improvement are welcome – please email the author at the address at the head of this article.

Code listings

#!/bin/bash
# Presumes files are located in ./source and will be written to .

for f in source/*.zip
do
    # extract the file basename as $base
    f0=${f%.*}
    base=${f0##*/}

    echo Converting from $f to $base

    # extract the zip file to stdout \
    7z -so e $f |                    \
    # convert UTF-16 to UTF-8        \
    iconv -f utf-16 -t utf-8 |       \
    # remove the word NULL           \
    sed -e 's/@NULL/@/g' |           \
    # omit lines with no data and save file with an alt. extension \
    grep '@' > $base.utf-8.nonull.csv
done

Listing 1: unzipping and adjusting downloaded files

DROP TABLEs if exists coins_fact_table_2009_2010;

CREATE TABLE coins_fact_table_2009_2010 (
    data_type text,
    data_type_description text,
    department_code text,
    department_description text,
    account_code text,
    account_description text,
    data_subtype text,
    data_subtype_description text,
    t_time text,
    counterparty_code text,
    couterparty_description text,
    programme_object_code text,
    programme_object_description text,
    programme_object_group_code text,
    programme_object_group_description text,
    accounting_authority text,
    accounts_capital_current text,
    activity_code text,
    budget_boundary text,
    budget_capital_current text,
    resource_capital text,
    programme_admin text,
    cga_body_type text,
    cofog text,
    dept_group text,
    estimate_line text,
    estimate_line_last_year text,
    estimate_line_next_year text,
    esa text,
    estimates_aina text,
    estimates_capital_current text,
    eu text,
    income_category text,
    lg text,
    lg_body_type text,
    estimate_number text,
    estimate_number_last_year text,
    estimate_number_next_year text,
    nac text,
    near_cash_non_cash text,
    nhs_body_type text,
    pc_body_type text,
    pesa text,
    pesa_1_1 text,
    pesa_aef_grants text,
    pesa_capital_support text,
    pesa_current_grants text,
    pesa_delivery text,
    pesa_non_aef_grants text,
    pesa_services text,
    pesa_tables text,
    pesa_transfer text,
    request_for_resources text,
    request_for_resources_last_year text,
    request_for_resources_next_year text,
    sbi text,
    sector text,
    signage text,
    territory text,
    cbal text,
    grant_provision text,
    levy_funded text,
    local_government_use_only text,
    net_subhead text,
    non_id_exceptions text,
    notocs text,
    obal text,
    outside_tes text,
    pension text,
    pesa_1_1_cap text,
    pesa_1_1_ec_payments text,
    pesa_1_1_local_exp text,
    pesa_1_1_nat_lottery text,
    pesa_1_1_tax_credits text,
    pesa_1_2_nhs text,
    pesa_1_4_locally_financed text,
    pesa_bbc text,
    pesa_stu_loans text,
    social_fund text,
    trust_debt_remuneration text,
    value numeric
);

copy coins_fact_table_2009_2010 from 'fact_table_extract_2009_10.utf-8.nonull.csv' delimiter '@' csv header null 'null';

DROP SEQUENCE if exists coins_fact_table_2009_2010_rowid_seq;
CREATE SEQUENCE coins_fact_table_2009_2010_rowid_seq;

ALTER TABLE coins_fact_table_2009_2010
    ADD rowid INT UNIQUE;

ALTER TABLE coins_fact_table_2009_2010
    ALTER COLUMN rowid
        SET DEFAULT NEXTVAL('coins_fact_table_2009_2010_rowid_seq');

UPDATE coins_fact_table_2009_2010
    SET rowid = NEXTVAL('coins_fact_table_2009_2010_rowid_seq');

Listing 2: loading a source datafile into PostgresQL

@prefix rdf:                <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:               <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl:                <http://www.w3.org/2002/07/owl#> .
@prefix skos:               <http://www.w3.org/2004/02/skos/core#> .
@prefix xsd:                <http://www.w3.org/2001/XMLSchema#> .
@prefix foaf:               <http://xmlns.com/foaf/0.1/> .
@prefix void:               <http://rdfs.org/ns/void#> .
@prefix dcterms:            <http://purl.org/dc/terms/> .
@prefix sdmx:               <http://purl.org/linked-data/sdmx#> .
@prefix sdmx-concept:       <http://purl.org/linked-data/sdmx/2009/concept#> .
@prefix sdmx-dimension:     <http://purl.org/linked-data/sdmx/2009/dimension#> .
@prefix sdmx-attribute:     <http://purl.org/linked-data/sdmx/2009/attribute#> .
@prefix sdmx-measure:       <http://purl.org/linked-data/sdmx/2009/measure#> .
@prefix sdmx-code:          <http://purl.org/linked-data/sdmx/2009/code#> .
@prefix sdmx-subject:       <http://purl.org/linked-data/sdmx/2009/subject#> .
@prefix qb:                 <http://purl.org/linked-data/cube#> .
@prefix coins-data:         <http://source.data.gov.uk/data/coins/> .
@prefix coins-dsd:          <http://source.data.gov.uk/dsd/coins/> .
@prefix coins-dimension:    <http://source.data.gov.uk/dsd/coins/dimension/> .
@prefix coins-attribute:    <http://source.data.gov.uk/dsd/coins/attribute/> .
@prefix coins-measure:      <http://source.data.gov.uk/dsd/coins/measure/> .

Listing 3: RDF prefixes used in examples

<http://source.data.gov.uk/dsd/coins>
    a qb:DataStructureDefinition;
    qb:component [qb:componentProperty sdmx-dimension:refPeriod; qb:order 0];
    qb:component [qb:componentProperty coins-dimension:dataType; qb:order 1];
    qb:component [qb:componentProperty coins-dimension:dataSubtype; qb:order 2];
    qb:component [qb:componentProperty coins-dimension:departmentCode; qb:order 3];
    qb:component [qb:componentProperty coins-dimension:accountCode; qb:order 4];
    qb:component [qb:componentProperty coins-dimension:programmeObjectCode; qb:order 5];
    qb:component [qb:componentProperty coins-dimension:counterpartyCode; qb:order 6];
    qb:component [qb:componentProperty coins-attribute:accountingAuthority];
    qb:component [qb:componentProperty coins-attribute:accountsCapitalCurrent];
    qb:component [qb:componentProperty coins-attribute:activityCode];
    qb:component [qb:componentProperty coins-attribute:budgetBoundary];
    qb:component [qb:componentProperty coins-attribute:budgetCapitalCurrent];
    qb:component [qb:componentProperty coins-attribute:cgaBodyType];
    qb:component [qb:componentProperty coins-attribute:cofog];
    qb:component [qb:componentProperty coins-attribute:departmentGroup];
    qb:component [qb:componentProperty coins-attribute:esa];
    qb:component [qb:componentProperty coins-attribute:estimateLine];
    qb:component [qb:componentProperty coins-attribute:estimateLineLastYear];
    qb:component [qb:componentProperty coins-attribute:estimateLineNextYear];
    qb:component [qb:componentProperty coins-attribute:estimateNumber];
    qb:component [qb:componentProperty coins-attribute:estimateNumberLastYear];
    qb:component [qb:componentProperty coins-attribute:estimateNumberNextYear];
    qb:component [qb:componentProperty coins-attribute:estimatesAina];
    qb:component [qb:componentProperty coins-attribute:estimatesCapitalCurrent];
    qb:component [qb:componentProperty coins-attribute:incomeCategory];
    qb:component [qb:componentProperty coins-attribute:nac];
    qb:component [qb:componentProperty coins-attribute:pesa11];
    qb:component [qb:componentProperty coins-attribute:pesaCurrentGrants];
    qb:component [qb:componentProperty coins-attribute:pesaDelivery];
    qb:component [qb:componentProperty coins-attribute:pesaServices];
    qb:component [qb:componentProperty coins-attribute:pesaTransfer];
    qb:component [qb:componentProperty coins-attribute:programmeAdmin];
    qb:component [qb:componentProperty coins-attribute:programmeObjectGroupCode];
    qb:component [qb:componentProperty coins-attribute:requestForResources];
    qb:component [qb:componentProperty coins-attribute:requestForResourcesLastYear];
    qb:component [qb:componentProperty coins-attribute:requestForResourcesNextYear];
    qb:component [qb:componentProperty coins-attribute:resourceCapital];
    qb:component [qb:componentProperty coins-attribute:sbi];
    qb:component [qb:componentProperty coins-attribute:sector];
    qb:component [qb:componentProperty coins-attribute:signage];
    qb:component [qb:componentProperty coins-attribute:territory];
    qb:component [qb:componentProperty coins-measure:amount; qb:componentAttachment qb:DataSet];
    .

Listing 4: COINS DSD

<http://finance.data.gov.uk/def/statistical-concept/territory>
    a sdmx:Concept;
    rdfs:label "territory"@en;
    skos:prefLabel "territory"@en;
    rdfs:comment """This indicates the UK country in which
    residents benefit from that expenditure. Spending should be allocated
    to an area, or combination of areas, on the basis of who benefits
    from the spending.""";
    <http://purl.org/dc/terms/source> <http://www.hm-treasury.gov.uk/d/coins_guidance.pdf>;
    <http://www.epimorphics.com/vocabs/coins-generate#columnName> "territory"
    .

<http://finance.data.gov.uk/def/Territory>
    a rdfs:Class, owl:Class;
    rdfs:subClassOf skos:Concept;
    rdfs:label "territory values class"@en;
    skos:prefLabel "territory values class"@en;
    rdfs:comment "The class of values in the territory code-list scheme";
    .

<http://finance.data.gov.uk/def/territory>
    a skos:ConceptScheme;
    rdfs:label "territory code list concept scheme"@en;
    skos:prefLabel "territory code list concept scheme"@en;
    skos:hasTopConcept <http://finance.data.gov.uk/def/territory/eng>,
          <http://finance.data.gov.uk/def/territory/e&w>,
          <http://finance.data.gov.uk/def/territory/gb>,
          <http://finance.data.gov.uk/def/territory/ni>,
          <http://finance.data.gov.uk/def/territory/sco>,
          <http://finance.data.gov.uk/def/territory/uk>,
          <http://finance.data.gov.uk/def/territory/wal>
    .

<http://finance.data.gov.uk/def/territory/eng>
    a <http://finance.data.gov.uk/def/Territory>, skos:Concept;
    rdfs:label "ENG"@en;
    skos:prefLabel "ENG"@en;
    skos:notation "eng";
    rdfs:comment "England";
    skos:topConceptOf <http://finance.data.gov.uk/def/territory>;
    .

<http://finance.data.gov.uk/def/territory/e&w>
    a <http://finance.data.gov.uk/def/Territory>, skos:Concept;
    rdfs:label "E&W"@en;
    skos:prefLabel "E&W"@en;
    skos:notation "e&w";
    rdfs:comment "England and Wales";
    skos:topConceptOf <http://finance.data.gov.uk/def/territory>;
    .

<http://finance.data.gov.uk/def/territory/gb>
    a <http://finance.data.gov.uk/def/Territory>, skos:Concept;
    rdfs:label "GB"@en;
    skos:prefLabel "GB"@en;
    skos:notation "gb";
    rdfs:comment "Great Britain";
    skos:topConceptOf <http://finance.data.gov.uk/def/territory>;
    .

<http://finance.data.gov.uk/def/territory/ni>
    a <http://finance.data.gov.uk/def/Territory>, skos:Concept;
    rdfs:label "NI"@en;
    skos:prefLabel "NI"@en;
    skos:notation "ni";
    rdfs:comment "Northern Ireland";
    skos:topConceptOf <http://finance.data.gov.uk/def/territory>;
    .

<http://finance.data.gov.uk/def/territory/sco>
    a <http://finance.data.gov.uk/def/Territory>, skos:Concept;
    rdfs:label "SCO"@en;
    skos:prefLabel "SCO"@en;
    skos:notation "sco";
    rdfs:comment "Scotland";
    skos:topConceptOf <http://finance.data.gov.uk/def/territory>;
    .

<http://finance.data.gov.uk/def/territory/uk>
    a <http://finance.data.gov.uk/def/Territory>, skos:Concept;
    rdfs:label "UK"@en;
    skos:prefLabel "UK"@en;
    skos:notation "uk";
    rdfs:comment "United Kingdom";
    skos:topConceptOf <http://finance.data.gov.uk/def/territory>;
    .

<http://finance.data.gov.uk/def/territory/wal>
    a <http://finance.data.gov.uk/def/Territory>, skos:Concept;
    rdfs:label "WAL"@en;
    skos:prefLabel "WAL"@en;
    skos:notation "wal";
    rdfs:comment "Wales";
    skos:topConceptOf <http://finance.data.gov.uk/def/territory>;
    .

Listing 5: sample code list encoding – territory

<http://source.data.gov.uk/data/coins/2010-06-14/2009-10/plans/pdiraa/des022/51611300/p0110002/cpid.na>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
                    <http://purl.org/linked-data/cube#Observation> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/accountingAuthority>
                    <http://finance.data.gov.uk/def/coins/accounting-authority/aa101> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/accountsCapitalCurrent>
                    <http://finance.data.gov.uk/def/coins/accounts-capital-current/a-cap> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/activityCode>
                    <http://finance.data.gov.uk/def/coins/activity-code/act0201> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/budgetBoundary>
                    <http://finance.data.gov.uk/def/coins/budget-boundary/del> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/budgetCapitalCurrent>
                    <http://finance.data.gov.uk/def/coins/budget-capital-current/b-cap> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/cgaBodyType>
                    <http://finance.data.gov.uk/def/coins/cga-body-type/dept> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/cofog>
                    <http://finance.data.gov.uk/def/coins/cofog/cofog090102> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/departmentGroup>
                    <http://finance.data.gov.uk/def/coins/department-group/dept022> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/esa>
                    <http://finance.data.gov.uk/def/coins/esa/esa-d92b> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/estimateLine>
                    <http://finance.data.gov.uk/def/coins/estimate-line/el12> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/estimateLineLastYear>
                    <http://finance.data.gov.uk/def/coins/estimate-line-last-year/e12ly> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/estimateLineNextYear>
                    <http://finance.data.gov.uk/def/coins/estimate-line-next-year/e12ny> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/estimateNumber>
                    <http://finance.data.gov.uk/def/coins/estimate-number/me01> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/estimateNumberLastYear>
                    <http://finance.data.gov.uk/def/coins/estimate-number-last-year/m01ly> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/estimateNumberNextYear>
                    <http://finance.data.gov.uk/def/coins/estimate-number-next-year/m01ny> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/estimatesCapitalCurrent>
                    <http://finance.data.gov.uk/def/coins/estimates-capital-current/e-gra> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/nac>
                    <http://finance.data.gov.uk/def/coins/nac/n1001> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/pesaCurrentGrants>
                    <http://finance.data.gov.uk/def/coins/pesa-current-grants/ocg> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/pesaDelivery>
                    <http://finance.data.gov.uk/def/coins/pesa-delivery/regional> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/pesaServices>
                    <http://finance.data.gov.uk/def/coins/pesa-services/individual> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/programmeAdmin>
                    <http://finance.data.gov.uk/def/coins/programme-admin/prog> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/programmeObjectGroupCode>
                    <http://finance.data.gov.uk/def/coins/programme-object-group-code/p01s100101> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/requestForResources>
                    <http://finance.data.gov.uk/def/coins/request-for-resources/rfr01> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/requestForResourcesLastYear>
                    <http://finance.data.gov.uk/def/coins/request-for-resources-last-year/r01ly> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/requestForResourcesNextYear>
                    <http://finance.data.gov.uk/def/coins/request-for-resources-next-year/r01ny> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/resourceCapital>
                    <http://finance.data.gov.uk/def/coins/resource-capital/capital> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/sbi>
                    <http://finance.data.gov.uk/def/coins/sbi/sbi01> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/sector>
                    <http://finance.data.gov.uk/def/coins/sector/la> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/signage>
                    <http://finance.data.gov.uk/def/coins/signage/positive> ;
      <http://finance.data.gov.uk/dsd/coins/attribute/territory>
                    <http://finance.data.gov.uk/def/coins/territory/eng> ;
      <http://finance.data.gov.uk/dsd/coins/dimension/accountCode>
                    <http://finance.data.gov.uk/def/coins/account-code/51611300> ;
      <http://finance.data.gov.uk/dsd/coins/dimension/counterpartyCode>
                    <http://finance.data.gov.uk/def/coins/counterparty-code/cpid.na> ;
      <http://finance.data.gov.uk/dsd/coins/dimension/dataSubtype>
                    <http://finance.data.gov.uk/def/coins/data-subtype/pdiraa> ;
      <http://finance.data.gov.uk/dsd/coins/dimension/dataType>
                    <http://finance.data.gov.uk/def/coins/data-type/plans> ;
      <http://finance.data.gov.uk/dsd/coins/dimension/departmentCode>
                    <http://finance.data.gov.uk/def/coins/department-code/des022> ;
      <http://finance.data.gov.uk/dsd/coins/dimension/programmeObjectCode>
                    <http://finance.data.gov.uk/def/coins/programme-object-code/p0110002> ;
      <http://finance.data.gov.uk/dsd/coins/measure/amount>
                    1119900 ;
      <http://purl.org/dc/elements/1.1/source>
                    2703 ;
      <http://purl.org/linked-data/cube#dataSet>
                    <http://source.data.gov.uk/dataset/coins/coins_fact_table_2009_2010> ;
      <http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure>
                    <http://finance.data.gov.uk/def/currency/GBP> ;
      <http://purl.org/linked-data/sdmx/2009/attribute#unitMult>
                    <http://purl.org/linked-data/sdmx/2009/code#unitMult-3> ;
      <http://purl.org/linked-data/sdmx/2009/dimension#refPeriod>
                    <http://finance.data.gov.uk/def/coins/time/2009-10> .

Figure 6: one encoded observation