Combined Online Information System

Posted By HMT on 04/06/2010 - 08:08

81 comments
Your rating: None Average: 4.1 (49 votes)

The Combined Online Information System (COINS) is a database of UK Government expenditure provided by government departments. The data is used to produce reports for Parliament and the public including: expenditure data in the Budget and Pre-Budget reports; Supply Estimates; Public Expenditure Statistical Analyses (PESA); the monthly Public Sector Finance Releases. It is also used by the ONS for statistical purposes.

Overview

Released
2010-06-04
Last updated
2010-06-14
Update frequency
Periodic
Tags
Department Her Majesty's Treasury
Wiki

Resources

Download
Download
Download
Download
Download
Download
Download
Download
Download
Download
Download
Download
Download
Download
Download
Download
Download
Download
Download
Download
Full description

Details

Name
coins
Licence
OKD Compliant::UK Crown Copyright with data.gov.uk rights
Version
-
Geographic coverage
Geographical granularity
-
Temporal granularity
months
Agency
-
Precision
-
Taxonomy url
-
Temporal coverage to
-
Temporal coverage from
-
Categories
-

Contact information

Team
COINS Service Management team

Add new comment

Comments (81)

data quality questions

I wonder why these data are encoded in utf16?

Is there a formal description of the schema - the data's hardly exported in tnf, but it would at least be nice to get the format descriptions. I cannot find this info in the treasury docs.

fwiw, the raw data for facts..2009_10 compresses down to 15MB with bzip2 after ripping out the NULLs, as opposed to the 70MB on the web site.

The coins data is actually

The coins data is actually pretty small

try www.publicspendingdata.co.uk for smal csv files for each department / Program / account

More tips for Linux users

The file adjustment_table_extract_2009_10.txt contains many broken lines, i.e. some fields contain one or more newline characters. Here is a small Perl script to convert from UTF-16 to UTF-8, repair the breaks and do some filtering:

#!/usr/bin/perl
# - convert text encoding from UTF-16 to UTF-8
# - convert DOS/Windows newlines (CR-LF) to host system's newlines
# - fix broken lines (spurious line breaks in some records)
# - delete "NULL"s
#
# Examples of usage:
# coins.pl 82 < infile.txt > outfile.txt
# zcat infile.gz | coins.pl 82 > outfile.txt
# 7z -so e infile.zip | coins.pl 82 > outfile.txt
#
my $fields = $ARGV[0]; # Number of fields expected (81 in fact table, 82 in adjustment table)
my $part;
binmode(STDIN, ':raw:encoding(UTF-16)');
binmode(STDOUT, ':raw:encoding(UTF-8)');
while (<STDIN>) {
s/[\n\r]//g; # remove CR and LF
s/\@NULL/@/g; # remove "NULL"
my $data = $_;
my $str = $data; # count the number of fields
$str =~ s/[^@]//g;
if (length($str) > $fields-2) { # enough fields?
print "$data\n";
$part = "";
} else { # not enough fields
$part .= $data; # join the parts that we have so far
$str = $part;
$str =~ s/[^@]//g;
if (length($str) > $fields-2) { # do we have enough now?
print "$part\n";
$part = "";
}
}
}

Geographic Granularity - a null field

Local government... (also citizen bodies?) would I am sure appreciate some drill down route-map into what the local spends are... perhaps down to NUTS 3 level or even district council.
From a quick scan of the code table there I can't see a way into this..

Problems with the adjustments file

Is anyone else having problems with the adjustments file? Having converted the 2009-10 file to utf8, I'm finding that some records have newlines in them, so that when I try to import the file into PostgresQL the loader complains that lines are truncated. I'm also seeing some lines that have more fields than they should.

2009_10 adjustments file

Hi Ian
I've not got any embedded newlines. however, there are a lot of files with embedded @ signs in the data, leading to (I think) 466 lines with broken formatting. Notwithstanding comments on the site, it's not simple to identify where these have gone wrong with general regex's: 528 lines have the string "@ ", 8295 the string " @", 406 lines have the string " @ ".

I think that the only realistic way to get this fixed is to ask for better exporting mechanism.

I find it ironic that there's a big ad for the semantic web on this page as SW depends on accurate data.

I must say that I'm struggling to understand what the data means, eg there seem to be some very odd records (GDP figures and GDP deflators), reading the notes, I cannot see how the number of records can fall between forecasts, but it does and some of the snapshots look very short.
Tim

Adjustment Files - Rosslyn

The adjustment files were very poorly prepared. We found thousands of errors in both files. We had to adjust our data validation tool and create a version just for these files. Combinations of missing end of lines and poor column delimeters - who ever thought of using '@' to seperate columns, when this is also widely used in the descriptions. Obviously the data prepartion must have been done in a hurry and not tested.
Couple of hints - start looking for the following in the rows and replace the '@' with a space:
1. ' @ '
2. '@ xx-' where xx can be any value from 01 to 31
Tools such as UltraEdit are quite useful but you really need to a data validation application that can be modified on the fly, else these files will not be accurate.
Please visit https://rapidgateway.rapidintel.com to get access to the data in user friendly format.

Patience required

To all those who are complaining about the size of the files, format, etc, please just have a bit of patience. There are many groups who are working to provide more person-friendly ways of drilling down into this data. It's a massive step that it has been released at all - I've been involved in open data for some time, and I didn't expect HMT to release COINS data on anything like this timescale. So kudos to them. Everyone else just hang-on a short while until the user-friendly interfaces come along, or better still pitch in with ideas of the capabilities you want. What queries would you want to ask if you had a suitably user-friendly tool?

Tips for Linux users - correction 2

Sorry, the HTML filtering on the web site corrupted the text file in my previous comment. Here it is again:

#!/usr/bin/perl
binmode(STDOUT, ':raw:encoding(UTF-8)');
binmode(STDIN, ':raw:encoding(UTF-16)');
print while <STDIN>;

Tips for Linux users - correction

The method I gave in my last comment for converting from UTF-16 to UTF-8 doesn't work for the large files due to a limitation in iconv.

Instead, you can use a small Perl script. Copy the following four lines into a text editor:

#!/usr/bin/perl
binmode(STDOUT, ':raw:encoding(UTF-8)');
binmode(STDIN, ':raw:encoding(UTF-16)');
print while ;

Save the file as "utf16to8.pl" in your home directory. Make the file executable:

chmod +x utf16to8.pl

Now you can convert a text file of any size from UTF-16 to UTF-8 as follows:

~/utf16to8.pl < filename.txt > filename.utf-8.txt

Or, to unzip, convert and remove NULLs all in one go, type:

7z -so e filename.zip | ~/utf16to8.pl | sed -e 's/@NULL/@/g' > filename.utf-8.nonull.txt

Re: Tips for Linux users

This was very helpful - thank you. The iconv solution worked fine on my 64bit Ubuntu 10.04 system.

Linux users more tips

#!/bin/bash
wget -o getdatai.coinsUrls2010.log -i coinsUrls2010
wget -o getdata.coinsUrls2009.log -i coinsUrls2009

for i in `ls -1 *.zip`
do
echo uncompressing and cleaning $i
cat $i | funzip | tr -dc "[:alnum:][:space:][:punct:]" > ../data/$i.txt
rm $i
done

note that using a stream decompressor such as funzip means you dont need p7 or large files support
Ive put the out turn data into a html format a bit like the gurdian on
http://www.publicspendingdata.co.uk/

and there are also .csv files of just the Outturn data for you to download and analyze in excel... the files are pretty small and easily downloadable which might be useful for those suffering decompression difficulties...

Tips for Linux users

The compressed file can be extracted using 7zip. The result is a text file in UTF-16 encoding which can be converted to the more common UTF-8 encoding using iconv. The NULL strings can be removed using sed.

Extract the data:

7z e filename.zip

where "filename.zip" is, for example, fact_table_extract_2009_10.zip.

Convert from UTF-16 to UTF-8:

iconv -f utf-16 -t utf-8 filename.txt > filename.utf-8.txt

Remove NULL strings:

cat filename.utf-8.txt | sed -e 's/@NULL/@/g' > filename.utf-8.nonull.txt

Or, more succinctly, you can do the whole job in one go:

7z -so e filename.zip | iconv -f utf-16 -t utf-8 | sed -e 's/@NULL/@/g' > filename.utf-8.nonull.txt

On Debian based systems, 7z is part of the p7zip-full package.

Hope that helps someone.

Improved spend visibility

Rosslyn Analytics, a London-based technology company that specializes in enabling organisations to quickly and easily obtain spend visibility, has launched a dedicated portal that gives the general public the ability to view the UK government’s recently published public sector data from COINS. This portal can be found at https://rapidgateway.rapidintel.com.

A few more useful links

The Open Knowledge Foundation have been logging some of their learning on accessing the data in this Etherpad: http://pad.okfn.org/coins

The page includes links to spreadsheets created to work out what the codes in the dataset mean, contains some example python code for accessing the data. It also links to this set of Python scripts: http://bitbucket.org/okfn/coins

Rosslyn Analytics have imported the data into their analysis tool which generates graphical representation of some of the data: http://www.rosslynanalytics.com/newspress/index.php/rosslyn-analytics-ma...

Equivalent data set for income?

Will it be possible to get an equivalent of the COINS dataset for Income that the treasury receives?

Well done

It is nice to see that the goverment is being open with the information it hold.

Well done.

Hard work extracting

Perhaps in the coming days the data.gov.uk guardians can provide alternative formats for the data.

Having spent a full minute downloading the data (wow for torrents), it has taken me a long time to unzip.

Perhaps encode as ASCII and offer in a .tar.gz format?

So many people complaining about the format...

It's possibly the best format this data could have been released in. Seriously, this way anyone with a bit of python/perl/php/ruby/etc hacking knowledge and first year statistics tuition can do their own data mining and present it to the world, rather than everyone being stuck with the govt's pre-chewed spun-to-death version.

Complaints? Frankly I'm amazed the government has done this. Long may it continue.

Live Data

It would be nice to see access to the live data, via a database connection.
And its all very well expecting people to create software to access the data, but it would be great if some code that reads the files was made available along with it, that would go along way to explaining the format of the data.

Awesome

This is great. As mentioned above, that this is the raw data is very good. For those who are having a tough time importing it, rest assured that there are a whole bunch of concerned citizens and organisations that will be putting it together in a more digestible format very soon.

Data (So Far)

So I've spent a few hours working at the data - first part was going through the process of putting it in UTF8, then re-writing it into a more traditional CSV, importing the whole thing into MySQL and started rationalizing it for things such as Department Codes, Account ID etc, setting table indexes and pulling out DISTINCT sets.

Unfortunately the more processing I do the more stags I hit... SQL info result statements included in the CSV, clusters of non-integer account codes and so forth.

It looks like the Guardian has beat me to the punch in creating a searchable index, but the data format hasn't exactly helped - I understand this was rushed but IMO spitting out cleaned-up data with separate CSVs for each department and a list of department codes in an additional CSV would have taken little additional time on the governments end and have helped those of us not on the civil service payroll enormously :)

-- Mark

Impressive

This is a fantastic initiative. This sort of transparency should be the "default" for all government data. Now could you guys export this to North America now? ;-)

For Linux users

To decompress the fact table, install p7zip and run "7za e fact_table_extract_2009_10.zip".

The files themselves are encoded as UTF-16 (not UTF-8) text, which is a remarkably inefficient encoding for data that is probably mostly ASCII to begin with, but that's probably the fault of whatever database export/dump utility they are using on the system.

With data of this level of volume, you can't expect to use ordinary desktop tools for analysis. The Government should be commended for making this available - I don't know of any other democracy reaching this level of transparency.

Nice!

I think this might take a few minutes to sift through.

I would urge everybody to 'read the readme' (as it were) to be sure you know what it is that you are looking at. At least skim it, especially the "what the data won't tell you" bit.

For people who are just curious as to what it is, pages 11-20 of the guidance pdf have explanations of the column headings which might answer a lot of questions without having to download all the data.

It might be helpful for someone to add to the above datasets a 'representative sample' file with e.g. a hundred or so lines of data, so people can at least get an idea of what it looks like.
Rather more detailed/complicated than "DWP paid x for a red stapler on April 1st"...

-- DW.

Ways of exploring the data (without downloading it)

The guardian have produced a COINS Explorer interface for people who want to browse the data here: http://coins.guardian.co.uk/coins-explorer/search

The guardian site allows you to drill down into particular departments or areas of the dataset and then to export a CSV file. By heading deep enough into the data it should be possible to get simple CSV files which will open even in old versions of Excel etc.

The WhereDoesMyMoneyGo website also host a copy of the dataset here: http://coins.wheredoesmymoneygo.org/coins

For anyone who just wants to browse what's available at present - those should save a lot of hacking about with the data.

another concern

I do not want to complain on the weight or the format of the data, but it seems to me that this is just a query from a database and nobody did care about the quality after it (who uses @ as a separation element?!?!. the most difficult thing is not importing the data in this or that system but controlling that everything is correct because a few records do not import in the right shape. however, I really enjoy reading this sentence from the guidance booklet "Therefore any data extracted from the database are almost immediately out of date." any?in 5Gb data?

The Plan

To those complaining about the format and size of the data....

The new government has stated that they expect the digesting, analysis and redistribution of this data to be conducted by 3rd parties. These 3rd parties, who are willing to put money into making the data more readable and then potentially sell it on, are the ones who are more likely to download and use the data in it's current form.

Good Choice

I'm glad to see this data available, and shared in such a format that allows for easy analysis.

The choice of sharing medium (bittorrent) is also a good choice. It allows large files to be shared while reduce load on the host server, and therefore lowering the necessary server costs saving tax payer money. In theory at least.

Thank you.

Corrupt zip file ?

Zipped 2009/10 Fact table, 67MiB (4.28GiB uncompressed) is coming up as corrupt when I try and extract the contents.

addressing some concerns

to all the people who are commenting that the information is being released in a difficult to consume form, this is GOOD. this is what we really need, the RAW data. if it were sliced up into easier to consume packages, or presented in nice webpages, it would just mean the civil service taking us further away from the raw original data in the form it was produced, further from the truth basically.

now that the raw data has been released, hacker types (in the traditional sense) the world over can dig through it and present it in any form they/you want. in the coming weeks months and years you will start to see other websites where you will be able to browse, consume and interpret the data in every form imaginable, from traditional spreadsheet style interfaces to insane hollywood style 3d topographical visualisations. these will be built by citizens, using the raw data finally releaed here by government. this page and these csv files are meant for consumption by these hackers, if they were put into a form easily understandable/consumable by the general public this would completely defeat the point of the exercise.

right, off to hack some code! ive been looking for a data set to feed into those python machine learning algorithms they taught us at uni...

Formatting

Could whomever is responsible for creating these files please give it another go with the following:

1. Use UTF-8 or ASCII instead of UTF-8 - Almost half this file is empty space!
2. Use proper CSV, double quoted where needed
3. Compress to RAR
4. Replace NULL with an empty field.

I've just done it here on my side and been able to reduce the 09-10 file down from the original 4 GB to 1.8GB, 23MB RAR - which is small enough to open in Excel if you've got 6+ GB of RAM installed.

- Mark Randall

Re: Formatting

3. Compress to RAR

Right, you want open and transparent data that's compressed using a proprietary algorithm. Isn't that defeating the purpose? Not to mention, PPMd and LZMA (as implemented by 7-zip) have generally yielded higher compression ratios than RAR's proprietary mess.

Not to mention that any RAR files created would be wholly illegal unless someone paid for WinRAR. No such restriction on 7-zip and it's technologies.

From the WinRAR license:

8. There are no additional license fees, apart from the cost of purchasing a license, associated with the creation and distribution of RAR archives, volumes, self-extracting archives or self-extracting volumes. Legally registered owners may use their copies of RAR/WinRAR to produce archives and self-extracting archives and to distribute those archives free of any additional RAR royalties.

In short, you must pay for a license before you can distribute archives. The whole purpose of opening up this dataset is to allow anyone to audit the government's spending, with the goal of reducing wasteful spending. Paying for WinRAR when superior and free alternatives exist is just more wasteful spending.

RAR is inappropriate

As it is a secret, proprietary format which cannot be opened with free software.

sorry Mark...

...it's not the number of bytes that's important here but the number of records (rows). We all know that we can reduce the number of fields (columns) as many are redundant and others are irrelevant, but 2+ mln records are too many for Xls, also if you have just two fields. I can open the entire 4Gb in Xls but before visualize the content I'm told not the entire file would be open...

Making the Data Accessible

Please provide this data in a form that can be drilled down from within a browser.

You get a tick for making the data available, but loose points for inaccessibility!

Is this CSV?

I unzipped the (non-torrent) version of the 09/10 adjustment table and it wasn't CSV but rather 2-sign delimited (think tab-delim with an @ instead of a tab). also the data wasn't clean for import to something like Excel as it had some lines of non-table data at the end - just the sort of thing to upset already hard-pushed spreadsheet importers on non-high end rigs.

Checksum for 2009/10 zip file

sha1sum fact_table_extract_2009_10.zip
c163e7e1fba578e38bb3cf295bd1be0514d2fcf9 fact_table_extract_2009_10.zip

Corrupt on unzipping. Does anyone else get a different sha1sum checksum for this file?

COINS contains millions of

COINS contains millions of rows of data; as a consequence the files are large and the data held within the files complex. Using these download files will require some degree of technical competence and expertise in handling and manipulating large volumes of data. It is likely that these data will be most easily used by organisations that have the relevant expertise, rather than by individuals. By having access to these data, institutions and experts will be able to process and present them in a way that is more accessible to the general public. In addition, subsets of data from the COINS database will also be made available in more accessible formats by August 2010.

Mac problems?

It seems the out of the box unarchiver on OSX can't deal with large files, such as the COINS fact table. It spits out a cpgz file, which if you try to unzip just creates another zip, and so on, and so on.

Thankfully I managed to unzip it using Keka (http://www.kekaosx.com/en/), which is an OSX form of 7zip. This worked without any problems. Just thought I'd share in case anyone else was having any difficulties.

Unable to open

Typical Tory duplicity - how the hell is an ordinary person meant to open this.

Try the user interface

Try the user interface provided by the guardian newspaper

geoff

I downloaded the file but as

I downloaded the file but as I am no computer whizz I have been unable to find out how to open it. Why have you made it so complicated? I normally use Adobe and beyond that I am out of my depth.

How to unzip

Please list successful solutions for Windows XP and Linuz (Debian Etch)

I have tried several programs on both OS and non-succeed (e.g. on fact 2010)

Thanks

Mark

How to unzip

PKZIP works.
The file unzips as a .txt file, not .csv, so I simply changed the filename to .csv and opened it in Excel 2010, which asked for the delimiter, which is @ (not comma). Then it works.

Except that Excel 2010 has 'only' about 1,000,000 rows, which isn't nearly enough. In fact the data that did download is about 50% nil values.

How to open the remainder of the file? I don't know.

Why the hell isnt the data

Why the hell isnt the data stored centrally in a database with webpages drilling into it? Most ppl dont know what to do with a csv file.

Surely if you going to go to the effort of publishing then do it properly so that non technical people actually do have access.

How far should they really go?

I can open it fine, with a bit of trial and error.

Try microsoft website, download the free version of sql server & import the data. easy peasy.

I'm all for freedom of information, but I'd be annoyed if the govt then spent millions at inflated rates for some contractor/fms company to come in & make it "easy" for all to browse at their leisure. Importing this data is far simpler than trying to make some meaning of it. Think about it.

Having the data yourself

Having the data yourself means that no-one can claim the govt is preventing you finding facts. Also, these services will spring up as interested parties analyse the data, and they'll be independent. Much more sensible this way, I'd say.

Interesting that the data is

Interesting that the data is downloadable via a torrent considering the problems such services cause via copyright infringement.

Torrents

Torrents are just a technology, it is not inherently linked to copyright infringement. It's like saying 'Interesting that this is available on the Internet, where there is lots of porn too.' Perhaps true, but not really of any point.