Guidance: A very basic standard file format for data
We have talked before about some specialised ways in which to publish our data, especially Linked Data. However, we are mindful of Tim Berners-Lee’s advice to publish in “raw” form first, and then worry about additional means. These forms may take longer to do, but make the data useful for more situations, adding semantic information.
To meet the urgent need for open, standardised formats to support the Government’s transparency commitments this year we needed a recommended, basic format that is easily reproduced for publishers, and serves the community of developers. To this end, we have come up with a very basic set of technical guidance on the simplest possible format, "flat" files with separations between the fields. This is most widely seen in Comma-Separated Variable or “CSV” form.
Having said that CSV is as basic as it gets, there are still several technical things to consider in such a specification. In the jargon, these are the delimiter and terminator characters, how fields are escaped and quoted, and the character set encoding. Experience on data.gov.uk so far shows that different publishers can take different decisions about each of these elements of a “CSV” file. As we start to publish data about the same theme (spending, salaries, organograms and so on) from many different departments developers need to be assured that there is a precise and common standard for all the ‘CSV’ files which departments will be publishing.
An important consideration is that Government departments need as far as possible to be able to generate the data files with what they already have. It is not just that additional software could have potential costs (indeed there is usually good open source software available); the process of getting new software qualified and installed on each separate departmental network is both costly and, perhaps more importantly, involves substantial lead times. There are similar issues with bespoke coding. What’s more, we need to be mindful of the skills of the people who will be producing the information. So our strategy has been to set a standard that the vast majority of data owners in Government can meet using what they already have on their desktops - and to define that sufficiently so that those who are writing bespoke extraction software can also produce to that standard.
At the moment the vast majority of government desktops have Microsoft Excel and the vast majority of datasets that publishers will be releasing are held in, or manipulated with, Excel. The Government has policies to make greater use of open source in the future, but as far as data.gov.uk goes “we are where we are”. So asking for a format that Excel cannot produce would be very difficult to implement, increasing costs and slowing down the stream of data released.
Microsoft Excel’s “Save As CSV” function produces files that are reasonable on most of these grounds. Records are terminated with a newline (Windows CRLF encoded), with fields separated with a single comma character (,). Where a field includes a double-quote character (") it is replaced with two of them. Where a field includes a comma or a double-quote it is quoted with a double-quote at the start and end of the field. An example of this is:
Test Test, Test" Test test “Test" ","",,,",,
Excel’s CSV files use the Microsoft Windows “Latin-1” character encoding (sometimes known as “ANSI”), rather than a form of Unicode like UTF-8. Though this means that code will need manual tweaks to import non-Western European scripts, for the interim we have chosen this format. In the medium-term we will find a better solution that provides Unicode-encoded data.
There are also less technical, but still important considerations – representations and headers. We know that many developers try to import automatically the data we publish, and that too many datasets are created with formatting and comments done with an eye on human rather than machine readers. CSV avoids much of the formatting, macros and other elements that can obscure the data from machines, but these can be just as unhelpful.
Our standard is that there should be precisely one header line, with very brief, informative headings (like “amount in Sterling”). Numbers should always be written without thousands separators or unit symbols (like currency or size), and dates written out in British form with leading zeros (e.g. “23/02/2010”). Where there is a cross-government push for all bodies to publish their data on the same topic (like with senior staff salaries), we would expect a mandated exact standard set of representations and headers that all publishers must use (without any changes, additions or removals). Any comments or helpful advice about the data should be included in the meta-data that accompanies the entry on data.gov.uk, or in additional documentation or Web pages that can help the developer work with the data.
We would hugely value any feedback you have on this guidance.