Guidance: A very basic standard file format for data

We have talked before about some specialised ways in which to publish our data, especially Linked Data. However, we are mindful of Tim Berners-Lee’s advice to publish in “raw” form first, and then worry about additional means. These forms may take longer to do, but make the data useful for more situations, adding semantic information.

To meet the urgent need for open, standardised formats to support the Government’s transparency commitments this year we needed a recommended, basic format that is easily reproduced for publishers, and serves the community of developers. To this end, we have come up with a very basic set of technical guidance on the simplest possible format, "flat" files with separations between the fields. This is most widely seen in Comma-Separated Variable or “CSV” form.

Having said that CSV is as basic as it gets, there are still several technical things to consider in such a specification. In the jargon, these are the delimiter and terminator characters, how fields are escaped and quoted, and the character set encoding. Experience on data.gov.uk so far shows that different publishers can take different decisions about each of these elements of a “CSV” file. As we start to publish data about the same theme (spending, salaries, organograms and so on) from many different departments developers need to be assured that there is a precise and common standard for all the ‘CSV’ files which departments will be publishing.

An important consideration is that Government departments need as far as possible to be able to generate the data files with what they already have. It is not just that additional software could have potential costs (indeed there is usually good open source software available); the process of getting new software qualified and installed on each separate departmental network is both costly and, perhaps more importantly, involves substantial lead times. There are similar issues with bespoke coding. What’s more, we need to be mindful of the skills of the people who will be producing the information. So our strategy has been to set a standard that the vast majority of data owners in Government can meet using what they already have on their desktops - and to define that sufficiently so that those who are writing bespoke extraction software can also produce to that standard.

At the moment the vast majority of government desktops have Microsoft Excel and the vast majority of datasets that publishers will be releasing are held in, or manipulated with, Excel. The Government has policies to make greater use of open source in the future, but as far as data.gov.uk goes “we are where we are”. So asking for a format that Excel cannot produce would be very difficult to implement, increasing costs and slowing down the stream of data released.

Microsoft Excel’s “Save As CSV” function produces files that are reasonable on most of these grounds. Records are terminated with a newline (Windows CRLF encoded), with fields separated with a single comma character (,). Where a field includes a double-quote character (") it is replaced with two of them. Where a field includes a comma or a double-quote it is quoted with a double-quote at the start and end of the field. An example of this is:

Test       Test,    Test"   Test test          “Test" ","",,,",,

Test,"Test,","Test""",Test test,"""Test""",""","""",,,"",,"

Excel’s CSV files use the Microsoft Windows “Latin-1” character encoding (sometimes known as “ANSI”), rather than a form of Unicode like UTF-8. Though this means that code will need manual tweaks to import non-Western European scripts, for the interim we have chosen this format. In the medium-term we will find a better solution that provides Unicode-encoded data.

There are also less technical, but still important considerations – representations and headers. We know that many developers try to import automatically the data we publish, and that too many datasets are created with formatting and comments done with an eye on human rather than machine readers. CSV avoids much of the formatting, macros and other elements that can obscure the data from machines, but these can be just as unhelpful.

Our standard is that there should be precisely one header line, with very brief, informative headings (like “amount in Sterling”). Numbers should always be written without thousands separators or unit symbols (like currency or size), and dates written out in British form with leading zeros (e.g. “23/02/2010”). Where there is a cross-government push for all bodies to publish their data on the same topic (like with senior staff salaries), we would expect a mandated exact standard set of representations and headers that all publishers must use (without any changes, additions or removals). Any comments or helpful advice about the data should be included in the meta-data that accompanies the entry on data.gov.uk, or in additional documentation or Web pages that can help the developer work with the data.

We would hugely value any feedback you have on this guidance.

Comments

Data Archive

Have you spoken to the Data Archive folks, who try to promote best practice in data curation/preservation for social science/humanities research data in ac.uk?

http://www.data-archive.ac.uk/

And also the DCC/Digital Curation Centre:

http://www.dcc.ac.uk/

It would be a shame to miss an opportunity to advocate/share experiences around a simple open standard with these folks, too, or sound them out on pragmatic approaches that they've come round to from past experience?

 Flag as offensive 

Notes on broken representations at a slightly higher level..

As well as issues of data representation at the low level, I think it's also worth flagging that another problematic area with CSV files is when the CSV file gets a bit irregular...

If we define a simple table as one with a set of columns and a set of rows, assume that the list of columns defines some sort of record (such as a database record), and let each row correspond to the data included in one record.

Every row is much like the other. So for example, each row may define a set of measures applying to a local school.

But then we often get something a "subtotal" row appearing in the middle of the table, where values above are being subtotalled for some reason, and these only applies to some of the columns. The rwo in this case is not of the same record type or record structure as the rows containing the schools data.

Or consider this: a note appears in the table, breaking it in two ways: eg an asterisk appears in one of the table cells (so the cell can now longer be easily read/parsed by a machine), and then somewhere underneath the table, we get a footnote corresponding to that note (so if we're trying to read in rows according to the columns-define-the-record-structure way, we get things breaking. Again, the format of the row is not the correct "type")

It might be that there is a really simple convention we can adopt to get round this, or a series of conventions that make it increasingly easy (at a pragmatic level) to structure these exceptions, so that they get easier to handle at the machine readable level, but also lead to folk structuring their data more powerfully.

For example, to cope with notes, leave a series of 3 blank lines below the table before the notes. If you need to mark a cell with a footnote always leave a space then put the footnote in square brackets eg [3] Or create a blank column to the right or left of the table, and then have a Note Referent and Note Detail column (Note referent contains the row or cell number that the note refers to, Note Detail the actual note; and yes, I know, the dislocation between cell and referent means that changes may get made to the spreadsheet that makes the notes point the wrong way... but we're being pragmatic)

Examples and templates are both ways of trying to suggest such conventions to folk - and if they can see a payoff, then they may start to adopt them...

I think we need to recognise that folk will publish "broken"/irregular CSV, but there may be simple/effective ways that make it easier to work round/accommodate these irregularities if it's easy and beneficial for folk to adopt them as a convention or a norm.

PS Just by the by, it's maybe also worth considering what other guidance is given by government; e.g. http://www.parliament.uk/briefingpapers/commons/lib/research/briefings/s... is a Parliamentary research briefing paper on tabular layouts...

 Flag as offensive 

csv metadata descriptor

There's a proposed format for a csv metadata descriptor and validator - written itself in csv - named Interface Control Document. What it does is similar to what a DTD or an XML Schema does for XML data. It could be a way to standardize the way csv is used for open government data.

See http://cutplace.sourceforge.net/tutorial.html for an example using the cutplace tool.

 Flag as offensive 

File Formats

As a 'non-technical' web editor, I welcome the information and much appreciate the comment that "we need to be mindful of the skills of the people who will be producing the information". 

Whilst I understand the overall drive for 'open data', the journey from what I know and have to deal with and how I get there is still a mystery!

 

Julia

 Flag as offensive 

File format and contents

I fully suport reducing data to tables and making available in a simple format such as CSV - pleased to see you note the pitfalls even here of line-end conventions, escape characters when the data value includes a quite or whatever, and multiple languages.  Surprised you don't firmly recommend Unicode rather than the proprietory Latin-1.  It cannot be satsfactory to mix files that potentially use a single code for different characters (language glyphs).

The biggest problem however is glossed over by you and by Berners-Lee: that of column headers.  The data table should, as you suggest, have a single header line (but beware so many spreadsheet users leave column A and row 1 blank because they are used to leaving margons on paper!).  Your example "amount in sterling" is then precisely NOT informative - amount of what, how measured?  Berners-Lee suggests to "use terms other people are already sharing. Like foaf:name for the name of a person."  I've NEVER seen foaf:name as a fieldname and would not have any idea what it meant. Even worse, people's name is one of the most ambiguous fields ever, with no general parsing rules and highly culturally determined.

Current metadata standards are entirely directed at cataloguing and discovery.  There is no standard or requirement for documenting the field structure and semantics of data.  But that is essential if data is to be reused, and absolutely critical when data files are merged.  The simple example of Mars lander that crashed because units were confused shows how costly omissions can be.  I have, with a colleague, suggested a scheme that is a dirct extension to yours, based on a data table, discovery metadata, and a codebook that explains and documents the data columns.  Happy to send a PDF of a poster on request.

allan.reese@cefas.co.uk

 Flag as offensive 

not underestimate the importance of graph structure

Data rapresented in a table format carry much less information than data in a graph. I.e. tu supply the same amount of information you will need more that one table. The table representation should be just a view on data. This is also the basic principle of E-R db. That's why a simple CSV (that cannot even represent the complexity of a E-R db) is only the most basic step, and not an arrival point.

About semantics tags, it's all about standard definitions. Sure FOAF is not appropriate to governamental data. The definition of a standard set of "tags", will improve potential of:

  • data reusage
  • data integration
  • distributed query over data
  • data linkage

FMF

 Flag as offensive 

A CSV File Format Standard

Please consider the open CSV-1203 file format standard http://mastpoint.curzonnassau.com/csv-1203/csv-1203.pdf which I describe as "Best practice for business-to-business operations".

 Flag as offensive 

RFC 4180

does RFC 4180 assist or hinder the definition and understanding of what CSV actually should look like?

 Flag as offensive