Outline

Code of Practice

3 Disclosing datasets in an electronic form which is capable of re-use

3.1

When releasing any dataset under the Act public authorities must, as far as reasonably practicable, provide it in an electronic form which is capable of re-use, i.e. a re-usable format. A re-usable format is one that is machine readable, such as Comma-separated Value (CSV) format.

TXT

<p>Perhaps it would be more useful to pure text-based format, such as CSV because it's probably not unknown for someone to convert a CSV to a PDF and send that out!</p>

 Flag as offensive 

This needs a better definition of machine readable

The definition of re-usable needs to be expanded and clarified here. 

Points might include that a re-usable data is one for which:

  • A requester can open the data file in freely available and widely accessible software packages - this means that formats such as CSV should be preferred over, or offered in addition to formats like Excel, and proprietary formats which can only be openned with commercial or specialist software should be avoided. 
  • It is possible to process the data directly, carrying out any appropriate operations on it such as sorting columns, filtering rows, running aggregates of values - this requires well structured data. Where possible, the meaning of the data should not be contained in the layout. 
  • Common elements in the dataset are expressed in uniform ways - for example, dates are always in the same format, codes or names are always in the same case, and numbers are expressed consistently (e.g. 1,000 or 1000 but not a mixture of the two). 
  • The meaning of fields and values is clearly documented - either through clear naming of fields, or through accompanying descriptions provided along with the data.  

In addition, where possible machine readability is enhanced if:

  • The dataset uses common standards where they exist - including standard identifiers and standard field names. These might be standards like the public spending vocabulary developed for government, or third-party standards such as KML for indicating 'points of interest'

It is important to also note that it may be possible to provide data in a variety of machine readable formats, and where possible the authority should correspond with the requester to identify the best format. For example, points of interest could be provided in a CSV spreadsheet form, or KML. Ideally both would be provided: though depending on the context and the re-user one may be more appropriate than the other. 

 Flag as offensive 

To back up these points, the

To back up these points, the following blog posts might be useful:

I'm sure there are many others on the same theme...

 Flag as offensive 

More examples needed

I agree with TimDavies that this needs a fuller definition. It would also benefit from a wider range of examples. Not all datasets can sensibly be produced in CSV - collections of documents or photographs for instance could meet the definition of a dataset. And re-usability, like openness, is a range rather than a dichotomy - for a table, PDF is more re-usable than JPEG, but less re-usable than Excel or CSV.

 Flag as offensive 

Yes More Examples Please

For example Spatial data. At the moment it is deemed acceptable to publish this as PDF maps, where it would be much more valuable to GIS community and others to release it as vector shapefiles or google kml files that can be reused by others. There is a wealth of this information in Local government planning departments and large projects such as The Olympics

 Flag as offensive 

Yes More Examples Please

For example Spatial data. At the moment it is deemed acceptable to publish this as PDF maps, where it would be much more valuable to GIS community and others to release it as vector shapefiles or google kml files that can be reused by others. There is a wealth of this information in Local government planning departments and large projects such as The Olympics

 Flag as offensive 

Practical guidance

More detailed practical guidance needs to be provided alongside the code of practice to provide a more practical definition of datasets.  This could be developed alongside the guidance provided by the Information Commissioner’s Office.

 Flag as offensive 

I agree the definition of re-usable needs more work

I agree with Tim that the definition of re-usable needs more work. The statement that "a re-usable format is one that is machine readable" is rather dogmatic, and not a universally held view. That is certainly the way IT developers tend to think about re-use of data, and the statement is more likely to be true for large datasets. However many datasets are re-used directly for analysis (i.e. just read and considered) and do not necessary need to be pre-structured.

The Code of Practice should encourage discussion based on the specifics of individual datasets. A single dataset may be suitable for a range of different re-uses and there will not always be an ideal format. When in doubt the public authority should try to strike a balance between making the data re-usable and preserving the data in its original context.

 Flag as offensive 
3.2

Datasets are, by their nature, often created in formats that are capable of re-use. Therefore, if a public authority publishes a dataset in a non-re-usable format such as an image file, it should consider also publishing the dataset in its re-usable format before its conversion to an image file, or keeping a re-usable version of the dataset available. Where datasets are only held in non-re-usable formats, and it is impractical or too burdensome to convert the dataset into a re-usable format, the public authority is not obliged to convert the dataset before releasing it.

Engagement?

Worth a point that authorities are encouraged to respond to requests by proactively engaging with the requester and understanding how they might move towards providing more structured and regularly released open data in future? 

 Flag as offensive 
3.3

Public authorities should consider best practice and any guidance on the provisions of datasets issued by the Information Commissioner.

It may be useful to

It may be useful to explicitly draw attention here to data protection guidance, and the recently released Anonymisation Code of Practice

In addition, there is a specific risk of personal data leaking out the tracked changes or hidden fields of machine-readable files that authorities previously published only in PDF or print forms - and so in either this guidance, or the ICO guidance, it may be neccessary to address this privacy issue explicitly. 

It could be addressed here, or in the section on the costs of preparing a dataset for machine readable distribution, to note that this should include an audit of any potential privacy implications of releasing the data in the format of choice. 

 Flag as offensive