A simple intro to open data
If you have come to data.gov.uk wondering what open data is all about, this short guide will provide you with the basic information you need to start playing with government data and will point you to tools and resources that can help you further in working with open data.
The reasons why you are interested in open data can be many, you may be a student wanting to use data for research, a local campaigner or a charity looking for evidence to support your decisions or a business looking into how open data can enhance your products or inform your processes. Although this guide is aimed at taking the very first steps on the path to open data, those with a better understanding of data will find the resources section valuable.
What is open data?
It is only right that we start with a definition. Open data is data that is published in an open format, is machine readable and is published under a license that allows for free reuse.
Open format – you will be familiar with those .doc or .xls endings on your files, you see them every time you open up your folders in your computer or someone emails you the latest holiday pictures
An open format is one which is platform independent (It is irrelevant what operating system your PC is running to use it; Windows, Mac or any other) and it is machine readable (more on that in a bit). The most common open formats you will encounter are:
For numerical files (spreadsheets) - CSV, ODS
For text - ODT, TXT
Machine readable – This basically means that the data, both in its format (CSV) and its structure, can be read by a computer without human aid. This means that the data is clearly structured in a logical way…
In the example above, you can see how the first row provides the heading titles (the things for which the file provides values) and it is followed by the values corresponding to those headings, in the same order, separated by commas.
This allows the file to be read by computers and for awesome things to be produced from the data, from visualisations to detailed analytics.
Open license – Files generally come with restrictions or licenses attached to them. You may be able to use that great picture of a sunset that you found on the internet for your monthly newsletter, but it is likely that there will be a license attached to it that, either prevents you from doing so freely (there is a charge to use the image), or simply does not allow you to use the image at all.
The same restriction apply to data. In order for data to be of value to everyone it needs to have as few barriers to re-use as possible. This way you can pick up that data file and use it as you wish with no worries whatsoever of infringing someone’s copyright. In the UK we have what is known as the Open Government License, which basically allows you to do whatever you want with government data under it.
What can I do with open data?
This is where things get interesting. Working with data can provide you with new insights into a subject, help you prove a point, provide your application or web service with what it needs to run, inform financial and social statistics or help you understand customer behaviour.
So, let’s make it bit more real and try to do something basic with some data, let’s visualise it and see if we get some new insights.
In a data catalogue, data is managed by creating datasets. The definition of a dataset is a hot topic amongst experts, but let’s say that it is basically a collection of data files. So if you do a budget spreadsheet every month, for last year you will have 12 files which could be called Smith’s Household budget 2012. That would become the title of our dataset and within it, just as we used to do by throwing receipts and bank statements in a folder and labelling it, in the case of our dataset, inside we will find a spreadsheet file for each month.
In the budget example, you probably would use very specific labels for certain items, calling, for example, a transaction for pizza: ‘movie night’ because that’s why you ordered it. You may have entered amounts excluding VAT or you may have not broken down whole transactions (you went shopping, bought 5 things but three of those are toiletries, one is a kitchen appliance and one is a perfume). As the generator of the data, you know this but I don’t, which means that if I want to analyse your 2012 budget by areas of expenditures I will end up with a less than accurate view of where your money went. To help those using data, we provide something called metadata, which is basically data about data; it tell us about the aspects of the data that may be relevant for it to be understood.
Let’s start by downloading the following file (clicking the link will actually start the download, it should not take you away from this page) save it somewhere in your computer, maybe to your desktop for easy access: CLICK TO DOWNLOAD
Now let’s open it. If you have Microsoft Excel installed in your computer it should automatically open the file once you click on it, or it may be ‘Numbers’ in your Mac if you have it installed. Otherwise, it may ask you what software you wish to use to open it. If you see notepad listed in the options, choose that. Otherwise, you can download a free office package such as Libre Office or Open Office
If you open it in notepad on Windows you will see something similar to the earlier example shown. This is just to check that the file can be read. If you opened it in one of the office suites, it should look like any other spreadsheet you have seen before
The file we downloaded is a list of all government properties across the UK, with details of location, name of building and such.
We are interested in one bit of information for this first step; number of buildings in different cities across the UK. Can we find where the largest concentration of Government buildings is? Let’s try it.
Close the file and go to Google Fusion Tables (you may require a Google account to use this site)
Click CREATE and it will take you to a screen where you can enter your data, it will ask you where the data is, the option ‘From this PC’ should be selected by default. Click on choose file and point to the file you downloaded earlier from this guide.
Once the file has been uploaded, Fusion Tables will show you the contents of the file as it understands it (this is where the machine readable bit comes in: if the file is well structured, Fusion Tables should be able to automatically identify each column header and all the values attached to each)
Notice how it also allows you, if the data is not well ordered (which is a common problem) to select which row contains the column names. This in the case there was some explanatory text on the first few rows, which would confuse Fusion Table and yield nonsensical results. Everything appears to be in order, the top row has all the headers and the rest of the rows are the values corresponding to each one of those headers. Click next to go and provide some details for our data.
We won't change anything on this screen, simply click finish to begin importing, you will see this message (this could take a minute or so).
Once Fusion Tables has uploaded your data, you will see a table view with the contents of the file.
Fusion Tables offers many of the functionalities of a traditional spreadsheet software, but it also has an extra tool; it can plot your data geographically using Google Maps. In the case of our file, it contains latitude and longitude for each building, which will allow Fusion Tables to plot the location of each.
Click on the third tab, Map of latitude, by default it will produce the map based on the latitude values on the file…
And you’ll get something like the picture above, if you zoom in (you can use your mouse wheel or the + and – buttons on the left side of the map) you can click on each red dot and see information about that building. But it does not tell us concentration of buildings by place! We know, by looking at the table, that there is a column called region. It appears that is as close as we will get to have some groupings for this data, let’s use that as counting buildings by town would be difficult to chart and impossible to read given the hundreds of buildings and the many towns they are in.
Click on the red button next to the ‘Map’ tab, the one with the white + in a red square, choose ‘add chart’ and you should now see the chart creation interface.
From the set of charts available, choose the vertical bar chart. Click on the button under the Category heading and select ‘Region’. It will show the category as region and allow you to choose which values you want to measure and chart. Select the ‘summarise data’ tick box and you should see the data for each region, in terms of counts of each instance (building in the data file) for which the value of the column ‘Region’ equals a given region.
Now we have our info, we know how many government buildings within the dataset are in what regions.
Issues with data
Not all data is perfectly ready for use. The quality of the structure varies from publisher to publisher. Even if they have provided a CSV, results can only be as good as the quality of the original spreadsheet from which it was created. Issues with datasets are common and often require manual work to normalise and clean, enabling them to be usable and comparable to other datasets.
For example, the file you used for this exercise was modified for this guidance. The original had thousands of rows, which would have meant that the creation of the map would have taken far longer. The ‘Town’ column had mixed quality. In some instances it contains a town, in others it contains a sector of a town, such as Whitehall so we deleted it.
The post code column does not have a post code for all of the entries, which again, would have led to inaccurate or incomplete charting. The only column that appeared to be both complete and to contain geographical data was the ‘Region’ column, hence why we used it.
If we wanted to do more with this data there are plenty of variables in the data file that we could use such as post code and department and these could enable interesting comparisons and analysis. The following resources can help you explore further how to work with data (some may require registration).
Working with data