Bathing Water Quality as Linked Data

Level: introductory

Is is safe to go back in the water? In the movie Jaws 2 the risk to avoid was a giant shark, but fortunately this is not a threat we have to worry about in the UK. There are, however, other reasons to be careful. One of these is water quality: we would prefer to swim in water that is relatively clean, and not contaminated by sewage. For England and for Wales, the duty of monitoring bathing waters to assess water quality falls to the Environment Agency of England and Wales (EA). Under the aegis of the European Bathing Water Directive, EA staff members collect weekly water samples during the May to September bathing season, and test those samples for compliance with clean water regulations. Currently this data is collated for reports at the EU level, and results are also published on the EA web site. However, the data itself was not directly available outside the EA organization. EA wanted to make the data available directly, both for transparency reasons and to encourage new and creative uses of the data. We designed and implemented a linked data system that would allow the water quality data to be made available, in a timely fashion, for anyone to re-use.

So why use a Linked Data approach to publishing the EA's water quality data? To answer that, let's back up a bit and just review what linked data actually is. Data is, when you strip it down, one of two things: numbers or words. For example: 36, 2011 or "Lulworth Cove". These data elements are hard to interpret by themselves: we need some context, some additional information. 36 could be the number of points in an international music competition, or the age of a TV presenter, or many other things. In this particular case, the value I'm thinking of is the concentration, in colonies per 100ml, of faecal streptococci found in a water sample taken at Lulworth Cove on August 2nd, 2011. I have to convey those values - the units, the type of thing being measured, the time and location - to allow someone else to make sense of the measurement and conclude that it's a low reading, and indicates nice clean water. The question then is how to convey that information: the association between the data itself (36, etc) and the metadata that gives it meaning (units, measure type etc)

Linked data addresses this central issue not by changing the data itself - at root, data is still words and numbers - but by changing the way that we interact with the data, especially via computer programs. In particular:

  1. every data resource, such as a particular water quality reading is given a unique identity (known as a URI);
  2. that unique identity is based on HTTP, the widely-used network protocol used by web browsers to fetch web pages for display;
  3. properties of resources, such as their type, scale, units, provenance, last updated time, etc, are represented using the Resource Description Framework, RDF, in which values are connected together in networks of named links, called properties
  4. the names of the properties, and also resource types, classification codes, etc, are also all given web identifier URIs. This means that they too can fetched (technically the term we use is resolved) using HTTP. The information returned when resolving, say, an RDF property gives a clear semantic meaning to the property.

Using URI's for data has a number of advantages. A resource identifier such as http://environment.data.gov.uk/id/bathing-water/ukk2204-20000 provides a globally unique identifier for a particular bathing water, in this case Lulworth Cove in Dorset. Because the identifier is based on HTTP, I can follow that link, in a program or by clicking on it in a browser, to get a representation of the resource. The link is set up so that the representation returned can be varied; by default returns human-readable HTML. However, by requesting an alternative representation format, such as JSON, or Turtle, other formats can be delivered. The following links present the same resource as above, but in JSON, Turtle, or XML formats. The point of these different formats is that they describe the same underlying resource, but using different encodings that might be easier for a programmer to use. Web developers, for example, often like to use JSON because that format can be easily handled by JavaScript programs. It's not just the bathing waters themselves that have URIs: the water sample I referred to above has its own URI. It's quite a long URI, since it has to capture the characteristics that distinguish it from other similar samples at the same site, or at different sites on the same day, so I've shorted it to http://goo.gl/DqKCQ. However, if you click through to that page, you'll see the full URI, and links on the top right of the page to also view the sample data in JSON, Turtle, etc.

The web domain used in these identifiers is environment.data.gov.uk. It suggests that the domain owner - EA in this case - can make authoratitive statements about the resource. But it is also true that anyone can make a statement - using RDF - about that bathing water. I might, for example, publish a list of bathing water sites I have visited with my family and whether we liked them or not. By using the official identity URI, it is much more likely that my data can be found, and re-used, by other people.

It's this practice of connecting resources together using named, meaningful links that gives linked data both its name and its power. It would be quite possible, for example, for someone to publish a data about the time of year that given bathing water sites are open to dog walkers or horse riders, using the same reference identifiers to unambiguously identify the beach locations. Someone else might then write a smartphone app which lists bathing waters that are both clean and dog-friendly (or dog-free), and perhaps further link to weather or tide data. Or recommended nearby pubs and restaurants! Once the reference data has been created, the possibilities to extend and refine it with additional data sources are extensive.

A key enabler to unlocking the potential of linked data is to make life easy for software developers. End-users typically don't want raw data, they prefer it nicely presented and easy to understand. Their goal, of course, is not just to observe data, but to interpret it and make decisions. The question "what was the most recent faecal streptococci count?" is implicitly part of "how clean is the water?" which is probably really part of "shall we take the kids to to beach today?". Making it easier for developers to create web pages and apps that help end-users to make those kinds of decisions was an important goal for EA. While the so-called follow your nose approach to linked data - following the links in the data to see where they lead - does work, app writers really need more support than that. The query language SPARQL is a powerful and general-purpose tool for accessing RDF data, but having to learn SPARQL is an extra burden for developers. We elected instead to use an API approach, in which a collection of HTTP-accessible end-points provide a programming interface that web developers can make use of easilyl. In particular we use the Linked Data API (LDA) to provide a programming interface to the data. The LDA uses well-established conventions for accessing the details of invididual data resources, and of collections of resources. For example, the following link returns a list of all bathing waters, in JSON, five at a time:

http://environment.data.gov.uk/id/bathing-water.json?_pageSize=5

From easily-understood pieces like this (which, indidentally, are documented at some length), developers can build quite feature-rich user experiences. We are aware of at least one app currently in Apple's iPhone store which makes use of the data. As part of our work for EA, we also implemented a reference application: the Bathing Water Data Explorer, which allows users to search for bathing waters of interest by name, county and postcode, and then view all of the details, including detailed water sample history, for that location . While this application is intended to be useful to end-users, its main goal is to illustrate to other developers the range of bathing water quality data now available to use - completely free of charge - through the API.

Conclusion

Bathing water quality is one of a number of datasets curated by the Environment Agency, which they would like to make more accessible to the public and the developer community. We created a publishing system so that weekly bathing water quality sample updates can be incrementally published by EA staff, and made available as linked data for consumption via API calls. To illustrate one use of the data, the Bathing Water Data Explorer acts as a reference application. Together with extensive developer documentation, EA hopes that this will encourage a broad range of innovative uses of the data.

Dr Ian Dickinson

Comments

"Data is, when you strip it

"Data is, when you strip it down, one of two things: numbers or words." -> What about things? Data is just a thing + its relationships with other things (incl. literals, that is, numbers or words).

 Flag as offensive 

Numbers vs. things

In response to @fadirra

Well, here lie deep philosophical waters indeed. As you may already be aware, the semantic web and linked data communities have spent a very long time debating the nuances bewtween a thing and the representation of that thing and data about that thing. What we're trying to do with linked data is exactly to discuss things and their relationships to each other, and be able to express those things and relationships in a way that our computer programs can process them. In particular, we want it so that your computer program can process my data in a such way that the things and relationships you compute with bear a meaningful relationship to the semantics I intended my data to have. That's really one of the main goals.

Making precise distinctions between the identity of something, the represention of a thing and the thing itself turns out to be tricky, just using the tools we currently have to hand (i.e. web protocols as they are presently defined). When I quote an identifier such as:

http://environment.data.gov.uk/doc/bathing-water/ukk4304-34100

I mean that to refer to a particular UK bathing water location. However, what you get when you follow that link is a web page, not a beach per se. I can say that I helped create http://environment.data.gov.uk/doc/bathing-water/ukk4304-34100, meaning that I helped create the web page (or, equivalently, the information resource it denotes), not that I put any sand or rocks on the beach.

When I wrote in the article that data is numbers and characters, I meant that that's usually the base level of abstraction we start from before building up structures in order to talk about things and their relationships. Of course that's an arbitrary starting point, since we could say that our computers encode both strings and numbers as binary digits, and those in turn are encoded as patterns of electromagnetic activity, and those in turn, etc, until the whole argument disappears in a puff of quantum logic. However, in programming terms, strings and numbers are typically the starting point from which we build up the other abstractions and structures.

I hope I've addressed your question, but please follow-up if not. Let me turn the issue back to you: what would it be like if we had things, as distinct from words and numbers in, say, an Excel spreadsheet?

 Flag as offensive 

Thanks for the reply. I get

Thanks for the reply. I get the point that a thing in Linked Data could be identified by a unique identifier (IRI), and then if we dereference the IRI, we could get the description (data) about it. The description itself usually is in two versions: human (eg. HTML) and machine-readable one (eg. RDF). If we regard IRIs as string then yes, a piece of data will basically contain only string and numbers in a structured way (eg. triple-based). So, addressing the issue, the Excel spreadsheet would contain IRIs (as an identifier of things), string, and numbers.

One last thing, what if we have, say, the image of a particular bathing water location, then should we regard the image as a thing, or data?

 Flag as offensive 

Are pictures things or data?

Well, this is one of those questions relating to levels of absraction. We encode images as a sequence of numbers (or, equivalently, one very large number!) and then use standardised file formats (.jpeg, .png, etc) to enable the interpretation of those numbers as images, through the help of some rendering program. So you could ask about a .png file: is this is a picture, or the representation of a picture? At the end of the day, I'm not sure such hair-splitting is useful to practical application development, semantic or otherwise. In any case, in a linked-data data model, you wouldn't usually have image representation data, you'd have the location of a file (on the web or on the file system) that contains the encoding that represents the picture. As is well known, every problem in computer science can be solved by adding another layer of indirection!

 Flag as offensive 

API link

A link to API http://environment.data.gov.uk/id/bathing-water.json?pageSize=5 does not work at the moment

 Flag as offensive 

Typo in URL

Hi VasilyB,

Sorry about that - there's a typo in the article. It should be _pageSize, not pageSize. In the LDA, meta-properties, that control the operation of the API itself, are prefixed with an underscore to minimise the chance that they clash with actual properties in the data.

I'll ask to get the article updated - sadly, guest authors are not given write access, we have to go through the site editor to get changes made.

Ian

 Flag as offensive 

Typo fixed now

Thanks to prompt action by the editor, the broken link is now fixed.

Ian

 Flag as offensive