Monday, January 31, 2011

Data Scraping vs Providing Linked Data

In my first blog post I gave a brief overview of the Open Government Movement and how the Linked Data principles make publicly available data released by the UK and US governments open for citizen utility and economic opportunities. The second blog post described in more detail the Linked Data principles and how they can be part of a Corporate Transparency effort. I used two examples of publicly available data; payments data published on AstraZeneca US website and energy consumption data published by the Volvo Group.

In this blog post I will describe how the pdf with payments information is being scrapped of and re-created as data. I will contrast it with how spending data is published as linked open data by local governments in UK.

 Re-actively let other scrape off and re-create data
vs.
Pro-actively provide linked data

NB: My descriptions of these different approaches have a true data focus. I do not make any judgement from a business or government perspective, nor can I fully understand them in the U.S. and UK legislation contexts.

Data scraping 
A recent re-tweet by Beth Noveck (@bethnoveck), the previous US deputy chief technology officer for open government, pointed me to a post about Scraping for Journalism: A Guide for Collecting Data on the ProPublica Nerd Blog. It lists useful tools to scrape data off pdf:s and html-pages, and how to refine messy data using tools such as Google Refine. The author reports experiences from developing the Dollars for Docs news application that let users search pharmaceutical company payments to doctors. One of the sources is the Physician Engagement summarizing payments made to U.S. physicians who have spoken on behalf of AstraZeneca and/or its products.

Data taken from internal databases have been published as a pdf file which the data journalists from Propublica interpret before they can scrape off text strings to be able to re-create data and populate a public database wit the data together with data from other pharmaceutical companies.

Providing Open Linked Data
In a blog post serie on Talis' Nodalities blog Richard Wallis writes about Linked Spending Data – How and Why Bother. In the second part of the series he describes how the payment data can be searched and navigated.  And also how users can pose questions regarding individual payments.

This is enabled by the way the payment data has been provided as a standard model of triples (subject/predicate/object) with links to explicit semantics defined and structured in the Payment Ontology and the Data Cube vocabulary.


Some examples of what the above triples represents:
  • A data item instance is globally identified by the URI http://spending.lichfielddc.gov.uk/spend/860567 and is typified as an ExpenditureLine, defined in the Payment Ontology as a sub-class of Observation (from the Data Cube Vocabulary)

  • The value "120.00" is a property of this ExpenditureLine defined as netAmount (described in the Payment Ontology as "The net amount of the payment. This is the effective cost to the payer after any reclaimable tax has been deducted")
  • The local authority known as Lichfield District Council is globally identified with the URI http://statistics.data.gov.uk/id/local-authority/41UD and is the Payer of the payments part of the Invoice identified as http://spending.lichfielddc.gov.uk/invoice/7747. The Payer is also defined as a dimension (in the Data Cube Vocabulary).


five star open Web data




This is a great example of how 5-star rated linked open data looks like (see my previous blog post for more details). I hope to get back to the Physician Engagement example in a future blog post turned into a 5-star show case  of pro-actively providing linked open data and avoiding that others do data scraping off pdf:s.

Kudos to Richard Wallis (@rjw) for the great blog serie on spending data. And also to
Kingsley Uyi Idehen (@kidehen) for pointing out the value of linked data as 
3-col restricted table of triples with global references (URIs)