Monday, January 31, 2011

Data Scraping vs Providing Linked Data

In my first blog post I gave a brief overview of the Open Government Movement and how the Linked Data principles make publicly available data released by the UK and US governments open for citizen utility and economic opportunities. The second blog post described in more detail the Linked Data principles and how they can be part of a Corporate Transparency effort. I used two examples of publicly available data; payments data published on AstraZeneca US website and energy consumption data published by the Volvo Group.

In this blog post I will describe how the pdf with payments information is being scrapped of and re-created as data. I will contrast it with how spending data is published as linked open data by local governments in UK.

 Re-actively let other scrape off and re-create data
vs.
Pro-actively provide linked data

NB: My descriptions of these different approaches have a true data focus. I do not make any judgement from a business or government perspective, nor can I fully understand them in the U.S. and UK legislation contexts.

Data scraping 
A recent re-tweet by Beth Noveck (@bethnoveck), the previous US deputy chief technology officer for open government, pointed me to a post about Scraping for Journalism: A Guide for Collecting Data on the ProPublica Nerd Blog. It lists useful tools to scrape data off pdf:s and html-pages, and how to refine messy data using tools such as Google Refine. The author reports experiences from developing the Dollars for Docs news application that let users search pharmaceutical company payments to doctors. One of the sources is the Physician Engagement summarizing payments made to U.S. physicians who have spoken on behalf of AstraZeneca and/or its products.

Data taken from internal databases have been published as a pdf file which the data journalists from Propublica interpret before they can scrape off text strings to be able to re-create data and populate a public database wit the data together with data from other pharmaceutical companies.

Providing Open Linked Data
In a blog post serie on Talis' Nodalities blog Richard Wallis writes about Linked Spending Data – How and Why Bother. In the second part of the series he describes how the payment data can be searched and navigated.  And also how users can pose questions regarding individual payments.

This is enabled by the way the payment data has been provided as a standard model of triples (subject/predicate/object) with links to explicit semantics defined and structured in the Payment Ontology and the Data Cube vocabulary.


Some examples of what the above triples represents:
  • A data item instance is globally identified by the URI http://spending.lichfielddc.gov.uk/spend/860567 and is typified as an ExpenditureLine, defined in the Payment Ontology as a sub-class of Observation (from the Data Cube Vocabulary)

  • The value "120.00" is a property of this ExpenditureLine defined as netAmount (described in the Payment Ontology as "The net amount of the payment. This is the effective cost to the payer after any reclaimable tax has been deducted")
  • The local authority known as Lichfield District Council is globally identified with the URI http://statistics.data.gov.uk/id/local-authority/41UD and is the Payer of the payments part of the Invoice identified as http://spending.lichfielddc.gov.uk/invoice/7747. The Payer is also defined as a dimension (in the Data Cube Vocabulary).


five star open Web data




This is a great example of how 5-star rated linked open data looks like (see my previous blog post for more details). I hope to get back to the Physician Engagement example in a future blog post turned into a 5-star show case  of pro-actively providing linked open data and avoiding that others do data scraping off pdf:s.

Kudos to Richard Wallis (@rjw) for the great blog serie on spending data. And also to
Kingsley Uyi Idehen (@kidehen) for pointing out the value of linked data as 
3-col restricted table of triples with global references (URIs)

    Sunday, December 12, 2010

    Corporate Transparency and Linked Data

    In my previous blog post I described the Open Government Movement and how the Linked Data principles make publicly available data released by the UK and US governments open for citizen utility and economic opportunities.

    A recent blog post made me aware that I, and many other, tend to use the term open data to mean publicly available data:
    "Simply put, all open data is publicly available. But not all publicly available data is open. Open data does not mean that a government or other entity releases all of its data to the public. ... Rather, open data means that whatever data is released is done so in a specific way to allow the public to access it without having to pay fees or be unfairly restricted in its use."

    Source: What “open data” means – and what it doesn't, by Melanie Chernoff, published on the Open Knowledge Foundation Blog
    In this blog post I will adopt this when I now focus on publicly available data released by enterprises. And start to look into how linked data principles can be applied for data enterprises make publicly available as part their efforts for Corporate Transparency, and for Social Responsibility.

    What will the movement in Governments for  
    Linked Open Data mean for Enterprises?
     
    How can Corporate Transparency be supported

    by applying Linked Data principles? 

    Let me first introduce the Linked Data principles and also the 5-star deployment scheme for Linked Open Data. With this in mind I will highlight examples of data made publicly available by two enterprises: Volvo Group and AstraZeneca. And then, outline steps for Linking Open Enterprise Data -- from a 1-star to a 5-star rating.

    Linked Data principles
    The four principles, or rules, of Linked Data have been outlined by  Tim Berners-Lee, often referred to as the "inventor of the web", in his Design Issues: Linked Data note:
    1. Use URIs (global identifiers) to identify things.
    2. Use HTTP URIs so that these things can be referred to and looked up ("dereferenced") by people and user agents.
    3. Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF/XML.
    4. Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.
    Source: Linked Data page Wikipedia
    5-star scheme for Linked Open Data
    Get a 5* mug - profits help W3C

    In my previous blog post I wrote about the 5-star deployment scheme for Linked Open Data presented by Tim Berners-Lee at the International Open Government Data Conference (#iogdc ) in Washington, D.C., and Open Government Data Camp (#ogdcamp) in London. 

    What is required for 1-5 star ratings?  What are the costs and benefits? I will elaborate on this for publicly available data released by enterprises based on Linked Open Data star scheme by example, Michel Hausenblas. To spread this nice idea you can buy your own 5-star mug and T-shirt.

    Publicly available enterprise data
    So, what does this mean for enterprises? Below two example of data made publicly available by two enterprises: Volvo Group and AstraZeneca. Two large international enterprises in  different industries under regulations for different aspects and regions, such as the Corporate Integrity Agreement (CIA) for health services in US.

    Volvo Group, Corporate Social Responsibility, publish a yearly Sustainability Report with a Scorecard including key sustainability performance indicators such as Energy consumption (example from the scorecard 2009) Data is formatted as a html table and the whole report as a pdf.
    AstraZeneca, Corporate transparency, publish for example data on Physician Engagement, a summary of payments made to U.S. physicians who have spoken on behalf of AstraZeneca and/or its products. Data is published in a table of 2000+ rows as a pdf (Speaker compensation report, January - June 2010).


    Linking Open Enterprise Data
    one star open Web data


    These examples of data are made publicly available in a way that makes it possible for consumers to look at it, print it, store it locally, and to enter it in manually into another system.  If this was done with an open licens (such as PDDL, ODC-by or CC0) they would have got a nice 1-star rating.

    For a 2-star rating, data should be made available as structured data (e.g., Excel instead of pdf) so that it also can be reused. Consumer can now directly process it with proprietary software to aggregate it, perform calculations, visualize it, etc.. For a 3-star rating data should be in non-proprietary, open formats (e.g., CSV instead of Excel). Consumer can now manipulate the data in any they like, without being confined by the capabilities of any particular software.

    five star open Web data



    A key enabler to get 4-star and 5-star ratings is to choose or design a vocabulary of terms for the things (using URIs) the information is about, and for the descriptions about these things so data can be linked. Consumer can now reuse parts of the data with explicit semantics and discover more (related) data while consuming the data.
    Source: Linked Open Data star scheme by example and Star badges
    Available vocabularies 
    An example of such a vocabulary of interest for the AstraZeneca physician engagement example to make it to the 4-star rating is the payments ontology being used for publishing UK government spending data as linked data (see COINS as Linked Data). The ontology (see Guide to the Payments Ontology) has been developed as a general purpose vocabulary for representing organizational spending information and is not specific to government or local government applications.

    Of relevance for the Volvo Group example to make it to the 4-star rating is the work in the eGovernment Interest Group for Linked Environment Data that Environment Agencies from Europe and the US are setting up. The Statistical Core Vocabulary (scovo) for representing statistical data on the Web have been used by the German Federal Environment Agency (UBA) to publish linked environment data.

    Thoughts for future posts
    In future blog posts I will continue the exploration of the opportunities, and challenges, of Linking Open Enterprise Data. I am also interested in experiences of applying Linked Data principles for data  sources available within enterprise networks to make it easier for employees and partners to consume it, and to combine it with other linked data sources -- internal, shared, licensed and publicly available sources.

    While writing this post I was thinking of provenance, i.e. open history of data, in relation to the 5-star deployment scheme -- Maybe a 6-star rating for embedding provenance data using emerging provenance vocabularies? I wonder what Tim Berners-Lee thinks about that :-)


    Kudos to Michel Hausenblas (@mhausenblas) for the great 5-star scheme examples with costs and benefits, and the nice star badges. And also to Bill Roberts (@billroberts) for excellent input for the payment data example. As well as to Melanie Chernoff (@melaniechernoff) for the interesting blog post on publicly available and open data.

    Monday, November 22, 2010

    The Open Government Data Movement

    Last week I spend all my commuting hours to catch up with two busy Twitter streams on my iPhone: #iogdc and #ogdcamp

    For three days the International Open Government Data Conference (iogdc), in Washington, D.C., gathered the community of data owners, developers and policy makers from around the globe to share lessons learned, stimulate new ideas, and demonstrate the power of democratizing data. For two days London was the meeting place for a more European focused audience in the Open Government Data Camp (ogdcamp).
    Get a 5* mug - profits help W3C

    Alexander B. Howard from 
O’Reilly Media published a great blog post the second day in Washington that very well summarizes the whole week: Open data: accountability, citizen utility and economic opportunity.

    In both conferences Tim Berners-Lee, the inventor of the World Wide Web, described his “five star system” for open government data. From "one star" for making data available, often in pdf format. To "five stars" for linked data using the semantic web standard RDF (Resource Description Framework). What is required to get the ratings stars? See this excellent page: Linked Open Data star scheme by example.

    In a five minute interview on YouTube with David Eaves, Public Policy Entrepreneur in Canada, on Open Government Data, gives his view on why is it important, what are the benefits, what should government do?




    Two examples of what is happening in the Open Government sphere.
    In my next blog post I will share some thoughts on what that this movement for Open Government Data in an Enterprise context.