Monday, March 21, 2011

When will we see the first data.xyz.com?



"http://data.xyz.com is the home of our open linked data"
                   Say the CIO of Corporation XYZ
When will we see such an announce from a corporation?

I really liked the tweet today from Milton Keynes, UK (@mdaquin) pointing me to data.open.ac.uk, that is the home of open linked data from The Open University.

I would love to see an announcement from a corporation with high ambitions on corporate transparency and understanding of the value of sharing of pre-competitive data.  With a CIO with good insights on open data and linked data principles. A corporation that clearly state the applied open license (such as PDDL, ODC-by or CC0), and also have earned a 5 star ranking (see Linked Data star scheme by example)

Or, does this already exist? Let me know if you know of something similar in an enterprise context.

For more information about the benefits on Linked Data, see a nice blog post by Stuart Brown (@stuartbrown) on the LUCERO Project, Linking University Content for Education and Research Online, blog. See also my previous post on Corporate Transparency and Linked Data.

Sunday, March 13, 2011

Three presentations

The coming two weeks I'll be working on presentations for three events I have got the opportunity to participate in. I will use this blog post as a way to shape my thinking and a new blog post when developing the slides and manuscripts.
  1. Linked Data in Pharma
    A brief presentation of a short paper we have got accepted for the first international workshop on linked web data management in Uppsala, 25 March. The title of the paper is; Linked Data, an opportunity to mitigate complexity in pharmaceutical research and development (link to be added). I have written it together with my colleague Bosse Andersson.
  2. Semantics for Clinical Data
    Some reflections on different approaches to provide semantics for clinical data to be discussed in the EBI Industry Workshop on Biomedical Data and Model Interoperability 
in Cambridge, 28-29 March.
  3. Linked Clinical Data
    An introduction to Linked Data principles and pragmatic examples for the CDISC Interchange Europe 2011 conference in Brussels, 13-14 April.
I did find it hard to start working on this with all the terrible news on what is happening in Japan just now. Kudos to Jim Hendler and Ivan Herman for their tweets today on the power of linked open data with an interactive map using open earthquake data.
See Ivan Herman's blog post

Background, Audiences and Intentions 
Some brief notes on the background to my participation in the three events, and also on what I know about the audiences, and my intentions with what I will to talk about.

1.  Linked Data in Pharma
The first one is an event I learned about on the Twitter feed for #linkeddata. It's a workshop on linked data management arranged in conjunction with a conference on database technology. We saw this as an opportunity to go to a workshop here in Sweden on this interesting topic. We decided to re-write an article from last year for an internal publication to describe some insights from working in the W3C interest group for semantic web in Health Care and Life Science (HCLS), and in the Large Knowledge Collider (LarKC) EU-project.

The article we started from had an intended audience of colleagues in a pharma company with no knowledge of the standards and principles behind the huge cloud of linked open data. 
The Linking Open Data cloud diagram
While the participants in the workshop will be highly knowledgeable researchers and practitioners in linked data management. My hope is that we during 2011 will have more internal experiences to report from in an extended paper as the linked data idea now also get a lot of interest internally.

2.  Semantics for Clinical Data
The second event is the result of interactions we have had with Bernhard de Bono, leading the Drug Disease Modeling Resources (DDMoRe) one of the projects in Innovative Medicines Initiative (IMI). I  met Bernard in an EBI industry workshop on ontology engineering last year and we talked about existing metadata standards for clinical data and the opportunities in ontology based annotations of clinical data.

The list of attendees includes people from many of the European pharma companies and also from research centers such as EBI and INSERM. I assume many of the them work in the pre-clinical / drug discovery phase and have a bioinformatics focus, so together with the people from CDISC I hope to to be able to add a clinical perspective.

My contribution will be some reflections on different approaches to provide semantics along with clinical data. As it has been done when a lot of the semantics, that is the knowledge on what clinical data represents, have been implicit and carried by people and documents  And how semantics now is made explicit for humans as standardized data exchange containers, e.g. the CDISC SDTM domain for Lab test data, and as text strings of standardized codes and labels, so called controlled terminologies e.g. the list of lab test procedure codes, to simplify the programming to transform, integrate and analyze data. By linking to Bernard's presentation on the RICORDO 2] toolkit for semantic integration of biomedical resources I will outline how clinical data can be annotated with ontology based standards making the semantics explicit using formal and machine processable formats. I will also briefly talk about how clinical metadata registries could be used to support ontology based annotation.

3.  Linked Clinical Data 
The abstract I proposed for the third event was triggered by the frustration I interpreted from the FDA representatives at CDISC Interchange US in 2009. And a follow-up to the brief discussions I had with some of the CDISC folks on linked data principles and semantic web standards. Here is how Jay Levin, expressed it in the FDA panel in November 2009:
We want to separate the analysis view from how clinical data is exchange. To have a very normalized, flexible way to convey the data as it actually was collected, as it occurred. And than from that create any number of disease area specific views and analysis specific views. You have tremendous options. So, instead of being looked into this difficult dance that I see happening with SDTM then you always try to decide how useful it’s going to be for correct analysis vs. how consistent it could be if you free up the potential ways data can be represented for disease specific areas. 1]
In my presentation I want to provide show examples of RDF  data model (triples) as such a "very normalized, flexible way to convey the data" (see also my comments on this blog post Wondering why the FDA hasn't more actively promoted CDISC standards). I'll also share the good news on how linked data principles now are applied by key players such as the UK and US governments, as described in my first blog post on The Open Government Data Movement. And also use the practical example of how RDF triples of linked data look like using the payment example from a local authority in UK that I also used in  my previous blog post on publishing linked data.
 
five star open Web data




My key message will be some proposed pragmatic steps for how the CDISC standards can be published using the 5-star rating scheme for linked open data described in my second blog post.

The title of the CDISC track is "eHRs and the World Beyond", and Patient Controlled Health Records (PCHR) or Personal Heath Records (PHR) e.g. Google Health, could be the next big thing. So, I will also as food-for-thought include a slide from the explorative work we do on leveraging semantics developed for PCHR also for clinical research data. That is, the Computer-Based Patient Record (CPR) ontology developed by Chimezie Ogbuji, Case Western Reserve University's Center for Clinical Investigation, previously Cleveland Clinics.


1]  Jay Levin refereed to the HL7 standards as a the "normalized, flexible way". He and others from FDA earlier in 2009 did some initial statement on moving from CDISC's SDTM standards to HL7's CDA (Clincial Document Architecture) standard for submissions of clinical data. This was not well received by CDISC, nor by the representatives from pharma and CRO companies. During 2010 FDA and CDISC came to a common agreement on CDISC SDTM. (That is, the 40+ different container with standardized variable names, and the evolving controlled terminologies.) See two posts on CDISC's blog: Clear Messages from FDA CDER and CBER and FDA CDER Data Standards Plan V 1.0 and PDUFA IV IT Plan Update
2] Researching Interoperability using Core Reference Datasets and Ontologies for the Virtual Physiological Human (RICORDO)