Linked Data for Enterprises

Friday, July 22, 2011

ICBO2011 Preparations, part two

Via the email lists for the Clinical Data Interchange Consortium (CDISC) Terminology team, and for the Electronic Health Records for Clinical Research (EHR4CR) one of the Innovative Medicines Initiative (IMI) project, I have see some recent discussions on cross-terminology mapping challenges. Challenges due to the fact that terminologies and coding nomenclatures, such as SNOMED CT, LOINC, CDISC SDTM CT:s, and MedDRA, all have been developed for different purposes, with disparate approaches and structures.

Together with attendances from NCI, NCBO, FDA, Mayo, SAS, Stanford and other organizations, I and a few colleagues, will attend the International Conference on Biomedical Ontology (ICBO) next week . See my previous blogpost with some more background.

Photo (Flickr): Automania

Apples and Oranges

In preparations for the workshop the first day, Representing Adverse Events, I did find this paper highly interesting as it compare and contrast SNOMED CT and MedDRA, and also describes the challenges in mapping between them: Heterogeneous but “standard” coding systems for adverse events: Issues in achieving interoperability between apples and oranges.

OBO Foundry based ontologies as "catalyst"
I hope the adverse event workshop, and the whole ICBO event, will be an opportunity for me to learn more about the Open Biology and Biomedical Ontologies (OBO) Foundry approach, and to discuss the challenges and opportunities in a “common language with which to energize cross-disciplinary research” 1).

I hope to better understand “how legacy terminologies, such as SNOMED CT, and the data coded with their aid can be successfully used for information-driven clinical and translational research” 2). My understanding is that the approach to be discussed at this event is the use of OBO Foundry based high-level reference ontologies, such as the Ontology for General Medical Science (OGMS), as a kind of catalyst instead of direct terminology-to-terminology mappings.

Yet another "standard", or ...
At the same time I did find this cartoon, circulating on Twitter this week, quite amusing. So, I think it will be a hot and interesting week in Buffalo, NY..

xkcd: Standards

Here's a brief introduction to the use case I and a colleague will present at the adverse event workshop:

"A use case will be presented describing how a query from a regulatory authority is handled as part of the regular ongoing pharmacovigilance in pharmaceutical research and development. It will illustrate how databases and literature are being reviewed manually, exemplify how different databases are structured and highlight some of issues in the coding of data. With this use case, we hope to provide a background to our interest in an ontologically based approach to enable a more automatic way to access, structure and analyze patient safety related data."

1) For the Sake of Research and Patient Care, Scientists Must Find Common Language

2) A Unified Framework for Biomedical Terminologies and Ontologies

Tuesday, June 28, 2011

ICBO2011 Preparations

In a couple of weeks I will attend the International Conference on Biomedical Ontology (ICBO) 2011, in Buffalo, NY.

“In July, hundreds of international scientists from dozens of biomedical fields will meet at the University at Buffalo seeking a common language with which to energize cross-disciplinary research.“ From ICBO News: For the Sake of Research and Patient Care, Scientists Must Find Common Language

And yes, it will be a great opportunity for me to see the Niagara Falls again. This time from the American side. Last time I saw it was in 1999 from the Canadian side when I attended the W3C conference in Toronto. The WWW8 conference where I was absolutely thrilled by the power of the simple and elegant model of RDF triples. At the WWW8 I also heard Tim Berners-Lee talk about the Semantic Web for the first time.

The coming weeks I hope to able to do a re-cap of a couple of ontology related papers and articles, and also read and digest some new ones listed for the events I have signed up for:

Representing Adverse Events (full day workshop)
Improving Structured EHR Data (half day tutorial)
OBI: A Shared Ontology for Representing Biomedical Studies and Resources (half day tutorial)

I will use one or two forthcoming blog posts to write up my insights and reflections coming to my mind while reading.

Here's a quote I think well captures my motivation to learn more about ontologies and getting my ICBO2011 attendance approved by my managers. It's taken from this great article More than Words: Biomedical Ontologies with references to the work of several of the international scientists who will get together at the ICBO2011.

“… true ontologies are more than just controlled terms. They capture, in a logical, systematic way, what scientists regard as the basic truths about a topic. Like equations in physics or axioms in mathematics, they can even be the basis for computational models. When connected to databases, scientific papers, and software applications, ontologies ‘help cope with the ever-growing, chaotic accumulation of text and facts" in biomedical and translational research.“

Sunday, June 12, 2011

SemTech2011

The last couple of days the Twitter feeds for #semanticweb and #linkeddata have been very busy and #semtech peaked with more than one tweet per minute during the Semantic Technology Conference 2011 in San Fransisco 5-9 June.

See the #SemTech 2011 Twitterscript for agreat overview of all the #semtech tweets sent during the conference, aligned with the sessions going on at the time. Kudos to @glenn_mcdonald and @needlebase.

For me, here over in Sweden, it's been a couple of late evenings and some busy mornings catching up on Twitter while commuting. Below some of the presentations, discussions, and blogs I did find extra interesting.

schema.org

A couple of days before the conference the news came out on Twitter about the announcement from Google, Yahoo and Microsoft (Bing) on their joint schema.org. A global, single vocabulary and the use of Microdata to encode structured data into web-pages using this vocabulary for search engines to do a better job.


A graph centric visualization of the schema.org vocabulary with "Thing" in the center of it

The first comment I re-tweeted as a "I Liked" on this topic was a tweet on Friday 5 June by Darin L. Stewart (@darinlstewart) pointing to his posting on Gartner's blog: Schema.org: Webmaster One-Stop or Linked Data Land Grab? With some early critique. At the same time came the first version of a RDF Schema version of the vocabulary on schema.rdfs.org. Great job done by Michael Hausenblas (@MHausenblas) et al.. And I did find it interesting to read the quick, positive comment from Chris Bizer, the Linked Data guru behind DBpedia, on Google's official webmaster blog. During the conference schema.org was also the *hot* topic and late Wednesday evening my time I followed a heated online IRC discussion from the BOF on structured data in HTML and vocabularies. For more reading on this topic see the link bundle called schema.org is in town compiled by Michael Hasheke (@hashek)

Linked Data Tutorial and Cookbook

Among all the tutorials and presentations at the conference I picked up two great Linked Data resources, First of all Juan Sequeda's (@juansequeda) tutorial series, and also a presentation "I liked, very much"- The Joy of Data - A cookbook for publishing and consuming Linked Data by Bernadette Hyland (@BernHylland). These two triggered me to create a separate Linked Data Resource Page with my favorites, including these two.

Linked Health Data

The last day of the conference I spotted some tweets that toke me to the presentation I liked most of all: Clinical quality linked data on health.data.gov, presented by George Thomas (@georgethomas). See also his blog post on data.gov with an excellent argumentation for linking publicly available health data such as hospital compare data:

In addition to making flatfiles available to download on the Web, and providing applications that enable programmatic access to backend databases through the Web, imagine using the Web itself as a database: a massively distributed, decentralized database. This is what Linked Data is about – putting data in the Web.

Two technologies to catch up with

Many tweets talked two Calimachus, a framework for data-driven applications based on Linked Data principles allowing Web authors to quickly and easily create semantically-enabled Web applications. I will have a look at the Calimaschus videos they published. And a presentation on Semantic Architecture & Composing Resource Oriented System, by Brian Sletten (@bsletten), made me curios to learn more about the architecture thinking called NetKernel.

Monday, May 2, 2011

Linking Clinical Data Standards

This is a follow-up to an earlier blog post where I outlined the background, audiance and intention of three presentations. Two of them have been published on Slideshare:

Linked Data in Pharma (short paper, presentation)
Linking Clinical Data Standards (presentation - viewed 600+ times)

Here I focus on the second presentation, a presentation I did in the CDISC (Clinical Data Interchange Standards Consortium) conference in Brussels recently. One of the key people in the CDISC community, Dave Iberson-Hurst, lists semantic web as one of three themes and kindly refers to my presentation in a recent blog post.

My presentation, and also a very nice presentation from Roche, triggered interesting questions. Questions both on what I proposed as pragmatic first steps for linking clinical data standards, and also on what I see as future opportunities. Below you find the questions and my "answers", or rather thoughts. In a coming blog post I will discuss what all of this could mean for CDISC SHARE (metadata repository).

In my presentation - the last one on the first day - I urged the CDISC community to consider the use of semantic web standards and linked data principles for clinical data standards. It was very nice to be able to refer back to two of the presentations in the earlier sessions.

Pragmatic steps for CDISC
Firstly, to the presentation by Rebecca Kush, President of CDISC, on the value of open and free standards. The key message in my presentation pointed out:

Linking clinical data standards from Kerstin Forsberg

Roche use Semantic Web for clinical data standards

And secondly, to the presentation from Roche on the development of a "Global Data Standard Repository" (GDSR) using semantic web standards and a ontology tool (TopBraid Composer). My first slides introducing the idea of "Triples" (the RDF standard model) and "Global Identifiers" (URI:s) was a recap for the audience as Frederik Malfait (IMOS Consulting presenting on behalf of Roche) in a really good way already had introduced these.

Questions and Answers

Even though it was the last presentation for the day (just before a very nice evening with TinTin at the Brussels Comic Strip Center) many people stayed around and I got the opportunity to sort out a key question, and also to outline two future opportunities:

Q: Do you mean we should publish the actual clinical data openly?

A: No! What should be made publicly available is another topic. My key message is that the free and open clinical data standards as they are currently constructed should be made available as linked open clinical data standards 1]. This means, using semantic web standards. (I propose the use of RDF/XML format as an alternative to Excel and ODM/XML.) And, also applying the Linked Data principles. (For example, assigning URI:s as global identifiers as an alternative to text strings for the submission values.)

Q: Does this relates to ontologies for bioinformatics?
A: Yes. The insights from developing for example the Gene Ontology are highly applicable when representing and structuring the entities and relations in the clinical reality. In some extra slides to my presentation I propose explorative work to construct the next generation of clinical data standards using modern ontologies 2] based on the so called Open Biological and Biomedical Ontologies (OBO) Foundry.

Q: Do you mean that this would take away the need for manual transformation of clinical data?
A: Yes and No.
Yes, because the above outlined next generation of clinical data standards (i.e. using semantic web standards, applying linked data principles and being based on modern ontologies) would improve the research utility of clinical datasets. That is, firstly, a very normalized, flexible way to convey clinical data. And, secondly, machine-processable clinical data ready for automatic transformation and direct querying, and ready for inferencing and reasoning.
No, because existing data needs to be transformed according to the above. And, No for quite some time as there are many things to explore and learn. A highly pragmatic, incremental and stepwise approach is required 3]

1] See my presentation slide 31-36 for more details on the pragmatic steps I propose for CDISC, and NCI.

2] The two OBO Foundry based ontologies I am referring to are the Translational Medicine Ontology, TMO (a.k.a. the Pharma Ontology) and the Computer-Based Patient Record (CPR) Ontology. See also an excellent article on biomedical ontologies: More Than Words, in the Clinical and Translational Science Network.

3] See my short paper Linked data, an opportunity to mitigate the complexity in pharmaceutical research and development

Kudos to Frederik Malfait and Jonathan Chainey (Roche), Dave Iberson-Hurst (@Assero_UK),
Bron Kisler (@CDISC), Philippe Verplancke and Isabelle de Zegher
for great discussions F2F in Brussels.

Monday, March 21, 2011

When will we see the first data.xyz.com?

"http://data.xyz.com is the home of our open linked data"

Say the CIO of Corporation XYZ

When will we see such an announce from a corporation?

I really liked the tweet today from Milton Keynes, UK (@mdaquin) pointing me to data.open.ac.uk, that is the home of open linked data from The Open University.

I would love to see an announcement from a corporation with high ambitions on corporate transparency and understanding of the value of sharing of pre-competitive data. With a CIO with good insights on open data and linked data principles. A corporation that clearly state the applied open license (such as PDDL, ODC-by or CC0), and also have earned a 5 star ranking (see Linked Data star scheme by example)

Or, does this already exist? Let me know if you know of something similar in an enterprise context.

For more information about the benefits on Linked Data, see a nice blog post by Stuart Brown (@stuartbrown) on the LUCERO Project, Linking University Content for Education and Research Online, blog. See also my previous post on Corporate Transparency and Linked Data.

Sunday, March 13, 2011

Three presentations

The coming two weeks I'll be working on presentations for three events I have got the opportunity to participate in. I will use this blog post as a way to shape my thinking and a new blog post when developing the slides and manuscripts.

Linked Data in Pharma

A brief presentation of a short paper we have got accepted for the first international workshop on linked web data management in Uppsala, 25 March. The title of the paper is; Linked Data, an opportunity to mitigate complexity in pharmaceutical research and development (link to be added). I have written it together with my colleague Bosse Andersson.
Semantics for Clinical Data 

Some reflections on different approaches to provide semantics for clinical data to be discussed in the EBI Industry Workshop on Biomedical Data and Model Interoperability  in Cambridge, 28-29 March.
Linked Clinical Data

An introduction to Linked Data principles and pragmatic examples for the CDISC Interchange Europe 2011 conference in Brussels, 13-14 April.

I did find it hard to start working on this with all the terrible news on what is happening in Japan just now. Kudos to Jim Hendler and Ivan Herman for their tweets today on the power of linked open data with an interactive map using open earthquake data.

See Ivan Herman's blog post

Background, Audiences and Intentions

Some brief notes on the background to my participation in the three events, and also on what I know about the audiences, and my intentions with what I will to talk about.

1. Linked Data in Pharma

The first one is an event I learned about on the Twitter feed for #linkeddata. It's a workshop on linked data management arranged in conjunction with a conference on database technology. We saw this as an opportunity to go to a workshop here in Sweden on this interesting topic. We decided to re-write an article from last year for an internal publication to describe some insights from working in the W3C interest group for semantic web in Health Care and Life Science (HCLS), and in the Large Knowledge Collider (LarKC) EU-project.

The article we started from had an intended audience of colleagues in a pharma company with no knowledge of the standards and principles behind the huge cloud of linked open data.

The Linking Open Data cloud diagram

While the participants in the workshop will be highly knowledgeable researchers and practitioners in linked data management. My hope is that we during 2011 will have more internal experiences to report from in an extended paper as the linked data idea now also get a lot of interest internally.

2. Semantics for Clinical Data 

The second event is the result of interactions we have had with Bernhard de Bono, leading the Drug Disease Modeling Resources (DDMoRe) one of the projects in Innovative Medicines Initiative (IMI). I met Bernard in an EBI industry workshop on ontology engineering last year and we talked about existing metadata standards for clinical data and the opportunities in ontology based annotations of clinical data.

The list of attendees includes people from many of the European pharma companies and also from research centers such as EBI and INSERM. I assume many of the them work in the pre-clinical / drug discovery phase and have a bioinformatics focus, so together with the people from CDISC I hope to to be able to add a clinical perspective.

My contribution will be some reflections on different approaches to provide semantics along with clinical data. As it has been done when a lot of the semantics, that is the knowledge on what clinical data represents, have been implicit and carried by people and documents And how semantics now is made explicit for humans as standardized data exchange containers, e.g. the CDISC SDTM domain for Lab test data, and as text strings of standardized codes and labels, so called controlled terminologies e.g. the list of lab test procedure codes, to simplify the programming to transform, integrate and analyze data. By linking to Bernard's presentation on the RICORDO 2] toolkit for semantic integration of biomedical resources I will outline how clinical data can be annotated with ontology based standards making the semantics explicit using formal and machine processable formats. I will also briefly talk about how clinical metadata registries could be used to support ontology based annotation.

3. Linked Clinical Data

The abstract I proposed for the third event was triggered by the frustration I interpreted from the FDA representatives at CDISC Interchange US in 2009. And a follow-up to the brief discussions I had with some of the CDISC folks on linked data principles and semantic web standards. Here is how Jay Levin, expressed it in the FDA panel in November 2009:

We want to separate the analysis view from how clinical data is exchange. To have a very normalized, flexible way to convey the data as it actually was collected, as it occurred. And than from that create any number of disease area specific views and analysis specific views. You have tremendous options. So, instead of being looked into this difficult dance that I see happening with SDTM then you always try to decide how useful it’s going to be for correct analysis vs. how consistent it could be if you free up the potential ways data can be represented for disease specific areas. 1]

In my presentation I want to provide show examples of RDF data model (triples) as such a "very normalized, flexible way to convey the data" (see also my comments on this blog post Wondering why the FDA hasn't more actively promoted CDISC standards). I'll also share the good news on how linked data principles now are applied by key players such as the UK and US governments, as described in my first blog post on The Open Government Data Movement. And also use the practical example of how RDF triples of linked data look like using the payment example from a local authority in UK that I also used in my previous blog post on publishing linked data.

My key message will be some proposed pragmatic steps for how the CDISC standards can be published using the 5-star rating scheme for linked open data described in my second blog post.

The title of the CDISC track is "eHRs and the World Beyond", and Patient Controlled Health Records (PCHR) or Personal Heath Records (PHR) e.g. Google Health, could be the next big thing. So, I will also as food-for-thought include a slide from the explorative work we do on leveraging semantics developed for PCHR also for clinical research data. That is, the Computer-Based Patient Record (CPR) ontology developed by Chimezie Ogbuji, Case Western Reserve University's Center for Clinical Investigation, previously Cleveland Clinics.

1] Jay Levin refereed to the HL7 standards as a the "normalized, flexible way". He and others from FDA earlier in 2009 did some initial statement on moving from CDISC's SDTM standards to HL7's CDA (Clincial Document Architecture) standard for submissions of clinical data. This was not well received by CDISC, nor by the representatives from pharma and CRO companies. During 2010 FDA and CDISC came to a common agreement on CDISC SDTM. (That is, the 40+ different container with standardized variable names, and the evolving controlled terminologies.) See two posts on CDISC's blog: Clear Messages from FDA CDER and CBER and FDA CDER Data Standards Plan V 1.0 and PDUFA IV IT Plan Update

2] Researching Interoperability using Core Reference Datasets and Ontologies for the Virtual Physiological Human (RICORDO)

Monday, January 31, 2011

Data Scraping vs Providing Linked Data

In my first blog post I gave a brief overview of the Open Government Movement and how the Linked Data principles make publicly available data released by the UK and US governments open for citizen utility and economic opportunities. The second blog post described in more detail the Linked Data principles and how they can be part of a Corporate Transparency effort. I used two examples of publicly available data; payments data published on AstraZeneca US website and energy consumption data published by the Volvo Group.

In this blog post I will describe how the pdf with payments information is being scrapped of and re-created as data. I will contrast it with how spending data is published as linked open data by local governments in UK.

Re-actively let other scrape off and re-create data
vs.
Pro-actively provide linked data

NB: My descriptions of these different approaches have a true data focus. I do not make any judgement from a business or government perspective, nor can I fully understand them in the U.S. and UK legislation contexts.

Data scraping

A recent re-tweet by Beth Noveck (@bethnoveck), the previous US deputy chief technology officer for open government, pointed me to a post about Scraping for Journalism: A Guide for Collecting Data on the ProPublica Nerd Blog. It lists useful tools to scrape data off pdf:s and html-pages, and how to refine messy data using tools such as Google Refine. The author reports experiences from developing the Dollars for Docs news application that let users search pharmaceutical company payments to doctors. One of the sources is the Physician Engagement summarizing payments made to U.S. physicians who have spoken on behalf of AstraZeneca and/or its products.

Data taken from internal databases have been published as a pdf file which the data journalists from Propublica interpret before they can scrape off text strings to be able to re-create data and populate a public database wit the data together with data from other pharmaceutical companies.

Providing Open Linked Data

In a blog post serie on Talis' Nodalities blog Richard Wallis writes about Linked Spending Data – How and Why Bother. In the second part of the series he describes how the payment data can be searched and navigated. And also how users can pose questions regarding individual payments.

This is enabled by the way the payment data has been provided as a standard model of triples (subject/predicate/object) with links to explicit semantics defined and structured in the Payment Ontology and the Data Cube vocabulary.

Some examples of what the above triples represents:

A data item instance is globally identified by the URI http://spending.lichfielddc.gov.uk/spend/860567 and is typified as an ExpenditureLine, defined in the Payment Ontology as a sub-class of Observation (from the Data Cube Vocabulary)
The value "120.00" is a property of this ExpenditureLine defined as netAmount (described in the Payment Ontology as "The net amount of the payment. This is the effective cost to the payer after any reclaimable tax has been deducted")
The local authority known as Lichfield District Council is globally identified with the URI http://statistics.data.gov.uk/id/local-authority/41UD and is the Payer of the payments part of the Invoice identified as http://spending.lichfielddc.gov.uk/invoice/7747. The Payer is also defined as a dimension (in the Data Cube Vocabulary).

This is a great example of how 5-star rated linked open data looks like (see my previous blog post for more details). I hope to get back to the Physician Engagement example in a future blog post turned into a 5-star show case of pro-actively providing linked open data and avoiding that others do data scraping off pdf:s.

Kudos to Richard Wallis (@rjw) for the great blog serie on spending data. And also to
Kingsley Uyi Idehen (@kidehen) for pointing out the value of linked data as
3-col restricted table of triples with global references (URIs)

Pages