Sunday, May 27, 2012

AstraZeneca re-joins W3C HCLS

After a warm and sunny day of kayaking out in the archipelago north of Gothenburg it was nice to catch up on Twitter and see the official announcement from W3C that my employeer; AstraZeneca, has joined W3C. It's actually a re-join as we joined W3C in 2006 to participate in the Semantic Web Health Care and Life Sciences Interest Group (HCLS IG). 


Update 6 June: I recommend this nice slide deck for an overview of Semantic Web and Related Work at W3C, presented by Ivan Herman (@ivan_herman) at the 2012 Semantic Tech & Business Conference in San Francisco, CA, USA,   5 June.

I attended and reported back from the W3C conference in Edinburgh in May 2006 (WWW2006) and from the next one in Banff in May 2007 (WWW2007) together with my former colleague Bosse Andersson (@bbalsa). My focus was on applying semantic web standards for clinical data and in 2007 Eric Neumann (@ericneumann), one of the HCLS pioneers, and I published a W3C Note in the Drug Safety and Efficacy task force on CDISC's Study Data Tabulation Model (SDTM). And together with most of the members in the HCLS group I co-authored an important article in BMC Bioinformatics Advancing translational research with the Semantic Web.  

In late 2007 I had to focus on other tasks while Bosse and colleuges in the US; Julia Kozlovsky, Elgar Pichler and Otto Ritter contiued the interactions with other parties across life science and health care in two of the HCLS groups: Linking Open Drug Data (LODD) and Translational Medicine Ontology (TMO).  

In early 2010 when I returned my job focus to semantic interoperability, AstraZeneca had decided not to renew the W3C membership. To stay updated I started to use use social media as a way to engage with the semantic web and linked data community, to follow thought leaders in the intersection between eHealth and Clinical Research, and to share news and insights with colleagues.

Early 2011 Bosse and I wrote a short paper to summarise insights from AstraZeneca's engagment in W3C HCLS and in the EU project Large Knowledge Collider  (LarKC). The paper, Linked Data, an opportunity to mitigate complexity in  pharmaceutical research and development, starts with a look back to one of the most inspiring meetings I have been in:
During the WWW2007 conference a breakthrough of the Linked  Data idea happened in a session where web experts demonstrated the power of a new generation of the web, a web of data. For us attending the session it was hard to imagine the full potential on what this idea would mean for individual scientists and for a  pharmaceutical company. 
As described in my earlier blog post we do now have a new program in AstraZeneca called Integrative Informatics (i2) establishing the components required to let a linked data cloud grow across R&D. Re-joining W3C and re-connect with HCLS is one step in this.




Sunday, May 6, 2012

Semantic models for CDISC based standard and metadata management

In mid April we did a presentation at the 2012 CDISC (Clinical Data Interchange Standards Consortium) Interchange Europe with the title: Semantic models for CDISC based standard and metadata management (see our slides and short paper). This time in a sunny, but chilly, Stockholm at a very nice hotel (Elite Marina Tower). Last year Frederik Malfait,  consulting at Roche, and I, working for AstraZeneca, had two different presentations at the 2011 conference in Brusses. See my blog post: Linking Clinical Data Standards

Since then we have seen more interest in semantic web standards in the CDISC community, see for example the article in Applied Clinical Trials Online (@Clin_Trials): Digital Data, the Semantic Web, and Research, by  Wayne Kubick, the new CTO of CDISC. This year Frederik and I did a joint presentation with a key messsage to the CDISC organisation: "Put semantics into the semantics". That is, to start using semantic web standards and linked data principles for the whole suite of CDISC standards. See below our list of proposals.

In my introduction I described the current situation when the question now is "Not when, but how" to best adopt CDISC standards. At the same time the different CDISC standards are not linked and published in different formats and so called metadata registeres (MDR) are requested for robust life cycle management of standards. 

Real world use 

In my brief introduction (see slide 5-11) to the core semantic web standard, the so called RDF triple, I showed an example of how Google use RDF based standards to improve search (see my previous blog post on schema.org). And I also showed how NCI use RDF to publish the NCI Thesaurus, see RDF/OWL download of NCIt via LexEVS. And also how RDF is used for an early version of  the domain model for biomedical research (BRIDG), see RDF/OWL representation of BRIDG/ISO21090. In both these cases the RDF is published as XML, but RDF triples can also be published in different serialisation formats (i.e. XML, JSON, Turtle, and N-Triples). I also showed the latest version of the Linked Open Data cloud, with even more linked datasets than the one Frederik and I had in our presentations last year. I then turned over to the main part of our presentation describing two real world use of how two sponsors now start to use semantic web standards and linked data principles.

Linked Data cloud to grow across AstraZeneca R&D

Photo from CDISC Facebook
In AstraZeneca we have a new program called Integrative Informatics (i2) establishing the components required to let a linked data cloud grow across R&D. A key component is the URI policy for how to make for example a Clinical Study linkable by giving it a URI, that is a Uniform Resource Identifier, e.g. http://research.data.astrazeneca.com/id/clinicalstudy/D5890C00003. This is an identifier for a clinical study with the study code D5890C00003 that should be persistent and not dependent on any system. In the same way we will give guidance on how to use URI:s to make other key entities such as Investigator and Lab linkable. Also standard data elements from CDISC and internal ones to be managed in a future MDR should have URI:s to make them linkable. For more information on how URI:s are being used in for example the UK and US governments, see my URI design page.

A semantic web standard based MDR in Roche

Photo from CDISC Facebook
Frederik described the schema, content and architecture of Roche Biomedical MDR. And then he went through a demo using a RDF representation of a CDISC standard example and of an internal Roche standard (you will find the screenshoots from the demo in end of the slide deck). He first showed how the standards could be viewed using a general tool (TopBraid Composer from TopQuadrant, but could be any other RDF tool such as Protégé, a common open source tool). On slide 20-28 you can see how SDTM model v.1.2, SDTM IG v3.1.2, and SDTM CT:s, all are linked together (for example Observation Class: Event - Domain: AE - Variable:  AEOUT - Submission value: NOT RECOVERED/NOT RESOLVED). And then he showed the same RDF representation via the application Roche Global Standard Data Browser (slide 29-37). Frederik also showed how the linked data standards can be exported in SAS and Excel formats (slide 42-50). And finally, he showed an example from a Roche standard questionnaire.

Proposals to CDISC

In the slides you can see that Frederik had to transform CDISC standards into RDF using a schema he developed for Roche and give them URI:s in a Roche namespace (e.g. http://gdsr.roche.com/cdisc/sdtmig-3-1-2#Column.AE.AEOUT for one of the data elements). This is not a ideal way, instead we would like CDISC to provide these. Hence the drive from our leadership in Roche and AstraZeneca for Frederik and myself to push back to CDISC. 

Below a draft list of proposals to CDISC: 
  • Decide on a URI design for CDISC standards (e.g. http://id.cdisc.org/sdtm).
  • Review the schema Frederik has proposed for the core MDR in CDISC SHARE. 
  • Publish the new SDTM v1.3 and SDTM IG v.3.1.3 as RDF in XML, JSON, Turtle, and N-Triples formats using the reviewed schema and URI design. (As options to current publication formats, i.e PDF, html, csv, xml/odm.) 
  • Work together with NCI on enhancing the RDF/OWL version of NCI Thesaurus. Also review the option to use the RDF/SKOS standard and apply linked data principles. Publish coming versions of CDISC CT:s as RDF in XML, JSON, Turtle, and N-Triples. 
  • Work together with NCI on enhancing the RDF/OWL representation of BRIDG/ISO21090 model and apply linked data principles to make all BRIDG classes, properties and ISO21090 data types linkable.
  • Extend the MDR schema for CDISC SHARE for linkage to relevant BRIDG classes and properties and to ISO21090 data types.
  • Start exploring semantic web standards and linked data principles also for clinical data, including making invidual clinical data points linkable using URI:s and annotating them using existing and emerging clinical standard terminilogies and ontologies.