Saturday, December 29, 2012

My MOOCs Spring 2013

Great to see that the news program on SVT (Swedish Television) described MOOC (Massive Open Online Courses) in a new story the other day.

SVT Nyheter, 27 Dec. 2012:Toppuniversitet ger gratiskurser på nätet.  

During 2012 I have followed a few courses via one of the organisations mentioned in the news program: Coursera.Two of the courses were excellent: Model Thinking and  Fundamentals of Pharmacology, and they are on Coursera's list of 211 (!) courses. While the course in "Software Engineering for Software as a Service (SAAS") was not of the same high quality, and it's not on the list anymore.

For the Spring 2013 I have enrollod three MOOCs. So, now I know what to do while commuting 2 hours per day also the coming months :-)

It's great to see how all of this have taken off during 2012 offering courses not only for data nerds as myself but also for many others.

So, I was thinking of my sister when I read these teasers from Coursera:
  • "Ever wonder why people do what they do? This course offers some answers based on the latest research from Social Psychology."
  • "In the course Introductory Human Physiology students learn to recognize and to apply the basic concepts that govern integrated body function (as an intact organism) in the body's nine organ systems."

Sunday, September 16, 2012

Mind maps just begging for RDF triples and formal models

Earlier this week CDISC English Speaking User Group (ESUG) Committee arranged a webinar: "CDISC SHARE - How SHARE is developing as a project/standard” with Simon Bishop, Standards and Operations Director, GSK. I did find the comprehensive presentation from Simon, and his colleuage Diane Wold, very interesting.

Interesting as the presentation in an excellent way exemplifies how "Current standards (company standards, SDTM standards, other standards) do not current deliver the capability we require" Also, I do find the presentation interesting as it exemplifies mind maps as a way forward as "Diagrams help us understand clinical processes and how this translates into datasets and variables." (Quotes from slide 20 in the presentation: Conclusions.) 

Below a couple of examples of mind maps from the presentation. And also, the background to my thinking that they are Mind maps just begging for RDF triples and formal models of the clinical and biomedical reality to make them fully ready "both for human understanding and for computer interpretation".


High level mind map from the Parkinson's disease example
by Dianne Wold, GSK (slide 14)

Current standards do not current deliver the capability we require 

This conclusion is backed up in the first half of the presentation with exemples from GSK's internal standards and from CDISC's SDTM standards. These are low level data standard specifying data structures and data elements (variables). Standards for exchange of data in bulk (in containers such as SDTM Vital Sign and Lab domains) or standards for exchange of captured data (in specified variables such as a data modules for specific blod pressure and temperature mesurements) . Good exemples in the presentations show the challanges in analysing and aggregating clinical data put into SDTM dataset variables as containers "lacking documented relationships between the variables".

Example from data represented in the proposed
SDTM standard for Parkinson's disease (slide 12)

Diagrams help us understand clinical processes and how this translates into datasets and variables

The value in drawing diagrams to understand the higher level of relationsships in terms of the clinical processes in which clinical data is captured for different diseases. This is nicely illustrated in the presentation with a couple of diagrams, or "mind maps".

Example of a map of the clinical process (slide 15)
And also on the value in drawing diagrams to understand the mid level of relationsships in terms of "concepts" *) and "concept variables" and how these should be put into the SDTM variables (in red). (The example below is unfortunately not the same as the above Parkinson's disease examples.)

Example of a map for the concept of Temperature measurement (slide 28)


Mind maps just begging for RDF triples and formal models

When I see these mind maps I see graphs just begging for RDF triples (subject, predicat, object). That is, the fundemental semantic web standard. See my two earlier blog posts from two presentations at CDISC Interchange Europe: Semantic models for CDISC standards and metadata and Linking Clinical Data Standards

An intersting exercise would be to have the Parkinson's disease exemple completed in the concept mapping tool (CMAP) the whole way down to SDTM. And export the mind maps using as RDF triples. However, this is nice, but not enough ...

When I these mind maps I can also see how easy it is to start drawing such diagrams and exporting them as representations of generic mind maps. However, to fullfill the ultimate goal to have them "captured in a way that these can be used both for human understanding and for computer interpretation" the "mind maps" need underlying formal models of the clinical and biomedical reality.
 
Therefore, I see an interesting connection between the high level maps for disease and clinical processes to the Ontology for General Medical Science (OGMS). OGMS is an ontology of entities involved in a clinical encounter and provides a formal theory of disease that have been further elaborated by specific disease ontologies. See my blog post from last year on Disease terminologies and ontologies.


*) The CDISC SHARE project talks about scientific, or research, concepts. It has also been called observation concepts. However, the word "concept" is overused and carries challanges in itself, see From concept to clinical reality. 

Kudos to Frederik Malfait, working for Roche and my co-presenter on Semantic models for CDISC data standards and metadata, for pointing me to this presentation.

Wednesday, June 20, 2012

To Whom It May Concern

A nice tweet from Phil Archer (@philarcher1) this morning reminded me of a "triple tweet" I posted earlier this year on the topic of creating data and metadata To Whom It May Concern ('a formal salutation used for opening a letter to an unknown recipient' source Wikipedia)

So, here's Phil's tweet quoting Sharon Dawes (@ssdawes):


And here is my "triple tweet" on a what I see as one of the core values of Linked Data:




I posted them after I had the pleasure to meet David Wood (@prototypo) and Berndette Hyland (@BernHyland) F2F in a Linked Data and URI workshop in Boston in late January:

Sunday, May 27, 2012

AstraZeneca re-joins W3C HCLS

After a warm and sunny day of kayaking out in the archipelago north of Gothenburg it was nice to catch up on Twitter and see the official announcement from W3C that my employeer; AstraZeneca, has joined W3C. It's actually a re-join as we joined W3C in 2006 to participate in the Semantic Web Health Care and Life Sciences Interest Group (HCLS IG). 


Update 6 June: I recommend this nice slide deck for an overview of Semantic Web and Related Work at W3C, presented by Ivan Herman (@ivan_herman) at the 2012 Semantic Tech & Business Conference in San Francisco, CA, USA,   5 June.

I attended and reported back from the W3C conference in Edinburgh in May 2006 (WWW2006) and from the next one in Banff in May 2007 (WWW2007) together with my former colleague Bosse Andersson (@bbalsa). My focus was on applying semantic web standards for clinical data and in 2007 Eric Neumann (@ericneumann), one of the HCLS pioneers, and I published a W3C Note in the Drug Safety and Efficacy task force on CDISC's Study Data Tabulation Model (SDTM). And together with most of the members in the HCLS group I co-authored an important article in BMC Bioinformatics Advancing translational research with the Semantic Web.  

In late 2007 I had to focus on other tasks while Bosse and colleuges in the US; Julia Kozlovsky, Elgar Pichler and Otto Ritter contiued the interactions with other parties across life science and health care in two of the HCLS groups: Linking Open Drug Data (LODD) and Translational Medicine Ontology (TMO).  

In early 2010 when I returned my job focus to semantic interoperability, AstraZeneca had decided not to renew the W3C membership. To stay updated I started to use use social media as a way to engage with the semantic web and linked data community, to follow thought leaders in the intersection between eHealth and Clinical Research, and to share news and insights with colleagues.

Early 2011 Bosse and I wrote a short paper to summarise insights from AstraZeneca's engagment in W3C HCLS and in the EU project Large Knowledge Collider  (LarKC). The paper, Linked Data, an opportunity to mitigate complexity in  pharmaceutical research and development, starts with a look back to one of the most inspiring meetings I have been in:
During the WWW2007 conference a breakthrough of the Linked  Data idea happened in a session where web experts demonstrated the power of a new generation of the web, a web of data. For us attending the session it was hard to imagine the full potential on what this idea would mean for individual scientists and for a  pharmaceutical company. 
As described in my earlier blog post we do now have a new program in AstraZeneca called Integrative Informatics (i2) establishing the components required to let a linked data cloud grow across R&D. Re-joining W3C and re-connect with HCLS is one step in this.




Sunday, May 6, 2012

Semantic models for CDISC based standard and metadata management

In mid April we did a presentation at the 2012 CDISC (Clinical Data Interchange Standards Consortium) Interchange Europe with the title: Semantic models for CDISC based standard and metadata management (see our slides and short paper). This time in a sunny, but chilly, Stockholm at a very nice hotel (Elite Marina Tower). Last year Frederik Malfait,  consulting at Roche, and I, working for AstraZeneca, had two different presentations at the 2011 conference in Brusses. See my blog post: Linking Clinical Data Standards

Since then we have seen more interest in semantic web standards in the CDISC community, see for example the article in Applied Clinical Trials Online (@Clin_Trials): Digital Data, the Semantic Web, and Research, by  Wayne Kubick, the new CTO of CDISC. This year Frederik and I did a joint presentation with a key messsage to the CDISC organisation: "Put semantics into the semantics". That is, to start using semantic web standards and linked data principles for the whole suite of CDISC standards. See below our list of proposals.

In my introduction I described the current situation when the question now is "Not when, but how" to best adopt CDISC standards. At the same time the different CDISC standards are not linked and published in different formats and so called metadata registeres (MDR) are requested for robust life cycle management of standards. 

Real world use 

In my brief introduction (see slide 5-11) to the core semantic web standard, the so called RDF triple, I showed an example of how Google use RDF based standards to improve search (see my previous blog post on schema.org). And I also showed how NCI use RDF to publish the NCI Thesaurus, see RDF/OWL download of NCIt via LexEVS. And also how RDF is used for an early version of  the domain model for biomedical research (BRIDG), see RDF/OWL representation of BRIDG/ISO21090. In both these cases the RDF is published as XML, but RDF triples can also be published in different serialisation formats (i.e. XML, JSON, Turtle, and N-Triples). I also showed the latest version of the Linked Open Data cloud, with even more linked datasets than the one Frederik and I had in our presentations last year. I then turned over to the main part of our presentation describing two real world use of how two sponsors now start to use semantic web standards and linked data principles.

Linked Data cloud to grow across AstraZeneca R&D

Photo from CDISC Facebook
In AstraZeneca we have a new program called Integrative Informatics (i2) establishing the components required to let a linked data cloud grow across R&D. A key component is the URI policy for how to make for example a Clinical Study linkable by giving it a URI, that is a Uniform Resource Identifier, e.g. http://research.data.astrazeneca.com/id/clinicalstudy/D5890C00003. This is an identifier for a clinical study with the study code D5890C00003 that should be persistent and not dependent on any system. In the same way we will give guidance on how to use URI:s to make other key entities such as Investigator and Lab linkable. Also standard data elements from CDISC and internal ones to be managed in a future MDR should have URI:s to make them linkable. For more information on how URI:s are being used in for example the UK and US governments, see my URI design page.

A semantic web standard based MDR in Roche

Photo from CDISC Facebook
Frederik described the schema, content and architecture of Roche Biomedical MDR. And then he went through a demo using a RDF representation of a CDISC standard example and of an internal Roche standard (you will find the screenshoots from the demo in end of the slide deck). He first showed how the standards could be viewed using a general tool (TopBraid Composer from TopQuadrant, but could be any other RDF tool such as Protégé, a common open source tool). On slide 20-28 you can see how SDTM model v.1.2, SDTM IG v3.1.2, and SDTM CT:s, all are linked together (for example Observation Class: Event - Domain: AE - Variable:  AEOUT - Submission value: NOT RECOVERED/NOT RESOLVED). And then he showed the same RDF representation via the application Roche Global Standard Data Browser (slide 29-37). Frederik also showed how the linked data standards can be exported in SAS and Excel formats (slide 42-50). And finally, he showed an example from a Roche standard questionnaire.

Proposals to CDISC

In the slides you can see that Frederik had to transform CDISC standards into RDF using a schema he developed for Roche and give them URI:s in a Roche namespace (e.g. http://gdsr.roche.com/cdisc/sdtmig-3-1-2#Column.AE.AEOUT for one of the data elements). This is not a ideal way, instead we would like CDISC to provide these. Hence the drive from our leadership in Roche and AstraZeneca for Frederik and myself to push back to CDISC. 

Below a draft list of proposals to CDISC: 
  • Decide on a URI design for CDISC standards (e.g. http://id.cdisc.org/sdtm).
  • Review the schema Frederik has proposed for the core MDR in CDISC SHARE. 
  • Publish the new SDTM v1.3 and SDTM IG v.3.1.3 as RDF in XML, JSON, Turtle, and N-Triples formats using the reviewed schema and URI design. (As options to current publication formats, i.e PDF, html, csv, xml/odm.) 
  • Work together with NCI on enhancing the RDF/OWL version of NCI Thesaurus. Also review the option to use the RDF/SKOS standard and apply linked data principles. Publish coming versions of CDISC CT:s as RDF in XML, JSON, Turtle, and N-Triples. 
  • Work together with NCI on enhancing the RDF/OWL representation of BRIDG/ISO21090 model and apply linked data principles to make all BRIDG classes, properties and ISO21090 data types linkable.
  • Extend the MDR schema for CDISC SHARE for linkage to relevant BRIDG classes and properties and to ISO21090 data types.
  • Start exploring semantic web standards and linked data principles also for clinical data, including making invidual clinical data points linkable using URI:s and annotating them using existing and emerging clinical standard terminilogies and ontologies. 

Monday, April 9, 2012

Describe things vs Improve markup of pages that describe things

Easter Monday is a public holiday in Sweden and it's been a rainy and cold day -- so, it's time to write a new blog post. It's triggered by a nice blog post published just before the weekend by Phil Archer (@philarcher1) with the interesting title: Danbri has moved on – should we follow? In his blog post Phil reflects on a presentation Dan Brickley (@danbri) did the week before at a Linked Data meetup in London.

Phil focus on Dan's point about the the best practice so far in the semantic web community: "look at existing vocabularies, particularly ones that are already widely used and stable, and re-use as much as you can. Dublin Core, FOAF – you know the ones to use." 

And Phil wonder if it's now time to move on and "embrace schema.org as the vocabulary to use wherever possible? It won't cover everything, but it might cover the 50% of classes and properties that dominate any domian of interest." In his presentation, Schema.org and One Hundred Years of Search, Dan also argues that search terms have barely changed in style for 100 years and more.

For more info about the joint vocabulary from Google, Bing (Microsoft) and Yahoo called schema.org, see my remote report from the SemTech 2011 conference


Improve markup of pages that describe things

When listening to the video with Dan I did find this statement in his slides very interesting (on slide 33) decribing the scope of the schema.org vocabulary as "In-page structured data for search":
"Not asking an unconstrained 'so, how do we describe cars?', but “how can we improve markup on existing pages that describe cars?” (or Comics, SoftwareApps, Sports, ...)".
I always like when someone cleary state what is not included -- what's not intended. So, this is a helpful statement for me. And it will be interesting to follow how Schema.org will be extended and refined for domains such as Medicine/Health, see the list of Schema.org proposals maintained by W3C Web Schemas.

At the same time, a lot of the semantics I look for in my daily work is more about "how to describe cars?". Well, not cars really -- it's about other kinds of 'things' and their parts, relationsships and impacts on each other. It's about "how to describe 'things' in small portions of the biological, chemical, clinical and heath economic reality studied in clinical research and documented in health care". Also, "how to organise data about these 'things' not only to improve search but also to improve how data about these entities can be combined and queried in new ways." 

Describe things
This is also the driver for me to learn more about how to: "capture, in a logical, systematic way, what scientists regard as the basic truths about a topic. Like equations in physics or axioms in mathematics, they can even be the basis for computational models." from More than Words. See also several of my erlier blog post on this approach, for example my post on Disease terminologies and ontologies.

In a future blog post I hope to learn more about how this approach has been applied on the Chemical Information Ontology (ChemInfo) to describe Chemical Entities of Biological Interest (chEBI). This is nicely explained in one of the favorite papers I have collected in my (kerfors) CiteULike library: The Chemical Information Ontology: Provenance and Disambiguation for Chemical Data on the Biological Semantic Web by Janna Hastings, Leonid Chepelev, Egon Willighagen, Nico Adams, Christoph Steinbeck, Michel Dumontier.

Exempel from Chemical Entities of Biological Interest (CHEBI)
the entity Hemoglobin in html view via the ontology-browser Ontobee

Sunday, January 8, 2012

"To-day we have naming of parts"

I like to explore associations people share with me when I describe something new to them. So, I was pleased the other day when a colleague in the UK shared the phrase that did stick in her head when I described the idea of URI:s (Uniform Resource Identifiers) and Linked Data: "To-day we have naming of parts".

So, after some googling I found the poem by Henry Reed (1914-1986), "Naming of Parts." New Statesman and Nation 24, no. 598 (8 August 1942).

To-day we have naming of parts. Yesterday,
We had daily cleaning. And to-morrow morning,
We shall have what to do after firing. But to-day,
To-day we have naming of parts. Japonica
Glistens like coral in all of the neighboring gardens,
     And to-day we have naming of parts.

Hear Henry Reed and Frank Duncan read "Naming of parts" (mp3)


Through a very nice website; The Poetry of Henry Reed, I learned more about this World War II British poet, critic, translator, and radio dramatist. It helped me to better understand this wonderful, and sad, poem about the contrast between the world of weapons and the world of nature. 

Naming parts and other things
I also learned about an article (DOI:10.1038/nbt0102-27) in Nature BioTechnology (2002) using the first stanza in Henry Reeds' famous poem as its title. In the article a professor of genomics at the University of Manchester describes the identification of previously non-annotated genes in yeast.

And, I also found a blog post from 2009 that also used the first stanza in Henry Reed's poem in its title:
Naming of parts and other things. That is, David Bawken's (@David_Bawden) post on his nice blog: "The Occasional Informationist, irregular thoughts on the information sciences". In this post he describes a meeting with John Wilbanks (@wilbanks) at the British Library:
In his presentation of the need for annotation of digital reporting of scientific findings, Wilbanks commented simply that we need to call the same thing by the same name; this makes possible the semantic linking of information and data, the creation of ontologies, and so on, without which it will not be possible to share information across disciplinary and sub-disciplinary silos. 
He exemplified this by examples by simple – the various names for coffee in different languages – and complex – the variant terminology used in hundreds of datasets relating to polar climate change, and in over a thousand related to genomics.
There was another aspect to this point. What we call an information object in the digital world – DOIs and all the rest – is also fundamental; if we do not call these digital objects the same thing, we will have great difficulty in finding them.

Names of today
So, let me conclude this post with a couple of examples of naming parts and other things using names of today that is http-based URI:s. The three example URI:s are also three examples of large efforts to publishing linked data "about the named things":

  1. British Library's URI for the poet Henry Reed
    http://bnb.data.bl.uk/id/person/ReedHenry1914-1986
  2. Wikipedia's, i.e. DBpedia's, URI for the poet Henry Reed
    http://dbpedia.org/resource/Henry_Reed_%28poet%29
  3. The DOI for the the article about identifying genes in yeast turned into a URI by CrossRef
    http://dx.doi.org/10.1038/nbt0102-27

1. British Library publish metadata about bibliographic resources ("things") using Linked Data techniques and technologies. And part of that is to assign http-based URI:s to the creators. For a great introduction to the underlying model see the blog post: British Library Data Model: Overview by Tim Hodson (@timhodson).

So, for example the data model specifies that persons who are the identified creators of bibliographic resources, such as the poet Henry Reed (http://bnb.data.bl.uk/id/person/ReedHenry1914-1986), should be of the type Agent and Person according to the basic, and very often used vocabulary for linked data, called Friend of a Friend (FOAF).


2. A large part of the structured content published on Wikipedia pages is also made available as linked data called DBpedia. See this great article: How DBpedia Treats Wikipedia as a Database. The so called resources ("things") that the wikipedia pages describes are in DBpedia given http-based URI:s and each resource are typified using a thin model called the DBpedia ontology. 

So, here we can see that the poet Henry Reed is also identified in DBpedia (http://dbpedia.org/resource/Henry_Reed_%28poet%29) and described with the structured data from the Wikipedia page about him. Such as his birth date and death date, and also the fact that he is categorized using the concept 'English poets'. This concept also has a URI http://dbpedia.org/resource/Category:English_poets. So, we may have more than one URI for the same Henry Reed. These can be related to each other using the sameAs statement.


This is not yet done by the British Library, but I assume this will be done later as for example the Swedish Library catalogue relates their URI:s to DBpedia's.

Here is another URI, http://dbpedia.org/resource/Category:Firearm_components, for a categorization concept, and in the DBpedia interface you can see of list such resources ("things") and links to them using URI:s such as http://dbpedia.org/resource/Sling_%28firearms%29.


3. CrossRef has made metadata for 46 million Digital Object Identifiers (DOI) available as Linked Data. DOIs are used for publishing of uniquely identify electronic documents (largely scholarly journal articles). CrossRef is a consortium of roughly 3,000 publishers, and is a big player in the academic publishing marketplace.

So, here is the identifier of the article about identifying genes in yeast http://dx.doi.org/10.1038/nbt0102-27.

Kudos to my colleague for the opportunity for me to learn more this wonderful poem and for a great discussion.
To ReedingLessons the signature behind the great website about Henry Reed.
To @David_Bawden for his niceblog 
The Occasional Informationist.
And, finally, to @wilbanks a great source of inspiration.