Thursday, December 15, 2016

Wikipedia pages for clinical trials

What would add value to have on Wikipedia pages for clinical trials? How could they support for example in recruiting and retaining patients? How could the underlying structured data in Wikidata be useful?
I had hoped to explore these questions during an internal innovation day. However, I had the chance to look more into another interesting thing: Jupyter Notebooks. Something I have been eager to do since last summer.

Anyhow, I hope to get involved in this interesting during 2017. I see great opportunities to both contribute and leverage from this in the work I do internally on a master list for clinical studies. I have started by engaging in two issues: Normalize study_phase values see a Issue on OpenTrials Git and also in a discussion about study identifiers (OpenTrials issue on github). Below some background to having clinical trials on Wikipedia, and in the data backbone called Wikdata, and also a note about OpenTrials.
Today (November 2016) a few studies have Wikipedia pages e.g. Lilly's PARAMOUNT study. 16 studies, including the Lilly study, are typed as Clinical Trial (Q30612) in Wikidata (SPARQL query). The Wikidata identifier for the Lilly study is Q17148583 and its URI is​ 

Wikidata is the backbone of Wikpedia where the entities behind Wikipedia pages are registered, such as compounds e.g. the Wikidata entity rosuvastatin (Q415159)hold some of the core structured data behind the infobox to the right on the Wikipedia page for Rosuvastatin
Wikidata entity and Wikipedia page for Rosuvastatin (also know as Crestor) 

Examples of structured data for Rosuvastatin in Wikidata: it is classified as a pharmaceutical drug (Q12140), and is a subclass of statin (Q954845) and what ATC code it has a property (P267). The structured data in Wikidata can be queries using SPARQL, a queary language for data structured as so calles RDF. The live SPARQL query to get the ATC codes for statins.
Live link to the SPARQL example live via:

The data for drugs and chemical compounds are sourced from Drugbank using a so called bot (see A simple way to write Wikidata bots [blog post] and Drug and Chemical compound items in Wikidata as a data source for Wikipedia infoboxes[video]).

There are plans to integrate OpenTrials info into WikiData. That is, all studies registered in, available via OpenTrials Explorer. For more info about OpenTrial see my recent blog pos​t.

A first part of that work is to develop a Data model i Wikidata for Trials. using the data elements for One example is how the NCT number, the identifier of studies, have been defined as a WikiData property (P3098) with statements about it such as the Format as a regular expression: NCT(\d{8}) and properties describing clinical trials like study phases.

Saturday, November 5, 2016


I have followed the development of OpenTrials (@opentrials) since Ben Goldacre's (@bengoldacre) first comments about the lack of an open infrastructure to improve the sharing of information about clinical trials. See my blog post from 2013 Talking to machines.

It was nice to be able to give some initial feedback on the human user interface earlier this year. And very happy to see the API for programmatic data access. In this blog post I ask for some clarifications about Study URIs as a key enabler to link information about studies.

Intro to OpenTrials

I couldn't make it to the recent Hack day in Berlin just before beta version was launched at the World Health Summit. But it was great to follow the two events via the Twitter feed.

For a short intro to OpenTrials, watch this short video from the launch with Ben Goldacre.

Human user access and Programmatic data access

The user access to search the 300.000+ trials is via For programmatic access via APIs I find the blog post from the hack day excellent. It includes links to the API documentation (in Swagger), to a notebook showing sample code (Python) and to another example using R.

Code from the OpenTrials Hack Day in Berlin (photo by benmeg / CC BY)
I point colleagues in industry to this, and also to the OpenFDA, as two great examples of access to data both for humans and for programs. We have lots to learn from these two open data initatives, both when we define requirements and develop solutions.

I was also glad to see a comment from Ben Meghreblian (@benmeg), OpenTrials community manager, in an interview for the AllTrials initative the other day: "While API access is very useful, the best way a registry can offer its entire database is as a regular download, similar to what the FDA does with its OpenFDA website."

Study URIs

In the same interview Ben also concluded:
One thing we (IMHO), both in open data initatives and in industry, "need to spend a little on making sure the information is discoverable, machine readable, and impactful" is to establish persistent URIs as Identifiers of studies. So, instead of a text string such as "D5135C00001" as a secondary/sponsor identifier in e.g. I am pushing for study http-based URIs such as:

A first step is an internal process to assign URIs to both old and new studies, and also an internal study look-up API service. This study lock-up API provide basic study descriptions, such as study phase and acronym and is presented on a study "home" page with the same http address as the URL. On this page we also provide other identifiers for the same study e.g.'s NCT number: "NCT01732822" and link to it using the URL.

I have argued for study URIs from but my understanding from interactions with some of the people behind it - they see their URLs as pragmatic, persistent study URIs. So Study URIs = study Page URLs.

I would like to also include the identifier for the same study represented in OpenTrials. Either as a study URI distinct from the study page URL, or deliberately using the same http schema for them. I may think the current ones are locating study pages (URLs) rather than identifying studies (URIs), for example:

It would be great to have some clarifications about this. What I would like to have are namespaces for study identifiers (e.g. azct, nct, and opentrials) so I make assertions like these about the same study.

<azct:D5135C00001> <owl:sameAs> <nct:NCT01732822>
<azct:D5135C00001> <owl:sameAs> <opentrials:9b48fd6a-2c6c-4455-bcc2-b1aff574298e> 

<azct:D5135C00001> <azct:hasAcronym> "EUCLID"

I have also posted this as an issue (#552) on OpenTrials Github

Sunday, May 22, 2016

Awesome graphic as Graphs

The classic continuum from Data via Information to Knowledge is nicely visualized in a three part graphic. I've seen it shared many times the last couple of years on Twitter and LinkedIn. Today I saw it extended with Insight and Wisdom. It made it even more awesome.

Original graphic by Hugh MacLeod @hughcards
extended by David Sommerville @smrvl  

It was my friend and former colleague Martin Börjesson @futuramb that did a Re-Tweet of a tweet from John Hagel @jhagel, management consultant and author. It took me to the creator of the original graphic, Hugh MacLeod @hughcards, cartoonist and co-founder of @gapingvoid. The extension of it is done by David Sommerville @smrvl Digital Design Director for @TheAtlantic.

So, I started to think about representing the five pieces as executable and querayable graphs:

  • 1 DataPoint class
  • 21 DataPoints
  • 2 InfoClasses (represented by the green and lilac labels) 
  • 21 Classifications 
  • 1 type of Relationship
  • 18 relationsships
  • 1 new InfoClass (yellow) 
  • 2 new Classifications
  • 1 Relationship Query

RDF triples, RDF Schema and SPARQL would be one option.

Neo4j Property Graph and Cypher, another option.

Well, will see if I can find the time to do it, or convince some graphs and linked data friends to have a go at it :-)

Thursday, May 19, 2016

Global, persistent and resolvable identifiers for clinical data

Yesterday two thought leaders in clinical data standards publised great blog posts. Dave Ibersen-Hurst (@Assero_UK) and  Armando Oliva (@nomini). Dave's post has the title Wear Sunscreen but it's really about "CDISC 2.0". Armando's post has the title Improving the Study Data Tabulation Model

Discussions threads on Twitter and LinkedIn today made me write this post about one the many great proposals in the two blog posts: 1. SDTM should incorporate unique identifiers for each record in each domain.

In today's clinical data standards for 2-dimensional/tabular data exchange, e.g. CDISC SDTM, keys are either natural keys, e.g. STUDYID, USUBJID, LBTESTCD in a dataset of labdata according to SDTM, or surrogat keys, e.g LBSEQ. A define.xml file should be the source for study specific Key Variables for each dataset. For more details about SDTM keys and the challenges of this see Duplicate records - it may be a good time to contact your data management team, PharmaSUG 2016, Sergiy Sirichenko and Max Kanevsky (@pinnacle_21)

Armando details the proposal in his blog post as he says that the identifiers should be "globally unique".
This is a discussion I have looked forward to since I urged CDISC to consider semantic web standards and linked data principles in my presentation at CDISC EU conference in 2011.

Linking Clinical Data Standards
My presentation at CDISC EU Interchange 2011
I now see how smart programmers and informatians use checksums as record identifiers as a practical way to get around this problem and simplify the integration and reviewing of clinical data.

A phrase we often use talking about linking data and semantic web standards is: "globally, persistent and resolvable identifiers".

  • A http URI schema makes identifiers possible to resolve. An example of the URI that has a resolver service is the URI for the UK postcode SO160AS 1). 
  • While the URIs assigned to CDISC standard items such as for the standard lab result variable in CDISC SDTM do (yet) not resolve.

So how would a URI look like for a single data point in a clinical study? HL7 FHIR use so called UUID. Trusty URI:s use hash values "URIs that contain a certain kind of hash value that can be used to verify the respective resource" 

I am eager to learn more about the potential of using URIs in combinations with Blockchains. This presentation on using blockchain technology and semantic standards for provenance across the supply chain made me think ...

... about Semantic blockchains in the Clinical Data Supply Chain. With identifiers assigned to each data point through the the supply chain of clinical data captured in EHR and smartphones, fed into clinical trial records, aggregated into summary level TLFs and later on included in secondary use analyses.


2) CDISC2RDF see

Friday, May 6, 2016

Twitter Feeds and Blog posts from Conferences

Conferences is a great way to meet interesting people and learn new things. Always nicest when you can attend IRL but interesting also following remotely via Twitter feeds, live blogging and reports and presentations blog post.

Conference Live Blogging

When I can attend conferences IRL I like to take notes using Twitter and I try to gather links and tweets using Storify as a kind of live blogging. Check out Storify/kerfors from events such as the recent Linked Data in Sweden, 2016 (ldsv2016) and HL7 FHIR workshops at Vitails, eHealth conference (Vitails2016).

Me in action live blogging

When I can not attend I like to follow conferences on  a distance and read peoples blog reports.

This week I've been following the great #csvconf feed from "a data conference that's not literally about CSV file format but rather what CSV represents to our community: data interoperability, hackability, simplicity,etc" The most interesting Twitter feeds from onferences I've seen so far.
Many thanks to some of the people tweeting from the event: , @_inunddata, @EmilyGarfield (Emily also posted some very nice drawings from the event.)

Conference Reports as blog posts

The recent CDISC Europe conference in Vienna #CDISCEurope did have a pretty thin feed but with some great tweets from Magnus Wallberg (@CMWallberg), Technology Evangelist at WHO Uppsala Monitoring Center, posted a few tweets.
Magnus also wrote an excellent report as a blog post: A great mix of standards and great visions when CDISC met in Vienna

Update: Just after I published this blog post I saw Wayne Kubick's (@WayneKubick), CTO for  HL7 and former CTO for CDISC, blog post HL7’s FHIR and BioPharma and article in Applied Clinical Trial: Building on FHIR for Pharmaceutical Research from a HL7 event I recently followed: Partners in Interoperability workshop in Washington DC.

Conference Presentations accompanying blog posts 

I also very much like when presenters quickly post their conference presentations on e.g. Slideshare. And it's also very nice to see accompanying blog posts with the speakers notes and additional material. I very much liked Dave Iberson-Hurst (@assero_UK) blog post with his CDISC Europe presentation this year. It is a post on his Semantic Web & Metadata series: CDISC Standards: Assessing the Impact of Change

I tried something similar when I wrote a blog post to prepare for my presentation "Linked Data efforts for data standards in biopharma and healthcare" at the Linked Data in Sweden, 2016 meeting a week ago: Linked Data in Sweden 2016

Thursday, April 21, 2016

Linked Data in Sweden 2016

It's time for the 5th "Linked Data in Sweden" event, Tuesday 26 April. Last year I was organizing the meeting in Gothenburg together with Fredrik Landqvist. This year we are back in Stockholm, this time at the Royal Armoury. I just learned that it is the oldest museum in Sweden. It was established by King Gustav II Adolph in 1628.

Several interesting presentations on the agenda from e.g Scania, Nobel Media, Wikimedia, Findwise and National Library of Sweden. I will give a short update on Linked Data efforts for data standards in biopharma and healthcare. So, I have started to think about things I would like to cover and will tweet an item per day to things I find interesting. Below the emerging list of links and a video presentation per item. Not much spare time, so I will shape them into a couple of slides on the train up to Stockholm, see slides in the end of this blog post.

Standards represented as Linked Data

The first items on my list are examples of when the authoritative sources of the content, in this case traditional standard organisations, publish linked data versions of their own content. This is very much what I was hoping for in my key at the Semantic Web Applications Tools for Life Sciences (SWAT4LS) workshop in late 2013: Pushing back, standards and standard organizations in a Semantic Web enabled world.
  • CDISC in RDF
  • HL7 FHIR in RDF
  • MeSH in RDF
  • ICD-11 in OWL
  • Others standards e.g. ATC, WHO Drug and MedDRA


In 2011 I presented; Linking Clinical Data Standards, at the CDISC (Clinical Data Interchange Standards Consortium) EU conference in Brussels. A year later, in Stockholm, Frederik Malfait (IMOS Consulting and consult at Roche) and I together presented Semantic models for CDISC based standard and metadata management. At the 2nd Linked Data in Sweden meeting in 2013 I presented; Länkade kliniska data standards (Linked clinical data standards).

The same spring the CTO of CDISC, Wayne Kubick, agreed to make this a task for the PhUSE organisation (PhUSE Association Programming Pharmaceutical Users Software Exchange). The PhUSE Semantic Technology project started later that year.

Overview of PhUSE Semantic Technology Project
by Frederik Malfait (21:16 - 37:00)

In the summer 2015 CDISC published their standards in RDF.  In the future, representation of CDISC standards in RDF will be one of the outputs of CDISC's metadata registry (SHARE).


The Fast Healthcare Interoperability Resources (FHIR, pronounced "fire") proposed standard describing data formats and elements (known as "resources"). It is an Application Programming Interface (API) for exchanging Electronic health records. The standard was created by the Health Level Seven International (HL7) health-care standards organization. And it is hot! I recently attended a FHIR workshop organised by HL7 Sweden at the Swedish eHealth conference Vitalis (see my Storify Vitalis2016).

The HL7 FHIR project and the W3C Semantic Web Health Care and Life Sciences Interest Group work on RDF representations of FHIR. The HL7 work lead by Graham Grieve, one of the creators of FHIR, and the W3C HCLS group lead by, David Both the initiator of the so called Yosemite project, will be aligned.


The Medical Subject Headings (MeSH) is the National Library of Medicine's controlled vocabulary thesaurus. It is used to index the biomedical journals. The rational and design of MESH in RDF is described in a good article: Desiderata for an authoritative Representation of MeSH in RDF

ICD-11 in OWL

The 11th revision of the International Classification of Diseases (ICD-11) is based on a content model encoded in OWL that takes it beyond the long list of terms in ICD10. Excellent introduction by Mark Musen to both ICD11 and to how the ontology tool called iCAT, based on WebProtege, has been used to represent ICD-11. While most editors want to stick to Excel spread sheets. This is a shared experience for all data standards mentioned here.

Other standards e.g. ATC, WHO Drug, MedDRA

There are several other standards I would like to see RDF/OWL versions of  to make our use of them in biopharma more robust. For example ATC (Anatomical Therapeutic Chemical Classification System), WHO Drug Dictionary and MedDRA (Medical Dictionary for Regulatory Activities). Early 2015 I was invited to WHO Uppsala Monitoring Center to talk about the value of this.

In the same way as it took CDISC almost 5 years, from early ideas on using semantic web standards and linked data principles to actually applying them, I think it will take some years more before we have:"Standardized the Standards", quote from David Booth leading the Yosemite project (see below).

New initiatives outside the traditional standard organisations

Here a couple of interesting initiatives I wanted to also cover but will probably not have the time to do. 

See my Storify LDSV2016 with notes and links from the event.

And here are the slides for my presentation in the afternoon that I did put together on the train from Gothenburg to Stockholm this morning.

Wednesday, December 9, 2015

SWAT4LS 2015 Industry stream

It's been a great first day at SWAT4LS and I have been buying a few books in the lovely Cambridge University bookstore and had a nice conference dinner.

I'm now preparing for tomorrow's task to be the chair for the industry stream in SWAT4LS (see my previous blog post for more information about this event). So, here's a list of the 6 abstracts, companies and projects/tools that I'll introduce tomorrow morning:
  1. The BioHub Knowledge Base: Ontology and Repository for Sustainable BiosourcingText Mining/NLP research group within the School of Computer Science at the University of Manchester together with UniLever, BioHub Knowledge Base (BioHubKB)
  2. Customizing “General SPARQL” for visualisation of in-house data in CytoscapeGeneral BioinformaticsGeneral SPARQL
  3. GraphScope – smart data access for the life sciencesSearchHaus,  GraphScope
  4. Semantic Technologies Make Sense for Life SciencesSmartLogic
  5. Advancing Knowledge Discovery for Alzheimer’s Disease: The Alzforum ExperienceAlzforum
  6. Everybody a Translational Data ScientistOntoforceDISQOVER