Thursday, May 19, 2016

Global, persistent and resolvable identifiers for clinical data

Yesterday two thought leaders in clinical data standards publised great blog posts. Dave Ibersen-Hurst (@Assero_UK) and  Armando Oliva (@nomini). Dave's post has the title Wear Sunscreen but it's really about "CDISC 2.0". Armando's post has the title Improving the Study Data Tabulation Model

Discussions threads on Twitter and LinkedIn today made me write this post about one the many great proposals in the two blog posts: 1. SDTM should incorporate unique identifiers for each record in each domain.

In today's clinical data standards for 2-dimensional/tabular data exchange, e.g. CDISC SDTM, keys are either natural keys, e.g. STUDYID, USUBJID, LBTESTCD in a dataset of labdata according to SDTM, or surrogat keys, e.g LBSEQ. A define.xml file should be the source for study specific Key Variables for each dataset. For more details about SDTM keys and the challenges of this see Duplicate records - it may be a good time to contact your data management team, PharmaSUG 2016, Sergiy Sirichenko and Max Kanevsky (@pinnacle_21)

Armando details the proposal in his blog post as he says that the identifiers should be "globally unique".
This is a discussion I have looked forward to since I urged CDISC to consider semantic web standards and linked data principles in my presentation at CDISC EU conference in 2011.

Linking Clinical Data Standards
My presentation at CDISC EU Interchange 2011
I now see how smart programmers and informatians use checksums as record identifiers as a practical way to get around this problem and simplify the integration and reviewing of clinical data.

A phrase we often use talking about linking data and semantic web standards is: "globally, persistent and resolvable identifiers".

  • A http URI schema makes identifiers possible to resolve. An example of the URI that has a resolver service is http://data.ordnancesurvey.co.uk/id/postcodeunit/SO160AS the URI for the UK postcode SO160AS 1). 
  • While the URIs assigned to CDISC standard items such as http://rdf.cdisc.org/std/sdtmig-3-1-3#Column.LB.LBSTRES for the standard lab result variable in CDISC SDTM do (yet) not resolve.

So how would a URI look like for a single data point in a clinical study? HL7 FHIR use so called UUID. Trusty URI:s use hash values "URIs that contain a certain kind of hash value that can be used to verify the respective resource" http://trustyuri.net/ 

I am eager to learn more about the potential of using URIs in combinations with Blockchains. This presentation on using blockchain technology and semantic standards for provenance across the supply chain made me think ...



... about Semantic blockchains in the Clinical Data Supply Chain. With identifiers assigned to each data point through the the supply chain of clinical data captured in EHR and smartphones, fed into clinical trial records, aggregated into summary level TLFs and later on included in secondary use analyses.

Thoughts?

1) https://www.ordnancesurvey.co.uk/education-research/research/linked-data-web.html 
2) CDISC2RDF see https://github.com/phuse-org/rdf.cdisc.org