Linked Data for Enterprises: 2013

Sunday, November 17, 2013

De-identification and Informed Consent in Clinical Trials

Thursday evening I was following the great #PACCR feed on Twitter from a "Patients at Center of Clinical Research" discussion hosted by Eli Lilly Clinical Open Innovation team. (Thank you Rahlyn Gossen, @RebarInter, for the pointer)

A couple of interesting comments came up in some tweets on the topic of de-identification. As de-identification (sometimes called anonymization) is a key topic for clinical trial data transparency, I did find these quotes really interesting.

.@reginaholliday No one even asks if clinical trial participants want to be de-identified. Some people don't want to. #PACCR
— Rebar Interactive (@RebarInter) November 14, 2013
It was said in the meeting by Regina Holliday (@ReginaHolliday), a great tweeter promoting patients rights within medicine.

Is it ethical to remove rare/genetic diseases within De-identified Data to protect against re-identification? http://t.co/cGRxe9Kzv5 #PACCR
— Daniel Barth-Jones (@dbarthjones) November 14, 2013

Daniel Barth-Jones (@dbarthjones), Columbia University and expert in Data Privacy and De-identification Policy, asked in another tweet and referenced a very interesting blog post from Harvard Law School on Ethical Concerns, Conduct and Public Policy for Re-Identification and De-identification Practice.

"When re-identification risks are exaggerated, we need to recognize that the resulting fears cause needless harms. Such fears can push us toward diminishing our use of properly de-identified data, or distorting the accuracy of our statistical methods because we’ve engaged in ill-motivated de-identification and have altered data even in cases where there was not anything more than de minimis re-identification risks."

From the same blog post from the Online Symposium on the Law, Ethics & Science of Re-identification Demonstrations, at the Bill of Health at Harvard Law School, in the fields of health law policy, biotechnology, and bioethics.

“We must achieve an ethical equipoise between potential privacy harms and the very real benefits that result from the advancement of science and healthcare improvements which are accomplished with de-identified data."
There were also a couple of interesting #PACCR tweets on the topic of Informed Consent quoting Sharon Terry (@sharonfterry), CEO of Genetic Alliance:

Informed consent information should be dynamic, granular, matrixed and contextual. @sharonfterry #PACCR #clinicaltrials
— Lilly Clinical OI (@Lilly_COI) November 14, 2013

I would like to learn more about this thinking and how they potentially could be realized by:

Structuring and formalizing the Informed Consent content to become a semantic rich, and machine-executable, contract/policy for transparency and accountability in using clinical trial data.
For more information see:

Permission Ontology for informed consent and HIPAA compliance (presentation in pdf) in the CTSA Ontology Workshop, Febr 2013

Information accountability policy whose restrictions are based on usage rules, not access or collection rules testimony by Danny Weitzner (@djweitzner) in the Privacy and Civil Liberties Oversight Board Workshop Regarding Surveillance Programs Operated, July 2013

I do find all of this very interesting. And I hope such a "dynamic, granular, matrixed and contextual" approach can be part of new clinical trial data transparency policies:

"To find solutions that are 'good enough' and provide both dramatic privacy protections and useful analytic data" (from the same blog post).

Monday, October 7, 2013

The future of CDISC CT:s

A poll posted by Lex Jansen (@lexjansen) in the LinkedIN group for CDISC (Clinical Data Interchange Standards Consortium) triggered me to write down some thoughts on the future of CDISC's so called Controlled Terminologies (CT:s):

When you import CDISC Controlled Terminology from NCI EVS at http://evs.nci.nih.gov/ftp1/CDISC, which format do you use?

(Excel, Text, ODM XML, or OWL/RDF)

My vote goes to the formats with the best potential for the future, that is the formats serializing RDF modeled data e.g. turtle, json-ld, n-triples, and xml (See the blog post: Understanding RDF serialisation formats)

Today's RDF version

The recently published OWL/RDF version of the CT:s (serialized in xml) uses the first version of the CDISC2RDF schema 1) implementing the model behind the existing export of a limit part of the content in NCI Thesaurus (NCIt).

It is modeled to support today's use of the CT:s only as text strings to populate variables in CDISC defined data sets (e.g. SDTM domains) with submission values.That is, it provide study specific clarity making it easy for humans to read the clinical data and metadata.

Next RDF version

Based on very useful discussions with the terminology expert Julie James (LinkedIn profile) working for HL7, IMI EHR4CR and FDA/PhuSE Metadata definition project, these are my thoughts for the next RDF version:

To provide cross study semantic interoperability making it easy for machines to directly integrate and query clinical data and metadata across health care and clinical research we need an enhanced model.

That is, a model that fully leverage the content in NCIt. And address the issues people have experienced when using the CT:s in attempts to implement them in BRIDG / ISO21090. Using the insights from the IMI EHR4CR project and from the development of the IHE DEX profile (Data Element Exchange).

I think there is also an opportunity to leverage the work on binding value sets to data elements part of the HL7 FHIR (Fast Healthcare Interoperability Resources) development 2). Julie also pointed me to a new ISO standards: ISO/CD 17583 3) The next version should also apply both the OID (Object identfier) standard and the URI (Uniform Resource Identifier) standard to identify each value set and each value.

References:
1) CDISC2RDF poster (presented at DILS 2013, Data Integration in Life Science conference) and FDA/PhUSE Semantic Technology project
2) http://www.hl7.org/implement/standards/fhir/terminologies.htm
3) ISO/CD 17583: Health informatics -- Terminology constraints for coded data elements expressed in (ISO 21090) Harmonized Data Types used in healthcare information interchange.

Friday, September 13, 2013

Justifications of Mappings

A common theme in the Semantic Trilogy events in Montreal this summer (see Semantic Trilogy preparations and Semantic Trilogy report part 1) was mappings such as the mappings provided via the NCBO BioPortal.

For example the mappings in the Bioportal expressed as skos:closeMatch are the result of using the LOOM lexical algorithm. Examples of not so good mappings, such as this one, were highlighted:

<skos:closeMatch>

<Int. Classification for Patient Safety: Chair (subclass to Piece of Furniture)>

One view was: ‘Don’t use them!’ (tweet). Another view was “Give us the justification of the mappings so we can decide when it makes sense to use them.”

Mappings in chemical informatics

When I came back from the Semantic Trilogy and read about mappings, or linksets as they are called, in the new version of the Open PHACTS specification "Dataset Descriptions for the Open Pharmacological Space" I saw some opportunities to make mappings more explicit and hence more useful.

I think the editor, Alasdair Gray (@gray_alasdair), and the whole team of authors, have done a great job on this specification.

"The Dataset Descriptions for the Open Pharmacological Space is a specification for the metadata to described datasets, and the linksets that relate them, to enable their use within the Open PHACTS discovery platform. The specification defines the metadata properties that are expected to describe datasets and linksets; detailing the creation and publication of the dataset."

I especially liked the part on making the justification of mappings explicit. For example, what is the justification behind stating that there is a close match (skos:closeMatch), or exact match (skos:exactMatch), between what is described in two different chemical datasets, such as the RDF datasets sourced from ChemSpider and ChEMBL.

The figure depicts four distinct linksets: two sourced from ChemSpider
depicted in blue which use different link predicates; one sourced from ChEMBL
depicted in red; and one sourced from a third party depicted in green.

My understanding is that for the chemical informatics community the Open PHACTS specification will establish a vocabulary to express the justifications for links/mappings between chemical entities. This enables them to explicitly state justifications such as "Has isotopically unspecified parent" or "Have the same InChI key" (see B.2 Link Justification Vocabulary Terms to also get the URIs for these terms).

Mappings between medical terminologies

Together with members of the EU projects EHR4CR and SALUS, MedDRA MSSO, and W3C HCLS, I am now exploring the idea of establishing a similar approach for the medical terminology community. That is, a vocabulary of terms to express the justifications for different mappings between concepts/terms in terminologies across healthcare and clinical research, such as ICD9, SNOMED CT and MedDRA.

This is part of a broader discussion on the use of terminologies in semantic web focused environments, with formal representations in RDF of both the terminologies themselves and of the mappings between them. Here's an example of a visualization from such a formal representations of MedDRA and SNOMED-CT terms and mappings between them in SKOS/RDF.

The example show the hierarchy of cardiac disorders in both the MedDRA and

SNOMED-CT concept schemes, expressed using the skos:broader property. Mappings between

similar concepts in both concept schemes are stated using the skos:exactMatch property.
From: SALUS Harmonized Ontology for Post Market Safety Studies

Monday, July 29, 2013

Semantic Trilogy report part 1

It's been two very nice summer weeks of vacation after I got home from a week at the Semantic Trilogy events in Montreal, Qc, Canada. (See my previous blog post: Semantic Trilogy preparations.) Here's the first part of my report from seven intensive days of conferences, tutorials, workshops and great discussions with researchers in biomedical ontologies and data integration in life sciences.

It was very nice to meet colleagues from other pharma companies; Sanofi, UCB and NovoNordisk, and to discuss with early adopters in traditional software vendors, such as Siemens, and with experts from niche vendors, such as IO Informatics. It was also nice to discuss common topics, such the use of semantic web standards and linked data principles on for example clinicaltrials.gov, with key individuals such as Olivier Bodenreider, NLM (National Library of Medicine).

Notes
During the two main conferences I used Twitter as my note book and in the evenings I gather tweets and related links in two Storify items:

ICBO2013
Storify: 4th Interational Conference on Biomedical Ontology (ICBO), 7-9 July
DILS2013
Storify: 9th Conference on Data Integration in the Life Sciences (DILS), 11-12 July

My poster
The last evening I presented a CDISC2RDF poster on our joint AstraZeneca and Roche CDISC2RDF project, now part of the FDA/Phuse Semantic Technology working group. I really enjoyed the discussions it triggered.

My #dils2013 poster http://t.co/HWr1Yy9uQY Thx @jpmccu @micheldumontier @siv_arabandi pic.twitter.com/NR3lrWZT6R
— Kerstin Forsberg (@kerfors) July 12, 2013

I'll be back in mid August, after couple of days of trecking in the Swedish mountains, with more details about the papers, presentations and discussions I did find most interesting. (For a first glimpse of two of them see this blog post from HL7 Watch by Barry Smith: An OGMS-Based Model for Clinical Information.)

Monday, June 24, 2013

Semantic Trilogy preparation

The Swedish Midsummer weekend is over and it's time to look forward. Saturday 6th to Friday 12th of July I'll attend the Semantic Trilogy in Montreal, Qc, Canada.

I plan to attend these events during the week:

6 July, Semantic Trilogy Hackaton, "Making big data out of small contributions"
7 July, Tutorial, OBO Foundry 101: Collaborative ontology development, tool support and semantic web
8-9 July, 4th International Conference on Biomedical Ontology (ICBO 2013)
10 July, OGMS Workshop, (Ontology for General Medical Science) and 4th Canadian Semantic Web Symposium (CSCW 2013)
11-12 July, 9th International Conference on Data Integration in the Life Sciences (DILS 2013) I'll present my poster on CDISC2RDF in the poster session.

In 2011 I, together with three colleagues, attended the ICBO 2011 event (see my three blog post: Preparations part 1 and part 2, report). So, I look forward to reconnect with people in the OBO (The Open Biological and Biomedical Ontologies) community.

And to meet F2F interesting people in the W3C HCLS (Semantic Web Health Care and Life Sciences Interest Group). And people interested in ontologies and semantic web working for e.g. Sanofi, Novo Nordisk, Mayo Clinic.

I'm also very happy that I'll get the opportunity to attend my third semantic web related event in Canada.

In 1999 I attended the W3C conference in Toronto. The WWW8 conference where I was absolutely thrilled by the power of the simple and elegant model of RDF triples. At the WWW8 I also heard Tim Berners-Lee talk about the Semantic Web for the first time.

In 2007 I attended the WWW2007 conference in wonderful Banff.

"During the WWW2007 conference a breakthrough of the Linked Data idea happened in a session where web experts demonstrated the power of a new generation of the web, a web of data. For us attending the session it was hard to imagine the full potential on what this idea would mean for individual scientists and for a pharmaceutical company."

From Linked Data, an opportunity to mitigate complexity in pharmaceutical research and development, Bo Andersson and Kerstin Forsberg, LWDM 2011

And yes, I do hope to also get some time during the weekend to visit the Jazz Festival.

Tuesday, June 11, 2013

Standards for common aspects

Through the last three years I have been engage with different groups working on standards, both for data exchange, such as CDISC, and for vocabularies such as MedDRA MSSO and NCI EVS. As they now start to see the value of using "standards for standards".

From Flickr bitpuddle
(Twitter @eric_d_hancock)

Standards for standards

So, "I push back" to standard organisations to use semantic web standards and linked data principles to make their standards directly usable for humans and for machines.

A good example is CDISC and their growing interest in using semantic web standards (based on RDF, Resource Description Framework): CDISC2RDF. For some background see Clinical studies and the road to Linked Data. Today FDA, CDISC, pharma:s, CRO:s and software vendors are working together on this in a FDA working group for Semantic Technology organised by PhUSE.

Standards for common aspects

The last year or so, I have also tried to keep up to date with groups developing RDF-based standards for common aspect such as:

data descriptions (VoID)
data provenance and versioning (PROV and PAV)
concept based vocabularies and value sets (SKOS)
multi-dimensional statistical data (RDF Data Cube)

I try to ensure that we have a good view of the maturity and applicability of these standars so we can use them in our internal“integration factory”. But most of all “push back” to vendors. I foresee that we in the same way started to add requirements on web-interfaces for better end user usability back in the late 90:ies, we now should start to add requirements on web-interfaces for better machne usability. So we need to to understand how to incorporate these common aspects in our URS:s, RFI:s RFP:s etc..

For software vendors to use RDF-based standards for common aspects, for example:

MediData's Rave and Perceptive's IMPACT to describe datasets using VoID.
Accelrys' Pipeline Pilot to use W3C PROV.
Microsoft's SharePoint to use term sets for tagging in SKOS.
SAS Institute's Drug Development to create analysis results using RDF Data Cube.

So, this interview with Reza B'Far, Vice President of Development, Oracle on the W3C blog made me vryy glad: Oracle on Data on the Web

Oracle to use W3C provenance standard to create a single audit time line across systems

"One of the hugest problems we faced was maintaining transaction audit trails in a heterogeneous environment in a standard and compatible way. Audit trails are described with literally millions of different formats in different organizations. This used to mean it was impossible to create a single audit time line. PROV solves this problem. We now provide (and consume) a PROV feed that unifies the audit trails generated by transactions across heterogeneous systems."

See also the Implementation report with 60+ examples of usage of the W3C Provenance specifications.

For a nice intro to the W3C Provenance Specifications, see the tutorial by Paul Groth (@pgroth) at the Extended (European) Semantic Web conference.

Saturday, May 25, 2013

Three Linked Data meetings in Sweden

I'm back after two nice day in the south of Sweden. Yesterday, 24 May, I attended the first meetup for Linked Data in Malmö.

This was the third Linked Data meeting in Sweden. They have all been great events with more than 30 attendees each. I do hope these will encourage more friends and colleagues In Sweden across academia, industry, consult companies and government to start applying the Linked Data principles and use the stack of Semantic Web standards.

Links to all three events:

Join our discussion about Linked Data and Semantic Web in Sweden in the Facebook group: Semantiska webben i Sverige, SSWIG

Kudos to Bosse Andersson (@bBalsa), Marie Gustavsson-Friberg (@mariegus)
and Eva Blomqvist (@evabl444) for arranging. I look forward the next one!

Sunday, March 31, 2013

Talking to machines

The last week I remotely followed two events while commuting, two events related to Evidence Based Medicine (EBM), both took place in Oxford:

Cochrane UK & Ireland 21st Anniversary Symposium (Storify summary)
EvidenceLive 2013 (summary blog)

+Ben Goldacre did speak at both events. At the Cochrane event he talked about getting better in talking to the Public, to Policy makers and to Machines. In the last part of his talk: Talking to Machines he says "That it's odd how we share results of RCTs (Randomised, Controlled Trials) in C19th essay format!" This is also how Cochrane Collaboration share reviews and meta-analyses of clinical trial data.

Ben Goldacre, Talking to Machines
[28.30-39.00 mins]

Structured data in RDF

Instead we should use "C21th structured data standards". I was especially pleased to hear how he was even more explicit: "Publish in RDF a good, quality standard, nice data format" [at 36.50 mins]

See also what the web development director at Cochrane, +Chris Mavergames, say in his excellent presentation on how linked data can help free content from the 'container of the article'.

Future of the Article from Chris Mavergames

This is related to our the work we do on linked clinical data standards, see my recent blog post: CDISC2RDF. That is, a semantic web versions of data standards for clinical data on subject/participant level.

Clinical Data Transparency

Given the recent move towards clinical data transparency (see a good summary in Nature this week Drug-company data vaults to be opened) I foresee a discussion also on data standards for the summary level data in clinical study reports and per-reviewed papers using semantic web standards.

An alternative could be to represent tables in the reports and paper as RDF using the RDF Data Cube Vocabulary (for multi-dimensional statistical data), see the CSVImport and the CubViz projects (Representing and browsing multi-dimensional statistical data as RDF using the RDF Data Cube Vocabulary, previously called Stats2RDF) This EU/FP7 project has used this vocabulary to publish biomedical statistical data, e.g. the WHO's Global Heath Observatory dataset (see Publishing and Interlinking the Global Health Observatory Dataset).

A challange is to express the clinical trial design and other contextual information as structured data to make it easier to make informed decisions for trial reviews and cross trial analyses.

Tuesday, February 12, 2013

CDISC2RDF

In a recent article from semanticweb.com (The Voice of Semantic Web Technology and Linked Data Business) the project CDISC2RDF is nicely decribed: Clinical Studies And The Road To Linked Data.

The project will be presented at the Conference on Semantics in Health Care & Life Sciences (CSHALS) meeting at the end of February by Charlie Mead, co-chair of the W3C’s Health Care and Life Sciences Interest Group (HCLSIG).

Here is a slide deck describing the first deliverable of the project. A refined slide deck will be presented at the CSHALS meeting together with a couple of CDISC2RDF blog post to describe the transformation process.

Cdisc2 rdf overveiw from Kerstin Forsberg

Pages

Sunday, November 17, 2013

Monday, October 7, 2013

Today's RDF version

Next RDF version

Friday, September 13, 2013

Mappings in chemical informatics

Mappings between medical terminologies

Monday, July 29, 2013

Monday, June 24, 2013

Tuesday, June 11, 2013

Standards for standards

Standards for common aspects

Saturday, May 25, 2013

Sunday, March 31, 2013

Structured data in RDF

Clinical Data Transparency

Tuesday, February 12, 2013