Linked Data for Enterprises

Wednesday, June 20, 2012

To Whom It May Concern

A nice tweet from Phil Archer (@philarcher1) this morning reminded me of a "triple tweet" I posted earlier this year on the topic of creating data and metadata To Whom It May Concern ('a formal salutation used for opening a letter to an unknown recipient' source Wikipedia)

So, here's Phil's tweet quoting Sharon Dawes (@ssdawes):

Metadata should support users you don't know - what a great line from Sharon Dawes at #pmod
— Phil Archer (@philarcher1) June 20, 2012

And here is my "triple tweet" on a what I see as one of the core values of Linked Data:

Facilitating the use of data for people you don't know, who you may never meet. #LinkedData
— Kerstin Forsberg (@kerfors) February 14, 201

Reusable data is published without prior agreement by or coordination of data consumer. #LinkedData
— Kerstin Forsberg (@kerfors) February 14, 2012

Future unknown stakeholders can benfit from the availability of shared data. #LinkedData
— Kerstin Forsberg (@kerfors) February 14, 2012

I posted them after I had the pleasure to meet David Wood (@prototypo) and Berndette Hyland (@BernHyland) F2F in a Linked Data and URI workshop in Boston in late January:

Hat tip for previous tweets on the value of #LinkedData to @BernHyland @prototypo
— Kerstin Forsberg (@kerfors) February 14, 2012

Sunday, May 27, 2012

AstraZeneca re-joins W3C HCLS

After a warm and sunny day of kayaking out in the archipelago north of Gothenburg it was nice to catch up on Twitter and see the official announcement from W3C that my employeer; AstraZeneca, has joined W3C. It's actually a re-join as we joined W3C in 2006 to participate in the Semantic Web Health Care and Life Sciences Interest Group (HCLS IG).

RT @w3c: AstraZeneca joined W3C bit.ly/JXYQj9 < Yes - we are back. Kudos to @TPlasterer & @John_Reynders
— Kerstin Forsberg (@kerfors) May 27, 2012

Update 6 June: I recommend this nice slide deck for an overview of Semantic Web and Related Work at W3C, presented by Ivan Herman (@ivan_herman) at the 2012 Semantic Tech & Business Conference in San Francisco, CA, USA, 5 June.

I attended and reported back from the W3C conference in Edinburgh in May 2006 (WWW2006) and from the next one in Banff in May 2007 (WWW2007) together with my former colleague Bosse Andersson (@bbalsa). My focus was on applying semantic web standards for clinical data and in 2007 Eric Neumann (@ericneumann), one of the HCLS pioneers, and I published a W3C Note in the Drug Safety and Efficacy task force on CDISC's Study Data Tabulation Model (SDTM). And together with most of the members in the HCLS group I co-authored an important article in BMC Bioinformatics Advancing translational research with the Semantic Web.

In late 2007 I had to focus on other tasks while Bosse and colleuges in the US; Julia Kozlovsky, Elgar Pichler and Otto Ritter contiued the interactions with other parties across life science and health care in two of the HCLS groups: Linking Open Drug Data (LODD) and Translational Medicine Ontology (TMO).

In early 2010 when I returned my job focus to semantic interoperability, AstraZeneca had decided not to renew the W3C membership. To stay updated I started to use use social media as a way to engage with the semantic web and linked data community, to follow thought leaders in the intersection between eHealth and Clinical Research, and to share news and insights with colleagues.

Early 2011 Bosse and I wrote a short paper to summarise insights from AstraZeneca's engagment in W3C HCLS and in the EU project Large Knowledge Collider (LarKC). The paper, Linked Data, an opportunity to mitigate complexity in pharmaceutical research and development, starts with a look back to one of the most inspiring meetings I have been in:

During the WWW2007 conference a breakthrough of the Linked Data idea happened in a session where web experts demonstrated the power of a new generation of the web, a web of data. For us attending the session it was hard to imagine the full potential on what this idea would mean for individual scientists and for a pharmaceutical company.

As described in my earlier blog post we do now have a new program in AstraZeneca called Integrative Informatics (i2) establishing the components required to let a linked data cloud grow across R&D. Re-joining W3C and re-connect with HCLS is one step in this.

Sunday, May 6, 2012

Semantic models for CDISC based standard and metadata management

In mid April we did a presentation at the 2012 CDISC (Clinical Data Interchange Standards Consortium) Interchange Europe with the title: Semantic models for CDISC based standard and metadata management (see our slides and short paper). This time in a sunny, but chilly, Stockholm at a very nice hotel (Elite Marina Tower). Last year Frederik Malfait, consulting at Roche, and I, working for AstraZeneca, had two different presentations at the 2011 conference in Brusses. See my blog post: Linking Clinical Data Standards.

Since then we have seen more interest in semantic web standards in the CDISC community, see for example the article in Applied Clinical Trials Online (@Clin_Trials): Digital Data, the Semantic Web, and Research, by Wayne Kubick, the new CTO of CDISC. This year Frederik and I did a joint presentation with a key messsage to the CDISC organisation: "Put semantics into the semantics". That is, to start using semantic web standards and linked data principles for the whole suite of CDISC standards. See below our list of proposals.

In my introduction I described the current situation when the question now is "Not when, but how" to best adopt CDISC standards. At the same time the different CDISC standards are not linked and published in different formats and so called metadata registeres (MDR) are requested for robust life cycle management of standards.

Real world use

In my brief introduction (see slide 5-11) to the core semantic web standard, the so called RDF triple, I showed an example of how Google use RDF based standards to improve search (see my previous blog post on schema.org). And I also showed how NCI use RDF to publish the NCI Thesaurus, see RDF/OWL download of NCIt via LexEVS. And also how RDF is used for an early version of the domain model for biomedical research (BRIDG), see RDF/OWL representation of BRIDG/ISO21090. In both these cases the RDF is published as XML, but RDF triples can also be published in different serialisation formats (i.e. XML, JSON, Turtle, and N-Triples). I also showed the latest version of the Linked Open Data cloud, with even more linked datasets than the one Frederik and I had in our presentations last year. I then turned over to the main part of our presentation describing two real world use of how two sponsors now start to use semantic web standards and linked data principles.

Linked Data cloud to grow across AstraZeneca R&D

Photo from CDISC Facebook

In AstraZeneca we have a new program called Integrative Informatics (i2) establishing the components required to let a linked data cloud grow across R&D. A key component is the URI policy for how to make for example a Clinical Study linkable by giving it a URI, that is a Uniform Resource Identifier, e.g. http://research.data.astrazeneca.com/id/clinicalstudy/D5890C00003. This is an identifier for a clinical study with the study code D5890C00003 that should be persistent and not dependent on any system. In the same way we will give guidance on how to use URI:s to make other key entities such as Investigator and Lab linkable. Also standard data elements from CDISC and internal ones to be managed in a future MDR should have URI:s to make them linkable. For more information on how URI:s are being used in for example the UK and US governments, see my URI design page.

A semantic web standard based MDR in Roche

Photo from CDISC Facebook

Frederik described the schema, content and architecture of Roche Biomedical MDR. And then he went through a demo using a RDF representation of a CDISC standard example and of an internal Roche standard (you will find the screenshoots from the demo in end of the slide deck). He first showed how the standards could be viewed using a general tool (TopBraid Composer from TopQuadrant, but could be any other RDF tool such as Protégé, a common open source tool). On slide 20-28 you can see how SDTM model v.1.2, SDTM IG v3.1.2, and SDTM CT:s, all are linked together (for example Observation Class: Event - Domain: AE - Variable: AEOUT - Submission value: NOT RECOVERED/NOT RESOLVED). And then he showed the same RDF representation via the application Roche Global Standard Data Browser (slide 29-37). Frederik also showed how the linked data standards can be exported in SAS and Excel formats (slide 42-50). And finally, he showed an example from a Roche standard questionnaire.

Proposals to CDISC

In the slides you can see that Frederik had to transform CDISC standards into RDF using a schema he developed for Roche and give them URI:s in a Roche namespace (e.g. http://gdsr.roche.com/cdisc/sdtmig-3-1-2#Column.AE.AEOUT for one of the data elements). This is not a ideal way, instead we would like CDISC to provide these. Hence the drive from our leadership in Roche and AstraZeneca for Frederik and myself to push back to CDISC.

Below a draft list of proposals to CDISC:

Decide on a URI design for CDISC standards (e.g. http://id.cdisc.org/sdtm).
Review the schema Frederik has proposed for the core MDR in CDISC SHARE.
Publish the new SDTM v1.3 and SDTM IG v.3.1.3 as RDF in XML, JSON, Turtle, and N-Triples formats using the reviewed schema and URI design. (As options to current publication formats, i.e PDF, html, csv, xml/odm.)
Work together with NCI on enhancing the RDF/OWL version of NCI Thesaurus. Also review the option to use the RDF/SKOS standard and apply linked data principles. Publish coming versions of CDISC CT:s as RDF in XML, JSON, Turtle, and N-Triples.
Work together with NCI on enhancing the RDF/OWL representation of BRIDG/ISO21090 model and apply linked data principles to make all BRIDG classes, properties and ISO21090 data types linkable.
Extend the MDR schema for CDISC SHARE for linkage to relevant BRIDG classes and properties and to ISO21090 data types.
Start exploring semantic web standards and linked data principles also for clinical data, including making invidual clinical data points linkable using URI:s and annotating them using existing and emerging clinical standard terminilogies and ontologies.

Monday, April 9, 2012

Describe things vs Improve markup of pages that describe things

Easter Monday is a public holiday in Sweden and it's been a rainy and cold day -- so, it's time to write a new blog post. It's triggered by a nice blog post published just before the weekend by Phil Archer (@philarcher1) with the interesting title: Danbri has moved on – should we follow? In his blog post Phil reflects on a presentation Dan Brickley (@danbri) did the week before at a Linked Data meetup in London.

Phil focus on Dan's point about the the best practice so far in the semantic web community: "look at existing vocabularies, particularly ones that are already widely used and stable, and re-use as much as you can. Dublin Core, FOAF – you know the ones to use."

And Phil wonder if it's now time to move on and "embrace schema.org as the vocabulary to use wherever possible? It won't cover everything, but it might cover the 50% of classes and properties that dominate any domian of interest." In his presentation, Schema.org and One Hundred Years of Search, Dan also argues that search terms have barely changed in style for 100 years and more.

For more info about the joint vocabulary from Google, Bing (Microsoft) and Yahoo called schema.org, see my remote report from the SemTech 2011 conference

Improve markup of pages that describe things
When listening to the video with Dan I did find this statement in his slides very interesting (on slide 33) decribing the scope of the schema.org vocabulary as "In-page structured data for search":

"Not asking an unconstrained 'so, how do we describe cars?', but “how can we improve markup on existing pages that describe cars?” (or Comics, SoftwareApps, Sports, ...)".

I always like when someone cleary state what is not included -- what's not intended. So, this is a helpful statement for me. And it will be interesting to follow how Schema.org will be extended and refined for domains such as Medicine/Health, see the list of Schema.org proposals maintained by W3C Web Schemas.

At the same time, a lot of the semantics I look for in my daily work is more about "how to describe cars?". Well, not cars really -- it's about other kinds of 'things' and their parts, relationsships and impacts on each other. It's about "how to describe 'things' in small portions of the biological, chemical, clinical and heath economic reality studied in clinical research and documented in health care". Also, "how to organise data about these 'things' not only to improve search but also to improve how data about these entities can be combined and queried in new ways."

Describe things

This is also the driver for me to learn more about how to: "capture, in a logical, systematic way, what scientists regard as the basic truths about a topic. Like equations in physics or axioms in mathematics, they can even be the basis for computational models." from More than Words. See also several of my erlier blog post on this approach, for example my post on Disease terminologies and ontologies.

In a future blog post I hope to learn more about how this approach has been applied on the Chemical Information Ontology (ChemInfo) to describe Chemical Entities of Biological Interest (chEBI). This is nicely explained in one of the favorite papers I have collected in my (kerfors) CiteULike library: The Chemical Information Ontology: Provenance and Disambiguation for Chemical Data on the Biological Semantic Web by Janna Hastings, Leonid Chepelev, Egon Willighagen, Nico Adams, Christoph Steinbeck, Michel Dumontier.

Exempel from Chemical Entities of Biological Interest (CHEBI)

the entity Hemoglobin in html view via the ontology-browser Ontobee

Sunday, January 8, 2012

"To-day we have naming of parts"

I like to explore associations people share with me when I describe something new to them. So, I was pleased the other day when a colleague in the UK shared the phrase that did stick in her head when I described the idea of URI:s (Uniform Resource Identifiers) and Linked Data: "To-day we have naming of parts".

So, after some googling I found the poem by Henry Reed (1914-1986), "Naming of Parts." New Statesman and Nation 24, no. 598 (8 August 1942).

To-day we have naming of parts. Yesterday,
We had daily cleaning. And to-morrow morning,
We shall have what to do after firing. But to-day,
To-day we have naming of parts. Japonica
Glistens like coral in all of the neighboring gardens,
And to-day we have naming of parts.

Hear Henry Reed and Frank Duncan read "Naming of parts" (mp3)

Through a very nice website; The Poetry of Henry Reed, I learned more about this World War II British poet, critic, translator, and radio dramatist. It helped me to better understand this wonderful, and sad, poem about the contrast between the world of weapons and the world of nature.

Naming parts and other things
I also learned about an article (DOI:10.1038/nbt0102-27) in Nature BioTechnology (2002) using the first stanza in Henry Reeds' famous poem as its title. In the article a professor of genomics at the University of Manchester describes the identification of previously non-annotated genes in yeast.

And, I also found a blog post from 2009 that also used the first stanza in Henry Reed's poem in its title:
Naming of parts and other things. That is, David Bawken's (@David_Bawden) post on his nice blog: "The Occasional Informationist, irregular thoughts on the information sciences". In this post he describes a meeting with John Wilbanks (@wilbanks) at the British Library:

In his presentation of the need for annotation of digital reporting of scientific findings, Wilbanks commented simply that we need to call the same thing by the same name; this makes possible the semantic linking of information and data, the creation of ontologies, and so on, without which it will not be possible to share information across disciplinary and sub-disciplinary silos.

He exemplified this by examples by simple – the various names for coffee in different languages – and complex – the variant terminology used in hundreds of datasets relating to polar climate change, and in over a thousand related to genomics.

There was another aspect to this point. What we call an information object in the digital world – DOIs and all the rest – is also fundamental; if we do not call these digital objects the same thing, we will have great difficulty in finding them.

Names of today
So, let me conclude this post with a couple of examples of naming parts and other things using names of today that is http-based URI:s. The three example URI:s are also three examples of large efforts to publishing linked data "about the named things":

British Library's URI for the poet Henry Reed
http://bnb.data.bl.uk/id/person/ReedHenry1914-1986
Wikipedia's, i.e. DBpedia's, URI for the poet Henry Reed
http://dbpedia.org/resource/Henry_Reed_%28poet%29
The DOI for the the article about identifying genes in yeast turned into a URI by CrossRef
http://dx.doi.org/10.1038/nbt0102-27

1. British Library publish metadata about bibliographic resources ("things") using Linked Data techniques and technologies. And part of that is to assign http-based URI:s to the creators. For a great introduction to the underlying model see the blog post: British Library Data Model: Overview by Tim Hodson (@timhodson).

So, for example the data model specifies that persons who are the identified creators of bibliographic resources, such as the poet Henry Reed (http://bnb.data.bl.uk/id/person/ReedHenry1914-1986), should be of the type Agent and Person according to the basic, and very often used vocabulary for linked data, called Friend of a Friend (FOAF).

2. A large part of the structured content published on Wikipedia pages is also made available as linked data called DBpedia. See this great article: How DBpedia Treats Wikipedia as a Database. The so called resources ("things") that the wikipedia pages describes are in DBpedia given http-based URI:s and each resource are typified using a thin model called the DBpedia ontology.

So, here we can see that the poet Henry Reed is also identified in DBpedia (http://dbpedia.org/resource/Henry_Reed_%28poet%29) and described with the structured data from the Wikipedia page about him. Such as his birth date and death date, and also the fact that he is categorized using the concept 'English poets'. This concept also has a URI http://dbpedia.org/resource/Category:English_poets. So, we may have more than one URI for the same Henry Reed. These can be related to each other using the sameAs statement.

This is not yet done by the British Library, but I assume this will be done later as for example the Swedish Library catalogue relates their URI:s to DBpedia's.

Here is another URI, http://dbpedia.org/resource/Category:Firearm_components, for a categorization concept, and in the DBpedia interface you can see of list such resources ("things") and links to them using URI:s such as http://dbpedia.org/resource/Sling_%28firearms%29.

3. CrossRef has made metadata for 46 million Digital Object Identifiers (DOI) available as Linked Data. DOIs are used for publishing of uniquely identify electronic documents (largely scholarly journal articles). CrossRef is a consortium of roughly 3,000 publishers, and is a big player in the academic publishing marketplace.

So, here is the identifier of the article about identifying genes in yeast http://dx.doi.org/10.1038/nbt0102-27.

Kudos to my colleague for the opportunity for me to learn more this wonderful poem and for a great discussion.
To ReedingLessons the signature behind the great website about Henry Reed.
To @David_Bawden for his niceblog The Occasional Informationist.
And, finally, to @wilbanks a great source of inspiration.

Sunday, December 11, 2011

Linked Enterprise Data Patterns Workshop

Earlier this week I followed yet another event remotely. This time the workshop arranged by W3C on Linked Enterprise Data Patterns, in Cambridge, MA. So, I had some nice hours on the bus in the dark evenings and mornings over here in Sweden when I followed things on the:

conference website: position papers
twitter feed: #LEDP
irc channel log/scribe: first day and second day

Here's a couple of things I did find extra interesting:

An article on IBM developerWorks presented by Martin Nally: Toward a Basic Profile for Linked Data, A collection of best practices and a simple approach for a Linked Data architecture

New role proposed by Tim Berners-Lee (@timberners_lee) "Chief Identity Officer".

IBM DB2 will include RDF support sometime in 2012.

I have followed the work of Eric Prud'hommeaux, W3C, on access controls and policy medication to enable networks of parties across industry, health care, and academia to share sensitive data such as clinical records.

In this workshop Eric presented ideas I want to understand better: Combining XACML (eXtensible Access Control Markup Language) with SPIN rules in SPARQL queries. Eric's Position paper: SPARQL Access Policies, and presentation: Access Control Landscape. Controlling READ/WRITE of information as sets
An paper from 2008 co-authored by Eric that I have found very useful: Policy Mediation to Enable Collaborative Use of Sensitive Data

Two papers on identity and URI:s with interesting people as co-authors that I'll read in more detail:

Identity Crisis in Linked Data, co-authored by Ora Lassila (@gotsemantic), Nokia, and also one of the auhors of the famous Scientific American article on semantic web from 2001
Diverted URI Pattern, co-authored by David Wood (@prototypo), and also the editor of the great book Linking Enterprise Data.

And, finally, a quite interesting discussion on 'silo folks & data integration folks' between David Wood and Bradley P. Allen captured by Sandro Hawke (@Sandhawke) in the irc channel log/scribe from the first day.

davidw: Where RDF really shines is in crossing silos, connecting things where traditional approaches have left off.

davidw: Some orgs that have succeeded well (DoD, O'Reilly), they built a new team and hire ontologists if they need them, they get consultants in, they build a skunk works to do that bit between the silos. They leave the DBAs in place, because the DBA stuff still needs to get done.

davidw: And they have consultants/new team to build out that bridging infrastructure. You're not going to convert your silo folks -- really good at silos -- into data integration folks.

Allen: That's what we're doing, with a startup group, showing we can solve this interop problem.

Allen: When people see this, they perk up, and want to know more.

Sunday, December 4, 2011

Large organisations using Semantic Web

Earlier his week the east version of the Semantic Tech & Biz Conference took place in Washington, DC. And I followed it via the #semtechbiz feed on Twitter. The activity in this feed was lower than at the much larger west version that took place in San Francisco early June. An event I also followed remotely, see my blog post: SemTech2011 report.

Below I highlight one of the many case studies presented in the conference in Washington, DC, on the theme "here is what we did", that is what U.S. military (DoD) do in their so called Enterprise Information Web. Further down you find examples of what Chevron and Statoil did in the oil industry. In two side notes I wunder about the use of semantic technologies in Norway, and I am reminded of some explorative work I did ten years ago on Topic Maps and Published Subject Identifiers (PSI:s).

Enterprise Information Web

One of the many case studies presented in the conference was the U.S. military (DoD Defense Information Systems Agency) Enterprise Information Web. In the recent RFI, Request for Interest, they write "the envisioned EIW is built on semantic web, which will allow better enterprise-wide collection, analysis and reporting of data necessary for managing personnel information and business systems, as well as protecting troops on the ground with crucial intelligence."

A YouTube video with Dennis E. Wisnosky, Chief Technical Officer and Chief Architect at DoD
See also: DoD Turns to Semantic Web To Improve data Sharing

As being a non-American I do find it a bit hard to relate to DoD and some of the critical comments to the YouTube video. However, as I wrote in one of my tweets: 30+ years ago U.S. military needed Internet - now they use Semantic Web standards and Linked Data principles. And I think this video gives some really nice explanations.

How two large organisations in oil industry use semantic web

This week I also saw another interesting case study, that is how the semantic web standard OWL is used in the oil industry. In an interview with Roger Cutler, published on the W3C blog, he describes the typical situation in most large organisation where information "lives in different forms in number of different systems and is handled separately by different organizations with different data models", and he talks about how this traditionally have beed adressed:

People use point-to-point solutions or big data warehouses, but neither approach scales gracefully. Point-to-point solutions become very complex and hard to maintain. Data warehouses create replication issues and tend to be fragile. So, the possibility of a smarter, more agile, more cost-effective way of dealing with integration would have a great deal of value to us. The Semantic Web is not guaranteed to be the solution, but it looks plausible and we’d like to see if it lives up to its promise in practice.

I also noted that Roger Cutler, Research Consultant at Chevron Information Technology Company, talks about the "expressiveness and reasoning achievable with OWL". I like that because I sometimes hear comments a long the lines that OWL, and OWL2, is too complex and maybe not so useful in an industrial setting. In the interview Roger say:

We have demonstrated a case in which similar objectives were obtained in the context of an ontology with about fifteen lines of readily comprehensible rules and in a relational database context with over 1000 lines of pretty complex code.

I also see that there exists a W3C Oil, Gas and Chemicals Business Group also with an representative from Statoil, Jennifer Sampson. And I now also see an interesting case study presented by Jennifer at the SemTech conference in San Francisco: Semantic Technologies and Statoil's Integration Layer for Plant Information Systems.

Side note: Semantic technologies in Norway
The Statoil presentation looks really interesting and is a trigger for me to catch up with how semantic technologies are used in Norway. Have been thinking about that for some time. I visited Statoil's office in Stavanger a couple of years ago to talked about metadata standards. And I see some interesting signals that semantic technologies have much been more used in Norway than in Sweden.

Side note: Topic Maps and Published Subject Identifiers (PSI:s)

Back in 2002, before the OWL standard existed and Linked Data principles was defined, I supervised a master thesis with an Evaluation of Topic Maps for information navigation in cardiovascular research. Topic Maps is a semantic technology that has a strong presence in Norway. The master students I supervised worked together with Steve Pepper, the Topic Maps guru. A key learning I took away from some really good discussions back in 2002 with Steve, and also Lars Marius Garshol (@larsga), was the idea of Published Subject Identifiers (PSI:s). In a future blog post I will do a recap of PSI:s and try to relate it today's http-based URI:s as a one of the Linked Data principles.

Kudos to Bernadette Hyland (@BernHyland) and Dave Smith (@DruidSmith)
for their #semtechbiz tweets. And also to @semanticweb for the great news service:
"Voice of Semantic Web Technologies and Linked Data Business" and to the @W3C blog.

Pages