Linked Data for Enterprises

Sunday, December 4, 2011

Large organisations using Semantic Web

Earlier his week the east version of the Semantic Tech & Biz Conference took place in Washington, DC. And I followed it via the #semtechbiz feed on Twitter. The activity in this feed was lower than at the much larger west version that took place in San Francisco early June. An event I also followed remotely, see my blog post: SemTech2011 report.

Below I highlight one of the many case studies presented in the conference in Washington, DC, on the theme "here is what we did", that is what U.S. military (DoD) do in their so called Enterprise Information Web. Further down you find examples of what Chevron and Statoil did in the oil industry. In two side notes I wunder about the use of semantic technologies in Norway, and I am reminded of some explorative work I did ten years ago on Topic Maps and Published Subject Identifiers (PSI:s).

Enterprise Information Web

One of the many case studies presented in the conference was the U.S. military (DoD Defense Information Systems Agency) Enterprise Information Web. In the recent RFI, Request for Interest, they write "the envisioned EIW is built on semantic web, which will allow better enterprise-wide collection, analysis and reporting of data necessary for managing personnel information and business systems, as well as protecting troops on the ground with crucial intelligence."

A YouTube video with Dennis E. Wisnosky, Chief Technical Officer and Chief Architect at DoD
See also: DoD Turns to Semantic Web To Improve data Sharing

As being a non-American I do find it a bit hard to relate to DoD and some of the critical comments to the YouTube video. However, as I wrote in one of my tweets: 30+ years ago U.S. military needed Internet - now they use Semantic Web standards and Linked Data principles. And I think this video gives some really nice explanations.

How two large organisations in oil industry use semantic web

This week I also saw another interesting case study, that is how the semantic web standard OWL is used in the oil industry. In an interview with Roger Cutler, published on the W3C blog, he describes the typical situation in most large organisation where information "lives in different forms in number of different systems and is handled separately by different organizations with different data models", and he talks about how this traditionally have beed adressed:

People use point-to-point solutions or big data warehouses, but neither approach scales gracefully. Point-to-point solutions become very complex and hard to maintain. Data warehouses create replication issues and tend to be fragile. So, the possibility of a smarter, more agile, more cost-effective way of dealing with integration would have a great deal of value to us. The Semantic Web is not guaranteed to be the solution, but it looks plausible and we’d like to see if it lives up to its promise in practice.

I also noted that Roger Cutler, Research Consultant at Chevron Information Technology Company, talks about the "expressiveness and reasoning achievable with OWL". I like that because I sometimes hear comments a long the lines that OWL, and OWL2, is too complex and maybe not so useful in an industrial setting. In the interview Roger say:

We have demonstrated a case in which similar objectives were obtained in the context of an ontology with about fifteen lines of readily comprehensible rules and in a relational database context with over 1000 lines of pretty complex code.

I also see that there exists a W3C Oil, Gas and Chemicals Business Group also with an representative from Statoil, Jennifer Sampson. And I now also see an interesting case study presented by Jennifer at the SemTech conference in San Francisco: Semantic Technologies and Statoil's Integration Layer for Plant Information Systems.

Side note: Semantic technologies in Norway
The Statoil presentation looks really interesting and is a trigger for me to catch up with how semantic technologies are used in Norway. Have been thinking about that for some time. I visited Statoil's office in Stavanger a couple of years ago to talked about metadata standards. And I see some interesting signals that semantic technologies have much been more used in Norway than in Sweden.

Side note: Topic Maps and Published Subject Identifiers (PSI:s)

Back in 2002, before the OWL standard existed and Linked Data principles was defined, I supervised a master thesis with an Evaluation of Topic Maps for information navigation in cardiovascular research. Topic Maps is a semantic technology that has a strong presence in Norway. The master students I supervised worked together with Steve Pepper, the Topic Maps guru. A key learning I took away from some really good discussions back in 2002 with Steve, and also Lars Marius Garshol (@larsga), was the idea of Published Subject Identifiers (PSI:s). In a future blog post I will do a recap of PSI:s and try to relate it today's http-based URI:s as a one of the Linked Data principles.

Kudos to Bernadette Hyland (@BernHyland) and Dave Smith (@DruidSmith)
for their #semtechbiz tweets. And also to @semanticweb for the great news service:
"Voice of Semantic Web Technologies and Linked Data Business" and to the @W3C blog.

Sunday, October 23, 2011

Query Federation and Linked Closed Data

This is a blog post with highlights from the 10th International Semantic Web Conference taking place in Bonn, that I picked up while following the event on distance.

>> Updated 2 November with a presentation by Peter Haase on Fedbench, see below. And also with this the nice blog post by Ivan Herman's (the leader for W3C's Semantic Web work): Some notes on ISWC2011…

>> Updated 13 November with a link to a paper on federated search in life science, see below.

Today and tomorrow, Sunday - Monday 23-24 October, are the workshops days with 16 workshops arranged before the main conference. Through the day I have on and off been catching up on the busy Twitter feeds on my iPhone while being out walking in the nice weather on the West Coast of Sweden. And now in the evening I have picked up two things of that I did find extra interesting from an enterprise perspective on linked data and URI:s.

Query Federation

Being able to do federate querying of data from different internal and external data sources is a key capability required in an enterprise context. An interesting paper presented in the Consuming Linked Data Workshop describes how this can be done using the VOID standard (Vocabulary of Interlinked Datasets): SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions by Olaf Görlitz and Steffen Staab.

Splendid: SPARQL Endpoint Federation Exploiting VOID Descriptions

View more presentations from Olaf Goerlitz

The paper use a scenario of researchers in the life science domain have numerous databases at hand which contain detailed information about pathways, genes, proteins, drugs and so forth. It describes and evaluate a framework called SPLENDID including an Index Manager, a Query Optimizer, and a Query Executor.

A take away highlighted in the Twitter feed from the presentation of the paper is the value of publishing VOID data for Linked Data set. The paper also includes references to an interesting product that I have spotted in other tweets earlier on: FedX, a framework for transparent access to Linked Data sources through a federation using optimization techniques. See also a recent discussion thread: SPARQL Federated Query Clients, on W3c's Linked Open Data email-list. I also find this paper from 2009 by key people in W3C's interest group for semantic web in life science highly relevant: A journey to Semantic Web query federation in the life science.

Later on in the conference a research paper was presented that I adress topic of central repositories vs. federated querying and processing

Fedbench - A Benchmark Suite for
Federated Semantic Data Processing

View more presentations from Peter Haase

Linked Closed Data

In an enterprise context the recognition of the use of transparency and the value of open sharing of data is getting more and more traction. By applying the Linked Data principles corporations can enable meaningful use of data. See my previous blog post on Corporate Transparency and Linked Data. At the same time there are of course datasets for which access to and use of the data is subject to legal, business, data privacy or ethical restrictions which go beyond attribution and share-alike obligations.

A vision paper presented at the Consuming Linked Data Workshop outlines A research agenda for Linked Close Data by Marcus Cobden, Jennifer Black, Nicholas Gibbins, Les Carr, and Nigel

Shadbolt. The authors defines Linked Closed Data as Semantic Web datasets which are published in accordance with Linked Data principles, but which include access and licence restriction.

I was glad to see that Ivan Herman in his blog post also highlight this: "we can and we should speak about Linked Closed Data alongside Linked Open Data is important if we want the Semantic Web to be adopted and used by the enterprise world as well."

Thanks Kudos to @juansequeda and @ivan_herman for your great tweets today.

(While writing this blog post I can see on Twitter that the folks at the conference in Bonn now have a linked data gather and getting ready to play "#semanticbeerpong" :) For me it's time for a cup of tea instead ...)

Monday, September 19, 2011

Semantic Interoperability in 4 tweets

Q: Semantic interoperability? A: Data & context formated & organised for computers & humans to use it in new & meaningful ways.
— Kerstin Forsberg (@kerfors) August 28, 2011

Q: Data Context? A: Explicit semantics & provenance making the aboutness and history of data transparent.
— Kerstin Forsberg (@kerfors) August 28, 2011

Q: Data formated for computers? A: Data directly usable by applications, ready for transformation, querying and reference.
— Kerstin Forsberg (@kerfors) August 28, 2011

Q: Data organised for humans? A: Data you can click on. Data with affordance making it easy to use it new & meaningful ways
— Kerstin Forsberg (@kerfors) August 28, 2011

Thursday, September 1, 2011

ICBO2011, Disease terminologies and ontologies

This is my fourth blog post from the International Conference on Biomedical Ontology (ICBO) 2011, in Buffalo, NY. This time I will focus on disease vocabularies. In earlier blog posts I have highlighted the differences between two types of vocabularies:

Vocabularies of terms for concepts organized as terminology hierarchies (e.g SNOMED CT), classification systems (e.g. ICD and MedDRA) being used as coding nomenclatures for diseases, or rather diagnoses, in EHR, clinical trials and patient safety databases.
Vocabularies of terms for types of entities in reality, and of the relationships between such entities, structured in ontologies according to the best current scientific understanding of physiological and pathological processes.

In my previous blog post from ICBO I listed examples of high quality, "true", ontologies, and also different approaches to manage "Mapping mania" for the legacy of terminologies. See also another blog post that describes very well how terminologies, relates to ontologies, and also to information models etc.: Why Do We Need Ontologies in Healthcare Applications.

In this blog post I use a review of a common terminology, that is SNOMED CT, and the Mental Disease Ontology under development, as examples to highlight problems and potentials with these two types of vocabularies.

Terms for concepts organized as terminology hierarchies
While working on this blog post I saw a posting on Google+ (that is the new social media tool excellent for online discussions) pointing to a recent report from practical use of SNOMED CT in a commercial clinical system focused on cardiovascular and respiratory diseases, and diabetes mellitus. The Google+ posting came from Alan Ruttenberg, one of the key people in the biomedical ontology (OBO) community and organiser of the ICBO event. The main author of the paper Alan pointed to, Getting the foot out of the pelvis: modeling problems affecting use of SNOMED CT hierarchies in practical applications, is Alan Rector, one of the key people in the biomedical terminology community.

Systematized Nomenclature of Medicine–Clinical Terms (SNOMED CT) is now mandated in the USA, UK, and several other countries for coding of clinical problems in EHR. The SNOMED identifiers, codes such as 38341003 for the term 'hypertensive disorder', provide a stable reference point for coding of diagnoses. And it is one the key terminologies in the EHR4CR IMI-project, for example when querying EHR data for protocol feasibility.

"When doctors apply SNOMED codes to a patient, they are stating that those codes and all their ancestors in the hierarchy apply to that patient. When researchers use codes in queries, they are querying for those codes and all of their descendants."

Source: Getting the foot out of the pelvis: modeling problems affecting use of SNOMED CT hierarchies in practical applications, J Am Med Inform Assoc. 2011 July; 18(4): 432–440.

The article lists, and exemplifies, the major types of problems when using SNOMED-CT hierarchies. It also illustrates existing hierarchies, for example for hypertensive disorder (A) and a suggested revised hierarchy for Hypertension (B).

The authors’ conclusion is quite tough:

“… anyone using SNOMED codes should exercise caution. Errors in the hierarchies, or attempts to compensate for them, are likely to compromise interoperability and meaningful use.”

Terms for types of entities in reality structured in ontologies

In preparations for the conference I studied one of the disease area ontologies under development: Mental Disease Ontology. I do not have any medical insights into this disease area, but became interested in it because it uses the Ontology of General Medical Science (OGMS).

Source: Toward an Ontological Treatment of Disease and Diagnosis

OGMS is a so called mid-level ontology. The objective for it is to support research on Electronic Health Record (EHR) technology and integration of clinical and research data. My interested in OGMS started at the Clinical Trial Ontology workshop at the NIH Campus in Bethesda, MD., in 2007. When the OBO community took the insights and best practice from developing large biology ontologies (such as the Gene Ontology and the Protein Ontology) the framework called OBO Foundry, into the clinical space a couple of things were often confused:

The process of observing, the results of the observation and what is being observed
Disorders and diseases on the one hand and diagnoses on the other

To address these, and other confusions, the development of OGMS started.

"OGMS comprises representations of highly general universals in the domains of anatomy, physiology and pathology, of diagnosis and treatment, and of information artifacts such as clinical histories and lab test results.”

From the paper: Research Foundations for a realist ontology of mental disease, authored by Barry Smith and Werner Ceusters, two of the key people in the biomedical ontology (OBO) community. In this paper the authors describe how the development of an ontology for mental disease addresses the need for acceptable definitions for 'mental disorder', 'disease' and 'illness' as it has been called out in the research agenda for the new edition (DSM-V) of the Diagnostic and Statistical Manual, scheduled for release in May 2013.

The authors defines three different list of types of entities according to the best current scientific understanding in the domain of mental diseases:

Mental health related entities that can exist in the absence of any mental disorder, using terms to denote these entities such as behavior and interpersonal process
Mental disorder related core entities, e.g. using terms to denote these entities such as pathological mental process and mental disease course
Diagnosis related core entities using terms to denote these entities such as disease picture components and collection of marker features for disease X (e.g. Diagnostic Criteria for Asperger's Syndrome and for ADHD)

I find this statement of the authors highly interesting:

“We do not suggest that all the terms proposed in the above should be used by clinicians, although moves in this direction would help to make medical jargon less ambiguous (while at the same time potentially bringing other costs). What is more important is a broad recognition of the existence of the types of entities denoted by these terms, since without this broad recognition we will not achieve the sort of terminological clarity that is needed for computational purposes such as integration of mental health data with biological and other sorts of data. Finding better terms for the entities in question is, in this light, a secondary issue.”

Some reflections

As outlined in one of my earlier blog post in preparation for ICBO I hoped to better understand the emerging trend of well design “true” ontologies. And at the same time understand how we better can use legacy terminologies, such as SNOMED CT, and data coded with their aid can be successfully used for information-driven clinical and translational research. By attending ICBO I have got a much better understanding of the problems and potentials of the two different types of vocabularies. However, I still struggle to understand how to combine them short and long term.

Kudos to @alanruttenberg for a great ICBO conference and for the Google+ posting,
and also to @jamoussou for the great blog post on why we need ontologies.

Wednesday, August 31, 2011

Ideas on Linked Open Transportation Data for TravelHack

Earlier this summer I saw some tweets about a nice event here in Gothenburg: West Coast TravelHack 2011, 8-9 October. As I am a daily commuter (with Västtrafik's trams, buses and trains) and an information architect addicted to the linked data idea, and I also have a background as researcher in mobile informatics, I got two ideas and wrote them up as tweets (tweet 1 and tweet 2)

Today, I saw some tweets linking to two articles about the interesting FixMyTransportation:

mySociety launches FixMyTransport.com, Open Knowledge Foundation Blog
How to create sustainable open data projects with purpose, O'Reilly Radar

Looking for hackers

I was reminded of my two ideas and also of my time as a part-time industrial PhD researcher. My research in the Mobile Informatics group, at the Victoria Institue and IT University, concerned the mechanisms needed to provide highly mobile professionals, such as new journalists, with contextualized information using mobile applications: "Mobile Newsmaking" (thesis, presentation)

So, I posted a tweet about FixMyTransportation it in Swedish and Karl-Petter Åkesson (@kallep), an old friend from my time as part-time researcher, kindly replied and said in his tweets back (tweet 1 and tweet 2): Why not get together with a couple of hackers and show how your ideas for a linked data infrastructure could enable nice apps and services for commuters. Great, I tweeted back -- but, I don't know that many great hackers as it's ten years since I did my research on mobile applications.

So, now I am looking for some great hackers to potentially explore my ideas on Linked Open Transportation Data at the TravelHack event 8-9 October.

Give every bus stop, tram route and train station etc. a URI

Identify "things" globally by using http based URIs (Uniform Resource Identifiers) - today all public schools, roads, ministers, and many bus stops, in UK have URIs.

For example the URI http://transport.data.gov.uk/id/stop-point/1800SJH1081 identifies a bus stop in Manchester. Assigning a http based URI is what the two first principles of Linked Data say.

The third principle say that you should provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF/XML. So, if you you put this URI http://transport.data.gov.uk/id/stop-point/1800SJH1081 in a web browser it will give you a nice html documentation of the metadata describing the busstop. A app or service could choose between for example a RDF/XML file or a JSON file. See my Linked Data page for some nice videos, books, blogs etc.

Use a common vocabulary for transportation

And all these "things" can also be typed, described and linked using classes, properties and relationships from a range of vocabularies for different domains.

For the transportation domain I have seen some nice tweets pointing me to TRANSIT: A vocabulary for describing transit systems and routes.

There are also many general vocabularies and ontologies that are commonly used to publish linked data. You can 'cherry-pick' from some of the most common, for example Friend-of-a-Friend (FOAF) provides terms for describing people and their social network, SIOC Semantically-Interlinked Online Communities, and Dublin Core defines general metadata attributes.

Kudos to @peterkz_swe, @egonwillighagen, @wieselgren, @kallep
for nice interactions on Twitter inspiring me to write this blog post

Thursday, August 18, 2011

A prediction 3-5 years from now

Making predictions can be tricky. However, a former colleague, and actually also my manager for a short time, Jean-Peter Fendrich (@carokanns) recently published a few predictions 3-5 years from now in the LinkedIn group Volvo IT Innovation Centre

Inspired by iPad and its competitors there will come a new device that replaces the Laptop as we know it now.
Html5 will make all these app's and app technologies obsolete.
We will finally have standards and infrastructure that support "mobile wallet" - replacing cash, credit cards and other payment systems.

JP asked for feedback and more predictions, so I posted the following:

Globally identified "things" using http based URIs (Uniform Resource Identifiers) - today all public schools, roads, ministers, and many bus stops, in UK have URIs.
And all these "things" will also be typed, described and linked using classes, properties and relationships from a range of vocabularies/ontologies for different domains, see for example TRANSIT: A vocabulary for describing transit systems and routes.

(I also referred to a recent report from Booz&co with the title: Designing the Transcendent Web: The Power of Web 3.0. )

As JP and I, together with Martin Börjesson (@futuramb), Annika Eriksson, Christian Forsäng and Else-Marie (Emma) Malmek, were some of the folks introducing the first generation of web technology (Web 1.0) in the Volvo organisation back in the mid 90ies it was nice to highlight the third generation (Web 3.0) in this Volvo IT group.

The focus in my blog postings and tweets the last year or so has been on two of the fundaments for Web 3.0, i.e. the Linked Data principles and in particular the use of http based URIs. For more details, see one of my first blog posts: Corporate Transparency and Linked Data. See also my list on URI Design that I try to keep updated.

"Data is the new electricity. URIs are the conduction mechanism."
Quote by Kingsley Uyi Idehen (@kidehen)

Tuesday, August 9, 2011

ICBO2011 Reports

The last week in July I and three colleagues attended the International Conference on Biomedical Ontology (ICBO) 2011, in Buffalo, NY. As I have been a "remote hang-around" on Twitter following other conferences on distance (see for example my blog post following the SemTech conference earlier this summer) it was great fun this time to be active on Twitter IRL in Buffalo: My #ICBO2011 tweets

And yes, I did see the Niagara Falls again -- this time I did get really close to them on a boat tour with the "Maid of the Mist".

Now, after a long journey home, and a couple of relaxing days on the Swedish west coast and in central London, it's time to use my tweets, the conference presentations and proceedings (pdf) to pull together some of my insights and learnings. Here's my first report with some notes and reflections from the conference and follow up to my previous blog posts in preparation for the conference (part 1 and part 2). See also my fourth blog post from ICBO published 1 September.

High quality, "true", ontologies
It was nice to see presentations and read papers on ontologies from a broad spectrum of domains, such as:

Genes
See a recent paper: How the Gene Ontology Evolves, describing the ways in which curators of the Gene Ontology (GO) have incorporated new knowledge.
Protein complex and supra-complex
See the presentation on this topic in the panel the first day: From proteins to diseases, by Bill Crosby (Department of Biological Sciences, University of Windsor)
Emotions and Chronic pain
See the presentation and paper on how to represent emotions based on research in affective disorders such as bipolar, depression and schizoaffective disorder, by Janna Hastings, (European Bioinformatics Institute, UK, and, Swiss Centre for Affective Sciences, University of Geneva, Switzerland). See also the announcement of the development of an ontology for Chronic pain and a nice video: Toward a New Vocabulary of Pain.
Demographics
See the presentation describing how "demographic data in current information systems is ad hoc, and current standards are insufficient to support accurate capture and exchange of demographic data", and the proposed use of the Demographics Application Ontology to as a solution.
Adverse Events
In the workshop on representing adverse events we learned about interesting work on adverse ontologies. (See a video of the workshop organizer Mélanie Courtot: Towards an Adverse Event Reporting Ontology). We also learned about the development of ontologies to represent temporal relationships (e.g. Clinical Narrative Temporal Relation Ontology) which is a key aspect in handling safety issues and regular ongoing pharmacovigilance in pharmaceutical research and development.

All of these are examples of high quality "true"1) and modular ontologies developed beneath the Basic Formal Ontology (BFO) providing formal definitions for types of entities in reality and for the relationships between such entities (so called ontological realism). Such ontologies are designed to allow annotations of experimental and clinical data "to be unified through disambiguation of the terms employed in a way that allows complex statistical and other analyses to be performed which lead to the computational discovery of novel insights"2).

My own reflections:
So far we have seen none, or very little, uptake of such high quality "true" ontologies for clinical data. Something I also highlighted in my earlier blog post on clinical data standards. In a coming blog post I will present a demo using the Demographics Application Ontology showing how a high quality "true" ontology can be used to support accurate capture and exchange of demographic data. I will also outline some ideas on how this could be used also for clinical study data (CRF:s and databases).

"Mapping mania" for the legacy of terminologies

A common theme in several of the presentations, papers and panels was the mappings (matching, alignment) needed between terms and concepts organized as terminologies and coding nomenclatures, such as SNOMED CT, LOINC, ICD, CDISC SDTM CT:s (derived from NCI Thesaurus), and MedDRA. Here are some examples:

Extraction of the anatomy value set from SNOMED CT to be reused for the 11th revision of the International Classification of Diseases (ICD-11). See a presentation on the problems and proposed patterns by some well known people (Harold Solbrig and Christopher Chute at Mayo Clinic, Kent Spackman working for IHTSDO, and Alan L. Rector at University of Manchester)
The Ontology Evaluation Alignment Initiative (OAEI) was mentioned by several presenters as a forum to discuss the problems of direct matching between different terminological resources.
The use of a ontology matching tool called AgreementMaker was presented.
In a panel on: National Center for Biomedical Ontology (NCBO) Technology in Support of Clinical and Translational Science, the basic lexical term mappings was mentioned as an example of a service available both via BioPortal's graphical interface and as REST services.

These are all example of a legacy already in use, or in the process of being used, for the annotations of EHR, clinical trials and patient safety data. For example for the huge US initiative on meaningful use of EHR as highlighted by Roberto Roch in his keynote on Practical Applications of Ontologies in Clinical Systems.

My own reflections:
In my previous blog post preparing for the conference I refereed to the mapping problem as comparing "Apples and Oranges" and sometimes I think of it as a "mapping mania". In the conference I did hear the comment "Mappings are hard" several times, and also the question "Who will create, validate and maintain all the mappings?"

After some more days of vacation I will get back later on in August with more notes and reflections from the conference:.

I will report from the debate on how to accurately connect data from measurements and questionnaires (information entities) to ontologies (real world entities). I think this is a key aspect to get machine-processable clinical data ready for automatic transformation and direct querying, and ready for inferencing and reasoning.
Another theme I would like to cover is referent tracking, i.e. assign globally unique identifiers for each entity in reality about which information is stored. For example diagnoses, procedures, demographics, encounters, hypersensitivity, and observations as they are reported in EHRs. This is something I think is a key enabler for accurate secondary use of EHRs.

1) See More than Words: Biomedical Ontologies

2) See Dispositions and Processes in the Emotion Ontology

Pages