Open W3C-standards like SKOS provide a great chance to combine corporate information with Internet-based resources

Dr. Horst Baumgarten has worked for Roche for almost 25 years. He was head of Information Management at Roche Professional Diagnostics for 15 years and has become head of Scientific Information Technologies in 2010. His “mantra” is simple: “I want to support my colleagues at Roche to more easily find the needed information for their daily work”. Baumgarten is convinced, that the general data access problem can no longer be solved with the help of traditional approaches. Instead, he started to implement semantic technologies at Roche. He is optimistic that these new technologies and concepts will at least lead to a “pain relieve” in finding relevant data almost instantenously.

PoolParty Team had the chance to talk with Dr. Baumgarten about Roche´s activities in the area of semantic technologies and linked dataLinked Data is a sub-topic of the Semantic Web. The term Linked Data is used to describe a method of exposing, sharing, and connecting data via dereferenceable URIs on the Web..

What is the purpose of your thesaurus project?

Simply speaking, making many diverse glossariesA glossary, also known as an idioticon, vocabulary, or clavis, is an alphabetical list of terms in a particular domain of knowledge with the definitions for those terms. Traditionally, a glossary appears at the end of a book and includes terms within that book which are either newly introduced, ... within the Roche Intranet more available than they are today. As you may imagine, a company with more than 80.000 employees has a huge Intranet and it is sometimes hard to find the needed information in general and a glossaryA glossary, also known as an idioticon, vocabulary, or clavis, is an alphabetical list of terms in a particular domain of knowledge with the definitions for those terms. Traditionally, a glossary appears at the end of a book and includes terms within that book which are either newly introduced, ... information in special. Thus we want to provide a central entry point for as many glossaries as possible. They come from very different sources like Quality ManagementPoolParty quality checker is an advanced SKOS and linked data validator based on qSKOS library., IT and Global Finance.

Why did you choose thesauri to organize your information?

We face very different interpretations of terms like Thesaurus, Ontology, an ontology is a formal representation of knowledge as a set of concepts within a domain, and the relationships between those concepts. It is used to reason about the entities within that domain, and may be used to describe the domain. In theory, an ontology is a "formal, explicit ..., Glossary, Abbreviation list and many more. Thus we decided to talk about a “Glossary for Roche” only and it contains a mixture of data sources. But the technical solution is to manage this variety of data with just one tool for Thesaurus-Management. We start with abbreviation lists and glossaries including many synonyms and will push it into a true Thesaurus in the very near future.

What kind of problems are you able to solve with this approach?

The primary problem is a universal one, not finding the needed information in a short time. We want to drastically reduce the search time at least for this very special topic, the glossaries at Roche. To be honest, this can only be a “pain relieve” because it is impossible to make all relevant data in such a big company like Roche available “at a finger tip”.

Which role does SKOSSimple Knowledge Organization System (SKOS) is a family of formal languages designed for representation of thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured controlled vocabulary. SKOS is built upon RDF and RDFS, and its main objective is to ... and/or Linked Data play in order to achieve your goals?

Working with IT topics since more than 30 years, I am aware of the benefits of (open) standards and SKOS is one of them. This makes Roche less dependent of a special supplier and gives us access to a rich amount of already existing data from the Internet. As an example, the Linked Data from the GeneOntology or DBpediaDBpedia is a project aiming to extract structured information from the information created as part of the Wikipedia project. This structured information is then made available on the World Wide Web. DBpedia allows users to query relationships and properties associated with Wikipedia resources, ... can easily be linked to Roche internal data. This provides a tremendously better wealth of information than was possible prior to Linked Data.

What are the most important values you generate for your stakeholders?

It is about combining content from many different homepages into just one application. Currently users face databases, pdf- and Word, html-files and many other formats – not to speak of sometimes long ways to reach the searched information (= many clicks). With the “Glossary for Roche” the stakeholders will have a chance to publish their content not only in their own homepage, but in a central entry point with a very simple address like “glossary.roche.com” (only for Roche-internal usage). Their information will thus be found much more easily than today.

What are the most important arguments to use Semantic Web standardsThe Semantic Web Stack, also known as Semantic Web Cake or Semantic Web Layer Cake, illustrates the architecture of the Semantic Web. and Linked Data, especially in the life science domain?

I am  fascinated about the chance to access very different data sources – with the help of semantic technologies – in a dramatically more efficient way than in the past. As an example, my colleagues from Pharma Development have made more than 100 external ontologies, an ontology is a formal representation of knowledge as a set of concepts within a domain, and the relationships between those concepts. It is used to reason about the entities within that domain, and may be used to describe the domain. In theory, an ontology is a "formal, explicit ... available for internal purposes and their colleagues from Diagnostics can use these ontologies too.

What kind of applications can be built or have been built on top of your thesauri?

Your are kidding, there are too many…..   The “Glossary for Roche” is just one application, but there are already others in progress like a product-thesaurus for Diagnostics and an innovative (semantically enabled) new access to “zillions” of homepage contents at Roche. And we are not only talking about setting up a thesaurus, but USING it. Examples include Internet- and Intranet-searches and searches in departmental shares, which all use the same thesaurus in the background. This is kind of “one fits it all” concept.

Why did you choose PoolParty to manage your thesauri?

PoolParty is simple to use and focuses on essential, but powerful functions. The support is great and very open for ideas from the customers. And, the Poolparty guys have an in-depth knowledge about semantics in general and the solution of customer-problems in the life science area in special.

How do you manage to get your thesauri used, how are you going to build an “eco-system” around your work?

To bring a thesaurus alive is the same task as with any other application in Information Management: Find out the needs of possible customers, find stakeholders, get the money to set up a pilot, develop the pilot and ask for the feedback of pilot users, modify the system to the “final” needs of the customers and market the productive application on every possible channel (always keeping contact with the customers) and support the continuous improvement of the application. This means, there is nothing special with thesauri.

What are your future plans and next steps?

Step 1: Use the experience from the “Glossary for Roche” project to perform comparable “small” projects. Which would be a rather easy job.

Step 2: Set up a product-glossary for Roche Diagnostics, which would be the hard job. Because it should provide a metadataMetadata is loosely defined as data about data. Metadata is a concept that applies mainly to electronically archived or presented data and is used to describe the a) definition, b) structure and c) administration of data files with all contents in context to ease the use of the captured and ... level for many applications from development over production to customer information. These systems are running very well and are optimized for their special purpose, but have currently limited interfaces to other systems. Thus, it is not that easy to retrieve information from different systems in a single step. What would be great for users.

I would be more than happy if we at Roche could implement such a product-glossary within the next couple of years. This will be a demanding time schedule for the implementation of such an innovative approach.  And it will need a lot of colleagues who are just enthusiastic about this  project and its chances, as I am.