Вы находитесь на странице: 1из 12

Towards a reuse-oriented methodology

for ontology engineering

ELENA PASLARU BONTAS, MALGORZATA MOCHOL

Given the intrinsic meaning of the term “ontology” and the efforts required
by ontology building, reuse and reusability are very important issues for a
cost-effective and high-quality ontology engineering. While several method-
ologies describing the reuse process already emerged in the Semantic Web
community, the implications of reuse in concrete application settings have
not been examined to a satisfactory extent yet. In this paper we analyze the
costs and benefits related to the reuse process on the basis of two case studies
which attempt to build new ontologies in the domains of eRecruitment and
medicine by means of ontological knowledge sources available on the Web.

Introduction

Ontology engineering is already considered a mature discipline in the context


of the Semantic Web. A variety of methodologies and tools to build, manage
and merge ontologies emerged in the last decades (Lopez, 2002). Most of the
currently available ontologies are, however, not aligned to a specific meth-
odology. They are rather the result of some ad-hoc application- and domain-
dependent engineering process. While it is generally accepted that building
ontologies from scratch is a challenging, time-consuming and error-prone
task, the development of new ontologies still does not tap the full potential of
the knowledge sources available on the Web.
Ontology reuse can be defined as the process in which existing (ontologi-
cal) knowledge is used as input to generate new ontologies. Depending on the
content of the knowledge sources and their overlapping one can distinguish
between ontology merging and integration (Pinto and Martins, 2001). The
limited reuse of ontologies currently available on the Web can be explained
by the difficulties related to building reusable ontological sources which have
to strike the balance between a rich conceptualization and application speci-
ficity (Gruber, 1995), and by the fact that this issue has been poorly explored
in existing engineering methodologies. A general explanation of the reuse
process is given in (Uschold and King, 1995). (Pinto and Martins, 2001) of-

1
Elena Paslaru Bontas, Malgorzata Mochol

fers a detailed methodology about how to perform ontology integration, leav-


ing out methods to decide on the circumstances under which a reuse-oriented
approach is profitable. Further on, as to the knowledge of the authors, none of
the reuse methodologies provides tool support in order to allow an (at least
partial) automation of the process.
In this paper we present our experiences in building domain ontologies on
the basis of existing ontological knowledge (i.e. build by reuse). We adapt
generic reuse methodologies to the current state of the art of the Semantic
Web in order to be able to apply them efficiently to generate two ontologies in
the e-Recruitment and the medical domain respectively. Finally we analyze
the profitability of ontology reuse in the two use cases in terms of costs and
benefits.
The remaining of this paper is organized as follows: we describe our reuse
process model in Section “’Our Approach to Ontology Reuse” and present its
application in two concrete Semantic Web projects in Sections “Case Study
Human Resources” and “Case Study Medicine”. After a general description
of the methodology employed to estimate and compare the costs and the
benefits of reuse (Section “Costs and Benefits of Ontology Reuse”), we con-
cretize these aspects in the context of the empirical studies. The limitations of
our approach together with planed future work are subject of the Section
“Conclusions and Future Work”.

Our Approach to Ontology Reuse

Typically ontology reuse starts with the identification of potentially rele-


vant knowledge sources. The presumed candidate ontologies usually differ in
represented content and formality degree (thesauri, XML-Schemes, UML
diagrams, etc.). Even when translation tools are available for some represen-
tation formats1, the resulting matching still requires human evaluation and re-
finement. Provided a common representation formalism the source ontologies
have to be compared and eventually merged. For this purpose one needs a
generic scheme-matching algorithm which can deal with the heterogeneity of
the incoming sources w.r.t. their structure, domain and application view upon
the domain.
Our approach copes with such limitations by proposing an incremental
process which concentrates on the concepts represented in the input sources
and subsequently takes into account additional information like semantic re-
lationships and axioms depending on the application needs (see Figure 1).

1. In fact large amounts of domain knowledge are encoded in thesauri like Cyc, UMLS (a
medical ontology containing over 300,000 concepts) using proprietary formats or even
natural language without any technical support for translation tools.

2
Towards a reuse-oriented methodology for ontology engineering

Our approach does not depend on any scheme-matching algorithm but


merges the schemes depending on their degree of formality.
We start by considering the vocabulary of the sources (concepts, relations,
and axioms) and compute a common vocabulary depending on the natural
language in which the ontological primitives have been originally denomi-
nated. Categorizing the domain knowledge -- implicitly stored and managed
in experts’ minds -- and representing it explicitly can be significantly simpli-
fied by generating a preliminary vocabulary of the domain, which is used as a
start point for the domain experts for further refinements during the concep-
tualization.

Figure1 The Reuse Process

The vocabulary contains several lists of potential ontological primitives: a


list of concept names, a list of properties and a list of axioms.
The generation of the candidate concepts is performed in the following
steps. At the beginning we merge the source vocabularies, generate separate
lists of ontological primitives according to the degree of formality of the con-
sidered models, and eliminate duplicates in order to avoid unnecessary com-
putations. We compute syntactical similarities between concept names be-
longing to different source ontologies (Cohen et al., 2003). To improve the
accuracy of the similarity computation we assume common naming conven-

3
Elena Paslaru Bontas, Malgorzata Mochol

tions in ontology engineering2 to generate a bag of terms for each concept


name in the source vocabularies. Word stemming and stop-word elimination
are applied to improve precision and recall. As a result of this phase similar
bags of words are aggregated to concept names.
We compute the ranking of the concepts by considering frequencies (con-
cept names occurring in several source ontologies are ranked higher), source
priority (relevance measure of the corresponding source to the target applica-
tion domain) and application requirements.
After identifying relevant concepts, the user selects the relevant relation-
ships, which can be added incrementally to the ontology until a certain level
of complexity has been achieved.

The presented methodology, though relatively straight forward and not tap-
ping the full potential of the newest approaches in ontology match-
ing/merging, has proved to be very useful and cost-saving in the application
domains presented below, since it does not has to cope with the limitations
related to the heterogeneity of the available source ontologies. Such hetero-
geneity issues currently make the automatic usage of matching techniques a
tedious and error-prone process.

A Cost Benefit Analysis of the Reuse Process

As mentioned in the previous sections reusing existing knowledge for on-


tology engineering is today complicated by serious technical problems that
should not be underestimated when deciding on how to build a specific ap-
plication ontology (i.e. from scratch, by reuse, using ontology learning tech-
niques or combinations of the three).
Likewise software engineering, reusing an existing component implies
costs for its discovery, comprehension, evaluation, adaptation and actualiza-
tion. While most ontologies emerged in the last decades can be accessed on
the Web (i.e. the costs for ontology discovery are relatively low) these on-
tologies show significant differences w.r.t. intrinsic and extrinsic features.
In the first category we mention the representation language, the modeled
domain, the view upon the domain, the granularity as well as the degree of
formality. The second category contains features such as maturity, devel-
opment stage or underlying methodology. This variety complicates severely
the evaluation process and is therefore a major cost factor. The customiza-
tion of the source ontologies with regard to a given set of requirements re-
quires tools for their translation, comparison and merging. Existing tools

2. Ontological primitives are denominated by complex phrases, in which single words are
capitalized or delimited by space, underscore etc.

4
Towards a reuse-oriented methodology for ontology engineering

though containing valuable ideas and techniques currently lack real-world


practice and are usually confined to specific domains, representation lan-
guages or ontology types (e.g. taxonomies), thus not being able to deal with
this heterogeneity (Do et al., 2002).
The benefits of reuse have been alleviated in numerous engineering disci-
plines including ontology engineering. Besides implementation cost savings
an important advantage in this case is interoperability. In terms of ontolo-
gies interoperability is achieved on the syntactic and the semantic level.
While the former one can be provided by using commonly agreed inter-
faces, semantic interoperability assumes the usage of explicitly and formally
defined domain models.
Each of these aspects is relevant for the decision of the engineering team
whether to newly build an ontology or generate it from existing sources. Ide-
ally a cost benefit analysis should be supported by means to quantify and
compare the mentioned factors. The costs involved by ontology building and
the cost savings caused by ontology reuse can be estimated using appropriate
cost models (see below). However the advantages achieved through an in-
creased interoperability can hardly be expressed in a reliable quantitative
manner. In the following we give a brief description of ONTOCOM (Paslaru
and Mochol, 2005), a cost model that aims at predicting the costs (expressed
in person months efforts or duration) involved in developing an ontology.
ONTOCOM is applied for the cost benefit analysis of the case studies to es-
timate the presumed cost savings induced by reusing existing ontologies.
A first step towards a cost estimation model for ontology engineering is the
definition of an appropriate process model. After analyzing general-purpose
cost estimation methodologies (Stewart, 1995; NASA, 2004) in terms of their
suitability for ontology engineering we elaborated a methodology to realize a
cost model for this field. In a first phase a top-down approach is applied to
identify the cost-intensive sub-tasks of the project, in our case ontology build-
ing. Further on a parametric method similar to those adopted in related disci-
plines such as software engineering (Boehm, 1997) is employed to define a
cost calculation formula. The model is finally refined using the expert judg-
ment method (the Delphi method (Linstone and Turoff, 1975)) which pro-
vides a fine-grained description on how the human-driven model validation
and refinement should be performed.
In the first step we identified three areas of ontology engineering for which
the parametric cost calculation should be further defined:
• building3 which includes the efforts invested in requirements speci-
fication, conceptualization, implementation, instantiation and on-
tology evaluation,

3. See (Lopez, 2002) for a detailed description of ontology building and its sub-tasks.

5
Elena Paslaru Bontas, Malgorzata Mochol

• maintenance which involves costs related to analysis and updating


the ontology, and
• reuse which relates to the costs for the acquisition and re-usage of
available knowledge sources.
The overall effort (in person months PM) is calculated as the sum of the
costs accruing in the three areas:

PM = PMB + PMM + PMR (1)


PMB, PMM and PMR represent the effort associated with building, mainte-
nance and reuse of ontologies, respectively. In the effort estimation of these
development phases different sets of cost drivers are relevant. Cost drivers
have a rating level (from extra low to very high) that expresses their impact
on the development effort. For the purpose of quantitative analysis, each rat-
ing level of each cost driver is associated to a weight (effort multiplier – EM).
The average EM assigned to a cost driver is 1.0 (nominal weight). If a rating
level causes more development effort, its corresponding EM is above 1.0. If
the rating level reduces the effort then the corresponding EM is less than the
nominal value. The values associated with each cost driver and effort multi-
plier are subject of further calibration on the basis of the statistical analysis of
real-world project data (i.e. the real costs involved in ontology engineering
projects).
The costs caused by ontology building are calculated as:

PMB = A * SizeB * ΠCDi (2)


SizeB is the number of thousands of ontological primitives (e.g. for an ontol-
ogy with 1000 primitives Size = 1). CDi’s are the effort multipliers for the
cost drivers. The constant A accounts for the multiplicative effects of efforts
with increasing project size and is adapted from the COCOMO framework
(Boehm et al., 1997). Ontology maintenance costs are estimated in a similar
way:

PMM= A * SizeM * ΠCDi (3)


SizeM is the sum of the added and modified ontology fragments influenced
by the appropriate cost factors. A different approach is applied for reuse
processes:

PMR = A * SizeR * ΠCDi , (4)


where
SizeR= Sizedir * (OU _ UNFM + OE) + (5)
Sizetrans * (OU _ UNFM + OE + OT) +
Sizemod * (OU _ FM + OE + OM) +
Sizetransmod * (OU _ UNFM + OE + OT + OM)

6
Towards a reuse-oriented methodology for ontology engineering

The reused size SizeR is divided into the size of the directly integrated (Size-
dir), translated (Sizetrans) and/or modified (Sizemod, Sizetransmod) components
with different cost drivers: the unfamiliarity of ontologists and domain ex-
perts (OUNF), ontology understanding (OU), evaluation (OE), modification
(OM) and translation (OT). For a detailed explanation of the cost drivers see
(Paslaru and Mochol, 2005).
We now turn to the presentation of the two case studies on ontology reuse
from the domains of recruitment and medicine.

Case Study Human Resources

The “Knowledge Nets”4 project explores the potential of Semantic Web from
a business and a technical viewpoint by means of pre-selected use scenarios.
One of the scenarios analyzed the online job seeking and job procurement
processes and the implications of Semantic Web technologies in this area
(Mochol et al., 2004; Bizer et al., 2005).
The first step towards the realization of the e-Recruitment scenario was the
creation of a human resources ontology (HR-ontology). The requirements
analysis revealed the necessity of aligning the resulting ontology with com-
monly used domain standards and classifications in order to maximize the in-
tegration of job seeker profiles and job postings.
First we identified the sub-domains of the application setting (skills, types
of professions, etc.) and several useful knowledge sources covering them
(approx. 25). As candidate ontologies we selected some of the most relevant
classifications in the area, deployed by federal agencies or statistic organiza-
tions: Profession Reference Number Classification – BKZ (text file), Stan-
dard Occupational Classification – SOC5 (text file), Classification of Indus-
trial Sector – WZ20036 (text file), North American Industry Classification
System – NAISC7 (text file), Human Resources XML – HR-XML8 (XML
scheme), HR-BA-XML (XML scheme) and KOWIEN Skill Ontology9
(DAML+OIL).
Depending on the language used in the knowledge sources (Eng-
lish/German) we generated lists of concept names. Except for the KOWIEN
ontology, additional ontological primitives were not supported by the candi-

4. http://nbi.inf.fu-berlin.de/research/wissensnetze
5. http://www.bls.gov/soc/
6. http://www.destatis.de/allg/d/klassif/wz2003.htm
7. http://www.census.gov/epcd/www/naics.html
8. http://www.hr-xml.org
9. KOWIEN - Cooperative Knowledge Management in Engineering Networks;
http://www.kowien.uni-essen.de/

7
Elena Paslaru Bontas, Malgorzata Mochol

date sources. In order to reduce the computation effort required to compare


and merge similar concept names we identified the sources which had to be
completely integrated to the target ontology. For the remaining sources we
identified several thematic clusters for further similarity computations. For
instance the Profession Reference Classification and the Standard Occupa-
tional Classification System were directly integrated to the final ontology,
while the KOWIEN skill ontology was subject of additional customization.
To have an appropriate vocabulary for a core skill ontology we compiled a
small conceptual vocabulary (15 concepts) from various job portals and job
procurement Web sites and matched them against the comprehensive KO-
WIEN vocabulary. Next, the relationships extracted from KOWIEN and vari-
ous job portals were evaluated by HR experts and inserted into the target skill
sub-ontology. The resulting conceptual model was translated mostly manu-
ally to OWL (since except for KOWIEN the knowledge sources were not
formalized using a Semantic Web representation language).

Case Study Medicine

The project “A Semantic Web for Pathology”10 analyzes the impact of on-
tologies within a retrieval system for image and text data for the medical do-
main. The underlying ontology is used for concept-based search techniques
and for the semantic annotation of medical data (i.e. medical reports in text
form) (Paslaru et al., 2004).
In order to generate the ontology using available medical sources we ap-
plied the reuse-oriented methodology described in Section “A reuse-centered
methodology for ontology engineering”. First, we identified and analyzed
relevant knowledge sources, describing aspects of pathology-related knowl-
edge and diagnosis procedures. However, the sources to be reused in this set-
ting differ to a large extent in the content area and granularity, representation
format and degree of formality: i). SNOMED11 and DigitalAnatomist12 de-
scribe the anatomy of the lung and typical diseases (database); ii). The UMLS
Semantic Network13 contains generic and core medical concepts as part of
UMLS (database format); iii). XML-HL7 is an XML-based format for the
representation of patient data; and iv). Immunohistology Guidelines are a list
of stains to be applied in diagnosis procedures in our partner healthcare or-
ganization (textual description).

10. http://nbi.inf.fu-berlin.de/research/swpatho/deutsch/projektbeschreibung.htm
11. http://www.snomed.org
12. http://www.digitalanatomist.com/
13. Unified Medical Language System, National Library of Medicine;
http://www.nlm.nih.gov/research/umls

8
Towards a reuse-oriented methodology for ontology engineering

After merging the vocabularies of the sources according to the language


used in the documents (English/German) a preliminary vocabulary consisting
of concepts and relations was selected. In the pre-processing phase we trans-
formed complex concept names and computed similarities among the corre-
sponding bags of terms. The concepts were ranked according to the applica-
tion relevance, which was defined by a lexicon generated from the archive of
medical documents. Candidate relationships were extracted from Digital
Anatomist, SNOMED and UMLS Semantic Network. Approximately 50 re-
lations were evaluated by domain experts, who finally inserted approximately
20 generic and medicine-specific core relations to the target ontology. The
implementation of the target ontology was performed semi-automatically, by
translating the corresponding database-stored data to OWL, while significant
amounts of domain-specific knowledge were encoded manually since not
available in any of the knowledge sources in a structured form.

Costs and Benefits of Reuse in the Case Studies

The cost and benefit analysis of the presented case studies focused on the
estimation of the presumed cost savings achieved by reuse. The real costs
arisen in the two projects were compared with the predicted costs which
would have been caused by building the corresponding ontologies from
scratch. The costs induced by the second approach were calculated using
ONTOCOM (Paslaru and Mochol, 2005).
In the recruitment scenario we found several taxonomies for the descrip-
tion of skills, classification of job profiles and industrial sectors, which we
wanted to reuse in our ontology. 15% of the total time was spent on gather-
ing the relevant sources while about 35% were invested in their customiza-
tion. Several ontologies have been fully integrated into the resulting ontol-
ogy, while KOWIEN and the XML-based sources required additional cus-
tomization. This part of the ontology building process produced over 40%
of the total engineering costs. The last phase of the ontology building, re-
finement and evaluation, costs 10% of the overall resources.
According to our experiences reusing existing knowledge source was
profitable for the HR-domain and for our application setting. A cost estima-
tion for a new implementation revealed that the reuse approach was more
cost-effective (2,5 PM’s for the HR-ontology with reuse vs. 4 PM’s for de-
velopment from scratch). In the same time re-using standard classifications
is expected to considerably increase the usability of our e-Recruitment ap-
plication. Nevertheless there is a need for reliable tools for translating be-
tween various representations and for ontology customization in order to
further optimize reuse costs.

9
Elena Paslaru Bontas, Malgorzata Mochol

The main challenge of the second scenario was the evaluation of existing
sources. Medicine is one of the best examples of application domains where
ontologies have already been deployed at large scale and have already dem-
onstrated their utility (Gangemi et al., 1999). However most of the available
ontologies in this domain are very comprehensive knowledge bases, which
differ in the formalized domain, quality and appropriateness for certain ap-
plication tasks. Additionally most of the available medical ontologies lack a
“reuse-friendly” representation format.
Since the retrieval system using the ontology is still under development,
the product-oriented benefits of the reusing process can not be fully evalu-
ated at this point. However we may say that, for this scenario, the efforts re-
lated to the customization of the source ontologies required over 45% of the
time necessary to build the target ontology. Further 15% of the engineering
time was spent on translating the input representation formalisms to OWL.
The reuse oriented approach gave rise to considerable efforts to evaluate
and extend the outcomes (approx. 40% of the total engineering time).
According to our experiences in this case study the benefits of reuse were
outweighed by their costs, because of the difficulties related to the evaluation
and (technical) management of large scale ontologies and because of the
costs of the subsequent refinement phase. Using ONTOCOM we approxi-
mated the costs induced by semi-automatically building a similar ontology on
the basis of a domain specific document corpus From a resource point of
view, building the first ontology involved four times as many resources as a
new implementation (5 person months for the UMLS based ontology with
1200 concepts vs. 1.25 person months for a manually developed ontology). In
the same time the recall of the ontology w.r.t. the semantic annotation task
would be consequently improved in the latter case because of the text-close
nature of the generation method.

Conclusions and Future Work

In this paper we described two case studies on ontology reuse and a simple
methodology underlying them. Further on we introduced a method to esti-
mate the costs arisen in ontology building processes which was used to ana-
lyze the costs and the benefits of reuse in the mentioned case studies.
Ontology integration means not only the translation of the representation
languages to a common format, but also the matching of the resulting
schemes. Our experience during the presented case studies showed that due to
scalability and heterogeneity issues both of these steps can not be performed
efficiently using current techniques. This was the fundamental motivation for
applying an eventually less technically-versed reuse methodology in the case
studies. However exploiting incrementally the “lowest common denomina-

10
Towards a reuse-oriented methodology for ontology engineering

tor” of the source ontologies (i.e. their vocabulary) proved to be extremely


useful in our reuse experiments.
We are working on further methods to optimize the costs and the quality
of the reuse process. Currently a support tool for the described methodology
is being developed in order to reduce the manual efforts invested in ontology
reuse so far. In the same time the cost estimation methods presented here are
subject of additional detailed empirical refinements.

Acknowledgements
This work is a result of the cooperation within the Semantic Web PhD-
Network Berlin-Brandenburg14 and has been partially supported by the
KnowledgeWeb - Network of Excellence, by the project “A Semantic Web
for Pathology” funded by the German Research Foundation DFG and by the
“Knowledge Nets” project, which is part of the InterVal - Berlin Research
Centre for the Internet Economy, funded by the German Ministry of Research
BMBF.

References to Publications and Bibliography

Bizer, C. et al.; The Impact of Semantic Web Technologies on Job Recruit-


ment Processes. 7. Internationale Tagung Wirtschaftsinformatik (WI 2005),
Bamberg, Germany, February 2005.

Boehm, B. et al.; COCOMO II Model Definition Manual.


http://sunset.usc.edu/research/COCOMOII/Docs/modelman.pdf, 1997.

Cohen, W. W & Ravikumar, P. & Fienberg, S. E; A Comparison of String


Distance Metrics for Name-Matching Tasks. Proceedings of IIWEB Work-
shop at the IJCAI03, 2003.

Do, H. & Melnik, S. & Rahm, E.; Comparison of schema matching evalua-
tions. Proceedings of the 2nd International Workshop on Web Databases
(German Informatics Society), 2002

Gangemi, A. & Pisanelli, D. M. & Steve, G.; An Overview of the ONIONS


Project: Applying Ontologies to the Integration of Medical Terminologies.
Data Knowledge Engineering, 31(2):183-220, 1999.

14. http://nbi.inf.fu-berlin.de/research/KnowledgeWeb/phd/phd.html

11
Elena Paslaru Bontas, Malgorzata Mochol

Gruber, R. T.; Toward principles for the design of ontologies used for knowl-
edge sharing. Int. J. Hum.-Comput. Stud., 43(5-6):907–928, 1995.

Grüninger, M. & Fox, M.; Methodology for the Design and Evaluation of
Ontologies. Proceedings Workshop on Basic Ontological Issues in Knowl-
edge Sharing, IJCAI95, 1995.

Linstone, H. A. & Turoff, M.; The Delphi Method: Techniques and Applica-
tion, Addison-Wesley, 1975

Lopez, F. M.; Overview and analysis of methodologies for building ontolo-


gies, Knowledge Engineering Review, 17(2), 2002

Mochol, M. & Oldakowski, R. & Heese, R.; Ontology based Recruitment


Process, Proceedings Workshop Semantische Technologien für Informa-
tionsportale, INFORMATIK 2004, Ulm, Germany, 2004.

NASA; National Aeronautics and Space Administration; NASA Cost Estimat-


ing Handbook 2004. http://ceh.nasa.gov/, 2004.

Pinto, H. S. & Martins, J. P; A methodology for ontology integration. K-CAP


2001: Proceedings of the international conference on Knowledge capture,
ACM Press, 2001.

Paslaru Bontas, E. et al.; Generation and Management of a Medical Ontology


in a Semantic Web Retrieval System, Proceedings of the OTM Conferences,
Larnaca, Cyprus, 2004.

Paslaru Bontas, E. & Mochol, M.; A Cost Model for Ontology Engineering.
Technical Report, TR-B-05-03, FU Berlin, ftp://ftp.inf.fu-
berlin.de/pub/reports/tr-b-05-03.pdf, 2005.

Stewart, R. D. & Wyskida, R. M. & Johannes, J. D.; Cost Estimator’s Refer-


ence Manual. Wiley, 2nd edition, 1995.

Uschold M. & King, M.; Towards a Methodology for Building Ontologies.


Proceedings Workshop on Basic Ontological Issues in Knowledge Sharing,
IJCAI95, 1995.

12

Вам также может понравиться