Академический Документы
Профессиональный Документы
Культура Документы
Abstract.
EU-IST Strategic Targeted Research Project (STREP) IST-2004-026460 TAO
Deliverable D6.1 (WP6)
This deliverable describes the results of requirement analysis and definition of transitioning
scenarios to semantic technologies that have been carried out on the open-source software
engineering case study 1. It provides an outline of the application to be refactored and an
introduction to the underlying application-specific concepts. A structured set of requirements
is collected and exploited as a starting point to set up precise technical directions that will be
followed in the re-engineering work. This work resulted in the technical specification and scope
delimitation of the transitioning scenarios regarding ontology acquisition, semantic annotation,
and semantic-based access.
Keyword list: Semantic Modelling, Ontology Acquisition, Semantic Annotation, Semantic Web
Services, SOA, software engineering
Copyright
c 2007 University of Sheffield (USFD)
TAO Consortium
This document is part of a research project partially funded by the IST Programme of the Commission of the European
Communities as project number IST-2004-026460.
Dassault Aviation SA
Atos Origin Sociedad Anonima Espanola DGT/DPR
Dept Stream-Public Sector 78, quai Marcel Dassault
Atos Origin Spain, C/Albarracin, 25, 28037 Madrid 92552 Saint-Cloud
Spain Cedex 300
Tel: +34 91 214 8610, Fax: +34 91 754 3252 France
Contact person: Nuria de Lama Tel: +33 1 47 11 53 00, Fax: +33 1 47 11 53 65
E-mail: nuria.delama@atosorigin.com Contact person: Farid Cerbah
E-mail: Farid.Cerbah@dassault-aviation.com
This deliverable describes the results of requirement analysis and definition of transi-
tioning scenarios to semantic technologies that have been carried out on the open-source
software engineering case study 1. It provides an outline of the application to be refac-
tored and an introduction to the underlying application-specific concepts. A structured
set of requirements is collected and exploited as a starting point to set up precise tech-
nical directions that will be followed in the re-engineering work. This work resulted in
the technical specification and scope delimitation of the transitioning scenarios regarding
ontology acquisition, semantic annotation, and semantic-based access.
The challenge addressed by this case study is that successful code reuse and bug avoid-
ance in software engineering requires numerous qualities, both of the library code and of
the development staff; two important qualities are ease of identification of relevant com-
ponents and ease of understanding of their parameters and usage profiles. The attraction
of using semantic technology to address this problem lies in its potential to transform ex-
isting software documentation into a conceptually organised and semantically interlinked
knowledge space that incorporates unstructured data from multiple software artefacts: fo-
rum postings, manuals, structured data from source code and configuration files. The
enriched information can then be used to add novel functionality to web-based documen-
tation of the software concerned, providing the developer with new and powerful ways
to locate and integrate components (either for reuse or for integration with new develop-
ment). The research is being evaluated on transitioning an existing open source project,
GATE, which has a long development history and extensive documentation in multiple
formats and modalities.
The advantage of transitioning GATE to ontologies will be two-fold. Firstly, GATE
components and services will be easier to discover and integrate within other applications
due to the use of semantic web service technology.
Secondly, users will be able to use knowledge access tools and find easily all infor-
mation relevant to a given GATE concept, searching across all different sotware artefacts:
the GATE documentation, XML configuration files, video tutorials, screen shots, user
discussion forum, etc.
In this context, one of the key requirements towards the semantic web technology in
our application is that it should be applied to existing software projects and require little
human involvement in its creation and maintenance. Due to the fact that the majority
of projects do not maintain an ontology already, an essential part of our methodological
and technological approach is in developing automatic methods for ontology learning and
D6.1 Case Study 1: Requirement analysis and application of TAO methodology in data intensive applications
alongside this, the exposure of semantic-based access to the GATE document base
(application of knowledge stores and content augmentation technology);
finally, uniting the results of these two processes, to address the various application-
specific scenarios discussed in this deliverable.
In general, this case study has an ambitious scope. However, it should be stressed that
all parts will not be equally developed. The main effort will be focused on the ontology
acquisition, content augmentation, and knowledge access scenarios, elaborated below, as
they are the key building blocks of the transitioned application.
1 Introduction 3
3 Requirements Analysis 13
3.1 Requirements towards the SOA-based GATE . . . . . . . . . . . . . . . 13
3.1.1 Computational complexity and load balancing . . . . . . . . . . . 13
3.1.2 Combined Methodological and Technological Support . . . . . . 14
3.1.3 Multi-role Support . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.4 Service-based Access to Key Components . . . . . . . . . . . . . 15
3.1.5 Support for Building Service-based Applications . . . . . . . . . 16
3.1.6 Compatibility with the Existing GATE Developement Environment 16
3.2 Requirements towards the TAO Technology . . . . . . . . . . . . . . . . 17
5 Transitioning Scenarios 24
5.1 Ontology Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1
CONTENTS 2
6 Conclusion 38
Introduction
GATE1 [CMBT02] is a leading open-source architecture and infrastructure for the build-
ing and deployment of Human Language Technology applications, used by thousands of
users at hundreds of sites. The development team consists at present of over 15 peo-
ple, but over the years more than 30 people have been involved in the project. As such,
this software product exhibits all the specific problems that large software architectures
encounter and has been chosen as the data-intensive case study in the TAO project.
While GATE has increasingly facilitated the development of knowledge-based appli-
cations with semantic features (e.g. [BTMC04, KPO+ 04, Sab06]), its own implementa-
tion has continued to be based on compositions of functionalities justified on the syntactic
level, understood by informal human-readable documentation. By its very nature as a suc-
cessful and accepted general architecture, a systematic understanding of its concepts and
their relation is shared between its human users. It is simply that this understanding has
not been formalised into a description that can be reasoned about by machines or made
easier to access by new users. Indeed, GATE users who want to learn about the system are
finding it difficult due to the large amount of heterogeneous information, which cannot be
accessed via a unified interface.
The advantage of transitioning GATE to ontologies will be two-fold. Firstly, GATE
components and services (see Chapter 4) will be easier to discover and integrate within
other applications (see [SP05, Sap05]) due to the use of semantic web service technology.
Secondly, users will be able to use knowledge access tools (see Chapter 5) and find
easily all information relevant to a given GATE concept, searching across all different
sotware artefacts: the GATE documentation, XML configuration files, video tutorials,
screen shots, user discussion forum, etc.
The work in this case study over the course of the TAO project is structured in three
phases, corresponding to the three years of the project (see Figure 1.1).
Work in the first year is dedicated to requirements analysis; definition of the GATE
1
http://gate.ac.uk
3
CHAPTER 1. INTRODUCTION 4
web services, following the methodology; data provision to all technical workpackages,
and definition of the scenarios to be covered by the case study in subsequent years.
The effort in the second year will focus on experimentation and customisation of
the ontology learning (WP2) and content augmentation (WP3) tools for the needs of the
case study. In more detail, ontology learning will be applied to the GATE source code,
javadocs, and forum postings in order to extract a first version of the domain ontology.
This will then be refined manually by a GATE expert and stored into the knowledge stores
from WP4. Then the content augmentation tools will be applied to index semantically all
software artefacts and the results will be accessible via the knowledge store.
In the last year, the work will address semantic annotation of the WSDL service def-
initions and use of the TAO Suite to address the case studies scenarios. The case study
prototype will be evaluated against the legacy system along several dimensions, such as
search and navigation within the software documentation. The report will also provide an
assessment of the usefulness of the new functionalities, such as content augmentation and
semantic web services.
5
CHAPTER 2. GATE AS A LEGACY APPLICATION 6
Language Resources (LRs) represent entities such as documents, corpora and on-
tologies;
Processing Resources (PRs) represent entities that are primarily algorithmic, such
as parsers, generators or ngram modellers;
Visual Resources (VRs) represent visualisation and editing components that par-
ticipate in GUIs.
These resources can be local to the users machine or remote (available via HTTP), and
all can be extended by users without modification to GATE itself.
One of the main advantages of separating the algorithms from the data they require is
that the two can be developed independently by language engineers with different types of
expertise, e.g. programmers and linguists. Similarly, separating data from its visualisation
allows users to develop alternative visual resources, while still using a language resource
provided by GATE.
Collectively, all resources are known as CREOLE (a Collection of REusable Objects
for Language Engineering), and are declared in a repository XML file, which describes
their name, implementing class, parameters, icons, etc. This repository is used by the
framework to discover and load available resources.
For example, the corpus LR is declared as:
<RESOURCE>
<NAME>GATE corpus</NAME>
<CLASS>gate.corpora.CorpusImpl</CLASS>
<COMMENT>GATE transient corpus</COMMENT>
<INTERFACE>gate.Corpus</INTERFACE>
<PARAMETER NAME="documentsList"
COMMENT="A list of GATE documents"
CHAPTER 2. GATE AS A LEGACY APPLICATION 7
OPTIONAL="true"
ITEM_CLASS_NAME="gate.Document"
>java.util.List</PARAMETER>
<ICON>lrs.gif</ICON>
</RESOURCE>
A parameters tag describes the parameters which each resource needs when created or
executed. Parameters can be optional, e.g. if a document list is provided when the corpus
is constructed, it will be populated automatically with these documents.
When an application is developed within GATEs graphical environment, the user
chooses which processing resources go into it (e.g. tokeniser, POS tagger), in what order
they will be executed, and on which data (e.g. document or corpus).
The execution parameters of each resource are also set there, e.g. a loaded document
is given as a parameter to each PR. When the application is run, the modules will be
executed in the specified order on the given document. The results can be viewed in the
CHAPTER 2. GATE AS A LEGACY APPLICATION 8
controller (see Figure 2.2) was introduced in GATE v2 in order to allow flexible process-
ing of heterogenous data (e.g., a corpus consisting of documents from several domains,
styles, and genres). The controller allows the user to specify for each PR the conditions
when it can be executed: always; never; or conditionally only if a particular
feature with a given value is present on the document.
For example, a categorisation module that always fires can assign different genres
to each document as features (e.g., sport, political, business). Then the application can
contain PRs trained specifically for the given type of documents and at execution time,
based on the document type, the controller will determine which PRs are executed. For
example, a text about sport would use a different named entity recogniser from the one
about politics (England is a team in the first case and a location in the second) or a text
in all uppercase would use a different POS tagger (amongst other things) from a text in
mixed case (see Figure 2.2). However, the application will list all alternative PRs (e.g. two
POS taggers and 3 entity recognisers) but only those appropriate for the current document
will be executed on it, whereas a different set of PRs might be executed on the next one.
The iterative strategy is equivalent to a loop and will enable one or more PRs to be
executed until a given condition is met.
CHAPTER 2. GATE AS A LEGACY APPLICATION 10
All data sources are updated frequently, with stable versions corresponding to soft-
ware releases. Thus semantic annotation and access to these needs to be made
sensitive to this dynamic aspect.
Some of the data sources are a mixture of structured and unstructured text and also
images.
The size of the corpus itself is not likely to pose a serious challenge for the seman-
tic annotation and ontology learning tools. The instance data generated from this
analysis is likely to produce millions of triples and thus pose a scalability challenge
for the TAO semantic repository. For instance, the Dhruv system [ASHK06], which
aims at providing semantic-based support for software bug resolution, has reported
such scalability problems, which hampered the development of a full-blown system.
<?xml version="1.0"?>
<!-- $Id: gate-description.tex,v 1.2 2006/10/12 15:42:07 kalina Exp $ -->
<CREOLE-DIRECTORY>
<!-- Processing Resources -->
<CREOLE>
...
</RESOURCE>
</CREOLE>
In addition to a jar and xml file, some plugins also contain Java source code and other
resources required by the specific application, e.g., gazetteer lists, finate-state transduction
grammars, grammar definition files. At the very least, TAO will attempt to utilise the
creole.xml and the Java code.
Chapter 3
Requirements Analysis
This chapter defines a set of requirements towards the requirements towards the re-
factored legacy application, which are then used to derive a set of requirements towards
the TAO semantic technology.
Support for load balancing by driving several GATE instances through a single
access point
13
CHAPTER 3. REQUIREMENTS ANALYSIS 14
of kilobytes per second, however more complex ones run at tens of kilobytes per second or
even sometimes less. The majority of GATE PRs have a linear execution time [CMB+ 05],
but still when applied to large documents they can take several minutes on each.
At the same time, GATE is not multi-threaded and therefore if a GATE-based service
is to be capable of supporting more than one processing request at any given time, then
what is actually required is something like a pool of several instances of GATE or several
instances of the same service. The expectation is that guidance on how to best address
this issue would be provided by the TAO methodology.
combine them. However, the typical NLP application consists of over 10 PRs, so asking
non-expert users to know how to combine this is probably not realistic. On the other hand,
expert users are likely to require just that. For non-experts the right granularity would be
to use sets of PRs, already composed by an expert user, and exposed as a single service,
e.g., term extractor.
Consequently, the requirement is to define the gate annotation services in a flexible
way so they can have different internal granularity and be also composable into larger
services, which themselves can then behave like one gate annotation service for non-
expert users.
LRs which can be wrapped as data-intensive services. Efficient storage and access
of potentially large data sets. At least two types needed: document and ontology
services
In order to reduce the amount of information exchanged over the network, it would be
better to retrieve only a part of a LR, e.g. an AnnotationSet and not a whole document.
CHAPTER 3. REQUIREMENTS ANALYSIS 16
The first 4 functionalities are currently supported by the GATE DataStore so for com-
patibility reasons, it would be necessary for the doc storage service to have a compatible
API.
The search facility is currently not implemented by GATE Datastores, but is required
for the DocService, as users need to be able efficiently to retrieve only relevant documents
(to minimise network traffic amonst other things). The idea is to enhance the current
datastore API by adding a search API and thus have RDMS-like functionalities. The
indexing and querying of the data would be made on the remote server.
Key functionalities required from the ontology access service:
both local and remote data in the same fashion and/or build applications containing both
local and remote components.
From the methodological perspective, how does the human post-editing of the learnt
domain ontology fit within the entire transitioning process? Would it be possible
to develop a relatively simple how-to guidance on conceptual modelling, so that
software engineers with limited expertise in ontologies could carry out this step?
Upper Level and Other Existing Ontologies Another set of issues that needs to be in-
vestigates is would there be any benefit in grounding the learnt and post-edited
ontologies into an upper level ontology? If yes, which ontology or what are the cri-
teria for choosing the most relevant one among existing ones? In addition, there are
existing ontologies oriented towards software artefacts, e.g., those from the Dhruv
system [ASHK06], which also need to be investigated.
Dealing with Dynamic Software Artefacts How would the learning and content aug-
mentation methods cope with the dynamic nature of software artefacts, i.e., new
concepts becoming important as software matures through versions. This is prob-
ably less relevant for legacy systems where there is no further development of the
system itself, however GATE specifically is updated continuously and expanded
with new plugins by its user community (although this does not necessarily entail
any changes to the key parts of the GATE API).
1
For example, http://www.cs.cmu.edu/ anupriya/code.owl, http://www.cs.cmu.edu/ anupriya/bugs.owl
CHAPTER 3. REQUIREMENTS ANALYSIS 19
This section describes the GATE web services, which have been defined over the existing
legacy application, while retaining backwards compatibility and addressing all other key
requirements defined in the previous chapter.
The requirements towards the SOA-based application model were defined in Chap-
ter 3. In brief, the main requirements are for non-monolithic, human-assisted, and back-
wards compatible model. In particular, the backwards compatibility with the GATE
graphical user interface is of paramount importance, as it is being used extensively by
the majority of GATE users.
20
CHAPTER 4. DESIGN OF THE GATE WEB SERVICES 21
The saved application should be as self-contained as possible, i.e. the plugins used by
the application should be in subdirectories under the saved application location. This is
necessary because in some cases read permission on parent directories may be restricted
by the servlet container.
<parameters>
<param name="depth">
<runtimeParameter prName="MyAnnotator" prParam="maxDepth"/>
</param>
<param name="annotatorName">
<documentFeature name="annotator"/>
</param>
</parameters>
Parameter values are all strings, but when mapped to PR parameters they are mapped
automatically to other types (e.g. URL, Integer, etc.).
Again, as for standard GATE PRs, the GAS service definition also specifies what
annotation sets the service requires as input and populates as output:
<annotationSets>
<annotationSet name="Original markups" in="true"/>
<annotationSet name="" out="true"/>
<annotationSet name="Key" in="true" out="true"/>
</annotationSets>
Note that a service can use the same set for both input and output, and can use the
default annotation set (the second example above).
For a complete example of a GAS service WSDL and service definition file see Ap-
pendix A.
Effectively a GAS service behaves roughly like a current GATE application, i.e. it
takes a document and a set of parameters as input and returns a modified document. This
is required in order to incorporate a GAS in a GATE application, like any other PR.
Future work needs to consider generating a simple service deployment (with every-
thing running on a single server) with a couple of clicks. The result of this operation will
be a WAR file, ready to be unpacked in a web application server. If one needs a more
advanced deployment, e.g. on a cluster, then more administrative intervention will be
required.
apply it with the content augmentation tools from WP3 to the WSDL definitions
also apply these to the service code and documentation, to obtain semantic descrip-
tions of the web services.
Chapter 5
Transitioning Scenarios
As discussed in the introduction, there are three kinds of tasks to be completed to trans-
form GATE into a semantic service-oriented architecture:
alongside this, the exposure of semantic-based access to the GATE document base
(application of WP4 storage and WP3 content augmentation technology);
finally, uniting the results of these two processes, the description of this function-
ality in semantic terms (using the WP5 infrastructure to define the Semantic Web
Services).
Overall this case study has a wide scope. However, it should be stressed that all
parts will not be equally developed. The main effort will be focused on the ontology
acquisition, semantic content augmentation, and knowledge access scenarios, elaborated
below, as they are the key building blocks of the transitioned application.
24
CHAPTER 5. TRANSITIONING SCENARIOS 25
The overall ontology acquisition process will follow the TAO methodology defined in
D1.2 and refined for the case studies in section 4.1.2. of D7.1.
the semantic annotations from the content augmentation services, run over the multiple
software artefacts in GATE.
In the context of semantic-based access, several scenarios have been identified as rel-
evant for this case study:
Automatic generation of reference pages from the ontology provides users with a
single point of access to all knowledge, continuously kept up to date.
Natural language-based queries for semantic search intuitive way to formulate se-
mantic queries without need for knowledge of the underlying query language of the
semantic repository.
Semantic-based filtering of forum postings allow users to filter which postings they
see, by defining a set of relevant concepts from the domain ontology (based on the
semantic content augmentation of the forum postings).
Expertise location enable a (new) team member to identify which are the most appro-
priate team members to consult on a given problem. Developer expertise will be
discovered automatically from forum postings.
However, due to the limits in terms of effort available, we will focus on implementing
the first two scenarios, with the second two remaining at a fairly early prototype level, to
be explored further in year 3, should more effort become available.
In order to overcome this problem and provide a single point of reference, the Dhruv
system [ASHK06], for example, generates automatically the so called cross-link pages,
which provide a unified point of information (see Figure 5.1). The cross-link pages show
information about the ontology class (code:Function in this case), the file that contains the
source code, who authored this file, other related sources, version control logs, discussion
posts, bug reports, etc. The system also recognises domain terms, which are not added to
the ontology, but are simply used to cross-reference the different artefacts (see the right
hand-side part of the figure).
The problem with adopting wholesale the Dhruv approach is that in a large system
such as GATE the hits would be too many and these cross-link pages would become hard
to use and navigate. For instance, there are 333 hits for POS Tagger in the forum postings
and 169 hits among the other kinds of documentation.
The approach that we envisage here is more ontology-centred. In other words, the
user would be able to browse the ontology or search for relevant concepts, instances,
and properties by providing one or more keywords. For instance, searching for tagger,
CHAPTER 5. TRANSITIONING SCENARIOS 28
the user would be offered several matching concepts: POSTagger, TreeTagger, and any
other entry which contains tagger within the label string. Then let us assume that the user
selects POSTagger as the instance of interest.
Instead of being shown triples about the POSTagger instance in a formal way (e.g., as
does Protege or the KIM KnowledgeBase Explorer[PKK+ 04]), we envisage to generate
automatically a web page, which can be shown on its own or alongside the ontology tree,
where POSTagger is selected. The contents of the page can be as follows:
Super and sub-concepts of the selected concept; if an instance is selected, then show
its class and possibly other instances of the same class
Values of key properties (e.g., parameters of POSTagger) or properties with their
domains, where this is a value (e.g., which PRs take Ontologies)
Links to sections in the user and other guides where this is mentioned. Maybe
importance can be judged by whether or not the mention is in the title or body of
the section/chapter.
Links to Javadocs and key forum postings.
Ideally the user should be able to customise what they are interested in, so some of the
information can be filtered out. For instance, allow them to exclude certain data sources or
ask for information after a given date. Experiments will show exactly how many hits will
be returned, so appropriate ranking and filtering can be done. The exclusion of certain
types of information will be implemented via specifying more refined SERQL queries to
the knowledge store, containing the content augmentation index.
Another alternative would be to generate these as wiki pages, so the user can then
correct any mistakes in an intuitive way, perhaps using a controlled language [DHCT06]
or a special semantic wiki syntax [VKV+ 06, MH05], which then directly updates the
ontology as well. If a semantic wiki is used, then the update of a wiki page would trigger
the following processes:
One argument for using a wiki as the basis of our system is that these are already used
heavily, especially in open-source software development projects, as a collaborative way
for documentation authoring, coordination, etc. At present, the GATE team uses a wiki
internally to list jobs, track deadlines, and share relevant publications on selected topics.
Consequently, opening this out and enriching it with semantic technology might be a good
solution, provided that editing is required.
CHAPTER 5. TRANSITIONING SCENARIOS 29
Semantic wikis or software engineering support systems such as Dhruv do not al-
low sufficiently flexible exploration of the available semantic knowledge about con-
cepts/instances which are the objects (ranges) of the semantic triples, although some
promising GUIs have been developed recently (e.g., IkeWiki [Sch06], OntoWiki
[ADR06]). For example, if the generated page is about Processing Resource with an
is-a link to Resource, then the way to find further information about Resource would be
to click on the link.
A better approach could be to enable a small summary to be generated and shown on
mouse hover over Resource (IkeWiki currently supports something similar by showing
the content of the target wiki page, but thats way too long usually). Alternatively, on
right click to show a separate small window, such as the KIM KnowledgeBase explorer
which shows the relevant portion of the ontology, which can then be navigated further.
Given that the automatically generated pages have URLs, they can be used as URIs
for resources in RDF triples. This means that the system can collect automatically a list
of bookmarks for each user, time-stamped, and with possibility for the user to add further
semantic information, if desired. They can then access these from any machine. A similar
idea has been proposed in the SHAWN semantic wiki [Aum05]. However, here we can go
one step further, by associating the concepts, instances, and relations found on this page
automatically as semantic tags for it. Clustering this information would enable us to build
a profile of the users expertise and interests in the domain and also then deploy Amazon-
style recommender system (Other GATE users interested in <domain concept/instance>
also visited these pages:...).
The development of this scenario will proceed in two phases. Firstly, well implement
the automatic generation of reference web pages, from queries against the knowledge
store. Next well evaluate the result with GATE users and identify possible improve-
ments. Well also evaluate whether the developers are happy to engage in metadata edit-
ing. If they are, then in the second phase, well consider moving towards a semantic wiki,
probably by taking an existing one off the shelf, after suitable evaluation.
and a suitable interface for semantic-based search and browse. A similar problem is being
currently addressed in the context of digital libraries [WTA06] and we will consider using
some of these techniques as well.
Since users tend to post messages on the discussion forums when they have failed to
identify a solution to their problem in the user manuals and earlier forum postings, by
analysing also which topics are being discussed one can also identify potential weaknesses
in the current documentation, which can then be rectified. Again this is an example,
where classification with respect to an ontology can help with the maintenance and update
process of software documentation.
In order to verify this hypothesis, we carried out a limited manual analysis of topics
on the GATE forum postings between April and June 2006. All threads were classified as
belonging to one of four topics: questions about the core system, existing plugins (i.e., op-
tional components), problems with writing new plugins, and problems with writing new
applications. Table 5.1 shows the results, which indicate that the majority of the users are
having problems with using the core system or some of its plugins. Some of the problems
are clearly due to shortcomings of the user guide, e.g., lack of information on which class
implements a certain plugin, which directories contain the corresponding configuration
files, what are the parameters that a plugin takes. All of this information changes over
time with each release and maintaining the user guide is therefore a time-consuming task.
However, ontology-based search could provide the user with this information, if the on-
tology concepts have been learnt from the source code and this provenance information
has been kept in the ontology itself (e.g., via implementedInClass or referencedInClass
properties).
Table 5.1: Number of questions posted each month under the different categories.
Another aspect are queries to identify all places in the code, documentation, and mail-
ing list posting which are relevant to a particular instance or a more complex conceptual
query involving several instances and properties. The results could be presented as a
generated summary HTML page with pointers to the original docs. However, given the
size of some of these docs, it might be good to have also highlighting within them of the
relevant parts. This may be implemented in two ways:
when processed, all documents are saved back on their original location with the
new semantic markup inserted and then we only highlight these tags, when needed
CHAPTER 5. TRANSITIONING SCENARIOS 32
overlay the highlights on the original page, as the KIM plugin does [PKK+ 04].
Well need to investigate how this plugin can be integrated within the case study or
how to implement something similar, if this is not appropriate. The key problem is
that some docs live in sourceforges svn, others in our svn, whereas third ones are
emails on gmane.
any user/reader with finding relevant information from the forum postings, which are not
tightly controlled artefacts. However, the recommendation mechanisms in Hipikat would
be relevant, if we are to provide rankings of the matched results from the semantic search.
In this case, the triples are constructed with a domain London, as this is the page title.
1
http://wiki.ontoworld.org/index.php/Semantic MediaWiki
CHAPTER 5. TRANSITIONING SCENARIOS 35
The category in the end defines what can be thought as the class of the title of the wiki
page, London in this case.
Example query is:
<ask>
[[Category:Publication]]
[[project:SEKT]]
[[published: >= 1.1.2006]]
</ask>
The semantic statements are typically edited in the standard text-based wiki editor,
which is convenient for expert users, but is also error-prone, e.g., spelling errors could
result in the creation of a new typed link, instead of the intended one. In order to help
avoid this problem and make the wiki more attractive to novices, IkeWiki [Sch06] sup-
ports also a WYSIWYG editor, which helps with addition/deletion of triples, by showing
a drop-down list of available link types (and values, if these exist already). The same
problem is addressed via an extendable set of widgets in OntoWiki [ADR06].
The semantic wikis discussed so far suffer from several problems [ODM+ 06]. Firstly,
the semantic annotation on each page assumes that the domain of the RDF triple is the
main subject of the page. Secondly, they assume that each wiki page describes a concept,
whereas this might not be necessarily the case (e.g. wikipedia disambiguation pages).
To help enrich this simple approach, SemperWiki [ODM+ 06] defines a formal model
of annotation which in addition to the subject-predicate-object triple also encodes the
context when the annotation was made (e.g., provenanace (who), timestamp, or limits on
validity (where)). A distinction is made between formal annotations (in machine pro-
cessable format such as RDF) and semantic annotations (formal annotations referring to
ontologies). In addition, a distinction is made between the document describing a concept
(e.g., wiki page about London) and the concept itself (i.e., London). SemperWiki uses
URLs for the former and URNs for the latter, which enables clean separation.
Another point that needs consideration is that most semantic wikis support document-
level annotation only (see [ODM+ 06]), i.e., the RDF/OWL statements are typically about
the concept defined in the page, whereas content augmentation techniques typically create
finer-grained annotations at sentence or word level. Also, given that content augmentation
techniques typically annotate with respect to one or more ontologies, it is necessary for a
semantic wiki (if used), to support namespaces or other ways in which the URIs can refer
to resources outside the wiki itself (see [ODM+ 06] for some wikis that support what they
call terminology reuse).
One of the outstanding problems that has only recently started to be addressed
([VK06, Sch06]) is about relating already existing formal ontologies to semantic wikis
with their more document-centric, wiki-internal less-formal ontology approach. This
would be an interesting issue to consider in TAO, because the ontology learning and con-
tent augmentation services would have produced and would be updating continuously the
CHAPTER 5. TRANSITIONING SCENARIOS 36
3
http://www.ifi.unizh.ch/attempto/
4
http://rewerse.net/
5
http://www.ifi.unizh.ch/attempto/tools/
6
POS: part of speech, e.g. noun, verb, adjective, determiner.
7
http://www.monrai.com/products/cypher/cypher manual
Chapter 6
Conclusion
This deliverable presented a number of requirements towards the TAO methodology and
technology. As a result, we defined the transitioning scenarios to be addressed in this
project. The result of the redesign process, discussed in Chapter 4, provides the starting
point for the definition of the semantic web services, with the help of tools from the TAO
suite.
The work planned for the next two years on the transitioning scenarios will lead to
a thorough evaluation of the benefits from semantic technologies by comparing against
the functionalities of the legacy application. The evaluation will be along the following
dimensions:
ease of finding information on a given topic, with and without semantic-based ac-
cess
scalability of the knowledge stores and ability to store all case study data
ease of customisation and maintenance of the ontology learning and content aug-
mentation services
In order to measure these, we will carry out a number of task-based evaluations in year
3 and the results will be used for exploitation activities towards the software engineering
sector.
38
Bibliography
[ADR06] S. Auer, S. Dietzold, and T. Riechert. OntoWiki A Tool for Social, Seman-
tic Collaboration. In Proceedings of the Fifth International Semantic Web
Conference (ISWC06), 2006.
39
BIBLIOGRAPHY 40
[DRR+ 05] B. Decker, E. Ras, J. Rech, B. Klein, and C. Hoecht. Self-Organized Reuse
of Software Engineering Knowledge Supported by Semantic Wikis. In Work-
shop on Semantic Web Enabled Software Engineering (SWESE), Galway, Ire-
land, 2005.
[FKK+ 06] Norbert E. Fuchs, Kaarel Kaljurand, Tobias Kuhn, Gerold Schneider, Loic
Royer, and Michael Schroder. Attempto Controlled English and the semantic
web. Deliverable I2D7, REWERSE Project, April 2006.
[Kal06] Kaarel Kaljurand. Writing owl ontologies in ace. Technical report, Univer-
sity of Zurich, August 2006.
[LM04] Vanessa Lopez and Enrico Motta. Ontology driven question answering in
AquaLog. In NLDB 2004 (9th International Conference on Applications of
Natural Language to Information Systems), Manchester, 2004.
[MH05] H. Muljadi and T. Hideaki. Semantic wiki as an integrated content and meta-
data management system. In Poster Session at the International Semantic
Web Conference (ISWC05), 2005.
[Sab06] M. Sabou. Building Web Service Ontologies. PhD thesis, Vrije Universiteit,
2006.
[Sow02] J. Sowa. Architectures for intelligent systems. IBM Systems Journal, 41(3),
2002.
[SP05] M. Sabou and J. Pan. Towards Improving Web Service Repositories through
Semantic Web Techniques. In Workshop on Semantic Web Enabled Software
Engineering (SWESE), Galway, Ireland, 2005.
[TR06] S. Thaddeus and S.V. Kasmir Raja. A Semantic Web Tool for Knowledge-
based Software Engineering. In Workshop on Semantic Web Enabled Soft-
ware Engineering (SWESE), Athens, G.A., USA, 2006.
This appendix shows a sample WSDL of a GAS service, followed by a service definition
file.
43
APPENDIX A. SAMPLE WSDL AND SERVICE DEFINITION 44
</element>
<element name="getOptionalParameterNamesResponse">
<complexType>
<sequence>
<element maxOccurs="unbounded"
name="getOptionalParameterNamesReturn" type="xsd:string"/>
</sequence>
</complexType>
</element>
<element name="getInputAnnotationSetNames">
<complexType/>
</element>
<element name="getInputAnnotationSetNamesResponse">
<complexType>
<sequence>
<element maxOccurs="unbounded"
name="getInputAnnotationSetNamesReturn" type="xsd:string"/>
</sequence>
</complexType>
</element>
<element name="getOutputAnnotationSetNames">
<complexType/>
</element>
<element name="getOutputAnnotationSetNamesResponse">
<complexType>
<sequence>
<element maxOccurs="unbounded"
name="getOutputAnnotationSetNamesReturn" type="xsd:string"/>
</sequence>
</complexType>
</element>
<element name="processRemoteDocument">
<complexType>
<sequence>
<element name="executiveLocation" type="xsd:anyURI"/>
<element name="taskId" type="xsd:string"/>
<element name="docServiceLocation" type="xsd:anyURI"/>
<element name="docId" type="xsd:string"/>
<element maxOccurs="unbounded" name="annotationSets"
type="impl:AnnotationSetMapping"/>
<element maxOccurs="unbounded" name="parameterValues"
type="impl:ParameterValue"/>
</sequence>
</complexType>
</element>
<complexType name="AnnotationSetMapping">
APPENDIX A. SAMPLE WSDL AND SERVICE DEFINITION 45
<sequence>
<element name="docServiceASName" nillable="true" type="xsd:string"/>
<element name="gateServiceASName" nillable="true" type="xsd:string"/>
</sequence>
</complexType>
<complexType name="ParameterValue">
<sequence>
<element name="name" nillable="true" type="xsd:string"/>
<element name="value" nillable="true" type="xsd:string"/>
</sequence>
</complexType>
<element name="processRemoteDocumentResponse">
<complexType/>
</element>
<complexType name="GateWebServiceFault">
<sequence/>
</complexType>
<element name="fault" type="impl:GateWebServiceFault"/>
<element name="processDocument">
<complexType>
<sequence>
<element name="documentXml" type="xsd:string"/>
<element maxOccurs="unbounded" name="parameterValues"
type="impl:ParameterValue"/>
</sequence>
</complexType>
</element>
<element name="processDocumentResponse">
<complexType>
<sequence>
<element maxOccurs="unbounded" name="processDocumentReturn"
type="impl:AnnotationSetData"/>
</sequence>
</complexType>
</element>
<complexType name="AnnotationSetData">
<sequence>
<element name="name" nillable="true" type="xsd:string"/>
<element name="xmlData" nillable="true" type="xsd:string"/>
</sequence>
</complexType>
</schema>
</wsdl:types>
<wsdl:message name="getRequiredParameterNamesResponse">
<wsdl:part element="impl:getRequiredParameterNamesResponse"
name="parameters"/>
APPENDIX A. SAMPLE WSDL AND SERVICE DEFINITION 46
</wsdl:message>
<wsdl:message name="getOptionalParameterNamesResponse">
<wsdl:part element="impl:getOptionalParameterNamesResponse"
name="parameters"/>
</wsdl:message>
<wsdl:message name="getRequiredParameterNamesRequest">
<wsdl:part element="impl:getRequiredParameterNames"
name="parameters"/>
</wsdl:message>
<wsdl:message name="getInputAnnotationSetNamesResponse">
<wsdl:part element="impl:getInputAnnotationSetNamesResponse"
name="parameters"/>
</wsdl:message>
<wsdl:message name="processRemoteDocumentRequest">
<wsdl:part element="impl:processRemoteDocument" name="parameters"/>
</wsdl:message>
<wsdl:message name="GateWebServiceFault">
<wsdl:part element="impl:fault" name="fault"/>
</wsdl:message>
<wsdl:message name="getOptionalParameterNamesRequest">
<wsdl:part element="impl:getOptionalParameterNames"
name="parameters"/>
</wsdl:message>
<wsdl:message name="getOutputAnnotationSetNamesRequest">
<wsdl:part element="impl:getOutputAnnotationSetNames"
name="parameters"/>
</wsdl:message>
<wsdl:message name="getOutputAnnotationSetNamesResponse">
<wsdl:part element="impl:getOutputAnnotationSetNamesResponse"
name="parameters"/>
</wsdl:message>
<wsdl:message name="processRemoteDocumentResponse">
<wsdl:part element="impl:processRemoteDocumentResponse"
name="parameters"/>
</wsdl:message>
<wsdl:message name="processDocumentRequest">
<wsdl:part element="impl:processDocument" name="parameters"/>
</wsdl:message>
<wsdl:message name="getInputAnnotationSetNamesRequest">
<wsdl:part element="impl:getInputAnnotationSetNames"
name="parameters"/>
</wsdl:message>
<wsdl:message name="processDocumentResponse">
<wsdl:part element="impl:processDocumentResponse"
name="parameters"/>
</wsdl:message>
APPENDIX A. SAMPLE WSDL AND SERVICE DEFINITION 47
<wsdl:portType name="GateWebService">
<wsdl:operation name="getRequiredParameterNames">
<wsdl:input message="impl:getRequiredParameterNamesRequest"
name="getRequiredParameterNamesRequest"/>
<wsdl:output message="impl:getRequiredParameterNamesResponse"
name="getRequiredParameterNamesResponse"/>
</wsdl:operation>
<wsdl:operation name="getOptionalParameterNames">
<wsdl:input message="impl:getOptionalParameterNamesRequest"
name="getOptionalParameterNamesRequest"/>
<wsdl:output message="impl:getOptionalParameterNamesResponse"
name="getOptionalParameterNamesResponse"/>
</wsdl:operation>
<wsdl:operation name="getInputAnnotationSetNames">
<wsdl:input message="impl:getInputAnnotationSetNamesRequest"
name="getInputAnnotationSetNamesRequest"/>
<wsdl:output message="impl:getInputAnnotationSetNamesResponse"
name="getInputAnnotationSetNamesResponse"/>
</wsdl:operation>
<wsdl:operation name="getOutputAnnotationSetNames">
<wsdl:input message="impl:getOutputAnnotationSetNamesRequest"
name="getOutputAnnotationSetNamesRequest"/>
<wsdl:output message="impl:getOutputAnnotationSetNamesResponse"
name="getOutputAnnotationSetNamesResponse"/>
</wsdl:operation>
<wsdl:operation name="processRemoteDocument">
<wsdl:input message="impl:processRemoteDocumentRequest"
name="processRemoteDocumentRequest"/>
<wsdl:output message="impl:processRemoteDocumentResponse"
name="processRemoteDocumentResponse"/>
<wsdl:fault message="impl:GateWebServiceFault" name="GateWebService
</wsdl:operation>
<wsdl:operation name="processDocument">
<wsdl:input message="impl:processDocumentRequest"
name="processDocumentRequest"/>
<wsdl:output message="impl:processDocumentResponse"
name="processDocumentResponse"/>
<wsdl:fault message="impl:GateWebServiceFault"
name="GateWebServiceFault"/>
</wsdl:operation>
</wsdl:portType>
<wsdl:binding name="GateWebService" type="impl:GateWebService">
<wsdlsoap:binding style="document"
transport="http://schemas.xmlsoap.org/soap/http"/>
<wsdl:operation name="getRequiredParameterNames">
<wsdlsoap:operation soapAction=""/>
APPENDIX A. SAMPLE WSDL AND SERVICE DEFINITION 48
<wsdl:input name="getRequiredParameterNamesRequest">
<wsdlsoap:body use="literal"/>
</wsdl:input>
<wsdl:output name="getRequiredParameterNamesResponse">
<wsdlsoap:body use="literal"/>
</wsdl:output>
</wsdl:operation>
<wsdl:operation name="getOptionalParameterNames">
<wsdlsoap:operation soapAction=""/>
<wsdl:input name="getOptionalParameterNamesRequest">
<wsdlsoap:body use="literal"/>
</wsdl:input>
<wsdl:output name="getOptionalParameterNamesResponse">
<wsdlsoap:body use="literal"/>
</wsdl:output>
</wsdl:operation>
<wsdl:operation name="getInputAnnotationSetNames">
<wsdlsoap:operation soapAction=""/>
<wsdl:input name="getInputAnnotationSetNamesRequest">
<wsdlsoap:body use="literal"/>
</wsdl:input>
<wsdl:output name="getInputAnnotationSetNamesResponse">
<wsdlsoap:body use="literal"/>
</wsdl:output>
</wsdl:operation>
<wsdl:operation name="getOutputAnnotationSetNames">
<wsdlsoap:operation soapAction=""/>
<wsdl:input name="getOutputAnnotationSetNamesRequest">
<wsdlsoap:body use="literal"/>
</wsdl:input>
<wsdl:output name="getOutputAnnotationSetNamesResponse">
<wsdlsoap:body use="literal"/>
</wsdl:output>
</wsdl:operation>
<wsdl:operation name="processRemoteDocument">
<wsdlsoap:operation soapAction=""/>
<wsdl:input name="processRemoteDocumentRequest">
<wsdlsoap:body use="literal"/>
</wsdl:input>
<wsdl:output name="processRemoteDocumentResponse">
<wsdlsoap:body use="literal"/>
</wsdl:output>
<wsdl:fault name="GateWebServiceFault">
<wsdlsoap:fault name="GateWebServiceFault" use="literal"/>
</wsdl:fault>
</wsdl:operation>
APPENDIX A. SAMPLE WSDL AND SERVICE DEFINITION 49
<wsdl:operation name="processDocument">
<wsdlsoap:operation soapAction=""/>
<wsdl:input name="processDocumentRequest">
<wsdlsoap:body use="literal"/>
</wsdl:input>
<wsdl:output name="processDocumentResponse">
<wsdlsoap:body use="literal"/>
</wsdl:output>
<wsdl:fault name="GateWebServiceFault">
<wsdlsoap:fault name="GateWebServiceFault" use="literal"/>
</wsdl:fault>
</wsdl:operation>
</wsdl:binding>
<wsdl:service name="GateWebServiceService">
<wsdl:port binding="impl:GateWebService" name="GATEService">
<wsdlsoap:address location="http://..."/>
</wsdl:port>
</wsdl:service>
</wsdl:definitions>
This GAS service definition file defines that the unnamed annotation set is used as
input (e.g., to read tokens). Also the annotation set called Output is used for storing the
annotations produced by the service. In addition, the service has one parameter, called
annTypes, whose value needs to be passed as a value of the parameter annotationTypes
on the processing resource Transfer, embedded in the GAS service. As different services
have different embedded PRs, their service definitions specify how the parameters are
passed on to which embedded PR.
service xmlns="http://gate.ac.uk/gate-service/definition">
<annotationSets>
<annotationSet name="" in="true" />
<annotationSet name="Output" out="true" />
</annotationSets>
<parameters>
<param name="annTypes" optional="true">
<runtimeParameter prName="Transfer" prParam="annotationTypes" />
</param>
</parameters>
</service>