Math

EU-IST Strategic Targeted Research Project (STREP) IST-2004-026460 TAO
TAO: Transitioning Applications to Ontologies
D6.1 Case Study 1: Requirement

analysis and application of TAO
methodology in data intensive
applications
Kalina Bontcheva, Ian Roberts, Milan Agatonovic, Julien

Nioche, James Sun (University of Sheffield)
Abstract.
EU-IST Strategic Targeted Research Project (STREP) IST-2004-026460 TAO
Deliverable D6.1 (WP6)
This deliverable describes the results of requirement analysis and definition of transitioning
scenarios to semantic technologies that have been carried out on the open-source software
engineering case study 1. It provides an outline of the application to be refactored and an
introduction to the underlying application-specific concepts. A structured set of requirements
is collected and exploited as a starting point to set up precise technical directions that will be
followed in the re-engineering work. This work resulted in the technical specification and scope
delimitation of the transitioning scenarios regarding ontology acquisition, semantic annotation,
and semantic-based access.
Keyword list: Semantic Modelling, Ontology Acquisition, Semantic Annotation, Semantic Web
Services, SOA, software engineering
Document Id. TAO/2007/D6.1/v1.0

Project TAO IST-2004-026460
Date April 10, 2007
Distribution public
Copyright
c 2007 University of Sheffield (USFD)
TAO Consortium
This document is part of a research project partially funded by the IST Programme of the Commission of the European
Communities as project number IST-2004-026460.
University of Sheffield Mondeca

Department of Computer Science 3, cit Nollez
Regent Court, 211 Portobello St. 75018 Paris
Sheffield S1 4DP France
UK Tel: +33 (0) 1 44 92 35 03, Fax: +33 (0) 1 44 92 02 59
Tel: +44 114 222 1891, Fax: +44 114 222 1810 Contact person: Jean Delahousse
Contact person: Kalina Bontcheva E-mail: jean.delahousse@mondeca.com
E-mail: K.Bontcheva@dcs.shef.ac.uk
Sirma Group Corp., Ontotext Lab

University of Southampton Office Express IT Centre, 5th Floor
Southampton SO17 1BJ 135 Tsarigradsko Shose
UK Sofia 1784
Tel: +44 23 8059 8343, Fax: +44 23 8059 2865 Bulgaria
Contact person: Terry Payne Tel: +359 2 9768, Fax: +359 2 9768 311
E-mail: trp@ecs.soton.ac.uk Contact person: Atanas Kiryakov
E-mail: naso@sirma.bg
Dassault Aviation SA
Atos Origin Sociedad Anonima Espanola DGT/DPR
Dept Stream-Public Sector 78, quai Marcel Dassault
Atos Origin Spain, C/Albarracin, 25, 28037 Madrid 92552 Saint-Cloud
Spain Cedex 300
Tel: +34 91 214 8610, Fax: +34 91 754 3252 France
Contact person: Nuria de Lama Tel: +33 1 47 11 53 00, Fax: +33 1 47 11 53 65
E-mail: nuria.delama@atosorigin.com Contact person: Farid Cerbah
E-mail: Farid.Cerbah@dassault-aviation.com
Jozef Stefan Institute

Jamova 39
1000 Ljubljana
Slovenia
Tel: +386 1 4773 778, Fax: +386 1 4251 038
Contact person: Marko Grobelnik
E-mail: marko.grobelnik@ijs.si
Executive Summary
This deliverable describes the results of requirement analysis and definition of transi-
tioning scenarios to semantic technologies that have been carried out on the open-source
software engineering case study 1. It provides an outline of the application to be refac-
tored and an introduction to the underlying application-specific concepts. A structured
set of requirements is collected and exploited as a starting point to set up precise tech-
nical directions that will be followed in the re-engineering work. This work resulted in
the technical specification and scope delimitation of the transitioning scenarios regarding
ontology acquisition, semantic annotation, and semantic-based access.
The challenge addressed by this case study is that successful code reuse and bug avoid-
ance in software engineering requires numerous qualities, both of the library code and of
the development staff; two important qualities are ease of identification of relevant com-
ponents and ease of understanding of their parameters and usage profiles. The attraction
of using semantic technology to address this problem lies in its potential to transform ex-
isting software documentation into a conceptually organised and semantically interlinked
knowledge space that incorporates unstructured data from multiple software artefacts: fo-
rum postings, manuals, structured data from source code and configuration files. The
enriched information can then be used to add novel functionality to web-based documen-
tation of the software concerned, providing the developer with new and powerful ways
to locate and integrate components (either for reuse or for integration with new develop-
ment). The research is being evaluated on transitioning an existing open source project,
GATE, which has a long development history and extensive documentation in multiple
formats and modalities.
The advantage of transitioning GATE to ontologies will be two-fold. Firstly, GATE
components and services will be easier to discover and integrate within other applications
due to the use of semantic web service technology.
Secondly, users will be able to use knowledge access tools and find easily all infor-
mation relevant to a given GATE concept, searching across all different sotware artefacts:
the GATE documentation, XML configuration files, video tutorials, screen shots, user
discussion forum, etc.
In this context, one of the key requirements towards the semantic web technology in
our application is that it should be applied to existing software projects and require little
human involvement in its creation and maintenance. Due to the fact that the majority
of projects do not maintain an ontology already, an essential part of our methodological
and technological approach is in developing automatic methods for ontology learning and
D6.1 Case Study 1: Requirement analysis and application of TAO methodology in data intensive applications
content augmentation, specifically tailored to software artefacts.

Overall, there are three kinds of tasks to be completed to transform GATE into a
semantic service-oriented architecture:
the elicitation of an ontology to represent the concepts treated by GATE components

(application of ontology learning components);
alongside this, the exposure of semantic-based access to the GATE document base
(application of knowledge stores and content augmentation technology);
finally, uniting the results of these two processes, to address the various application-
specific scenarios discussed in this deliverable.
In general, this case study has an ambitious scope. However, it should be stressed that
all parts will not be equally developed. The main effort will be focused on the ontology
acquisition, content augmentation, and knowledge access scenarios, elaborated below, as
they are the key building blocks of the transitioned application.
TAO/2007/D6.1/v1.0 April 10, 2007 2

Contents
1 Introduction 3
2 GATE as a Legacy Application 5

2.1 GATE Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Main Architectural Elements . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Data representation and storage . . . . . . . . . . . . . . . . . . 8
2.1.3 Execution strategies . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Different GATE Software Artefacts . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Source code and Javadoc . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 User forum postings . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 User, developer, and annotator manual . . . . . . . . . . . . . . . 11
2.2.4 Plugins, resources, and configuration files . . . . . . . . . . . . . 11
3 Requirements Analysis 13
3.1 Requirements towards the SOA-based GATE . . . . . . . . . . . . . . . 13
3.1.1 Computational complexity and load balancing . . . . . . . . . . . 13
3.1.2 Combined Methodological and Technological Support . . . . . . 14
3.1.3 Multi-role Support . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.4 Service-based Access to Key Components . . . . . . . . . . . . . 15
3.1.5 Support for Building Service-based Applications . . . . . . . . . 16
3.1.6 Compatibility with the Existing GATE Developement Environment 16
3.2 Requirements towards the TAO Technology . . . . . . . . . . . . . . . . 17
4 Design of the GATE Web Services 20

4.1 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 GAS: GATE Annotation Services . . . . . . . . . . . . . . . . . . . . . 21
4.2.1 The saved application state . . . . . . . . . . . . . . . . . . . . . 21
4.2.2 Service definition . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.3 Using a GAS service . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 From SOA to Semantic Web Services . . . . . . . . . . . . . . . . . . . 23
5 Transitioning Scenarios 24
5.1 Ontology Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1
CONTENTS 2
5.2 Semantic Content Augmentation . . . . . . . . . . . . . . . . . . . . . . 25

5.3 Semantic-based Access Scenarios: Overview . . . . . . . . . . . . . . . 25
5.3.1 Automatic creation of reference pages from the ontology . . . . . 26
5.3.2 Natural language queries for semantic search . . . . . . . . . . . 29
5.3.3 Semantic-based filtering of forum postings . . . . . . . . . . . . 30
5.3.4 Expertise Location . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.4.1 Semantic Technologies for Software Engineering . . . . . . . . . 32
5.4.2 Semantic wikis . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.4.3 Semantic-based search and browsing . . . . . . . . . . . . . . . 36
6 Conclusion 38
A Sample WSDL and Service Definition 43

Chapter 1
Introduction
GATE1 [CMBT02] is a leading open-source architecture and infrastructure for the build-
ing and deployment of Human Language Technology applications, used by thousands of
users at hundreds of sites. The development team consists at present of over 15 peo-
ple, but over the years more than 30 people have been involved in the project. As such,
this software product exhibits all the specific problems that large software architectures
encounter and has been chosen as the data-intensive case study in the TAO project.
While GATE has increasingly facilitated the development of knowledge-based appli-
cations with semantic features (e.g. [BTMC04, KPO+ 04, Sab06]), its own implementa-
tion has continued to be based on compositions of functionalities justified on the syntactic
level, understood by informal human-readable documentation. By its very nature as a suc-
cessful and accepted general architecture, a systematic understanding of its concepts and
their relation is shared between its human users. It is simply that this understanding has
not been formalised into a description that can be reasoned about by machines or made
easier to access by new users. Indeed, GATE users who want to learn about the system are
finding it difficult due to the large amount of heterogeneous information, which cannot be
accessed via a unified interface.
The advantage of transitioning GATE to ontologies will be two-fold. Firstly, GATE
components and services (see Chapter 4) will be easier to discover and integrate within
other applications (see [SP05, Sap05]) due to the use of semantic web service technology.
Secondly, users will be able to use knowledge access tools (see Chapter 5) and find
easily all information relevant to a given GATE concept, searching across all different
sotware artefacts: the GATE documentation, XML configuration files, video tutorials,
screen shots, user discussion forum, etc.
The work in this case study over the course of the TAO project is structured in three
phases, corresponding to the three years of the project (see Figure 1.1).
Work in the first year is dedicated to requirements analysis; definition of the GATE
1
http://gate.ac.uk
3
CHAPTER 1. INTRODUCTION 4
web services, following the methodology; data provision to all technical workpackages,
and definition of the scenarios to be covered by the case study in subsequent years.
The effort in the second year will focus on experimentation and customisation of
the ontology learning (WP2) and content augmentation (WP3) tools for the needs of the
case study. In more detail, ontology learning will be applied to the GATE source code,
javadocs, and forum postings in order to extract a first version of the domain ontology.
This will then be refined manually by a GATE expert and stored into the knowledge stores
from WP4. Then the content augmentation tools will be applied to index semantically all
software artefacts and the results will be accessible via the knowledge store.
In the last year, the work will address semantic annotation of the WSDL service def-
initions and use of the TAO Suite to address the case studies scenarios. The case study
prototype will be evaluated against the legacy system along several dimensions, such as
search and navigation within the software documentation. The report will also provide an
assessment of the usefulness of the new functionalities, such as content augmentation and
semantic web services.
Figure 1.1: GATE Case Study
This deliverable is structured as follows. Chapter 2 describes the legacy application

and its key concepts. Next Chapter 3 defines a set of requirements towards the TAO
methodology and various TAO technologies. The transitioning of the legacy application
towards a SOA approach is discussed in Chapter 4. Chapter 5 presents the knowledge
access scenarios to be addressed in the case study prototype. In the end we draw some
conclusions.
Chapter 2
GATE as a Legacy Application
2.1 GATE Overview

GATE [Cun02] is an architecture, a framework and a development environment for LE
(Language Engineering)1 . As an architecture, it defines the organisation of an LE sys-
tem and the assignment of responsibilities to different components, and ensures that the
component interactions satisfy the system requirements. As a framework, it provides a
reusable design for an LE software system and a set of prefabricated software building
blocks that language engineers can use, extend and customise for their specific needs. As
a development environment, it helps its users to minimise the time they spend building
new LE systems or modifying existing ones, by aiding overall development and provid-
ing a debugging mechanism for new modules. Because GATE has a component-based
model, this allows for easy coupling and decoupling of the processors, thereby facilitat-
ing comparison of alternative configurations of the system or different implementations of
the same module (e.g., different parsers). The availability of tools for easy visualisation
of data at each point during the development process aids immediate interpretation of the
results.
The GATE framework comprises a core library (analogous to a bus backplane) and
a set of reusable LE modules. The framework implements the architecture and provides
(amongst other things) facilities for processing and visualising resources, including rep-
resentation, import and export of data.
The reusable modules provided as plugins within the backplane are able to perform
basic language processing tasks such as POS tagging and semantic tagging. This elim-
inates the need for users to keep recreating the same kind of resources, and provides a
good starting point for new applications.
Applications developed within GATE can be deployed outside its Graphi-
cal User Interface (GUI), using programmatic access via the GATE API (see
1
GATE is freely available for download from http://gate.ac.uk.
5
CHAPTER 2. GATE AS A LEGACY APPLICATION 6
http://gate.ac.uk). In addition, the reusable modules, the document and anno-

tation model, and the visualisation components can all be used independently of the de-
velopment environment.
GATE components may be implemented by a variety of programming languages and
databases, but in each case they are represented to the system as a Java class. This class
may simply call the underlying program or provide an access layer to a database; alterna-
tively it may implement the whole component.
2.1.1 Main Architectural Elements

GATE distinguishes between data, algorithms, and ways of visualising them. In other
words, GATE components are one of three types:
Language Resources (LRs) represent entities such as documents, corpora and on-
tologies;
Processing Resources (PRs) represent entities that are primarily algorithmic, such
as parsers, generators or ngram modellers;
Visual Resources (VRs) represent visualisation and editing components that par-
ticipate in GUIs.
These resources can be local to the users machine or remote (available via HTTP), and
all can be extended by users without modification to GATE itself.
One of the main advantages of separating the algorithms from the data they require is
that the two can be developed independently by language engineers with different types of
expertise, e.g. programmers and linguists. Similarly, separating data from its visualisation
allows users to develop alternative visual resources, while still using a language resource
provided by GATE.
Collectively, all resources are known as CREOLE (a Collection of REusable Objects
for Language Engineering), and are declared in a repository XML file, which describes
their name, implementing class, parameters, icons, etc. This repository is used by the
framework to discover and load available resources.
For example, the corpus LR is declared as:
<RESOURCE>
<NAME>GATE corpus</NAME>
<CLASS>gate.corpora.CorpusImpl</CLASS>
<COMMENT>GATE transient corpus</COMMENT>
<INTERFACE>gate.Corpus</INTERFACE>
<PARAMETER NAME="documentsList"
COMMENT="A list of GATE documents"
Figure 2.1: GATEs document viewer/editor
OPTIONAL="true"
ITEM_CLASS_NAME="gate.Document"
>java.util.List</PARAMETER>
<ICON>lrs.gif</ICON>
</RESOURCE>
A parameters tag describes the parameters which each resource needs when created or
executed. Parameters can be optional, e.g. if a document list is provided when the corpus
is constructed, it will be populated automatically with these documents.
When an application is developed within GATEs graphical environment, the user
chooses which processing resources go into it (e.g. tokeniser, POS tagger), in what order
they will be executed, and on which data (e.g. document or corpus).
The execution parameters of each resource are also set there, e.g. a loaded document
is given as a parameter to each PR. When the application is run, the modules will be
executed in the specified order on the given document. The results can be viewed in the
document viewer/editor (see Figure 2.1).
2.1.2 Data representation and storage

GATE supports a variety of formats including XML, RTF, HTML, SGML, email and plain
text. In all cases, when a document is created/opened in GATE, the format is analysed and
converted into a single unified model of annotation. The annotation format is a modified
form of the TIPSTER format [Gri97] which has been made largely compatible with the
Atlas format [BDG+ 00], and uses the now standard mechanism of stand-off markup
[TM97]. The annotations associated with each document are a structure central to GATE,
because they encode the language data read and produced by each processing module.
The GATE framework also provides persistent storage of language resources. It cur-
rently offers three storage mechanisms: one uses relational databases (e.g. Oracle) and
the other two are file- based, using Java serialisation or an XML-based internal format.
GATE documents can also be exported back to their original format (e.g. SGML/XML
for the British National Corpus (BNC)) and the user can choose whether some additional
annotations (e.g. named entity information) are added to it or not.
To summarise, the existence of a unified data structure ensures a smooth communi-
cation between components, while the provision of import and export capabilities makes
communication with the outside world simple.
2.1.3 Execution strategies

The execution strategies in GATE determine how the algorithms (i.e. processing re-
sources) are combined to form a complete application and how that application is exe-
cuted on the provided input (i.e., language resources such as corpora). These execution
strategies are implemented in GATE as controllers and the user chooses the one most
suitable for their application.
The two main execution strategies commonly found in NLP applications are pipeline
(i.e., sequential) and parallel. For example, GATEs pipeline controller executes the PRs
in the specified order and with the given parameters (e.g., a document). A variation of
this controller is the corpus pipeline, which executes the PRs sequentially over a given
corpus, taking care of loading the documents from the disk (if necessary), processing
each of them, and saving the results.
At present GATE does not provide a controller which supports parallel execution of
PRs, although it is planned for future work. However, it does support execution of appli-
cations on different machines over data shared on the server, which enables simultaneous
processing of documents.
Other types of execution strategies are conditional (equivalent to an if statement in
programming languages) and iterative (equivalent to loop statements). The conditional
Figure 2.2: GATEs conditional controller
controller (see Figure 2.2) was introduced in GATE v2 in order to allow flexible process-
ing of heterogenous data (e.g., a corpus consisting of documents from several domains,
styles, and genres). The controller allows the user to specify for each PR the conditions
when it can be executed: always; never; or conditionally only if a particular
feature with a given value is present on the document.
For example, a categorisation module that always fires can assign different genres
to each document as features (e.g., sport, political, business). Then the application can
contain PRs trained specifically for the given type of documents and at execution time,
based on the document type, the controller will determine which PRs are executed. For
example, a text about sport would use a different named entity recogniser from the one
about politics (England is a team in the first case and a location in the second) or a text
in all uppercase would use a different POS tagger (amongst other things) from a text in
mixed case (see Figure 2.2). However, the application will list all alternative PRs (e.g. two
POS taggers and 3 entity recognisers) but only those appropriate for the current document
will be executed on it, whereas a different set of PRs might be executed on the next one.
The iterative strategy is equivalent to a loop and will enable one or more PRs to be
executed until a given condition is met.
2.2 Different GATE Software Artefacts

This section describes the different software artefacts which are subject to ontology learn-
ing and content augmentation, prior to being indexed in the semantic repository for knowl-
edge access.
Overall, there are several aspects of these software artefacts that need to be considered:
All data sources are updated frequently, with stable versions corresponding to soft-
ware releases. Thus semantic annotation and access to these needs to be made
sensitive to this dynamic aspect.
Some of the data sources are a mixture of structured and unstructured text and also
images.
The size of the corpus itself is not likely to pose a serious challenge for the seman-
tic annotation and ontology learning tools. The instance data generated from this
analysis is likely to produce millions of triples and thus pose a scalability challenge
for the TAO semantic repository. For instance, the Dhruv system [ASHK06], which
aims at providing semantic-based support for software bug resolution, has reported
such scalability problems, which hampered the development of a full-blown system.
2.2.1 Source code and Javadoc

The GATE source code and javadoc corpus, used within TAO, consists of 536 Java files,
which form the GATE 3.1 release of the system. They are organised in 21 gate packages,
with thousands of lines of code. These files do not include any of the optional plugin
components, distributed with GATE 3.1 as they are considered separately below. The
code formatting follows certain coding conventions, including tabulation, variable naming
conventions, Java reserved words, etc. The level of provision of Javadoc comments varies
across the different classes, depending on the developer(s) responsible for their creation
and maintenance.
2.2.2 User forum postings

GATEs discussion forum has on average about 120 posts per month, with up to 15 on
some days. It is available from sourceforge and GMANE.
For the purposes of this project, we downloaded a representative corpus of 2160 post-
ings, dated between 1 January 2005 and 31 July 2006. The emails were downloaded from
GMANE, using a news reader, in order to preserve their original document structure,
i.e., obtain the original email headers and message body format. In addition, GMANE
anonymises all email addresses, which is needed in order to be able to release the corpus
to the research community.
2.2.3 User, developer, and annotator manual

At present, GATE has three software manuals - user, developer, and annotator manual.
These are all unstructured texts which come in different formats, e.g., PDF, Word, HTML.
Therefore, TAO technology needs to be able to read these formats and extract the text
content from them for semantic processing.
The user guide for GATE version 3.1 (March 2006) consists of 318 pages in PDF (or
equivalent HTML), including over 30 screen shots and UML diagrams, and also XML
and code snippets. From the perspective of automatic analysis, the user guide size and
content pose a significant challenge.
The developer manual for GATE 3.1 is 40 pages long in PDF and again contains a
mixture of unstructured text, tables, and code snippets.
The annotation manual is 11 pages long in PDF and consists of undstructured text
oriented towards non-programmers.
One important aspect of all these manuals is that they have undergone hundreds of
revisions, with stable archived versions corresponding to the respective GATE releases.
Another challenge comes from the need to analyse and index the images within the man-
uals, not just the text.
2.2.4 Plugins, resources, and configuration files

GATE resources (LRS, PRs, and VRs) are made available to the GATE system using
a plugin mechanism, which is based on a JAR and an XML configuration file, called
creole.xml. At present, the user guide describes around 25 such plugins, but their number
is growing continously. An example XML configuration file is shown below:
<?xml version="1.0"?>

<CREOLE-DIRECTORY>

<CREOLE>


<RESOURCE>
<NAME>Gazetteer List Collector</NAME>
<CLASS>gate.creole.GazetteerListsCollector</CLASS>
<COMMENT>Gazetteer lists collector
(http://gate.ac.uk/sale/tao/#sec:misc-creole:listscollector)
</COMMENT>
<PARAMETER NAME="document" RUNTIME="true">
gate.Document
</PARAMETER>
...
</RESOURCE>
</CREOLE>


<CREOLE>

<RESOURCE>
<NAME>Syntax tree viewer</NAME>
<CLASS>gate.gui.SyntaxTreeViewer</CLASS>

<GUI>
<MAIN_VIEWER/>
<ANNOTATION_TYPE_DISPLAYED>
Sentence
</ANNOTATION_TYPE_DISPLAYED>
<PARAMETER NAME="tokenType" DEFAULT="Token" RUNTIME="false"
OPTIONAL="true">java.lang.String</PARAMETER>
</GUI>
</RESOURCE>
</CREOLE>
</CREOLE-DIRECTORY>
In addition to a jar and xml file, some plugins also contain Java source code and other
resources required by the specific application, e.g., gazetteer lists, finate-state transduction
grammars, grammar definition files. At the very least, TAO will attempt to utilise the
creole.xml and the Java code.
Chapter 3
Requirements Analysis
This chapter defines a set of requirements towards the requirements towards the re-
factored legacy application, which are then used to derive a set of requirements towards
the TAO semantic technology.
3.1 Requirements towards the SOA-based GATE

Let us start first by discussing a number of requirements towards the refactored SAO-
based GATE which are listed here for convenience with detailed discussions of each ap-
pearing in the remaining part of this section:
Support for load balancing by driving several GATE instances through a single
access point
Methodological support of non-specialists wanting to use NLP tools, not purely

technological support (as is now)
Multi-role instead of single role beyond language engineers
Service-orientated, not monolithic
Assistive instead of autonomous, e.g., automatic processes helped by humans
Compatibility with existing GATE development environment
3.1.1 Computational complexity and load balancing

Language processing tasks tend to have varying complexity and requirements towards
computation time. Simple tasks, such as tokenisation, are capable of processing hundreds
13
CHAPTER 3. REQUIREMENTS ANALYSIS 14
of kilobytes per second, however more complex ones run at tens of kilobytes per second or
even sometimes less. The majority of GATE PRs have a linear execution time [CMB+ 05],
but still when applied to large documents they can take several minutes on each.
At the same time, GATE is not multi-threaded and therefore if a GATE-based service
is to be capable of supporting more than one processing request at any given time, then
what is actually required is something like a pool of several instances of GATE or several
instances of the same service. The expectation is that guidance on how to best address
this issue would be provided by the TAO methodology.
3.1.2 Combined Methodological and Technological Support

Natural language processing in general, and Information Extraction (IE) in particular, are
typically not end-user applications in themselves, but tend to be components in informa-
tion seeking and knowledge management applications. Despite the breadth and depth of
literature describing algorithms, evaluation protocols and performance statistics for IE,
further methodological work is needed on how to go about specifying, implementing and
customising IE functionality for a new task or domain, especially oriented towards non-
experts. In the same way that GATE is not just an implementation but also an abstract
architecture, so the TAO project aims to increase its impact by defining a methodology
to support the use of GATE-based IE services for content augmentation, which will be
integrated in the TAO infrastructure.
The relevant parts of the TAO methodology will need to cover:
how to decide if IE is applicable to your problem

how to define the problem with reference to a corpus of examples
how to identify similarities with other problems and thence to estimate likely per-
formance levels
how to design a balance of adaptation interventions for system setup and mainte-
nance
3.1.3 Multi-role Support

GATE has both a class library for programmers embedding NLP in applications and a
development environment for skilled language engineers. However, what is also needed
in the TAO case is support for a wider constituency of users, who are essentially going to
combine web services provided by GATE with other web services (e.g., ontology learning
from WP2) into their own applications.
Another aspect of this is to choose the correct level of granularity for exposing the
services. One extreme is to expose each individual PR as a service and then allow users to
combine them. However, the typical NLP application consists of over 10 PRs, so asking
non-expert users to know how to combine this is probably not realistic. On the other hand,
expert users are likely to require just that. For non-experts the right granularity would be
to use sets of PRs, already composed by an expert user, and exposed as a single service,
e.g., term extractor.
Consequently, the requirement is to define the gate annotation services in a flexible
way so they can have different internal granularity and be also composable into larger
services, which themselves can then behave like one gate annotation service for non-
expert users.
3.1.4 Service-based Access to Key Components

The following are the key components, which need to be refactored towards service-based
access:
LRs which can be wrapped as data-intensive services. Efficient storage and access
of potentially large data sets. At least two types needed: document and ontology
services
PRs which can be wrapped as algorithmic services. Run on different machines,

some computationally intensive. Input and output through the data services or di-
rectly on an XML document.
At present, we separate data-intensive services into two kinds: document/corpora stor-

age service and ontology/KB service. The rationale behind this separation is that they do
not store the same types of information and search is different (and reasoning is required
in the case of ontologies). This separation also makes it possible to reuse existing knowl-
edge stores, e.g. Sesame or OWLIM.
Key functionalities required from the document storage service (called DocService in
brief):
1. Browse stored contents
2. Create and delete documents and corpora
3. Add/remove documents to/from corpora
4. Concurrent modification of document annotations
5. Search documents according to particular criteria
In order to reduce the amount of information exchanged over the network, it would be
better to retrieve only a part of a LR, e.g. an AnnotationSet and not a whole document.
The first 4 functionalities are currently supported by the GATE DataStore so for com-
patibility reasons, it would be necessary for the doc storage service to have a compatible
API.
The search facility is currently not implemented by GATE Datastores, but is required
for the DocService, as users need to be able efficiently to retrieve only relevant documents
(to minimise network traffic amonst other things). The idea is to enhance the current
datastore API by adding a search API and thus have RDMS-like functionalities. The
indexing and querying of the data would be made on the remote server.
Key functionalities required from the ontology access service:
Create, modify, and access concepts and concept instances
Create, modify, and access properties and property instances
Support ontology languages (OWL, RDF)
3.1.5 Support for Building Service-based Applications

As discussed in Section 2.1.3, GATE applications are built from processing resources,
following a particular execution strategy (typically one of sequential, conditional, or it-
erative). Consequently, when processing resources are made available as web services, a
workflow manager is required in order to allow users to perform service composition and
execution, i.e., this is one more problem which we will address with the help of the TAO
methodology and the corresponding tool support from WP5.
Some processing resources in GATE depend on others to be run earlier in the appli-
cation (e.g., part-of-speech tagging requires the text to be tokenised) and at present the
development enviroment does not offer support by making these dependencies explicit or
help the user to verify that the composed application will operate correctly. Currently if a
component does not find the required data during execution time, it issues a warning mes-
sage and just quits. In other words, the application runs regardless but does not produce
any results. Consequently, when PRs are exposed as web services, it might be beneficial
if the associated semantic descriptions can be used to better facilitate this process.
3.1.6 Compatibility with the Existing GATE Developement Environ-

ment
One of the key features of GATE, which is largely appreciated by its users, is the graphical
development environment, and in particular the specialised visualisation components for
the different kinds of linguistic data. Consequently, one of the key requirements towards
the new services is that they are compatible with the existing GUI, so the user can view
both local and remote data in the same fashion and/or build applications containing both
local and remote components.
3.2 Requirements towards the TAO Technology

In this section we have taken the requirements identified earlier and recast them in the con-
text of the TAO technology. We also mention a set of issues which need to be addressed
under each requirement.
Workflow methodology backed by an infrastructure The requirement for combined

methodological and technological support (see Section 3.1.2) leads to a TAO
requirement for a workflow methodology supported by an infrastructure. The
methodology is required in order to guide users with best practice in using se-
mantic technology. For instance, methodological guidance can help with advice
for domain experts on how to make the most from the output of the ontology learn-
ing tools and post-edit these into a domain ontology, with a minimum effort, while
avoiding known ontology modelling pitfalls.
Flexible System Building Environment Building applications which combine a num-

ber of services is a complex task, so a development and debug environment is
required to help with service composition, handling dependencies between them,
execution, etc.
Web Service Refactoring Guidance Refactoring legacy applications poses a number of

questions, so guidance is needed on how best to address issues such as enabling
a service to handle concurrent requests, data exchange (e.g., multilingual content
needs to be exchanged between services), service discovery, security and privacy
issues.
Assisted Service Composition It is expected that the semantic descriptions associated

with the web services would enable the TAO Suite to assist non-expert users in
configuring SWS into applications.
Low Human Overhead of Ontology Learning and Content Augmentation Services

One of the key requirements towards the semantic web technology in our appli-
cation is that it should be applied to existing software applications and require
minimal human involvement in its creation and maintenance. Due to the fact that
the majority of projects do not maintain an ontology already (neither does GATE),
automatic methods for ontology learning, specifically tailored to software artefacts,
are key for success.
One of the issues that need to be addressed is whether we need to keep provenance
information for learnt concepts.
From the methodological perspective, how does the human post-editing of the learnt
domain ontology fit within the entire transitioning process? Would it be possible
to develop a relatively simple how-to guidance on conceptual modelling, so that
software engineers with limited expertise in ontologies could carry out this step?
Upper Level and Other Existing Ontologies Another set of issues that needs to be in-
vestigates is would there be any benefit in grounding the learnt and post-edited
ontologies into an upper level ontology? If yes, which ontology or what are the cri-
teria for choosing the most relevant one among existing ones? In addition, there are
existing ontologies oriented towards software artefacts, e.g., those from the Dhruv
system [ASHK06], which also need to be investigated.
Scalability of the Semantic Repositories Semantic repositories currently scale to mil-

lions of triples with some new, file-based ones (BigOWLIM) scaling up to 1 bil-
lion. This is a rather encouraging recent development, because previous work on
semantically annotating bug postings [ASHK06] reported on scalability being the
main stumbling block.
Learning Domain Concepts and Properties Semantic descriptions of web services

rely on the use of two kinds of ontologies. On the one hand, generic web service
ontologies (such as OWL-S or WSMO) describe the kind of information that each
web service should provide about itself. Most commonly, each web service should
describe the service it provides (functionality) and the parameters it expects. The
values of all these kinds of information (e.g. BuyBook, Book, onlyInUSA) come
from so called domain ontologies. While there is much effort dedicated to deciding
on the terms of the generic ontologies, the acquisition of domain ontologies is left
to users who annotate their services.
There exist some ontologies describing software artefacts1 , but they describe
generic software concepts such as bugs, directories, method names, code blocks,
etc. Consequently, such ontologies are also not suitable to act as domain ontologies
for semantic web services.
In other words, the ontology learning approaches need to acquire domain and
application-specific terms. For example, in the case of GATE these would be lan-
guage resource, part-of-speech tagger, etc.
Dealing with Dynamic Software Artefacts How would the learning and content aug-
mentation methods cope with the dynamic nature of software artefacts, i.e., new
concepts becoming important as software matures through versions. This is prob-
ably less relevant for legacy systems where there is no further development of the
system itself, however GATE specifically is updated continuously and expanded
with new plugins by its user community (although this does not necessarily entail
any changes to the key parts of the GATE API).
1
For example, http://www.cs.cmu.edu/ anupriya/code.owl, http://www.cs.cmu.edu/ anupriya/bugs.owl
Exploiting Redundancy across Multiple Software Artefacts Ontology Learning (OL)

and Semantic Annotation (SA) of source code only is not sufficient, because our
analysis of the GATE code showed that the comments and class names tend to
use consistently the same term to refer to a given domain concept, whereas the user
manual and forum postings use more (e.g., POS tagger, Hepple tagger). In addition,
the software developers usually write the comments to be as short as possible and
thus introduce abbreviations, e.g., doc for document or splitter for sentence
splitter. The consequence is that the majority of terms in the source code are single
word terms, whereas the user-oriented unstructured manuals tend to use more multi-
word terms to refer to the same concepts.
Consequently, the goal is to develop OL and SA approaches which can exploit
the redundancy and complimentarity of the different software artefacts, in order to
improve their performance.
Chapter 4
Design of the GATE Web Services
This section describes the GATE web services, which have been defined over the existing
legacy application, while retaining backwards compatibility and addressing all other key
requirements defined in the previous chapter.
4.1 Architecture Overview

As already mentioned, prior to TAO, NLP applications built with GATE were monolythic
in nature and consisted of tightly coupled processing resources (see Figure 4.1.
Figure 4.1: Legacy GATE application model
The requirements towards the SOA-based application model were defined in Chap-
ter 3. In brief, the main requirements are for non-monolithic, human-assisted, and back-
wards compatible model. In particular, the backwards compatibility with the GATE
graphical user interface is of paramount importance, as it is being used extensively by
the majority of GATE users.
20
CHAPTER 4. DESIGN OF THE GATE WEB SERVICES 21
The resulting SOA-based, backwards-compatible application model is shown in Fig-

ure 4.2. In this model, local and remote processing resources can be combined in applica-
tions, with the local algorithms being tightly coupled, whereas the service-based resources
are invoked via clients. It is also possible to have applications consisting entirely of re-
mote processing resources.
Figure 4.2: SOA-based, backwards compatible application model
4.2 GAS: GATE Annotation Services

A GAS service is defined as a pipeline of GATE PRs which is available via a WebSer-
vice. In practice, a GAS service can be a front-end to several instantiations of this GATE
pipeline, running on different servers in order to gain in scalability. The configuration of
the GAS service will specify this behaviour. It is transparent to the external user of the
GAS service how many machines are actually used to produce an annotation.
A GAS service is configured with two files, first the GATE application file defining
the pipeline as produced by saving the application state from GATE, and second a service
definition defining the input and output annotation sets required by the service and any
parameters the service takes. These files are described next.
4.2.1 The saved application state

The GATE pipeline run by a service is typically defined by a GATE application file, cre-
ated in the GATE GUI using Save application state. The application should be fully
configured before saving, i.e. all required runtime parameters should be set (except pos-
sibly those that will be mapped to required service parameters, as described in the next
section).
The saved application should be as self-contained as possible, i.e. the plugins used by
the application should be in subdirectories under the saved application location. This is
necessary because in some cases read permission on parent directories may be restricted
by the servlet container.
4.2.2 Service definition

Similar to a GATE PR, a GAS can declare required and optional parameters whose values
are provided by the caller. The parameter values are mapped onto either feature values on
the document being processed by the embedded GATE PRs, or runtime parameter values
of the various PRs that make up the application. The mapping is defined in a service
definition, an XML file supplied by the GAS creator:
<parameters>
<param name="depth">
<runtimeParameter prName="MyAnnotator" prParam="maxDepth"/>
</param>
<param name="annotatorName">
<documentFeature name="annotator"/>
</param>
</parameters>
Parameter values are all strings, but when mapped to PR parameters they are mapped
automatically to other types (e.g. URL, Integer, etc.).
Again, as for standard GATE PRs, the GAS service definition also specifies what
annotation sets the service requires as input and populates as output:
<annotationSets>
<annotationSet name="Original markups" in="true"/>
<annotationSet name="" out="true"/>
<annotationSet name="Key" in="true" out="true"/>
</annotationSets>
Note that a service can use the same set for both input and output, and can use the
default annotation set (the second example above).
For a complete example of a GAS service WSDL and service definition file see Ap-
pendix A.
4.2.3 Using a GAS service

As shown in Figure 4.2, the traditional GATE environment has a client to connect to a
GAS Service as if it were a normal GATE PR. Consequently, a GAS Service is able to
take as input the content of a local document.
Effectively a GAS service behaves roughly like a current GATE application, i.e. it
takes a document and a set of parameters as input and returns a modified document. This
is required in order to incorporate a GAS in a GATE application, like any other PR.
Future work needs to consider generating a simple service deployment (with every-
thing running on a single server) with a couple of clicks. The result of this operation will
be a WAR file, ready to be unpacked in a web application server. If one needs a more
advanced deployment, e.g. on a cluster, then more administrative intervention will be
required.
4.3 From SOA to Semantic Web Services

Having finalised the design of the web services it has now become possible to start moving
towards obtaining their semantic descriptions. For year 2 of TAO we plan on carrying out
the following research experiments:
define the domain ontology (created with WP2 learning tools)
apply it with the content augmentation tools from WP3 to the WSDL definitions
also apply these to the service code and documentation, to obtain semantic descrip-
tions of the web services.
Chapter 5
Transitioning Scenarios
As discussed in the introduction, there are three kinds of tasks to be completed to trans-
form GATE into a semantic service-oriented architecture:
the elicitation of an ontology to represent the concepts treated by GATE components

(application of WP2 technology on the GATE API documentation);
alongside this, the exposure of semantic-based access to the GATE document base
(application of WP4 storage and WP3 content augmentation technology);
finally, uniting the results of these two processes, the description of this function-
ality in semantic terms (using the WP5 infrastructure to define the Semantic Web
Services).
Overall this case study has a wide scope. However, it should be stressed that all
parts will not be equally developed. The main effort will be focused on the ontology
acquisition, semantic content augmentation, and knowledge access scenarios, elaborated
below, as they are the key building blocks of the transitioned application.
5.1 Ontology Acquisition

The acquisition of the domain ontology is the enabling block for the rest of the work and
therefore it will be the first task to be undertaken in year 2. This ontology is required by
all semantic-based access and web service description scenarios.
Whereas traditionally such an ontology would be defined by hand, well follow the
TAO methodology and first apply the ontology learning tools from WP2 to obtain a draft
ontology (model bootstrapping), which will be then post-edited by a domain expert to
obtain the gold-standard. In addition, existing relevant ontologies will be considered for
reuse.
24
CHAPTER 5. TRANSITIONING SCENARIOS 25
The overall ontology acquisition process will follow the TAO methodology defined in
D1.2 and refined for the case studies in section 4.1.2. of D7.1.
5.2 Semantic Content Augmentation

Semantic content augmentation will be a service provided by WP3 and adapted for the
specifics of software artefacts (e.g., special capitalisation and naming conventions). Due
to the size of the GATE document base (hundreds of documents, growing and changing
daily), it is unfeasible to assume that GATE developers will be post-editing the content
augmentation results and correcting any mistakes, although we will pilot the semantic an-
notation tools, in order to quantify this overhead and explore whether such tools cannot be
used only on an important subset of software documentation (e.g., the user and developer
manuals). In that respect, the GATE content augmentation scenario is different from that
in the aviation case study, where all results will be post-edited, due to the specifics of their
document authoring practices (see section 4.2 of D7.1).
Similar to the other case study, we will distinguish the generic functionalities that will
be directly reused from the TAO suite from case-study-specific customisations, which
will be developed software artefacts. In most parts, the annotation services will be pro-
vided by WP3 as generic tools, although minor customisation is envisaged, mostly to deal
with specific formatting and naming conventions. In addition, the semantic annotation
tools need to distinguish between the terms of the programming language and those of
the application (hashMap vs document), as only the latter should be annotated. Such gen-
eral, software terms will either be learnt automatically or supplied as a list of terms in a
gazetteer.
Another key challenge will be dealing with a wide range of file formats (HTML,
source code, PDF, images) and differently-sized documents.
Once a first prototype of the content augmentation service is created, it will enable
work on the semantic-based access scenarios described next.
5.3 Semantic-based Access Scenarios: Overview

One of the roles of semantic technologies within the GATE case study will be to manage
access to the different knowledge sources, i.e., source code, javadoc, discussion forums,
publications, user guides, etc. An important aspect that needs to be dealt with is data
dynamics, i.e., many of these knowledge sources are updated continuously, so the main
challenge comes from implementing a process which indexes them regularly with respect
to the domain ontology and updates the knowledge base/semantic annotation repository
accordingly.
The knowledge access tools will present knowledge from the domain ontology and
the semantic annotations from the content augmentation services, run over the multiple
software artefacts in GATE.
In the context of semantic-based access, several scenarios have been identified as rel-
evant for this case study:
Automatic generation of reference pages from the ontology provides users with a
single point of access to all knowledge, continuously kept up to date.
Natural language-based queries for semantic search intuitive way to formulate se-
mantic queries without need for knowledge of the underlying query language of the
semantic repository.
Semantic-based filtering of forum postings allow users to filter which postings they
see, by defining a set of relevant concepts from the domain ontology (based on the
semantic content augmentation of the forum postings).
Expertise location enable a (new) team member to identify which are the most appro-
priate team members to consult on a given problem. Developer expertise will be
discovered automatically from forum postings.
However, due to the limits in terms of effort available, we will focus on implementing
the first two scenarios, with the second two remaining at a fairly early prototype level, to
be explored further in year 3, should more effort become available.
5.3.1 Automatic creation of reference pages from the ontology

The current GATE web site provides links to various kinds of documentation about the
system: user/developer manuals, source code, javadoc, papers, tutorials, movies, etc. In
addition, there is a link to the discussion forum, which is hosted on sourceforge.
The GATE documentation pages are searchable via google, restricted to only search
on gate.ac.uk. However, this does not return results from the forum postings, which are
searchable via a different field on sourceforge.
The overall result is that a new user who wishes to find out information about a GATE
component, e.g., POS (part-of-speech) Tagger, needs to do these searches in two separate
places, i.e., even for important GATE components there are no single comprehensive
reference pages. In addition, the user does not have control over the results, i.e., see
results ordered by recency (only possible in the forum search) or type of source (request
results only from papers or the user guide).
Another problem is that there are typically more than one terms refering to the same
concept (POS Tagger, Hepple Tagger, part-of-speech tagger). Currently searching for
these gives different results, whereas a concept-based search would return the same hits.
In order to overcome this problem and provide a single point of reference, the Dhruv
system [ASHK06], for example, generates automatically the so called cross-link pages,
which provide a unified point of information (see Figure 5.1). The cross-link pages show
information about the ontology class (code:Function in this case), the file that contains the
source code, who authored this file, other related sources, version control logs, discussion
posts, bug reports, etc. The system also recognises domain terms, which are not added to
the ontology, but are simply used to cross-reference the different artefacts (see the right
hand-side part of the figure).
Figure 5.1: Dhruv cross-link pages
The problem with adopting wholesale the Dhruv approach is that in a large system
such as GATE the hits would be too many and these cross-link pages would become hard
to use and navigate. For instance, there are 333 hits for POS Tagger in the forum postings
and 169 hits among the other kinds of documentation.
The approach that we envisage here is more ontology-centred. In other words, the
user would be able to browse the ontology or search for relevant concepts, instances,
and properties by providing one or more keywords. For instance, searching for tagger,
the user would be offered several matching concepts: POSTagger, TreeTagger, and any
other entry which contains tagger within the label string. Then let us assume that the user
selects POSTagger as the instance of interest.
Instead of being shown triples about the POSTagger instance in a formal way (e.g., as
does Protege or the KIM KnowledgeBase Explorer[PKK+ 04]), we envisage to generate
automatically a web page, which can be shown on its own or alongside the ontology tree,
where POSTagger is selected. The contents of the page can be as follows:
Super and sub-concepts of the selected concept; if an instance is selected, then show
its class and possibly other instances of the same class
Values of key properties (e.g., parameters of POSTagger) or properties with their
domains, where this is a value (e.g., which PRs take Ontologies)
Links to sections in the user and other guides where this is mentioned. Maybe
importance can be judged by whether or not the mention is in the title or body of
the section/chapter.
Links to Javadocs and key forum postings.
Ideally the user should be able to customise what they are interested in, so some of the
information can be filtered out. For instance, allow them to exclude certain data sources or
ask for information after a given date. Experiments will show exactly how many hits will
be returned, so appropriate ranking and filtering can be done. The exclusion of certain
types of information will be implemented via specifying more refined SERQL queries to
the knowledge store, containing the content augmentation index.
Another alternative would be to generate these as wiki pages, so the user can then
correct any mistakes in an intuitive way, perhaps using a controlled language [DHCT06]
or a special semantic wiki syntax [VKV+ 06, MH05], which then directly updates the
ontology as well. If a semantic wiki is used, then the update of a wiki page would trigger
the following processes:
parsing of the semantic information (via special wiki syntax)

mapping to corresponding repository API calls to update the ontology. If an earlier
version of this page contained semantic information which has now been removed,
then it needs to be deleted from the repository, not just the new information added.
One argument for using a wiki as the basis of our system is that these are already used
heavily, especially in open-source software development projects, as a collaborative way
for documentation authoring, coordination, etc. At present, the GATE team uses a wiki
internally to list jobs, track deadlines, and share relevant publications on selected topics.
Consequently, opening this out and enriching it with semantic technology might be a good
solution, provided that editing is required.
Semantic wikis or software engineering support systems such as Dhruv do not al-
low sufficiently flexible exploration of the available semantic knowledge about con-
cepts/instances which are the objects (ranges) of the semantic triples, although some
promising GUIs have been developed recently (e.g., IkeWiki [Sch06], OntoWiki
[ADR06]). For example, if the generated page is about Processing Resource with an
is-a link to Resource, then the way to find further information about Resource would be
to click on the link.
A better approach could be to enable a small summary to be generated and shown on
mouse hover over Resource (IkeWiki currently supports something similar by showing
the content of the target wiki page, but thats way too long usually). Alternatively, on
right click to show a separate small window, such as the KIM KnowledgeBase explorer
which shows the relevant portion of the ontology, which can then be navigated further.
Given that the automatically generated pages have URLs, they can be used as URIs
for resources in RDF triples. This means that the system can collect automatically a list
of bookmarks for each user, time-stamped, and with possibility for the user to add further
semantic information, if desired. They can then access these from any machine. A similar
idea has been proposed in the SHAWN semantic wiki [Aum05]. However, here we can go
one step further, by associating the concepts, instances, and relations found on this page
automatically as semantic tags for it. Clustering this information would enable us to build
a profile of the users expertise and interests in the domain and also then deploy Amazon-
style recommender system (Other GATE users interested in <domain concept/instance>
also visited these pages:...).
The development of this scenario will proceed in two phases. Firstly, well implement
the automatic generation of reference web pages, from queries against the knowledge
store. Next well evaluate the result with GATE users and identify possible improve-
ments. Well also evaluate whether the developers are happy to engage in metadata edit-
ing. If they are, then in the second phase, well consider moving towards a semantic wiki,
probably by taking an existing one off the shelf, after suitable evaluation.
5.3.2 Natural language queries for semantic search

One of the key problems at present is that users are finding it difficult to find information,
due to the limited keyword search facilities and the fact that information is spread across
different web sites. While semantic search will be able to provide better results, across the
multiple software artefacts, there is a problem with helping non-expert users to formulate
their queries in an intuitive fashion, without any knowledge of SERQL.
One promising solution would be to offer natural language queries. However, one of
the downsides of using natural language for formulating queries is that unconstrained lan-
guage tends to be ambiguous. Therefore we envisage using a controlled natural language
for formulating language queries against the knowledge stores.
A controlled language is a subset of a natural language which is generally designed

to be less ambiguous than the complete language and furthermore to include only certain
vocabulary terms and grammar rules which are relevant for a specific task. Controlled
languages have a long history of use. Since the 1970s, CLs have been used by large
companies such as Caterpillar, Boeing, and Perkins for generation of multilingual docu-
mentation.
More recently, work began on using CLs in knowledge management. Computer Pro-
cessable English (CPE), developed at Cambridge Computer Laboratory, is an attempt to
devise a controlled language which could be easily translated into a knowledge represen-
tation language. The project has been successful in easing the process of creating domain
specific knowledge bases [Pul96]. ClearTalk is a language used by a series of knowledge
engineering tools created at Ottawa University. The tools can extract knowledge from
ClearTalk automatically. In addition, tools to assist the production of ClearTalk from un-
restricted language also exist [Sku03], [Sow02]. The KANT system also has the ability
to produce conceptual graphs and OWL statements from its intermediate representation
[KAMN98]. [Sow02] also demonstrates that there is a mapping between the database
query language SQL and controlled languages.
In the context of this case study, we envisage that the controlled language queries will
be translated into the respective query language for the knowledge stores (from WP4),
then they will be passed for execution. The returned matching triples will be displayed as
a hypertext with links enabling further exploration of the knowledge base. For example,
if the user asks: What are the parameters of the POS Tagger?, then the query would be
translated into SeRQL or SPARQL and passed to OWLIM for evaluation. The matching
results would be triples, which will then be displayed to the user in a friendly manner.
If the user needs further information, they can click on the link which will display
information about the underlying concept (i.e., its properties, super- and sub-classes, and
instances).
An interesting issue that would need to be investigated is appropriate presentation of
cases where there are too many results returned from the knowledge store. Initially the
implementation will present them all, but later on, if required, we will investigate methods
for result summarisation or clustering.
See Section 5.4.3 for details of related work, which will be taken as a starting point in
the implementation of this scenario.
5.3.3 Semantic-based filtering of forum postings

GATEs discussion forum has on average about 120 posts per month, with up to 15 on
some days. Due to the volume of this information, it would be helpful if developers
could choose to read only postings related to their areas of interest. Therefore, what is
required is automatic classification of postings with respect to concepts in the ontology
and a suitable interface for semantic-based search and browse. A similar problem is being
currently addressed in the context of digital libraries [WTA06] and we will consider using
some of these techniques as well.
Forum Postings Analysis
Since users tend to post messages on the discussion forums when they have failed to
identify a solution to their problem in the user manuals and earlier forum postings, by
analysing also which topics are being discussed one can also identify potential weaknesses
in the current documentation, which can then be rectified. Again this is an example,
where classification with respect to an ontology can help with the maintenance and update
process of software documentation.
In order to verify this hypothesis, we carried out a limited manual analysis of topics
on the GATE forum postings between April and June 2006. All threads were classified as
belonging to one of four topics: questions about the core system, existing plugins (i.e., op-
tional components), problems with writing new plugins, and problems with writing new
applications. Table 5.1 shows the results, which indicate that the majority of the users are
having problems with using the core system or some of its plugins. Some of the problems
are clearly due to shortcomings of the user guide, e.g., lack of information on which class
implements a certain plugin, which directories contain the corresponding configuration
files, what are the parameters that a plugin takes. All of this information changes over
time with each release and maintaining the user guide is therefore a time-consuming task.
However, ontology-based search could provide the user with this information, if the on-
tology concepts have been learnt from the source code and this provenance information
has been kept in the ontology itself (e.g., via implementedInClass or referencedInClass
properties).
Month core GATE Exist. plugins New plugins New apps

April06 17 13 0 1
May06 27 14 5 2
June06 12 14 4 2
Table 5.1: Number of questions posted each month under the different categories.
Another aspect are queries to identify all places in the code, documentation, and mail-
ing list posting which are relevant to a particular instance or a more complex conceptual
query involving several instances and properties. The results could be presented as a
generated summary HTML page with pointers to the original docs. However, given the
size of some of these docs, it might be good to have also highlighting within them of the
relevant parts. This may be implemented in two ways:
when processed, all documents are saved back on their original location with the
new semantic markup inserted and then we only highlight these tags, when needed
overlay the highlights on the original page, as the KIM plugin does [PKK+ 04].
Well need to investigate how this plugin can be integrated within the case study or
how to implement something similar, if this is not appropriate. The key problem is
that some docs live in sourceforges svn, others in our svn, whereas third ones are
emails on gmane.
5.3.4 Expertise Location

As part of the process of designing the semantic-based search and browse facilities, we
carried out a small experiment with using a set of domain concepts to discover who is the
most suitable GATE developer to address for a given problem. A subset of the GATE fo-
rum postings were analysed to identify all responses by GATE developers, whose names
were supplied as a list. The result was an association of domain concepts, developer
names (as initials), and frequency of answers. Some examples are: POS tagger (DM (43
postings), IR (12)), Jena ontologies (VT (45), KB (6), IR (2)). The goal is to enable
developers to identify easily who they should consult when working on a new topic. Con-
versely, as already discussed, the assignment of GATE concepts to forum postings will
enable our system to provide developers with the facility to be notified only of postings
related to their area of expertise.
The success of this task depends on the creation and maintenance of user profiles.
The SEKT project investigated how these can be stored in the ontology [GMG05] and
also derived from server log files. Here we will take this further by building a model of
users knowledge based on the topics that users have written about in the forum. In terms
of technological approach, the SEKT work made use of knowledge discovery techniques,
so in TAO work from WP2 will probably be a good starting point, combined also with
results from the content augmentation tools from WP3.
5.4 Related Work
5.4.1 Semantic Technologies for Software Engineering

Large software frameworks and applications tend to have a significant learning curve both
for new developers working on system extensions and for other software engineers who
wish to integrate relevant parts into their own applications. This problem is made worse in
the case of open-source projects, where developers are distributed geographically and also
tend to have limited time for answering user support requests and generally helping novice
users by writing extensive tutorials and step-by-step guides. At the same time, such online
communities typically create a large amount of information about the project (e.g., forum
discussions, bug resolutions) which is weakly integrated and hard to explore [ASHK06].
In other words, there is a practical need for better tools to help guide users through the
jungle of APIs, tools and platforms [Knu05].

Recent research has begun to demonstrate that Semantic technologies are a promis-
ing way to address some of these problems [Knu05]. For instance, search and browse
access to web service repositories can be improved by a combination of ontology learn-
ing and semantic-based access (e.g., ontology-based browsing and visualisation) [SP05].
Similarly, semantic-based wikis have been proposed as a way of supporting software
reuse across projects, where a domain ontology describing the different software artifacts
emerges as a side effect of authoring wiki pages [DRR+ 05].
Perhaps the closest work to what we propose here is the Dhruv system [ASHK06],
which relies on a semantic model obtained by instantiating hand-built domain ontologies
with data from different information sources associated with an open source project to
support bug resolution.
The main differences of our work from that of Dhruv [ASHK06] are:
The application-specific ontology is not represented explicitly in Dhruv, whereas

learning and using such ontologies is the focus in TAO. The ontologies in Dhruv
are general purpose software ontologies.
Dhruv only populates the ABox (i.e., instances) and uses a fixed TBox (4 generic
software engineering ontologies), whereas we address both
The focus of Dhruv is different in that it aims at helping the developers at a very
low level (e.g., method names), aka JavaDoc on steroids
The focus of TAO is at higher, service/component level, although visualisations
similar to these in Dhruv can be used
Work on the Ontology-driven Software Engineering Environment (OSEE) [TR06] is

also relevant, because, similar to us, it makes a clear separation between the data layer,
midlleware layer, and the application layer. The data layer, similar to the TAO methodol-
ogy, stores the ontologies and other related data. The middle-ware layer is different, as it
uses agents, whereas TAO uses semantic web services. The application layer is specifi-
cally oriented towards software engineering applications, so has components such as code
generator and case editor, although it also has an ontology editor and a semantic search
engine, which are components provided by the TAO Suite and customised for the needs of
this case study. Overall, the goal of OSEE is to support semantic-based software develop-
ment, whereas in this case study we do not alter the software development practices at all,
but layer some semantic technology on top, to enable new usage of the already existing
software artefacts.
The Hipikat [CMB02] Eclipse plugin for supporting software development is relevant
to our forum postings and expertise location scenarios, which we plan to develop in the
last stage of the project. The difference again is that Hipikat is integrated with the soft-
ware development environment, whereas our scenarios address the problem of helping
any user/reader with finding relevant information from the forum postings, which are not
tightly controlled artefacts. However, the recommendation mechanisms in Hipikat would
be relevant, if we are to provide rankings of the matched results from the semantic search.
5.4.2 Semantic wikis

Semantic wikis are another candidate for creating a knowledge access platform for soft-
ware engineering, due to their collaborative nature and the fact that many open-source
software projects already use them in one way or another. Here we discuss the pros and
cons of semantic wikis, which are extensions of traditional wikis with semantic knowl-
edge.
Platypus and Rhizome [Sou05] wikis require the user to write semantic information
in n-triple format, separate from the text itself, and are thus not relevant to our scenarios,
because we are aiming at users without expert knowledge of OWL or RDF.
The SHAWN [Aum05] semantic wiki is aimed primarily at metadata management,
supported by a wiki user interface. The user creates the semantic triples in a way similar
to that of the semantic MediaWikis discussed next. An interesting feature of this wiki is
that it addresses some personal information management (PIM) issues, such as extracting
calendar events and maintaining bookmarks. It collects automatically the URLs of pages
visited by the user and enables them to add semantic annotations to these. In this way, one
can access and share their bookmarks effectively, without necessarily putting them up on
delicious.
Work on Semantic MediaWikis [VKV+ 06, MH05] aims at adding typed links and
attributes to wikipedia, then mapping and storing this information as RDF, in order to
support semantic-based search. The key here is that in the wikipedia spirit there is no
explicit ontology or pre-defined set of link types or attributes. The users are free to create
their own, but this could cause a problem with spelling and remembering what types have
already been defined. This kind of wikis also combine metadata and content management
into a wiki-based interface, unlike the previous ones which are mostly metadata-oriented.
[VKV+ 06] also supports semantic-based search, which is very similar in spirit to
the KIM semantic search with its drop down boxes for specifying domain, range, and
property. The query is translated into SPARQL in order to support dynamic pages.
Examples of semantic statements are shown below1 :
London is the capital city of [[capital of::England]].

Londons population is [[population:=7million]] people.
[[Category:City]]
In this case, the triples are constructed with a domain London, as this is the page title.
1
http://wiki.ontoworld.org/index.php/Semantic MediaWiki
The category in the end defines what can be thought as the class of the title of the wiki
page, London in this case.
Example query is:
<ask>
[[Category:Publication]]
[[project:SEKT]]
[[published: >= 1.1.2006]]
</ask>
The semantic statements are typically edited in the standard text-based wiki editor,
which is convenient for expert users, but is also error-prone, e.g., spelling errors could
result in the creation of a new typed link, instead of the intended one. In order to help
avoid this problem and make the wiki more attractive to novices, IkeWiki [Sch06] sup-
ports also a WYSIWYG editor, which helps with addition/deletion of triples, by showing
a drop-down list of available link types (and values, if these exist already). The same
problem is addressed via an extendable set of widgets in OntoWiki [ADR06].
The semantic wikis discussed so far suffer from several problems [ODM+ 06]. Firstly,
the semantic annotation on each page assumes that the domain of the RDF triple is the
main subject of the page. Secondly, they assume that each wiki page describes a concept,
whereas this might not be necessarily the case (e.g. wikipedia disambiguation pages).
To help enrich this simple approach, SemperWiki [ODM+ 06] defines a formal model
of annotation which in addition to the subject-predicate-object triple also encodes the
context when the annotation was made (e.g., provenanace (who), timestamp, or limits on
validity (where)). A distinction is made between formal annotations (in machine pro-
cessable format such as RDF) and semantic annotations (formal annotations referring to
ontologies). In addition, a distinction is made between the document describing a concept
(e.g., wiki page about London) and the concept itself (i.e., London). SemperWiki uses
URLs for the former and URNs for the latter, which enables clean separation.
Another point that needs consideration is that most semantic wikis support document-
level annotation only (see [ODM+ 06]), i.e., the RDF/OWL statements are typically about
the concept defined in the page, whereas content augmentation techniques typically create
finer-grained annotations at sentence or word level. Also, given that content augmentation
techniques typically annotate with respect to one or more ontologies, it is necessary for a
semantic wiki (if used), to support namespaces or other ways in which the URIs can refer
to resources outside the wiki itself (see [ODM+ 06] for some wikis that support what they
call terminology reuse).
One of the outstanding problems that has only recently started to be addressed
([VK06, Sch06]) is about relating already existing formal ontologies to semantic wikis
with their more document-centric, wiki-internal less-formal ontology approach. This
would be an interesting issue to consider in TAO, because the ontology learning and con-
tent augmentation services would have produced and would be updating continuously the
domain ontology (stored in OWLIM), so if semantic wiki technology is to be employed,

then it would need to take this existing knowledge into account.
5.4.3 Semantic-based search and browsing

There is a lot of research on semantic-based search and browsing. Here we specifically
focus on approaches and systems which could potentially be used as a basis of imple-
menting (parts of) the scenarios discussed above.
Magpie [DDM04] is a suite of tools which supports the interpretation of webpages and
collaborative sense-making. It annotates webpages with metadata in a fully automatic
fashion and needs no manual intervention by matching the text against instances in the
ontology. It uses an ontology to provide a very specific and personalised viewpoint of the
webpages the user is browsing. In other words, Magpie adapts the browsing experience
for different users depending on their goals, knowledge, and context, which might be a
good approach to deriving user profiles and customised interfaces in this case study.
In more detail, Magpie maintains a kind of browsing history in windows called col-
lectors. Each collector shows the instances of a given concept that have been mentioned
on the page or a list of related instances (e.g., people working on a given project, which
were not mentioned in this page). The user can then click on these instances and browse
their semantic data or create semantic bookmarks to retrieve this information later through
semantic queries.
Another relevant system is KIM [PKK+ 04], which is an extendable platform for
knowledge management, including semantic-based search. It also includes a set of front-
ends for online use, that offer semantically enhanced browsing. The KIMExplorer allows
web-based browsing of the ontology, which can also be quieried by constructing SERQL
queries via web forms.
One of the key problems experienced by users wishing to search semantically-
annotated content, not just via the KIM GUI, is that most systems require some knowledge
of SPARQL (or a symilar query language) or provide triple-based query interfaces, which
require again that the user is familiar with the underlying ontology representation. One
of the ways to address this problem is to provide natural language query interfaces to
complement the existing more formal methods.
AquaLog2 [LM04] is an ontology-driven, portable question-answering (QA) system
built for providing a natural language query interface to a knowledge base. The applica-
tion scenario for AquaLog and other similar systems is very similar to a natural language
interface to relational database, but with semi-structured data encoded in ontologies in-
stead of structured data stored in RDBMS.
AquaLog translates controlled natural language queries into a triple representation
2
http://kmi.open.ac.uk/technologies/aqualog/
called Query-Triples, by first performing shallow parsing. Further processing is con-

ducted by a Relation Similarity Service (RSS) module, which maps questions into
ontology-compliant queries. Furthermore, this module uses string similarity metrics, lex-
ical resources such as WordNet [Fel98] and domain-dependent lexicons in order to gen-
erate query-triples that are compliant with the underlying ontology [LM04]. If the RSS
module fails to discover matching relations or concepts within the KB, it requests, as a
last resort, the user to disambiguate the relation or concept from a given set of candidates.
Another relevant effort is Attempto Controlled English3 (ACE). It is a subset of stan-
dard English designed for knowledge representation and technical specifications, and con-
strained to be unambiguously machine-readable into discourse representation structures,
a form of first-order logic. (It can also be translated into other formal languages.) ACE
has been adopted as the controlled language for the EU FP6 Network of Excellence REW-
ERSE4 (Reasoning on the Web with Rules and Semantics) [FKK+ 06].
The Attempto Parsing Engine5 (APE) consists principally of a definite clause gram-
mar, augmented with features and inheritance and written in Prolog. [Hoe04]
The tool produces DRS (Discourse Representation Structure), which can be mapped
cleanly to first-order logic and OWL DL [KF06, Kuh06].
One of the problems with ACE is that currently it has limited support for datatype
properties [Kal06], which we expect to need for our domain ontology. Another obstacle
to using ACE is that it has a predefined lexicon, so out-of-vocabulary words can only be
used if they are annotated with a POS6 tag, e.g. Every carnivore is a n:meat-eater, but
this does require the user to be familiar with the lexicon [Kal06].
Another relevant tool is Cypher7 , which however is proprietary (but free of charge)
software that translates natural language input into RDF and SeRQL (Sesame RDF Query
Language), according to grammars and lexica defined by the user in XML. As Cypher is
a recently developed proprietary software, it is difficult to evaluate it in detail here. One
possible downside is that it requires the development of a grammar and lexicon for each
application domain.
3
http://www.ifi.unizh.ch/attempto/
4
http://rewerse.net/
5
http://www.ifi.unizh.ch/attempto/tools/
6
POS: part of speech, e.g. noun, verb, adjective, determiner.
7
http://www.monrai.com/products/cypher/cypher manual
Chapter 6
Conclusion
This deliverable presented a number of requirements towards the TAO methodology and
technology. As a result, we defined the transitioning scenarios to be addressed in this
project. The result of the redesign process, discussed in Chapter 4, provides the starting
point for the definition of the semantic web services, with the help of tools from the TAO
suite.
The work planned for the next two years on the transitioning scenarios will lead to
a thorough evaluation of the benefits from semantic technologies by comparing against
the functionalities of the legacy application. The evaluation will be along the following
dimensions:
usefulness of the domain ontology for service annotation
usefulness of the domain ontology for content augmentation
ease of finding information on a given topic, with and without semantic-based ac-
cess
scalability of the knowledge stores and ability to store all case study data
ease of customisation and maintenance of the ontology learning and content aug-
mentation services
benefit of assisted vs manual annotation
In order to measure these, we will carry out a number of task-based evaluations in year
3 and the results will be used for exploitation activities towards the software engineering
sector.
38
Bibliography
[ADR06] S. Auer, S. Dietzold, and T. Riechert. OntoWiki A Tool for Social, Seman-
tic Collaboration. In Proceedings of the Fifth International Semantic Web
Conference (ISWC06), 2006.
[ASHK06] A. Ankolekar, K. Sycara, J. Herbsleb, and R. Kraut. Supporting Online Prob-

lem Solving Communities with the Semantic Web. In Proc. of WWW, 2006.
[Aum05] D. Aumueller. Semantic authoring and retrieval within a Wiki. In Proceed-

ings of the Second European Semantic Web Conference (ESWC05), 2005.
[BDG+ 00] S. Bird, D. Day, J. Garofolo, J. Henderson, C. Laprun, and M. Liberman.

ATLAS: A flexible and extensible architecture for linguistic annotation. In
Proceedings of the Second International Conference on Language Resources
and Evaluation, Athens, 2000.
[BTMC04] K. Bontcheva, V. Tablan, D. Maynard, and H. Cunningham. Evolving GATE

to Meet New Challenges in Language Engineering. Natural Language Engi-
neering, 10(3/4):349373, 2004.
[CMB02] D. Cubranic, G. Murphy, and K. Booth. Hipikat: A Developers Recom-

mender. In OOPSLA, 2002.
[CMB+ 05] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, C. Ursu, M. Dim-

itrov, M. Dowman, N. Aswani, and I. Roberts. Developing Lan-
guage Processing Components with GATE Version 3 (a User Guide).
http://gate.ac.uk/, 2005.
[CMBT02] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A

Framework and Graphical Development Environment for Robust NLP Tools
and Applications. In Proceedings of the 40th Anniversary Meeting of the
Association for Computational Linguistics (ACL02), 2002.
[Cun02] H. Cunningham. GATE, a General Architecture for Text Engineering. Com-

puters and the Humanities, 36:223254, 2002.
39
BIBLIOGRAPHY 40
[DDM04] J. Domingue, M. Dzbor, and E. Motta. Magpie: Supporting Browsing and

Navigation on the Semantic Web. In N. Nunes and C. Rich, editors, Proceed-
ings ACM Conference on Intelligent User Interfaces (IUI), pages 191197,
2004.
[DHCT06] B. Davis, Siegfried Handschuh, H. Cunningham, and V. Tablan. Further

use of Controlled Natural Language for Semantic Annotation of Wikis. In
Proceedings of the 1st Semantic Authoring and Annotation Workshop at
ISWC2006, Athens, Georgia, USA, November 2006.
[DRR+ 05] B. Decker, E. Ras, J. Rech, B. Klein, and C. Hoecht. Self-Organized Reuse
of Software Engineering Knowledge Supported by Semantic Wikis. In Work-
shop on Semantic Web Enabled Software Engineering (SWESE), Galway, Ire-
land, 2005.
[Fel98] Christiane Fellbaum, editor. WordNet - An Electronic Lexical Database. MIT

Press, 1998.
[FKK+ 06] Norbert E. Fuchs, Kaarel Kaljurand, Tobias Kuhn, Gerold Schneider, Loic
Royer, and Michael Schroder. Attempto Controlled English and the semantic
web. Deliverable I2D7, REWERSE Project, April 2006.
[GMG05] M. Grcar, D. Mladenic, and M. Grobelnik. User Profile Inference Module.

Technical report, SEKT project deliverable D5.5.2, 2005.
[Gri97] R. Grishman. TIPSTER Architecture Design Document Version 2.3. Tech-

nical report, DARPA, 1997. http://www.itl.nist.gov/div894/-
894.02/related projects/tipster/.
[Hoe04] Stefan Hoefler. The syntax of Attempto Controlled English: An abstract

grammar for ACE 4.0. Technical Report ifi-2004.03, Department of Infor-
matics, University of Zurich, 2004.
[Kal06] Kaarel Kaljurand. Writing owl ontologies in ace. Technical report, Univer-
sity of Zurich, August 2006.
[KAMN98] C. Kamprath, E. Adolphson, T. Mitamura, and E. Nyberg. Controlled Lan-

guage for Multilingual Document Production: Experience with Caterpillar
Technical English. In Second International Workshop on Controlled Lan-
guage Applications (CLAW 98), 1998.
[KF06] Kaarel Kaljurand and Norbert E. Fuchs. Bidirectional mapping between

OWL DL and Attempto Controlled English. In Fourth Workshop on Prin-
ciples and Practice of Semantic Web Reasoning, Budva, Montenegro, June
2006.
BIBLIOGRAPHY 41
[Knu05] H. Knublauch. Ramblings on Agile Methodologies and Ontology-Driven

Software Development. In Workshop on Semantic Web Enabled Software
Engineering (SWESE), Galway, Ireland, 2005.
[KPO+ 04] A. Kiryakov, B. Popov, D. Ognyanoff, D. Manov, A. Kirilov, and M. Gora-

nov. Semantic annotation, indexing and retrieval. Journal of Web Semantics,
ISWC 2003 Special Issue, 1(2):671680, 2004.
[Kuh06] Tobias Kuhn. Attempto Controlled English as ontology language. In

Francois Bry and Uta Schwertel, editors, REWERSE Annual Meeting 2006,
March 2006.
[LM04] Vanessa Lopez and Enrico Motta. Ontology driven question answering in
AquaLog. In NLDB 2004 (9th International Conference on Applications of
Natural Language to Information Systems), Manchester, 2004.
[MH05] H. Muljadi and T. Hideaki. Semantic wiki as an integrated content and meta-
data management system. In Poster Session at the International Semantic
Web Conference (ISWC05), 2005.
[ODM+ 06] E. Oren, R. Delbru, K. Moeller, M. Voelkel, and S. Handschuh. Annotation

and Navigation in Semantic Wikis. In First Workshop on Semantic Wikis:
From Wiki to Semantic (SemWiki2006), 2006.
[PKK+ 04] B. Popov, A. Kiryakov, A. Kirilov, D. Manov, D. Ognyanoff, and M. Gora-

nov. KIM Semantic Annotation Platform. Natural Language Engineering,
2004.
[Pul96] S. Pulman. Controlled Language for Knowledge Representation. In

CLAW96: Proceedings of the First International Workshop on Controlled
Language Applications, pages 233242, Leuven, Belgium, 1996.
[Sab06] M. Sabou. Building Web Service Ontologies. PhD thesis, Vrije Universiteit,
2006.
[Sap05] B. Sapcota. Web Service Discovery in Distributed and Heterogeneous Envi-

ronments. In International Conference on Web Intelligence (WI05), Com-
piegne, France, 2005.
[Sch06] S. Schaffert. IkeWiki - A Semantic Wiki for Collaborative Knowledge

Management. In Semantic Technologies in Collaborative Applications
(STICA06), 2006.
[Sku03] D. Skuce. A Controlled Language for Knowledge Formulation on the Se-

mantic Web. http://www.site.uottawa.ca:4321/factguru2.pdf, 2003.
BIBLIOGRAPHY 42
[Sou05] A. Souzis. Building a Semantic Wiki. IEEE Intelligent Systems, 20(5):87

91, 2005.
[Sow02] J. Sowa. Architectures for intelligent systems. IBM Systems Journal, 41(3),
2002.
[SP05] M. Sabou and J. Pan. Towards Improving Web Service Repositories through
Semantic Web Techniques. In Workshop on Semantic Web Enabled Software
Engineering (SWESE), Galway, Ireland, 2005.
[TM97] H. Thompson and D. McKelvie. Hyperlink semantics for standoff markup of

read-only documents. In Proceedings of SGML Europe97, Barcelona, 1997.
[TR06] S. Thaddeus and S.V. Kasmir Raja. A Semantic Web Tool for Knowledge-
based Software Engineering. In Workshop on Semantic Web Enabled Soft-
ware Engineering (SWESE), Athens, G.A., USA, 2006.
[VK06] D. Vrandecic and M. Kroetzsch. Reusing Ontological Background Knowl-

edge in Semantic Wikis. In Proceedings of the First Workshop on Semantic
Wikis (SemWiki06), 2006.
[VKV+ 06] M. Voelkel, M. Kroetzsch, D. Vrandecic, H. Haller, and R. Studer. Semantic

Wikipedia. In Proceedings of the International World Wide Web Conference
(WWW06), 2006.
[WTA06] P. Warren, I. Thurlow, and D. Alsmeyer. Applying Semantic Technology to

a Digital Library. In J. Davies, R. Studer, and P. Warren, editors, Semantic
Web Technologies. John Wiley and Sons, 2006.
Appendix A
Sample WSDL and Service Definition
This appendix shows a sample WSDL of a GAS service, followed by a service definition
file.
<?xml version="1.0" encoding="UTF-8"?>

<wsdl:definitions targetNamespace="http://gate.ac.uk/gate-service/1.0"
xmlns:apachesoap="http://xml.apache.org/xml-soap"
xmlns:impl="http://gate.ac.uk/gate-service/1.0"
xmlns:intf="http://gate.ac.uk/gate-service/1.0"
xmlns:wsdl="http://schemas.xmlsoap.org/wsdl/"
xmlns:wsdlsoap="http://schemas.xmlsoap.org/wsdl/soap/"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">

<wsdl:types>
<schema elementFormDefault="qualified"
targetNamespace="http://gate.ac.uk/gate-service/1.0"
xmlns="http://www.w3.org/2001/XMLSchema">
<element name="getRequiredParameterNames">
<complexType/>
</element>
<element name="getRequiredParameterNamesResponse">
<complexType>
<sequence>
<element maxOccurs="unbounded"
name="getRequiredParameterNamesReturn" type="xsd:string"/>
</sequence>
</complexType>
</element>
<element name="getOptionalParameterNames">
<complexType/>
43
APPENDIX A. SAMPLE WSDL AND SERVICE DEFINITION 44
</element>
<element name="getOptionalParameterNamesResponse">
<complexType>
<sequence>
name="getOptionalParameterNamesReturn" type="xsd:string"/>
</sequence>
</complexType>
</element>
<element name="getInputAnnotationSetNames">
<complexType/>
</element>
<element name="getInputAnnotationSetNamesResponse">
<complexType>
<sequence>
name="getInputAnnotationSetNamesReturn" type="xsd:string"/>
</sequence>
</complexType>
</element>
<element name="getOutputAnnotationSetNames">
<complexType/>
</element>
<element name="getOutputAnnotationSetNamesResponse">
<complexType>
<sequence>
name="getOutputAnnotationSetNamesReturn" type="xsd:string"/>
</sequence>
</complexType>
</element>
<element name="processRemoteDocument">
<complexType>
<sequence>
<element name="executiveLocation" type="xsd:anyURI"/>
<element name="taskId" type="xsd:string"/>
<element name="docServiceLocation" type="xsd:anyURI"/>
<element name="docId" type="xsd:string"/>
<element maxOccurs="unbounded" name="annotationSets"
type="impl:AnnotationSetMapping"/>
<element maxOccurs="unbounded" name="parameterValues"
type="impl:ParameterValue"/>
</sequence>
</complexType>
</element>
<complexType name="AnnotationSetMapping">
<sequence>
<element name="docServiceASName" nillable="true" type="xsd:string"/>
<element name="gateServiceASName" nillable="true" type="xsd:string"/>
</sequence>
</complexType>
<complexType name="ParameterValue">
<sequence>
<element name="name" nillable="true" type="xsd:string"/>
<element name="value" nillable="true" type="xsd:string"/>
</sequence>
</complexType>
<element name="processRemoteDocumentResponse">
<complexType/>
</element>
<complexType name="GateWebServiceFault">
<sequence/>
</complexType>
<element name="fault" type="impl:GateWebServiceFault"/>
<element name="processDocument">
<complexType>
<sequence>
<element name="documentXml" type="xsd:string"/>
<element maxOccurs="unbounded" name="parameterValues"
type="impl:ParameterValue"/>
</sequence>
</complexType>
</element>
<element name="processDocumentResponse">
<complexType>
<sequence>
<element maxOccurs="unbounded" name="processDocumentReturn"
type="impl:AnnotationSetData"/>
</sequence>
</complexType>
</element>
<complexType name="AnnotationSetData">
<sequence>
<element name="name" nillable="true" type="xsd:string"/>
<element name="xmlData" nillable="true" type="xsd:string"/>
</sequence>
</complexType>
</schema>
</wsdl:types>
<wsdl:message name="getRequiredParameterNamesResponse">
<wsdl:part element="impl:getRequiredParameterNamesResponse"
name="parameters"/>
</wsdl:message>
<wsdl:message name="getOptionalParameterNamesResponse">
<wsdl:part element="impl:getOptionalParameterNamesResponse"
name="parameters"/>
</wsdl:message>
<wsdl:message name="getRequiredParameterNamesRequest">
<wsdl:part element="impl:getRequiredParameterNames"
name="parameters"/>
</wsdl:message>
<wsdl:message name="getInputAnnotationSetNamesResponse">
<wsdl:part element="impl:getInputAnnotationSetNamesResponse"
name="parameters"/>
</wsdl:message>
<wsdl:message name="processRemoteDocumentRequest">
<wsdl:part element="impl:processRemoteDocument" name="parameters"/>
</wsdl:message>
<wsdl:message name="GateWebServiceFault">
<wsdl:part element="impl:fault" name="fault"/>
</wsdl:message>
<wsdl:message name="getOptionalParameterNamesRequest">
<wsdl:part element="impl:getOptionalParameterNames"
name="parameters"/>
</wsdl:message>
<wsdl:message name="getOutputAnnotationSetNamesRequest">
<wsdl:part element="impl:getOutputAnnotationSetNames"
name="parameters"/>
</wsdl:message>
<wsdl:message name="getOutputAnnotationSetNamesResponse">
<wsdl:part element="impl:getOutputAnnotationSetNamesResponse"
name="parameters"/>
</wsdl:message>
<wsdl:message name="processRemoteDocumentResponse">
<wsdl:part element="impl:processRemoteDocumentResponse"
name="parameters"/>
</wsdl:message>
<wsdl:message name="processDocumentRequest">
<wsdl:part element="impl:processDocument" name="parameters"/>
</wsdl:message>
<wsdl:message name="getInputAnnotationSetNamesRequest">
<wsdl:part element="impl:getInputAnnotationSetNames"
name="parameters"/>
</wsdl:message>
<wsdl:message name="processDocumentResponse">
<wsdl:part element="impl:processDocumentResponse"
name="parameters"/>
</wsdl:message>
<wsdl:portType name="GateWebService">
<wsdl:operation name="getRequiredParameterNames">
<wsdl:input message="impl:getRequiredParameterNamesRequest"
name="getRequiredParameterNamesRequest"/>
<wsdl:output message="impl:getRequiredParameterNamesResponse"
name="getRequiredParameterNamesResponse"/>
</wsdl:operation>
<wsdl:operation name="getOptionalParameterNames">
<wsdl:input message="impl:getOptionalParameterNamesRequest"
name="getOptionalParameterNamesRequest"/>
<wsdl:output message="impl:getOptionalParameterNamesResponse"
name="getOptionalParameterNamesResponse"/>
</wsdl:operation>
<wsdl:operation name="getInputAnnotationSetNames">
<wsdl:input message="impl:getInputAnnotationSetNamesRequest"
name="getInputAnnotationSetNamesRequest"/>
<wsdl:output message="impl:getInputAnnotationSetNamesResponse"
name="getInputAnnotationSetNamesResponse"/>
</wsdl:operation>
<wsdl:operation name="getOutputAnnotationSetNames">
<wsdl:input message="impl:getOutputAnnotationSetNamesRequest"
name="getOutputAnnotationSetNamesRequest"/>
<wsdl:output message="impl:getOutputAnnotationSetNamesResponse"
name="getOutputAnnotationSetNamesResponse"/>
</wsdl:operation>
<wsdl:operation name="processRemoteDocument">
<wsdl:input message="impl:processRemoteDocumentRequest"
name="processRemoteDocumentRequest"/>
<wsdl:output message="impl:processRemoteDocumentResponse"
name="processRemoteDocumentResponse"/>
<wsdl:fault message="impl:GateWebServiceFault" name="GateWebService
</wsdl:operation>
<wsdl:operation name="processDocument">
<wsdl:input message="impl:processDocumentRequest"
name="processDocumentRequest"/>
<wsdl:output message="impl:processDocumentResponse"
name="processDocumentResponse"/>
<wsdl:fault message="impl:GateWebServiceFault"
name="GateWebServiceFault"/>
</wsdl:operation>
</wsdl:portType>
<wsdl:binding name="GateWebService" type="impl:GateWebService">
<wsdlsoap:binding style="document"
transport="http://schemas.xmlsoap.org/soap/http"/>
<wsdl:operation name="getRequiredParameterNames">
<wsdlsoap:operation soapAction=""/>
<wsdl:input name="getRequiredParameterNamesRequest">
<wsdlsoap:body use="literal"/>
</wsdl:input>
<wsdl:output name="getRequiredParameterNamesResponse">
</wsdl:output>
</wsdl:operation>
<wsdl:operation name="getOptionalParameterNames">
<wsdl:input name="getOptionalParameterNamesRequest">
</wsdl:input>
<wsdl:output name="getOptionalParameterNamesResponse">
</wsdl:output>
</wsdl:operation>
<wsdl:operation name="getInputAnnotationSetNames">
<wsdl:input name="getInputAnnotationSetNamesRequest">
</wsdl:input>
<wsdl:output name="getInputAnnotationSetNamesResponse">
</wsdl:output>
</wsdl:operation>
<wsdl:operation name="getOutputAnnotationSetNames">
<wsdl:input name="getOutputAnnotationSetNamesRequest">
</wsdl:input>
<wsdl:output name="getOutputAnnotationSetNamesResponse">
</wsdl:output>
</wsdl:operation>
<wsdl:operation name="processRemoteDocument">
<wsdl:input name="processRemoteDocumentRequest">
</wsdl:input>
<wsdl:output name="processRemoteDocumentResponse">
</wsdl:output>
<wsdl:fault name="GateWebServiceFault">
<wsdlsoap:fault name="GateWebServiceFault" use="literal"/>
</wsdl:fault>
</wsdl:operation>
<wsdl:operation name="processDocument">
<wsdl:input name="processDocumentRequest">
</wsdl:input>
<wsdl:output name="processDocumentResponse">
</wsdl:output>
<wsdl:fault name="GateWebServiceFault">
<wsdlsoap:fault name="GateWebServiceFault" use="literal"/>
</wsdl:fault>
</wsdl:operation>
</wsdl:binding>
<wsdl:service name="GateWebServiceService">
<wsdl:port binding="impl:GateWebService" name="GATEService">
<wsdlsoap:address location="http://..."/>
</wsdl:port>
</wsdl:service>
</wsdl:definitions>
This GAS service definition file defines that the unnamed annotation set is used as
input (e.g., to read tokens). Also the annotation set called Output is used for storing the
annotations produced by the service. In addition, the service has one parameter, called
annTypes, whose value needs to be passed as a value of the parameter annotationTypes
on the processing resource Transfer, embedded in the GAS service. As different services
have different embedded PRs, their service definitions specify how the parameters are
passed on to which embedded PR.
service xmlns="http://gate.ac.uk/gate-service/definition">
<annotationSets>
<annotationSet name="" in="true" />
<annotationSet name="Output" out="true" />
</annotationSets>
<parameters>
<param name="annTypes" optional="true">
<runtimeParameter prName="Transfer" prParam="annotationTypes" />
</param>
</parameters>
</service>

Math

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Math

Загружено:

Авторское право:

Доступные форматы

EU-IST Strategic Targeted Research Project (STREP) IST-2004-026460 TAO

TAO: Transitioning Applications to Ontologies

D6.1 Case Study 1: Requirement

Kalina Bontcheva, Ian Roberts, Milan Agatonovic, Julien

Document Id. TAO/2007/D6.1/v1.0

University of Sheffield Mondeca

Sirma Group Corp., Ontotext Lab

Jozef Stefan Institute

content augmentation, specifically tailored to software artefacts.

the elicitation of an ontology to represent the concepts treated by GATE components

TAO/2007/D6.1/v1.0 April 10, 2007 2

2 GATE as a Legacy Application 5

4 Design of the GATE Web Services 20

5.2 Semantic Content Augmentation . . . . . . . . . . . . . . . . . . . . . . 25

A Sample WSDL and Service Definition 43

Figure 1.1: GATE Case Study

This deliverable is structured as follows. Chapter 2 describes the legacy application

GATE as a Legacy Application

2.1 GATE Overview

http://gate.ac.uk). In addition, the reusable modules, the document and anno-

2.1.1 Main Architectural Elements

Figure 2.1: GATEs document viewer/editor

document viewer/editor (see Figure 2.1).

2.1.2 Data representation and storage

2.1.3 Execution strategies

Figure 2.2: GATEs conditional controller

2.2 Different GATE Software Artefacts

2.2.1 Source code and Javadoc

2.2.2 User forum postings

2.2.3 User, developer, and annotator manual

2.2.4 Plugins, resources, and configuration files

3.1 Requirements towards the SOA-based GATE

Methodological support of non-specialists wanting to use NLP tools, not purely

Multi-role instead of single role beyond language engineers

Service-orientated, not monolithic

Assistive instead of autonomous, e.g., automatic processes helped by humans

Compatibility with existing GATE development environment

3.1.1 Computational complexity and load balancing

3.1.2 Combined Methodological and Technological Support

how to decide if IE is applicable to your problem

3.1.3 Multi-role Support

3.1.4 Service-based Access to Key Components

PRs which can be wrapped as algorithmic services. Run on different machines,

At present, we separate data-intensive services into two kinds: document/corpora stor-

1. Browse stored contents

2. Create and delete documents and corpora

3. Add/remove documents to/from corpora

4. Concurrent modification of document annotations

5. Search documents according to particular criteria

Create, modify, and access concepts and concept instances

Create, modify, and access properties and property instances

Support ontology languages (OWL, RDF)

3.1.5 Support for Building Service-based Applications

3.1.6 Compatibility with the Existing GATE Developement Environ-

3.2 Requirements towards the TAO Technology

Workflow methodology backed by an infrastructure The requirement for combined

Flexible System Building Environment Building applications which combine a num-

Web Service Refactoring Guidance Refactoring legacy applications poses a number of

Assisted Service Composition It is expected that the semantic descriptions associated

Low Human Overhead of Ontology Learning and Content Augmentation Services

Scalability of the Semantic Repositories Semantic repositories currently scale to mil-

Learning Domain Concepts and Properties Semantic descriptions of web services