Вы находитесь на странице: 1из 2

Web-based context generation for semantic

annotation of geological texts


Ikechukwu Nkisi-Orji , Nirmalie Wiratunga , Rachel Heaven , Stewart Massie and Kit-ying Hui
School

of Computing and Digital Media, Robert Gordon University, Aberdeen


{i.o.nkisi-orji, n.wiratunga, s.massie, k.hui}@rgu.ac.uk
British Geological Survey, Keyworth Nottingham
reh@bgs.ac.uk

AbstractSemantic annotation is an enabling technology. It


links content of documents to ontology concepts that unambiguously describe them to enable information retrieval applications.
The annotation process itself is a challenging task due to the
presence of polysemies and lack of sufficient contextual text
in some ontologies. This paper proposes a novel approach to
semantic annotation of document segments by first generating
contextual texts for ontology concepts using documents sourced
on the Web. Document segments are subsequently annotated with
concepts based on similarity of contexts.
Index Termssemantic annotation; ontologies; document segments

I. I NTRODUCTION
Semantic annotation links content expressed in documents
to concepts in ontologies. Ontologies provide a specification of
concepts in a domain and how they relate to each other [1]. As
a result, semantic annotation facilitates unambiguous access
to document content. Document segment annotation marks-up
units within documents such as chapters and sections, with
representative concepts from domain ontologies. This enables
search models to target segments within documents [2] and to
link semantically related content [3]. Successful annotation of
documents have been shown to improve information retrieval
but are largely dependent on manual annotation which is
tedious, time-consuming and lacks scalability [4]. This paper
proposes the recommendation of ontology concepts to annotate
document segments based on the similarity between the content of a segment and context of concepts. Concepts whose
contexts are most similar to the segment are recommended
as annotation. Typically, domain-specific ontologies lack adequate textual contexts for this purpose. Therefore, with textual
labels of ontology concepts forming queries to search engines,
we augment concepts with contextual texts extracted from web
documents.
II. BACKGROUND AND R ELATED WORK
Electronic document authoring and mass digitisation efforts
have made vast amounts of domain-rich content available on
the web. To improve access to such content, several approaches
have been proposed in literature to semantically annotate
segments within documents. Document segments have been
annotated using texts in segment titles [2]. This approach
assumes that an author expresses the content of a segment in
the title which matches the labels of corresponding ontology

concepts and that the hierarchical structure of documents


(sections and sub-sections) corresponds with the hierarchical
structure of the ontology. These expectations are rigid and do
not support annotation challenges typical with multi-concept
segments. For example, it is easy to see how an author may
be unable to include all relevant concepts in a segments
title. Besides, the wrong sense of a polysemous term is easily
annotated without disambiguation [4].
Document segment annotation using DBpedia resources
provides a more flexible approach which is able to disambiguate content [3]. Titles of DBpedia concepts form
candidate annotations while corresponding textual contents
provide context for disambiguation. By linking key terms in
a documents segment to DBpedia concepts, titles of central
concepts in DBpedia graph clusters formed are selected as
annotation. Although promising, the requirement to use a fixed
knowledge resource such as DBpedia makes this approach
unsuitable if candidate annotations have no equivalent concepts in the knowledge resource. This is especially true for
domain-specific concepts such as those contained in ontologies
used for this study. For example, THESAURUS describes
general geoscience-related concepts such as dams and frozen
soils. Despite being common domain terms, a look up on
Wikipedia1 showed that about 45% of concepts have no corresponding articles. Machine learning approaches which cluster
and annotate document segments have been proposed as well
[5]. But the requirement for training corpora from which all
possible annotations are learned limits the practicability of
such approach.
An initial study showed that we are able to recommend
good annotations for document segments based on annotations
assigned to segments with similar content in an annotated
corpus. The content of each annotated segment in the corpus
contributes to the contextual text of the ontology concept used
to annotate it. Analysis of the manually annotated corpus
used for the study showed that only a small fraction (5.4%)
of available ontology concepts appeared as annotations. The
implication is that about 95% of concepts cannot be recommended for document segments. Hence the need to source
contextual text from elsewhere.
1 DBpedia is linked open data generated from Wikipedia. As a result,
Wikipedia is often ahead of DBpedia with respect to the number of concepts
described.

Fig. 1. Pseudo-document generation from web documents

III. C ONTEXT GENERATION FOR ANNOTATION


Our approach generates contexts for ontology concepts from
the web. As shown in fig. 1, a web search is followed by the
retrieval and ranking of documents from which contextual texts
are extracted.
A. Retrieve documents from the web
Ontology concepts usually have one or more labels which
provide short textual descriptions (e.g. dams). All labels of
each concept are issued as web search queries. Search result
links are then used to retrieve corresponding web documents.
Due to reasons such as the presence of polysemous words
in search queries, some documents retrieved may not be
relevant to the ontologys domain. Consequently, search result
documents are ranked to ensure that the most domain-related
documents are used to generate contexts.
B. Rank documents and extract context
The retrieved set of documents, R, for a concept, x, is
ranked using the input ontology, C. The intuition is that a
domain-related term appearing in domain-related documents
most likely have the desired sense of the term. Since related
concepts tend to form clusters in an ontology, the ranking
process measures the overlap between concepts mentioned in
a document and the ontological neighbourhood of the concept
for which the document was retrieved. To this end, ontology
concepts C0 C present in each r R are identified.
Ontology based semantic measure [6] shown in equation 1
is then used to estimate the relevance of each y C0 to
x. This involves finding the most specific common subsumer
(mscs) of x and y. From the root of C, mscs is the most
distant concept that subsumes x and y. Thus, higher relevance
is assigned to y if it is close to x and both appear near the
leaf nodes.
relevance(y, x) =

2 M SCS(x, y)
N (x) + N (y) + 2 M SCS(x, y)

(1)

where relevance(y, x) is the relevance of y to x, N (x) is


minimum node count from x to mcsc, N (y) is minimum node
count from y to mcsc, and M SCS(x, y) is minimum node
count from mcsc to the root of C.
Normalised cumulative relevance of C0 estimates the relevance of r to x and determines its rank. Subsequently,

contextual text for a concept is extracted from top 30 results


as used in [7] by extracting sentences that contain any concept
label. Restriction to sentence-level contexts is to minimise the
inclusion of irrelevant content.
C. Recommend annotations
Annotation recommendation for a documents segment is
achieved by comparing text in the segment to contextual
texts of ontology concepts. The greater the similarity between
the contextual and segmental text, the greater the annotation
recommendation confidence. For this purpose, vector space
models (VSMs) are generated for texts and similarity is
measured by the cosine of the angle between VSMs.
IV. E VALUATION
Proposed approach will be applied to 2 ontologies THESAURUS and CHRONOSTRAT. THESAURUS describes
6,095 common geoscience-related terms while CHRONOSTRAT is a specialised dictionary of 406 geological time
divisions. A corpus of 1,948 document sections, manually
annotated by domain experts, will provide the ground truth
for evaluation. Mean average precision (MAP) will be used to
assess the quality of annotations being recommended. Results
obtained will be compared to the approach used by Hazman
et al. [2] which does not require such contextual text.
R EFERENCES
[1] T. R. Gruber, A translation approach to portable ontology specifications,
Knowledge acquisition, vol. 5, no. 2, pp. 199220, 1993.
[2] M. Hazman, S. R. El-Beltagy, and A. Rafea, An ontology based approach
for automatically annotating document segments, International Journal
of Computer Science, no. v9, p. i2, 2012.
[3] I. Hulpus, C. Hayes, M. Karnstedt, and D. Greene, Unsupervised graphbased topic labelling using dbpedia, in Proceedings of the sixth ACM
international conference on Web search and data mining. ACM, 2013,
pp. 465474.
[4] R. Berlanga, V. Nebot, and M. Perez, Tailored semantic annotation for
semantic search, Web Semantics: Science, Services and Agents on the
World Wide Web, 2014.
[5] M. Schaal, R. M. Muller, M. Brunzel, and M. Spiliopoulou, Relfintopic
discovery for ontology enhancement and annotation, in The Semantic
Web: Research and Applications. Springer, 2005, pp. 608622.
[6] Z. Wu and M. Palmer, Verbs semantics and lexical selection, in Proceedings of the 32nd annual meeting on Association for Computational
Linguistics. Association for Computational Linguistics, 1994, pp. 133
138.
[7] W. Yih and V. Qazvinian, Measuring word relatedness using heterogeneous vector space models, in Proceedings of the 2012 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational
Linguistics, 2012, pp. 616620.

Вам также может понравиться