Вы находитесь на странице: 1из 9

A Novel Approach to Automatic Gazetteer Generation using

Wikipedia

Ziqi Zhang JoséIria


University of Sheffield, UK University of Sheffield, UK
z.zhang@dcs.shef.ac.uk j.iria@dcs.shef.ac.uk

such as works by Toral and Muñoz (2006), and


Abstract Kazama and Torisawa (2007).
Unfortunately, current systems still present
Gazetteers or entity dictionaries are important several limitations. First, none have exploited the
knowledge resources for solving a wide range of full content and structure of Wikipedia articles,
NLP problems, such as entity extraction. We in- but instead, only make use of the article’s first
troduce a novel method to automatically generate sentence. However, the full content and structure
gazetteers from seed lists using an external
of Wikipedia carry rich information that has been
knowledge resource, the Wikipedia. Unlike pre-
vious methods, our method exploits the rich con- proven useful in many other NLP problems, such
tent and various structural elements of Wikipe- as document classification (Gabrilovich and
dia, and does not rely on language- or domain- Markovitch, 2006), entity disambiguation (Bu-
specific knowledge. Furthermore, applying the nescu and Paşca, 2006), and semantic relatedness
extended gazetteers to an entity extraction task in (Strube and Ponzetto, 2006). Second, no other
a scientific domain, we empirically observed a works have evaluated their methods in the con-
significant improvement in system accuracy text of entity extraction tasks. Evaluating these
when compared with those using seed gazetteers. generated gazetteers in real NLP applications is
important, because the quality of these gazetteers
has a major impact on the performance of NLP
1 Introduction applications that make use of them. Third, the
Entity extraction is the task of identifying and majority of approaches focus on newswire do-
classifying atomic text elements into predefined main and the four classic entity types location
categories such as person names, place names, (LOC), person (PER), organization (ORG) and
and organization names. Entity extraction often miscellaneous (MISC), which have been studied
serves as a fundamental step for complex Natural extensively. However, it has been argued that
Language Processing (NLP) applications such as entity extraction is often much harder in scientif-
information retrieval, question answering, and ic domains due to complexity of domain lan-
machine translation. It has been recognized that guages, density of information and specificity of
in this task, gazetteers, or entity dictionaries, play classes (Murphy et al, 2006; Byrne, 2007; Noba-
a crucial role (Roberts et al, 2008). In addition, ta et al, 2000).
they serve as important resources for other stu- In this paper we propose a novel approach to
dies, such as assessing level of ambiguities of a automatically generating gazetteers using exter-
language, and disambiguation (Maynard et al, nal knowledge resources. Our method is lan-
2004). guage- and domain- independent, and scalable.
Because building and maintaining high quality We show that the content and various structural
gazetteers by hand is very time consuming (Ka- elements of Wikipedia can be successfully ex-
zama and Torisawa, 2008), many solutions have ploited to generate high quality gazetteers. To
proposed generating gazetteers automatically assess gazetteer quality, we evaluate it in the
from existing resources. In particular, the success context of entity extraction in the scientific do-
that solutions which exploit Wikipedia1 have main of Archaeology, and demonstrate that the
been enjoying in many other NLP applications generated gazetteers improve the performance of
has encouraged a number of research works on an SVM-based entity tagger across all entity
automatic gazetteer generation to use Wikipedia, types on an archaeological corpus.
The rest of the paper is structured as follows.
In the next section, we review related work. In
1
section 3 we explain our methodology for auto-
http://en.wikipedia.org

1
Proceedings of the 2009 Workshop on the People’s Web Meets NLP, ACL-IJCNLP 2009, pages 1–9,
Suntec, Singapore, 7 August 2009. 2009
c ACL and AFNLP
matic gazetteer generation. Section 4 introduces for candidate entities in corpus, while concepts
the problem domain and describes the experi- of Word_Instance are used directly as lookup
ments conducted. Section 5 presents and dis- dictionaries. They achieved good results on a
cusses the results. Finally we conclude with an newswire corpus. The main limitation of Word-
outline of future work. Net is lack of domain specific vocabulary, which
is critical to domain specific applications
2 Related Work (Schütze and Pedersen, 1997). Roberts et al
(2008) used terminology extracted from UMLS
Currently, existing methods to automatic gazet- as gazetteers and tested it in an entity extraction
teer generation can be categorized into two task over a medical corpus. Contrary to Word-
mainstreams; pattern driven approach and know- Net, UMLS is an example of a domain specific
ledge resource approach. knowledge resource, thus its application is also
The pattern driven approach uses domain- limited.
and language specific patterns to extract candi-
date entities from unlabeled corpora. The idea is Recently, the exponential growth in informa-
to include features derived from unlabeled data tion content in Wikipedia has made this Web
to improve a supervised learning model. For ex- resource increasingly popular for solving a wide
ample, Riloff and Jones (1999) introduced a range of NLP problems and across different do-
bootstrapping algorithm which starts from seed mains.
lists and, iteratively learns and refines domain Concerning automatic gazetteer generation,
specific extraction patterns for a semantic cate- Toral and Muñoz (2006) tried to build gazetteers
gory that are then used for building dictionaries for LOC, PER, and ORG by extracting all noun
from unlabeled data. Talukdar et al (2006), also phrases from the first sentences of Wikipedia
starting with seed entity lists, apply pattern in- articles. Next they map the noun phrases to
duction to an unlabeled corpus and then use the WorldNet synsets, and follow the hyperonymy
induced patterns to extract candidate entities hierarchy until they reach a synset belonging to
from the corpus to build extended gazetteers. the entity class of interest. However, they did not
They showed that using the token membership evaluate the generated gazetteers in the context
feature with the extended gazetteer improved the of entity extraction. Due to lack of domain spe-
performance of a Conditional Random Field cific knowledge in WordNet, their method is li-
(CRF) entity tagger; Kozareva (2006) designed mited if applied to domain specific gazetteer
language specific extraction patterns and valida- generation. In contrast, our method overcomes
tion rules to build Spanish location (LOC), per- this limitation since it doesn’t rely on any re-
son (PER) and organization (ORG) gazetteers sources other than Wikipedia. Another funda-
from unlabeled data, and used these to improve a mental difference is that our method exploits
supervised entity tagger. more complex structures of Wikipedia.
However, the pattern driven approach has Kazama and Torisawa (2007) argued that
been criticized for weak domain adaptability and while traditional gazetteers map word sequences
inadequate extensibility due to the specificity of to predefined entity categories such as “London
derived patterns. (Toral and Muñoz, 2006; Ka- → {LOCATION}”, a gazetteer is useful as long
zama and Torisawa, 2008). Also, often it is dif- as it returns consistent labels even if these are not
ficult and time-consuming to develop domain- predefined categories. Following this hypothesis,
and language-specific patterns. they mapped Wikipedia article titles to their
The knowledge resource approach, attempts hypernyms by extracting the first noun phrase
to solve these problems by relying on the abun- after be in the first sentence of the article, and
dant information and domain-independent struc- used these as gazetteers in an entity extraction
tures in existing large-scale knowledge re- task. In their experiment, they mapped over
sources. Magnini et al (2002) used WordNet as a 39,000 search candidates to approximately 1,200
gazetteer together with rules to extract entities hypernyms; and using these hypernyms as cate-
such as LOC, PER and ORG. They used two re- gory labels in an entity extraction task showed an
lations in WordNet; Word_Class, referring to improvement in system performance. Later, Ka-
concepts bringing external evidence; and zama and Torisawa (2008) did the same in
Word_Instance, referring to particular instances another experiment on a Japanese corpus and
of those concepts. Concepts belonging to achieved consistent results. Although novel, their
Word_Class are used to identify trigger words method in fact bypasses the real problem of ge-

2
nerating gazetteers of specific entity types. Our takes input seed entities of any type, and extends
method is essentially different in this aspect. In them to more complete lists of the same type. It
addition, they only use the first sentence of Wi- is based on three hypotheses;
kipedia articles. 1. Wikipedia contains articles about domain
specific seed entities.
3 Automatic Gazetteer Generation – the 2. Using articles about the seed entities, we
Methodology can extract fine-grained type labels for
them, which can be considered as a list
In this section, we describe our methodology for of hypernyms of the seed entities, and
automatic gazetteer generation using the know- predefined entity type hyponyms of the
ledge resource approach. seeds.
3. Following the links on Wikipedia ar-
3.1 Wikipedia as the knowledge resource ticles, we can reach a large collection of
To demonstrate the validity of our approach, we articles that are related to the source ar-
have selected the English Wikipedia as the ex- ticles. If a related article’s type label (as
ternal knowledge resource. Wikipedia is a free extracted above) matches any of those
multilingual and collaborative online encyclope- extracted for seed entities, we consider it
dia that is growing rapidly and offers good quali- a similar entity of the predefined type.
ty of information (Giles, 2005). Articles in Wiki- Naturally, we divide our methods into three
pedia are identified by unique names, and refer steps; firstly we match a seed entity to a Wikipe-
to specific entities. Wikipedia articles have many dia article (the matching phase); next we label
useful structures for knowledge extraction; for seed entities using the articles extracted for them
example, articles are inter-connected by hyper- and build a pool of fine-grained type labels for
links carrying relations (Gabrilovich and Marko- the seed entities (the labeling phase); finally we
vitch, 2006); articles about similar topics are ca- extract similar entities by following links in ar-
tegorized under the same labels, or grouped in ticles of seed entities (the expansion phase). The
lists; categories are organized as taxonomies, and pseudo-algorithm is illustrated in Figure 1.
each category is associated with one or more
3.2.1 Matching seed entities to Wikipedia
parent categories (Bunescu and Paşca, 2006).
article
These relations are useful for identifying related
articles and thus entities, which is important for For a given seed entity, we firstly use the exact
automatic gazetteer generation. Compared to phrase to retrieve Wikipedia articles. If not
other knowledge resources such as WordNet and found, we use the leftmost longest match, as
UMLS, Wikipedia covers significantly larger done by Kazama and Torisawa (2007). In Wiki-
amounts of information across different domains, pedia, searches for ambiguous phrases are redi-
therefore, it is more suitable for building domain- rected to a Disambiguation Page, from which
specific gazetteers. For example, as of February users have to manually select a sense. We filter
2009, there are only 147,287 unique words in out any matches that are directed to disambigua-
WordNet2, whereas the English Wikipedia is tion pages. This filtering strategy is also applied
significantly larger with over 2.5 million articles. to step 3 in extracting candidate entities.
A study by Holloway (2007) identified that by
2005 there were already 78,977 unique catego- 3.2.2 Labeling seed entities
ries divided into 1,069 disconnected category
clusters, which can be considered as the same After retrieving Wikipedia articles for all seed
number of different domains. entities, we extract fine-grained type labels from
these articles. We identified two types of infor-
3.2 The methodology mation from Wikipedia that can extract potential-
ly reliable labels.
We propose an automatic gazetteer generation
method using Wikipedia article contents, hyper-
links, and category structures, which can gener-
ate entity gazetteers of any type. Our method
2
According to
http://wordnet.princeton.edu/man/wnstats.7WN , February
2009

3
Input: seed entities SE of type T simplicity, we refer to this approach to labeling
Output: new entities NE of type T seed entities as FirstSentenceLabeling, and the
STEP 1 (section 3.2.1) labels created as Ls. Note that our method is es-
1.1. Initialize Set P as articles for SE; sentially different from Kazama and Torisawa as
1.2. For each entity e: SE we do not add these extracted nouns to gazet-
1.3. Retrieve Wikipedia article p for e; teers; instead, we only use them for guiding the
1.4. Add p to P; extraction of candidate entities, as described in
STEP 2 (section 3.2.2) section 3.2.3.
2.1. Initialize Set L
As mentioned in section 3.1, similar articles
2.2. For each p: P
2.3. Extract fine grained type labels l;
in Wikipedia are manually grouped under the
2.4. Add l to L; same categories by their authors, and categories
STEP 3 (section 3.2.3) are further organized as a taxonomy. As a result,
3.1. Initialize Set HL; we extract category labels of articles as fine-
3.2. For each p: P grained type labels and consider them to be
3.3. Add hyperlinks from p to HL; hypernyms of the entity’s article. We refer to this
3.4. If necessary, recursively crawl extracted method as CategoryLabeling, and apply it to the
hyperlinks and repeat 3.2 and 3.3 seed entities to create a list of category labels,
3.5. For each link hl: HL which we denote by Lc.
3.6. Extract fine grained type labels l’;
Three situations arise in which the Category-
3.7. If L contains l’
3.8. Add title of hl to NE;
Labeling introduces noisy labels. First, some
3.9. Add titles of redirect links of hl to articles are categorized under a category with the
NE; same title as the article itself. For example, the
Figure 1. The proposed pseudo-algorithm for gazet- article about “Bronze Age” is categorized under
teer generation from the content and various structural category “Bronze Age”. In this case, we explore
elements of Wikipedia the next higher level of the category tree, i.e., we
extract categories of the category “Bronze Age”,
As Kazama and Torisawa (2007) observed, in the including “2nd Millennium”, “3rd millennium
first sentence of an article, the head noun of the BC”, “Bronze”, “Periods and stages in Archaeo-
noun phrase just after be is most likely the logy”, and “Prehistory”. Second, some categories
hypernym of the entity of interest, and thus a are meaningless and for management purposes,
good category label. There are two pitfalls to this such as “Articles to be Merged since 2008”,
approach. First, the head noun may be too gener- “Wikipedia Templates”. For these, we manually
ic to represent a domain-specific label. For ex- create a small list of “stop” categories to be dis-
ample, following their approach the label ex- carded. Third, according to Strube and Ponzetto
tracted for the archaeological term “Classical (2008), the category hierarchy is sometimes noi-
Stage”3 from the sentence “The Classic Stage is sy. To reduce noisy labels, we only keep labels
an archaeological term describing a particular that are extracted for at least 2 seed entities.
developmental level.” is “term”, which is the
head noun of “archaeological term”. Clearly in Once a pool of fine-grained type labels have
such case the phrase is more domain-specific. been created, in the next step we consider them
For this reason we use the exact noun phrase as as fine-grained and immediate hypernyms of the
category label in our work. Second, their method seed entities, and use them as control vocabulary
ignores a correlative conjunction which in most to guide the extraction of candidate entities.
cases indicates equivalently useful labels. For
example, the two noun phrases in italic in the 3.2.3 Extracting candidate entities
sentence “Sheffield is a city and metropolitan
To extract candidate entities, we first identify
borough in South Yorkshire, England” are equal-
from Wikipedia the entities that are related to the
ly useful labels for the article “Sheffield”. There-
seed entities. Then we select from them those
fore, we also extract the noun phrase connected
candidates that share one or more common
by a correlative conjunction as the label. We ap-
hypernyms with the seed entities. The intuition is
ply this method to articles retrieved in 3.2.1. For
that in the taxonomy, nodes that share common
immediate parents are mostly related, and, there-
3
Any Wikipedia examples for illustration in this paper make fore, good candidates for extended gazetteers.
use of the English Wikipedia, February 2009, unless other-
wise stated.

4
We extract related entities by following the The problem of entity extraction has been stu-
hyperlinks from the articles retrieved for the seed died extensively across different domains, par-
entities, as by section 3.2.1. This is because in ticularly in newswire articles (Talukdar et al
Wikipedia, articles often contain mentions of 2006), bio-medical science (Roberts et al, 2008).
entities that also have a corresponding article, In this experiment, we present the problem with-
and these mentions are represented as outgoing in the domain of archaeology, which is a discip-
hyperlinks. They link the main article of an enti- line that has a long history of active fieldwork
ty (source entity) to other sets of entities (related and a significant amount of legacy data dating
entities). Therefore, by following these links we back to the nineteenth century and earlier. Jeffrey
can reach a large set of related entities to the seed et al (2009) reports that despite the existing fast-
list. To reduce noise, we also filter out links to growing large corpora, little has been done to
disambiguation pages as in section 3.2.1. Next, develop high quality meta-data for efficient
for each candidate in the related set, we use the access to information in these datasets, which has
two labeling approaches introduced in section become a pressing issue in archaeology. To our
3.2.2 to extract its type labels. If any of these are best knowledge, three works have piloted the
included by the control vocabulary built with the research on using information extraction tech-
same labeling approach, we accept them into the niques for automatic meta-data generation in this
extended gazetteers. That is, if the control voca- field. Greengrass et al (2008) applied entity and
bulary is built by FirstSentenceLabeling we on- relation extraction to historical court records to
ly use FirstSentenceLabeling to label the candi- extract names, locations and trial names and their
date. The same applies to CategoryLabeling. relations; Amrani et al (2008) used a series of
One can easily extend this stage by recursively text-mining technologies to extract archaeologi-
crawling the hyperlinks contained in the re- cal knowledge from specialized texts, one of
trieved pages. In addition, some Wikipedia ar- these tasks concerns entity extraction. Byrne
ticles have one or more redirecting links, which (2007) applied entity and relation extraction to a
groups several surface forms of a single entity. corpus of archaeology site records. Her work
For example a search for “army base” is redi- concentrated on nested entity recognition of 11
rected to article “military base”. These surface entity types.
forms can be considered as synonyms, and we Our work deals with archaeological entity ex-
thus also select them for extend gazetteers. traction from un-structured legacy data, which
mostly consist of full-length archaeological re-
After applying the above processes to all seed ports varying from 5 to over a hundred pages.
entity articles, we obtain the output extended According to Jeffrey et al (2009), three types of
gazetteers of domain-specific types. To eliminate entities are most useful to an archaeologist;
potentially ambiguous entities, for each extended  Subject (SUB) – topics that reports refer
gazetteer, we exclude entities that are found in to, such as findings of artifacts and mo-
domain-independent gazetteers. For example, we numents. It is the most ambiguous type
use a generic person name gazetteer to exclude because it covers various specialized
ambiguous person names from the extended ga- domains such as warfare, architecture,
zetteers for LOC. agriculture, machinery, and education.
For example “Roman pottery”, “spear-
4 Experiments head”, and “courtyard”.
In this section we describe our experiments. Our  Temporal terms (TEM) – archaeological
goal is to build extended gazetteers using the dates of interest, which are written in a
methods proposed in section 3, and test them in number of ways, such as years “1066 -
an entity extraction task to improve a baseline 1211”, “circa 800AD”; centuries “C11”,
system. First we introduce the setting, an entity “the 1st century”; concepts “Bronze
extraction task in the archaeological domain; Age”, “Medieval”; and acronyms such as
next we describe data preparation including “BA” (Bronze Age), “MED” (Medieval).
training data annotation and gazetteer generation;  Location (LOC) – place names of inter-
then, we introduce our baseline; and finally est, such as place names and site ad-
present the results. dresses related to a finding or excava-
tion. In our study, these refer to UK-
4.1 The Problem Domain specific places.

5
Source Domain Tag Density rately as well as in combination; specifically for
astro-ph Astronomy 5.4% each entity type, GAZ_EXTfirstsent denotes the ex-
MUC7 Newswire 11.8% tended gazetteer built using FirstSentenceLabe-
GENIA Biomedical 33.8% ling for labeling seed entities and selecting can-
AHDS- Archaeology 9.2% didate entities; GAZ_EXTcategory refers to the ex-
selected tended gazetteer built with CategoryLabeling;
Table 1. Comparison of tag density in four test corpo- GAZ_EXTunion merges entities in two extended
ra for entity extraction tasks. The “AHDS-selected” gazetteers into a single gazetteer; while
corpus used in this work has a tag density comparable GAZ_EXTintersect is the intersection of
to that of MUC7 GAZ_EXTfirstsent and GAZ_EXTcategory i.e., taking
only entities that appear in both. Table 2 lists
4.2 Corpus and resources
statistics of the gazetteers and Table 3 displays
We developed and tested our system on 30 full example type labels extracted by the two me-
length UK archaeological reports archived by the thods.
Arts and Humanities Data Service (AHDS)4. To implement the entity extraction system, we
These articles vary from 5 to 120 pages, with a used Runes8 data representation framework, a
total of 225,475 words. The corpus is tagged by collection of information extraction modules
three archaeologists, and is used for building and from T-rex9, and the machine learning frame-
testing the entity extraction system. Compared to work Aleph10. The core of the tagger system is a
other test data reported in Murphy et al (2006), SVM classifier. We used the Java Wikipedia Li-
our task can be considered hard, due to the hete- brary11 (JWPL v0.452b) and the Wikipedia dump
rogeneity of information of the entity types and of Feb 2007 published with it.
lower tag density in the corpus (the percentage of
words tagged as entities), see Table 1. Also, ac- 4.4 Feature selection and baseline system
cording to Vlachos (2007), full length articles are We trained our baseline system by tuning feature
harder than abstracts, which are found common sets used and the size of the token window to
in biomedical domain. This corpus is then split consider for feature generation; and we select the
into five equal parts for a five-fold cross valida- best performing setting as the baseline. Later we
tion experiment. add official gazetteers in section 4.1 and ex-
For seed gazetteers, we used the MIDAS Pe- tended gazetteers as in section 4.3 to the base-
riod list5 as the gazetteer for TEM, the Thesaurus lines and use gazetteer membership as an addi-
of Monuments Types (TMT2008) from English tional feature to empirically verify the improve-
Heritage6 and the Thesaurus of Archaeology Ob- ment in system accuracy.
jects from the STAR project7 as gazetteers for
SUB, and the UK Government list of administra- The baseline setting thus used a window size of 5
tive areas as the gazetteer for LOC. In the fol- and the following feature set:
lowing sections, we will refer to these gazetteers  Morphological root of a token
as GAZ_original.
 Exact token string
4.3 Automatic gazetteer generation  Orthographic type (e.g., lowercase, up-
percase)
We used the seed gazetteers together with the  Token kind (e.g., number, word)
methods presented in section 3 to build new ga-
zetteers for each entity type, and merge them 4.5 Result
with the seeds as extended gazetteers to be tested
in our experiments. Since we introduced two me- Table 4 displays the results obtained under each
thods for labeling seed entities (section 3.2.2), setting, using the standard metrics of Recall (R),
which are also used separately for selecting ex- Precision (P) and F-measure (F1). The bottom
tracted candidate entities (section 3.2.3), we de- row illustrates Inter Annotator Agreement (IAA)
sign four experiments to test the methods sepa-

4
http://ahds.ac.uk/
5 8
http://www.midas-heritage.info and http://www.fish- http://runes.sourceforge.net/
9
forum.info http://t-rex.sourceforge.net/
6 10
http://thesaurus.english-heritage.org.uk http://aleph-ml.sourceforge.net/
7 11
http://hypermedia.research.glam.ac.uk/kos/STAR/ http://www.ukp.tu-darmstadt.de/software/jwpl/

6
LOC SUB TEM
GAZ_original 11,786 (8,228 found) 5,725 (4,320 found) 61 (43 found)
GAZ_EXTfirstsent 19,385 (7,599) 11,182 (5,457) 163 (102)
GAZ_EXTcategory 18,861 (7,075) 13,480 (7,745) 305 (245)
GAZ_EXTunion 23,741 (11,955) 16,697 (10,972) 333 (272)
GAZ_EXTintersect 14,022 (2,236) 7,455 (1,730) 133 (72)
Table 2. Number of unique entities in each gazetteer, including official and extended versions.
GAZ_EXT includes GAZ_original. For GAZ_original, numbers in brackets are the number of entities
found in Wikipedia. For others, they are the number of extracted entities that are new to the correspond-
ing GAZ_original

LOC SUB TEM


FirstSentence- CategoryLabeling FirstSentence- CategoryLabe- FirstSentence- CategoryLabe-
Labeling (597) (779) Labeling (1342) ling (761) Labeling (11) ling
(10)
village, villages in north facility, ship types, period, Periods and
small village, Yorkshire, building, monument archaeological stages in arc-
place, north Yorkshire geo- ship, types, period, haeology,
town, graphy stubs, tool, gardening, era, Bronze age,
civil parish villages in Norfolk, device, fortification, century, middle ages,
villages in Somerset, establishment architecture millennium historical eras,
English market towns stubs centuries
Table 3. Top 5 most frequently extracted (counted by number of seed entities sharing that label) fine-
grained type labels for each entity type. Numbers in brackets are the number of unique labels extracted

LOC SUB TEM


P R F1 P R F1 P R F1
Baseline (B) 69.4 67.4 68.4 69.6 62.3 65.7 82.3 81.4 81.8
B+ GAZ_original 69.0 72.1 70.5 69.7 65.4 67.5 82.3 82.7 82.5
B+ GAZ_EXTfirstsent 69.9 76.7 73.1 70.0 68.3 69.1 82.6 84.6 83.6
B+ EXTcategory 69.1 75.1 72.0 68.8 67.0 67.9 82.0 83.7 82.8
B+ EXTunion 68.9 75.0 71.8 69.8 66.5 68.1 82.4 83.4 82.9
B+ EXTintersect 69.3 76.2 72.6 69.7 67.6 68.6 82.6 84.3 83.4
IAA - - 75.3 - - 63.6 - - 79.9
Table 4. Experimental results showing accuracy of systems in the entity extraction task for each type of entities,
varying the feature set used. Baseline performances are marked in italic. Better performances than baselines
achieved by our systems are highlighted in bold.

between the annotators on a shared sample cor- lation and exploring information in an external
pus of the same kind as that for building the sys- resource, one can extend a gazetteer by entities
tem, calculated using the metric by Hripcsak and of similar types without utilizing language- and
Rothschild (2005). The metric is equivalent to domain-specific knowledge. Also by taking the
scoring one annotator against the other using the intersection of entities generated by the two labe-
F1 metric, and in practice system performance ling methods (bottom row of table 2), we see that
can be slightly higher than IAA (Roberts et al, the overlap is relatively small (from 30%-40% of
2008). The IAA figures for all types of entities the list generated by either method), indicating
are low, indicating that the entity extraction task that the extended gazetteers produced by the two
for the archaeological domain is difficult, which methods are quite different, and may be used to
is consistent with Byrne (2007)’s finding. complement each other. Combining figures in
Table 3, we see that both methods extract fine-
5 Discussion grained type-labels that on average extract 4 - 14
candidate entities.
As shown in Table 2, our methods have generat- The quality of the gazetteers can be checked
ed domain specific gazetteers that almost using the figures in Table 4. First, all extended
doubled the original seed gazetteers in every oc- gazetteers improved over the baselines for the
casion, even for the smallest seed gazetteer of three entity types, with the highest increase in F1
TEM. This proves our hypotheses formulated in of 4.7%, 3.4% and 1.8% for LOC, SUB, and
section 3.1, that by utilizing the hyperonymy re-

7
TEM respectively. In addition, they all outper- dia. By applying this approach to a corpus of the
form the original gazetteers, indicating that the Archaeology domain, we empirically observed a
quality of extended gazetteers is good for the significant improvement in system accuracy
entity extraction task. when compared with the baseline systems, and
By comparing the effects of each extended the baselines plus original gazetteers.
gazetteer, we notice that using the gazetteers The extensibility and domain adaptability of
built with type-labels extracted from the first our methods still need further investigation. In
sentence of Wikipedia article always outper- particular, our methods can be extended to intro-
forms using those built via the Wikipedia catego- duce several statistical filtering thresholds to
ries, indicating that the first method (FirstSen- control the label generation and candidate entity
tenceLabeling) results in better quality gazet- extraction in an attempt to reduce noise; also the
teers. This is due to two reasons. First, the cate- effect of recursively crawling Wikipedia articles
gory tree in Wikipedia is not a strict taxonomy, in the candidate extraction stage is worth study-
and does not always contain is-a relationships ing. Additionally, it would be interesting to study
(Strube and Ponzetto, 2006). Although we have other structures of Wikipedia, such as list struc-
eliminated categories that are extracted for only tures and info boxes, in gazetteer generation. In
one seed entity, the results indicate the extended future we will investigate into these possibilities,
gazetteers are still noisier than those built by and also test our approach in different domains.
FirstSentenceLabeling. To illustrate, the articles
for SUB seed entities “quiver” and “arrowhead” Acknowledgement
are both categorized under “Archery”, which
This work is funded by the Archaeotools12 project that
permits noisy candidates such as “Bowhunting”,
is carried out by Archaeology Data Service, Universi-
“Camel archer” and “archer”. Applying a stricter ty of York, UK and the Organisation, Information and
filtering threshold may resolve this problem. Knowledge Group (OAK) of University of Sheffield,
Second, compared to Wikipedia categories, the UK.
labels extracted from the first sentences are
sometimes very fine-grained and restrictive. For References
example, the labels extracted for “Buckingham- Ahmed Amrani, Vichken Abajian, Yves Kodratoff,
shire” from the first sentence are “ceremonial and Oriane Matte-Tailliez. 2008. A Chain of Text-
Home County” and “Non-metropolitan County”, mining to Extract Information in Archaeology. In
both of which are UK-specific LOC concepts. Proceedings of Information and Communication
These rather restrictive labels help control the Technologies: From Theory to Applications, ICT-
TA 2008, 1-5.
gazetteer expansion within the domain of inter-
est. The better performance with FirstSentence- Razva Bunescu and Marius Paşca. Using Encycloped-
Labeling indicates that such restrictions have ic Knowledge for Named Entity Disambiguation.
played a positive role in reducing noise in the In Proceedings of EACL2006
labels generated, and then improving the quality Kate Byrne. Nested Named Entity Recognition in
of candidate entities. Historical Archive Text. In Proceedings of Interna-
We also tested effects of combining the two tional Conference on Semantic Computing, 2007.
approaches, and noticed that taking the intersec- Evgeniy Gabrilovich and Shaul Markovitch. Over-
tion of gazetteers generated by the two ap- coming the Brittleness Bottleneck using Wikipe-
proaches outperform the union, but figures are dia: Enhancing Text Categorization with Encyclo-
still lower than the single best method. This is pedic Knowledge. In Proceedings of the Twenty-
understandable because by permitting members First National Conference on Artificial Intelli-
of noisier gazetteers the system performance de- gence, 1301-1306, Boston, 2006.
grades. Jim Giles. Internet Encyclopedias Go Head to Head.
In Nature 438. 2005. 900-901.
6 Conclusion
Mark Greengras, Sam Chapman, Jamie McLaughlin,
We have presented a novel language- and do- Ravish Bhagdev and Fabio Ciravegna. Finding
main- independent approach for automatically Needles in Haystacks: Data-mining in Distributed
generating domain-specific gazetteers for entity Historical Datasets. In The Virtual Representation
of the Past. London, Ashgate. 2008
recognition tasks using Wikipedia. Unlike pre-
vious approaches, our approach makes use of
richer content and structural elements of Wikipe- 12
http://ads.ahds.ac.uk/project/archaeotools/

8
George Hripcsak and Adam S. Rothschild. Agree- Michael Strube and Simone Paolo Ponzetto. WikiRe-
ment, the F-measure and Reliability in Information late! Computing Semantic Relatedness Using Wi-
Retrieval: In Journal of the American Medical In- kipedia. In Proceedings of the 21st National Confe-
formatics Association, 296-298. 2005 rence on Artificial Intelligence, 2006. 1419 - 1424
Partha Pratim Talukdar, Thorsten Brants, Mark Li-
Todd Holloway, Miran Bozicevic and Katy Börner.
berman and Fernando Pereira. 2006. A Context
Analyzing and Visualizing the Semantic Coverage
Pattern Induction Method for Named Entity Ex-
of Wikipedia and its Authors. In Complexity, Vo-
traction. In Proceedings of CoNLL-2006, 141-148.
lumn 12, issue 3, 30-40. 2007
Antonio Toral and Rafael Muñoz. 2006. A Proposal
Stuart Jeffrey, Julian Richards, Fabio Ciravegna, Ste- to Automatically Build and Maintain Gazetteers for
wart Waller, Sam Chapman and Ziqi Zhang. 2009. Named Entity Recognition by using Wikipedia. In
The Archaeotools project: Faceted Classification Proceedings of Workshop on New Text, 11th Con-
and Natural Language Processing in an Archaeo- ference of the European Chapter of the Association
logical Context. To appear in special Theme Issues for Computational Linguistics 2006.
of the Philosophical Transactions of the Royal So- Andreas Vlachos. Evaluating and Combining Bio-
ciety A,"Crossing Boundaries: Computational medical Named Entity Recognition Systems. In
Science, E-Science and Global E-Infrastructures". Workshop: Biological translational and clinical
language processing. 2007
Jun’ichi Kazama and Kentaro Torisawa. 2008. Induc-
ing Gazetteers for Named Entity Recognition by
Large-scale Clustering of Dependency Relations.
In Proceedings of ACL-2008: HLT, 407-415.
Jun’ichi Kazama and Kentaro Torisawa. Exploting
Wikipedia as External Knowledge for Named Enti-
ty Recognition. In Proceedings of EMNLP-2007
and Computational Natural Language Learning
2007. 698-707.
Zornista Kozareva. 2006. Bootstrapping Named Enti-
ty Recognition with Automatically Generated Ga-
zetteer Lists. In EACL-2006-SRW.
Bernardo Magnini, Matto Negri, Roberto Prevete and
Hristo Tanev. AWordNet-Based Approach to
Named Entity Recognition. In Proceedings of
COLING-2002 on SEMANET: building and using
semantic networks. 1-7
Diana Maynard, Kalina Bontcheva and Hamish Cun-
ningham. Automatic Language-Independent Induc-
tion of Gazetteer Lists. In Proceedings of
LREC2004.
Tara Murphy, Tara Mcintosh and James R Curran.
Named Entity Recognition for Astronomy Litera-
ture. In Proceedings of the Australasian Language
Technology Workshop, 2006.
Chikashi Nobata, Nigel Collier and Jun’ichi Tsujii.
Comparison between Tagged Corpora for the
Named Entity Task. In Proceedings of the Work-
shop on Comparing Corpora at ACL2000.
Ellen Riloff and Rosie Jones. 1999. Learning Dictio-
naries for Information Extraction by Multi-level
Bootstrapping. In Proceedings of the Sixteenth Na-
tional Conference on Artificial Intelligence, 474-
479.
Angus Roberts, Robert Gaizauskas, Mark Hepple and
Yikun Guo. Combining Terminology Resources
and Statistical Methods for Entity Recognition: an
Evaluation. In Proceedings of LREC2008.
Hinrich Schütze and Jan O. Pedersen. A co-
occurrence-based thesaurus and two applications to
Information Retrieval. In Information Processing
and Management: an International Journal, 1997.
33(3): 307-318

Вам также может понравиться