Академический Документы
Профессиональный Документы
Культура Документы
c [not recorded]
Version: [not recorded]
Link(s) to article on publishers website:
http://www.informatik.uni-trier.de/ ley/db/conf/ISCAicis/ISCAicis2002.html#YangZ02
Copyright and Moral Rights for the articles on this site are retained by the individual authors and/or other
copyright owners. For more information on Open Research Onlines data policy on reuse of materials please
consult the policies page.
oro.open.ac.uk
Abstract
The proliferation of online information resources
increases the importance of effective and efficient
distributed searching. However, the problem of word
mismatch seriously hurts the effectiveness of distributed
information retrieval. Automatic query expansion has
been suggested as a technique for dealing with the
fundamental issue of word mismatch. In this paper, we
propose a method - query expansion with Naive Bayes to
address the problem, discuss its implementation in IISS
system, and present experimental results demonstrating its
effectiveness. Such technique not only enhances the
discriminatory power of typical queries for choosing the
right collections but also hence significantly improves
retrieval results.
Keywords: Naive Bayes, Query Expansion, Word
Mismatch, Distributed Information Retrieval
Introduction
Related Work
Cultivation
Tillage, Breeding
Development
Harvesting
Mowing, Reaping
Stacking, Winnowing
Husking, Threshing
3.1.1
N (w , d )P(c d )
1+
(2)
d i D
P wt c j =
V +
N (w , d )P(c d )
s
P (d PD ) = P (cr )P (d PD cr )
P (c j d PD ) =
s =1 d i D
d PD
where,
P(c
di )
(3)
number
of
classes,
training documents, D = d1 , d 2, L, d
}.
P (c j )P d PD c j
its class:
d PD
(5)
P(dPD | c j ) = ( + P(wt | cj ))
t
P( wt c j )
cj
P(c )P(d
PD
cr )
))
t =1
d PD
(7)
t =1
in the
4
4.1
Experiment
Testbed
(4)
P (d PD )
documents in class c j ,
PD
the
3.1.2
P(c ) ( + P(w c ))
C + D
indicates
r =1
d i D
) = P(c )P(d
P(c j ) + P wt c j
1+
P (d PD )
in documents in class c j .
P (c j ) =
P (c j )P d PD c j
r =1
(6)
r =1
c j , P (wt c j ) is probably
REUTERREUTERTRAINING
TEST
Number of queries
96
96
Raw text size in megabytes
9.68
70.5
Number of documents
3294
17309
Mean words per document
176
176
Mean relevant documents per query
33
168
Number of words
5808
29568
Number of collections
96
200
Mean documents per collections
33
100
Table 1: Statistics about the sets of collections used for evaluation
4.2
Experimental Setup
4.3
well-known
Experimental Results
5.1
5.1.1
R(qCi ) = tf jk logN
t jq dkCi
dft j
(10)
df t j is the number of
Weight (q | ci ) = df icf
(8)
5.1.2
Selection Result
0.5
0.45
50 concepts
0.4
30 concepts
0.35
10 concepts
0.3
base-query
0.25
0.2
0.15
0.1
2
10
15
20
collections selected
5.2
5.2.1
Expansion Concepts
are very strong indicators of relevance. So those nonrelevant expansion concepts hurt retrieval effectiveness.
Analysis of the results also reveals that for more than
30 expansion concepts, retrieval is improved at all
documents cut-offs. Improvement at higher cut-offs (from
70 to 100) is around 7%, which is more noticeable than at
lower cut-offs.
5.2.2
Selection Collections
5.2.3
Training Collections
5.2.4
Acknowledgement
References
50 concepts
30 concepts
20 collections
15 collections
10 concepts
base-query
10 collections
5 collections
0.5
precision
precision
0.5
0.45
0.4
0.45
0.4
0.35
0.3
0.35
10
30
50
70
10
90
30
50
0.5
Train
75% Train
50% Train
25% Train
90
0.45
1
0.6
0.2
weight
weight
0.8
0.4
0.49
precision
precision
70
documents retrieved
documents retrieved
0.4
0.35
0.3
0.25
0.47
0.45
0.43
10
30
50
70
90
documents retrieved
10
30
50
70
90
documents retrieved