Академический Документы
Профессиональный Документы
Культура Документы
AbstractAs the number of Internet servers increases rapidly, it becomes difficult to determine the relevant servers when
searching for information. We develop a new method to rank Internet servers for Boolean queries. Our method reduces time and
space complexity from exponential to polynomial in the number of Boolean terms. We contrast it with other known methods and
describe its implementation.
Index TermsBoolean query, information retrieval, ranking, resource discovery, similarity measure.
1 INTRODUCTION
3 SIMILARITY MEASURE
Well-known similarity measures, such as Dices coefficient,
Jaccards coefficient, Cosine coefficient, and Overlap coefficient,
have been used to compute the similarities of one docu-
ment to another document, and documents to queries for
automatic classification, clustering, and indexing [12]. For
Fig. 1. Resource discovery process: (1) A user sends a query to the these measures, documents and queries are represented as
directory of services. (2) The directory of services returns a ranked list sets of keywords or vectors.
of relevant servers. (3) The user sends his query to one or more of the
relevant servers, which (4) return matching documents. In the cluster-based retrieval system, documents with
high similarities are grouped into a cluster. User queries are
first compared with cluster representatives, then compared
servers. The WAIS directory of servers is similar to the di- with documents in the clusters that have high similarities
rectory of services in our model. It ranks servers based on a with the queries [12]. In the client-directory-server model,
word-weighting algorithm, but it is maintained manually. the function of directory of services is similar to cluster-
In Content Routing System (CRS) [10], each server is char- based retrieval, where servers are clusters described by
acterized by a content label, which is a Boolean combination cluster representatives (i.e., server descriptions). For user
of attribute-value pairs and is manually constructed by queries and cluster representatives both described as Boo-
administrators or automatically derived from frequently lean expressions, the above similarity measures can not be
occurring terms in the database. A server is relevant to a applied directly. The degree of similarity between user que-
query if its content label satisfies the query. Users can refine ries and server descriptions is determined by how much
queries when browsing the content labels of selected serv- these Boolean expressions overlap. Consider the example
ers. In addition, this system automatically forwards queries below.
to relevant servers and merges their results. Currently,
nesting of Boolean operations is not supported. EXAMPLE 2. Suppose RA and RB are the server descriptions of
The GlOSS [11] system uses a probabilistic scheme to two retrieval systems (A and B) stored in the directory
find relevant servers for user queries. In GlOSS, each server of services, and Q1 and Q2 are two user queries:
extracts a histogram of term occurrences in its database. RA = t1 t3 ,
The histograms are used to estimate the query result size RB = t2 t4 ,
(defined as the number of documents in the database times
Q1 = t1 t2 t3 ,
the probability that a document contains all the query
terms) and to determine relevant servers. This method is c h
Q2 = t1 t2 t3 .
built upon the assumption that terms appear in different
Both RA and RB overlap with Q1, but RA contains two
documents of a database following independent and uni-
overlapped terms (t1 and t3) while RB contains only
form probability distributions. GlOSS only considers Boo-
one (t2). Thus, RA is more relevant to query Q1 than
lean DQG queries and does not rank servers.
RB, assuming all terms are weighted equally. How-
Indie [4] is designed and implemented based on the cli-
ever, for an and-or-combined query Q2, it becomes
ent-directory-server model. Each Indie resource is managed
more complicated to determine which server descrip-
by a server called an Indie broker, which maintains a genera-
tion is more relevant.
tor that describes the objects stored in its database. The gen-
erator, a nested Boolean expression, is used as a filter to We need a systematic method to measure the overlap
collect data from information providers. The logically cen- between user queries and server descriptions. Furthermore,
tralized but replicated server, called directory of services, is a this method must perform efficiently even when the number
specialized broker that contains only the generators of of server descriptions increase. Radecki employed several
every Indie broker in the system. Users send a Boolean measures to rank similarity between Boolean expressions
query to the directory of services, which compares the [13], [14]. In the following sections, we review Radeckis
query with each generator in its database, finds the similar- measures and present our modified measure. We demon-
ity between them, then sends a ranked list of relevant Indie strate our improvements in space and time complexity and
brokers to the user. compare the two measures on a synthetic benchmark.
Depending on the type and format of the user query and
3.1 Background
server description, each system employs a different simi-
larity measure to determine relevant servers. Most systems Radecki proposed two similarity measures, S and S*, based
only support a simple query type, such as keywords or on Jaccards coefficient. He defined the similarity value S
simple Boolean combinations. They do not solve nested between queries Q1 and Q2 as the ratio of the number of
Boolean queries and rank servers accordingly. The method common documents to the total number of documents re-
presented in this paper is to measure the similarity between turned in response to both queries. This ratio, commonly
Boolean expressions, which can be directly applied to Indie. known as Jaccards coefficient, can be described as
LI AND DANZIG: BOOLEAN SIMILARITY MEASURES FOR RESOURCE DISCOVERY 865
c h c h,
y Q1 I y Q2 F Im k
c
S Q1 , Q2 = h y cQ h U y cQ h
eQ~j TQ TR
= GH ~q JK ,
i =1 j =1
i, j
1 2
where > denotes set intersection, < denotes set union, and F r I,
n k
\(Q1) and \(Q2) are the response sets to Q1 and Q2, respec- eR~j TQ TR
= G ~
H JK
i =1 j =1
i, j
= mt , t r,
2 4
p c R h = mb , b , b r.
B
B 1 2 3
where set TX is the union of all the descriptors in
Assume, for query Q1, the system responses are Boolean expression X (X = Q2, RA, or RB). To trans-
c h m
y A Q1 = a1 , a2 , a4 ,r form Q2 to its RDNF, we can apply the distributive law
yB cQ h = mb , b r,
1 1 3
ct t h t = c t
1 2 3 1 h c
t3 t2 t3 , h
where \A(Q1) and \B(Q1) are the responses to Q1 in and expand the two conjunctions, (t1 t3) and (t2 t3),
systems A and B, respectively. The similarity meas- to their associated reduced atomic descriptors. The
ures between Q1 against RA and RB are then expansion process is based on the equation
c h
S Q1 , RA = 3 4 = 0.750, c
ta = ta tb ta tb , h c h
ScQ , R h = 2 3 = 0.667.
1 B
where ta and tb are descriptors. Consider Q2 and RA
first. Since TQ U TR = {t1 , t2 , t3 } , each reduced
In the case of a directory of services, however, the simi- 2
~
A
~
larity measure is used to estimate the importance of entire atomic descriptor in (Q2 )T T and ( RA )T T must
Q2 RA Q2 RA
information systems and decide the order in which users
should search them. If the similarity is calculated based on contain all the tis (1 i 3) or their negated forms.
the query results from every information system, the Thus, the conjunctions in Q2 are expanded to
searching order is no longer needed because you have al-
ready searched them all.
c
t1 t3 = t1 t3 t2 t1 t3 t2 , h c h
Radecki proposed a similarity measure S* that is inde- t2 t3 = ct 2 t3 t h ct
1 2 t3 t h. 1
pendent of the responses to the queries [14]. In S*, Boolean
expression Q is transformed into its reduced disjunctive nor- The RDNFs of Q2 and RA are
~ ~
mal form (RDNF), denoted as Q , which is the disjunction of e j
Q2 c h c
= t1 t2 t3 t1 t2 t3
TQ TR
h
a list of reduced atomic descriptors. If set T is the union of all 2 A
eQ~ j2
TQ TR
c h c h
= t1 t2 t3 t4 t1 t2 t3 t4 Let Q and R be two Boolean expressions, and TQ and TR
be their sets of descriptors. We denote Q$ and R$ as the
2 B
ct t t t h ct t t t h
1 2 3 4 1 2 3 4 CDNFs of Q and R, and express them as
ct t t t h ct t t t h,
1 2 3 4 1 2 3 4 F
m xi I,
~ Q$ = GH q$ JK
= ct t t t h c t t t t h
i ,u
e j
RB
TQ TR
2 B
1 2 3 4 1 2 3 4 i =1 u=1
ct t t t h c t t t t h.
F
n
R$ = G r$
yj I,
1 2 3 4 1 2 3 4
H
j =1 v =1
j,v JK
Radecki defines the similarity value S* between two
Boolean expressions (Q and R) as the ratio of the number of
~ ~ where each conjunction ( ux=1 q$i ,u
i
and yy =1 r$j ,v ) is a com-
j
c
S * Q2 , RB h =
~
2 B
~
2 B scriptors from TR and TR . In other words, the de-
A B
e j
Q2 U RB e j scriptors in Q$ 2 are independent of those in other
TQ TR TQ TR
2 B 2 B
Boolean expressions, such as RA and RB.
2
= = 0.250.
8 We denote our similarity measure S and define the
Therefore, RA is more relevant to query Q2 than RB. similarity of two Boolean expressions as the summation of
the individual similarity measures (s) between each com-
From Example 4, we can see that Q2 is transformed to pact atomic descriptor. The individual similarity measure
~ ~
different RDNFs, (Q2 )T T and (Q2 )T T , when com- s is defined as
Q2
EXAMPLE 7. Using the above definitions, we compute S for 2) Calculate S based on the number of hit documents on
Q2, RA, and RB of Example 6. We find each server.
3) Calculate S* and S for each filter-query pair.
Q$ 2 = t1 t3 t2 t3 , TQ1 = t1 , t3 ,
c h c h m r
12
4 4 3 12
4 4 3 2
4) Rank servers based on S, S*, and S.
Q$ 2 Q$ 2
1 2
TQ2 = t2 , t3 ,
2
m r 5) Compare their rankings using Spearman rank-order cor-
R$ A = t1 t3 , c h
TR1 = t1 , t3 , m r relation coefficient (rs) [15].
12
4 4 3 A
6) Compare rs(S*, S) and rs(S, S) using confidence inter-
R$ A
1
1 the average of the ranks that would have been assigned had
s
eQ$ , R$ j =
2
2
1
B TR1 - TQ2 TQ2 - RR1
no ties happened.
2 B 2
+2 2 B
-1 Let a1, , an and b1, , bn be two rankings for Qi gener-
1 ated by various similarity measures, where n is the number
= = 0.333, of elements in the ranking (N in our case). The tied ranks in
2 1 + 21 - 1
each ranking form a group. Assume there are gu different
which leads to
groups in a1, , an, each group has uk (1 k gu) tied ele-
S Q2 , RB = s Q$ 21 , R$ B1 + s Q$ 22 , R$ B1 = 0.333 .
c h ments. Similarly, ranking b1, , bn has gv groups, each has
e j e j vk (1 k gv) tied elements. The rs coefficient can be ob-
Therefore, RA is more relevant to query Q2 than RB. tained by [15]:
Notice that the similarity values calculated using S (in n 2
*
Example 7) are different from those calculated using S (in rs =
1
6 en 3
j ca - b h - U - V ,
-n -
k =1 k k
3 3
Example 5). It is meaningless to compare these values di-
rectly because both are measured on a relative scale. How-
1
6 en - nj - 2U en - nj - 2V
1
6
Fig. 2. The difference between rs(S, S) and rs(S*, S) for 35 Boolean queries on the CISI database. The s above zero indicate S generates a
ranking closer to that of S than S* for the associated query.
number of samples (N in our case), and n1 is the number of 95% confidence interval for proportion = p m 1.960
b
p 1- p g
n
times S is superior to S* (i.e., rs(S, S) > rs(S*, S)). By defi- = c0.532, 0.840h.
nition [16], if np 10 and the confidence interval does not
The confidence interval does not include 0.5. Therefore, we
include 0.5, we can say, with 95 percent confidence, that S can say, with 95 percent confidence, that S is superior to
is superior to S*. S* in the CISI experiment.
4.1 CISI Experiment 4.2 USC Homer Experiment
The CISI dataset consists of 1,460 information science In this experiment, we manually create 32 Boolean query
documents and 35 Boolean queries. All documents are in- samples, each averaging 3.6 descriptors picked up from 24
dexed with terms occurring in the title and abstract, but not terms in diverse fields. We submit these queries to the USC
on a stop list, of 429 common words. All indexed terms are Homer database and compute the results. Fig. 3 shows the
stored in their original forms without stemming. A Boolean values of rs(S, S) minus rs(S*, S) for the 32 queries.
query is a nested structure of terms with logical DQG, RU, In Fig. 3, rs(S, S) is greater than rs(S*, S) for 22 times (the
and QRW operators in between. Documents are hit by a s above zero) and less than rs(S*, S) for 10 times (the s
query if they satisfy all the conditions in the query. below zero). This indicates S generates a ranking closer to
Following the six steps described previously, we calcu- that of S for 22 out of 32 times, whereas S* only has closer
late rs(S, S) and rs(S*, S) for the 35 queries. Fig. 2 shows the order for 10 out of 32 times. The mean Spearman coeffi-
cients of rs(S, S) and rs(S*, S) are 0.595 and 0.494, respec-
value of rs(S, S) minus rs(S*, S) for each query. Among
tively. This shows S has a better average estimation than
them, rs(S, S) is greater than rs(S*, S) for 24 times (the s S* on the USC Homer database.
above zero) and less than rs(S*, S) for 11 times (the s below From the results of rs(S, S) and rs(S*, S), the sample
proportion of rs(S, S) > rs(S*, S) is
zero). This indicates S generates a ranking closer to that of
S for 24 out of 35 times, whereas S* only has closer order for 22
p= = 0.688 .
32
11 out of 35 times. The mean Spearman coefficient of rs(S, S)
Because np 10 (n = 32), we can calculate the confidence
and rs(S*, S) are 0.331 and 0.275, respectively. This shows S interval for the proportion for S:
has a better average estimation than S* on the CISI database.
LI AND DANZIG: BOOLEAN SIMILARITY MEASURES FOR RESOURCE DISCOVERY 869
Fig. 3. The difference between rs(S, S) and rs(S*, S) for 32 Boolean queries on the USC Homer database. The s above zero indicate S gen-
erates a ranking closer to that of S than S* for the associated query.
c h
Q = t1 t2 t3 a
TimeS = TimeS transformation + f
= ct t h ct t h efi Q$ j
1 3 2 3 (5) TimeS ccomputationh,
= ct t t h ct t t h cfi t t h
1 2 3 1 2 3 1 3 TimeS* = TimeS* atransformationf +
(6) TimeS* ccomputationh,
ct t t h ct t t h cfi t t h
1 2 3 1 2 3 2 3
where
= ct t t h ct t t h c t t t h
~
1 2 3 1 2 3 1 2 3
a
TimeS transformation f
=Q
Case 2: We expand Q first, then distribute it.
c
= Time Boolean expression fi CDNF , h
c h
Q = t1 t2 t3 (7)
a
TimeS* transformation f
dct t t h ct t t h U|V t
1 2 3 1 2 3
c
= Time Boolean expression fi CDNF + h
= (8) TimeaCDNF fi RDNF f.
ct t t h ct t t h|W
1 2 3 1 2 3
1
= dct t t h ct t t h ct t t h
1 2 3 1 2 3 1 2 3
5.1 From Boolean Expression To CDNF
To simplify the analysis, we use binary trees [17] to repre-
ct t t h ct t t h
1 2 3 1 2 3 sent the Boolean expressions. Each external node or leaf
ct t t hi dct t t h
1 2 3 1 2 3
represents a descriptor. All the internal nodes, including
the root, are logical operators. The negation not can be
ct t t h ct t t h ct t t hi
1 2 3 1 2 3 1 2 3 stored with the associated descriptor, therefore we do not
denote it separately. The height of a tree is the longest path
= ct t t h ct t t h c t t t h
1 2 3 1 2 3 1 2 3 from any leaf to the root.
~
=Q The binary trees are transformed to their equivalent
In Case 1, the expansion is performed after the distribu- CDNF binary trees using the distributive law. The tech-
tion. Therefore each compact atomic descriptor is expanded nique is to transform an and-rooted subtree to an equiva-
(from (5) to (6)) instead of each descriptor (from (7) to (8)), lent or-rooted subtree, one at a time, in a top-down ap-
as in Case 2. A compact atomic descriptor usually contains proach. An example is shown in Fig. 4, where A, B, and C
more than one descriptor after applying the distributive are the subtrees of associated nodes.
law to its original Boolean expression. In the above exam-
ple, each of the two compact atomic descriptors in Q$ (5),
(t1 t3) and (t2 t3), contains two descriptors. Eight addi-
tional descriptors are added from (5) to (6) after the expan-
sion. On the other hand, each individual descriptor in the
original Boolean expression is expanded in Case 2. Thirty-
three additional descriptors are added from (7) to (8). The
second approach needs more space than the first one for
storing those intermediate descriptors, which consequently Fig. 4. Compact disjunctive normalization. We use the distributive law
causes it to spend more time checking the duplicates before (A B) C = (A C) (B C) on the subtrees A, B, and C.
~
obtaining the final Q .
In our example, the original Boolean expression contains We first change the current root node from and to or,
only three descriptors. It is the simplest transformation and change its or-rooted child node to be and-rooted. Then,
case. For more complicated Boolean expressions, the differ- we demote the other child (C) by one level, and add one
ence between Case 1 and Case 2 is bigger. Therefore we use and node at its original position to be its new parent. Fi-
the first approach (i.e., Boolean expression CDNF nally, we replicate the demoted child (C) and exchange it
RDNF) in our complexity analysis. Based on this, the time with one of the children (B) on the other subtree. The same
complexities of S and S* are equal to the transformation procedure is repeated until reaching the leaves.
time from Boolean expression to CDNF or RDNF plus the The space complexity of transforming the Boolean ex-
time to compute the similarity measures. For a single Boo- pression to a CDNF varies from O(n) to O(n2), depending
lean expression, on how the Boolean expression is constructed, where n is
LI AND DANZIG: BOOLEAN SIMILARITY MEASURES FOR RESOURCE DISCOVERY 871
the sum of the total number of descriptors and logical op- duplicated from Fig. 6b to Fig. 6c. The time complexity T(n)
erators in the Boolean expression. For example, a linear bi- consists of the time for adding additional DQG nodes and
nary tree (Fig. 5a) generates an O(n) CDNF, while a com- for duplicating subtrees. Since there is no need for distribu-
plete binary tree (Fig. 5b) generates an O(n2) CDNF. Notice tion for n 3, T(n) = 0 in these cases. Otherwise,
that n is equal to the total number of nodes if the Boolean
expression is represented as a binary tree. Intuitively, the af
T n = 1+
n-1
+ 2 1+
n-3
+ 2T
n-1 LM FG IJ OP .
O(n2) CDNF can be derived by starting with a tree of height
h
2 4 2 MN H K PQ
h. It will have O(n) = O(2 ) nodes for a complete binary tree. Therefore,
The distributive law no more than doubles the height. Thus,
a f RST0n + 1 + 4 Td i
2h h if n 3,
the new tree is bounded by size O(2 ) = O((2 )2) = O(n2).
T n = n-1
2
otherwise.
h
Let h be the height of the original binary tree, then n = 2 1,
where h 1. We can derive
af e
T n = T 2h - 1 j
-
= 2h + 4 T 2h 1 - 1 e j
- h-2
= 2h + 4 2h 1 + 42 T e -1 j
h h -1 2 h- 2 h- 3 3
= 2 + 42 + 4 2 +L+4 2 +
4h- 2 T 2 e b f - 1j
h- h-2
h h 2 h h -3 h
(a)
= 2 + 22 + 2 2 + L + 2 2
h h-2
=2 2 -1e j
a
= n+1 fLM n 4+ 1 - 1OP
N Q
2
= Oen j.
af
M n = 1+ 2 1+ 2M
LM FG n - 1IJ OP .
MN H 2 K PQ
(b)
Therefore,
Fig. 5. Various binary trees: (a) a linear binary tree, and (b) a complete
binary tree, where and and or are logical operators, t1, , tp are de-
scriptors. Mn = a f RSTn3 + 4 Md i n-1
2
if n 3,
otherwise,
(a) (b)
(c)
Fig. 6. The CDNF transformation of an n-node complete binary tree. Originally, only the root is the and operator, the other internal nodes are all
or operators. A, B, C1, and C2 are subtrees. The number in the brackets means the number of nodes in this subtree. (a) The original binary tree.
(b) The binary tree after distribution on the first level node (i.e., the root), where the subtree C is duplicated. (c) The binary tree after distributions
on the second level nodes, where the subtrees A and B are duplicated.
into four groups (A, B, C1, C2) of equal size, each group
where A, B, C1, and C2 are subtrees, and (9), (10), and (11)
having n+1
8
descriptors. Let k = n8+1 , then represen Figs. 6a, 6b, and 6c, respectively. Equation (12)
represents the resulting CDNF Q$ , which consists of n+1
2
c
A = t1 L tk , h d i
4
atomic descriptors of Q$ 1 and Q$ 2 , and r1 and r2 the number (1 i c1) atomic descriptor of Q$ 1 contains ki descriptors
~ ~
of reduced atomic descriptors of Q1 and Q2 . We observe and the jth (1 j c ) atomic descriptor of Q$ contains h de-
2 2 j
that the resulting RDNFs of Boolean expressions Q1 and Q2 scriptors, where ki p1, hj p2, and max(p1, p2) p p1 + p2.
contain the following characteristics: To speed up the computation time, all the descriptors
each reduced atomic descriptor contains p descriptors, within the compact atomic descriptors or reduced atomic
descriptors are sorted before calculating their similarities.
max (p1, p2) p p1 + p2,
k log k to sort Q$ , and
c1 c2
r1 = min (c1 2 , 2 ),
p 2 p Therefore it takes i = 1 i h log h i 1 j =1 j j
r2 = min (c2 2 , 2 ),
p 2 p to sort Q$ 2 . To compare these two CDNFs term-by-term, it
~ c1 c2
Q1 has (r1 p) descriptors, takes i =1 j =1 max(ki , hj ) time. Hence,
~
Q2 has (r2 p) descriptors.
~ ~
TimeS computation c h
In Q1 and Q2 , each compact atomic descriptor containing c1 c2 c1 c2
F F n + 1I 2
p-2
I ~ ~ r r
= minG G 1
,2 p JJK i =1 j =1 p time to compare them.
r1
GH H 4 JK 2 sort Q1 and Q2 , and
1 2
Thus,
=S
R|cn + 1h 2
1
2 p-6
if n1 < 7 , TimeS* computationc h
|T2 p
otherwise, r1 r2 r1 r2
2 , p = p log p + p log p + p
i =1 j =1 i =1 j =1
a f c
Space RDNF = r1 p + r1 p - 1 h c h c h
= r1 + r2 p log p + r1r2 p.
2 p +1 p - 1
Because RDNF is obtained by expanding its CDNF, we
p
=O2 p,e j are certain that c1 r1 and c2 r2. Therefore,
~
where (r1 p) is the number of descriptors in Q1 and (r1 p c
TimeS computation h
~
1) is the number of logical operators in Q1 . is always less than or equal to
The time for transforming CDNF to RDNF consists of c
TimeS* computation . h
1) expanding each compact descriptor, and
If Q1 and Q2 are both n-node binary trees as described
2) checking and removing duplicate reduced atomic
above, then ki = 2 (1 i c1) and hj = 2 (1 j c2). Thus,
descriptors.
Because 2 can be done as 1 is being executed, it is omitted in
c
TimeS computation h
our analysis. Thus, the time complexity for an n-node bi-
nary tree is =G
F n + 1IJ 2 log 2 + FG n + 1IJ
2 2
2 log 2 +
H 4 K H 4 K
f FGH n 4+ 1IJK 2
2
a
Time CDNF fi RDNF = p-2
p FG n + 1IJ FG n + 1IJ 2
2 2
p 2
H 4 KH 4 K
= Oe 2 pn j. 4
= Oen j,
ing the notation given above, we further define that the ith
=O2 p. e j
874 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 6, NOVEMBER/DECEMBER 1997
5.4 Remarks c
TimeS* 1 query, 100 server descriptions h
Below, we summarize the previous time and space analysis. 2 5
= 100 2 5 = 512 ,000 ,
a
TimeS = Time S transformation + f c
TimeS 1 query, 100 server descriptions h
TimeS ccomputationh 4
= 100 5 = 62 , 500 ,
c
= Time Boolean expression fi CDNF + h c
SpaceS* 1 query, 100 server descriptions h
c
TimeS computation h 5
= 100 2 5 = 16 ,000 ,
2 4
=One j + Oen j, c
SpaceS 1 query, 100 server descriptions h
TimeS* = Time atransformationf +
S*
2
= 100 5 = 2 , 500.
Time ccomputation h
S* When using S, the directory of service is eight times
= TimecBoolean expression fi CDNF h + faster in searching the relevant servers, and takes only
one-sixth the space of S*.
TimeaCDNF fi RDNF f +
Time ccomputation h
S* 6 IMPLEMENTATION
2 p 2 2p
= Oen j + Oe 2 pn j + Oe 2 pj, In the client-directory-server model, the directory of serv-
SpaceS = Spacea CDNF f ices ranks the servers by comparing their descriptions with
the query. Both the query and the server descriptions need
2
= Oen j, to be normalized before the comparison. In our method, the
normalization of a Boolean expression is independent of
SpaceS* = Spacea CDNF f + Spacea RDNF f other Boolean expressions in the comparison. Therefore, we
2
= Oen j + Oe 2 pj.
p can prenormalize the server descriptions and store them in
the directory of services. Below, we describe the imple-
The above comparisons are analyzed based on a pair of mentation of our Boolean similarity measure.
Boolean expressions only. For (N + 1) Boolean expressions, We use the UNIX tools flex and bison to parse the nested
consisting of one incoming query and N server descrip- Boolean expressions and build the associated binary parse
tions, their time and space complexities are in proportion to trees. Each attribute-value pair in the user query and the
N [18]. As discussed previously, an n-node binary tree con- server description is presented as a three-element subtree in
sists of n+1
2
leaves (or descriptors in the Boolean expres- the binary parse tree. The three-element subtree consists of
sion). The number of distinct descriptors p must be no one parent node and two child nodes. The left and right
larger than n+1 , i.e., p n2+1 = O(n) . Thus, the complexities child nodes, i.e., the leaves, are the attribute name and its
2
value respectively. The leaves are joined by the parent
of the two measures S and S* can be simplified as shown node, which is a relational operator (= or ). These sub-
in Table 2. trees are merged by the logical operators (and and or) to
form the binary parse tree.
TABLE 2 The binary parse trees are transformed to their equiva-
TIME AND SPACE COMPLEXITIES OF S AND S* FOR ONE USER
QUERY AGAINST N SERVER DESCRIPTIONS lent CDNF binary trees based on the distributive law. No-
tice that, while replicating the subtree (such as C in Fig. 4),
S S* we only copy the logical operator nodes in order to save
4 2n space. For relational operator nodes, only their associated
time complexity O(N n ) O(N2 n)
2 n pointers are copied. All the nodes in the binary tree whose
space complexity O(N n ) O(N2 n)
parents are or are linked together after the distributive
Both the user query and the server descriptions are n-node binary trees.} normalization. Consider the following example.
Apparently, S outperforms S* in both time and space EXAMPLE 9. Let Q1 be an incoming user query and RA, RB,
complexities. The above analysis shows that our similarity RC be three server descriptions stored as CDNFs
measure based on CDNFs consumes up to exponentially ( R$ A , R$ B , R$ C ) in the directory of services,
less time and space than Radeckis method. The following
example further illustrates the performance difference be- Q1 : cbkeyword = networkg or akeyword = UNIX fh
tween the two measures. and bauthor = Smithg,
EXAMPLE 8. Consider a directory of services containing 100
R$ A : bkeyword = network g,
server descriptions, each consisting of five descrip-
tors. The time and space used to calculate the simi- R$ B : bkeyword = database g or bkeyword = computer g,
larities S* and S for a five-descriptor user query are:
R$ C : cakeyword = UNIX f and bauthor = Smithgh or
cbkeyword = databaseg and bauthor = McLeodgh.
LI AND DANZIG: BOOLEAN SIMILARITY MEASURES FOR RESOURCE DISCOVERY 875
Fig. 8. Normalized user query Q$ 1. The KHDG links all the nodes whose parents are RU. The dashed subtree is a replicated subtree. The Q$ 1 is one
i
Fig. 9. Normalized server description R$ C . The KHDG links all the nodes whose parents are RU. The R$ C is one of the compact atomic descriptors in R$ C .
j
The Q1 is normalized to Q$ 1 before comparison, link generated in each normalized binary tree is pointed at
by KHDG.
Q$ 1 = cakeyword = networkf and bauthor = Smithgh After the normalization process, we compare each com-
or cakeyword = UNIX f and bauthor = Smithgh. ponent in the links of the two binary trees. Each element in
the server description link represents a compact atomic de-
The similarity values between the user query and the scriptor R$ Cj in the server description R$ C . Each element in
three server descriptions are the user query link represents a compact atomic descriptor
Q$ 1i in the user query Q$ 1 . To calculate s (Q$ 1i , R$ Cj ) , we com-
2 1
c
S Q1 , RA =h s eQ$ , R$ j = 41 ,
i
1
j
A
pare all the nodes under Q$ i with all the nodes under R$ j
i =1 j =1 1 C
2 2 and find out the number of uncommon nodes between
c
S Q1 , RB =h s eQ$ , R$ j = 0,
i
1
j
B them. Then we sum up all the s (Q$ i , R$ j ) to get S(Q1, RC).
1 C
i =1 j =1
2 2 Similarly, we calculate S(Q1, RA) and S(Q1, RB), sort RA,
c
S Q1 , RC =h s eQ$ , R$ j = 31 .
i
1
j
C RB, and RC in descending order of their similarity values
i =1 j =1
with Q1, and return the result to the user.
Figs. 7 and 8 show the binary parse tree of the user query
Q1 in Example 9 before and after normalization. Fig. 9
shows the server description RC after normalization. The
876 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 6, NOVEMBER/DECEMBER 1997
7 CONCLUSIONS [7] V.I. Frants and J. Shapiro, Algorithms for Automatic Construc-
tion of Query Formulations in Boolean Form, J. Am. Soc. Informa-
We have developed a new method using compact disjunc- tion Science, vol. 42, no. 1, pp. 16-26, Jan. 1991.
tive normal form (CDNF) to rank the similarity between [8] K. Obraczka, P.B. Danzig, and S.-H. Li, Internet Resource Dis-
covery Services, Computer, vol. 26, no. 9, pp. 8-22, Sept. 1993.
Boolean expressions. We compared our method with [9] A. Emtage and P. Deutsch, Archie: An Electronic Directory
Radeckis measure on two databases and used the Spear- Service for the Internet, Proc. Winter 1992 Usenix Conf., pp. 93-
man rank coefficients and the confidence intervals to show 110, 1992.
[10] M.A. Sheldon, A. Duda, R. Weiss, J.W. OToole Jr., and D.K. Gif-
that our method can get a closer ranking order to that gen- ford, A Content Routing System for Distributed Information
erated by Jaccards coefficient. The theoretical analysis Servers, Proc. Fourth Intl Conf. Extending Database Technology,
proves that this new measure outperforms the one pro- Cambridge, England, Mar. 1994.
posed by Radecki significantly in terms of time and space [11] L. Gravano, H. Garcia-Molina, and A. Tomasic, The Efficacy of
GIOSS for the Text Database Discovery Problem, Technical Re-
complexity. These results demonstrate that our similarity port STAN-CS-TN-93-2, Stanford Univ., 1993.
measure can greatly improve the searching process in to- [12] C.J. van Rijsbergen, Information Retrieval, second ed. London:
days world of overwhelming information. Butterworth & Co. (Publishers) Ltd., 1979.
In addition to ranking results, similarity estimates can be [13] T. Radecki, A Model of a Document-Clustering-Based Informa-
tion Retrieval System with a Boolean Search Request Formula-
used to help identify similar, but autonomously managed, tion, Information Retrieval Research, R.N. Oddy, S.E. Robertson,
retrieval systems. For example, the similarity measure can C.J. van Rijsbergen, and P.W. Williams, eds., pp. 334-344. London:
be used to cluster servers with similar descriptions in a sin- Butterworth & Co. (Publishers) Ltd., 1981.
[14] T. Radecki, Similarity Measures for Boolean Search Request
gle directory entry. When the similarity measures of two Formulations, J. Am. Soc. Information Science, vol. 33, no. 1, pp. 8-
servers exceed a certain value, they can be merged to re- 17, 1982.
move redundancy. Moreover, the administrator can create [15] M. Kendall and J.D. Gibbons, Rank Correlation Methods, fifth ed.
new servers by using the most frequently asked queries as London: Edward Arnold, 1990.
[16] R. Jain, The Art of Computer Systems Performance Analysis. New
the filter and select other relevant servers as its information York: John Wiley & Sons, 1991.
sources. Thus, most user queries can be satisfied by a small [17] R. Sedgewick, Algorithms, second ed. Reading, Mass.: Addison-
number of servers which reduces search time. For people Wesley, 1988.
[18] S.-H. Li and P.B. Danzig, Boolean Similarity Measures for Re-
using Boolean expressions to represent their interests, such
source Discovery, Technical Report USC-CS-94-579, Univ. of
as collaborative filtering [19] or user profile [20], [21], similarity Southern California, 1994.
measure can help find other individuals having common [19] D. Goldberg, D. Nichols, B.M. Oki, and D. Terry, Using Collabo-
interests, so that they may share their collections. Our rative Filtering to Weave an Information Tapestry, Comm. ACM,
vol. 35, no. 12, pp. 61-70, Dec. 1992.
method can also benefit systems that support automatic [20] C. Danilowicz, Modeling of User Preferences and Needs in Boo-
query formulations by relevance-feedback [22], [7], [23], lean Retrieval Systems, Information Processing & Management, vol. 30,
where the reformed queries could be in complex Boolean no. 3, pp. 363-378, 1994.
forms. [21] T.W. Yan and H. Garcia-Molina, Index Structures for Selective
Dissemination of Information Under the Boolean Model, ACM
Trans. Database Systems, vol. 19, no. 2, pp. 332-364, June 1994.
[22] M. Dillon and J. Desper, The Use of Automatic Relevance Feed-
ACKNOWLEDGMENTS back in Boolean Retrieval Systems, J. Documentation, vol. 36, no.
This work was supported in part by the Advanced Research 3, pp. 197-208, Sept. 1980.
[23] G. Salton, E.A. Fox, and E.M. Voorhees, Advanced Feedback
Projects Agency under contract number DABT63-93-C-0052, Methods in Information Retrieval, J. Am. Soc. Information Science,
HBP NIH grant 1-P20-MH/DA52194-01A1, National Science vol. 36, no. 3, pp. 200-210, May 1985.
Foundation Institutional Infrastructure grant number CDA-
9216321, and NSF NYI grant number NCR-9457518. Shih-Hao Li received his BS in communication
engineering from the National Chiao-Tung Uni-
versity, Hsinchu, Taiwan, in 1985, his MS in
computer engineering in 1991, and his PhD in
REFERENCES computer science in 1996, both from the Univer-
[1] S.-H. Li and P.B. Danzig, Vocabulary Problem in Internet Re- sity of Southern California. His research interests
source Discovery, Proc. Second Intl Workshop Next Generation In- include Internet resource discovery, information
formation Technologies and Systems, pp. 139-145, Naharia, Israel, retrieval, and distributed computing. He is cur-
June 1995. Available from ftp://catarina.usc.edu/shli/ngits.ps.gz. rently a senior software engineer at Infoseek
[2] D.R. Hardy and M.F. Schwartz, Essence: A Resource Discovery Corporation. Dr. Li is a member of the ACM and
System Based on Semantic File Indexing, Proc. Winter 1993 the IEEE Computer Society.
Usenix Conf., pp. 361-374, Jan. 1993.
[3] B. Kahle and A. Medlar, An Information System for Corporate Peter B. Danzig received his BS in applied phys-
Users: Wide Area Information Servers, ConneXionsThe Interop- ics from the University of California, Davis, in 1982
erability Report, vol. 5, no. 11, pp. 2-9, 1991. and his PhD in computer science from the Univer-
[4] P.B. Danzig, S.-H. Li, and K. Obraczka, Distributed Indexing of sity of California, Berkeley, in 1989. He is currently
Autonomous Internet Services, Computing Systems, vol. 5, no. 4, on leave from the University of Southern Califor-
pp. 433-459, 1992. nia, where he is an associate professor and is
[5] P.G. Anick, J.D. Brennan, R.A. Flynn, D.R. Hanssen, B. Alvey, and chief Internet architect at Network Appliance. His
J.M. Robbins, A Direct Manipulation Interface for Boolean In- research addresses both building scalable Inter-
formation Retrieval via Natural Language Query, Proc. 13th Ann. net information systems and flow, congestion,
Intl ACM SIGIR Conf., pp. 135-150, Brussels, Sept. 1990. and admission control algorithms for the Internet.
[6] D. Young and B. Shneiderman, A Graphical Filter/Flow Repre- He has served on several ACM SIGCOMM and
sentation of Boolean Queries: A Prototype Implementation and ACM SIGMETRICS program committees. He is a member of the IEEE
Evaluation, J. Am. Soc. Information Science, vol. 44, no. 6, pp. 327- and the ACM.
339, July 1993.