Вы находитесь на странице: 1из 14

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO.

6, NOVEMBER/DECEMBER 1997 863

Boolean Similarity Measures


for Resource Discovery
Shih-Hao Li, Member, IEEE Computer Society, and Peter B. Danzig, Member, IEEE

AbstractAs the number of Internet servers increases rapidly, it becomes difficult to determine the relevant servers when
searching for information. We develop a new method to rank Internet servers for Boolean queries. Our method reduces time and
space complexity from exponential to polynomial in the number of Boolean terms. We contrast it with other known methods and
describe its implementation.

Index TermsBoolean query, information retrieval, ranking, resource discovery, similarity measure.

1 INTRODUCTION

S EARCHING for information in the Internet is a consider-


able task. Thousands of servers provide different in-
formation over the networks. Determining appropriate
existing tools or algorithms can help users generate or re-
construct complicated Boolean queries to express their in-
formation needs [5], [6], [7].
servers for searching is a common problem. Novice users EXAMPLE 1. Consider the following Boolean expression,
have no idea where to send requests, and experienced users
may miss new servers having relevant information. ((keyword = network) or (keyword = UNIX)) and
A user query can be described using natural language, (author = Smith),
keywords, or a database query language. We assume each where keyword and author are predefined attribute
query is transformed to a standard format, such as a Boo- names, and network, UNIX, and Smith are their corre-
lean expression, by an associated query engine. Because sponding values. In the discussion, we would repre-
each user requests different information, it is inappropriate sent this expression as (t1 t2) t3, where ti (1 i 3)
to broadcast requests to all servers. That overwhelms the are called descriptors, is the logical RU operator, and
underlying networks and overloads irrelevant servers. is the logical DQG operator.
To solve this problem, we propose the client-directory-
In this paper, we develop an efficient algorithm to rank
server model [1]. Our goal is to give users a list of relevant
servers based on their similarities with respect to a query. We
servers ranked according to their relevance to the query. In
describe two existing similarity measures for Boolean expres-
this model, a directory of services records a description of
sions, introduce our new measure, and experimentally con-
each information server, called a server description. A user
trast it with the well known Jaccards coefficient. We review
sends his query to the directory of services, which deter-
related work on Internet resource discovery in Section 2. Sec-
mines and ranks the servers relevant to the users request.
tion 3 describes existing and our new Boolean similarity
The user employs the rankings when selecting the servers
measures. We show experimental results of both measures in
to query directly. Fig. 1 shows the details.
Section 4 and analyze their time and space complexity in
A server description can be automatically generated by
Section 5. Section 6 discusses the implementation of our
clustering algorithms [1], by information extraction tools
method, and Section 7 presents our conclusions.
[2], or can be manually assigned by administrators [3], [4].
In either case, it can represent a summary of the underlying
database contents or function as a filter to collect informa- 2 RELATED WORK
tion satisfying certain conditions. In this research, we focus Internet resource discovery services [8], such as Archie [9],
on Boolean environments, where both user queries and WAIS [3], CRS [10], GlOSS [11], and Indie [4], all provide
server descriptions are written in Boolean expressions. We services similar to the client-directory-server model. They
believe that Boolean expressions can precisely describe a determine relevant servers for users to submit queries.
servers contents as well as a users information need. Us- In Archie [9], a centralized server collects file and direc-
ing the above methods [1], [2], server descriptions can eas- tory names from anonymous Internet FTP servers. Users
ily be formulated as Boolean expressions. For user queries, send queries containing the requested file name to the cen-
tralized server, get back a list of matching hosts, and re-
trieve the file manually. This system only searches docu-
The authors are with the Computer Science Department, University of ments by their file names and does not support complicated
Southern California, Los Angeles, CA 90089. Boolean queries.
E-mail: {shli, danzig}@usc.edu.
WAIS [3] has a special server, called the directory of serv-
Manuscript received 29 July 1994; revised 29 Aug. 1995. ers, which contains the description of each WAIS server and
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number 104424. compares them with user queries to determine relevant

1041-4347/97/$10.00 1997 IEEE


864 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 6, NOVEMBER/DECEMBER 1997

It can also be used for full text data or keyword queries by


combining them with all DQG or or Boolean operators.

3 SIMILARITY MEASURE
Well-known similarity measures, such as Dices coefficient,
Jaccards coefficient, Cosine coefficient, and Overlap coefficient,
have been used to compute the similarities of one docu-
ment to another document, and documents to queries for
automatic classification, clustering, and indexing [12]. For
Fig. 1. Resource discovery process: (1) A user sends a query to the these measures, documents and queries are represented as
directory of services. (2) The directory of services returns a ranked list sets of keywords or vectors.
of relevant servers. (3) The user sends his query to one or more of the
relevant servers, which (4) return matching documents. In the cluster-based retrieval system, documents with
high similarities are grouped into a cluster. User queries are
first compared with cluster representatives, then compared
servers. The WAIS directory of servers is similar to the di- with documents in the clusters that have high similarities
rectory of services in our model. It ranks servers based on a with the queries [12]. In the client-directory-server model,
word-weighting algorithm, but it is maintained manually. the function of directory of services is similar to cluster-
In Content Routing System (CRS) [10], each server is char- based retrieval, where servers are clusters described by
acterized by a content label, which is a Boolean combination cluster representatives (i.e., server descriptions). For user
of attribute-value pairs and is manually constructed by queries and cluster representatives both described as Boo-
administrators or automatically derived from frequently lean expressions, the above similarity measures can not be
occurring terms in the database. A server is relevant to a applied directly. The degree of similarity between user que-
query if its content label satisfies the query. Users can refine ries and server descriptions is determined by how much
queries when browsing the content labels of selected serv- these Boolean expressions overlap. Consider the example
ers. In addition, this system automatically forwards queries below.
to relevant servers and merges their results. Currently,
nesting of Boolean operations is not supported. EXAMPLE 2. Suppose RA and RB are the server descriptions of
The GlOSS [11] system uses a probabilistic scheme to two retrieval systems (A and B) stored in the directory
find relevant servers for user queries. In GlOSS, each server of services, and Q1 and Q2 are two user queries:
extracts a histogram of term occurrences in its database. RA = t1 t3 ,
The histograms are used to estimate the query result size RB = t2 t4 ,
(defined as the number of documents in the database times
Q1 = t1 t2 t3 ,
the probability that a document contains all the query
terms) and to determine relevant servers. This method is c h
Q2 = t1 t2 t3 .
built upon the assumption that terms appear in different
Both RA and RB overlap with Q1, but RA contains two
documents of a database following independent and uni-
overlapped terms (t1 and t3) while RB contains only
form probability distributions. GlOSS only considers Boo-
one (t2). Thus, RA is more relevant to query Q1 than
lean DQG queries and does not rank servers.
RB, assuming all terms are weighted equally. How-
Indie [4] is designed and implemented based on the cli-
ever, for an and-or-combined query Q2, it becomes
ent-directory-server model. Each Indie resource is managed
more complicated to determine which server descrip-
by a server called an Indie broker, which maintains a genera-
tion is more relevant.
tor that describes the objects stored in its database. The gen-
erator, a nested Boolean expression, is used as a filter to We need a systematic method to measure the overlap
collect data from information providers. The logically cen- between user queries and server descriptions. Furthermore,
tralized but replicated server, called directory of services, is a this method must perform efficiently even when the number
specialized broker that contains only the generators of of server descriptions increase. Radecki employed several
every Indie broker in the system. Users send a Boolean measures to rank similarity between Boolean expressions
query to the directory of services, which compares the [13], [14]. In the following sections, we review Radeckis
query with each generator in its database, finds the similar- measures and present our modified measure. We demon-
ity between them, then sends a ranked list of relevant Indie strate our improvements in space and time complexity and
brokers to the user. compare the two measures on a synthetic benchmark.
Depending on the type and format of the user query and
3.1 Background
server description, each system employs a different simi-
larity measure to determine relevant servers. Most systems Radecki proposed two similarity measures, S and S*, based
only support a simple query type, such as keywords or on Jaccards coefficient. He defined the similarity value S
simple Boolean combinations. They do not solve nested between queries Q1 and Q2 as the ratio of the number of
Boolean queries and rank servers accordingly. The method common documents to the total number of documents re-
presented in this paper is to measure the similarity between turned in response to both queries. This ratio, commonly
Boolean expressions, which can be directly applied to Indie. known as Jaccards coefficient, can be described as
LI AND DANZIG: BOOLEAN SIMILARITY MEASURES FOR RESOURCE DISCOVERY 865

c h c h,
y Q1 I y Q2 F Im k
c
S Q1 , Q2 = h y cQ h U y cQ h
eQ~j TQ TR
= GH ~q JK ,
i =1 j =1
i, j
1 2

where > denotes set intersection, < denotes set union, and F r I,
n k

\(Q1) and \(Q2) are the response sets to Q1 and Q2, respec- eR~j TQ TR
= G ~
H JK
i =1 j =1
i, j

tively. To apply S in our environment, we denote S(R) and


\R(Q) as the sets of documents in the cluster represented by where m and n are the number of reduced atomic descrip-
~ ~
R and in Rs response to query Q. The similarity value S tors in (Q)T T and ( R)T T . Each reduced atomic de-
Q R Q R
between Q and R is then defined as the ratio of the number
of common documents to the total number of documents in
j =1
scriptor ( k ~q and k ~
i, j j =1 i , j
r ) in the two RDNFs consists of
\R(Q) and S(R), the same number of descriptors (k), which is the set size of
qi , j and ~
TQ < TR. Each ~ ri , j in the RDNFs represents the cor-
a f
S Q, R =
a f a f.
yR Q I p R
(1)
yR aQf U p aRf responding descriptor tj or its negation tj ( is the logical
not operator). For example, ~
q2 ,1 denotes the first descriptor
Because all the documents satisfying query Q belong to ~
cluster R (i.e., \R(Q) S(R)), (1) can be simplified as in the second reduced atomic descriptor of RDNF (Q)T ,
Q TR

yR Q a f, where ~ q2 ,1 is either t1 or t1 depending on how Q is trans-


a f
S Q, R =
p a Rf
(2) formed. The following example illustrates the transforma-
tion from Boolean expressions to RDNFs.
EXAMPLE 3. Using the definitions from Example 2, we as-
EXAMPLE 4. From Example 2,
sume system A (represented by RA) contains docu-
ments {a1, a2, a3, a4} and system B (represented by RB) c
Q2 = t1 t2 t3 , h m
TQ = t1 , t2 , t3 ,
2
r
contains documents {b1, b2, b3}. Thus,
RA = t1 t3 , TR = mt , t r,
1 3
c h m r
p RA = a1 , a2 , a3 , a4 ,
RB = t2 t4 , TR
A

= mt , t r,
2 4
p c R h = mb , b , b r.
B
B 1 2 3
where set TX is the union of all the descriptors in
Assume, for query Q1, the system responses are Boolean expression X (X = Q2, RA, or RB). To trans-
c h m
y A Q1 = a1 , a2 , a4 ,r form Q2 to its RDNF, we can apply the distributive law

yB cQ h = mb , b r,
1 1 3
ct t h t = c t
1 2 3 1 h c
t3 t2 t3 , h
where \A(Q1) and \B(Q1) are the responses to Q1 in and expand the two conjunctions, (t1 t3) and (t2 t3),
systems A and B, respectively. The similarity meas- to their associated reduced atomic descriptors. The
ures between Q1 against RA and RB are then expansion process is based on the equation

c h
S Q1 , RA = 3 4 = 0.750, c
ta = ta tb ta tb , h c h
ScQ , R h = 2 3 = 0.667.
1 B
where ta and tb are descriptors. Consider Q2 and RA
first. Since TQ U TR = {t1 , t2 , t3 } , each reduced
In the case of a directory of services, however, the simi- 2
~
A
~
larity measure is used to estimate the importance of entire atomic descriptor in (Q2 )T T and ( RA )T T must
Q2 RA Q2 RA
information systems and decide the order in which users
should search them. If the similarity is calculated based on contain all the tis (1 i 3) or their negated forms.
the query results from every information system, the Thus, the conjunctions in Q2 are expanded to
searching order is no longer needed because you have al-
ready searched them all.
c
t1 t3 = t1 t3 t2 t1 t3 t2 , h c h
Radecki proposed a similarity measure S* that is inde- t2 t3 = ct 2 t3 t h ct
1 2 t3 t h. 1
pendent of the responses to the queries [14]. In S*, Boolean
expression Q is transformed into its reduced disjunctive nor- The RDNFs of Q2 and RA are
~ ~
mal form (RDNF), denoted as Q , which is the disjunction of e j
Q2 c h c
= t1 t2 t3 t1 t2 t3
TQ TR
h
a list of reduced atomic descriptors. If set T is the union of all 2 A

the descriptors that appear in the to-be-compared Boolean ct t t h,


1 2 3
expression pair, then a reduced atomic descriptor is defined ~
as a conjunction of all the elements in T in either their origi- e j
RA = ct t t h ct
1 2 3 1 t2 t3 . h
TQ TR
2 A
nal or negated forms. Let Q and R be two Boolean expres-
sions and TQ and TR be the sets of descriptors that appear in Similarly, because TQ U TR = {t1 , t2 , t3 , t4 } , the RDNFs
2 B

Q and R, respectively. Suppose TQ < TR = {t1, t2, , tk}, of Q2 and RB are


where k is the set size of TQ < TR. Then, the RDNFs of Q
and R are
866 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 6, NOVEMBER/DECEMBER 1997

eQ~ j2
TQ TR
c h c h
= t1 t2 t3 t4 t1 t2 t3 t4 Let Q and R be two Boolean expressions, and TQ and TR
be their sets of descriptors. We denote Q$ and R$ as the
2 B

ct t t t h ct t t t h
1 2 3 4 1 2 3 4 CDNFs of Q and R, and express them as
ct t t t h ct t t t h,
1 2 3 4 1 2 3 4 F
m xi I,
~ Q$ = GH q$ JK
= ct t t t h c t t t t h
i ,u
e j
RB
TQ TR
2 B
1 2 3 4 1 2 3 4 i =1 u=1

ct t t t h c t t t t h.
F
n
R$ = G r$
yj I,
1 2 3 4 1 2 3 4
H
j =1 v =1
j,v JK
Radecki defines the similarity value S* between two
Boolean expressions (Q and R) as the ratio of the number of
~ ~ where each conjunction ( ux=1 q$i ,u
i
and yy =1 r$j ,v ) is a com-
j

common reduced atomic descriptors in Q and R to the to-


tal number of reduced atomic descriptors in them, pact atomic descriptor, and m and n are their number in Q$
and R$ . The xi is the number of descriptors in the ith (1 i m)
eQ~j TQ TR
~
ej
I R
TQ TR compact atomic descriptor of Q$ , and yj is the number of
S * aQ , Rf = . (3) descriptors in the jth (1 j n) compact atomic descriptor
eQ~j TQ TR
~
U e Rj
TQ TR of R$ . Each q$i , u and r$j , v in the CDNFs represents a descrip-

EXAMPLE 5. Continuing with Example 4, tor in TQ and TR, respectively.


EXAMPLE 6. The CDNFs of Q2, RA, and RB in Example 2 are
eQ~ j 2
TQ TR
~
e j
I RA
TQ TR
Q$ 2 = t1 t3 t2 t3 ,
S * cQ , R h =
2 A
2 A 2 A
c h c h
eQ~ j 2
TQ TR
2 A
~
U eR j A R T
Q R 2 A
R$ A = t1 t3 ,
c h
2 R$ B = ct
2 t h.
4
= = 0.667 ,
3 Each compact atomic descriptor in Q$ 2 consists of only
~ ~ the descriptors in TQ without introducing new de-
e j
Q2
TQ TR
I RB e j TQ TR
2

c
S * Q2 , RB h =
~
2 B

~
2 B scriptors from TR and TR . In other words, the de-
A B

e j
Q2 U RB e j scriptors in Q$ 2 are independent of those in other
TQ TR TQ TR
2 B 2 B
Boolean expressions, such as RA and RB.
2
= = 0.250.
8 We denote our similarity measure S and define the
Therefore, RA is more relevant to query Q2 than RB. similarity of two Boolean expressions as the summation of
the individual similarity measures (s) between each com-
From Example 4, we can see that Q2 is transformed to pact atomic descriptor. The individual similarity measure
~ ~
different RDNFs, (Q2 )T T and (Q2 )T T , when com- s is defined as
Q2

puting with RA and RB. This means whenever a new user


RA Q2 RB
R| j i
1
i j
if TQi I TRj
query is compared against N server descriptions, it needs |
s Q$ i , R$ j =
2
TR - TQ
+2
TQ - TR
-1
and "t TQi ,
2N RDNF transformations to calculate the similarity be- e j S|
tween them. This method suffers when the number of ||0 t TRj ,
server descriptions is large and users query frequently. The
system will spend significant amounts of time recomputing
T otherwise,
where Q$ i indicates the ith compact atomic descriptor of
RDNFs, and, consequently, will perform badly. To solve
this problem, we modify Radeckis method so that it need CDNF Q$ , and R$ j indicates the jth compact atomic de-
not recompute RDNFs of server descriptions while still scriptor of CDNF R$ . TQi and TRj are the sets of descriptors
providing results of equivalent or better quality.
in Q$ i and R$ j . TQi - TRj is the number of descriptors that
3.2 New Similarity Measure
appear in TRj but not in TQi . TQi - TRj is the number of de-
We propose a new measure based on Radeckis similarity
measure S*, that is independent of the underlying informa- scriptors that appear in TQi but not in TRj . The similarity
tion systems and requires less computation. We transform a
measure S is the sum of the individual s given by
Boolean expression to its compact disjunctive normal form
Q$ R$
(CDNF), using the distributive law described in the previ-
ous section. The CDNF is a disjunction of compact atomic a f s eQ$ , R$ j ,
S Q , R = i j
(4)
i =1 j =1
descriptors, each being a conjunction of subsets of descrip-
tors in the original Boolean expression. The descriptors in where Q$ and R$ are the number of compact atomic de-
each compact atomic descriptor are determined while per-
forming the distributive law. scriptors in Q$ and R$ , respectively.
LI AND DANZIG: BOOLEAN SIMILARITY MEASURES FOR RESOURCE DISCOVERY 867

EXAMPLE 7. Using the above definitions, we compute S for 2) Calculate S based on the number of hit documents on
Q2, RA, and RB of Example 6. We find each server.
3) Calculate S* and S for each filter-query pair.
Q$ 2 = t1 t3 t2 t3 , TQ1 = t1 , t3 ,
c h c h m r
12
4 4 3 12
4 4 3 2
4) Rank servers based on S, S*, and S.
Q$ 2 Q$ 2
1 2
TQ2 = t2 , t3 ,
2
m r 5) Compare their rankings using Spearman rank-order cor-
R$ A = t1 t3 , c h
TR1 = t1 , t3 , m r relation coefficient (rs) [15].
12
4 4 3 A
6) Compare rs(S*, S) and rs(S, S) using confidence inter-
R$ A
1

val for the proportion [16].


R$ B = t2 t4 ,
c h TR1 = t2 , t4 ,m r
1
424 3 B
We describe the details as follows:
R$ B1
During the experiment, all queries play two roles. First,
where TQ1 , TQ2 , TR1 , and TR1 , are the sets of descrip- each query is used as the filter of a server to collect specific
2 2 A B
documents from the testing database. Thus, we can create N
tors in the compact atomic descriptors Q$ 21 , Q$ 22 , R$ A1 , servers by running N queries on the database, where each
and R$ 1 , respectively. The individual similarity meas-
B server description is represented by the associated filter.
ures are, therefore Second, each query is submitted to all the N servers. The
1 number of hit documents is used to calculate S using (2).
s Q$ 21 , R$ A1 =

e j 1
TR - TQ1 1 1
TQ - TR
Based on the S values from the N servers, we can rank them
for each query and use that as the standard ranking to
2 A 2
+2 2 A
-1
1 evaluate S* and S.
= = 1.000 , To calculate the rankings estimated by S* and S, we
20 + 20 - 1
apply (3) and (4) to each filter-query pair (i.e., query pair)
1
s Q$ 22 , R$ A1 =

e j TR1 - TQ2 TQ2 - TR1
and sort them in descending order. To compare which
method generates a ranking closer to the standard, we
2 A 2
+2 2 A
-1
1 compute the degree of association between (S*, S) and be-
= = 0.333, tween (S, S) by applying the Spearman rank-order correlation
21 + 2 1 - 1 coefficient (rs) [15]. The rs ranges between 1 and 1. If two
which yields rankings are identical, rs = 1. If one ranking is the reverse of
S Q2 , RA = s Q$ 21 , R$ A1 + s Q$ 22 , R$ A1 = 1.333 . the other, rs = 1. The larger the rs, the closer the rankings.

c
h
e j e j Let Q1, Q2, , QN denote the N queries as well as the fil-
Similarly, for Q2 and RB, we get ters. For each query Qi (1 i N), we rank filter Qj (1 j N)
according to the similarity values S(Qi, Qj), S*(Qi, Qj), and
s Q$ 21 , R$ B1 = 0

e j (because TQ1 I TR1 = ), S(Qi, Qj) separately. For tied values, each Qj is assigned
2 B

1 the average of the ranks that would have been assigned had
s

eQ$ , R$ j =
2
2
1
B TR1 - TQ2 TQ2 - RR1
no ties happened.
2 B 2
+2 2 B
-1 Let a1, , an and b1, , bn be two rankings for Qi gener-
1 ated by various similarity measures, where n is the number
= = 0.333, of elements in the ranking (N in our case). The tied ranks in
2 1 + 21 - 1
each ranking form a group. Assume there are gu different
which leads to
groups in a1, , an, each group has uk (1 k gu) tied ele-
S Q2 , RB = s Q$ 21 , R$ B1 + s Q$ 22 , R$ B1 = 0.333 .
c h ments. Similarly, ranking b1, , bn has gv groups, each has

e j e j vk (1 k gv) tied elements. The rs coefficient can be ob-
Therefore, RA is more relevant to query Q2 than RB. tained by [15]:
Notice that the similarity values calculated using S (in n 2
*
Example 7) are different from those calculated using S (in rs =
1
6 en 3
j ca - b h - U - V ,
-n -
k =1 k k
3 3
Example 5). It is meaningless to compare these values di-
rectly because both are measured on a relative scale. How-
1
6 en - nj - 2U en - nj - 2V
1
6

ever, they can be used to rank a list of Boolean expressions where


measured by the same method. gu
1
U =
12 euk3 - uk j,
k =1
4 EXPERIMENTS gv
1
To compare the rankings estimated by similarity measures
S* and S, we conduct experiments on two databases. One
V =
12 evk3 - vk j.
k =1
is the standard CISI dataset. The other is the Homer data- For each query, we can determine which method per-
base at the University of Southern California. We use the forms better by their rs values with respect to S. Among the
result of S as the criterion, and compare it to that of S* and N observations, we measure the confidence that S is supe-
S. Each experiment consists of the following steps: rior to S* by calculating the confidence interval for the propor-
1) Create individual server databases by using queries tion, defined as follows [16]:
as filters.
868 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 6, NOVEMBER/DECEMBER 1997

Fig. 2. The difference between rs(S, S) and rs(S*, S) for 35 Boolean queries on the CISI database. The s above zero indicate S generates a
ranking closer to that of S than S* for the associated query.

n1 From the results of rs(S, S) and rs(S*, S), the sample


Sample proportion = p = ,
n proportion of rs(S, S) > rs(S*, S) is

Confidence interval for proportion = p m z1- a


b
p 1- p g
, p=
24
= 0.686 .
2 n 35
where z1- a is the (1 - a2 ) -quantile of a unit normal variate Because np 10 (n = 35), we can calculate the confidence
2
interval for the proportion for S:
( z1- a = 1.960 for 95 percent confidence level), n is the total
2

number of samples (N in our case), and n1 is the number of 95% confidence interval for proportion = p m 1.960
b
p 1- p g
n
times S is superior to S* (i.e., rs(S, S) > rs(S*, S)). By defi- = c0.532, 0.840h.
nition [16], if np 10 and the confidence interval does not
The confidence interval does not include 0.5. Therefore, we
include 0.5, we can say, with 95 percent confidence, that S can say, with 95 percent confidence, that S is superior to
is superior to S*. S* in the CISI experiment.
4.1 CISI Experiment 4.2 USC Homer Experiment
The CISI dataset consists of 1,460 information science In this experiment, we manually create 32 Boolean query
documents and 35 Boolean queries. All documents are in- samples, each averaging 3.6 descriptors picked up from 24
dexed with terms occurring in the title and abstract, but not terms in diverse fields. We submit these queries to the USC
on a stop list, of 429 common words. All indexed terms are Homer database and compute the results. Fig. 3 shows the
stored in their original forms without stemming. A Boolean values of rs(S, S) minus rs(S*, S) for the 32 queries.
query is a nested structure of terms with logical DQG, RU, In Fig. 3, rs(S, S) is greater than rs(S*, S) for 22 times (the
and QRW operators in between. Documents are hit by a s above zero) and less than rs(S*, S) for 10 times (the s
query if they satisfy all the conditions in the query. below zero). This indicates S generates a ranking closer to
Following the six steps described previously, we calcu- that of S for 22 out of 32 times, whereas S* only has closer
late rs(S, S) and rs(S*, S) for the 35 queries. Fig. 2 shows the order for 10 out of 32 times. The mean Spearman coeffi-
cients of rs(S, S) and rs(S*, S) are 0.595 and 0.494, respec-
value of rs(S, S) minus rs(S*, S) for each query. Among
tively. This shows S has a better average estimation than
them, rs(S, S) is greater than rs(S*, S) for 24 times (the s S* on the USC Homer database.
above zero) and less than rs(S*, S) for 11 times (the s below From the results of rs(S, S) and rs(S*, S), the sample
proportion of rs(S, S) > rs(S*, S) is
zero). This indicates S generates a ranking closer to that of
S for 24 out of 35 times, whereas S* only has closer order for 22
p= = 0.688 .
32
11 out of 35 times. The mean Spearman coefficient of rs(S, S)
Because np 10 (n = 32), we can calculate the confidence
and rs(S*, S) are 0.331 and 0.275, respectively. This shows S interval for the proportion for S:
has a better average estimation than S* on the CISI database.
LI AND DANZIG: BOOLEAN SIMILARITY MEASURES FOR RESOURCE DISCOVERY 869

Fig. 3. The difference between rs(S, S) and rs(S*, S) for 32 Boolean queries on the USC Homer database. The s above zero indicate S gen-
erates a ranking closer to that of S than S* for the associated query.

95% confidence interval for proportion = p m 1.960


b
p 1- p g 5 ANALYSIS AND COMPARISON
n Space and time are two of the important factors in design-
= c0.527, 0.848h. ing a real-time system. In an on-line information retrieval
system, the system response time is highly dependent on
The confidence interval does not include 0.5. Therefore, we the underlying data structures and associated indexing and
can say, with 95 percent confidence, that S is superior to searching techniques. In this section, we analyze the space
S* in this experiment. and time complexities of the two searching techniques
4.3 Discussion similarity measures S* and S.
As mentioned eariler, to calculate S, we need to apply
The queries associated with each dataset are designed to hit
the distributive law, such as
a number of documents in the collection. Therefore, the
servers generated by using queries as the filters contain ct t h t = c t
1 2 3 1 h c h
t3 t2 t3 ,
different portions of the collection. The CISI database is a
to obtain CDNFs, where t1, t2, and t3 are descriptors. To
collection of documents in library science and related areas.
calculate Radeckis S*, we need to transform Boolean ex-
It is an experimental database commonly used by research-
pressions to RDNFs. Two steps are required in the trans-
ers working on information retrieval. The USC Homer is an
formation:
on-line library catalog system that covers a board range of
collections, such as business, law, literature, medicine, sci- 1) distribution, where the distributive law is used to
ence, and engineering. So, for example, each server in the produce the corresponding disjunctive normal form;
first experiment is a subset of documents focusing on a spe- and
cific topic in information science, while the servers in the 2) expansion, where we use t1 = (t1 t2) (t1 t2) so
second experiment contain documents in widely different that each reduced atomic descriptor contains all the
fields. Table 1 gives the additional characteristics of the two descriptors (original or negated) in the to-be-
experiments. We obtained similar results from the two da- compared Boolean expressions.
tabases even though they have different size and cover dif- The order of these two steps affects the complexity, but not
ferent fields of documents. In the two experiments, both the the result, of transforming a Boolean expression to a RDNF.
average Spearman coefficient and the confidence interval If the distribution is performed before the expansion, it is
for proportion show that S is superior to S*. equivalent to transforming the Boolean expression to its
TABLE 1 CDNF and then expanding the CDNF to a RDNF. If the
CHARACTERISTICS OF THE CISI AND USC HOMER EX- expansion is performed before the distribution, it needs
PERIMENTS more space and computation because extra negated de-
CISI USC Homer scriptors, such as t2, will be generated in the expansion
Number of documents 1,460 < 800,000 step. The following example will clarify this idea.
Case 1: We transform Boolean expression Q to CDNF Q$ ,
Number of queries 35 32
Number of servers 35 32
Mean number of terms per query 7.14 3.6
~
then expand it to RDNF Q .
Mean number of documents per server 91.7 5,492
870 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 6, NOVEMBER/DECEMBER 1997

c h
Q = t1 t2 t3 a
TimeS = TimeS transformation + f
= ct t h ct t h efi Q$ j
1 3 2 3 (5) TimeS ccomputationh,
= ct t t h ct t t h cfi t t h
1 2 3 1 2 3 1 3 TimeS* = TimeS* atransformationf +
(6) TimeS* ccomputationh,
ct t t h ct t t h cfi t t h
1 2 3 1 2 3 2 3
where
= ct t t h ct t t h c t t t h
~
1 2 3 1 2 3 1 2 3
a
TimeS transformation f
=Q
Case 2: We expand Q first, then distribute it.
c
= Time Boolean expression fi CDNF , h
c h
Q = t1 t2 t3 (7)
a
TimeS* transformation f
dct t t h ct t t h U|V t
1 2 3 1 2 3
c
= Time Boolean expression fi CDNF + h
= (8) TimeaCDNF fi RDNF f.
ct t t h ct t t h|W
1 2 3 1 2 3
1

Similarly, the space complexities of S and S* are deter-


mined by the storage requirements for the CDNF and
ct t t h ct t t h U|V t
1 2 3 1 2 3 RDNF, respectively. For a single Boolean expression,
ct t t h ct t t hi W|
1 2 3 1 2 3
2
a f
SpaceS = Space CDNF ,

SpaceS* = SpaceaCDNF f + SpaceaRDNF f.
dct t t h ct t t h U|V t
1 2 3 1 2 3 In the following sections, we discuss the complexity of each
ct t t h ct t t hiW|
1 2 3 1 2 3
3
individual step.

= dct t t h ct t t h ct t t h
1 2 3 1 2 3 1 2 3
5.1 From Boolean Expression To CDNF
To simplify the analysis, we use binary trees [17] to repre-
ct t t h ct t t h
1 2 3 1 2 3 sent the Boolean expressions. Each external node or leaf
ct t t hi dct t t h
1 2 3 1 2 3
represents a descriptor. All the internal nodes, including
the root, are logical operators. The negation not can be
ct t t h ct t t h ct t t hi
1 2 3 1 2 3 1 2 3 stored with the associated descriptor, therefore we do not
denote it separately. The height of a tree is the longest path
= ct t t h ct t t h c t t t h
1 2 3 1 2 3 1 2 3 from any leaf to the root.
~
=Q The binary trees are transformed to their equivalent
In Case 1, the expansion is performed after the distribu- CDNF binary trees using the distributive law. The tech-
tion. Therefore each compact atomic descriptor is expanded nique is to transform an and-rooted subtree to an equiva-
(from (5) to (6)) instead of each descriptor (from (7) to (8)), lent or-rooted subtree, one at a time, in a top-down ap-
as in Case 2. A compact atomic descriptor usually contains proach. An example is shown in Fig. 4, where A, B, and C
more than one descriptor after applying the distributive are the subtrees of associated nodes.
law to its original Boolean expression. In the above exam-
ple, each of the two compact atomic descriptors in Q$ (5),
(t1 t3) and (t2 t3), contains two descriptors. Eight addi-
tional descriptors are added from (5) to (6) after the expan-
sion. On the other hand, each individual descriptor in the
original Boolean expression is expanded in Case 2. Thirty-
three additional descriptors are added from (7) to (8). The
second approach needs more space than the first one for
storing those intermediate descriptors, which consequently Fig. 4. Compact disjunctive normalization. We use the distributive law
causes it to spend more time checking the duplicates before (A B) C = (A C) (B C) on the subtrees A, B, and C.
~
obtaining the final Q .
In our example, the original Boolean expression contains We first change the current root node from and to or,
only three descriptors. It is the simplest transformation and change its or-rooted child node to be and-rooted. Then,
case. For more complicated Boolean expressions, the differ- we demote the other child (C) by one level, and add one
ence between Case 1 and Case 2 is bigger. Therefore we use and node at its original position to be its new parent. Fi-
the first approach (i.e., Boolean expression CDNF nally, we replicate the demoted child (C) and exchange it
RDNF) in our complexity analysis. Based on this, the time with one of the children (B) on the other subtree. The same
complexities of S and S* are equal to the transformation procedure is repeated until reaching the leaves.
time from Boolean expression to CDNF or RDNF plus the The space complexity of transforming the Boolean ex-
time to compute the similarity measures. For a single Boo- pression to a CDNF varies from O(n) to O(n2), depending
lean expression, on how the Boolean expression is constructed, where n is
LI AND DANZIG: BOOLEAN SIMILARITY MEASURES FOR RESOURCE DISCOVERY 871

the sum of the total number of descriptors and logical op- duplicated from Fig. 6b to Fig. 6c. The time complexity T(n)
erators in the Boolean expression. For example, a linear bi- consists of the time for adding additional DQG nodes and
nary tree (Fig. 5a) generates an O(n) CDNF, while a com- for duplicating subtrees. Since there is no need for distribu-
plete binary tree (Fig. 5b) generates an O(n2) CDNF. Notice tion for n 3, T(n) = 0 in these cases. Otherwise,
that n is equal to the total number of nodes if the Boolean
expression is represented as a binary tree. Intuitively, the af
T n = 1+
n-1
+ 2 1+
n-3
+ 2T
n-1 LM FG IJ OP .
O(n2) CDNF can be derived by starting with a tree of height
h
2 4 2 MN H K PQ
h. It will have O(n) = O(2 ) nodes for a complete binary tree. Therefore,
The distributive law no more than doubles the height. Thus,
a f RST0n + 1 + 4 Td i
2h h if n 3,
the new tree is bounded by size O(2 ) = O((2 )2) = O(n2).
T n = n-1
2
otherwise.
h
Let h be the height of the original binary tree, then n = 2  1,
where h 1. We can derive
af e
T n = T 2h - 1 j
-
= 2h + 4 T 2h 1 - 1 e j
- h-2
= 2h + 4 2h 1 + 42 T e -1 j
h h -1 2 h- 2 h- 3 3
= 2 + 42 + 4 2 +L+4 2 +
4h- 2 T 2 e b f - 1j
h- h-2

h h 2 h h -3 h
(a)
= 2 + 22 + 2 2 + L + 2 2
h h-2
=2 2 -1e j
a
= n+1 fLM n 4+ 1 - 1OP
N Q
2
= Oen j.

Similarly, the space complexity M(n) consists of the


space to store the root and its two to-be-distributed sub-
trees. For n 3, the space is not changed because there is no
need for distribution. Otherwise,

af
M n = 1+ 2 1+ 2M
LM FG n - 1IJ OP .
MN H 2 K PQ
(b)
Therefore,
Fig. 5. Various binary trees: (a) a linear binary tree, and (b) a complete
binary tree, where and and or are logical operators, t1, , tp are de-
scriptors. Mn = a f RSTn3 + 4 Md i n-1
2
if n 3,
otherwise,

The time complexity is primarily determined by the which can be derived as


number of times the distributive law is invoked and the
size of the subtree to be duplicated. Basically it is the same
af
M n = M 2h - 1 je
h -1
order as the space complexityO(n) for a linear binary tree = 3 + 4 Me 2 j
-1
and O(n2) for a complete binary tree. 2 h-2
The linear and complete binary trees are used as the = 3+ 43+ 4 Me 2 -1 j
lower and upper bounds for complexity analysis. Below, 2 h-3
= 3+ 43+ 4 3+ L + 4 3+
we discuss only the worst casea complete binary tree
with and node at the root and or nodes elsewhere. In this 4 h- 2
T 2 e h- h-2b f - 1j
case, the distributive law is applied to all the subtrees at
each level. The case of linear binary tree is described in [18]. = 3 1 + 4 + 42 + L + 4h- 2
e j
Fig. 6 shows an n-node complete binary tree where each h -1
internal node contains two children (i.e., a complete binary =4 -1
tree). Every time the topmost DQG node is distributed, an an + 1f 2

additional DQG node and a copy of one of its subtrees will =


4
-1
be created. Fig. 6a is transformed to Fig. 6b by creating 2
an DQG node and a duplicate C. Similarly, A and B are =On . e j
872 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 6, NOVEMBER/DECEMBER 1997

(a) (b)

(c)
Fig. 6. The CDNF transformation of an n-node complete binary tree. Originally, only the root is the and operator, the other internal nodes are all
or operators. A, B, C1, and C2 are subtrees. The number in the brackets means the number of nodes in this subtree. (a) The original binary tree.
(b) The binary tree after distribution on the first level node (i.e., the root), where the subtree C is duplicated. (c) The binary tree after distributions
on the second level nodes, where the subtrees A and B are duplicated.

n+1 n+1 n+1 n+1


Let ti (1 i 2
) denote the 2
descriptors (i.e., 4 2

leaves) in the original n-node full binary tree. We divide ti


= eti tj j ,
i =1 j = n+5
(12)
4

into four groups (A, B, C1, C2) of equal size, each group
where A, B, C1, and C2 are subtrees, and (9), (10), and (11)
having n+1
8
descriptors. Let k = n8+1 , then represen Figs. 6a, 6b, and 6c, respectively. Equation (12)
represents the resulting CDNF Q$ , which consists of n+1
2
c
A = t1 L tk , h d i
4

B = ct k +1 L t2 k , h compact atomic descriptors with two descriptors in each of


them. We can, therefore, show that the characteristics of the
C = ct
1 2 k +1 L t3 k , h CDNF of an n-node complete binary tree are:
C = ct L t4 k h.
2
2 2 k +1
e ( n+ 1)
4 j
- 1 total nodes,
Therefore, Fig. 6 can be presented as 2
( n+1)
8
descriptors,
Q = (A B) (C1 C2) (9) 2
e ( n+ 1)
8 j
- 1 logical operators,
= (A (C1 C2)) (B (C1 C2)) (10)
n+1 2
= (A C1) (A C2) (B C1) (B C2) (11) d i 4
compact atomic descriptors,
= ((t1 L tk ) (t2 k + 1 L t3 k )) each compact atomic descriptor contains two de-
144244 3 144 42444 3
A C1 scriptors,
2
((t1 L tk ) (t3 k + 1 L t4 k )) Timecomplete(Boolean expression CDNF} = O(n ),
144244 3 144 42444 3 2
A C2 Spacecomplete(CDNF) = O(n ).
((tk + 1 L t2 k ) (t2 k + 1 L t3 k ))
144 42444 3 144 42444 3
B C1 5.2 From CDNF To RDNF
((tk + 1 L t2 k ) (t3 k + 1 L t4 k )) Assume Q1 and Q2 are two Boolean expressions, which
144 42444 3 144 42444 3
B C2 have n1 and n2 total nodes and p1 and p2 distinct descrip-
2k 4k
tors, respectively. Let p be the size of the union of these two
= eti tj j
i =1 j = 2 k +1 distinct descriptor sets, c1 and c2 the number of compact
LI AND DANZIG: BOOLEAN SIMILARITY MEASURES FOR RESOURCE DISCOVERY 873

atomic descriptors of Q$ 1 and Q$ 2 , and r1 and r2 the number (1 i c1) atomic descriptor of Q$ 1 contains ki descriptors
~ ~
of reduced atomic descriptors of Q1 and Q2 . We observe and the jth (1 j c ) atomic descriptor of Q$ contains h de-
2 2 j
that the resulting RDNFs of Boolean expressions Q1 and Q2 scriptors, where ki p1, hj p2, and max(p1, p2) p p1 + p2.
contain the following characteristics: To speed up the computation time, all the descriptors
each reduced atomic descriptor contains p descriptors, within the compact atomic descriptors or reduced atomic
descriptors are sorted before calculating their similarities.
max (p1, p2) p p1 + p2,
k log k to sort Q$ , and
c1 c2

r1 = min (c1 2  , 2 ),
p 2 p Therefore it takes i = 1 i h log h i 1 j =1 j j

r2 = min (c2 2  , 2 ),
p 2 p to sort Q$ 2 . To compare these two CDNFs term-by-term, it
~ c1 c2
Q1 has (r1 p) descriptors, takes i =1 j =1 max(ki , hj ) time. Hence,
~
Q2 has (r2 p) descriptors.
~ ~
TimeS computation c h
In Q1 and Q2 , each compact atomic descriptor containing c1 c2 c1 c2

two descriptors is expanded to 2


p2
reduced atomic de-
= ki log ki + hj log hj + maxeki , hj j
i =1 j =1 i =1 j =1
scriptors containing p descriptors. However, some of these c1 c2
reduced atomic descriptors are duplicates. Therefore, the
p
total number should not exceed 2 , which is the number of
p1 log p1 + p2 log p2 + c1c1 maxcp1 , p2 h
i =1 j =1
all possible combinations of reduced atomic descriptors
~
containing p descriptors. The space complexity of Q1 can be
c h
c1 + c2 p log p + c1c2 p.

derived as Similarly, we transform the same two Boolean expressions


~ ~
to RDNFs Q1 and Q2 , where each reduced atomic descrip-
c1 =
FG n + 1IJ ,
1
2
tors contains exact p descriptors. Using the same optimal
H 4 K sorting method, it takes
r1
i =1 p log p + j =1 p log p time to
r2

F F n + 1I 2
p-2
I ~ ~ r r
= minG G 1
,2 p JJK i =1 j =1 p time to compare them.
r1
GH H 4 JK 2 sort Q1 and Q2 , and
1 2

Thus,
=S
R|cn + 1h 2
1
2 p-6
if n1 < 7 , TimeS* computationc h
|T2 p
otherwise, r1 r2 r1 r2

2 , p = p log p + p log p + p
i =1 j =1 i =1 j =1
a f c
Space RDNF = r1 p + r1 p - 1 h c h c h
= r1 + r2 p log p + r1r2 p.
2 p +1 p - 1
Because RDNF is obtained by expanding its CDNF, we
p
=O2 p,e j are certain that c1 r1 and c2 r2. Therefore,
~
where (r1 p) is the number of descriptors in Q1 and (r1 p c
TimeS computation h
~
 1) is the number of logical operators in Q1 . is always less than or equal to
The time for transforming CDNF to RDNF consists of c
TimeS* computation . h
1) expanding each compact descriptor, and
If Q1 and Q2 are both n-node binary trees as described
2) checking and removing duplicate reduced atomic
above, then ki = 2 (1 i c1) and hj = 2 (1 j c2). Thus,
descriptors.
Because 2 can be done as 1 is being executed, it is omitted in
c
TimeS computation h
our analysis. Thus, the time complexity for an n-node bi-
nary tree is =G
F n + 1IJ 2 log 2 + FG n + 1IJ
2 2
2 log 2 +
H 4 K H 4 K
f FGH n 4+ 1IJK 2
2

a
Time CDNF fi RDNF = p-2
p FG n + 1IJ FG n + 1IJ 2
2 2

p 2
H 4 KH 4 K
= Oe 2 pn j. 4
= Oen j,

5.3 Computation Time ccomputation h


S*
p p p p
= 2 p log p + 2 p log p + 2 2 p
To calculate the similarity measure S between two CDNFs,
we need to compare their compact atomic descriptors. Us- 2p

ing the notation given above, we further define that the ith
=O2 p. e j
874 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 6, NOVEMBER/DECEMBER 1997

5.4 Remarks c
TimeS* 1 query, 100 server descriptions h
Below, we summarize the previous time and space analysis. 2 5
= 100 2 5 = 512 ,000 ,
a
TimeS = Time S transformation + f c
TimeS 1 query, 100 server descriptions h
TimeS ccomputationh 4
= 100 5 = 62 , 500 ,
c
= Time Boolean expression fi CDNF + h c
SpaceS* 1 query, 100 server descriptions h
c
TimeS computation h 5
= 100 2 5 = 16 ,000 ,
2 4
=One j + Oen j, c
SpaceS 1 query, 100 server descriptions h
TimeS* = Time atransformationf +
S*
2
= 100 5 = 2 , 500.
Time ccomputation h
S* When using S, the directory of service is eight times
= TimecBoolean expression fi CDNF h + faster in searching the relevant servers, and takes only
one-sixth the space of S*.
TimeaCDNF fi RDNF f +
Time ccomputation h
S* 6 IMPLEMENTATION
2 p 2 2p
= Oen j + Oe 2 pn j + Oe 2 pj, In the client-directory-server model, the directory of serv-
SpaceS = Spacea CDNF f ices ranks the servers by comparing their descriptions with
the query. Both the query and the server descriptions need
2
= Oen j, to be normalized before the comparison. In our method, the
normalization of a Boolean expression is independent of
SpaceS* = Spacea CDNF f + Spacea RDNF f other Boolean expressions in the comparison. Therefore, we
2
= Oen j + Oe 2 pj.
p can prenormalize the server descriptions and store them in
the directory of services. Below, we describe the imple-
The above comparisons are analyzed based on a pair of mentation of our Boolean similarity measure.
Boolean expressions only. For (N + 1) Boolean expressions, We use the UNIX tools flex and bison to parse the nested
consisting of one incoming query and N server descrip- Boolean expressions and build the associated binary parse
tions, their time and space complexities are in proportion to trees. Each attribute-value pair in the user query and the
N [18]. As discussed previously, an n-node binary tree con- server description is presented as a three-element subtree in
sists of n+1
2
leaves (or descriptors in the Boolean expres- the binary parse tree. The three-element subtree consists of
sion). The number of distinct descriptors p must be no one parent node and two child nodes. The left and right
larger than n+1 , i.e., p n2+1 = O(n) . Thus, the complexities child nodes, i.e., the leaves, are the attribute name and its
2
value respectively. The leaves are joined by the parent
of the two measures S and S* can be simplified as shown node, which is a relational operator (= or ). These sub-
in Table 2. trees are merged by the logical operators (and and or) to
form the binary parse tree.
TABLE 2 The binary parse trees are transformed to their equiva-
TIME AND SPACE COMPLEXITIES OF S AND S* FOR ONE USER
QUERY AGAINST N SERVER DESCRIPTIONS lent CDNF binary trees based on the distributive law. No-
tice that, while replicating the subtree (such as C in Fig. 4),
S S* we only copy the logical operator nodes in order to save
4 2n space. For relational operator nodes, only their associated
time complexity O(N n ) O(N2 n)
2 n pointers are copied. All the nodes in the binary tree whose
space complexity O(N n ) O(N2 n)
parents are or are linked together after the distributive
Both the user query and the server descriptions are n-node binary trees.} normalization. Consider the following example.

Apparently, S outperforms S* in both time and space EXAMPLE 9. Let Q1 be an incoming user query and RA, RB,
complexities. The above analysis shows that our similarity RC be three server descriptions stored as CDNFs
measure based on CDNFs consumes up to exponentially ( R$ A , R$ B , R$ C ) in the directory of services,
less time and space than Radeckis method. The following
example further illustrates the performance difference be- Q1 : cbkeyword = networkg or akeyword = UNIX fh
tween the two measures. and bauthor = Smithg,
EXAMPLE 8. Consider a directory of services containing 100
R$ A : bkeyword = network g,
server descriptions, each consisting of five descrip-
tors. The time and space used to calculate the simi- R$ B : bkeyword = database g or bkeyword = computer g,
larities S* and S for a five-descriptor user query are:
R$ C : cakeyword = UNIX f and bauthor = Smithgh or
cbkeyword = databaseg and bauthor = McLeodgh.
LI AND DANZIG: BOOLEAN SIMILARITY MEASURES FOR RESOURCE DISCOVERY 875

Fig. 7. User query Q1 before normalization.

Fig. 8. Normalized user query Q$ 1. The KHDG links all the nodes whose parents are RU. The dashed subtree is a replicated subtree. The Q$ 1 is one
i

of the compact atomic descriptors in Q$ 1.

Fig. 9. Normalized server description R$ C . The KHDG links all the nodes whose parents are RU. The R$ C is one of the compact atomic descriptors in R$ C .
j

The Q1 is normalized to Q$ 1 before comparison, link generated in each normalized binary tree is pointed at
by KHDG.
Q$ 1 = cakeyword = networkf and bauthor = Smithgh After the normalization process, we compare each com-
or cakeyword = UNIX f and bauthor = Smithgh. ponent in the links of the two binary trees. Each element in
the server description link represents a compact atomic de-
The similarity values between the user query and the scriptor R$ Cj in the server description R$ C . Each element in
three server descriptions are the user query link represents a compact atomic descriptor
Q$ 1i in the user query Q$ 1 . To calculate s (Q$ 1i , R$ Cj ) , we com-
2 1
c
S Q1 , RA =h s eQ$ , R$ j = 41 ,
i
1
j
A
pare all the nodes under Q$ i with all the nodes under R$ j
i =1 j =1 1 C
2 2 and find out the number of uncommon nodes between
c
S Q1 , RB =h s eQ$ , R$ j = 0,
i
1
j
B them. Then we sum up all the s (Q$ i , R$ j ) to get S(Q1, RC).

1 C
i =1 j =1
2 2 Similarly, we calculate S(Q1, RA) and S(Q1, RB), sort RA,
c
S Q1 , RC =h s eQ$ , R$ j = 31 .
i
1
j
C RB, and RC in descending order of their similarity values
i =1 j =1
with Q1, and return the result to the user.
Figs. 7 and 8 show the binary parse tree of the user query
Q1 in Example 9 before and after normalization. Fig. 9
shows the server description RC after normalization. The
876 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 6, NOVEMBER/DECEMBER 1997

7 CONCLUSIONS [7] V.I. Frants and J. Shapiro, Algorithms for Automatic Construc-
tion of Query Formulations in Boolean Form, J. Am. Soc. Informa-
We have developed a new method using compact disjunc- tion Science, vol. 42, no. 1, pp. 16-26, Jan. 1991.
tive normal form (CDNF) to rank the similarity between [8] K. Obraczka, P.B. Danzig, and S.-H. Li, Internet Resource Dis-
covery Services, Computer, vol. 26, no. 9, pp. 8-22, Sept. 1993.
Boolean expressions. We compared our method with [9] A. Emtage and P. Deutsch, Archie: An Electronic Directory
Radeckis measure on two databases and used the Spear- Service for the Internet, Proc. Winter 1992 Usenix Conf., pp. 93-
man rank coefficients and the confidence intervals to show 110, 1992.
[10] M.A. Sheldon, A. Duda, R. Weiss, J.W. OToole Jr., and D.K. Gif-
that our method can get a closer ranking order to that gen- ford, A Content Routing System for Distributed Information
erated by Jaccards coefficient. The theoretical analysis Servers, Proc. Fourth Intl Conf. Extending Database Technology,
proves that this new measure outperforms the one pro- Cambridge, England, Mar. 1994.
posed by Radecki significantly in terms of time and space [11] L. Gravano, H. Garcia-Molina, and A. Tomasic, The Efficacy of
GIOSS for the Text Database Discovery Problem, Technical Re-
complexity. These results demonstrate that our similarity port STAN-CS-TN-93-2, Stanford Univ., 1993.
measure can greatly improve the searching process in to- [12] C.J. van Rijsbergen, Information Retrieval, second ed. London:
days world of overwhelming information. Butterworth & Co. (Publishers) Ltd., 1979.
In addition to ranking results, similarity estimates can be [13] T. Radecki, A Model of a Document-Clustering-Based Informa-
tion Retrieval System with a Boolean Search Request Formula-
used to help identify similar, but autonomously managed, tion, Information Retrieval Research, R.N. Oddy, S.E. Robertson,
retrieval systems. For example, the similarity measure can C.J. van Rijsbergen, and P.W. Williams, eds., pp. 334-344. London:
be used to cluster servers with similar descriptions in a sin- Butterworth & Co. (Publishers) Ltd., 1981.
[14] T. Radecki, Similarity Measures for Boolean Search Request
gle directory entry. When the similarity measures of two Formulations, J. Am. Soc. Information Science, vol. 33, no. 1, pp. 8-
servers exceed a certain value, they can be merged to re- 17, 1982.
move redundancy. Moreover, the administrator can create [15] M. Kendall and J.D. Gibbons, Rank Correlation Methods, fifth ed.
new servers by using the most frequently asked queries as London: Edward Arnold, 1990.
[16] R. Jain, The Art of Computer Systems Performance Analysis. New
the filter and select other relevant servers as its information York: John Wiley & Sons, 1991.
sources. Thus, most user queries can be satisfied by a small [17] R. Sedgewick, Algorithms, second ed. Reading, Mass.: Addison-
number of servers which reduces search time. For people Wesley, 1988.
[18] S.-H. Li and P.B. Danzig, Boolean Similarity Measures for Re-
using Boolean expressions to represent their interests, such
source Discovery, Technical Report USC-CS-94-579, Univ. of
as collaborative filtering [19] or user profile [20], [21], similarity Southern California, 1994.
measure can help find other individuals having common [19] D. Goldberg, D. Nichols, B.M. Oki, and D. Terry, Using Collabo-
interests, so that they may share their collections. Our rative Filtering to Weave an Information Tapestry, Comm. ACM,
vol. 35, no. 12, pp. 61-70, Dec. 1992.
method can also benefit systems that support automatic [20] C. Danilowicz, Modeling of User Preferences and Needs in Boo-
query formulations by relevance-feedback [22], [7], [23], lean Retrieval Systems, Information Processing & Management, vol. 30,
where the reformed queries could be in complex Boolean no. 3, pp. 363-378, 1994.
forms. [21] T.W. Yan and H. Garcia-Molina, Index Structures for Selective
Dissemination of Information Under the Boolean Model, ACM
Trans. Database Systems, vol. 19, no. 2, pp. 332-364, June 1994.
[22] M. Dillon and J. Desper, The Use of Automatic Relevance Feed-
ACKNOWLEDGMENTS back in Boolean Retrieval Systems, J. Documentation, vol. 36, no.
This work was supported in part by the Advanced Research 3, pp. 197-208, Sept. 1980.
[23] G. Salton, E.A. Fox, and E.M. Voorhees, Advanced Feedback
Projects Agency under contract number DABT63-93-C-0052, Methods in Information Retrieval, J. Am. Soc. Information Science,
HBP NIH grant 1-P20-MH/DA52194-01A1, National Science vol. 36, no. 3, pp. 200-210, May 1985.
Foundation Institutional Infrastructure grant number CDA-
9216321, and NSF NYI grant number NCR-9457518. Shih-Hao Li received his BS in communication
engineering from the National Chiao-Tung Uni-
versity, Hsinchu, Taiwan, in 1985, his MS in
computer engineering in 1991, and his PhD in
REFERENCES computer science in 1996, both from the Univer-
[1] S.-H. Li and P.B. Danzig, Vocabulary Problem in Internet Re- sity of Southern California. His research interests
source Discovery, Proc. Second Intl Workshop Next Generation In- include Internet resource discovery, information
formation Technologies and Systems, pp. 139-145, Naharia, Israel, retrieval, and distributed computing. He is cur-
June 1995. Available from ftp://catarina.usc.edu/shli/ngits.ps.gz. rently a senior software engineer at Infoseek
[2] D.R. Hardy and M.F. Schwartz, Essence: A Resource Discovery Corporation. Dr. Li is a member of the ACM and
System Based on Semantic File Indexing, Proc. Winter 1993 the IEEE Computer Society.
Usenix Conf., pp. 361-374, Jan. 1993.
[3] B. Kahle and A. Medlar, An Information System for Corporate Peter B. Danzig received his BS in applied phys-
Users: Wide Area Information Servers, ConneXionsThe Interop- ics from the University of California, Davis, in 1982
erability Report, vol. 5, no. 11, pp. 2-9, 1991. and his PhD in computer science from the Univer-
[4] P.B. Danzig, S.-H. Li, and K. Obraczka, Distributed Indexing of sity of California, Berkeley, in 1989. He is currently
Autonomous Internet Services, Computing Systems, vol. 5, no. 4, on leave from the University of Southern Califor-
pp. 433-459, 1992. nia, where he is an associate professor and is
[5] P.G. Anick, J.D. Brennan, R.A. Flynn, D.R. Hanssen, B. Alvey, and chief Internet architect at Network Appliance. His
J.M. Robbins, A Direct Manipulation Interface for Boolean In- research addresses both building scalable Inter-
formation Retrieval via Natural Language Query, Proc. 13th Ann. net information systems and flow, congestion,
Intl ACM SIGIR Conf., pp. 135-150, Brussels, Sept. 1990. and admission control algorithms for the Internet.
[6] D. Young and B. Shneiderman, A Graphical Filter/Flow Repre- He has served on several ACM SIGCOMM and
sentation of Boolean Queries: A Prototype Implementation and ACM SIGMETRICS program committees. He is a member of the IEEE
Evaluation, J. Am. Soc. Information Science, vol. 44, no. 6, pp. 327- and the ACM.
339, July 1993.

Вам также может понравиться