Академический Документы
Профессиональный Документы
Культура Документы
SSW: A Small-World-Based
Overlay for Peer-to-Peer Search
Mei Li, Member, IEEE, Wang-Chien Lee, Member, IEEE,
Anand Sivasubramaniam, Senior Member, IEEE, and Jing Zhao
Abstract—Peer-to-peer (P2P) systems have become a popular platform for sharing and exchanging voluminous information among
thousands or even millions of users. The massive amount of information shared in such systems mandates efficient semantic-based
search instead of key-based search. The majority of existing proposals can only support simple key-based search rather than
semantic-based search. This paper presents the design of an overlay network, namely, semantic small world (SSW), that facilitates
efficient semantic-based search in P2P systems. SSW achieves the efficiency based on four ideas: 1) semantic clustering, where
peers with similar semantics organize into peer clusters, 2) dimension reduction, where to address the high maintenance overhead
associated with capturing high-dimensional data semantics in the overlay, peer clusters are adaptively mapped to a one-dimensional
naming space, 3) small world network, where peer clusters form into a one-dimensional small world network, which is search efficient
with low maintenance overhead, and 4) efficient search algorithms, where peers perform efficient semantic-based search, including
approximate point query and range query in the proposed overlay. Extensive experiments using both synthetic data and real data
demonstrate that SSW is superior to the state of the art on various aspects, including scalability, maintenance overhead, adaptivity to
distribution of data and locality of interest, resilience to peer failures, load balancing, and efficiency in support of various types of
queries on data objects with high dimensions.
Index Terms—Distributed systems, distributed indexing, overlay, Internet content sharing, peer-to-peer systems.
1 INTRODUCTION
systems. Given the vast repositories of information (for
P EER-TO-PEER (P2P) systems have become a popular
platform for sharing and exchanging resources and
voluminous information among thousands or even millions
example, documents) shared in P2P systems, it is undesir-
able to require users to remember the key or identifier
of users. In contrast to the traditional client-server comput- associated with a document in order to search for such a
ing models, a host node in P2P systems can act as both a document. Instead, it is more favorable to allow users to
server and a client. Despite avoiding performance bottle- issue searches based on the content of documents (just as on
necks and single points of failure, these decentralized the Internet today). This mandates the employment of
systems present fundamental challenges when searching content-based/semantic-based searches.1 The primary goal of
for resources (for example, data and services) available at this study is to design a P2P overlay network that supports
one or more of these numerous host nodes. efficient semantic-based search.
These challenges have motivated the proposals of Various digital objects such as documents and multi-
distributed hash tables (DHTs), for example, CAN [19], media can be represented and stored as data objects in P2P
Chord [22], Pastry [21], and Tapestry [25], which use systems. The semantics or features of these data objects can
hashed keys to direct the searches to the specific peer(s) be identified by a k-element vector, namely, Semantic Vector
holding the requested data objects. Although these techni- (SV; also called feature vector in the literature). SVs can be
ques address the scalability issue with respect to the derived from the content or metadata of the objects. Each
number of peers in the P2P system, it is equally important element in the vector represents a particular feature or
to address the voluminous information content of such attribute associated with the data object, with weight
indicating the importance of this feature element in
representing the semantics of the data object. The SV of a
. M. Li is with Microsoft Corporation, One Microsoft Way, Redmond, WA data object can be mapped to a point in a k-dimensional
98052. E-mail: meili@microsoft.com. space. Thus, each data object can be seen as a point in a
. W.-C. Lee is with the Department of Computer Science & Engineering, multidimensional semantic space. As a result, queries on data
Pennsylvania State University, 360D IST Building, University Park, PA
16802. E-mail: wlee@cse.psu.edu. objects in this semantic space can be specified in terms of
. A. Sivasubramaniam is with the Department of Computer Science & these attributes. The number of attributes (elements)
Engineering, Pennsylvania State University, 354A IST Building, Uni- capturing the semantics of data objects is normally very
versity Park, PA 16802. E-mail: anand@cse.psu.edu. large, and thus, the dimensionality of the semantic space, k,
. J. Zhao is with the Department of Computer Science and Engineering, The
Hong Kong University of Science and Technology, Clear Water Bay, is high. For instance, the dimensionality of SV for documents
Kowloon, Hong Kong. E-mail: zhaojing@cs.ust.hk. (latent semantic vector [6]) is around 50-300. The euclidean
Manuscript received 6 Sept. 2006; revised 30 Mar. 2007; accepted 5 June distance is used to represent the semantic distance between
2007; published online 9 Aug. 2007. two SVs, whereas SVs are assumed to be unit vectors.
Recommended for acceptance by G. Lee.
For information on obtaining reprints of this article, please send e-mail to: 1. We do not exploit the differences between semantic-based and
tpds@computer.org, and reference IEEECS Log Number TPDS-0276-0906. content-based searches. These two terms are used interchangeably in this
Digital Object Identifier no. 10.1109/TPDS.2007.70757. paper.
1045-9219/08/$25.00 ß 2008 IEEE Published by the IEEE Computer Society
736 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 6, JUNE 2008
Fig. 3. An illustrative example of ASL. (a) ClusterIDs. (b) GPT. (c) LPT. (d) SSW-1D.
starting from the root to the leaf node corresponding to this range r. Note that a point query can be treated as a special
peer’s subspace. Thus, we call this partial partition history case of range query, with r ¼ 0.
local partition tree (LPT). In the LPT, each node (except the To process these queries, we need to address two issues:
leaf node) has two children representing the two subspaces
generated in a space partitioning event. Between the two . Search space resolution (SSR). To process a query, a
peer first needs to determine which portions of the
subspaces, the one that the peer does not reside in may
overlay (peers and data subspaces) host the re-
undergo further partitioning that is not known to the peer,
quested data (or the index information for the data).
and thus, it is called old neighbor subspace. The readers
Since the overlay (index) structure of SSW is
should not confuse the old neighbor subspaces with short-
constructed over the linearized naming space, and
range contacts or long-range contacts. Although a peer has
each peer only has partial knowledge of the mapping
maintained the history of the data space partition events from the high-dimensional data space to the one-
generating these subspaces in its LPT, it does not dimensional overlay, SSR is a nontrivial issue.
necessarily have contacts in each of these old neighbor . Multidestination query propagation (MDQP). By the
subspaces. nature of range queries, it is easy to observe that
Fig. 3c illustrates the LPT for a peer residing in cluster 10 the searched data may be spread out in multiple
(marked by dark rectangles). Cluster 10 has been involved semantic subspaces. Therefore, for range query, we
in four partitioning events, generating four old neighbor need to address how a query can efficiently be
subspaces (marked by light rectangles). In this figure, we propagated to multiple peer clusters that may have
also show how the subspace corresponding to a leaf node in the qualified data. The lack of complete neighbor-
the LPT is partitioned further. Based on this figure, we can hood information in all data dimensions makes this
see that the two old neighbor subspaces (that is, the tree issue very difficult to solve.
node with labels 0000 and 1100) are partitioned further and The solution to SSR is presented in Section 5.1, followed by
consist of a collection of smaller subspaces, respectively. the algorithms for point* and range query in Sections 5.2
Fig. 3d illustrates SSW-1D built upon the ClusterIDs. A and 5.3, respectively. The solution to MDQP is also
peer in cluster 4 maintains short-range contacts to the discussed in Section 5.3.
neighboring clusters 2 and 5. It also maintains a long-range
contact to a distant cluster 8. A peer in clusters 8, 10, and 12 5.1 Search Space Resolution
maintain a long-range contact to cluster 13, 6, and 14, Since SSW is constructed over the one-dimensional naming
respectively. The contacts of other peers are not shown here space captured by ClusterIDs, SSR basically is to figure out
for readability. the ClusterIDs of the peer clusters intersecting with the
search space (or query region) associated with a given
query. We propose a localized algorithm for SSR based on
5 SEARCH the concept of LPT as follows: A peer examines its LPT,
In this section, we explain the search process in SSW-1D. For starting from the root node, and decides whether a subtree
the details of overlay maintenance, please refer to [14]. For in the LPT should be pruned or not by examining whether
simplicity, we refer to SSW-1D as SSW in the rest of this the corresponding subspace intersects with the search space
paper, where the context is clear. We present the algorithms or not. To determine whether a subspace intersects with the
for two common types of semantic-based searches, that is, search space, the peer first computes the minimum distance
approximate point query (denoted as point* query) and range mindist from the query point to the subspace. If the
query. Point* query, specified by a query SV or query computed mindist is greater than r (the range of the search
point q, returns the data objects matching q if there is such a space), the subspace does not intersect with the search
data object in the system (similar to a regular point query); space, and thus, the corresponding subtree is pruned.
otherwise, it returns a data object similar to q (different from Otherwise, the corresponding subtree is examined further.
a regular point query). Range query, specified by a query Both subtrees may be examined if both corresponding
point q and a query radius r, returns the data objects with subspaces intersect with the query region. This process
dissimilarity to the query point q smaller than the specified continues till the leaf nodes in the LPT, called terminal nodes,
LI ET AL.: SSW: A SMALL-WORLD-BASED OVERLAY FOR PEER-TO-PEER SEARCH 741
Fig. 4. An illustrative example of SSR. (a) Search space. (b) GPT. (c) SSW-1D.
are reached. We assign a pseudo cluster name (PCN) to the is received, a peer first checks whether q falls within the
corresponding subspace of a terminal node. PCN is a tuple range of its peer cluster. If that is not the case, SSR is
of ha; hi, denoted as ah , where a and h are set to the tree invoked to obtain the PCN for q based on the partition
label and depth of the corresponding terminal node in the history (LPT) stored at this peer. The query message is then
LPT. Here, a and h correspond to the value of the PCN and forwarded to the contact with the shortest distance to the
the number of bits resolved in the PCN, respectively. If a PCN in the one-dimensional naming space. The above
PCN’s value is the same as the ClusterID of the current process is repeated until the cluster whose semantic
peer, this PCN is completely resolved. Otherwise, the PCN subspace covering q is reached. At this point, flooding
may be refined further by invoking SSR at the peers search within the cluster is invoked by flooding the
managing the old neighbor subspace corresponding to the message to other peers in the same peer cluster. Then, the
terminal node. Algorithm 1 shows the pseudocode for SSR data object with the highest similarity to the query is
at a peer. returned as the result.
Algorithm 1: Algorithm for SSR. Here, we show an example to illustrate point* query
Peer i issue ssrðx; q; r; a; hÞ: i:ssrðx; q; r; a; hÞ (x, set to the processing in SSW-1D. Let us go back to Fig. 4a. Assume
that Peer 1 in Cluster 4 issues a point* query q1 [0.9,0.3]
root node initially, is the current tree node to be examined
indicated by the small gray circle in Cluster 11. Peer 1 first
in i’s LPT. q and r are the center and range of the search
checks its own cluster range. Since [0.9,0.3] is not within the
space. a and h, set to 0 initially, indicate the value and the
subspace of its peer cluster, Peer 1 invokes SSR and
number of bits resolved in the PCN.)
estimates the PCN for the query as 81 . Peer 1 then forwards
1: if x is the leaf node in i.LPT then
the message to Cluster 8, which invokes SSR and refines the
2: Return ah .
PCN as 103 . The message is then forwarded to Cluster 10,
3: else
which refines the PCN further as 114 . The query is finally
4: h ¼ h þ 1;
forwarded to a peer in Cluster 11, which finds that q1 is
5: if mindistðq; x:leftÞ r then
within its own cluster range and, thus, floods the message
6: Set the ith most significant bit of a to 0.
within its peer cluster to obtain the query results.
7: i.ssr(x.left, q, r, a, h). Despite the need for SSR, navigation in SSW-1D is
8: end if efficient. We group the routing hop(s) to resolve 1 bit of
9: if mindistðq; x:rightÞ r then PCN as a PCN resolving phase. A query may need to go
10: Set the ith most significant bit of a to 1. through multiple PCN resolving phases, where each
11: i.ssr(x.right, q, r, a, h). phase brings the query message halfway closer to the
12: end if target. The following theorem obtains the search path
13: end if length for SSW-1D (different from Theorem 1, where a
Fig. 4a illustrates an example for SSR. The large circle q2 peer has short-range contacts along all the dimensions, in
intersecting with clusters 8, 11, 12, and 14 depicts the SSW-1D, a peer has only two short-range contacts in a
search space (also marked on the GPT in Fig. 4b). A peer in k-dimensional space ðk >> 1Þ and, thus, it only has a partial
cluster 10 resolves the search space and obtains the knowledge of its neighborhood).
corresponding PCNs as 83 , 114 , and 122 . The search space Theorem 2. For a k-dimensional space, given an SSW-1D of
corresponding to PCN 122 is later resolved as 124 and 143 by
N peers, with maximum cluster size M and number of long-
a peer residing in the old neighbor subspace corresponding
to the tree node with label 1100 in cluster 10’s LPT. range contacts l, the average search path length for navigation
3
across clusters is Oðlog ð2N=MÞ
l Þ.
5.2 Point* Query
Proof. The average number of peer clusters in the system is
The algorithm for processing a point* query consists of two 2N 2N
stages: navigation across clusters, which navigates to the M , which implies that a total of log M bits in PCN need
destination peer cluster covering the query point, and flooding to be resolved, requiring log 2NM PCN resolving phases.
search within cluster, which floods the query within the Starting from the most significant bit, each phase
destination peer cluster for query results. When a message resolves 1 bit in PCN, thereby reducing the distance to
742 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 6, JUNE 2008
the target cluster by half. S denotes the initial distance to In the following, we first describe a general algorithm
the target cluster; that is, S ¼ 2N for processing range queries and then discuss our strategy
M . According to Theorem 1
for propagating the query message to multiple candidate
(dimensionality k is 1 here), the path lengths for these
2 2 2 2 logS1 subspaces, that is, the issue of MDQP.
log 2N
M phases are logl S ; log ðS=2Þ
l ; log ðS=4Þ
l ; . . . ; log ðS=2l Þ
,
respectively. The search path length in SSW-1D is 5.3.1 Range Query Algorithm
the summation of all these path lengths incurred by Based on the above idea, we develop an algorithm for
3
different phases, which is Oðlogl SÞ. Hence, the above processing range queries. Given that a peer issues or
theorem is proved. u
t receives a range query, if its subspace intersects with the
specified query region, the peer propagates the query
During the query process, SSW adapts to the locality of
message to all other members in its peer cluster. Mean-
users’ interests as follows: Each peer maintains a query-hit while, regardless of the existence of overlapping between a
list, which consists of the peers who have query hits peer’s subspace and the query region, this peer invokes SSR
(provide query results) in the past X queries issued by this and determines whether there exist (other) subspaces
peer. For every X queries, a node replaces the long-range intersecting with (the remaining part of) the query region
contact having the lowest hit rate with the entry having the and obtains their PCNs. Next, the peer forwards the query
highest hit rate in the query-hit list with a probability of message greedily toward the candidate subspace(s) by
di using our MDQP algorithm (to be discussed shortly).
di þdj , where di and dj represent the naming distance of the
old long-range contact and the candidate long-range 5.3.2 Multidestination Query Propagation
contact to the current peer, respectively. We prove that If a peer maintains the neighborhood information of all data
this update strategy maintains the properties of small dimensions as in CAN, MDQP can be solved easily by
world networks; that is, the distribution of long-range propagating the message to the neighbors recursively.
contacts is still proportional to 1d , where d is the distance of However, as discussed in Section 4, to address the issue
of high dimensionality, the peer clusters form into SSW-1D,
a long-range contact.
and a peer does not maintain the complete neighborhood
Theorem 3. By replacing a long-range contact at distance di information of all data dimensions.
by another peer at distance dj with probability di dþd i
j
, the One possible strategy for MDQP is to propagate the query
1
distribution of long-range contacts is proportional to d . message to the candidate subspaces sequentially in a certain
order. Therefore, the single-destination routing protocol (as
Proof. Basically, if we can prove at the steady state that
described for the Point* query earlier in Section 5.2) can be
the probability that a peer at distance d becomes a
directly applied here to solve the problem of MDQP.
long-range contact, denoted by pin , is equal to the However, this sequential forwarding strategy incurs long
probability that a long-range contact at distance d is query latency. In the following, we present a MDQP
replaced by other peers, denoted by pout , we prove the algorithm to propagate a query message to multiple
above theorem. candidate subspaces in parallel. Since our MDQP algorithm
pi denotes the probability that peer i is the long-range invokes multiple propagation paths simultaneously, an
C
important issue is avoiding redundant query message
contact of the current peer, which is equal to di , where propagation and processing. Thus, a peer, upon receiving a
C is the normalization constant that brings the total query message, needs to know which portions of the query
probability to 1. pi!j donates the probability that a long- region have been processed or are being processed by other
peers so that it will not process these regions again. We
range contact i is replaced by another peer j, which
P observe that a PCN (for example, ah ) corresponds to a GPT
is equal to di dþd i
. We can then obtain pout ¼ N j¼1 pi subtree rooted at the tree node, with the tree label being a and
PN C dji PN C PN
pi!j ¼ j¼1 di di þdj ¼ j¼1 di þdj and pin ¼ j¼1 pj pj!i ¼ the depth being h. This information can be utilized to
PN C dj PN C propagate the query message systematically from the virtual
j¼1 dj di þdj ¼ j¼1 di þdj . Since pin ¼ pout , this proves
GPT tree downward along different paths. In other words, if a
the above theorem. u
t peer obtains multiple PCNs through SSR, this peer forks
5.3 Range Query multiple query messages, each of which is attached with a
PCN and forwarded toward the subspace(s) corresponding
The query region (search space) of a range query may
to the PCN. A receiver of the query message only processes
intersect with multiple subspaces, called candidate sub-
the portion of the query region that intersects with the
spaces, managed by different peer clusters (called candidate
subspace indicated by the attached PCN. This process is
clusters, accordingly). An idea for processing a range query
repeated recursively until there is no more subspace to be
is to greedily route the query message based on the
examined. Algorithm 2 shows the pseudocodes for range
shortest distance to the specified query region toward a
query processing.
peer located in a candidate subspace. Once the query
message reaches such a subspace, in addition to being Algorithm 2: Algorithm for range query.
propagated to other peers located in the same subspace, Peer i issue rangeðq; rÞ: i:rangeðq; r; ah Þ (ah is the currently
the query message is propagated further (as needed) to estimated PCN for q where a and h are initialized to 0,
other candidate subspaces. respectively.)
LI ET AL.: SSW: A SMALL-WORLD-BASED OVERLAY FOR PEER-TO-PEER SEARCH 743
Fig. 5. An illustrative example of range query. (a) Query region. (b) GPT. (c) SSW-1D. (d) MDQP.
1: if rangeðq; rÞ intersects with i’s subspace then forwarded to a peer whose subspace intersects with the
2: Invoke flooding search within cluster. query region in the first of the p subvectors and is then
3: end if propagated to the peers whose subspaces intersect with the
4: Invoke SSR and record in U the PCNs (indicated by a0h0 ) other portion of the query region in this subvector. These
of the subspaces intersecting with the query region peers then return the data objects whose SVs are within the
corresponding to PCN ah . query region in the complete semantic space as the results.
5: for all a0h0 2 U do In the following, we explain the simulation setup before we
6: Fork a query message rangeðq; r; a0h0 Þ and forward it present the simulation results.
toward a0 .
6.1 Simulation Setup
7: end for
A random mixture of operations, including peer join, peer
Fig. 5 illustrates an example for range query processing. leave, and search (based on certain ratios), are injected into
Assume that peer 1 in cluster 4 issues a range query the network. During each run of the simulation, we issue
depicted by the large circle q2 in Fig. 5a. For presentation 10 N random queries into the system. The proportion of
clarity, here, we redisplay the figures for the corresponding peer join to peer leave operations is kept the same to
GPT and SSW-1D in Figs. 3b and 3c. Peer 1 first invokes SSR maintain the network at approximately the same size. The
and obtains the PCN of the only candidate cluster/subspace simulation parameters, their values, and the defaults
as 81 . As a result, the query is forwarded to cluster 8. The (unless otherwise stated) are given in Table 1. Most of
peer receiving the message in cluster 8 further obtains the these parameters are self-explanatory. More details for
PCNs for the query region via SSR as 103 and 122 . Therefore, some of the system parameters, the synthetic data set, the
this peer forks two query messages and forwards them in real data set, and the query workload are given below.
accordance with the two PCNs, respectively. The message Synthetic data set. The data set is defined by the
corresponding to 103 reaches cluster 10 and then is further dimensionality of SV (and the semantic space), the weight
forwarded to cluster 11. The message corresponding to 122 of each dimension in SV, the number of data objects per
reaches cluster 13 first and then is forwarded to clusters 12 peer n, and the data distribution in the semantic space. The
and 14 simultaneously. default setting for the dimensionality of SV k is 100. We use
three different weight assignments for SV, namely, uniform,
6 PERFORMANCE EVALUATION linear, and Zipf dimension weight assignments. In the
uniform dimension weight assignment, each dimension has
We move on to evaluate SSW’s benefits through extensive
the same weight. In the linear dimension weight assign-
simulations. We compare SSW with pSearch, the state of
ment, dimension i has weight kþ1i
k ð1 i 100Þ. The Zipf
the art on semantic-based search in P2P systems. Based on
distribution is captured by the distribution function i1 ,
[23], pSearch takes four groups of subvectors, each with
2:3lnN dimensions (that is, p ¼ 4, and m ¼ 2:3lnN). In
addition, to compare the effectiveness of our proposed TABLE 1
dimension reduction method ASL with rolling index used Parameters Used in the Simulations
in pSearch, we also implement the rolling index on top of
the m-dimensional small world network, called small world
rolling index (SWRI). We conduct point* query and range
query in the simulation. We extend pSearch/SWRI to
support point* query and range query as follows: To
process a point* query, the query vector is partitioned into
p subvectors, and the query is routed to p subspaces,
covering each of these p subvectors. The most similar data
object is returned from each of the p subspaces, and the
most similar one among these p data objects is returned as
the result. To process a range query, the query message is
744 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 6, JUNE 2008
where is the skewness parameter. In our Zipf dimension . Search failure ratio is the percentage of unsuccessful
weight assignment, is set to 1; that is, dimension i has searches that fail to locate existing data objects in the
weight 1i . If unspecified, the linear dimension weight system.
assignment is the default setting. . Index load is the number of foreign index entries
Data distribution in the semantic space is determined by maintained at a peer.
two factors: 1) semantic distribution of data objects among . Routing load is the number of search messages that
different peers and 2) semantic distribution of data objects a peer processed.
at a single peer. The former controls the data hot spots in . Result quality measures the quality of the returned
the system, and the latter controls the semantic similarity data object for a point* query. To calculate the result
between data objects at a single peer, namely, semantic quality, we first calculate the normalized dissim-
closeness. To model both factors, we associate a Zipf ilarity (euclidean distance)3 between the query and
distribution each—Dataseed-Zipf for the former and Data- the result returned by SSW (or pSearch/SWRI),
Zipf for the latter—with their skewness parameters being denoted as dreal , and the normalized dissimilarity
d1 and d2 , respectively. We first draw a seed for each peer between the query and the data object most similar
following the Zipf distribution controlled by d1 . This to the query in the system, denoted as dideal . Then,
serves as the centroid around which the actual data objects we use 1 ðdreal dideal Þ to represent the result
for that peer are composed following d2 . quality. When the difference between dideal and
Real data set. We derive the latent semantic indexes for dreal is very small, the result quality is high.
527,000 documents in TREC versions 4 and 5 [1]. We
6.2 Simulation Results
distribute these documents to peers in two ways: 1) random
data distribution, where documents are randomly distrib- We have conducted extensive experiments under both
uted among peers, and 2) clustered data distribution, synthetic and real data sets to demonstrate SSW’s strength
where documents are first assigned into N clusters using on various aspects. In the following, we first present the
the K Means algorithm, and each cluster is then randomly results that demonstrate SSW’s nice properties as a generic
assigned to a peer. overlay structure, followed by the results demonstrating
Workload parameters. A point* query is specified by the SSW’s efficiency in support of different types of queries. We
query point and a range query is specified by the query present the results under different data sets when they
radius and the query point. The query radii for range query display different trends. Otherwise, we present the results
are randomly drawn from the range ½0; R, where R under one specific data set for presentation brevity and the
indicates the maximum query radius. Similar to data interest of readers.
parameters, we consider two factors in generating query
points: the distribution of query points emanating across 6.2.1 Overlay Properties
the peers in the system and the skewness of the query In the following, we demonstrate SSW’s scalability, adap-
points emanating from a single peer. The former controls tivity to data distribution and query locality, resilience to
query hot spots (that is, more users are interested in a few peer failures, and load balancing property.
data items) in the system, and the latter controls the locality Scalability. We vary the number of peers from 28 to 214
of interest for a single peer, namely, query locality (that is, a to evaluate the search efficiency and maintenance cost of
user is more interested in one part of the semantic space). SSW. According to our preliminary simulation results (see
We use two Zipf distributions with parameters q1 (for [14]), we find that SSW with four long-range contacts has
Queryseed-Zipf) and q2 (for Query-Zipf) to control the reasonable trade-off between search efficiency and main-
skewness; that is, q2 captures the skewness of the queries tenance overhead for most of the settings, and we use this
around the centroid generated by q1 . value in the experiments. Since pSearch does not use any
Although the main focus of this paper is to improve clustering, we disable the clustering feature of SSW (that is,
search performance with minimum overhead, we also try to the cluster size M is set to 1) in this set of experiments. We
explore the strengths and weaknesses of SSW in other have demonstrated that SSW can perform even better with
aspects such as fault tolerance and load balance. In this appropriate cluster sizes (as detailed in [14]).
paper, we use the following metrics for our evaluations: Fig. 6a shows the average path length under uniformly
distributed synthetic data set ðd1 ¼ d2 ¼ 0Þ. The results
. Search path length is the average number of logical
under other data sets are almost the same and are omitted.
hops traversed by search messages to reach the
Since the size of peer clusters is set to 1 in this experiment,
destination.
there is no flooding within a cluster, and the average search
. Search cost is the average number of messages
incurred per search. path length for SSW represents the search cost as well. The
. Maintenance cost is the number of messages search path length for SSW increases slowly with the size of
incurred per membership change, consisting of the network, confirming the search path length bound in
overlay maintenance cost and foreign index pub- Theorem 2. In addition, the constant hidden in the big-O
lishing cost. Since the size of different messages notation is smaller than 1, as shown in the figure. The plot
(join, query, index publishing, cluster split, and of pSearch’s path length has a slope close to SSW (with four
cluster merge) is more or less the same (dominated
3. To perform the normalization,
pffiffiffi we divide the euclidean distance
by the size of one SV, that is, 400 bytes), we focus on between two vectors by k, which is the maximum euclidean distance
the number of messages in the paper for clarity. between two vectors in the semantic space.
LI ET AL.: SSW: A SMALL-WORLD-BASED OVERLAY FOR PEER-TO-PEER SEARCH 745
Fig. 6. Comparing the scalability of the schemes. (a) Search path length. (b) Overlay maintenance cost. (c) Index publishing cost.
long-range contacts) but has a much higher offset. In fact, between 2 and 16, SSW does better than the cluster size of 1
the search path length of SSW is about 40 percent shorter in terms of all maintenance cost, path length, and search
than that of pSearch at a network size of 16 Kbytes. The cost (see [14] for the details). Thus, we set the cluster size to
search path length of SWRI is between pSearch and SSW. 8 for the rest of the experiments if unspecified otherwise.
This confirms the effectiveness of ASL versus rolling index Adaptivity to data distribution and query locality. In
in terms of the search path length. SSW, a peer selects the semantic centroid of its largest local
The overlay maintenance cost is proportional to the data cluster as the join point (semantic position) when it
number of states maintained at each peer, which are 20, 24 joins the network. The rationale is that, when data is more
(20 short-range contacts and four long-range contacts), and skewed around the centroid, fewer foreign indexes for data
6 (two short-range contacts and four long-range contacts) objects need to be published outside the cluster, thereby
for pSearch, SWRI, and SSW, respectively. Fig. 6b shows the reducing the foreign index publishing cost. To better
overlay maintenance cost for the same experiments as in understand the effect of semantic closeness on the foreign
Fig. 6a. These two figures confirm our expectation that, index publishing cost, we vary d2 , the skewness for Data-
compared to pSearch, SSW can achieve much better search Zipf, from 0 to 1. In addition, we also vary d1 , the skewness
performance with smaller overlay maintenance overhead. for Dataseed-Zipf, from 0 to 1 to observe the effect of data
The other maintenance cost to consider is the overhead hot spots. The results are shown in Fig. 7a. As pointed out, a
of publishing foreign index upon peer joins (apart from higher skewness lowers the foreign index publishing cost of
the cost shown in Fig. 6b). This cost is proportional to the SSW significantly. pSearch’s foreign index publishing costs
number of data objects that need to be published, and the are in the range of 3,500 and only decrease slightly under
corresponding relationship is shown in Fig. 6c. Due to the skewed data distribution. We omit the plot for pSearch in
fact that pSearch and SWRI have to publish a data object this figure to avoid distorting the graph.
multiple times, the index publishing costs for pSearch and Fig. 7b compares the foreign index publishing cost under
SWRI are much higher than that for SSW. In addition, the a real data set with random data distribution and clustered
index publishing cost for SWRI is lower than that for data distribution. Similar to the trends observed under
pSearch. Note that these results for SSW are conservative, synthetic data sets, if data objects are rather clustered
since d2 ¼ 0 (uniform data distribution), and the over- instead of randomly distributed, SSW reduces the foreign
head is likely to be lower with any skewness (as will be index publishing cost significantly. pSearch’s foreign index
shown later). publishing cost is also reduced under clustered data
This set of experiments confirms the scalability of SSW. distribution. However, the reduction is not as significant
It also confirms our expectations that ASL is a better as in SSW.
dimension reduction method than rolling index in terms of In SSW, long-range contacts can be updated based on
various aspects, including search cost and index publish- query history to exploit query locality. To study this
ing cost (later, we will show that ASL is better than rolling improvement, we vary q2 , the skewness for Query-Zipf,
index in terms of the result quality for point* query as from 0 to 1. We also vary q1 , the skewness for Queryseed-
well). In the remaining experiments, we only compare with Zipf, to observe the effect of query hot spots. Figs. 8a and
pSearch for presentation clarity. 8b respectively compare the search path length of SSW
In the above experiments, the size of the peer clusters M without updates and with updates of long-range contacts
has been set to 1. When M is larger, cluster splits or merges (based on what was described in Section 5.2). Without any
occur less frequently, resulting in lower overlay main- updates, query locality has little impact on the results.
tenance costs. Furthermore, the total number of peer With long-range contact updates, however, query locality
clusters in the system decreases with larger cluster sizes, significantly enhances the search performance. For in-
thereby reducing the cost of navigation across clusters. The stance, we see nearly a 78 percent reduction in path
downside of large-sized clusters is the higher search cost length when q2 increases from 0 to 1.0, with q1 set to 1.0.
within a cluster (due to flooding). We have performed a pSearch’s result (the plot is not shown here) is similar to
series of experiments by varying the cluster size from 1 to the one without update, except that pSearch’s path length
N, which shows that, within a spectrum of cluster sizes is much higher (in the range of 40).
746 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 6, JUNE 2008
Fig. 7. Effect of data distributions. (a) Synthetic data set. (b) Real data set.
Fig. 8. Effect of query distributions. Cost for pSearch is much higher (not shown for clarity). (a) Without updates. (b) With updates.
Tolerance to peer failures. Peer failure is a common number of states (20), the search failure ratio grows rapidly
event in P2P systems. Thus, a robust system needs to be with the ratio of peer failures. On the other hand, at a
resilient to these failures. To evaluate the tolerance of SSW cluster size of 1, SSW with a much smaller number of
to peer failures, a specified percentage of peers are made to states maintained per peer (two short-range contacts and
fail after the network is built up. We then measure the ratio four long-range contacts) has a similar search failure rate
of searches that fail in finding data objects existing in the as pSearch. Increasing the cluster size to 4 substantially
network (we do not consider failures due to the data
improves SSW’s fault tolerance. Beyond sizes of 4, the
residing on the failed peers).
search failure ratio, even with as high as 30 percent of peer
Fig. 9 shows the fraction of searches that fail as a
failure, is very close to 0. These results reiterate the
function of the ratio of induced peer failures (from
0 percent to 50 percent). Since the fault tolerance of SSW benefits of forming peer clusters.
is largely dependent on the cluster size, we also consider Load balancing. We evaluate the load balance of SSW
different values for M (the cluster size) in these experi- based on two aspects: index load and routing load. For the
ments. Even though each peer in pSearch maintains a large index load, we evaluate the distribution of the foreign index
maintained at each peer under different data distribution
patterns. Since the load is evenly balanced under the
uniform distribution for both pSearch and SSW, we present
only the results under the skewed data distribution in
Fig. 10a. For presentation clarity, we show the CDF of
foreign index load. As we expected, pSearch has a much
more uneven index load distribution compared to SSW. For
instance, the CDF of the foreign index load for pSearch
reaches 90 percent with the first 16 percent of nodes. In
contrast, SSW displays a relatively even distribution of
index load, even under this skewed data set. This confirms
our intuition that placing peers in the semantic space in
Fig. 9. Effect of peer failure ratio on the failure of search operations accordance with their local data objects can effectively
ð ¼ 0 percentÞ. partition the search space according to data distribution.
LI ET AL.: SSW: A SMALL-WORLD-BASED OVERLAY FOR PEER-TO-PEER SEARCH 747
Fig. 10. Distribution of foreign index load and routing load among the peers. (a) Index load. (b) Routing load.
Fig. 11. Result quality of point* query under different data sets. (a) Synthetic data set. (b) Real data set.
Similarly, we have varied the query distribution in two equally sized subspaces). We expect that our partition
order to study the routing load distribution across the scheme adapts to different SV dimension weight assign-
nodes, again with uniform and skewed distributions (this ments automatically, whereas pSearch does not. This is
time with queries). We present the results for the skewed confirmed in Fig. 11a. Different SV dimension weight
query distribution (q1 and q2 were set to 1) in Fig. 10b. assignments have little effect on the result quality of SSW
Similarly, we show the CDF of routing load. We find that (as well as the search path length, search cost, and
the routing load is more evenly distributed in SSW maintenance cost). On the other hand, pSearch is very
compared to pSearch. For instance, the CDF of the routing sensitive to different dimension weight assignments. In all
load for pSearch reaches 90 percent with the first 20 percent three settings, the result quality of SSW is higher than
of nodes. At this point, the CDF of the routing load for pSearch.
SSW is only 57 percent. This is due to the introduction of Fig. 11b shows the result quality of point* query under a
randomness through the long-range contacts and the good real data set with random data distribution and clustered
balance within a peer cluster itself. data distribution. In this figure, we can see that the result
quality obtained under the real data set is similar to that
6.2.2 Query Performance under the synthetic data set with the Zipf dimension weight
In the following, we present the results specific to point* assignment. This is not surprising, since the vector elements
query and range query. in the document SVs have weights following the Zipf
Point* query. Although the results on the search path distribution.
length and search cost of point* query have been presented In addition, we vary the dimension of semantic space
earlier, here, we present the result quality of point* query (SVs) from 10 to 100 and evaluate its effect. The search path
under different data sets. Since the result quality highly length is more or less the same under different dimensions
depends on the SV dimension weight assignments, we vary (the results are omitted). This is not out of expectation,
SV dimension weight assignments, and the results are since the underlying routing structure of SSW-1D is based
presented in Fig. 11a. SSW partitions the data space in a on the linearized naming space, which is not tied to the
way adaptive to data distribution in the semantic space, data dimensionality. On the other hand, there are some
whereas pSearch partitions the data space uniformly (into changes on the result quality. Fig. 12 shows the result
748 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 6, JUNE 2008
Fig. 12. Effect of dimensionality on point* query. Fig. 14. Effect of dimensionality on range query.
Fig. 13. Effect of query radius on range query. (a) Synthetic data set. (b) Real data set.
LI ET AL.: SSW: A SMALL-WORLD-BASED OVERLAY FOR PEER-TO-PEER SEARCH 749