Академический Документы
Профессиональный Документы
Культура Документы
1
Alternatively, a node may detect failures only when it ac-
This research was sponsored by the Defense Advanced Research tually needs to contact a neighbor; however, this merely defers
Projects Agency (DARPA) and the Space and Naval Warfare the network traffic for finding a new neighbor until the old one
Systems Center, San Diego, under contract N66001-00-1-8933. fails. It also raises the risk that all of a node’s neighbors fail,
<http://pdos.lcs.mit.edu/chord/>. permanently disconnecting that node from the network.
1
nected structure where lookups are correct? How their system requires a central server to guarantee
much work is required to provide a richer structure connectivity.
where lookups are correct and also fast? We believe that our evolutionary analysis, with its
To answer these questions, we make two kinds recognition that the ideal state will rarely occur, is
of observations about P2P maintenance protocols. crucial for proper understanding of P2P protocols in
First, we give lower bounds on the maintenance pro- practice.
tocol bandwidth for connectivity in any P2P network
2 A H ALF L IFE L OWER B OUND
as nodes join and leave. We characterize this lower
bound using the notion of half life, which essen- In this section, we give a general lower bound for
tially measures the time for replacement of half the the bandwidth of maintenance messages in P2P sys-
nodes in the network by new arrivals. We show that tems, based on the rate of node joins and departures.
per-node maintenance protocol bandwidth is lower- If there are live nodes at time , then the doubling
bounded by
per half life for any P2P sys- time at time is time that it takes for additional
tem that wishes to remain connected with high prob- nodes to arrive. The halving time at time is the time
ability.2 Second, we analyze the maintenance proto- for half of the nodes alive at time to depart. The
col used by Chord [5], a P2P routing protocol. We half life at time is the smaller of the doubling and
show that Chord consumes bandwidth only logarith- halving times at time . Finally, the half life of the en-
mically larger than our lower bound. Critical to this tire system is the minimum half life over all times .
analysis is a demonstration that Chord’s join, lookup, Intuitively, a half life of means that after time ,
and maintenance protocols work correctly even when only half the state of the system can be extrapolated
the system is not in its idealized stable state. from its state at time .
This style of evolutionary analyses of P2P net- For example, consider a Poisson model of ar-
works has not been well-developed. Many P2P sys- rivals/departures [2]: nodes arrive according to a
tems focus on models in which nodes join and de- Poisson process with rate ! , while each node in the
part only in a well-behaved fashion, allowing main- system departs independently according to an expo-
tenance to happen only at the time of arrival and de- nential distribution with rate parameter " (i.e., ex-
parture. We believe this kind of well-behaved model pected node lifetime is #%$&" ). If there are nodes in
is unrealistic. Other protocols allow for the possi- the system at time , then the expected doubling time
bility of unexpected failures, and show that the sys- is $! and the expected halving time is '#%$&"()+*-, .
tem is still well-structured after such failures occur. (The probability . that a node fails in time is
These analyses, however, assume that the system be- #0/21354%6 ; setting 879'#%$&"()+*-, makes .:79#%$, .)
gins in an ideal starting state, and do not show how The half life is then ;<+*=>+*?,@>$&"BA $!C .
the system returns to this ideal state after the failures; If ! and " are fixed and the system is in a steady
thus, accumulation of failures over time eventually state, then the arrival rate of ! must be balanced by
disrupts the system. (See, e.g., [1, 3, 4, 5, 6].) the departure rate of D" (each of nodes is leaving
Perhaps the closest to our evolutionary analysis is at rate " ), implying E7F!C$&" . Then the doubling
the recent work of Pandurangan et al. [2], who study time is #%$&" and halving time and half life are both
a centralized, flooding-based P2P protocol. Using '#%$&"()G*H, . This reflects a general property: in any
a Poisson arrival/departure model, they show that system where the number of nodes is stable, the dou-
their protocol results in an overlay network whose bling time, halving time, and half life are all equal to
diameter remains logarithmic, with high probability. within constant factors.
However, their scheme does not solve the problem of Using this Poisson model, we derive a lower
routing within the P2P network: to find the node re- bound on the rate at which bandwidth must be con-
sponsible for a given data item, they propose flood- sumed to maintain connectivity of the P2P network.
ing the network, requiring
messages. Also,
Theorem 2.1. Consider any P2P system with any
Throughout this paper, with high probability (abbreviated initial configuration. Suppose that there is some
2
whp) means with probability . node I that, on average, receives notification about
2
// ask node to find the successor of // search the local table for the highest predecessor of
GeDf g
hi
C:ON .> .,
if (
!"##%$%%&'( )
for - /aj downto
6
return )"##*$%*%&' ; if !?@9=;=$%'` (
else return ?9<;=$%'` ( ;
,+.- /0#1!&2$%3 45'$6#*$678:9<; 95&
7=$= ; return ;
return >+ ?@957 *"##*$%2&A'5 ;
// periodically verify n’s immediate successor,
B2DC:E,+ // and tell the successor about n.
45'$67=$6#*$%%&'- /F.C:G ; .stabilize()
H k
- /I,+J ?@957 "#*#*$%%&'5 ; - /l*"##*$%2&A' 45'$67=$6#*$%%&' ;
K H 6
"<81!7 ?9<;=$%'*= ; if k "##*$%%&'
H k
"##*$%%&'L- / ; "##*$%%&'L- / ;
"#*#*$%%&')95&38 m%n5 ;
// periodically refresh finger table entries.
M ON5< // + thinks it might be our predecessor.
K
"<81!7 ?9<;=$%'* ; .f
CQopE>+
6
if nil or ,+
:45'6$67=$6#%$%%&'q/ :4'$67=$6#*$%*%&'6
P
CQG: .NA + 45'$67=$6#*$%%&'L- /F + ;
// get first non-trivial finger entry.
R
- /TS:U!VWDX*"##*$%2&A'
]^_R
B Y
[Z]\
; M
r H
i
5 GQC:f=
HAst
for each index into ?9<;=$%'` ( ;
2O - /u"##%$%%&')"##*$%*%&' 1 8Q*3 ;
?9<;=$%'`
( - /a + ?9D7 "##*$%*%&'
\_bcd
; "##*$%*%&' 1 8Q*3]- /
r
"##%$%%&'
H
Hhv
>
H s d
t
;
fewer than w new nodes per time. of correct routing information in the face of concur-
Then there is a sequence of joins and leaves with rent arrivals and departures. The second is strong
half life and a time so that node I is disconnected stabilization, which ensures a correct routing over-
from the network by time with probability '#%$5x@Ay . lay from an arbitrary initial condition.
Corollary 2.2. Consider any -node P2P network Background on Chord. Chord nodes4 and keys are
that remains connected with high probability for ev- hashed into a random location on the unit circle; a
ery sequence of joins and leaves with half life . key is assigned to the first node encountered mov-
Then every node must be notified with an average ing clockwise from it. Each node knows its succes-
of
new nodes per time. sor node—the node immediately following it on the
In a half life, the probability that any particular node circle—which allows correct lookup of any key w by
in the network fails is #%$, . Thus, if any node has less walking around the circle until reaching w ’s succes-
than +
neighbors, then the probability that they sor. We speed this search using fingers: I~} .O
6
all fail during this half life is larger than #%$ . In is the first node following ID,, on the identifier cir-
each half life, then, each node loses about
>$, cle. Intuitively, any node always has a finger point-
ing halfway to any destination, so that a sequence of
neighbors; it must replace its failed neighbors to re-
main connected in the next half life. 3 + “halvings” of the distance take us to the key.
Each node also maintains its predecessor, the node
3 A DYNAMIC M ODEL FOR C HORD closest to that has as its successor.
This section outlines and analyzes two mainte- Each node I periodically executes a weak sta-
nance protocols in Chord. The first is weak stabi- bilization procedure to maintain the desired rout-
ing invariants: it contacts its successor , and if
>} ]hh>AAD 78. falls between nodes I and , sets
lization from [5], which maintains a small amount
3
Note that this does not require that each node z can learn
about {|U!ViW
C
nodes every half life, since z may receive a 4
For load balancing, each “real” Chord node maintains
message containing information about many new nodes; instead, U)VW
virtual nodes with different identifiers; since our load bal-
it requires that z receive information about new nodes at an av- ancing is not our concern, we omit virtual nodes from our dis-
erage rate of {|U!VW
per half life. cussion, and consider work per virtual node.
3
I
~}:i5 72.
. To maintain finger pointers, each
node I periodically searches for improved fingers by
running ] AA=55 I ,5 3 for each finger .
A node departing the Chord ring can cause dis-
connection of the ring because another node may no
longer be able to contact its successor. To alleviate
this, each node keeps a successor list of the first
nodes following it on the ring. A node I maintains
its successor list by repeatedly fetching the succes-
sor list of 7 I~}:AD , removing its last entry,
and prepending to it. If node fails, then I sets
I~}:i5 to the next node on its successor list.
Node I also periodically confirms that its predeces-
N1
sor has not failed; if so, it sets I~} ]
A>hii5 7 .
N8
N56
4
the ring-like state with successor lists of length protocol maintains a state in which routing is done
, and allow random joins and $, ran- correctly and quickly. But, fearful of bugs in an im-
dom failures at arbitrary times over at least C+ plementation, or a breakdown in our model, 5 we now
rounds. Then, with high probability, we end up in the wish to take a more cautious view. In this section,
ring-like state. we extend the Chord protocol to one that will stabi-
lize the network from an arbitrary state, even one not
Intuitively, the theorem follows because appendages
reachable by correct operation of the protocol. This
are not too big, and not too many nodes join them.
Thus over rounds, the appendage nodes
protocol does not reconnect a disconnected network;
we rely on some external means to do so.
have time to join the cycle.
This approach is in keeping with our focus on the
Theorem 3.3. In the ring-like state, lookups require behavior of our system over time. Over a sufficiently
+
time. long period of time, extremely unlikely events (such
as the simultaneous failure of all nodes in a successor
This theomem follows from Properties 2 and 3 list) can happen. We need to cope with them.
of Definition 3.1. For every node and , the A Chord network is weakly stable if, for all nodes
, we have [ }:AA=5 } ]
A>hii5 7 and
pointer } OO=
is accurate with respect to good
nodes. Thus our analysis showing logarithmic time strongly stable if, in addition, for each node , there
search when all fingers are correct can be easily
adapted to show that, in logarithmically many steps,
is no node so that }:AiiA=5 . A loopy
5
BDCQ + fPC:G:C i.
&9 #2n#1)$~- /ao5GQ ; z - /l*"##*$%2&A'` ( ?@957 *"##*$%2&A'5 ;
J
45'$67=$6#*$%%&'- /F.C:G ; . g@ f~.5f
CeoJpE &9 #2ni#1)$~- /0z / I ;
H
- /I + ?@957 "#*#*$%%&'5 ;
H
- /u*"##*$%2&A'5` (
if
( *"##*$%2&A'5` D ( /l"##*$%*%&'` 2(
H
while ( e&9 #2ni#1)$ ) do k
-/
H
45'$67=$6#*$%2&A' ; and z
"##*$%*%&'` 2(
6
H
-/
H
?@9D7 *"##*$%2&A'5,+
; if k
H 6
"##*$%%&'5` 2(O- /Fz ;
"##*$%%&'5` (O- /
H
; "#*#*$%%&'5` (.- /
k
; for
-/
"##*$%%&'5` 2(O- /
H
; H
!9D&A38 m%nD ;
"h47
3J$ 9D&38 m%n,
J
;
]hh>AAD
. We extend our previ-
4 C ONCLUSION
ous stabilization protocol by allowing each node
We have described the operation of Chord in a
to maintain a second successor pointer. This second
general model of evolution involving joins and de-
successor is generated by self-search, and improved
partures. We have shown that a limited amount of
in exactly the same way as in the previous protocol.
housekeeping work per node allows the system to
See Figure 3.
resolve queries efficiently. There remains the pos-
Theorem 3.4. A connected Chord network strongly sibility of reducing this housekeeping work by log-
stabilizes within rounds if no nodes join it, arithmic factors. Our current scheme postulates that
and in rounds if there are no joins and at
the half life of the system is known; an interesting
most
failures occur over
rounds. question is whether the correct maintenance rate can
Corollary 3.5. A connected loopy Chord network be learned from observation of the behavior of neigh-
strongly stabilizes within rounds with no fail- bors. Another area to address is recovery from patho-
ures, and rounds if there are at most
logical situations. Our protocol exhibits slow recov-
failures occur over +
rounds. ery from certain pathological “disorderings” of the
Chord ring. Although it is of course impossible to
The requirement on the failure rate exists solely to recover from total disconnection, an ideal protocol
allow us to maintain a successor list with sufficiently would recover quickly from any state in which the
many live nodes, and thus maintain connectivity. system remained connected.
The corollary follows because a loopy Chord net-
work will never permit any new nodes to join until its R EFERENCES
loops merge—in a loopy network, for all , we have [1] F IAT, A., AND S AIA , J. Censorship resistant peer-to-peer
}D , 7 , since ’s self-search never re- content addressable networks. In Proc. SODA 2001.
turns in a loopy network. Thus, no node attempting [2] PANDURANGAN , G., R AGHAVAN , P., AND U PFAL , E.
to join can ever find a node on the cycle to choose Building low-diameter peer-to-peer networks. In Proc.
FOCS 2001.
as its successor.
[3] R ATNASAMY, S., F RANCIS , P., H ANDLEY, M., K ARP, R.,
While the runtime of our strong stabilization pro-
AND S HENKER , S. A scalable content-addressable net-
tocol is large, recall that strong stabilization needs to work. In Proc. SIGCOMM 2001.
be invoked only when the system gets into a patho- [4] ROWSTRON , A., AND D RUSCHEL , P. Pastry: Scalable,
logical state. Such pathologies ought to be extremely distributed object location and routing for large-s cale peer-
rare, which means that the lengthy recovery is a small to-peer systems. In Proc. Middleware 2001.
fraction of the overall lifetime of the system. For ex- [5] S TOICA , I., M ORRIS , R., K ARGER , D., K AASHOEK ,
ample, if pathological states occur only once every M. F., AND BALAKRISHNAN , H. Chord: A scalable peer-
rounds, then the system will only be spending to-peer lookup service for internet applications. In Proc.
a #%$ fraction of its time on strong stabilization.
SIGCOMM 2001.
[6] Z HAO , B., K UBIATOWICZ , J., AND J OSEPH , A. Tapestry:
Nonetheless, it would clearly be preferable to de-
An infrastructure for fault-tolerant wide-area location and
velop a strong stabilization protocol that, like weak routing. Tech. Rep. UCB/CSD-01-1141, Computer Science
stabilization, simply executes at a low rate in the Division, U. C. Berkeley, Apr. 2001.
background, rather than bringing everything else to
a halt for lengthy periods.