Gnutella Thesis

UNIVERSITY OF CINCINNATI
April 20 01 _____________ , 20 _____
Mihajlo A. Jovanovic I,______________________________________________, hereby submit this as part of the requirements for the degree of:
MASTER OF SCIENCE ________________________________________________
in:
computer science ________________________________________________
It is entitled:
Modeling Large-scale Peer-to-Peer Networks ________________________________________________
and a Case Study of Gnutella ________________________________________________
________________________________________________ ________________________________________________
Approved by: Dr. Fred S. Annexstein ________________________ Dr. Ken A. Berman ________________________ Dr. Yizong Cheng ________________________ ________________________ ________________________
Modeling Large-scale Peer-to-Peer Networks and a Case Study of Gnutella A thesis submitted to the Division of Graduate Studies and Research of the University of Cincinnati in partial fulllment of the requirements for the degree of MASTER OF SCIENCE in the Department of Electrical and Computer Engineering and Computer Science of the College of Engineering June, 2000 by Mihajlo A. Jovanovi B.S., Department of Mathematics and c Computer Science, Otterbein College, Westerville, Ohio, 1997. Thesis Advisor and Committee Chair: Dr. Fred S. Annexstein and Dr. Kenneth A. Berman
Abstract The ongoing digital revolution has brought on the emergence of novel network applications such as Gnutella, Freenet, and Napster, intended to facilitate worldwide sharing of information. These applications have embraced the familiar peer-to-peer (P2P) architecture model of the original Internet in new and innovative ways, forever changing the world of personal computing. However if P2P is to truly replace the well-established client-server model as the computing paradigm of the future, more ecient decentralized algorithms must rst be designed. This requires better understanding of the P2P network model on which those algorithms would be operating. Such model includes both network topology and trac. In this thesis, we study both of these factors using as our case study Gnutella a fully-decentralized le sharing network application. In order to study the Gnutella network topology, we have developed a network crawler that allows topology discovery to be performed in parallel. Upon analyzing the obtained topology data, we discovered it exhibits strong small-world properties. More specically, we observed the properties of small diameter and clustering in the Gnutella network topology. In addition, we report evidence of four dierent power laws previously observed in other technological networks, such as the Internet and the WWW. In the second part of our thesis, we utilize our topology model in order to study network trac. Specically, we show that heterogeneous latencies present in many large-scale P2P network applications, when combined with the standard protocol mechanisms of time-to-live (TTL) and unique message identication (UID) used to govern ooding message transmissions, can potentially have a devastating eect on the reachability of message broadcast. We call this combined eect short-circuiting, and we investigate consequences of this phenomenon. We show through experimentation that, in the worst case, short-circuiting can near-completely eliminate the reach of broadcast messages. We report measurements obtained through both network simulation studies and experimental studies performed on Gnutella. Our results indicate that, on average, the real eects of short-circuiting are signicant, but not devastating to the performance of an overall large-scale system. We believe our discoveries of both network topology properties and short-circuiting are an important step toward a uniform model of P2P network applications, and could serve as a valuable tool in analyzing the performance of existing algorithms, as well as designing new, more scalable solutions.
Acknowledgments
First, I would like to thank my advisers, Dr. Fred Annexstein and Dr. Kenneth Berman, for hours of intellectually stimulating discussions, suggestions and ideas. For the duration of this thesis, they have been not just my advisers but also my mentors, providing constant encouragement as well as nancial support in the form of a Research Assistantship. I would also like to thank Dr. Yizong Cheng for taking the time out of his busy schedule to be on my thesis committee, and Dr. John Schlipf for attending my thesis defense. Special thanks goes to Dr. John Franco for providing motivation and guidance, particularly during my rst year at UC, and also Linda Gruber for her always kind and helpful attitude. I extend my sincere gratitude to the Department of Electrical and Computer Engineering and Computer Science for its generous support without which this work would not be possible. The department has provided me with a Graduate Assistantship during my rst year and a University Graduate Scholarship for three full academic years. Finally, I dedicate this work to my parents, Aleksandar and Mirjana, whose love and support, even from half a world away, I could not have done it without.
Table of Contents
Page 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Peer-to-Peer Computing . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 1.2 Example Applications . . . . . . . . . . . . . . . . . . . . . . 1 3 5 7 7 9 10 13 14 18 19 21 27 28 30 32 32 37
Modeling P2P Applications . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Benets to Modeling . . . . . . . . . . . . . . . . . . . . . . .
2 Modeling Topology of Large P2P Networks . . . . . . . . . . . . . . . . . . 2.1 Small-World Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 2.1.2 2.2 Modeling Small-World Networks . . . . . . . . . . . . . . . . . Gnutella as a Small-World . . . . . . . . . . . . . . . . . . . .
Power-Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 2.2.2 Power-Law Models . . . . . . . . . . . . . . . . . . . . . . . . Power-Laws in Gnutella . . . . . . . . . . . . . . . . . . . . .
3 Modeling Network Latencies . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 3.2 3.3 Latency Eects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Modeling the Short-Circuiting Eect . . . . . . . . . . . . . . . . . . Empirical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 3.3.2 Gnutella Studies . . . . . . . . . . . . . . . . . . . . . . . . . Network Simulation Studies . . . . . . . . . . . . . . . . . . .
4 Gnutella Crawler Implementation . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction to Gnutella . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 4.1.2 4.2 Gnutella Protocol . . . . . . . . . . . . . . . . . . . . . . . . . Discovering Gnutella Network Topology . . . . . . . . . . . .
41 41 42 44 45 45 47 47 50 50 52 52 53 53 53 53
Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 4.2.2 4.2.3 4.2.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Initial Implementation . . . . . . . . . . . . . . . . . . . . . . Parallel Algorithm . . . . . . . . . . . . . . . . . . . . . . . . Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3
Distributed Computing Solution Using Java RMI . . . . . . . . . . .
5 Conclusions and future research . . . . . . . . . . . . . . . . . . . . . . . . 5.1 5.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 5.2.2 5.2.3 Appendix A Visualizations of the Gnutella Network Topology . . . . . . . . . . . . . . B Java source code for gnutsim . . . . . . . . . . . . . . . . . . . . . . . . . . C Network Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . Network Topology Modeling . . . . . . . . . . . . . . . . . . . Network Visualization . . . . . . . . . . . . . . . . . . . . . . Server Placement . . . . . . . . . . . . . . . . . . . . . . . . .
59 65 77
ii
List of Figures
2.1 Values for the clustering coecient as dened in denition 3 for the Gnutella, Barabsi-Albert, Watts-Strogatz, random graph, and the 2D a torus topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 2.3 2.4 Log-log plots of degree versus rank (power-law 1) . . . . . . . . . . . Log-log plot of frequency versus degree (power-law 2) . . . . . . . . . Log-log plot of the number of pairs of nodes versus the number of hops (power-law 3) for four snapshots of the Gnutella topology . . . . . . . 2.5 Log-log plot of eigenvalues versus rank (power-law 4) for four snapshots of the Gnutella topology . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 The results of level-1 short-circuiting eects on the broadcast horizon on the Gnutella network, October 2000. The y-axis represents the broadcast horizon size, and the x-axis labels each of 68 broadcast trials. The top line is the resulting horizon from multiple distinct broadcasts from the same source, and the lower line is the resulting horizon from a single broadcast message from a single source. The discrepancy represents level-1 short-circuiting eects. . . . . . . . . . . . . . . . . . 3.2 Horizon-size versus t . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 34 26 24 16 22 23
iii
3.3
Horizon-size variation over time with broadcasting client using multiple connections on the Gnutella network, March 2001. The y-axis represents the horizon size, and the x-axis labels each of 180 broadcast trials, performed consecutively in six minute intervals. . . . . . . . . . 35 36
3.4 3.5
Diculty in conducting experiments on todays Gnutella network . .
Short-circuiting eects for the Watts-Strogatz topology (nodes = 10000, k = 3, p = 0.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 60 60 61
A.1 Gnutella network topology using Caidas Otter . . . . . . . . . . . . . A.2 Gnutella network topology using LEDAs 2D spring layout . . . . . . A.3 Gnutella network topology using experimental layout . . . . . . . . . A.4 Gnutella network backbone (dominating set using greedy algorithm) using LEDAs 3D spring layout . . . . . . . . . . . . . . . . . . . . . A.5 Gnutella network backbone (nodes with degree > 10) using LEDAs 3D spring layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.6 Gnutella network backbone (nodes with degree > 20) using LEDAs 3D spring layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
63
64
iv
Chapter 1 Introduction
The new wave of innovative network applications such as Gnutella, Freenet, Jabber, Popular Power, SETI@Home, Publius, Free Haven, Groove, and others, has brought on a revolution in personal computing threatening the long-established client-server architecture of the Internet. For lack of a better term, this revolution has been labeled peer-to-peer (P2P), or simply peer computing. The success of this revolution will depend on the ability of modern P2P network application to provide ecient communication between increasingly large number of autonomous hosts dispersed all over the Internet. To cope with this problem some P2P applications, like instant messaging and Napster rely on a centralized server. Other applications, such as Gnutella and Freenet, adopt fully decentralized design approach and require scalable algorithmic solutions for functions such as routing and searching. Gnutella, for example, utilizes a ooding mechanism for transmitting messages through the network. These algorithms are typically built-in the application in the form of an application-level protocol. The inadequacy of the existing protocols became painfully clear to Gnutella developers during the summer of 2000, when the size of the user community rapidly increased. The problem is that the original protocols were designed without any
knowledge about the nature of the network on which they would be operating. In P2P applications such as Gnutella and Freenet, much like in social networks, this nature is determined by collective phenomena, as users connect to each other in a seemingly random manner. Under these circumstances and given the highly-dynamic nature of these networks, even relatively simple protocols result in complex interactions that are dicult to predict. To provide better understanding of such interactions, in this thesis we study the nature of P2P networks using Gnutella as our case study. In particular, we study two fundamental components of a network, namely the topology and the trac. In the rst part of this thesis (chapter 2), we focus on the network topology model. In order to study the Gnutella network topology, we have designed and implemented a distributed network crawler that allows topology discovery to be performed in parallel - an important feature considering highly dynamic nature of Gnutella. The analysis of the obtained topology data reveals several important structural characteristics of P2P networks: 1. We report that the Gnutella network is a small-world topology, exhibiting both small diameter and clustering typical of many social networks. 2. We present evidence of four dierent power laws also found in other technological networks, such as the Internet and the WWW. As a result, we conclude that many P2P networks, such as Gnutella, posses characteristics of both technological and social networks. It is our thesis that these characteristics can be utilized for designing more ecient algorithms operating on such networks. In the second part of this thesis (chapter 3), we turn our focus to network trac. More specically, we study the eects of heterogeneous latencies on reachability in
P2P networks operating under ooding protocols. We show that heterogeneous latencies present in many large-scale P2P network applications, when combined with the standard protocol mechanisms of time-to-live (TTL) and unique message identication (UID) used to govern ooding message transmissions, can potentially have a devastating eect on the reachability of message broadcast. We call this combined eect short-circuiting, and we investigate consequences of this phenomenon. We show through experimentation that, in the worst case, short-circuiting can nearcompletely eliminate the reach of broadcast messages. We report measurements obtained through both network simulation studies and experimental studies performed on Gnutella. Our results indicate that, on average, the real eects of short-circuiting are signicant, but not devastating to the performance of an overall large-scale system. In chapter 4, we describe the design and implementation of our parallel network crawler. Finally, chapter 5 concludes this thesis with the description of future work. For the remainder of this chapter, we rst present a brief overview of the P2P computing paradigm. Then, we summarize the main reasons for network modeling and present our formal model.
1.1
Peer-to-Peer Computing
As with many new technologies, there is no single universally accepted denition for P2P. The recently formed Peer-to-Peer Working Group, a consortium lead by the industry giants such as Hewlett-Packard, Intel and IBM, denes peer computing as sharing of computer resources by direct exchange. Indeed it is this notion of direct access to resources, instead of through a centralized server as with the traditional client-server model, that characterizes P2P. However, this denition may be too general as it would seam to include applications typically considered client-server, such as FTP and TELNET. According to [25], the two fundamental criteria that each
P2P application must satisfy are (1) treating variable connectivity and temporary network addresses as the norm and (2) giving nodes at the edges of the network signicant autonomy. Using this denition, applications such as email are not P2P since addresses are not machine independent, while instant messaging applications such as ICQ and Jabber are P2P, because they devolve connection management to the individual nodes and dynamically map users to their IP addresses. However the fundamental idea of having computers act as peers is hardly new - some may even argue it has its root in the original design of the Internet, as part of the early ARPANET architecture. In fact, early network applications such as USENET and DNS were based on a peer-to-peer communication model and can be considered predecessors to modern P2P technologies. The true innovation of these technologies therefore lies not in their architecture design, but rather in their implementation and scale. In order for these applications to extend the scope of P2P computing beyond a single LAN, they needed to overcome serious technical challenges posed by technologies such as rewalls, dynamic IP, and NAT, designed to obstruct open communications between computers for reasons of security. They did so by mitigating application complexity to the edges of the network, thereby creating a much more signicant role for the Internet-connected PCs than previously oered by the traditional client-server model. This idea of transferring the complexity to the edges can be best explained in comparison with a telephone network. At rst glance a telephone network may seam P2P, since communication occurs directly between two points in the network. However the crucial dierence between a telephone network and P2P is that the former relies on an intelligent network for functions such as routing, and relatively dumb devices in the form of telephone sets. In contrast, P2P application like Gnutella relies on an existing, dumb network (the Internet) and incorporates all the application logic at endpoints. The main advantage to such design from a perspective of a researcher is that it enables rapid development and deployment of innovative technologies, which 4
can perhaps serve as an explanation for such a large number of P2P applications we are seeing today.
1.1.1
Example Applications
Current network applications have embraced three forms of peer computing: sharing of information, sharing of computing power, and communication. This does not mean P2P computing model is limited to these resources, but simply that a P2P application for sharing other types of resources has not yet been designed. Table 1.1 shows the list of the most popular P2P applications in each category. Applications such as SETI@Home outline clear relationship between P2P and another computing paradigm commonly referred to as distributed computing. These applications allow the computing power of thousands of Internet-connected PCs to be harnessed and used for performing computationally intensive tasks that would otherwise require the use of a supercomputer. Examples include processing radio signals from outer space in search for extraterrestrial intelligence [4] and simulating protein folding [2]. Perhaps the most popular form of peer computing on the Internet is instant messaging. Unlike email, where messages travel through centralized mail servers, instant messaging allows individuals to directly communicate with each other. To route messages between users across the entire Internet, applications such as AIM, ICQ, MSN, and Jabber rely on a centralized back-end server to dynamically map users to their IP addresses and buer messages in case the user is oine. Ongoing work toward development of a generalized platform for building P2P applications [16] can be perhaps taken as an indication that the P2P model is here to stay. The main goal of Groove developers is abstracting away many common challenges to building P2P network application, such as providing open PC-to-PC communication. The main obstacles are arising from the fact that the Internet archi-
Sharing of Information Gnutella Freenet Napster Publius Free Haven
Sharing of Computing Power Communication SETI@Home Folding@Home FightAIDS@Home PopularPower Intels NetBatch AIM ICQ MSN Jabber
Table 1.1: List of most popular P2P applications tecture has been built for years around the prevalent client-server model. As a result, numerous technologies such as rewalls, dynamic IP, NAT, and asymmetric bandwidth connections have been deployed on the Internet, driven by the fundamental assumption that most Internet-connected PCs will only serve as clients. This underlying assumption is being strongly challenged by P2P applications such as Gnutella, Napster, and Freenet, which strive to provide a fully distributed worldwide information sharing system. These applications require their users to serve both as consumers and producers of information in a large distributed information storage system. The idea behind peer-to-peer information sharing is that much of the desired content is stored on individual workstations and not behind some centralized server. Applications like Gnutella allow users to directly connect to each other for the purpose of exchanging information. From the perspective of this thesis, a common thread that ties all of these applications is that they all form highly dynamic networks of peers with complex topology. Understanding the nature of these networks, particularly with regards to their topological structure, is the main topic of chapter 2. In addition, applications such as Gnutella and Free Haven [14], which rely on a broadcast search mechanism typically
implemented through ooding, are susceptible to a potential negative eect of heterogeneous latencies on message reachability - a phenomenon we call short-circuiting. We examine this phenomenon in detail in chapter 3.
1.2
Modeling P2P Applications
In this section we present our formal model for representing network topology. We model topology of P2P networks with an undirected graph G whose nodes represent hosts and edges represent Internet connections between those hosts. For the remainder of this thesis, we will refer to network graphs as graphs representing topological structure of a network. In order to study the eects of latencies on broadcast ooding operations in chapter 3, we will further rene our model to include edge weights denoting network latencies along communication links.
1.2.1
Benets to Modeling
There are many reasons for obtaining an accurate network model. The main ones can be summarized as follows: Provides insight into the nature of the underlying system: Even if it was possible to catalog all the vertices and edges of a graph, such information does not explain the evolutionary process of the corresponding network, nor does it provide a deeper understanding of its nature. Enables analytical analysis of algorithms: Performance of graph algorithms is closely related to the structural properties of the underlying graph [28]. A wellformulated graph model can aid in analytical analysis of algorithms performing on such topologies.
Allows generation of realistic topologies for simulation purposes: Besides analytical analysis, simulations are a widely used method of assessing the performance of algorithms. However successful simulations require realistic topologies that accurately capture important structural characteristics present in the original networks. Facilitates design of new scalable algorithms: If the nature of a particular topology is well understood, algorithms can be design to take advantage of particular structural properties. Helps in understanding of related network structures: A good understanding of the nature of a particular system could lead to better understanding of other dynamic, decentralized network structures for which complete topological data may not be available. Allows prediction of future trends: A good network model can be used to simulate future growth, thereby allowing developers to address potential problems in advance. As we have mentioned earlier, the topology of many P2P networks such as Gnutella is completely dened by usage patterns, or collective phenomena. In this sense, there is a clear relationship between P2P and social networks. Over the recent years, a lot of research has been done on social network models. In the following chapter we present some of the most notable network models and discuss how they can be adopted for P2P networks. We support are claims with results obtained on the Gnutella network topology.
Chapter 2 Modeling Topology of Large P2P Networks

In this chapter we focus on one major aspect of the overall network model, namely the topology. We analyzed the Gnutella network topology instances obtain by our network crawler between the months of May and December of 2000. In our analysis, we discovered some important structural properties of the topology graph, such as the small-world properties and several power-law distributions of certain graph metrics. It is our thesis that these properties can be used to test the representativeness of synthetically generated topologies used to model P2P networks such as Gnutella. Conversely, we believe these properties are an essential ingredient of an accurate P2P network topology model. Here we present our results in the context of other related research. Be begin with a brief introduction of small-world networks and their characteristics. We then present our discoveries on Gnutella, showing that the Gnutella network topology exhibits strong small-world properties. Next, we describe several power-laws recently observed in various network structures arising in technology. Finally, we report four
of these power-laws characterizing topology of the Gnutella network. It is our thesis that these power-laws are a fundamental property of many large-scale P2P networks, and therefore must be dealt with in their corresponding models.
2.1
Small-World Networks
The small-world phenomenon in the context of a worldwide social network refers to a widely accepted belief that we are all connected by a short chain of intermediate acquaintances. One of the rst experimental studies of this phenomenon was conducted by Stanley Milgram in the late 1960s. Milgrams famous experiment consisted of taking a number of letters addressed to a person in the Boston area, and distributing them to a randomly selected group of people in Nebraska. Each person who received a letter was asked to pass it to someone they knew on a rst-name basis in an eort to get it closer to its destination. As many of the letters eventually reached their destination, Milgram observed that the average number of steps for a letter to get from Nebraska to Boston was between ve and six. The results of Milgrams experiment were the rst to quantify the phenomenon, giving birth to a popular expression six degrees of separation. One way to model the small-world phenomenon is by a graph whose vertices are people and edges exist between two people who know each other. Such graph is often referred to as the human acquaintanceship graph. As suggested by the phenomenon, the acquaintanceship graph is characterized by small diameter. Stated more precisely, its diameter seams to be of the order of log n, where n is the size of the graph. Furthermore, the acquaintanceship graph also shows tendency to be clustered. Clustering can be thought of as a measure of how well connected each nodes neighborhood is. For the human acquaintanceship graph this property seams intuitive, as two people with a common friend are with high probability themselves friends. It is these two
10
properties of clustering and small diameter that dene a class of graphs Watts and Strogatz call the small-worlds graphs. The two in [27] argue that the structure of many biological, technological, and social networks exhibits small-world behavior. As examples of such networks, they studied the only completely mapped neural network of the nematode worm Caenorhabditis elegans, the electric power grid of the western US, and the Hollywood graph. The collaboration graph of lm actors, appropriately termed the Hollywood graph, contains 225, 000 vertices representing actors and an edge for any two actors who have appeared in a feature lm together. Similar collaboration graphs exist for active scientists [17] and even baseball players [24]. Since each of these social networks is a subgraph of the acquaintanceship graph, it is not surprising they also show properties of clustering and small diameter. Without providing a strict mathematical denition, Watts and Strogatz dene small-world behavior in terms of two properties, mainly the characteristic path length and clustering. In order to quantify these properties for various networks, the two dened characteristic path length L and clustering coecient C as the following: Denition 1 Characteristic path length L, a global property, is dened as the number of edges in the shortest path between two vertices, averaged over all pairs of vertices. Denition 2 Clustering Coecient Cv , a local (node) property measuring cliquishness of vertex v, is calculated by taking all the neighbors of v, counting the edges between them, and then dividing by the maximum number of edges that could possibly be drawn between those neighbors. Clustering coecient C of a graph is dened as the average of Cv over all vertices v. Table 2.1 shows the L and C values for three real networks mentioned above, benchmarked against a random graph of the same size. The results clearly demonstrate the small-world phenomenon for these networks: L 11 Lrandom but C Crandom .
n Lactual Film actors 225,226 Power grid C. elegans 4,941 282 3.65 18.7 2.65
Lrandom 2.99 12.4 2.25
Cactual 0.79 0.080 0.28
Crandom 0.00027 0.005 0.05
Table 2.1: Small-world behavior of three real networks Recently Leda Adamic in [5] showed that the web hyperlink graph, in which nodes are static home pages and edges are hyperlinks between those pages, is also a smallworld. In addition, the author demonstrated how this fact could be used to improve performance of web search engines. Besides small diameter and clustering, many small-world networks share other important properties: They tend to be sparse: These graphs all have relatively few edges, considering their vast number of vertices. Stated more precisely, in small-world graphs the number of edges is typically closer to the number of vertices n than to the maximum possible number of edges
n 2
. The Hollywood graph, for example,
has 225, 000 vertices connected by 13 billion edges, far short of 25 billion in a clique. The largest studied sample of the WWW graph contains 1.5 billion links connecting 200 million pages. This means that only about 7% of all possible edges exist in the WWW graph. They are self-organizing: Most of these small-world networks are not deliberate constructions. Instead, they can be viewed as naturally occurring artifacts that have developed through some evolutionary process. A good theoretical model for generating realistic small-world topologies must inevitably provide deeper insight into the nature of such process. 12
2.1.1
Modeling Small-World Networks
The simplest way to model the small-world phenomenon is by means of a uniform random graph. Graphs of this type were thoroughly studied by Erdos and Rnyi in e the 1960s. While these graphs exhibit small diameter, their major limitation as a model of the small-world is that they show no tendency to form clusters. To address this problem, Watts and Strogatz proposed a model based on interpolating between a completely regular and completely random topology [27]. The authors start by taking a highly regular ring lattice topology, created by arranging n vertices in a circle and joining each vertex to its k nearest neighbors for some small constant k. Each edge in the original lattice is then examined and redirected to another randomly chosen destination with probability p. This method allowed the authors to tune the graph between regularity (p = 0) and disorder (p = 1), and thereby to probe the intermediate region 0 < p < 1, about which little is known. Because of the potential rewiring of edges, Watts and Strogatz refer to their model as the rewired ring lattice. Another way to look at this construction process is to observe that all the edges in the original lattice are local contacts. The rewiring process can then simply be viewed as adding a number of long-range contacts. Watts and Strogatz observed that adding only a few such edges results in a dramatic decrease in diameter size while still preserving the clustering property of the original lattice. While the Watts-Strogatz model remains one of the most popular models of the small-world, most of the recent research utilizes a variation of the model proposed by Newman and Watts. In this version, instead of rewiring the existing links, new shortcut links are added. This greatly simplies the analysis by eliminating the possibility present in the original model for a portion of the graph to become disconnected from the rest. The model was latter generalized by Kleinberg in [19], who introduced an additional parameter consequently dening an entire family of random networks. Kleinberg
13
showed that the performance of decentralized algorithm varies within this family of network models, proving the existence of a unique model within the family for which decentralized algorithms are eective. The idea most relevant to our thesis is that the small-world property of a network topology can signicantly impact the performance of algorithms such as those for routing operating on such topology.
2.1.2
Gnutella as a Small-World
Upon analyzing the Gnutella network topology data obtained by our crawler, we discovered both the small diameter and the clustering properties characteristic of small-world networks. To show this, we calculated the clustering coecient and the characteristic path length as dened by Watts and Strogatz for ve dierent snapshots of the Gnutella topology obtained during the months of November and December of 2000. Since the results presented in this chapter are based on these particular datasets, we present some basic statistics for them in table 2.2.
Snapshot date 11/13/2000 11/16/2000 12/20/2000 12/27/2000 12/28/2000
Nodes 992 1008 1077 1026 1125
Edges 2465 1782 4094 3752 4080
Diameter 9 12 10 8 8
Table 2.2: Statistics for ve snapshots of the Gnutella network topology We present the statistics for the clustering coecient C and the characteristic path length L in tables 2.3 and 2.4. The values for each one are benchmarked against the random graph G(n, p) and the 2-D mesh of the same size (in terms of the number 14
of nodes) as the original Gnutella topology graph. For random graphs, average values out of 100 trials are shown.
Count source vertex Gnutella 11/13/2000 11/16/2000 12/20/2000 12/27/2000 12/28/2000 G(n,p) 2D mesh 0.413181 0.41276 0.412366 0.41276 0.411995
Do not count source vertex Gnutella G(n,p) 2D mesh 0 0 0 0 0
0.643587 0.389914 0.701287 0.492788 0.539189 0.268877 0.514996 0.278801 0.521659 0.27966
0.035122 0.007789 0.010896 0.005636 0.065172 0.009371 0.063023 0.010213 0.054443 0.009013
Table 2.3: Values for the clustering coecient C as dened by Watts and Strogatz in denition 2 Because it is not clear from their denition whether Watts and Strogatz consider each vertex to be a neighbor of itself, we have calculated the results using both methods. Based on the results in 2.3, we believe the two were not counting the source vertex. However the results obtained on a 2D mesh, typically regarded as a highly clustered topology, highlight a potential inconsistency with this denition. For this reason we propose a more consistent denition for the clustering coecient of a graph: Denition 3 Characteristic coecient C(l)v of vertex v is calculated by dividing the number of cross edges in a BFS-tree of depth l and rooted at v, by the maximum possible number of cross edges given by
k 2
(k 1), where k is the number of vertices
in the BFS-tree. Clustering coecient C(l) of a graph is dened as the average of C(l)v over all vertices v.
15
Gnutella 11/13/2000 11/16/2000 12/20/2000 12/27/2000 12/28/2000
BA
WS
G(n,p) 2D torus 0.0606061 0.0606061 0.0606061 0.0606061 0.0606061
0.0223545 0.0149507 0.0372667 0.00403533 0.0088999 0.0095887 0.0372356 0.00249125 0.0300611 0.0178844 0.0537228 0.00618598 0.0205752 0.0184729 0.0539221 0.00620002 0.0206982 0.0173541 0.0535703 0.00561928 (a) l = 2 Gnutella BA WS G(n,p)
2D torus 0.0434783 0.0434783 0.0434783 0.0434783 0.0434783
11/13/2000 11/16/2000 12/20/2000 12/27/2000 12/28/2000
0.0141344 0.00693268 0.0110796 0.00391614 0.0100001 0.00524975 0.0110373 0.00243858 0.0136551 0.00743268 0.0143759 0.00601365 0.0125729 0.00773103 0.014582 0.00602383
0.0122163 0.00718639 0.0142141 0.00545913 (b) l = 3
Figure 2.1: Values for the clustering coecient as dened in denition 3 for the Gnutella, Barabsi-Albert, Watts-Strogatz, random graph, and the 2D torus topoloa gies We believe our denition to be in better agreement with our intuitive understanding of clustering. Furthermore, such denition allows us to identify the aspect of clustering in various topologies that contributes to the short-circuiting eect we study in chapter 3. The results for the new clustering coecient are presented in gure 2.1. Besides the values for the Gnutella, the random graph and the 2D torus, each table also contains results for the Barabsi-Albert (discussed in the suba 16
sequent section) and the Watts-Strogatz models. The parameters for these models were chosen in a way so that the number of nodes and average degree of the resulting graph is approximately equal to that of the original Gnutella topology. For example, the Gnutella topology snapshot from 12/20/2000 is compared to the Watts-Strogatz topology generated according to the following parameters: n = 1125, k = 3, and p = 1 (every node gets a random edge - the Newman-Watts version of the model is used).
Gnutella 11/13/2000 11/16/2000 12/20/2000 12/27/2000 12/28/2000
BA
WS
G(n,p) 2D mesh 4.48727 5.5372 3.6649 3.70995 3.7688 20.6667 21.3333 22 21.3333 22.6667
3.72299 3.47491 4.59706 4.42593 4.07535 4.61155 3.3065 3.19022 4.22492 3.30361 3.18046 4.19174 3.32817 3.20749 4.25202
Table 2.4: Values for the characteristic path length L for the Gnutella, Barabsia Albert, Watts-Strogatz, random graph, and the 2D mesh topologies As you can see, all of the Gnutella topology instances show the small-world phenomenon: characteristic path length is comparable to that of a random graph (table 2.4), while the clustering coecient is considerably higher. These results clearly indicate strong small-world properties of the Gnutella network topology. It is our thesis that this is an important issue to consider when modeling P2P networks such as Gnutella. More specically, an accurate P2P model must inevitably generate topologies exhibiting the described small-world properties. Furthermore, our discovery can aid in designing and predicting performance of distributed algorithms, such as those for routing and searching. For example, Gnutellas current broadcast routing strategy
17
is clearly not likely to work well on a clustered topology of a small-world network, as it would generate large amounts of duplicate messages. This would result in poor utilization of network bandwidth and hinder scaling - a phenomenon recently observed in practice [13].
2.2
Power-Laws
The major limitation of the described small-world models is due to increasing evidence of various power-laws of the form y = xa , governing distribution of various graph metrics for many large, self-organizing networks [15, 10, 11, 20]. Faloutsos et al [15] discovered four of these power-laws characterizing topology of the Internet at both inter-domain and router level. These power-laws are dened as follows: Power-Law 1 (rank exponent R): The outdegree, dv , of a node v, is proportional
R to the rank of the node, rv , to the power of a constant, R: dv rv . The rank
rv of a node, v, is dened as its index in the order of decreasing outdegree. Power-Law 2 (out-degree exponent O): The frequency, fd , of an out-degree, d, is proportional to the out-degree to the power of a constant, O: fd dO . Power-Law 3 (hop-plot exponent H): The total number of pairs of nodes, P (h), within h hops, is proportional to the number of hops to the power of a constant, H: P (h) hH ,h , the diameter. The number of pairs P (h) is the total
number of pairs of nodes within less or equal to h hops, including self-pairs, and counting all other pairs twice. Power-Law 4 (eigen exponent E): The eigenvalues, i , of a graph are proportional to the order, i, to the power of a constant, E: i iE .
18
Several research groups have also independently discovered evidence of the same power-laws describing structural properties of the web graph [10, 11, 20]. Since these discoveries occurred on various scales and levels of granularity, they could be taken as indications of possible self-similar or fractal nature of the web. Of particular interest is the fact that all of these groups reported practically identical values for the powerlaw 2 exponent, ranging between 2.1 and 2.2. This observation led the authors in [15] to suggest the use of power-law exponents as a way of characterizing dierent families of graphs. In addition, they demonstrated how these exponents could be used to approximate important graph metrics, such as the number of nodes, the number of edges, the average neighborhood size, and the eective diameter. Albert, Jeong, and Barabsi went even further to argue the scale-invariant nature of the power-law a distributions, suggesting that large networks self-organize into a scale-free state, a feature unpredicted by all existing random graph models [10]. The signicance of these power-laws is that they clearly outline the inadequacy of the described small-world models to accurately capture the true nature of many large networks. The problem is that these models do not explain the existence of highly connected nodes, a simple consequence of the power-law 2. The described power-law observations have therefore opened up a search for alternative techniques for generating realistic network topologies that exhibit such power-law phenomena.
2.2.1
Power-Law Models
Based on the discoveries described above, a number of alternative models have been proposed that produce graphs exhibiting the observed power-law properties. While some set out to synthetically reproduce various power-law distributions accepting them as empirical facts, others attempt to provide an explanation as to the origin of such phenomena. An example of the later is a model proposed by Barabsi and Albert a
19
[10]. The two argue that the existence of power-laws in many real networks is caused by two key features: growth and preferential attachment. Growth feature describes the dynamic nature of many real networks, in which new vertices are continuously added. Preferential attachment is used to model the fact that in real networks, new vertices are more likely to link to existing vertices of high degree, resulting in so-called rich-get-richer phenomenon. In the case of the web graph, these two features are evident as new pages are created daily, typically containing hyperlinks to already highly connected and therefore highly visible pages. Barabsi and Albert build their a model by starting with a small number of vertices and no edges. Then, a new vertex is added at each time step by linking it to m other vertices already present in the system. The existing vertices are chosen with probability that is proportional to their degree. This process produces a random graph that reaches a steady state characterized by the same power-law distribution observed in many real networks. Notice that, without continuous addition of new vertices, this model would eventually produce a clique, as all the vertices would ultimately be connected. In fact the authors proved that both growth and preferential attachment are necessary to correctly model the behavior of real networks: growth factor ensures stationary power-law distribution, and preferential attachment is responsible for its scale-free nature. The Barabsia Albert model possesses certain intuitive appeal, particularly when used to model the topology of many P2P networks such as Gnutella. Recently, a topology generator called BRITE was proposed for produces graphs exhibiting all four of the discussed power-laws based on factors such as growth and preferential attachment studied by Barabsi and Albert [21]. We are currently experimenting with adopting this model a for P2P networks such as Gnutella. If the goal is to simply generate graphs that match exactly the power-law properties observed empirically, then the graph model proposed by Aiello, Chung, and Lu could be used [7]. This model involves two parameters, and , represent20
ing the intercept and the slope of the plot of degree distribution on a log-log scale. Since any xed pair of values for and denes a nite set of graphs, the authors propose simply selecting a graph from this set at random. More recently, Internet topology generators have been proposed that subscribe to the same philosophy of using power-laws to guide graph construction [23].
2.2.2
Power-Laws in Gnutella
Upon analyzing the Gnutella topology data obtained using our network crawler, we discovered it obeys all four of the power-laws described in the previous section. The results for power-laws 1 through 4 are presented in gures 2.2, 2.3, 2.4, and 2.5, respectively. Power-laws relationships between variables are typically plotted on a logarithmic scale, since their plot should, by denition, appear linear. Power-law exponents can then be dened as the slope of this linear plot. We used linear regression to t a line in a set of two-dimensional points using the least-square errors method. To quantify the validity of the approximation, with each gure we included the absolute value of the correlation coecient r ranging between 1 and 1. A |r| value of 1 indicated perfect linear correlation. As mentioned earlier, power-law 1 is evaluated by sorting all nodes in descending order according to their degree, and plotting degree versus rank of a node in this sequence on a log-log scale. For comparison, we present plots for both the snapshots of the Gnutella network topology and a simple connected random graph of the same size. Figure 2.2 shows this power-law holds for the Gnutella topology instance with rank exponent R =0.98 and the correlation coecient of 0.94, which cannot be said for the random topology. Power-law 2 is of particular importance, because it is the one that is most frequently cited in the recent studies of large network topologies. Figure 2.3 shows
21
10
10
Gnutella 12/28/2000 exp(6.04022)*x**(1.42696)
Random graph
10
10
10
10
10
0
10
10
10
10
10
10 0 10
10
10
(a) Gnutella 12/28/00(|r| = 0.94)
(b) Random Graph
Figure 2.2: Log-log plots of degree versus rank (power-law 1) node degree power-law exponent of 1.4 for the Gnutella topology. We must remark that a group called Clip2 independently discovered this particular power-law for the Gnutella network topology [13]. However they reported the power-law exponent of 2.3, in disagreement with our result. We believe the reason for this discrepancy is due to the fact that our results are based on the network crawls performed during December of 2000, while the other result dates back to the summer of the same year. Since that time, the Gnutella network has undergone signicant changes in terms of its structure and size, as described in [13]. While the values of the node degree exponent O for all of the Gnutella topology instances obtained during the month of December are consistently around 1.4, we have observed O values of 1.6 for the data obtained in November. This may be taken as indication of a highly-dynamic, evolving state of the Gnutella network. We are nevertheless currently attempting to establish contact with people from Clip2 in order to further examine reasons for this discrepancy. Interestingly, power-law degree distributions have recently been reported for another le-sharing P2P applications, Freenet [22]. 22
10
10 Gnutella 12/28/00 exp(7.27358)*x**(0.98116)
Random graph
10
10
10
10
10 0 10
10
10
10
10
10 0 10
10
10
10
10
(a) Gnutella 12/28/00(|r| = 0.96)
(a) Random Graph
Figure 2.3: Log-log plot of frequency versus degree (power-law 2) It has been shown that power-laws 3 and 4 hold for almost all types of topologies, including random, regular, and hierarchical [21]. Power-law three by denition holds for regular topologies such as a ring topology and a 2-D mesh, with hop-plot exponents of 1 and 2, respectively, for h . It is therefore not surprising that we have also
observed these power-laws in the Gnutella network topology. However a case has been made that, while the mere presence of these two power-laws is not a distinguishing property of a graph, the values of their exponents can be. For this reason, instead of plotting power-laws 3 and 4 for a single instance of the Gnutella topology and a random graph of the same size, we compare results for several dierent snapshots of the Gnutella topology. Figure 2.4 shows the hop-plots for four of these Gnutella topology snapshots described previously. For each one, we approximated only the rst four hops. Clearly, power-law 3 holds for all four snapshots with very high correlation coecients of 0.99. More importantly, the hop-plot exponents seam to be clustered tightly around the value of 3.5. Notice that this value lies right between the exponent values reported for the inter-domain and router level topology instances of 23
10
10
Gnutella snapshot 11/16/2000 exp(8.36937)*x**(3.48228) maximum number of pairs
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
(a) Gnutella 11/16/00(|r| = 0.99)
(b) Gnutella 12/20/00(|r| = 0.99)
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
(c) Gnutella 12/27/00(|r| = 0.99)
(d) Gnutella 12/28/00(|r| = 0.99)
Figure 2.4: Log-log plot of the number of pairs of nodes versus the number of hops (power-law 3) for four snapshots of the Gnutella topology the Internet [15]. Like the authors in [15, 21], we must concede that the results for this particular power-law may be misleading given such small number of data points. This limitation is imposed by the fact that these graphs have a small diameter. An application of power-law 3 that seams particularly applicable to Gnutella was suggested by the authors in [15]. They introduced a concept of the eective diameter
24
ef , which is essentially the number of hops required to reach a suciently large portion of a network. In other words, any two nodes are within ef hops of each other with high probability. We present the denition below for convenience. Denition 4 (eective diameter) Given a graph with N nodes, E edges, and H hop-plot exponent, the eective diameter, ef , is dened as: ef = N2 N + 2E
1/H
Substituting the values for the Gnutella topology snapshot from December 28th , 2000, we get that, during that time, a better value for the maximum TTL would have been 4 (instead of 7, which is the default specied by the Gnutella protocol). Similar trends to the ones reported for the hop-plots appear in the eigenvalue plots. Figure 2.5 shows the rst 20 eigenvalues versus their order on a log-log scale for the Gnutella topology snapshots. Once again, we see the consistency of power-law exponents across dierent snapshots. Interestingly the exponents for the snapshots obtained during the month of December are practically equal, while the exponent for the snapshot from November is slightly smaller. Again, this fact may be taken as an indication that the Gnutella network was going through an evolutionary state, captured by these power-law exponents. There is a rich literature proving that eigenvalues of a graph are closely related to its topological properties. In the future, we plan to further analyze the eigenvalues of P2P network topologies and their practical implications. Our empirical results clearly outline strong power-law properties on the Gnutella network topology. It is our thesis that these properties can be utilized to improve performance of algorithms such as those used for searching [6]. In addition, we believe that an accurate model of the network topology of P2P network applications such as Gnutella must inevitable exhibit presence of power-laws 1 and 2, as well as produce all four power-law exponents in close agreement with the ones observed empirically. 25
10
10 Gnutella 11/16/2000 exp(2.27850)*x**(0.22301)
Gnutella 12/20/2000 exp(2.83511)*x**(0.30114)
10
10
10 0 10
10
10 0 10
10
(a) Gnutella 11/16/00(|r| = 0.97)
(b) Gnutella 12/20/00(|r| = 0.89)
10
10 Gnutella 12/27/2000 exp(2.82127)*x**(0.29278)
Gnutella 12/28/2000 exp(2.81997)*x**(0.29412)
10
10
10 0 10
10
10 0 10
10
(c) Gnutella 12/27/00(|r| = 0.94)
(d) Gnutella 12/28/00(|r| = 0.94)
Figure 2.5: Log-log plot of eigenvalues versus rank (power-law 4) for four snapshots of the Gnutella topology
26
Chapter 3 Modeling Network Latencies

In this chapter we further rene our model of P2P networks to include trac. In particular, we study the eects of heterogeneous latencies on reachability in P2P network applications operating under ooding protocols. We call this potentially devastating eect short-circuiting. Traditionally, latency has been studied to model network performance as it relates to throughput. Network reachability has traditionally been studied through the analysis of distance in graphs. In this work, we point towards a novel fact that heterogeneous latencies can signicantly impact reachability, independent of distance. We begin with a brief introduction of short-circuiting. We then present our formal model for studying the eects of short-circuiting. Finally, we report our results from both network simulation studies and empirical tests performed on Gnutella. We conclude based on these results that, on average, the real eects of short-circuiting are signicant, but not devastating to the performance of an overall system.
27
3.1
Latency Eects
We have seen in chapter 1 that P2P applications are inherently decentralized, therefore relying on ecient decentralized algorithms for communication between hosts. As a result, many of these applications, including Gnutella, have adopted a ooding mechanism to forward messages in an eort to maximize reachability. Notice that reachability, or the number of hosts receiving a particular message, is an important performance metric for many P2P applications, particularly those used for le-sharing. Flooding dictates that each host is to simply forward each received message to all of its neighbors, except the one from which the message was received. As such, ooding provides a simple and eective way of broadcasting messages in a dynamically changing network without requiring the use of routing tables or knowledge of the global network topology. However it clearly does not scale for Internet-wide applications, as it generates a large number of redundant messages and uses all available paths across the network. For this reason, in practice, ooding is typically implemented in combination with one or more of the following standard governing mechanisms designed to restrict its scope and limit redundant messages: Mechanism 1. Time-to-Live Bounds Time-to-Live (TTL) is a governing mechanism that prevent messages from traveling farther than a specied number of hops, dened by an initial TTL value. TTL bounds are implemented by including in each message header a TTL value eld. When a node receives a message it rst checks to see if its TTL value is greater than zero. If not, the node continues the ood with a decremented TTL. Otherwise the message is dropped. Mechanism 2. Unique Message Identication Unique Message Identication is
28
a mechanism that prevents unique messages from being transmitted more than once from each node. This mechanism is implemented by including in each message header a UID (a unique ID label, or unique sequence number). When a node receives a message it checks to see if it has previously seen that message. If it has , the message is dropped and not forwarded. Otherwise, the node stores the new UID in a local table, and then continues the ood. Mechanism 3. Path Identication Path Identication is a mechanism that prevents message paths from looping. This mechanism is implemented by including in each message a header that records which nodes of the network have already encountered the message. Before forwarding messages, each node simply checks the header to verify whether or not it has previously seen the message. If so, the message is dropped and not forwarded. If not, the node adds its name to the header, and then continues the ood. Ordinarily, a broadcast operation functioning under these mechanisms should reach all nodes within the TTL bound of the broadcast source. However we have discovered that network latencies can negatively impact reachability of broadcast operations. We dene latency as the time it takes a message to traverse a link in the network. We will show that, when Mechanisms 1 and 2 are implemented together, heterogeneous network latencies can potentially have a devastating eect on reachability. We call this phenomenon the short-circuiting eect, and describe it as follows: Short-circuiting Eect. Consider a message broadcast from a source node a, and consider a path P = {u1 , u2 , . . . , up}, joining nodes a = u1 and b = up . It is possible that there may be no throughput of the broadcast messages from a to b along P , even if the hop-length p of the path P is less than or equal to the TTL value t. This can result from heterogeneous latencies, as the following scenario 29
shows. Suppose there exists a message path Q from a to some intermediate node x = ui of P , having a strictly smaller latency (but, with possibly a greater hop number). Then a broadcast message originating from a, and following path P will be killed (by Mechanism 2) when it reaches x, since it is the duplicate of an earlier arriving message originating from a, but following path Q. Notice that there may also be no throughput along path R consisting of the path Q together with the subpath of P from x to b. This eect results from the fact that R may possibly have a hop-length strictly greater than t, and hence, by Mechanism 1 there is no throughput of the broadcast message originating at a along path R. And, indeed, there may be no throughput of the broadcast message along any path from a to b; it is this latency eect on reachability which we call short-circuiting. For the remainder of this chapter, we will consider broadcasts as operating under the combination of Mechanisms 1 and 2. Note that short-circuiting like eects can not be caused by the combination of Mechanisms 1 and 3, since, in that case, all loop-free paths within the TTL bound are valid message paths.
3.2
Modeling the Short-Circuiting Eect
In order to analyze the problem of SC, we rene our network model from chapter 1 to include edge weights representing latency values on communication links. We consider the latency of a message path to be the sum of the latencies of its edges. The ooding operation governed by mechanisms 1 and 2 in a network G is dened by the following protocol regimen. Packets in the network we will denote p(u, t, h), with unique message identier UID = u, initial TTL value T T L = t, and current hop-value HOP = h. The hop-value denotes the number of hops from the packets source node. We will denote a packet (ready for broadcast) originating at node s, 30
with initial T T L = t, by p(us , t, 0). The broadcast regimen operates as follows, and denes the valid message paths associated with the transmission of the broadcast packet. 1. Source s sends p(us , t, 0) to all the neighbors of s, injecting the packet on all links connected to s at the same time. 2. Nodes process packets on rst-come-rst-served basis as follows: when a node v receives packet p(us , t, h) it checks whether the UID us has been seen previously. If it has, then the packet is dropped with no further processing. 3. If not, then v records us in its local table, and check whether t = h. If t > h, then v replicates and forwards the message p(us , t, h + 1) (with incremented hop count) to all neighbors except u, the node from which it received the packet. If t = h then the packet is dropped and not forwarded. When latencies are introduced into this model of a ooding broadcast, complications arise as to the reachability of nodes. To determine reachability it is not sucient to consider only minimum-cost paths from s to v. In order to quantify reachability, we introduce the notion of a horizon, dened as following: Denition 5 The t-horizon R(s, t) from a source node s, is the set of all nodes v which receive a packet ps (u, t, ) broadcast from s with T T L = t. The t-neighborhood N(s, t) from a source node s, is the is the set of all nodes within a hop-distance of t from s. Likewise, for a set of source nodes S, we denote by R(S, t) and N(S, t) are the t-horizon, and t-neighborhood, respectively, from S, where we assume that the broadcast is initiated by each s S simultaneously. In the subsequent sections, we present our experimental results on the size of t-horizon as a function of latencies under the described broadcast model. 31
3.3
Empirical experiments
We have conducted a series of experiments to empirically test the eects of shortcircuiting. These experiments are divided into two categories: simulations performed on various static network topologies and empirical tests performed on a real P2P network application. For the later, we use Gnutella as our case study.
3.3.1
Gnutella Studies
We have already mentioned Gnutella as a rapidly evolving technology based on the peer-to-peer network model. In this section we continue our case study of Gnutella with the analysis of short-circuiting eects on reachability. In order to see why Gnutella presents a meaningful testbed for studying the problem of short-circuiting, let us briey describe its design. Gnutellas application-level protocol supports two basic types of broadcast requests: ping, which is essentially a request for a host to announce itself, and a query. These messages are propagated through the network by means of a ooding broadcast. The response messages are then routed back along the same path that the original request arrived by means of dynamically updated routing tables maintained by each host. The ooding in Gnutella is implemented using mechanisms 1 and 2 described in previous sections, with the Gnutella software generally limiting TTL values to at most 7. Its routing protocol, together with heterogeneous latencies, make Gnutella potentially vulnerable to the short-circuiting eects we have described. Our original interest in the eects of short-circuiting arose from an experiment that involved crawling and mapping the entire Gnutella network. In particular, we noted that the number of reachable hosts reported by a client was substantially less than on o-line analysis of the generated topology map. This analysis consisted of calculating the number of elements in the BFS tree rooted at a node representing that 32
particular client. We consistently noted discrepancies of this nature of approximately one half. After conjecturing that short-circuiting may play a substantial role is such discrepancies, we attempted to try to prove this empirically.
Figure 3.1: The results of level-1 short-circuiting eects on the broadcast horizon on the Gnutella network, October 2000. The y-axis represents the broadcast horizon size, and the x-axis labels each of 68 broadcast trials. The top line is the resulting horizon from multiple distinct broadcasts from the same source, and the lower line is the resulting horizon from a single broadcast message from a single source. The discrepancy represents level-1 short-circuiting eects. To test our hypothesis, we have devised an experimental method of discovering what we call the level-1 short-circuiting eect. These are the eects of shortcircuiting caused by the paths interfering at the rst level, that is, in our experiments we compare the 7-horizon of a message broadcast from v with the 6-horizon of distinct message broadcasts from the neighbors of v. The idea is that sending messages with distinct ID labels will prevent them from interfering with each other, and thereby allows us to measure a subset of the total short-circuiting eect. The actual number of hosts reached by the broadcast of the shared message is compared to a union of host sets reached by the set of distinct broadcast messages. More rened estimates of short-circuiting eects can be obtained by comparing the hop counts of messages 33
responding to a shared broadcast to the hop counts of messages responding to distinct broadcasts: if the former is larger than the minimum of the later, than we posit that short-circuiting has occurred. Figure 3.1 shows the results of a particular experiment of this nature conducted in October of 2000 . We note that the observed reductions average 55%.
450 2 servers 3 servers 400
350
300
250
200
150
100
50
Figure 3.2: Horizon-size versus t In another set of experiments we focused on the t-horizon as a function of the TTL value. We performed the experiment by connecting to a set of servers and sending successive ping messages with increasing TTL. Figure 3.2 shows the results of one such experiment using two and three broadcast servers. As predicted by short-circuiting, we observed a decrease in t-horizon after TTL has exceeded certain threshold, typically around 5. We have been able to explain this phenomenon analytically in [9]. This particular experiment required connections to selected servers to persist over a longer period of time, so that a number of test trials could be performed. Diculties in conducting experiments on Gnutella. Overall, we have found it quite challenging to isolate the eects of short-circuiting, as well as other phenomena, 34
on the Gnutella application. The challenge has been mainly due the system instability, both in terms of topology and latencies. One of our preliminary experiments focused on measuring variance in the size of the broadcast horizon over time. We have found that several identical tests of horizon size, which were performed consecutively, can dier drastically in their results. Figure 3.3 shows the size of the broadcast horizon over time using four broadcast servers. Each data point represents the horizon size for a particular broadcast trial, with trials performed consecutively in six minute intervals.
4000
3500
3000
2500
2000
1500
1000
500
20
40
60
80
100
120
140
160
180
200
Figure 3.3: Horizon-size variation over time with broadcasting client using multiple connections on the Gnutella network, March 2001. The y-axis represents the horizon size, and the x-axis labels each of 180 broadcast trials, performed consecutively in six minute intervals. We attribute this phenomenon to the highly dynamic nature of the network and constantly changing network conditions and topology. (We remark that in our network simulations, we have also observed that slight changes in latency distribution can result in dramatic changes in the size of the t-horizon.) Such high variance, as well as the existence of a number of factors inuencing the actual number of hosts
35
reached, makes it challenging to obtain meaningful results. By far the biggest challenge to isolating the eects of short-circuiting on Gnutella is due to emergence of a new generation of intelligent Gnutella clients. These clients contain built-in application logic designed to promote overall network health by conserving bandwidth. While such clients have succeeded in allowing the Gnutella network to scale-up to about ve times the original size, they have also created a serious obstacle to conducting sophisticated experimental studies on the network. In order to see this, consider a simple procedure for calculating the size of the thorizon in Gnutella, performed by sending a ping message and counting the number of responses. Figure 3.4 shows the results of an experiment in which eight of these procedures were performed simultaneously.
3500 ping1 ping2 ping3 ping4 ping5 ping6 ping7 ping8
3000
2500
2000
1500
1000
500
10
11
12
13
14
Figure 3.4: Diculty in conducting experiments on todays Gnutella network As you can see, typically only one of these procedures will result in a considerable number of responses. The reason for this is that Gnutella clients are now intelligent enough to realize when messages are the same, and will only forward one of them. In addition, many clients will now cache the responses to ping and query messages
36
for a certain amount of time. While such design decisions are understandable from the performance standpoint, they also eectively take away the ability to accurately determine the exact size of the broadcast horizon in Gnutella at any given time. As a result we have found it extremely dicult to repeat experiments such as those reported in gures 3.1 and 3.2 on the current system. Because of the diculties with measuring short-circuiting eects directly on the application, we turned our attention to a series of network simulation studies in which we were able to precisely isolate the eects of short-circuiting on theoretical network topologies.
3.3.2
Network Simulation Studies
In order to study the practical impact of short-circuited t-horizon reductions, we needed to carefully consider both the topology of the network and the assignment of latencies. Simulated studies allowed us to isolate the eects of short-circuiting on xed topologies. We conducted the simulations using our network simulator gnutsim, based on a modied version of Dijkstras shortest path algorithm. The Java source code for gnutsim is given in appendix B. To carry out these simulations, we needed to choose the network topological model, as well as the network latency model. We report in this chapter on a number of well-known regular topologies, such as the mesh and the hypercube, as well as the Watts-Strogatz small world topology and snapshots of the Gnutella topology obtained through crawling. To model network latencies we used several classes of weights representing various commonly used Internet connection bandwidths. We conducted our experiments by using random distributions of these weights. We present the statistics of our simulation studies as tables, which report the reduction ratios in reachability caused by short-circuiting, given by randomly chosen latencies on a xed topology. Each table is associated with a xed topology. Each
37
TTL Worst Avg Best Nbhd WRR MRR 1 8 8 8 8 100% 100% 2 18 21 24 25 72% 84% 3 24 47 66 69 35% 68% 4 43 84 124 138 31% 61% 5 67 150 238 310 22% 48% 6 121 274 424 678 18% 40% 7 278 498 723 1399 20% 36% 8 434 819 1364 2771 16% 30% 9 765 1388 2307 5018 15% 28% 10 977 2148 3420 7729 13% 28% 11 2030 3153 4549 9449 21% 33% 12 2252 4290 5812 9928 23% 43% 13 3692 5519 6599 9994 37% 55% 14 4995 6392 7563 10000 50% 64% (a) Reduction rations for the Watts-Strogatz topology
100
90
80
70
60
50
40
30
20
10
10
20
30
40
50
60
70
80
90
100
(b) Histogram of 1000 trials with random distribution of latencies (t = 10) Figure 3.5: Short-circuiting eects for the Watts-Strogatz topology (nodes = 10000, k = 3, p = 0.2)
38
row of the table represents results from 100 trials using random latencies. In each row we report for a xed t, the worst, average, and best observed t-horizon, and t-neighborhood (which is equal to t-horizon when using uniform latencies). We then give the reduction ratios by dividing the worst over t-neighborhood, and the average over t-neighborhood. Figure 3.5 represents the results for the Watts-Strogatz small-world topology. The histogram on the right represents distribution of t-horizon values over 100 trials using random latencies for t = 10, which is the value of t for which the reduction ratios are the most severe. The results for other topologies are presented in appendix C. Observations and Conclusions. Our empirical results indicate that, in practice, the eects of short-circuiting are not as devastating as suggested by the theoretical results in [9]. We have observed the most signicant impact on small-world topologies such as our Gnutella snapshots and Watts-Strogatz network models. Fr these graphs, we have observed reduction ratios in t-horizon size of over 90% in the worst case, for certain values of t. In other words, we have observed that with random latencies one can expect instances where the ratio of sizes of the t-neighborhood divided by t-horizon is greater than 10 to 1, as shown in gure ??. Furthermore, the histogram in the same gure shows that the reduction in reachability caused by short-circuiting was always greater than 50% using random latencies. In our experimental studies we have also observed that both random graphs and highly structured graphs such as the mesh and hypercube tend to have, on average, less pronounced short-circuiting eects, as compared with small-world graphs. Intuitively, this can be best understood if one considers the potentially stimulating eect of the clustering property as dened in chapter 2 on short-circuiting. In general, for a xed T T L = t, the distribution of t-horizon sizes tends to be normally distributed with small variance, independent of network topology. We have
39
also observed that, independent of topology, mean reduction ratios are dependent on the TTL= t. Our results suggest that the reduction ratio increases as t increases, until certain thresholds are reached, usually at about the point t is equal to half the network radius or diameter, after which the reduction decreases.
40
Chapter 4 Gnutella Crawler Implementation

In this chapter we discuss issues related to design and implementation of our Gnutella network crawler. We begin by providing a brief introduction to Gnutella and its protocol, necessary for understanding the remainder of this chapter. We then present both the sequential and parallel algorithms for discovering topology of the Gnutella network, followed by the discussion of our distributed implementation using Java RMI.
4.1
Introduction to Gnutella
Gnutella can be best explained as a fully distributed, information sharing technology. It originated as a project at Nullsoft, a subsidiary of America Online, but was abandoned out of fear of its potential use for copyright infringement. After being quickly reverse-engineered by several programmers and open-source enthusiasts, Gnutellas popularity really took o. Gnutella allows distributed le sharing by allowing each user to specify directories on their local machine they want to share. In this sense, Gnutella can be viewed as a distributed le storage system with search capabilities. Unlike its predecessor Napster, which relies on a centralized search database, 41
Gnutella promotes decentralization of all network functions. As we have already seen, Gnutella is based on a peer-to-peer model. This means that users connect to each other directly through a piece of client-server software, forming a high-level network. Throughout this thesis, we have and will continue to refer to this high-level network as the Gnutella network, or GnutellaNet. Because Gnutella software functions as both a server and a client, it is sometimes referred to as a servant. In this thesis we may use the terms client, servant, and host interchangeably to refer to Gnutella software running on a particular machine.
4.1.1
Gnutella Protocol
Each Gnutella client implements the application level Gnutella protocol, which species how messages are routed between GnutellaNet hosts. We have already described Gnutellas protocol design at a high-level in chapter 3. We will now complete our description with a few implementation details. Gnutella protocol support four basic types of messages summarized in table 4.1. The routing technique employed by the Gnutella protocol is a form of controlled ooding, where messages are passed recursively between hosts. Flooding operates by each Gnutella host forwarding the received ping and search messages to all of its neighbors, except to the one that sent the message. To limit exponential spread of messages through the network, each message header contains a time-to-live (TTL) eld. TTL is used in the same fashion as in the IP protocol: at each hop its value is decremented until it reaches zero, at which point the message is dropped. This is equivalent to mechanism 1 described in chapter 3. The maximum TTL value specied by the Gnutella protocol is seven. Recall that this restriction eectively segments the Gnutella network into subnets, imposing on each user a virtual horizon beyond which their messages cannot reach. In practice, this situation is acceptable
42
Type Ping
Description
Contains
Request for a host to an- No body nounce itself
Pong
Reply to Ping message
IP and port of responding host, number and size of les shared
Query
Search request
Minimum speed requirement for responding host, search string
Query Hits Reply to Query message
IP and port speed of responding host, number of matching les and their indexed result set
Table 4.1: Gnutella protocol message description as information may still get around. Each Gnutella message is also agged with a unique ID. Message ID is used by peers to detect and subsequently drop duplicate messages, indicating a loop in GnutellaNet topology (mechanism 2). In addition, it is also used to route the response messages along the same path that the original request arrived. This is implemented by each host maintaining a dynamic routing table of message IDs and connection labels indicating a particular connection along which that specic message arrived. When a response message arrives at a host, it should contain the same message ID as the original request. The host then checks its routing table to determine along which link the response message should be forwarded. This technique greatly improves eciency while also preserving network bandwidth.
43
4.1.2
Discovering Gnutella Network Topology
Topology discovery in IP networks is a well-studied area of research [26]. Generally the approach is based on some protocol-specic feature, as in the case of traceroute. Although Gnutella protocol is much simpler than IP and provides no feedback regarding message delivery, it nevertheless provides the necessary functionality for mapping GnutellaNet topology. Notice that, according to the Gnutella protocol, it is possible to discover neighbors of a particular host by connecting to that host and sending a ping message with T T L = 2. As a result, pong messages would be sent back from the connected host and all of its immediate neighbors. A complete network topology could therefore be discovered by connecting to all the hosts, discovering their neighbors, and combining the information into a single graph. We refer to this process as crawling. Notice that, by following the described procedure, each edge would be discovered twice thus introducing a level of redundancy. However it is still necessary to connect to all the hosts in order to guarantee that the obtained topology map is complete. Compared with IP networks, GnutellaNet is highly dynamic. This means that its topology is constantly changing - nodes and edges are added and removed as hosts join and leave the network, establish new connections, and close the existing ones. Therefore any topology discovery algorithm operating on the Gnutella network is really capturing an instance, or a snapshot of the topology at a specic point in time. Clearly, this posses an additional requirement for any topology discovery algorithm to be ecient, since the accuracy of the topology map is inversely proportional to the actual running time of an algorithm that was used to obtain it. In designing our crawler, we have paid close attention to this requirement.
44
4.2
Design
In this section we discuss some issues related to design of our Gnutella network crawler. We present informal performance analysis for both our sequential and parallel algorithms for discovering Gnutella network topology.
4.2.1
Algorithm
Based on the procedure described in the previous section for discovering GnutellaNet topology, an intuitive design solution might be to use the BFS to crawl the network, applying the algorithm for discovering direct neighbors to each encountered host. However, there are some practical issues that make this approach inecient. In order to see this, let us rst examine the basic operation of discovering neighbors of a single Gnutella host. This operation requires establishing a connection, sending a ping message, and waiting for all pong messages to be received - overall a time-consuming process with running time in the order of several minutes. However it is clear that such operation represents a lower bound for any topology discovery algorithm operating on Gnutella and based on the procedure described in the previous section. We will therefore use this basic operation as a unit in our performance analysis of algorithms for discovering GnutellaNet topology. The complexity of the BFS algorithm for discovering topology of the Gnutella network with N hosts is clearly O(log N). Also, for the moment, let us assume that our crawling workstation is capable of maintaining up to b simultaneous network connections. Then if b N and we had a list of addresses for all the Gnutella hosts, we could simply connect to all of them simultaneously and obtain the entire network topology map in constant time. Fortunately such list is available, as every Gnutella client maintains a dynamically updated list of live hosts. Using this list as input, we can now formulate our new algorithm for discovering GnutellaNet topology as follows: 45
Procedure buildTopoMap (G, l) Input: An empty graph G, and a complete host list l Output: A graph G representing the Gnutella network topology for each element h of l connect to h if (connection is successful) send ping message with T T L = 2 for each response message m from host h2 if (h2! = h) add edge h h2 to G if (h2 is not in l) add h2 to the end of l
Due to highly dynamic nature of the network, the input list of hosts is not guaranteed to be neither complete nor perfectly accurate. This means that new hosts not contained in the list could have just joined the network and, furthermore, hosts contained in the list may no longer be active. Nevertheless our algorithm will still work, as new hosts will be discovered at run-time and added to the end of the list. Similarly, hosts that are no longer active will simply be ignored. The ability of our algorithm to work with incomplete input data is particularly important considering highly dynamic nature of the Gnutella network. However the more complete the list is, the closer the performance of our algorithm will be to optimal. Notice that our algorithm in eect partitions the problem of discovering Gnutella network topology into two steps, or phases: discovering nodes (host list) and discovering edges (connections). Since the functionality for solving the rst phase is already provided through the existing Gnutella client software, our algorithms focus is on the second phase of the problem.
46
4.2.2
Initial Implementation
We have implemented the algorithm presented in the previous section as a Java application. We chose Java as our development platform primarily for its support for networking and threads. Platform-independence was also an important benet, particularly for our distributed implementation described is the subsequent sections. The main problem with our initial implementation is due to our original assumption that the number of connections that could be maintained simultaneously is greater than the total number of Gnutella hosts. In practice, this assumption doesnt hold as the number of live Gnutella hosts at any given time is typically in the order of thousands. To cope with this situation we were forced to organize threads into groups of b, where b is the maximum number of simultaneous connections that our system could handle. This strategy introduces additional complexity and, as already discussed, sacrices the integrity of a time-critical task such as topology discovery in a highly dynamic network. However since connections to dierent Gnutella hosts can be done asynchronously, a natural solution would be to run the crawler in parallel. The following section describes issues involved in discovering GnutellaNet topology in parallel, as well as our implementation using Java RMI.
4.2.3
Parallel Algorithm
The simplest and perhaps the most natural way to make our topology discovery algorithm run in parallel would be to partition the initial list of Gnutella host addresses. Each processor would then be responsible for discovering neighbors of only a subset of hosts. In addition, each processors would need to have some way of knowing whether a newly discovered host address has already been crawled by another processor. One way this could be done is by hashing the host address string and checking the result (modulo the number of processors participating in the crawl) against the pro-
47
cessors index. If there is a match, the processor would know that it should go ahead and crawl the host. If not, it would then need to pass the information to the appropriate processor. In fact, this technique is commonly used for indexing the WWW by many search engines, including Google, primarily because it results in good load balancing. However it also requires additional inter-processor communication in order to pass the Gnutella host addresses discovered at run-time to the appropriate processors. Instead, we have opted for perhaps less elegant but more robust solution. Our algorithm provides each processor with a complete input list of active hosts. Each processor then executes an algorithm for calculating the subset for which it is responsible, based on its unique processor number and the total number of processors involved in the computation. For example, processor 0 of 10 would only attempt to discover neighbors of the rst 10% of hosts from the input list. The parallel version of the topology discovery algorithm presented in the previous section is formulated bellow. For clarity, we are assuming that the size of the initial list of hosts is a multiple of the number of processors. Procedure parallelBuildTopoMap (G, l) Input: An empty graph G, and a complete host list l Output: A graph G representing the Gnutella network topology startIndex = (sizeof hosts/numberof procs) procID endIndex = startIndex + (sizeof hosts/numberof procs) 1 l2 = hosts[startIndex..endIndex] for each element h of l2 connect to h if (connection is successful) send ping message with T T L = 2 for each response message m from host h2 if (h2! = h) add edge h h2 to G if (h2 is not in l) add h2 to the end of l2
48
Despite its apparent simplicity, due to highly asynchronous nature of the task, our parallel algorithm in the best cast achieves optimal speed-up. In addition, as long as total number of Gnutella hosts N pb, where p is the number of processors and b is the maximum number of connections each processor can maintain simultaneously, our algorithm will run in constant time. In practice, we were typically able to satisfy this requirement with only a few processors, as the size of the largest connected public segment of the Gnutella network at the time rarely exceeded two thousand users. One potential problem with our algorithm is that its performance is dependent on the completeness of the input list of host addresses. Recall from our previous discussion that the input list is not guaranteed to be complete, as new hosts could have joined the network. Because our algorithm only partitions the initial set of hosts, each processor would discover new hosts independently. This would result in redundant work being performed by all the processors. Notice that this would not be a problem had be used the hashing solution mentioned above. However it is easy to show that, as long as the number of hosts discovered at run-time is within b, performance of our algorithm will be within a factor of two of optimal. This is true because only a single additional step will be required by each processor. Typically an important issue in designing parallel algorithms is load balancing. In our case, this refers to the actual number of connections each processor is required to make. Recall that the input list of potential hosts may also contain some hosts that have recently left the network. Therefore even though each processor will receive an equal number of potential hosts to connect to, the number of actual live hosts in a list is likely to be smaller and will vary between processors. However our experiments indicate this is not a signicant problem. In order to see this recall that, even though the actual number of connections made by each processor could vary, they are still handled simultaneously by each processor in a single logical step.
49
4.2.4
Limitations
The main limitation of our crawler is related to the notion of private networks. Since a signicant portion of Gnutella users reside behind a rewall that prevents anyone on the outside from establishing direct connection to them, our crawler will not be able to accurately discover topology between such hosts. Notice that these hosts may still appear in the nal topology graph, due to their connections with hosts outside the rewall. In this sense, the topology obtained by our crawler can be viewed as a subgraph of the actual Gnutella network topology. In addition, even though running time of our algorithm is optimal for any topology discovery algorithm based on the Gnutella protocol, the actual execution time is still bounded by the RTT time of messages in the Gnutella network and can take up to several minutes. One could therefore argue the integrity of our topology data, based on the fact that the network structure may have signicantly changed over the course of several minutes. Despite these limitations we believe our crawler is a valuable tool, able to accurately capture important structural properties of the actual Gnutella network topology.
4.3
Distributed Computing Solution Using Java RMI
We have implemented our parallel algorithm for GnutellaNet topology discovery for a network of workstations (NOW), primarily because we felt it would give the greatest amount of exibility and portability to our code. In addition, we felt that the task at hand would be perfectly suited for a distributed computing model, since it requires very little inter-processor communication. In fact, in our design, communication only occurs at the beginning of the process, to distribute input, and at the end, to gather
50
the output at a central location. The mechanism for this communication is provided by Java RMI. Remote method invocation (RMI) is JavaSofts implementation of remote procedure calls (RPC). It is distributed as a standard Java library, providing necessary functionality for distributed object communication. In our implementation, crawling a subset of the Gnutella network is provided as a service residing on various remote locations throughout our network. In other words, our parallel algorithm described in the previous section is implemented as a distributed object residing on remote machines. Our distributed computing system includes an object serving as the brain of the entire computation. This central object is responsible for bootstrapping the entire topology discovery process by distributing the initial list of Gnutella hosts to other remote objects. Upon receiving the input, each remote object performs topology discovery of its portion of the network, and subsequently returns a graph object representing network topology to the central object. The central object is then responsible for merging all the output graphs into a single one representing topology of the entire Gnutella network. We should mention that our crawler utilized some Java classes providing functionality related to Gnutella protocol compliance from furi - a full-edged open-source Gnutella client developed by William Wong [3]. The main feature of our distributed implementation is that is allows a heterogeneous network of workstations to participate in discovery of the Gnutella network topology. As explained, this topology discovery can be executed in constant time using only a few processors. In addition, the output graph representing Gnutella network topology is provided in GML format [18], which is a fast growing standard for representing graph data structures, and can immediately be viewed using visualization tools such as LEDAs graphwin [8]. Several visualizations of the Gnutella network topology data obtained using our crawler are presented in appendix A.
51
Chapter 5 Conclusions and future research

5.1 Conclusions
Modeling complex network structures produces by modern P2P network applications is a dicult task. The main contribution of this thesis to the task at hand is two-fold. First, we made several important discoveries regarding the structure of the underlying network topology of a P2P network application known as Gnutella. Specically we discovered it exhibits small-world properties of clustering and small diameter. In addition, we observed four dierent power law relationships of various graph metrics. It is our thesis that these empirical observations must be accounted for by any accurate graph-based model of P2P network topology. Second, we pointed out potential devastating eects of heterogeneous latencies on reachability of message broadcast in P2P network applications operating under ooding protocols. Even though our empirical results indicate that this problem we call short-circuiting is on average not devastating to the overall system performance, we believe it should be taken seriously by protocol designers. It is our hope that our results can be used in designing the new generation of application-level protocols for P2P network applications.
52
5.2
Future Directions
Future research directions can be divided into three categories: those dealing with network topology, visualization, and server placement. In the following sections, we briey discuss each one.
5.2.1
Network Topology Modeling
In this thesis we have reported discoveries of some structural properties of P2P network topologies. However the search continues toward a uniform model of P2P network topology, encompassing all of those structural properties observed in real network applications. We speculate that for many P2P network applications, including Gnutella, such model will be a modication of the discussed Barabsi-Albert model, a perhaps accounting for hosts leaving the network and dynamically-changing connections. In addition, more research needs to be done on spectral analysis of the topology graphs eigenvalues and their relationship with the structural properties.
5.2.2
Network Visualization
Better graph drawing algorithms need to be designed for visualizing the topology of large-scale P2P networks. Such algorithms should be able to present topological structure of a network in a way so that meaningful conclusions can be drawn. Network visualizations can then be used by engineers to identify network-related problems.
5.2.3
Server Placement
The problem of nding an optimal placement of servers has received a lot of attention in the Internet community. Many P2P le-sharing applications such as Gnutella present another attractive practical application of this problem. For example, each
53
time a Gnutella user connects to the network can be modeled as a graph augmentation problem. This problem can be formulated as adding a single vertex and t edges to a graph G so that the size of t-horizon would be optimized. In the future, we plan to examine some theoretical issues behind this problem using the knowledge weve obtained on the Gnutella topology model.
54
Bibliography
[1] Cooperative Association for Internet Data Analysis (CAIDA).
http://www.caida.org. [2] Folding@home. http://www.stanford.edu/group/pandegroup/Cosm. [3] The Furi Homepage. http://www.jps.net/williamw/furi/. [4] SETI@home. http://setiathome.ssl.berkeley.edu. [5] Lada Adamic. The small world web. In ECDL99, pages 443452, Springer, 1999. Lecture Notes in Computer Science 1696. [6] Lada A. Adamic, Rajan M. Lukose, Amit R. Puniyani, and Bernardo A. Huberman. Search in power-law networks.
http://www.parc.xerox.com/istl/groups/iea/papers/plsearch/, March 20, 2001. [7] William Aiello, Fan R. K. Chung, and Linyuan Lu. A random graph model for massive graphs. In ACM Symposium on Theory of Computing, pages 171180, Portland, Oregon, 2000. [8] Algorithmic Solutions Software GmbH. The LEDA Homepage.
http://www.algorithmic-solutions.com/as html/products/products.html.
55
[9] Fred S. Annexstein, Kenneth A. Berman, and Mihajlo A. Jovanovi. Latency c eects on reachability in large-scale peer-to-peer networks. In ACM Symposium on Parallel Algorithms and Architectures, July 2001. [10] Albert-Lszl Barabsi and Rka Albert. Emergence of scaling in random neta o a e works. Science, 286:509512, October 15, 1999. [11] Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. Graph structures in the web. Computer Networks, 33(1-6):30920, June 2000. [12] Brown University. The Java Data Structures Library (JDSL).
http://www.cs.brown.edu/cgc/jdsl/. [13] Gnutella: To the bandwidth barrier and beyond. Clip2.com, November 6, 2000. http://dss.clip2.com/gnutella.html. [14] Roger Dingledine, Michael J. Freedman, and David Molnar. The free haven project: Distributed anonymous storage service. In Workshop on Design Issues in Anonymity and Unobservability, July 2000. [15] Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos. On power-law relationships of the internet topology. In SIGCOMM, pages 251262, 1999. [16] Groove Networks, Inc. Introducing Groove. http://www.groove.net/products/. [17] Jerrold W. Grossman and Patrick D. F. Ion. The Erds Number Project. o
http://www.oakland.edu/ grossman/erdoshp.html. [18] Michael Himsolt. Gml: A portable graph le format. Technical Report 94030, University of Passau, 1997.
56
[19] Jon Kleinberg. The small-world phenomenon: An algorithmic perspective. Technical Report 99-1776, Cornell University Department of Computer Science, October 1999. [20] Jon M. Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. The web a a graph: measurements, models, and methods. In 5th Annual International Conference on Computing and Combinatorics, volume 1627, pages 17, 1999. Lecture Notes in Computer Science. [21] Albert Medina, Ibrahim Matta, and John Byers. On the origin of power laws in internet topologies. ACM Computer Communications Review, 30(2), April 2000. [22] Andrew Oram, editor. Harnessing the Power of Disruptive Technologies. OReilly & Associates, 1 edition, March 2001. [23] Christopher R. Palmer and J. Gregory Stean. Generating network topologies that obey power laws. http://citeseer.nj.nec.com/palmer00generating.html, 2000. [24] T. Remes. Six degrees of Rogers Hornsby. New York Times, August 17, 1997. [25] Clay Shirky. What is p2p... and what isnt? The OReilly Network,
November 24, 2000. http://www.openp2p.com/pub/a/p2p/2000/11/24/shirky1whatisp2p.html. [26] R. Siamwalla, R. Sharma, and S. Keshav. Discovering internet topology.
http://www.cs.cornell.edu/skeshav/papers.html, 1998. [27] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of small-world networks. Nature, 393:440442, June 1998.
57
[28] Ellen W. Zegura, Kenneth L. Calvert, and Michael J. Donahoo. A quantitative comparison of graph-based models for Internet topology. IEEE/ACM Transactions on Networking, 5(6):770783, December 1997.
58
Appendix A Visualizations of the Gnutella Network Topology

In this appendix we present vizualizations of the Gnutella network topology data obtained using out crawler between November 13 and December 28 of 2000. The visualizations were done using Otter - a network visualization tool developed by Caida [1], and LEDAs graph drawing software [8].
59
Figure A.1: Gnutella network topology using Caidas Otter
Figure A.2: Gnutella network topology using LEDAs 2D spring layout 60
Figure A.3: Gnutella network topology using experimental layout
61
Figure A.4: Gnutella network backbone (dominating set using greedy algorithm) using LEDAs 3D spring layout
62
Figure A.5: Gnutella network backbone (nodes with degree > 10) using LEDAs 3D spring layout
63
Figure A.6: Gnutella network backbone (nodes with degree > 20) using LEDAs 3D spring layout
64
Appendix B Java source code for gnutsim

The following the is the Java source code for our Gnutella network simulator gnutsum, which we used to study the problem of short-circuiting. Our code makes use of some classes from the JDSL package developed at Brown University [12].
/* * * * * */ gnutsim - Gnutella message transmission simulator Copyright (C) November 2000 Mihajlo A. Jovanovic mjovanov@ececs.uc.edu
import jdsl.core.api.*; import jdsl.core.ref.ArrayHeap; import java.io.BufferedReader; import java.io.InputStreamReader; import java.io.FileInputStream; import java.io.PrintWriter;
65
import java.io.FileWriter; import java.io.File; import java.util.Vector; import java.util.Hashtable; import java.util.Enumeration; import java.util.StringTokenizer; import java.util.Random; import java.util.Date;
class MsgComparator implements Comparator { public int compare(Object a, Object b) { return ((Msg)a).compareTo((Msg)b); }
public boolean isLessThan(Object a, Object b) { return true; } public boolean isGreaterThan(Object a, Object b) { return true; } public boolean isEqualTo(Object a, Object b) { return true; } public boolean isLessThanOrEqualTo(Object a, Object b) { return true; } public boolean isGreaterThanOrEqualTo(Object a, Object b) { return true; } public boolean isComparable(Object b) { return true; } }
class HostComparator implements Comparator { public int compare(Object a, Object b) { return ((Host)a).compareTo((Host)b); }
public boolean isLessThan(Object a, Object b) { return true; } public boolean isGreaterThan(Object a, Object b) { return true; } public boolean isEqualTo(Object a, Object b) { return true; }
66
public boolean isLessThanOrEqualTo(Object a, Object b) { return true; } public boolean isGreaterThanOrEqualTo(Object a, Object b) { return true; } public boolean isComparable(Object b) { return true; } }
class Msg { private int guid; private int ttl = 7; private int cost = 0;
Msg(int id) { guid = id; } Msg(Msg m) { //COPY CONSTRUCTOR guid = m.getGuid(); ttl = m.getTtl(); cost = m.getCost(); }
public void setTtl(int newTTL) { ttl = newTTL; } public int getGuid() { return guid; } public int getTtl() { return ttl; } public int getCost() { return cost; }
public boolean decTTL() { ttl--;
67
if (ttl == 0) return false; else return true; }
public void incrCost(int w) { cost += w; }
public int compareTo(Msg m) { return (new Integer(cost)).compareTo(new Integer(m.get public boolean equals(Object msg) { return (guid == ((Msg)msg).getGuid()); } public String toString() { return "GUID: } " + guid + " TTL: " + ttl + " Cost:
class Host { Vector msgHistory = new Vector(10, 10); Hashtable neighbors = null; //keys: neighbors (Host) Values:
link weights (Integ
ArrayHeap sendQueue = new ArrayHeap(new MsgComparator()); String id;
Host(String address) { id = address; }
public String getID() { return id; }
public void clearAndReset(Random r, Hashtable map) {
68
msgHistory.clear(); //Recalculate link weights for { int w = r.nextInt(gnutsim.MAX_WEIGHT); neighbors.put(e.nextElement(), (Integer)map.get(new Integer(w))); } } (Enumeration e = neighbors.keys() ; e.hasMoreElements() ;)
public void setBroadcastMsg(Msg newMsg) { msgHistory.add(newMsg);
for (Enumeration e = neighbors.keys() ; e.hasMoreElements() ;) { Host h = (Host)e.nextElement(); Msg outMsg = new Msg(newMsg); outMsg.incrCost(((Integer)neighbors.get(h)).intValue()); sendQueue.insert(outMsg, h); } }
public boolean wasMsgSeen(Msg msg) { return msgHistory.contains(msg); }
public void setNeighbors(Hashtable h) { neighbors = h; }
69
public void addNeighbor(Host h, int w) { if (neighbors == null) neighbors = new Hashtable(); neighbors.put(h, new Integer(w)); }
public void receiveMsg(Host sender, Msg inMsg) { if (msgHistory.contains(inMsg)) { return; } else { msgHistory.add(inMsg); }
if (inMsg.decTTL()) { /*for all neighbors except sender 1. 2. create a new Msg object(m), incr cost add to the send queue(msg, neighbor)*/
for (Enumeration e = neighbors.keys() ; e.hasMoreElements() ;) { Host h = (Host)e.nextElement(); if (h.equals(sender)) continue;
70
Msg outMsg = new Msg(inMsg); outMsg.incrCost(((Integer)neighbors.get(h)).intValue()); sendQueue.insert(outMsg, h); } } }
public Host sendNextMsg() { Msg outMsg = (Msg)sendQueue.min().key();
Host receiver = (Host) sendQueue.removeMin(); receiver.receiveMsg(this, outMsg); return receiver; }
public int getNextMsgCost() { if (sendQueue.isEmpty()) return -1; else return ((Msg)sendQueue.min().key()).getCost(); }
public boolean equals(Object host) { if (id.equals(((Host)host).getID())) return true;
71
return false; } public int compareTo(Host m) { return (new Integer(getNextMsgCost())).compareTo(new public String toString() { return id; } }
public class gnutsim { static final int NUM_OF_TRIALS = 100; static final int MAX_WEIGHT = 9; static boolean isArrayHeapElement(ArrayHeap a, Object el) { for (ObjectIterator i = a.keys(); i.hasNext() ;) { Object o = i.nextObject(); if (el.equals(o)) return true; } return false; }
public static void main(String args[]) { ArrayHeap pq = new ArrayHeap(new HostComparator()); //CREATE WEIGHTED TOPOLOGY String line = ""; StringTokenizer t; String token = null;
72
Hashtable nodes = null;
//keys:
node ID (Integer)
values:
hosts (Host)
Random r = new Random((new Date()).getTime()); Hashtable map = new Hashtable(); map.put(new Integer(0), new Integer(1)); map.put(new Integer(1), new Integer(6)); map.put(new Integer(2), new Integer(31)); map.put(new Integer(3), new Integer(127)); map.put(new Integer(4), new Integer(500)); map.put(new Integer(5), new Integer(2001)); map.put(new Integer(6), new Integer(8005)); map.put(new Integer(7), new Integer(16400)); map.put(new Integer(8), new Integer(33000)); int min = -1, max = -1, accum = 0, ttl = -1;
try { for (int trial = 0; trial < NUM_OF_TRIALS; trial++) { if (trial == 0) { ttl = Integer.parseInt(args[1]); File f = new File(args[0]); if (!f.exists() || !f.canRead()) throw new Exception("Cannot read file " + f);
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(f))); while ((line = in.readLine()) != null) {
73
t = new StringTokenizer(line, " "); token = t.nextToken(); if (token.equals(new String("t"))) nodes = new Hashtable(2*Integer.parseInt(t.nextToken())); else if (token.equalsIgnoreCase(new String("?"))) { int i = Integer.parseInt(t.nextToken()); Host h = new Host(t.nextToken()); nodes.put(new Integer(i), h); } else if (token.equalsIgnoreCase(new String("L"))) { t.nextToken(); int nodeID = Integer.parseInt(t.nextToken()); Host h1 = (Host)nodes.get(new Integer(nodeID)); nodeID = Integer.parseInt(t.nextToken()); Host h2 = (Host)nodes.get(new Integer(nodeID)); if (h1 == null || h2 == null) throw new Exception("Invalid .odf file firmat!"); /*UNIFORM WEIGHTS h1.addNeighbor(h2, 1); h2.addNeighbor(h1, 1); */ int w = r.nextInt(MAX_WEIGHT); h1.addNeighbor(h2, ((Integer)map.get(new Integer(w))).intValue()); h2.addNeighbor(h1, ((Integer)map.get(new Integer(w))).intValue()); } }
74
} else { //clear all host objects for (Enumeration e = nodes.elements() ; e.hasMoreElements() ;) ((Host)e.nextElement()).clearAndReset(r, map); } //ADD BROADCAST SERVER ONTO PQ Msg m = new Msg(1); m.setTtl(ttl); Host h = (Host)nodes.get(new Integer(0)); h.setBroadcastMsg(m); pq.insert(h, new Boolean(true));
while(!pq.isEmpty()) { Locator l = pq.min(); Host nextHost = (Host)l.key(); Host newHost = nextHost.sendNextMsg(); pq.remove(l); if (nextHost.getNextMsgCost() != -1) pq.insert(nextHost, new Boolean(true)); //if new host is not already in the pq and its cost is not -1 - add to pq if (!isArrayHeapElement(pq, newHost) && newHost.getNextMsgCost() != -1) pq.insert(newHost, new Boolean(true)); } int horSize = 0; for (Enumeration e = nodes.elements() ; e.hasMoreElements() ;)
75
if (((Host)e.nextElement()).wasMsgSeen(m)) horSize++; System.out.println("Total horizon size: if (min == -1 || horSize < min) min = horSize; if (max == -1 || horSize > max) max = horSize; accum+=horSize; } System.out.println("Average horizon size: System.out.println("Min horizon size: System.out.println("Max horizon size: } catch (ArrayIndexOutOfBoundsException e) { System.out.println("Usage: java gnutsim [graph_file.odf] [TTL]"); } catch (Exception e) { System.out.println(e); } } } " + accum*1.0/NUM_OF_TRIALS); " + horSize);
" + min); " + max);
76
Appendix C Network Simulation Results

In this appendix we present the statistics obtained from our network simulation studies. The tables report reduction ratios in reachability, caused by short-circuiting and given by randomly chosen latencies on a xed topology. Each table is associated with a xed topology. Each row of the table represents results from 100 trials using random latencies. In each row we report for a xed t, the worst, average, and best observed t-horizon, and t-neighborhood (which is equal to t-horizon when using uniform latencies). We then give the reduction ratios by dividing the worst over t-neighborhood, and the average over t-neighborhood.
77
TTL 1 2 3 4 5 6 7 8 9 10
Worst Avg 7 9 12 15 28 55 105 185 371 468 7 14 28 52 105 181 333 496 659 804
Best Nbhd 7 16 41 83 188 337 525 719 877 983 7 16 42 96 252 494 830 1055 1121 1129
WRR MRR 100% 56% 29% 16% 11% 11% 13% 18% 33% 41% 100% 88% 67% 54% 42% 37% 40% 47% 59% 71%
Table C.1: Short-circuiting eects on the Watts-Strogatz topology (nodes = 1129, k = 3, p = 0.2)
78
TTL 1 2 3 4 5 6 7 8
Worst Avg 2 4 10 65 214 246 419 566 2 4 10 92 492 589 806 915
Best Nbhd 2 4 10 113 689 843 1040 1071 2 4 10 113 844 1107 1124 1125
WRR MRR 100% 100% 100% 58% 25% 22% 37% 50% 100% 100% 100% 81% 58% 53% 72% 81%
Table C.2: Short-circuiting eects on the Gnutella topology (nodes = 1125, edges = 4080)
TTL 1 2 3 4 5 6 7
Worst Avg 6 54 405 1473 4686 6557 8113 6 54 410 2216 5986 8143 9060
Best Nbhd 6 54 419 2606 6875 8809 9443 6 54 419 2851 9021 9998 10000
WRR MRR 100% 100% 97% 52% 52% 66% 81% 100% 100% 98% 78% 66% 81% 91%
Table C.3: Short-circuiting eects on a random topology (nodes = 10000, edges = 40000)
79
TTL 1 2 3 4 5 6 7 8 9 10
Worst Avg 11 56 92 263 307 478 533 699 883 916 11 56 150 319 523 720 852 948 991 1011
Best Nbhd 11 56 176 372 606 821 933 1002 1020 1024 11 56 176 386 638 848 968 1013 1023 1024
WRR MRR 100% 100% 52% 68% 48% 56% 55% 69% 86% 89% 100% 100% 85% 83% 82% 85% 88% 94% 97% 99%
Table C.4: Short-circuiting eects on a hypercube topology (N = 210 )
80
TTL 1 2 3 4 5 6 7 8 9 10 11 12 13
Worst Avg 14 92 258 685 1120 2243 2796 3970 6023 6259 6930 7877 8050 14 92 315 858 1750 3079 4422 5813 6844 7558 7907 8108 8174
Best Nbhd 14 92 368 1008 2139 3544 5298 6644 7424 7950 8147 8187 8192 14 92 378 1093 2380 4096 5812 7099 7814 8100 8178 8191 8192
WRR MRR 100% 100% 68% 63% 47% 55% 48% 56% 77% 77% 85% 96% 98% 100% 100% 83% 78% 74% 75% 76% 82% 88% 93% 97% 99% 100%
Table C.5: Short-circuiting eects on a hypercube topology (N = 213 )
81

Gnutella Thesis

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Gnutella Thesis

Загружено:

Авторское право:

Доступные форматы

UNIVERSITY OF CINCINNATI

April 20 01 _____________ , 20 _____

MASTER OF SCIENCE ________________________________________________

Modeling P2P Applications . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Benets to Modeling . . . . . . . . . . . . . . . . . . . . . . .

Power-Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 2.2.2 Power-Law Models . . . . . . . . . . . . . . . . . . . . . . . . Power-Laws in Gnutella . . . . . . . . . . . . . . . . . . . . .

Distributed Computing Solution Using Java RMI . . . . . . . . . . .

Diculty in conducting experiments on todays Gnutella network . .

Short-circuiting eects for the Watts-Strogatz topology (nodes = 10000, k = 3, p = 0.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 60 60 61

Sharing of Information Gnutella Freenet Napster Publius Free Haven

Modeling P2P Applications

Chapter 2 Modeling Topology of Large P2P Networks

Lrandom 2.99 12.4 2.25

Cactual 0.79 0.080 0.28

Crandom 0.00027 0.005 0.05

. The Hollywood graph, for example,

Modeling Small-World Networks

Snapshot date 11/13/2000 11/16/2000 12/20/2000 12/27/2000 12/28/2000

Nodes 992 1008 1077 1026 1125

Edges 2465 1782 4094 3752 4080

Do not count source vertex Gnutella G(n,p) 2D mesh 0 0 0 0 0

(k 1), where k is the number of vertices

Gnutella 11/13/2000 11/16/2000 12/20/2000 12/27/2000 12/28/2000

G(n,p) 2D torus 0.0606061 0.0606061 0.0606061 0.0606061 0.0606061

2D torus 0.0434783 0.0434783 0.0434783 0.0434783 0.0434783

11/13/2000 11/16/2000 12/20/2000 12/27/2000 12/28/2000

0.0122163 0.00718639 0.0142141 0.00545913 (b) l = 3

Gnutella 11/13/2000 11/16/2000 12/20/2000 12/27/2000 12/28/2000

(a) Gnutella 12/28/00(|r| = 0.94)

(b) Random Graph

10 Gnutella 12/28/00 exp(7.27358)*x**(0.98116)

(a) Gnutella 12/28/00(|r| = 0.96)

(a) Random Graph

Gnutella snapshot 11/16/2000 exp(8.36937)*x**(3.48228) maximum number of pairs

Gnutella snapshot 12/20/2000 exp(9.32629)*x**(3.54494) maximum number of pairs

(a) Gnutella 11/16/00(|r| = 0.99)

(b) Gnutella 12/20/00(|r| = 0.99)

Gnutella snapshot 12/27/2000 exp(9.26415)*x**(3.52262) maximum number of pairs

Gnutella snapshot 12/28/2000 exp(9.31438)*x**(3.60599) maximum number of pairs

(c) Gnutella 12/27/00(|r| = 0.99)

(d) Gnutella 12/28/00(|r| = 0.99)

10 Gnutella 11/16/2000 exp(2.27850)*x**(0.22301)

Gnutella 12/20/2000 exp(2.83511)*x**(0.30114)

(a) Gnutella 11/16/00(|r| = 0.97)

(b) Gnutella 12/20/00(|r| = 0.89)

10 Gnutella 12/27/2000 exp(2.82127)*x**(0.29278)

Gnutella 12/28/2000 exp(2.81997)*x**(0.29412)

(c) Gnutella 12/27/00(|r| = 0.94)

(d) Gnutella 12/28/00(|r| = 0.94)

Chapter 3 Modeling Network Latencies

Modeling the Short-Circuiting Eect

Network Simulation Studies

Chapter 4 Gnutella Crawler Implementation

Request for a host to an- No body nounce itself

Reply to Ping message

IP and port of responding host, number and size of les shared

Minimum speed requirement for responding host, search string

Query Hits Reply to Query message

Discovering Gnutella Network Topology

Distributed Computing Solution Using Java RMI

Chapter 5 Conclusions and future research

Network Topology Modeling

Appendix A Visualizations of the Gnutella Network Topology

Figure A.1: Gnutella network topology using Caidas Otter

April 20 01 _________ , 20 _