Академический Документы
Профессиональный Документы
Культура Документы
Series Editor
Nicola Bellomo
Politecnico di Torino
Italy
Niloy Ganguly
Andreas Deutsch
Animesh Mukherjee
Editors
Birkhäuser
Boston • Basel • Berlin
Editors
Niloy Ganguly Andreas Deutsch
Indian Institute of Technology Center for Information Services
Department of Computer Science and High Performance Computing
and Engineering Technische Universität Dresden
Kharagpur 721302 01062 Dresden
India Germany
niloy@cse.iitkgp.ernet.in andreas.deutsch@tu-dresden.de
Animesh Mukherjee
Indian Institute of Technology
Department of Computer Science
and Engineering
Kharagpur 721302
India
animeshm@cse.iitkgp.ernet.in
Mathematics Subject Classification (2000): 05C85, 68M10, 82B43, 90B15, 90B18, 90B40, 90C35, 91D30,
92D30, 94C15
was approximately 40. There were around 20 speakers, including both senior
researchers and young scientists, who spoke about the dynamics on and of
different systems exhibiting a complex network structure.
The theme of this edited volume is identical to that of the workshop. Its
primary aim is to show how the theories of complex networks are being suc-
cessfully used by researchers to tackle numerous difficult problems in various
domains. Towards this aim, it presents an extended version of some of the
very high quality submissions received at the workshop together with new
invited contributions, which can play an extremely important role in the un-
derstanding as well as advancement of the field. Since the target audience
of this book is expected to be largely cross-disciplinary, the chapters have
been made as readable as possible, explaining all the intricate technicalities
wherever necessary in sufficient detail.
The uniqueness of this volume lies in the fact that it presents an equal
mix of (a) very relevant reviews (eight chapters) of important works in the
field, which gives the reader an up-to-date picture of the state of the art, and
(b) independent research reports (eight chapters) providing a clear conception
about how complex networks can be extremely useful in harnessing even the
hardest problems of a particular discipline. The editors feel that research
in this area has reached a stage where there is an urgent need to have a
comprehensive knowledge of the past and the present before the future can
be planned. The blend of reviews and the contributory chapters presented in
this volume strive to achieve this objective and, thereby, set the platform for
a “Phase II” research in complex networks.
The volume consists of three parts. The contributions in Part I center
around the application of complex networks in the understanding of biolog-
ical problems. This part consists of five chapters. The first chapter is From
Network Structure to Dynamics and Back Again: Relating Dynamical Stability
and Connection Topology in Biological Complex Systems, in which Sitabhra
Sinha presents a study of how the topology of a biological network influences
the nature of its dynamics, and conversely, how dynamical considerations put
constraints on the network structure. The next chapter deals with Regula-
tion of Apoptosis via the NFκB Pathway: Modeling and Analysis, in which
Madalena Chaves et al. model and analyze, in the framework of complex net-
works, the interaction of the nuclear factor κB with the apoptosis signaling
pathway. In the third chapter, Network-Based Models in Molecular Biology,
Andreas Beyer presents a survey on the extensive literature that employs
complex networks to understand numerous intricate phenomena in biology.
The fourth chapter, Ecological Networks: Structure, Interaction Strength, and
Stability, by Samit Bhattacharyya and Somdatta Sinha, presents a detailed
survey of the various studies conducted on ecological networks and especially
on food webs. In the last chapter, Signaling and Feedback in Biological Net-
works, Sandeep Krishna et al. review some important studies on the signaling
and feedback mechanisms that are observed in different biological networks.
Preface VII
Part II is also spread over five chapters and focuses on social networks. This
part begins with a chapter on Topographic Spreading Analysis of an Empirical
Sex Workers’ Network, by Johannes Bjelland et al., where the authors present
a “topographic” analysis of spreading (of HIV) on an empirical network of fe-
male sex workers. The authors find that the HIV graph breaks into small
components, thereby reducing the spreading if perfect condom protection is
made possible. The next chapter, Spectral Characterization of Network Struc-
tures and Dynamics, by Anirban Banerjee and Jürgen Jost, centers around
the investigation of the spectral properties of complex networks with a special
thrust on social networks. The third chapter, Dynamics of Social Complex
Networks: Some Insights into Recent Research, is authored by Sergi Lozano
and presents a comprehensive review of how complex network theory has been
instrumental in explaining the structure and the dynamics of a society. The
last two chapters show how complex networks can be applied to explain the
dynamics of human languages. The first one, titled The Structure and Dynam-
ics of Linguistic Networks, by Monojit Choudhury and Animesh Mukherjee,
is a review of the current literature on linguistic networks. The second one,
Networks Generated from Natural Language Text, by Chris Biemann and Uwe
Quasthoff, presents a survey focusing on how corpus linguistics (i.e., the study
of language as expressed in corpora) can be studied within the framework of
complex networks.
Part III presents a comprehensive overview of the networks that are preva-
lent in information sciences. This part is laid out in six chapters. The first
chapter in this part, Efficiency of Navigation in Indexed Networks, by Petter
Holme, explores the efficiency of navigation of data packets on “indexed”
graphs. The second chapter, Evolution of Apache Open Source Software, by
Haoran Wen et al., attempts to explain the evolution of the Apache open
source software through the analysis of its call graphs. The next chapter, Some
New Applications of Network Growth Models, by Gourab Ghoshal, presents
new models of growth for peer-to-peer file-sharing networks. The fourth chap-
ter, The Big Friendly Giant: The Giant Component in Clustered Random
Graphs, by Yakir Berchenko et al., is a theoretical study of the properties of
the giant component in a special kind of random graph, which is relevant for
various information networks. The fifth chapter, Technological Networks, by
Bivas Mitra, presents a detailed review of the large number of studies that
have been conducted on information networks, especially the World Wide
Web and peer-to-peer networks. The last chapter, Advances in the Theory of
Complex Networks, by Fernando Peruani, presents a survey of some of the the-
oretical advancements that have taken place and helps in providing a better
understanding of the structure and dynamics of information networks.
These contributions collectively demonstrate that complex networks in-
deed provide an elegant research framework relevant to a variety of scientific
disciplines. The chapters are designed to serve as the state of the art not only
for students and new comers who intend to pursue research in this field but
VIII Preface
also for the experts. All the chapters have been carefully peer reviewed for
their scientific content as well as readability and self-consistency.
We would like to thank the authors for their contributions, construc-
tive co-operation and gracious acceptance of the editorial comments. We are
also indebted to Ranjita Bhagwan, Chris Biemann, Lutz Brusch, Geoffrey
Canright, Michael Gamon, Gourab Ghoshal, Petter Holme, A. Kumaran,
Abyayananda Maiti, Pabitra Mitra, Luis Morelli, Gautam Mukherjee, Romit
Roy Choudhury, Gustavo Sibona and Biplab K. Sikdar for their constructive
criticisms, comments and suggestions, which have significantly improved the
quality of the chapters. In addition, we would also like to extend our grat-
itude to Rishabh Singh for his painstaking effort in helping to prepare the
Glossary of Essential Terms. Finally, we are also grateful to Tom Grasso and
the Birkhäuser team for all their help and support towards the publication of
this volume.
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V
List of Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XI
Technological Networks
Bivas Mitra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
List of Contributors
Zachary M. Saul
Bivas Mitra
Department of Computer Science Department of Computer Science
and Engineering University of California
Indian Institute of Technology Davis, CA 95616
Kharagpur 721302 USA
India zmsaul@ucdavis.edu
bivasm@cse.iitkgp.ernet.in
Sitabhra Sinha
Animesh Mukherjee The Institute of
Department of Computer Science Mathematical Sciences
and Engineering CIT Campus
Indian Institute of Technology Taramani
Kharagpur 721302 Chennai 600113
India India
animeshm@cse.iitkgp.ernet.in sitabhra@imsc.res.in
Sitabhra Sinha
1 Introduction
To see a world in a grain of sand,
And a heaven in a wild flower,
Hold infinity in the palm of your hand,
And eternity in an hour.
– William Blake, Auguries of Innocence
Like Blake, physicists look for universal principles that are valid across
many different systems, often spanning several length or time scales. While
the domain of physical systems has often offered examples of such widely ap-
plicable “laws,” biological phenomena tended to be, until quite recently, less
fertile in terms of generating similar universalities, with the notable exception
of allometric scaling relations [20]. However, this situation has changed with
the study of complex networks emerging into prominence. Such systems com-
prise a large number of nodes (or elements) linked with each other according
to specific connection topologies, and are seen to occur widely across the bi-
ological, social and technological worlds [4, 9, 16]. Examples range from the
intra-cellular signaling system which consists of different kinds of molecules af-
fecting each other via enzymatic reactions, to the internet composed of servers
around the world which exchange enormous quantities of information packets
regularly, and food webs which link, via trophic relations, large numbers of
inter-dependent species. While the existence of complex networks in various
domains had been known for some time, the recent excitement among physi-
cists working on such systems has to do with the discovery of certain universal
principles among systems which had hitherto been considered very different
from each other.
This further underlined the fact that most networks occurring in reality are
neither regular (in which case the degree distribution would be close to a
delta function) nor random (which has a Poisson degree distribution), as for
both cases the probability of having a node with large degree (i.e., a hub)
would be significantly smaller than that indicated by the power-law tail of
empirically obtained degree distributions. In addition, it was observed that
there exist non-trivial degree correlations among linked pairs of nodes. For
example, a network where nodes with high degree tend to preferentially con-
nect with other high degree nodes is said to show assortative mixing [15]. On
the other hand, in a disassortative network, nodes with a large number of
links prefer to connect with nodes having low degree. Empirical studies indi-
cate that most biological and technological networks are disassortative, while
social networks tend to be assortative [16]. As assortative mixing promotes
percolation and makes a network more robust to vertex removal, it may be
hard to understand why natural evolution in the biological world has favored
disassortativity. However, in a recent study, we have shown that when one
considers the stability of dynamical states of a network, disassortative net-
works would tend to be more robust, and this may be one of the reasons why
they are preferred [6].
This brings us to the thrust of recent work in the area of complex networks
which has shifted from the initial focus on purely structural aspects of the con-
nection topology to the role such features play in determining the dynamical
processes defined on a network [27]. Over the past few years, much effort has
been made to understand not only how structure affects dynamics, and hence
function, in a network, but also the reverse problem of how functional cri-
teria, such as the need for dynamical stability, can constrain the topological
properties of a network. In this chapter, some of the principal results obtained
by our group will be briefly described. The goal of our research program is
to understand the evolution of robust yet complex biological structures, viz.,
networks occurring in reality that are stable against perturbations and, yet,
which can adapt to a changing environment.
400
600
membrane 800
domain
1000
Fig. 2. Structure of the KirBac1.1 protein (left) which comprises four identical sub-
units spanning the membrane and intra-cellular regions [13]. The PCN is constructed
by considering a cutoff distance of dc = 12 Å, whose adjacency matrix is shown for
the entire network (right). Each of the four blocks corresponding to a subunit shows a
clear partition into membrane and intra-cellular compartments, indicating a modular
structure.
From Network Structure to Dynamics and Back Again 7
IgG receptor
Igα, Igβ PI3K
Shc
BLNK Btk PDK1
Grb2 PLCg2
Vav
SOS Rac
DAG IP3 Akt
Raf−1 MEKK Ca2+
PKC
MKK MKK
MEK 1/2 3/4/6 IKK
4/7 CaMK2
K Erk 1/2 Jnk 1/2 p38 Bad Bcl2
IkB NFAT
Fig. 3. A subset of the signal transduction network of the BCR [12]. The kinases
are represented by squares, while other molecules (such as second messengers and
adapters) are depicted as circles.
It is in the context of these networks that questions first arose on the con-
nection between the structural properties of a network and the stability of
its dynamical behavior (see Section 4). Indeed, one not only asks what kind
of structures allow complex networks to be stable against ever-present per-
turbations, but also how the requirement to be robust constrains the kind of
structures such networks can evolve. To stress the universality of the questions
asked by physicists about networks, we note that, like many other networks,
food webs also have been shown to have a modular structure, with species in
each module interacting between themselves strongly and only weakly with
other species [11]. As in the other systems discussed earlier, the role that
modularity plays in stabilizing the dynamics of ecosystems can be seen as a
specific instance of a much more general question.
Having discussed a few instances of how universal principles about net-
works can appear by investigating very different systems in the biological
world, we now describe certain results of our studies on general network mod-
els. However, we stress that each of these results has relevance to problems
appearing in the context of specific biological systems.
The role that the connection topology of a network plays in the nature of
its dynamics has been extensively investigated for spin models occurring in
physics. In fact, such systems had been explored for a long time prior to the
recent interest in complex networks, and many results are known regarding or-
dering transition in both regular as well as random structures. More recently,
it has been shown that, for partial random rewiring in a system of sufficiently
large size, any finite value of p (the rewiring probability) causes a transition
to the small-world regime, with the Ising model defined on such a network ex-
hibiting a finite temperature ferromagnetic phase transition [5]. However, spin
models are extremely restricted in their dynamical repertoire; therefore, re-
searchers have looked at the effect of introducing other kinds of node dynamics
in such network structures, e.g., oscillators. Motivated by recent observations
that the brain may have a connection structure with small-world properties
(see e.g., Ref. [1]), we have examined the effect of long-range connections (i.e.,
non-local diffusion) over an otherwise regular network of nodes with links be-
tween nearest neighbors on a square lattice [25]. The dynamics considered
is that of the excitable type, with the variable having a single stable state
and a threshold. If a perturbation causes the system variable to exceed the
threshold, we see a rapid transition to a metastable excited state followed by a
slow recovery phase when the system gradually converges to the stable state.
As a result of coupling the dynamics of individual nodes through diffusive
coupling, various spatial patterns (which may be temporally varying) are ob-
served. Such a dynamics is commonly observed in a large variety of biological
From Network Structure to Dynamics and Back Again 11
Activity
0 0 0
0 500 1000 1500 2000 1600 1800 2000 0 100 200
Each node in our model network has a dynamical variable associated with
it, which evolves according to a well-known class of difference equations com-
monly used for modeling population dynamics. By varying a non-linear pa-
rameter, the nature of the dynamics (i.e., whether it converges to a steady
state or undergoes chaotic fluctuations) at each node can be controlled. How-
ever, in the absence of coupling, each node will always have a finite, positive
value for its dynamical variable. When coupled in a network (initially in a ran-
dom fashion) with links that can have either positive or negative weights, it is
possible that as a result of dynamical fluctuations, the variable for some nodes
can become negative or zero. As this implies the absence of any activity, the
corresponding node is considered to be “extinct” and thus isolated from the
network. This procedure may create further fluctuations and cause more nodes
to becomes “extinct,” resulting in gradual reduction of the size of the network
(Fig. 5). The final asymptotic size of the network, relative to its initial size,
is a measure of its robustness—the more robust network is one with a higher
fraction of nodes having persistent activity. Analysis showed that the network
robustness (as measured by the above global criterion) not only decreased with
N , C and s, as expected from local stability analysis, but actually matched the
May–Wigner theorem quantitatively [23]. In addition, the asymptotic network
exhibited robust macroscopic features: (a) the number of persistently active
nodes was independent of the initial network size, and (b) the asymptotic
number of links between these persistently active nodes was independent of
both the initial size and connectivity [24]. This is all the more surprising, as
the removal of nodes (and hence, links) is not guided by any explicit fitness
criterion, but rather emerges naturally from the nodal dynamics through fluc-
tuations of individual node properties. Our results imply that asymptotically
Pa
Fig. 5. Evolution of a network with non-trivial dynamics at the nodes. The initial
(left) and final asymptotic (right) networks are shown. Only nodes having persistent
activity are connected to the network. The figures were drawn using Pajek software.
14 S. Sinha
active networks are non-extensive: when two networks of size N are coupled to
each other (with the same connectance as the individual networks), although
the resulting network initially has a size 2N , the ensuing dynamical fluctua-
tions will reduce its size to N . This implies that simply increasing the number
of redundant elements is not a good strategy for designing robust systems.
We have also looked at the effect of empirically reported structures, such
as small-world connection topology and scale-free degree distribution, on the
dynamical stability of networks. Our results indicate that, in general, intro-
ducing such structural features does not alter the outcome expected from
the May–Wigner theorem [6, 22]. However, these details can indeed affect
the nature of the stability-instability transition; for example, the transition
exhibiting a cross-over from being very sharp (resembling first-order phase
transition) for a random network to a more gradual change as the network
becomes more regular in the small-world regime [22].
This brings us to the issue of how complex networks can be stable at all, given
that the May–Wigner theorem seems to hold even for networks that have
structures similar to those seen in reality and where non-trivial dynamical
situations have also been considered. The solution to this apparent paradox
lies in the observation that most networks that we see around us did not
occur fully formed but emerged through a process of gradual evolution, where
stability with respect to dynamical fluctuations is likely to be one of the key
criteria for survival. In earlier work, we have shown that a simple model,
where nodes are gradually added to or removed from a network according
to whether this results in a dynamically stable network or not, leads to a
non-equilibrium steady state in which the network is extremely robust [30].
The robustness is manifested by increased resistance and resilience, as well as
decreased probability of large extinction cascades, when the network size (i.e.,
the system diversity) is increased. Thus, our results reconcile the apparently
contradictory conclusions of the May–Wigner theorem and a large number of
empirical studies.
More recently, we have shown that model networks can evolve many of
the observed structural features seen among networks in the natural world,
by taking into account the fact that the majority of such systems must opti-
mize between several (often conflicting) constraints, which may be structural
as well as dynamical in nature. In particular, most networks need to have
high communication efficiency (i.e., low average path length) and low connec-
tivity (to reduce the resource cost involved in maintaining many links) while
being stable with respect to dynamical perturbations. If a network satisfied
only the first two constraints, the optimal structure would have been that of
a star (Fig. 6). Even if the resource cost constraint is somewhat relaxed, so
that the network can have more links than the minimum necessary to make it
From Network Structure to Dynamics and Back Again 15
(A)
(I) (II)
(B) (C)
Fig. 6. Networks with (I) star and (II) clustered star connection topologies can form
the fundamental building blocks of different types of modular networks. Network con-
figurations with clustered star modules can be constructed by (A) connecting different
modules by single undirected links among the hub nodes, or (B) connecting nodes of a
module to another module only through the hub node of the latter, or (C) connecting
nodes of a module randomly to any node of another module.
Acknowledgments
I would like to thank my collaborators with whom the work described here
has been carried out, in particular, R. K. Pan, S. Sinha, N. Chatterjee, M.
Brede, C. C. Wilmers, J. Saramäki and K. Kaski, as well as S. Vemparala,
D. Kumar, K. V. S. Rao and B. Saha for helpful discussions.
References
1. Achard, S., Salvador, R., Whitcher, B., Suckling, J., Bullmore, E.: A resilient,
low-frequency, small-world human brain functional network with highly connected
association cortical hubs. J. Neurosci., 26, 63–72 (2006)
2. Aftabuddin, M., Kundu, S.: Hydrophobic, hydrophilic and charged amino acid
networks within protein. Biophys. J., 93, 225–231 (2007)
3. Albert, R., Barabási, A.L.: Emergence of scaling in random networks. Science,
286, 509–512 (1999)
4. Albert, R., Barabási, A.L.: Statistical mechanics of complex networks. Rev. Mod.
Phys., 74, 47–97 (2002)
5. Barrat, A., Weigt, M.: On the properties of small-world network models. Eur.
Phys. J.B, 13, 547–560 (2000)
6. Brede, M., Sinha, S.: Assortative mixing by degree makes a network more unstable.
Arxiv preprint, cond-mat/0507710 (2005)
7. Chatterjee, N., Sinha, S.: Understanding the mind of a worm: Hierarchical network
structure underlying nervous system function in C. elegans. Prog. Brain Res., 168,
145–153 (2007)
8. Deem, M.W.: Mathematical adventures in biology. Physics Today, 60(1), 42–47
(2007)
9. Dorogovtsev, S.N., Mendes, J.F.F.: Evolution of Networks: From Biological Nets
to the Internet and WWW. Oxford Univ. Press, Oxford (2003)
10. Haliloglu, T., Bahar, I., Erman, B.: Gaussian dynamics of folded proteins. Phys.
Rev. Lett., 79, 3090–3093 (1997)
11. Krause, A.E., Frank, K.A., Mason, D.M., Ulanowicz, R.U., Taylor, W.W.: Com-
partments revealed in food-web structure. Nature, 426, 282–284 (2003)
12. Kumar, D., Srikanth, R., Ahlfors, H., Lahesmaa, R., Rao, K.V.S.: Capturing cell-
fate decisions from the molecular signatures of a receptor-dependent signaling
response. Molecular Systems Biology, 3, 150 (2007)
13. Kuo, A., Gulbis, J.M., Antcliff, J.F., Rahman, T., Lowe, E.D., Zimmer, J., Cuth-
bertson, J., Ashcroft, F.M., Ezaki, T., Doyle, D.A.: Crystal structure of the potas-
sium channel KirBac1.1 in the closed state. Science, 300, 1922–1926 (2003)
14. May, R.M.: Stability and Complexity in Model Ecosystems. Princeton Univ. Press,
Princeton (1973)
15. Newman, M.E.J.: Assortative mixing in networks. Phys. Rev. Lett., 89, 208701
(2002)
16. Newman, M.E.J.: The structure and function of complex networks. SIAM Review,
45, 167–256 (2003)
17. Pan, R.K., Sinha, S.: Modular networks emerge from multiconstraint optimization.
Phys. Rev. E, 76, 045103(R) (2007)
From Network Structure to Dynamics and Back Again 17
18. Pan, R.K., Sinha, S.: The small world of modular networks. Arxiv preprint,
arXiv:0802.3671 (2008)
19. Saramäki, J., Kaski, K.: Modelling development of epidemics with dynamic small-
world networks. J. Theor. Biol., 234, 413–421 (2005)
20. Schmidt-Nielsen K: Scaling: Why is Animal Size So Important? Cambridge Univ.
Press, Cambridge (1984)
21. Sen, P., Dasgupta, S., Chatterjee, A., Sreeram, P.A., Mukherjee, G., Manna, S.S.:
Small-world properties of the Indian railway network. Phys. Rev. E, 67, 036106
(2003)
22. Sinha, S.: Complexity vs. stability in small-world networks. Physica A, 346, 147–
153 (2005)
23. Sinha, S., Sinha, S.: Evidence of universality for the May-Wigner stability theorem
for random networks with local dynamics. Phys. Rev. E, 71, 020902(R) (2005)
24. Sinha, S., Sinha, S.: Robust emergent activity in dynamical networks. Phys. Rev.
E, 74, 066117 (2006)
25. Sinha, S., Saramäki, J., Kaski, K.: Emergence of self-sustained patterns in small-
world excitable media. Phys. Rev. E, 76, 015101(R) (2007)
26. Steele, A.J., Tinsley, M., Showalter, K.: Spatiotemporal dynamics of networks of
excitable nodes. Chaos, 16, 015110 (2006)
27. Strogatz, S.H.: Exploring complex networks. Nature, 410, 268–276 (2001)
28. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature,
393, 440–442 (1998)
29. Weng, G., Bhalla, U.S., Iyengar, R.: Complexity in biological signaling systems.
Science, 284, 92–96 (1999)
30. Wilmers, C.C., Sinha, S., Brede, M.: Examining the effects of species richness on
community stability: An assembly model approach. Oikos, 99, 363–367 (2002)
31. Wilmers, C.C.: Understanding ecosystem robustness. Trends Ecol. Evoln., 22,
504–506 (2007)
Regulation of Apoptosis via the NFκB
Pathway: Modeling and Analysis
1 Introduction
Programmed cell death (or apoptosis) has an essential biological function, en-
abling successful embryonic development, as well as maintenance of a healthy
living organism [6]. Apoptosis is a physiological process which enables an
organism to remove unwanted or damaged cells. Malfunctioning apoptotic
pathways can lead to many diseases, including cancer and inflammatory or
immune system related problems. A family of proteins called caspases are
primarily responsible for execution of the apoptotic process: basically, in re-
sponse to appropriate stimuli, initiator caspases (for instance, caspases 8, 9)
activate effector caspases (for instance, caspases 3, 7), which will then cleave
various cellular substrates to accomplish the cell death process [22].
Nuclear factor κB (NFκB) is a transcription factor for a large group of
genes which are involved in several different pathways. For instance, NFκB
activates its own inhibitor (IκB) [14] as well as groups of pro-apoptotic and
anti-apoptotic genes [21]. Among the latter, NFκB activates transcription of
a gene encoding for inhibitor of apoptosis protein (IAP). This protein in turn
contributes to downregulate the activity of the caspase cascade which forms
the core of the apoptotic pathway [6, 8].
The canonical NFκB pathway is induced, among other stimuli, by the
cytokine tumor necrosis factor α (TNFα) [21]. Binding of TNFα to death
receptor TNFR1 forms a first complex which eventually activates NFκB.
A second complex is later formed, which will activate the initiator caspase
8 [6], and hence activate the apoptotic process. The same signal (TNFα stim-
ulation) thus triggers two parallel but contrary pathways: the pro-apoptotic
caspase cascade and the anti-apoptotic NFκB-IκB-IAP pathway. These two
pathways, together with the interactions among their components, form a
2 The Model
The network of interactions among the NFκB pathway and the apoptosis sig-
naling cascade to be studied here is shown in Fig. 1. The various components
of the network (here messenger RNAs, proteins, or protein complexes) form
the set of variables or nodes (Xi , i = 1, . . . , n) of the Boolean model. The
system will evolve according to a set of logical rules which are deduced from
the interactions or links depicted in the schematic diagram of Fig. 1. The in-
teractions among nodes can be classified as “activation” or “inhibition” links:
a directed arrow Xi → Xj means that a high concentration of component
Xi activates component Xj , while the symbol Xi Xj means that a high
concentration of component Xi inhibits Xj .
The components in our model and the activation or inhibition links
among them are based on existing literature data. For general aspects, the
reviews [6, 21] were used. However, some pathways of regulation among the
NFκB pathway and the caspase cascade are not yet clear, and more work is
needed to understand how these two signaling pathways are interconnected.
In this chapter, we aim to investigate and test several possible hypotheses for
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis 21
Fig. 1. Schematic diagram of the NFκB pathway and the caspase cascade (light
shaded regions). The oval dark grey shaded region represents the cellular nucleus.
Both pathways are activated by binding of TNFα to death receptor TNFR1 (the
resulting complex is represented simply by the rectangle TNF). Messenger RNAs are
represented by ellipses, while transcription factors, caspases, and other proteins are
represented by squares. To study the interconnections between the two pathways, four
network variants, based on different combinations of the links A, L, and C, will be
analysed and compared (see Table 2).
the combined network structure. We will consider four model variants and
try to discriminate between them by comparing our numerical analysis with
experimental data from the literature. The four network variants (see Table 2)
are based on different combinations of three links (A, L, C in Fig. 1) which
have been suggested but are not fully established in the apoptosis literature.
The NFκB pathway follows very closely the model presented in [17]. Stim-
ulation of death receptors with TNFα leads (see for instance [6]), first, to the
formation of a complex I (T1 in Fig. 1) which will recruit and activate inhibitor
of IκB kinases (IKK). Inhibitor of NFκB, or IκB, acts by binding to NFκB
molecules and preventing their transcriptional function. Active IKK (IKKa)
phosphorylates IκB which releases NFκB, thus enabling its translocation to
the nucleus and transcription of NFκB-dependent genes, including genes for
inhibitor of apoptosis protein (iap), inhibitor of NFkB (iκB), a protein as-
sociated with inhibition of complex T2 (flip), and a protein regulating IKK
activity (a20) [21]. Transcription of IκB mRNA generates a negative feedback
22 M. Chaves et al.
loop in the NFκB pathway [14, 20], which may lead to oscillatory behaviour
in NFκB and IκB concentrations [19]. In a second step, after dissociation of
components of complex I from the death receptor, a second complex is formed
(T2 in Fig. 1) which will recruit and activate initiator caspase 8 (C8a). As a
result of the signaling cascade [8, 22], effector caspase 3 is also activated (C3a).
Thus, complex T1 activates the anti-apoptotic pathway and, after a certain
delay, complex T2 activates the pro-apoptotic pathway.
Two well-documented points of regulation of the apoptotic pathway by
NFκB are inhibition of C3a by IAP and regulation of complex T2 by FLIP [6].
Active caspase 8 was found to be negatively regulated by caspase-8 and
caspase-10-associated RING proteins (CARPs) [18], which seem to play an
analogous role to IAP’s, but are less well studied. It was found that CARPs
are overexpressed in tumors, and that their suppression leads to restoration of
the apoptotic pathway, with the CARP being rapidly cleaved. In addition, it
was observed that inhibitors of caspase 3 block CARP cleavage. In our model,
we introduced CARP and a pre-complex CARP0 , which is inhibited by C3a.
Inhibition by C3a is, however, not sufficient to control CARP, and there are
probably other regulators. Since CARP plays a similar role to caspases 8 and
10, as IAP plays to caspases 3 and 9 (and in the absence of further details),
we assume that the pre-complex CARP0 is also regulated by a product of the
NFκB pathway.
The points where the caspase cascade influences the NFκB pathway are
less well documented. We will use our model to test different hypotheses by
studying and comparing the network dynamics for the following cases (see
also Table 2): inhibition of IKKa (link L) and/or NFκB (link A) by C3a, or
neither of these links present.
To obtain the logical rules shown in Table 1, some simplifications of the
biological processes were inevitably introduced. For instance, the bound com-
plex NFκB−IκB (either in the cytoplasm or in the nucleus) was not explicitly
considered in the system, but was simply treated as an inhibition effect: the
rule for NFκB says that it vanishes whenever IκB is expressed. Thus, any
state with NFκB = 0 and IκB = 1 represents in fact a high concentration of
bound complex NFκB − IκB, while any state with NFκB = 1 and IκB = 0
represents a high concentration of free NFκB and low concentration of free
IκB. To translate our diagram into a set of logical rules, the convergence of
two or more arrows (either activation or inhibition) at the same node was al-
ways treated as a logical AND, except in three cases: IκB, IAP, and CARP0 .
For these proteins, the overall effect was treated as an AND in the presence of
TNF stimulation, but treated as an OR in the absence of TNF. These three
proteins represent inhibitors whose levels should be stable in the absence of
any stimulus [8]: IAP and CARP0 (or CARP) should be effective inhibitors of
the caspases, and IκB should be at approximately constant levels to control
NFκB transcriptional activity. In contrast, with TNF stimulation, the degra-
dation rates of these proteins can vary and lead to rapid changes in their
concentrations (different degradation rates in the presence or absence of TNF
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis 23
have been observed, notably for bound IκB [20]). For instance, under TNF
treatment, the rule for inhibition of NFκB is simplified to IκB+ = [iκB and not
IKKa]. Suppose that IKK becomes activated at time t1 , that is IKKa(t1 ) = 1.
Then, in the next iteration of the model, the IκB rule implies that IκB will
degrade very fast, with IκB(t1 +Δ) = 0. In contrast, in the absence of the TNF
stimulus, the rule is IκB+ = [iκB or not IKKa]. If IKK becomes active at time
t1 , one has IκB(t1 + Δ) = iκB(t1 ), meaning that IκB is only rapidly degraded
if no more of its messenger RNA is available. A similar reasoning justifies the
rules for IAP and CARP0 . The rules for these three proteins with inhibiting
roles reflect the fact that their degradation rates, and hence turnover, can be
much faster in response to TNF stimulation.
where Xi ∈ {0, 1}, X = (X1 , . . . , Xn ) denotes the state of the system at time t,
and X + = (X1+ , . . . , Xn+ ) denotes the next state (at t + Δ). Alternatively, with
asynchronous algorithms, at each iteration the nodes are sequentially updated,
according to a given order (which can be prespecified or randomly chosen).
Discrete models focus on the structure of the network (links), thus offering
a more qualitative description of the system’s dynamics. Continuous models
may offer more detailed descriptions of a system, but they also have the dis-
advantage of involving a large set of kinetic parameters, many of which are
unknown. A method for analysis of Boolean models was introduced in [12, 13],
which provides a bridge between discrete and continuous approaches. In this
method, each node Xi of the network is represented by one continuous vari-
able (xi ) and one discrete variable (Xi , as before). The continous variables are
24 M. Chaves et al.
Table 1. Boolean rules for the model of regulation of apoptosis via the NFκB pathway.
TNF is a constant input. Identification of the nodes is given in the text. The letter “a”
juxtaposed to a variable name denotes the active form of a molecule. The subscript
“nuc” denotes the given component in the cellular nucleus. Alternative rules are given
for the presence/absence of links A, C, L.
The steady states of a Boolean model are given by all the possible solutions
X ∗ of the equations:
It is easy to see that any steady state of the Boolean model yields a steady
state of the piecewise linear equations (2), since
d xi bi
= 0 ⇔ xi = Fi (X1 , X2 , . . . , Xn ), i = 1, . . . , n,
dt ai
independently of θi . Because the right-hand side of this equation is discontin-
uous, it is difficult to provide general results on the existence and uniqueness
of solutions for system (2) (see for instance [3] and [11]). In view of this dif-
ficulty, in the present study we will assume that trajectories are well defined
and analyze their dynamical behavior.
For the model of Table 1, the steady states depend on the value of TNF
(see Table 2). It is not difficult to check that (both with and without link A)
there are exactly two distinct steady states when TNF = 0, characterized by
the presence or absence of caspases 3 and 8, and hence corresponding to the
survival or apoptotic responses (nodes not indicated below are zero):
This is in agreement with the idea that, under typical conditions, the cell
should be capable of stably maintaining either an apoptotic or a survival
Table 2. Steady states of the Boolean model, for each model variant, in the presence
and absence of TNF.
Model Links TNF = 0 TNF = 1 Oscillations?
I A, C, no L Ap0 , Lf0 Ap1 Yes
II L, C, no A Ap0 , Lf0 — Yes
III C, no A, no L Ap0 , Lf0 — Yes
IV L, no A, no C Ap0 , Lf0 — Yes
26 M. Chaves et al.
state [8, 4]. If TNF = 1, there is only one possible steady state for models
with link A:
For models with no link A, there is no possible steady state when TNF = 1,
and there are only periodic orbits of period higher than 1.
Therefore, during TNF treatment, models with link A may at any time
make a decision towards the apoptotic pathway, while models with no link
A will exhibit oscillatory behaviour and can only make a decision when TNF
treatment ceases. Upon removal of TNF stimulation, trajectories of system (2)
may be expected to converge to either the apoptotic or survival state. The
choice of one or the other state will depend on the initial condition and the
set of parameters ai , bi , and θi . Since these parameters are very likely to
vary from cell to cell, it is reasonable to consider several (randomly chosen)
sets of parameters and then compute the probability of convergence to each
steady state. To examine the dynamics of system (2), and its dependence on
parameters and the structure of the network of interactions, several numerical
studies were performed, as described next.
To test the model and analyse the effects of links A and L (Fig. 1), system (2)
was simulated several times, with randomly chosen sets of parameters. For
simplicity, the synthesis rates and threshold constants were fixed (bi = 1 and
θi = 0.5 for all i), and only parameters ai were allowed to vary, chosen from
a uniform distribution in the interval [1/3, 3] (h−1 ). This seems reasonable, as
the degradation rates used in [17] are roughly between 0.5 and 4 h−1 . Observe
that ai plays a double role: it represents a degradation rate, but also defines the
0/1 threshold concentration (0.5/ai ). Hence, high degradation rates also imply
that a lower concentration is needed to achieve the 0/1 transition. Different
durations of TNF stimulation were considered, namely: 2, 6, 11, 16, and 21
hours. For these simulations, one initial condition was chosen: IκB(0) = 1 and
all other nodes set to zero. This is based on a natural physiological starting
point of the system: previous to stimulation, IKK is in its inactive form, while
IκB is bound to NFκB, preventing transcriptional activity. Caspases reside in
the cytosol in dormant forms [22].
To understand the importance of the links A, C, and L (the least well
documented), four variants of the model depicted in Fig. 1 are compared: (I)
links A and C present, (II) links L and C present, (III) only link C present,
and (IV) only link L present (as listed in Table 2). The first three variants aim
at comparing the effects of links A and L, and the last aims at evaluating the
effect of link C. Other alternatives gave similar results (for example, a model
with all three links gave results very similar to I) and thus are not detailed
here. For each variant, the response of the system to each of the five TNF
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis 27
durations was simulated 500 times. Since different sets of parameters {ai }
introduce different time scales, variations in the dynamics from one simulation
to another are expected. These variations may also be interpreted as a result
of natural variability in biological systems. The average response over the 500
simulations will then yield the probability of the system converging towards
each of the steady states.
Other open questions that may be studied with our model include com-
petition between the pro- and anti-apoptotic pathways and the point of ir-
reversibility of the apoptotic decision. For instance, how long after caspase
activation is recovery from the apoptotic pathway still possible [22]? To
address these questions, numerical experiments were conducted by letting
NFκB(0) = 1, setting all others to zero, and maintaining C3a(t) = 1 for
durations of 10, 30, 60, and 360 minutes.
For analysis of the numerical results, a “peak” in the trajectory of node
Xj will be defined as a time interval [T0 , T1 ], during which Xj (t) = 1, and such
that Xj (T0 − Δ) = Xj (T1 + Δ) = 0. The period of oscillations is calculated
as the average time interval between the onset of two consecutive peaks, i.e.,
1
Np
Period = T0,i − T0,i−1 ,
Np − 1 i=2
TNF
0.5 0.5
0 0
0 5 10 15 20 0 5 10 15 20
1 1
2.6733 1.1631
IKK
0.5 0.5
0 0
0 5 10 15 20 0 5 10 15 20
1 1
1.898 2.9469
IkBn
0.5 0.5
0 0
0 5 10 15 20 0 5 10 15 20
NFkBn
1 1
2.3488 2.5784
0.5 0.5
0 0
0 5 10 15 20 0 5 10 15 20
1 1
0.90041 1.8348
IAP
0.5 0.5
0 0
0 5 10 15 20 0 5 10 15 20
1 1
0.79962 2.5642
C8a
0.5 0.5
0 0
0 5 10 15 20 0 5 10 15 20
1 1
0.4439 0.69736
C3a
0.5 0.5
0 0
0 5 10 15 20 0 5 10 15 20
Time (hours) Time (hours)
Fig. 2. Example of network dynamics with the hybrid model (variant II), corre-
sponding to cell survival (left) or apoptosis (right) solution. Numbers indicate the
degradation rates for these numerical experiments. Solid lines represent normalized
continuous variables (xi ) and dashed lines represent discrete variables (Xi ).
90
80
70
Survival rate (%)
60 III
50 II
40
IV
30
20 I
10
2 4 6 8 10 12 14 16 18 20 22
TNF duration (hours)
models II and IV, or 30% in the model with only link C (which favours the
anti-apoptotic pathway) (Fig. 3). These values appear to be in agreement with
experimental data: Rehm et al. [22] report that, for 8 hour treatments with
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis 29
6 6 6
2 2 2 apoptosis
1 apoptosis 1 1
0 0 0
0 10 20 0 10 20 0 10 20
TNF duration (hours) TNF duration (hours) TNF duration (hours)
7 7 7
TPeak i −TPeak i−1 (hours)
6 I 6 II 6 III
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 2 4 6 0 2 4 6 0 2 4 6
Peak i Peak i Peak i
Fig. 4. Top row: Average period of nuclear IκB oscillations for apoptotic or surviving
cells, as a function of TNF stimulus duration. Vertical lines represent standard devi-
ation over the 500 numerical experiments. Bottom row: Relative timing of sucessive
peaks in IκB oscillations, for apoptotic (grey) or surviving (black) cells. The “+” signs
mark the experimental peak timing in [19].
high and low concentrations of TNFα, the percentage of cells undergoing ac-
tivation of effector caspases was, respectively, 86% and 24%. The numerical
experiments with our model capture the response to high (or significant) con-
centrations of TNFα, so variants I (followed by II and IV) are closer to the
real system.
Quantitative analysis of the oscillatory behaviour reveals some interesting
facts (Fig. 4). To characterize the oscillatory dynamics, the following quan-
tities were computed for nuclear IκB: period of oscillations (approximated),
number of peaks, and relative timing between peaks. First, in all cells os-
cillations cease when TNF stimulation ceases, in agreement with observa-
tions. Second, the timing of successive peaks is also in remarkable quantitative
agreement with experimental data [19], see Fig. 4 (bottom row). The first peak
in nuclear IκB concentration was observed about 72 minutes from the start
of TNF stimulation, and the second peak appears about 4 hours later, very
close to the 75 minutes and 4.5 hours reported in [19]. It is striking that the
time span of the first peak is typically longer than that of the following peaks,
and that the time lapse between consecutive peaks decreases (see Figs. 2, 4).
Third, the average period of oscillations is fairly constant, but “depends” on
the apoptosis/survival decision. Statistical analysis of the period of oscillations
30 M. Chaves et al.
60
III
50
Survival rate (%)
40
II
30
IV
20
10
I
0
0 50 100 150 200 250 300 350 400
C3a overexpression interval (mins.)
decision, while for the largest part (70% of all cells) the apoptotic pathway
is chosen early on, within an hour of TNF stimulation. Not surprisingly, ex-
amination of the relative values of the parameters ai shows that two thirds of
cells that were able to recover from the apoptotic pathway had degradation
rates for C3a higher than those for NFκB or IκB.
Based on our study of regulation of apoptosis and the NFκB pathway, it
seems clear that the links A and L play quite important roles, and at least
one of these should definitely be included for faithful modeling of apoptosis
via TNF receptors. This eliminates model III. Both links contribute to the
same physiological function: downregulation of NFκB transcriptional activity.
However, link A (direct inhibition of NFκB by C3a) achieves this objective
in a much faster way than link L (“indirect” inhibition of NFκB by C3a,
through complex IKK). The essential difference between models I and II is
thus the length of the pathway representing inhibition of NFκB by C3a. The
shorter path (model I, with link A) leads to much higher apoptosis rates than
the longer path (models II or IV, with link L). The shorter path also renders
recovery from the apoptosis pathway practically impossible, with apoptosis
rates higher than 95% after only half an hour with C3a overexpression (Fig. 5).
The longer path allows a higher recovery rate from the apoptotic pathway,
although the probability of apoptosis does not increase above 70%, even after
6 hours of C3a overexpression. Recent experimental evidence [10] points to
the existence of a link L, that is, caspases are responsible for cleavage or
degradation of (parts of) complex IKK. To further discriminate between a
short or long pathway for the influence of caspases on the NFκB pathway, the
results shown in Fig. 4 suggest the following experiment. First, measure the
period of oscillations during TNF stimulation and then monitor cells for some
time after TNF removal. Next, compare the frequency of oscillations in cells
that survive and in cells that eventually go through the apoptotic program.
If the frequency of oscillations is similar for both groups of cells, or slightly
higher in apoptotic cells, then model II (longer pathway) provides a better
description of the system. If oscillations stopped after a short time interval (as
compared to TNF duration) in apoptotic cells, then model I (shorter pathway)
should be chosen.
5 Conclusion
The present study illustrates the usefulness of Boolean and piecewise linear
models in the analysis of large complex networks. The qualitative dynamics
that emerges from the network structure was studied, leading to predictions on
the response to increasing duration of stimulation, response to overexpression
of a given protein, or indication of which links/interactions play crucial roles
in the regulation of apoptosis. Some quantitative aspects were also analyzed,
such as the probabilities of survival or apoptosis and the frequency/period of
oscillations, and were shown to be in remarkable agreement with experimental
32 M. Chaves et al.
data. Many other questions can be examined in this hybrid framework: for
instance, extending the set of parameters (degradation and synthesis rates,
threshold concentrations) and varying the relative strengths of anti- and pro-
apoptotic links will lead to more refined models, capturing a wider range of
kinetic variability. Although writing the logical rules requires some simplifica-
tions of the biological processes, discrete and hybrid models retain the essen-
tial qualitative properties of the network. The effect of the network structure
on the qualitative dynamics of the system can be easily studied, even when
kinetic details are not well known. This class of models can thus be a pow-
erful method to generate predictions and test new hypotheses for complex
biological networks.
Acknowledgments
The authors thank Peter Scheurich and Monica Schliemann for their many
interesting and fruitful discussions.
References
1. R. Albert and H.G. Othmer. The topology of the regulatory interactions predicts
the expression pattern of the drosophila segment polarity genes. J. Theor. Biol.,
223:1–18, 2003.
2. G. Bernot, J.-P. Comet, A. Richard, and J. Guespin. Application of formal meth-
ods to biological regulatory networks: extending Thomas’ asynchronous logical
approach with temporal logic. J. Theor. Biol., 229:339–347, 2004.
3. R. Casey, H. de Jong, and J.L. Gouzé. Piecewise-linear models of genetic regulatory
networks: equilibria and their stability. J. Math. Biol., 52:27–56, 2006.
4. M. Chaves, T. Eissing, and F. Allgöwer. Bistable biological systems: a charac-
terization through local compact input-to-state stability. IEEE Trans. Automat.
Control, 53:87–100, 2008.
5. M. Chaves, E.D. Sontag, and R. Albert. Methods of robustness analysis for boolean
models of gene control networks. IEE Proc. Syst. Biol., 153:154–167, 2006.
6. N.N. Danial and S.J. Korsmeyer. Cell death: critical control points. Cell, 116:
205–216, 2004.
7. H. de Jong, J.L. Gouzé, C. Hernandez, M. Page, T. Sari, and J. Geiselmann.
Qualitative simulation of genetic regulatory networks using piecewise linear mod-
els. Bull. Math. Biol., 66:301–340, 2004.
8. T. Eissing, H. Conzelmann, E.D. Gilles, F. Allgöwer, E. Bullinger, and
P. Scheurich. Bistability analysis of a caspase activation model for receptor-induced
apoptosis. J. Biol. Chem., 279:36892–36897, 2004.
9. A. Fauré, A. Naldi, C. Chaouiya, and D. Thieffry. Dynamical analysis of a
generic boolean model for the control of the mammalian cell cycle. Bioinformatics,
22(14):e124–e131, 2006.
10. C. Frelin, V. Imbert, V. Bottero, N. Gonthier, A.K. Samraj, K. Schulze-Osthoff,
P. Auberger, G. Courtois, and J.F. Peyron. Inhibition of the NF-κB survival path-
way via caspase-dependent cleavage of the IKK complex scaffold protein and NF-
κB essential modulator NEMO. Cell Death Differ., 15:152–160, 2008.
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis 33
Andreas Beyer
1 Introduction
Biological systems are characterized by a large number of diverse interactions.
Interaction maps have been used to abstract those interactions at all biolog-
ical scales ranging from food webs at the ecosystem level down to protein
interaction networks at the molecular scale.
Organisms consist of thousands of cells with hundreds of different types.
Cells in turn contain millions of molecules comprising thousands of different
chemical species. Our genome contains about 23,000 protein coding genes [32],
and the estimated number of chemically different proteins (considering splice
variants and posttranslational modifications) is at least an order of magnitude
larger. It is difficult to estimate the true number of different proteins, because
there are no reliable methods yet for predicting splice variants. For example,
the NCBI database (www.ncbi.nlm.nih.gov) currently lists about 440,000 pro-
tein entries—many of them may however be redundant. In addition, our cells
contain many other molecules with catalytic or regulatory functions, such as
ribosomal RNA, tRNA, and small interfering RNA (siRNA). Further, the cells
contain thousands of different lipid species and other small molecules serving
as structural components of the cell or as substrates for the biochemical re-
actions executed by the metabolic program. Hence, our body is coordinating
the activity and reactions of hundreds of thousands if not millions of different
chemical species [3]. Even a single cell is a prototypic example of a complex
system [27]. Although biological systems follow all basic physical and chem-
ical principles, they cannot be modeled sufficiently using standard methods
from those two disciplines. Typical physical models describe a system as ei-
ther a small number of different entities (e.g. mechanics) or a large number of
very similar or even identical elements (e.g. thermodynamics). Likewise, also
chemical reaction systems can only be appropriately described if the number of
reacting species is small. However, the behavior and fate of organisms cannot
Conditional interactons:
need condition specific protein
abundance or protein activation
data
level of detail
Causal (directed) interactions:
can predict who affects whom,
coverage
need regulatory information
Logical networks:
+ + considers type of effect
(repressive/activating) and potenti-
ally also Boolean rules (such as
-
“need A AND B to activate C“)
Quantitative models:
k1 k2 kinetic rate constants are known or
derived from the data, can predict
dynamics of system response
k3
located in the respective region of the genome and it is mostly unknown which
of the genes is causal for the disease. Second, even if the causal gene is known,
the molecular mechanisms linking the gene to the disease are usually elusive.
Lage et al. addressed these problems by mapping all genes located in disease-
related loci onto a protein-protein interaction network. They hypothesized
that truly causal genes would cluster together in common protein complexes
of the network. Indeed the authors found protein complexes significantly en-
riched with candidate genes. Often, these complexes also had a molecular
relationship with disease phenotypes. Hence, the investigators not only iden-
tified potentially causal genes, but they also identified protein complexes that
could aid in understanding the molecular mechanisms by which mutations
alter disease susceptibility. This example demonstrated that even comparably
simple networks can yield new insight for our understanding of diseases.
38 A. Beyer
large maps of functionally related genes. Such a “functional network” has for
example been used to better predict substrates of kinases [47]. This study
nicely demonstrated the value of such data integration, since previous meth-
ods relying exclusively on kinase binding motifs suffered from a large number
of false positives.
Geneticists define a genetic interaction based on the phenotypes observed
when the genes are knocked out: if the knock-out of one gene “masks” the
phenotype of the other knock-out they are said to be linked [7, 9]. A proto-
typical example is the synthetic lethal interaction. In this case the knock-out
of any single gene has no or only very little effect on viability, whereas the
double knock-out of both genes creates a lethal phenotype. Such a synthetic
lethal phenotype can be explained by redundant functions of the two genes
e.g. in two independent pathways that can compensate for each other [35].
Hence, the fact that two genes create a synthetic lethal phenotype indicates
that they participate in distinct pathways.
Underlying the functional or genetic relationships are sequences of physical
or biochemical interactions “connecting” the two genes. Thus, genetic interac-
tions provide important functional information that can be used for inferring
molecular pathways [7].
Other network types. Many other types of molecular interactions have
systematically been studied using network approaches. For example, protein-
DNA interactions are important for understanding transcriptional regulation,
and they have been studied on a large scale for almost a decade [58, 77]. On
the other hand, protein-RNA networks are substantially less researched, al-
though the relevance of alternative splicing is immediately apparent and it
is known that translation is also heavily regulated via RNA binding proteins
[5]. Yet another prominent example is the use of logical networks for describ-
ing transcriptional regulatory cascades [19, 57]. These networks are similar
to the above-mentioned transcription factor-DNA networks; however, such
logical networks; may not always explicitly model the molecular mechanism
underlying the regulatory relationship.
In recent years several new technologies have been developed for measur-
ing all kinds of physical and genetic interactions on a large scale (Fig. 2,
Appendix 1). For example protein-protein interactions can be measured
with yeast two-hybrid (Y2H) or tandem affinity purification coupled with
mass-spectrometry-based protein identification (TAP-MS). Protein-DNA
interactions can be measured with chromatin immunoprecipitation and
DNA microarrays (Appendix 1) can be used to identify the DNA frag-
ments (ChIP-Chip). Likewise, techniques for the large-scale measurement of
gene-gene interactions have been developed. These are just some examples to
40 A. Beyer
Protein-protein interactions:
yeast two-hybrid (Y2H)
TAP-MS
known interacting protein domains
demonstrate the fact that today a wide range of interactions can be measured
practically at a genomic scale. However, all of these methods are subject to
considerable noise, and often results from different techniques only agree to
a small extent [73]. Therefore, numerous bioinformatic approaches are under
development for physical and genetic network quality assessment, integration,
assembly, and annotation.
Although all large-scale studies are subject to noise, the rationale for data
integration is that observations of true interactions will reinforce or comple-
ment one another when combined across different studies and/or experimental
techniques. For example, the independent observation of a protein-protein in-
teraction by both Y2H and TAP-MS methods, or by two independent TAP-MS
studies, renders this interaction more likely to be true [73].
Network-Based Models in Molecular Biology 41
there is no need for a priori assumptions about which genes/proteins are likely
to respond during the experiment.
Kiesel et al. [36] used microarray time course data to study the tran-
scriptional change during osteoclastogenesis (i.e. during the transition from
precursor cells to mature osteoclasts). The differentiation of precursor cells
into mature osteoclasts involves dramatic changes of the transcriptional pro-
gram, thereby affecting the topology of the interaction network at the protein
level. The authors identified co-expression networks associated with early and
late response to the differentiation stimulus. A co-expression network is a
graph linking two genes if the two genes are similarly expressed either during
the specific experiment or at a range of different conditions. For the Kiesel
study it was necessary to create two distinct networks to fully capture the
complexity of the dynamical changes. One network described the early, and
the second one the late response during differentiation. Accordingly, the two
networks contained different pathways that are known to be associated with
osteoclastogenesis. These findings emphasize the importance of considering
the dynamics of transcriptional changes—often one may lose important de-
tails when looking at only two time points (before and after treatment, before
and after differentiation, etc.).
Whereas analyzing microarray time course data can in itself reveal impor-
tant insights into the dynamics of transcriptional networks, combining those
data with other interaction data is significantly more powerful. Expression
data can be combined with transcription factor binding data, adding the di-
mension of protein-DNA interactions [23]. Thereby it becomes possible to
infer the molecular mechanisms by which transcriptional networks change
their state. For example, Ramsey et al. [57] combined time course expression
data with transcription factor binding data to assess the regulatory program
responding to macrophage activation. Putative regulatory relationships were
identified by employing a novel method for identifying time-lagged correla-
tion between transcription factors and potential target genes. Those interac-
tions were later corroborated by additionally taking the binding affinity of
transcription factors in upstream regulatory regions into account. Subsequent
experiments confirmed that this combined analysis of expression and binding
data significantly improved the quality of the inferred regulatory network. In
a similar approach, Ernst et al. [19] analyzed yeast TF-DNA binding data
in combination with respective expression data under various different stress
conditions. They identified bifurcation points in the time course expression
data indicating regulatory events. Along with the TF binding data they were
able to identify TFs that were likely regulators of those bifurcations, i.e. they
were regulating a specific subset of the genes.
Alternatively, time course data of expression changes can be combined
with physical protein interaction networks in order to identify pathways or
pathway components that are differentially expressed [11, 30]. Ideker et al.
[31] combined physical interaction networks with expression data and devised
a method based on simulated annealing for identifying relevant subnetworks
44 A. Beyer
(modules) of the physical network. In this case, the network is not a co-
expression network, but a network of proteins binding to other proteins or
to DNA. Expression changes are mapped onto the physical network, i.e. they
become the nodes’ attributes. The algorithm’s task is to identify the most
significant subnetwork enriched for differentially expressed genes. The result
will depend on the topology of the physical network and the strength (extent)
of differential regulation of the individual genes. Several variants of this idea
have been published since then [11, 13, 56, 61]. Most important however is the
central idea: combining dynamic expression data with independently derived
interaction networks significantly improves the statistical power of the analysis
and provides much more insight into the underlying molecular mechanisms [7].
It is well established that protein concentrations are not only regulated
at the transcriptional level, but also at the level of translation and protein
turnover [5, 10]. These posttranscriptional processes affect the topology of
interaction networks just as much as transcriptional changes. For example,
Brockmann et al. [10] have shown that proteins responding early in signaling
cascades and transcription factors in particular are subject to “translation
on demand.” The coding mRNA of such proteins is constitutively expressed,
but translation is blocked until the protein is actually needed. This allows
for a much faster response compared to transcriptional regulation. Hence,
the presence/absence of these network components is highly dependent on
posttranscriptional processes, which are often neglected in studies assessing
the dynamics of protein expression. It is a very important finding that the
regulatory network components themselves are often not regulated at the
transcriptional level and are therefore missed by studies only applying DNA
microarrays for measuring transcriptional changes [8, 74].
The main bottleneck for studying posttranscriptional network changes
more systematically are the experimental limitations of protein detection and
quantification. Current state-of-the-art techniques employ mass spectrometry
for identifying, characterizing, and quantifying proteins [1]. The current lim-
its of this technology are high costs, relatively complicated protocols, data
processing, limited number of detectable proteins (in the range of a few hun-
dred), and limited reproducibility [14, 48]. Recently, significant progress has
been made because of much more sensitive instruments, improved protocols,
and better data analysis tools [14, 43, 48]. Hence, this progress suggests that
in the near future posttranscriptional network dynamics can also be studied
at a level of detail and scope comparable to that of mRNA changes [14, 22].
The previous section focused on the dynamic adaptation of the network topol-
ogy, e.g. the presence or absence of network components. Here, we will ad-
dress state changes of the nodes themselves, i.e. alterations or activities of
biomolecules in response to external or internal stimuli.
Network-Based Models in Molecular Biology 45
edge (interaction) is indicative for the importance of the interaction for the
regulation of the target gene. By applying their method to yeast eQTL data
the authors could infer several known and new regulatory relationships, and
they were able to predict the directionality of information flow for hundreds
of protein-protein interactions.
The method developed by Suthram et al. enables the inference of causal
relationships, but it does not lend any insight into the effect of interactions.
For example, the currents do not predict whether the regulator increases or
represses the activity of the target. Workman et al. [74] went further in this
respect. Using ChIP-Chip data and knock-out expression measurements under
DNA damaging conditions they could infer a causal network for DNA damage
response. They used the method of factor graphs, which is a generalization of
Bayesian networks. Factor graphs are minimal graph models explaining the
observed (expression) data. Importantly, the method predicts whether any
given interaction is activating or repressing. The down side is that this method
requires significantly more comprehensive data than more simple approaches.
These logical network models cannot fully capture the kinetics of signaling.
However, at least some of them predict state changes in response to different
inputs, they provide insights into the sequence of events, and they allow for
analyzing the stability of the regulatory system and for finding “weak spots.”
A weak spot is a gene in the signaling network whose knock-out would max-
imally alter the output. Those genes could be interesting drug targets, e.g.
when looking for new targets in pathogens or when attacking tumor cells.
Also, those weak genes could be causal for diseases, for example if they are
mutated in patients carrying a certain inheritable disease.
Metabolic networks are another important application of dynamic network
modeling. They too are highly dynamic, and fully capturing their kinetics
would allow for developing new drugs and for optimizing yields in biochemi-
cal reactors. However, modelers face similar problems as those in regulatory
networks: although the kinetic properties of enzymes have been measured for
decades, we are still far from completely covering all relevant enzymes in any
multicellular eukaryote [49]. In addition, enzymes may behave completely dif-
ferently in in vitro systems than in in vivo situation, where pH, temperature,
and many other important parameters may differ [68]. Hence, complete dy-
namic modeling using differential equations is possible only for a relatively
small set of well-studied subsystems. Fortunately, methods have been devel-
oped that do not require kinetic constants for the analysis of metabolic net-
works. For instance, Petri nets have also been used for analyzing metabolic
networks [12]. One of the most mature methods is flux balance analysis (FBA)
[34, 52]. FBA simulates a metabolic network assuming steady state (input bal-
ancing output), which greatly simplifies the data requirements. For example,
elementary modes represent a minimal set of reactions necessary to produce
a given product at steady state [66] (Fig. 3). These elementary modes, can
be deduced just from the stoichiometric matrix. Hence, one only has to know
the possible reactions in the system along with their educts and products to
Network-Based Models in Molecular Biology 47
(a) R2 (b) R2
R1 R4 R1 R4
S1 M1 M2 P1 S1 M1 M2 P1
R5 R5
R3 R3
P2 P2
(c) R2 (d) R2
R1 R4 R1
S1 M1 M2 P1 S1 M1 M2 R4 P1
R5 R5
R3 R3
P2 P2
Fig. 3. Elementary flux modes. (a) A simple metabolic network consuming sub-
strate S1 and producing products P1 and P2 via the reactions R1 through R5. (b – d)
Elementary modes (highlighted) are minimum sets of reactions creating the products
P1 (b, c) or P2 (d). Note that removing R1 affects all elementary modes, i.e. synthesis
of all products. Removal of R4 disables synthesis of P1 only. {R1 }, {R4 }, and {R2,
R3 } are minimal cut sets with respect to P1.
predict all possible chemical fluxes that do not lead to the accumulation of
products under the steady state assumption. Depending on the metabolic net-
work there might be many elementary modes leading from certain substrate(s)
to specific product(s). Such a network would be redundant. However, even if
there are many elementary modes, all of them might require one specific en-
zyme, thus this enzyme would be essential for synthesizing the respective
product (e.g. the enzyme catalyzing reaction R1 in Fig. 3). The concept of
elementary modes has been used to make a range of important predictions:
for example Stelling et al. [66] were able to predict lethal genes in Escherichia
coli by searching for enzymes whose knock-out would remove all possible ele-
mentary modes leading to essential products. Klamt [37] extended this idea to
the concept of minimal cut sets: whereas Stelling and co-authors were looking
for single genes whose knock-out would be detrimental to the organisms, the
cut sets define the minimum set of genes required to turn off the synthesis
of a given product (Fig. 3). This analysis could be instrumental for develop-
ing combinatorial antibiotics targeting different enzymes in bacteria such that
their synergistic interaction would be lethal to the pathogens.
In summary, the “reduced modeling approaches” that are currently pop-
ular do not strictly simulate the dynamics on (or of) the networks, but they
simulate dynamic networks in a way that still leads to important conclu-
sions. In most cases it would be impossible to derive those insights without
48 A. Beyer
Acknowledgments
rial. Bone is constantly degraded and newly formed by these two types of
cells. An excess of osteoclasts leads to osteoporosis (brittle bones).
Phenotype. The expression of a genotype. Individuals may have different
physiological or molecular characteristics based on their genotype. For
example, eye and hair color are phenotypes determined by the respective
gene variants (genotype). A phenotype is generally determined by both
environmental and genetic factors. Biologists often refer to “the phenotype
of a gene” as the physiological change in response to knocking out the
respective gene.
Phylogenetic profile. Describes the occurrence pattern of a gene in differ-
ent species. Two genes occurring in the same species are said to have
similar phylogenetic profiles.
Simulated annealing is an optimization technique for finding global max-
ima (or minima) in complex fitness landscapes with many local optima.
Simulated annealing starts searching for an optimum from some (random)
parameter configuration. After a number of iterations the current param-
eters are randomized to some extent in order to overcome boundaries be-
tween local maxima/minima (“heating” of parameters). This procedure
is repeated until convergence, while reducing the level of parameter ran-
domization each time (“annealing”).
Stoichiometry describes the type and number of molecules consumed and
the type and number of molecules produced by a chemical reaction.
Substrate. A molecule chemically changed/consumed by an (enzymatic) re-
action. For example, proteins that are phosphorylated by kinases are called
substrates of the kinases.
Transcription. The process of copying a gene’s sequence into RNA. Poly-
merases are protein machines “reading” the sequence of a gene and pro-
ducing the complementary RNA.
Transcription factor (TF). A regulatory protein controlling the transcrip-
tion of genes. TFs bind directly or indirectly (bridged via other proteins)
to DNA and change the 3D structure of DNA, attract or block transcrip-
tional machinery at the site, or alter other proteins in the vicinity (e.g.
histones) to manipulate the transcription rate of the target gene.
Translation. The process of synthesizing a protein from the respective mes-
senger RNA (mRNA). Ribosomes are molecular machines (consisting of
RNA and proteins) reading an mRNA sequence and translating it into
the corresponding amino acid sequence.
References
1. Aebersold R, Mann M. (2003) Mass spectrometry-based proteomics. Nature.
422(6928):198–207.
2. Albert R, Othmer HG. (2003) The topology of the regulatory interactions predicts
the expression pattern of the segment polarity genes in Drosophila melanogaster.
J Theor Biol. 223(1 ):1–18.
Network-Based Models in Molecular Biology 53
24. Gavin et al. (2006) Proteome survey reveals modularity of the yeast cell machinery.
Nature. 440(7084):631–6.
25. Gilbert D, Fuss H, Gu X, Orton R, Robinson S, Vyshemirsky V, Kurth MJ,
Downes CS, Dubitzky W. (2006) Computational methodologies for modelling,
analysis and simulation of signalling networks. Brief Bioinform. 7(4 ):339–53.
26. Goss PJ, Peccoud J. (1998) Quantitative modeling of stochastic systems in molecu-
lar biology by using stochastic Petri nets. Proc Natl Acad Sci USA. 95(12 ):6750–5.
27. Han JD. (2008) Understanding biological functions through molecular networks.
Cell Res. 18(2):224–37.
28. Heinrich R, Schuster S. (1998) The modelling of metabolic systems. Structure,
control and optimality. Biosystems. 47(1–2):61–77.
29. Helikar T, Konvalina J, Heidel J, Rogers JA. (2008) Emergent decision-making in
biological signal transduction networks. Proc Natl Acad Sci USA. 105(6 ):1913–8.
30. Ideker T et al. (2001) Integrated genomic and proteomic analyses of a systemati-
cally perturbed metabolic network. Science. 292:929–934.
31. Ideker T, Ozier O, Schwikowski B, Siegel AF. (2002) Discovering regulatory
and signalling circuits in molecular interaction networks. Bioinformatics. 18
Suppl 1 :S233–40.
32. International Human Genome Sequencing Consortium (2004). Finishing the eu-
chromatic sequence of the human genome. Nature. 431:931−945.
33. Jansen RC. (2003) Studying complex biological systems using multifactorial per-
turbation. Nature Rev Genet. 4:145–151.
34. Joyce AR, Palsson BO. (2008) Predicting gene essentiality using genome-scale in
silico models. Methods Mol Biol. 416:433–57.
35. Kelley R, Ideker T. (2005) Systematic interpretation of genetic interactions using
protein networks. Nature Biotechnol. 23:561–566.
36. Kiesel J, Miller C, Abu-Amer Y, Aurora R. (2007) Systems level analysis of os-
teoclastogenesis reveals intrinsic and extrinsic regulatory interactions. Dev Dyn.
236(8 ):2181–97.
37. Klamt S, Gilles ED. (2004) Minimal cut sets in biochemical reaction networks.
Bioinformatics. 20(2 ):226–34.
38. Klipp E, Nordlander B, Kruger R, Gennemark P, Hohmann S. (2005) Integrative
model of the response of yeast to osmotic shock. Nature Biotechnol. 23:975–982.
39. Klipp E. (2007) Modelling dynamic processes in yeast. Yeast. 24(11 ):943–59.
40. Krogan et al. (2006) Global landscape of protein complexes in the yeast Saccha-
romyces cerevisiae. Nature. 440(7084):637–43.
41. Krüger M, Kratchmarova I, Blagoev B, Tseng YH, Kahn CR, Mann M. (2008)
Dissection of the insulin signaling pathway via quantitative phosphoproteomics.
Proc Natl Acad Sci USA. 105(7 ):2451–6.
42. Lage K et al. (2007) A human phenome-interactome network of protein complexes
implicated in genetic disorders. Nature Biotechnol. 25:309–316.
43. Lange V et al. (2008) Targeted quantitative analysis of Streptococcus pyogenes
virulence factors by multiple reaction monitoring. Mol Cell Proteomics. [Epub
ahead of print]
44. Lähdesmäki H, Rust AG, Shmulevich I. (2008) Probabilistic inference of transcrip-
tion factor binding from multiple data sources. PLoS ONE . 3(3 ):e1820.
45. Lee I, Date SV, Adai AT, Marcotte EM. (2004) A probabilistic functional network
of yeast genes. Science. 306:1555–1558.
46. Legrain P, Wojcik J, Gauthier JM. (2001) Protein–protein interaction maps: a
lead towards cellular functions. Trends Genet. 17:346–352.
Network-Based Models in Molecular Biology 55
66. Stelling J, Klamt S, Bettenbrock K, Schuster S, Gilles ED. (2002) Metabolic net-
work structure determines key aspects of functionality and regulation. Nature.
420(6912 ):190–3.
67. Stelzl U et al. (2005) A human protein–protein interaction network: a resource for
annotating the proteome. Cell. 122:957–968.
68. Stryer L. (1995) Biochemistry. Freeman & Co, New York.
69. Stuart JM, Segal E, Koller D, Kim SK. (2003) A gene coexpression network for
global discovery of conserved genetic modules. Science. 302:249–255.
70. Suthram S, Shlomi T, Ruppin E, Sharan R, Ideker T. (2006) A direct comparison
of protein interaction confidence assignment schemes. BMC Bioinformatics. 7:360.
71. Suthram S, Beyer A, Karp RM, Eldar Y, Ideker T. (2008) eQED: an efficient
method for interpreting eQTL associations using protein networks. Molec Syst
Biol. 4:162.
72. Tyson JJ. (1991) Modeling the cell division cycle: cdc2 and cyclin interactions.
Proc Natl Acad Sci USA. 88(16 ):7328–32.
73. von Mering C et al. (2002) Comparative assessment of largescale data sets of
protein–protein interactions. Nature. 417:399–403.
74. Workman CT et al. (2006) A systems approach to mapping DNA damage response
pathways. Science. 312:1054–1059.
75. Yan W, Hwang D, Aebersold R. (2008) Quantitative proteomic analysis to profile
dynamic changes in the spatial distribution of cellular proteins. Methods Mol Biol .
432:389–401.
76. Yeang CH, Mak HC, McCuine S, Workman C, Jaakkola T, Ideker T. (2005) Val-
idation and refinement of gene-regulatory pathways on a network of physical in-
teractions. Genome Biol. 6(7 ):R62.
77. Zhu J, Zhang MQ. (1999) SCPD: a promoter database of the yeast Saccharomyces
cerevisiae. Bioinformatics. 15:607–611.
Ecological Networks: Structure, Interaction
Strength, and Stability
1 Introduction
The fundamental building blocks of any ecosystem, the food webs, which are
assemblages of species through various interconnections, provide a central con-
cept in ecology. The study of a food web allows abstractions of the complexity
and interconnectedness of natural communities that transcend the specific de-
tails of the underlying systems. For example, Fig. 1 shows a typical food web,
where the species are connected through their feeding relationships. The top
predator, Heliaster (starfish) feeds on many gastropods like Hexaplex, Morula,
Cantharus, etc., some of whom predate on each other [52]. Interactions be-
tween species in a food web can be of many types, such as predation, compe-
tition, mutualism, commensalism, and ammensalism (see Section 1.1, Fig. 2).
Mathematical ecologists have used dynamic models to explore how the
size and connectivity of food webs determine the stability and long-term per-
sistence of a community under fluctuations in density [41], invasion of new
species [11], or nonlinear population dynamics [24]. There are two different
approaches for modeling a food web: the static model and dynamic model.
Static models describe the food web by a graph whose vertices are species
and whose links are the interactions/relations between them. These models
are primarily concerned with the robustness of the food web structure against
modifications (i.e., removal and addition) of vertices and links. Based on the
hierarchical position of the species in a food web, there exist two types of
static models: the cascade model and the niche model. The dynamic models,
on the other hand, account for the stability of food webs, and are represented
by coupled ordinary differential equations, where different functional forms
describe the type of interactions between the species. However, neither the
static nor the dynamic models are useful for making long-term predictions
of the changes in structural organization of food webs due to extinction or
invasion of new species in the community. Other models of food webs — the
assembly model and evolutionary model — mainly focus on this aspect. One
N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks,
Modeling and Simulation in Science, Engineering and Technology,
DOI: 10.1007/978-0-8176-4751-3 4,
c Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009
58 S. Bhattacharyya and S. Sinha
Fig. 1. Feeding relationship between a predator Heliaster and marine snails in the
northern Gulf of California (adapted from [52]).
Fig. 2. Types of ecological interactions: (i) predation, (ii) competition, (iii) mutualism,
(iv) commensalism, (v) ammensalism.
Ecological networks (i.e., food chains and food webs) have been mathemati-
cally described by different types of models, which focus on different aspects
of their structure. We give simple descriptions of a few as follows:
60 S. Bhattacharyya and S. Sinha
Cascade model. These models are built on the hierarchical positions of the
species in the food chain and use the role of top-down forces (predator to
prey) in shaping the ecological communities. In cascade models, high-ranked
species prey on lower-ranked species in the food chain, and the probability of
consumption depends on the number of connections with these other species.
It has been shown that the mean number of species in the cascade model
grows linearly with the number of the species present in the food chain [13].
Niche model. Like the cascade models, the niche models are also structural
in nature. But unlike the cascade model, the rigid hierarchical effect is relaxed
in the niche model by allowing looping and cannibalism. The “niche model”
takes its name from the premise that each trophic species (i.e., a group of
species that share the same predators and prey) belongs to a specific niche
based on what it eats and, in turn, on what eats it. Recent work [74] has
shown that many diverse real food webs can be better described by the niche
model than by the cascade model, in particular with respect to features such
as cycles and species similarities.
Assembly model and evolutionary model. In this class of models, as
the names suggest, the species composition and structure of the food web
can change with time due to the ongoing introduction of new species and
by species extinctions. As a consequence, these models focus mainly on the
features of the food web after a sufficiently long time, when the size of the
food web and its other properties stabilize. The primary concern of these
models relates to the stability of the food web due to the introduction of
new species into the system by immigration or speciation (assembly) or by
altering one or a few individuals of existing species in the system (evolution-
ary), though the time scales for evolutionary models are longer than those
for assembly ones [60]. A few species from a “pool” are added one by one to
the existing network. The new species may add stability to the web, become
extinct immediately after introduction because of its poor adaptation, or may
cause one or more other species to become extinct due to competition for
resources. Usually studies of these models involve the measurement of various
properties of the underlying network and compare these with the data of real
food webs to determine the robustness in a given time period [9].
Keystone species. This concept provides an important representation for
understanding the organizing forces in ecological communities. A keystone
species is one whose presence contributes critically to the diversity of life in
the food web, and whose removal has a strong adverse impact on the com-
munity structure, even though the species may occupy only a small part of
the ecosystem in terms of biomass or productivity. Keystone species play an
important role in conservation biology [53, 55].
Community matrix. In an ecological network, which describes interactions
between multiple species at different trophic levels, the community matrix rep-
resents the per capita direct effect of one species on other species in the com-
munity [35]. In simple words, the community matrix is a spreadsheet, where
the rows and columns are species and other elements of their environment,
Ecological Networks: Structure, Interaction Strength, and Stability 61
and the entry records are calculations for describing the interactions among
them. The matrix can be used to derive the stability and sensitivity to change
in ecological networks. Alternative but related definitions are common in
ecological literature [41, 78].
Interaction strength is a critical descriptor of the magnitude of effect of
one species on the other in an ecological network. There are several interpre-
tations with the common theme of being a measure of the effect of one species
on another or on all others in the network [38, 40]. In this work, we have used
“interaction strength” to represent the per capita direct effects of a species
on another (e.g., the per capita rate of consumption of the prey species by
the predator, “α” in the first equation of Model I in Section 2, which can also
be modulated by prey preference).
The complexity of a large ecological network or a whole food web can be de-
scribed by many indicators: number of trophic levels, number of species and
their connections, density of interactions, etc. Each trophic module embedded
in the entire food web may also have a similar complexity in its structure and
interactions. There are two mutually nonexclusive aspects that underscore a
long-standing problem in food web theory: Do “the details of population dy-
namics in one or a few modules change the structure of the whole system
over time” [64]? The first aspect is structural, and involves the distributive
pattern of trophic structures or motifs which modulates the real food web
system [3, 10, 46]. The other aspect is related to the study of specific trophic
modules in order to infer properties of the entire ecosystem dynamics [4, 23].
The first one essentially refers to the robustness of the system against removal
of nodes in the functional integrity of the ecological network. In particular,
the study involves how removing or replacing native species with exotic in-
vaders can alter the food web structure, which could be measured by the
number of secondary extinctions, and by the breakup of the network into
smaller components. For instance, conceptualizing food webs as energy-flow
networks, Allesina and Bodini [2] have indicated that it is the “dominating
nodes” (removal of which would make a number of species disappear from the
network) that act as an energy bottleneck for resources flowing to the other
members of the food web. However, it has also been shown that removal of
species with low connectivity sometimes may have a large effect on the persis-
tence of a community, which reinforces the notion of keystone species in food
web theory [18, 31].
The second aspect that is particularly important for understanding real
ecological network dynamics is related to the distribution of interaction
strengths, which determines how strongly the links in the trophic cascade
are coupled. For example in a consumer-resource interaction, the strength is
a function of the metabolic rate, ingestion rate, and preference of the con-
sumer species for the resource species [43]. The theoretical finding that inter-
action strength is one of the key properties promoting persistence in nonlinear
models of food webs has attracted considerable attention [19, 43, 63, 67].
It is indeed challenging to ecologists to quantify the strengths of species
62 S. Bhattacharyya and S. Sinha
interactions, identify the patterns that occur between species, and determine
the mechanisms that cause interactions to vary across space and time in natu-
ral ecosystems. These topics are also important for several other reasons. First,
ecosystems, on the whole, provide hospitable conditions for life. Understanding
how the provision of this hospitability is affected by extinctions and alien intro-
ductions is important, because without knowledge of the strengths of species
interactions, our predictions on the consequences of environmental impacts
become indeterminate for any ecosystem with a reasonable degree of com-
plexity [6]. Second, development of a general understanding of how ecological
communities are structured can benefit from an analysis of the general prop-
erties of multispecies population dynamics models [41, 58]. Knowledge of the
pattern of interaction strengths in natural ecosystems can help to guide the
development of appropriate multispecies models.
A number of recent studies consider the influence of interaction strength
on the stability of food webs for real communities [6, 7, 49, 51] and in food
web models with a large number of species [12, 25, 28, 61]. Martinez et al. [39]
studied the effect of the variation of the interaction strength of omnivore links
on the stability of large food webs. General patterns of food web structure also
appear to be an emergent property of dynamical constraints on species inter-
actions [5, 17]. However, in a more realistic study, the food web dynamics has
been considered by introducing time-varying interactions among the adaptive
foragers [20, 21, 30, 56, 71, 72]. Adaptation modifies predator-prey interac-
tion strengths, and thus, acts on the topology of the network by eventually
removing certain links with zero strength. As a consequence, the complexity
of the food web is molded by how adaptive predators design their foraging
strategy. Recently, it has also been shown that adaptation of foraging behav-
ior and stability of food webs can lead to a rise in basal species richness and
link density, which in turn, increases the emergent complexity of the food web
[21, 30].
However, there still remains a significant gap in predictions on food web
stability and the role of distribution of interaction strengths in the Food
web. This key problem in ecology arises due to the basic difference between
empiricists and theoreticians in understanding the concepts of stability, and
interaction strength [6, 32]. Many theoretical investigations have focused on
the definition of stability in model communities [11, 25, 36, 41, 42, 44]. Multi-
ple definitions of stability have been proposed, some of them designed to have
closer ties to empirical data [14, 15, 33, 37]. This is important because most of
the earlier analytical studies evaluate linear stability of a community at equi-
librium in the face of small perturbations, whereas empirical investigations
focus on community changes (with no assumed equilibrium) in response to
comparatively large perturbations, such as species removals, species additions,
and physical disturbances [75]. The measurement of interaction strengths, on
the other hand, actually centers around different concepts in the system, and
the only consistent aspect is the use of the words “interaction strength”. Laska
and Wootton [32] have clearly mentioned the difference in understanding the
Ecological Networks: Structure, Interaction Strength, and Stability 63
Model I. Figure 3(i) shows a prey-predator system where the predator species
is commensal on the prey species. In this simple model, in the absence of
64 S. Bhattacharyya and S. Sinha
Fig. 3. Food web configurations of (i) Model I and (ii) Model II.
(Y ) can consume both the susceptible and the infected prey, it may have a
higher preference towards the uninfected prey (S), and a concomitant lower
one towards the infected prey (I). Thus, the earlier strong interaction α (in
Model I) is now modified into two interactions with variable strength: one
strong interaction augmented with one weak interaction (Fig. 3(ii)). The rate
of change of the virus population (dV /dt) depends on the the infected prey
population (I), as every dead and lysed infected prey releases virus into the
environment, initiating a new infection cycle. The temporal evolution of this
ecological network is given as follows:
dS S ξαSY
= rS 1 − − λSV −
dt K γ + (S + I)
dI (1 − ξ)αIY
= λSV − − ηI
dt γ + (S + I)
dY βα(ξS + (1 − ξ)I)
= Y −d+
dt γ + (S + I)
dV
= −μV + κηI,
dt
where V is the virus density and λ defines the strength of viral infection on the
susceptible class of prey, and represents the “effective per host contact rate
with viruses.” Parameter η denotes the death rate of the infected prey and μ is
the death rate for the virus. κ denotes the “virus replication parameter,” i.e.,
the number of virus productions per infected individual due to lysis. The other
parameter that regulates the interaction strength of predation, indicating the
prey preference of the predator, is ξ (ξ ∈ (0, 1)). The exact choice of these
parameter values is arbitrary, but they are kept within the same range as in
[8, 22].
3 Results
In Model I, the predation strength α is an important determinant of the
dynamics of the prey and predator populations. As seen in the bifurcation
diagrams of prey and predator populations (Fig. 4), both exhibit equilibrium
dynamics for α < 5.6, but the steady state loses its stability and bifurcates
to limit cycle oscillations with increasing amplitude for α > 5.6 (Fig. 4). We
now show the results of the effect of modifying the structure of this simple
two-species network due to, addition of the new link through the virus, which
not only separates the prey species into two compartments, but also modifies
the predation strength (Model II). For simulation of the Model II network,
the new parameter values are chosen as λ = 0.002, η = 0.7, μ = 0.05, and
κ = 13.
The introduction of the new node (V ) and links to the existing module
(Model I) has interesting effects on the population dynamics of the species that
66 S. Bhattacharyya and S. Sinha
Prey
2
0
0.6
Predator
0.4
0.2
0
5.5 5.7 5.9 6.1 6.3 6.5
α
Fig. 4. Bifurcation diagram of the prey and predator in Model I with increasing
predation strength α. At α = 5.6 (approx.), the system undergoes a period-doubling
bifurcation.
a 6 10
1 b 6 101
Susceptible
Susceptible
Infected 100
Infected
4 10
0 4
2 −1
10 2 10−1
0 −2
10 0 10−2
500 500
0.4 0.4
Predator
Predator
100 100
Virus
Virus
50 50
0.2 0.2
10 0 10
0 5 5
5.5 6 6.5 5.5 6 6.5 5.5 6 6.5 5.5 6 6.5
α α α α
with the predator population going to zero. This happens because, in the
absence of predation, all of S is available for inducing strong viral infection,
which converts the susceptible prey class to the infected one, and the predator
does not have enough preys to survive through predation. Second, the dynam-
ics of this prey-virus system remains stable with a large virus population and
low prey populations. When the the predator has a strong preference for the S
population, i.e., at ξ = 0.99, this situation continues for low predation strength
(until α = 6.2), and the reduced prey-virus system remains stable (Fig. 5(B)).
However, at higher predation strength (α > 6.2), the predator succeeds in
surviving on predation and reduces the population of I strongly enough to
reduce the production of V , which in turn reduces infection, thereby increas-
ing S, which is then available for predation. This kind of a delayed feedback
on S eventually induces oscillations in all four populations, albeit at higher
α compared to Model I. This interesting phenomenon essentially underscores
the fact that distribution of the type (+ or −) and the strength of interactions
can play a significant role in food web structure and dynamics. It can change
the structure of the network by inducing a species to go extinct, and also
promote stability in an otherwise oscillatory system.
parameters, the full network structure persists. These features, i.e., the inter-
action strength and network structure, also regulate the population dynamics
of the species. A combination of type and strength of interactions determines
the dynamical stability of the species in the network. One natural extension of
our study would be to introduce yet another class of prey species, Recovered,
which represents the population of individuals that recover from the infection
after a time, and either return to the susceptible class, or may be immune to
further infections. This would, obviously, increase the complexity of the net-
work by adding new nodes and interactions among them. However, this would
contribute towards understanding the concept “diversity leads to stability” on
large-scale food web processes.
Most of the recent research on food web theory in ecology centers around
the local dynamics of a community, but the evolution of food web dynam-
ics across different spatial scales has also received considerable attention
[26, 27, 45, 57, 73]. “Habitat fragmentation and its impact on life” is one of the
most important issues of present research [66]. The destruction of habitat oc-
curs due to a variety of environmental threats, such as habitat removal, invad-
ing alien species, or hunting, each of which may have different effects on food
web structure. Given that they often act concomitantly, these may also inter-
act with each other in unpredictable ways. Introduction of alien species poses
a significant threat to global biodiversity by altering ecosystem processes, such
as nutrient cycling, or disturbance regimes in a community [65], which, in turn,
also affect the strength of the links. If the performance of interacting species
is habitat dependent, then interaction strength may change with scale. Cer-
tain approaches such as hierarchical communities of competitors [69, 70] and
neutral and quasi-neutral communities [68] have been adapted to show that
community organization is relevant in determining the effects of habitat loss
and spatial patterning. Such research on ecological network theory in the fu-
ture would involve rigorous modeling approaches, both analytical and through
simulations, in combination with field and laboratory experimental studies,
to resolve the crucial questions in conservation and restoration ecology.
Acknowledgments
The authors are thankful to the anonymous referees for constructive, criti-
cal comments, and to the Department of Science and Technology, India, for
financial support.
References
1. Abrams, P. et al. The role of indirect effects in food webs. In Food Webs: Inte-
gration of Patterns and Dynamics (eds G.A. Polis & K.O. Winemiller), 371–395,
Chapman & Hall, New York (1996)
2. Allesina, S. and Bodini, A. Who dominates whom in the ecosystem? Energy flow
and bottlenecks and cascading extinctions. J. Theor. Biol., 230, 351–358 (2004)
Ecological Networks: Structure, Interaction Strength, and Stability 69
3. Bascompte, J. and Melian, C. J. Simple trophic modules for complex food webs.
Ecology, 86, 2868–2873 (2005)
4. Bascompte, J. et al. Interaction strength combinations and the overfishing of a
marine food web. Proc. Natl Acad. Sci. USA, 102, 5443–5447 (2005)
5. Bastolla, U., Lassig, M., Manrubia, S. C. and Valleriani, A. Diversity patterns
from ecological models at dynamical equilibrium. J. Theor. Biol., 212, 11-34
(2001)
6. Berlow, E. L. et al. Interaction strengths in food webs: issues and opportunities.
J. Anim. Ecol., 73, 585–598 (2004)
7. Berlow, E. L., Brose U., and Martinez, N. D. The “Goldilocks factor” in food
webs. Proc. Natl. Acad. Sci. USA, 105, 4079–4080 (2008)
8. Bhattacharyya, S. and Bhattacharya, D. K. Pest control through viral diseases:
mathematical modeling and analysis. J. Theor. Biol., 238, 177–197 (2006)
9. Caldarelli, G., Higgs, P. G. and McKane, A. J. Modelling coevolution in multi-
species communities, J. Theor. Biol., 193, 345–358 (1998)
10. Camacho, J. et al. Quantitative analysis of the local structure of food webs.
J. Theor. Biol., 246, 260–268 (2007)
11. Case, T. J. Invasion resistance arises in strongly interacting species-rich model
competition communities. Proc. Natl. Acad. Sci. USA, 87, 9610–9614 (1990)
12. Chen, X. and Cohen, J. E. Global stability, local stability and permanence in
model food webs. J. Theor. Biol., 212, 223–305 (2001)
13. Cohen, J. E., Briand, F. and Newman, C. M. Community food webs. Biomathe-
matics, 20, Springer-Verlag, Berlin (1990)
14. Dambacher, J. M. et al. Relevance of community structure in assessing indeter-
minacy of ecological predictions. Ecology, 83, 1372–1385 (2002)
15. Dambacher, J. M. et al. Qualitative stability and ambiguity in model ecosystems.
Am. Nat., 161, 876–888 (2003)
16. De Ruiter, P., Neutel, A. M. and Moore, J. C. Energetics, patterns of interaction
strengths, and stability in real ecosystems. Science, 269, 1257–1260 (1995)
17. Drossel, B. and McKane, A. J. Modelling food webs. In Handbook of Graphs
and Networks (eds S. Bornholdt & H. G. Schuster), 218–247, Wiley-VCH, Berlin
(2003)
18. Dunne, J. A. et al. Network structure and biodiversity loss in food webs: robust-
ness increases with connectance. Ecol. Lett., 5, 558-567 (2002)
19. Emmerson, M. C. and Raffaelli, D. Predator-prey body size, interaction strength
and the stability of a real food web. J. Anim. Ecol., 73, 399–409 (2004)
20. Garcia-Domingo, J. L. and Saldana, J. Food-web complexity emerging from eco-
logical dynamics on adaptive networks. J. Theor. Biol., 247, 819–826 (2007)
21. Garcia-Domingo, J. L. and Saldana, J. Effects of heterogeneous interaction
strengths on food web complexity. Oikos, 117, 336–343 (2008)
22. Ghosh, S., Bhattacharyya, S. and Bhattacharya, D. K. Role of viral infection in
pest control: a mathematical study. Bull. Math. Biol., 69, 2649–2691 (2007)
23. Gross, T. et al. Long food chains are in general chaotic. Oikos, 109, 135–144 (2005)
24. Hastings, A. and Powell, T. Chaos in a 3-species food-chain. Ecology, 72, 896–903
(1991)
25. Jansen, V. A. A. and Kokkoris, G. D. Complexity and stability revisited, Ecol.
Lett., 6, 498–502 (2003)
26. Keitt, T. H. Network theory: an evolving approach to landscape conservation.
Ecological and Modeling for Resource Managers, Springer Berlin, 125–134, (2003)
70 S. Bhattacharyya and S. Sinha
71. Uchida, S. and Drossel, B. Relation between complexity and stability in food
webs with adaptive behavior. J. Theor. Biol., 247, 713–722 (2007)
72. Uchida, S., Drossel, B. and Brose, U. The structure of food webs with adaptive
behaviour. Ecol. Model., 206, 263–276 (2007)
73. Urban, D. L., Goslee, S., Pierce K. B. and Lookingbill, T.R. Extending commu-
nity ecology to landscapes. Ecoscience, 9, 200–212 (2002)
74. Williams, R. J. and Martinez, N. D. Simple rules yield complex food webs. Nature,
404, 180–183 (2000)
75. Woodward, G. and Hildrew, A. G. Body-size constraints on niche overlap and
intraguild predation in a complex food web. J. Anim. Ecol., 71, 1063–1074 (2002)
76. Wootton, J. T. Estimates and tests of per-capita interaction strength: diet, abun-
dance, and impact of intertidally-foraging birds. Ecological Monographs, 67, 45–
64 (1997)
77. Wootton, J. T. and Emmerson M. Measurement of interaction strength in nature.
Annu. Rev. Ecol. Evol. Syst., 36, 419–444 (2005)
78. Yodzis, P. The indeterminacy of ecological interactions as perceived through per-
turbation experiments. Ecology, 69, 508–515 (1988)
79. Yodzis, P. and Innes, S. Body-size and consumer-resource dynamics. Am. Nat.,
139, 1151–1175 (1992)
Signaling and Feedback in Biological Networks
Center for Models of Life, Niels Bohr Institute, Blegdamsvej 17, 2100 Copenhagen,
Denmark; sandeep@nbi.dk, mhjensen@nbi.dk, sneppen@nbi.dk
1 Introduction
Cellular processes operate on a wide range of time and length scales to produce
complex and intricate dynamics. It is a great challenge to understand both
how these dynamical patterns are produced, as well as why they are produced;
that is, what functional or evolutionary role do they play? This is one of the
most fruitful areas in which to apply the ideas of complex networks. Living
cells have all the prerequisites for a useful representation as networks. First,
cellular systems contain numerous non-identical active components—genes,
proteins, RNA, etc. These are the nodes of the network. Second, there are
many interactions between these components, which form the links between
the nodes. Not every pair of components interacts, so the resulting network
is not fully connected, nor is it a tree or other simple topology. Thus, cellular
networks provide plenty of scope for analysing their structure and graph-
theoretic properties, and numerous studies have taken advantage of this (see
[1] for reviews and [2–9] for some examples).
Network representations of cellular systems can easily be augmented to
address dynamical issues. Each node can be associated with a dynamical vari-
able which could represent, for example, the concentration of that protein or
the level of expression of that gene. Equations or rules governing the tem-
poral dynamics of these variables can then be written, where the network
structure determines which variables interact with each other. This usually
requires encoding more information about the interactions into the network
representation. For instance, apart from knowing that one node links to an-
other, one needs to know the sign and strength of the interaction. However, in
a network picture it is sometimes difficult to encode more detailed molecular
information, such as whether the binding of a protein to DNA is accompanied
by DNA looping, or whether a small molecule that binds to a protein can also
bind equally well when that protein is bound to DNA.
The question then is: What kind of physiologically useful processes can be
illuminated by the kind of information that is easily represented in a network
picture of a cell? One broad class of such processes is signal propagation.
Signals need to be sent in response to environmental conditions in order to
trigger the appropriate functional proteins, and need to be sent between pro-
teins in order to perform necessary computations. For example, the presence
of food metabolites in the surroundings triggers signals to proteins involved in
transport and metabolism of those molecules; or a sudden change in the tem-
perature triggers signals to proteins which buffer the cell against the shock.
Network representations of cellular systems are particularly suited to study
signal propagation because they precisely delineate the paths along which sig-
nals could travel. The next level of complication occurs when a signal loops
back onto itself. Such feedback loops are at the core of every non-trivial com-
putation performed by a cell [10–16]. Feedback loops are necessary for much
non-trivial dynamical behaviour, in particular, oscillations and multistability,
both of which are important for proper cellular function in different organisms.
Our review will therefore introduce biological networks specifically with
the intention of investigating signal propagation and feedback. We will de-
scribe simple measures for examining signal propagation on networks. We
will use the organism-wide cellular network of E. coli to discuss whether the
network structure has any particular properties which would affect the cost
and specificity of signal propagation. The review will then continue by dis-
cussing feedback in sub-networks of mammalian and yeast cells. We will take
one example each from a biological setting where, respectively, negative and
positive feedback in the network structure play a crucial role in the dynamical
behaviour of the system. Finally, we will conclude by looking at combinations
of feedback loops. We show that two entangled feedback loops, which are com-
mon in bacterial cells, have dynamical properties that are quite different from
those of their individual loops.
2 Signaling
An organism-wide protein network of the bacterium E. coli can be extracted
from the database EcoCyc [17] and represented as a directed, bipartite graph
with 2846 protein nodes and 2774 reaction nodes [18]. The reaction nodes
include all kinds of cellular reactions between proteins: transcription reactions,
complex formations, protein modifications and metabolic reactions. Figure 1A
shows the giant weakly connected component of this graph, consisting of 1938
reactions (of which 812 are transcription reactions, squares) and 1897 proteins
(circles). Figure 1A also illustrates that the E. coli graph is composed of a
large number of relatively small strong components (a strong component is
a sub-graph where there is a directed path between every pair of nodes).
Figure 1B compares this with the strong component structure of a randomised
network with exactly the same number of nodes and links, as well as the same
in- and out-degree (number of in- and out-links) of each node. The E. coli
Signaling and Feedback in Biological Networks 75
Fig. 1. E. coli protein reaction network. (A, Left) The graph is the largest weak
component of a bipartite network, consisting of proteins (circles) and reaction nodes
(promoters (squares), complex formations and modifications (black squares)). The two
largest hubs, σ 70 and CRP , and their links, have been removed for ease of visualisation.
(A, bottom left) Illustration of the procedure of making the strong component graph.
(A, Right) The resulting strong component graph of the E. coli network. An arrow
in the strong component graph indicates that there is a path connecting the two
strong components in the original graph; nodes correspond to strong components of
minimum size two. (B) The strong component graph for a randomized version of the
E. coli network. The randomisation preserves the total number of nodes, total number
of links and the number of in- and out-links of each node [18].
obviously limited to reach only these nodes. The strong component graphs
in Fig. 1 show particularly clearly how the network structure affects signal-
ing possibilities. Within each strong component, every node can, in principle,
send a signal to another node. But between strong components the possibil-
ities are hugely reduced. Thus, the E. coli network structure already seems
to be set up to allow plentiful signaling on short length scales, but to allow
only very specific paths on longer length scales. In the random network, how-
ever, most nodes can send signals to almost the entire network (because most
of the nodes are part of one giant strong component). A percolating struc-
ture like this is not conducive to specific signaling because every node has
almost the entire network downstream of it. Figure 2 bolsters this conclusion,
showing that in the E. coli network proteins have a much smaller number of
downstream targets than in the randomised network.
Fig. 3. (A) Schematic showing how the “cost” of a signaling path, A → F , is measured.
In this case proteins B and D are necessary, giving a cost C = 2. (B) Cost of a signaling
path as a function of its length for the real (solid) and randomised (dashed) E. coli
networks [18].
molecules in the cell. The more reactions in the path, and the more reactants
in each reaction, the more conditions that must be met for propagation of the
signal.
We quantify this cost C = C(path) for an arbitrary path from a starting
protein to a target protein by simply counting the number of reactants along
the entire path (not counting the protein nodes which are part of the path), as
described schematically in Fig. 3A. If the same reactant is used several times,
it is only counted once. Notice that the propagation of a signal does not
necessarily mean an increased level of the proteins involved. The key point is
that a change in input state should be transmitted to a changed output state
of the end product. Our cost function is a simple measure of the complexity
of handling such a signal and it could, in principle, be calculated between any
pair of proteins where a path exists in the directed network.
Figure 3B shows the average cost of signals propagating from one protein
to another along the shortest path connecting them, as a function of the length
l of that path. Each data point is the average over all pairs which are at the
given distance. Except for paths of length two, the average cost for signals
78 S. Krishna, M.H. Jensen, and K. Sneppen
Fig. 4. The six largest strong components of the E. coli network (A–F), along with
plots of the average cost, C(l) as a function of signaling distance. The grey areas show
the range spanned by C(l) for 100 randomised versions of the subgraphs [18].
is significantly smaller for the real E. coli network than for a randomised
networks (error bars are smaller than the symbol size).
Figure 4 repeats this analysis for each of the six largest strong components
in the network. These strong components capture distinct functional units
associated, respectively, to (A) predominantly fatty acid metabolism, (B) the
transcription network around σ factors, (C) PTS-sugar transport, (D) ABC
transporters, (E) the FeII and FeIII transport system and (F) the chemotaxis
module. Overall, we see that the cost within each module is fairly similar to
the random expectation.
We have shown that the molecular network of E. coli is designed in a way which
facilitates local signaling. On longer distances, signal transmission is a priori
nearly impossible, but we find statistical evidence for signal pathways in terms
of a lower signaling “cost” when we measure this by the number of co-factors
needed to transmit a given signal. The fact that the E. coli network has a
lower than randomly expected cost of signaling for paths longer than two steps
shows that it contains many linear chains which have few incoming branches.
That is, the real network is “stringy,” while the randomised network is more
“bushy,” having relatively many more branched pathways. Topologically, a
low cost is equivalent to less cross talk, which is indeed desirable [3, 19].
This picture of a stringy network of long linear chains applies to the large
scale: the place where the real network optimizes specific signaling is between
Signaling and Feedback in Biological Networks 79
strong component modules, rather than within them. A final intriguing point
is that at small scales, within modules, the network has widely different design
features, as seen from Fig. 4. Some modules (C,F) are dominated by complex
formation reactions, and others (D,E) by linear pathways, while the remaining
(A,B) are densely interconnected.
Obviously, signaling is not only limited by the topology of the network, but
also by the type of chemical reactions that facilitate the signals. For example,
in pure protein-protein interaction networks, Refs. [20, 21] show that proteins
with high concentrations propagate signals to proteins at low concentrations,
but not vice versa. Further, when most of a protein is present in an unbound
form, rather than in a complex with other proteins, it inhibits propagation of
signals through that node of the network. Thus, the overall picture of signaling
in biological networks is that one needs careful engineering of both topology
and protein binding chemistry in order to facilitate signal propagation over
more than one or two reactions.
3 Feedback
Figure 5 shows a number of feedback loops. Each node in each loop receives
signals (perturbations) from the previous node and sends it on to the next
node in the cycle. When the signal travels all the way around the loop, it will
lactose LacI
Cro transporter
Fig. 5. Examples of positive and negative feedback loops. An ordinary arrow indi-
cates activation, a barred arrow indicates inhibition. (a)–(d) Negative feedback loops
found involving proteins important for, respectively, development [29], apoptosis [30],
lactose consumption [31, 32] and the immune system [33]. (e)–(g) Positive feedback
loops involving proteins important for, respectively, λ phage lysis-lysogeny decision
and induction [34], and import of extracellular lactose [31, 32].
80 S. Krishna, M.H. Jensen, and K. Sneppen
The simplest negative feedback loop is, of course, a protein which represses
itself (Fig. 5a). There are many examples of such proteins: the main regulator
of the E. coli response to UV damage, LexA, represses its own production
[36]; Hes1, involved in development in mammalian cells, also represses tran-
scription of its own gene [29]. A well-known synthetic negative feedback loop
is the repressilator, which consists of three proteins each repressing each other
[37] (the same structure as Fig. 5c). Here we will concentrate on the negative
feedback loop shown in Fig. 5d containing the transcription factor, NF-κB,
which is one of the central regulators of the immune system in mammalian
cells. The NF-κB family of proteins is one of the most studied, being involved
in a variety of cellular processes including immune response, inflammation
and development. NF-κB can be activated by a number of external stimuli
Signaling and Feedback in Biological Networks 81
dNn (1 − Nn ) INn
=A −B , (1)
dt +I δ + Nn
dIm
= Nn2 − Im , (2)
dt
dI (1 − Nn )I
= Im − C . (3)
dt +I
The saturated degradation is the second term in the last equation. Other
terms in the equations model processes like nuclear import and export of
NF-κB, production of IκB, etc. (see [11] for more details).
An obvious question is why the cell requires oscillations in NF-κB in re-
sponse to inflammation. This is a subject of much debate currently, and there
is no clear answer. However, our model of NF-κB provides a possible clue:
One property of the oscillations of nuclear NF-κB (in Fig. 6) that stands out
is that they are extremely spiky. The spikiness is extremely robust to changes
82 S. Krishna, M.H. Jensen, and K. Sneppen
Fig. 6. (Left) Oscillations of nuclear NF-κB (Nn ) (black curve) and cytoplasmic IκB
(grey curve) for simulations of the model with A = 0.007, B = 954.5, C = 0.035,
δ = 0.029 and = 2 × 10−5 (these parameter values are derived from the ones used
in Ref. [10], see [11]). In order to facilitate comparison with the experimental plot
(right, obtained from Ref. [38]), the x-axis has been limited to 600 minutes, but the
oscillations are sustained.
Fig. 7. Sensitivity to IKK. (Left) Spike duration, the fraction of time Nn spends above
its mean value, as a function of IKK concentration. (Right) Spike peak, the maximum
concentration of nuclear NF-κB, as a function of IKK concentration. In both plots,
the black dot shows the IKK value used in Fig. 6, which separates regions of spiky and
soft oscillations [11].
that is much larger than that obtained by other typical mechanisms which do
not involve oscillations [40, 41]. Thus, oscillations could be a by-product of
designing the system to have a very high sensitivity to small changes in the
external stimulus.
Cells carry information handed down from their ancestors and are able to pass
on information to their descendants. In many cases this “memory” is epige-
netic—not stored in the DNA sequence—allowing cells with identical DNA to
maintain distinct properties. Epigenetic cell memory implies alternative states
that are stable over time and are inherited through cell division.
One proposed mechanism for epigenetic cell memory invokes positive feed-
back loops in nucleosome modification [42]. Nucleosomes are protein com-
plexes that package eukaryotic DNA, with a density of about one nucleosome
per 200 base pairs (bp). The core nucleosome is composed of two molecules
each of four core histone proteins. Nucleosomes may carry various chemi-
cal modifications (e.g. acetylation and methylation) at different amino acid
positions on the different histones, conferring a large potential information
capacity on each nucleosome. Specific additions and removals of these nucle-
osome modifications are carried out by classes of enzymes, including histone
acetyltransferases (HATs), histone methylases (HMTs), histone deacetylases
(HDACs) and histone demethylases (HDMs). At least some of these modifi-
cations affect the activity of nearby genes, in part because the modifications
can alter the binding of regulatory proteins to the DNA.
Positive feedbacks are present in this system because nucleosomes that
carry a particular modification may recruit (directly or indirectly) the en-
zymes that catalyse similar modification of neighbouring nucleosomes. Thus,
a cluster of nucleosomes may be able to maintain itself stably in a particular
modification state. These states can be inherited through DNA replication
because nucleosomes on the parental DNA strand are distributed to both
daughter strands [43], and the enzymes recruited by these parental nucle-
osomes may then establish the parental modification pattern on the newly
deposited nucleosomes.
A specific case in which positive feedbacks in nucleosome modification
result in multiple stable states occurs in the mating-type system of the eu-
karyote S. pombe (fission yeast) [44]. A ∼20 kbp region of S. pombe DNA
containing two mating-type cassettes is normally in a stable “silenced” state,
with the mating-type genes not expressed. In certain mutants where part of
the silenced region is modified, the system is bistable, flipping between states
where the ura4 gene is either expressed (active) or not (silenced). Each state
is stable and heritable, with transitions occurring at roughly equal frequencies
of ≈ 5 × 10−4 per cell division [44]. Switching appears to be stochastic and is
determined by factors associated with the region itself. In the silenced state,
but not the active state, the region is dominated by nucleosomes that are
84 S. Krishna, M.H. Jensen, and K. Sneppen
Fig. 8. Illustration of basic ingredients of the model: Each oval represents a nucleosome
that can be methylated (M), unmodified (U) or acetylated (A). Enzymatic transitions
(solid arrows) between the three states are in part random (controlled by a noise level
1 − α), and in part autoregulated by recruitment (dotted lines) of enzymes (open
symbols) by nucleosomes in the M or A state [45].
M or A state, the nucleosome n1 is changed one step toward this state. For
example, when nucleosome n2 is an M: if n1 is an A, then it is changed to U
and if n1 is a U it is changed to M. If nucleosome n1 and n2 are in the same
state, or if n2 is a U, then no changes are made.
(b) With probability 1 − α one attempts a change of the selected nucleo-
some n1 : A U is changed to an M with probability 13 , or an A with probability
1 1
3 whereas an A or an M is changed to U with probability 3 .
One may view process (a) as occurring due to the action of enzymes re-
cruited by nucleosomes in the region within the isolating boundaries, whereas
(b) reflects extrinsic noise caused by unrecruited enzymes. Thus, a lower α
value indicates a higher noise level.
In Fig. 9 we illustrate the dynamics of the model. One observes a fluc-
tuating number of the three kinds of nucleosomes. In the upper panel α is
small (noise is high) and the system has only one stable state, in which the
nucleosome modifications are distributed randomly along the chain. In the
lower panel, with a higher α, the system exists either in a state dominated
by methylated nucleosomes or a state dominated by acetylated nucleosomes,
with occasional switches between the two states. As α is increased further
(i.e., noise is reduced) the states become more stable, and the switching oc-
curs less often. However, the fact that the epigenetic states in the mutant S.
pombe have a finite stability demonstrates that noise in the form of disordered
methylation-acetylation events plays a crucial role.
Fig. 9. Time development of the standard model [45] for a system consisting of N = 60
nucleosomes with respectively α = 0.40 (upper figure) and α = 0.64 (lower figure). The
light grey curve shows the number of methylated, dark grey the number of acetylated
and black the number of unmodified nucleosomes. Time t is measured in number of
attempted nucleosome updates per nucleosome.
86 S. Krishna, M.H. Jensen, and K. Sneppen
Fig. 10. Schematic illustration of molecular processes in a two-loop motif. This motif
is found in the regulation of uptake and metabolism of, for example, maltose and ara-
binose [50, 49]. σ, s denote, respectively, extracellular and intracellular concentrations
of the small molecule. The molecule binds to the regulator, R, forming the complex
{Rs} which activates production of transport proteins, T , and metabolic enzymes, E.
γ is a parameter controlling the metabolic rate per enzyme [13].
the transport proteins (T ) facilitating the influx of the small molecule, while
the other controls transcription of enzymes (E) responsible for the metabolism
of s. The signs show the logic of each feedback loop: positive (+) or negative
(-). Each motif can then be described by a notation of two signs, e.g. (+ –),
which means that the transport loop is positive and the metabolism loop neg-
ative. Thus, there are four logical structures: the socialist (– –), the consumer
(+ –), the fashion (– +) and the collector (+ +) [13]. Each can, in turn,
be implemented in two distinct but logically equivalent ways, depending on
whether s inhibits or activates R. This we denote using the notation (+ – i)
or (+ – a), where the i (respectively, a) indicates inhibition (activation) of R
by s. Th i- and a-motifs with the same logic behave very similarly, so here we
will concentrate on only the a-motifs.
The socialist motif. We call the (– –) motif the socialist because at low levels
of extracellular s (low σ) it increases transport and reduces the metabolism,
while at high levels of extracellular s, it does the opposite. Thus, the two
negative feedback loops help maintain s robustly within a small concentration
range. Such behaviour would be ideal for a system responsible for maintaining
homeostasis. And indeed, a regulatory system with this logic is found in the
iron homeostasis system in mammals [51]: iron activates the ferric uptake
regulator (Fur), which represses transcription initiation of iron uptake genes,
and enhances production of iron-using proteins. For most organisms iron is
essential for several proteins, but is poisonous at high concentrations. There,
88 S. Krishna, M.H. Jensen, and K. Sneppen
Fig. 11. Behaviour of four entangled feedback loop motifs. Plots show the steady state
values of s (middle column) and influx (σT = γEs + s, right column) as a function
of σ. In all plots, the black curve shows the behaviour for the two-loop motif. The two
other curves show the behaviour when only the transport loop is active (E = 1) and
when only the metabolism loop is active (T = 1) [13].
the (– –) motif maintains the loosely bound iron within a narrow concentration
range, and at the same time allows a high consumption of iron molecules by
certain proteins that bind iron strongly.
The consumer motif. The (+ –) motif we term the consumer, because
any amount of extracellular small molecule results in the increase of both
transport and metabolism. Thus, it is ideal for food molecules. This logic
is in fact typical for sugar transport and metabolism in prokaryotes. The
gal [52] and lac [31, 32] operons in E. coli are the most well studied of such
systems. They both use the sugar molecule to inhibit the transriciption factor
Signaling and Feedback in Biological Networks 89
loop) and also decreases their chance to exercise (positive metabolism feedback
loop), thus forming a collector motif. The bistable behaviour of the collector
motif would then contribute to a broadening of the weight distribution in
human populations [60].
Figure 11 also shows the behaviour of individual loops in these motifs, ob-
tained by keeping either E or T fixed, thereby cutting feedback in one of the
loops. The near constant value of s in (– –) comes from the metabolic loop’s
ability to constrain s for low σ, and the transport loop’s ability to constrain s
at high σ. Thus, the functionality of (– –) is dominated by the sub-motif that
best prevents large variation of s and flux. The (+ –) obtains a steady increase
in s and a step-like increase in flux with σ by using the negative metabolic
loop’s ability to “smooth out” the bistability associated to the positive trans-
port loop. The (– +) motif exhibits a remarkable non-monotonic behaviour of
flux, which cannot be obtained from any of the sub-motifs. The (+ +) motif
maximizes bistability, by extending it to the extreme of the two bistable re-
gions of its sub-motifs. Overall, we can conclude that whole two-loop motifs
are more than a simple sum of their parts.
Our analysis of two entangled feedback loops creates a framework for analysing
small molecule regulatory circuits composed of multiple entangled feedback
loops. For instance, the regulation of iron in E. coli, while being dominated
by interactions that form a socialist motif [56, 57], also contains a positive
feedback on the metabolism side involving usage of iron in FeS clusters [58].
An investigation of this three-loop motif suggests that two metabolism loops,
connected like this in “parallel” (as opposed to the “series” connection between
a transport and metabolism loop), are additive in behaviour [13, 59]. Due to
this additiveness, iron regulation in E. coli is able to minimise variation of
both the concentration of iron (a property of the socialist part) as well as the
flux (a property of the fashion part) [56]. This indicates that an interesting
direction to extend these ideas might be to try to formulate “design principles”
for combinations of parallel and serially connected feedback loops.
5 Concluding Remarks
in between? There is, of course, no one answer to this question. In the examples
above we have looked at a wide range of scales, from the entire E. coli network,
to three or four component sub-networks, down to nucleosomes on DNA. On
all these scales the dynamical behaviour is, however, constrained first by the
available communication channels, and second by the logical properties of
feedback loops in the network. To summarise, we extract the following main
“lessons” from our case studies:
• The E. coli protein network is highly modular.
• The real E. coli network is more “stringy” than the randomised version,
and this reduces constraints on signal propagation.
• Most feedback loops go through small molecules; there are very few in the
transcription network.
• Biological function is coupled to the logic (positive/negative) of the feed-
back.
• Entangled feedback loops are “more” than a simple sum of their parts.
Acknowledgments
We thank our collaborators, with whom much of the work described here was
done: J. Axelsen, I. Dodd, M. Micheelsen, S. Pigolotti, S. Semsey, G. Thon
and G. Tiana. We acknowledge support from The Danish National Research
Foundation and the Villum Kann Rasmussen Foundation.
References
1. S. Bornholdt and H.G Schuster, eds., Handbook of Graphs and Networks: From
the Genome to the Internet, Wiley-VCH, Weinheim (2002).
2. E. Ravasz, A.L. Somera, D.A. Mongru, Z.N. Oltvai and A.-L. Barabasi, Science,
297, 1551–1555 (2002).
3. S. Maslov and K. Sneppen, Science, 296, 910–913 (2002).
4. K. Sneppen, A. Trusina and M. Rosvall, Europhys. Lett., 69, 853 (2005).
5. A. Trusina, S. Maslov, P. Minnhagen and K. Sneppen, Phys. Rev. Lett., 92, 178702
(2004).
6. J. B. Axelsen, S. Bernhardsson and K. Sneppen, BMC Systems Biology, 2, 25
(2008).
7. S.S. Shen-Orr, R. Milo, S. Mangan and U. Alon, Nat. Genetics, 31, 64–68 (2002).
8. A. Samal, S. Singh, V. Giri, S. Krishna, N. Raghuram and S. Jain, BMC Bioin-
formatics, 7, 118 (2006).
9. S. Singh, A. Samal, V. Giri, S. Krishna, N. Raghuram and S. Jain, Eur. Phys. J.
B, 57, 75–80 (2007).
10. A. Hoffmann, A. Levchenko, M.L. Scott and D. Baltimore, Science, 298, 1241–
1245 (2002).
11. S. Krishna, M.H. Jensen and K. Sneppen, Proc. Natl. Acad. Sci. USA, 103, 10840–
10845 (2006).
92 S. Krishna, M.H. Jensen, and K. Sneppen
12. E. Aurell, S. Brown, J. Johansen and K. Sneppen, Phys. Rev. E, 65, 51914 (2002).
13. S. Krishna, S. Semsey and K. Sneppen, Proc. Natl. Acad. Sci. USA, 104, 20815–
20819 (2007).
14. K.B. Arnvig, S. Pedersen and K. Sneppen, Phys. Rev. Lett., 84, 3005 (2000).
15. G. Tiana, M.H. Jensen and K. Sneppen, Eur. Phys. J. B 29, 135 (2002).
16. M.H. Jensen, G. Tiana and K. Sneppen, Febs Letters 541, 176 (2003).
17. P.D. Karp et al., Nucl. Acids Res., 35, 7577–7590 (2007).
18. J.B. Axelsen, S. Krishna and K. Sneppen, J. Stat. Mech., P01018 (2008).
19. L.H. Hartwell, J.J. Hopfield, S. Leibler and A.W. Murray, Nature, 402(6761),
C47–52 (1999).
20. S. Maslov, K. Sneppen and I. Ispolatov, New J. Phys., 9, 273 (2007).
21. S. Maslov and I. Ispolatov, Proc. Natl. Acad. Sci. USA, 104, 13655–13660 (2007).
22. S. Krishna, A.M.C. Andersson, S. Semsey and Kim Sneppen, Nucl. Acids Res.,
34, 2455 (2006).
23. R. Thomas, Quantum noise, Springer Series in Synergetics 9, Ed. Gardiner,
Springer, Berlin, pp. 180–193 (1981).
24. E.H. Snoussi, J, Biol. Sys., 6, 3–9 (1998).
25. J.L. Gouzé, J. Biol. Syst., 6, 11–15 (1998).
26. J.E. Ferrell Jr., Curr. Opin. Cell Biol., 14, 140–148 (2002).
27. D. Angeli, J.E. Ferrell and E.D- Sontag, Proc. Natl. Acad. Sci. USA, 101, 1822–
1827 (2004).
28. F.J. Isaacs, J. Hasty, C.R. Cantor and J.J. Collins, Proc. Natl. Acad. Sci. USA,
100, 7714–7719 (2003).
29. H. Hirata, S. Yoshiura, T. Ohtsuka, Y. Bessho, T. Harada, K. Yoshikawa and
R. Kageyama, Science, 298, 840–843 (2002).
30. S.L. Harris and A.J. Levine, Oncogene, 24, 2899–2908 (2005).
31. F. Jacob and J. Monod, J. Mol. Biol., 3, 318–356 (1961).
32. P. Wong, S. Gladney and J.D. Keasling, Biotechnol. Prog., 13, 132–143 (1997).
33. H.L. Pahl, Oncogene, 18, 6853–6866 (1999).
34. M. Ptashne, A Genetic Switch: Phage Lambda Revisited, Cold Spring Harbor
Laboratory Press Cold Spring Harbor(2004).
35. S. Pigolotti, S. Krishna and M.H. Jensen, Proc. Natl. Acad. Sci. USA, 104, 6533–
6537 (2007).
36. M. Schnarr et al., Biochimie, 73, 423–431 (1991).
37. M.B. Elowitz and S. Leibler, Nature, 403, 335–338 (2000).
38. D.E. Nelson, A.E.C. Ihekwaba, M. Elliott, J.R. Johnson, C.A. Gibney,
B.E. Foreman, G. Nelson, V. See, C.A. Horton, D.G. Spiller et al., Science, 306,
704–708 (2004).
39. G. Tiana, S. Krishna, S. Pigolotti, M. H. Jensen and K. Sneppen, Phys. Biol., 4,
R1 (2007).
40. C.Y. Huang and J.E. Ferrel Jr, Proc. Natl. Acad. Sci. USA, 93, 10078–10083
(1996).
41. A. Goldbeter and D.E. Koshland, Proc. Natl. Acad. Sci. USA, 78, 6840–6844
(1981).
42. G. Felsenfeld and M. Groudine, Nature, 421, 448 (2003).
43. A.T. Annunziato, J. Biol. Chem., 280, 12065 (2005).
44. G. Thon and T. Friis, Genetics, 145, 685 (1997).
45. I.B. Dodd, M.A. Micheelsen, K. Sneppen and G. Thon, Cell, 129, 813–822 (2007).
46. G. Thon, P. Bjerling, C.M. Brunner and J. Verhein-Hansen, Genetics, 161, 611
(2002).
Signaling and Feedback in Biological Networks 93
47. B.H. Zimm, Proc. Natl. Acad. Sci. USA, 45, 1601 (1959).
48. H.A. Scherage, Pure and Applied Chemistry, 36 1 (1972).
49. R. Schleif, Trends Genet., 16, 559–565 (2000).
50. E. Richet and O. Raibaud, EMBO J., 8, 981–987 (1989).
51. E. Massé and M. Arguin, Trends Biochem. Sci., 30, 462–468 (2005).
52. M.J. Weickert and S. Adhya, Mol. Microbiol., 10, 245–251 (1993).
53. E.M. Ozbudak, M. Thattai, H.N. Lim, B.I. Shraiman and A. van Oudenaarden,
Nature, 427, 737–740 (2004).
54. W.P. Smits, O.P. Kuipers and J.W. Veening, Nat. Rev. Microbiol., 4, 259–271
(2006).
55. R. Donangelo and K. Sneppen, Physica A, 316, 581–591 (2002).
56. S. Semsey, A.M.C. Andersson, S. Krishna, M.H. Jensen, E. Massé and K. Sneppen,
Nucl. Acids Res., 34, 4960–4967 (2006).
57. N. Mitarai, A.M.C. Andersson, S. Krishna, S. Semsey and K. Sneppen, Phys. Biol.,
4, 164–171 (2007).
58. F.W. Outten, O. Djaman and G. Storz, Mol. Microbiol., 52, 861–872 (2004).
59. M. Werner, S. Semsey, K. Sneppen and S. Krishna, preprint (2008).
60. U.S. EPA Exposure Factors Handbook, 1997, http://www.epa.gov/ncea/efh/
Topographic Spreading Analysis
of an Empirical Sex Workers’ Network
1 Introduction
approach.) The network consists of the FSWs themselves, plus their sex
partners (paid and unpaid), as well as any partners of these partners which
were known to the FSW. This method, beginning with 49 interviewed FSWs,
gave a highly connected network of 553 nodes [10]. Furthermore, STI (sexu-
ally transmitted infection) status was obtained for many of these nodes. In
particular, two of the nodes were identified as being HIV-positive, while 11
other nodes have either gonorrhea, chlamydia, or both.
From the collected network data we build an adjacency matrix, where
element aij = 1 if i has a link to j, and is zero elsewhere. (In the case of a
weighted graph, element aij equals the strength of the link from node i to
node j.) The principal eigenvector of the adjacency matrix is a measure of a
node’s centrality in the graph and is called the eigenvector centrality or EVC.
The EVC scores for the nodes in the (weighted or unweighted) network give
the starting point for our approach: they are used for assigning the nodes
to regions, and for predicting the spreading of disease within and between
regions.
The aims of this work are several. One goal is to extend our earlier topo-
graphic approach to a graph with weighted links. As we will see, this seem-
ingly small change can have very large effects; but we will also see that the
validity of our approach is confirmed, in spite of these large effects. This is
because the modified approach (presented here for the first time) is consistent:
we use the link weights to modify the graph’s adjacency matrix, and hence
the nodes’ EVC values; and we use them again when we define the regions
via the steepest-ascent graph (SAG).
A second aim of this work is to try to exploit the insights gained from the
topographic analysis, in order to find novel suggestions for preventive actions
to hinder the spread of the disease in question. We find that our progress to-
wards this second goal is considerably more modest than that towards the first
goal. We will show “thought experiments,” based on the empirical graph topol-
ogy and link strengths, for which our analysis is extremely useful. However,
we will not find practical suggestions which are immediately promising for the
given Vancouver FSW graph. There are several reasons for this. First, the HIV
graph is so thoroughly protected by condom use that we find little to add in
terms of ideas for preventive measures. Second, the graphs for gonorrhea and
for chlamydia are so thoroughly well connected, and also so well infected, that
we do not find small topological changes which can make a large difference.
We note that our approach treats the network as static; hence any effects
of network dynamics are not taken into account. We believe, however, that
our qualitative results are fairly robust to the likely dynamics of this network,
since its overall structure is thought to be fairly stable over time. Also, our
analysis (once the network is mapped out—which can be time consuming!) is
not computationally demanding, and so may be performed in essentially zero
time compared to the time scale of epidemic spreading. Hence any suggestions
resulting from the analysis may be implemented in something approaching
real time.
Topographic Spreading Analysis 99
Fig. 1. Regions visualization of the FSW network, with all links set to equal strength.
Only the links in the SAG are shown here for visual clarity. The most central node in
each region is enlarged. The three largest regions, which will be discussed further in
the text, are labeled R (red), G (grey), and B (blue). - Male, - Female.
and (c) automatically excluded from being a Center. Thus bipartiteness will
tend to favor one gender over another. By the same token, highly central M
nodes are never neighbors of other M Centers, and so are candidate Centers
themselves. Hence there may be a tendency for more, and smaller, regions.
Points (iv)–(vi) tell us that this network is highly prone to infection: the
many regions are not well isolated from one another, because of their common
connection to the dense, infectious red region. Also, the two start nodes are
in or near the central part of the red region, where spreading is fast.
Fig. 2. Same visualization as Fig. 1, except that all links are shown. The arrows mark
the known HIV-positive nodes.
Fig. 3. HIV spreading simulation without (top) and with (bottom) measures to isolate
the grey region from the red region. In each plot, there are four growth curves, showing
the total growth of the infection (‘Sum’), and the growth for the red (R), grey (G),
and blue (B) regions (the largest regions in the network).
Fig. 4. Same simulation as Fig. 3, except that the infection starts from a peripheral
node in the grey region.
Topographic Spreading Analysis 103
For each link we want a single weight (number) which gives the probability
per unit time of transmission from an infected node to an uninfected node.
This probability is based on a number of factors which must be estimated
from limited data. We list these factors schematically as follows:
Now we discuss each factor in turn. For each disease (HIV, gonorrhea or
‘NG’, and chlamydia or ‘CT’) we estimate (unprotected probability/contact)
from Ref. [6]. See Table 1. To correct for condom use, we must know the
frequency of condom use for each link (condom use prevalence). For 256 links
(about 17% of them) we have an estimate for (condom use prevalence) from
survey data [10]. We know very little about the remaining links, except for
NG CT HIV
Unprotected 0.43 0.10 0.05
Protected 0.16 0.074 0
104 J. Bjelland et al.
3.2 SAG∗
Now we address another complication arising from the use of weighted links:
we must reconsider the definition of the steepest-ascent graph (SAG), which is
used both for assigning region membership and for visualization purposes. Our
point here is simple, namely that the definition of steepest ascent should take
account of the link strength. This rather obvious point has not been addressed
in our earlier use of the SAG [1, 2, 3], because these earlier studies were applied
to unweighted graphs. Hence we offer a brief account here of the modification
used for weighted links.
We recall that region membership is assigned by in essence asking each
node to find the steepest path to the “top”—i.e., to the “nearest” local maxi-
mum of the EVC. The notion of local maximum is independent of link stength.
Suppose, however, that a node N has two local maxima (Centers, C1 and C2 )
as neighbors: which region do we place N in? Since we want steepest-ascent
paths to represent most likely spreading, it seems reasonable that a neighbor
C1 with a very weak link to N should not be assigned the steepest-ascent
path—even if it is somewhat higher (in EVC) than C2 . In other words, if we
retain the notion that steepest ascent gives the right answer, then we clearly
want to define the slope as being
The SAG∗ for our weighted HIV graph is shown in Fig. 5. We see immediately
that the contrast with Fig. 1 is enormous.
In particular, the 17 regions of Fig. 1 have multiplied many times. In
addition (which is not so easily seen in the figure) some nodes are completely
disconnected due to the zero-weight links, and hence do not appear in the
figure at all. The apparently isolated nodes in the corner of the figure are
one-node regions; such regions occur typically on the periphery of a graph,
where all EVC values are small.
What is even more striking is that adding all non-zero links to the SAG∗
picture of Fig. 5 makes very little change; that is, there are only six non-zero
links which are not shown in the figure (four connecting the one-node regions
to one other node each, and two other inter-region links). Hence we do not
show the full graph: it is essentially that of Fig. 5. This means in turn that
HIV spreading—while seemingly unstoppable in the picture obtained from
Fig. 2—is in fact not a problem for this FSW network. In particular, the
Fig. 5. Regions analysis for the HIV graph, corrected with the transmission probability
on each link. Note that the graph breaks into very many small regions, due to the
(assumed) zero transmission probability for reliable condom use. The two enlarged
nodes are known to be HIV-infected; the four nodes in the upper left corner are single-
node regions in the weighted graph.
Topographic Spreading Analysis 107
two HIV-positive (male) nodes (marked with large squares in Fig. 5) are each
confined to an effective two-node network, consisting of themselves and their
nonclient partner. Hence our expected picture of condom use for this empirical
network implies that HIV spreading will be limited to the non-client partner
relationships of the two infected nodes, and so has effectively zero probability
of reaching the rest of this dense sexual network.
Because the effective graph is so fragmented, and also because the HIV-
infected nodes are effectively isolated, we have not performed spreading sim-
ulations on the weighted HIV graph. We note that the largest region in Fig. 5
has 24 nodes, with a FSW as the most central node in the region. In fact the
strongly bipartite picture obtained from the unweighted graph (Fig. 1) has
also broken down here: both male and female Centers of the many regions are
found. This is however not so surprising, given the fragmented nature of the
effective graph.
3.4 Gonorrhea
Figure 6 shows the steepest-ascent (SAG∗ ) graph when we use link strengths
approriate to gonorrhea. Since 100% condom use does not give 100% protec-
tion [5], the effective gonorrhea graph has all the same links as were present
Fig. 6. Region (SAG∗ ) visualization for the gonorrhea network NG. The enlarged
nodes are known to be STI-infected.
108 J. Bjelland et al.
in Fig. 2; but they are reweighted. We see that the reweighting has still had a
dramatic effect. In particular, the 17 regions found for the unweighted graph
are now a single region for the weighted graph. Also, the Center of this one
region (and so of the entire graph) is an FSW.
An interesting aspect of the gonorrhea SAG∗ is that one of the few existing
homosexual (FSW ⇐⇒ FSW) links plays a very central role in the graph: the
link between the Center and the head of the large red subregion is homosexual.
This means that the two women involved are highly central in the weighted
graph, and also that the link strength between them (transmission probability
for gonorrhea) is not too small. One might then propose to remove this link—
which (as it is certainly requested and paid for by a male customer) should
be possible. However as we will see below, removal of this link—or any single
link—has little or no beneficial effect. (This conclusion is perhaps intuitively
grasped from the fully linked visualization of Fig. 7 below.)
SAGs of either type are strict hierarchical structures—that is, they are
directed trees, with links pointing strictly towards the root (Center). This
means that, for any given region, one can readily define subregions in terms
of branches of the tree. We have picked out the five largest branches of the
gonorrhea SAG∗ and color coded them. We see that it is visually meaningful
to think in terms of subregions for this region.
Fig. 7. Same layout as in as Fig. 6, but with all non-zero links displayed.
Topographic Spreading Analysis 109
Figure 7 shows the NG-graph again, but with all links displayed. We note
that presently infected nodes are enlarged and marked yellow (lighter grey
in printed version) in Fig. 6 and in Fig. 7. From Fig. 6 we see two infected
nodes lying at the heads of their (large) respective subregions, and hence only
one hop from the Center. Also we see that every major subregion is already
infected. This immediately suggests that preventing the further spreading of
gonorrhea on this graph will be quite difficult.
This pessimistic prognosis is also supported by the visualization of Fig. 7.
Here we see that all the major subregions are well connected to one aother,
with infected nodes lying in the heart of a dense cloud of links. We will
test (and confirm) this pessimistic prediction via stochastic simulations—see
Section 4.
3.5 Chlamydia
In Fig. 8 we show the SAG∗ visualization of the chlamydia graph. Qualitatively
we see much the same picture as for the NG graph: a single region, with an
FSW at the Center of the region. In fact, the homosexual dyad that we found
lying centrally in the NG graph is also central here—with the one difference
that here the two FSWs have exchanged roles (Center and subregion head).
Our SAG∗ visualizations suggest that the CT graph is perhaps even more
well connected than the NG graph—in that there are very few subregions,
Fig. 8. Region (SAG∗ ) visualization for the chlamydia network CT. Enlarged nodes
are known STI-infected nodes.
110 J. Bjelland et al.
and they are very large. And since (again) every major subregion is infected,
we arrive at the same qualitative prognosis for this graph: it will be difficult
to hinder the further spreading of the disease.
We have also plotted the analog of Fig. 7 for chlamydia—that is, the full
graph with all non-zero links. The result is again qualitatively like that of
Fig. 7; hence we do not show it here.
of time is one day, have values which vary from a few percent down to about
10−4 . With these small values we can increment the simulator with a time
step of one day, and get smooth results.
Our simulations differ from one another in three ways: (i) the choice of
“start” nodes which are infected at t = 0; (ii) the choice of a set of “immune”
nodes which cannot be infected; and (iii) sometimes, the choice of links which
are to be blocked from transmission (removed). Choices (ii) and (iii) allow us
to test various strategies for hindering spreading. In the real world of human
sexual behavior, accomplishing either of these effects may be quite difficult;
but we test them here simply to see what can be achieved.
First we simulate the reference case, in which those nodes which are known
to be infected are the start nodes (see again Figs. 6 and 7), and we immunize
no nodes or links. We find (Fig. 9) that the infection takes off very fast—as
anticipated in Section 3.4. Specifically, we see that the takeoff time is very
short—just a few days. This is consistent with the fact that the infection has
already reached three very central (as defined by EVC) nodes. This latter
fact is consistent with two interpretations: either (i) the infection has recently
come to this dense network, and it is on the verge of taking off, or (ii) the
infection has been present for a long time, and has reached an equilibrium
(and rather low) level.
600
500
400
Infected nodes
300
200
100 As−is
Center
Within 1 hop from center
STI red region + head of region
50 random
0
0 50 100 150 200 250 300 350 400
time
Fig. 9. Spreading simulations for gonorrhea, based on the SI model, and using various
prevention strategies. “As-is” = known infected start nodes and no strategy; the other
scenarios involve immunizing various nodes, as described in the text. The unit of time
is one day.
112 J. Bjelland et al.
600
Infected nodes 500
400
300
Central node
200 Medium Central
Not central
100
0
0 50 100 150 200 250 300 350 400
time
0.2
Mean EVC infected nodes
0.15
0.1
0.05
0
0 50 100 150 200 250 300 350 400
time
Fig. 10. Three spreading simulations, based on three chosen scenarios, each with a
single start node. We see that distance from the Center node (in a metric defined by
the SAG∗ ) correlates strongly with time to takeoff. The lower part of the figure shows
the average EVC of the newly infected nodes.
103
time
102
101 −12
10 10−10 10−8 10−6 10−4 10−2 100
closeness
Fig. 11. Time needed for a single start node to infect 300 nodes, as a function of that
start node’s “closeness” to the graph’s Center (averaged over 10 experiments for each
start node). Closeness is measured entirely in terms of the modified steepest-ascent
graph SAG∗ . We see a thorough statistical corroboration of the results of Fig. 10.
Simulations on the gonorrhea graph gave results much like those on the
unweighted FSW graph: the graph was very well connected, and the already-
infected nodes had rather central positions. The result was that we were unable
to find simple topological fixes, inspired by our analysis, which could sig-
nificantly retard spreading. However, we were able to find strong evidence
confirming the basic applicability of our analysis to spreading. Specifically,
we showed that our own notion of a node’s distance from the Center of the
graph correlated strongly with the time needed for that node to infect the
graph.
We emphasize that this is the first application of the topographic approach
to a weighted graph. Performing this analysis has required generalizing our
earlier definition [3] of steepest ascent. The results we obtain here, based on
this new, generalized definition, are very promising. Hence—even as we fail
to come up with promising, concrete suggestions for hindering the spread
of STIs in the Vancouver sex network—we feel that our results confirm the
applicability of our approach to understanding spreading in the real-world
case of a network with non-uniformly weighted links.
We see a clear need for two obvious extensions of this work. First, it would
be useful to reconnect the HIV graph, by assigning small but non-zero proba-
bilities to the 100%-condom-use links. This would allow for a more meaningful
regions analysis and the accompanying testing by simulations (perhaps over
a long time scale).
Second, our approach is most simply understood and applied for diseases
for which SI spreading is appropriate (such as HIV). The application to gon-
orrhea or chlamydia would be greatly strengthened if one could generalize
the method to the SIS and/or SIR case. This is an interesting challenge for
future work.
The data used arrive from self-reported infection status ([10]). To validate
our model, empirically collected retrospective data on actual prevalence and
incidence of the infections could be obtained. This is also recommended for
future work.
Finally, we remind the reader of the motivation for this work. We believe
that the topographic analysis, based on EVC, is extremely useful for under-
standing epidemic spreading on a coarse scale. The analysis itself is not com-
putationally demanding; hence it can be performed in essentially real time.
Thus, we hope that our approach can be useful for disease prevention, in those
cases for which the network can be mapped in reasonably short time—that
is, short compared to both the time scale for infectious spreading, and the
time scale for significant topology changes. The results presented here do not
offer any immediate solution to the problem of STIs in the Vancouver FSW
network, but they do add further support to our belief that this approach may
be useful for this problem, and for others.
116 J. Bjelland et al.
Acknowledgments
GC and KEM acknowledge partial support from the Future and Emerg-
ing Technologies unit of the European Commission through Project DELIS
(IST-2002-001907). VPR acknowledges the financial and in-kind support,
respectively, of the BC Medical Services Fdn and HIV/STI Prevention and
Control, BC Centre for Disease Control.
References
1. G. Canright and K. Engø-Monsen. Roles in networks. Science of Computer Pro-
gramming, pages 195–214, 2004.
2. G. Canright and K. Engø-Monsen. Epidemic spreading over networks: a view from
neighbourhoods. Telektronikk, 101:65–85, 2005.
3. G. Canright and K. Engø-Monsen. Spreading on networks: a topographic view.
In Proceedings, European Conference on Complex Systems, 2005.
4. G. S. Canright and K. Engø-Monsen. Some relevant aspects of network analysis
and graph theory. In J. Bergstra and M. Burgess, editors, Handbook of Network
and Systems Administration. Elsevier, Amsterdam, 2007.
5. K. Holmes, R. Levine, and M. Weaver. Effectiveness of condoms in preventing
sexually transmitted infections. Bull World Health Organ, 82:454–461, 2004.
6. A. M. Jolly, M. E. Moffatt, M. V. Fast, and R. C. Brunham. Sexually transmitted
disease thresholds in Manitoba, Canada. Ann Epidemiol, 15:781–788, 2005.
7. M. Kretzschmar, Y. T. P. H. van Duynhoven, and A. J. Severijnen. Modeling
prevention strategies for gonorrhea and chlamydia using stochastic network simu-
lations. American Journal of Epidimiology, 144:306–317, 1996.
8. M. Newman. The structure and function of complex networks. SIAM Review,
45:167–256, 2003.
9. R. Pastor-Satorras and A. Vespignani. Epidemic spreading in scale-free networks.
Phys Rev Lett, 86:3200–3203, 2001.
10. V. P. Remple, D. M. Patrick, C. Johnston, M. W. Tyndall, and A. Jolly. Clients of
indoor commercial sex workers: Heterogeneity in patronage patterns and implica-
tions for HIV and STI propagation through sexual networks. Sexually Transmitted
Diseases, May 2007.
Spectral Characterization of Network
Structures and Dynamics
1 Introduction
2 Growing Networks
Empirical networks usually do not spring into existence, but rather grow to
their present or final state from smaller beginnings. Naturally, such a growth
process involves the sequential addition of nodes and links (connections). Usu-
ally, nodes are added at random, but their link formation with other nodes
(already present in the network) is often not entirely random. This link for-
mation will follow some rule that typically is still stochastic but also involves
properties of those nodes that are candidates for receiving a link. When that
rule is such that there is a higher chance of receiving links from those nodes
that already have many connections than from those with fewer connections,
we have some form of preferential attachment. Such a rule is known to lead to
a scale-free degree distribution of the nodes in the network; that is, the number
of nodes in the final network that have k links behaves like some power k −α ,
for some positive exponent α. The first such rule was proposed by Simon [44],
and it directly stipulated that those nodes that have more connections also
have a higher chance of receiving additional ones (“the-rich-get-richer” prin-
ciple). This rule and the effects resulting from it were then systematically
investigated by Barabási–Albert [2, 11], and subsequently, many empirical
networks were found to exhibit such a power-law degree distribution.
It would be, however, premature to draw systematic consequences about
other network properties from such a power-law degree distribution. In fact,
there are many rules for network growth that are plausible in many areas
of application that indirectly lead to such a kind of preferential attachment,
but can lead to networks with properties that are otherwise rather different
from those of the schemes of Simon and Barabási–Albert. For instance, Jost–
Joy [28] investigated the “make-friends-with-the-friends-of-your-friends” rule
where a new node first forms one link with a randomly selected node in the
network and then preferentially makes further links with neighbors of that
node. Since the chance of a node being a neighbor of some randomly chosen
node depends on its degree, these subsequent links then also constitute some
preferential attachment, and the resulting degree distribution will follow a
power law. However, other properties of that network are rather different from
those obtained by the direct preferential attachment scheme. In particular,
because of the preference for local connections, the network diameter will be
typically much larger. Even the opposite scheme, where a node preferentially
forms additional links with nodes from which it has a large distance, does not
lead to a network with a very small diameter. For creating a network with a
small diameter, it is rather more efficient that nodes directly use preferential
attachment, that is, preferentially form links with other nodes that have a
high degree and are therefore well connected in the network. Of course, the
most efficient way to achieve a small diameter in a sparse network is to connect
every node to one single central node.
Another crucial difference between a “make-friends-with-the-friends-of-
your-friends” network and a “the-rich-get-richer” network is that the first
Spectral Characterization of Network Structures and Dynamics 119
eigenvalue of the make friends network will be much smaller, implying for
instance that dynamics on such a network are much more difficult to synchro-
nize, as will be explained below. In fact, spectral properties like the behavior
of the first eigenvalue of scale-free networks were analyzed in [3, 4], and it
was pointed out that the scaling exponent and the first eigenvalue are essen-
tially independent parameters for a network. Of course, when networks are
produced by a certain stochastic scheme or drawn from some probability dis-
tribution on the space of networks, then that scheme or distribution will also
lead to some typical spectral behavior, as systematically investigated in [29].
However, when we only know whether a network is scale free, we should be
careful about inferring other network properties. It might be a wiser strategy
to find out more about the underlying network evolution rule, like the above
“make-friends-with-the-friends-of-your-friends” principle, the Cameo princi-
ple of Blanchard–Krüger [12], or whatever is plausible in the given empirical
domain. One important class of rules for which there is much evidence in
various domains is the one of node duplications. That means that instead of
randomly attaching a new external node, we take some node i already present
in the network and double it in the sense that we create a new node i that
forms links with all or some of the neighbors of i. It may or may not also
form a link with i itself. Again, since the chance of another node j of being a
neighbor of the randomly chosen node i and therefore receiving new connec-
tions from i depends on the degree of j, we do get a preferential attachment
scheme. Again, however, as we shall see below, such a node duplication leads
to some specific spectral properties that are not shared by networks arising
from different schemes.
There also exist other distinctions within the class of scale-free networks.
An important one is whether the nodes of high degree are assortative, i.e., pre-
fer connections with other high degree nodes, or disassortative, i.e., avoid con-
nections with high degree nodes and rather form links with low degree nodes.
3 × number of triangles
C := . (1)
number of connected triples of nodes
This operator is different from the algebraic graph Laplacian Lv(i) := ni v(i)−
j,j∼i v(j); see, e.g., [13, 14, 20, 32, 35]. In particular, the spectrum of Δ is
different from that of L; Δ, however, has the same spectrum as the Laplacian
investigated in [15] (in fact, the two operators are equivalent, differing only by
a multiplier). The normalized Laplacian is the operator underlying random
walks and conservative diffusion processes on graphs. Therefore, it seems to be
the more natural operator from a geometric or physical perspective. However,
the algebraic Laplacian does possess certain nice algebraic properties that
are not shared by the normalized Laplacian, like a trace formula, see [22].
Spectral Characterization of Network Structures and Dynamics 121
Nevertheless, in our empirical studies, we have found that the Laplacian con-
sidered here seems to be a better tool for distinguishing different classes of
graphs by spectral properties.
We now recall some elementary properties, see, e.g., [15, 26]. The Laplacian
is symmetric for the product
(u, v) := ni u(i)v(i) (3)
i∈V
Δu + λu = 0. (4)
λk > 0 (5)
λ0 = 0 < λ1 ≤ · · · ≤ λN −1 .
For the largest eigenvalue, we have
λN −1 ≤ 2. (6)
In particular, the spectrum of Δ is always confined to the interval [0, 2], re-
gardless of the size of the graph. This is not true for the algebraic graph
Laplacian L, and this property of Δ allows for an easy comparison of the
spectra of graphs irrespective of their sizes.
We have equality in (6) iff the graph is bipartite. Thus, a single eigenvalue
determines the global property of bipartiteness. More generally, a graph is
bipartite iff whenever λ is an eigenvalue, then so is 2 − λ. Thus, the character-
istic spectral property of a bipartite graph is that its spectrum is symmetric
about 1.
For instance, for a complete graph of N vertices,
N
λ1 = ... = λN −1 = , (7)
N −1
122 A. Banerjee and J. Jost
that is, there is only one nontrivial eigenvalue, NN−1 , occurring with multi-
plicity N − 1. Among all graphs with N vertices, this is the largest possible
value for λ1 and the smallest possible value for λN −1 . Thus, the characteristic
spectral property of complete graphs is that there is this eigenvalue with the
highest possible multiplicity.
Many qualitative properties of graphs can be characterized by inequalities
or other relationships between their eigenvalues. For instance, Monasson [36]
carried out a systematic investigation of the spectrum of a small-world graph
as the superposition of a regular ring and a random graph. Also, [23] develops
a method for (re)constructing a graph from its spectrum. We should point
out, however, that in general it is not possible to uniquely determine a graph
from its spectrum. In fact, there exist isospectral graphs, that is, different
graphs with the same eigenvalues. For instance, all complete bipartite graphs
with the same number N of vertices have the same eigenvalues. Actually, they
possess the eigenvalues 0 and 2 with multiplicity 1 and the eigenvalue 1 with
multiplicity N − 2. Any graph with that spectrum is a complete bipartite
graph, but among bipartite graphs of N vertices, the two classes may have
different sizes N1 , N2 , as long as N1 + N2 = N , of course.
We now rewrite the eigenvalue equation (4) as
1
u(j) = (1 − λ)u(i) for all i. (8)
ni j∼i
The converse also holds, except for the case λ = 1 when (9) holds at all points
regardless of whether the eigenfunction vanishes there or not.
We now consider motifs, that is, small subgraphs of Γ of a particular type,
and analyze what happens to the spectrum when performing some natural
operations with motifs. As our motif, we take some graph Λ.
We start with motif joining: Here, the motif Λ is a graph that is inde-
pendent of Γ . Let j0 be a vertex of Λ. We assume that Λ has eigenvalue λ
and an eigenfunction uλ that vanishes at j0 , i.e., uλ (j0 ) = 0. We then form
a graph Γ̄ by identifying the vertex j0 with an arbitrary vertex i of Γ . The
new graph then also possesses the eigenvalue λ, with an eigenfunction that
agrees with uλ on Λ and vanishes at the other vertices, that is, those coming
from Γ . Thus, a motif Λ can be joined to an existing graph with a preserved
eigenvalue and a localized eigenfunction when the joining occurs at one (or
several) vertices where that eigenfunction vanishes.
We next consider motif duplication: Here, the motif Λ is a subgraph of Γ ,
with vertices j1 , . . . , jm . Let the function u on the vertex set of Λ satisfy
1
u(j) = (1 − λ)u(i) for all i ∈ Λ and some λ, (10)
ni
j∈Λ,j∼i
Spectral Characterization of Network Structures and Dynamics 123
h(Γ ) ≤ 4 (18)
Here, f : [0, 1] → [0, 1] is some function; the functions we have in mind are
those whose iteration generates some chaotic dynamics, like the logistic map
1 − e−μ0 1 + e−μ0
<< . (24)
λ1 λN −1
In practice, the left inequality, the one involving λ1 , is the crucial one here. In
particular, when the eigenvalues satisfy appropriate conditions, we can have
a stable synchronized solution that is chaotic (μ0 > 0).
Note that the first eigenvalue even determines the synchronization of dy-
namics with transmission delays between the nodes, see [5].
0.04 0.03
a
b
0.035
0.025
0.03
0.02
0.025
0.02 0.015
0.015
0.01
0.01
0.005
0.005
0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2
0.06 0.06
c d
0.05 0.05
0.04 0.04
0.03 0.03
0.02 0.02
0.01 0.01
0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2
0.025 0.02
a b
0.018
0.02 0.016
0.014
0.015 0.012
0.01
0.01 0.008
0.006
0.005 0.004
0.002
0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2
0.025 0.012
c
d
0.02 0.01
0.008
0.015
0.006
0.01
0.004
0.005
0.002
0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2
Fig. 2. (a) Foodweb network from “Florida bay in wet season”. Data downloaded
from http://vlado.fmf.uni-lj.si/pub/networks/data (main data resource: Chesapeake
Biological Laboratory. Web link: http://www.cbl.umces.edu/). [Download date 21
Dec. 2006]. Network size 128. (b) Foodweb network from “Ythan estuary”. Data
downloaded from http://www.cosinproject.org. [Download Date 21 Dec. 2006]. Net-
work size 135. (c) The network of hyperlinks between weblogs on US politics,
recorded in 2005 by Adamic and Glance [1]. Network size 1222. Data down-
loaded from http://www-personal.umich.edu/∼mejn/netdata [Download date: 23
April 2007]. (d) Neuronal connectivity of Caenorhabditis elegans. Network size 297.
Data used in [49, 50]. Data Source: http://cdg.columbia.edu/cdg/datasets [Down-
load date: 18 Dec. 2006]. (e) E-mail interchanges between members of the Uni-
veristy Rovira i Virgili (Tarragona) [21]. Network size 1133. Data downloaded from
http://deim.urv.cat/∼aarenas/data/welcome.htm [Download date: 21 March, 2007].
128 A. Banerjee and J. Jost
x 10−3
9
e
8
0
0 0.5 1 1.5 2
Fig. 2. (Continued)
0.01 0.025
a b
0.009
0.008 0.02
0.007
0.006 0.015
0.005
0.004 0.01
0.003
0.002 0.005
0.001
0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2
x 10−3 x 10−3
8 9
c d
8
7
7
6
6
5
5
4
4
3
3
2
2
1 1
0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2
Fig. 3. (a) Topology of the Western states power grid of the United States [49]. Net-
work size 4941. Data downloaded from http://cdg.columbia.edu/cdg/datasets [Down-
load date: 1 March 2007]. (b) Jazz band network. Nodes represent jazz bands. Two
bands are connected if a same musician played in those two bands. Network size 198.
Data downloaded from http://deim.urv.cat/∼aarenas/data/welcome.htm [Download
date: 17 March 2008]. Data used in [19]. (c) Co-authorships between scientists posting
preprints on the High-Energy Theory E-Print Archive, http://arxiv.org/archive/hepth
between 1 Jan. 1995 and 31st Dec. 1999 [37]. Network size 5835. (d) Co-authorships of
scientists working on network theory and experiment [38]. Network size 379. (c,d)
Data downloaded from http://www-personal.umich.edu/∼mejn/netdata [Download
date: 23 April 2007].
130 A. Banerjee and J. Jost
x 10−3 x 10−3
6 7
a b
6
5
5
4
4
3
3
2
2
1 1
0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2
x 10−3
7
c
6
0
0 0.5 1 1.5 2
Fig. 4. Electronic circuits. (a) With size = 122. (b) With size = 252. (c) With
size = 512. Data downloaded from http://www.weizmann.ac.il/mcb/UriAlon [Down-
load date: 15 March 2005]. Data used in [33].
References
1. L.A. Adamic and N. Glance, The political blogosphere and the 2004 US election:
Divided they blog, in Proceedings of the WWW-2005 Workshop on the Weblogging
Ecosystem (2005)
2. R. Albert, A.-L. Barabási, Statistical mechanics of complex networks, Reviews of
Modern Physics 74, 2002, 47–97
3. F.M. Atay, T. Bıyıkoğlu, J. Jost, Synchronization of networks with prescribed
degree distributions, IEEE Trans. Circuits and Systems I 53(1), 2006, 92–98
4. F.M. Atay, T. Bıyıkoğlu, J. Jost, Network synchronization: Spectral versus statis-
tical properties, Phys. D 224, 2006, 35–41
5. F.M. Atay, J. Jost, A. Wende, Delays, connection topology, and synchronization
of coupled chaotic maps, Phys. Rev. Lett. 92(14), 2004, 144101
Spectral Characterization of Network Structures and Dynamics 131
33. R Milo et al., Network motifs: Simple building blocks of complex networks, Science
298, 2002, 824–827
34. R. Milo et al., Superfamilies of evolved and designed networks, Science 303, 2004,
1538–1542
35. B. Mohar, Some applications of Laplace eigenvalues of graphs, in: G. Hahn,
G. Sabidussi (eds.), Graph Symmetry: Algebraic Methods and Applications, 227–
277, Springer, Berlin, 1997
36. R. Monasson, Diffusion, localization and dispersion relations on “small-world”
lattices, Europ. Phys. J. B 12, 1999, 555–567
37. M.E.J. Newman, The structure of scientific collaboration networks, Proc. Natl.
Acad. Sci. USA 98, 2001, 404–409
38. M.E.J. Newman, Finding community structure in networks using the eigenvectors
of matrices, Phys. Rev. E 74, 2006, 036104
39. M. Newman, The structure and function of complex networks, SIAM Review 45,
2003, 167–256
40. S. Ohno, Evolution by Gene Duplication, Springer, Berlin, 1970
41. L.M. Pecora, T.L. Carroll, Synchronization in chaotic systems, Phys. Rev. Lett.
64, 1990, 821–824
42. A. Pikovsky, M. Rosenblum, J. Kurths, Synchronization – A Universal Concept
in Nonlinear Science, Cambridge University Press, Cambridge, 2001
43. G. Rangarajan, M.Z. Ding, Stability of synchronized chaos in coupled dynamical
systems, Phys. Lett. A 296, 2002, 204–212
44. H. Simon, On a class of skew distribution functions, Biometrika 42, 1955, 425–440
45. R. Solé et al., A model of large scale proteome evolution, Adv. Compl. Syst. 5,
2002, 43–54
46. A. Vazquez et al., Modelling of protein interaction networks, ComPlexUs 1, 2003,
38–44
47. A. Wagner, How the global structure of protein interaction networks evolves, Proc.
Roy. Soc. B 270, 2003, 457–466
48. A. Wagner, Evolution of gene networks by gene duplications — A mathematical
model and its implications on genome organization, Proc. Nat. Acad. Sciences
USA 91(10), 1994, 4387–4391
49. D.J. Watts, S.H. Strogatz, Collective dynamics of ‘small-world’ networks, Nature
393, 1998, 440–442
50. J.G. White et al., The structure of the nervous system of the nematode Caenorhab-
ditis elegans, Phil. Trans. Royal Soc. of London Series B-Bio. Sc. 314, 1986, 1–340
51. P. Zhu, R.C. Wilson, A study of graph spectra for comparing graphs. In Proc. of
British Machine Vision Conf. (MBVC), Sep 2005
52. K.H. Wolfe, D.C. Shields, Molecular evidence for an ancient duplication of the
entire yeast genome, Nature 387(6634), 1997, 708–713
Dynamics of Social Complex Networks: Some
Insights into Recent Research
Sergi Lozano
ETH Zurich, Swiss Federal Institute of Technology, UNO D11, Universitätstr. 41,
8092 Zurich, Switzerland; slozano@ethz.ch
that shape the network itself (i.e. sudden emergencies of determined struc-
tural features are observed when a certain external parameter exceeds a
certain threshold value).
Regarding the fulfillment of this list of requirements by social networks,
Vega-Redondo refers to the results of previous studies about social structure
to confirm that social networks satisfy the first two. Following the same rea-
soning, we notice that the other two requirements (covering dynamic aspects)
are repeatedly recognized in social phenomena, for instance, collective behav-
ior and social mobilization [8, 9] (third point), or the emergence of hierarchical
social structures from interactions at an individual level [10, 11] (fourth point).
Once confirmed that social networks are indeed complex networks, in this
chapter we will focus on the dynamic aspects of this complexity (the two later
points in the check list above). More concretely, we will overview some of the
recent research that addresses dynamics on and of social networks from the
perspective of complex systems. The rest of the chapter is structured as fol-
lows. The second section is devoted to works dealing, as separate topics, with
the analysis of social phenomena over static social networks and with the time
evolution of the social structure. The third section focuses on the coevolution
of social structure and phenomena, stressing the importance of this interplay
from the complexity viewpoint. Finally, the last section summarizes the whole
chapter and points out some ideas about the future evolution of the field.
the influence of social networks and the individual’s role in the evolution
of different social phenomena. A good example of this can be found in the
research devoted to diffusion of innovations [13, 14].
This perspective has resulted in an in-depth knowledge of the most im-
portant structural characteristics of social networks and their influence on
the behavior of the social actors, as has been recognized by scholars recently
entering the field from complexity science [6, 7] (although some ‘traditional’
social network analysts claim that this effort by the ‘newcomers’ is not quite
appreciable [1]). The incorporation of these ‘newcomers’ has not changed this
orientation, but has reinforced it by contributing new analyses and modeling
methodologies.
The works ensuing from this combination of tools and perspectives have
uncovered very relevant results. Some of them, for example, have related the
emergence and resilience of cooperation in social groups with certain structural
features of its social network, such as the degree heterogeneity [15, 16, 17]
or the community structure [18]. Others have shown that scale-freeness and
the small-world phenomenon can influence the consensus time of opinions
in a population [19, 20] and even force scenarios with coexisting domains
of opposite opinions [21]. In order to further understand the various tools
and perspectives developed for explaining and modeling social networks, it is
useful to resort to the exhaustive recent reviews on game theory [22], opinion
dynamics [12, 23], language dynamics [12] or spreading phenomena [5].
Finally, as a sample of work addressing social dynamics on networks, the
first chapter of Part in this book presents a work centered on the study of
epidemic spreading [24]. In this work, the authors apply a mesoscopic (neither
individual nor global, but intermediate) structural approach to predict and
understand the spreading of an incurable disease (like HIV) over an empiri-
cal static network. First, they study the division into subnetworks or regions
of a real social network of sexual contacts obtained by means of interviews.
They also deduce qualitative predictions about infection spreading from the
observed topological features. Second, they use a computational model to nu-
merically contrast these predictions and design possible protection strategies
suitable in this particular case. This work represents an important contribu-
tion to the literature on diseases spreading, since it highlights the analysis
and visualization possibilities of mesoscopic approximations.
The second separated approach that we are going to consider in this section
is based on the study of network processes, that is, “series of events that cre-
ate, sustain and dissolve social structures” [25]. Logically, this sort of study
requires the use of time in addition to the structural description. However,
in the past social networks analysis mainly focused on the study of static so-
cial networks and their influence over individual and collective behavior [6].
Borgatti [26] argues that one of the reasons behind such an orientation was the
136 S. Lozano
difficulty to obtain longitudinal empirical data. As has been pointed out in the
Introduction, this scenario has changed lately with the increasing availability
of large social datasets obtained from different information and communi-
cation technologies (email traffic, mobile phone calls, activities within peer-
to-peer systems, social media and social networking websites, etc.). Taking
advantage of this new availability, scholars have developed different method-
ologies to understand the evolution of social networks using these data as
input [25, 27, 28].
The (generally) large size of these datasets has given rise to especially in-
teresting applications from a complex network perspective. On one side, we
find works that try to deduce the basic mechanisms ruling social network pro-
cesses. To do that, the authors analyze the evolution of these social datasets
from a statistical point of view (macroscopic level) [4], focusing on their mod-
ular structure (mesoscopic or intermediate level) [29, 30, 31], or addressing
key individual properties such as centrality (microscopic level) [32].
On the other side, following the example of seminal works by Watts and
Strogatz [33] and Barabási and Albert [34], datasets are also used to validate
simple models based on single mechanisms that forge complex social-like fea-
tures. In these works, empirical data are contrasted against the models’ sim-
ulations in terms of structural parameters at different topological scales. For
example, some of these works present extensions of Barabási’s preferential
attachment models and are focused on the degree distribution [35, 36]. Oth-
ers present variants of the seceder model (where the mechanism conditioning
topological evolution is based on each agent’s efforts to differentiate from the
crowd) [37]. Finally, in Ref. [38] the authors propose a model where each agent
is assigned a set of social values (representing different social attributes), and
ties are established in the function of the social distances among agents (dif-
ferences between their social attributes) and α, a parameter quantifying the
homophily in the system (the individuals’ preference to establish and main-
tain links with other individuals they feel similar to). Interestingly, for different
values of α the resulting social network presents different modular structures,
while preserving general topological features of social networks (such as as-
sortativity or high clustering).
these actions (like cooperation and diffusion of habits, for example). On the
other side, the stronger the friendship relation among two people, the more
probable that they introduce each other to new friends, modifying their mu-
tual ‘friendship local neighborhood’ and, consequently, the whole structure of
the network. In general, networks exhibiting such a feedback loop are called
coevolutionary or adaptive networks [40].
This interdependency has clear implications from a complexity point of
view. If structural patterns of social networks can induce nonlinearity in social
phenomena evolving over them and, likewise, social network processes forge
the emergence of complex structural features, a coevolutive scheme has to lead,
necessarily, to scenarios exhibiting extremely rich behaviors. In their recent
review on adaptive networks, Gross and Blasius [40] suport this assertion by
reporting a list of four ‘hallmarks’ typically presented by adaptive networks
in general (and social networks in particular):
• Self-organization towards a dynamical critical state.
• Emergence of ‘specialized’ roles from an initially homogeneous population.
• Formation of complex global topologies (even from very simple local rules).
• Highly complex macroscopical dynamics due to the interaction of local
states and topological complexity.
In the following, we will review some recent works that have addressed
interesting sociological topics from an adaptive networks’ perspective. We will
also identify some of the preced hallmarks in the referred examples.
In Ref. [41], Skyrms and Pemantle claim to “(..) create models that are more
true to life (..)” by incorporating coevolution among structure and strategies
in evolutive game theory models. Since then, some authors have proposed
models where players’ strategies depend on the structure but, at the same
time, they can modify the connectivity in their local neighborhood in order to
maximize the payoff of a certain strategy (modifying, as an aggregated effect,
the whole topology at the macroscopic level).
Cooperation among individuals and, more concretely, the evolution of one-
shot versions of the Prisoner’s Dilemma played over adaptive networks, have
been intensively studied. In the results of these works, we can find some of the
four ‘complexity hallmarks’ in coevolving networks listed previously. For ex-
ample, in some cases the authors identify the formation of scale-free topologies
(which present a power-law distribution) [42, 43] and the emergence of differ-
entiated roles and hierarchies [42, 44, 45]. Moreover, regarding system dynam-
ics, Ebel and Bornholdt [46], Eguiluz and co-workers [42] and Zimmermann
and Eguiluz [44] report large avalanches of strategy changes when the system
approaches the final state, identifying a sort of self-organized critical behavior.
As a particular case, [47] analyzes a scenario where topological changes
occur much faster than changes of individuals’ strategies. The authors find
138 S. Lozano
Opinion and cultural dynamics are other important social topics which have
been addressed from a coevolutive viewpoint.
Dynamics of Social Complex Networks 139
We also mention the work presented in [65], where the authors propose an
innovative coevolutionary model of HIV infection spreading through the use
of dynamic complex networks. On one hand, the state of each individual (her
health situation) is determined by means of a Markov process that takes into
account both topological data (such as the number of infected neighbours)
and information regarding the HIV infections (probability of infection and
progression from HIV to AIDS, for instance). On the other hand, the social
structure of the population is defined at each time step in a function of certain
statistical features and the state of nodes (nodes with AIDS are removed from
the network). The authors find a good correspondence between simulation
results and real demographic historical epidemiological data from the United
States. Moreover, this epidemiological prediction model could be integrated
in related decision support systems (regarding anti-drug policy, for instance).
4 Conclusions
References
1. Freeman, L.C.: The Development of Social Network Analysis: A Study in the
Sociology of Science. Empirical Press, Vancouver (BC Canada) (2004).
2. Scott, J.: Social Network Analysis: A Handbook. SAGE Publications, London
(2000).
3. Wasserman, S., Faust, K.: Social Networks Analysis: Methods and Applications.
Cambridge University Press, New York (1994).
4. Holme, P., Edling, C.R., Liljeros, F.: Structure and time evolution of an Internet
dating community. Social Networks 26, 155–174 (2004).
5. Vega-Redondo, F.: Complex Social Networks. Cambridge University Press, New
York (2007).
6. Watts, D.J.: Six Degrees: The Science of a Connected Age. W. W. Norton &
Company Inc., New York (2003).
7. Barabási, A.-L.: Linked: The New Science of Networks. Perseus Publishing,
Cambridge (USA) (2002).
8. Coleman, J.: Foundations of Social Theory. Harvard University Press, Cambridge,
MA (1990).
9. Gould, R.V.: Collective action and network structure. American Sociological Re-
view 58 (2), 182–196 (1993).
10. Gould, R.V.: The origins of status hierarchies: A formal theory and empirical test.
American Journal of Sociology 107 (5), 114378 (2002).
11. Epstein, J.M.: Generating classes without conquest. In: Generative Social Science:
Studies in Agent-Based Computational Modeling. Princeton University Press,
Princeton, NJ (2007).
12. Castellano, C., Fortunato, S., Loreto, V.: Statistical physics of social dynamics.
Reviews of Modern Physics (Accepted) 348 (2008).
13. Rogers, E.M.: Diffusion of Innovations (5th ed.). Free Press, New York (2003).
14. Valente, T.W.: Models and methods for innovation diffusion. In: Carrington, P.,
Scott, J., Wasserman, S. (ed) Models and Methods in Social Network Analysis.
Cambridge University Press, New York (2005).
15. Abramson, G., Kuperman, M.: Social games in a social network. Phys. Rev. E 63,
030901 (2001).
16. Duran, O., Mulet, R.: Evolutionary prisoners dilemma in random graphs. Physica
D 208 (3–4), 257–265 (2005).
17. Santos, F.C., Pacheco, J.M., Lenaerts, T.: Evolutionary dynamics of social dilem-
mas in structured heterogeneous populations. Proc. Natl. Acad. Sci. 103, 3490–
3494 (2006).
18. Lozano, S., Arenas, A., Sanchez, A.: Mesoscopic structure conditions the emer-
gence of cooperation on social networks. PLoS ONE 3(4): e1892 doi: 10.1371/
journal.pone.0001892 (2008).
19. Castellano, C., Loreto, V., Barrat, A., Cecconi, F., Parisi, D.: Comparison of voter
and Glauber ordering dynamics on networks. Phys. Rev. E 71 (6), 066107 (2005).
20. Sood, V., Redner, S.: Voter model on heterogeneous graphs. Phys. Rev. Lett. 94
(17), 178701 (2005).
21. Castellano, C., Vilone, D., Vespignani, A.: Incomplete ordering of the voter model
on small-world networks. Europhys. Lett. 63 (1), 153158 (2003).
22. Szabó, G., Fáth, G.: Evolutionary games on graphs. Phys. Rep. 446 (4–6), 97–216
(2007).
142 S. Lozano
46. Ebel, H., Bornholdt, S.: Coevolutionary games on networks. Phys. Rev. E 66,
056118 (2002).
47. Pacheco, J.M., Traulsen, A., Nowak, M.A.: Coevolution of strategy and structure
in complex networks with dynamical linking. Phys. Rev. Lett. 97, 258103 (2006).
48. Rosvall, M., Sneppen, K.: Dynamics of opinions and social structures.
arXiv:0708.0368v2 [physics.soc-ph] (2007).
49. Marsili, M., Vega-Redondo, F., Slanina, F.: The rise and fall of a networked society:
A formal model. Proc. Nat. Acad. Sci. 101, 1439–1442 (2004).
50. Ehrhardt, G.C.M.A, Marsili, M., Vega-Redondo, F.: Phenomenological models of
socioeconomic network dynamics. Phys. Rev. E 74, 036106 (2006).
51. Holme, P., Ghoshal, G.: Dynamics of networking agents competing for high cen-
trality and low degree. Phys. Rev. Lett. 96, 098701 (2006).
52. König, M.D, Battiston, S., Napoletano, M., Schweitzer, F.: On algebraic graph
theory and the dynamics of innovation networks. Networks and Heterogeneous
Media 3 (2) 201–220 (2007).
53. Rosvall, M., Sneppen, K.: Modeling self-organization of communication and topol-
ogy in social networks. Phys. Rev. E 74, 016108 (2006).
54. Centola, D., González-Avella, J.C., Eguiı́luz, V.M., San Miguel, M.: Homophily,
cultural drift, and the co-evolution of cultural groups. J. of Conflict Resolution 51
(6), 905–929 (2007).
55. Axelrod, R.: The dissemination of culture: A model with local convergence and
global polarization. The Journal of Conflict Resolution 41 (2), 203–226 (1997).
56. Benczik, I.J., Benczik, S.Z., Schmittmann, B., Zia, V.: Lack of consensus in social
systems. EPL 82, 48006 (2007).
57. Vázquez, F., Eguı́luz, V.M., San Miguel, M.: Generic absorbing transition in co-
evolution dynamics. Phys. Rev. Lett. 100, 108702 (2007).
58. Zanette, D.H., Gil, S.: Opinion spreading and agent segregation on evolving net-
works. Phys. D 224, 156–165 (2006).
59. Gil, S., Zanette, D.H.: Coevolution of agents and networks: Opinion spreading and
community disconnection. Phys. Lett. A 356, 89–95 (2006).
60. Liggett, T.M.: Interacting Particle Systems. Springer, New York (1985).
61. Holme, P., Newman, M.E.J.: Nonequilibrium phase transition in the coevolution
of networks and opinions. Phys. Rev. E 74, 056108 (2006).
62. Gross, T., D’Lima, C.J.D., Blasius, B.: Epidemic dynamics on an adaptive net-
work. Phys. Rev. Lett. 96, 208701 (2006).
63. Gross, T., Kevrekidis, I.G.: Coarse-graining adaptive coevolutionary network dy-
namics via automated moment closure. arXiv:nlin/0702047v1 [nlin.AO] (2007).
64. Zanette, D.: Coevolution of agents and networks in an epidemiological model.
arXiv:0707.1249v2 [physics.soc-ph] (2007).
65. Sloot, P.M.A., Ivanov, S.V., Boukhanovsky, A.V., Vijver, D., Boucher, C.A.:
Stochastic simulation of HIV population dynamics through complex network mod-
eling, Int. J. of Computer Mathematics 85 (8), 1175–1187 (2008).
66. Borgatti, S.P., Molina, J.L.: Toward ethical guidelines for network research in
organizations. Social Networks. 27 (2), 107–117 (2005).
67. Birnbaum, M.H.: Methodological and ethical issues in conducting social psychol-
ogy research via the Internet. In: Sansone, C., Morf, C.C., Panter, A.T. (ed) Hand-
book of Methods in Social Psychology. Sage, Thousand Oaks, CA (2004).
The Structure and Dynamics of Linguistic
Networks
1 Introduction
Human beings as a species are quite unique to this biological world, for they
are the only organisms known to be capable of thinking, communicating and
preserving potentially an infinite number of ideas that form the pillars of
modern civilization. This unique ability is a consequence of the complex and
powerful human languages characterized by their recursive syntax and compo-
sitional semantics [40]. It has been argued that language is a dynamic complex
adaptive system that has evolved through the process of self-organization to
serve the purpose of human communication needs [80]. The complexity of hu-
man languages has always attracted the attention of physicists, who have tried
to explain several linguistic phenomena through models of physical systems
(see e.g., [32, 42]).
Like any physical system, a linguistic system (i.e., a language) can be
viewed from three different perspectives [52]. On one extreme, a language is a
collection of utterances that are produced by the speakers of a linguistic com-
munity during the course of their interactions with other speakers of the same
community. This is analogous to the microscopic view of a thermodynamic
system, where every utterance and its corresponding context contributes to
the identity of the language, i.e., the grammar. On the other extreme, a lan-
guage can be characterized by a set of grammar rules and a vocabulary. This
is analogous to a macroscopic view. Sandwiched between these two extremes,
one can also conceive of a mesoscopic view of language, where linguistic enti-
ties, such as the letters, words or phrases are the basic units and the grammar
is an emergent property of the interactions among them.
Complex networks provide a suitable framework to model and study the
structure and dynamics of linguistic systems from a mesoscopic perspec-
tive. Although multi-agent simulation is the preferred modeling paradigm for
microscopic studies in linguistics (see e.g., [15, 80]), there have been some
works where networks are also involved. For instance, in [67], the interaction
patterns between the agents are modeled as a social network, and the diffusion
of linguistic innovations (which are key to language change) are studied on
various network topologies. This survey is confined to the works pertaining to
various linguistic networks only at the level of mesoscopy.
There has been a plethora of works on linguistic networks with various
motivations and at various levels of linguistic structure. On the basis of the
primary goal of the research, the work in this area can be broadly classified
into two categories: (1) those which investigate the structural properties of
language from the perspective of language evolution and, thereby, explain
the emergence of certain universal characteristics of languages, and (2) those
which try to exploit the network-based representations to develop certain
useful practical systems such as machine translation, information retrieval
and summarization systems. This article focuses on the former works, but a
brief overview of the latter is also presented in Section 5.
The survey is organized from the perspective of linguistic structure.
Section 2 describes lexical networks, where the nodes are words and edges
represent the lexical relationship between two words such as phonetic and
semantic similarity. In Section 3 we present an overview of various networks
where again the nodes are the words, but unlike the case of lexical networks,
the edges represent their co-occurrences in similar context. These networks
are representations of the interactions among words as governed by the gram-
mar rules of a language. Section 4 describes the phonological networks, where
the nodes are sub-lexical units such as phonemes or syllables. Applications
of linguistic networks in natural language processing (NLP) and information
retrieval (IR) are discussed in Section 5. Section 6 concludes the survey by
enumerating some open problems in the area of linguistic networks.
2 Lexical Networks
The phrase “mental lexicon” (ML) usually refers to the repository of word
forms that is assumed to reside in the human brain. The average size of the
receptive vocabulary for a normal high school student has been found to be
more than 100,000 [63]. Quite surprisingly, speakers are capable of navigating
this huge lexicon in a very efficient way; reaction time to judge whether a word
form is legitimate takes less than 100 milliseconds. Consequently, there can
be two important questions associated with ML: (a) how the words are stored
in the long-term memory, i.e., how ML is organized, and (b) how these words
are retrieved from ML. Note that these questions are highly interrelated—to
predict the organization one can investigate how words are retrieved from ML
and vice versa.
One of the earliest attempts to model the organization of ML was made
in [13]. In this work, the authors propose a hierarchical structure of ML, where
Linguistic Networks 147
the concepts are arranged in the form of a tree and the attributes of a partic-
ular concept in this tree can be inherited by all the child concepts. Figure 1
shows a representative example formed from the concepts “animal”, “mam-
mal” and “fish”. While early studies like [13] focused mainly on representation
of the local structure of ML, its global structure remained largely unexplored.
Recently, researchers have also started to investigate the global structure of
ML primarily within the framework of complex systems and, more specifically,
complex networks (see [36, 45, 77, 83, 86] for reference). In all of these studies
ML is modeled as a web of interconnected nodes, where each node corresponds
to a word form and the interconnections may be based on any one (or more)
of the following:
• Phonological similarity (e.g., the words banana, bear and bean may be
connected since they start with the same phoneme),
• Semantic similarity (e.g., the words banana, apple and pear may be con-
nected since all of them are names of fruits),
• Frequency of usage,
• Age at which the word forms are acquired,
• Parts of speech, and
• Orthographic properties.
In the rest of this section we review one representative study each (refer-
ring, wherever applicable, to the other relevant ones) of such complex networks
constructed based on (a) phonological, (b) semantic, and (c) orthographic
similarities of the word forms. Syntactic similarity-based networks will be dis-
cussed in detail in the next section.
Phonological similarity among the word forms has been extensively studied
in the past to infer the structure of ML and, consequently, the nature of a
linguistic system [4, 35, 71, 81]. This large-scale phonological ML has also
been studied in the framework of complex networks in which the word forms
represent the nodes and two nodes (read words) are connected by an edge
if they differ only by the addition, deletion or substitution of one or more
phonemes [36, 45, 83, 86]. [45] reports one of the most popular studies, where
148 M. Choudhury and A. Mukherjee
and semantic relationships between them are represented through the edges.
In [77] the authors analyze the structure of the nouns in the English WordNet
database (version 1.6). The semantic relationships between the nouns can
be primarily of four types: (i) hypernymy/hyponymy (e.g., animal/cat), (ii)
antonymy (e.g., day/night), (iii) meronymy/holonymy (e.g., trunk/tree) and
(iv) polysemy (e.g., the concepts “the main stem of a tree”, “the body exclud-
ing the head and neck and limbs”, “a long flexible snout as of an elephant”
and “luggage consisting of a large strong case used when traveling or for stor-
age” are connected to each other due to the polysemous word “trunk” which
can mean all of these). Some of the important findings of this work are as
follows.
• Semantic relationships are scale invariant.
• The hypernymy tree forms the skeleton of the network.
• Inclusion of polysemy reorganizes the network into a small world.
• The nodes with the most traffic (i.e., nodes with the maximum number
of paths passing through them) correspond to those concepts which are
expressed by the most polysemous words. They are also found to have very
high clustering coefficients.
• In the presence of polysemous edges, the distance between two nodes across
the network is not in correspondence with the depth at which they are
found in the hypernymy tree.
Further references to the studies on such semantic relationship-based networks
can be found in [1, 82]. Although there are several works attempting to analyze
the structure of the semantic network of words, one hardly finds any study
explaining the emergence of these topological properties through models of
network synthesis. It would be very interesting to study the correlates of
semantic acquisition and symbol grounding with the model parameters.
Like phonological similarity networks, one can also construct networks based
on orthographic similarity, where the nodes are the words and the edit distance
between two words defines the edge weight between the nodes corresponding
to them. Such networks have been studied in order to investigate the diffi-
culties involved in spelling error detection and correction [11]. In this work
the authors construct such networks (SpellNet) for three different languages
(Bengali, Hindi and English) and analyze them to show the following.
• For a particular language, the probability of real word errors can be
equated to the average weighted degree of SpellNet.
• The difficulty of non-word error correction correlates to the average clus-
tering coefficient for a language.
• The basic topological properties are invariant in nature for all the lan-
guages; for instance, the authors find that the SpellNet for all of the three
150 M. Choudhury and A. Mukherjee
It does not follow, however, from the collocation networks that a word
with high degree is indeed a word with high usage frequency (unless the word
co-occurrences are completely independent in nature, which essentially is not
the case). In a separate study, Cancho and Solé [25] have shown that the
rank-degree distribution of the words in a very large corpus also follows a
two-regime power law, supporting their claim regarding the presence of a
core lexicon whose size is about 5000 words. In order to explain the two-
regime power law in word collocation networks, Dorogovtsev and Mendes [18]
proposed a preferential attachment-based growth model. At every time step t,
a new word (i.e., a node) enters the language (i.e., the network) and connects
itself preferentially to one of the pre-existing nodes. Simultaneously, ct (where
c is a positive constant) new edges are grown between pairs of old nodes that
are chosen preferentially. Through mathematical analysis and simulations, the
authors establish that this model gives rise to a two-regime power law with
exponents very close to those observed in [24].
There have been studies on the properties of collocation networks for lan-
guages other than English, including Russian [46] and many others [41]. The
basic topological properties of the networks (e.g., scale-free, small-world, as-
sortative) are similar across languages, which points to the fact that like Zipf’s
law, these characteristics are also linguistic universals and call for a non-trivial
psycholinguistic account of their emergence and existence.
Fig. 2. Example of a dependency tree. The arrows are labeled by the type of depen-
dency relation and run from the dependent to the head words.
1
While it is true that syntactic dependencies have a tendency to avoid crossing,
there are systematic exceptions to that generalization in languages with relatively free
constituent order. In German, for example, about one-third of all relative clauses are
extraposed, thus creating cross dependencies.
Linguistic Networks 153
4 Phonological Networks
In the earlier sections, we have seen how complex networks can be used to
study the different types of interactions (phonological, syntactic and semantic)
between the words of a language. In this section, we shall review some of
the works where the networks are constructed from linguistic units that are
smaller than words, e.g., phonemes and syllables.
The most basic units of human languages are the speech sounds. The reper-
toire of sounds that make up the sound inventory of a language are not chosen
arbitrarily, even though the speakers are capable of perceiving and producing
a plethora of them. In contrast, the inventories show exceptionally regular
patterns across the languages of the world, which is arguably an outcome of
the self-organization that goes on in shaping their structure. In fact, numer-
ous computational models have been proposed in the literature in order to
154 M. Choudhury and A. Mukherjee
explain the self-organization of the vowel inventories [15, 47, 51, 76]. A few
attempts have also been made in the area of linguistics to reason the observed
patterns across the consonant inventories. Most of these works confine them-
selves to explaining certain individual principles rather than formulating a
general theory describing the pattern emergence. However, complex networks
have been recently used quite successfully to explain the self-organization of
the consonant inventories. In [65] the authors construct a bipartite network
called PlaNet, or the Phoneme-Language Network, in which one of the par-
titions consists of nodes representing the languages while the other partition
consists of nodes representing the consonants. There is an edge between the
nodes of these two partitions if a particular consonant occurs in a particular
language. The authors further construct PhoNet (Phoneme-Phoneme Net-
work), which is the one-mode projection of PlaNet onto the consonant nodes
i.e., a network of consonants in which the nodes are linked as many times as
they have co-occurred across the language inventories. The data used for con-
structing the above networks is drawn from the UCLA Phonological Segment
Inventory Database (UPSID) [54], which consists of 317 languages and 541
consonants that are found across these languages. Several important observa-
tions are made from the study of PlaNet and PhoNet. The observations are
noted below.
From the study of PlaNet [65]
• The degree distribution of the consonant nodes in PlaNet roughly follows
a power law with an exponential cut-off towards the tail.
• A synthesis model based on preferential attachment (a language node
attaches itself to a consonant node depending on the current degree (k)
of the consonant node) can explain the emergence of the degree distribu-
tion of PlaNet. The results match the empirical data more accurately if
the attachment kernel is super-linear (i.e., the attachment probability is
proportional to k α , where α > 1).
From the study of PhoNet [64, 65]
• The degree distribution of the consonant nodes in PhoNet also roughly
indicate a power-law behavior with exponential cut-offs.
• The clustering coefficient of PhoNet (=0.89) is significantly higher than
that of a random graph with the same number of nodes and edges (=0.08).
• Community structure analysis of PhoNet can capture the strong patterns
of co-occurrence of consonants that are prevalent across the languages of
the world.
• The driving force that leads to the emergence of these communities is
feature economy, which states that languages tend to use a small number
of distinctive features and maximize their combinatorial possibilities to
generate a large number of consonants.
• The emergence of the degree distribution and the clustering coefficient of
PhoNet can be explained through a synthesis model that is based on both
preferential attachment and triad (i.e., fully connected triplet) formation.
While the preferential part of the model reproduces the degree distribution
Linguistic Networks 155
of the network, the triad formation part imposes a large number of triangles
onto the generated network, thereby increasing the clustering coefficient.
• The emergence of feature economy can be explained by having a synthesis
model, which is a linear combination of two different parts, one driven
by the usual degree-dependent preference and the other by a factor that
favors the choice of those consonants that share many features with the
already chosen ones.
The authors postulate that the physical significance of the synthesis models is
grounded in the process of language change. Language change is a collective
phenomenon that functions at the level of a population of speakers [80]. They
also conjecture that it is possible to explain the significance of the models
at the level of an individual, primarily in terms of the process of language
acquisition. Further, they argue that there are two orthogonal preferences:
(a) the occurrence frequency of a consonant, and (b) the feature-dependent
preference (that increases the ease of learning), which are instrumental in
the acquisition of the inventories. The synthesis model is essentially a linear
combination of these two mutually orthogonal factors.
often quite different. The works on linguistic networks discussed in the last
three sections were primarily targeted to the statistical physics community,
and the objective was to unfurl the structure of languages and their dynamics.
In this section, we will survey some equally interesting and significant works,
which use the same set of mathematical tools, but the objective is to develop
practical applications concerning languages.
One of the earliest and recurrent applications of networks in NLP has been
in automatic induction of syntactic and semantic categories based on the
distributional hypothesis [39]. The distributional hypothesis states that words
of similar syntactic (semantic) category are found in similar contexts [39]. To
illustrate this concept, consider two unknown words X and Y that occur in
the following sentences:
(1) The red X is very beautiful.
(2) If you Y then I shall punish you.
Even though we do not know what X and Y are, it is easy to infer that the
former is a noun and the latter is a verb. We can draw such inferences about
the syntactic categories (in this case the parts of speech) of words based on
our knowledge that nouns, but not verbs, can be preceded by articles (the) and
adjectives (red). The concept of distributional hypothesis is equally relevant
for semantic categories. Words belonging to the same domain club together.
Thus, the word student is expected to be in vicinity of the word school, rather
than market.
Measuring to what extent two words appear in similar contexts defines
their similarity [62]. The general methodology [12, 27, 31, 72, 74, 75] for
inducing word class information can be outlined as follows.
1. Define the context of a word as a vector. It could be just the set of words
which occur in the same sentence, or only the immediate neighbors of the
words. For syntactic class induction, usually the word order is preserved
during construction of the vectors and the context vectors are defined only
in terms of the function words (such as is, of, the and a).
2. Collect global context vectors for the words by summing up the local con-
texts.
3. Construct a weighted network, where the nodes are the words and the
weight of the edge between two words is the distance between their context
vectors. There are several ways to define the distance between the vectors.
Some of the common measures are Euclidean distance, cosine similarity
and correlation coefficients.
4. Apply a clustering algorithm on these networks to obtain the word classes.
In the syntactic category induction literature, the 150–250 words with the
highest frequency are considered as function words, and the context vectors
Linguistic Networks 157
are defined based on them. Some authors employ a much larger number of fea-
tures and reduce the dimensions of the resulting matrix using singular value
decomposition [72, 74]. [27] uses the spearman rank correlation coefficient and
a hierarchical clustering, [74, 75] use the cosine between vector angles and
buckshot clustering, [31] uses cosine on mutual information vectors for hierar-
chical agglomerative clustering and [12] applies Kullback–Leibler divergence
in his CDC algorithm.
[28] does not sum up the contexts of each word in a context vector, but uses
the most frequent instances of four-word windows in a co-clustering algorithm
[16]: rows and columns (here words and contexts) are clustered simultaneously.
Two-step clustering is undertaken by [74]: clusters from the first step are
used as features in the second step. More recently, Biemann [6] proposed the
Chinese Whispers algorithm for clustering, which is fast and does not require
any parameters to be specified. [7] reports application of Chinese Whispers for
parts-of-speech (POS) induction in English, Finnish and German, which has
also been applied very recently to Bengali [66]. In this work, the authors also
investigate the topological properties of the word networks so constructed and
report a scale-free degree distribution, high clustering coefficient and power-
law cluster size distribution.
Widdows and Dorow [87] propose an unsupervised incremental clus-
ter building approach for acquisition of semantic classes. There are also
graph-based algorithms to infer semantic classes (sets of synonyms, to be
specific) from the lexicons (see, e.g., [17, 43]).
Identification of syntactic or semantic classes is of great importance to NLP
and IR. For instance, POS tagging is the first step towards parsing. However,
the supervised machine learning techniques for POS tagging demand a large
amount of human annotated data, which is expensive as well as non-existent
for most of the languages. Since automatic induction of POS tags through
graph clustering does not require annotated data, it might turn out to be a
very useful technique in NLP for resource-poor languages. Similarly, semantic
clustering of the words is useful for search and IR.
Word sense disambiguation (WSD) refers to the task of assigning the appropri-
ate sense or meaning to a word in a given context (i.e., sentence or paragraph)
out of the several possibilities. For example, the English word bank has two
different meanings as a noun: 1) river bank, and 2) a financial institution.
However, as shown in the following sentences, in a given context only one of
the senses is appropriate.
(1) They were walking down the bank enjoying the cool river breeze.
(2) She went to the bank to cash her check.
There are several ways in which graph-based techniques have been ap-
plied for WSD. Examples include lexical chaining [29], semantic relatedness
158 M. Choudhury and A. Mukherjee
Fig. 3. Example of Hyperlex: (a) the network of words for disambiguation of the word
“light”; (b) the minimal spanning tree obtained after introduction of the word “light”.
The hubs are shown in bold font.
measures based on path lengths and random walks on semantic networks [57,
61] and lexicon graphs [50]. Due to the paucity of space, here we discuss
in detail only one of the approaches—HyperLex [85]—that rely on the word
co-occurrence graphs.
Consider the problem of automatically identifying and disambiguating the
various senses of the word light. The HyperLex algorithm works as follows.
A sub-corpus consisting of all the paragraphs featuring at least one occurrence
of the word light is extracted from a raw text corpus. A word co-occurrence
graph is constructed from this sub-corpus, where the nodes are the content
words except for the word light. Two words are connected by an edge if they
co-occur in a paragraph more than a preset number of times. The weight of
an edge decreases as the number of times the words co-occur increases. It
has been found that word co-occurrence graphs built in this manner exhibit
small-world properties.
In this co-occurrence network, nodes with very high degree are identified
as hubs. The word light, for which we want to build the disambiguator, is then
introduced to the network and connected to the hubs. A minimal spanning
tree is constructed from the co-occurrence graph, where light is the root node
and the first level consists of the hubs. Figure 3 illustrates this process. Each
node in the spanning tree can be thought of as a sense. Thus, the hubs denote
the basic senses and, as we move further down the tree, we have more refined
senses of the word. This tree can then be used for disambiguating the sense
of the target word (here light) in a particular context.
6 Conclusion
So far we have seen that there has been a substantial amount of work to under-
stand the structure and dynamics of languages at the mesoscopic level within
the framework of complex networks. A parallel thread of research in the field
of NLP and IR tries to achieve a different goal, but uses very much the same
means. Nevertheless, mesoscopic models of language as well as network-based
approaches to NLP are in a nascent state, especially when compared to similar
lines of research in the fields of biology, economics and other social sciences
(refer to the surveys in this volume). On the other hand, there seems to be a
great potential for application of complex network theory to a variety of open
problems in linguistics and language engineering.
One of the fundamental problems of linguistics is characterization and
explanation of linguistic universals, i.e., properties that are common to all
human languages. Differences among the languages, on the other hand, are
restricted by the typologies and implicational hierarchies [14]. We have seen
that, like Zipf’s law, there are many linguistic universals observable in the lin-
guistic networks. For example, the SDNs as well as word collocation networks
of all languages exhibit scale-free degree distributions and the small-world
property. A systematic investigation of topological universals of linguistic net-
works can substantially improve our understanding of languages. At the same
Linguistic Networks 161
time, there are properties for which the linguistic networks vary across lan-
guages. For example, the average degrees of the SpellNets are very different
for English, when compared to Hindi or Bengali. This difference has been at-
tributed to the different writing systems used by English (which is alphabetic)
and the two Indo-Aryan languages (which is abugida). Typological variations
have also been predicted in the topological properties of syllable networks.
Thus, it would be interesting to have a typological theory of languages based
on the structure of the linguistic networks.
Another question of great importance for any linguistic network is on the
emergence of its structural properties. It is least clear why the word collo-
cation networks should display small-world and scale-free properties. Even
though the Dorogovtsev and Mendes model [18] can explain the emergence
of the two-regime power law observed in the collocation networks, it does
not explain by itself the validity and the physical significance of this model
based on preferential attachment. In other words, the phenomenon of prefer-
ential attachment at the mesoscopic level needs an independent microscopic
explanation in terms of psycholinguistic factors, because words cannot volun-
tarily link to other words. Similar microscopic explanations are required for
the non-trivial topological properties of the other linguistic networks, such as
ML, SDN, PhoNet and SpellNet. This is presumably a hard problem, but any
mesoscopic explanation is incomplete without a corresponding microscopic
model.
In the context of NLP and IR applications, network-based models are
mostly ad hoc and this reduces their credibility and, thereby, the popularity,
as compared to the more principled Bayesian approaches. A network-based
language model can bridge this gap and provide us with a more systematic
way of solving the NLP problems within this framework. Although there have
been some initiatives in this direction [44], this area is largely unexplored
and presents numerous challenging problems. Another relatively unexplored,
but potentially fecund, area of research is processes “on” linguistic networks.
Navigation of the ML can be modeled as guided random walks on the ML
network; similarly, typographical errors can be modeled as walks on SpellNet.
The exact nature of such guided walks is still to be explored and can provide
a strong understanding of underlying cognitive principles.
In the previous sections we have seen several ways to define networks
where the nodes represent words. One can conceive of a universal word net-
work obtained through superimposition of these partial representations of a
linguistic system into a multi-tier network where the nodes are the words and
two nodes can be connected by several labeled edges signifying their pho-
netic, collocational, syntactic, orthographic, semantic and various other kinds
of similarities. Studies on such a network can reveal a holistic picture of the
interaction patterns between the words, thereby providing a unified model of
grammar at different levels of linguistic structure.
162 M. Choudhury and A. Mukherjee
References
1. M. E. Adilson, A. P. S. de Moura, Y. C. Lai, and P. Dasgupta. Topology of the
conceptual network of language. Physical Review E, 65(065102):1–4, 2002.
2. A. Agarwal, S. Chakrabarti, and S. Aggarwal. Learning to rank networked entities.
In Proceedings of KDD, 2006.
3. A. Akmajian. Linguistics. An introduction to Language and Communication. MIT
Press, Cambridge, MA, 1995.
4. A. Albright and B. Hayes. Rules vs. analogy in english past tenses: A computa-
tional/experimental study. Cognition, 90:119–161, 2003.
5. A.-L. Barabási and R. Albert. Emergence of scaling in random networks. Science,
286:509–512, 1999.
6. C. Biemann. Chinese whispers - an efficient graph clustering algorithm and its ap-
plication to natural language processing problems. In Proceedings of TextGraphs:
the Second Workshop on Graph Based Methods for Natural Language Process-
ing, pages 73–80, New York, NY, June 2006. Association for Computational
Linguistics.
7. C. Biemann. Unsupervised part-of-speech tagging employing efficient graph
clustering. In Proceedings of the COLING/ACL 2006 Student Research Work-
shop, pages 7–12, Sydney, Australia, July 2006. Association for Computational
Linguistics.
8. C. Biemann, I. Matveeva, R. Mihalcea, and D. Radev, editors. Proceedings of the
Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language
Processing. Association for Computational Linguistics, Rochester, NY, 2007.
9. S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine.
CNIS, 30(1–7):107–117, 1998.
10. N. Chomsky. The Minimalist Program. MIT Press, Cambridge, MA, 1995.
11. M. Choudhury, M. Thomas, A. Mukherjee, A. Basu, and N. Ganguly. How difficult
is it to develop a perfect spell-checker? A cross-linguistic analysis through complex
network approach. In Proceedings of the Second Workshop on TextGraphs: Graph-
Based Algorithms for Natural Language Processing, pages 81–88, Rochester, NY,
2007. Association for Computational Linguistics.
12. A. Clark. Inducing syntactic categories by context distribution clustering. In
C. Cardie, W. Daelemans, C. Nédellec, and E. T. K. Sang, editors, Proceedings of
the Fourth Conference on Computational Natural Language Learning and of the
Second Learning Language in Logic Workshop, Lisbon, 2000, pages 91–94. Asso-
ciation for Computational Linguistics, Somerset, NJ, 2000.
13. A. M. Collins and M. R. Quillian. Retrieval time from semantic memory. Journal
of Verbal Learning and Verbal Memory, 8:240–247, 1969.
14. W. Croft. Typology and Universals. Cambridge University Press, Cambridge, MA,
1990.
15. B. de Boer. Self-organisation in vowel systems. Journal of Phonetics, 28(4):
441–465, 2000.
16. I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In
Proceedings of The Ninth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD-2003), pages 89–98, 2003.
17. W. B. Dolan, L. Vanderwende, and S. Richardson. Automatically deriving struc-
tured knowledge base from on-line dictionaries. In Proceedings of the Pacific As-
sociation for Computational Linguistics, 1993.
Linguistic Networks 163
73. M. Richardson, A. Prakash, and E. Brill. Beyond PageRank: Machine learning for
static ranking. In Proceedings of WWW, pages 707–715, 2006.
74. H. Schütze. Part-of-speech induction from scratch. In Proceedings of the 31st An-
nual Meeting on Association for Computational Linguistics, pages 251–258, Mor-
ristown, NJ, 1993. Association for Computational Linguistics.
75. H. Schütze. Distributional part-of-speech tagging. In Proceedings of the 7th Con-
ference on European Chapter of the Association for Computational Linguistics,
pages 141–148, San Francisco, CA, 1995. Morgan Kaufmann Publishers Inc.
76. J.-L. Schwartz, L.-J. Boë, N. Vallée, and C. Abry. The dispersion-focalization
theory of vowel systems. Journal of Phonetics, 25:255–286, 1997.
77. M. Sigman and G. A. Cecchi. Global organization of the wordnet lexicon. Proceed-
ings of the National Academy of Science, 99(3):1742–1747, 2002.
78. M. M. Soares, G. Corso, and L. S. Lucena. The network of syllables in Portuguese.
Physica A: Statistical Mechanics and its Applications, 355(2-4): 678–684, 2005.
79. Z. Solan, D. Horn, E. Ruppin, and S. Edelman. Unsupervised learning of natural
languages. Proceedings of National Academy of Sciences, 102(33):11629–11634,
2005.
80. L. Steels. Language as a complex adaptive system. In Proceedings of PPSN VI,
pages 17–26, 2000.
81. D. Steriade. Knowledge of similarity and narrow lexical override. BLS, 29: 583–598,
2004.
82. M. Steyvers and J. B. Tenenbaum. The large-scale structure of semantic networks:
Statistical analyses and a model of semantic growth. Cognitive Science, 29(1):
41–78, 2005.
83. M. Tamariz. Exploring the Adaptive Structure of the Mental Lexicon. Ph.D. the-
sis, Department of Theoretical and Applied Linguistics, Univerisity of Edinburgh,
Scotland, 2005.
84. K. Toutanova, C. D. Manning, and A. Y. Ng. Learning random walk models for
inducing word dependency distributions. In ICML ’04: Proceedings of the Twenty-
First International Conference on Machine Learning, page 103, New York, NY,
2004.
85. J. Véronis. HyperLex: Lexical cartography for information retrieval. Computer
Speech and Language, 18(3):223–252, 2004.
86. M. S. Vitevitch. Phonological neighbors in a small world (network): What can
graph theory tell us about the mental lexicon? Departmental Colloquy co-sponsored
by the Linguistics and Psychology Departments, Rice University, January 27, 2006.
87. D. Widdows and B. Dorow. A graph model for unsupervised lexical acquisition.
In Proceedings of COLING, 2002.
Networks Generated from Natural
Language Text
1 Introduction
few hundred words make the bulk of words in a text allows one to use only
these words as contextual features with only a minor loss in text coverage.
Knowing that word co-occurrence networks possess the scale-free small world
property has implications for clustering these networks.
An interesting aspect is whether these characteristics are only inherent to
real natural language data or whether they can be produced with generators
of linear sequences in a much simpler way than our intuition about language
complexity would suggest. In other words, we shall see how distinctive these
characteristics are with respect to tests deciding whether a given sequence is
natural language or not.
G. K. Zipf [31, 32] described the following phenomenon: if all words in a corpus
of natural language are arranged in decreasing order of frequency, then the
relation between a word’s frequency and its rank in the list follows a power
law. Since then, a significant amount of research has been devoted to the
question of how this property emerges and what kinds of processes generate
such Zipfian distributions. Hence, some datasets related to language will be
presented that exhibit a power law on their rank-frequency distribution. For
this discussion, basic units of language will be examined.
The relation between the frequency of a word at rank r and its rank is given
by f (r) ∼ r−z , where z is the exponent of the power law that corresponds
to the slope of the curve in a log-log plot. The exponent z was assumed to
be exactly 1 by Zipf. In natural language data, slightly differing exponents
in the range of about 0.7 to 1.2 are also observed [30]. B. Mandelbrot [21]
provided a formula that more closely approximates the frequency distributions
in language data after noticing that Zipf’s law holds only for the medium range
of ranks, whereas the curve is flatter for very frequent words and steeper for
high ranks. Figure 1 displays the word rank-frequency distributions of corpora
of different languages taken from the Leipzig Corpora Collection.1
There exist several exhaustive collections of research capitalising Zipf’s
law and related distributions2 ranging over a wide area of datasets; here, only
findings related to natural language will be reported. A related distribution
is the lexical spectrum [16], which gives the probability of choosing a word
from the vocabulary with a given frequency. For natural language, the lexical
spectrum follows a power law with slope γ = z1 + 1, where z is the exponent
1
LCC, see http://www.corpora.uni-leipzig.de [July 7th, 2007].
2
e.g. http://www.nslij-genetics.org/wli/zipf/index.html [April 1, 2007].
Networks Generated from Natural Language Text 169
1000
100
10
0.1
1 10 100 1000 10000 100000 1e+006
rank
Fig. 1. Zipf’s law for various corpora. The numbers next to the language give the
corpus size in sentences. Enlarging the corpus does not affect the slope of the curve,
but merely moves it upwards in the plot. Most lines are almost parallel to the ideal
power-law curve with z = 1. Finnish exhibits a lower slope of γ ≈ 0.8, akin to higher
morphological productivity.
3
http://www.natcorp.ox.ac.uk/ [April 1, 2007]
170 C. Biemann and U. Quasthoff
1000
100
10
1
1 10 100 1000 10000 100000 1e+006
rank
Fig. 2. Rank-frequency distributions for letter N -grams for the first 10,000 sentences
in the BNC. Letter N -gram rank-frequency distributions do not exhibit power laws on
the full scale, but increasing N results in a larger power-law regime for low ranks.
For word N -grams, the relation between rank and frequency follows a power
law, just as in the case for words (unigrams). Figure 4 (left) shows the rank-
frequency plots up to N = 4, based on the first 1 million sentences of the
BNC. As more different word combinations are possible with increasing N ,
Networks Generated from Natural Language Text 171
1000
frequency
100
10
1
1 10 100 1000 10000
rank
Fig. 3. Rank-frequency plots for letter bigrams, for a text generated from letter
unigram probabilities and for the BNC sample.
100000
10000
frequency
frequency
10000
1000
1000
100
100
10
10
1 1
1 10 100 1000 10000 1000001e+0061e+0071e+008 1 10 100 1000 10000 100000 1e+006 1e+007
rank rank
Fig. 4. Left: Rank-frequency distributions for word N -grams for the first one million
sentences in the BNC. Word N -gram rank-frequency distributions exhibit power laws.
Right: Rank-frequency plots for word bigrams, for a text generated from letter unigram
probabilities and for the BNC sample.
the curves become flatter as the same total frequency is shared amongst more
units, as previously observed (e.g. [27, 18]). Testing concatenation restrictions
quantitatively as above for letters, it might at first seem surprising that the
curve for a text generated with word unigram frequencies differs only very
little from the word bigram curve, as Fig. 4 (right) shows. Small differences
are only observable for low ranks: more top-rank generated bigrams reflect
172 C. Biemann and U. Quasthoff
that words are usually not repeated in the text. More low-ranked and less
high-ranked real bigrams indicate that word concatenation takes place not
entirely without restrictions, yet is subject to much more variety than letter
concatenation. This coincides with the intuition that it is, for a given word
pair, almost always possible to form a correct English sentence in which these
words are neighbours. Regarding quantitative (as opposed to syntactic or
semantic) aspects, the frequency distribution of word bigrams can be produced
by a generation process based on word unigram probabilities.
10000
1000
frequency
100
10
1
1 10 100 1000 10000 100000 1e+006 1e+007
rank
Fig. 5. Rank-frequency plot for sentence frequencies in the full BNC, following a
power law with γ ≈ 0.9, but with a high fraction of sentences occurring only once.
Networks Generated from Natural Language Text 173
10000
1000
frequency
100
10
1
1 10 100 1000 10000 100000 1e+006
rank
Fig. 6. Rank-frequency plot for AltaVista search queries, following a power law with
γ ≈ 0.75.
distribution would be found, but such an analysis has not been carried out
and would require access to the index of a web search engine. Further, there are
more power laws in language-related areas, some are mentioned here briefly
to illustrate their omnipresence.
• Web page requests follow a power law, which was employed for a caching
mechanism in [17].
• Related to this, frequencies of web search queries during a fixed time span
also follow a power law, as exemplified in Fig. 6 for a 7-million queries log
of AltaVista4 as used by Lempel [19].
• The number of authors of Wikipedia5 articles was found to follow a power
law with γ ≈ 2.7 for a large regime in [29]. The same paper further dis-
cusses various other power-law relationships.
in the graph. Some of the graphs discussed here can be classified as scale-free
SWGs; others have different characteristics and represent other, but related,
graph classes.
− (n − nA ) log (n − nA ) − (n − nB ) log (n − nB )
that exist due to random noise and keeping almost exclusively those edges that
reflect a true association between their endpoints. Graphs that contain all sig-
nificant co-occurrences of a corpus, with edge weights set to the significance
value between their endpoints, are called significant co-occurrence graphs in
the remainder. For convenience, no singletons in the graph are allowed, i.e. if a
vertex is not contained in any edge because none of the co-occurrences for the
corresponding word is significant, then the vertex is excluded from the graph.
As observed previously [15, 24], word co-occurrence graphs exhibit the
scale-free small world property. This is in line with co-occurrence graphs
reflecting human associations [25] and human associations in turn forming
SWGs [28]. The claim is confirmed here on an exemplary basis with the
graph for Leipziy Corpora Collection’s (LCC’s) 1 million sentence corpus for
German. Figure 7 gives the degree distributions and graph characteristics for
various co-occurrence graphs.
The shape of the distribution is dependent on the language, as Fig. 8 shows.
Some languages—here English and Italian—have a hump-shaped distribution
in the log-log plot where the first regime follows a power law with a lower expo-
nent than the second regime, as observed in [15]. For the Finnish and German
corpora examined here, this effect could not be found in the data. This prop-
erty of two power-law regimes in the degree distribution of word co-occurrence
graphs motivated the Dorogovtsev-Mendes (DM)-model, see [12]. There, the
de1M neighbour-based graphs degree distribution de1M sentence-based graphs degree distribution
degree distribution with window size 2 degree distribution with window size 2
1e+006 1e+006
Icelandic window 2 Italian window 2
German window 2 English BNC window 2
power-law gamma=2 10000 power-law gamma=1.6
10000 power-law gamma=2.6
# of vertices for degree
100 100
1 1
0.01 0.01
0.0001 0.0001
1e-006 1e-006
1 10 100 1000 10000 100000 1e+006 1 10 100 1000 10000 1000001e+006
degree degree
Fig. 9. Degree distributions in word co-occurrence graphs for window size 2. Left: The
distribution for German and Icelandic is approximated by a power law with γ = 2.
Right: For English (BNC) and Italian, the distribution is approximated by two power-
law regimes.
1 1
0.01 0.01
0.0001 0.0001
1e-006 1e-006
1 10 100 1000 10000 100000 1 10 100 1000 10000 100000
degree degree
Fig. 10. Degree distributions in word co-occurrence graphs for distance 1 and dis-
tance 2 for English (BNC) and Italian. The hump-shaped distribution is much more
distinctive for distance 2.
that two power-law regimes in word co-occurrence graphs with window size 2
are not a language universal, but only hold for some languages.
To examine the hump-shaped distributions further, Fig. 10 displays the
degree distribution for the neighbour-based word co-occurrence graphs and
the word co-occurrence graphs for connecting only words that appear in a
distance of 2. As it becomes clear from the plots, the hump-shaped distribution
is mainly caused by words co-occurring in distance 2, whereas the neighbour-
based graph shows only a slight deviation from a single power law. Together
with the observations from sentence-based co-occurrence graphs of different
languages in Figure 8, it becomes clear that a hump-shaped distribution with
two power-law regimes is caused by long-distance relationships between words,
if present at all.
Word co-occurrence statistics are an established standard and have been used
in many language processing systems. The authors have used co-occurrences
in practical applications like bilingual dictionary acquisition [4, 11], semantic
lexicon extension [8] and visualisation of concept trails [7]. The aim of this
chapter is to underpin their applications with a theoretical foundation.
band
albumn
saxophonist music concerts
album roll jazz
singer
music Marsalis concert
pop
band trumpeter stars star rock
jazz
musicians
rock pianist
singer strata
blues
Jazz mass
classical
coal
musician burst bursts
Fig. 11. Neighbourhoods of jazz and rock in the significant sentence-based word co-
occurrence graph as displayed on LCC’s English corpus website. Both neighbourhoods
contain album, music, singer and band, which leads to an edge weight of 4 in the
second-order graph.
1 1
0.1 0.1
0.01 0.01
0.001 0.001
0.0001 0.0001
1 10 100 1000 10000 100000 1 10 100 1000 10000
degree degree
Fig. 12. Degree distributions of word-co-occurrence graphs of higher order. The first-
order graph is the sentence-based word co-occurrence graph of LCC’s 1-million German
sentence corpus (s = 6.63, t = 2). Left: N = 2 for max2 = 3, max2 = 10 and
max2 = ∞. Right: N = 3 for t2 = 3, t3 = ∞, using the second-order graph with
max2 = 3.
BA order 2 BA order 3
100000 1000
BA full BA order 2 max 10
BA max 10 BA order 2 max 3
10000 BA max 3 100 power-law gamma=0.7
power-law gamma=2
10
100
10 1
1
0.1
0.1
0.01
0.01
0.001 0.001
1 10 100 1000 1 10 100 1000
degree degree
ST order 2 ST order 3
power-law gamma=2.5
100 100
10 10
1 1
0.1 0.1
0.01 0.01
1 10 100 1000 1 10 100 1000
degree degree
DM order 2 DM order 3
1000
DM full DM order 2 max 10
DM max 10 DM order 2 max 3
1000 DM max 3
vertices per degree interval
0.01 0.01
0.001 0.001
1 10 100 1000 1 10 100 1000
degree degree
Fig. 13. Second- and third-order graph degree distributions for BA-model, ST-model
and DM-model graphs.
Networks Generated from Natural Language Text 181
In summary, all random graph models exhibit clear differences for word
co-occurrence networks with respect to the higher-order transformation. The
ST-model shows maxima depending on the average degree of the first-order
graph. The BA-model’s power law is decreased with higher orders, but is
able to explain a degree distribution with power-law exponent 2. The full
DM model exhibits the same two power-law regimes in the second order as
observed for German sentence-based word co-occurrences in the third order.
In [6] and [20], the utility of word co-occurrence graphs of higher orders are
examined for lexical semantic acquisition. The highest potential for extracting
paradigmatic semantic relations can be attributed to second- and third-order
word co-occurrences. In [9] second-order graphs are evaluated against lexical
semantic resources.
Using words as internal features, the similarity of two sentences can be mea-
sured by the number of common words they share. Since the few top frequency
words are contained in most sentences as a consequence of Zipf’s law, their
influence should be downweighted or they should be excluded to arrive at
a useful measure for sentence similarity. Here, the sentence similarity graph
of sentences sharing at least two common words is examined, with the max-
imum frequency of these words bounded by 100. This maximum frequency
threshold was arbitrarily chosen and could be replaced by a weighting scheme
that attributes more weight to less frequent words. However, a hard thresh-
old reduces the computational cost significantly. The corpus of examination
is here LCC’s 3-million sentences of German. Figure 14 shows the component
size distribution for this sentence similarity graph, Figure 15 shows the degree
distributions for the entire graph and for its largest component.
The degree distribution of the entire graph follows a power law with γ close
to 1 for low degrees and decays faster for high degrees; the largest component’s
degree distribution plot is flatter for low degrees. This can be attributed to
limited sentence length: as sentences are not arbitrarily long, they cannot
be similar to an arbitrary high number of other sentences with respect to
the measure discussed here, as the number of sentences per feature word is
bounded by the word frequency limit. However, the extremely high values
for transitivity and clustering coefficient and the low γ values for the degree
distribution for low degree vertices and comparably long average shortest path
lengths indicate that the sentence similarity graph belongs to a different graph
class than all other graphs discussed above.
182 C. Biemann and U. Quasthoff
1000
# of vertices
100
10
1
1 10 100 1000 10000 100000
component size
Fig. 14. Component size distribution for the sentence similarity graph of LCC’s
3-million sentence German corpus. The component size distribution follows a power
law with γ ≈ 2.7 for small components, the largest component comprises 211,447 out
of 416,922 total vertices. The component size distribution complies with the theoretical
results of [2].
10000
# of vertices
1000
100
10
1
1 10 100 1000
degree
Fig. 15. Degree distribution for the sentence similarity graph of LCC’s 3-million
sentence German corpus and its largest component. An edge between two vertices
representing sentences is drawn if the sentences share at least two words with corpus
frequency <101; singletons are excluded.
Networks Generated from Natural Language Text 183
A similar measure is used in [5] for document similarity and obtains well-
correlated results when evaluated against a given document classification.
A precision-recall tradeoff arises when lowering the frequency threshold for
feature words or increasing the minimum number of common feature words
two documents must have in order to be connected in the graph: both improve
precision but result in many singleton vertices, which lowers the total number
of documents that are considered.
The preceding examples confirm the claim that graphs built on various aspects
of natural language data often exhibit the scale-free small world property or
similar characteristics. Experiments with generated text corpora suggest that
this is mainly due to the power-law frequency distribution of language units.
The slopes of the power law approximating the degree distributions can often
not be produced using the random graph generation models. Specifically, all
previously discussed generation models fail to explain the properties of word
co-occurrence graphs, where γ ≈ 2 was observed as the power-law exponent
of the degree distribution. Of the generation models inspired by language
data, the ST-model exhibits γ = 3, whereas the universality of the DM-
model to capture word co-occurrence graph characteristics can be doubted
after examining data from different languages.
References
1. Adamic, L. A. (2000). Zipf, power-law, pareto – a ranking tutorial. Technical
report, Information Dynamics Lab, HP Labs, HP Labs, Palo Alto, CA 94304.
2. Aiello, W., Chung, F., and Lu, L. (2000). A random graph model for massive
graphs. In STOC ’00: Proceedings of the Thirty-Second Annual ACM Symposium
on Theory of Computing, pages 171–180, New York, NY, USA. ACM Press.
3. Barabási, A.-L. and Albert, R. (1999). Emergence of scaling in random networks.
Science, 286, 509.
4. Biemann, C. and Quasthoff, U. (2005). Dictionary acquisition using parallel text
and co-occurrence statistics. In Proceedings of NODALIDA’05 , Joensuu, Finland.
5. Biemann, C. and Quasthoff, U. (2007). Similarity of documents and document col-
lections using attributes with low noise. In Proceedings of the Third International
Conference on Web Information Systems and Technologies (WEBIST-07), pages
130–135, Barcelona, Spain.
6. Biemann, C., Bordag, S., and Quasthoff, U. (2004a). Automatic acquisition of
paradigmatic relations using iterated co-occurrences. In Proceedings of the Fourth
International Conference on Language Resources and Evaluation (LREC-04),
Lisbon, Portugal.
184 C. Biemann and U. Quasthoff
7. Biemann, C., Bhm, C., Heyer, G., and Melz, R. (2004b). Automatically build-
ing concept structures and displaying concept trails for the use in brainstorming
sessions and content management systems. In Proceedings of Innovative Internet
Community Systems (IICS-2004), Springer LNCS, Guadalajara, Mexico.
8. Biemann, C., Shin, S.-I., and Choi, K.-S. (2004c). Semiautomatic extension of
corenet using a bootstrapping mechanism on corpus-based co-occurrences. In
Proceedings of the 20th International Conference on Computational Linguistics
(COLING-04), Morristown, NJ, USA. Association for Computational Linguistics.
9. Bordag, S. (2007). Elements of Knowledge-free and Unsupervised Lexical Acquisi-
tion. Ph.D. thesis, University of Leipzig.
10. Burnard, L. (1995). Users Reference Guide for the British National Corpus. Oxford
University Computing Service, Oxford, U.K.
11. Cysouw, M., Biemann, C., and Ongyerth, M. (2007). Using Strong’s numbers
in the Bible to test an automatic alignment of parallel texts. Special issue of
Sprachtypologie und Universalienforschung (STUF), pages 66–79.
12. Dorogovtsev, S. N. and Mendes, J. F. F. (2001). Language as an evolving word
web. Proceedings of The Royal Society of London. Series B, Biological Sciences,
268(1485), 2603–2606.
13. Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coinci-
dence. Computational Linguistics, 19(1), 61–74.
14. Evert, S. (2004). The Statistics of Word Co-occurrences: Word Pairs and Collo-
cations. Ph.D. thesis, University of Stuttgart.
15. Ferrer-i-Cancho, R. and Sol, R. V. (2001). The small world of human lan-
guage. Proceedings of The Royal Society of London. Series B, Biological Sciences,
268(1482), 2261–2265.
16. Ferrer-i-Cancho, R. and Sol, R. V. (2002). Zipf’s law and random texts. Advances
in Complex Systems, 5(1), 1–6.
17. Glassman, S. (1994). A caching relay for the world wide web. Computer Networks
and ISDN Systems, 27(2), 165–173.
18. Ha, L. Q., Sicilia-Garcia, E. I., Ming, J., and Smith, F. J. (2002). Extension of
Zipf’s law to words and phrases. In Proceedings of the 19th International Con-
ference on Computational Linguistics (COLING-02), pages 1–6, Morristown, NJ,
USA. Association for Computational Linguistics.
19. Lempel, R. and Moran, S. (2003). Predictive caching and prefetching of query
results in search engines. In Proceedings of the 12th International Conference on
World Wide Web (WWW-03), pages 19–28, New York, NY, USA. ACM Press.
20. Mahn, M. and Biemann, C. (2005). Tuning co-occurrences of higher orders for
generating ontology extension candidates. In Proceedings of the ICML-05 Work-
shop on Ontology Learning and Extension using Machine Learning Methods, Bonn,
Germany.
21. Mandelbrot, B. B. (1953). An information theory of the statistical structure of
language. In Proceedings of the Symposium on Applications of Communications
Theory. Butterworths.
22. Miller, G. A. (1957). Some effects of intermittent silence. American Journal of
Psychology, 70, 311–313.
23. Moore, R. C. (2004). On log-likelihood-ratios and the significance of rare events.
In D. Lin and D. Wu, editors, Proceedings of the Conference on Empirical Methods
in Natural Language Processing (EMNLP-04), pages 333–340, Barcelona, Spain.
Association for Computational Linguistics.
Networks Generated from Natural Language Text 185
24. Quasthoff, U., Richter, M., and Biemann, C. (2006). Corpus portal for search
in monolingual corpora. In Proceedings of the Fifth International Conference on
Language Resources and Evaluation (LREC-06), pages 1799–1802, Genoa, Italy.
25. Rapp, R. (1996). Die Berechnung von Assoziationen: ein korpuslinguistischer
Ansatz . Olms, Hildesheim.
26. Sigurd, B., Eeg-Olofsson, M., and van de Weijer, J. (2004). Word length, sentence
length and frequency – Zipf revisited. Studia Linguistica, 58(1), 37–52.
27. Smith, F. J. and Devine, K. (1985). Storing and retrieving word phrases. Inf.
Process. Manage., 21(3), 215–224.
28. Steyvers, M. and Tenenbaum, J. B. (2005). The large-scale structure of semantic
networks: Statistical analyses and a model of semantic growth. Cognitive Science,
29(1), 41–78.
29. Voss, J. (2005). Measuring Wikipedia. In P. Ingwersen and B. Larsen, editors,
ISSI2005 , volume 1, pages 221–231, Stockholm. International Society for Sciento-
metrics and Informetrics.
30. Zanette, D. H. and Montemurro, M. A. (2005). Dynamics of text generation with
realistic Zipf’s distribution. Journal of Quantitative Linguistics, 12(1), 29–40.
31. Zipf, G. K. (1935). The Psycho-Biology of Language. Houghton Mifflin, Boston.
32. Zipf, G. K. (1949). Human Behavior and the Principle of Least-Effort. Addison-
Wesley, Cambridge, MA.
Efficiency of Navigation in Indexed Networks
Petter Holme1,2,3
1
Department of Computer Science, University of New Mexico, Albuquerque,
NM 87131, USA
2
School of Computer Science and Communication, Royal Institute of Technology,
10044 Stockholm, Sweden
3
Department of Physics, Umeå University, 90187 Umeå, Sweden;
petter.holme@physics.umu.se
1 Introduction
The interplay between network structure and search dynamics has emerged as
a busy subfield of statistical network studies (see e.g. Refs. [1, 9, 10, 13, 14]).
Consider a simple graph G = (V, E) (where V is a set of n vertices and E is
a set of m edges—unordered pairs of vertices). Assume information packets
travel from a source vertex s to a destination t. We assume the packages are
myopic agents (at a given time step they have access to information about the
vertices in their neighborhood, but not more), and have memory (so they can
e.g. perform a depth-first search) but no previous knowledge of the network.
Let τ (p) be the time for a packet p to travel between its source and destination.
One commonly studied quantity of search efficiency is the expectation value of
τ , τ̄ , for randomly chosen s and t. In this chapter we attempt to find efficient
ways to index V (with numbers from 1 to n) and utilize these indices for
packet navigation. In other words, we try to find ways to compress the global
information about the network into numbers 1, . . . , n so that the information
can be used by packets to find short paths to their destinations.
We propose two schemes of indexing the vertices, and corresponding meth-
ods for packet navigation. These schemes, along with two depth-first search
methods (not using vertex indices for more than remembering the path) are
examined on four network models. We will first present the indexing and
search schemes, then the network models for testing the algorithms and at
last the numerical results.
The numbers 1, . . . , n can be arranged into a search tree [3] such that the
expected value of τ scales like log n. In Fig. 1(a) we give an example of a search
tree. To go from source s to destination t a packet first moves to the root r
by going to the neighbor with the lowest index value. From the root to the
destination, the package moves to the neighbor with the largest index smaller
than, or equal to, t. Our strategy for the ASD indexing and search scheme is
to construct a spanning tree T (G) for the network, index the tree to make it a
search tree, and use the algorithm above to navigate from s to t. The problem
is, however, that real networks are not trees. Imagine adding edges between
vertices of the same heights and branches to the tree in Fig. 1(a)—the tree
will still be a spanning tree, but the packets may not take the same path from
s to t any more. As we will see, with certain ways of constructing the tree and
indexing the vertices, the search either from s to r or r to t will be optimal.
We construct T (G) in the following way.
1. Let the root r be a vertex of smallest eccentricity (maximal distance to
another vertex).
2. Construct the tree such that the distances to the root are the same in T (G)
and G. In other words, construct it such that all edges in T go between
different neighborhoods Γl (r) = {i ∈ V : d(i, r) = l} and Γl+1 (r) for some
level 0 ≤ l ≤ h, where h is the height of the tree (by the choice of r, h is
also the radius of the graph). Such a tree can be constructed by finding
the set of followed edges in a breadth-first search [3] starting from r.
When it is not clear which vertex, or edge, to choose in the above construc-
tion, we choose one at random from all the possible candidates. When T is
constructed, let the indices be a preordering of the vertices in T (i.e. the order
of first occurrence of the vertex in a depth-first search of the graph) [3].
Efficiency of Navigation in Indexed Networks 191
root, r = 1
a b 6 4
1 3
2 5
10
2 3 6 7 10
7 9
4 5 8 9 8
c t=6 4 d
1 3
2 5
10 even odd
7 s=9
8
e 1 f 6 10
1 4
2 t=8
5
2 4 6 3 5
3 7
10 8 9 7 s=9
Fig. 1. Illustration of the ASD (panels (a)–(c)) and ASU (panels (d)–(f)) indexing
and search schemes. (a) shows a search tree where a local search algorithm can find
the shortest path from one vertex to another fast. (b) shows a network indexed by the
ASD scheme. The tree used in the construction is identical to the one shown in (a).
Panel (c) shows an ASD search from s to t (with τ = 4). On the way from s to r
the packet chooses the neighbor (of the current vertex) with lowest index, which here
gives a longer route than the optimal {(9, 10), (10, 1)}. (d) shows a possible partition
of branches of non-root vertices into classes of as similar size as possible (as done in
the ASU indexing scheme). (e) shows a possible indexing based on the partition in (d).
Panel (f) displays a search from s to t with τ = 6. The shortest path from t to r is
accurately found, but a detour to 6 makes the search from r to t suboptimal.
Now we prove that this indexing and the search algorithm always give the
shortest paths from the root to a vertex t. Let ET be the edges of T and let
Ti be the maximal subtree with i as root. By construction, all vertices in Ti
have indices in [i, i + |Ti |] (where | · | denotes the cardinality of a subgraph).
Let i be the largest index in i’s neighborhood smaller than t. Assume there
is an edge (i, j) ∈ E \ ET that the search will follow, i.e. that i < j < t. This
means that j ∈ Ti . By construction, i is the only vertex in Ti at a distance
d(r, i ) (the distance from the rest of Ti to the root is at least d(r, i ) + 1).
Since d(r, i ) = d(r, i) + 1, we have d(r, j) ≥ d(r, i) + 2, which contradicts the
existence of an edge (i, j) ∈ E. Thus, searches from r to t will always follow
the edges of T , which also means the r–t-searches will be as short as possible.
Searching upwards, from i to r, in a graph indexed as above is harder. We
know that one shortest path goes via a vertex j with smaller index than i, but
there might exist suboptimal paths via indices i in the intervals r < i < j and
j < i < i, and there might also be paths via vertices of index larger than j
192 P. Holme
that are optimal. For example, assume the search tree in Fig. 1(a) comes from
a graph with the additional edges (5, 9), (8, 9) and (9, 10) (see Fig. 1(b)). Then,
the shortest path from 9 to r via a vertex of lower index is {(9, 7), (7, 1)}, but
there is an equally long path via a vertex of larger index, {(9, 10), (10, 1)}, and
longer paths via vertices both smaller and larger than 7 but smaller than 9.
Thus, there is no general way of finding the shortest way from s to r. Instead,
we always choose the vertex with the smallest index in the neighborhood.
By this strategy a packet will come closer to r, in index space, for every step.
Furthermore, in tree-like parts of the graph, the search will follow the shortest
paths. An illustration of the ASD search is shown in Fig. 1(c).
levels randomly. This construction scheme is illustrated in Figs. 1(d) and (e).
An illustration of the ASU search scheme is shown in Fig. 1(f).
As a reference, we also run simulations for two depth-first search methods that
do not utilize indices [1]. One of them, Rnd, is a regular depth-first search
where the neighbors are traversed in random order. In the other, Deg, the
neighbors are chosen in order from high to low degree. Just as in the ASU and
ASD methods, a packet is assumed to have knowledge about its neighborhood.
If the destination is in the neighborhood of a vertex, then the search will be
over the next time step.
3 Network Models
The efficiency of our indexing and search schemes is more or less directly
affected by the network structure. To investigate this relationship we test the
search schemes on four different types of network models: modified Erdős–
Rényi (ER) graphs [5], square lattices, From Barabási–Albert (BA) [2] and
Holme–Kim (HK) [8] networks. To facilitate comparison, we use the same
average degree, four (dictated by the square grid), in all networks.
The ER model is the simplest model for randomly generating simple graphs
with n vertices and m edges. The edges are added one by one to randomly
chosen vertex pairs (the only restriction being that loops or multiple edges are
not allowed). A problem for our purpose is that ER graphs are not necessarily
connected (something required to measure τ̄ ). To remedy this we propose a
scheme to make networks connected.
1. Detect the connected components.
2. Go through the connected components sequentially. Denote the current
component CI .
a) Pick a component CJ randomly.
b) Pick a random edge (i, j) whose removal would not fragment CJ . If no
such edge exists, go to step 2.
c) Pick a random vertex i of CI .
d) Replace (i, j) by (i , j). If the edge (i , j) would exist already (an unlikely
event), go to step 2a. If there is no vertex i ∈ CI such that (i , j) does
not already exist, then go to 3.
3. If the network is still disconnected, go to step 1.
In practice, even for our largest system sizes, the preceding algorithm con-
verges in a few iterations. The number of edges needed to be added never
194 P. Holme
exceeds a few percent of m, and this addition is made with the greatest pos-
sible randomness; hence, we believe the essential network structure of the ER
model is conserved.
3.3 BA Model
3.4 HK Model
4 Numerical Results
We study the search schemes on the four different network topologies numer-
ically. We use 100 independent networks and 100 different s–t-pairs for every
network. The network sizes range from n = 16 to n = 16,384.
In Fig. 2 we display the average search times as a function of system size
for our simulations. The most conspicuous feature is that the ASD scheme is
always, by far, the most efficient. While ASU and Deg are close to the least
efficient method (Rnd), ASD is rather close to the theoretical limit (equal to
the average distances τ̄ , the upper border of the shaded areas in Fig. 2). To be
more precise, τ̄ is quite constant, about two times larger than the average dis-
tance. The other search schemes (ASU, Deg and Rnd) follow faster increasing
functional forms. For the square lattice, these three schemes increase, approx-
imately proportional to n (the analytical value for two-dimensional random
Efficiency of Navigation in Indexed Networks 195
modified ER
103
transit time, ¿
100
10
1
square lattice
ASU
103
transit time, ¿
100
10
ASD
1
103 BA
DEG
transit time, ¿
100
10
1 RND
104
HK
103
transit time, ¿
100
10
1
100 10 3 104
network size, N
Fig. 2. The average search time τ̄ as a function of the graph sizes n. In all panels,
we display data for the different indexing and search schemes. The shaded areas are
unreachable (corresponding to τ̄ values smaller than the theoretical minimum, the
¯ The different panels correspond to the modified ER model, square
average distance d).
grid, BA model and HK model networks, respectively. Error bars would have been
smaller than the symbol sizes.
walks) whereas for ASD, τ̄ scales like distances in square grids, n1/2 . One way
of interpreting this result is to say that while ASD manages to find the root
as fast as it finds the destination from the root, ASU fails to find t faster than
a random search. The slow downward performance of ASU is not unexpected.
The r–t-search in ASU only differs from a random depth-first search in that it
196 P. Holme
3
n 5
7
n−2 2
n −1 1 4
n−3 6
9
18 8
16 14
12 10
19 11
17 13
15
Fig. 3. A worst-case scenario for navigating from s to r with the ASD indexing and
search scheme. A packet from n − 2 to 1 will travel along the perimeter to 3 and then
move towards the center.
does not search further than the level of the destination, and that it restricts
the search space to half its original size by dividing the vertices into odd and
even indices. The fast upward search of ASD is more surprising. In Fig. 3
we show a network where ASD performs badly. The average time to search
upwards is (n2 + 20n − 13)/8n → n/8 as n → ∞. The downward search takes
3(n − 1)/2n ∼ 3/2, giving a total expected value of τ̄ ∼ n/8. This can be
compared to the average distance d¯ = 3 − 21/4n + 2/n2 ∼ 3. For this example,
τ̄ and d¯ diverge in a way not seen in the network models. Why is the search so
much faster in the model networks? One point is that the worst-case indexing
seen in Fig. 3 is very unlikely. Since the spokes would be sampled randomly,
the chance that a vertex at the perimeter does not find r in two steps is 1/2,
the probability that it finds r in 3 steps is 1/4, and so on. Continuing this
calculation, a vertex at the perimeter reaches r in 2 k k2k +2 ∼ 6 time steps,
giving τ̄ ∼ 5—not too far from the observed τ̄ /d¯ ∼ 2. We note, however, that
for the model networks many other factors that are not present in the wheel
graph of Fig. 3 affect τ̄ . For example, the high density of short triangles in the
HK model networks will introduce many edges between vertices of the same
level in T (G), which will affect the search efficiency.
τ̄ is approximately linear for the ASU, Deg and Rnd on all network
models. The slopes of these curves are, however, a little different. First, the
Deg method is more efficient (compared to ASU and Rnd) for BA networks
than for the modified ER model. This observation (also made in Ref. [1])
can be explained by the skewed degree distribution in the BA-network—the
packet reaches high-degree vertices quickly. The packet can see a large part
of the network from these hubs, and is therefore more likely to see t. More
interesting, perhaps, is the observation that ASU is more efficient for the
networks with a higher density of short cycles (the square lattice and HK
models). A rough explanation is that the partition procedure of ASU cuts off
many edges between vertices at the same distance from r. Since there are many
such edges in these network models, the network will effectively be sparser
(without changing G’s diameter), which results in a better performance.
Efficiency of Navigation in Indexed Networks 197
5 Discussion
Acknowledgments
References
1. L. A. Adamic, R. M. Lukose, A. R. Puniyani, and B. A. Huberman. Search in
power-law networks. Phys. Rev. E, 64:046135, 2001.
2. A.-L. Barabási and R. Albert. Emergence of scaling in random networks. Science,
286:509–512, 1999.
198 P. Holme
Fig. 1. The “5-core” of Apache on November 2005. Each node is a function, with size
indicating its relative length in lines of code. Each directed edge is a function call.
3000 8500
N 8500
Number of functions N
2900 E
Number of calls E
8000
Number of calls E
8000
2800
2700 7500
7500
2600
7000
2500
7000 E ~ N1.18
2400 6500
10 20 30 40 50 6500
2400 2600 2800 3000
Month Number of functions N
Fig. 2. (Left) Evolution of the number of functions N (left-hand axis) and the number
of function calls E (right-hand axis) during the 50 month period. (Right) E as a
function of N since the first stable release of Apache 2.0 in May 2002 through Nov
2005 (months 8–50). Dots are individual data points. The line is the best fit, E ∼ N 1.18 .
(left) shows their evolution over the entire 50-month period. Our first evi-
dence for a restructuring of the code is observed during the Fourth and the
fifth months, when there is a dramatic decrease in N , of approximately 250
functions, accompanied by a much smaller decrease in E, of approximately 75
function calls. Thus the average degree (N/E) increases dramatically during
this period. Investigating the Apache release history [16], we find that this pe-
riod (from 2002-1-1 to 2002-2-1) marks the transition from the second to the
third beta release of Apache 2.0. According to the release logs, approximately
130 changes were made to the code, with ten of these changes being the addi-
tion of new features. The bulk of the remaining changes were “bug” fixes along
with a few performance improvements. The functionality of the system was
enhanced during a period where the number of functions decreased. We as-
sume redundancy in functions was eliminated, while “functionality” (perhaps
more closely related to number of edges) was preserved and enhanced.
The first stable (non-beta) releases of Apache 2.0 were issued shortly there-
after, in April and May 2002. From there on, the relationship between E and
N is extremely consistent as shown in Fig. 2 (right). We find that E ∼ N 1.18 .
Remarkably, Valverde and Solé find almost identical scaling, of E ∼ N 1.17 ,
for a collection of 80 object-oriented systems [17], where N is the number of
classes and E is the total number of edges, with each edge representing a
relationship between classes. This suggests some universal trend in software
systems.
100 100
2001−10−1 2001−10−1
2005−11−1 2005−11−1
10−1 10−1
p(k)
p(k)
10−2 10−2
10−3 10−3
Fig. 3. (Left) In-degree distribution and (right) out-degree distribution for the first
month and final month. The dashed line is the best fit functional form for the final
k−μ)2
month: (left) p(k) = 0.55 · k−1.84 , and (right) p(k) = (2πσ 2 k2 )−1/2 exp[ −(ln2σ 2 ], with
2
μ = 0.75 and σ = 0.93.
k
1
Q= [hi − f (xi )]2
i=1
h i
with a smaller Q being better. We find that, for in-degree, a power law pro-
vides the best fit for each of the 50 months with Q ≈ 0.04. Fitting a log-
normal distribution to in-degree gives Q ≈ 0.08, and stretched-exponential
gives Q ≈ 0.15. For out-degree, log-normal provides the best fit for all 50
months with Q ≈ 0.02. A stretched-exponential distribution gives the next
best fit with Q ≈ 0.06, and a power-law fit is the worst, with Q ≈ 0.16.
There are small, almost indiscernible changes to the distributions over
the 50 months. For in-degree we find the exponent of the best fit power law
204 H. Wen et al.
slowly decreases from γ ≈ 1.9 to γ ≈ 1.84, reflecting that the maximum values
of in-degree slowly increase with time. For out-degree, the mean out-degree of
the best fit log-normal distribution slowly increases from μ ≈ 0.64 to μ ≈ 0.75.
However, the shapes of both the in- and out-degree distributions (power-
law and log-normal, respectively) are global properties which are established
before our data sampling begins and remain invariant throughout.
A
Dependency Matrix
A B C D E F
B D
A 0 1 0 1 0 0
B 0 0 1 0 0 0
C 0 0 0 1 0 0
C E D 0 0 0 0 1 0
E 0 0 0 0 0 1
F F 0 0 0 0 0 0
Fig. 4. (Left) A simple call graph. (Right) The equivalent dependency matrix.
Evolution of Apache Open Source Software 205
Fig. 5. V(4), the visibility matrix up to path length d = 4 for the simple call graph
in Fig. 4 (Left).
The propagation cost (PC) was introduced in [11] as a scalar value to quantify
the extent of indirect dependencies in a network. It is defined as the number
of 1’s in V(4) divided by N 2 (the total number of 1’s possible). In other
words, PC is the number of pairs of functions connected by a path of length
less than or equal to 4, divided by the number of all possible pairs. We find
that changes in PC (a global variable) can be useful indicators of important
small-scale changes in the code base. Note that we also analyze PC for V(5),
but get almost identical results.
Figure 6 shows the evolution of PC, along with that of N , for the 50
months of Apache data. The baseline behavior indicates an inverse relation-
ship (as N increases PC decreases and vice versa). There is only one re-
gion that violates this trend, encompassing months 24 to 33. Removing these
months from consideration, we see an extremely consistent relation between
PC and N , as shown in Fig. 6 (right), that PC ∼ N −0.70 . The first anoma-
lous event which does not conform to this scaling relationship is month 24
(September 2003), when N decreases slightly yet PC jumps disproportion-
ately. The second anomalous event is from months 33 to 34 (June 2004 to July
2004), when PC drops dramatically while N remains essentially constant.
No other global property discussed herein shows marked changes in this
time frame, not even during the second anomaly which is most dramatic. N
and E are both essentially invariant (see Fig. 2). The degree distribution is
invariant, and the average clustering coefficient is invariant.
We attempt to isolate what changes in the details of Apache are responsible
for these two anomalous events. Motivated by findings in [10], which suggest
that functions with simultaneously high in- and high out-degree are particu-
larly problematic, we isolate functions whose in- or out-degree changed dur-
ing the time frame of interest. Functions with simultaneously high in-degree
and out-degree have a tremendous amount of upstream and downstream de-
pendencies. They are simultaneously information consumers and information
206 H. Wen et al.
x 10−3 x 10−3
7.8 3000 7.8
7.6
Number of functions N
2900 7.6
7.4
Propagation Cost
Propagation cost
7.2 2800 7.4
7
2700 7.2 PC ~ N−0.70
6.8
6.6 2600 7
6.4
Prop Cost 2500 6.8
6.2 N
6 2400 6.6
10 20 30 40 50 2400 2600 2800 3000
Month Number of functions N
Fig. 6. (Left) Propagation cost (left-hand axis) and N (right-hand axis) as functions
of time. (Right) PC as a function of N since the first stable release of Apache 2.0, with
anomalous months (23 thru 34) removed. We find that PC ∼ N −0.70 .
x 10−3
7.8
2004−6−1 PC
7.6
2004−7−1 w/o 2
1
Propagation Cost
10 7.4
Out−degree
7.2
7
6.8
6.6
6.4
6.2
100 6
100 101 102 10 20 30 40 50
In−degree Month
Fig. 7. (Left) Scatter plot of in-degree and out-degree, using log-log scale, with only
functions whose degree changed in this time period shown. (Right) Propagation cost
over time. Top line is for the entire system. Bottom line is resulting PC if the the two
functions indicated in (left) are removed, denoted “w/o 2” in the legend.
These functions (apr thread mutex lock and apr thread mutex unlock)
are members of the Apache Portable Runtime layer that implements function-
ality related to multithreading. Investigating the detailed commit logs written
by developers [21], we find that on August 7, 2003 (between months 23 and
24) attempted “bug” fixes to these two functions were made, with accompa-
nying comments indicating a history of problems with these two functions. On
June 4, 2004 (between months 33 and 34) these two “racy/broken” functions
were dropped from the code entirely and replaced with lower-level system
library calls.
A simple example call graph is given in Fig. 4 (left). There are directed paths
connecting various functions. For instance, function A is connected to func-
tion F via two paths, one of length 3 and one of length 5, where length is
measured by number of hops in the call graph. The path of length 3 is ob-
viously the shortest path connecting A and F . We consider all such pairs of
functions which are connected by a directed path and calculate the short-
est path between them. The fraction of shortest paths of a specified length
(i.e., the normalized distribution) is shown in Fig. 8 (left), for the first month
(October 2001) and the final month (November 2005) of our study. Similar
distributions result for all 50 months, with the typical shortest path of length
between 4 and 5, and the largest shortest path (i.e., the graph diameter) of
length 14.
We compare this distribution of shortest paths to those resulting from two
different random graph growth processes. First we consider an ensemble of 20
realizations of Erdős–Rényi random graphs [22, 23] with N = 2909 nodes and
E = 4142 undirected edges (equivalent to the N = 2909 nodes and E = 8284
0.25 0.1
2001−10−1
2005−11−1
0.2 0.08
Frequency
Frequency
0.15 0.06
0.1 0.04
0.05 0.02
0 0
0 5 10 15 0 10 20 30 40
Length of shortest path Length of shortest path (skewness=0.98206)
Fig. 8. (Left) Normalized shortest paths in Apache, first month and last month.
(Right) Normalized shortest paths averaged over 20 realizations of random networks
with the exact in- and -out degree distributions of Apache on November 2005. The
vertical axis “frequency” means the fraction of shortest paths having that length.
208 H. Wen et al.
directed edges in the November 2005 Apache call graph). Here we find the
typical shortest path is of length 7 or 8, much larger than for the Apache call
graphs. However, the diameter is comparable, ranging from length 14 to 16.
The degree distributions of the Apache call graphs (see Fig. 3) are much
broader and more heterogeneous than the Poisson distribution which char-
acterizes Erdős–Rényi random graphs [22, 23]. Thus we next compare the
Apache graphs to random graphs constructed to match exactly the Apache
degree distribution by extending the ideas in [24, 25] to directed graphs. We
begin with N = 2909 nodes and map each one to a distinct node in Apache.
We assign to each of these new nodes the in- and out-degree of their cor-
responding Apache node. We do not yet specify the connectivity, only the
final degree. In other words, we assign unconnected half-edges. We next per-
form a random matching and pair up each in-degree half-edge with a different
out-degree half-edge chosen at random. We construct an ensemble of 20 such
random graphs. The resulting normalized shortest path distribution, averaged
over the full ensemble, is shown in Fig. 8 (right). Note that the typical path
length is much larger than for Apache, peaking at length 10, and the max-
imum shortest path is around 30. Matching degree distribution alone is not
enough to reproduce the shortest path lengths observed for Apache.
“Small world” networks are characterized by small diameters and large
clustering. We have established the small diameter above. Throughout the 50-
month period the average clustering coefficient, C, fluctuates in the range
0.09 < C < 0.099. Calculating C over an ensemble of corresponding Erdős–
Rényi random graphs yields C = 0.0018, and for the ensemble of random
graphs with the Apache degree distribution C = 0.023. The Apache call graphs
thus have the “small world” characteristics of short average path length and
relatively large clustering coefficient when compared to a comparable random
graph. Note that to measure C we temporarily assume the edges are undi-
rected. A more thorough treatment is presented in the next section, where
“transitive” triads are distinguished from “cyclic” triads. (Cyclic triads are
rarely seen in software, though transitive ones occur frequently.)
Here, we describe formally the ERGM statistical framework for modeling net-
works, in particular as it pertains to modeling software call graphs. Let X
be a random variable representing the adjacency matrix of a software net-
work. The pdf for this random variable, P (X = x), tells us the probability
that an observed graph, x, was drawn from X. Unfortunately, the pdf of
X is unknown and cannot be directly calculated. To estimate this pdf, let
z(x) = (z1 (x), z2 (x), . . . , zr (x)) be a vector of explanatory variables, where
each explanatory variable can be any function of the observed data. We pos-
tulate that there exists θ = (θ1 , θ2 , . . . , θr ) such that
This is the standard log linear probability model that is used in a wide range
of fields from the social sciences to biology [28, 29].
To create an ERGM, a set of explanatory variables (virtually any function
from the observed graph to the real numbers) is chosen by the modeler. The
choice of variables is based on the pertinent features of the graph under study,
or on a set of desired features, if the graphs are being simulated. An example,
210 H. Wen et al.
Table 1. Exponential random graph models are extremely flexible. This table shows
several example explanatory variables, identifying the variables by their names in the
statnet package for R [30].
Variable Description
istar(k) The number of k-tuples of edges that point to the same node in the
network.
ctriad The number of 3-cycles in the network.
ttriad The number of two-edge paths for which there is a one-edge shortcut
in the network.
triangle The sum of ctriad and ttriad for the network.
idegree(k) The number of nodes with exactly k incoming edges in the network.
odegree(k) The number of nodes with exactly k outgoing edges in the network.
gwidegree The sum of the counts of each in-degree, weighted by the geometric
sequence, (1 − e−θk )i where θk is a decay parameter.
edges The number of edges in the graph.
Table 2. The AIC for a sample of fitted models. Note: For space and readability, the
notation we use here to describe the models omits the θi parameter coefficient from
Eq. (1). Each term (seperated by +) is a separate model predictor variable with its
own coefficient.
Model AIC
edges+gwidegree 104090
edges+gwidegree+ctriad 104088
edges+gwidegree+ttriad 101473
edges+gwidegree+ttriad+odegree(2) 100065
edges+gwidegree+ttriad+istar(3) 97723
edges+gwidegree+ttriad+idegree(2) 97589
edges+gwidegree+ttriad+istar(2) 94383
edges+gwidegree+ttriad+idegree(2)+idegree(3)+istar(2) 91017
edges+gwidegree+ttriad+idegree(2)+idegree(3)+istar(2)+istar(3) 89491
Table 2 allows us to see the variables that are important to the AIC and,
hence, are better at predicting the topology of the Apache call graph. For
example, it is interesting that the out-degree of a function is less important to
the global topology than the in-degree, indicating that the emergent structure
of the call graph is more dependent on how many times each function is called
than on how many dependencies they have, which is in line with the findings
in Section 3.
Next, we perform a longitudinal, 50-month study of the Apache call graph
using a few of the best-fitting models from the one-month study. This exper-
iment lets us see if the relative importance of explanatory variables changed
throughout the Apache development process. The ranking by AIC of the
models we fit remains constant across all 50 months, but the values of the
parameters do not. Figure 9 shows a plot of the coefficient values over time
for ttriad, idegree(2,3) and istar(2,3). These variables were chosen be-
cause they were contained in our best-fitting model (as determined by AIC)
from Table 2, and we chose not to study any variables (such as odegree)
from other, less well-fitting models. Our exploratory procedure eliminated the
other variables that we considered because they did not contribute as large
an improvement to the AIC as the variables from the final model.
All of the variables that we’ve measured relating to in-degree (istar(2,3),
idegree(2,3) and gwidegree) are generally negative in this model. On
the other hand, the transitivity variable ttriad is consistently positive
throughout the development cycle. This indicates that there are functions
in Apache that call their callee’s callees (perhaps due to the standard library
functions being included in the Apache call graph).
Interestingly, over the 50-month period, indegree(2) is almost perfectly
anti-correlated with indegree(3) (as seen in Fig. 9). One explanation is that
these two variables are measuring two aspects of the same phenomenon (how
Evolution of Apache Open Source Software 213
Fig. 9. Plots of several interesting coefficients across all 50 months. Top: ttriad.
Middle: idegree(2,3). Bottom: istar(2,3).
many functions are called approximately twice), and, hence, the importance of
the two variables to the model is correlated. Similarly, edges and gwidegree
(not shown) are strongly anti-correlated, perhaps because they both measure
aspects of network density.
Acknowledgments
We are indebted to Christian Bird for supplying the call graph data which
is central to our analysis and to Premkumar Devanbu for many useful dis-
cussions. This work was funded in part by the National Science Foundation
under Grant No. IIS-0613949.
References
1. E. S. Raymond. The Cathedral & the Bazaar. O’Reilly and Associates, Sebastopol,
CA, 1999.
2. T. O’Reilly. Lessons from open source software development. Communications of
the ACM, 42(4), 1999.
3. P. Ball. Openness makes software better sooner. Nature, June 25, 2003.
4. D. Challet and Y. Le Du. Microscopic model of software bug dynamics: Closed
source versus open source. International Journal of Reliability, Quality and Safety
Engineering, 12(6), 2005.
5. M. Fowler. Refactoring: Improving the Design of Existing Programs. Addison-
Wesley, Reading, MA, 1999.
6. A. A. Gorshenev and Yu. M. Pis’mak. Punctuated equilibrium in software evolu-
tion. Phys. Rev. E, 70(6):067103, 2004.
7. http://httpd.apache.org.
8. Software Maintenance Costs and references therein, http://www.cs.jyu.fi/
∼koskinen/smcosts.htm.
9. S. Valverde, R. Ferrer Cancho, and R. V. Solé. Scale-free networks from optimal
design. Europhys. Lett., 60(4):512–517, 2002.
10. C. R. Myers. Software systems as complex networks: Structure, function, and
evolvability of software collaboration graphs. Phys. Rev. E, 68:046116, 2003.
11. A. MacCormack, J. Rusnak, and C. Y. Baldwin. Exploring the structure of com-
plex software designs: An empirical study of open source and proprietary code.
Management Science, 52(7), 2006.
12. Z. M. Saul, V. Filkov, P. T. Devanbu, and C. Bird. Recommending random walks.
In Proceedings ESEC/SIGSOFT FSE, pages 15–24, 2007.
13. http://www.grammatech.com/products/codesurfer/overview.html.
14. S. B. Seidman. Network structure and minimum degree. Social Networks, 5:269–
287, 1983.
15. B. Bollobas. The evolution of sparse graphs. In Graph Theory and Combinatorics,
pages 35–57. Academic Press, New York, 1984.
16. http://www.apacheweek.com/features/ap2#rh.
17. S. Valverde and R. V. Solé. Hierarchical small worlds in software architecture. In
Dynamics of Continuous Discrete and Impulsive Systems: Series B; Applications
and Algorithms, volume 14, pages 1–11, 2007.
18. G. Baxter, M. Frean, J. Noble, M. Rickerby, H. Smith, M. Visser, H. Melton,
and E. Tempero. Understanding the shape of Java software. In OOPSLA ’06:
Proceedings of the 21st Annual ACM SIGPLAN Conference on Object-Oriented
Programming Systems, Languages, and Applications, pages 397–412, New York,
NY, USA, 2006. ACM.
Evolution of Apache Open Source Software 215
Gourab Ghoshal
1 Introduction
The study and analysis of complex networks has in recent times sparked
widespread attention from the scientific community [1, 2, 3]. This interest
has been spurred partly by researchers recognizing networks as useful rep-
resentations of real-world complex systems, and also due to the widespread
availability of computing resources, enabling them to gather and analyze data
on a scale much larger than before. Studies have ranged from large-scale empir-
ical analysis of the World Wide Web, social networks and biological systems,
to the development of theoretical models and tools to explore the various
properties of these systems [4, 5].
A topic that has garnered significant interest is the subject of growing
networks, inspired by real-world examples such as that of the Internet, the
World Wide Web and scientific citation networks [6, 7, 8]. The particular
case of the World Wide Web has led to what is perhaps the best-known body
of work on this topic: the preferential attachment model [9, 10], in which
vertices are added to a network with edges that attach to pre-existing vertices
with probabilities depending on those vertices’ degrees. When the attachment
probability is precisely linear in the degree of the target vertex, the resulting
degree sequence has a power-law tail, in the limit of large network size. The
appearance of the power-law tail is what first led to the popularity of growth
models as a method to describe network evolution, as most real-world net-
works appear to have degree distributions that are approximately power laws.
The preferential attachment model, though a good starting point, is insuf-
ficient for describing networks such as the World Wide Web. One can imagine
a variety of processes taking place in addition to the mere deposition of ver-
tices and edges. In particular, it is a matter of common experience that web
pages are sometimes permanently or temporarily removed from the web along
with their links to other web pages. Consequently, there is plenty of room to
build on these models, which are principally growth based, and add another
N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks,
Modeling and Simulation in Science, Engineering and Technology,
DOI: 10.1007/978-0-8176-4751-3 13,
c Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009
218 G. Ghoshal
level of complexity by including processes where vertices and edges are also
removed from the network. It is also possible to extend the model to study
general growth and deletion processes, and not just preferential attachment.
Indeed, in the last couple of years, there has been some activity in this re-
gard [11, 12, 13]. A notable example is the work done in [12], where, among
other things, the authors extended the preferential attachment model to in-
clude the deletion of vertices, potentially at different rates, from the addition
of new vertices. They demonstrated that networks still retain their power-law
tail when the rate of vertex accrual outstrips vertex deletion (with an ex-
ponent dependent on the relative rate). However, when the rates are equal,
the exponent diverges and the degree distribution transitions into a stretched
exponential (Weibull distribution). This could have interesting consequences,
for example, in the future character of the web where web pages would share
a more even load than at present.
Most growth models are designed to solve for the degree distribution of
the network. Although one can design a model which describes the evolution
of a number of other network properties, say the clustering coefficient, those
that deal with degree distributions are the most attractive for chiefly two
reasons. The first is in terms of practicality; the degree distribution is rela-
tively straightforward to deal with mathematically, and thus one can calcu-
late a number of properties exactly. The second reason is that the degrees of
vertices typically have a strong effect on the overall behavior of the network;
therefore, they are a useful guide in determining its characteristics. In fact,
the traditional use of growth models has been mostly in this regard, where
researchers define some evolution process and then solve for the equilibrium
degree distribution of the network.
However, it is certainly possible to think of alternative applications. In-
stead of defining a set of processes under which a network evolves and then
determine its final structure, one can turn the question around and specify a
final structure, and then solve for the rules which give rise to that structure.
To see when and why this is applicable, consider the following. Generally, we
can divide evolving networks into broadly three classes. There are those that
evolve naturally, in the sense that they are driven by dynamical processes
not under our control; representative examples are social, biological and in-
formation networks. This class is suited to the more traditional use of growth
models. At best, we can measure the degree distribution of these networks,
guess a set of rules that govern their evolution and then check our calcula-
tions against the measurements, to see if we made a reasonable choice. There
is a different class, mostly infrastructure related, such as the transportation
and power grids, communication networks such as the telephone and Internet,
that are designed by a centrally controlled authority. Since the rules for the
evolution process defined in growth models are mostly local in nature, we are
hard pressed to find a suitable application for them in this particular class.
Finally, there is a relatively new class of networks which falls in between these
two types, the classic example being peer-to-peer file-sharing networks. These
Some New Applications of Network Growth Models 219
2 The Model
In this section we will define our model for growing a network. Our approach
will be based on the attachment kernel introduced in [23] in addition to a
general deletion kernel. We will assume that nodes join/leave the network
at intervals, and on doing so form/lose connections with other pre-existing
nodes in the network. For the networks that we consider, we will make the
assumption that, on the typical time scales over which nodes enter or leave
the network, the size of the network n does not change substantially. We will,
however, not assume that our networks are uncorrelated (in terms of degree
correlations) and will take this into account explicitly in our evolution process.
The reasons for doing so will be clarified in Section 4.
To start off, let us define pk to be the fraction of nodes in the network
that at a given time have degree k. Alternatively, one can think of it as the
probability of a randomly chosen node to have degree k. Then by definition,
it satisfies the normalization condition
∞
pk = 1. (1)
k=0
Let us now define the process by which a newly arriving node chooses to attach
to others extant in the network and how a node is removed from the same.
Let πk be the probability that a given edge from a new node is connected to
a given node of degree k, multiplied by the total number of nodes n. Since
the total number of nodes in the network with degree k is npk , this implies
that πk pk is the probability that an edge from a new node is connected to any
node of degree k. Similarly, let ak be the probability that a given node with
degree k fails or is attacked during one node removal, again multiplied by the
total number of nodes n. Then ak pk is the total probability to remove a node
with degree k during one node removal. Since each newly attached edge goes
to some node with degree k, we have the following normalization conditions:
πk pk = 1, (2)
k
k ak pk = 1. (3)
Finally, let us also allow ourselves to choose the number of edges of the newly
joining nodes. Let mk be the distribution
from which the edges of these nodes
are drawn, with the constraint k kmk = c, in other words, the average
degree of incoming vertices is c.
Some New Applications of Network Growth Models 221
Armed with the given definitions, we are now in a position to write a rate
equation governing the evolution of the degree distribution. For a network of
n nodes at a given unit of time, the total number of nodes with degree k is
npk . After one unit of time we add a node and take away another, so the new
number with degree k is now npk , where pk is the new value of pk . Therefore,
we have
npk = npk +cπk−1 pk−1 +cπk pk + ek+1|j jaj pj − ek|j jaj pj − ak pk + mk ,
j j
(4)
which renders Eq. (4) independent of ek|j and thus independent of any degree
correlations. Random deletion thus closes the rate equation for pk , enabling
us to seek a solution for the degree distribution for a given mk and πk . If we
now assume that pk has an asymptotic form in the limit of large time, which
amounts to setting pk to pk , we get the following equation:
If we then multiply Eq. (6) by z k and sum over the index k with the convention
p−1 = 0, we find that the generating functions satisfy the following differential
equation:
dG
(1 − z) − G(z) − c(1 − z)F (z) + M (z) = 0. (10)
dz
Our task will be to solve for a set of rules that generate/preserve the degree
distribution of a network that is specified beforehand. In other words, given
a G(z), our aim is to solve for the attachment kernel F (z). We can rearrange
Eq. (10) to get F (z) in terms of the other two distributions,
1 dG M (z) − G(z)
F (z) = + . (11)
c dz 1−z
we can simply read the coefficient of z k on either side of Eq. (12) to give
1
πk = (k + 1)pk+1 + Pk+1 − Mk+1 , (15)
cpk
where Pk is the cumulative distribution of the degrees of nodes in the network,
and Mk is the cumulative distribution of the degrees of nodes added,
∞ ∞
Pk = pl , Mk = ml . (16)
l=k l=k
We have a number of options for solving Eq. (15); given (almost) any
choice of the distribution mk of the degrees of added vertices, we can find
the corresponding πk that will give the desired final degree distribution of
the network. A particularly convenient choice would be to make the degree
distribution of the added vertices the same as the desired degree distribution,
so that Mk = Pk . Then,
qk (k + 1)pk+1
πk = = . (17)
pk cpk
In other words, if we have some desired degree distribution pk for our network,
one way to achieve it is to add vertices with exactly that degree distribu-
tion and then arrange the attachment process so that the degree distribution
remains preserved thereafter, even as vertices and edges are added to and
removed from the network. Equation (17) tells us the choice of attachment
kernel that will achieve this.
For example, say we want to generate a Poisson network with degrees
distributed according to
μk
pk = e−μ , (18)
k!
where μ is the average degree of the network. Equation (17) tells us that all
we have to do is to introduce nodes with degrees distributed according to
Eq. (18) and attach them uniformly at random to the pre-existing vertices
in the network. Figure 1 shows the degree distribution of a Poisson network
generated using the method described above.
Having built our mathematical framework, we are now free to move on to
specific applications.
100
10−1
10−2
Probability Pk
10−3
10−4
10−5
10−6
1 10 100
Degree k
Fig. 1. The degree distribution for a network of fixed size n = 50,000 generated using
the growth mechanism described in the text, with c = 10. The points represent the
simulation results and the solid line is the distribution Eq. (18).
Interest in this has been inspired partly by the emergence of networked dis-
tributed databases such as peer-to-peer file-sharing networks. In such net-
works the structure of the network and the distribution of the items stored
on it typically change rapidly and frequently, which means that searches must
be performed in real time. In peer-to-peer networks searches typically con-
sist of queries that are forwarded from one vertex to another until the target
item is found. Real-time searches place heavy demands on computer power
and bandwidth, and there is interest in finding efficient search strategies to
decrease these costs.
As mentioned in the Introduction, direct measurements of real peer-to-peer
networks have shown that typically the degree distribution of these networks
follows a power law, which has led some authors to propose search strategies
that exploit this power-law form to improve efficiency. Here we describe an
alternative approach to the problem: instead of tailoring our algorithm to the
observed network, we instead tailor the structure of the network to optimize
the performance of the search algorithm. We will start by defining our al-
gorithm and then outline the properties of interest. We will then consider a
candidate network with a structure that optimizes those properties. The ideas
of this section have been discussed in detail in [24].
computer, which would make searching easier, but we will not assume this to
be the case. Computers are connected together in a virtual network, meaning
that each computer is designated as a neighbor of some number of other
computers. These connections between computers are purely notional: every
computer can communicate with every other directly over the Internet or
other physical network. The virtual network is used only to limit the amount
of information that computers have to keep about their peers. Each computer
maintains a directory of the data items held by its network neighbors, but
not by any other computers in the network. Searches for items are performed
by passing a request for a particular item from computer to computer until it
reaches one in whose directory that item appears, meaning that one of that
computer’s neighbors holds the item. The identity of the computer holding the
item is then transmitted back to the origin of the search, and the origin and
target computers communicate directly thereafter to negotiate the transfer
of the item. This basic model is essentially the same as that used by other
authors [15] as well as by many actual peer-to-peer networks in the real world.
Note that it achieves efficiency by the use of relatively large directories at
each node of the network, which inevitably use up memory resources on the
computers. However, with standard hash-coding techniques and for databases
of the typical sizes encountered in practical situations (hundred thousands
or millions of items) the amounts of memory involved are quite modest by
modern standards.
The two metrics of search performance that we consider are search time
and bandwidth, both of which should be low in a good algorithm. We define
the search time to be the number of steps taken by a propagating search
query before the desired target item is found. We define the bandwidth for a
node as the average number of queries that pass through that node per unit
time. Bandwidth is a measure of the actual communications bandwidth that
vertices must expend to keep the network as a whole running smoothly, but it
is also a rough measure of the CPU time they must devote to searches. Since
these are limited resources, it is crucial that we do not allow the bandwidth
to grow too quickly as vertices are added to the network; otherwise, the size
of the network will be constrained.
where kj is the degree of node j and Aij is an element of the adjacency matrix,
1 if there is an edge joining vertices i, j,
Aij = (20)
0 otherwise.
ki
pi = , (21)
2m
where m is the total number of edges in the network. That is, the random
walk visits nodes with probability proportional to their degrees.
When our random walker arrives at a previously unvisited node of de-
gree ki , it “learns” from that node’s directory about the items held by all of
its immediate neighbors, of which there are ki −1 excluding the one we arrived
from (whose items by definition we already know about). Thus, at every step
the walker gathers more information about the network.The average number
of nodes it learns about upon making a single step is i pi (ki − 1), with pi
given by (21), and hence the total number it learns about after τ steps is
2
τ k
ki (ki − 1) = τ −1 , (22)
2m i k
where k and k 2 represent the mean and mean-square degrees in the network
respectively and we have made use of 2m = nk.
The time taken for the walker to find the desired item, of course, depends
on how many instances of the target exist in the network. In many cases of
practical interest, copies of items exist on a fixed fraction of the nodes in
the network, which makes for quite an easy search. Here we will consider the
much harder problem in which copies of the target item exist on only a fixed
number of nodes, where that number could potentially be just 1. In this case,
the walker will need to learn about the contents of O(n) nodes in order to
find the target, and hence the average time to find the target is given by
Some New Applications of Network Growth Models 227
n
τ =A , (23)
k 2 /k − 1
for some constant A.
3.3 Bandwidth
Bandwidth is the mean number of queries reaching a given node per unit
time. Equation (21) tells us that the probability that a particular current
query reaches node i at a particular time is ki /2m, and assuming as discussed
above that the number of queries initiated per unit time is proportional to
the total number of vertices, the bandwidth for node i is
ki ki
βi = Bn =B , (24)
2m k
where B is another constant.
This implies that high-degree nodes will be overloaded in comparison with
low-degree ones, which means that networks with power-law or other highly
right-skewed degree distributions may be undesirable, resulting in bottlenecks
around the nodes of highest degree that could in principle harm the perfor-
mance of the entire network. If we wish to distribute load evenly among the
computers in our network, then a network with a tightly peaked degree dis-
tribution is desirable.
We wish to choose a structure for our network that gives low search times and
modest bandwidth demands, keeping in mind that the structure we consider
must be realizable in practice. In peer-to-peer networks users continually exit
the network whenever they want. Since we as designers have limited control
over this aspect of the network dynamics, we will assume that nodes are
effectively deleted at random. With this in mind, we are ideally placed to use
our model from Section 2.
A simple and attractive choice for our network is the Poisson distributed
network. For a Poisson degree distribution with mean μ we have k = μ and
k 2 = μ2 + μ. Then, using Eq. (23), the average search time is
n
τ =A . (25)
μ
Now if we allow μ to grow as some power of the size of the entire network,
i.e. μ ∝ nα with 0 ≤ α ≤ 1, then τ ∝ n1−α . For smaller values of α searches
will take longer, but the nodes’ degrees are lower on average, meaning that
each vertex will have to devote less memory resources to maintaining its di-
rectory. Conversely, for larger α, searches will be completed more quickly at
the expense of greater memory usage. In the limiting case α = 1, searches
228 G. Ghoshal
are completed in constant time, independent of the network size, despite the
simple nature of the random walk search algorithm. The price we pay for this
good performance is that the network becomes dense, having a number of
edges scaling as n1+α . However, remember that this is a virtual network, in
which the edges are a purely notional construct whose creation and mainte-
nance carry essentially zero cost. There is a cost associated with the directories
maintained by nodes, which for α = 1 will contain information on the items
held by a fixed fraction of all the nodes in the network. For instance, each
node might be required to maintain a directory of 1% of all items in the
network. Because of the nature of modern computer technology, however, we
do not expect this to create a significant problem. User time (for performing
searches) and CPU time and bandwidth are scarce resources that must be
carefully conserved, but memory space on hard disks is cheap, and the tens or
even hundreds of megabytes needed to maintain a directory is considered in
most cases to be a small investment. By making the choice α = 1 we can trade
cheap memory resources for essentially optimal behavior in terms of search
time, and this is normally a good deal for the user.
As a test of our proposed search scheme, we have performed simulations
of the procedure on Poisson networks generated using the methods described
in Section 2. Figure 2 shows as a function of network size the average time τ
taken by a random walker to find an item placed at a single randomly chosen
node in the network. As we can see, the value of τ does indeed tend to a
constant (about 100 steps in this case) as the network size becomes large.
While we have described here the theoretical ideas to grow a network with
a desired degree distribution, within the constraints outlined above, we have
not provided a realistic way to place edges between nodes with the desired
170
160
150
140
Time τ
130
120
110
100
90
0 5000 10000 15000 20000
Network size n
Fig. 2. The time τ for the random walk search to find an item deposited at a random
vertex, as a function of the number of vertices n.
Some New Applications of Network Growth Models 229
attachment kernel πk . If each node entering the network knew the identities
and degrees of all the others, this would be easy; we would simply select a
degree k at random in proportion to πk pk , and then select a node uniformly
at random with that degree. In the real world, however, and particularly in
peer-to-peer networks, no node knows the identity of all others. Typically,
computers only know the identities (such as IP addresses) of their immediate
network neighbors. There is indeed a way to get around this problem, and
that is by using biased random walks to generate the network. The main
purpose of this paper is more to discuss ideas, rather than implementation;
consequently, we will not describe this here. For a detailed discussion of the
practical implementation, along with other details such as data replication, we
refer the interested reader to [24], and instead move on to our next application.
We now turn our attention to quite a different topic: the field of network
resilience. Quite a lot of work has been done in this regard, though most have
focused on the effects of disruption on static networks. Typically, authors have
studied networks where the nodes and edges are progressively removed in some
fashion, and then measured the effect of these removals against the existence
of a giant component. The giant component constitutes the largest set of
nodes in the network, of size O(n), where n is the size of the network, that
are connected to each other by at least one path. The network is considered
static in that no compensatory measures, such as the (re)-introduction of new
edges or nodes, are permitted.
There is indeed good reason to study the resilience of networks. In the real
world, networks suffer from a variety of disruptions, stemming from failure
of key components, continuous addition/removal of nodes and edges and in-
tentional attacks such as Denial of Service, among other things. Since these
disruptions affect the structure of networks and structure is directly related
to performance, it is important to understand how the networks are affected.
However, we can do better than that. We can try and restore some or all of
the structure of the network by allowing it to react to the disruptions of new
nodes and edges. As evidenced from the previous section, considerable effort
can be expended in tailoring a network to have structures that optimize prop-
erties of interest, and it is a worthy effort to try and maintain that structure
in the face of varied disruptions. Note, that in the context of this paper, when
we talk about the structure of the network, we limit ourselves to the degree
distribution.
For the purposes of our study we assume that the designers of the network
are only aware of the statistical properties of the removed nodes and have
no ability to influence the existing network beyond the introduction of new
nodes along with their corresponding edges. They thus have two processes
under their control to compensate for the attack. The first is the degree of
230 G. Ghoshal
the introduced vertices, and the second is the process by which a newly in-
troduced node chooses to attach to a previously extant one on the network.
Consequently, failure is compensated by adding nodes and edges chosen from
an appropriate degree distribution and attaching them to the network via
specially tailored schemes.
As mentioned before, a variety of models have been proposed to simulate
network evolution and growth where vertices are both added and deleted, but
these have concentrated on the relatively simple case of uniform deletion. We
have already shown in Section 2 that, under uniform failures, the appearance
of degree correlations that typically arise as a result of growth processes can
be neglected. For the case of non-uniform deletion, correlations cannot be
ignored. Here we will proceed by demonstrating how to preserve an initially
uncorrelated network throughout the evolution process, with the introduction
of an additional rate equation for the degree correlations; consequently, our
focus will be on the currently neglected case of non-uniform failures. The
results of this section are based on the work of [26].
Before we move on to our method for repairing networks, we provide a brief de-
scription of the types of attacks or failures that most networks are subject to.
Random failures are the most generally studied schemes in both static and
evolving networks, because they lend themselves to relatively simple analysis.
These types of failures may be representative, say, of disruption of power lines
or transformers in a power grid owing to extraneous factors such as weather.
However, the functionality of most networks often depends on the performance
of higher-degree nodes; consequently, non-uniform attack schemes focus on
these. For example, in a peer-to-peer network, a high-degree node could be a
central user with large amounts of data. High degree could also be indicative
of the amount of load on a node during its operation, or on the public visibility
of a person in a social network. It is reasonable to assume that a malicious
entity such as a computer virus is more likely to strike these important nodes.
We can simulate these kinds of attacks using preferential failures ak ∝ k, that
sample nodes in proportion to their number of connections, and through an
outright attack on the highest-degree nodes represented by ak ∝ θ(k − kmin ),
where θ(x) is the Heaviside step function.
Our method of compensation will involve control over two processes: the
first where our newly incoming/repaired node chooses a degree for itself drawn
from some distribution mk , and second, the process by which this node decides
to attach to any other in the network, governed by the attachment kernel πk .
confront this issue, we proceed as follows. First we will find choices for mk and
πk that satisfy the solutions to the rate equation for a given pk in a network
that is uncorrelated. We will then demonstrate that a special subset of those
solutions for mk and πk is an uncorrelated fixed point of the rate equation for
the degree correlations. Our goal here is to solve for the attachment kernel
πk , that will preserve the original probability distribution pk , subject to a
deletion kernel ak for some choice of mk .
Before we move on, we need to make a slight modification to Eq. (6). In
the earlier instance, we exploited the simplification that arose from uniform
deletion. We will assume here that the initial network is uncorrelated; however,
ka be the mean
we will retain the general form of the deletion kernel ak . Let
degree of nodes removed from the network (i.e. ka = k kak pk ), and k
the mean degree of the original degree distribution pk . Then we have
ka k
cπk−1 pk−1 − cπk pk + (k + 1) pk+1 − k a pk − ak pk + mk = 0. (26)
k k
Once again, it can be easily shown from Eq. (26) that the average degree of
nodes removed is ka = c. Introducing the cumulative distribution for the
attacked and newly added nodes, Ak and Mk respectively,
∞
∞
Ak = al pl , Mk = ml , (27)
l=k l=k
This equation represents the set of possible solutions for the attachment kernel
that will lead to the desired degree distribution, given that the final network
is uncorrelated. The correct choice of solution from the above set must obey
the consistency condition that, when inserted into the rate equation for the
degree correlations, the correlations vanish. The following ansatz chosen from
the above set is such a choice:
mk = ak pk ,
qk (k + 1)pk+1
πk = = . (30)
pk k pk
232 G. Ghoshal
The reason behind this choice will be made more clear in the next section.
Note the similarity with Eq. (17) which was derived in the context of uniform
deletion. Here we see that it holds true even for non-uniform deletion, albeit
with some caveats that we will see shortly. There are basically two conditions
for the existence of a solution given by this equation; ak pk must be a valid
probability distribution, and k must be finite. These are not very stringent
conditions and are typically satisfied by most degree distributions. In other
words, barring some pathological cases, it is always possible to find a solution
of the above form.
We are now in a position to effect our repair on the network. Given the
original degree distribution pk and the form of the attack ak , Eq. (30) gives
us the precise recipe for recovering the degree distribution. We need to sam-
ple the degrees of the newly introduced nodes in proportion to the product of
the deletion kernel and the degree distribution, and then attach these edges
in proportion to the excess degree distribution of the network. To test our
repair method, we provide two examples for initially uncorrelated networks
with 10,000 nodes generated using the configuration model [25]. In the config-
uration model, only the degrees of vertices are specified. Apart from this sole
constraint the connections between vertices are made at random.
We employ two types of attack kernels, preferential attack represented
by ak ∝ k and a targeted attack only on high-degree nodes represented by
ak ∝ θ(k − kmin ) on our two example networks. Our first network has links
distributed according to a power law with an exponential cutoff,
−γ −k/κ
Ck e k = 0,
pk = (31)
0 k = 0,
where C is a normalization constant.
Our second choice of network has an exponential degree distribution,
pk = 1 − e−λ e−λk . (32)
In Fig. 3 we show the resulting degree distribution for the power-law network
where nodes were attacked preferentially, while Fig. 4 shows the results for the
exponentially distributed network undergoing targeted attack. Both figures
indicate that the initial and final networks are in excellent agreement.
To demonstrate the validity of our results, we must prove that our initially
uncorrelated networks remain uncorrelated under our repair scheme. Here we
give a brief sketch of the idea; for full details, see [26].
We start off by defining a rate equation for the correlations. The rate equa-
tion describes the evolution of the expected number of edges in the network
with ends of degree k and l. Let the number of such edges in the network be
mel,k , (33)
Some New Applications of Network Growth Models 233
100
10−1
Probability pk 10−2
10−3
10−4
10−5
10−6
10−7
1 10 100
Degree k
Fig. 3. Log-binned degree distribution of a power-law network (104 nodes) with ex-
ponent γ = 3 and exponential cutoff κ = 100, under preferential attack ak ∝ k using
πk from Eq. (30) after setting mk = ak pk . The data points are averaged over multiple
realizations of the network, each subject to 105 iterations of addition and deletion.
The points along with corresponding error bars represent the final degree distribution,
whereas the solid line represents the initial network.
100
10−1
10−2
Probability pk
10−3
10−4
10−5
10−6
1 10
Degree k
Fig. 4. Degree distribution of an exponential network (104 nodes) with λ = 0.4 under
targeted attack ak ∝ Θ(k − 5) using πk from Eq. (30) after setting mk = ak pk .
234 G. Ghoshal
where m = nk/2, and el,k is the probability that a randomly selected edge
has degree k at one end and degree l in the other. The expected number of
edges after one time step where we add c and take away ka edges is then
[m + c − ka ]el,k = mel,k + Δ, (34)
where Δ represents all other edge addition and removal processes.
We have already established that in the steady state case, ka = c irre-
spective of the degree distribution, so our goal is equivalent to showing that
Δ is equal to zero for an uncorrelated network generated/repaired with our
special choices of πk and mk . As a result ek,l = ek,l , implying that the degree
correlations (if any) remain constant over time.
So according to Eq. (34) there exists a set of solutions such that an initially
uncorrelated network will not develop correlations as a consequence of the
evolution process. The attachment kernel Eq. (30) that was employed in the
network evolution process is a subset of these solutions. This allows the repair
method to be employed by maintaining negligible correlations in the network.
To briefly summarize, we have demonstrated that if a network with a
certain degree structure is subjected to an attack that aims to destabilize
that structure, one can recover the same, by manipulating the rules by which
newly added/removed vertices are (re)-introduced back to the network. The
rules that we employ in our repair method are dependent on the types of
attacks on our networks.
5 Conclusion
In this paper we have discussed some interesting alternative applications of
network growth models. Traditionally these models have been used to deter-
mine the processes via which networks in the real world form. However, the
mathematical framework can be adopted to other uses. Here we have provided
two examples.
In the first example, we have considered the problem of designing networks
by trying to manipulate the rules by which they evolve. For a certain class of
networks, such as peer-to-peer networks, the limited control that this manipu-
lation gives us over network structure may be sufficient to generate significant
improvements in network performance. Using generating function methods,
we have shown that it is possible to create networks with a desired degree
distribution by appropriate choice of the attachment kernel that governs how
newly added vertices connect to the network. We studied in detail one partic-
ularly simple case of a Poisson network that can be realized in straightforward
fashion and allows us to perform decentralized searches in constant time, and
makes only constant bandwidth demands per node, even in the limit where
the database becomes arbitrarily large.
In the second example, we have shown how to preserve a network’s degree
distribution from various forms of attack or failures by allowing it to react to
Some New Applications of Network Growth Models 235
the disruptions via the introduction of new nodes and edges. Recent empirical
studies [27] have suggested that node removal, for example, in the World Wide
Web, is typically non-uniform in nature. Unfortunately as we have seen, non-
uniform removal leads to the creation of degree correlations in the network,
which makes analysis difficult. To deal with the special case of non-uniform
deletion we have introduced a rate equation for the evolution of degree cor-
relations and have used that in combination with the equation for the degree
distribution to work around this problem. The structure of many networks
in the real world is crucially related to their performance, and consequently,
loss of these properties can lead to severe constraints on their performance.
In view of this, it is crucial for researchers to come up with effective solutions
to try and manage these types of disruptions.
The ideas in this paper have been presented chiefly to demonstrate the
use and versatility of network evolution models. There remains much oppor-
tunity for other applications than those discussed here, as well as for ways to
execute them in the real world. We hope that this will stimulate the imagina-
tion of researchers working in the field and look forward to new and exciting
developments.
Acknowledgments
The author thanks Mark Newman and Brian Karrer for illuminating discus-
sions. This work was funded by the James S. McDonnell Foundation.
References
1. R. Albert and A.-L. Barabási, Rev. Mod. Phys. 74, 47 (2002).
2. S. N. Dorogovtsev and J. F. F. Mendes, Adv. Phys. 51, 1079 (2002).
3. M. E. J. Newman, SIAM Review 45, 167 (2003).
4. D. J. Watts and S. H. Strogatz, Nature 393, 440 (1998).
5. R. J. Williams and N. D. Martinez, Nature 404, 180 (2000).
6. R. Albert, H. Jeong and A.-L. Barabási, Nature 401, 130 (1999).
7. D. J. de S. Price, Science 149, 510 (1965).
8. S. Redner, Eur. Phys. J. B 4, 131 (1998).
9. D. J. de S. Price, J. Amer. Soc. Inform. Sci. 27, 292 (1976).
10. A.-L. Barabási and R. Albert, Science 286, 509 (1999).
11. E. Ben-Naim and P. L. Krapivsky, J. Phys. A: Math. Theor. 40, 8607 (2007).
12. C. Moore, G. Ghoshal and M. E. J. Newman, Phys. Rev. E 74, 036121 (2006).
13. J. Saldaña, Phys. Rev. E 75, 027102 (2007).
14. T. Hong, in Peer-to-Peer, Harnessing the Benefits of a Disruptive Technology,
edited by Andy Oram (O’Reilly, Sebastopol, CA, 2001), Chap. 14, pp. 203–241.
15. L. A. Adamic, R. M. Lukose, A. R. Puniyani and B. A. Huberman, Phys. Rev. E
64, 046135 (2001).
16. N. Sarshar, P. O. Boykin and V. P. Roychowdhury, Fourth International Confer-
ence on Peer-to-Peer Computing, pp. 2–9, Washington, D.C. (2004).
236 G. Ghoshal
17. G. Paul, T. Tanizawa, S. Havlin and H. E. Stanley, Eur. Phys. J. B 38, 187
(2004).
18. R. Cohen, K. Erez, D. ben-Avraham and S. Havlin, Phys. Rev. Letts. 85, 4626
(2000).
19. D. S. Callaway, M. E. J. Newman, S. H. Strogatz and D. J. Watts, Phys. Rev.
Letts. 85, 5468 (2000).
20. M. E. J. Newman and G. Ghoshal, Phys. Rev. Letts. 100, 138701 (2008).
21. B. Rezai, N. Sarshar, V. Roychowdhury and P. Oscar Boykin, Physica A 381, 497
(2007).
22. A. E. Motter, Phys. Rev. Letts. 93, 098701 (2004).
23. P. L. Krapivsky and S. Redner, Phys. Rev. E 63, 066123 (2001).
24. G. Ghoshal and M. E. J. Newman, Eur. Phys. J. B, 58, 175 (2007).
25. M. Molloy and B. Reed, Random Struct. Algorithms 6, 161 (1995).
26. B. Karrer and G. Ghoshal, Eur. Phys. J B, 62, 239 (2008).
27. J. S. Kong and V. P. Roychowdhury, e-print arXiv:0711.3263v2.
The Big Friendly Giant: The Giant Component
in Clustered Random Graphs
1 Introduction
Network theory is a powerful tool for describing and modeling complex sys-
tems having applications in widely differing areas including epidemiology [16],
neuroscience [34], ecology [20] and the Internet [26]. In its beginning, one
often compared an empirically given network, whose nodes are the elements
of the system and whose edges represent their interactions, with an ensem-
ble having the same number of nodes and edges, the most popular example
being the random graphs introduced by Erdos and Renyi [11]. As the field
matured, it became clear that the naive model above needed to be refined,
due to the observation that real-world networks often differ significantly from
the Erdos–Renyi random graphs in having a highly heterogenous non-Poisson
degree distribution [5, 15] and in possessing a high level of clustering [33].
Methods for generating random networks with arbitrary degree distribu-
tions and for calculating their statistical properties are now well understood.
This is usually achieved with the aid of the configuration model [6] and by
employing an analysis of a certain branching process based on generating
functions [24]. However, clustering, the other property that characterizes real-
world networks, remains far less understood. Clustering refers to the relative
number of triangles in a network, and is commonly measured by the coefficient
3×N
introduced in [24] as C = N3 . Here N is the total number of triangles in
the network, while N3 is the number of connected triples of nodes. This defi-
nition has the advantage that C is also the probability that two nodes which
connect to a mutual node are connected themselves, thereby forming a triangle
whereby “a friend of a friend is also a friend.”
The main difficulty when studying clustered networks is that the branching
processes, which are at the heart of the generating function formalism of [24],
no longer seem applicable due to the formation of short loops, namely trian-
gles. The lack of obvious analytical tools [16] and techniques for incorporating
N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks,
Modeling and Simulation in Science, Engineering and Technology,
DOI: 10.1007/978-0-8176-4751-3 14,
c Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009
238 Y. Berchenko et al.
triangles into random graph models with an arbitrary degree distribution [21]
has led researchers to pursue several different avenues. One should mention
several of these attempts:
• Giving up on analytic predictions, and conducting instead descriptive
studies [30], where various clustering indices are defined and measured
for a given real-world network. Resorting to simulations is also quite
common [33].
• Considering special cases, which are amenable for analysis. For exam-
ple, constructing a one-mode projection of a bipartite graph [14, 22], or
the framework of [25, 29], which generates exponential random graphs,
or Markov random graphs [32], which are flexible but more difficult to
analyze.
• There is yet another common but somewhat naive practice: adopting re-
sults and criteria from the unclustered case, and wrongly applying these
criteria for studying clustered graphs. Relevant is the example concern-
ing the emergence of the giant component (GC)—where it was shown [24]
that in the usual, unclustered, case there is a GC if the mean number of
nodes at a distance two (z2 ) is larger than the mean number of nodes at a
distance one (z1 ). This result is often (wrongly) taken as the criterion for
clustered networks [22, 31], thereby initiating the quest to calculate z2 in
the presence of clustering [22, 23, 31].
Here we suggest constructing a branching process that is applicable for net-
works with triangles [7, 28]. This recent approach seems very promising, and
we will pay attention to it, using the formalism of [7], rather than that found
in [28]. The latter relies on the restrictive assumption that any two triangles
in a network will never share an edge. Even in this limited setting, the results
are only applicable for relatively low levels of clustering (C), and the concepts
are difficult to interpret and broaden.
In Section 2 we review the application of generating functions for unclus-
tered (C = 0) random networks [24] (2.1), and describe the novel free-excess
degree formalism for clustered networks [7] (2.2). In Section 3 we discuss crit-
icality in random clustered graphs. Most of this section is devoted to the
emergence of the GC, as indeed is the bulk of the literature, but we will also
discuss briefly the second critical point (3.2), where the graph becomes con-
nected, which has impact on processes such as synchronization in networks
[17]. In Section 4 we show how to estimate the size of the GC as shown in [7];
then we broaden the setting to study the robustness and resilience of the GC,
i.e., bond, site and joint bond+site percolation (4.2). In Section 5 we describe
our simulations and compare the theory with data from real-world networks.
We discuss our findings in Section 6.
The Big Friendly Giant 239
2 Generating Functions
A generating function is a clothesline on which we hang up a sequence of
numbers for display.. [35].
For an excellent introduction to generating functions the reader is referred
to the book generatingfunctionology by Wilf [35]. Here we use the terminology
and notation used by Newman and colleagues [24] as it has been adapted for
network theory.
where pk is the probability that a randomly chosen node on the graph has
degree k. The distribution pk is assumed to be normalized, so that G0 (1) = 1.
The same will be true for all generating functions considered here. Because
the probability distribution is normalized and positive definite, G0 (x) is also
convergent for all |x| ≤ 1, and hence has no singularities in this region. The
function G0 (x), and indeed any probability generating function, has a number
of properties that will prove useful in subsequent developments.
Moments. The average over the probability distribution generated by a gen-
erating function, for instance, the average degree z1 of a node in the case of
G0 (x), is given by
z1 = k = kpk = G0 (1). (2)
k
Thus, if we know the values of the coefficients of a generating function, we
can calculate the mean of the probability distribution which it generates.
Powers. If the distribution of a property k of an object is generated by a
given generating function, then the distribution of the total of k summed over
m independent realizations of the object is generated by the m-th power of
that generating function. For example, if we choose m vertices at random from
a large graph, then the distribution of the sum of the degrees of those vertices
is generated by [G0 (x)]m .
Another quantity that will be important to us is the distribution of the
degree of the vertices that we arrive at by following a randomly chosen edge.
Such an edge arrives at a node with probability proportional to the degree
of that node, and the node therefore has a probability distribution of degree
proportional to kpk . The correctly normalized distribution is generated by
k kpk x
k
G (x)
= x 0 . (3)
k kpk G0 (1)
240 Y. Berchenko et al.
This is just the probability that of the i outgoing edges of v1 , i−k are connected
in a triangular formation that includes v0 , while the other k edges do not. Here,
as before, C is just the probability of a triangular formation. When d(v1 ) is
not known, from (8) we obtain
∞
∞
k
i i 1−C
ek := qi (1 − C) C
k i−k
= qi C i
. (9)
k k C
i=0 i=0
Let us remark that in deriving (8)–(13), it is possible to use any other clus-
tering index, such as c(k)—the degree-dependent clustering coefficient used
in [28]. However, it might be hard, if not impossible, to obtain a solution with
such a simple closed form.
As an example of how (13) may be useful, it is possible to determine the
mean free-excess degree:
dGc (x)
iei = x=1
= (1 − C)G1 (1) = (1 − C)ze . (14)
i
dx
Similarly, it will prove useful to calculate the mean number of edges emanating
outwards from nodes at a distance one to nodes at a distance two, beginning
from some arbitrary source node (note that this is not the mean number of
nodes at a distance two, due to the fact that there is a positive probability
242 Y. Berchenko et al.
that two edges reach the same node at a distance two). Similarly to (6) and
(7), the mean is
dG0 (Gc (x))
x=1
= G0 (1)G1 (1) · (1 − C) = (1 − C)z1 ze . (15)
dx
This parameter was also calculated in [23] by a different technique, but as will
be discussed shortly, its importance appears to have been overlooked.
In their
seminal paper, Molloy and Reed [18] introduced the parameter
Q := i ipi (i − 2), which identifies the phase transition in random graphs,
i.e., the point where a GC is born. Their procedure utilizes a method for
constructing a random graph, which may be viewed as “walking through a
graph” (Fig. 1a) and assessing the number of unknown nodes encountered
along the way. Suppose one follows a random edge to a node v having degree k.
How does this change the number of unknown nodes? First of all, by arriv-
ing at v the number of unknown nodes decreases by one. However, because
v itself has degree k, then this leads to an increase of (k − 1) in the number
of unknown nodes. The net effect is that the number of unknown nodes in-
creases by (k − 2). In order to calculate the expected change, the probability
3
p is usually a function of N , p(N ).
The Big Friendly Giant 243
a c1 c2 b c1 c2
b2 b3 b4 b2 b3 b4
b1 b1
a1 b5 a1 b5
a2 a2
V0 V0
a3 a3
a4 a4
Fig. 1. Graphical illustration of the exposure procedure. Choose a node at random, say
V0 , and start diffusing from it and counting the nodes encountered on the way. a) When
C = 0 and the network is tree-like (see footnote 1), after counting the new nodes
(a1 − a4 ) we pick one of them at random, say a1 , and count its new neighboring nodes
(b1 − b3 ), which are distributed according to {qi }∞
i=0 . In the next step, we randomly
choose one of the nodes (a2 − a4 , b1 − b3 ) and continue until the entire component
is exposed. b) When C > 0, two modifications are required to deal with cycles due
to triangles (the dashed edges): we use {ei }∞i=0 and diffuse depthwise. After counting
a1 − a4 , when we count the neighbors of a1 we avoid overcounting a2 because {ei }∞ i=0
governs the distribution of the solid-black edges. In the next step if we go from a1 to
b3 in order to count the neighbors of b3 , again we avoid overcounting a2 (because it is
connected to a1 ). The depthwise exposure, which is a permissible scheme [18], is used
to avoid dependencies.
a b
largest component
L2 Simulation
2 y = Const × N2/3
10
L1
V0
102 103
size of network (N)
Fig. 2. The difference between the new criterion B and the conventional criterion A.
a) Consider the following example: a typical node has a neighborhood similar to V0 —3
nodes at a distance one in the first layer, L1 , and 2 nodes at a distance two in the
second layer, L2 , but 4 edges to the second layer (from L1 to L2 ). Criterion B predicts
a GC, while criterion A fails to predict a GC. b) The size of the largest component
plotted vs. N for Poisson networks having mean degree z1 = 1.25 and C = 0.2 (i.e.,
at the critical point according to criterion B). Indeed the size at the critical point
correctly scales as ∼N 2/3 , as is known for the case z1 = 1, C = 0 (see references in [8]).
Note that criterion A would wrongly predict this regime to be below the critical point
(since z2 ≈ 1.19 < z1 ) and would suggest that all components should scale as O(log N ).
The Big Friendly Giant 245
a b 300 C=0
size
C = 0.25
z*1
0 z1 5
c 1.5
1
1
0 0.2 0.4
0 0.2 0.4
C C
Fig. 3. The critical mean degree z1∗ for the formation of a GC, plotted as a function
of C. a) Poisson degree distribution. Predictions of criterion A (grey line; z2 estimated
as in [31]). Predictions of criterion B (black line; z1∗ = (1 − C)−1 (see text)). Empirical
estimates of z1∗ (circles) were obtained through the following procedure in order to
overcome finite size effects: first the value of the size of the largest component was
found for networks with C = 0 at the known threshold z1∗ = 1 (b; dashed line). This
value was used to identify the critical threshold in comparable networks with C > 0.
c) SF degree distribution. Symbols as in a. Black and grey lines, which practically
overlap, are based on expressions for z1 and ze for SF networks [24].
Although the transition to complete connectivity is less well studied, the fol-
lowing example makes clear the need for further work in this area, particularly
for clustered networks.
In a recent series of papers [12, 17], the effect of clustering on a network
of coupled phase oscillators was examined. These authors made the plausible
assumption that by investigating a network with a very high mean degree
their network will be connected. When they [17] found groups of oscillators,
each group oscillating at a different frequency, they named them “dynami-
cal clusters,” in order to distinguish them from the topological clusters (i.e.,
connected components).
246 Y. Berchenko et al.
1
N=100
GC N=200
0.85
0 C 0.6
In order to find the size of the GC, Andersson [3] examined the probability of
extinction in a two-phase branching process that mimics the construction of a
random graph (with C = 0). In this branching process the source node has a
number of direct descendants distributed according to {pi }∞
i=0 (the first phase),
while each of its descendants has a number of direct descendants distributed
according to {qi }∞
i=0 (the second phase). First, consider the probability u for
a lineage of a single branch that arrives at some node, v1 , to eventually die
out. This necessitates that all k branches leaving v1 die out, an event that
occurs with probability uk . Since the
degreek of v1 is unspecified, we obtain
the self-consistency condition u = ∞ k=0 qk u = G1 (u), which can be solved
to find u.
The second step takes into consideration that the branching process begins
from some arbitrary source node. Because all branches originating from the
source must die out in order for the process to become extinct, the probability
where A is the graph adjacency matrix and D is a diagonal matrix with the degree
of node j at the Djj -th entry; the multiplicity of the eigenvalue 0 is the number of
connected components.
The Big Friendly Giant 247
Another related question concerns the size of the GC in the presence of dilu-
tion, i.e., when a fraction r of the nodes or edges (or a combination of nodes
and edges) has been randomly removed.5
This is understood to be related to the robustness and resilience of the net-
works against breakdowns of its units, the classic example being the World
Wide Web. Although the naive identification of functionality with the exis-
tence of the GC is sometimes considered problematic,6 this formalism does
have important applications as in, for example, the study of epidemic out-
breaks [10].
We can take the same approach from the previous section and ask again
the probability u for a lineage of a single branch that arrives at some node, v1 ,
to eventually die out. In the case of node removal, in the branching process,
following an edge we reach a node that is unoccupied (was removed) with
probability rn . Therefore, the lineage will die out with probability rn plus
1−rn times the probability that any of the lineages of the outgoing edges from
v1 will eventually die out (found via the self-consistency condition). Thus, step
(a) becomes: Solve for u such that rn +(1−rn )G1 (u) = u. Similar consideration
of edge removal with probability re , replacing the {qi }∞
i=0 with the free-excess
probabilities {ei }∞
i=0 (or G 1 with G c ) and demanding all branches originating
from the source to die out eventually, we get the size of the GC in clustered
networks after joint edge+node removal:
5
Also known respectively as site, bond and joint site+bond percolation.
6
Durret [10] gives a nice critique on the claim that “the internet is robust.. after
dilution (in a certain parameters regime) we still get a GC.” In the regime referred to,
“if all 6 billion people were initially connected then after the removal only 36 people
can check their email.”
248 Y. Berchenko et al.
Clustered networks were generated by three different methods, all giving sim-
ilar results, each having its own advantages in terms of efficiency. In all the
methods, a degree sequence was generated by sampling from a desired distri-
bution. In two of the methods, a network was constructed according to the
generated degree sequence by using a fill algorithm [13]. In one case we then
selectively switched links [4] to reach a desired degree of clustering. In the
second case, we selectively reconnected links to nodes of distance two, which
lead to an increase in the number of triangles. The third method was based
on distributing triangles in an empty network under the restrictions of the
degree sequence, and later filling in additional links using a fill algorithm [13].
In Fig. 5 we plot simulations against theory for the size of the GC for a
variety of parameters. Figure 5a shows the size of the GC vs. the mean degree
for different values of C, rn and re , the fraction of nodes and/or edges removed
respectively. In order to isolate the effect of clustering, we have also plotted
in figure 5b the size of the GC vs. C for a fixed mean degree.
The most revealing plot is that of the case rn = re = 0 (top line in
Fig. 5b), where there is good agreement at the lower values of C (i.e. C < 0.3),
a b
200 0. 8
rn=0 re=0
rn=0 re=0.2
0. 6 rn=0.2 re=0
rn=0.2 re=0.2
GC
0. 4
C=0.1 rn =0 re=0
C=0.1 rn=0.1 re=0
C=0.2 rn=0.2 re=0 0. 2
C=0.2 rn=0.2 re=0.2
0 2 4 6 0 0. 2 0. 4 0. 6 0. 8
z1 C
Fig. 5. The size of the GC after dilution. a) As a function of the mean degree for
networks with Poisson degree distribution. A fraction rn and re of the nodes/edges
were removed randomly, for C = 0.1 and C = 0.2. b) As a function of C for networks
with Poisson degree distribution and z1 = 2. A fraction rn and re of the nodes/edges
were removed randomly. Black lines: our prediction for each case.
The Big Friendly Giant 249
a b c
450
1400 30
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Fig. 6. The size of the GC after dilution in real-world networks. Grey: simulations with
bars at a width of one std, black: our predictions, broken line: the naive predictions
which do not consider C (i.e., C = 0). a) Nodes removal for the C. elegans neural
network. N = 453, C = 0.124, z1 = 8.9. b) Edges removal for the yeast protein-protein
interaction network. N = 2112, C = 0.055, z1 = 2.1. c) Joint nodes+edges removal for
the network of Zachary’s Karate club. N = 34, C = 0.255, z1 = 4.4.
6 Discussion
Perhaps the most far-reaching result presented here is our criterion B for the
existence of the GC. This simple and intuitive criterion (Is the mean number
of edges going to the second layer larger than the one going to the first?) is a
natural generalization of the well-established Molloy–Reed condition (Is the
mean number of nodes at the second layer larger than the one at the first?),
250 Y. Berchenko et al.
Acknowledgments
MT and YB are grateful for the support of the EC (project MATHfSS 15661)
and DIP (project Compositionality F 1.2). LS and YAR are grateful for
the support of the James S. McDonnell Foundation and the Israeli Science
Foundation.
References
1. Aiello, W., Chung, F., Lu, L.: A random graph model for massive graphs, Proc.
of the 32nd Annu. ACM Symposium on Theory of Computing (2000)
2. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du
Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK
User’s Guide, 3rd edition, SIAM, Philadelphia (1999)
3. Andersson, H.: Limit theorems for a random graph epidemic model, Ann. Appl.
Probab. 8, 1331–1349 (1998)
4. Artzy-Randrup, Y., Stone, L.: Generating uniformly distributed random networks,
Phys. Rev. E. 72 (5): 056708 (2005)
5. Barabasi, A.-L., Albert, R.: Emergence of scaling in random networks, Science
286, 509512 (1999)
6. Bender, E. A., Canfield, E. R.: The asymptotic number of labeled graphs with
given degree sequences, J. Combin. Theory A 24, 296307 (1978)
7. Berchenko, Y., Artzy-Randrup, Y., Teicher, M., Stone, L.: The emergence and
the size of the giant component in clustered random graphs with a given degree
distribution, submitted.
8. Bollobas, B.: Random Graphs, 2nd edition, Academic Press, New York (2001)
9. Callaway, D. S., Newman, M. E. J., Strogatz, S. H., Watts, D. J.: Network ro-
bustness and fragility: Percolation on random graphs, Phys. Rev. Lett. 85, 5468
(2000)
10. Durrett, R.: Random Graph Dynamics, Cambridge U. Press, Cambridge, UK
(2006)
The Big Friendly Giant 251
11. Erdos, P., Renyi, A.: On the evolution of random graphs, Publications of the
Mathematical Institute of the Hungarian Academy of Sciences 5, 1761 (1960).
12. Gomez-Gardenes, J., Moreno, Y., Arenas, A.: Paths to synchronization on complex
networks, Phys Rev Lett. 98 (3):034101 17358685 (2007)
13. Gotelli, N. J., Entsminger, G. L.: Swap and fill algorithms in null model analysis:
Rethinking the Knight’s Tour, Oecologia 129, 281–291 (2001)
14. Guillaume, J. L., Latapy, M.: A realistic model for complex networks, (2003) cond-
mat/0307095.
15. Jeong, H., Mason, S., Barabasi, A.-L., Oltvai, Z. N.: Lethality and centrality in
protein networks, Nature 411, 4142 (2001)
16. Keeling, M. J.: The effects of local spatial structure on epidemiological invasion.
Proc. R. Soc. London B 266, 859–867 (1999)
17. McGraw, P. N., Menzinger, M.: Analysis of nonlinear synchronization dynamics
of oscillator networks by Laplacian spectral methods, Phys. Rev. E 75, 027104
(2007)
18. Molloy, M., Reed, B.: A critical point for random graphs with a given degree
sequence, Random Structures and Algorithms 6, 161179 (1995)
19. Molloy, M., Reed, B.: The size of the giant component of a random graph with a
given degree sequence, Combin. Probab. Comput. 7, 295 (1998)
20. Montoya, J. M., Sole, R. V.: Small world patterns in food webs, J. Theor. Bio.,
214, 405–412 (2002)
21. Newman, M. E. J.: The structure and function of complex networks, SIAM Review
45, 167 (2003)
22. Newman, M. E. J.: Properties of highly clustered networks, Phys. Rev. E 68,
026121 (2003)
23. Newman, M. E. J.: Random graphs as models of networks. In: Bornholdt, S.,
Schuster, H. G. (eds.) Handbook of Graphs and Networks, Wiley-VCH, Berlin
(2003)
24. Newman, M. E. J., Strogatz, S. H., Watts, D. J.: Random graphs with arbitrary
degree distributions and their applications, Phys. Rev. E. 64, (2001)
25. Park, J., Newman, M. E. J.: Solution for the properties of a clustered network,
Phys. Rev. E 72, 026136 (2005)
26. Pastor-Satorras, R., Vasquez, A., Vespignnani, A.: Dynamical and correlation
properties of the internet, Phys. Rev. Lett. 87, 258701 (2001)
27. Redner, S.: A Guide to First-Passage Processes, Cambridge University Press,
New York (2001)
28. Serrano, M. A., Boguna, M.: Percolation and epidemic thresholds in clustered
networks, Phys. Rev. Lett. 97, 088701 (2006)
29. Strauss, D.: On a general class of models for interaction, SIAM Review 28, 513–527
(1986)
30. Vazquez, A.: Growing networks with local rules: Preferential attachment, cluster-
ing hierarchy and degree correlations, cond-mat/0211528 (2002)
31. Volz, E.: Networks with tunable degree distribution and clustering, Phys. Rev. E
70, 056115 (2003)
32. Wasserman, S., Pattison, P.: Logit models and logistic regressions for social net-
works: I. An introduction to Markov random graphs and p*, Psychometrika 61,
401426 (1996)
33. Watts, D. J., Strogatz, S. H.: Collective dynamics of small-world networks, Nature
393, 440442 (1998)
252 Y. Berchenko et al.
34. White, J. G., Southgate, E., Thompson, J. N., Brenner, S.: Structure of the nervous
system of the nematode C. elegans, Phil. Trans. R. Soc. London 314, 1340 (1986)
35. Wilf, H. S.: generatingfunctionology, 2nd edition, Academic Press, London (1994)
36. Zachary, W.: An information flow model for conflict and fission in small groups,
Journal of Anthropological Research 33, 452–473 (1977)
Technological Networks
Bivas Mitra
1 Introduction
The study of networks in the form of mathematical graph theory is one of
the fundamental pillars of discrete mathematics. However, recent years have
witnessed a substantial new movement in network research. The focus of the
research is shifting away from the analysis of small graphs and the properties
of individual vertices or edges to consideration of statistical properties of large
scale networks. This new approach has been driven largely by the availability
of technological networks like the Internet [12], World Wide Web network [2],
etc. that allow us to gather and analyze data on a scale far larger than pre-
viously possible. At the same time, technological networks have evolved as a
socio-technological system, as the concepts of social systems that are based on
self-organization theory have become unified in technological networks [13].
In today’s society, we have a simple and universal access to great amounts
of information and services. These information services are based upon the
infrastructure of the Internet and the World Wide Web. The Internet is the
system composed of ‘computers’ connected by cables or some other form of
physical connections. Over this physical network, it is possible to exchange
e-mails, transfer files, etc. On the other hand, the World Wide Web (com-
monly shortened to the Web) is a system of interlinked hypertext documents
accessed via the Internet where nodes represent web pages and links represent
hyperlinks between the pages. Peer-to-peer (P2P) networks [26] also have re-
cently become a popular medium through which huge amounts of data can
be shared. P2P file sharing systems, where files are searched and downloaded
among peers without the help of central servers, have emerged as a major
component of Internet traffic. An important advantage in P2P networks is
that all clients provide resources, including bandwidth, storage space, and
computing power. In this chapter, we discuss these technological networks in
detail. The review is organized as follows. Section 2 presents an introduction
to the Internet and different protocols related to it. This section also specifies
the socio-technological properties of the Internet, like scale invariance, the
small-world property, network resilience, etc. Section 3 describes the P2P net-
works, their categorization, and other related issues like search, stability, etc.
Section 4 concludes the chapter.
2 The Internet
The Internet is a global network connecting millions of computers in a de-
centralized form. Each Internet computer, called a host, is independent and
operators can choose any of the commercial Internet service providers (ISPs).
Many computer scientists observe the Internet as a “prime example of a large-
scale, highly engineered, yet highly complex system” (Fig. 1). The Internet is
extremely heterogeneous in nature; for instance, data transfer rates and physi-
cal characteristics of connections vary widely. In addition, the Internet evolves
and emerges based upon its large-scale self-organization property. Technically,
the Internet can be defined as the network of networks working with Transmis-
sion Control Protocol (TCP)/Internet Protocol (IP). This definition visualizes
the Internet as a purely technological system. However, this assumption over-
looks the fact that knowledgeable human activities make the Internet work.
Hence, more accurately, the Internet is a global socio-technological system
that is based on a technological structure and a set of protocols [13]. Some of
the important Internets-based services are e-mail, World Wide Web, remote
access, and Internet telephony.
Fig. 2. IP addressing.
256 B. Mitra
The topology of the Internet is studied at two different levels. At the router
level, the nodes are the routers, and edges are the physical connections be-
tween them. At the interdomain (or autonomous system) level, each domain,
composed of hundreds of routers and computers, is represented by a single
node, and an edge is drawn between two domains if there is at least one route
that connects them. The topology of large-scale networks like the Internet is
characterized by the degree distribution pk , which is defined as the fraction of
nodes in the network having degree k. In 1999, Faloutsos et al. [12] studied the
Internet at both levels, concluding that in each case the degree distribution
follows a power law (Fig. 3) i.e. pk ∼ k −γ . The interdomain topology of the
Internet, captured at three different dates between 1997 and the end of 2002,
resulted in degree exponents between γ = 2.15 and γ = 2.2. The 1995 sur-
vey of the Internet topology at the router level, containing 3888 nodes, found
γ = 2.48. In 2000, Govindan and Tangmunarunkit [15] mapped the connec-
tivity of nearly 150,000 router interfaces and nearly 200,000 router adjacently,
confirming the power-law scaling with γ = 2.3. It is widely believed that the
scale invariance property of the Internet is related to the self-organization
property of the participating nodes. The preferential attachment tendency of
Fig. 3. The first data file holds link directions corresponding to the traceroute direc-
tions, while the second file is an undirected version of the first file. There are a total
of 192,244 nodes, 636,643 directed links, and 609,066 undirected links. The average
and maximum node degrees (undirected) are 6.34 and 1071 respectively, and the node
degree distribution is plotted.
Technological Networks 257
the nodes to join the network [42] stabilizes the degree distribution as the size
of the Internet becomes very large.
Internet as small world. An accurate characterization of the emergent
topological properties of the Internet and a better understanding of the un-
derlying processes that yield these characteristics are crucial for proper eval-
uation of network protocols and systems. In that vein, recent works [20, 5]
have shown the prevalence of small-world phenomena [24, 44] in the Internet.
Small-world graphs exhibit a high degree of clustering, yet have typically short
path lengths between arbitrary vertices. Yook [47] and Pastor-Satorras [32]
have studied the Internet at the domain/autonomous system level between
1997 and 1999 and found that its clustering coefficient ranges between 0.18
and 0.3, compared to the clustering coefficient 0.001 for random networks of
similar parameters. On the other hand, the average path length of the Internet
ranges between 3.70 and 3.77 and at the router level it is around 9, indicating
its small-world character. Small-world behavior in the Internet maps to two
possible causes: first, the high variability of node degree distributions and,
second, the preference of vertices to have local connections [20]. With the
high variability of the node degree distribution, it is likely that two intercon-
nected vertices, say u and v, will have the same neighbor, say w specifically,
when w is a node with extremely large degree. It means that u, v, and w
form a triangle. Such a pattern contributes directly to the computation of the
clustering coefficients of u, v, and w, (i.e. Cu , Cv , and Cw ) and results in a
larger overall average clustering coefficient C of the network. Thus, C grows
with the variability of vertex degree. Also, notice that with highly variable
vertex degrees, the average distance between two vertices (L) is short. This
happens because the shortest path is usually through those extremely popular
vertices. That is, highly popular vertices serve as good navigators through the
graph. On the other hand, preference for the local connectivity also results
in small-world behavior. The reason behind this is that, with a non-negligible
probability of a local connection, if a node u is connected to v and w, then
it is likely that v and w are also close to each other. As a result, there is a
non-negligible probability that a triangle will form among these vertices, re-
sulting in a higher clustering coefficient. Meanwhile, since there are still many
long-range connections, it is easy to find a short path between two randomly
chosen nodes.
In addition, researchers from Stanford University [37] found that as net-
works grow very large, they become very efficient in the number of steps
a data packet takes to get from one node to another node. The number of
steps grows logarithmically with the size of the network, which means that
for 10,000 nodes we need five steps, but for 100 million the number grows only
to 6.5. They also exhibit a clustering property, i.e. the relationships among
nodes are not randomly distributed, but are grouped. Short path links means
that there are some very short paths sprinkled throughout the network that
258 B. Mitra
may directly link one group to another. This conforms to Watts and Strogatz’s
model [44], where a low dimensional regular lattice is transformed to a small
world network.
The Internet and other communication networks display a high degree of ro-
bustness: while key components regularly malfunction, local failures rarely
lead to the loss of the global information-carrying ability of the network [3].
It has been observed that network topology plays an important role in the
robustness of the Internet. Consider an arbitrary connected graph of N nodes,
and assume that an f fraction of the nodes have been removed. This leads
to important questions, like: What is the probability that the resulting sub-
graph is connected, and how does it depend on the removal probability f ?
For a broad class of graphs there exists a threshold probability fc such that
if f < fc the resulting subgraph is connected, but if f > fc the subgraph
becomes disconnected (Fig. 4). Here fc is termed the percolation threshold.
In the following discussion, we will call a network fault tolerant (or robust) if
it contains a giant component comprising of most of the nodes even after a
fraction of its nodes are removed.
The topology of the Internet and the failure probability of nodes can be char-
acterized by probability distributions pk and fk respectively. Here pk signifies
the degree distribution which is the probability that a randomly chosen node
has degree k. Similarly fk is the probability that a vertex of degree k, will
be removed from the network. Nodes leave the Internet due to their faulty
nature [8] or due to the attack mounted on the important nodes [9]. Based
upon these basic parameters, an analytical framework has been derived to
examine the stability of the Internet (or any kind of networks) where the ver-
tices undergo some dynamics [28]. The analytical framework can be expressed
with the help of the following equation:
∞
kpk (k(1 − fk ) − (1 − fk ) − 1) = 0. (1)
k=0
Equation. (1) states the critical condition for the stability the Internet
(characterized by pk ) undergoing any type of failure and attack (characterized
by fk ).
Stability analysis of networks under different node disturbance
schemes. The existing empirical and theoretical results indicate that complex
networks can be divided into two major classes based on their degree distri-
bution pk . In the first class of networks, pk peaks at an average degree k
and decays exponentially for large k. The most investigated examples of such
exponential networks are the random graph model of Erdos and Renyi [11]
and the small-world model of Watts and Strogatz [44], both leading to a fairly
homogeneous network. In contrast, results on the Internet, World Wide Web,
and other large networks indicate that many systems belong to a class of in-
homogeneous networks, referred to as scale-free networks, for which pk decays
as a power law, i.e. pk ∼ k −γ [8]. While the probability that a node has a very
large number of connections (k k) is practically prohibited in exponential
networks, highly connected nodes are statistically significant in scale-free net-
works. In this review, we concentrate on the scale-free network, as this kind
of network is widely used to model the Internet.
In this section, we consider two types of node removal schemes. The first
scheme studies the removal of randomly selected nodes. In this case, the prob-
ability of removal of any randomly chosen node having degree k after this kind
of failure is fk = f (independent of k) [8]. In the second technique, most highly
connected nodes are removed at each step. This second scheme emulates an
intentional attack on the network [9]. Formally, fk = 0 when k ≤ kmax and
fk = 1 when k > kmax , i.e. all the nodes in the network having degree more
than kmax are removed.
Next we discuss the stability of scale-free networks in the face of failure
and attack. The stability is measured by the change in the size of the giant
component S and the average path length l after removal of the fraction of
nodes. The maximum reduction in the size of the giant component indicates
the breakdown of the network.
Stability against random failure. We start by investigating the stability
of scale-free network to random removal of nodes, looking at the changes in
the relative size of the giant component S and the average path length l [8]. In
a scale-free network, the size of the giant component S decreases slowly from
S = 1 as the fraction of nodes removed f increases (see Fig. 5). In random
failure, most of the removed nodes in the network have low degree; hence,
they have little impact upon the size of the giant component S. Eventually,
260 B. Mitra
Fig. 5. The size of the giant component S and average path length l of an initially
connected network when a fraction f of the nodes are removed. Scale-free network
generated by the scale-free model with N = 10,000 and k = 4. Squares indicate
random node removal, while circles correspond to preferential removal of the most
connected nodes [3].
Computer viruses and worms are posing serious challenges to the network
research community. In computer science jargon, ‘virus’ refers to malicious
software that spreads from computer to computer and can halt or hin-
der operations at numerous businesses and other organizations, disrupt
Technological Networks 261
cash-dispensing machines, delay airline flights, and even affect emergency call
centers [41, 23, 4]. The structure of contact networks affects the rate and
extent of spreading of computer viruses, just as it does for human diseases;
understanding this structure is a key element in the control of infection. Thus,
recent works in epidemiological models have emphasized the effects of the
virus spread in scale-free networks, in which the degree distribution follows a
power law [16].
There are various epidemic models available in the literature which can be
used to formalize the spread of viruses in the network [33]. In these models,
the susceptible (S) individuals do not have the disease and are ready to be
attacked with a disease if they come in contact with virus infected (I) individ-
uals. The infected individuals may gain permanent or temporary immunity
after some time period and become recovered (R). The R individuals do not
take part in disease transmission. Various epidemic dynamics like SI, SIS,
SIR, SIRS exist in the literature [35, 36]. In SI dynamics, infected individuals
increase until all the S individuals becomes infected. If the I individuals in SI
dynamics become susceptible again after some time period, the SIS dynam-
ics results [34]. Computer viruses mostly fall into this category; they can be
‘cured’ by antivirus software, but without a permanent virus-checking pro-
gram the computer has no way to fend off subsequent attacks by the same
virus. Let us assume that any susceptible individual has a uniform probabil-
ity β per unit time of being infected from any other infected one, and that
infected individuals recover and become immune at some stochastically con-
stant rate γ. Then s, i, r, the individual fraction of nodes in the states of S,
I, and R respectively, are governed by the following differential equations:
ds di
= −βis, = βis − γi. (2)
dt dt
The classical SIS model can be applied to the networked system where in-
fection probability of the node is not constant but varies between the nodes
of the network depending upon its degree. The quantity βi represents the
average rate at which a susceptible individual becomes infected by its neigh-
bors. If λ is the rate of infection via contact with the single infective node
and θ(λ) is the probability that the neighbor of a k degree susceptible node is
infective, then the average rate of infection of the k degree susceptible node
becomes βi = kλθ(λ). The implicit expression for θ(λ) is obtained in [35] by
the following expression:
λ k 2 pk
= 1, (3)
z 1 + kλθ(λ)
k
where z is the average degree and pk is the degree distribution. For particular
choices of pk , this equation can be solved for θ(λ) either exactly or approxi-
mately. For instance, for a power-law degree distribution, Pastor-Satorras and
Vespignani [34] solve it by making an integral approximation, and hence show
262 B. Mitra
that there is no non-zero epidemic threshold for the SIS model in the power-
law case, i.e. the disease will always persist, regardless of the value of the
infection rate parameter. They have also generalized the solution to a num-
ber of other cases, including other degree distributions, finite-sized networks,
and models that include vaccination of some fraction of individuals [35, 36].
In the latter case, they tackle both random vaccination and vaccination tar-
geted at the vertices with highest degree. The results have shown that the
propagation of the disease turns out to be relatively robust against random
vaccination, at least in networks with right-skewed degree distributions, but
highly susceptible to vaccination of the highest-degree individuals.
3 Peer-to-Peer Networks
Overlay networks. Peers in the P2P networks are typically connected via
ad hoc overlay connections. If a participating peer knows the location of an-
other peer in the network, then there is a link from the former node to the
latter in the overlay network. Based on how the nodes in the overlay network
are linked to each other, the current P2P architecture can be classified into
three types [43], centralized, decentralized and structured, and decentralized
but unstructured.
1. Centralized: All object index items are kept in a centralized server in
the form of object key, node address etc. Each arriving node needs to
actively notify this server about its kept object information. Therefore, the
querying node only needs to consult the central server to obtain the peer
address containing its searched object. In order to download the searched
object from the peer, the querying node directly establishes the connection
with that peer and downloads the item. This type of P2P architecture is
very simple and easy to deploy. But it has the problem of a single point
of failure, although we can use several parallel servers. An example of this
network type is Napster [31].
2. Decentralized and structured: A structured P2P network employs a
globally consistent protocol to ensure that any node can efficiently route
a search query to a peer that has the desired file. Most of the struc-
tured P2P networks are based on the distributed hash table (DHT), in
which a variant of consistent hashing is used to assign ownership of each
file to a particular peer [27]. A DHT is a hash table whose table entries
are distributed among different peers located in arbitrary locations. Each
data item is hashed to a unique numeric key. Each node is also hashed
to a unique ID in the same key space. Each node is responsible for a
certain number of keys; that is, the responsible node stores the key and
a pointer to the data item with that key. Keys are mapped to their re-
sponsible nodes. The searching and routing algorithms support two basic
operations: lookup(key) and put(key); lookup(k) is used to find the loca-
tion of the node that is responsible for the key k, and put(k) is used to
store a data item (or a pointer to the data item) with the key k in the
node responsible for k. It appears that searches in structured systems fol-
low the well-defined neighboring links; henceforth, these systems provide
guarantees on finding existing data in bounded overlay hops. However,
the strict network structure imposes high overhead for handling dynam-
icity in P2P networks due to peer churn. Some well-known DHT based
structured P2P networks are Chord, Pastry, Tapestry, CAN, and Tulip.
3. Decentralized and unstructured: An unstructured and decentralized
P2P network is formed when the overlay links are established arbitrarily.
As no special network structure needs to be maintained, unstructured P2P
systems are extremely resilient to peer churn. Searching in unstructured
networks is often based on flooding or its variation because there is no
control over data storage [26]. The main disadvantage with such networks
is that the queries may not always be resolved. Popular content is likely to
264 B. Mitra
be available at several peers, but if a peer is looking for rare data shared
by only a few other peers, then it is highly unlikely that the search will
be successful [10]. Since there is no correlation between a peer and the
content managed by the peer, there is no guarantee that flooding will find
a peer that has the desired data. However, due to the high dynamicity of
peers, robustness is given the topmost priority. Most of the popular P2P
networks such as Gnutella and FastTrack are unstructured in nature [14].
Searching is one of the most important services and utilities provided by the
P2P networks where users try to locate the desired object in the network.
Existing P2P systems support the simple object lookup by key or identi-
fier. Some existing P2P systems can handle more complex keyword queries,
which find documents containing keywords in queries. Searching techniques
are primarily forwarding based. Starting with the requesting node, a query is
forwarded or routed until the node which has the desired object is reached. To
forward query messages, each node must keep information about some other
nodes called neighbors. The information of these neighbors constitutes the
routing table of a node. The desired features of searching algorithms in P2P
systems include high-quality query results, minimal query packet overhead,
high routing efficiency, load balance, resilience to node failures, and support
of complex queries. The quality of query results is application dependent.
Generally, it is measured by the number of results and relevance. The query
packet overhead signifies the amount of packets generated in the network to
satisfy a specific search query. The routing efficiency is generally measured
by the number of overlay hops per query. Different searching techniques make
different trade-offs between these desired characteristics.
Searching in structured P2P networks follows the well-defined neighboring
links to locate some specific object. This provides guarantees on finding exist-
ing data and bounds data lookup efficiency in terms of the number of overlay
hops. But it shows poor performance in the dynamic condition where peers
join and leave the network quite frequently. Searching in the unstructured
P2P systems is more challenging, as the overlay network does not follow any
structure dependent on the data storage. Searching techniques in unstruc-
tured networks can be classified as either flooding based or random walker
Technological Networks 265
groups with approximately of same size. The source sends query packets to its
neighbors in the first group, starting the first round of flooding. These neigh-
bors faithfully broadcast the query packets (but not back to the source). The
source also sets a limit on the scope of these broadcasting query packets, e.g.,
by using a TTL value. The first round of flooding may have a very narrow
scope with small TTL. This round of flooding may not return the desired re-
sult. Then the source sends query packets to its neighbors in the second group,
with a larger limit on the scope of the flooding. This process repeats until the
source obtains the desired result. It has been shown that Hurricane flooding
reduces the search cost to arbitrarily close to a lower bound for any search
algorithms and bounds the search latency, which is a logarithmic function of
the location of the target.
3.2.1 Outcomes
0.95
fr (Percolation threshold)
0.9
0.75
0.7
0.85 0.9 0.95 1
r (Fraction of peers)
Fig. 7. The impact of random failure upon the stability of superpeer networks.
Random failure. The analysis done in [30] shows that the superpeer net-
works are quite robust against churn (Fig. 7). Since churn affects peers and
superpeers depending upon their individual fraction in the network, peers are
affected much more than superpeers. The removal of a significant number of
low degree peers along with a few high degree superpeers has less impact upon
the stability of the networks. Practical experience also ensures that superpeer
networks exhibit high robustness in the face of churn. Another significant ob-
servation is that a lower fraction of superpeers in the network (specifically
when it is below 5%) results in a sharp fall in the percolation threshold; that
is, the vulnerability of the network drastically increases when the fraction of
superpeers is below 5%.
Degree-dependent failure. It can be easily identified from Fig. 8, that with
the increase of superpeer degree km , the value of critical attack exponent γc
that percolates the network decreases. This increases the necessary fraction
of superpeers required to be removed to break down the network. Since the
increase of km increases the fraction of peers r, the removal of most of the
low degree peers along with a fraction of superpeers increases the percolation
threshold fd . It is also interesting to observe that the percolating γc remains
quite low and less than 0.1 for the entire range of km . The reason is that
small values of γc result in the removal of a higher fraction of superpeers
nodes from the network. Since the degree-dependent failure mainly removes
the lower degree nodes, which are not so useful for breaking the network down,
removal of a significant amount of superpeers becomes necessary.
Degree-dependent attack. [29] analyzes the behavior of superpeer networks
against degree-dependent attack, where kl and km are the degree of peers and
superpeers respectively and r is the fraction of peers in the network. In [29],
270 B. Mitra
0.07 1
0.04 0.94
fd
Theoretical 〈k〉=4
γc
0.01 0.88
0 0.86
10 15 20 25 30 10 15 20 25 30
Km (Degree of superpeers) Km (Degree of superpeers)
Fig. 8. Change in critical attack exponent γc and percolation threshold fd with respect
of superpeer degree km for superpeer networks undergoing degree-dependent failure.
Here mean degree k varies from 8 to 16. x-axis represents the superpeer degree(km )
and y-axis represents the corresponding γc and fd .
the authors have established the critical condition for the stability of the
network against degree-dependent attack:
The inequality gives the set of solutions for the critical exponent γc and sub-
sequently the normalizing constant C, which determines the fraction of peers
and superpeers to be attacked. The nature of the solution set Sγc of the in-
equality has a profound impact upon the fraction of peers and superpeers
required to be removed and the percolation threshold fc . The breakdown of
the network can be due to one of the following three situations.
Case A. Removal of all the superpeers along with a fraction of peers. Net-
works having bounded solution set Sγc where 0 ≤ γc ≤ γcbd exhibit this kind
of behavior at the maximum value of the solution γc = γcbd .
Case B. Removal of only a fraction of superpeers. Networks having un-
bounded solution set Sγc where 0 ≤ γc ≤ +∞ exhibit this kind of behavior as
γc → ∞.
Case C. Removal of some fraction of both superpeers and peers. Intermediate
critical exponent γc ∈ Sγc signifies the fractional removal of both peers and
superpeers.
Figure 9 shows that solution set Sγc of the networks up to a threshold
superpeer fraction spth (spth = 0.19 and 0.41 for kl = 3 and kl = 4 respec-
tively) remains bounded. Hence, the removal of all the superpeers is necessary
to disintegrate the network along with a fraction of the peers (Fig. 9). It also
represents some instances of case B where only some fraction of superpeers
are needed to be removed.
Technological Networks 271
5 1
Peer degree kl=3 0.9 Percolation threshold (fc) (kI=3)
Peer degree kl=4 Peer fraction removed (fp) (kI=3)
4 0.8
Percolation threshold
Superpeer fraction removed (fsp) (kI=3)
Boundary γc (γcbd)
4 Conclusion
In this chapter, we have presented a comprehensive study of various aspects of
technological networks. We have chosen two different technological networks
under consideration: the Internet and P2P networks. The protocols used in
the Internet have been discussed briefly along with their services. An em-
pirical study of the different topological properties of the Internet like scale
invariance, small world, etc. have been elaborated. The impact of the fault
tolerance of the Internet has been discussed in the light of general stability
analysis. The spread of computer viruses has been modeled by network-aware
epidemic models. We have also shed some light on the recent advancements
and classifications of the P2P networks. As search is one of the most impor-
tant services provided by the P2P systems, different search techniques and
their comparative study have been provided. The stability of P2P networks
in the face of churn and attack has also been discussed as a continuation of
the Internet fault tolerance.
The advancements of the Internet have also posed some serious challenges
in front of the network research community. One of the significant problems
is modeling the widely varying Internet traffic. An appropriate modeling of
the Internet is often useful to measure the efficiency of routing algorithms and
the quality of service (QoS) of different web applications. Maintaining specific
QoS in a faulty environment can be another major research issue. There is
always substantial uncertainty when making network management decisions.
A decision maker is limited not only because it possesses only partial infor-
mation due to decentralized control but is also limited by the impossibility
of predicting the future in terms of traffic demand and/or network topology
status. Hence, managing this large-scale Internet is also a non-trivial issue.
Understanding the assortative or disassortative relation among different par-
ticipating nodes and their impact upon the complex structural properties is
also a major research problem.
272 B. Mitra
References
1. L. A. Adamic, R. M. Lukose, A. R. Puniyani, B. A. Huberman, Search in power-
law networks, Physical Review E, 64, 046135, 2001.
2. R. Albert, H. Jeong, A.-L. Barabasi, Diameter of the world wide web, Nature, 401,
130–131, 1999.
3. R. Albert, H. Jhong, A.-L. Barabasi, Error and attack tolerance of complex net-
works, Nature, 406, 2000.
4. N. Berger, C. Borgs, T. Chayes, A. Saberi, On the spread of viruses on the Internet,
Proceedings of the 16th ACM-SIAM Symposium on Discrete Algorithms (SODA),
301–310, 2005.
5. T. Bu, D. Towsley, On distinguishing between Internet power law topology gen-
erators, Proceedings of INFOCOM, New York, NY, USA, 2002.
6. D. Clark, Face-to-face with peer-to-peer networking, IEEE Computer, 34 (1),
pp. 18–21, January 2001.
7. Clip2 Company, Gnutella. http://www.clip2.com/gnutella.html.
8. R. Cohen, K. Erez, D. Avraham, S. Havlin, Resilience of the Internet to random
breakdown, Physical Review Letters, 85 (21), 2000.
9. R. Cohen, K. Erez, D. Avraham, S. Havlin, Resilience of the Internet under in-
tentional attack, Physical Review Letters, 86 (16), 2001.
10. Q. Deng, H. Lv, Analyzing unstructured peer-to-peer Search Networks with
QIL Proceedings of the IEEE International Conference on Services Computing,
pp. 547–550, Shanghai, China, 2004.
11. P. Erdos, A. Renyi, On Random Graphs I, Publ. Mathematical, Debrecen, 6, 290–
297, 1959.
12. M. Faloutsos, P. Faloutsos, C. Faloutsos, On power-law relationships of the internet
topology, Computer Communications Review, 29, 251262, 1999.
13. C. Fuchs, The Internet as a self-organizing socio-technological system”, Cybernet-
ics and Human Knowing, 12 (31), pp. 37–81, 2005.
14. Gnutella: www.gnutellaforums.com.
15. R. Govindan, H. Tangmunarunkit, Heuristics for internet map discovery, Proceed-
ings of IEEE Infocom, 2000.
16. C. Griffin, R. Brooks, A note on the spread of worms in scale-free networks, IEEE
Transactions on Systems, Man, and Cybernetics, Part B, Feb. 2006.
17. L. Guo, S. Jiang, X. Zhang, H. Wang, LightFlood: Minimizing redundant messages
and maximizing scope of peer-to-peer search, IEEE Transactions on Parallel and
Distributed Systems (TPDS) 19 (5), pp. 601–614, May 2008.
Technological Networks 273
Fernando Peruani1,2
1
CEA-Service de Physique de l’Etat Condensé, Centre d’Etudes de Saclay,
91191 Gif-sur-Yvette, France
2
Institut des Systémes Complexes de Paris Île-de-France, 57/59, rue Lhomond
F-75005 Paris, France; fernando.peruani@iscpif.fr
1 Introduction
An exhaustive and comprehensive review on the theory of complex networks
would imply nowadays a titanic task, and it would result in a lengthy work
containing plenty of technical details of arguable relevance. Instead, this chap-
ter addresses very briefly the ABC of complex network theory, visiting only
the hallmarks of the theoretical founding, to finally focus on two of the most
interesting and promising current research problems: the study of dynamical
processes on transportation networks and the identification of communities in
complex networks.
Fig. 1. The figure shows two network topologies: (a) network with exponential degree
distribution and (b) network with power-law (scale-free) degree distribution. Figure
taken from Ref. [12].
relevant properties of a node, its degree, which indicates the number of edges
attached to it, and ... denotes the average over all nodes of the system.
Though k is a useful and informative quantity, by itself it cannot charac-
terize the structure of the network, and typically a good characterization also
requires higher order moments, such as k 2 , k 3 , etc. How many moments
do we need to know to unequivocally characterize the network? All the infor-
mation about the moments is contained in the degree probability distribution
of the network pk (see Fig. 1). pk is the probability of picking up a node at
random and observing that its degree is k. The moments are computed as
k n = k n pk . If the network is such that the vertices (nodes) are statis-
tically independent, that is, the connections are completely at random, then
the degree probability distribution unequivocally determines the properties
of the network. If this is not the case and there are correlations among nodes,
the characterization of the network will require the use of a degree-degree
probability distribution, or an even higher n-points probability distribution,
etc. Let us assume for the moment that vertices are statistically independent.
There are three types of degree distributions which due to their ubiquity and
simplicity, deserve to be specially mentioned: a) the Poisson distribution, de-
fined as pk = e−k kk /k! and which is the degree distribution of a classical
random graph; b) the exponential distribution, defined as pk ∼ e−k/k (see
Fig. 1(a)); and c) the power-law distribution (see Fig. 1(b)), which is propor-
tional to pk ∼ k −γ , with γ > 0, and has (for infinite networks) all moments
higher than m > γ − 1 diverging (for this reason these distributions are re-
ferred to as scalef ree). For distributions like a) and b), the first moment of
the distribution, i.e., k, unequivocally characterizes the network topology,
but in general higher moments are required to unequivocally determine the
network topology.
Advances in the Theory of Complex Networks 277
The network diameter is defined as the maximal distance between any pair of
nodes. The above definition strictly works for fully connected networks; how-
ever, by redefining the diameter as the maximum distance among all fully con-
nected components (clusters) of the system, the definition is applicable to all
kinds of networks. Assuming that the network has a sort of tree-like structure,
a simple rough estimation can be obtained by equating kd with N as follows:
ln(N )
d∼ . (4)
ln(k)
It has been shown that Eq. (4) predicts the correct scaling of d with N and
k for random networks. Note that when k > ln(N ), a random network
278 F. Peruani
has a high probability of being totally connected [1]. The concept of network
diameter is closely related to another important quantity, the average path
length, which is the average distance between any pair of nodes.
The network topology can also be studied through the adjacency matrix A,
which is an N × N symmetric matrix whose elements Ai,j represent the con-
nections among the nodes of the network. If nodes i and j are connected,
then Ai,j = 1, otherwise, Ai,j = 0. The spectrum of the network is the set of
eigenvalues of A, and since A has N eigenvalues, the spectral density takes
the form
1
N
ρ(λ) = δ(λ − λj ). (5)
N i=1
In the limit of N → ∞, ρ(λ) becomes a continuum function.
Interestingly, the topology of the network is related to the spectral density
through
1 1
dλ λk ρ(λ) = (λj )k = Ai1 ,i2 Ai2 ,i3 . . . Aik ,i1 . (6)
N j N i ,i ,...,i
1 2 k
Equation (6) represents the number of paths returning to the same node in
the network. One of the most remarkable results connected to this kind of ap-
proach is Wigner’s law, which applies to infinite random networks with a con-
nectivity p ∼ N −ξ . When 0 < ξ < 1, Wigner’s
law predicts that the spectrum
semicircular distribution ρ(λ) = 4N p(1 − p) − λ2 /(2πN p(1−p))
density is a
for |λ| < 2 N p(1 − p) and is vanishing for larger values of λ, except for the
principal eigenvalue, which is isolated from the bulk and increases with net-
work size. For ξ > 1 the spectral density deviates from Wigner’s law and its
odd moments vanish (i.e., k 2m+1 = 0), indicating that the only path that
comes back to the original node is following all nodes previously visited, i.e.,
there are no closed loops [4–9, 26]
There are equilibrium and non-equilibrium random networks. These terms are
associated to the way in which the network was grown. In this subsection we
briefly review how a network can be built.
Given a fixed number N of nodes and a fixed number M of edges, the network
is built by taking for each edge a randomly selected couple of nodes and
inserting an edge between them.
Advances in the Theory of Complex Networks 279
Recall that the connection between the generating function and the probabil-
ity distribution it generates is given by
1 dk G(x)
pk = lim . (12)
x−→0 k! dxk
Now we look for the generating function H1 (x) of the distribution of cluster
sizes of surviving nodes that are reached by randomly choosing an edge and
following it to one of its ends. If we choose an edge that leads us to a removed
node, regardless of the degree of the node, we say that the cluster size we find
is zero. The probability of following the randomly chosen edge and finding a
surviving node of degree zero is zero, the probability of finding a surviving
node of degree one is p1 q1 /z, the probability of finding a surviving node of
degree two is 2p2 q2 /z, and so on.So, the probability of finding a surviving
node, regardless of its degree, is ∞ k=0 kpk qk /z = F1 (1). In consequence, the
probability of finding an edge that leads to a removed node is 1 − F1 (1).
Clearly, this is also the probability of following a randomly chosen edge that
leads to a zero size component, and so also the coefficient s0 that accompanies
x0 in H1 (x). To find the full expression of H1 (x), we have still to look for the
probabilities that accompany non-zero size components, i.e., xk with k > 0.
This can be computed from the probability s1 of finding, by following a ran-
domly chosen edge, a component of size 1. This is nothing other than the sum
of the probabilities of following an edge and finding a surviving node of degree
k which has its other k − 1 edges connected to removed nodes:
∞
s1 = kpk qk /z(1 − F1 (1))k−1 = F1 (H1 (0)). (16)
k=1
F1 (H1 (0)), while the second derivative evaluated in x = 0 is 2F1 (H1 (0))H1 (0).
This suggest a self-consistence equation for H1 (x) of the form
It can be easily verified that Eq. (18) leads to the correct expressions of
s0 , s1 , . . . , sn by applying the definition given by Eq. (12). Along similar
lines, we can obtain the generating function H0 (x) of the distribution of the
component size to which a randomly chosen node belongs. The main difference
is that instead of determining the probability of finding a randomly chosen
edge attached to a component size s, we now randomly choose a node and
want to determine the probability of finding this node belonging to a cluster
of size s. For this reason, instead of using P (k) as before, we use pk qk and its
corresponding generating function F1 (x). The expression for H0 (x) takes the
form
H0 (x) = (1 − F0 (1)) + xF0 (H1 (x)). (19)
Finally from Eq. (19), we can obtain the average size of the components:
Equation (21) defines the critical condition for the stability of an uncorrelated
infinite network under an arbitrary attack. For failure, i.e., when the attack
does not depend on the degree k of the node, qk = q and from Eq. (21) the
classical percolation threshold for failure [13, 10] is retrieved as follows:
k
qc = 1 − . (22)
k 2 − k
Notice that Eq. (22) defines the percolation threshold for infinite networks.
The critical qc strongly depends on system size and thus Eq. (22) fails to de-
scribe the stability of finite networks [17]. Also notice that a basic assumption
behind Eq. (21) is that the original network is uncorrelated. Expressions for
the percolation threshold of finite and/or correlated networks are still missing.
Advances in the Theory of Complex Networks 283
In this section we address two current hot problems in complex networks: dy-
namics on transportation networks and community identification in complex
network. Part of the future advances of complex network theory clearly is go-
ing to be along the lines of the problems reviewed in this section. However, we
warn the reader that this selection of problems just gathers a small number
of timely interesting issues on networks which are particularly attractive for
the author. The amount of relevant open problems in the fast-evolving area
of network theory exceeds by far the small selection presented here.
For the moment we assume that there is only one species diffusing in the sys-
tem. A metapopulation description of the transport process can be obtained
by thinking in terms of the mean occupation number ñk (t) of nodes of degree
k at time t, which by definition reads as
1 (i)
ñk (t) = n (t), (23)
Nk (i)
k =k
284 F. Peruani
where the sum runs over all nodes whose degree is k, Nk refers to the total
number of nodes with degree k, and n(i) (t) denotes the occupation number
(= number of individuals) at node i. It is assumed that there is a diffusion
rate d(k, k ) that controls the migration of individuals from a subpopulation
with degree k to another of degree k . In consequence, the probability per
Lk for an individual at a node of degree k of leaving the node is
unit time
Lk = k kp(k |k)d(k, k ), where p(k |k) is the conditional probability that an
edge departing from a node of degree k points to a node of degree k . Thus,
the (mean-field) time evolution of ñk (t) can be expressed as
∂t ñk (t) = −Lk ñk (t) + k p(k |k)d(k , k)ñk (t). (24)
k
The reasoning behind Eq. (24) is very simple. The first term on the right-
hand side accounts for the number of individuals that initially are in a node
of degree k and then leave it, while the second term considers the increase
of individuals in k-degree nodes due to the migration of individuals from
subpopulations of degree k to k. For uncorrelated networks, p(k |k) takes the
form p(k |k) = k pk /k and Eq. (24) reduces to
k
∂t ñk (t) = −Lk ñk (t) + pk d(k , k)ñk (t). (25)
k
k
In the following discussion we assume that there are multiple species travel-
ing across the network which interact among themselves. We consider three
interacting species: susceptible, infected, and recovered individuals which fol-
low the classical Susceptible-Infected-Recovered (SIR) dynamics (see Fig. 2).
For a single population (node), an epidemic outbreak can occur depending
on the basic reproductive number R0 , which accounts for the number of sec-
ondary infected cases generated by a primary infected individual. The basic
reproductive number is defined as
Advances in the Theory of Complex Networks 285
Subpopulation i:
i i
β
R0 = , (27)
μ
d(k, k )αNk
λk,k = . (28)
μ
Now we can derive a simple approximate evolution equation for the number
of infected subpopulations Dkn of degree k at generation n for a random graph
in which each subpopulation has the same degree k,
1 λkk
D = D (k − 1) 1 − ( )
n n−1
1 − Dn−1 /N . (29)
R0
The reasoning behind Eq. (29) is the following. Each of the Dn−1 infected pop-
ulations at generation n − 1 will seed during the next generation a number of
subpopulations proportional to k − 1 times the probability
that the neighbor-
ing subpopulations are not infected (i.e., 1 − Dn−1 /N ), times the probability
that the new infected
individuals
cause a local outbreak (this probability is
proportional to 1 − R0−λkk since the probability that a single individual will
not transmit the disease is R0−1 [27]). Assuming, as before, that d(k, k ) = p/k,
then λkk = pN0 α/(μk) (where N0 = Nk ) and in addition R0 1 such that
1 − R0−λkk ∼ λkk (R0 − 1), Eq. (29) reduces to
k−1
Dn = pN0 αμ−1 (R0 − 1)Dn−1 . (30)
k
From Eq. (30) it is easy to observe that a macroscopic outbreak can only
occur if
k−1
R∗ = pN0 αμ−1 (R0 − 1) > 1. (31)
k
Thus, the global invasion threshold is defined by Eq. (31). This implies that
to observe global spread the mobility rate has to be such that
μk
p≥ . (32)
α(k − 1) (R0 − 1)
kp(k)
Dkn = (R0 − 1) Dkn−1
(k − 1)λk k . (34)
k k
pαNk
λk,k = . (35)
μk
Consequently, the evolution equation for Dkn reads:
kp(k)pN0 α n−1
Dkn = (R0 − 1) Dk (k − 1). (36)
μk2 k
Multiplying both sides by (k − 1) and taking the sum over k on both sides,
Eq. (36) can be expressed as
k 2 − k pN0 α
R∗ = (R0 − 1) > 1. (38)
k2 μ
Equation (38) defines the global invasion threshold for a heterogeneous net-
work.
Though in recent years we have observed important progress related to the
dynamics on transportation networks, there are still many open questions to
be answered. For example, the degree of a subpopulation has been considered
so far decoupled from the subpopulation size. However, we know that, in
many cases, as in an airline transportation network which connects cities of
different sizes, degree and subpopulation size are strongly correlated. In fact, a
satisfactory network growth model for transportation networks is still lacking.
Regarding the dynamics on the nodes, typically death and birth processes
are ignored, even though small size nodes could experience large fluctuations
which in turn could dramatically affect global flow on the network. Bottleneck
effects due to limitation in the transportation channel, as well as limitation
in node capacity, are important problems that deserve to be investigated.
where the sum runs over the m modules of the network, ls is the number of
links inside module s, and ds is the total degree of the nodes in module s. The
term ls /L denotes the fraction of links connecting pairs of nodes belonging
to module s, while (ds /2L)2 represents the fraction of links that one would
find in the module if links were placed at random in the network, under the
constraint of respecting the degree distribution of the original network. If qs
is such that 2
ls ds
qs = − ≥ 0, (40)
L 2L
the module is well defined, in the sense that the module presents more links
than expected by random chance. The greater qs , the better defined the mod-
ule. The identification method implies the maximization of Q, which in turn
involves sampling over all possible partitions of the network. Unfortunately,
Advances in the Theory of Complex Networks 289
the number of possible subsets grows exponentially with the network size, and
the modularity optimization is an NP-complete problem [29]. Typically, the
ambitious goal of finding the true optimum of the measure is not possible.
However, approximations of the minimum can be obtained by applying op-
timization algorithms such as simulated annealing, extremal optimization, or
spectral division. Other drawbacks of the Newman–Girvan method are that it
cannot scan the network below some scale, leaving small modules undetected,
and that it may be affected by the time evolution of the network, i.e., by
network size.
Bipartite network:
D C B A
Teams
1 2 3 4
Actors
Unipartite projection:
1 3
2 4
Fig. 4. The figure shows a scheme of a growing bipartite network. The team node
D represents a new incoming node. The scheme at the bottom indicates the resulting
unipartite projection of actor nodes (see text).
290 F. Peruani
where caa is the actual number of movies in which a and a have co-starred.
Notice that the identification of modules through the optimization of QB
leads to the same type of problems present in the Newman–Girvan method:
the method leaves small modules undetected and strongly depends on net-
work size.
The identification of communities in complex networks is extremely im-
portant, since it can reveal functional relationships between nodes. So far the
available methods for modularity identification are purely phenomenological
and they cannot guarantee the correct identification of the community struc-
ture. A theoretical founding for modularity identification is still lacking. Due
to the relevance of the problem, we expect to observe important theoretical
progress in this direction in the near future.
4 Concluding Remarks
The complex network community has been growing for years. Everyday we
see new articles on complex networks, and the evolution of the field seems
limitless. In such a dynamical research field, any prediction about the future
of complex network theory is extremely risky. The two selected hot topics
in this chapter, dynamical processes on transportation networks and iden-
tification of communities in complex networks, are certainly areas that will
experience important progress in the near future. Very important progress is
also expected in many other areas, as for example, in dynamical networks of
moving agents. In the coming years we will witness substantial new progress
in network research.
References
1. R. Albert and A.-L. Barabási, Rev. Mod. Phys. 74, 47 (2002).
2. S.N. Dorogovtsev and J.F.F. Mendes, Evolution of Networks: From Biological Nets
to the Internet and WWW, Oxford University Press, Oxford, UK (2003).
3. F. Chung and L. Lu, Adv. Appl. Math. 26, 257 (2001).
4. E.P. Wigner, Ann. Math. 62, 548 (1955).
5. E.P. Wigner, Ann. Math. 65, 203 (1957).
6. E.P. Wigner, Ann. Math. 67, 325 (1958).
7. M.L. Metha, Random Matrices, 2nd ed., Academic Press, New York (1991).
292 F. Peruani
σst (v)
g(v) = Σs=v=t . (1)
σst
Clique: Cliques are complete graphs where all nodes are connected to
all other nodes.
|ejk |
Ci = : vj , vk ∈ Ni , ejk ∈ E. (3)
ki (ki − 1)
Geodesic Path: The geodesic path between two vertices is the shortest
path between them.
Hyperedges: The edges in the network that join more than two nodes
together.
Incidence Matrix: The incidence matrix of a graph gives the (0, 1)-matrix
which has a row for each vertex and column for each edge, and (v, e) = 1 iff
edge e is incident on vertex v.
|A ∩ B|
J(A, B) = . (6)
|A ∪ B|
k-core: A k-core is defined as the maximal subset where each node is con-
nected to at least k members.
k -plex: In a k-plex, all the nodes have degree at least (n − k). 1-plex
represents a clique.
n-clique: An n-clique is the maximal subset of the nodes where the dis-
tance between any two nodes u and v is less than or equal to n:
Network Motif: Network motifs are patterns that occur in different parts
of a network at frequencies much higher than those found in randomized net-
works.
Glossary of Essential Terms 299
Zipf ’s Law: Zipf’s law states that given some corpus of natural language
utterances, the frequency of any word is inversely proportional to its rank in
the frequency table.
Index