Академический Документы
Профессиональный Документы
Культура Документы
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max-Planck Institute of Computer Science, Saarbruecken, Germany
Lee Giles Marc Smith John Yen
Haizheng Zhang (Eds.)
Advances in Social
Network Mining
and Analysis
13
Volume Editors
Lee Giles
John Yen
Pennsylvania State University, College of Information Science and Technology
University Park, PA 16802, USA
E-mail: {giles, jyen}@ist.psu.edu
Marc Smith
Microsoft Research, One Microsoft Way, Redmond, WA 98002, USA
E-mail: masmith@microsoft.com
Haizheng Zhang
Amazon.com., Seattle, WA, USA
E-mail: haizhengzhang@gmail.com
ISSN 0302-9743
ISBN-10 3-642-14928-6 Springer Berlin Heidelberg New York
ISBN-13 978-3-642-14928-3 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
springer.com
© Springer-Verlag Berlin Heidelberg 2010
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper 06/3180
Preface
This year’s volume of Advances in Social Network Analysis contains the pro-
ceedings for the Second International Workshop on Social Network Analysis
(SNAKDD 2008). The annual workshop co-locates with the ACM SIGKDD In-
ternational Conference on Knowledge Discovery and Data Mining (KDD). The
second SNAKDD workshop was held with KDD 2008 and received more than
32 submissions on social network mining and analysis topics. We accepted 11
regular papers and 8 short papers. Seven of the papers are included in this
volume.
In recent years, social network research has advanced significantly, thanks to
the prevalence of the online social websites and instant messaging systems as
well as the availability of a variety of large-scale offline social network systems.
These social network systems are usually characterized by the complex network
structures and rich accompanying contextual information. Researchers are in-
creasingly interested in addressing a wide range of challenges residing in these
disparate social network systems, including identifying common static topolog-
ical properties and dynamic properties during the formation and evolution of
these social networks, and how contextual information can help in analyzing the
pertaining social networks. These issues have important implications on commu-
nity discovery, anomaly detection, trend prediction and can enhance applications
in multiple domains such as information retrieval, recommendation systems, se-
curity and so on.
The second SNAKDD workshop focused on knowledge discovery and data
mining in social networks, such as contextual community discovery, link analysis,
the growth and evolution of social networks, algorithms for large-scale graphs,
techniques that can be used for recovering and constructing social networks
from online social systems, search on social networks, multi-agent-based social
network simulation, trend prediction of social network evolution, and related
applications in other domains such as information retrieval and security. The
workshop was concerned with inter-disciplinary and cross-domain studies span-
ning a variety of areas in computer science including graph and data mining,
machine learning, computational organizational and multi-agent studies, infor-
mation extraction and retrieval, and security, as well as other disciplines such as
information science, and social science.
In the first paper “Leveraging Label-Independent Features for Classification
in Sparsely Labeled Networks: An Empirical Study,” Brian Gallagher and Tina
Eliassi-Rad study the problem of within-network classification in sparsely labeled
networks. The authors present an empirical study and show that the use of LI
features produces classifiers that are less sensitive to specific label assignments
and can lead to significant performance improvement.
VI Preface
identifies the aspect of each approach that is responsible for the resolution limit
and proposes a variant, SGE, that addresses this limitation. The paper demon-
strates on three artificial data sets that (1) SGE does not exhibit a resolution
limit on graphs in which other approaches do, and that (2) modularity and the
compression-based algorithms, including SGE, behave similarly on graphs not
subject to the resolution limit.
We would like to thank the authors of all submitted papers for both the joint
workshop and this proceedings volume. We are further indebted to the Program
Committee members for their rigorous and timely reviewing. They allowed us
to make this workshop a major success.
Lee Giles
Marc Smith
John Yen
Haizheng Zhang
Organization
Program Chairs
Lee Giles Pennsylvania State University, USA
Marc Smith Microsoft, USA
John Yen Pennsylvania State University, USA
Haizheng Zhang Amazon.com, USA
Program Committee
Lada Adamic
Aris Anagnostopoulos
Arindam Banerjee
Tanya Berger-Wolf
Yun Chi
Aaron Clauset
Isaac Councill
Tina Eliassi-Rad
Lise Getoor
Mark Goldberg
Larry Holder
Andreas Hotho
Gueorgi Kossinets
Kristina Lerman
Wei Li
Yi Liu
Ramesh Nallapati
Jennifer Neville
Cheng Niu
Dou Shen
Bingjun Sun
Jie Tang
Andrea Tapia
Alessandro Vespigani
Xuerui Wang
Michael Wurst
Xiaowei Xu
X Organization
Referees
Vladimir Barash Sanmay Das Evan Xiang
Mustafa Bilgic Anirban Dasgupta Xiaowei Xu
Matthias Broecheler Robert JÃschke Limin Yao
Guihong Cao Liu Liu Jing Zhang
Bin Cao Galileo Namata Yi Zhang
Table of Contents
1 Introduction
L. Giles et al. (Eds.): SNAKDD 2008, LNCS 5498, pp. 1–19, 2010.
c Springer-Verlag Berlin Heidelberg 2010
2 B. Gallagher and T. Eliassi-Rad
Fig. 1. Portion of the MIT Reality Mining call graph. We know the class labels for the
black (dark) nodes, but do not have labels for the yellow (light) nodes.
of known legitimate users, but for the vast majority of users, we do not know the
correct label. For such applications, it is reasonable to expect that we may have
access to labels for fewer than 10%, 5%, or even 1% of the nodes. In addition,
cell phone networks are generally anonymized. That is, nodes in these networks
often contain no attributes besides class labels that could be used to identify
them. It is this kind of sparsely labeled, anonymized network that is the focus
of this work. Put another way, our work focuses on univariate within-network
classification in sparsely labeled networks.
Relational classifiers have been shown to perform well on network classification
tasks because of their ability to make use of dependencies between class labels
(or attributes) of related nodes [1]. However, because of their dependence on
class labels, the performance of relational classifiers can substantially degrade
when a large proportion of neighboring instances are also unlabeled. In many
cases, collective classification provides a solution to this problem, by enabling
the simultaneous classification of a number of related instances [2]. However,
previous work has shown that the performance of collective classification can
also degrade when there are too few labels available, eventually to the point
where classifiers perform better without it [3].
In this paper, we explore another source of information present in networks
that does not depend on the availability or accuracy of node labels. Such infor-
mation can be represented using what we call label-independent (LI ) features.
The main contribution of this paper is an in-depth examination of the effects
of label-independent features on within-network classification. In particular, we
address the following questions:
2. Can LI features provide information above and beyond that provided by the
class labels? Answer: Yes.
3. How do LI features improve classification performance? Answer: Because
they are less sensitive to the specific labeling assigned to a graph, classifiers
that use label-independent features produce more consistent results across
prediction tasks.
4. Which LI features are the most useful ? Answer: A combination of a few di-
verse network-based structural characteristics (such as node and link counts
plus betweenness) is the most informative.
Section 2 covers related work. Section 3 describes our approach for modeling
label-independent characteristics of networks. Sections 4 and 5, respectively,
present our experimental design and results. We conclude the paper in Section 6.
2 Related Work
In recent years, there has been a great deal of work on models for learning and
inference in relational data (i.e., statistical relational learning or SRL) [3,4,5,6,7].
All SRL techniques make use of label-dependent relational information. Some
use label-independent information as well.
Relational Probability Trees (RPTs) [8] use label-independent degree-based
features (i.e., neighboring node and link counts). However, existing RPT studies
do not specifically consider the impact of label-independent features on classifier
performance.
Perlich and Provost [9] provide a nice study on aggregation of relational at-
tributes, based on a hierarchy of relational concepts. However, they do not con-
sider label-independent features.
Singh et al. [10] use descriptive attributes and structural properties (i.e., node
degree and betweenness centrality) to prune a network down to its ‘most infor-
mative’ affiliations and relationships for the task of attribute prediction. They
do not use label-independent features directly as input to their classifiers.
Neville and Jensen [11] use spectral clustering to group instances based on
their link structure (where link density within a group is high and between
groups is low). This group information is subsequently used in conjunction with
attribute information to learn classifiers on network data.
There has also been extensive work on overcoming label sparsity through tech-
niques for label propagation. This work falls into two research areas: (1) collective
classification [2,3,7,12,13,14] and (2) graph-based semi-supervised learning (SSL)
[15,16].
Previous work confirms our observation that the performance of collective
classification can suffer when labeled data is very sparse [3]. McDowell et al. [14]
demonstrate that “cautious” collective classification procedures produce better
classification performance than “aggressive” ones. They recommend only prop-
agating information about the top-k most confident predicted labels.
4 B. Gallagher and T. Eliassi-Rad
4 Experimental Design
4.1 Classifiers
ROC curve (AUC) for each fold and then obtain an average AUC score for each
classifier, AU CLD and AU CLI . We then set w as follows:
AU CLD
w= (2)
AU CLD + AU CLI
nLB+ICA uses the nLB classifier, but performs collective classification using
the ICA algorithm described in Section 4.2.
nLBLI+ICA uses the nLBLI classifier, but performs collective classification
using the ICA algorithm described in Section 4.2.
wvRN is the weighted-vote relational neighbor classifier [19,7]. It is a simple
non-learning classifier. Given a node i and a set of neighboring nodes, N , the
wvRN classifier calculates the probability of each class for node i as:
1 wi,j if Ci = c
P (Ci = c|N ) = (3)
Li 0 otherwise
j∈N
where wi,j is the number of links between nodes i and j and Lj is the number of
links connecting node i to labeled nodes. When node i has no labeled neighbors,
we use the prior probabilities observed in the training data.
wvRNLI combines the LI features with wvRN in the same way that nLBLI
does with nLB (i.e., using a weighted sum of wvRN and logLI).
wvRN+ICA uses the wvRN classifier, but performs collective classification
using the ICA algorithm described in Section 4.2.
wvRNLI+ICA uses wvRNLI, but performs collective classification using the
ICA algorithm described in Section 4.2.
GRF is the semi-supervised Gaussian Random Field approach of Zhu et al. [15].
We made one modification to accommodate disconnected graphs. Zhu computes
the graph Laplacian as L = D − cW , where c = 1. We set c = 0.9 to ensure
that L is diagonally dominant and thus invertible. We observed no substantial
impact on performance in connected graphs due to this change.
GRFLI combines the LI features with GRF as nLBLI does with nLB (i.e., using
a weighted sum of GRF and logLI). We also tried the approach of Zhu et al.
[15], where one attaches a “dongle” node to each unlabeled node and assigns
it a label using the external LI classifier. The transition probability from node
i to its dongle is η and all other transitions from i are discounted by 1 − η .
This approach did not yield any improvements. So, we use the weighted sum
approach (i.e., Equation 1) for consistency.
performs well on a variety of tasks, and (3) it tends to converge more quickly than
other approaches. We also performed experiments using relaxation labeling (RL)
[7]. Our results are consistent with previous research showing that the accuracy
of wvRN+RL is nearly identical to GRF, but GRF produces higher AUC values
[7]. We omit these results due to the similarity to GRF. For a comparison of
wvRN+RL and GRF on several of the same tasks used here, see Gallagher et
al. [17]. Overall, ICA slightly outperforms RL for the nLB classifier.
Several of our data sets have large amounts of unlabeled data since ground
truth is simply not available. In these cases, there are two reasonable approaches
to collective classification: (1) perform collective classification over the entire
graph and (2) perform collective classification over the core set of nodes only
(i.e., nodes with known labels).
In our experiments, attempting to perform collective classification over the
entire graph produced results that were often dramatically worse than the non-
collective base classifier. We hypothesize that this is due to an inadequate propa-
gation of known labels across vast areas of unlabeled nodes in the network. Note
that for some of our experiments, fewer than 1% of nodes are labeled. Other
researchers have also reported cases where collective classification hurts perfor-
mance due to a lack of labeled data [3,11]. We found that the second approach
(i.e., using a network of only the core nodes) outperformed the first approach in
almost all cases, despite disconnecting the network in some cases. Therefore, we
report results for the second approach only.
Each data set has a set of core nodes for which we know the true class labels.
Several data sets have additional nodes for which there is no ground truth avail-
able. Classifiers have access to the entire graph for both training and testing.
However, we hide labels for 10% − 90% of the core nodes. Classifiers are trained
on all labeled core nodes and evaluated on all unlabeled core nodes.
For each proportion labeled, we run 30 trials. For each trial, we choose a
class-stratified random sample containing 100 × (1.0 − proportionlabeled)% of
the core nodes as a test set and the remaining core nodes as a training set.
Note that a single node will necessarily appear in multiple test sets. However,
we carefully choose test sets to ensure that each node in a data set occurs in
the same number of test sets over the course of our experiments; and therefore,
carries the same weight in the overall evaluation. Labels are kept on training
nodes and removed from test nodes. We use identical train/test splits for each
classifier. For more on experimental methodologies for relational classification,
see Gallagher and Eliassi-Rad [20].
We use the area under the ROC curve (AUC) to compare classifiers because it
is more discriminating than accuracy. In particular, since most of our tasks have a
large class imbalance (see Section 4.4), accuracy cannot adequately differentiate
between classifiers.
8 B. Gallagher and T. Eliassi-Rad
We present results on four real-world data sets: political book purchases [21],
Enron emails [22], Reality Mining (RM) cellphone calls [23], and high energy
physics publications (HEP-TH) from arXiv [24]. Our five tasks are to identify
neutral political books, Enron executives, Reality Mining students, Reality Min-
ing study participants, and HEP-TH papers with the topic “Differential Geom-
etry.” Table 1 summarizes the prediction tasks. The Sample column describes
the method used to obtain a data sample for our experiments: use the entire set
(full ), use a time-slice (time), or sample a continuous subgraph via breadth-first
search (BFS ). The Task column indicates the class label we try to predict. The
|V |, |L|, and |E| columns indicate counts of total nodes, labeled nodes, and total
edges in each network. The P (+) column indicates the proportion of labeled
nodes that have the positive class label (e.g., 12% of the political books are
neutral). For Enron, Reality Mining students, and HEP-TH, we have labels for
only a subset of nodes (i.e., the “core” nodes) and can only train and test our
classifiers on these nodes. However, unlabeled nodes and their connections to
labeled nodes are exploited to calculate LI features of the labeled nodes.
5 Experimental Results
In this section, we discuss our results. We assess significance using paired t-tests
(p-values ≤ 0.05 are considered significant).1
Figures 2 and 3 show results for statistical relational learning and semi-supervised
learning approaches on all of our classification tasks. Supervised learning
approaches, like nLB, use labeled nodes as training data to build a dependency
model over neighboring class labels. The non-learning wvRN and GRF assume
1
It is an open issue whether the standard significance tests for comparing classifiers
(e.g., t-tests, Wilcoxon signed-rank) are applicable for within-network classification,
where there is typically some overlap in test sets across trials. It remains to be seen
whether the use of such tests produces a bias and the extent of any errors caused by
such a bias. This is an important area for future study that will potentially affect a
number of published results.
Leveraging Label-Independent Features 9
0.8 0.8
0.75 0.75
AUC
AUC
0.7 0.7
0.65 0.65
0.6 0.6
0.55 0.55
0.5 0.5
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
Enron Executives
0.9 0.9
0.8 0.8
AUC
UC
0.7 0.7
A
0.6 0.6
0.5 0.5
0.4 0.4
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
0.9 0.9
0.85 0.85
0.8 0.8
AUC
AUC
0.75 0.75
0.7 0.7
0.65 0.65
0.6 0.6
0.55 0.55
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
AUC
AUC
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
0.8 0.8
0.75 0.75
0.7 0.7
AUC
AUC
0.65 0.65
0.6 0.6
0.55 0.55
0.5 0.5
0.45 0.45
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
Proportion of Core Nodes Labeled Proportion of Core Nodes Labeled
Proportion of Core Nodes Labeled Proportion of Core Nodes Labeled
Fig. 2. Classification results for statistical relational learning approaches on our data
sets. For details on classifiers, see Section 4.1. Note: Due to differences in the difficulty
of classification tasks, the y-axis scales are not consistent across tasks. However, for
a particular classification task, the y-axis scales are consistent across the algorithms
shown both in this figure and in Figure 3.
10 B. Gallagher and T. Eliassi-Rad
0.7 0.7
0.65
0.6
0.6
0.5
0.55
0.5 0.4
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
0.75 0.5
0.4
0.7
0.3
0.65
0.2
0.6 0.1
0.55 0
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
0.8
0.75
0.7
AUC
0.65
0.6
0.55
0.5
0.45
0.1 0.3 0.5 0.7 0.9
Fig. 3. Classification results for semi-supervised learning approaches on our data sets.
For details on classifiers, see Section 4.1. Note: Due to differences in the difficulty
of classification tasks, the y-axis scales are not consistent across tasks. However, for
a particular classification task, the y-axis scales are consistent across the algorithms
shown both in this figure and in Figure 2.
that class labels of neighboring nodes tend to be the same (i.e., high label consis-
tency). GRF performs well on the Enron and RM student tasks, which have high
label consistency between neighbors. On the RM study task, where neighboring la-
bels are inversely correlated (i.e., low label consistency), wvRN and GRF perform
poorly, whereas nLB can learn the correct dependencies.
Leveraging Label-Independent Features 11
Lableing Sensitivity
(50% Labeled)
0.04
HepthArea
0.035
EnronTitle
Variance in AUC
0.03 PoliticalBook
0.025 RealityMiningInStudy
0.02 RealityMiningStudent
0.015
0.01
0.005
0
nLB nLB+ICA nLBLI GRF GRFLI
Classifier
Fig. 4. Sensitivity of classifiers to specific assignments of 50% known labels across data
sets
0.9
0.8 All
Node count
0.7
AUC
Link count
0.6 Betweenness
Clust. coef.
0.5
0.4
Enron HEP-TH P. Books RM Students RM Study
Data Set
Figure 6 shows the increase in AUC due to adding the specified feature to a
classifier that already has access to all other LI features. The y-axis is the AUC
2
Degree-based features are node (or neighbor) count and link (or edge) counts. Non-
degree-based features are betweenness and clustering coefficient.
Leveraging Label-Independent Features 13
0.3
0.25 Degree-based
Increase in AUC
0.2 Non-degree
0.15 Node count
0.1 Link count
0.05 Betweenness
0 Clust. coef.
-0.05
-0.1
Enron HEP-TH P. Books RM Students RM Study
Data Set
of a classifier that uses all LI features minus the AUC of a classifier that uses all
except the specified feature. This demonstrates the power of each feature when
combined with the others.
All features appear to be useful for some tasks. Clustering coefficient is the
least useful overall, improving AUC slightly on two tasks and degrading AUC
slightly on three. For all tasks, a combination of at least three features yields
the best results. Interestingly, features that perform poorly on their own can
be combined to produce good results. On the RM student task, node count, be-
tweenness, and clustering coefficient produce AUCs of 0.57, 0.49, and 0.48 alone,
respectively. When combined, these three produce an AUC of 0.78. Betweenness,
which performs worse than random (AUC < 0.5) on its own, provides a boost
of 0.32 AUC to a classifier using node count and clustering coefficient.
For most tasks, performance improves due to using all four LI features. On
Enron, however, clustering coefficient appears to mislead the classifier to the
point where it is better to use either node or link count individually than to
Enron Executives
0.8
0.75
0.7
Logistic
AUC
0.65
Rand Forest
0.6
0.55
0.5
0.1 0.3 0.5 0.7 0.9
Fig. 7. Comparison of logistic regression and random forest classifiers with all four LI
features
14 B. Gallagher and T. Eliassi-Rad
use all features. This is one case where we might benefit from a more selective
classifier. Figure 7 compares logistic regression with a random forest classifier
[25], both using the same four LI features. As expected, the random forest is
better able to make use of the informative features without being misled by the
uninformative ones.
To get a feel for why some LI features make better predictors than others, we
examine the distribution of each feature by class for each prediction task. Table 2
summarizes these feature distributions by their mean and standard deviation. In
general, we expect features that cleanly separate the classes to provide the most
predictive power. As mentioned previously, clustering coefficient appears to be
the least powerful feature overall for our set of prediction tasks. One possible
explanation for clustering coefficient’s general poor performance is that it does
not vary enough from node to node; therefore, it does not help to differentiate
among instances of different classes.
Table 2. Mean and standard deviation (SD) of feature values by class and data set.
The larger mean value for each feature (i.e., row) is shown in bold.
Data Set/Feature Mean (SD) for the ‘+’ Class Mean (SD) for the ‘-’ Class
Political Books Neutral Other
Node Count 5.8 (3.3) 8.8 (5.6)
Link Count 5.8 (3.3) 8.8 (5.6)
Betweenness 0.027 (0.030) 0.019 (0.029)
Clust. Coef. 0.486 (0.25) 0.489 (0.21)
Enron Executive Other
Node Count 22 (27) 9.6 (20)
Link Count 61 (100) 25 (66)
Betweenness 0.0013 (0.0037) 0.00069 (0.0025)
Clust. Coef. 0.91 (0.77) 1.75 (4.5)
RM Student Student Other
Node Count 19 (27) 22 (38)
Link Count 471 (774) 509 (745)
Betweenness 0.027 (0.050) 0.022 (0.056)
Clust. Coef. 15 (22) 8.0 (7.0)
RM Study In-study Out-of-study
Node Count 18 (30) 1.4 (2.8)
Link Count 418 (711) 30 (130)
Betweenness 0.022 (0.048) 0.00086 (0.022)
Clust. Coef. 10 (17) 5.8 (51)
HEP-TH Differential Geometry Other
Node Count 14 (9.0) 21 (26)
Link Count 14 (9.0) 21 (26)
Betweenness 0.000078 (0.00010) 0.0011 (0.0056)
Clust. Coef. 0.42 (0.19) 0.40 (0.23)
Leveraging Label-Independent Features 15
Feature Variability
12
10
Coefficient of Variation
8 PoliticalBook
EnronTitle
6 RealityMiningInStudy
RealityMiningPosition
4 HepthArea
0
t
nt
s
un
un
es
cie
Co
Co
nn
ffi
ge
or
ee
oe
hb
ed
tw
gC
ig
be
in
ne
er
st
clu
Fig. 8. Degree of variability for each LI feature on each prediction task
Figure 8 shows the degree of variability of each LI feature across the five
prediction tasks. To measure variability, we use the coefficient of variation, a
normalized measure of the dispersion of a probability distribution. The coefficient
of variation is defined as:
σ
cv(dist) = (4)
μ
where μ is the mean of the probability distribution dist and σ is the standard
deviation. A higher coefficient of variation indicates a feature with more varied
values across instances in the data set.
The variability of the clustering coefficient appears comparable to the degree
features (i.e., node and link count) (see Figure 8). We even observe that the de-
gree of variability of the clustering coefficient for the Enron task is higher than
the degree of variability for the neighbor count feature, even though neighbor
count provides much more predictive power (see Figure 5). So, clustering coeffi-
cient appears to have sufficient variability over the nodes in the graph. However,
it is possible that the clustering coefficient exhibits similar variability for nodes
of both classes; and thus, still fails to adequately distinguish between nodes of
different classes. Therefore, we wish to quantify the extent to which the feature
distributions can be separated from one another by class.
Figure 9 shows how well each LI feature separates the two classes for each
prediction task. We measure class separation by calculating the distance between
the empirical distributions of the LI feature values for each class. Specifically,
we use the Kolmogorov-Smirnov statistic (K-S) to measure the distance between
two empirical (cumulative) distribution functions:
Class Separation
0.8
Kolmogorov-Smirnov Distance
0.7
0.6
PoliticalBook
0.5 EnronTitle
0.4 RealityMiningInStudy
RealityMiningPosition
0.3
HepthArea
0.2
0.1
0
t
nt
t
s
un
un
es
cie
Co
Co
nn
ffi
ge
or
ee
oe
hb
ed
tw
gC
ig
be
in
ne
er
st
clu
Fig. 9. Degree of class separation for each LI feature on each prediction task
the values do not differ consistently based on class. Therefore, clustering coeffi-
cient has a hard time distinguishing between instances of different classes, and
exhibits poor predictive power overall. The exception is on the Reality Mining
study-participant task, where we observe a high K-S distance (Figure 9) and
a correspondingly high classification performance (Figure 5). In fact, the K-S
distances in Figure 9 generally correspond quite well to the classification perfor-
mance we observe in Figure 5.
6 Conclusion
We examined the utility of label-independent features in the context of within-
network classification. Our experiments revealed a number of interesting findings:
(1) LI features can make up for large amounts of missing class labels; (2) LI fea-
tures can provide information above and beyond that provided by class labels
alone; (3) the effectiveness of LI features is due, at least in part, to their consistency
and their stabilizing effect on network classifiers; (4) no single label-independent
feature dominates, and there is generally a benefit to combining a few diverse LI
features. In addition, we observed a benefit to combining LI features with label
propagation, although the benefit is not consistent across tasks.
Our findings suggest a number of interesting areas for future work. These
include:
– Combining attribute-based (LD) and structural-based (LI) features of a net-
work to create new informative features for node classification. For instance,
will the number of short paths to nodes of a certain label or the average path
length to such nodes improve classification performance?
– Exploring the relationship between attributes and network structure in time-
evolving networks, where links appear and disappear and attribute values
change over time. For example, in such a dynamic network, could we use
a time-series of LI feature values to predict the values of class labels at a
future point in time?
Acknowledgments
We would like to thank Luke McDowell for his insightful comments. This work
was performed under the auspices of the U.S. Department of Energy by Lawrence
Livermore National Laboratory under contract No. W-7405-ENG-48 and No.
DE-AC52-07NA27344 (LLNL-JRNL-411529).
18 B. Gallagher and T. Eliassi-Rad
References
1. Taskar, B., Abbeel, P., Koller, D.: Discriminative probabilistic models for relational
data. In: Proceedings of the 18th Conference on Uncertainty in AI, pp. 485–492
(2002)
2. Sen, P., Namata, G., Bilgic, M., Getoor, L., Gallagher, B., Eliassi-Rad, T.: Collec-
tive classification in network data. AI Magazine 29(3), 93–106 (2008)
3. Neville, J., Jensen, D.: Relational dependency networks. Journal of Machine Learn-
ing Research 8, 653–692 (2007)
4. Getoor, L., Friedman, N., Koller, D., Taskar, B.: Learning probabilistic models of
link structure. Journal of Machine Learning Research 3, 679–707 (2002)
5. Lu, Q., Getoor, L.: Link-based classification. In: Proceedings of the 20th Interna-
tional Conference on Machine Learning, pp. 496–503 (2003)
6. Neville, J., Jensen, D., Gallagher, B.: Simple estimators for relational bayesian
classifiers. In: Proceedings the 3rd IEEE International Conference on Data Mining,
pp. 609–612 (2003)
7. Macskassy, S., Provost, F.: Classification in networked data: A toolkit and a uni-
variate case study. Journal of Machine Learning Research 8, 935–983 (2007)
8. Neville, J., Jensen, D., Friedland, L., Hay, M.: Learning relational probability trees.
In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 625–630 (2003)
9. Perlich, C., Provost, F.: Aggregation-based feature invention and relational concept
classes. In: Proceedings of the 9th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 167–176 (2003)
10. Singh, L., Getoor, L., Licamele, L.: Pruning social networks using structural prop-
erties and descriptive attributes. In: Proceedings of the 5th IEEE International
Conference on Data Mining, pp. 773–776 (2005)
11. Neville, J., Jensen, D.: Leveraging relational autocorrelation with latent group
models. In: Proceedings the 5th IEEE International Conference on Data Mining,
pp. 322–329 (2005)
12. Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hy-
perlinks. In: Proceedings of ACM SIGMOD International Conference on Manage-
ment of Data, pp. 307–318 (1998)
13. Jensen, D., Neville, J., Gallagher, B.: Why collective inference improves relational
classification. In: Proceedings of the 10th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pp. 593–598 (2004)
14. McDowell, L., Gupta, K., Aha, D.: Cautious inference in collective classification. In:
Proceedings of the 22nd AAAI Conference on Artificial Intelligence, pp. 596–601
(2007)
15. Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning using gaussian
fields and harmonic functions. In: Proceedings of the 20th International Conference
on Machine Learning, pp. 912–919 (2003)
16. Zhu, X.: Semi-supervised learning literature survey. Technical Report
CS-TR-1530, University of Wisconsin, Madison, WI (December 2007),
http://pages.cs.wisc.edu/~ jerryzhu/pub/ssl_survey.pdf
17. Gallagher, B., Tong, H., Eliassi-Rad, T., Faloutsos, C.: Using ghost edges for clas-
sification in sparsely labeled networks. In: Proceedings of the 14th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pp. 256–264
(2008)
Leveraging Label-Independent Features 19
18. Newman, M.: The structure and function of complex networks. SIAM Review 45,
167–256 (2003)
19. Macskassy, S., Provost, F.: A simple relational classifier. In: Notes of the 2nd
Workshop on Multi-relational Data Mining at KDD 2003 (2003)
20. Gallagher, B., Eliassi-Rad, T.: An examination of experimental methodology for
classifiers of relational data. In: Proceedngs of the 7th IEEE International Confer-
ence on Data Mining Workshops, pp. 411–416 (2007)
21. Krebs, V.: Books about U.S. politics (2004),
http://www.orgnet.com/divided2.html
22. Cohen, W.: Enron email data set, http://www.cs.cmu.edu/~ enron/
23. Eagle, N., Pentland, A.: Reality mining: sensing complex social systems. Journal
of Personal and Ubiquitous Computing 10(4), 255–268 (2006),
http://reality.media.mit.edu
24. Jensen, D.: Proximity HEP-TH database,
http://kdl.cs.umass.edu/data/hepth/hepth-info.html
25. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
Community Detection Using a Measure of
Global Influence
1 Introduction
Communities and social networks have long interested researchers [5,13]. How-
ever, one of the main problems faced by the early researchers was the difficulty
of collecting empirical data from human subjects [5]. The advent of the internet
and the growing popularity of online social networks changed that, giving re-
searchers access to huge amount of social interactions data. This, coupled with
the ever increasing computation speed, storage capacity and data mining capa-
bilities, led to the reemergence of interest in the social networks in general, and
community detection specifically.
Many existing community finding algorithms look for regions of the network
that are better connected internally and have fewer connections to nodes outside
the community [4]. Graph partitioning methods [7,27], for example, attempt to
minimize the number of edges between communities. Modularity maximization-
based methods, on the other hand, identify groups of nodes that have higher
than expected number of edges within them [22,21,24,23]. We believe, however,
that edges do not give the true measure of network connectivity. We generalize
the notion of network connectivity to be the number of paths, of any length, that
exist between two nodes (Section 2). We argue that this metric, called influence
L. Giles et al. (Eds.): SNAKDD 2008, LNCS 5498, pp. 20–35, 2010.
c Springer-Verlag Berlin Heidelberg 2010
Community Detection Using a Measure of Global Influence 21
by sociologists [13], because it measures the ability of one node to affect (e.g.,
send information to) another, gives a better measure of connectivity between
nodes. We use the influence metric to partition a (directed or undirected) network
into groups or communities by looking for regions of the network where nodes
have more influence over each other than over nodes outside the community. In
addition to discovering natural groups within a network, the influence metric
can also help identify the most influential nodes within the network, as well as
the “weak ties” who bridge different communities. We formalize our approach by
describing a general mathematical framework for representing network structure
(Section 3). We show that the metric used for detecting communities in random
walk models, modularity-based approaches, and influence-based modularity are
special cases of this general framework. We evaluate our approach (in Section 4)
on the standard data sets used in literature, and find performance at least as
good as that of the edge-based modularity algorithm.
n+1
n-hop path is Πi=1 αi . The total influence of b on c thus depends on the number
of (attenuated) channels between b and c, or the sum of all the weighted paths
from b to c. This definition of influence makes intuitive sense, because the greater
the number of paths between b and c, the more opportunities there are for b to
transmit messages to c or to affect c.
For ease computation we simplify this model by taking α1 = β and αi = α,
∀i = 1. β is called the direct attenuation factor and is the probability of trans-
mission of effect directly between adjacent nodes. α is the indirect attenuation
factor and is the probability of transmission of effect through intermediaries. If
α = β, i.e., the probability of transmission of effect through all links is the same,
then this index reduces to the metric used to find the Katz status score [13].
n /
The number of paths from i to j with n intermediaries, i j , is given
n+1 times
by An = A · A · · · A = A(n−1) · A. Adding weights to take into account the
attenuation of effect, we get the weighted total capacity of i to affect j as
i / j = β i 0 / j + · · · + βαn i n / j + · · ·. We represent this weighted
total capacity to influence by the influence matrix P :
As mentioned by Katz[13], the equation holds while α < 1/λ, where λ is the
largest characteristic root of A [6].
We use the influence matrix to help find community structure in a network.
We claim (without much theoretical or empirical support) that a community is
composed of individuals who have a greater capacity to influence others within
their community than outsiders. As a result, actions of community members will
tend to become correlated with time, whether by adopting a new fashion trend,
Community Detection Using a Measure of Global Influence 23
Hence the null model has the same number of vertices N as the original model,
and in it the expected influence of the entire network equals to the actual influ-
ence of the original network. We further restrict the choice of null model to that
24 R. Ghosh and K. Lerman
where the expected influence on a vertex j, Wjin , is equal to the actual influence
on the corresponding vertex in the real network.
Wjin = P¯ij = Pij (4)
i i
Similarly, we also assume that in the null model, the expected capacity of a
vertex i to influence others, Wiout , is equal to the actual capacity to influence of
the corresponding vertex in the real network
Wiout = P¯ij = Pij . (5)
j j
Therefore the adjacency matrix A, Aij = (q1ij ), shows if two simplices are
zero-near to one another in a 0-hop path. The product A2 =A × A gives the
value of q2ij such that A2 ij = q2ij , i.e., vertex i and vertex j when separated
by a one-hop path are q-near each other with q = q2ij − 1. In the same way,
A3 = A × A × A = (A3ij ) = (q3ij ) shows that vertices i and j connected by
a two-hop path are q3ij − 1 near from each other. We then take the length of
the sequence into account to calculate the expected q-nearness of one vertex to
another by taking the weighted average of q-nearness of varying length of paths.
The expected value of qij between two elements i and j, such that they are
expected to be qij − 1 near each other, with (qkij ) = Akij is:
This expected value can be used to find out how connected two vertices are to
each other, taking paths of all lengths into account. Note that Wi can be a scalar
or a vector.
This formulations allows us to generalize different network models for com-
munity detection and scoring like the random walk model [28,29,31], the Katz
model [13] of status score, and the influence-based model. In random walk mod-
els, a particle starts a random walk from node i. The particle iteratively tran-
sitions to its neighbors with probability proportional to the corresponding edge
weights. Also at each step, the particle returns to node i with some restart
probability (1 − c). The proximity score from node i to node j is defined as
the steady-state probability ri,j that the particle will be on node j [29]. These
models can be shown to be special cases of the formulations of the expected
q-nearness (without loss of generality we assume that T is an n × n matrix):
k−1 −(k−1)
1. If Wk =cn · D where c is a constant and D is an n × n matrix with
Dij = j=1 Aij if i = j and 0 otherwise; then, the expected q-nearness score
reduces to proximity score in random walk model [28,29].
i
2. If Wi = Πj=1 αj , where the scalar αj is the attenuation factor of a (j − 1)-th
hop in a (i−1) hop path, then the expected q-nearness reduces to metric used
to find the influence score and represented by the influence matrix. For ease
of computation of the influence matrix, we have taken α1 = β and αi = α
∀i = 1. As stated before, α < 1/λ where λ is the largest characteristic root of
A. Gershgorins Circle Theorem (1931) gives the simple sufficient condition
α < 1/maxi (Dii ).
3. When β = α, this in turn reduces to the metric used to find the Katz status
score [13] with α as the attenuation factor.
4. When α1 = 1 and α2 = . . . = αn = . . . = 0, the expected q-nearness is the
q-nearness of the 0-hop path which is metric used to calculate similarity in
edge-based modularity approaches [22].
In summary, the capacity to influence is a measure of the expected q-nearness
between vertices. Liben-Nowell and Kleinberg [20] have shown that Katz measure
is the most effective measure for link prediction task. The influence score, which
Community Detection Using a Measure of Global Influence 27
4 Evaluation
We applied influence-based community finding method to small networks studied
previously in literature, as well as the friendship network extracted from the
social photosharing site Flickr. On all the data sets we studied, the performance
of the influence-based modularity optimization algorithm was at least as good as
that of the edge-based modularity (α = 0 case). In several cases, the influence-
based approach led to purer groups.
Fig. 2. Zachary’s karate club data. Circles and squares represent the two actual fac-
tions, while colors stand for discovered communities as the strength of ties increases:
(a) α = 0 (b) 0 < α < 0.14 (c) 0.14 ≤ α ≤ 0.29.
28 R. Ghosh and K. Lerman
We also ran our approach on the US College football data from Girvan et al. [10].1
The network represents the schedule of Division 1 games for the 2000 season
where the vertices represent teams and the edges represent the regular season
game between the two teams they connect. The teams are divided into “confer-
ences” (or communities) containing 8 to 12 teams each. Games are more frequent
between members of the same conference than members of different conferences.
Inter-conference games, however, are not uniformly distributed, with teams that
are geographically closer likely to play more games with one another than teams
separated by geographic distances. However, some conferences have teams play-
ing nearly as many games against teams in other conferences as teams within
their own conference. This leads to the intuition, that conferences may not be
the natural communities, but the natural communities may actually be bigger
in size than conferences, with teams playing as many games against others in
the same conferences being put into the same community.
Fig. 3. The graph showing the purity of communities predicted with different values
of α and β in the (a) college football and (b) political books data sets. We see that
purity increases with α, and is independent of β. When α = 0, the method reduces to
eigenvector based modularity maximization method postulated by Newman [23].
From the set of users in each topic, we identified four (eight for the wildlife
topic) who were interested in the topics we identified: i.e., wildlife for tiger and
beetle query terms, portraiture for the newborn query, and technology for the
apple query. We studied each user’s profile to confirm that the user was indeed
interested in that topic. Specifically, we looked at group membership and user’s
most common tags. Thus, groups such as “Big Cats”, “Zoo”, “The Wildlife
Photography”, etc. pointed to user’s interest in the wildlife topic. In addition
to group membership, tags that users attached to their images could also help
identify their interests. For example, users who used tags nature and macro were
probably interested wildlife rather than technology. Similarly, users interested in
human, rather than animal, portraiture tagged their images with baby and family.
We used the Flickr API to retrieve the contacts of each of the users we identified,
as well as their contacts’ contacts. We labeled users by the topic through which
they were discovered. In other words, users who uploaded one of the 500 most
interesting images retrieved by the query tiger, were labeled wildlife, whether or
not they were interested in wildlife photography. The contacts and contacts’s
contacts of the four users within this set identified as being interested in wildlife
photography were also labeled wildlife. Although we did not verify that all the
labeled users were indeed interested in the topic, we use these soft labels to
evaluate the discovered communities.
Once we retrieved the social networks of target set of users, we reduced it to
an undirected network containing mutual contacts only. In other words, every
link in the network between two nodes, say A and B, implies that A lists B
as contact and vice versa. This resulted in a network of 5747 users. Of these,
1620 users were labeled technology, 1337 and 2790 users were labeled portraiture
and wildlife respectively. We ran our community finding algorithm for different
values of α on this data set. For α = 0, we found four groups, while for higher
values of α (α < 0.01), we found three groups. Figure 4 shows composition
of the discovered groups in terms of soft labels. Group 1 is composed mainly of
technology users, group 2 mainly wildlife users, and group 3 is almost exclusively
portraiture. The fourth group found at α = 0.0 has 932 members, of which 497
are labeled wildlife, 242 technology, and 193 members portraiture. Except for the
portraiture group (group 3), groups become purer as α increases.
5 Related Research
Fig. 4. Composition of groups discovered in the Flickr social network for different
values of α
32 R. Ghosh and K. Lerman
the motifs. The method we propose, on the other hand, imposes no such limit
on proximity. On the contrary, it considers the correlation between nodes in a
more global sense. The measure of global correlation evaluated using the influ-
ence metric would be equal to the weighted average of correlations when motifs
of different sizes are taken. The influence matrix enables the calculation of this
complex term in a quick and efficient manner.
Resolution limit is one of the main limitations of the original modularity de-
tection approach [8]. It can account for the comment by Leskovec et al. [19]
that they “observe tight but almost trivial communities at very small scales, the
best possible communities gradually ‘blend in’ with rest of the network and thus
become less ‘community-like’.” However, that study is based on the hypothesis
that communities have “more and/or better-connected ‘internal edges’ connect-
ing members of the set than ‘cut edges’ connecting to the rest of the world.”
Hence, like most graph partitioning and modularity-based approaches to com-
munity detection, their process depends on the local property of connectivity
of nodes to neighbors via edges and is not dependent on the structure of the
network on the whole. Therefore, it does not take into account the characteris-
tics of node types, that is ‘who’ are the nodes that a node is connected to and
how influential these nodes are. In their paper on motif-based community detec-
tion, Arenas et al.[1] state that the extended quality functions for-motif based
modularity also obey the principle of the resolution limit. But this limit is now
motif-dependent and then several resolution of substructures can be achieved by
changing the motif. However, it would be difficult to verify which resolution of
substructures is closest to natural communities. In influence-based modularity,
on the other hand, the resolution limit would depend on the probability of trans-
mission of the effect between nodes, i.e., the strength of ties. The probability of
transmission of effect can indeed be calculated from the graph, by say observing
the dynamics of spread of idea within a graph at different times.
As stated before, Liben-Nowell and Kleinberg [20] have shown that Katz mea-
sure is the most effective measure for the link prediction task, better than hitting
time, PageRank [26] and its variants. Thus we use influence score, which is a
generalization of the Katz score, to detect communities and compute rankings
of individuals.
Recently researchers have used probabilistic models, e.g., mixture models, for
community discovery. These models can probabilistically assign a node to more
than one community, as it has been observed “objects can exhibit several dis-
tinct identities in their relational patterns” [15]. This indeed may be true, but
whether the nodes in the network is to be divided into distinct communities or
probabilities with which each node belongs to community is to be discovered,
really depends on the specific application. By this, we mean that if the appli-
cation we are interested in is finding the natural communities say in the karate
club data, and if we use a probabilistic method (say [15]), we would be assigning
the nodes into groups into which their probability of belonging is the highest,
and the communities thus formed do not necessarily portray the division of the
network into natural communities observed.
Community Detection Using a Measure of Global Influence 33
Acknowledgements
This research is based on work supported in part by the National Science Foun-
dation under Award Nos. IIS-0535182, BCS-0527725 and IIS-0413321.
References
1. Arenas, A., Fernandez, A., Fortunato, S., Gomez, S.: Motif-based communities in
complex networks. Mathematical Systems Theory 41, 224001 (2008)
2. Atkin, R.: From cohomology in physics to q-connectivity in social science. Inter-
national Journal of Man-Machines Studies 4, 341–362 (1972)
3. Brandes, U., Delling, D., Gaertler, M., Gorke, R., Hoefer, M., Nikoloski, Z., Wagner,
D.: On modularity clustering. IEEE Trans. on Knowl. and Data Eng. 20(2), 172–
188 (2008)
34 R. Ghosh and K. Lerman
27. Pothen, A., Simon, H., Liou, K.P.: Partitioning sparse matrices with eigenvectors
of graphs. SIAM J. Matrix Anal. Appl. 11, 430–452 (1990)
28. Tong, H., Faloutsos, C., Pan, J.: Fast random walk with restart and its applications.
In: Sixth International Conference on Data Mining, ICDM 2006, pp. 613–622 (2006)
29. Tong, H., Papadimitriou, S., Yu, P.S., Faloutsos, C.: Proximity tracking on time-
evolving bipartite graphs. In: SDM, pp. 704–715. SIAM, Philadelphia (2008)
30. Zachary, W.W.: An information ow model for conict and ssion in small groups.
Journal of Anthropological Research 33, 452–473 (1977)
31. Zhou, H.: Network landscape from a brownian particles perspective. Physical Re-
view E 67 (2003)
Communication Dynamics of Blog Networks
1 Introduction
The structure of large social networks, such as the WWW, the Internet, and the
Blogosphere, has been the focus of intense research during the last decade (see
[1], [7], [8], [12], [17], [19], [20], [21], [22]. One of the main foci of this research
has been the development of dynamic models of network creation ([2], [11], [22],
[18]) which incorporates two fundamental elements: network growth, with nodes
arriving one at a time; and some form of preferential attachment in which an
arriving node is more likely to attach itself to a more prominent existing node
than a less prominent one (the rich get richer).
Once a network has grown and stabilized in size, how does it evolve? Such
an evolution is governed by the communication dynamics of the network: links
being broken and formed as social groups form, evolve and disappear. The com-
munication dynamics of these networks have been studied much less, partially
L. Giles et al. (Eds.): SNAKDD 2008, LNCS 5498, pp. 36–54, 2010.
c Springer-Verlag Berlin Heidelberg 2010
Communication Dynamics of Blog Networks 37
because the typical networks studied (the WWW, the Internet, collaboration
networks) mainly exhibit growth dynamics and not communication dynamics.
Clearly, as a network matures, the growth (addition of new users) becomes a
minor ingredient of the total change (see Figure 1). Further, links in a socially
dynamic network such as the Blogosphere should not be interpreted as static.
The posts made by a blogger a week ago may not be reflective of his/her current
interests and social groups. In fact, blog networks display extreme communica-
tion dynamics. Over the 20 week period shown in Figure 1, in a typical week,
510,000 pairs of bloggers communicated via blog comments. Out of those about
380,000 are between pairs of bloggers who did not communicate the week before,
i.e. over 70% of the communications are new. What models adequately describe
the dynamics of the communications in such networks which have more or less
stabilized in terms of growth?
0.9
vertex growth
0.8 new edges
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
51 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Fig. 1. Edge and vertex dynamics. Clearly the rate of growth is decreasing however
the fraction of new edges which appear in a week remains approximately constant at
over 70%.
To begin to address this question, one must first develop methods for testing
the validity of a model. In such an environment of extreme stochastic dynamics,
one cannot hope to replicate the dynamics of the individual communications; this
explains our focus on the evolution of interesting macroscopic properties of the
communication dynamics. Particularly interesting ones are those which are time
invariant. We refer to such properties as stable statistics. As we demonstrate,
even in such an active environment, certain statistics are remarkably stable. For
example: the power-law coefficient for the in-degree distribution, the clustering
coefficient, and the size of the giant component (see Table 1).
2 Clusters
The notion of a social community is crucial to our model of a Blog network. The
underlying idea of our model is that every user selects the nodes to visit (to leave
a comment) from the set of nodes that belong to a relatively small “area” of a
node. Our experiments with different definitions of the local area of the node
show that the best approximation to the observed statistics is achieved if the
area is taken as the union of clusters containing a given node. Our definition of
network clusters is borrowed from [4], [5], [6] with an important specification of
the notion of the density of a set of nodes in a network.
Definition. Given a graph G(V.E) let function D, called the density, be defined
on the set of all subsets of V . Then, a set C ⊆ V is called a cluster if it is locally
maximal w.r.t. D in the following sense: for every vertex x ∈ C (resp. x ∈ C),
removing x from C (resp. adding x to C) creates a set whose density is smaller
than D(C).
The idea of the definition matches the common understanding of a social
community as a set of members that forge more communication links within the
40 M. Goldberg et al.
set than that with the outside the set. The function D is not specified by the
definition, but its precise formulation is crucial in “catching” the nature of social
communities. The density function considered in [3] is as follows:
win
D(C) = , (1)
win + wout
where win is the number of edges xy with x, y ∈ C and wout is the number of
edges xy with either x ∈ C & y ∈ C or x ∈ C & y ∈ C (to allow for directed
graphs). The main deficiency of the definition of a cluster as a computational rep-
resentation of a social community is that it is easy to find examples of networks
that permits very large and loosely connected clusters, that intuitively are not
representing any community. The idea of our modification of 1 is to introduce an
additional parameter which represents the edge probability in the set
win 2win
D(C) = +λ , (2)
win + wout |C|(|C| − 1)
where the parameter λ depends on the specific network under the consideration,
and is supposed to be selected by the researcher. For our experiments, we selected
λ = 0.125.
3 Data
Fig. 2. Blogograph generation example. Vertices are placed for every blogger who
posted or commented, the edges are placed from the author of the comment to the
author of the post (the blog owner). Parallel edges and loops are not allowed.
Table 1. Statistics for observed blogograph: order of the graph (|V |), graph size (|E|),
fraction of vertices that are part of giant component (GC size), clustering coefficient
(C), average separation (d), power law exponent (α)
week |V | |E| GC C d α
49 155,615 530,160 95.88% 0.0639 5.333 2.63
50 156,026 532,189 95.91% 0.0644 5.327 2.66
51 155,093 527,364 95.62% 0.0635 5.316 2.65
52 151,559 516,483 95.62% 0.0635 5.316 2.71
1 118,979 327,356 93.55% 0.0573 5.777 2.92
2 142,478 444,457 95.14% 0.0587 5.392 2.68
3 159,436 559,506 96.16% 0.0629 5.268 2.68
4 158,429 550,436 95.60% 0.0631 5.224 2.67
5 156,144 534,917 95.49% 0.0627 5.293 2.72
6 156,301 526,194 95.70% 0.0615 5.338 2.72
7 154,846 523,235 95.44% 0.0622 5.337 2.69
8 156,064 528,363 95.59% 0.0609 5.320 2.69
9 156,362 524,441 95.58% 0.0602 5.377 2.68
10 154,820 523,304 95.48% 0.0593 5.368 2.68
11 155,267 516,280 95.13% 0.0600 5.356 2.68
12 156,872 514,269 95.20% 0.0590 5.367 2.63
13 155,338 510,070 95.42% 0.0601 5.342 2.71
14 155,099 506,892 95.19% 0.0607 5.309 2.73
15 153,440 504,850 95.32% 0.0601 5.303 2.73
16 154,012 512,094 95.34% 0.0599 5.298 2.60
17 151,427 503,802 95.30% 0.0611 5.288 2.75
our screen-scraping program visits the page of a post after it has been pub-
lished for two weeks and collects the comment threads. We then generate the
communication graph.
We have focused on the Russian section of LiveJournal as it is reasonably
but not excessively large (currently close to 580,000 bloggers out of the total 15
million) and almost self contained. We identify Russian blogs by the presence of
Cyrillic characters in the posts. Technically this also captures the posts in other
42 M. Goldberg et al.
450000
400000
350000
300000
250000
200000
150000
01/19 02/02 02/16 03/01 03/15 03/29
30000
25000
20000
15000
10000
5000
0
Mon Tue Wed Thu Fri Sat Sun
Fig. 3. Number of comments per day that appeared between January 14, 2008 and
April 6th, 2008 and number of comments per hour during a week between March
24th, 2008 and March 30th. The periodic drops in the number of comments per day
correspond to Saturdays and Sundays.
languages with a Cyrillic alphabet, but we found that the vast majority of the
posts are in Russian. The network of Russian bloggers is very active. On average,
32% of all posts contain Cyrillic characters. LiveJournal blogging has become
a cultural phenomenon in Russia. Discussion threads often contain intense and
interesting discussions which encourage communication through commenting.
Our work is based on data collected between December 2007 and April 2008.
The basic statistics about the size of obtained data are presented in Table 1. A
simpler set of statistics on a smaller set of observed data is presented in [16].
4 Stable Statistics
The observed communication graph has interesting properties. The graph is
very dynamic on the level of nodes and edges but has stable aggregated statis-
tics. About 75% of active bloggers will also be active in the next week. Further,
about 28% of edges that existed in a week will also be found in the next week. A
large part of the network changes weekly, but a significant part is preserved. The
stability of various statistics of the blogograph is presented in Table 1. The giant
component (GC) is the largest connected (not necessarily strongly connected)
Communication Dynamics of Blog Networks 43
1 0.5
0.4
0.1
0.3
0.01
0.2
0.001
0.1
0.0001 0
0 5 10 15 20 0 2 4 6 8 10 12 14 16
1
observed distribution
α = 2.70
0.1
0.01
P(x = k)
0.001
0.0001
1e-05
1e-06
1 10 100 1000 10000
k
Fig. 5. Average in-degree distribution in the blogograph observed over 21 weeks from
Dec. 03, 2007 and Apr. 28, 2008
Table 2. 19 weeks of communities from the Russian section of Live Journal. |C| is
the number of communities, δavg is the average density, and ep is the average edge
probability within the communities.
week |C| avg size δavg ep week |C| avg size δavg ep
51 19631 10.0183 0.456677 0.253212 9 20136 9.95401 0.473693 0.252607
52 19520 10.0615 0.453763 0.252101 10 19670 9.71678 0.45449 0.255778
1 23187 10.0915 0.473676 0.248130 11 20212 9.66842 0.456908 0.256098
2 20970 9.98412 0.458161 0.251843 12 20415 9.70331 0.461118 0.255819
3 17986 9.86184 0.448757 0.254203 13 20030 9.78058 0.455676 0.254681
4 18510 9.71891 0.453578 0.257481 14 19893 9.74936 0.455234 0.254384
5 18808 9.88255 0.455823 0.254305 15 19392 9.73407 0.455365 0.254687
6 19318 9.79242 0.454656 0.253901 16 19113 9.74787 0.454531 0.254721
7 19343 9.80381 0.456364 0.255236 17 18737 9.72333 0.455775 0.255658
8 19796 9.83113 0.453577 0.252818
presented distribution (the variation at each point is less then 2%). As the figure
suggests, the majority of communicating vertices were less or at 3 hops away in
the network on the previous time cycle. This provides evidence for the strong
locality of communication that occurs in the observed network.
In addition to looking for stability in structural statistics, it is also useful
to examine stable community behavior. Using the notion of clusters discussed
previously in this text, we find locally optimal communities using each edge in
the graph as a seed. Once all seeds are optimized, duplicates and clusters of size
2 are removed. Statistics of the remaining clusters are showing in Table 2. A size
vs density plot is also given in Figure 6. The general shape and scale of this plot
is replicated across all observed weeks.
Communication Dynamics of Blog Networks 45
0.8
Density 0.6
0.4
0.2
0
0 10 20 30 40 50 60 70 80 90 100
Size
Fig. 6. A size vs density plot for week 5 of the observed data. The x-axis is a measure
of the community size while the y-axis shows the value of δ. Each point represents a
community.
5 Modeling
As previously stated, networks with such strong communication dynamics have
not been well modeled. Much of the previous work aims to replicate the growth
phase of a network’s life-cycle, ignoring the evolution of communication once
the network’s size stabilizes. Models which replicate these dynamics would be
useful as a sand-box within which social hypotheses on information diffusion, the
emergence of leaders, and group formation and dissolution can be tested. To be
considered useful, any model should create a set of graphs whose statistics come
as close as possible to mirroring the statistics of the observed data presented
previously.
Before delving into the creation of a new model, let us first consider the
modification of a previously existing one. The simplest method of producing
a set of evolving graphs is to grow each week’s graph using a known network
growth algorithm. Vertices can be assigned an out-degree based on the observed
data and connected to each other via preferential attachment for each of the
weeks. If done correctly, this would yield a set of graphs whose in-degree and
out-degree distributions come close to matching the observed data’s power law
distributions.
Despite this initial positive result, examining the rest of the statistics demon-
strates that the model is insufficient. Relational statistics such as edge stability,
edge history, and clustering coefficient all significantly depart from the observed
values, which we will show in detail further in the paper. This model’s inability do
recreate these statistics is expected, since it generates each graph independently.
Below, we propose a model which performs its edge connection within some
locality in an effort to more closely mirror the edge stability, edge history, clus-
tering coefficient, and community based statistics of the network.
46 M. Goldberg et al.
distribution pit , where pit (v) specifies the probability for node vi to attach to node
v for v ∈ V . The probability distribution pit may depend on Ait and Gt−1 (e.g.
higher
degree nodes may get higher probabilities). In particular, we assume that
i
v∈At pt (v) = 1, which corresponds to the assumption that every nodes expends
i
all its communication energy within its local area. Since we do not allow parallel
edges, if kti > |Ait |, it is not possible for node vi to expend all its communication
energy within its local area Ait . In this case, we assume that kti − Ait edges are
attached uniformly at random to nodes outside its area and the remaining edges
are attached within its area. The precise algorithm for distributing the edges
given the probability distribution pit is given in Algorithm 2.
The evolution model is illustrated in Figure 7. In more detail, the model
first obtains the out-degrees (which are exogenously specified). From Gt−1 , it
computes Ait and pit for all nodes vi ∈ V . For all nodes, it then attaches edges
according to Algorithm 2. This entire process is iterated for a user specified
number of time steps. The entire process is given in Algorithm 1. The inputs to
the model are the procedure OutDeg which specifies the out-degrees (assumed
48 M. Goldberg et al.
Assign Form
out−degrees new graph
Place edges
w/in area
to be exogenous), the procedure Area which identifies the local areas of the
nodes given the previous graph, and the procedure Prob which specifies the
attachment probabilities according to the attachment model. We will now discuss
some approaches to defining the areas and the attachment probabilities. When
testing our model, we will also need the procedure for obtaining the out-degrees,
which will be discussed in Section 7.
Given the local area Ait of the node vi at time t, the attachment model describes
the probability pit+1 (vj ) of occurrence of an edge (vi , vj ) at time t + 1 for vj ∈ V .
We propose the following attachment modes:
1
pit (vj ) = (3)
|Ait |
and for vj ∈
/ Ait , pit (vk ) = 0.
2. Preferential Attachment. Node vi attaches to any vj ∈ Ait with proba-
bility
The combination of the locality model and attachment model specifies the evo-
lution model that, given the out-degree distribution, will produce a series of
graphs that represent the blogograph at different time periods.
observed weeks. Figure 8 compares the observed degree distribution to the ones
generated by some of the best area/attachment combinations. As defined in
Section 4, edge history conveys information about how close the end points of
the observed edge were in the previous time cycle and therefore measures the
significance of locality in the communications. Figure 8 compares the observed
edge history with the edge histories produces by the best models.
First, we consider the model with global area where vertices are aware of and
can connect to any other vertex in the network.
In the case of a uniform attachment, the resulting model is very similar to
the Erdös-Rényi model. The in-degree distribution and other parameters gen-
erated by such model are predictably very different from the power law degree
distribution in the observed graph.
Global area with preferential attachment strictly proportional to the in-degree
of the vertices in the graph of the previous iteration results in a formation of
a power house - small set of vertices with very high in-degree that attract all
of out-degree. This effect is caused directly by preferential attachment; since
vertices with zero in-degree will never be attached to, any vertex that receives
no incoming vertices at some iteration will not receive any incoming vertices in
any of the following iterations. Clearly, the graph with small set of vertices that
attract all of the in-degree is very different from the observed graph.
The combination of global area and preferential attachment proportional to the
out-degree of the vertices in the graph of the previous iteration produced results
that were more similar to the observed network than the other global models, but
the results were also significantly worse compared to models with other area defini-
tions (k-neighborhood and union of clusters). Since this model allows for random
selection of the end points of edges from the whole graph, the edge history (Figure
8) is very different from the one observed in the real-life network.
1 1
observed Observed
0.95 Global + P.A. (out) 0.9 Global + P.A. (out)
Clusters + P.A. (in) Clusters + P.A. (in)
0.9 3-Neighb + P.A. (out) 0.8 3-Neighb + P.A. (out)
0.85 0.7
0.8
P(d>=k)
0.6
0.75 0.5
0.7
0.4
0.65
0.3
0.6
0.2
0.55
0.1
0.5
1 10 100 1000 10000 0
k 2 4 6 8 10 12 14 16
Fig. 8. In-degree and edge history distributions for various models and observed net-
work
Models with this area definition were the only ones that produced non-trivial
edge stability defined as the likelihood of a repetition of a recently observed
edge. To evaluate this stability, we consider the number of edges that appeared
more then once in 21 iterations of the model after stabilization. Models with
global and 3-neighborhood area definitions that did not result in the formation
of power houses produced a set of graphs such that less then 1% of edges that
appeared more the once in all of the graphs. Models with an area defined by the
union of clusters produced a set of graphs in which, on average, 14% of edges
appear more then once in 21 iterations. In particular, a combination of this area
definition with preferential attachment proportional to the in-degree produced
a sequence of graphs with 18% of edges appear more than once, while in the
observed network, 40% of edges (see Figure 4) appear more then once during 21
observed weeks.
After considering all of the parameters of the models, we determined the com-
bination of an area defined by the union of clusters with preferential attachment
proportional to the in-degrees of the vertices to be the best model to describe
the dynamics of communication in the observed network.
8 Conclusion
We have presented a set of statistics which display strong stability even for
a dynamic network such as the blogosphere. Our list of stable statistics is not
Communication Dynamics of Blog Networks 53
References
8. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stat, R.,
Tomkins, A., Wiener, J.: Graph structure in the web. Computer Networks 33(1-6),
309–320 (2000)
9. Chung, F., Lu, L.: Connected components in random graphs with given degree
sequence. Annals of Combinatoreics 6, 125–1456 (2002)
10. Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empirical
data (2007)
11. Doreian, P., Stokman, E.F.N.: Evolution of social networks. Gordon and Breach
(1997)
12. Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-law relationships of the in-
ternet topology. In: SIGCOMM, pp. 251–252 (1999)
13. Gkantsidi, C., Mihail, M., Zegura, E.: The markov chain simulatiopn methods for
generating connected power law random graphs. In: Proc. of ALENEX 2003, pp.
16–50 (2003)
14. Goh, K.-I., Eom, Y.-H., Jeong, H., Kahng, B., Kim, D.: Structure and evolution
of online social relationships: Heterogeneity in unrestricted discussions. Physical
Review E (Statistical, Nonlinear, and Soft Matter Physics) 73(6), 66123 (2006)
15. Goldberg, M., Kelley, S., Magdon-Ismail, M., Mertsalov, K.: A locality model for
the evolution of blog networks. In: IEEE Information and Security Informatics, ISI
(2008)
16. Goldberg, M., Kelley, S., Magdon-Ismail, M., Mertsalov, K.: Stable statistics of
the blogograph. In: Interdisciplinary Studies in Information Privacy and Security
(2008)
17. Kleinberg, J.M., Lawrence, S.: The structure of the web. Science, 1849–1850 (2001)
18. Kossinets, G., Watts, D.J.: Empirical analysis of an evolving social network. Sci-
ence 311, 88–90 (2006)
19. Kumar, R., Novak, J., Raghavan, P., Tomkins, A.: Structure and evolution of
blogospace. Communications of the ACM 33(1-6), 309–320 (2004)
20. Kumar, R., Novak, J., Tomkins, A.: Structure and evolution of online social net-
works. In: KDD 2006 (2006)
21. Newman, M.: The structure and function of complex networks. SIAM Review 45(2),
167–256 (2003)
22. Newman, M., Barabási, A.-L., Watts, D.: The structure and dynamics of networks.
Princeton University Press, Princeton (2006)
23. Newman, M.E.J.: The structure of scientific collaboration networks. Proc. Natl.
Acad. Sci. USA 98, 404 (2001)
24. Stauffer, A.O., Barbosa, V.C.: A study of the edge-switching markov chain method
for the generation of random graphs. arxiv: cs. DM/0512105 (2006)
25. West, D.B.: Introduction to graph theory. Prentice Hall, Upper Saddle River (2003)
Finding Spread Blockers in Dynamic Networks
1 Introduction
How can we stop a process spreading through a social network? This problem
has applications to diverse areas such as preventing or inhibiting the spread of
diseases [7, 26, 40], computer viruses1 [8, 22], rumors, and undesirable fads or
risky behaviors [23, 24, 37, 38]. A common approach to spread inhibition is to
(No last name). Work supported in part by the Fulbright fellowship.
Work performed in part while being a visiting student at the University of New
Mexico.
Work supported in part by the NSF grant IIS-0705822 and NSF CAREER Award
0747369.
†
Work supported in part by the NSF grant IIS-0705822, NSF CAREER Award
0644058, and an AFO MURI award.
1
In particular, we are concerned with computer malware that spreads through social
networks, such as email viruses and worms, cell-phone viruses, and other related
malware such as the recent MySpace worm.
L. Giles et al. (Eds.): SNAKDD 2008, LNCS 5498, pp. 55–76, 2010.
c Springer-Verlag Berlin Heidelberg 2010
56 Habiba et al.
identify key individuals whose removal will most dampen the spread. In the
context of the spread of a disease, it is a question of finding individuals to be
quarantined, inoculated, or vaccinated so that the disease is prevented from
becoming an epidemic. We call this set of key individuals the blockers of the
spreading process.
There has been significant previous work related to studying and controlling
the spread of dynamic processes in a network [9,10,11,16,18,22,23,26,35,40,43,
44, 46, 47, 51, 54, 57, 59, 60, 67]. Unfortunately, these results have three properties
rendering them ineffective for identifying good blockers in large networks. First,
many proposed algorithms focus on a slightly different objective: they aim to
identify nodes that will be most effective in starting the spread of a process rather
than blocking it [44, 47]; or alternatively, nodes that would be most effective in
sensing that a process has started to spread, and where the process initiated
[9, 10, 11]. In this paper, we are focused specifically on identifying those nodes
that are good blockers. Second, algorithms proposed in previous work all require
computationally expensive calculations of some global properties over the entire
network, or rely on expensive, repeated stochastic simulations of the spread of a
dynamic process. In this paper, we present heuristics that identify good blockers
quickly, based only on local information.
Finally, perhaps the most critical problem in previous work is the omission of
the dynamic nature of social interactions. The very nature of a spreading process
implies an explicit time axis [52]. For example, the flow of information through a
social network depends on who starts out with the information when, and which
individuals are in contact at the starting point with the information carrier [43].
In this paper, we consider explicitly dynamic networks, defined in Section 3.1.
In these networks, we study the social interactions over a finite period of time,
measured in discrete time steps.
The main contributions of this paper are summarized below.
2 Related Work
Dynamic phenomena such as opinions, information, fads, behavior, and disease
spread through a network by contacts and interactions among the entities of the
network. Such spreading phenomena have been studied in a number of domains
including epidemiology [22,26,40,51,54,57,59], diffusion of technological innova-
tions and adoption of new products [7, 16, 18, 23, 24, 38, 35, 44, 46, 60, 67], voting,
strikes, rumors [36, 37, 53, 68], as well as spread of contaminants in distribution
networks [8, 9, 10, 11, 46] and numerous others.
One of the fundamental questions about dynamic processes is: Which indi-
viduals, if removed from the network, would block the spread of such process?
Several previous results have addressed the problem of identifying such indi-
viduals [26, 40, 43]. Eubank et al. [26] experimentally show that global graph
theoretic measures like expansion factor and overlap ratio are good indicators
for devising vaccination strategies in static networks. Cohen et al. [21] propose
another immunization strategy based on the aggregate network model. In partic-
ular, they propose an efficient method of picking high degree nodes in a network
to immunize, thus inhibiting the spread of disease. Kempe et al. [43] show that
a variant of the blocker identification problem is NP-hard. While these problems
and suggested approaches are similar to finding good blockers in a network, un-
fortunately, there are critical differences that make these results inappropriate
for our formulation. First of all our objective is to minimize the expected extent
of spread in a network. We do not make any assumption about the source of the
spread. Second, almost all the above methods simplify the spreading process by
ignoring the time ordering of interactions.
There has also been significant related work on the problem of determining
where to place a small number of detectors in a network so as to minimize the time
required to detect the spread of a dynamic process, and, ideally, also the location
at which the spread began. Berger-Wolf et al. [9] give algorithms for the problem
of minimizing the size of the infected population before an outbreak is detected.
Berry et al. [10,11] give algorithms to strategically place sensors in utility distribu-
tion networks to minimize worst case time until detection. In [47], Leskovec et al.
demonstrate that many objectives of the detection problem exhibit the property
of submodularity. They exploit this fact to develop efficient and elegant algorithms
for placing detectors in a network. While the detection problem is related to the
58 Habiba et al.
3 Definitions
Populations of individuals interacting over time are often represented as net-
works, or graphs, where the nodes correspond to individuals and a pairwise in-
teraction is represented as an edge between the corresponding individuals. The
idea of representing societies as networks of interacting individuals dates back to
Lewin’s earlier work of group behavior [48]. Typically, there is a single network
representing all interactions that have happened during the entire observation
period. We call this representation an aggregate network (Section 3.2). In this
paper we use an explicitly dynamic network representation (Section 3.1) that
takes the history of interactions into account.
(e)
Fig. 1. Example of several dynamic networks that have the same unweighted aggregate
network representation. Figures (a)–(d) show a dynamic networks of three individuals
interacting over four time steps. The solid line edges represent interactions among
individuals in a time step. Empty circles are individuals observed during a time step.
While at any given time step some individuals may be unobserved, the particular
example shows all the individuals being observed at all time steps. Figure (e) shows an
unweighted aggregate network that has the same interactions as every dynamic network
in the example. Figures (a)–(c) have the multiplicity two of each edge while figure (d)
has the multiplicity four for every edge in the aggregate representation.
Using this aggregate network model, the structure and properties of many social
networks have been studied from different perspectives [6,13,12,15,41]. However,
as we have mentioned, this and other similar models do not explicitly consider
the temporal aspect of the network.
Thus, finding the best blockers in the network is equivalent to finding the (set
of) individuals whose removal from the network minimizes the expected extent
of spread.
kBl(G) = Spread(G) − min Spread(G \ X). (5)
X⊆V,|X|=k
Finding Spread Blockers in Dynamic Networks 61
|E|
D(G) = |V | . (6)
2
In the example in Figure 1, the density of the aggregate network in (e) is 2/3.
However, the dynamic density of the networks (a), (b), and (c) is 1/3 while the
dynamic density of (d) is 2/3.
Path between a pair of nodes u, v is a sequence of distinct nodes u = v 1 , v 2 , . . . , v p
= v with every consecutive pair of nodes connected by an edge(v i , v i+1 ) ∈ E.
Temporal Path between u, v is a time respecting path in a dynamic network.
It is a sequence of nodes u = v 1 , . . . , v p = v where each (v i , v i+1 ) ∈ E is an
edge in Et for some t. Also, for any i, j such that i + 1 < j, if v i ∈ Vt and
v j ∈ Vs then t < s. The length of a temporal path is the number of time
steps it spans. Note, that this definition allows only immediate neighborhood
of a node to be reached within one time step.
In the example in Figure 1, while there is a path from c to a in the aggregate net-
work (e), there is no temporal path from c to a in the dynamic network (b). All
the temporal paths from a to c in the dynamic networks (a)–(d) are of length 2.
Diameter is the length of the longest shortest path. In dynamic networks, it is
the length of the longest shortest temporal path.
Note, that here we consider a friend to be “new” if it was not a friend in the
previous time step. The definition is easily extended to incorporate a longer
term memory of friendship. The dynamic degree captures the gregariousness
of an individual, an important quality from a spreading perspective.
2
Here denotes the symmetric difference of the sets.
Finding Spread Blockers in Dynamic Networks 63
Dynamic Average Degree is the average over all time steps of the interac-
tions of an individuals in each time step:
1
AV G-DEG(u) = DEG(ut ). (10)
T
1≤t≤T
The dynamic degree, unlike its standard aggregate version, carries the informa-
tion of the timing of interactions and is sensitive to the order, concurrency and
delay among the interactions. For example, in Figure 1, the degree of the node b
in the aggregate network (e) is 2. However, its dynamic degree in (a) is 3, in (b)
is 1, and in (c) and (d) is 0. The dynamic average degree, on the other hand does
not change when the order of interactions in a dynamic network is perturbed.
It just tells us the average connectivity of an individual in the observed time
period. In all the dynamic networks (a)–(c) the average dynamic degree of b is
1, while in (d) it is 2.
CF (ut )
CCT (u) = . (13)
|N (ut )|(|N (ut )| − 1)
0≤t<T
Consider the example in Figure 2. The clustering coefficient of all three nodes in
the static network is the same and equals to 1. However, the situation in the two
dynamic networks is completely different. In network (a) the dynamic clustering
coefficient of nodes a and c is 0 while that of the node b is 1. In network (b),
on the other hand, the dynamic clustering coefficient of all the nodes is 0 since
when b meets a and c they still don’t know each other.
Apart from the measures defined above we also compute PageRank [14] of
nodes.
Fig. 2. Example of two dynamic networks (a) and (b) that have the same aggregate
network representation (c)
Finding Spread Blockers in Dynamic Networks 65
others. However in the dynamic network model as defined above, the active
individuals never become latent during the spreading process. For this paper, we
only consider the progressive case in which an individual converts from inactive
to active but never reverses (no recovery in the epidemiological model). It is
a particularly important case in the context of identifying blockers since the
blocking action is typically done before any recovery.
4 Experimental Setup
For each measure and for each dynamic network dataset, we follow the following
steps:
We compare the power of each measure to serve as a proxy indicator for the
blocking ability of an individual based on the number of individuals that had
to be removed in the ordering imposed by that measure in order to achieve this
reduction to 10%.
5 Datasets
We now describe the datasets used in the experiments.
Grevy’s: Populations of Grevy’s zebras (Equus grevyi) were observed by bi-
ologists [29, 30, 61, 63] over a period of June–August 2002 in the Laikipia
region of Kenya. Predetermined census loops were driven on a regular basis
(approximately twice per week) and individuals were identified by unique
stripe patterns. Upon sighting, an individual’s GPS location was taken. In
the resulting dynamic network, each node represents an individual animal
and two animals are interacting if their GPS locations are the same. The
dataset contains 28 individuals interacting over a period of 44 time steps.
68 Habiba et al.
The following table provides a summary of the statistics of the networks we use
in our experiments.
4
Available with a full description at http://www.cs.cmu.edu/~enron/
5
Available with a full description at http://kdl.cs.umass.edu/data/msn/msn-info.
html
Finding Spread Blockers in Dynamic Networks 69
V E T D DT d dT p pT r rT
Grevy’s 28 779 44 0.30 0.52 4 36 1.84 4.81 518 432
Onagers 29 402 82 0.36 0.24 3 74 1.66 7.51 756 617
DBLP 1374 2262 38 0.002 0.09 15 37 5.54 5.12 900070 58146
Enron 147 7406 701 0.04 0.14 6 618 2.66 461.24 19620 16474
MIT 96 67107 2940 0.68 0.18 2 315 1.32 4.21 9120 9114
UMass 20 2664 693 0.72 0.35 2 8 1.28 3.71 380 374
For each of the datasets we have evaluated all the structural network measures to
determine how effectively they serve to identify good blockers. To recap, we rank
nodes by each measure and remove them from the network in that order. After
removing each node we measure the expected extent of spread in the network
using simulations. We compare the effect of each measure’s ordering to that of
a random ordering and the brute force best blockers ordering. Figure 3 shows
results for two datasets, Onagers and Enron, that are representative of the results
on all the datasets. The results for the other datasets are omitted due to space
limitations. For all the plots, the x-axis is the number of individuals removed
and the y-axis shows the corresponding extent of spread. The lower the extent
of spread after removal, the better is the blocking capacity of the individuals
removed. Thus, the curves lower on the plot correspond to measures that serve
as better indicators of individuals’ blocking power.
Fig. 3. [Best viewed in color.] Comparison of the reduction of extent of spread after
removal of nodes ranked by various measures in Onagers and Enron datasets
70 Habiba et al.
The comparison of all the measures showed that four measures performed con-
sistently well as blocker indicators: degree in aggregate network, the number of
edges in the immediate aggregate neighborhood (local density), dynamic average
degree, and dynamic clustering coefficient. This is good news from the practical
point of view of designing epidemic response strategies since all the measures are
simple, local, and easily scalable. Figure 4 shows the results of the comparison of
those four best measures, as well as the best possible and random orderings, for
all the datasets. Surprisingly, while the local density and the dynamic cluster-
ing coefficients seem to be good indicators, the aggregate clustering coefficient
turned out to be the worst, often performing worse than a random ordering. Be-
tweenness and closeness measures performed inconsistently. PageRank did not
perform well in the only dataset with directed interactions (Enron)6 .
As seen in Figure 4, the ease of blocking the spread depends very much on the
structure of the dynamic network. In the two bluetooth datasets, MIT Reality
Mining and UMass, all orderings, including the random, performed similarly.
Those are well connected networks, as evident by the large difference between
the dynamic diameter and the average shortest temporal path. The only way to
reduce the extent of spread to below 10% of the original population seems to be
trivially removing nearly 90% of the individual population. On the other hand,
Enron and DBLP, the sparsely connected datasets, show the opposite trend of
being easily blockable by a good ranking measure.
When rankings of different measures result in a similar blocking ability we ask
whether it is due to the fact that the measures rank individuals in a similar way
Fig. 4. [Best viewed in color.] Comparison of the reduction of the extent of spread
after removal of nodes ranked by the best 4 measures. The x-axis shows the number of
individuals removed and the y-axis shows the average spread size after the removal of
individuals.
6
On undirected graphs, PageRank is equivalent to degree in aggregate network.
Finding Spread Blockers in Dynamic Networks 71
Table 2. Average rank difference between the rankings induced by every two of the
best four measures
AveDEG vs DynCC
AvgDEG vs ENN1
AvgDEG vs DEG
DynCC vs ENN1
Best vs AvgDEG
DynCC vs DEG
Best vs DynCC
DEG vs ENN1
Best vs ENN1
Best vs DEG
Dataset
Grevy’s 4.5 4.64 4.79 3.86 4.5 2.86 2.64 5.57 5 1.14
Onagers 3.59 4.48 3.31 3.52 4.69 4.14 2.97 6.07 6 2
DBLP - - - - 430.76 71.3 78.49 434.21 428.25 77.22
Enron 21.95 50.01 27.29 21.02 46.37 22.56 21.93 44.35 44.95 25.32
MIT - - - - 4.88 14.4 14.48 14.33 14.27 2.25
UMass 4.6 4.6 3 2.7 0 3.3 3.1 3.3 3.1 1
and, thus, identify the same set of good blockers or, rather, different measures
identify different sets of good blockers. To answer this question, we compared
the sets of the top ranked blockers identified by the four best measures as well
as the best possible ordering. We compute the average rank difference between
the sets of individuals ranked top by every two measures. Table 2 shows the
pairwise difference in ranks. In general, there is little correspondence between the
rankings imposed by various measures. The only strong relationship, as expected,
is between the number of edges in the neighborhood of a node and its degree in
the aggregate network.
We further explore the difference in the sets of the top ranked individuals
by computing the size of the common intersection of all the top sets ranked
by the four measures and the best possible ranking. We use the size of the set
determined by the best possible ordering as the reference set size for all measures.
Table 3 shows the size of the common intersection for all datasets. Again, we see
Table 3. The size of the common intersection of all the top sets ranked by the four
measures and the best ranking. Set size is the size of the sets determined by the best
blocking ordering. The size of the intersection is the number of the individuals in the
intersection and the Intersection fraction is the fraction of the intersection of the size
of the set.
a strong effect of the structure of the network. The MIT Reality Mining and the
UMass datasets have the largest intersection size. On the other hand, in DBLP
the four measures produced very different top ranked sets, yet all four measures
were extremely good indicators of the blockers. In other networks, while there
are some individuals that are clearly good blockers according to all measures,
there is a significant difference among the measures. Overall, these results lead to
two future directions: 1) investigating the effect of the overall network structure
on the “blockability” of the network; and 2) designing consensus techniques that
combine rankings by various measures into a possible better list of blockers.
We have also compared the sets of nodes ranked at the top by various mea-
sures. Interestingly, in the networks in which it was difficult to block dynamic
spread, all the measures resulted in very similar rankings of individuals. In con-
trast, in the networks where the removal of a small set of individuals was sufficient
to reduce the spread significantly, the best measures gave very different rankings
of individuals. Thus, there seems to be a dichotomy in the real-world networks
we studied. On one hand, there are dense networks (e.g. MIT Reality Mining
and UMass datasets) in which it is inherently challenging to block a spreading
process and all measures perform similarly badly. On the other hand, there are
sparse networks where it seems to be easy to stop the spread and there are many
ways to do it. In future work, we will investigate the specific global structural
attributes of a network that delineate this difference between networks for which
it is hard or easy to identify good blockers.
The comparison of the top ranked sets also shows that while there may be
some common nodes ranked high by all measures, there is a significant difference
among the measures. Yet, all the rankings perform comparably well. Thus, there
is a need to test a consensus approach that combines the sets ranked top by
various measures into one set of good candidate blockers. This is similar to
combining the top k lists returned as a web search result [27].
This paper focused on the practical approaches to identifying good blockers.
However, the theoretical structure of the problem is not well understood and
so far has defied good approximation algorithms. Recent developments in the
analysis of non-monotonic submodular functions [28, 64] may be applicable to
variants of the problem and may result in good approximation guarantees.
References
1. Adibi, J.: Enron email dataset, http://www.isi.edu/~adibi/Enron/Enron.htm
2. Anderson, R.M., May, R.M.: Infectious Diseases of Humans: Dynamics and Control.
Oxford University Press, Oxford (1992)
3. Anthonisse, J.: The rush in a graph. Mathematische Centrum, Amsterdam (1971)
4. Aspnes, J., Chang, K., Yampolskiy, A.: Inoculation strategies for victims of viruses
and the sum-of-squares partition problem. J. Comput. Syst. Sci. 72(6), 1077–1093
(2006)
5. Asur, S., Parthasarathy, S., Ucar, D.: An event-based framework of characterizing
the evolutionary behavior of interaction graphs. In: Proceedings of the Thirteenth
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(2007)
6. Barabasi, A.L., Jeong, H., Neda, Z., Ravasz, E., Schubert, A., Vicsek, T.: Evolution
of the social network of scientific collaborations. Physica A: Statistical Mechanics
and its Applications 311(3-4), 590–614 (2002)
7. Berger, E.: Dynamic monopolies of constant size. J. Combin. Theory Series B 83,
191–200 (2001)
8. Berger, N., Borgs, C., Chayes, J.T., Saberi, A.: On the spread of viruses on the
internet. In: SODA 2005: Proceedings of the sixteenth annual ACM-SIAM sym-
posium on Discrete algorithms, Philadelphia, PA, USA, pp. 301–310. Society for
Industrial and Applied Mathematics (2005)
74 Habiba et al.
9. Berger-Wolf, T., Hart, W., Saia, J.: Discrete sensor placement problems in distri-
bution networks. Mathematical and Computer Modelling (2005)
10. Berry, J., Fleischer, L., Hart, W., Phillips, C., Watson, J.: Sensor placement in
municipal water networks. Journal of Water Resources Planning and Manage-
ment 131(3) (2005a)
11. Berry, J., Hart, W., Phillips, C., Uber, J.G., Watson, J.: Sensor placement in munic-
ipal water networks with temporal integer programming models. Journal of Water
Resources Planning and Management 132(4), 218–224 (2006)
12. Börner, K., Dall’Asta, L., Ke, W., Vespignani, A.: Studying the emerging global
brain: Analyzing and visualizing the impact of co-authorship teams. Complexity,
Special issue on Understanding Complex Systems 10(4), 57–67 (2005)
13. Börner, K., Maru, J., Goldstone, R.: The simultaneous evolution of author and
paper networks. PNAS 101(suppl. 1), 5266–5273 (2004)
14. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine.
In: WWW7: Proceedings of the 7th International Conference on World Wide Web
7, pp. 107–117. Elsevier Science Publishers B. V., Amsterdam (1998)
15. Broido, A., Claffy, K.: Internet topology: connectivity of IP graphs. In: Proceedings
of SPIE ITCom (2001)
16. Carley, K.: Communicating new ideas: The potential impact of information and
telecommunication technology. Technology in Society 18(2), 219–230 (1996)
17. Carreras, I., Miorandi, D., Canright, G., Engøo-Monsen, K.: Eigenvector central-
ity in highly partitioned mobile networks: Principles and applications. Studies in
Computational Intelligence (SCI) 69, 123–145 (2007)
18. Chen, L., Carley, K.: The impact of social networks in the propagation of computer
viruses and countermeasures. IEEE Trasactions on Systems, Man and Cybernetics
(forthcoming)
19. Chen, N.: On the approximability of influence in social networks. In: ACM-SIAM
Symposium on Discrete Algorithms (SODA), pp. 1029–1037 (2008)
20. Clauset, A., Eagle, N.: Persistence and periodicity in a dynamic proximity network
(unpublished manuscript)
21. Cohen, R., Havlin, S., ben Avraham, D.: Efficient immunization strategies for com-
puter networks and populations. Physical Review Letters (2003)
22. Dezsö, Z., Barabási, A.-L.: Halting viruses in scale-free networks. Physical Review
E 65(055103(R)) (2002)
23. Domingos, P.: Mining social networks for viral marketing. IEEE Intelligent Sys-
tems 20, 80–82 (2005)
24. Domingos, P., Richardson, M.: Mining the network value of customers. In: Seventh
International Conference on Knowledge Discovery and Data Mining (2001)
25. Eagle, N., Pentland, A.: Reality mining: Sensing complex social systems. Journal
of Personal and Ubiquitous Computing (2006)
26. Eubank, S., Guclu, H., Kumar, V., Marathe, M., Srinivasan, A., Toroczkai, Z.,
Wang, N.: Modelling disease outbreaks in realistic urban social networks. Na-
ture 429, 180–184 (2004) (supplement material)
27. Fagin, R., Kumar, R., Sivakumar, D.: Comparing top k lists. In: SODA 2003: Proc.,
14th ACM-SIAM Symposium on Discrete Algorithms, Philadelphia, PA, USA, pp.
28–36. Society for Industrial and Applied Mathematics (2003)
28. Feige, U., Mirrokni, V., Vondrák.: Maximizing non-monotone submodular func-
tions. In: Foundations of Computer Science, FOCS (2007)
Finding Spread Blockers in Dynamic Networks 75
29. Fischhoff, I.R., Sundaresan, S.R., Cordingley, J., Larkin, H.M., Sellier, M.-J.,
Rubenstein, D.I.: Social relationships and reproductive state influence leadership
roles in movements of plains zebra (Equus burchellii). Animal Behaviour 73(5),
825–831 (2007)
30. Fischhoff, I.R., Sundaresan, S.R., Cordingley, J., Rubenstein, D.I.: Habitat use and
movements of plains zebra (Equus burchelli) in response to predation danger from
lions. Behavioral Ecology 18(4), 725–729 (2007)
31. Freeman, L.: A set of measures of centrality based on betweenness. Sociometry 40,
35–41 (1977)
32. Freeman, L.C.: Centrality in social networks: I. conceptual clarification. Social
Networks 1, 215–239 (1979)
33. Girvan, M., Newman, M.E.J.: Community structure in social and biological net-
works. Proc. Natl. Acad. Sci. 99, 8271–8276 (2002)
34. Goldenberg, J., Libai, B., Muller, E.: Talk of the network: A complex systems
look at the underlying process of word-of-mouth. Marketing Letters 12(3), 211–
223 (2001)
35. Goldenberg, J., Libai, B., Muller, E.: Using complex systems analysis to advance
marketing theory development. Academy of Marketing Science Review (2001)
36. Granovetter, M.: The strength of weak ties. American J. Sociology 78(6), 1360–
1380 (1973)
37. Granovetter, M.: Threshold models of collective behavior. American J. Sociol-
ogy 83(6), 1420–1443 (1978)
38. Gruhl, D., Guha, R., Liben-Nowell, D., Tomkins, A.: Information diffusion through
blogspace. In: WWW 2004: Proc. 13th Intl Conf on World Wide Web, pp. 491–501.
ACM Press, New York (2004)
39. Habiba, C.T., Berger-Wolf, T.Y.: Betweenness centrality in dynamic networks.
Technical Report 2007-19, DIMACS (2007)
40. Holme, P.: Efficient local strategies for vaccination and network attack. Europhys.
Lett. 68(6), 908–914 (2004)
41. Hopcroft, J., Khan, O., Kulis, B., Selman, B.: Natural communities in large linked
networks. In: Proc. 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and
Data Mining, pp. 541–546 (2003)
42. Jordán, F., Benedek, J., Podani, Z.: Quantifying positional importance in food
webs: A comparison of centrality indices. Ecological Modelling 205, 270–275 (2007)
43. Kempe, D., Kleinberg, J., Kumar, A.: Connectivity and inference problems for
temporal networks. J. Comput. Syst. Sci. 64(4), 820–842 (2002)
44. Kempe, D., Kleinberg, J., Tardos, E.: Maximizing the spread of influence through
a social network. In: 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and
Data Mining (2003)
45. Klimt, B., Yang, Y.: The Enron corpus: A new dataset for email classification re-
search. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML
2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004)
46. Leskovec, J., Adamic, L.A., Huberman, B.A.: The dynamics of viral marketing.
In: EC 2006: Proceedings of the 7th ACM conference on Electronic commerce, pp.
228–237. ACM Press, New York (2006)
47. Leskovec, J., Krause, A., Guestrin, C., Faloutsos, C., VanBriesen, J.: Cost-effective
outbreak detection in networks. In: Proc. 13th ACM SIGKDD Intl. Conf. on Knowl-
edge Discovery and Data Mining (2007)
48. Lewin, K.: Principles of Topological Psychology. McGraw Hill, New York (1936)
49. Ley, M.: Digital bibliography & library project (DBLP) (December 2005); A digital
copy of the databse has been provided by the author, http://dblp.uni-trier.de/
76 Habiba et al.
50. Liljeros, F., Edling, C., Amaral, L.N.: Sexual networks: Implication for the trans-
mission of sexually transmitted infection. Microbes and Infection (2003)
51. May, R.M., Lloyd, A.L.: Infection dynamics on scale-free networks. Physical Review
E 64(066112) (2001)
52. Moody, J.: The importance of relationship timing for diffusion. Social Forces (2002)
53. Moreno, Y., Nekovee, M., Pacheco, A.F.: Dynamics of rumor spreading in
complex networks. Physical Review E (Statistical, Nonlinear, and Soft Matter
Physics) 69(6), 066130 (2004)
54. Morris, M.: Epidemiology and social networks:modeling structured diffusion. Soci-
ological Methods and Research 22(1), 99–126 (1993)
55. Mossel, E., Roch, S.: On the submodularity of influence in social networks. In: The
Annual ACM Symposium on Theory of Computing(STOC) (2007)
56. Newman, M.: The structure and function of complex networks. SIAM Review 45,
167–256 (2003)
57. Newman, M.E.: Spread of epidemic disease on networks. Physical Review
E 66(016128) (2002)
58. Newman, M.E.J.: Scientific collaboration networks. i. network construction and
fundamental results. Physical Review E 64, 016131 (2001)
59. Pastor-Satorras, R., Vespignani, A.: Epidemic spreading in scale-free networks.
Phys. Rev. Lett. 86(14), 3200–3203 (2001)
60. Rogers, E.M.: Diffusion of Innovations, 5th edn. Simon & Shuster, Inc., New York
(2003)
61. Rubenstein, D.I., Sundaresan, S., Fischhoff, I., Saltz, D.: Social networks in wild
asses: Comparing patterns and processes among populations. In: Stubbe, A.,
Kaczensky, P., Samjaa, R., Wesche, K., Stubbe, M. (eds.) Exploration into the
Biological Resources of Mongolia, vol. 10, pp. 159–176. Martin-Luther-University,
Halle-Wittenberg (2007)
62. Sabidussi, G.: The centrality index of a graph. Psychometrika 31, 581–603 (1966)
63. Sundaresan, S.R., Fischhoff, I.R., Dushoff, J., Rubenstein, D.I.: Network metrics
reveal differences in social organization between two fission-fusion species, Grevy’s
zebra and onager. Oecologia 151, 140–149 (2007)
64. Vredeveld, T., Lenstra, J.: On local search for the generalized graph coloring prob-
lem. Operations Research Letters 31, 28–34 (2003)
65. Watts, D.: A simple model of global cascades on random networks. PNAS 99,
5766–5771 (2002)
66. Watts, D., Strogatz, S.: Collective dynamics of small-world networks. Nature 393,
440–442 (1998)
67. Young, H.P.: Innovation diffusion and population heterogeneity, Working paper
(2006)
68. Zanette, D.H.: Dynamics of rumor propagation on small-world networks. Phys.
Rev. E 65(4), 041908 (2002)
Social Network Mining with Nonparametric
Relational Models
1 Introduction
Social network mining has gained in importance due to the growing availability
of data on novel social networks, e.g. citation networks (DBLP, Citeseer), SNS
websites (Facebook), and social media websites (Last.fm). Social networks usu-
ally consist of rich collections of objects, which are linked into complex networks.
Generally, social network data can be graphically represented as a sociogram as
illustrated in Fig. 1 (left). In this simple social network, there are persons, person
profiles (e.g., gender), and these persons are linked together via friendships. Some
interesting applications in social network mining include community discovery,
relationship prediction, social recommendation, etc.
L. Giles et al. (Eds.): SNAKDD 2008, LNCS 5498, pp. 77–96, 2010.
c Springer-Verlag Berlin Heidelberg 2010
78 Z. Xu et al.
Fig. 1. Left: A simple sociogram. Right: A probabilistic model for the sociogram. Each
edge is associated with a random variable that determines the state of the edge. The
directed arcs indicate direct probabilistic dependencies.
In the simple relational model of social network, the friendship is locally pre-
dicted by the profiles of the involved objects: whether a person is a friend of
another person is only dependent on the profiles of the two persons. Given that
the parameters are fixed, and given the parent attributes, all friendships are
independent of each other such that correlations between friendships, i.e., the
collaborative effect, cannot be taken into account. To solve this limitation, struc-
tural learning might be involved to obtain non-local dependencies but structural
learning in complex relational networks is considered a hard problem [9]. Non-
local dependencies can also be achieved by introducing for each person a hidden
variable as proposed in [24]. The state of the hidden variable represents unknown
attributes of the person, e.g. the particular habit of making friends with certain
Social Network Mining with Nonparametric Relational Models 79
persons. The hidden variable of a person is now the only parent of its profiles and
is one of the parents of the friendships in which the person potentially partici-
pates. Since the hidden variables are of central importance, this model is referred
to as the hidden relational model (HRM). In relational domains, different classes
of objects generally require a class-specific complexity in the hidden representa-
tion. Thus, it is sensible to work with a nonparametric method, Dirichlet process
(DP) mixture model, in which each object class can optimize its own representa-
tional complexity in a self-organized way. Conceptionally, the number of states
in the hidden variables in the HRM model becomes infinite. In practice, the
DP mixture sampling process only occupies a finite number of components. The
combination of the hidden relational model and the DP mixture model is the
infinite hidden relational model (IHRM) [24].
The IHRM model has been first presented in [24]. This paper is an extended
version of [25] and we explore social network modeling and analysis with IHRM
for community detection, link prediction, and product recommendation. We
present two inference methods for efficient inference: one is the blocked Gibbs
sampling with truncated stick-breaking (TSB) construction, the other is the
mean-field approximation with TSB. We perform empirical analysis on three so-
cial network datasets: the Sampson’s monastery data, the Bernard & Killworth
data, and the MovieLens data. The paper is organized as follows. In the next sec-
tion, we perform analysis of modeling complex social network data with IHRMs.
In Sec. 3 we describe a Gibbs sampling method and a mean-field approximation
for inference in the IHRM model. Sec. 4 gives the experimental analysis on so-
cial network data. We review some related work in Sec. 5. Before concluding, an
extension to IHRMs is discussed in Sec. 6.
2 Model Description
Based on the analysis in Sec. 1, we will give a detailed description of the IHRM
model for social network data. In this section, we first introduce the finite hidden
relational model (HRM), and then extend it to an infinite version (IHRM). In
addition, we provide a generative model describing how to generate data from
an IHRM model.
of the person 3, i.e. predict the relationship R2,3 . The probability is computed
on the evidence about: (1) the attributes of the immediately related persons,
i.e. G2 and G3 , (2) the known relationships associated with the persons of inter-
est, i.e. the friendships R2,1 and R2,4 about the person 2, and the friendships R1,3
and R3,4 about the person 3, (3) high-order information transferred via hidden
variables, e.g. the information about G1 and G4 propagated via Z1 and Z4 . If
the attributes of persons are informative, those will determine the hidden states
of the persons, therefore dominate the computation of predictive probability of
relationship R2,3 . Conversely, if the attributes of persons are weak, then hidden
state of a person might be determined by his relationships to other persons and
the hidden states of those persons. By introducing hidden variables, information
can globally distribute in the ground network defined by the relational structure.
This reduces the need for extensive structural learning, which is particularly dif-
ficult in relational models due to the huge number of potential parents. Note that
a similar propagation of information can be observed in hidden Markov models
used in speech recognition or in the hidden Markov random fields used in image
analysis [26]. In fact, the HRM can be viewed as a directed generalization of
both for relational data.
Additionally, the HRM provides a cluster analysis of relational data. The state
of the hidden variable of an object corresponds to its cluster assignment. This
can be regarded as a generalization of co-clustering model [13]. The HRM can
be applied to domains with multiple classes of objects and multiple classes of
relationships. Furthermore, relationships can be of arbitrary order, i.e. the HRM
is not constraint to only binary and unary relationships[24]. Also note that the
sociogram is quite related to the resource description framework (RDF) graph
used as the basic data model in the semantic web [3] and the entity relationship
graph from database design.
We now complete the model by introducing the variables and parameters in
Fig. 2. There is a hidden variable Zi for each person. The state of Zi speci-
fies the cluster of the person i. Let K denote the number of clusters. Z fol-
lows
a multinomial distribution with parameter vector π = (π1 , . . . , πK ) (πk >
0, k πk = 1), which specifies the probability of a person belonging to a cluster,
Social Network Mining with Nonparametric Relational Models 81
Since hidden variables play a key role in the HRM model, we would expect that
the HRM model might require a flexible number of states for the hidden vari-
ables. Consider again the sociogram example. With little information about past
friendships, all persons might look the same; with more information available,
one might discover certain clusters in persons (different habits of making friends);
but with an increasing number of known friendships, clusters might show increas-
ingly detailed structure ultimately indicating that everyone is an individual. It
thus makes sense to permit an arbitrary number of clusters by using a Dirichlet
process mixture model. This permits the model to decide itself about the optimal
number of clusters and to adopt the optimal number with increasing data. For
our discussion it suffices to say that we obtain an infinite HRM by simply letting
the number of clusters approach infinity, K → ∞. Although from a theoretical
point of view there are indeed an infinite number of components, a sampling
procedure would only occupy a finite number of components.
The graphical representations of the IHRM and HRM models are identical,
shown as Fig. 2. However, the definitions of variables and parameters are dif-
ferent. For example, hidden variables Z of persons have infinite states, and
thus parameter vector π is infinite-dimensional. The parameter is not generated
from a Dirichlet prior, but from a stick breaking construction Stick(·|α0 ) with a
82 Z. Xu et al.
Symbol Description
C number of object classes
B number of relationship classes
N c number of objects in a class c
αc0 concentration parameter of an object class c
eci an object indexed by i in a class c
Aci an attribute of an object eci
c
θk mixture component indexed by a hidden state k in an object class c
Gc0 base distribution of an object class c
βc parameters of a base distribution Gc0
b
Ri,j relationship of class b between objects i, j
φbk, correlation mixture component indexed by hidden states k for ci and for
cj , where ci and cj are object classes involved in a relationship class b
Gb0 base distribution of a relationship class b
βb parameters of a base distribution Gb0
2. For each relationship class b between two object classes ci and cj , draw
φbk, ∼ Gb0 i.i.d. with component indices k for ci and for cj .
3. For each object eci in a class c,
(a) Draw cluster assignment Zic ∼ Mult(·|π c );
(b) Draw object attributes Aci ∼ P (·|θc , Zic ).
c c
4. For eci i and ej j with a relationship of class b, draw Ri,j
b
∼ P (·|φb , Zici , Zj j ).
The basic property of SBC is that: the distributions of the parameters (θkc
and φb ) are sampled, e.g., the distribution of θkc can be represented as Gc =
∞ k,c c
k=1 πk δθk , where δθk is a distribution with a point mass on θk . In terms of
c c
this property, SBC can sample objects independently; thus it might be efficient
when a large domain is involved.
3 Inference
The key inferential problem in the IHRM model is computing posterior of unob-
servable variables given the data, i.e. P ({π c, Θc , Z c }c , {Φb }b |D, {αc0 , Gc0 }c , {Gb0 }b ).
Unfortunately, the computation of the joint posterior is analytically intractable,
thus we consider approximate inference methods to solve the problem.
where Aci and Ri,j
b
denotes the known attributes and relationships about
c (t)
i. cj denotes the class of the object j , Zj j denotes hidden variable
of j at the last iteration t. Intuitively, the equation represents to what
extent the cluster k agrees with the data Dic about the object i.
84 Z. Xu et al.
1
W +w
b
≈ P (Rnew,j |D, {Z c(t) , π c(t) , Θc(t) }C
c=1 , {Φ
b(t) B
}b=1 )
W t=w+1
c
1
W +w K
b b(t) c(t) c(t) b b (t)
∝ P (Rnew,j |φk, ) πk P (Acnew |θk ) P (Rnew,j |φ
k, ),
W t=w+1 k=1 b j
where and denote the cluster assignments of the objects j and j , respectively.
The equation is quite intuitive. The prediction is a weighted sum of predictions
b(t)
b
P (Rnew,j |φk, ) over all clusters. The weight of each cluster is the product of
the last three terms, which represents to what extent this cluster agrees with
the known data (attributes and relationships) about the new object. Since the
blocked method also samples parameters, the computation is straightforward.
ξ denote a set of unknown quantities, and D denote the known data. The KL
divergence between q(ξ) and P (ξ|D) is defined as:
KL(q(ξ)||P (ξ|D)) = q(ξ) log q(ξ) − q(ξ) log P (ξ|D). (4)
ξ ξ
The smaller the divergence, the better is the fit between the true and the ap-
proximate distributions. The probabilistic inference problem (i.e. computing the
posterior) now becomes: to minimize the KL divergence with respect to the
variational distribution. In practice, the minimization of the KL divergence is
formulated as the maximization of the lower bound of the log-likelihood:
log P (D) ≥ q(ξ) log P (D, ξ) − q(ξ) log q(ξ). (5)
ξ ξ
c c ⎡ B K ci K cj ⎤
C
N
K
q(Zic |ηic ) q(Vkc |λck )q(θkc |τkc ) ⎣ q(φbk, |ρbk, )⎦ , (6)
c i k b k
where ci and cj denote the object classes involved in the relationship class b.
k and denote the cluster indexes for ci and cj . Variational parameters in-
clude {ηic , λck , τkc , ρbk, }. q(Zic |ηic ) is a multinomial distribution with parameters
ηic . Note, that there is one ηic for each object eci . q(Vkc |λck ) is a Beta distribution.
q(θkc |τkc ) and q(φbk, |ρbk, ) are respectively with the same forms as Gc0 and Gb0 .
We substitute Equ. 6 into Equ. 5 and optimize the lower bound with a coor-
dinate ascent algorithm, which generates the following equations to iteratively
update the variational parameters until convergence:
c c c
N
N
K
λck,1 = 1 + c
ηi,k , λck,2 = αc0 + c
ηi,k , (7)
i=1 i=1 k =k+1
c c
c
N
N
τk,1 = β1c + c
ηi,k T(Aci ), c
τk,2 = β2c + c
ηi,k , (8)
i=1 i=1
ci c
ci c
ρbk,,1 = β1b + ηi,k j
ηj, b
T(Ri,j ), ρbk,,2 = β2b + ηi,k j
ηj, , (9)
i,j i,j
c
k−1
ηi,k ∝ exp Eq [log Vkc ] + Eq [log(1 − Vkc )] + Eq [log P (Aci |θkc )]
k =1
cj b
+ ηj, Eq [log P (Ri,j |φbk, )] , (10)
b j
86 Z. Xu et al.
where λck denotes parameters of Beta distribution q(Vkc |λck ), λck is a two-
dimensional vector λck = (λck,1 , λck,2 ). τkc denotes parameters of the exponential
family distribution q(θkc |τkc ). We decompose τkc such that τk,1 c
contains the first
dim(θk ) components and τk,2 is a scalar. Similarly, β1 contain the first dim(θkc )
c c c
components and β2c is a scalar. ρbk,,1 , ρbk,,2 , β1b and β2b are defined equivalently.
T(Aci ) and T(Ri,j b
) denote the sufficient statistics of the exponential family dis-
tributions P (Ai |θk ) and P (Ri,j
c c b
|φbk, ), respectively.
It is clear that Equ. 7 and Equ. 8 correspond to the updates for variational
parameters of object class c, and they follow equations in [6]. Equ. 9 represents
the updates of variational parameters for relationships, which is computed on the
involved objects. The most interesting updates are Equ. 10, where the posteriors
of object cluster-assignments are coupled together. These essentially connect the
c
DPs together. Intuitively, in Equ. 10 the posterior updates for ηi,k include a
prior term (first two expectations), the likelihood term about object attributes
(third expectation), and the likelihood terms about relationships (last term). To
calculate the last term we need to sum over all the relationships of the object
cj
eci weighted by ηj, that is variational expectation about cluster-assignment of
the other object involved in the relationship.
Once the procedure reaches stationarity, we obtain the optimized variational
parameters, by which we can approximate the predictive distribution
b
P (Rnew,j |D, {αc0 , Gc0 }C
c=1 , {G0 }b=1 ) of the relationship Rnew,j between a new
b B b
cj
object enew and a known object ej with q(Rnew,j |D, λ, η, τ, ρ) proportional to:
c b
c c
j
K K
b c c
q(Rnew,j |ρbk, )q(Zj j = |ηj j )q(Znew
c
= k|λc )
k
c c
× q(Acnew |τkc ) q(Zj j = |ηj j )q(Rnew,j
b b
|ρk, ). (11)
b j
4 Experimental Analysis
The first experiment is performed on the Sampson’s monastery dataset [19] for
community discovery. Sampson surveyed social relationships between 18 monks
in an isolated American monastery. The relationships between monks included
esteem/disesteem, like/dislike, positive influence/negative influence, praise and
blame. Breiger et al. [7] summarized these relationships and yielded a single
Social Network Mining with Nonparametric Relational Models 87
10
12
14
16
18
2 4 6 8 10 12 14 16 18
Fig. 3. Left: The matrix displaying interactions between Monks. Middle: A sociogram
for three monks. Right: The IHRM model for the monastery sociogram.
Cluster Members
1 Peter, Bonaventure, Berthold, Ambrose, Louis, Victor, Ramuald
2 John, Gregory, Mark, Winfrid, Hugh, Boniface, Albert
3 Basil, Elias, Simplicius
4 Amand
88 Z. Xu et al.
In the second experiment, we perform link analysis with IHRM on the Bernard
& Killworth data [5]. Bernard and Killworth collected several data sets on hu-
man interactions in bounded groups. In each study they obtained measures of
social interactions among all actors, and ranking data based on the subjects’
memory of those interactions. Our experiments are based on three datasets. The
BKFRAT data is about interactions among students living in a fraternity at
a West Virginia college. All subjects had been residents in the fraternity from
three months to three years. The data consists of rankings made by the subjects
of how frequently they interacted with other subjects in the observation week.
The BKOFF data concern interactions in a small business office. Observations
were made as the observer patrolled a fixed route through the office every fifteen
minutes during two four-day periods. The data contains rankings of interaction
frequency as recalled by employees over the two-week period. The BKTEC data
is about interactions in a technical research group at a West Virginia university.
It contains the personal rankings of the remembered frequency of interactions.
In the experiments, we randomly select 50% (60%, 70%, 80%) interactions as
known and predict the left ones. The experiments are repeated 20 times for each
setting. The average prediction accuracy is reported in Table 3. We compare our
model with the Pearson collaborative filtering method. It shows that the IHRM
model provides better performance on all the three datasets. Fig. 4 illustrates
the link prediction results on the BKOFF dataset with 70% known links. The
predicted interaction matrix is quite similar with the real one.
Table 3. Link prediction on the Bernard & Killworth data with the IHRM
Fig. 4. Left: Interaction matrix on the BKOFF data. Right: The predicted one, which
is quite similar with the real situation.
Social Network Mining with Nonparametric Relational Models 89
Fig. 5. Top: A sociogram for movie recommendation system, illustrated with 2 users
and 3 movies. For readability, only two attributes (user’s occupation and movie’s genre)
show in the figure. Bottom: The IHRM model for the sociogram.
90 Z. Xu et al.
Fig. 6. Left: The traces of the number of user clusters for the runs of two Gibbs
samplers. Middle: The trace of the change of the variational parameter η u for mean
field method. Right: The sizes of the largest user clusters of the three inference methods.
The prediction results are shown in Table 4. All IHRM inference methods
under consideration achieve comparably good performance; the best results are
achieved by the Gibbs samplers. To verify the performance of the IHRM, we
also implement Pearson-coefficient collaborative filtering (CF) method [18] and
a SVD-based CF method [20]. It is clear that the IHRM outperforms the tra-
ditional CF methods, especially when there are few known ratings for the test
users. The main advantage of the IHRM is that it can exploit attribute infor-
mation. If the information is removed, the performance of the IHRM becomes
close to the performance of the SVD approach. For example, after ignoring all
attribute information, the TSBMF generates the predictive results: 64.55% for
Given5, 65.45% for Given10, 65.90% for Given15, and 66.79% for Given20.
The IHRM provides cluster assignments for all objects involved, in our case
for the users and the movies. The rows #C.u and #C.m in Table 4 denote
the number of clusters for users and movies, respectively. The Gibbs samplers
converge to 46-60 clusters for the users and 44-78 clusters for the movies. The
mean field solution have a tendency to converge to a smaller number of clusters,
depending on the value of α0 . Further analysis shows that the clustering results
of the methods are actually similar. First, the sizes of most clusters generated
by the Gibbs samplers are very small, e.g., there are 72% (75.47%) user clusters
with less than 5 members in CRPGS (TSBGS). Fig. 6 (right) shows the sizes
of the 20 largest user clusters of the 3 methods. Intuitively, the Gibbs samplers
tend to assign the outliers to new clusters. Second, we compute the rand index
(0-1) of the clustering results of the methods, the values are 0.8071 between
CRPGS and TSBMF, 0.8221 between TSBGS and TSBMF, which demonstrates
the similarity of the clustering results.
Fig. 7 gives the movies with highest posterior probability in the 4 largest
clusters generated from TSBMF. In cluster 1 most movies are very new and
popular (the data set was collected from September 1997 through April 1998).
Also they tend to be action and thriller movies. Cluster 2 includes many old
movies, or movies produced by the non-USA countries. They tend to be drama
movies. Cluster 3 contains many comedies. In cluster 4 most movies include
relatively serious themes. Overall we were quite surprised by the good inter-
pretability of the clusters. Fig. 8 (top) shows the relative frequency coefficient
(RFC) of the attribute Genre in these movie clusters. RFC of a genre s in a
cluster k is calculated as (fk,s − fs )/σs , where fk,s is the frequency of the genre
s in the movie cluster k, fs is mean frequency, and σs is standard deviation of
frequency. The labels for each cluster specify the dominant genres in the cluster.
For example, action and thriller are the two most frequent genres in cluster 1. In
general, each cluster involves several genres. It is clear that the movie clusters
are related to, but not just based on, the movie attribute Genre. The clustering
effect depends on both movie attributes and user ratings. Fig. 8 (bottom) shows
RFC of the attribute Occupation in user clusters. Equivalently, the labels for
each user cluster specify the dominant occupations in the cluster.
Note that in the experiments we predicted a relationship attribute R indicat-
ing the rating of a user for a movie. The underlying assumption is that in prin-
ciple anybody can rate any movie, no matter whether that person has watched
the movie or not. If the latter is important, we could introduce an additional at-
tribute Exist to specify if a user actually watched the movie. The relationship R
would then only be included in the probabilistic model if the movie was actually
watched by a user.
Fig. 8. Top: The relative frequency coefficient of the attribute Genre in different movie
clusters, Bottom: that of the attribute Occupation in different user clusters
Social Network Mining with Nonparametric Relational Models 93
5 Related Work
The work on infinite relational model (IRM) [15] is similar to the IHRM, and has
been developed independently. One difference is that the IHRM can specify any
reasonable probability distribution for an attribute given its parent, whereas the
IRM would model an attribute as a unary predicate, i.e. would need to transform
the conditional distribution into a logical binary representation. Aukia et al. also
developed a DP mixture model for large networks [4]. The model associates an
infinite-dimensional hidden variable for each link (relationship), and the objects
involved in the link are drawn from a multinomial distribution conditioned on the
hidden variable of the link. The model is applied to the community web data
with promising experimental results. The latent mixed-membership model [1]
can be viewed as a generalization of LDA model on relational data. Although
it is not nonparametric, the model exploits hidden variables to avoid the ex-
tensive structure learning and provides a principled way to model the relational
networks. The model associates each object with a membership probability-like
vector. For each relationship, cluster assignments of the involved objects are gen-
erated with respect to their membership vectors, and then relationship is drawn
conditioned on the cluster assignments.
There are some other important SRL research works for complex relational
networks. The probabilistic relational model (PRM) with class hierarchies [10]
specializes distinct probabilistic dependency for each subclass, and thus obtains
refined probabilistic models for relational data. A group-topic model is proposed
in [23]. It jointly discovers latent groups in a network as well as latent topics
of events between objects. The latent group model in [16] introduces two latent
variables ci and gi for an object, and ci is conditioned on gi . The object attributes
depends on ci and relations depend on gi of the involved objects. The limitation
is that only relations between members in the same group are considered. These
models demonstrate good performance in certain applications. However, most are
restricted to domains with simple relationships. These models demonstrate good
performance in certain applications. However, most are restricted to domains
with simple relationships.
Fig. 9. A conditional IHRM model for a simple sociogram. The main difference from
the IHRM model in Fig. 2 is that attributes G are not indirect influence over relations
R via object clusters Z, but are direct conditions of relations.
The main difference to the IHRM model in Fig. 2 is that relationship attributes
are conditioned on both the states of the latent variables and features derived
from attributes. A simple conditional model is based on logistic regression of the
form
log P (Ri,j |Zi = k, Zj = , F (Gi , Gj )) = σ(ωk, , xi,j ),
where xi,j = F (Gi , Gj ) denotes a vector describing features derived from all
attributes of i and j. ωk, is a weight vector, which determines how much a
particular attribute contributes to a choice of relation and can implicitly imple-
ment feature selection. Note that there is one weight vector for each cluster pair
(k, ). ·, · denotes an inner product. σ(·) is a real-valued function with any form
σ : R → R. The joint probability of the conditional model is now written as:
P (R, Z|G) = P (Zi ) P (Ri,j |Zi , Zj , F (Gi , Gj )), (12)
i i,j
where P (Zi ) is still defined as a stick breaking construction (Equ. 1). The prelim-
inary experiments show promising results, and we will report the further results
in future work.
7 Conclusions
This paper presents a nonparametric relational model IHRM for social network
modeling and analysis. The IHRM model enables expressive knowledge represen-
tation of social networks and allows for flexible probabilistic inference without
the need for extensive structural learning. The IHRM model can be applied to
community detection, link prediction, and product recommendation. The em-
pirical analysis on social network data showed encouraging results with inter-
pretable clusters and relation prediction. For the future work, we will explore
discriminative relational models for better performance. It will also be interest-
ing to perform analysis on more complex relational structures in social network
systems, such as domains including hierarchical class structures.
Social Network Mining with Nonparametric Relational Models 95
Acknowledgments
This research was supported by the German Federal Ministry of Economy and
Technology (BMWi) research program THESEUS, the EU FP7 project LarKC,
and the Fraunhofer ATTRACT fellowship STREAM.
References
1. Airoldi, E.M., Blei, D.M., Xing, E.P., Fienberg, S.E.: A latent mixed-membership
model for relational data. In: Proc. ACM SIGKDD Workshop on Link Discovery
(2005)
2. Aldous, D.: Exchangeability and related topics. In: Ecole d’Ete de Probabilites de
Saint-Flour XIII 1983, pp. 1–198. Springer, Heidelberg (1985)
3. Antoniou, G., van Harmelen, F.: A Semantic Web Primer. MIT Press, Cambridge
(2004)
4. Aukia, J., Kaski, S., Sinkkonen, J.: Inferring vertex properties from topology in
large networks. In: NIPS 2007 workshop on statistical models of networks (2007)
5. Bernard, H., Killworth, P., Sailer, L.: Informant accuracy in social network data
iv. Social Networks 2 (1980)
6. Blei, D., Jordan, M.: Variational inference for dp mixtures. Bayesian Analysis 1(1),
121–144 (2005)
7. Breiger, R.L., Boorman, S.A., Arabie, P.: An algorithm for clustering relational
data with applications to social network analysis and comparison to multidimen-
sional scaling. Journal of Mathematical Psychology 12 (1975)
8. Dzeroski, S., Lavrac, N. (eds.): Relational Data Mining. Springer, Berlin (2001)
9. Getoor, L., Friedman, N., Koller, D., Pfeffer, A.: Learning probabilistic relational
models. In: Dzeroski, S., Lavrac, N. (eds.) Relational Data Mining, Springer, Hei-
delberg (2001)
10. Getoor, L., Koller, D., Friedman, N.: From instances to classes in probabilistic re-
lational models. In: Proc. ICML 2000 Workshop on Attribute-Value and Relational
Learning (2000)
11. Getoor, L., Taskar, B. (eds.): Introduction to Statistical Relational Learning. MIT
Press, Cambridge (2007)
12. Handcock, M.S., Raftery, A.E., Tantrum, J.M.: Model-based clustering for social
networks. Journal of the Royal Statistical Society 170 (2007)
13. Hofmann, T., Puzicha, J.: Latent class models for collaborative filtering. In: Proc.
16th International Joint Conference on Artificial Intelligence (1999)
14. Ishwaran, H., James, L.: Gibbs sampling methods for stick breaking priors. Journal
of the American Statistical Association 96(453), 161–173 (2001)
15. Kemp, C., Tenenbaum, J.B., Griffiths, T.L., Yamada, T., Ueda, N.: Learning sys-
tems of concepts with an infinite relational model. In: Proc. 21st Conference on
Artificial Intelligence (2006)
16. Neville, J., Jensen, D.: Leveraging relational autocorrelation with latent group
models. In: Proc. 4th international workshop on Multi-relational mining, pp. 49–
55. ACM Press, New York (2005)
17. Raedt, L.D., Kersting, K.: Probabilistic logic learning. SIGKDD Explor.
Newsl. 5(1), 31–48 (2003)
96 Z. Xu et al.
18. Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J.: Grouplens: An open
architecture for collaborative filtering of netnews. In: Proc. of the ACM 1994 Con-
ference on Computer Supported Cooperative Work, pp. 175–186. ACM, New York
(1994)
19. Sampson, F.S.: A Novitiate in a Period of Change: An Experimental and Case
Study of Social Relationships. PhD thesis (1968)
20. Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Application of dimensionality re-
duction in recommender systems–a case study. In: WebKDD Workshop (2000)
21. Sarwar, B.M., Karypis, G., Konstan, J.A., Riedl, J.: Analysis of recommender
algorithms for e-commerce. In: Proc. ACM E-Commerce Conference, pp. 158–167.
ACM, New York (2000)
22. Sethuraman, J.: A constructive definition of dirichlet priors. Statistica Sinica 4,
639–650 (1994)
23. Wang, X., Mohanty, N., McCallum, A.: Group and topic discovery from relations
and text. In: Proc. 3rd international workshop on Link discovery, pp. 28–35. ACM,
New York (2005)
24. Xu, Z., Tresp, V., Yu, K., Kriegel, H.-P.: Infinite hidden relational models. In: Proc.
22nd UAI (2006)
25. Xu, Z., Tresp, V., Yu, S., Yu, K.: Nonparametric relational learning for social
network analysis. In: Proc. 2nd ACM Workshop on Social Network Mining and
Analysis, SNA-KDD 2008 (2008)
26. Yedidia, J., Freeman, W., Weiss, Y.: Constructing free-energy approximations and
generalized belief propagation algorithms. IEEE Transactions on Information The-
ory 51(7), 2282–2312 (2005)
Using Friendship Ties and Family Circles
for Link Prediction
1 Introduction
There is a growing interest in social media and in data mining methods which
can be used to analyze, support and enhance the effectiveness and utility of
social media sites. The analysis methods being developed build on traditional
methods from the social network analysis community, extend them to deal with
the heterogeneity and growing size of the data being generated and use tools
from graph mining, statistical relational learning and methods for information
extraction from unstructured and semi-structured text.
Traditionally, social network analysis has focused on actors and ties (or rela-
tionships) between them, such as friendships or kinships. The two most common
types of networks are (1) unimodal networks, where the nodes are actors and the
edges represent ties such as friendships, and (2) affiliation networks which can be
represented as bipartite graphs, where there are two types of nodes, the actors
and organizations, and the edges represent the affiliations between actors and
organizations. Most of the existing work has focused on networks that exhibit a
single relationship type, either friendship or affiliation.
L. Giles et al. (Eds.): SNAKDD 2008, LNCS 5498, pp. 97–113, 2010.
c Springer-Verlag Berlin Heidelberg 2010
98 E. Zheleva et al.
prediction [4,1,5,8], link completion [10], and anomalous link discovery [1,9]
which are covered in more depth in Section 7.
Link prediction in social networks is useful for a variety of tasks. The most
straightforward use is for making data entry easier – a link-prediction system
can propose links, and users can select the friendship links that they would
like to include, rather than users having to enter the friendship links manually.
Link prediction is also a core component of any system for dynamic network
modeling—the dynamic model can predict which actors are likely to gain popu-
larity, and which are likely to become central according to various social network
metrics.
Link prediction is challenging for a number of reasons. When it is posed as
a pair-wise classification problem, one of the fundamental challenges is dealing
with the large outcome space; if there are n actors, there are n2 possible re-
lations. In addition, because most social networks are sparsely connected, the
prior probability of any link is extremely small, thus we have to contend with
a large class skew problem. Furthermore, because the number of links is poten-
tially so large, the number of the negative instances will be huge, so constructing
a representative training set is challenging.
In our approach to link prediction in multi-relational social networks, we ex-
plore the use of both attribute and structural features, and, in particular, we
study how group membership (in our case, family membership) can significantly
aid in accurate link (here, friendship) prediction.
Fig. 1. Actors in the same tightly-knit group often exhibit structural equivalence, i.e.,
they have the same connections to all other nodes. Using the original network (a), and
a structural equivalence assumption, one can construct a network with new predicted
links (b).
The descriptive attributes are attributes of nodes in the social network that
do not consider the link structure of the network. These features vary across
domains. They provide semantic insight into the inherent properties of each
node in a social network, or compare the values of the same inherent attributes
for a pair of nodes.
We define two classes of descriptive attributes for multi-relational social net-
works:
The next set of features that we introduce describe features of network structure.
The first is a structural features for a single node, ai , while the remaining describe
structural attributes of pairs of nodes, ai and aj .
1. Actor features. These features describe the link structure around a node.
Number of friends. The degree, or number of friends, of an actor ai : |ai .F |.
2. Actor-pair features. These features describe how interconnected two nodes
are. They measure the sets of friends that two actors have ai .F and aj .F .
Number of common friends. The number of friends that the pair of nodes
have in common in the network: |ai .F ∩ aj .F |.
Jaccard coefficient of the friend sets. The Jaccard coefficient over the friend
sets of two actors describes the ratio of the number of their common
friends to their total number of friends:
|ai .F ∩aj .F |
Jaccard(ai , aj ) = |ai .F ∪aj .F | .
The third category of features that we consider are based on group membership;
in the networks we studied, the groups are families. These are the features that
overlay friendship and affiliation networks.
102 E. Zheleva et al.
1. Actor features. These are features that describe the groups to which an actor
belongs.
Family Size. This is the simplest attribute and describes the size of an
actor’s family: |ai .M |.
2. Actor-pair features. There are two types of features for modeling these inter-
family relations based on the overlapping friend and family sets of two actors
ai .F and aj .M :
Number of friends in the family. The first feature describes the number of
friends ai has in the family of aj : |ai .F ∩ aj .M |. This feature allows one
to reason about the relationship between an actor and a group of other
actors, where the latter is semantically defined over the network through
the family relations.
Portion of friends in the family. The second feature on inter-family relations
describes the ratio between the number of friends that ai has in aj ’s
family (the same as the above feature) and the size of aj ’s family. The
rationale behind this feature is that the higher this ratio is, the more
likely it is that aj is close to ai in the network since more of its family
members are friends with ai .
The idea behind the group features is based on the notion of structural equiv-
alence of nodes within a group. Two nodes are structurally equivalent if they
have the same links to all other actors. If we can detect tightly-knit groups in a
social network and we assume that the nodes in each group are likely to behave
similarly, then new links can be predicted by projecting links such that the nodes
in the group become structurally equivalent. In our networks, such groups are
the family cliques. In a weighted graph, a tight group could map to a clique of
nodes with highly-weighted edges.
Figure 1 shows an example of how a structural equivalence assumption can
help in predicting new links. For example, if one of the actors from Group A is
friends with an actor from Group B, as shown on the original network (a), then
it may be more likely that there is a link between the other actor from Group
A and the actor from Group B, shown as a dashed line in (b).
6 Experimental Evaluation
6.1 Social Media Data Sets
This research is based upon using networks that have two sets of connections:
friendship links and family ties. We performed our experiments on three novel
datasets describing petworks: Dogster, Catster, and Hamsterster1 . On these
sites, profiles include photos, personal information, characteristics, as well as
membership in community groups. Members also maintain links to friends and
family members. As of February 2007, Dogster has approximately 375,000 mem-
bers. Catster is based on the same platform as Dogster and contains about
150,000 members. Hamsterster has a different platform, but it contains similar
information about its members. It is much smaller than Dogster and Catster -
about 2,000 members.
These sites are the only three of the hundreds we visited that publicly share
both family and friendship connections2 . However, these are networks where both
types of connections are realistic and representative of what we would expect to
see in other social networks if they collected this data. The family connections
are representative of real life, since family links are only made between profiles
of pets created by the same owner. The friendship linking behavior is in line
with patterns seen in other social networks [11].
1. Actor features:
Breed. This is the pet breed such as golden retriever or chihuahua. A pet
can have more than one breed value.
1
At http://www.dogster.com, http://www.catster.com, and
http://www.hamsterster.com.
2
For a full list, see http://trust.mindswap.org/SocialNetworks
104 E. Zheleva et al.
Breed category. Each breed belongs to a broader category set. For example
in Dogster, the major breed categories we identified are working, herding,
terrier, toy, sporting, non-sporting, hound, and other, a catchall for the
other breeds that appear in a the site, but not as frequently as the
previous ones. When a dog has multiple breeds, its breed category is
mixed.
Single Breed. This boolean feature describes whether a pet has a single
breed or whether it has multiple breed characteristics.
Purebred. This is a boolean feature which specifies whether a dog owner
considers its pet to be purebred or not.
2. Actor-pair features. All of the above features describe characteristics of a
single user in the network.
Same breed. This boolean feature is true if two profiles have at least one
common breed.
We have obtained a random sample of 10, 000 profiles each from Dogster and
Catster, and all 2059 profiles registered with Hamsterster. Each instance in the
test data contained the features for a pair of profiles where some of the features
were individual node features. To construct the test data, we chose the pairs of
Using Friendship Ties and Family Circles for Link Prediction 105
nodes for which there was an existing friendship link, and we sampled from the
space of node pairs which did not have a link. We computed the descriptive,
structural and group features for each of the profiles.
For each pair of profiles in the test data, we computed the features from the
three classes described in Section 4. A test instance for a pair of profiles ai and
aj includes both the individual actor features and the actor-pair features. It has
the form
where class is the binary class which denotes whether a friendship link exists
between the actors.
For Dogster, the sample of 10,000 dogs had around 17,000 links among them-
selves, and we sample from the non-existing links at a 10:1 ratio (i.e., the non-
existing links are 10 times more than the existing links). For Catster, the 10,000
cats had 43,000 links, and for the whole Hamsterster dataset, the number of links
was around 22,000. We sampled from the non-existing links in these datasets at
the same 10:1 ratio.
We used three well-known classifiers, namely Naı̈ve Bayes, logistic regression and
decision trees for our experiments. The goal was to perform binary classification
on the test instances and predict friendship links. The implementations of these
classifiers were from the latest version of Weka (v3.4.12) from http://www.cs.
waikato.ac.nz/ml/weka/. We allocated a maximum of 2GB of memory for each
classifier we ran. We measured prediction accuracy by computing precision, recall,
and their harmonic mean, F1 score, using 10-fold cross-validation.
Table 1. Comparison of F1 values in the three datasets, with the feature types from
our taxonomy
Feature Type Dogster Catster Hamsterster
Descriptive 37.6% 0.4% 19.8%
Structural 76.1% 83.1% 59.9%
Group 90.8% 95.2% 89.2%
Descriptive and structural 78.6% 83.0% 60.3%
Descriptive, structural, and group 94.8% 97.9% 90.5%
106 E. Zheleva et al.
Fig. 3. a) Recall, precision, and F1 score for Dogster using descriptive and struc-
tural attributes; b) F1 score across datasets. Using descriptive attributes together with
structural attributes leads to a better F1 score in Dogster but not in Catster and
Hamsterster.
Fig. 4. Link-prediction accuracy using all feature classes: descriptive, structural and
group features. a) Recall, precision, and F1 score for Dogster; b) F1 score across
datasets. Group features are highly predictive, yet adding the other features provided
benefit too.
Fig. 6. Prediction accuracy when links are treated equal, with and without group
affiliations. As the results from the affiliation overlays suggest, group features are the
main contributor to the high link-prediction accuracy.
types: the link-prediction accuracy was the same. However, in the case when
the affiliations were not given, it was better to compute the structural features
using both types of relationships but treat them as one type. When family links
were treated as friendship links, the accuracy of the predictions made by the
structural attributes improved by 6% to 20%. This may be due to the fact that
the overlap between friends and family links in the data was very small, and
using both types of links when computing the structural features was beneficial.
Using the affiliation information and computing all features on the data led to
the best accuracy, and the accuracy was the same both in the different-link and
same-link cases. These experiments also confirmed the previous results: group
affiliation was the main contributor to the high link-prediction accuracy.
7 Related Work
In general, link-prediction algorithms process a set of features in order to learn
and predict whether it is likely that two nodes in the data are linked. Sometimes,
these features are hand-constructed by analyzing the problem domain, the at-
tributes of the actors, and the relational structure around those actors [12,4,5,9].
Other times, they are automatically generated, i.e., the prediction algorithm first
learns the best features to use and then predicts new links [8]. In this section,
we discuss the existing work that is most relevant to the link-prediction problem
in multi-relational social networks.
The link-prediction techniques that are based on feature-construction are clos-
est to our work [12,4,1,5,9]. As most of the relational domains can be represented
as a network model, the constructed features not only include the attributes of
the actors, but also the characteristics of the structure. Most of this work ex-
amines co-authorship and citation networks [4,5,8,9] whereas we validate our
method using online social networks. Some of the approaches use machine learn-
ing techniques for classification [4,13,8,14], and others rely on ranking the feature
values [12,5,9].
110 E. Zheleva et al.
8 Discussion
When studying other large social networks, family information is not always
relevant or available. However, groups and affiliations are often available, or
communities can be discovered.
The networks used here had a binary relationships - friend or family - but a similar
effect can be achieved in networks where relationships are weighted. For example,
112 E. Zheleva et al.
Acknowledgments
This work was partially supported by NSF under Grants No. 0746930 and No.
0423845.
References
1. Huang, Z., Zeng, D.: A Link Prediction Approach to Anomalous Email Detection.
In: IEEE International Conference on Systems, Man, and Cybernetics (2006)
2. Klimt, B., Yang, Y.: The enron corpus: A new dataset for email classification
research. In: Boulicaut, J.F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML
2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004)
3. Barabasi, A.L., Jeong, H., Neda, Z., Ravasz, E., Schubert, A., Vicsek, T.: Evolution
of the social network of scientific collaborations. PHYSICA A 311, 3 (2002)
4. Hasan, M., Chaoji, V., Salem, S., Zaki, M.: Link Prediction using Supervised Learn-
ing. In: Proceedings of the Workshop on Link Analysis, Counter-terrorism and
Security (with SIAM Data Mining Conference) (2006)
5. Liben-Nowell, D., Kleinberg, J.: The Link Prediction Problem for Social Networks.
In: Proceedings of the 12th International Conference on Information and Knowl-
edge Management (CIKM) (2003)
6. Newman, M.: Who is the best connected scientist? a study of scientific coauthorship
networks. Working Papers 00-12-064, Santa Fe Institute (December 2000),
http://ideas.repec.org/p/wop/safiwp/00-12-064.html
Using Friendship Ties and Family Circles for Link Prediction 113
L. Karl Branting
1 Introduction
Many complex networks, such as the Internet, metabolic pathways, and social
networks, are characterized by a community structure that groups related ver-
tices together. Traditional clustering techniques group vertices based on some
metric for attribute similarity [2]. More recent research has focused on detection
of community structure from graph topology. Under this approach, the input to
a community-detection algorithm is a graph in which vertices correspond to indi-
viduals (e.g., URLs, molecules, or people) and edges correspond to relationships
(e.g., hyperlinks, chemical reactions, or marital and business ties). The output
consists of a partition of the graph in which subgraphs correspond to meaningful
groupings (e.g., web communities, families of molecules, or social clans).1
Community detection algorithms can be viewed as comprising two compo-
nents: a utility function that expresses the quality of any given partition of a
1
Some communities, such as social clubs and families, can overlap. Membership in
such communities is better modeled as attributes of vertices rather than through
a partition of the graph [3]. The focus of this paper, however, as in the bulk of
community detection research, is on partition-based community structure.
L. Giles et al. (Eds.): SNAKDD 2008, LNCS 5498, pp. 114–130, 2010.
c Springer-Verlag Berlin Heidelberg 2010
Information Theoretic Criteria for Community Detection 115
Table 1. Utility functions and search strategies for various community-detection algo-
rithms. DHC represents divisive hierarchical clustering, ADHH represents agglomera-
tive hierarchical clustering, and MDL represents “Minimum Description length.”
graph; and a search strategy that specifies a procedure for finding a partition
that optimizes the utility function. Table 1 sets forth utility functions and search
strategies of eight recent community-detection algorithms, showing that utility
functions have been paired with a variety of different search strategies.
The utility function most prevalent in recent community detection research is
the modularity function introduced in [1]:
Q= (w(Dii )/l − (li /l)2 ) (1)
1<i≤m
where i is the index of the communities, w(Dii ) is the number of edges in the
graph that connect pairs of vertices within community i, li = j≤i w(Dij ), i.e.,
the number of edges in the graph that are incident to at least one vertex in
community i, and l is the total number of edges in the entire graph. Modularity
formalizes the intuition that communities consist of groups of entities having
more links with each other than with members of other groups.
Because of the shortage of real-world data sets with known community struc-
ture, maximum modularity has sometimes even been equated with correct com-
munity structure. However, two important weaknesses have been identified in
modularity as a community-structure criterion.
First, the group structure that optimizes modularity within a given subgraph
can depend on the number of edges in the entire graph in which the subgraph is
embedded. Specifically, modularity is characterized by an intrinsic scale under√
which Q is maximized when pairs of distinct groups having fewer than 2l
edges (where l is the total number of edges in the graph) are combined into
single groups [4]. This phenomenon is apparent in ring graphs, i.e., connected
graphs that consist of identical subgraphs each connected to exactly two other
subgraphs by a single link. For example, in the graph shown in Figure 1 consisting
of a ring of 15 squares, modularity is greater when adjacent squares are grouped
together than when each square is a separate group.
116 L. Karl Branting
A second weakness of modularity is that even when the resolution limit is not
exceeded, modularity exhibits a bias towards groups of similar size. Intuitively,
the sum of the square terms, (li /l)2 , representing the expected number of intra-
group edges within community i under the null model, is minimized, and Q
therefore maximized, when all li are as nearly equal in size as possible.
One approach to the resolution limit of modularity is to apply modularity
recursively, so that the coarse structure found at one level is refined at lower
levels [5].2 An alternative approach is to substitute a different community-quality
criterion for modularity.
One such alternative criterion for community quality that has recently been
proposed, based on information theory, is minimizing description length [7,8,9].
In this approach, the quality of a given partition of a graph is a function of the
complexity of the community structure together with the mutual information
between the community structure and the graph as a whole. The best commu-
nity structure is one that minimizes the sum of (1) the number of bits needed
to represent the community structure plus (2) the number of bits needed to
2
See [6] for recent approach that addresses resolution limits by using an absolute
evaluation of community structure rather than comparison to a null model.
Information Theoretic Criteria for Community Detection 117
represent the entire graph given the community structure. Under this approach,
the task of community detection consists of finding the community structure that
leads to the minimum description length (MDL) representation of the graph,
where description length is measured in number of bits.
The structure of the paper is as follows: Section 2 of this paper compares the
compression approach used in two previous approaches to information-theoretic
community detection and identifies a feature common to both that can lead
to a bias toward combining distinct communities in large sparse graphs. An
alternative encoding, termed SGE (Sparse Graph Encoding) that addresses this
bias is proposed in Section 3. Section 4 describes the design of an empirical
evaluation comparing the previous information-theoretic utility functions, SGE,
and modularity on three classes of artificial data. The results of this experiment
are set forth in Section 5.
The intuition behind the minimum description length (MDL) criterion for com-
munity structure is that a partition of a graph that permits a more concise
description of the graph is more faithful to the actual community structure than
a partition leading to a less concise description. The best partition is the one
that lends itself to the most concise description, that is, the encoding of the par-
tition and of the graph given with the partition in the fewest bits. However, the
minimum description length (MDL) criterion does not in itself specify how to
encode either the community structure or the graph given the community struc-
ture. Indeed, the close connection between MDL and Kolmogorov complexity
[10], which is undecidable, suggests that MDL may itself be undecidable.
The encoding algorithms of Rosvall and Bergstrom [7] (hereinafter “RB”) and
Chakrabarti [8] (hereinafter “AP,” standing for “AutoPart”) use quite different
approaches to measuring the description length of community structures and
graphs. However, RB and AP have in common that both are characterized by a
resolution limit similar to that observed in modularity.
RB and AP decompose the task of encoding a graph and its community
structure into similar steps, but they calculate the bits in each term differently.
For the purposes of this comparison, the following notation will be followed:
– P (Dij ) - for a square matrix Dij , the density of 1’s ignoring the diagonal
– H(Dij ) = −P (Dij ) log(P (Dij ))−(1−P (Dij )) log(1−P (Dij )), i.e., the mean
entropy of Dij
– H (Dij ) = −P (Dij ) log(P (Dij )) − (1 − P (Dij )) log(1 − P (Dij )), i.e., the
mean entropy of Dij if values on the diagonal of Dij are ignored
– B - a matrix representing for each pair of groups whether the pair is con-
nected, i.e., Bij = 1 ⇐⇒ w(Dij ) > 0
1. Bits needed to represent the number of vertices in the graph. Since this term
doesn’t vary with differing community structure, it is irrelevant to the choice
between different community structures and can be ignored.
2. Bits needed to represent the number of groups.
– RB. Not explicitly represented.
– AP. log∗ (m). log∗ (x) = log2 (x) + log2 log2 (x) + ... where only positive
terms are included in the sum. This series is apparently intended to
represent the mean coding length of integers given that the probability
of an integer of a given length is a monotonically decreasing function
of the integer’s length, i.e., longer integers are less probable, but no
maximum length is known [17].
3. Bits needed to represent the association between vertices and groups
– RB. n log(m). The rationale appears to be that for each of the n vertices,
log(m) bits are needed to identify the group to which the vertex belongs.
– AP. If the groups are placed in decreasing order of length, i.e., a1 ≥ a2 ≥
... ≥ am ≥ 1,
m−1
log(ai )
i=1
m
where ai = ( t=1 at ) − m + i.
4. Bits needed for the group adjacency matrix, i.e., the number of edges between
pairs of groups.
– RB. 12 m(m+1) log(l). The first term ( 12 m(m+1)) represents the number
of pairs of groups, and the second term (log(l)) the number of bits needed
to specify the number of edges between any pair of groups.
– AP.
log(ai aj + 1)
1<i,j<m
This expression sums for every pair of groups sufficient bits to represent
the number of edges between that pair.
5. Bits needed to represent the full adjacency matrix for vertices, given the
group structure represented in terms 2-4.
– RB.
m
ai (ai − 1)/2 ai a j
log( )
w(Dii ) w(Dij )
i=1 i<j
Information Theoretic Criteria for Community Detection 119
The expression following the first product sign represents the number of
ways to choose the actual pairs that are connected within a single group
from the set of all possible pairs. The expression following the second
product sign is the number of ways to choose the actual pairs between
vertices in two different groups from the set of possible edges between
vertices in those groups.
– AP. m m
ai aj H(Dij )
i=1 j=1
For each pair of groups, the entropy of the adjacency matrix for that
pair, i.e., the size of the matrix times its the mean entropy.
mean entropy is less than 1.0, and the total entropy is therefore less than the
square of the number of groups.
Moreover, the number of bits needed to represent B can be further reduced by
noting that the value of B’s diagonal need not be explicitly represented because
it can be determined from the number of nodes in each group. Singleton groups
have no within-group edges (assuming that self-loops are prohibited) and groups
with more than one element must have at least one within-group edge (if there
are no within-group edges, the density of within-group edges cannot be higher
than the density of between-group edges, the basic characteristic of a group).
The bits needed to represent B are therefore:
m(m − 1)H (B) (2)
where H (B) = −P (B) log(P (B)) − (1 − P (B)) log(1 − P (B)) and P (B) is
the density of 1’s in B, ignoring the diagonal.
The second term contains, for each connected pair, the number of bits needed
to represent the number of edges between that pair (the second sum is needed
if, as we assume, edges from a vertex to itself are forbidden):
log(ai aj ) + log(ai (aj − 1)) (3)
i=j∧w(Dij )≥0 i=j∧w(Dij )>0
5. Bits needed to represent the full adjacency matrix for vertices given the
group structure represented in terms 2-4. This consists, for every pair of
groups i and j, of size of the i, j adjacency matrix, ai aj , times the entropy
per entry in the corresponding binary matrix, H(Dij ). This is equivalent to
the AP calculation, shown above:
m
m
ai aj H(Dij )
i=1 j=1
4 Empirical Evaluation
The previous section suggested that a graph encoding in which the calculation of
the number bits required to represent a group adjacency matrix was reduced from
an expression that grows as the square of the number of groups, as in RB and AP,
to an expression that grows in proportion to the number of pairs of connected
groups, as in SGE, would reduce or eliminate any resolution limit in sparsely
connected graphs. This hypothesis was tested by comparing the communities
found by optimizing RB, AP, SGE, and modularity on three different artificial
data sets.
To avoid conflating the effect of a utility function with the behavior of a search
strategy, it was necessary to compare alternative utility functions using a single
common search strategy. Accordingly, a single search function was applied to all
for utility functions in the experimental evaluation: the greedy divisive hierar-
chical clustering algorithm of Newman & Girvan (2004) [11]. In the Newman &
Girvan procedure, the edge with the highest betweenness centrality is iteratively
removed, and the partition in the resulting sequence having the optimal value
under the utility function was returned as the community structure. Using a sin-
gle search strategy removes the potentially confounding disparity of the search
algorithms used in published descriptions of RB, AP, and modularity.
including the Rand index [23], the adjusted Rand index [24], and f-measure.
There is no consensus regarding the most informative objective function. In this
evaluation, f-measure was selected since its use in information retrieval has made
it familiar to a wide range of researchers.
The intuition underlying the use of f-measure is that group structure can be
expressed as a relation c(G) = {vi , vj | ∃g ∈ G ∧ vi , vj ∈ g}, that is, the
community structure can be represented by specifying for each pair of vertices
whether that pair is in the same group. The similarity between the proposed
group structure and the actual group structure can be evaluated by comparing
c(proposed) with c(actual). One way to make the comparison is to view each
pair in c(proposed) that is also in c(actual) as a true positive, whereas each pair
in c(proposed) that is not in c(actual) is a false positive. Under this view, recall
and precision can be defined as follows:
|c(proposed)| |c(actual)|
– Recall =
|c(actual)|
|c(proposed)| |c(actual)|
– Precision =
|c(proposed)|
Three experiments were performed, each with a different type of artificial graph.
The first, ring graphs, are characterized by the sparsity of connections between
groups observed in many large-scale real-world graphs [20]. The second, uniform
random graphs, has been used in a number of evaluations of community-detection
algorithms. The third, embedded Barabasi-Albert Graphs, consists of communities
generated by preferential attachment [20] embedded in a random graph. Fifty
trials were performed under each experimental condition for uniform random
and EBA graphs. There is no randomness in the construction of ring graphs, so
a single trial was sufficient.
– SGE. The partition having the optimal (lowest) SGE had the correct parti-
tion (i.e., no separate communities were conflated) in every graph except for
R4,3 and R13,3 In other words, the correct community structure was found
in 89 out of 91 ring graphs.
– RB and AP. No community structure was found by optimizing either RB
or AP. The partition having the optimal (lowest) value for RB and AP
contained at least one pair of communities that were grouped together in
every ring graph tested.
– Modularity. Optimizing modularity led to incorrect community structure
for rings of more than 8 triangles, more than 10 squares, more than 11 pen-
tagons, or more that 13 hexagons or heptagons. In other words, the correct
partitions were obtained with modularity only for rings and communities of
the following sizes:
Fig. 2. A uniform random graph with 32 vertices, 4 groups, size ratio 1.25, and io ratio
0.67
124 L. Karl Branting
• R4,3 − R8,3
• R4,4 − R10,4
• R4,5 − R11,5
• R4,6 − R13,6
• R4,7 − R13,7
• R4,8 − R16,8
• R4,9 − R16,9
This evaluation confirmed empirically the existence of the resolution limit for
modularity derived formally in [4]. The evaluation also showed the surprising
result that optimizing RB and AP leads to even more conflation of distinct
communities than does modularity. The observation that optimizing SGE led to
the correct community structure provides confirmation for the hypothesis that
the conflation of communities in RB and AP arises from term 4, which uses more
bits than necessary to represent the number of edges connecting groups in sparse
graphs. Substituting rings of cliques for rings of graphs that are themselves rings
leads to almost identical results to those described here.
Fig. 4. F-measure for uniform random graphs with i=0.6 (weak community structure)
126 L. Karl Branting
Fig. 5. F-measure for uniform random graphs with i=0.75 (moderate community struc-
ture)
Fig. 6. F-measure for uniform random graphs with i=0.9 (strong community structure)
Information Theoretic Criteria for Community Detection 127
Fig. 7. F-measure for embedded Barabasi-Albert graph with 2–4 edges added per time
step
128 L. Karl Branting
5 Conclusion
Acknowledgments
This work was funded under contract number CECOM Wl5P7T-08-C-F600. The
MITRE Corporation is a nonprofit Federally Funded Research and Development
Center chartered in the public interest.
References
1. Newman, M.E.J.: Fast algorithm for detecting community structure in networks.
Physical Review E 69, 066133 (2004)
2. Ganti, V., Ramakrishnan, R., Gehrke, J., Powell, A.L., French, J.C.: Clustering
large datasets in arbitrary metric spaces. In: Proceedings of the 15th IEEE Inter-
national Conference on Data Engineering, Sydney, pp. 502–511 (1999)
Information Theoretic Criteria for Community Detection 129
22. Clauset, A., Shalizi, C., Newman, M.: Power-law distributions in empirical data.
SIAM Review 51(4), 661–703 (2009)
23. Rand, W.M.: Objective Criteria for the Evaluation of Clustering Methods. Journal
of the American Statistical Association 66(336), 846–850 (1971)
24. Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2, 193–218
(1985)
Author Index