Вы находитесь на странице: 1из 144

Lecture Notes in Computer Science 5498

Commenced Publication in 1973


Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max-Planck Institute of Computer Science, Saarbruecken, Germany
Lee Giles Marc Smith John Yen
Haizheng Zhang (Eds.)

Advances in Social
Network Mining
and Analysis

Second International Workshop, SNAKDD 2008


Las Vegas, NV, USA, August 24-27, 2008
Revised Selected Papers

13
Volume Editors

Lee Giles
John Yen
Pennsylvania State University, College of Information Science and Technology
University Park, PA 16802, USA
E-mail: {giles, jyen}@ist.psu.edu

Marc Smith
Microsoft Research, One Microsoft Way, Redmond, WA 98002, USA
E-mail: masmith@microsoft.com

Haizheng Zhang
Amazon.com., Seattle, WA, USA
E-mail: haizhengzhang@gmail.com

Library of Congress Control Number: 2010931753

CR Subject Classification (1998): H.3, H.4, I.2, H.2.8, C.2, H.2

LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues

ISSN 0302-9743
ISBN-10 3-642-14928-6 Springer Berlin Heidelberg New York
ISBN-13 978-3-642-14928-3 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
springer.com
© Springer-Verlag Berlin Heidelberg 2010
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper 06/3180
Preface

This year’s volume of Advances in Social Network Analysis contains the pro-
ceedings for the Second International Workshop on Social Network Analysis
(SNAKDD 2008). The annual workshop co-locates with the ACM SIGKDD In-
ternational Conference on Knowledge Discovery and Data Mining (KDD). The
second SNAKDD workshop was held with KDD 2008 and received more than
32 submissions on social network mining and analysis topics. We accepted 11
regular papers and 8 short papers. Seven of the papers are included in this
volume.
In recent years, social network research has advanced significantly, thanks to
the prevalence of the online social websites and instant messaging systems as
well as the availability of a variety of large-scale offline social network systems.
These social network systems are usually characterized by the complex network
structures and rich accompanying contextual information. Researchers are in-
creasingly interested in addressing a wide range of challenges residing in these
disparate social network systems, including identifying common static topolog-
ical properties and dynamic properties during the formation and evolution of
these social networks, and how contextual information can help in analyzing the
pertaining social networks. These issues have important implications on commu-
nity discovery, anomaly detection, trend prediction and can enhance applications
in multiple domains such as information retrieval, recommendation systems, se-
curity and so on.
The second SNAKDD workshop focused on knowledge discovery and data
mining in social networks, such as contextual community discovery, link analysis,
the growth and evolution of social networks, algorithms for large-scale graphs,
techniques that can be used for recovering and constructing social networks
from online social systems, search on social networks, multi-agent-based social
network simulation, trend prediction of social network evolution, and related
applications in other domains such as information retrieval and security. The
workshop was concerned with inter-disciplinary and cross-domain studies span-
ning a variety of areas in computer science including graph and data mining,
machine learning, computational organizational and multi-agent studies, infor-
mation extraction and retrieval, and security, as well as other disciplines such as
information science, and social science.
In the first paper “Leveraging Label-Independent Features for Classification
in Sparsely Labeled Networks: An Empirical Study,” Brian Gallagher and Tina
Eliassi-Rad study the problem of within-network classification in sparsely labeled
networks. The authors present an empirical study and show that the use of LI
features produces classifiers that are less sensitive to specific label assignments
and can lead to significant performance improvement.
VI Preface

In the second paper “Community Detection Using a Measure of Global In-


fluence,” Rumi Ghosh and Kristina Lerman, define “influence” as the number
of paths, of any length, that exist between two nodes and argue that this gives
a better measure of network connectivity. The authors use the influence met-
ric to partition a network into groups or communities by looking for regions of
the network where nodes have more influence over each other than over nodes
outside the community.
The third paper, “Communication Dynamics of Blog Networks,” by Mark
Goldberg, Malik Magdon-Ismil, Stephen Kelley, Konstantin Mertsalov, and
William Wallace, studies the communication dynamics of Blog networks by ex-
ploring the Russian section of LiveJournal. The two fundamental questions that
this paper is concerned with include (1) what models adequately describe such
dynamic communication behavior; and (2) how does one detect changes in the
nature of the communication dynamics. The paper leverages stable statistics in
their research in disclosing the dynamics of the networks and characterizing the
locality properties.
In the fourth paper, “Finding Spread Blockers in Dynamic Networks,” Habiba,
Yintao Yu, Tanya Berger-Wolf, and Jared Saia extend standard structural net-
work measures to dynamic networks. The authors also compare the blocking
ability of individuals in the order of ranking by the new dynamic measures. The
authors found that overall, simple ranking according to a nodes static degree, or
the dynamic version of a nodes degree, performed consistently well. Surprisingly
the dynamic clustering coefficient seems to be a good indicator, while its static
version performs worse than the random ranking.
The fifth paper, “Social Network Mining with Nonparametric Relational
Models,” by ZZhao Xu, Volker Tresp, Achim Rettinger, and Kristian Kerst-
ing, discusses how the infinite hidden relational model (IHRM) can be used to
model and analyze social networks, where each edge is associated with a random
variable and the probabilistic dependencies between variables are specified by
the model based on the relational structure. The hidden variables are able to
transport information such that non-local probabilistic dependencies can be ob-
tained. The IHRM provides effective relationship prediction and cluster analysis
for social networks. The experiments demonstrate that this model can provide
good prediction accuracy and capture the inherent relations among social actors.
In the sixth paper, “Using Friendship Ties and Family Circles for Link Pre-
diction,” the authors (Elena Zheleva, Lise Getoor, Jennifer Golbeck, and Ugur
Kuter) investigate how networks can be overlaid and propose a feature taxonomy
for link prediction. The authors show that the accuracy of link prediction can
be improved when there are tightly-knit family circles in a social network. Their
experiments demonstrate significantly higher prediction accuracy compared to
using traditional features such as descriptive node attributes and structural
features.
The last paper, “Information Theoretic Criteria for Community Detection”
by Karl Branting, studies the resolution limit problem with two compression-
based algorithms that were designed to overcome such limits. The author
Preface VII

identifies the aspect of each approach that is responsible for the resolution limit
and proposes a variant, SGE, that addresses this limitation. The paper demon-
strates on three artificial data sets that (1) SGE does not exhibit a resolution
limit on graphs in which other approaches do, and that (2) modularity and the
compression-based algorithms, including SGE, behave similarly on graphs not
subject to the resolution limit.
We would like to thank the authors of all submitted papers for both the joint
workshop and this proceedings volume. We are further indebted to the Program
Committee members for their rigorous and timely reviewing. They allowed us
to make this workshop a major success.

Lee Giles
Marc Smith
John Yen
Haizheng Zhang
Organization

Program Chairs
Lee Giles Pennsylvania State University, USA
Marc Smith Microsoft, USA
John Yen Pennsylvania State University, USA
Haizheng Zhang Amazon.com, USA

Program Committee
Lada Adamic
Aris Anagnostopoulos
Arindam Banerjee
Tanya Berger-Wolf
Yun Chi
Aaron Clauset
Isaac Councill
Tina Eliassi-Rad
Lise Getoor
Mark Goldberg
Larry Holder
Andreas Hotho
Gueorgi Kossinets
Kristina Lerman
Wei Li
Yi Liu
Ramesh Nallapati
Jennifer Neville
Cheng Niu
Dou Shen
Bingjun Sun
Jie Tang
Andrea Tapia
Alessandro Vespigani
Xuerui Wang
Michael Wurst
Xiaowei Xu
X Organization

Referees
Vladimir Barash Sanmay Das Evan Xiang
Mustafa Bilgic Anirban Dasgupta Xiaowei Xu
Matthias Broecheler Robert JÃschke Limin Yao
Guihong Cao Liu Liu Jing Zhang
Bin Cao Galileo Namata Yi Zhang
Table of Contents

Leveraging Label-Independent Features for Classification in Sparsely


Labeled Networks: An Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Brian Gallagher and Tina Eliassi-Rad

Community Detection Using a Measure of Global Influence . . . . . . . . . . . . 20


Rumi Ghosh and Kristina Lerman

Communication Dynamics of Blog Networks . . . . . . . . . . . . . . . . . . . . . . . . . 36


Mark Goldberg, Stephen Kelley, Malik Magdon-Ismail,
Konstantin Mertsalov, and William (Al) Wallace

Finding Spread Blockers in Dynamic Networks . . . . . . . . . . . . . . . . . . . . . . 55


Habiba, Yintao Yu, Tanya Y. Berger-Wolf, and Jared Saia

Social Network Mining with Nonparametric Relational Models . . . . . . . . . 77


Zhao Xu, Volker Tresp, Achim Rettinger, and Kristian Kersting

Using Friendship Ties and Family Circles for Link Prediction . . . . . . . . . . 97


Elena Zheleva, Lise Getoor, Jennifer Golbeck, and Ugur Kuter

Information Theoretic Criteria for Community Detection . . . . . . . . . . . . . . 114


L. Karl Branting

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131


Leveraging Label-Independent Features for
Classification in Sparsely Labeled Networks:
An Empirical Study

Brian Gallagher and Tina Eliassi-Rad

Lawrence Livermore National Laboratory


P.O. Box 808, L-560, Livermore, CA 94551, USA
{bgallagher,eliassi}@llnl.gov

Abstract. We address the problem of within-network classification in


sparsely labeled networks. Recent work has demonstrated success with
statistical relational learning (SRL) and semi-supervised learning (SSL)
on such problems. However, both approaches rely on the availability of
labeled nodes to infer the values of missing labels. When few labels are
available, the performance of these approaches can degrade. In addition,
many such approaches are sensitive to the specific set of nodes labeled.
So, although average performance may be acceptable, the performance
on a specific task may not. We explore a complimentary approach to
within-network classification, based on the use of label-independent (LI )
features – i.e., features calculated without using the values of class labels.
While previous work has made some use of LI features, the effects of these
features on classification performance have not been extensively studied.
Here, we present an empirical study in order to better understand these
effects. Through experiments on several real-world data sets, we show
that the use of LI features produces classifiers that are less sensitive to
specific label assignments and can lead to performance improvements of
over 40% for both SRL- and SSL-based classifiers. We also examine the
relative utility of individual LI features; and show that, in many cases, it
is a combination of a few diverse network-based structural characteristics
that is most informative.

Keywords: Statistical relational learning; semi-supervised learning; so-


cial network analysis; feature extraction; collective classification.

1 Introduction

In this paper, we address the problem of within-network classification. We are


given a network in which some of the nodes are “labeled” and others are “un-
labeled” (see Figure 1). Our goal is to assign the correct labels to the unlabeled
nodes from among a set of possible class labels (i.e., to “classify” them). For exam-
ple, we may wish to identify cell phone users as either ‘fraudulent’ or ‘legitimate.’
Cell phone fraud is an example of an application where networks are often
very sparsely labeled. We may have a handful of known fraudsters and a handful

L. Giles et al. (Eds.): SNAKDD 2008, LNCS 5498, pp. 1–19, 2010.

c Springer-Verlag Berlin Heidelberg 2010
2 B. Gallagher and T. Eliassi-Rad

Fig. 1. Portion of the MIT Reality Mining call graph. We know the class labels for the
black (dark) nodes, but do not have labels for the yellow (light) nodes.

of known legitimate users, but for the vast majority of users, we do not know the
correct label. For such applications, it is reasonable to expect that we may have
access to labels for fewer than 10%, 5%, or even 1% of the nodes. In addition,
cell phone networks are generally anonymized. That is, nodes in these networks
often contain no attributes besides class labels that could be used to identify
them. It is this kind of sparsely labeled, anonymized network that is the focus
of this work. Put another way, our work focuses on univariate within-network
classification in sparsely labeled networks.
Relational classifiers have been shown to perform well on network classification
tasks because of their ability to make use of dependencies between class labels
(or attributes) of related nodes [1]. However, because of their dependence on
class labels, the performance of relational classifiers can substantially degrade
when a large proportion of neighboring instances are also unlabeled. In many
cases, collective classification provides a solution to this problem, by enabling
the simultaneous classification of a number of related instances [2]. However,
previous work has shown that the performance of collective classification can
also degrade when there are too few labels available, eventually to the point
where classifiers perform better without it [3].
In this paper, we explore another source of information present in networks
that does not depend on the availability or accuracy of node labels. Such infor-
mation can be represented using what we call label-independent (LI ) features.
The main contribution of this paper is an in-depth examination of the effects
of label-independent features on within-network classification. In particular, we
address the following questions:

1. Can LI features make up for a lack of information due to sparsely labeled


data? Answer: Yes.
Leveraging Label-Independent Features 3

2. Can LI features provide information above and beyond that provided by the
class labels? Answer: Yes.
3. How do LI features improve classification performance? Answer: Because
they are less sensitive to the specific labeling assigned to a graph, classifiers
that use label-independent features produce more consistent results across
prediction tasks.
4. Which LI features are the most useful ? Answer: A combination of a few di-
verse network-based structural characteristics (such as node and link counts
plus betweenness) is the most informative.

Section 2 covers related work. Section 3 describes our approach for modeling
label-independent characteristics of networks. Sections 4 and 5, respectively,
present our experimental design and results. We conclude the paper in Section 6.

2 Related Work

In recent years, there has been a great deal of work on models for learning and
inference in relational data (i.e., statistical relational learning or SRL) [3,4,5,6,7].
All SRL techniques make use of label-dependent relational information. Some
use label-independent information as well.
Relational Probability Trees (RPTs) [8] use label-independent degree-based
features (i.e., neighboring node and link counts). However, existing RPT studies
do not specifically consider the impact of label-independent features on classifier
performance.
Perlich and Provost [9] provide a nice study on aggregation of relational at-
tributes, based on a hierarchy of relational concepts. However, they do not con-
sider label-independent features.
Singh et al. [10] use descriptive attributes and structural properties (i.e., node
degree and betweenness centrality) to prune a network down to its ‘most infor-
mative’ affiliations and relationships for the task of attribute prediction. They
do not use label-independent features directly as input to their classifiers.
Neville and Jensen [11] use spectral clustering to group instances based on
their link structure (where link density within a group is high and between
groups is low). This group information is subsequently used in conjunction with
attribute information to learn classifiers on network data.
There has also been extensive work on overcoming label sparsity through tech-
niques for label propagation. This work falls into two research areas: (1) collective
classification [2,3,7,12,13,14] and (2) graph-based semi-supervised learning (SSL)
[15,16].
Previous work confirms our observation that the performance of collective
classification can suffer when labeled data is very sparse [3]. McDowell et al. [14]
demonstrate that “cautious” collective classification procedures produce better
classification performance than “aggressive” ones. They recommend only prop-
agating information about the top-k most confident predicted labels.
4 B. Gallagher and T. Eliassi-Rad

The problem of within-network classification can be viewed as a semi-supervised


learning problem. The graph-based approaches to semi-supervised learning are
particularly relevant here. In their study, Macskassy and Provost [7] compare the
SSL Gaussian Random Field (GRF) model [15] to a SRL weighted-vote relational
neighbor (wvRN) model that uses relaxation labeling for collective classification
(wvRN+RL). They conclude that the two models are nearly identical in terms of
accuracy, although GRF produces slightly better probability rankings. Our results
with wvRN+RL and GRF are consistent with this conclusion. The “ghost edge”
approach of Gallagher et al. [17] combines aspects of both SRL and SSL, and com-
pares favorably with both wvRN+RL and GRF.

3 Label-Dependent vs. Label-Independent Features


Relational classifiers leverage link structure to improve performance. Most fre-
quently, links are used to incorporate attribute information from neighboring
nodes. However, link structure can also be used to extract structural statistics
of a node (e.g., the number of adjacent links). We can divide relational features
into two categories: label-dependent and label-independent.
Label-dependent (LD) features use both structure and attributes (or labels)
of nodes in the network. The most commonly used LD features are aggregations
of the class labels of nodes one link away (e.g., the number of neighbors with the
class label ‘fraudulent’). LD features are the basis for incorporating relational
information in many SRL classifiers.
Label-independent (LI) features are calculated using network structure, but
not attributes or class labels of nodes. An example of a simple LI feature is
the degree of a node (i.e., the number of neighboring nodes). Of course, we
assume that there is an underlying statistical dependency between the class
label of a node and its LI features. Otherwise, LI features would be of no value
in predicting a node’s class. However, because they are calculated based only on
network structure, LI feature values do not directly depend on the current class
label assignments of nodes in a network. This means that, unlike LD features, LI
features may be calculated with perfect accuracy regardless of the availability of
class label information and are impervious to errors in class label assignments.

3.1 Extracting Label-Independent Features


We consider four LI features on nodes: (1) the number of neighboring nodes,
(2) the number of incident links, (3) betweenness centrality, and (4) clustering
coefficient. Features 1 and 2, respectively, are node-based and link-based mea-
sures of degree. Note that in multigraphs, these two are different. Betweenness
centrality measures how “central” a node is in a network, based on the number
of shortest paths that pass through it. Clustering coefficient measures neighbor-
hood strength, based on how connected a node’s neighbors are to one another.
For details, we refer the reader to a study by Mark Newman [18].
The success of network-based structural characteristics as predictors of class
relies on two assumptions. First, members of different classes play different roles
Leveraging Label-Independent Features 5

in a network. Second, these roles can be differentiated by structural characteris-


tics. The second assumption is met in many cases. For instance, “popular” nodes
can be identified by degree and “important” nodes can be identified by central-
ity measures. Whether the first assumption is met depends on the class label.
Suppose that executives tend to be more popular and central than an average
employee in a company’s communication network, and that employees with a
particular job title tend to have similar popularity and centrality, regardless of
department. Then, we would expect structural features to be more useful for
identifying executives than members of a particular department.

4 Experimental Design

We have designed our experiments to answer the following questions:

1. Can LI features make up for a lack of information due to sparsely labeled


data?
2. Can LI features provide information above and beyond that provided by the
class labels?
3. How do LI features improve classification performance?
4. Which LI features are the most useful ?

To avoid confounding effects as much as possible, we focus on univariate binary


classification tasks, and extend simple classifiers to incorporate label-independent
features.

4.1 Classifiers

On each classification task, we ran ten individual classifiers: four variations of a


link-based classifier [5], four variations of a relational neighbor classifier [19,7],
and two variations of the Gaussian Random Field classifier [15]. We describe
each of them below.
nLB is the network-only link-based classifier [5]. It uses logistic regression to model
a node’s class given the classes of neighboring nodes. To generate features, a node’s
neighborhood is summarized by the link-weighted count of each class label. For
example, given a binary classification task, two features will be generated: (1) the
link-weighted count of a node’s neighbors with the positive class and (2) the link-
weighted count of a node’s neighbors with the negative class.
nLBLI is composed of two logistic regression models: (1) nLB with its two LD
features and (2) logLI, which uses the four LI features (see Section 3.1). The
nLBLI classifier calculates the probability of each class as:

P (C) = w · PnLB (C) + (1 − w) · PlogLI (C) (1)

where w is calculated based on the individual performance of nLB and logLI


over 10-fold cross validation on the training data. We calculate area under the
6 B. Gallagher and T. Eliassi-Rad

ROC curve (AUC) for each fold and then obtain an average AUC score for each
classifier, AU CLD and AU CLI . We then set w as follows:
AU CLD
w= (2)
AU CLD + AU CLI
nLB+ICA uses the nLB classifier, but performs collective classification using
the ICA algorithm described in Section 4.2.
nLBLI+ICA uses the nLBLI classifier, but performs collective classification
using the ICA algorithm described in Section 4.2.
wvRN is the weighted-vote relational neighbor classifier [19,7]. It is a simple
non-learning classifier. Given a node i and a set of neighboring nodes, N , the
wvRN classifier calculates the probability of each class for node i as:

1  wi,j if Ci = c
P (Ci = c|N ) = (3)
Li 0 otherwise
j∈N

where wi,j is the number of links between nodes i and j and Lj is the number of
links connecting node i to labeled nodes. When node i has no labeled neighbors,
we use the prior probabilities observed in the training data.
wvRNLI combines the LI features with wvRN in the same way that nLBLI
does with nLB (i.e., using a weighted sum of wvRN and logLI).
wvRN+ICA uses the wvRN classifier, but performs collective classification
using the ICA algorithm described in Section 4.2.
wvRNLI+ICA uses wvRNLI, but performs collective classification using the
ICA algorithm described in Section 4.2.
GRF is the semi-supervised Gaussian Random Field approach of Zhu et al. [15].
We made one modification to accommodate disconnected graphs. Zhu computes
the graph Laplacian as L = D − cW , where c = 1. We set c = 0.9 to ensure
that L is diagonally dominant and thus invertible. We observed no substantial
impact on performance in connected graphs due to this change.
GRFLI combines the LI features with GRF as nLBLI does with nLB (i.e., using
a weighted sum of GRF and logLI). We also tried the approach of Zhu et al.
[15], where one attaches a “dongle” node to each unlabeled node and assigns
it a label using the external LI classifier. The transition probability from node
i to its dongle is η and all other transitions from i are discounted by 1 − η .
This approach did not yield any improvements. So, we use the weighted sum
approach (i.e., Equation 1) for consistency.

4.2 Collective Classification

To perform collective classification, we use the iterative classification algorithm


(ICA) [7], with up to 1000 iterations. We chose ICA because (1) it is simple, (2) it
Leveraging Label-Independent Features 7

performs well on a variety of tasks, and (3) it tends to converge more quickly than
other approaches. We also performed experiments using relaxation labeling (RL)
[7]. Our results are consistent with previous research showing that the accuracy
of wvRN+RL is nearly identical to GRF, but GRF produces higher AUC values
[7]. We omit these results due to the similarity to GRF. For a comparison of
wvRN+RL and GRF on several of the same tasks used here, see Gallagher et
al. [17]. Overall, ICA slightly outperforms RL for the nLB classifier.
Several of our data sets have large amounts of unlabeled data since ground
truth is simply not available. In these cases, there are two reasonable approaches
to collective classification: (1) perform collective classification over the entire
graph and (2) perform collective classification over the core set of nodes only
(i.e., nodes with known labels).
In our experiments, attempting to perform collective classification over the
entire graph produced results that were often dramatically worse than the non-
collective base classifier. We hypothesize that this is due to an inadequate propa-
gation of known labels across vast areas of unlabeled nodes in the network. Note
that for some of our experiments, fewer than 1% of nodes are labeled. Other
researchers have also reported cases where collective classification hurts perfor-
mance due to a lack of labeled data [3,11]. We found that the second approach
(i.e., using a network of only the core nodes) outperformed the first approach in
almost all cases, despite disconnecting the network in some cases. Therefore, we
report results for the second approach only.

4.3 Experimental Methodology

Each data set has a set of core nodes for which we know the true class labels.
Several data sets have additional nodes for which there is no ground truth avail-
able. Classifiers have access to the entire graph for both training and testing.
However, we hide labels for 10% − 90% of the core nodes. Classifiers are trained
on all labeled core nodes and evaluated on all unlabeled core nodes.
For each proportion labeled, we run 30 trials. For each trial, we choose a
class-stratified random sample containing 100 × (1.0 − proportionlabeled)% of
the core nodes as a test set and the remaining core nodes as a training set.
Note that a single node will necessarily appear in multiple test sets. However,
we carefully choose test sets to ensure that each node in a data set occurs in
the same number of test sets over the course of our experiments; and therefore,
carries the same weight in the overall evaluation. Labels are kept on training
nodes and removed from test nodes. We use identical train/test splits for each
classifier. For more on experimental methodologies for relational classification,
see Gallagher and Eliassi-Rad [20].
We use the area under the ROC curve (AUC) to compare classifiers because it
is more discriminating than accuracy. In particular, since most of our tasks have a
large class imbalance (see Section 4.4), accuracy cannot adequately differentiate
between classifiers.
8 B. Gallagher and T. Eliassi-Rad

4.4 Data Sets

We present results on four real-world data sets: political book purchases [21],
Enron emails [22], Reality Mining (RM) cellphone calls [23], and high energy
physics publications (HEP-TH) from arXiv [24]. Our five tasks are to identify
neutral political books, Enron executives, Reality Mining students, Reality Min-
ing study participants, and HEP-TH papers with the topic “Differential Geom-
etry.” Table 1 summarizes the prediction tasks. The Sample column describes
the method used to obtain a data sample for our experiments: use the entire set
(full ), use a time-slice (time), or sample a continuous subgraph via breadth-first
search (BFS ). The Task column indicates the class label we try to predict. The
|V |, |L|, and |E| columns indicate counts of total nodes, labeled nodes, and total
edges in each network. The P (+) column indicates the proportion of labeled
nodes that have the positive class label (e.g., 12% of the political books are
neutral). For Enron, Reality Mining students, and HEP-TH, we have labels for
only a subset of nodes (i.e., the “core” nodes) and can only train and test our
classifiers on these nodes. However, unlabeled nodes and their connections to
labeled nodes are exploited to calculate LI features of the labeled nodes.

Table 1. Summary of Data Sets and Prediction Tasks

Data Set Sample Task |V | |L| |E| P (+)


Political Books Full Neutral? 105 105 441 0.12
Enron Time Executive? 9K 1.6K 50K 0.02
Reality Mining BFS Student? 1K 84 32K 0.62
Reality Mining BFS In Study? 1K 1K 32K 0.08
HEP-TH BFS Differential Geometry? 3K 284 36K 0.06

5 Experimental Results

In this section, we discuss our results. We assess significance using paired t-tests
(p-values ≤ 0.05 are considered significant).1

5.1 Effects of Learning Label Dependencies

Figures 2 and 3 show results for statistical relational learning and semi-supervised
learning approaches on all of our classification tasks. Supervised learning
approaches, like nLB, use labeled nodes as training data to build a dependency
model over neighboring class labels. The non-learning wvRN and GRF assume
1
It is an open issue whether the standard significance tests for comparing classifiers
(e.g., t-tests, Wilcoxon signed-rank) are applicable for within-network classification,
where there is typically some overlap in test sets across trials. It remains to be seen
whether the use of such tests produces a bias and the extent of any errors caused by
such a bias. This is an important area for future study that will potentially affect a
number of published results.
Leveraging Label-Independent Features 9

Learning Algorithms Non-learning Algorithms


nLB nLB+ICA wvRN wvRN+ICA
nLBLI nLBLI+ICA wvRNLI wvRNLI+ICA
Political Books
0.85 0.85

0.8 0.8

0.75 0.75
AUC
AUC

0.7 0.7

0.65 0.65

0.6 0.6

0.55 0.55

0.5 0.5
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9

Enron Executives

0.9 0.9

0.8 0.8
AUC
UC

0.7 0.7
A

0.6 0.6

0.5 0.5

0.4 0.4
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9

Reality Mining Students


0.95 0.95

0.9 0.9

0.85 0.85

0.8 0.8
AUC
AUC

0.75 0.75

0.7 0.7

0.65 0.65

0.6 0.6

0.55 0.55
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9

Reality Mining Study Participants

0.9 0.9

0.8 0.8
0.7 0.7
0.6 0.6
AUC
AUC

0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9

HEP-TH Differential Geometry Papers


0.85 0.85

0.8 0.8

0.75 0.75

0.7 0.7
AUC
AUC

0.65 0.65

0.6 0.6

0.55 0.55

0.5 0.5

0.45 0.45
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
Proportion of Core Nodes Labeled Proportion of Core Nodes Labeled
Proportion of Core Nodes Labeled Proportion of Core Nodes Labeled

Fig. 2. Classification results for statistical relational learning approaches on our data
sets. For details on classifiers, see Section 4.1. Note: Due to differences in the difficulty
of classification tasks, the y-axis scales are not consistent across tasks. However, for
a particular classification task, the y-axis scales are consistent across the algorithms
shown both in this figure and in Figure 3.
10 B. Gallagher and T. Eliassi-Rad

Semi-supervised Learning Algorithms


GRF GRFLI

Political Books Enron Executives


0.85
0.9
0.8
0.8
0.75
AUC

0.7 0.7

0.65
0.6
0.6
0.5
0.55

0.5 0.4
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9

Proportion of Core Nodes Labeled Proportion of Core Nodes Labeled

Reality Mining Students Reality Mining Study Participants


0.95
0.9
0.9
0.8
0.85 0.7
0.8 0.6
AUC

0.75 0.5
0.4
0.7
0.3
0.65
0.2
0.6 0.1
0.55 0
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9

Proportion of Core Nodes Labeled Proportion of Core Nodes Labeled

HEP-TH Differential Geometry Papers


0.85

0.8

0.75

0.7
AUC

0.65

0.6

0.55

0.5

0.45
0.1 0.3 0.5 0.7 0.9

Proportion of Core Nodes Labeled

Fig. 3. Classification results for semi-supervised learning approaches on our data sets.
For details on classifiers, see Section 4.1. Note: Due to differences in the difficulty
of classification tasks, the y-axis scales are not consistent across tasks. However, for
a particular classification task, the y-axis scales are consistent across the algorithms
shown both in this figure and in Figure 2.

that class labels of neighboring nodes tend to be the same (i.e., high label consis-
tency). GRF performs well on the Enron and RM student tasks, which have high
label consistency between neighbors. On the RM study task, where neighboring la-
bels are inversely correlated (i.e., low label consistency), wvRN and GRF perform
poorly, whereas nLB can learn the correct dependencies.
Leveraging Label-Independent Features 11

5.2 Effects of Label-Independent Features

Figures 2 and 3 illustrate several effects of LI features. In general, the perfor-


mance of the LI classifiers degrades more slowly than that of the corresponding
base classifiers as fewer nodes are labeled. At ≤ 50% labeled, the LI features
produce a significant improvement in 36 of 45 cases. The exceptions mainly oc-
cur for GRF on Enron, RM Student, and HEP-TH, where (in most cases) we
have a statistical tie. In general, the information provided by the LI features is
able to make up, at least in part, for information lost due to missing labels. Note
that there are three separate effects that lower performance as the number of
labels decreases. (1) Fewer labels available for inference lead to lower quality LD
features at inference time, but do not impact the quality of LI features. (2) Fewer
labels at training time mean that (labeled) training examples have fewer labeled
neighbors. This impacts the quality of the LD features available at training time
and the quality of the resulting model. LI features are not affected. (3) Fewer
labels mean less training data. This impacts model quality for both LD and LI
features. Note that wvRN and GRF are affected only by 1, since they do not
rely on training data.
In general, the LI models outperform the corresponding base models, leading
to significant improvements in 49 out of 75 cases across all proportions of labeled
data. There is only one case where the use of LI features significantly degrades per-
formance: using GRF on the Enron task at 0.3 labeled. The GRF classifier does
so well on this task that the LI features simply add complexity without additional
predictive information. However, the degradation here is small compared to gains
on other tasks.
Another effect demonstrated in Figures 2 and 3 is the interaction between LI
features and label propagation (i.e., ICA or GRF). In several cases, combining
the two significantly outperforms either on its own (e.g., GRFLI on political
books and the RM tasks). However, the benefit is not consistent across all tasks.
The improved performance due to LI features on several tasks at 90% labeled
(i.e., political books, both RM tasks) suggests that LI features can provide infor-
mation above and beyond that provided by class labels. Recall that political books
and RM study are the only data sets fully labeled to begin with. This indicates
that LI features may have more general applicability beyond sparsely labeled data.
Figure 4 shows the sensitivity of classifiers to the specific nodes that are ini-
tially labeled. For each classifier and task, we measure variance in AUC across
30 trials. For each trial, a different 50% of nodes is labeled. ICA has very little
impact on the sensitivity of nLB to labeling changes. However, the LI features
decrease the labeling sensitivity of nLB dramatically for all but one data set.
The results for wvRN are qualitatively similar. LI features also decrease sensi-
tivity for GRF in most cases. Since GRF has low sensitivity to begin with, the
improvements are less dramatic. The observed reduction in label sensitivity is
not surprising since LI features do not rely on class labels. However, it suggests
that LI features make classifiers more stable. So, even in cases where average
classifier performance does not increase, we expect an increase in the worst case
due to the use of LI features.
12 B. Gallagher and T. Eliassi-Rad

Lableing Sensitivity
(50% Labeled)
0.04
HepthArea
0.035
EnronTitle
Variance in AUC

0.03 PoliticalBook
0.025 RealityMiningInStudy
0.02 RealityMiningStudent
0.015
0.01
0.005
0
nLB nLB+ICA nLBLI GRF GRFLI
Classifier

Fig. 4. Sensitivity of classifiers to specific assignments of 50% known labels across data
sets

5.3 Performance of Specific LI Features


To understand which LI features contribute to the observed performance gains,
we re-ran our experiments using subsets of the LI features. We used logistic
regression with different combinations of the four LI features: each alone (4
classifiers), leave one feature out (4 classifiers), degree-based features only (1
classifier), non-degree-based features only (1 classifier), and all features (1 clas-
sifier).2 This yields 11 classifiers. We present results for 50% of nodes labeled.
Results for other proportions labeled are similar.
Figure 5 shows AUC using each LI feature alone vs. all features together. This
demonstrates the utility of each feature in the absence of any other information.

Individual Label-Independent Features at 50% Labeled

0.9

0.8 All
Node count
0.7
AUC

Link count
0.6 Betweenness
Clust. coef.
0.5

0.4
Enron HEP-TH P. Books RM Students RM Study
Data Set

Fig. 5. Performance of LI features in isolation

Figure 6 shows the increase in AUC due to adding the specified feature to a
classifier that already has access to all other LI features. The y-axis is the AUC
2
Degree-based features are node (or neighbor) count and link (or edge) counts. Non-
degree-based features are betweenness and clustering coefficient.
Leveraging Label-Independent Features 13

Combined Label-Independent Features at 50% Labeled

0.3
0.25 Degree-based
Increase in AUC

0.2 Non-degree
0.15 Node count
0.1 Link count
0.05 Betweenness
0 Clust. coef.
-0.05
-0.1
Enron HEP-TH P. Books RM Students RM Study
Data Set

Fig. 6. Performance of LI features in combination. Degree-based features are node and


link count. Non-degree features are betweenness and clustering coefficient.

of a classifier that uses all LI features minus the AUC of a classifier that uses all
except the specified feature. This demonstrates the power of each feature when
combined with the others.
All features appear to be useful for some tasks. Clustering coefficient is the
least useful overall, improving AUC slightly on two tasks and degrading AUC
slightly on three. For all tasks, a combination of at least three features yields
the best results. Interestingly, features that perform poorly on their own can
be combined to produce good results. On the RM student task, node count, be-
tweenness, and clustering coefficient produce AUCs of 0.57, 0.49, and 0.48 alone,
respectively. When combined, these three produce an AUC of 0.78. Betweenness,
which performs worse than random (AUC < 0.5) on its own, provides a boost
of 0.32 AUC to a classifier using node count and clustering coefficient.
For most tasks, performance improves due to using all four LI features. On
Enron, however, clustering coefficient appears to mislead the classifier to the
point where it is better to use either node or link count individually than to

Enron Executives

0.8

0.75

0.7
Logistic
AUC

0.65
Rand Forest
0.6

0.55

0.5
0.1 0.3 0.5 0.7 0.9

Proportion of Core Nodes Labeled

Fig. 7. Comparison of logistic regression and random forest classifiers with all four LI
features
14 B. Gallagher and T. Eliassi-Rad

use all features. This is one case where we might benefit from a more selective
classifier. Figure 7 compares logistic regression with a random forest classifier
[25], both using the same four LI features. As expected, the random forest is
better able to make use of the informative features without being misled by the
uninformative ones.
To get a feel for why some LI features make better predictors than others, we
examine the distribution of each feature by class for each prediction task. Table 2
summarizes these feature distributions by their mean and standard deviation. In
general, we expect features that cleanly separate the classes to provide the most
predictive power. As mentioned previously, clustering coefficient appears to be
the least powerful feature overall for our set of prediction tasks. One possible
explanation for clustering coefficient’s general poor performance is that it does
not vary enough from node to node; therefore, it does not help to differentiate
among instances of different classes.

Table 2. Mean and standard deviation (SD) of feature values by class and data set.
The larger mean value for each feature (i.e., row) is shown in bold.

Data Set/Feature Mean (SD) for the ‘+’ Class Mean (SD) for the ‘-’ Class
Political Books Neutral Other
Node Count 5.8 (3.3) 8.8 (5.6)
Link Count 5.8 (3.3) 8.8 (5.6)
Betweenness 0.027 (0.030) 0.019 (0.029)
Clust. Coef. 0.486 (0.25) 0.489 (0.21)
Enron Executive Other
Node Count 22 (27) 9.6 (20)
Link Count 61 (100) 25 (66)
Betweenness 0.0013 (0.0037) 0.00069 (0.0025)
Clust. Coef. 0.91 (0.77) 1.75 (4.5)
RM Student Student Other
Node Count 19 (27) 22 (38)
Link Count 471 (774) 509 (745)
Betweenness 0.027 (0.050) 0.022 (0.056)
Clust. Coef. 15 (22) 8.0 (7.0)
RM Study In-study Out-of-study
Node Count 18 (30) 1.4 (2.8)
Link Count 418 (711) 30 (130)
Betweenness 0.022 (0.048) 0.00086 (0.022)
Clust. Coef. 10 (17) 5.8 (51)
HEP-TH Differential Geometry Other
Node Count 14 (9.0) 21 (26)
Link Count 14 (9.0) 21 (26)
Betweenness 0.000078 (0.00010) 0.0011 (0.0056)
Clust. Coef. 0.42 (0.19) 0.40 (0.23)
Leveraging Label-Independent Features 15

Feature Variability

12

10
Coefficient of Variation

8 PoliticalBook
EnronTitle
6 RealityMiningInStudy
RealityMiningPosition
4 HepthArea

0
t

nt
s
un

un

es

cie
Co

Co

nn

ffi
ge
or

ee

oe
hb

ed

tw

gC
ig

be

in
ne

er
st
clu
Fig. 8. Degree of variability for each LI feature on each prediction task

Figure 8 shows the degree of variability of each LI feature across the five
prediction tasks. To measure variability, we use the coefficient of variation, a
normalized measure of the dispersion of a probability distribution. The coefficient
of variation is defined as:
σ
cv(dist) = (4)
μ
where μ is the mean of the probability distribution dist and σ is the standard
deviation. A higher coefficient of variation indicates a feature with more varied
values across instances in the data set.
The variability of the clustering coefficient appears comparable to the degree
features (i.e., node and link count) (see Figure 8). We even observe that the de-
gree of variability of the clustering coefficient for the Enron task is higher than
the degree of variability for the neighbor count feature, even though neighbor
count provides much more predictive power (see Figure 5). So, clustering coeffi-
cient appears to have sufficient variability over the nodes in the graph. However,
it is possible that the clustering coefficient exhibits similar variability for nodes
of both classes; and thus, still fails to adequately distinguish between nodes of
different classes. Therefore, we wish to quantify the extent to which the feature
distributions can be separated from one another by class.
Figure 9 shows how well each LI feature separates the two classes for each
prediction task. We measure class separation by calculating the distance between
the empirical distributions of the LI feature values for each class. Specifically,
we use the Kolmogorov-Smirnov statistic (K-S) to measure the distance between
two empirical (cumulative) distribution functions:

D(Fn (x), Gn (x)) = max(|Fn (x) − Gn (x)|) (5)

We observe from Figure 9 that, on average, the per-class distributions of values


for the clustering coefficient are more similar to one another than for other LI
features. So, although the clustering coefficient does vary from node to node,
16 B. Gallagher and T. Eliassi-Rad

Class Separation

0.8
Kolmogorov-Smirnov Distance
0.7

0.6
PoliticalBook
0.5 EnronTitle
0.4 RealityMiningInStudy
RealityMiningPosition
0.3
HepthArea
0.2

0.1

0
t

nt
t

s
un

un

es

cie
Co

Co

nn

ffi
ge
or

ee

oe
hb

ed

tw

gC
ig

be

in
ne

er
st
clu
Fig. 9. Degree of class separation for each LI feature on each prediction task

the values do not differ consistently based on class. Therefore, clustering coeffi-
cient has a hard time distinguishing between instances of different classes, and
exhibits poor predictive power overall. The exception is on the Reality Mining
study-participant task, where we observe a high K-S distance (Figure 9) and
a correspondingly high classification performance (Figure 5). In fact, the K-S
distances in Figure 9 generally correspond quite well to the classification perfor-
mance we observe in Figure 5.

5.4 Observations about Our Problem Domains and Data Sets


Table 2 highlights a number of interesting characteristics of our data sets. An
examination of these characteristics provide insights into the underlying problem
domains. We describe several such insights here.
More politically extreme books tend to have higher degree (neighboring nodes
and adjacent edges) and clustering coefficient, but lower betweenness than the
neutral books. This tells us that there are two very different types of readers
represented in our network: (1) party loyalists that tend to have more extreme
viewpoints, strong agreement with others inside their party, and strong dis-
agreement with outsiders and (2) political moderates who are more inclined to
consider multiple differing perspectives on an issue.
Enron executives tend to have higher degree and betweenness, but lower clus-
tering coefficients than others. So, as we would expect, executives maintain more
relationships than other employees and are positioned in a place of maximal
control over information flow. The lower clustering coefficient suggests that ex-
ecutives maintain ties to multiple communities within the company and are less
closely tied to a particular community.
Reality Mining students tend to have higher betweenness and clustering co-
efficient, but lower degree than others. This indicates that students tend to be
more cliquey, with many of their communications made within a single strong
Leveraging Label-Independent Features 17

community. It also indicates that students play an important role in keeping


information flowing between more distant parts of the network.
Reality Mining study participants tend to have higher degree, betweenness, and
clustering coefficient than non-participants. These findings may reveal more about
how the data were collected than about the underlying problem domain, but they
are interesting nonetheless. Because their phones were instrumented with special
software, we have information on all calls of study participants. However, we have
only a partial view of the calls made and received by non-participants. More specif-
ically, the only calls we observe for non-participants are those to or from a study
participant. The result of this is that we end up with a central community of in-
terconnected study participants, surrounded by a large, but diffuse periphery of
non-participants. Thus, the participants appear to have more neighbors, higher
centrality, and a more closely knit surrounding community.
In HEP-TH, differential geometry papers tend to have higher clustering coef-
ficient, but lower degree and betweenness than others topics. This indicates that
differential geometry papers play a relatively isolated and peripheral role among
high-energy physics papers, at least in our subset of the arXiv data.

6 Conclusion
We examined the utility of label-independent features in the context of within-
network classification. Our experiments revealed a number of interesting findings:
(1) LI features can make up for large amounts of missing class labels; (2) LI fea-
tures can provide information above and beyond that provided by class labels
alone; (3) the effectiveness of LI features is due, at least in part, to their consistency
and their stabilizing effect on network classifiers; (4) no single label-independent
feature dominates, and there is generally a benefit to combining a few diverse LI
features. In addition, we observed a benefit to combining LI features with label
propagation, although the benefit is not consistent across tasks.
Our findings suggest a number of interesting areas for future work. These
include:
– Combining attribute-based (LD) and structural-based (LI) features of a net-
work to create new informative features for node classification. For instance,
will the number of short paths to nodes of a certain label or the average path
length to such nodes improve classification performance?
– Exploring the relationship between attributes and network structure in time-
evolving networks, where links appear and disappear and attribute values
change over time. For example, in such a dynamic network, could we use
a time-series of LI feature values to predict the values of class labels at a
future point in time?
Acknowledgments
We would like to thank Luke McDowell for his insightful comments. This work
was performed under the auspices of the U.S. Department of Energy by Lawrence
Livermore National Laboratory under contract No. W-7405-ENG-48 and No.
DE-AC52-07NA27344 (LLNL-JRNL-411529).
18 B. Gallagher and T. Eliassi-Rad

References

1. Taskar, B., Abbeel, P., Koller, D.: Discriminative probabilistic models for relational
data. In: Proceedings of the 18th Conference on Uncertainty in AI, pp. 485–492
(2002)
2. Sen, P., Namata, G., Bilgic, M., Getoor, L., Gallagher, B., Eliassi-Rad, T.: Collec-
tive classification in network data. AI Magazine 29(3), 93–106 (2008)
3. Neville, J., Jensen, D.: Relational dependency networks. Journal of Machine Learn-
ing Research 8, 653–692 (2007)
4. Getoor, L., Friedman, N., Koller, D., Taskar, B.: Learning probabilistic models of
link structure. Journal of Machine Learning Research 3, 679–707 (2002)
5. Lu, Q., Getoor, L.: Link-based classification. In: Proceedings of the 20th Interna-
tional Conference on Machine Learning, pp. 496–503 (2003)
6. Neville, J., Jensen, D., Gallagher, B.: Simple estimators for relational bayesian
classifiers. In: Proceedings the 3rd IEEE International Conference on Data Mining,
pp. 609–612 (2003)
7. Macskassy, S., Provost, F.: Classification in networked data: A toolkit and a uni-
variate case study. Journal of Machine Learning Research 8, 935–983 (2007)
8. Neville, J., Jensen, D., Friedland, L., Hay, M.: Learning relational probability trees.
In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 625–630 (2003)
9. Perlich, C., Provost, F.: Aggregation-based feature invention and relational concept
classes. In: Proceedings of the 9th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 167–176 (2003)
10. Singh, L., Getoor, L., Licamele, L.: Pruning social networks using structural prop-
erties and descriptive attributes. In: Proceedings of the 5th IEEE International
Conference on Data Mining, pp. 773–776 (2005)
11. Neville, J., Jensen, D.: Leveraging relational autocorrelation with latent group
models. In: Proceedings the 5th IEEE International Conference on Data Mining,
pp. 322–329 (2005)
12. Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hy-
perlinks. In: Proceedings of ACM SIGMOD International Conference on Manage-
ment of Data, pp. 307–318 (1998)
13. Jensen, D., Neville, J., Gallagher, B.: Why collective inference improves relational
classification. In: Proceedings of the 10th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pp. 593–598 (2004)
14. McDowell, L., Gupta, K., Aha, D.: Cautious inference in collective classification. In:
Proceedings of the 22nd AAAI Conference on Artificial Intelligence, pp. 596–601
(2007)
15. Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning using gaussian
fields and harmonic functions. In: Proceedings of the 20th International Conference
on Machine Learning, pp. 912–919 (2003)
16. Zhu, X.: Semi-supervised learning literature survey. Technical Report
CS-TR-1530, University of Wisconsin, Madison, WI (December 2007),
http://pages.cs.wisc.edu/~ jerryzhu/pub/ssl_survey.pdf
17. Gallagher, B., Tong, H., Eliassi-Rad, T., Faloutsos, C.: Using ghost edges for clas-
sification in sparsely labeled networks. In: Proceedings of the 14th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pp. 256–264
(2008)
Leveraging Label-Independent Features 19

18. Newman, M.: The structure and function of complex networks. SIAM Review 45,
167–256 (2003)
19. Macskassy, S., Provost, F.: A simple relational classifier. In: Notes of the 2nd
Workshop on Multi-relational Data Mining at KDD 2003 (2003)
20. Gallagher, B., Eliassi-Rad, T.: An examination of experimental methodology for
classifiers of relational data. In: Proceedngs of the 7th IEEE International Confer-
ence on Data Mining Workshops, pp. 411–416 (2007)
21. Krebs, V.: Books about U.S. politics (2004),
http://www.orgnet.com/divided2.html
22. Cohen, W.: Enron email data set, http://www.cs.cmu.edu/~ enron/
23. Eagle, N., Pentland, A.: Reality mining: sensing complex social systems. Journal
of Personal and Ubiquitous Computing 10(4), 255–268 (2006),
http://reality.media.mit.edu
24. Jensen, D.: Proximity HEP-TH database,
http://kdl.cs.umass.edu/data/hepth/hepth-info.html
25. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
Community Detection Using a Measure of
Global Influence

Rumi Ghosh and Kristina Lerman

USC Information Sciences Institute, Marina del Rey, California 90292


{ghosh,lerman}@isi.edu

Abstract. The growing popularity of online social networks gave re-


searchers access to large amount of network data and renewed interest
in methods for automatic community detection. Existing algorithms, in-
cluding the popular modularity-optimization methods, look for regions
of the network that are better connected internally, e.g., have higher than
expected number of edges within them. We believe, however, that edges
do not give the true measure of network connectivity. Instead, we argue
that influence, which we define as the number of paths, of any length,
that exist between two nodes, gives a better measure of network con-
nectivity. We use the influence metric to partition a network into groups
or communities by looking for regions of the network where nodes have
more influence over each other than over nodes outside the community.
We evaluate our approach on several networks and show that it often
outperforms the edge-based modularity algorithm.

Keywords: community, social networks, influence, modularity.

1 Introduction
Communities and social networks have long interested researchers [5,13]. How-
ever, one of the main problems faced by the early researchers was the difficulty
of collecting empirical data from human subjects [5]. The advent of the internet
and the growing popularity of online social networks changed that, giving re-
searchers access to huge amount of social interactions data. This, coupled with
the ever increasing computation speed, storage capacity and data mining capa-
bilities, led to the reemergence of interest in the social networks in general, and
community detection specifically.
Many existing community finding algorithms look for regions of the network
that are better connected internally and have fewer connections to nodes outside
the community [4]. Graph partitioning methods [7,27], for example, attempt to
minimize the number of edges between communities. Modularity maximization-
based methods, on the other hand, identify groups of nodes that have higher
than expected number of edges within them [22,21,24,23]. We believe, however,
that edges do not give the true measure of network connectivity. We generalize
the notion of network connectivity to be the number of paths, of any length, that
exist between two nodes (Section 2). We argue that this metric, called influence

L. Giles et al. (Eds.): SNAKDD 2008, LNCS 5498, pp. 20–35, 2010.

c Springer-Verlag Berlin Heidelberg 2010
Community Detection Using a Measure of Global Influence 21

by sociologists [13], because it measures the ability of one node to affect (e.g.,
send information to) another, gives a better measure of connectivity between
nodes. We use the influence metric to partition a (directed or undirected) network
into groups or communities by looking for regions of the network where nodes
have more influence over each other than over nodes outside the community. In
addition to discovering natural groups within a network, the influence metric
can also help identify the most influential nodes within the network, as well as
the “weak ties” who bridge different communities. We formalize our approach by
describing a general mathematical framework for representing network structure
(Section 3). We show that the metric used for detecting communities in random
walk models, modularity-based approaches, and influence-based modularity are
special cases of this general framework. We evaluate our approach (in Section 4)
on the standard data sets used in literature, and find performance at least as
good as that of the edge-based modularity algorithm.

2 A Measure of Global Influence

If a network has N nodes and E links it can be graphically represented by


G(N, E) where N is the number of vertices and E is the number of edges. Edges
are directed; however, if there exists an edge from vertex i to j and also from j
to i, it is represented as an undirected edge. A path p is an n-hop path from i
to j, if there are n vertices between the i and j along the path. We allow paths
to be non-selfavoiding, meaning that the same edge could be traversed more
than once. The graph G(N, E) can be represented by an adjacency matrix A,
whose elements are defined as Aij = 1 if ∃ an edge from vertex i to j; otherwise,
Aij = 0. A is symmetric for undirected graphs.
The Oxford English Dictionary defines influence as “the capacity to have an
effect on the character, development, or behavior of someone or something, or
the effect itself.” The measure of influence that we adopt is along lines of Pool
and Kochen [5], who state that “influence in large part is the ability to reach a
crucial man through the right channels, and the more the channels in reserve the
better.” This metric depends not only on direct edges between nodes, but also
on the number of ways an effect or a message can be transmitted through other
nodes. Therefore, the capacity of node i to influence node j can be measured
by the weighted sum of the number of n-hop paths present from the i to j. The
underlying hypothesis is that the more the number of paths from one node to
another, the greater is the capacity to influence. This model is analogous to the
independent cascade model of information spread [11,14].
The strength of the effect via longer paths is likely to be lower than via shorter
paths. We model the attenuation using parameters αi where αi (0 ≤ αi ≤ 1)
is the probability of transmission of effect in the (i-1)th hop. Let us consider
transmitting an effect or a message from nodes b to c in the network shown in
Figure 1. The probability of transmission via the immediate neighbors of c such
as e to c or g to c is α1 . The probability of transmission over 1-hop paths such
as b to c via e is α1 α2 . In general, the probability of a transmission along an
22 R. Ghosh and K. Lerman

Fig. 1. A directed graph representing a network

n+1
n-hop path is Πi=1 αi . The total influence of b on c thus depends on the number
of (attenuated) channels between b and c, or the sum of all the weighted paths
from b to c. This definition of influence makes intuitive sense, because the greater
the number of paths between b and c, the more opportunities there are for b to
transmit messages to c or to affect c.
For ease computation we simplify this model by taking α1 = β and αi = α,
∀i = 1. β is called the direct attenuation factor and is the probability of trans-
mission of effect directly between adjacent nodes. α is the indirect attenuation
factor and is the probability of transmission of effect through intermediaries. If
α = β, i.e., the probability of transmission of effect through all links is the same,
then this index reduces to the metric used to find the Katz status score [13].
n /
The number of paths from i to j with n intermediaries, i j , is given
n+1 times
  
by An = A · A · · · A = A(n−1) · A. Adding weights to take into account the
attenuation of effect, we get the weighted total capacity of i to affect j as
i / j = β i 0 / j + · · · + βαn i n / j + · · ·. We represent this weighted
total capacity to influence by the influence matrix P :

P = (βA + βαA1 + · · · + βαn An + · · ·)


−1
= βA(I − αA) , (1)

As mentioned by Katz[13], the equation holds while α < 1/λ, where λ is the
largest characteristic root of A [6].
We use the influence matrix to help find community structure in a network.
We claim (without much theoretical or empirical support) that a community is
composed of individuals who have a greater capacity to influence others within
their community than outsiders. As a result, actions of community members will
tend to become correlated with time, whether by adopting a new fashion trend,
Community Detection Using a Measure of Global Influence 23

vocabulary, watching a movie, or buying a product. Armed with this alternative


definition of community, we adapt modularity maximization-based approach to
identifying communities.

2.1 Influence-Based Modularity

The objective of the algorithms proposed by Newman and coauthors is to dis-


cover “community structure in networks — natural divisions of network nodes
into densely connected subgroups” [25]. They proposed modularity as a measure
for evaluating the strength of the discovered community structure. Algorithmi-
cally, their approach is based on finding groups with higher than expected edges
within them and lower than expected edges between them [22,21,23]. The mod-
ularity Q optimized by the algorithm is given by:
Q =(fraction of edges within community)-(expected fraction of such edges).
Thus, Q is used as a numerical index to evaluate a division of the network.
The underlying idea, therefore, is that connectivity of nodes within a commu-
nity is greater than that of nodes belonging to different communities, and they
take the number of edges as the measure of connectivity. However, we claim that
path-based, rather than edge-based, connectivity is the true measure of network
connectivity. Consider again the graph in Figure 1, where there exists an edge
from a to c but not from b to c. Clearly, however, c is not unconnected from b, as
there are several distinct paths from b to c. We use the influence matrix, which
gives the global connectivity of the network, to identify communities.
We redefine modularity as Q =(connectivity within the community) - ( ex-
pected connectivity within the community) and adopt the influence matrix P
as the measure of connectivity. This definition implies that in the best division
of the network, the influence of nodes within their community is greater than
their influence outside their community. A division of the network into communi-
ties, therefore, maximizes the difference between the actual capacity to influence
and the expected capacity to influence, given by the capacity to influence in an
equivalent random graph.
Let us denote the expected capacity to influence by an N × N matrix P̄ . We
round off the values of Pij to the nearest integer values. Modularity Q can then
be expressed as 
Q= [Pij − P¯ij ]δ(si , sj ) (2)
ij

where si is the index of the community i belongs to and δ(si , sj ) = 1 if si = sj ;


otherwise, δ(si , sj ) = 0. When allthe vertices are placed in a single group, then
axiomatically, Q = 0. Therefore ij [Pij − P¯ij ] = 0. Hence, the total capacity to
influence W is  
W = P¯ij = Pij (3)
ij ij

Hence the null model has the same number of vertices N as the original model,
and in it the expected influence of the entire network equals to the actual influ-
ence of the original network. We further restrict the choice of null model to that
24 R. Ghosh and K. Lerman

where the expected influence on a vertex j, Wjin , is equal to the actual influence
on the corresponding vertex in the real network.
 
Wjin = P¯ij = Pij (4)
i i

Similarly, we also assume that in the null model, the expected capacity of a
vertex i to influence others, Wiout , is equal to the actual capacity to influence of
the corresponding vertex in the real network
 
Wiout = P¯ij = Pij . (5)
j j

In order to compute the expected influence, we reduce the original graph G to


a new graph G that has the same number of nodes as G and total number of
edges W , such that each edge has weight 1 and the number of edges between
nodes i and j in G is Pij . So now the expected influence between nodes i and
j in graph G could be taken as the expected number of the edges between node
i and j in graph G and the actual influence between nodes i and j in graph
G can be taken as the actual number of edges between nodes i and node j in
graph G . The equivalent random graph G is used to find the expected number
of edges from node i to node j. In this graph the edges are placed at random
subject to constraints:
– The total number of edges in G is W .
– The out-degree of a node i in G = out-degree of node i in G = Wiout .
– The in-degree of a node j in graph G =in-degree of node j in graph G =
Wjin .
Thus in G the probability that an edge will emanate to a particular vertex
i is dependent only on the out-degree of that vertex; and the probability that
an edge is incident on a particular vertex i is dependent only on the in-degree
of that vertex and the probabilities of the two vertices being the two ends of
a single edge are independent of each other. In this case, the probability that
an edge exists from i to j is given by P (emanates from i) · P (incident on
j )=(Wiout /W )(Wjin /W ). Since the total number of edges is W in G , therefore
the expected number of edges between i and j is W · (Wiout /W )(Wjin /W ) = P̄ij ,
the expected influence between i and j in G.

2.2 Detecting Community Structure


Once we have derived Q, we have to select an algorithm to divide the network
into communities that optimize Q. Brandes et al. [3] have shown that the de-
cision version of modularity maximization is NP-complete. Like others [23,17],
we use the the leading eigenvector method to obtain the approximate solution.
In [9] we applied this approach to the standard data sets used in literature, and
found performance at least as good as that of the edge-based modularity algo-
rithm. As can be mathematically derived from the formulation, we find that the
Community Detection Using a Measure of Global Influence 25

communities detected are independent of the value of β. So, henceforth without


loss of generality, we shall assume the value of β = 1. In Section 4 we use this
approach to partition several example networks into communities.

3 A Generalized Model of Influence


In this section, we present a mathematical framework that generalizes the notion
of influence. In algebraic topology a k-simplex, with k ≥ 0, is a convex hull σ
of k + 1 linearly independent points v0 , v1 , . . . , vk and dimension k. The points
vi are called vertices of σ. Let σ = {v0 , v1 , . . . , vk } be a k-simplex and let ω =
{wi , . . . , wl } be a nonempty subset of σ, where wi = wj if i = j. Then ω =
{w0 , w1 , . . . , wl } is called the l-dimensional face of σ. A simplectic complex K is
a finite collection of simplices in some Rn satisfying:
– If σ ∈ K, then all faces of σ belong
 to K. 
– If σ1 , σ2 ∈ K, then either σ1 σ2 = Ø or σ1 σ2 is a common face of σ1
and σ2 .
The dimension of K is defined to be −1 if K = Ø and the maximum of the di-
mensions of K otherwise. An undirected graph can then be viewed as a simplectic
complex with a single-element set per vertex and a two-element set per edge.
Suppose we are given finite sets X = {x1 , x2 , . . . , xn } and Y = {y1 , y2 , . . . , ym },
and a binary relation γ ⊆ X × Y between elements of X and elements of Y (X
and Y could be the same). Then the relation γ may be expressed as an n × m
incidence matrix A (γ) = (Aij ) where Aij = 1 if (xi , yj ) ∈ γ and 0 otherwise.
Each row in the incidence matrix A (γ) may be viewed as a simplex in the
following way: Let Y be a set of vertices. The i-th row of A (γ) can be identified
with a k dimensional simplex {yj1 , yj2 , . . . , yj(k+1) } = σk (xi ) on the vertices Y
(where Aij =1). Thus each xi ∈ X determines (with γ ) a row of A (γ) and each
row A (γ) can be identified a simplex. The set of simplices is a simplical complex
denoted by KX (γ, Y ). Since an arbitrary element xi is γ-related to exactly k + 1
yj , σk (xi ) is distinguished as a named simplex. If we let d denote the maximum
dimension of KX (γ, Y ), we immediately see that d ≤ m − 1.
Let σ and τ be two simplices in KX (γ, Y ). Then σ and τ are q − near if they
have a common q-face, i.e., their intersection contains at least q + 1 elements.
(This q-face need not be an element of the simplex family.) Then τ and σ are
q-connected if there exists a sequence σ1 , σ2 , . . . , σp of simplices in KX (γ, Y ),
such that σ1 = σ and σp = τ and σi is q-near to σi+1 ∀1 ≤ i ≤ p − 1. Thus
q-connectivity is the transitive closure of q-nearness. Q analysis using q-nearness
and q-connectivity was used by Atkin [2] to deal with pairs of sets and sets of
contextual relations.
As Legrand [16] points out, q-nearness and q-connectivity are not necessarily
a true measure of how similar the vertices are to each other, for which the
length of sequence of q-connectivity should be the true indicator. We therefore
take the length of sequences into account by calculating how q-near vertex i is
from vertex j, making it dependent on the length of the path between them.
26 R. Ghosh and K. Lerman

Therefore the adjacency matrix A, Aij = (q1ij ), shows if two simplices are
zero-near to one another in a 0-hop path. The product A2 =A × A gives the
value of q2ij such that A2 ij = q2ij , i.e., vertex i and vertex j when separated
by a one-hop path are q-near each other with q = q2ij − 1. In the same way,
A3 = A × A × A = (A3ij ) = (q3ij ) shows that vertices i and j connected by
a two-hop path are q3ij − 1 near from each other. We then take the length of
the sequence into account to calculate the expected q-nearness of one vertex to
another by taking the weighted average of q-nearness of varying length of paths.
The expected value of qij between two elements i and j, such that they are
expected to be qij − 1 near each other, with (qkij ) = Akij is:

(W1 · q1ij + W2 · q2ij + . . . + Wn · qnij + . . .)


E(qij ) = ∞ (6)
i=1 Wi

This expected value can be used to find out how connected two vertices are to
each other, taking paths of all lengths into account. Note that Wi can be a scalar
or a vector.
This formulations allows us to generalize different network models for com-
munity detection and scoring like the random walk model [28,29,31], the Katz
model [13] of status score, and the influence-based model. In random walk mod-
els, a particle starts a random walk from node i. The particle iteratively tran-
sitions to its neighbors with probability proportional to the corresponding edge
weights. Also at each step, the particle returns to node i with some restart
probability (1 − c). The proximity score from node i to node j is defined as
the steady-state probability ri,j that the particle will be on node j [29]. These
models can be shown to be special cases of the formulations of the expected
q-nearness (without loss of generality we assume that T is an n × n matrix):
k−1 −(k−1)
1. If Wk =cn · D where c is a constant and D is an n × n matrix with
Dij = j=1 Aij if i = j and 0 otherwise; then, the expected q-nearness score
reduces to proximity score in random walk model [28,29].
i
2. If Wi = Πj=1 αj , where the scalar αj is the attenuation factor of a (j − 1)-th
hop in a (i−1) hop path, then the expected q-nearness reduces to metric used
to find the influence score and represented by the influence matrix. For ease
of computation of the influence matrix, we have taken α1 = β and αi = α
∀i = 1. As stated before, α < 1/λ where λ is the largest characteristic root of
A. Gershgorins Circle Theorem (1931) gives the simple sufficient condition
α < 1/maxi (Dii ).
3. When β = α, this in turn reduces to the metric used to find the Katz status
score [13] with α as the attenuation factor.
4. When α1 = 1 and α2 = . . . = αn = . . . = 0, the expected q-nearness is the
q-nearness of the 0-hop path which is metric used to calculate similarity in
edge-based modularity approaches [22].
In summary, the capacity to influence is a measure of the expected q-nearness
between vertices. Liben-Nowell and Kleinberg [20] have shown that Katz measure
is the most effective measure for link prediction task. The influence score, which
Community Detection Using a Measure of Global Influence 27

is a generalization of the Katz score, can then be used to find communities as


described in Section 2.1.

4 Evaluation
We applied influence-based community finding method to small networks studied
previously in literature, as well as the friendship network extracted from the
social photosharing site Flickr. On all the data sets we studied, the performance
of the influence-based modularity optimization algorithm was at least as good as
that of the edge-based modularity (α = 0 case). In several cases, the influence-
based approach led to purer groups.

4.1 Zachary’s Karate Club


The karate club data represents the friendship network of members of a karate
club studied by Zachary [30]. During the course of the study, a disagreement
developed between the administrator and the club’s instructor, resulting in the
division of the club into two factions, represented by circles and squares in Fig-
ure 2. We used this data to study the communities detected for different values
of α and compare the performance of the influence-based modularity approach
to Newman’s community-finding algorithms (which is the special case where
α = 0) [21].
Using α < 1/λ (Section 2) we get the upper bound on α ≤ 0.29. When
both Newman’s edge-based modularity maximization approach and our method
(0 ≤ α ≤ 0.29) are used to bisect the network into just two communities, we
recover the two factions observed by Zachary (Figure 2(c)). However, algorithms
run until a termination condition is reached (no more bisections are possible),
different values of α lead to different results, as shown in Figure 2. As stated when
α = 0, the method reduces to Newman’s edge-based modularity maximization
approach [21], and we get four communities (Figure 2(a)). For 0 < α < 0.14 the
number of communities reduces to three (Figure 2(b)). As α is increased further
(0.14 ≤ α ≤ 0.29) we get two communities(Figure 2(c)), which are the same as
the factions found in Zachary’s study.

(a) (b) (c)

Fig. 2. Zachary’s karate club data. Circles and squares represent the two actual fac-
tions, while colors stand for discovered communities as the strength of ties increases:
(a) α = 0 (b) 0 < α < 0.14 (c) 0.14 ≤ α ≤ 0.29.
28 R. Ghosh and K. Lerman

4.2 College Football

We also ran our approach on the US College football data from Girvan et al. [10].1
The network represents the schedule of Division 1 games for the 2000 season
where the vertices represent teams and the edges represent the regular season
game between the two teams they connect. The teams are divided into “confer-
ences” (or communities) containing 8 to 12 teams each. Games are more frequent
between members of the same conference than members of different conferences.
Inter-conference games, however, are not uniformly distributed, with teams that
are geographically closer likely to play more games with one another than teams
separated by geographic distances. However, some conferences have teams play-
ing nearly as many games against teams in other conferences as teams within
their own conference. This leads to the intuition, that conferences may not be
the natural communities, but the natural communities may actually be bigger
in size than conferences, with teams playing as many games against others in
the same conferences being put into the same community.

College football Political books

Fig. 3. The graph showing the purity of communities predicted with different values
of α and β in the (a) college football and (b) political books data sets. We see that
purity increases with α, and is independent of β. When α = 0, the method reduces to
eigenvector based modularity maximization method postulated by Newman [23].

We measure the quality of discovered communities in terms of purity. The


purity of a community is the fraction of all pairs of teams that were assigned
to that community that actually belong to the same conference. The quality
of a network division produced by an algorithm is the average purity of the
discovered communities. Figure 3 shows the purity the discovered communities
as α is varied. Purity is independent of β, the weight of direct edges, but increases
with α, reaching ∼ 90% near α = 0.1 (the upper bound to α is determined by
the reciprocal of the largest eigenvalue of the adjacency matrix). When α = 0,
the modularity reduces to edge-based modularity studied by Newman [23], the
purity is around 72%. The number of predicted groups changes from 8 at α = 0
to four at higher values of α.
1
http://www-personal.umich.edu/∼mejn/netdata/
Community Detection Using a Measure of Global Influence 29

4.3 Political Books


We evaluated our approach on the political books data compiled by V. Krebs.2
In this network the nodes represent books about US politics sold by the online
bookseller Amazon. Edges represent frequent co-purchasing of books by the same
buyers, as indicated by the “customers who bought this book also bought these
other books” feature of Amazon. This feature influences the book purchasing
decisions of customers. The nodes were given labels liberal, neutral, or conserva-
tive by Mark Newman on a reading of the descriptions and reviews of the books
posted on Amazon.3 49 of the books were marked as conservative, 43 books were
marked as liberal and 13 books were marked as neutral. We use our algorithm
to find the existing community structure in the network by varying parameters
as shown in Figure 3. Purity is independent of the value of β, and similarly to
the football data, as α increases, the number of communities decreases (from
four at α = 0 to two at α = 0.08). Also the purity of the communities increases
from 60% at α = 0 to 92% at α = 0.08. Again, α = 0 corresponds to Newman’s
modularity method. Another observation is that when α = 0.08, leading to the
formation of two groups, only the neutral books are split between group, indi-
cates a possibility that some of the 13 neutral books were conservatively inclined
and some liberally.

4.4 Flickr Social Network


We also ran our algorithm on the social network data collected from Flickr
for the image search personalization study [18]. Flickr is a social photosharing
site that allows users to upload images, tag them with descriptive keywords,
known as tags, and to join social networks by adding other users as contacts. We
believe that network structure, create by independent decisions to add another
photographer as a contact, capture social knowledge, including knowledge about
users’ photography interests. Thus, users who are interested in a particular topic
are more likely to be connected than users interested in different topics.
Since the actual social network on Flickr is rather vast, we sampled it by
identifying users who were broadly interested in one of three topics: child and
family portraiture, wildlife photography and technology. For each topic, we used
the Flckr API to perform a tag search using a keyword relevant to that topic,
and retrieved 500 ‘most interesting’ images. We then extracted the names of
users who submitted these images to Flickr. These users were added to our data
set. The keywords used for image search were (a) newborn for the portraiture
topic, (b) tiger and beetle for the wildlife topic, and (c) apple for the technology
topic. Each keyword is ambiguous. Tiger, for example, could mean a big cat of
the panthera genus, but also a flower (Tiger lily), Mac operating system (OS X
Tiger), or a famous golfer (Tiger Woods), while beetle could describe a bug or a
car. The keyword newborn could refer to human babies just as well as to kittens
and puppies, while apple could mean the computer maker or a fruit.
2
http://www.orgnet.com/
3
available at http://www-personal.umich.edu/∼mejn/netdata/
30 R. Ghosh and K. Lerman

From the set of users in each topic, we identified four (eight for the wildlife
topic) who were interested in the topics we identified: i.e., wildlife for tiger and
beetle query terms, portraiture for the newborn query, and technology for the
apple query. We studied each user’s profile to confirm that the user was indeed
interested in that topic. Specifically, we looked at group membership and user’s
most common tags. Thus, groups such as “Big Cats”, “Zoo”, “The Wildlife
Photography”, etc. pointed to user’s interest in the wildlife topic. In addition
to group membership, tags that users attached to their images could also help
identify their interests. For example, users who used tags nature and macro were
probably interested wildlife rather than technology. Similarly, users interested in
human, rather than animal, portraiture tagged their images with baby and family.
We used the Flickr API to retrieve the contacts of each of the users we identified,
as well as their contacts’ contacts. We labeled users by the topic through which
they were discovered. In other words, users who uploaded one of the 500 most
interesting images retrieved by the query tiger, were labeled wildlife, whether or
not they were interested in wildlife photography. The contacts and contacts’s
contacts of the four users within this set identified as being interested in wildlife
photography were also labeled wildlife. Although we did not verify that all the
labeled users were indeed interested in the topic, we use these soft labels to
evaluate the discovered communities.
Once we retrieved the social networks of target set of users, we reduced it to
an undirected network containing mutual contacts only. In other words, every
link in the network between two nodes, say A and B, implies that A lists B
as contact and vice versa. This resulted in a network of 5747 users. Of these,
1620 users were labeled technology, 1337 and 2790 users were labeled portraiture
and wildlife respectively. We ran our community finding algorithm for different
values of α on this data set. For α = 0, we found four groups, while for higher
values of α (α < 0.01), we found three groups. Figure 4 shows composition
of the discovered groups in terms of soft labels. Group 1 is composed mainly of
technology users, group 2 mainly wildlife users, and group 3 is almost exclusively
portraiture. The fourth group found at α = 0.0 has 932 members, of which 497
are labeled wildlife, 242 technology, and 193 members portraiture. Except for the
portraiture group (group 3), groups become purer as α increases.

5 Related Research

There has been some work in motif-based communities in complex networks


[1] which like our work extends traditional notion of modularity introduced by
Girvan and Newman [10]. The underlying motivation for motif-based commu-
nity detection is that “the high density of edges within a community determines
correlations between nodes going beyond nearest-neighbours,” which is also our
motivation for applying the influence-based modularity metric to community
detection. Though the motivation of this method is to determine the correla-
tions between nodes beyond nearest neighbors, yet it does impose a limit on the
proximity of neighbors to be taken into consideration dependent on the size of
Community Detection Using a Measure of Global Influence 31

Fig. 4. Composition of groups discovered in the Flickr social network for different
values of α
32 R. Ghosh and K. Lerman

the motifs. The method we propose, on the other hand, imposes no such limit
on proximity. On the contrary, it considers the correlation between nodes in a
more global sense. The measure of global correlation evaluated using the influ-
ence metric would be equal to the weighted average of correlations when motifs
of different sizes are taken. The influence matrix enables the calculation of this
complex term in a quick and efficient manner.
Resolution limit is one of the main limitations of the original modularity de-
tection approach [8]. It can account for the comment by Leskovec et al. [19]
that they “observe tight but almost trivial communities at very small scales, the
best possible communities gradually ‘blend in’ with rest of the network and thus
become less ‘community-like’.” However, that study is based on the hypothesis
that communities have “more and/or better-connected ‘internal edges’ connect-
ing members of the set than ‘cut edges’ connecting to the rest of the world.”
Hence, like most graph partitioning and modularity-based approaches to com-
munity detection, their process depends on the local property of connectivity
of nodes to neighbors via edges and is not dependent on the structure of the
network on the whole. Therefore, it does not take into account the characteris-
tics of node types, that is ‘who’ are the nodes that a node is connected to and
how influential these nodes are. In their paper on motif-based community detec-
tion, Arenas et al.[1] state that the extended quality functions for-motif based
modularity also obey the principle of the resolution limit. But this limit is now
motif-dependent and then several resolution of substructures can be achieved by
changing the motif. However, it would be difficult to verify which resolution of
substructures is closest to natural communities. In influence-based modularity,
on the other hand, the resolution limit would depend on the probability of trans-
mission of the effect between nodes, i.e., the strength of ties. The probability of
transmission of effect can indeed be calculated from the graph, by say observing
the dynamics of spread of idea within a graph at different times.
As stated before, Liben-Nowell and Kleinberg [20] have shown that Katz mea-
sure is the most effective measure for the link prediction task, better than hitting
time, PageRank [26] and its variants. Thus we use influence score, which is a
generalization of the Katz score, to detect communities and compute rankings
of individuals.
Recently researchers have used probabilistic models, e.g., mixture models, for
community discovery. These models can probabilistically assign a node to more
than one community, as it has been observed “objects can exhibit several dis-
tinct identities in their relational patterns” [15]. This indeed may be true, but
whether the nodes in the network is to be divided into distinct communities or
probabilities with which each node belongs to community is to be discovered,
really depends on the specific application. By this, we mean that if the appli-
cation we are interested in is finding the natural communities say in the karate
club data, and if we use a probabilistic method (say [15]), we would be assigning
the nodes into groups into which their probability of belonging is the highest,
and the communities thus formed do not necessarily portray the division of the
network into natural communities observed.
Community Detection Using a Measure of Global Influence 33

6 Conclusion and Future Work


We have proposed a new definition of community in terms of the capacity of
nodes to influence each other. We gave a mathematical formulation of this effect
in terms of the number of paths of any length that link two nodes, and redefined
modularity in terms of the influence metric. We use the new definition of mod-
ularity to partition a network into communities. We applied this framework to
networks well-studied in literature and found that it produces results at least as
good as the edge-based modularity approach.
Although the formulation developed in this paper applies equally well to di-
rected graphs, we have only implemented it on undirected ones. Hence future
work includes implementation of the of the algorithm on directed graphs that are
common on social networking sites, as well applying it to bigger networks. The
influence matrix approximates capacity to influence along the lines of indepen-
dent cascade model of information spread. Future work includes approximation
of capacity to influence along other models of information spread like the thresh-
old influence model. Leskovec et al. [19] state that they “observe tight but almost
trivial communities at very small scales, the best possible communities gradu-
ally ‘blend in’ with rest of the network and thus become less ‘community-like’.”
However the hypothesis that they employ to detect communities is that commu-
nities have “more and/or better-connected ‘internal edges’ connecting members
of the set than ‘cut edges’ connecting to the rest of the world.” Hence, like most
graph partitioning and modularity based approaches to community detection,
their process depends on the local property of connectivity of nodes to neighbors
via edges and is not dependent on the structure of the network on the whole.
Besides, it also does not take into account the heterogeneity of node types, that
is ‘who’ are the nodes that a node is connected to and how influential these
nodes are. Therefore, we argue that a global property, such as the measure of
influence, is a better approach to community detection. It remains to be seen
whether communities will similarly ‘blend in’ with the larger network if one uses
the influence metric to discriminate them.

Acknowledgements
This research is based on work supported in part by the National Science Foun-
dation under Award Nos. IIS-0535182, BCS-0527725 and IIS-0413321.

References
1. Arenas, A., Fernandez, A., Fortunato, S., Gomez, S.: Motif-based communities in
complex networks. Mathematical Systems Theory 41, 224001 (2008)
2. Atkin, R.: From cohomology in physics to q-connectivity in social science. Inter-
national Journal of Man-Machines Studies 4, 341–362 (1972)
3. Brandes, U., Delling, D., Gaertler, M., Gorke, R., Hoefer, M., Nikoloski, Z., Wagner,
D.: On modularity clustering. IEEE Trans. on Knowl. and Data Eng. 20(2), 172–
188 (2008)
34 R. Ghosh and K. Lerman

4. Clauset, A.: Finding local community structure in networks. Physical Review E


(Statistical, Nonlinear, and Soft Matter Physics) 72(2) (2005)
5. de Sola Pool, I., Kochen, M.: Contacts and influence. Social Networks 1(1), 39–40
(1978–1979)
6. Ferrar, W.L.: Finite Matrices. Oxford University Press, Oxford (1951)
7. Fiedler, M.: Algebraic connectivity of graphs. Czech. Math. J. 23, 298–305 (1973)
8. Fortunato, S., Barthelemy, M.: Resolution limit in community detection. Proc.
Natl. Acad. Sci. USA 104, 36 (2007)
9. Ghosh, R., Lerman, K.: Community detection using a measure of global influence.
In: Proc. of the 2nd KDD Workshop on Social Network Analysis, SNAKDD 2008
(2008)
10. Girvan, M., Newman, M.E.J.: Community structure in social and biological net-
works. Proc. Natl. Acad. Sci. USA 99, 7821 (2002)
11. Goldenberg, J., Libai, B., Muller, E.: Talk of the network: A complex systems look
at the underlying process of word-of-mouth. Marketing Letters (2001)
12. Granovetter, M.: The strength of weak ties. The American Journal of Sociology
(May 1973)
13. Katz, L.: A new status index derived from sociometric analysis. Psychometrika 18,
39–40 (1953)
14. Kempe, D., Kleinberg, J., Tardos, É.: Maximizing the spread of influence through
a social network. In: KDD 2003: Proceedings of the ninth ACM SIGKDD inter-
national conference on Knowledge discovery and data mining, pp. 137–146. ACM,
New York (2003)
15. Koutsourelakis, P.S., Eliassi-Rad, T.: Finding mixed-memberships in social net-
works. In: AAAI Spring Symposium Social Information Processing (2008)
16. Legrand, J.: How far can q-analysis go into social systems understanding? In: Fifth
European Systems Science Congress (2002)
17. Leicht, E.A., Newman, M.E.J.: Community structure in directed networks. Physical
Review Letters 100, 118703 (2008)
18. Lerman, K., Plangprasopchok, A., Wong, C.: Personalizing results of image search
on flickr. In: AAAI workshop on Intelligent Techniques for Web Personlization
(2007)
19. Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of
community structure in large social and information networks. In: Proceedings of
the World Wide Web Conference (2008)
20. Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks.
J. Am. Soc. Inf. Sci. Technol. 58(7), 1019–1031 (2007)
21. Newman, M.E.J.: Detecting community structure in networks. The European Phys-
ical Journal B 38, 321–330 (2004)
22. Newman, M.E.J.: Fast algorithm for detecting community structure in networks.
Physical Review E 69, 066133 (2004)
23. Newman, M.E.J.: Finding community structure in networks using the eigenvectors
of matrices. Physical Review E 74, 036104 (2006)
24. Newman, M.E.J.: Modularity and community structure in networks. Proc. Natl.
Acad. Sci. USA 103, 8577 (2006)
25. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in net-
works. Physical Review E 69, 026113 (2004)
26. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking:
Bringing order to the web. Technical report, Stanford Digital Library Technologies
Project (1998)
Community Detection Using a Measure of Global Influence 35

27. Pothen, A., Simon, H., Liou, K.P.: Partitioning sparse matrices with eigenvectors
of graphs. SIAM J. Matrix Anal. Appl. 11, 430–452 (1990)
28. Tong, H., Faloutsos, C., Pan, J.: Fast random walk with restart and its applications.
In: Sixth International Conference on Data Mining, ICDM 2006, pp. 613–622 (2006)
29. Tong, H., Papadimitriou, S., Yu, P.S., Faloutsos, C.: Proximity tracking on time-
evolving bipartite graphs. In: SDM, pp. 704–715. SIAM, Philadelphia (2008)
30. Zachary, W.W.: An information ow model for conict and ssion in small groups.
Journal of Anthropological Research 33, 452–473 (1977)
31. Zhou, H.: Network landscape from a brownian particles perspective. Physical Re-
view E 67 (2003)
Communication Dynamics of Blog Networks

Mark Goldberg1 , Stephen Kelley1 , Malik Magdon-Ismail1,


Konstantin Mertsalov1, and William (Al) Wallace2
1
CS Department, RPI, 110 8th Street, Troy, NY
{goldberg,kelles,magdon,mertsk2}@cs.rpi.edu
2
DSES Department, RPI, 110 8th Street, Troy, NY
wallaw@rpi.edu

Abstract. We study the communication dynamics of Blog networks,


focusing on the Russian section of LiveJournal as a case study. Com-
munication (blogger-to-blogger links) in such online communication net-
works is very dynamic: over 60% of the links in the network are new from
one week to the next, though the set of bloggers remains approximately
constant. Two fundamental questions are: (i) what models adequately
describe such dynamic communication behavior; and (ii) how does one
detect the phase transitions, i.e. the changes that go beyond the stan-
dard high-level dynamics? We approach these questions through the no-
tion of stable statistics. We give strong experimental evidence to the fact
that, despite the extreme amount of communication dynamics, several
aggregate statistics are remarkably stable. We use stable statistics to test
our models of communication dynamics postulating that any good model
should produce values for these statistics which are both stable and close
to the observed ones. Stable statistics can also be used to identify phase
transitions, since any change in a normally stable statistic indicates a
substantial change in the nature of the communication dynamics. We
describe models of the communication dynamics in large social networks
based on the principle of locality of communication: a node’s communi-
cation energy is spent mostly within its own “social area,” the locality
of the node.

1 Introduction
The structure of large social networks, such as the WWW, the Internet, and the
Blogosphere, has been the focus of intense research during the last decade (see
[1], [7], [8], [12], [17], [19], [20], [21], [22]. One of the main foci of this research
has been the development of dynamic models of network creation ([2], [11], [22],
[18]) which incorporates two fundamental elements: network growth, with nodes
arriving one at a time; and some form of preferential attachment in which an
arriving node is more likely to attach itself to a more prominent existing node
than a less prominent one (the rich get richer).
Once a network has grown and stabilized in size, how does it evolve? Such
an evolution is governed by the communication dynamics of the network: links
being broken and formed as social groups form, evolve and disappear. The com-
munication dynamics of these networks have been studied much less, partially

L. Giles et al. (Eds.): SNAKDD 2008, LNCS 5498, pp. 36–54, 2010.

c Springer-Verlag Berlin Heidelberg 2010
Communication Dynamics of Blog Networks 37

because the typical networks studied (the WWW, the Internet, collaboration
networks) mainly exhibit growth dynamics and not communication dynamics.
Clearly, as a network matures, the growth (addition of new users) becomes a
minor ingredient of the total change (see Figure 1). Further, links in a socially
dynamic network such as the Blogosphere should not be interpreted as static.
The posts made by a blogger a week ago may not be reflective of his/her current
interests and social groups. In fact, blog networks display extreme communica-
tion dynamics. Over the 20 week period shown in Figure 1, in a typical week,
510,000 pairs of bloggers communicated via blog comments. Out of those about
380,000 are between pairs of bloggers who did not communicate the week before,
i.e. over 70% of the communications are new. What models adequately describe
the dynamics of the communications in such networks which have more or less
stabilized in terms of growth?

0.9
vertex growth
0.8 new edges
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
51 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Fig. 1. Edge and vertex dynamics. Clearly the rate of growth is decreasing however
the fraction of new edges which appear in a week remains approximately constant at
over 70%.

To begin to address this question, one must first develop methods for testing
the validity of a model. In such an environment of extreme stochastic dynamics,
one cannot hope to replicate the dynamics of the individual communications; this
explains our focus on the evolution of interesting macroscopic properties of the
communication dynamics. Particularly interesting ones are those which are time
invariant. We refer to such properties as stable statistics. As we demonstrate,
even in such an active environment, certain statistics are remarkably stable. For
example: the power-law coefficient for the in-degree distribution, the clustering
coefficient, and the size of the giant component (see Table 1).

1.1 Our Contributions


In this paper, we first demonstrate by experimentation that a certain set of statis-
tics is stable in the graph of the Blogosphere. The stability implies that these
(and conceivably some other) statistics can be used to validate models, charac-
terize networks and identify their phases. Second, we present several models for
38 M. Goldberg et al.

communication dynamics that are based on the principle of locality of commu-


nication. We show via simulation that our models stabilize to an equilibrium in
which the aggregate statistics of the communication dynamics are stable. Fur-
thermore, among the set of models we tested, we select the one whose stable
statistics best reproduce the statistics of the observed network.
Stable Statistics. Our case study was the Russian section of LiveJournal. Dur-
ing the observed period of 20 weeks, close to 153,000 users were active in any one
week period. The size of this set is quite stable (changes typically from 1 to 2%).
Although, the makeup of the set changes drastically from week to week. Sur-
prisingly many aggregated statistics computed for the Blogograph show strong
stability. Among those stable statistics are: the distribution of the in-degrees and
the out-degrees of the nodes; the (overlapping) coalition distribution as described
by the cluster density and size; and the size of the giant component.
The nodes of the Blogograph represent bloggers and the directed edges be-
tween them represent all pairs {A, B} where blogger A visited the blog of B
during the week in question and left a comment to a specific post already in the
blog. We consider the following four types of stable statistics:
(i) Individual Statistics: properties of individual nodes such as the in-degree
and out-degree distributions for the graph
(ii) Relational Statistics: properties of edges in the graph such as the per-
sistence of edges and clustering coefficients
(iii) Global Statistics: properties reflecting global information such as the size
and diameter of the largest component and total density
(iv) Community Statistics: properties relating to group structure such as
the community size and density distributions
The purpose of collecting these statistics is two-fold. First, they create a baseline
which describes the normal behavior of individuals, communities, and the network
as a whole. Once this base has been established, anomalous behavior at each of
these levels can be identified and investigated further. Second, stable statistics
can be used for testing any model of the network dynamics, as any model which
attempts to replicate the communication dynamics must, in particular, be able to
reproduce these statistics. Furthermore, the quality of a model can be measured
by how well the statistics computed from the network generated by the model (in
equilibrium) replicate those observed in the real-life network.
Locality based Models of Communication Dynamics. Existing growth-
based models fail to adequately replicate the observed stable statistics, as they
do not capture communication dynamics. We consider models for communication
dynamics which take as input: (a) the current (observed) communication graph;
and, (b) each user’s out-degree (communication energy) at the next time step
(or a distribution over for the user’s out-degree). These two inputs are standard
for existing growth models (such as the preferential attachment growth model).
Such models are only applicable when the communications are open (observable
to all nodes). The output is the communication graph at the next time step,
based on the model for probabilistic attachment of each node’s out-edges.
Communication Dynamics of Blog Networks 39

We discuss intuitive extensions of growth models for modeling communica-


tion dynamics and illustrate that these extensions are inadequate for modeling
the observed stable statistics. We present a locality based model which relies on
two fundamental principles to more accurately reflect the observed communi-
cation dynamics. First, our concept of locality reduces the set of nodes a node
can attach to in the next time step (a week in our case). This locality is based
on structural properties of the current (observable to all) communication graph.
The locality represents a semi-stable set of “neighbor” nodes that an individual
is highly likely to connect to and can be interpreted as that individuals view
of the communities she belongs to. We test various structural (graph theoretic)
definitions of a node’s social locality, ranging from trivial localities such as the
entire graph to notions of a node’s neighborhood (e.g. the 2-neighborhood; the
clusters to which a node belongs). Second, after obtaining a node’s locality, one
must specify the attachment mechanism, the mechanism used by the individual
to select the nodes in her locality to which she will connect at the next time
step. We test a number of different attachment mechanisms which one could
consider, ranging from uniform attachment to some form of preferential attach-
ment. Thus, we present results using each of the various choices for the locality
and attachment mechanism.
Such probabilistic models are Markov chains, and we test a model’s perfor-
mance by comparing the values it produces for the stable statistics after it has
equilibrated. We find experimentally that the mixing times are small and the
equilibrium statistics are independent of the starting state (the chains are er-
godic), hence the equilibrium distribution is unique. Our results indicate that
our locality based model with locality defined as the union of clusters to which a
node belongs and a preferential attachment mechanism produces the best values
for the stable statistics.

2 Clusters
The notion of a social community is crucial to our model of a Blog network. The
underlying idea of our model is that every user selects the nodes to visit (to leave
a comment) from the set of nodes that belong to a relatively small “area” of a
node. Our experiments with different definitions of the local area of the node
show that the best approximation to the observed statistics is achieved if the
area is taken as the union of clusters containing a given node. Our definition of
network clusters is borrowed from [4], [5], [6] with an important specification of
the notion of the density of a set of nodes in a network.
Definition. Given a graph G(V.E) let function D, called the density, be defined
on the set of all subsets of V . Then, a set C ⊆ V is called a cluster if it is locally
maximal w.r.t. D in the following sense: for every vertex x ∈ C (resp. x ∈ C),
removing x from C (resp. adding x to C) creates a set whose density is smaller
than D(C).
The idea of the definition matches the common understanding of a social
community as a set of members that forge more communication links within the
40 M. Goldberg et al.

set than that with the outside the set. The function D is not specified by the
definition, but its precise formulation is crucial in “catching” the nature of social
communities. The density function considered in [3] is as follows:
win
D(C) = , (1)
win + wout

where win is the number of edges xy with x, y ∈ C and wout is the number of
edges xy with either x ∈ C & y ∈ C or x ∈ C & y ∈ C (to allow for directed
graphs). The main deficiency of the definition of a cluster as a computational rep-
resentation of a social community is that it is easy to find examples of networks
that permits very large and loosely connected clusters, that intuitively are not
representing any community. The idea of our modification of 1 is to introduce an
additional parameter which represents the edge probability in the set

win 2win
D(C) = +λ , (2)
win + wout |C|(|C| − 1)

where the parameter λ depends on the specific network under the consideration,
and is supposed to be selected by the researcher. For our experiments, we selected
λ = 0.125.

3 Data

We define the blogograph as a directed, unweighted graph representing the com-


munication of the blog network within a fixed time-period. There is a vertex in
the blogograph representing each blogger and a directed edge from the author of
any comment to the owner of the blog where the comment was made during the
observed time period. Parallel edges are not allowed and a comment is ignored
if the corresponding edge is already present in the graph. Loops, comments on a
bloggers own blog, are ignored as well. To study the communication dynamics,
we consider consecutive weekly snapshots of the network; the communication
graph contains the bloggers that either posted or commented during a week and
the edges represent the comments that appeared during the week. We chose to
split graphs into one week periods due to the highly cyclic nature of activity in
the blogosphere (see Figure 3). An illustration of the blogograph’s construction
is given on Figure 2.
The data used for our research was collected from the popular blogging ser-
vice LiveJournal. As of May 2008, there were more than 15 million users for the
whole network; the number of posts during a 24 hour period was approximately
191,000 (see html://www.livejournal.com/stats). Much of the communication in
LiveJournal is public, which allows for open access. LiveJournal provides a real
time RSS update feature that publishes all open posts that appear on any of
the hosted blogs. We record the permament addresses of the posts and wait for
the comments to accumulate. In our experience, the overwhelming majority of
comments appear on these posts within two weeks of the posting date. Thus,
Communication Dynamics of Blog Networks 41

Thread on Alice’s Blog A Thread on Bill’s Blog


Alice posted Edges: Edges: Bill posted
Bill commented B −>A D B A −>B Alice commented
C −>A C −>B
Alice commented D −>B Cory commented
Cory commented C Dave commented

Fig. 2. Blogograph generation example. Vertices are placed for every blogger who
posted or commented, the edges are placed from the author of the comment to the
author of the post (the blog owner). Parallel edges and loops are not allowed.

Table 1. Statistics for observed blogograph: order of the graph (|V |), graph size (|E|),
fraction of vertices that are part of giant component (GC size), clustering coefficient
(C), average separation (d), power law exponent (α)

week |V | |E| GC C d α
49 155,615 530,160 95.88% 0.0639 5.333 2.63
50 156,026 532,189 95.91% 0.0644 5.327 2.66
51 155,093 527,364 95.62% 0.0635 5.316 2.65
52 151,559 516,483 95.62% 0.0635 5.316 2.71
1 118,979 327,356 93.55% 0.0573 5.777 2.92
2 142,478 444,457 95.14% 0.0587 5.392 2.68
3 159,436 559,506 96.16% 0.0629 5.268 2.68
4 158,429 550,436 95.60% 0.0631 5.224 2.67
5 156,144 534,917 95.49% 0.0627 5.293 2.72
6 156,301 526,194 95.70% 0.0615 5.338 2.72
7 154,846 523,235 95.44% 0.0622 5.337 2.69
8 156,064 528,363 95.59% 0.0609 5.320 2.69
9 156,362 524,441 95.58% 0.0602 5.377 2.68
10 154,820 523,304 95.48% 0.0593 5.368 2.68
11 155,267 516,280 95.13% 0.0600 5.356 2.68
12 156,872 514,269 95.20% 0.0590 5.367 2.63
13 155,338 510,070 95.42% 0.0601 5.342 2.71
14 155,099 506,892 95.19% 0.0607 5.309 2.73
15 153,440 504,850 95.32% 0.0601 5.303 2.73
16 154,012 512,094 95.34% 0.0599 5.298 2.60
17 151,427 503,802 95.30% 0.0611 5.288 2.75

our screen-scraping program visits the page of a post after it has been pub-
lished for two weeks and collects the comment threads. We then generate the
communication graph.
We have focused on the Russian section of LiveJournal as it is reasonably
but not excessively large (currently close to 580,000 bloggers out of the total 15
million) and almost self contained. We identify Russian blogs by the presence of
Cyrillic characters in the posts. Technically this also captures the posts in other
42 M. Goldberg et al.

450000
400000
350000
300000
250000
200000
150000
01/19 02/02 02/16 03/01 03/15 03/29

(a) Comments per day

30000
25000
20000
15000
10000
5000
0
Mon Tue Wed Thu Fri Sat Sun

(b) Comments per hour

Fig. 3. Number of comments per day that appeared between January 14, 2008 and
April 6th, 2008 and number of comments per hour during a week between March
24th, 2008 and March 30th. The periodic drops in the number of comments per day
correspond to Saturdays and Sundays.

languages with a Cyrillic alphabet, but we found that the vast majority of the
posts are in Russian. The network of Russian bloggers is very active. On average,
32% of all posts contain Cyrillic characters. LiveJournal blogging has become
a cultural phenomenon in Russia. Discussion threads often contain intense and
interesting discussions which encourage communication through commenting.
Our work is based on data collected between December 2007 and April 2008.
The basic statistics about the size of obtained data are presented in Table 1. A
simpler set of statistics on a smaller set of observed data is presented in [16].

4 Stable Statistics
The observed communication graph has interesting properties. The graph is
very dynamic on the level of nodes and edges but has stable aggregated statis-
tics. About 75% of active bloggers will also be active in the next week. Further,
about 28% of edges that existed in a week will also be found in the next week. A
large part of the network changes weekly, but a significant part is preserved. The
stability of various statistics of the blogograph is presented in Table 1. The giant
component (GC) is the largest connected (not necessarily strongly connected)
Communication Dynamics of Blog Networks 43

subgraph of the undirected blogograph. A giant component of similar size has


been observed in other large social networks [18], [14]. The clustering coefficient
(C) refers to the probability that the neighbors of a node are connected. The
clustering coefficient of a node with degree k is the ratio of the number of edges
between it’s neighbors and k(k − 1). The clustering coefficient of the graph is
defined to be the average of the node clustering coefficients. The observed clus-
tering coefficient is stable over multiple weeks and significantly different from the
clustering coefficient in a random graph with the same out-degree distribution,
which is 0.00029. The average separation (d) is the average shortest path be-
tween two randomly selected vertices of the graph. We computed it by sampling
10,000 random pairs of nodes and finding the undirected shortest path between
them. The observed value in the blogograph is similar to what has been found
in many other social networks ([18], [23]).
Many large social networks ([2], [14]) display a power law in the degree distri-
bution, P (k) ∝ ck −α , where P (k) is the probability a node has degree k. Figure
5 shows the mean in-degree distribution of the collected blogographs. In these
graphs, we observed a power law tail with parameter α ≈ 2.70 which is stable
from week to week. This value was computed using maximum likelihood method
described in [10] and Matlab code made available by Aaron J. Clauset.
To evaluate the dynamic in the observed communication we consider the
change in the set of links or edges from one week to another. Figure 4 shows
the distribution of number of weeks a particular pair of bloggers communicated.
It is evident from this plot that the vast majority of communication does not
re-occur. Yet, some links reappear every week. We also look at the past rela-
tionship between the bloggers who communicated. We define the history of the
edge (i, j) that appeared in time cycle t to the the shortest undirected distance
between i and j in the graph of the time cycle t − 1. Figure 4 presents the
distribution of the edge histories of all observed edges of all time cycles. The
edge history distribution of the particular observed weeks is very close to the

1 0.5

0.4
0.1

0.3
0.01
0.2

0.001
0.1

0.0001 0
0 5 10 15 20 0 2 4 6 8 10 12 14 16

(a) Edge Stability (b) Edge History


Fig. 4. Edge stability: distribution of the number of weeks an edge appeared in 60% of
all edges only appeared once. Edge history: the distribution of the shortest undirected
distance of end points of an edge in a previous time cycle.
44 M. Goldberg et al.

1
observed distribution
α = 2.70
0.1

0.01
P(x = k)
0.001

0.0001

1e-05

1e-06
1 10 100 1000 10000
k
Fig. 5. Average in-degree distribution in the blogograph observed over 21 weeks from
Dec. 03, 2007 and Apr. 28, 2008

Table 2. 19 weeks of communities from the Russian section of Live Journal. |C| is
the number of communities, δavg is the average density, and ep is the average edge
probability within the communities.

week |C| avg size δavg ep week |C| avg size δavg ep
51 19631 10.0183 0.456677 0.253212 9 20136 9.95401 0.473693 0.252607
52 19520 10.0615 0.453763 0.252101 10 19670 9.71678 0.45449 0.255778
1 23187 10.0915 0.473676 0.248130 11 20212 9.66842 0.456908 0.256098
2 20970 9.98412 0.458161 0.251843 12 20415 9.70331 0.461118 0.255819
3 17986 9.86184 0.448757 0.254203 13 20030 9.78058 0.455676 0.254681
4 18510 9.71891 0.453578 0.257481 14 19893 9.74936 0.455234 0.254384
5 18808 9.88255 0.455823 0.254305 15 19392 9.73407 0.455365 0.254687
6 19318 9.79242 0.454656 0.253901 16 19113 9.74787 0.454531 0.254721
7 19343 9.80381 0.456364 0.255236 17 18737 9.72333 0.455775 0.255658
8 19796 9.83113 0.453577 0.252818

presented distribution (the variation at each point is less then 2%). As the figure
suggests, the majority of communicating vertices were less or at 3 hops away in
the network on the previous time cycle. This provides evidence for the strong
locality of communication that occurs in the observed network.
In addition to looking for stability in structural statistics, it is also useful
to examine stable community behavior. Using the notion of clusters discussed
previously in this text, we find locally optimal communities using each edge in
the graph as a seed. Once all seeds are optimized, duplicates and clusters of size
2 are removed. Statistics of the remaining clusters are showing in Table 2. A size
vs density plot is also given in Figure 6. The general shape and scale of this plot
is replicated across all observed weeks.
Communication Dynamics of Blog Networks 45

0.8

Density 0.6

0.4

0.2

0
0 10 20 30 40 50 60 70 80 90 100
Size

Fig. 6. A size vs density plot for week 5 of the observed data. The x-axis is a measure
of the community size while the y-axis shows the value of δ. Each point represents a
community.

5 Modeling
As previously stated, networks with such strong communication dynamics have
not been well modeled. Much of the previous work aims to replicate the growth
phase of a network’s life-cycle, ignoring the evolution of communication once
the network’s size stabilizes. Models which replicate these dynamics would be
useful as a sand-box within which social hypotheses on information diffusion, the
emergence of leaders, and group formation and dissolution can be tested. To be
considered useful, any model should create a set of graphs whose statistics come
as close as possible to mirroring the statistics of the observed data presented
previously.
Before delving into the creation of a new model, let us first consider the
modification of a previously existing one. The simplest method of producing
a set of evolving graphs is to grow each week’s graph using a known network
growth algorithm. Vertices can be assigned an out-degree based on the observed
data and connected to each other via preferential attachment for each of the
weeks. If done correctly, this would yield a set of graphs whose in-degree and
out-degree distributions come close to matching the observed data’s power law
distributions.
Despite this initial positive result, examining the rest of the statistics demon-
strates that the model is insufficient. Relational statistics such as edge stability,
edge history, and clustering coefficient all significantly depart from the observed
values, which we will show in detail further in the paper. This model’s inability do
recreate these statistics is expected, since it generates each graph independently.
Below, we propose a model which performs its edge connection within some
locality in an effort to more closely mirror the edge stability, edge history, clus-
tering coefficient, and community based statistics of the network.
46 M. Goldberg et al.

6 A Locality Based Model


The goal of our model is to produce a sequence of graphs which simulate the
connection and reconnection of vertices. Our model specifies how nodes update
their edges in response to the observed communication activity. In specifying this
model of evolution, we take as input the out-degree distribution of the blogo-
graph. The justification for this is that, while the out-degree distribution would
be an interesting object to model, it mainly reflects the individual properties of
the users in the network such as the level of energy and involvement of the user.
Such quantities tend to be innate to a user. Different people have different social
habits; some manage to communicate with hundreds of people while others in-
teract with only a small group. Hence, out-degrees should be specified either ab
initio (e.g. from social science theory) or extracted directly from the observed
data. We take the latter approach to specifying the out-degree distribution when
it comes to testing our model. An early version of this model with preliminary
results is presented in [15].
Given the out-degrees for all nodes, the task is now to specify how to attach
the out-edges of the nodes and to obtain the in-degree distribution. It is the
in-degree distribution that characterizes the global communication structure of
the network (for example, who is considered by others to be important). Clearly,
the out-degree distribution of a graph alone does not determine its in-degree
distribution. Algorithms for generating undirected random graphs with a pre-
scribed degree distribution are well known (see [9], [13], [24]). However, even if
those algorithms are expanded to the domain of directed graphs, they will still
be insufficient for our purpose of modeling evolution, which requires repeated
generation of the next graph given the previous one.
To summarize, we are interested in models which reproduce the observed
evolution given the out-degrees of the nodes. Thus, all our locality models assume
that a node, when deciding where to attach its communication links, has some
fixed budget of emanating edges which it can attach. The main task of our model
is to develop an evolution mechanism that re-creates an in-degree distribution
close to the observed one.
We use standard graph theory terminology in describing our model (see for
example [25]). The sequence of blogographs are represented by directed graphs
G0 , G1 , G2 , . . ., where at every time step t, Gt = (V, Et ). V is the common vertex
set of all known bloggers, V = {v1 , ..., vn }. An edge (vi , vj ) is in the edge set Et if
blogger vi commented on a post by vj during the time period t. One time period
covers one week, which appears to be the natural time scale in the blogosphere.
The input to the model is the set of out-degrees at time t for each vertex,
{kt1 , . . . , ktn } and Gt−1 , the blogograph at time t − 1. The output of the model is
Gt , the blogograph at time t. Our model is locality based. At time t, every node
vi identifies its area, and assigns its out-edges with destinations in its area.
More formally, denote the area of vi at time t by Ait ⊆ V . Ait represents the
locality of node vi at time t. Typically, a node’s locality at time t will depend on
Gt−1 , the blogograph at time t − 1. The attachment mechanism is probabilistic
for each node. Node vi attaches its kti out-edges according to its own probability
Communication Dynamics of Blog Networks 47

Algorithm 1. Evolution Model


1: Function Model (T , OutDeg, Area, Prob)
2: // Output: Blogographs G1 , . . . , GT .
3: {k01 , . . . , k0n } ← OutDeg
4: Initialize G0 (e.g. to a random graph)
5: for t = 1 to T do
6: Et ← ∅; {kt1 , . . . , ktn } ← OutDeg
7: for i = 1 to n do
8: Ait ← Area(i, Gt−1 ); pit ← Prob(i, Ait , Gt−1 )
9: Eti ← Attach(i, , Ai , pit , kti ); Et ← Et ∪ Eti .
10: end for
11: Gt+1 ← (V, Et )
12: end for

Algorithm 2. Edge attachment algorithm


1: Function Attach (i, Ait , pit , kti )
2: // Output: Eti : edges in Gt originating at i
3: whilekti > 0 do
4: if ( v∈Ai pit (v) > 0) then
t
5: Select node v ∈ Ait−1 with probability pit (v)
6: pit (v) ← 0; renormalize pit
7: else
8: Select node v ∈ V \ Ait−1 with uniform probability
9: end if
10: kti = kti − 1
11: Eti = Eti ∪ (i, v)
12: end while

distribution pit , where pit (v) specifies the probability for node vi to attach to node
v for v ∈ V . The probability distribution pit may depend on Ait and Gt−1 (e.g.
higher
 degree nodes may get higher probabilities). In particular, we assume that
i
v∈At pt (v) = 1, which corresponds to the assumption that every nodes expends
i

all its communication energy within its local area. Since we do not allow parallel
edges, if kti > |Ait |, it is not possible for node vi to expend all its communication
energy within its local area Ait . In this case, we assume that kti − Ait edges are
attached uniformly at random to nodes outside its area and the remaining edges
are attached within its area. The precise algorithm for distributing the edges
given the probability distribution pit is given in Algorithm 2.
The evolution model is illustrated in Figure 7. In more detail, the model
first obtains the out-degrees (which are exogenously specified). From Gt−1 , it
computes Ait and pit for all nodes vi ∈ V . For all nodes, it then attaches edges
according to Algorithm 2. This entire process is iterated for a user specified
number of time steps. The entire process is given in Algorithm 1. The inputs to
the model are the procedure OutDeg which specifies the out-degrees (assumed
48 M. Goldberg et al.

Initial Graph Iterate

For each node:

Generate random Determine


graph with fixed local area
out−degees

Assign Form
out−degrees new graph

Place edges
w/in area

Fig. 7. Model execution flow

to be exogenous), the procedure Area which identifies the local areas of the
nodes given the previous graph, and the procedure Prob which specifies the
attachment probabilities according to the attachment model. We will now discuss
some approaches to defining the areas and the attachment probabilities. When
testing our model, we will also need the procedure for obtaining the out-degrees,
which will be discussed in Section 7.

6.1 Locality Models


A node expends the majority of it’s communication energy within it’s local area.
This captures the intuition that people mostly communicate with in a small
group that contains friends, family, colleagues, etc. We propose the following
definitions of an area:
1. Global. Every node vi is aware of the whole network, the local area of vi is
Ait = V at every time period t.
2. k-neighborhood. The local area Ait of node vi at time t consists of all vj
such that undirected shortest distance δt (vi , vj ) ≤ k.
3. Clusters. This definition is based on the notion of a cluster, defined as
follows (see also [5], [4], [6]). First, the notion of the density of a set is
introduced, which generally can be any function defined on a subset of nodes
of the graph (e.g. ratio of number of edges within subset to a number of
edges with at least one endpoint in the subset). For every density function, a
cluster is defined to be any subset of nodes that is locally maximal w.r.t. to
its density. Our definition of a cluster permits clusters to overlap. In fact, our
experiments show that clusters overlap quite frequently. This is is expected
in a graph of a social network, where the same member may belong to more
than one community represented by clusters. Finally, using the definition of
a cluster presented earlier in this paper, we define the local area of a blogger
as the union of all clusters which she is a member of. Intuitively, this restricts
a blogger’s activity to the set of individuals in groups which they have shown
interest in previously.
Communication Dynamics of Blog Networks 49

6.2 Attachment Models

Given the local area Ait of the node vi at time t, the attachment model describes
the probability pit+1 (vj ) of occurrence of an edge (vi , vj ) at time t + 1 for vj ∈ V .
We propose the following attachment modes:

1. Uniform. Node vi attaches to any vj ∈ Ait with equal probability

1
pit (vj ) = (3)
|Ait |

and for vj ∈
/ Ait , pit (vk ) = 0.
2. Preferential Attachment. Node vi attaches to any vj ∈ Ait with proba-
bility

pit (vj ) ∝ indegt−1 (vj ) + γ (4)


where indegt−1 (vj ) is the in-degree of vertex vj in graph Gt−1 and γ is a
constant.
3. Markov Chain. To obtain the attachment probabilities for vertex vi we
simulate the particle traveling over undirected edges of graph Gt starting
from the node vi and randomly selecting edges to travel over until it arrives
at first node ve ∈/ Ait . Every time the particle arrives at some node vj ∈ Ait ,
the counter cij is incremented. After this simulation is repeated with out
resetting the counters cij , ∀ vj ∈ Ait a number of times, we determine the
attachment probability

pit (vj ) ∝ cij . (5)


4. Inverse distance. Node vi attaches to some node vj ∈ Ait with probability
1
pit (vj ) ∝ ρ , (6)
δt−1 (i, j)
where δt−1 (i, j) is the shortest undirected distance between vertices vi , vj in
graph Gt−1 and ρ is a constant.

The combination of the locality model and attachment model specifies the evo-
lution model that, given the out-degree distribution, will produce a series of
graphs that represent the blogograph at different time periods.

7 Experiments and Results


In this section we present the results of execution of few of the models and the
evaluation of their performance.
50 M. Goldberg et al.

To evaluate the performance of the models, we compare the sequence of graphs


produced by the model to the sequence of graphs produced by the observed
communication in LiveJournal. In particular, we compare the clustering coeffi-
cient, the size of the giant component, average separation between two nodes,
in-degree distributions, and community size and density distributions. To com-
pare the in-degrees, we compute the point-wise difference of the normalized
distributions. Formally, for each graph Gi we compute the normalized distri-
bution Di (k) = |Vki | , where k is the degree and |Vi | is the number of vertices
in the graph. The differences between distributions of observed graph Go and
generated graph Gg is

E= |Do (d) − Dg (d)|. (7)
d

Notice, E ∈ [0, 2] and lower value of E corresponds to a closer match.


To compare community structure we compute gerr , the average error between
the size and density distributions that represent communities in generated and
observed graphs. This value is computed by splitting the size-density plot into
bins where bin width on the desity axis is 0.05 and on the size axis is 5. The
resulting bin sizes are normalized with respect to the number of communities
in the graph. Value of gerr is measured by the sum of difference of sizes of
corresponding bins in size-density distributions of observed and generated data.
To evaluate a particular model we execute enough iterations to let the model
stabilize. We determine the stabilization by inspection of plots of the major pa-
rameters (including in-degree distribution, clustering coefficient, etc). Then, we
compare the sequence of the graphs produced by the model after the stabilization
to the sequence mined from LiveJournal.
Table 3 contains the results of execution of models with various combinations
of local area and attachment mechanisms compared to the average parameters of
graphs of different observed weeks. Note, for the observed data average parameter
E is computed by comparing distributions of graphs corresponding to various

Table 3. The stable parameters of graphs generated by various models compared to


the parameters of the observed data

Area Attch GC C d E gerr


Observed 0.9545 0.0613 5.34 0.0289 0.00144
Global Uniform 0.9867 5.2 × 10−6 7.86 1.075 0.04215
Global P.A. (in) - - - - -
Global P.A. (out) 0.9688 0.00018 5.21 0.427 0.01189
3-Neighb. uniform 0.8939 0.00045 5.30 0.4331 0.01792
3-Neighb. P.A.(in) - - - - -
3-Neighb. P.A.(out) 0.9776 0.00133 4.53 0.1412 0.03504
Clusters uniform 0.9646 0.00252 6.73 0.7267 0.03484
Clusters P.A. (in) 0.9643 0.00149 6.88 0.1713 0.03811
Clusters P.A. (out) 0.9523 0.03156 6.56 0.5320 0.02034
Communication Dynamics of Blog Networks 51

observed weeks. Figure 8 compares the observed degree distribution to the ones
generated by some of the best area/attachment combinations. As defined in
Section 4, edge history conveys information about how close the end points of
the observed edge were in the previous time cycle and therefore measures the
significance of locality in the communications. Figure 8 compares the observed
edge history with the edge histories produces by the best models.

7.1 Global Area Model

First, we consider the model with global area where vertices are aware of and
can connect to any other vertex in the network.
In the case of a uniform attachment, the resulting model is very similar to
the Erdös-Rényi model. The in-degree distribution and other parameters gen-
erated by such model are predictably very different from the power law degree
distribution in the observed graph.
Global area with preferential attachment strictly proportional to the in-degree
of the vertices in the graph of the previous iteration results in a formation of
a power house - small set of vertices with very high in-degree that attract all
of out-degree. This effect is caused directly by preferential attachment; since
vertices with zero in-degree will never be attached to, any vertex that receives
no incoming vertices at some iteration will not receive any incoming vertices in
any of the following iterations. Clearly, the graph with small set of vertices that
attract all of the in-degree is very different from the observed graph.
The combination of global area and preferential attachment proportional to the
out-degree of the vertices in the graph of the previous iteration produced results
that were more similar to the observed network than the other global models, but
the results were also significantly worse compared to models with other area defini-
tions (k-neighborhood and union of clusters). Since this model allows for random
selection of the end points of edges from the whole graph, the edge history (Figure
8) is very different from the one observed in the real-life network.

7.2 k-Neighborhood Area Model

We experimented with different values of k (k ∈ {2, 3, 4}) and determined that


k = 3 produced the best models.
The combination of 3-neighborhood area and uniform attachment produced a
model that showed mediocre results when compared to the observed parameters.
The combination of 3-neighborhood area and preferential attachment propor-
tional to the in-degree produced a graph with a small power house in just a few
iterations. 3-neighborhood area with preferential attachment proportional to the
out-degree produced a model that generated graphs with in-degree distributions
very similar to the observed graph. In particular, the power law tail resembled
the tail of the observed graph. The edge history of this model was quite different
from the observed, since most of the end points for new edges are selected such
that their distance in the previous iteration’s graph was 3.
52 M. Goldberg et al.

7.3 Union of Clusters Area Model


An area constructed via the union of clusters in combination with preferential
attachment proportional to the in-degree produced in-degree distributions more
similar to the observed than other attachment mechanisms combined with this
area definition. This model also produced a sequence of graphs with an edge
history that was closest to the observed as evident from Figure 8.

1 1
observed Observed
0.95 Global + P.A. (out) 0.9 Global + P.A. (out)
Clusters + P.A. (in) Clusters + P.A. (in)
0.9 3-Neighb + P.A. (out) 0.8 3-Neighb + P.A. (out)
0.85 0.7
0.8
P(d>=k)

0.6
0.75 0.5
0.7
0.4
0.65
0.3
0.6
0.2
0.55
0.1
0.5
1 10 100 1000 10000 0
k 2 4 6 8 10 12 14 16

(a) In-degree distribution (b) Edge history distribution

Fig. 8. In-degree and edge history distributions for various models and observed net-
work

Models with this area definition were the only ones that produced non-trivial
edge stability defined as the likelihood of a repetition of a recently observed
edge. To evaluate this stability, we consider the number of edges that appeared
more then once in 21 iterations of the model after stabilization. Models with
global and 3-neighborhood area definitions that did not result in the formation
of power houses produced a set of graphs such that less then 1% of edges that
appeared more the once in all of the graphs. Models with an area defined by the
union of clusters produced a set of graphs in which, on average, 14% of edges
appear more then once in 21 iterations. In particular, a combination of this area
definition with preferential attachment proportional to the in-degree produced
a sequence of graphs with 18% of edges appear more than once, while in the
observed network, 40% of edges (see Figure 4) appear more then once during 21
observed weeks.
After considering all of the parameters of the models, we determined the com-
bination of an area defined by the union of clusters with preferential attachment
proportional to the in-degrees of the vertices to be the best model to describe
the dynamics of communication in the observed network.

8 Conclusion
We have presented a set of statistics which display strong stability even for
a dynamic network such as the blogosphere. Our list of stable statistics is not
Communication Dynamics of Blog Networks 53

exhaustive. However, they represent a comprehensive set of interesting properties


of a network that any model for communication dynamics should capture.
Our experiments have shown that the communication dynamics of large so-
cial networks are best explained as a result of local communication, where the
majority of members communicate within their social locality, a relatively small
set of nodes reflective of their interests or communities. The best approximation
to this locality, among the models we evaluated on LiveJournal data was the
one determined by the union of clusters a node belonged to. Our notion of a
cluster is a set of nodes which locally maximized a cluster density. This notion
of a cluster has the important property that it allows clusters to overlap, which
is important if a cluster is to represent a social community or coalition.
Many possibilities exist for enhancing the definitions of locality and the attach-
ment mechanisms. One direction which we intend to pursue as future research
is the combination of local with global attachment mechanisms.

Acknowledgment. This material is based upon work partially supported by the


U.S. National Science Foundation (NSF) under Grant Nos. IIS-0621303, IIS-0522672,
IIS-0324947, CNS-0323324, NSF IIS-0634875 and by the U.S. Office of Naval Research
(ONR) Contract N00014-06-1-0466 and by the U.S. Department of Homeland Security
(DHS) through the Center for Dynamic Data Analysis for Homeland Security adminis-
tered through ONR grant number N00014-07-1-0150 to Rutgers University. The content
of this paper does not necessarily reflect the position or policy of the U.S. Government,
no official endorsement should be inferred or implied.

References

1. Albert, R., Barabási, A.-L.: Statictical mechanics of complex networks. Reviews of


Modern Physics 74(47-97) (2002)
2. Barabási, A.L., Jeong, J., Nëda, Z., Ravasz, E., Shubert, A., Vicsek, T.: Evolution
of the social network of scientific collaborations. Physica A 311(590-614) (2002)
3. Baumes, J., Chen, H.-C., Francisco, M., Goldberg, M., Magdon-Ismail, M., Wallace,
W.: Dynamics of bridging and bonding in social groups, a multi-agent model. In:
Third conference of the North American Association for Computational Social and
Organizational Science (NAACSOS 2005), Notre-Dame, Indiana, June 26-28 (2005)
4. Baumes, J., Goldberg, M., Krishnamoorthy, M., Magdon-Ismail, M., Preston, N.:
Finding comminities by clustering a graph into overlapping subgraphs. In: Pro-
ceedings of IADIS International Conference, Applied Computing 2005, pp. 97–104
(2005)
5. Baumes, J., Goldberg, M., Magdon-Ismail, M.: Efficient identification of overlap-
ping communities. In: IEEE International Conference on Intelligence and Security
Informatics (ISI), May 2005, pp. 27–36 (2005)
6. Baumes, J., Goldberg, M., Magdon-Ismail, M., Wallace, W.: Identification of hid-
den groups in communications. In: Handbooks in Information Systems, National
Security, vol. 2 (2007)
7. Berger-Wolf, T.Y., Saia, J.: A framework for analysis of dynamic social networks.
DIMACS Technical Report 28 (2005)
54 M. Goldberg et al.

8. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stat, R.,
Tomkins, A., Wiener, J.: Graph structure in the web. Computer Networks 33(1-6),
309–320 (2000)
9. Chung, F., Lu, L.: Connected components in random graphs with given degree
sequence. Annals of Combinatoreics 6, 125–1456 (2002)
10. Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empirical
data (2007)
11. Doreian, P., Stokman, E.F.N.: Evolution of social networks. Gordon and Breach
(1997)
12. Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-law relationships of the in-
ternet topology. In: SIGCOMM, pp. 251–252 (1999)
13. Gkantsidi, C., Mihail, M., Zegura, E.: The markov chain simulatiopn methods for
generating connected power law random graphs. In: Proc. of ALENEX 2003, pp.
16–50 (2003)
14. Goh, K.-I., Eom, Y.-H., Jeong, H., Kahng, B., Kim, D.: Structure and evolution
of online social relationships: Heterogeneity in unrestricted discussions. Physical
Review E (Statistical, Nonlinear, and Soft Matter Physics) 73(6), 66123 (2006)
15. Goldberg, M., Kelley, S., Magdon-Ismail, M., Mertsalov, K.: A locality model for
the evolution of blog networks. In: IEEE Information and Security Informatics, ISI
(2008)
16. Goldberg, M., Kelley, S., Magdon-Ismail, M., Mertsalov, K.: Stable statistics of
the blogograph. In: Interdisciplinary Studies in Information Privacy and Security
(2008)
17. Kleinberg, J.M., Lawrence, S.: The structure of the web. Science, 1849–1850 (2001)
18. Kossinets, G., Watts, D.J.: Empirical analysis of an evolving social network. Sci-
ence 311, 88–90 (2006)
19. Kumar, R., Novak, J., Raghavan, P., Tomkins, A.: Structure and evolution of
blogospace. Communications of the ACM 33(1-6), 309–320 (2004)
20. Kumar, R., Novak, J., Tomkins, A.: Structure and evolution of online social net-
works. In: KDD 2006 (2006)
21. Newman, M.: The structure and function of complex networks. SIAM Review 45(2),
167–256 (2003)
22. Newman, M., Barabási, A.-L., Watts, D.: The structure and dynamics of networks.
Princeton University Press, Princeton (2006)
23. Newman, M.E.J.: The structure of scientific collaboration networks. Proc. Natl.
Acad. Sci. USA 98, 404 (2001)
24. Stauffer, A.O., Barbosa, V.C.: A study of the edge-switching markov chain method
for the generation of random graphs. arxiv: cs. DM/0512105 (2006)
25. West, D.B.: Introduction to graph theory. Prentice Hall, Upper Saddle River (2003)
Finding Spread Blockers in Dynamic Networks

Habiba1, , Yintao Yu2, , Tanya Y. Berger-Wolf1,   , and Jared Saia3,†


1
University of Illinois at Chicago,
{hhabib3,tanyabw}@uic.edu
2
University of Illinois at Urbana-Champaign,
yintao@uiuc.edu
3
University of New Mexico,
saia@cs.unm.edu

Abstract. Social interactions are conduits for various processes spread-


ing through a population, from rumors and opinions to behaviors and dis-
eases. In the context of the spread of a disease or undesirable behavior, it is
important to identify blockers: individuals that are most effective in stop-
ping or slowing down the spread of a process through the population. This
problem has so far resisted systematic algorithmic solutions. In an effort to
formulate practical solutions, in this paper we ask: Are there structural net-
work measures that are indicative of the best blockers in dynamic social net-
works? Our contribution is two-fold. First, we extend standard structural
network measures to dynamic networks. Second, we compare the blocking
ability of individuals in the order of ranking by the new dynamic mea-
sures. We found that overall, simple ranking according to a node’s static
degree, or the dynamic version of a node’s degree, performed consistently
well. Surprisingly the dynamic clustering coefficient seems to be a good
indicator, while its static version performs worse than the random rank-
ing. This provides simple practical and locally computable algorithms for
identifying key blockers in a network.

1 Introduction
How can we stop a process spreading through a social network? This problem
has applications to diverse areas such as preventing or inhibiting the spread of
diseases [7, 26, 40], computer viruses1 [8, 22], rumors, and undesirable fads or
risky behaviors [23, 24, 37, 38]. A common approach to spread inhibition is to

(No last name). Work supported in part by the Fulbright fellowship.

Work performed in part while being a visiting student at the University of New
Mexico.

Work supported in part by the NSF grant IIS-0705822 and NSF CAREER Award
0747369.

Work supported in part by the NSF grant IIS-0705822, NSF CAREER Award
0644058, and an AFO MURI award.
1
In particular, we are concerned with computer malware that spreads through social
networks, such as email viruses and worms, cell-phone viruses, and other related
malware such as the recent MySpace worm.

L. Giles et al. (Eds.): SNAKDD 2008, LNCS 5498, pp. 55–76, 2010.

c Springer-Verlag Berlin Heidelberg 2010
56 Habiba et al.

identify key individuals whose removal will most dampen the spread. In the
context of the spread of a disease, it is a question of finding individuals to be
quarantined, inoculated, or vaccinated so that the disease is prevented from
becoming an epidemic. We call this set of key individuals the blockers of the
spreading process.
There has been significant previous work related to studying and controlling
the spread of dynamic processes in a network [9,10,11,16,18,22,23,26,35,40,43,
44, 46, 47, 51, 54, 57, 59, 60, 67]. Unfortunately, these results have three properties
rendering them ineffective for identifying good blockers in large networks. First,
many proposed algorithms focus on a slightly different objective: they aim to
identify nodes that will be most effective in starting the spread of a process rather
than blocking it [44, 47]; or alternatively, nodes that would be most effective in
sensing that a process has started to spread, and where the process initiated
[9, 10, 11]. In this paper, we are focused specifically on identifying those nodes
that are good blockers. Second, algorithms proposed in previous work all require
computationally expensive calculations of some global properties over the entire
network, or rely on expensive, repeated stochastic simulations of the spread of a
dynamic process. In this paper, we present heuristics that identify good blockers
quickly, based only on local information.
Finally, perhaps the most critical problem in previous work is the omission of
the dynamic nature of social interactions. The very nature of a spreading process
implies an explicit time axis [52]. For example, the flow of information through a
social network depends on who starts out with the information when, and which
individuals are in contact at the starting point with the information carrier [43].
In this paper, we consider explicitly dynamic networks, defined in Section 3.1.
In these networks, we study the social interactions over a finite period of time,
measured in discrete time steps.
The main contributions of this paper are summarized below.

– We formally define dynamic networks in Section 3.1. This representation of


networks encompasses the traditional “aggregate”view of networks defined in
Section 3.2 and adds the explicit temporal component to the interactions.
The time axis is necessary since most spreading processes take place on
networks that evolve over time.
– We formally define the problem of identifying key spread blockers in networks
in Section 3.3.
– We modify various network measures, such as the centrality measures and
clustering coefficient, to incorporate the dynamic nature of the networks
(Section 3.4).
– We compare the reduction in the extent of spread based on removing in-
dividuals from a network in the ranking order imposed by various network
measures. We identify measures that consistently give a good approximation
of the best spread blockers.
– We compare the difference in the sets of top blockers identified by various
measures.
Finding Spread Blockers in Dynamic Networks 57

– We extensively evaluate our methods on real networks (Section 5). We use


the Enron email network dataset, the MIT Reality Mining dataset, DBLP
co-authorship network, animal population networks of Grevy’s zebras, Plains
zebras, and onagers.
Ultimately, we show that the dynamics of interactions matters, and moreover
that simple local measures, such as degree, are highly indicative of an individual’s
capacity to prevent the spread of a phenomenon in a population. The implica-
tion of our results are that there are practical scalable heuristics for identifying
quarantine and vaccination targets in order to prevent an epidemic.

2 Related Work
Dynamic phenomena such as opinions, information, fads, behavior, and disease
spread through a network by contacts and interactions among the entities of the
network. Such spreading phenomena have been studied in a number of domains
including epidemiology [22,26,40,51,54,57,59], diffusion of technological innova-
tions and adoption of new products [7, 16, 18, 23, 24, 38, 35, 44, 46, 60, 67], voting,
strikes, rumors [36, 37, 53, 68], as well as spread of contaminants in distribution
networks [8, 9, 10, 11, 46] and numerous others.
One of the fundamental questions about dynamic processes is: Which indi-
viduals, if removed from the network, would block the spread of such process?
Several previous results have addressed the problem of identifying such indi-
viduals [26, 40, 43]. Eubank et al. [26] experimentally show that global graph
theoretic measures like expansion factor and overlap ratio are good indicators
for devising vaccination strategies in static networks. Cohen et al. [21] propose
another immunization strategy based on the aggregate network model. In partic-
ular, they propose an efficient method of picking high degree nodes in a network
to immunize, thus inhibiting the spread of disease. Kempe et al. [43] show that
a variant of the blocker identification problem is NP-hard. While these problems
and suggested approaches are similar to finding good blockers in a network, un-
fortunately, there are critical differences that make these results inappropriate
for our formulation. First of all our objective is to minimize the expected extent
of spread in a network. We do not make any assumption about the source of the
spread. Second, almost all the above methods simplify the spreading process by
ignoring the time ordering of interactions.
There has also been significant related work on the problem of determining
where to place a small number of detectors in a network so as to minimize the time
required to detect the spread of a dynamic process, and, ideally, also the location
at which the spread began. Berger-Wolf et al. [9] give algorithms for the problem
of minimizing the size of the infected population before an outbreak is detected.
Berry et al. [10,11] give algorithms to strategically place sensors in utility distribu-
tion networks to minimize worst case time until detection. In [47], Leskovec et al.
demonstrate that many objectives of the detection problem exhibit the property
of submodularity. They exploit this fact to develop efficient and elegant algorithms
for placing detectors in a network. While the detection problem is related to the
58 Habiba et al.

problem of blocking a process, it is only concerned with detecting a spreading pro-


cess once, whereas a good blocker prevents multiple spreading paths. Moreover,
the algorithms proposed for the detection problem all require global information
and work only for a stable, relatively unchanging network.
Another related problem is that of identifying nodes in a network that are most
critical for spreading a dynamic process. Kempe et al. [44] show that identifying
key spreaders – individuals that help to propagate the spread most – is NP-hard,
but admits a simple greedy (1−1/e)-approximation. Later, Mossel and Roch [55]
showed that the general case of finding a set of nodes with the largest “influence”
is NP-hard, and has a (1 − 1/e − ε) approximation algorithm. Unfortunately, this
approximation algorithm is computationally intensive. Strong inapproximability
results for several variants of identifying nodes with high influence in social
networks have been shown in [19]. Asur et al. in [5] present an event based
characterization of critical behavior in interaction graphs for the purposes of
modeling evolution, link prediction, and influence maximization.
Finally, Aspnes et al. [4] have studied the inoculation problem from a graph
theoretic perspective. They show that finding an optimum inoculation strategy
is related to the sum-of-squares partition problem. Moreover, they show that the
social welfare of an inoculation strategy found when each node is a selfish agent
can be significantly less than the social welfare of an optimal inoculation strategy.

3 Definitions
Populations of individuals interacting over time are often represented as net-
works, or graphs, where the nodes correspond to individuals and a pairwise in-
teraction is represented as an edge between the corresponding individuals. The
idea of representing societies as networks of interacting individuals dates back to
Lewin’s earlier work of group behavior [48]. Typically, there is a single network
representing all interactions that have happened during the entire observation
period. We call this representation an aggregate network (Section 3.2). In this
paper we use an explicitly dynamic network representation (Section 3.1) that
takes the history of interactions into account.

3.1 Dynamic Network


We represent dynamic network as a series G1 , . . . , GT  of static networks where
each Gt is a snapshot of individuals and their interactions at time t. For this
work, we assume that the time during which the individuals are observed is
finite. For simplicity, we also assume that the time period is divided into discrete
steps {1, . . . , T }. The nontrivial problem of appropriate time discretization is
beyond the scope of this paper. We assume that an interaction between a pair
of individuals takes place within one time step.

Definition 1. Let {1, . . . , T } be a finite set of discrete time steps. Let V =


{1, . . . , n} be a set of individuals. Let Gt = (Vt , Et ) be a graph representing the
snapshot of the network at time t. Vt ⊆ V , is a subset of individuals V observed
Finding Spread Blockers in Dynamic Networks 59

at time t. An edge (ut , vt ) ∈ Et if individuals u and v have interacted at time t.


Further, for all v ∈ V and t ∈ {1, . . . , T − 1} the edges (vt , vt+1 ) ∈ E are directed
self edges of individuals across time steps.
A dynamic network GD = G1 , . . . , GT  is the graph
 GD= (V, E) of the time
series of graphs Gt such that V = t Vt and E = t Et ∪ t−1 (vt , vt+1 ).

The definition is equivalent to an undirected multigraph representation in [43].


Figure 1 shows an example of several dynamic networks that have the same
unweighted aggregate network representation.

(a) (b) (c) (d)

(e)

Fig. 1. Example of several dynamic networks that have the same unweighted aggregate
network representation. Figures (a)–(d) show a dynamic networks of three individuals
interacting over four time steps. The solid line edges represent interactions among
individuals in a time step. Empty circles are individuals observed during a time step.
While at any given time step some individuals may be unobserved, the particular
example shows all the individuals being observed at all time steps. Figure (e) shows an
unweighted aggregate network that has the same interactions as every dynamic network
in the example. Figures (a)–(c) have the multiplicity two of each edge while figure (d)
has the multiplicity four for every edge in the aggregate representation.

3.2 Aggregate Network


The aggregate network is the graph GA = (V, E) of individuals V and their in-
teractions E observed over a period of time. In this representation an edge exists
between a pair of individuals if they have ever interacted during the observed
time period. Multiple interactions between a pair of individuals over time are
represented as a single, possibly weighted, edge or multiple edges between them.
This representation provides an aggregate view of the population where the in-
formation about the timing and order of interactions is discard. In this work we
represent aggregate networks as multigraphs.
Definition 2. Let {1, . . . , T } be a finite set of discrete time steps. Let Vt be the
set of individuals observed at time t and let Et be the set of interactions among
individuals Vt at t. Then the aggregate graph GA = (V, E)  of such a network is
the set of individuals V and interactions E such that V = t Vt and (u, v) ∈ E
if ∃(ut , vt ) at some time step t ∈ {1, . . . , T }.
60 Habiba et al.

Using this aggregate network model, the structure and properties of many social
networks have been studied from different perspectives [6,13,12,15,41]. However,
as we have mentioned, this and other similar models do not explicitly consider
the temporal aspect of the network.

3.3 Spread Blockers


We now formalize the notions of processes spreading in a network and individuals
blocking this spread.
Spread(.) is a function that gives the overall average extent of spread in
a network, that is, the expected number of individuals affected by a stochas-
tic spreading process within a specified number of time steps. The estimate of
the spread is dependent on the model of the spreading process, the structure
of the network, and, of course, the number of time steps under consideration.
Spreadv (.) is the expected spread in a network, when the spreading process is
initiated by the individual v. Given a model of a spreading process M and a dis-
tribution of the probability of infection X : E → [0, 1], we define the spreading
functions as follows:
Spreadv : {N etworks × Spread M odels × P robability × T ime} → R+
1 
Spread(G, M, X , T ) = Spreadv (G, M, X , T ) (1)
|V |
v∈V
The limit equilibrium state of spread is denoted by
Spread(G, M, X ) = Spread(G, M, X , ∞) (2)
For a fixed spread model, probability distribution and a time period we will use
the overloaded shorthand notation Spread(G).
We define BlX (.) as a function that measures the reduction in the expected
spread size after removing the set X of individuals from the network. Hence, the
blocking capacity of a single individual v, Blv (.), is the reduction in expected
spread size after removing individual v from the network.
BlX : {N etworks × Spread M odels × P robability × T ime} → R+
BlX (G) = BlX (G, M, X , T ) = Spread(G) − Spread(G \ X). (3)
kBl(.) is the function that finds the maximum possible reduction in spread in
a network when set of individuals of size k is removed from the network. Notice,
that the value of this function is always at least k. The argmax of this function
finds the best blocker(s) in a network.
kBl(G) = max BlX (G). (4)
X⊆V,|X|=k

Thus, finding the best blockers in the network is equivalent to finding the (set
of) individuals whose removal from the network minimizes the expected extent
of spread.
kBl(G) = Spread(G) − min Spread(G \ X). (5)
X⊆V,|X|=k
Finding Spread Blockers in Dynamic Networks 61

This definition of the individuals’ blocking capacity by removal corresponds in


the disease spread context to the quarantine action. Vaccination or inoculation
leave the node in the network but deactivate its ability to propagate the spread.
For the Independent Cascade model of spread (Section 3.5) the two actions are
equivalent at the abstract level of estimating the spread and identifying blockers
in networks.
Since no good analytical approaches are known for identifying blockers in
networks, in this paper we focus on examining the possibility of using structural
network measures as practical indicators of nodes’ blocking ability. We next
briefly define the structural measures used in this paper.

3.4 Network Structural Measures

In network analysis various properties of the graph representing the population


are studied as proxies of the properties of the individuals, their interactions,
and the population itself. For example, the degree, various centrality measures,
clustering coefficients, or the eigenvalues (PageRank) of the nodes have been
used to determine the relative importance of the individuals, e.g., [17, 42]. Be-
tweenness centrality has been used to identify cohesive communities [33] and the
distributions of shortest path lengths employed to measure the “navigability” of
the network [66]. These and many other graph theoretic measures have been
translated to many social properties [50, 56, 57].
The ability of an individual to block a process spreading over a network can
be seen as another such social property. Graph measures such as clustering and
assortative mixing coefficients have been used to design local vaccination strate-
gies [40]. However, it is not clear that those are the best network measures to be
used as an indicator of a node which is a good blocker. In this paper we evaluate
the power of all the standard network measures of a node to indicate the block-
ing ability of the corresponding individual. Moreover, we extend the standard
static measures to reflect the dynamic nature of the underlying network. We
examine the following measures: degree, average degree, betweenness, closeness
centralities and clustering coefficient. We modify those to incorporate the time
ordering of the interactions.
We use the following terms interchangeably in this paper: individuals or nodes
are the vertices of the network and interactions are edges that can be both directed
or undirected. Neighbors of a node, N (.), are the set of nodes adjacent to it. The
subscript T with a function name indicates the dynamic variant of the function.
We now state the standard network measures for aggregate networks and
define corresponding measures for dynamic networks. We focus first on the global
measures that summarize the entire network and then address local measures
that characterize a node.

Global Structural Properties

Density is the proportion of the number of edges |E| present in a network


 
relative to the possible number of edges |V2 | .
62 Habiba et al.

|E|
D(G) = |V | . (6)
2

Dynamic Density is the average density of an observed time snapshot.


1 
DT (G) = D(Gt ). (7)
T
1<t≤T

In the example in Figure 1, the density of the aggregate network in (e) is 2/3.
However, the dynamic density of the networks (a), (b), and (c) is 1/3 while the
dynamic density of (d) is 2/3.
Path between a pair of nodes u, v is a sequence of distinct nodes u = v 1 , v 2 , . . . , v p
= v with every consecutive pair of nodes connected by an edge(v i , v i+1 ) ∈ E.
Temporal Path between u, v is a time respecting path in a dynamic network.
It is a sequence of nodes u = v 1 , . . . , v p = v where each (v i , v i+1 ) ∈ E is an
edge in Et for some t. Also, for any i, j such that i + 1 < j, if v i ∈ Vt and
v j ∈ Vs then t < s. The length of a temporal path is the number of time
steps it spans. Note, that this definition allows only immediate neighborhood
of a node to be reached within one time step.
In the example in Figure 1, while there is a path from c to a in the aggregate net-
work (e), there is no temporal path from c to a in the dynamic network (b). All
the temporal paths from a to c in the dynamic networks (a)–(d) are of length 2.
Diameter is the length of the longest shortest path. In dynamic networks, it is
the length of the longest shortest temporal path.

Local Node Properties


Degree of a node is the number of its unique neighbors. It is perhaps the sim-
plest measure of the influence of an individual: the more neighbors one has,
the higher the chances of reaching a larger proportion of a population.
Dynamic Degree is the change in the neighborhood of an individual over time,
the rate at which new friends are gained. Let N (ut ) be the neighborhood
of individual u at time step t. The relative change in the neighborhood is
then:2
|N (ut−1 )
N (ut )|
|N (ut )|. (8)
|N (ut−1 ) ∪ N (ut )|
The Dynamic Degree DEGT of u is the total accumulated rate of friend
addition.
 |N (ut−1 )
N (ut )|
DEGT (u) = |N (ut )|. (9)
|N (ut−1 ) ∪ N (ut )|
1<t≤T

Note, that here we consider a friend to be “new” if it was not a friend in the
previous time step. The definition is easily extended to incorporate a longer
term memory of friendship. The dynamic degree captures the gregariousness
of an individual, an important quality from a spreading perspective.
2
Here  denotes the symmetric difference of the sets.
Finding Spread Blockers in Dynamic Networks 63

Dynamic Average Degree is the average over all time steps of the interac-
tions of an individuals in each time step:
1 
AV G-DEG(u) = DEG(ut ). (10)
T
1≤t≤T

where, DEG(ut ) is the size of the neighborhood of u at time step t.

The dynamic degree, unlike its standard aggregate version, carries the informa-
tion of the timing of interactions and is sensitive to the order, concurrency and
delay among the interactions. For example, in Figure 1, the degree of the node b
in the aggregate network (e) is 2. However, its dynamic degree in (a) is 3, in (b)
is 1, and in (c) and (d) is 0. The dynamic average degree, on the other hand does
not change when the order of interactions in a dynamic network is perturbed.
It just tells us the average connectivity of an individual in the observed time
period. In all the dynamic networks (a)–(c) the average dynamic degree of b is
1, while in (d) it is 2.

Nodes in Neighborhood (NNk) is the number of nodes in the local k-neigh-


borhood of an individual. The number of nodes in the 1-neighborhood is
precisely the degree of an individual. We extend the measure by considering
the 2- and 3-neighborhoods of each individual.
Edges in Neighborhood (ENk) is the number of edges in the local k-neigh-
borhood of an individual. We compute the edges in neighborhood for 1-, 2-
and 3- hop neighborhoods of each individual. This measure loosely captures
the local density of the neighborhood of an individual.
Betweenness of an individual is the sum of fractions of all shortest paths be-
tween all pairs of individuals that pass through this individual. It is a pa-
rameter that measures the importance of individuals in a network based on
their position on the shortest paths connecting pairs of non-adjacent indi-
viduals [3, 31, 32].
Dynamic Betweenness of an individual is the fraction of all shortest temporal
paths that pass through it. Intuitively, the edges in a temporal path appear
in the increasing time order. This concept of betweenness incorporates the
measure of a delay between interactions as well as the individual being at
the right place at the right time. We present in detail, different flavors of
the traditional betweenness centrality concept in dynamic networks based
on position, time, and duration of interactions among individuals in [39]. In
this paper, for technical reasons, we use the concept of temporal betweenness.
Let gst be the number of shortest temporal paths between s and t, gst (u) of
which pass through u. Then the temporal betweenness centrality, BT (u), of
a node u is the sum of fractions of all s-t shortest temporal paths passing
through the node u:
 gst (u)
BT (u) = . (11)
gst
s=t=u
64 Habiba et al.

Closeness of an individual is the average (geodesic) distance of the individual


to any other individual in the network [32, 62].
Dynamic Closeness of an individual is the average time it takes from that
individual to reach any other individual in the network. Dynamic closeness
is based on shortest temporal paths and the geodesic is defined as the time
duration of such paths. Let dT (u, v) be the length of the shortest temporal
path from u to v. Following the definition in [62] we define dynamic closeness
as follows.
1
CT (u) =  . (12)
dT (u, v)
v∈V \{u}

Clustering Coefficient of an individual is the fraction of its neighbors who


are neighbors among themselves [58].
Dynamic Clustering Coefficient is the sum of the fractions of an individ-
uals’ neighbors who have been friends among themselves in previous time
steps. That is, the dynamic clustering coefficient measures how many of
your friends are already friends. Let CF (ut ) be the number of friends of u
that are already friends among themselves by time step t. Then the dynamic
clustering coefficient is defined as follows.

 CF (ut )
CCT (u) = . (13)
|N (ut )|(|N (ut )| − 1)
0≤t<T

Consider the example in Figure 2. The clustering coefficient of all three nodes in
the static network is the same and equals to 1. However, the situation in the two
dynamic networks is completely different. In network (a) the dynamic clustering
coefficient of nodes a and c is 0 while that of the node b is 1. In network (b),
on the other hand, the dynamic clustering coefficient of all the nodes is 0 since
when b meets a and c they still don’t know each other.
Apart from the measures defined above we also compute PageRank [14] of
nodes.

(a) (b) (c)

Fig. 2. Example of two dynamic networks (a) and (b) that have the same aggregate
network representation (c)
Finding Spread Blockers in Dynamic Networks 65

3.5 Spreading Model

A propagation process in a network can be described formally using many mod-


els of transmission over the edges in that network. The fundamental assumption
of all such models is that the phenomenon is spreading over and only over the
edges in the network and, thus, the topology of the network defines the dynamics
of the spread. For this paper we use the Independent Cascade model of diffu-
sion in networks. Independent Cascade is one variant of the conditional decision
model [34, 65]. The spreading phenomenon cascades through the network bases
the the simplifying assumption that each individual base their decision to adopt
or reject the spreading phenomenon on the status of each of its neighbors in-
dependently. The independent cascade model was first introduced in [34, 35] in
the context of word-of-mouth marketing. This is also the most commonly used
simple model to study disease transmission in networks [22, 51, 54, 57, 59] and
is closely related to the simplest Susceptible-Infectious-Recovered (SIR) models
from epidemiology [2]. In the Independent Cascade model, transmission from
one individual to another happens independently of interactions those individu-
als have with all the other individuals.
The Independent Cascade model describes a spreading process in terms of
of two types of individuals, active and inactive. The process unfolds in discrete
synchronized time steps. In each time step, each active individual attempts to
activate each of its inactive neighbors. The activation of each inactive neighbor
is determined by a probability of success. If an active individual succeeds in
affecting any of its neighbors, those neighbors become active in the next time
step. Each attempt of activation is independent of all previous attempts as well
as the attempts of any other active individual to activate a common neighbor.
More formally, let GD = (V, E) be a dynamic network, A0 ⊆ V be a set
of active individuals, and puv be the probability of influence of u on v. For
simplicity, we assume p is uniform for all V and remains fixed for the entire
period of simulation. The uniform probability values also ensure that we test how
the blocking ability of individuals depends solely on the structure of the network,
controlling for other parameters that may affect this ability. An active individual
ut ∈ A0 at time step t tries to activate each of its currently inactive neighbors
vt with a probability p, independent of all the other neighbors. If ut succeeds in
activating vt at time step t, then vt will be active in step t + 1, whether or not
(ut+1 , vt+1 ) ∈ Et+1 . If ut fails in activating vt , and at any subsequent time step
ut+i gets reconnected to the still inactive vt+i , it will again try to activate vt+i .
The process runs for a finite number of time steps T . We denote by σ(A0 ) = AT
the correspondence between the initial set A0 and the resulting set of active
individuals AT . We call the size of the set AT , |AT |, the extent of spread.
The spreading process in the independent cascade model in a dynamic network
is different from the aggregate network in one important aspect. In the aggregate
case, each individual u uses all its attempts of activating each of its inactive
neighbors v with the same probability p in one time step t. This is the time
step right after the individual u itself becomes active. After that single attempt
the active individual becomes latent: that is, it is active but unable to activate
66 Habiba et al.

others. However in the dynamic network model as defined above, the active
individuals never become latent during the spreading process. For this paper, we
only consider the progressive case in which an individual converts from inactive
to active but never reverses (no recovery in the epidemiological model). It is
a particularly important case in the context of identifying blockers since the
blocking action is typically done before any recovery.

4 Experimental Setup

We evaluate the effectiveness of each of the network structural measures as indi-


cators of individual’s blocking capacity under the Independent Cascade spread-
ing model.

4.1 The Protocol

For each measure and for each dynamic network dataset, we follow the following
steps:

1. Order the individuals 0, . . . , |V − 1| according to the ranking imposed by the


measure.
2. For i = 0 to |V − 1| do:
(a) Remove node i from G = (V, E).
(b) Estimate the extent of spread in G \ i by averaging over stochastic sim-
ulations of Independent Cascade model initiated at each node in turn,
3000 iterations for each starting node.3
(c) If the extent of spread is less than 10% of the nodes in the original
network then STOP.

We compare the power of each measure to serve as a proxy indicator for the
blocking ability of an individual based on the number of individuals that had
to be removed in the ordering imposed by that measure in order to achieve this
reduction to 10%.

4.2 Probability of Activation

We conducted the Independent Cascade spreading experiments on a variety of


networks with diverse global structural properties such as density, diameter,
and average path length. In each network, we assigned a different probability of
activation based on the structure of the network. The probability value that for
some networks facilitated propagation of spread to only a small portion of the
nodes for other networks resulted in immediate spread to the entire network.
The following is the procedure we used to find a meaning full probability of
activation for a given network.
3
Which is more than sufficient for the convergence.
Finding Spread Blockers in Dynamic Networks 67

1. For a given G = (V, E) , run the Independent Cascade Spreading process


with p = 1. Note, that this is a deterministic process.
2. Calculate the average extent of spread S in G = (V, E). This is the average
size of a connected component in G.
3. Rerun the spreading process while setting p < 1. Calculate the average extent
of spread in the network. Repeatedly reduce p until the average extent of
spread is half of S.
4. Set probability of activation for G equal to p.
We use the following measures for comparison: dynamic and aggregate versions
of degree, betweenness, closeness centralities, and clustering coefficient, as well
as the average dynamic degree (turnover rate). For the global measures of be-
tweenness and closeness we locally approximate them within 1-, 2-, and 3-hops
neighborhoods. For the datasets with directed interactions we also use page rank
and approximate it within 1-, 2-, and 3-neighborhoods as well. We also rank in-
dividuals based on their neighbors within 1-, 2-, and 3- hops of nodes and edges.
Overall, we experimented with 26 different measures.
We compare the structural measures to a random ordering of nodes as an
upper bound and the best blockers identified by an exhaustive search as the
lower bound.

4.3 Lower Bound: Best Blockers


We identify the best blockers one at a time using exhaustive search over all the
individuals. To find one best blocker, we remove each individual, in turn, from
the network and estimate the extent of spread using stochastic simulations of
the Independent Cascade model in the remaining network. The best blocker,
then, is the individual whose removal results in the minimum extent of spread
after removal. We then repeat the process with the remaining individuals. This
process imposes another ranking on the nodes.
Optimally, one needs to identify the set of top k blockers. However, this prob-
lem is computationally hard and an exhaustive search is infeasible. We have
conducted limited experiments on the datasets considered in this paper and in
all cases the set of iterative best k blockers equals to the set of top k blockers.
This preliminary result warrants future investigation and rigorous evaluation.

5 Datasets
We now describe the datasets used in the experiments.
Grevy’s: Populations of Grevy’s zebras (Equus grevyi) were observed by bi-
ologists [29, 30, 61, 63] over a period of June–August 2002 in the Laikipia
region of Kenya. Predetermined census loops were driven on a regular basis
(approximately twice per week) and individuals were identified by unique
stripe patterns. Upon sighting, an individual’s GPS location was taken. In
the resulting dynamic network, each node represents an individual animal
and two animals are interacting if their GPS locations are the same. The
dataset contains 28 individuals interacting over a period of 44 time steps.
68 Habiba et al.

Onagers: Populations of wild asses (Equus hemionus), also known as onagers,


were observed by biologists [61, 63] in the Little Rann of Kutch, a desert in
Gujarat, India, during January–May 2003. These data are also obtained from
visual scans, as in Grevy’s zebra case. The dataset contains 29 individuals
over 82 time steps.
DBLP: This data set is a sample of the Digital Bibliography and Library Project
[49]. This is a bibliographic dataset of publications in Computer Science. We
use a cleaned version of the data from 1967–2005. In the dynamic network
each node represents an individual author and two authors are interacting if
they are co-authors on a paper. A year is one time step. The sample we used
contains 1374 individuals and 38 time steps. We use this dataset to compare
the dynamic and the static networks.
The DBLP dataset is sparse, with many small connected components. In
fact, the average size of a connected component (using temporal paths) is
.03 × |V |. Thus, the expected extent of spread in this network cannot exceed
3%. For DBLP we set the stopping criterion for removing blockers from the
network at 1% of the population being affected, rather than the 10% used
for other datasets.
Reality Mining: The Reality Mining experiment is one of the largest mobile
phone projects attempted in academia. These are the data collected by MIT
Media Lab at MIT [25]. They have captured communication, proximity, lo-
cation, and activity information from 100 subjects at MIT over the course
of the 2004-2005 academic year. These data represent over 350,000 collective
hours (∼ 40 years) of human behavior.
Reality Mining data are collected by recording the bluetooth scans of each
device every five minute. We have quantized the data to 4 hours interval for
the dynamic network representation of the network based on the analysis
by [20].
Enron: The Enron e-mail corpus is a publicly available database of e-mails sent
by and to employees of the now defunct Enron corporation4. Timestamps,
senders and lists of recipients were extracted from message headers for each
e-mail on file. We chose a day as the time step, and a directed interaction is
present if an e-mail was sent between two individuals.
We used the version of the dataset restricted to the 150 employees of
Enron organization who were actually subpoenaed. The raw Enron corpus
contains 619,446 messages belonging to 158 users [1, 45].
UMass: Co-location of individuals in a population of students at the University
of Massachusetts Amherst; data collected via portable motes5 .

The following table provides a summary of the statistics of the networks we use
in our experiments.

4
Available with a full description at http://www.cs.cmu.edu/~enron/
5
Available with a full description at http://kdl.cs.umass.edu/data/msn/msn-info.
html
Finding Spread Blockers in Dynamic Networks 69

Table 1. Dynamic network dataset statistics. Here V is the number of individuals,


E is the number of edges, T is the number of time steps, D is density and DT is
dynamic density, d is the diameter within a connected component and dT is the dynamic
diameter,p is average shortest path length and pT is the average temporal shortest path
length, and r is no. of reachable pairs and rT is the number of temporally reachable
pairs.

V E T D DT d dT p pT r rT
Grevy’s 28 779 44 0.30 0.52 4 36 1.84 4.81 518 432
Onagers 29 402 82 0.36 0.24 3 74 1.66 7.51 756 617
DBLP 1374 2262 38 0.002 0.09 15 37 5.54 5.12 900070 58146
Enron 147 7406 701 0.04 0.14 6 618 2.66 461.24 19620 16474
MIT 96 67107 2940 0.68 0.18 2 315 1.32 4.21 9120 9114
UMass 20 2664 693 0.72 0.35 2 8 1.28 3.71 380 374

6 Results and Discussion

For each of the datasets we have evaluated all the structural network measures to
determine how effectively they serve to identify good blockers. To recap, we rank
nodes by each measure and remove them from the network in that order. After
removing each node we measure the expected extent of spread in the network
using simulations. We compare the effect of each measure’s ordering to that of
a random ordering and the brute force best blockers ordering. Figure 3 shows
results for two datasets, Onagers and Enron, that are representative of the results
on all the datasets. The results for the other datasets are omitted due to space
limitations. For all the plots, the x-axis is the number of individuals removed
and the y-axis shows the corresponding extent of spread. The lower the extent
of spread after removal, the better is the blocking capacity of the individuals
removed. Thus, the curves lower on the plot correspond to measures that serve
as better indicators of individuals’ blocking power.

Fig. 3. [Best viewed in color.] Comparison of the reduction of extent of spread after
removal of nodes ranked by various measures in Onagers and Enron datasets
70 Habiba et al.

The comparison of all the measures showed that four measures performed con-
sistently well as blocker indicators: degree in aggregate network, the number of
edges in the immediate aggregate neighborhood (local density), dynamic average
degree, and dynamic clustering coefficient. This is good news from the practical
point of view of designing epidemic response strategies since all the measures are
simple, local, and easily scalable. Figure 4 shows the results of the comparison of
those four best measures, as well as the best possible and random orderings, for
all the datasets. Surprisingly, while the local density and the dynamic cluster-
ing coefficients seem to be good indicators, the aggregate clustering coefficient
turned out to be the worst, often performing worse than a random ordering. Be-
tweenness and closeness measures performed inconsistently. PageRank did not
perform well in the only dataset with directed interactions (Enron)6 .
As seen in Figure 4, the ease of blocking the spread depends very much on the
structure of the dynamic network. In the two bluetooth datasets, MIT Reality
Mining and UMass, all orderings, including the random, performed similarly.
Those are well connected networks, as evident by the large difference between
the dynamic diameter and the average shortest temporal path. The only way to
reduce the extent of spread to below 10% of the original population seems to be
trivially removing nearly 90% of the individual population. On the other hand,
Enron and DBLP, the sparsely connected datasets, show the opposite trend of
being easily blockable by a good ranking measure.
When rankings of different measures result in a similar blocking ability we ask
whether it is due to the fact that the measures rank individuals in a similar way

Fig. 4. [Best viewed in color.] Comparison of the reduction of the extent of spread
after removal of nodes ranked by the best 4 measures. The x-axis shows the number of
individuals removed and the y-axis shows the average spread size after the removal of
individuals.

6
On undirected graphs, PageRank is equivalent to degree in aggregate network.
Finding Spread Blockers in Dynamic Networks 71

Table 2. Average rank difference between the rankings induced by every two of the
best four measures

AveDEG vs DynCC

AvgDEG vs ENN1
AvgDEG vs DEG

DynCC vs ENN1
Best vs AvgDEG

DynCC vs DEG
Best vs DynCC

DEG vs ENN1
Best vs ENN1
Best vs DEG
Dataset

Grevy’s 4.5 4.64 4.79 3.86 4.5 2.86 2.64 5.57 5 1.14
Onagers 3.59 4.48 3.31 3.52 4.69 4.14 2.97 6.07 6 2
DBLP - - - - 430.76 71.3 78.49 434.21 428.25 77.22
Enron 21.95 50.01 27.29 21.02 46.37 22.56 21.93 44.35 44.95 25.32
MIT - - - - 4.88 14.4 14.48 14.33 14.27 2.25
UMass 4.6 4.6 3 2.7 0 3.3 3.1 3.3 3.1 1

and, thus, identify the same set of good blockers or, rather, different measures
identify different sets of good blockers. To answer this question, we compared
the sets of the top ranked blockers identified by the four best measures as well
as the best possible ordering. We compute the average rank difference between
the sets of individuals ranked top by every two measures. Table 2 shows the
pairwise difference in ranks. In general, there is little correspondence between the
rankings imposed by various measures. The only strong relationship, as expected,
is between the number of edges in the neighborhood of a node and its degree in
the aggregate network.
We further explore the difference in the sets of the top ranked individuals
by computing the size of the common intersection of all the top sets ranked
by the four measures and the best possible ranking. We use the size of the set
determined by the best possible ordering as the reference set size for all measures.
Table 3 shows the size of the common intersection for all datasets. Again, we see

Table 3. The size of the common intersection of all the top sets ranked by the four
measures and the best ranking. Set size is the size of the sets determined by the best
blocking ordering. The size of the intersection is the number of the individuals in the
intersection and the Intersection fraction is the fraction of the intersection of the size
of the set.

Dataset Set size Inter. size Inter. frac


Grevy’s 5 2 .40
Onagers 9 3 .33
DBLP 16 0 0
Enron 13 4 .31
Reality Mining 60 48 .80
UMass 12 10 .83
72 Habiba et al.

a strong effect of the structure of the network. The MIT Reality Mining and the
UMass datasets have the largest intersection size. On the other hand, in DBLP
the four measures produced very different top ranked sets, yet all four measures
were extremely good indicators of the blockers. In other networks, while there
are some individuals that are clearly good blockers according to all measures,
there is a significant difference among the measures. Overall, these results lead to
two future directions: 1) investigating the effect of the overall network structure
on the “blockability” of the network; and 2) designing consensus techniques that
combine rankings by various measures into a possible better list of blockers.

7 Conclusions and Future Work

In this paper we have investigated the task of preventing a dynamic process,


such as disease or information, from spreading through a network of social in-
teractions. We have formulated the problem of identifying good blockers: nodes
whose removal results in the maximum reduction in the extent of spread in the
network. In the absence of good computational techniques for finding such nodes
efficiently, we have focused on identifying structural network measures that are
indicative of whether or not a node is a good blocker. Since the timing and
order of interactions is critical in propagating many spreading phenomena, we
focused on explicitly dynamic networks. We, thus, extended many standard net-
work measures, such as degree, betweenness, closeness, and clustering coefficient,
to the dynamic setting. We also approximated global network measures locally
within a node’s neighborhood. Overall, we considered 26 different measures as
candidate proxies for the blocking ability of a node.
We conducted experiments on six dynamic network datasets spanning a range
of contexts, sizes, density, and other parameters. We compared the extent of
spread while removing one node at a time according to the ranking of nodes
imposed by each measure. Overall, four structural measures performed consis-
tently well in all datasets and were close to identifying the overall best blockers.
These four measures were node degree, number of edges in node’s neighborhood,
dynamic average degree, and dynamic clustering coefficient. The traditional ag-
gregate clustering coefficient and dynamic closeness performed the worst, often
worse than a random ordering of nodes. All four best measures are local, simple,
and scalable, thus, potentially can be used to design good practical epidemic
prevention strategies. However, before such policy decisions are made, we need
to verify that our results hold true in other, larger and more complete datasets
and for realistic disease spread models.
The striking disparity between the performance of the dynamic and aggre-
gate clustering coefficient indicates the necessity of taking the dynamic nature
of interactions explicitly into consideration in network analysis. Moreover, this
disparity justifies the extension of traditional network measures and methods
to the dynamic setting. In future work, we plan to further investigate the in-
formativeness of a range of dynamic network measures in various application
contexts.
Finding Spread Blockers in Dynamic Networks 73

We have also compared the sets of nodes ranked at the top by various mea-
sures. Interestingly, in the networks in which it was difficult to block dynamic
spread, all the measures resulted in very similar rankings of individuals. In con-
trast, in the networks where the removal of a small set of individuals was sufficient
to reduce the spread significantly, the best measures gave very different rankings
of individuals. Thus, there seems to be a dichotomy in the real-world networks
we studied. On one hand, there are dense networks (e.g. MIT Reality Mining
and UMass datasets) in which it is inherently challenging to block a spreading
process and all measures perform similarly badly. On the other hand, there are
sparse networks where it seems to be easy to stop the spread and there are many
ways to do it. In future work, we will investigate the specific global structural
attributes of a network that delineate this difference between networks for which
it is hard or easy to identify good blockers.
The comparison of the top ranked sets also shows that while there may be
some common nodes ranked high by all measures, there is a significant difference
among the measures. Yet, all the rankings perform comparably well. Thus, there
is a need to test a consensus approach that combines the sets ranked top by
various measures into one set of good candidate blockers. This is similar to
combining the top k lists returned as a web search result [27].
This paper focused on the practical approaches to identifying good blockers.
However, the theoretical structure of the problem is not well understood and
so far has defied good approximation algorithms. Recent developments in the
analysis of non-monotonic submodular functions [28, 64] may be applicable to
variants of the problem and may result in good approximation guarantees.

References
1. Adibi, J.: Enron email dataset, http://www.isi.edu/~adibi/Enron/Enron.htm
2. Anderson, R.M., May, R.M.: Infectious Diseases of Humans: Dynamics and Control.
Oxford University Press, Oxford (1992)
3. Anthonisse, J.: The rush in a graph. Mathematische Centrum, Amsterdam (1971)
4. Aspnes, J., Chang, K., Yampolskiy, A.: Inoculation strategies for victims of viruses
and the sum-of-squares partition problem. J. Comput. Syst. Sci. 72(6), 1077–1093
(2006)
5. Asur, S., Parthasarathy, S., Ucar, D.: An event-based framework of characterizing
the evolutionary behavior of interaction graphs. In: Proceedings of the Thirteenth
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(2007)
6. Barabasi, A.L., Jeong, H., Neda, Z., Ravasz, E., Schubert, A., Vicsek, T.: Evolution
of the social network of scientific collaborations. Physica A: Statistical Mechanics
and its Applications 311(3-4), 590–614 (2002)
7. Berger, E.: Dynamic monopolies of constant size. J. Combin. Theory Series B 83,
191–200 (2001)
8. Berger, N., Borgs, C., Chayes, J.T., Saberi, A.: On the spread of viruses on the
internet. In: SODA 2005: Proceedings of the sixteenth annual ACM-SIAM sym-
posium on Discrete algorithms, Philadelphia, PA, USA, pp. 301–310. Society for
Industrial and Applied Mathematics (2005)
74 Habiba et al.

9. Berger-Wolf, T., Hart, W., Saia, J.: Discrete sensor placement problems in distri-
bution networks. Mathematical and Computer Modelling (2005)
10. Berry, J., Fleischer, L., Hart, W., Phillips, C., Watson, J.: Sensor placement in
municipal water networks. Journal of Water Resources Planning and Manage-
ment 131(3) (2005a)
11. Berry, J., Hart, W., Phillips, C., Uber, J.G., Watson, J.: Sensor placement in munic-
ipal water networks with temporal integer programming models. Journal of Water
Resources Planning and Management 132(4), 218–224 (2006)
12. Börner, K., Dall’Asta, L., Ke, W., Vespignani, A.: Studying the emerging global
brain: Analyzing and visualizing the impact of co-authorship teams. Complexity,
Special issue on Understanding Complex Systems 10(4), 57–67 (2005)
13. Börner, K., Maru, J., Goldstone, R.: The simultaneous evolution of author and
paper networks. PNAS 101(suppl. 1), 5266–5273 (2004)
14. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine.
In: WWW7: Proceedings of the 7th International Conference on World Wide Web
7, pp. 107–117. Elsevier Science Publishers B. V., Amsterdam (1998)
15. Broido, A., Claffy, K.: Internet topology: connectivity of IP graphs. In: Proceedings
of SPIE ITCom (2001)
16. Carley, K.: Communicating new ideas: The potential impact of information and
telecommunication technology. Technology in Society 18(2), 219–230 (1996)
17. Carreras, I., Miorandi, D., Canright, G., Engøo-Monsen, K.: Eigenvector central-
ity in highly partitioned mobile networks: Principles and applications. Studies in
Computational Intelligence (SCI) 69, 123–145 (2007)
18. Chen, L., Carley, K.: The impact of social networks in the propagation of computer
viruses and countermeasures. IEEE Trasactions on Systems, Man and Cybernetics
(forthcoming)
19. Chen, N.: On the approximability of influence in social networks. In: ACM-SIAM
Symposium on Discrete Algorithms (SODA), pp. 1029–1037 (2008)
20. Clauset, A., Eagle, N.: Persistence and periodicity in a dynamic proximity network
(unpublished manuscript)
21. Cohen, R., Havlin, S., ben Avraham, D.: Efficient immunization strategies for com-
puter networks and populations. Physical Review Letters (2003)
22. Dezsö, Z., Barabási, A.-L.: Halting viruses in scale-free networks. Physical Review
E 65(055103(R)) (2002)
23. Domingos, P.: Mining social networks for viral marketing. IEEE Intelligent Sys-
tems 20, 80–82 (2005)
24. Domingos, P., Richardson, M.: Mining the network value of customers. In: Seventh
International Conference on Knowledge Discovery and Data Mining (2001)
25. Eagle, N., Pentland, A.: Reality mining: Sensing complex social systems. Journal
of Personal and Ubiquitous Computing (2006)
26. Eubank, S., Guclu, H., Kumar, V., Marathe, M., Srinivasan, A., Toroczkai, Z.,
Wang, N.: Modelling disease outbreaks in realistic urban social networks. Na-
ture 429, 180–184 (2004) (supplement material)
27. Fagin, R., Kumar, R., Sivakumar, D.: Comparing top k lists. In: SODA 2003: Proc.,
14th ACM-SIAM Symposium on Discrete Algorithms, Philadelphia, PA, USA, pp.
28–36. Society for Industrial and Applied Mathematics (2003)
28. Feige, U., Mirrokni, V., Vondrák.: Maximizing non-monotone submodular func-
tions. In: Foundations of Computer Science, FOCS (2007)
Finding Spread Blockers in Dynamic Networks 75

29. Fischhoff, I.R., Sundaresan, S.R., Cordingley, J., Larkin, H.M., Sellier, M.-J.,
Rubenstein, D.I.: Social relationships and reproductive state influence leadership
roles in movements of plains zebra (Equus burchellii). Animal Behaviour 73(5),
825–831 (2007)
30. Fischhoff, I.R., Sundaresan, S.R., Cordingley, J., Rubenstein, D.I.: Habitat use and
movements of plains zebra (Equus burchelli) in response to predation danger from
lions. Behavioral Ecology 18(4), 725–729 (2007)
31. Freeman, L.: A set of measures of centrality based on betweenness. Sociometry 40,
35–41 (1977)
32. Freeman, L.C.: Centrality in social networks: I. conceptual clarification. Social
Networks 1, 215–239 (1979)
33. Girvan, M., Newman, M.E.J.: Community structure in social and biological net-
works. Proc. Natl. Acad. Sci. 99, 8271–8276 (2002)
34. Goldenberg, J., Libai, B., Muller, E.: Talk of the network: A complex systems
look at the underlying process of word-of-mouth. Marketing Letters 12(3), 211–
223 (2001)
35. Goldenberg, J., Libai, B., Muller, E.: Using complex systems analysis to advance
marketing theory development. Academy of Marketing Science Review (2001)
36. Granovetter, M.: The strength of weak ties. American J. Sociology 78(6), 1360–
1380 (1973)
37. Granovetter, M.: Threshold models of collective behavior. American J. Sociol-
ogy 83(6), 1420–1443 (1978)
38. Gruhl, D., Guha, R., Liben-Nowell, D., Tomkins, A.: Information diffusion through
blogspace. In: WWW 2004: Proc. 13th Intl Conf on World Wide Web, pp. 491–501.
ACM Press, New York (2004)
39. Habiba, C.T., Berger-Wolf, T.Y.: Betweenness centrality in dynamic networks.
Technical Report 2007-19, DIMACS (2007)
40. Holme, P.: Efficient local strategies for vaccination and network attack. Europhys.
Lett. 68(6), 908–914 (2004)
41. Hopcroft, J., Khan, O., Kulis, B., Selman, B.: Natural communities in large linked
networks. In: Proc. 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and
Data Mining, pp. 541–546 (2003)
42. Jordán, F., Benedek, J., Podani, Z.: Quantifying positional importance in food
webs: A comparison of centrality indices. Ecological Modelling 205, 270–275 (2007)
43. Kempe, D., Kleinberg, J., Kumar, A.: Connectivity and inference problems for
temporal networks. J. Comput. Syst. Sci. 64(4), 820–842 (2002)
44. Kempe, D., Kleinberg, J., Tardos, E.: Maximizing the spread of influence through
a social network. In: 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and
Data Mining (2003)
45. Klimt, B., Yang, Y.: The Enron corpus: A new dataset for email classification re-
search. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML
2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004)
46. Leskovec, J., Adamic, L.A., Huberman, B.A.: The dynamics of viral marketing.
In: EC 2006: Proceedings of the 7th ACM conference on Electronic commerce, pp.
228–237. ACM Press, New York (2006)
47. Leskovec, J., Krause, A., Guestrin, C., Faloutsos, C., VanBriesen, J.: Cost-effective
outbreak detection in networks. In: Proc. 13th ACM SIGKDD Intl. Conf. on Knowl-
edge Discovery and Data Mining (2007)
48. Lewin, K.: Principles of Topological Psychology. McGraw Hill, New York (1936)
49. Ley, M.: Digital bibliography & library project (DBLP) (December 2005); A digital
copy of the databse has been provided by the author, http://dblp.uni-trier.de/
76 Habiba et al.

50. Liljeros, F., Edling, C., Amaral, L.N.: Sexual networks: Implication for the trans-
mission of sexually transmitted infection. Microbes and Infection (2003)
51. May, R.M., Lloyd, A.L.: Infection dynamics on scale-free networks. Physical Review
E 64(066112) (2001)
52. Moody, J.: The importance of relationship timing for diffusion. Social Forces (2002)
53. Moreno, Y., Nekovee, M., Pacheco, A.F.: Dynamics of rumor spreading in
complex networks. Physical Review E (Statistical, Nonlinear, and Soft Matter
Physics) 69(6), 066130 (2004)
54. Morris, M.: Epidemiology and social networks:modeling structured diffusion. Soci-
ological Methods and Research 22(1), 99–126 (1993)
55. Mossel, E., Roch, S.: On the submodularity of influence in social networks. In: The
Annual ACM Symposium on Theory of Computing(STOC) (2007)
56. Newman, M.: The structure and function of complex networks. SIAM Review 45,
167–256 (2003)
57. Newman, M.E.: Spread of epidemic disease on networks. Physical Review
E 66(016128) (2002)
58. Newman, M.E.J.: Scientific collaboration networks. i. network construction and
fundamental results. Physical Review E 64, 016131 (2001)
59. Pastor-Satorras, R., Vespignani, A.: Epidemic spreading in scale-free networks.
Phys. Rev. Lett. 86(14), 3200–3203 (2001)
60. Rogers, E.M.: Diffusion of Innovations, 5th edn. Simon & Shuster, Inc., New York
(2003)
61. Rubenstein, D.I., Sundaresan, S., Fischhoff, I., Saltz, D.: Social networks in wild
asses: Comparing patterns and processes among populations. In: Stubbe, A.,
Kaczensky, P., Samjaa, R., Wesche, K., Stubbe, M. (eds.) Exploration into the
Biological Resources of Mongolia, vol. 10, pp. 159–176. Martin-Luther-University,
Halle-Wittenberg (2007)
62. Sabidussi, G.: The centrality index of a graph. Psychometrika 31, 581–603 (1966)
63. Sundaresan, S.R., Fischhoff, I.R., Dushoff, J., Rubenstein, D.I.: Network metrics
reveal differences in social organization between two fission-fusion species, Grevy’s
zebra and onager. Oecologia 151, 140–149 (2007)
64. Vredeveld, T., Lenstra, J.: On local search for the generalized graph coloring prob-
lem. Operations Research Letters 31, 28–34 (2003)
65. Watts, D.: A simple model of global cascades on random networks. PNAS 99,
5766–5771 (2002)
66. Watts, D., Strogatz, S.: Collective dynamics of small-world networks. Nature 393,
440–442 (1998)
67. Young, H.P.: Innovation diffusion and population heterogeneity, Working paper
(2006)
68. Zanette, D.H.: Dynamics of rumor propagation on small-world networks. Phys.
Rev. E 65(4), 041908 (2002)
Social Network Mining with Nonparametric
Relational Models

Zhao Xu1 , Volker Tresp2 , Achim Rettinger3 , and Kristian Kersting1


1
Fraunhofer IAIS, Germany
1stname.2ndname@iais.fraunhofer.de
2
Siemens Corporate Technology, Germany
volker.tresp@siemens.com
3
Technical University of Munich, Germany
achim.rettinger@cs.tum.edu

Abstract. Statistical relational learning (SRL) provides effective tech-


niques to analyze social network data with rich collections of objects and
complex networks. Infinite hidden relational models (IHRMs) introduce
nonparametric mixture models into relational learning and have been
successful in many relational applications. In this paper we explore the
modeling and analysis of complex social networks with IHRMs for com-
munity detection, link prediction and product recommendation. In an
IHRM-based social network model, each edge is associated with a ran-
dom variable and the probabilistic dependencies between these random
variables are specified by the model, based on the relational structure.
The hidden variables, one for each object, are able to transport informa-
tion such that non-local probabilistic dependencies can be obtained. The
model can be used to predict entity attributes, to predict relationships
between entities and it performs an interpretable cluster analysis. We
demonstrate the performance of IHRMs with three social network appli-
cations. We perform community analysis on the Sampson’s monastery
data and perform link analysis on the Bernard & Killworth data. Finally
we apply IHRMs to the MovieLens data for prediction of user preference
on movies and for an analysis of user clusters and movie clusters.

Keywords: Statistical Relational Learning, Social Network Analysis,


Nonparametric Mixture Models, Dirichlet Process, Variational Inference.

1 Introduction
Social network mining has gained in importance due to the growing availability
of data on novel social networks, e.g. citation networks (DBLP, Citeseer), SNS
websites (Facebook), and social media websites (Last.fm). Social networks usu-
ally consist of rich collections of objects, which are linked into complex networks.
Generally, social network data can be graphically represented as a sociogram as
illustrated in Fig. 1 (left). In this simple social network, there are persons, person
profiles (e.g., gender), and these persons are linked together via friendships. Some
interesting applications in social network mining include community discovery,
relationship prediction, social recommendation, etc.

L. Giles et al. (Eds.): SNAKDD 2008, LNCS 5498, pp. 77–96, 2010.

c Springer-Verlag Berlin Heidelberg 2010
78 Z. Xu et al.

Statistical relational learning (SRL) [8,17,11] is an emerging area of machine


learning research, which attempts to combine expressive knowledge represen-
tation formalisms with statistical approaches to perform probabilistic inference
and learning on relational networks. Fig. 1 (right) shows a simple SRL model for
the above sociogram example. For each potential edge, a random variable (RV) is
introduced that describes the state of the edge. For example, there is a RV asso-
ciated with the edge between the person 1 and the person 2. The binary variable
is YES if the two persons are friends and No otherwise. The edge between an
object (e.g., person 1) and object property (e.g., Male) is also associated with
a RV, whose value describes the person’s profile. In the running example, all
variables are binary. To infer the quantities of interest, e.g., whether the person
1 and the person 2 are friends, we need to learn the probabilistic dependencies
between the random variables. Here we assume that friendship is conditioned
on the profiles (gender) of the involved persons, shown as Fig. 1 (right). The
directed arcs, for example, the ones between G1 and R1,2 and between G2 and
R1,2 specify that the probability that person 1 and the person 2 are friends de-
pends on their respective profiles. Given the probabilistic model, we can learn
the parameters and predict the relationships of interest.

Fig. 1. Left: A simple sociogram. Right: A probabilistic model for the sociogram. Each
edge is associated with a random variable that determines the state of the edge. The
directed arcs indicate direct probabilistic dependencies.

In the simple relational model of social network, the friendship is locally pre-
dicted by the profiles of the involved objects: whether a person is a friend of
another person is only dependent on the profiles of the two persons. Given that
the parameters are fixed, and given the parent attributes, all friendships are
independent of each other such that correlations between friendships, i.e., the
collaborative effect, cannot be taken into account. To solve this limitation, struc-
tural learning might be involved to obtain non-local dependencies but structural
learning in complex relational networks is considered a hard problem [9]. Non-
local dependencies can also be achieved by introducing for each person a hidden
variable as proposed in [24]. The state of the hidden variable represents unknown
attributes of the person, e.g. the particular habit of making friends with certain
Social Network Mining with Nonparametric Relational Models 79

persons. The hidden variable of a person is now the only parent of its profiles and
is one of the parents of the friendships in which the person potentially partici-
pates. Since the hidden variables are of central importance, this model is referred
to as the hidden relational model (HRM). In relational domains, different classes
of objects generally require a class-specific complexity in the hidden representa-
tion. Thus, it is sensible to work with a nonparametric method, Dirichlet process
(DP) mixture model, in which each object class can optimize its own representa-
tional complexity in a self-organized way. Conceptionally, the number of states
in the hidden variables in the HRM model becomes infinite. In practice, the
DP mixture sampling process only occupies a finite number of components. The
combination of the hidden relational model and the DP mixture model is the
infinite hidden relational model (IHRM) [24].
The IHRM model has been first presented in [24]. This paper is an extended
version of [25] and we explore social network modeling and analysis with IHRM
for community detection, link prediction, and product recommendation. We
present two inference methods for efficient inference: one is the blocked Gibbs
sampling with truncated stick-breaking (TSB) construction, the other is the
mean-field approximation with TSB. We perform empirical analysis on three so-
cial network datasets: the Sampson’s monastery data, the Bernard & Killworth
data, and the MovieLens data. The paper is organized as follows. In the next sec-
tion, we perform analysis of modeling complex social network data with IHRMs.
In Sec. 3 we describe a Gibbs sampling method and a mean-field approximation
for inference in the IHRM model. Sec. 4 gives the experimental analysis on so-
cial network data. We review some related work in Sec. 5. Before concluding, an
extension to IHRMs is discussed in Sec. 6.

2 Model Description
Based on the analysis in Sec. 1, we will give a detailed description of the IHRM
model for social network data. In this section, we first introduce the finite hidden
relational model (HRM), and then extend it to an infinite version (IHRM). In
addition, we provide a generative model describing how to generate data from
an IHRM model.

2.1 Hidden Relational Model


A hidden relational model (HRM) for a simple sociogram is shown in Fig. 2.
The basic innovation of the HRM model is introducing for each object (here:
person) a hidden variable, denoted as Z in the figure. They can be thought of as
unknown attributes of persons. We then assume that attributes of a person only
depend on the hidden variable of the person, and a relationship only depends on
the hidden variables of the persons involved in the relationship. It implies that
if hidden variables were known, both person attributes and relationships can be
well predicted.
Given the HRM model shown in Fig. 2, information can propagate via inter-
connected hidden variables. Let us predict whether the person 2 will be a friend
80 Z. Xu et al.

Fig. 2. A hidden relational model (HRM) for a simple sociogram

of the person 3, i.e. predict the relationship R2,3 . The probability is computed
on the evidence about: (1) the attributes of the immediately related persons,
i.e. G2 and G3 , (2) the known relationships associated with the persons of inter-
est, i.e. the friendships R2,1 and R2,4 about the person 2, and the friendships R1,3
and R3,4 about the person 3, (3) high-order information transferred via hidden
variables, e.g. the information about G1 and G4 propagated via Z1 and Z4 . If
the attributes of persons are informative, those will determine the hidden states
of the persons, therefore dominate the computation of predictive probability of
relationship R2,3 . Conversely, if the attributes of persons are weak, then hidden
state of a person might be determined by his relationships to other persons and
the hidden states of those persons. By introducing hidden variables, information
can globally distribute in the ground network defined by the relational structure.
This reduces the need for extensive structural learning, which is particularly dif-
ficult in relational models due to the huge number of potential parents. Note that
a similar propagation of information can be observed in hidden Markov models
used in speech recognition or in the hidden Markov random fields used in image
analysis [26]. In fact, the HRM can be viewed as a directed generalization of
both for relational data.
Additionally, the HRM provides a cluster analysis of relational data. The state
of the hidden variable of an object corresponds to its cluster assignment. This
can be regarded as a generalization of co-clustering model [13]. The HRM can
be applied to domains with multiple classes of objects and multiple classes of
relationships. Furthermore, relationships can be of arbitrary order, i.e. the HRM
is not constraint to only binary and unary relationships[24]. Also note that the
sociogram is quite related to the resource description framework (RDF) graph
used as the basic data model in the semantic web [3] and the entity relationship
graph from database design.
We now complete the model by introducing the variables and parameters in
Fig. 2. There is a hidden variable Zi for each person. The state of Zi speci-
fies the cluster of the person i. Let K denote the number of clusters. Z fol-
lows
 a multinomial distribution with parameter vector π = (π1 , . . . , πK ) (πk >
0, k πk = 1), which specifies the probability of a person belonging to a cluster,
Social Network Mining with Nonparametric Relational Models 81

i.e. P (Zi = k) = πk . π is sometimes referred to as mixing weights. It is drawn


from a conjugated Dirichlet prior with hyperparameters α0 .
All person attributes are assumed to be discrete and multinomial variables
(resp., binary and Bernoulli). Thus a particular person attribute Gi is a sample
drawn from a multinomial (resp., Bernoulli) distribution with parameters θk ,
where k denotes the cluster assignment of the person. θk is sometimes referred
to as mixture component, which is associated with the cluster k. For all persons,
there are totally K mixture components Θ = (θ1 , . . . , θK ). Each person in the
cluster k inherits
 the mixture component, thus we have: P (Gi = s|Zi = k, Θ) =
θk,s (θk,s > 0, s θk,s = 1). These mixture components are independently drawn
from a prior G0 . For computational efficiency, we assume that G0 is a conjugated
Dirichlet prior with hyperparameters β.
We now consider the variables and parameters concerning the relationships
(FriendOf). The relationship R is assumed to be discrete with two states. A
particular relationship Ri,j between two persons (i and j) is a sample drawn
from a binomial distribution with a parameter φk, , where k and  denote cluster
assignments of the person i and the person j, respectively. There are totally K ×
K parameters φk, , and each φk, is independently drawn from the prior Gr0 . For
computational efficiency, we assume that Gr0 is a conjugated Beta distribution
with hyperparameters β r .
From a mixture model point of view, the most interesting term in the HRM
model is φk, , which can be interpreted as a correlation mixture component. If a
person i is assigned to a cluster k, i.e. Zi = k, then the person inherits not only
θk , but also φk, ,  = {1, . . . , K}.

2.2 Infinite Hidden Relational Model

Since hidden variables play a key role in the HRM model, we would expect that
the HRM model might require a flexible number of states for the hidden vari-
ables. Consider again the sociogram example. With little information about past
friendships, all persons might look the same; with more information available,
one might discover certain clusters in persons (different habits of making friends);
but with an increasing number of known friendships, clusters might show increas-
ingly detailed structure ultimately indicating that everyone is an individual. It
thus makes sense to permit an arbitrary number of clusters by using a Dirichlet
process mixture model. This permits the model to decide itself about the optimal
number of clusters and to adopt the optimal number with increasing data. For
our discussion it suffices to say that we obtain an infinite HRM by simply letting
the number of clusters approach infinity, K → ∞. Although from a theoretical
point of view there are indeed an infinite number of components, a sampling
procedure would only occupy a finite number of components.
The graphical representations of the IHRM and HRM models are identical,
shown as Fig. 2. However, the definitions of variables and parameters are dif-
ferent. For example, hidden variables Z of persons have infinite states, and
thus parameter vector π is infinite-dimensional. The parameter is not generated
from a Dirichlet prior, but from a stick breaking construction Stick(·|α0 ) with a
82 Z. Xu et al.

hyperparameter α0 (more details in the next section). Note that α0 is a positive


real-valued scalar and is referred to as concentration parameter in DP mixture
modeling. It determines the tendency of the model to either use a large number
or a small number of states in the hidden variables [2]. If α0 is chosen to be
small, only few clusters are generated. If α0 is chosen to be large, the coupling
is loose and more clusters are formed. Since there are an infinite number of clus-
ters, there are an infinite number of mixture components θk , each of which is
still independently drawn from G0 . G0 is referred to as base distribution in DP
mixture modeling.

2.3 Generative Model


Now we describe the generative model for the IHRM model. There are mainly
two methods to generate samples from a Dirichlet Process (DP) mixture model,
i.e. the Chinese restaurant process (CRP) [2] and the stick breaking construction
(SBC) [22]. We will discuss how SBC can be applied to the IHRM model (see
[24] for CRP-based generative model). Notations is summarized in Table 1.

Table 1. Notation used in this paper

Symbol Description
C number of object classes
B number of relationship classes
N c number of objects in a class c
αc0 concentration parameter of an object class c
eci an object indexed by i in a class c
Aci an attribute of an object eci
c
θk mixture component indexed by a hidden state k in an object class c
Gc0 base distribution of an object class c
βc parameters of a base distribution Gc0
b
Ri,j relationship of class b between objects i, j
φbk, correlation mixture component indexed by hidden states k for ci and  for
cj , where ci and cj are object classes involved in a relationship class b
Gb0 base distribution of a relationship class b
βb parameters of a base distribution Gb0

The stick breaking construction (SBC) [22] is a representation of DPs, by


which we can explicitly sample random distributions of attribute parameters
and relationship parameters. In the following we describe the generative model
of IHRM in terms of SBC.
1. For each object class c,
(a) Draw mixing weights π c ∼ Stick(·|αc0 ), defined as
iid

k−1
Vkc ∼ Beta(1, αc0 ); π1c = V1c , πkc = Vkc (1 − Vkc ), k > 1. (1)
k =1

(b) Draw i.i.d. mixture components θkc ∼ Gc0 , k = 1, 2, . . .


Social Network Mining with Nonparametric Relational Models 83

2. For each relationship class b between two object classes ci and cj , draw
φbk, ∼ Gb0 i.i.d. with component indices k for ci and  for cj .
3. For each object eci in a class c,
(a) Draw cluster assignment Zic ∼ Mult(·|π c );
(b) Draw object attributes Aci ∼ P (·|θc , Zic ).
c c
4. For eci i and ej j with a relationship of class b, draw Ri,j
b
∼ P (·|φb , Zici , Zj j ).
The basic property of SBC is that: the distributions of the parameters (θkc
and φb ) are sampled, e.g., the distribution of θkc can be represented as Gc =
∞ k,c c
k=1 πk δθk , where δθk is a distribution with a point mass on θk . In terms of
c c

this property, SBC can sample objects independently; thus it might be efficient
when a large domain is involved.

3 Inference
The key inferential problem in the IHRM model is computing posterior of unob-
servable variables given the data, i.e. P ({π c, Θc , Z c }c , {Φb }b |D, {αc0 , Gc0 }c , {Gb0 }b ).
Unfortunately, the computation of the joint posterior is analytically intractable,
thus we consider approximate inference methods to solve the problem.

3.1 Inference with Gibbs Sampling


Markov chain Monte Carlo (MCMC) sampling has been used to approximate
posterior distribution with a DP mixture prior. In this section, we describe
the efficient blocked Gibbs sampling (GS) with truncated stick breaking rep-
resentation [14] for the IHRM model. The advantage is that given the posterior
distributions, we can independently sample hidden variables in a block, which
highly accelerates the computation. The Markov chain is thus defined not only
on hidden variables, but also on parameters.
Truncated stick breaking construction (TSB) fixes a value K c for each class
of objects by letting VKc c = 1. That means the mixing weights πkc are equal to 0
for k > K c (refer to Equ. 1). The number of the clusters is thus reduced to K c .
Note, that K c is an additional parameter in the inference method.
At each iteration, we first update the hidden variables conditioned on the
parameters sampled in the last iteration, and then update the parameters con-
ditioned on the hidden variables. In detail:
1. For each class of objects,
c(t+1)
(a) Update each hidden variable Zi with probability proportional to:
c(t) c(t+1)
  c(t+1) c  (t) 
πk P (Aci |Zi = k, Θc(t) ) b
P (Ri,j  |Zi = k, Zj j , Φb (t)
), (2)
b j


where Aci and Ri,j
b
 denotes the known attributes and relationships about
c  (t)
i. cj  denotes the class of the object j  , Zj j denotes hidden variable
of j  at the last iteration t. Intuitively, the equation represents to what
extent the cluster k agrees with the data Dic about the object i.
84 Z. Xu et al.

(b) Update π c(t+1) as follows:


c(t+1) c(t+1) c(t+1)
i. Sample vk from Beta(λk,1 , λk,2 ) for k = {1, . . . , K c − 1}
c c c
c(t+1)

N
c(t+1) c(t+1)

K 
N
c(t+1)
λk,1 = 1+ δk (Zi ), λk,2 = αc0 + δk (Zi ), (3)
i=1 k =k+1 i=1

c(t+1) c(t+1) c(t+1)


and set vK c = 1. δk (Zi ) equals to 1 if Zi = k and 0
otherwise.
c(t+1) c(t+1) k−1 c(t+1)
ii. Compute π c(t+1) as: πk = vk k =1 (1 − vk ) for k > 1
c(t+1) c(t+1)
and π1 = v1 .
c(t+1) b(t+1)
2. Update θk ∼ P (·|Ac , Z c(t+1) , Gc0 ) and φk, ∼ P (·|Rb , Z (t+1) , Gb0 ). The
parameters are drawn from their posterior distributions conditioned on the
sampled hidden states. Again, since we assume conjugated priors as the base
distributions (Gc0 and Gb0 ), the simulation is tractable.
After convergence, we collect the last W samples to make predictions for the rela-
tionships of interest. Note that in blocked Gibbs sampling, the MCMC sequence
is defined by hidden variables and parameters, including Z c(t) , π c(t) , Θc(t) , and
Φb(t) . The predictive distribution of a relationship Rnew,j
b
between a new object
c c j
enew and a known object ej is approximated as
b
P (Rnew,j |D, {αc0 , Gc0 }C b B
c=1 , {G0 }b=1 )

1 
W +w
b
≈ P (Rnew,j |D, {Z c(t) , π c(t) , Θc(t) }C
c=1 , {Φ
b(t) B
}b=1 )
W t=w+1
c
1   
W +w K
b b(t) c(t) c(t) b b (t)
∝ P (Rnew,j |φk, ) πk P (Acnew |θk ) P (Rnew,j  |φ
k, ),
W t=w+1 k=1   b j

where  and  denote the cluster assignments of the objects j and j  , respectively.
The equation is quite intuitive. The prediction is a weighted sum of predictions
b(t)
b
P (Rnew,j |φk, ) over all clusters. The weight of each cluster is the product of
the last three terms, which represents to what extent this cluster agrees with
the known data (attributes and relationships) about the new object. Since the
blocked method also samples parameters, the computation is straightforward.

3.2 Inference with Variational Approximation


The IHRM model has multiple DPs which interact through relationships, thus
blocked Gibbs sampling is still slow due to the slow exchange of information
between DPs. To solve the problem, we outline an alternative solution by varia-
tional inference method. The main strategy is to convert a probabilistic inference
problem into an optimization problem, and then to solve the problem with the
known optimization techniques. In particular, the method assumes a distribution
q, referred to as a variational distribution, to approximate the true posterior P
as close as possible. The difference between the variational distribution q and
the true posterior P can be measured via Kullback-Leibler (KL) divergence. Let
Social Network Mining with Nonparametric Relational Models 85

ξ denote a set of unknown quantities, and D denote the known data. The KL
divergence between q(ξ) and P (ξ|D) is defined as:
 
KL(q(ξ)||P (ξ|D)) = q(ξ) log q(ξ) − q(ξ) log P (ξ|D). (4)
ξ ξ

The smaller the divergence, the better is the fit between the true and the ap-
proximate distributions. The probabilistic inference problem (i.e. computing the
posterior) now becomes: to minimize the KL divergence with respect to the
variational distribution. In practice, the minimization of the KL divergence is
formulated as the maximization of the lower bound of the log-likelihood:
 
log P (D) ≥ q(ξ) log P (D, ξ) − q(ξ) log q(ξ). (5)
ξ ξ

A mean-field method was explored in [6] to approximate the posterior of un-


observable quantities in a DP mixture model. The main challenge of using
mean-field inference for the IHRM model is that there are multiple DP mix-
ture models coupled together with relationships and correlation mixture com-
ponents. In the IHRM model, unobservable quantities include Z c , π c , Θc and
Φb . Since mixing weights π c are computed on V c (see Equ. 1), we can replace
π c with V c in the set of unobservable quantities. To approximate the posterior
P ({V c , Θc , Z c }c , {Φb }b |D, {αc0 , Gc0 }c , {Gb0 }b ), we define a variational distribution
q({Z c , V c , Θc }C
c=1 , {Φ }b=1 ) as:
b B

 c c  ⎡ B K ci K cj ⎤

C 
N 
K   
q(Zic |ηic ) q(Vkc |λck )q(θkc |τkc ) ⎣ q(φbk, |ρbk, )⎦ , (6)
c i k b k 

where ci and cj denote the object classes involved in the relationship class b.
k and  denote the cluster indexes for ci and cj . Variational parameters in-
clude {ηic , λck , τkc , ρbk, }. q(Zic |ηic ) is a multinomial distribution with parameters
ηic . Note, that there is one ηic for each object eci . q(Vkc |λck ) is a Beta distribution.
q(θkc |τkc ) and q(φbk, |ρbk, ) are respectively with the same forms as Gc0 and Gb0 .
We substitute Equ. 6 into Equ. 5 and optimize the lower bound with a coor-
dinate ascent algorithm, which generates the following equations to iteratively
update the variational parameters until convergence:
c c c

N 
N 
K
λck,1 = 1 + c
ηi,k , λck,2 = αc0 + c
ηi,k , (7)
i=1 i=1 k =k+1
c c
c

N 
N
τk,1 = β1c + c
ηi,k T(Aci ), c
τk,2 = β2c + c
ηi,k , (8)
i=1 i=1
 ci c
 ci c
ρbk,,1 = β1b + ηi,k j
ηj, b
T(Ri,j ), ρbk,,2 = β2b + ηi,k j
ηj, , (9)
i,j i,j

c

k−1
ηi,k ∝ exp Eq [log Vkc ] + Eq [log(1 − Vkc )] + Eq [log P (Aci |θkc )]
k =1

 cj b 
+ ηj, Eq [log P (Ri,j |φbk, )] , (10)
b j 
86 Z. Xu et al.

where λck denotes parameters of Beta distribution q(Vkc |λck ), λck is a two-
dimensional vector λck = (λck,1 , λck,2 ). τkc denotes parameters of the exponential
family distribution q(θkc |τkc ). We decompose τkc such that τk,1 c
contains the first
dim(θk ) components and τk,2 is a scalar. Similarly, β1 contain the first dim(θkc )
c c c

components and β2c is a scalar. ρbk,,1 , ρbk,,2 , β1b and β2b are defined equivalently.
T(Aci ) and T(Ri,j b
) denote the sufficient statistics of the exponential family dis-
tributions P (Ai |θk ) and P (Ri,j
c c b
|φbk, ), respectively.
It is clear that Equ. 7 and Equ. 8 correspond to the updates for variational
parameters of object class c, and they follow equations in [6]. Equ. 9 represents
the updates of variational parameters for relationships, which is computed on the
involved objects. The most interesting updates are Equ. 10, where the posteriors
of object cluster-assignments are coupled together. These essentially connect the
c
DPs together. Intuitively, in Equ. 10 the posterior updates for ηi,k include a
prior term (first two expectations), the likelihood term about object attributes
(third expectation), and the likelihood terms about relationships (last term). To
calculate the last term we need to sum over all the relationships of the object
cj
eci weighted by ηj, that is variational expectation about cluster-assignment of
the other object involved in the relationship.
Once the procedure reaches stationarity, we obtain the optimized variational
parameters, by which we can approximate the predictive distribution
b
P (Rnew,j |D, {αc0 , Gc0 }C
c=1 , {G0 }b=1 ) of the relationship Rnew,j between a new
b B b
cj
object enew and a known object ej with q(Rnew,j |D, λ, η, τ, ρ) proportional to:
c b

c c
 j
K K
b c c
q(Rnew,j |ρbk, )q(Zj j = |ηj j )q(Znew
c
= k|λc )
k 
 c  c   
× q(Acnew |τkc ) q(Zj j =  |ηj j )q(Rnew,j
b b
 |ρk, ). (11)
b j 

The prediction is a weighted sum of predictions q(Rnew,jb


|ρbk, ) over all clusters.
The weight consists of two parts. One is to what extent the cluster  agrees with
c
the object ej j (i.e. the 2nd term), the other is to what extent the cluster k agrees
with the new object (i.e. the product of the last 3 terms). The computations
c
about the two parts are different. The reason is that ej j is a known object, we
c
have optimized variational parameters ηj j about its cluster assignment.

4 Experimental Analysis

4.1 Monastery Data

The first experiment is performed on the Sampson’s monastery dataset [19] for
community discovery. Sampson surveyed social relationships between 18 monks
in an isolated American monastery. The relationships between monks included
esteem/disesteem, like/dislike, positive influence/negative influence, praise and
blame. Breiger et al. [7] summarized these relationships and yielded a single
Social Network Mining with Nonparametric Relational Models 87

10

12

14

16

18
2 4 6 8 10 12 14 16 18

Fig. 3. Left: The matrix displaying interactions between Monks. Middle: A sociogram
for three monks. Right: The IHRM model for the monastery sociogram.

relationship matrix, which reflected interactions between monks, as shown in


Fig. 3 (left).
After observing the monks in the monastery for several months, Sampson pro-
vided a description of the factions among the monks: the loyal opposition (Peter,
Bonaventure, Berthold, Ambrose and Louis), the young turks (John Bosco, Gre-
gory, Mark, Winfrid, Hugh, Boniface and Albert) and the outcasts (Basil, Elias
and Simplicius). The other three monks (Victor, Ramuald and Amand) wavered
between the loyal opposition and the young turks, and were identified as the
fourth group, the waverers. Sampson’s observations were confirmed by the event
that the young turks group resigned after the leaders of the group were expelled
over religious differences. The task of the experiment is to cluster the monks.
Fig. 3 (middle) shows a sociogram with 3 monks. The IHRM model for the
monastery network is illustrated as Fig. 3 (right). There is one hidden variable
for each monk. The relationships between monks are conditioned on the hidden
variables of the involved monks. The mean field method is used for inference.
We initially assume that each monk is in his own cluster. After convergence, the
cluster number is optimized as 4, which is exactly the same as the number of
the groups that Sampson identified. The clustering result is shown as Table 2. It
is quite close to the real groups. Cluster 1 corresponds to the loyal opposition.
Cluster 2 is the young turks, and cluster 3 is the outcasts. The waverers
are split. Amand is assigned to cluster 4, Victor and Ramuald are assigned
to the loyal opposition. Actually, previous research analysis has questioned the
distinction of the waverers, e.g., [7,12] clustered Victor and Ramuald into the
loyal opposition, which coincides with the result of the IHRM model.

Table 2. Clustering result of the IHRM model on Sampson’s monastery data

Cluster Members
1 Peter, Bonaventure, Berthold, Ambrose, Louis, Victor, Ramuald
2 John, Gregory, Mark, Winfrid, Hugh, Boniface, Albert
3 Basil, Elias, Simplicius
4 Amand
88 Z. Xu et al.

4.2 Bernard & Killworth Data

In the second experiment, we perform link analysis with IHRM on the Bernard
& Killworth data [5]. Bernard and Killworth collected several data sets on hu-
man interactions in bounded groups. In each study they obtained measures of
social interactions among all actors, and ranking data based on the subjects’
memory of those interactions. Our experiments are based on three datasets. The
BKFRAT data is about interactions among students living in a fraternity at
a West Virginia college. All subjects had been residents in the fraternity from
three months to three years. The data consists of rankings made by the subjects
of how frequently they interacted with other subjects in the observation week.
The BKOFF data concern interactions in a small business office. Observations
were made as the observer patrolled a fixed route through the office every fifteen
minutes during two four-day periods. The data contains rankings of interaction
frequency as recalled by employees over the two-week period. The BKTEC data
is about interactions in a technical research group at a West Virginia university.
It contains the personal rankings of the remembered frequency of interactions.
In the experiments, we randomly select 50% (60%, 70%, 80%) interactions as
known and predict the left ones. The experiments are repeated 20 times for each
setting. The average prediction accuracy is reported in Table 3. We compare our
model with the Pearson collaborative filtering method. It shows that the IHRM
model provides better performance on all the three datasets. Fig. 4 illustrates
the link prediction results on the BKOFF dataset with 70% known links. The
predicted interaction matrix is quite similar with the real one.

Table 3. Link prediction on the Bernard & Killworth data with the IHRM

Prediction Accuracy (%)


50% 60% 70% 80%
IHRM Pearson IHRM Pearson IHRM Pearson IHRM Pearson
BKFRAT 66.50 61.82 67.63 64.56 68.26 66.91 68.69 67.41
BKOFF 66.21 57.32 67.89 59.45 69.20 60.58 69.82 61.54
BKTEC 65.47 58.85 66.79 62.04 68.31 63.61 69.58 64.46

Fig. 4. Left: Interaction matrix on the BKOFF data. Right: The predicted one, which
is quite similar with the real situation.
Social Network Mining with Nonparametric Relational Models 89

4.3 MovieLens Data


We also evaluate the IHRM model on the MovieLens data [21]. There are in
total 943 users and 1680 movies, and we obtain 702 users and 603 movies after
removing low-frequent ones. Each user has about 112 ratings on average. The
model is shown in Fig. 5. There are two classes of objects (users and movies)
and one class of relationships (Like). The task is to predict preferences of users.
The users have attributes Age, Gender, Occupation, and the movies have at-
tributes Published-year, Genres and so on. The relationships have two states,
where R = 1 indicates that the user likes the movie and 0 otherwise. The user
ratings in MovieLens are originally based on a five-star scale, so we transfer each
rating to binary value with R = 1 if the rating is higher than the user’s aver-
age rating, vice versa. The performance of the IHRM model is analyzed from
2 points: prediction accuracy and clustering effect. To evaluate the prediction
performance, we perform 4 sets of experiments which respectively select 5, 10,
15 and 20 ratings for each test user as the known ratings, and predict the re-
maining ratings. These experiments are referred to as given5, given10, given15
and given20 in the following. For testing the relationship is predicted to exist
(i.e., R = 1) if the predictive probability is larger than a threshold ε = 0.5.
We implement the following three inference methods: Chinese restaurant pro-
cess Gibbs sampling (CRPGS), truncated stick-breaking Gibbs sampling (TS-
BGS), and the corresponding mean field method TSBMF. The truncation pa-
rameters Ks for TSBGS and TSBMF are initially set to be the number of entities.
For TSBMF we consider α0 = {5, 10, 100, 1000}, and obtain the best prediction
when α0 = 100. For CRPGS and TSBGS α0 is 100. For the variational infer-
ence, the change of variational parameters between two iterations is monitored
to determine the convergence. For the Gibbs samplers, the convergence was

Fig. 5. Top: A sociogram for movie recommendation system, illustrated with 2 users
and 3 movies. For readability, only two attributes (user’s occupation and movie’s genre)
show in the figure. Bottom: The IHRM model for the sociogram.
90 Z. Xu et al.

Fig. 6. Left: The traces of the number of user clusters for the runs of two Gibbs
samplers. Middle: The trace of the change of the variational parameter η u for mean
field method. Right: The sizes of the largest user clusters of the three inference methods.

analyzed by three measures: Geweke statistic on likelihood, Geweke statistic on


the number of components for each class of objects, and autocorrelation. Fig. 6
(left) shows the trace of the number of user clusters in the 2 Gibbs samplers.
Fig. 6 (middle) illustrates the change of variational parameters η u in the varia-
tional inference. For CRPGS, the first w = 50 iterations (6942 s) are discarded
as burn-in period, and the last W = 1400 iterations are collected to approxi-
mate the predictive distributions. For TSBGS, we have w = 300 (5078 s) and
W = 1700. Although the number of iterations for the burn-in period is much
less in the CRPGS if compared to the blocked Gibbs sampler, each iteration is
approximately a factor 5 slower. The reason is that CRPGS samples the hidden
variables one by one, which causes two additional time costs. First, the expec-
tations of attribute parameters and relational parameters have to be updated
when sampling each user/movie. Second, the posterior of hidden variables have
to be computed one by one, thus we can not use fast matrix multiplication tech-
niques to accelerate the computation. Therefore if we include the time, which is
required to collect a sufficient number of samples for inference, the CRPGS is
slower by a factor of 5 (the row Time(s) in Table 4 ) than the blocked sampler.
The mean field method is again by a factor around 10 faster than the blocked
Gibbs sampler and thus almost two orders of magnitude faster than the CRPGS.

Table 4. Performance of the IHRM model on MovieLens data

CRPGS TSBGS TSBMF Pearson SVD


Given5 65.13 65.51 65.26 57.81 63.72
Given10 65.71 66.35 65.83 60.04 63.97
Given15 66.73 67.82 66.54 61.25 64.49
Given20 68.53 68.27 67.63 62.41 65.13
Time(s) 164993 33770 2892 - -
Time/iter. 109 17 19 - -
#C.u 47 59 9 - -
#C.m 77 44 6 - -
Social Network Mining with Nonparametric Relational Models 91

The prediction results are shown in Table 4. All IHRM inference methods
under consideration achieve comparably good performance; the best results are
achieved by the Gibbs samplers. To verify the performance of the IHRM, we
also implement Pearson-coefficient collaborative filtering (CF) method [18] and
a SVD-based CF method [20]. It is clear that the IHRM outperforms the tra-
ditional CF methods, especially when there are few known ratings for the test
users. The main advantage of the IHRM is that it can exploit attribute infor-
mation. If the information is removed, the performance of the IHRM becomes
close to the performance of the SVD approach. For example, after ignoring all
attribute information, the TSBMF generates the predictive results: 64.55% for
Given5, 65.45% for Given10, 65.90% for Given15, and 66.79% for Given20.
The IHRM provides cluster assignments for all objects involved, in our case
for the users and the movies. The rows #C.u and #C.m in Table 4 denote
the number of clusters for users and movies, respectively. The Gibbs samplers
converge to 46-60 clusters for the users and 44-78 clusters for the movies. The
mean field solution have a tendency to converge to a smaller number of clusters,
depending on the value of α0 . Further analysis shows that the clustering results
of the methods are actually similar. First, the sizes of most clusters generated
by the Gibbs samplers are very small, e.g., there are 72% (75.47%) user clusters
with less than 5 members in CRPGS (TSBGS). Fig. 6 (right) shows the sizes
of the 20 largest user clusters of the 3 methods. Intuitively, the Gibbs samplers
tend to assign the outliers to new clusters. Second, we compute the rand index
(0-1) of the clustering results of the methods, the values are 0.8071 between
CRPGS and TSBMF, 0.8221 between TSBGS and TSBMF, which demonstrates
the similarity of the clustering results.
Fig. 7 gives the movies with highest posterior probability in the 4 largest
clusters generated from TSBMF. In cluster 1 most movies are very new and

Fig. 7. The major movie clusters generated by TSBMF on MovieLens data


92 Z. Xu et al.

popular (the data set was collected from September 1997 through April 1998).
Also they tend to be action and thriller movies. Cluster 2 includes many old
movies, or movies produced by the non-USA countries. They tend to be drama
movies. Cluster 3 contains many comedies. In cluster 4 most movies include
relatively serious themes. Overall we were quite surprised by the good inter-
pretability of the clusters. Fig. 8 (top) shows the relative frequency coefficient
(RFC) of the attribute Genre in these movie clusters. RFC of a genre s in a
cluster k is calculated as (fk,s − fs )/σs , where fk,s is the frequency of the genre
s in the movie cluster k, fs is mean frequency, and σs is standard deviation of
frequency. The labels for each cluster specify the dominant genres in the cluster.
For example, action and thriller are the two most frequent genres in cluster 1. In
general, each cluster involves several genres. It is clear that the movie clusters
are related to, but not just based on, the movie attribute Genre. The clustering
effect depends on both movie attributes and user ratings. Fig. 8 (bottom) shows
RFC of the attribute Occupation in user clusters. Equivalently, the labels for
each user cluster specify the dominant occupations in the cluster.
Note that in the experiments we predicted a relationship attribute R indicat-
ing the rating of a user for a movie. The underlying assumption is that in prin-
ciple anybody can rate any movie, no matter whether that person has watched
the movie or not. If the latter is important, we could introduce an additional at-
tribute Exist to specify if a user actually watched the movie. The relationship R
would then only be included in the probabilistic model if the movie was actually
watched by a user.

Fig. 8. Top: The relative frequency coefficient of the attribute Genre in different movie
clusters, Bottom: that of the attribute Occupation in different user clusters
Social Network Mining with Nonparametric Relational Models 93

5 Related Work

The work on infinite relational model (IRM) [15] is similar to the IHRM, and has
been developed independently. One difference is that the IHRM can specify any
reasonable probability distribution for an attribute given its parent, whereas the
IRM would model an attribute as a unary predicate, i.e. would need to transform
the conditional distribution into a logical binary representation. Aukia et al. also
developed a DP mixture model for large networks [4]. The model associates an
infinite-dimensional hidden variable for each link (relationship), and the objects
involved in the link are drawn from a multinomial distribution conditioned on the
hidden variable of the link. The model is applied to the community web data
with promising experimental results. The latent mixed-membership model [1]
can be viewed as a generalization of LDA model on relational data. Although
it is not nonparametric, the model exploits hidden variables to avoid the ex-
tensive structure learning and provides a principled way to model the relational
networks. The model associates each object with a membership probability-like
vector. For each relationship, cluster assignments of the involved objects are gen-
erated with respect to their membership vectors, and then relationship is drawn
conditioned on the cluster assignments.
There are some other important SRL research works for complex relational
networks. The probabilistic relational model (PRM) with class hierarchies [10]
specializes distinct probabilistic dependency for each subclass, and thus obtains
refined probabilistic models for relational data. A group-topic model is proposed
in [23]. It jointly discovers latent groups in a network as well as latent topics
of events between objects. The latent group model in [16] introduces two latent
variables ci and gi for an object, and ci is conditioned on gi . The object attributes
depends on ci and relations depend on gi of the involved objects. The limitation
is that only relations between members in the same group are considered. These
models demonstrate good performance in certain applications. However, most are
restricted to domains with simple relationships. These models demonstrate good
performance in certain applications. However, most are restricted to domains
with simple relationships.

6 Extension: Conditional IHRM


We have presented the IHRM model and an empirical analysis of social network
data. As a generative model, the IHRM models both object attributes and re-
lational attributes as random variables conditioned on clusters of objects. If the
goal is to predict relationship attributes, one might expect to obtain improved
prediction performance if one trains a model conditioned on the attributes. As
part of ongoing work we study the extension of the IHRM model to discrimina-
tive learning. A conditional IHRMs directly models the posterior probability of
relations given features derived from attributes of the objects. Fig. 9 illustrates
the conditional IHRM model with a simple sociogram example.
94 Z. Xu et al.

Fig. 9. A conditional IHRM model for a simple sociogram. The main difference from
the IHRM model in Fig. 2 is that attributes G are not indirect influence over relations
R via object clusters Z, but are direct conditions of relations.

The main difference to the IHRM model in Fig. 2 is that relationship attributes
are conditioned on both the states of the latent variables and features derived
from attributes. A simple conditional model is based on logistic regression of the
form
log P (Ri,j |Zi = k, Zj = , F (Gi , Gj )) = σ(ωk, , xi,j ),

where xi,j = F (Gi , Gj ) denotes a vector describing features derived from all
attributes of i and j. ωk, is a weight vector, which determines how much a
particular attribute contributes to a choice of relation and can implicitly imple-
ment feature selection. Note that there is one weight vector for each cluster pair
(k, ). ·, · denotes an inner product. σ(·) is a real-valued function with any form
σ : R → R. The joint probability of the conditional model is now written as:
 
P (R, Z|G) = P (Zi ) P (Ri,j |Zi , Zj , F (Gi , Gj )), (12)
i i,j

where P (Zi ) is still defined as a stick breaking construction (Equ. 1). The prelim-
inary experiments show promising results, and we will report the further results
in future work.

7 Conclusions
This paper presents a nonparametric relational model IHRM for social network
modeling and analysis. The IHRM model enables expressive knowledge represen-
tation of social networks and allows for flexible probabilistic inference without
the need for extensive structural learning. The IHRM model can be applied to
community detection, link prediction, and product recommendation. The em-
pirical analysis on social network data showed encouraging results with inter-
pretable clusters and relation prediction. For the future work, we will explore
discriminative relational models for better performance. It will also be interest-
ing to perform analysis on more complex relational structures in social network
systems, such as domains including hierarchical class structures.
Social Network Mining with Nonparametric Relational Models 95

Acknowledgments

This research was supported by the German Federal Ministry of Economy and
Technology (BMWi) research program THESEUS, the EU FP7 project LarKC,
and the Fraunhofer ATTRACT fellowship STREAM.

References

1. Airoldi, E.M., Blei, D.M., Xing, E.P., Fienberg, S.E.: A latent mixed-membership
model for relational data. In: Proc. ACM SIGKDD Workshop on Link Discovery
(2005)
2. Aldous, D.: Exchangeability and related topics. In: Ecole d’Ete de Probabilites de
Saint-Flour XIII 1983, pp. 1–198. Springer, Heidelberg (1985)
3. Antoniou, G., van Harmelen, F.: A Semantic Web Primer. MIT Press, Cambridge
(2004)
4. Aukia, J., Kaski, S., Sinkkonen, J.: Inferring vertex properties from topology in
large networks. In: NIPS 2007 workshop on statistical models of networks (2007)
5. Bernard, H., Killworth, P., Sailer, L.: Informant accuracy in social network data
iv. Social Networks 2 (1980)
6. Blei, D., Jordan, M.: Variational inference for dp mixtures. Bayesian Analysis 1(1),
121–144 (2005)
7. Breiger, R.L., Boorman, S.A., Arabie, P.: An algorithm for clustering relational
data with applications to social network analysis and comparison to multidimen-
sional scaling. Journal of Mathematical Psychology 12 (1975)
8. Dzeroski, S., Lavrac, N. (eds.): Relational Data Mining. Springer, Berlin (2001)
9. Getoor, L., Friedman, N., Koller, D., Pfeffer, A.: Learning probabilistic relational
models. In: Dzeroski, S., Lavrac, N. (eds.) Relational Data Mining, Springer, Hei-
delberg (2001)
10. Getoor, L., Koller, D., Friedman, N.: From instances to classes in probabilistic re-
lational models. In: Proc. ICML 2000 Workshop on Attribute-Value and Relational
Learning (2000)
11. Getoor, L., Taskar, B. (eds.): Introduction to Statistical Relational Learning. MIT
Press, Cambridge (2007)
12. Handcock, M.S., Raftery, A.E., Tantrum, J.M.: Model-based clustering for social
networks. Journal of the Royal Statistical Society 170 (2007)
13. Hofmann, T., Puzicha, J.: Latent class models for collaborative filtering. In: Proc.
16th International Joint Conference on Artificial Intelligence (1999)
14. Ishwaran, H., James, L.: Gibbs sampling methods for stick breaking priors. Journal
of the American Statistical Association 96(453), 161–173 (2001)
15. Kemp, C., Tenenbaum, J.B., Griffiths, T.L., Yamada, T., Ueda, N.: Learning sys-
tems of concepts with an infinite relational model. In: Proc. 21st Conference on
Artificial Intelligence (2006)
16. Neville, J., Jensen, D.: Leveraging relational autocorrelation with latent group
models. In: Proc. 4th international workshop on Multi-relational mining, pp. 49–
55. ACM Press, New York (2005)
17. Raedt, L.D., Kersting, K.: Probabilistic logic learning. SIGKDD Explor.
Newsl. 5(1), 31–48 (2003)
96 Z. Xu et al.

18. Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J.: Grouplens: An open
architecture for collaborative filtering of netnews. In: Proc. of the ACM 1994 Con-
ference on Computer Supported Cooperative Work, pp. 175–186. ACM, New York
(1994)
19. Sampson, F.S.: A Novitiate in a Period of Change: An Experimental and Case
Study of Social Relationships. PhD thesis (1968)
20. Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Application of dimensionality re-
duction in recommender systems–a case study. In: WebKDD Workshop (2000)
21. Sarwar, B.M., Karypis, G., Konstan, J.A., Riedl, J.: Analysis of recommender
algorithms for e-commerce. In: Proc. ACM E-Commerce Conference, pp. 158–167.
ACM, New York (2000)
22. Sethuraman, J.: A constructive definition of dirichlet priors. Statistica Sinica 4,
639–650 (1994)
23. Wang, X., Mohanty, N., McCallum, A.: Group and topic discovery from relations
and text. In: Proc. 3rd international workshop on Link discovery, pp. 28–35. ACM,
New York (2005)
24. Xu, Z., Tresp, V., Yu, K., Kriegel, H.-P.: Infinite hidden relational models. In: Proc.
22nd UAI (2006)
25. Xu, Z., Tresp, V., Yu, S., Yu, K.: Nonparametric relational learning for social
network analysis. In: Proc. 2nd ACM Workshop on Social Network Mining and
Analysis, SNA-KDD 2008 (2008)
26. Yedidia, J., Freeman, W., Weiss, Y.: Constructing free-energy approximations and
generalized belief propagation algorithms. IEEE Transactions on Information The-
ory 51(7), 2282–2312 (2005)
Using Friendship Ties and Family Circles
for Link Prediction

Elena Zheleva1 , Lise Getoor1 , Jennifer Golbeck2 , and Ugur Kuter1


1
Department of Computer Science and Institute for Advanced Computer Studies,
University of Maryland, College Park, Maryland 20742, USA
{elena,getoor,ukuter}@cs.umd.edu
2
College of Information Studies,
University of Maryland, College Park, Maryland 20742, USA
jgolbeck@umd.edu

Abstract. Social networks can capture a variety of relationships among


the participants. Both friendship and family ties are commonly studied,
but most existing work studies them in isolation. Here, we investigate
how these networks can be overlaid, and propose a feature taxonomy
for link prediction. We show that when there are tightly-knit family
circles in a social network, we can improve the accuracy of link prediction
models. This is done by making use of the family circle features based on
the likely structural equivalence of family members. We investigated the
predictive power of overlaying friendship and family ties on three real-
world social networks. Our experiments demonstrate significantly higher
prediction accuracy (between 15% and 30% more accurate) compared to
using more traditional features such as descriptive node attributes and
structural features. The experiments also show that a combination of all
three types of attributes results in the best precision-recall trade-off.

1 Introduction

There is a growing interest in social media and in data mining methods which
can be used to analyze, support and enhance the effectiveness and utility of
social media sites. The analysis methods being developed build on traditional
methods from the social network analysis community, extend them to deal with
the heterogeneity and growing size of the data being generated and use tools
from graph mining, statistical relational learning and methods for information
extraction from unstructured and semi-structured text.
Traditionally, social network analysis has focused on actors and ties (or rela-
tionships) between them, such as friendships or kinships. The two most common
types of networks are (1) unimodal networks, where the nodes are actors and the
edges represent ties such as friendships, and (2) affiliation networks which can be
represented as bipartite graphs, where there are two types of nodes, the actors
and organizations, and the edges represent the affiliations between actors and
organizations. Most of the existing work has focused on networks that exhibit a
single relationship type, either friendship or affiliation.

L. Giles et al. (Eds.): SNAKDD 2008, LNCS 5498, pp. 97–113, 2010.

c Springer-Verlag Berlin Heidelberg 2010
98 E. Zheleva et al.

In this paper, we investigate the power of combining friendship and affilia-


tion networks. We use the notion of structural equivalence, when two actors are
similar based on participating in equivalent relationships, which is fundamen-
tal to finding groups in social networks. Our approach is an attempt to bridge
approaches based on structural equivalence and community detection, where
densely connected groups of actors are clustered together into communities. We
show how predictive models, based on descriptive, structural, and community
features, perform surprisingly well on challenging link-prediction tasks.
We validate our results on a trio of social media websites describing friendships
and family relationships. We show that our models are able to predict links
accurately, in this case friendship relationships, in held-out test data. This is
typically a very challenging prediction problem. With our results, we also hope
to motivate further research in discovering closely-knit groups in social networks
and using them to improve link-prediction performance.
Our link-prediction approach can be applied in a variety of domains. The
important properties of the data that we use are that there are actors, links be-
tween them and closely-knit groups such as families, housemates or officemates.
In some data, groups are given; in other datasets, it may be necessary to first
cluster the nodes in a meaningful manner. For example, in email communication
networks, such as Enron [1,2], groups could be cliques of people that email each
other frequently. In the widely studied co-authorship networks [3,4,5,6,7,8,9],
affiliation groups may be cliques of authors that collaborate on many papers
together. In these domains, the link-prediction task translates to finding people
who are likely to communicate with each other [1] or authors who are likely to
collaborate in the future [5,8].
Our contributions include the following:
– We propose a general framework for combining social and affiliation net-
works.
– We show how to instantiate it for overlaying friendship and family networks.
– We show how features of the overlaid networks can be used to accurately
predict friendship relationships.
– We validate our results on three social media websites.
In Section 2, we describe the link prediction problem that we focus on in
this paper, and in Section 3, the social network model. Section 4 addresses the
taxonomy of the descriptive, structural, and group features that we used for
link prediction in our overlaid networks. We then propose a comparison of our
network overlay method with two alternatives in Section 5. We describe exper-
imental results in Section 6, related work in Section 7, the generality of our
approach in Section 8, and discuss conclusions and future work in Section 9.

2 Link Prediction Problem


In this paper we study the problem of predicting friendship links in multi-
relational social networks. This problem is closely related to problems of link
Using Friendship Ties and Family Circles for Link Prediction 99

prediction [4,1,5,8], link completion [10], and anomalous link discovery [1,9]
which are covered in more depth in Section 7.
Link prediction in social networks is useful for a variety of tasks. The most
straightforward use is for making data entry easier – a link-prediction system
can propose links, and users can select the friendship links that they would
like to include, rather than users having to enter the friendship links manually.
Link prediction is also a core component of any system for dynamic network
modeling—the dynamic model can predict which actors are likely to gain popu-
larity, and which are likely to become central according to various social network
metrics.
Link prediction is challenging for a number of reasons. When it is posed as
a pair-wise classification problem, one of the fundamental challenges is dealing
with the large outcome space; if there are n actors, there are n2 possible re-
lations. In addition, because most social networks are sparsely connected, the
prior probability of any link is extremely small, thus we have to contend with
a large class skew problem. Furthermore, because the number of links is poten-
tially so large, the number of the negative instances will be huge, so constructing
a representative training set is challenging.
In our approach to link prediction in multi-relational social networks, we ex-
plore the use of both attribute and structural features, and, in particular, we
study how group membership (in our case, family membership) can significantly
aid in accurate link (here, friendship) prediction.

3 Social Network Model


Social networks describe actors and their relationships. The actors can have
properties or attributes such as age and income. Relationships can represent
dyadic (binary) relationships or group memberships (cliques or hyperedges); in
addition, relationships can be directed, undirected and/or weighted. Here we
consider both dyadic and group-membership relationships. Specifically, we con-
sider friendship relationships and family group memberships. In our domain,
these are undirected, unweighted relationships.
More formally, the networks we consider consist of the following:
actors: a set of actors A = {a1 , . . . , an },
and a grouping or partitioning of the actors into non-overlapping groups:
groups: a group of individuals connected through a common affiliation. The
affiliations group the actors into sets G = {G1 , . . . , Gm }.
Affiliation groups can be a partitioning of the actors, or overlapping groups of
actors. In this work, families of actors are such affiliation groups.
We consider the following relationships:
friends: F {ai , aj } denotes that ai is friends with aj , and
family: M {ai , Gk } denotes that ai is a member of family Gk .
100 E. Zheleva et al.

Fig. 1. Actors in the same tightly-knit group often exhibit structural equivalence, i.e.,
they have the same connections to all other nodes. Using the original network (a), and
a structural equivalence assumption, one can construct a network with new predicted
links (b).

Actors can have attributes; if b is an attribute, then we use ai .b to denote the b


attribute of actor ai . We denote the set of friends of actor ai by ai .F , and the
set of family members of the same actor as ai .M .
Figure 1(a) shows an example network of eight actors and five groups. Each
node represents an actor, and a group is shown as a circle around the actors.
The thick lines inside a group mark family relationships, and the thin black lines
denote friendship relationships. Every actor belongs to at least one group. There
are single-member groups, and there are actors without friends.

4 A Feature Taxonomy for Multi-relational Social


Networks
We identified three classes of features that describe characteristics of potential
links in a multi-relational social network:
– Descriptive attributes are attributes inherent to the nodes, and they do not
consider the structure of the network.
– Structural attributes include characteristics of the networks based on the
friendship relationships such as node degree.
– Group attributes are based on structural properties of the network when both
types of relationships, friendship and family, are considered. The groups in
this case are the cliques of family members.
Each feature within a class can be assigned to an actor or to a pair of actors (cor-
responding to a potential edge). The following sections describe our taxonomy
of the features in more detail.
Using Friendship Ties and Family Circles for Link Prediction 101

4.1 Descriptive Attributes

The descriptive attributes are attributes of nodes in the social network that
do not consider the link structure of the network. These features vary across
domains. They provide semantic insight into the inherent properties of each
node in a social network, or compare the values of the same inherent attributes
for a pair of nodes.
We define two classes of descriptive attributes for multi-relational social net-
works:

1. Actor features. These are inherent characteristics of an actor.


2. Actor-pair features. The actor-pair features compare the values of the same
node attribute for a pair of nodes.

4.2 Structural Features

The next set of features that we introduce describe features of network structure.
The first is a structural features for a single node, ai , while the remaining describe
structural attributes of pairs of nodes, ai and aj .

1. Actor features. These features describe the link structure around a node.
Number of friends. The degree, or number of friends, of an actor ai : |ai .F |.
2. Actor-pair features. These features describe how interconnected two nodes
are. They measure the sets of friends that two actors have ai .F and aj .F .
Number of common friends. The number of friends that the pair of nodes
have in common in the network: |ai .F ∩ aj .F |.
Jaccard coefficient of the friend sets. The Jaccard coefficient over the friend
sets of two actors describes the ratio of the number of their common
friends to their total number of friends:
|ai .F ∩aj .F |
Jaccard(ai , aj ) = |ai .F ∪aj .F | .

The Jaccard coefficient is a standard metric for measuring the similarity


of two sets. Unlike the feature number of common friends, it considers
the size of the friendship circle of each actor.
Density of common friends. For the set of common friends, the density
is the number of friendship links between the common friends over the
number of all possible friendship links in the set. The density of common
friends of two nodes describes the strength in the community of common
friends. Density is also known as clustering coefficient.

4.3 Group Features

The third category of features that we consider are based on group membership;
in the networks we studied, the groups are families. These are the features that
overlay friendship and affiliation networks.
102 E. Zheleva et al.

1. Actor features. These are features that describe the groups to which an actor
belongs.
Family Size. This is the simplest attribute and describes the size of an
actor’s family: |ai .M |.
2. Actor-pair features. There are two types of features for modeling these inter-
family relations based on the overlapping friend and family sets of two actors
ai .F and aj .M :
Number of friends in the family. The first feature describes the number of
friends ai has in the family of aj : |ai .F ∩ aj .M |. This feature allows one
to reason about the relationship between an actor and a group of other
actors, where the latter is semantically defined over the network through
the family relations.
Portion of friends in the family. The second feature on inter-family relations
describes the ratio between the number of friends that ai has in aj ’s
family (the same as the above feature) and the size of aj ’s family. The
rationale behind this feature is that the higher this ratio is, the more
likely it is that aj is close to ai in the network since more of its family
members are friends with ai .

The idea behind the group features is based on the notion of structural equiv-
alence of nodes within a group. Two nodes are structurally equivalent if they
have the same links to all other actors. If we can detect tightly-knit groups in a
social network and we assume that the nodes in each group are likely to behave
similarly, then new links can be predicted by projecting links such that the nodes
in the group become structurally equivalent. In our networks, such groups are
the family cliques. In a weighted graph, a tight group could map to a clique of
nodes with highly-weighted edges.
Figure 1 shows an example of how a structural equivalence assumption can
help in predicting new links. For example, if one of the actors from Group A is
friends with an actor from Group B, as shown on the original network (a), then
it may be more likely that there is a link between the other actor from Group
A and the actor from Group B, shown as a dashed line in (b).

5 Alternative Network Representations


The traditional approach to studying networks is to treat all relationships as equal.
In the previous section, we described overlaying networks with different link types
in a way that distinguishes between these types, and uses information about affili-
ation groups. In other words, our link-prediction approach uses information about
the actors A, the groups G, the friendship relationships F , and the family relation-
ships M . We call our representation different-link and affiliation overlay. There-
fore, a logical question one may ask is what the benefits of treating links as different
are, and whether affiliation groups really make a difference in link prediction. Our
claim is that affiliations are important and that they can have a predictive value.
To illustrate the benefit of our approach as compared to the traditional one, we
compare it to two alternative representations of the network.
Using Friendship Ties and Family Circles for Link Prediction 103

In the first alternative representation, which we call same-link and no affilia-


tion overlay, the family and friendship links are treated the same, and affiliation
groups are not given. More formally, in this representation, the graph consists
of these components: actors A, and a set of edges to which we refer as implied
friendships Fimplied = F ∪ M . We can compute the descriptive and structural
features in this alternative overlay, and use them for link prediction. In our ex-
periments, we investigate whether this alternative overlay can offer the same or
better link-prediction accuracy as the different-link and affiliation overlay.
Even if the first alternative overlay does not offer better accuracy, we still
need to check whether the predictive value of the different-link and affiliation
overlay comes from treating the links as different or from the fact that we are
given the affiliation groups. To investigate that, we look at a second alternative
overlay, the same-link and affiliation overlay, in which the family and friendship
links are treated the same, and affiliation groups are given. In this overlay, the
graph consists of these components: actors A, groups G, and implied friendships
Fimplied . We can compute all classes of features in this alternative overlay, and
use them for link prediction.

6 Experimental Evaluation
6.1 Social Media Data Sets
This research is based upon using networks that have two sets of connections:
friendship links and family ties. We performed our experiments on three novel
datasets describing petworks: Dogster, Catster, and Hamsterster1 . On these
sites, profiles include photos, personal information, characteristics, as well as
membership in community groups. Members also maintain links to friends and
family members. As of February 2007, Dogster has approximately 375,000 mem-
bers. Catster is based on the same platform as Dogster and contains about
150,000 members. Hamsterster has a different platform, but it contains similar
information about its members. It is much smaller than Dogster and Catster -
about 2,000 members.
These sites are the only three of the hundreds we visited that publicly share
both family and friendship connections2 . However, these are networks where both
types of connections are realistic and representative of what we would expect to
see in other social networks if they collected this data. The family connections
are representative of real life, since family links are only made between profiles
of pets created by the same owner. The friendship linking behavior is in line
with patterns seen in other social networks [11].

1. Actor features:
Breed. This is the pet breed such as golden retriever or chihuahua. A pet
can have more than one breed value.
1
At http://www.dogster.com, http://www.catster.com, and
http://www.hamsterster.com.
2
For a full list, see http://trust.mindswap.org/SocialNetworks
104 E. Zheleva et al.

Fig. 2. Sample profile on Dogster which includes family and friends

Breed category. Each breed belongs to a broader category set. For example
in Dogster, the major breed categories we identified are working, herding,
terrier, toy, sporting, non-sporting, hound, and other, a catchall for the
other breeds that appear in a the site, but not as frequently as the
previous ones. When a dog has multiple breeds, its breed category is
mixed.
Single Breed. This boolean feature describes whether a pet has a single
breed or whether it has multiple breed characteristics.
Purebred. This is a boolean feature which specifies whether a dog owner
considers its pet to be purebred or not.
2. Actor-pair features. All of the above features describe characteristics of a
single user in the network.
Same breed. This boolean feature is true if two profiles have at least one
common breed.

6.2 Data Description

We have obtained a random sample of 10, 000 profiles each from Dogster and
Catster, and all 2059 profiles registered with Hamsterster. Each instance in the
test data contained the features for a pair of profiles where some of the features
were individual node features. To construct the test data, we chose the pairs of
Using Friendship Ties and Family Circles for Link Prediction 105

nodes for which there was an existing friendship link, and we sampled from the
space of node pairs which did not have a link. We computed the descriptive,
structural and group features for each of the profiles.
For each pair of profiles in the test data, we computed the features from the
three classes described in Section 4. A test instance for a pair of profiles ai and
aj includes both the individual actor features and the actor-pair features. It has
the form

< ai features, aj features, (ai , aj )-pair features, class >

where class is the binary class which denotes whether a friendship link exists
between the actors.
For Dogster, the sample of 10,000 dogs had around 17,000 links among them-
selves, and we sample from the non-existing links at a 10:1 ratio (i.e., the non-
existing links are 10 times more than the existing links). For Catster, the 10,000
cats had 43,000 links, and for the whole Hamsterster dataset, the number of links
was around 22,000. We sampled from the non-existing links in these datasets at
the same 10:1 ratio.

6.3 Experimental setup

We used three well-known classifiers, namely Naı̈ve Bayes, logistic regression and
decision trees for our experiments. The goal was to perform binary classification
on the test instances and predict friendship links. The implementations of these
classifiers were from the latest version of Weka (v3.4.12) from http://www.cs.
waikato.ac.nz/ml/weka/. We allocated a maximum of 2GB of memory for each
classifier we ran. We measured prediction accuracy by computing precision, recall,
and their harmonic mean, F1 score, using 10-fold cross-validation.

6.4 Link-Prediction Results

We report only on the results from decision-tree classification because it consis-


tently had the highest accuracy among the three classifiers. Table 1 summarizes
our results. Adding group features to the descriptive and structural features in-
creased accuracy by 15% to 30%. We discuss the results in more detail in the
subsequent subsections.

Table 1. Comparison of F1 values in the three datasets, with the feature types from
our taxonomy
Feature Type Dogster Catster Hamsterster
Descriptive 37.6% 0.4% 19.8%
Structural 76.1% 83.1% 59.9%
Group 90.8% 95.2% 89.2%
Descriptive and structural 78.6% 83.0% 60.3%
Descriptive, structural, and group 94.8% 97.9% 90.5%
106 E. Zheleva et al.

Fig. 3. a) Recall, precision, and F1 score for Dogster using descriptive and struc-
tural attributes; b) F1 score across datasets. Using descriptive attributes together with
structural attributes leads to a better F1 score in Dogster but not in Catster and
Hamsterster.

Descriptive attributes can be useful in combination with structural


attributes. In these experiments, we have investigated the predictive power
of the simplest features, i.e., the descriptive attributes versus the impact of the
structural attributes. Figure 3 shows the accuracy results from the decision-tree
classifier. When we use only descriptive attributes, the link-prediction accuracy
varies across datasets. In Dogster, there is some advantage to using descriptive
attributes, yet the accuracy (F1 score) is relatively low (37.6%). In Catster and
Hamsterster, building the complete decision trees led to 0.4% and 19.8% accu-
racy, respectively (using Weka’s default pruning parameter, the trees were empty,
and the accuracies were 0%). This confirms that, in general, link prediction is a
challenging prediction task.
When we used the structural features (such as number of friends that two pro-
files share), the link-prediction accuracy increased to 76.1% in in Dogster. This sug-
gests that the structural features are much more predictive than simple descriptive
attributes. This effect was even more pronounced for Catster and Hamsterster.
In Dogster, combining the node attributes and the structural features leads
to futher improvement. Using descriptive attributes together with structural
attributes leads to a better F1 score (78.6%) as compared to using either category
alone (37.6% and 76.1%, respectively) in Dogster, as shown in Figure 3. For
Catster and Hamsterster, the difference was less than 0.4%.

Family group features are highly predictive. As the previous experiments


showed, structural attributes are stronger predictors than the descriptive at-
tributes alone. Next, we investigate the predictive power of the group features
in our taxonomy. In Dogster, Catster and Hamsterster, the group features in-
volve the families and friends of the users. Figure 4 shows our comparisons. Our
results suggested that family groups are strong predictors for friendship links
(F 1 = 90.8% for Dogster). We also ran experiments where we used not only
family cliques, but also the structural and descriptive features. In these experi-
ments, the results show that the accuracy (F1) improves by 4% in Dogster, 0.6%
in Catster and 1.3% Hamsterster.
Using Friendship Ties and Family Circles for Link Prediction 107

Fig. 4. Link-prediction accuracy using all feature classes: descriptive, structural and
group features. a) Recall, precision, and F1 score for Dogster; b) F1 score across
datasets. Group features are highly predictive, yet adding the other features provided
benefit too.

Computing more expensive structural attributes is not highly ben-


eficial. Some structural features in our taxonomy were more computationally
expensive to construct than others. For example, the feature that described the
number of friends is easy to compute, whereas the feature that described the den-
sity of common friends for each pair of profiles is the hardest. Using a database,
computing density of common friends for all pairs of profiles requires several
joins of large tables. In order to investigate the trade-off between computing ex-
pensive features and their predictive impact on our results, we have performed
the following experiments.
We have designed experiments in which we add more expensive structural
features one by one, and assess the link-prediction accuracy at each step. We
used the following combinations of features: (1) using number of friends only,
(2) using number of friends and number of common friends, (3) using number of
friends, number of common friends and jaccard coefficient, and finally (4) using
number of friends, number of common friends, jaccard coefficient and density of
common friends. We are reporting on the results of these four sets of structural
features together with the descriptive attributes since we showed in the previous
subsection that using descriptive attributes can sometimes be beneficial. We also
report on the setting in which group features were used.
Surprisingly, it turned out that computing the more expensive features added
very little benefit. Figure 5 shows the results of the experiments. For example, in
the Dogster case, adding the number of common friends of two nodes improved
accuracy (F1 score) by 2% over the individual number of friends. Computing the
most expensive feature density of common friends pays off slightly (improves F1
score by 0.4%) only when there are no group attributes. Computing the more
expensive jaccard coefficient did not pay off over using the simpler feature number
of common friends. In the Catster and Hamsterster cases, the improvement
was less that 0.5%. Our results also support the claim made in the preferential
attachment model [3] that the number of friends of a node (node degree) plays a
role in the process of new nodes linking to it. They contradict the link-prediction
results in co-authorship networks [5] where jaccard coefficient and the number
108 E. Zheleva et al.

Fig. 5. Link-prediction accuracy using structural features of increasing computation


cost (number of friends, number of common friends, jaccard coefficient of common
friends, density of common friends). Computing more expensive structural attributes
is not highly beneficial, especially in the presence of group information.

of common friends consistently out-performed the metric based on number of


friends. This may be inherent to the types of networks discussed.

Alternative network representations. In the next set of experiments, we


used the alternative network overlays to test whether there was an advantage to
keeping the different types of links and the affiliation groups. We compare our
proposed different-link and affiliation overlay to the alternative representations
same-link and no affiliation overlay and same-link and affiliation overlay (see
Section 5). We compute only the descriptive and structural features in the overlay
with no affiliation information, and compute all classes of features in the overlays
where affiliation information was given.
The results on Figure 6 show that when family affiliations were given, it
did not matter whether the links were treated as the same type or different
Using Friendship Ties and Family Circles for Link Prediction 109

Fig. 6. Prediction accuracy when links are treated equal, with and without group
affiliations. As the results from the affiliation overlays suggest, group features are the
main contributor to the high link-prediction accuracy.

types: the link-prediction accuracy was the same. However, in the case when
the affiliations were not given, it was better to compute the structural features
using both types of relationships but treat them as one type. When family links
were treated as friendship links, the accuracy of the predictions made by the
structural attributes improved by 6% to 20%. This may be due to the fact that
the overlap between friends and family links in the data was very small, and
using both types of links when computing the structural features was beneficial.
Using the affiliation information and computing all features on the data led to
the best accuracy, and the accuracy was the same both in the different-link and
same-link cases. These experiments also confirmed the previous results: group
affiliation was the main contributor to the high link-prediction accuracy.

7 Related Work
In general, link-prediction algorithms process a set of features in order to learn
and predict whether it is likely that two nodes in the data are linked. Sometimes,
these features are hand-constructed by analyzing the problem domain, the at-
tributes of the actors, and the relational structure around those actors [12,4,5,9].
Other times, they are automatically generated, i.e., the prediction algorithm first
learns the best features to use and then predicts new links [8]. In this section,
we discuss the existing work that is most relevant to the link-prediction problem
in multi-relational social networks.
The link-prediction techniques that are based on feature-construction are clos-
est to our work [12,4,1,5,9]. As most of the relational domains can be represented
as a network model, the constructed features not only include the attributes of
the actors, but also the characteristics of the structure. Most of this work ex-
amines co-authorship and citation networks [4,5,8,9] whereas we validate our
method using online social networks. Some of the approaches use machine learn-
ing techniques for classification [4,13,8,14], and others rely on ranking the feature
values [12,5,9].
110 E. Zheleva et al.

Adamic and Adar [12] use a similarity-based approach to predicting friendships


amongst students. They gather data from university student websites and mail-
ing lists, and construct a vector of features for each student such as website text,
in-links, out-links, and mailing lists the students belong to. Their approach uses
descriptive features whereas ours also considers structural and group features.
It has been shown that there is ”useful information contained in the network
topology alone” [5]. Liben-Nowell and Kleinberg use a variety of structural fea-
tures such as shortest path, (a variant of) number of friends, number of common
friends, jaccard coefficient, and more elaborate structural features based on all
paths between two nodes in co-authorship networks. Their experiments compare
the link-prediction accuracy of each feature in isolation. They rank the node-
pairs by each feature value and pick the top pairs as their predicted links. Their
results suggest that simple features such as number of common friends coefficient
perform well compared to others. In our work, besides structural features, we
also constructed descriptive and group features, and instead of using the features
in isolation, we combined them.
Rattigan and Jensen [9] recognize that the extremely large class skew asso-
ciated with the link-prediction task makes it very challenging. They look at a
related problem, anomalous link discovery, in which instead of discovering new
links, they are interested in learning properties of the existing links. They use
structural features in co-authorship networks and rank the most and least likely
collaborations based on an expensive structural feature, the Katz score. Another
work that uses link prediction for anomaly discovery is the work of Huang and
Zeng [1], in which they rank anomalous emails in the Enron dataset.
The work described so far uses descriptive and structural attributes in isola-
tion. Hasan et al. [4] use both. Their work studies classification for link prediction
based on hand-constructed features in co-authorship networks. They report pre-
diction accuracy (F score), precision, and recall results from a range of classifiers
such as decision trees, k-nearest neighbor, multilayer perceptron, and support-
vector machines. The novelty in our work compared to theirs is that we study
link prediction in richer social network settings and we explore the use of group
features and alternate representations.
The link-prediction problem has also been studied in the domain of citation
networks for scientific publications [8]. The authors posed the link-prediction
problem as a binary classification problem, and used logistic regression to solve
it. Their features are database queries such as most cited author, and thus they
are similar to both the descriptive and structural features we have discussed so
far. Their work describes a statistical learning approach for feature generation.
In particular, it extends the traditional Inductive Logic Programming (ILP) to
reason about probabilities, and uses this extension to learn new features from the
problem domain both statistically and inductively. The experiments in this work
suggest that the ratio of existing to non-existing links in the test data mattered,
and the fewer non-existing link examples were included, the better the precision-
recall curve was. However, testing with more non-existing link examples would
give a better estimate of the probability of a randomly picked pair of nodes in the
Using Friendship Ties and Family Circles for Link Prediction 111

network to be classified correctly. Another statistical learning approach to link


prediction was presented by Taskar et al. [14]. The authors use relational Markov
networks to define a probabilistic model over the entire link graph. Their features
are both descriptive and relational. They apply their method to two domains:
linked university websites and a student online social network.
Another automated feature-generation method has been presented by Kubica
et al. [13] who described a learning method for the task of friend identification
which is similar to anomalous link discovery. Their method, called cGraph, learns
an approximate graph model of the actual underlying link data given noisy linked
data. Then, this graph model is used to predict pairwise friendship information
regarding the nodes in the network. The types of features that they use are
descriptive, structural and group. The difference with our work is that we use
these features for link prediction rather than ranking the links.
Link completion is a problem related to link prediction. Given the arity of a
relationship and all but one entity participating in it, the goal is to predict the
missing entity, as opposed to classifying the missing link itself. Goldenberg et al.
[10] present a comparison of several classification algorithms such as Naive Bayes,
Bayesian networks, cGraph (mentioned above), logistic regression, and nearest
neighbor. This study used several real-world datasets from different domains,
including co-authorship networks, and data collected from the Internet Movie
Database site. It suggested that logistic regression performs well in general in
the datasets above; in our study on real-world social networks, logistic regression
usually performed worse than decision-tree classifier in terms of accuracy.
There has also been interest in learning group features in social networks.
Kubica et al. [15] describe a group-detection algorithm that uses descriptive fea-
tures and links. First, they perform clustering based on the descriptive features
(clustering) and find the groups. They allow group overlap and assume that
group memberships are conditionally independent of each other given descrip-
tive features. Then, their algorithm assigns a probability of a link between two
actors based on the similarity of their groups, and it can answer ranking queries
similar to the ones in the anomalous link discovery work. One of the issues with
the proposed algorithm is that it is slow [13].
The work of Friedland and Jensen [16] studies the problem of identifying
groups of actors in a social network that exhibit a common behavior over time.
The authors focused on networks of employees in large organizations, and in-
vestigated the employee histories to identify the employees who worked together
intentionally from those who simply shared frequently occurring employment
patterns in the industry.

8 Discussion
When studying other large social networks, family information is not always
relevant or available. However, groups and affiliations are often available, or
communities can be discovered.
The networks used here had a binary relationships - friend or family - but a similar
effect can be achieved in networks where relationships are weighted. For example,
112 E. Zheleva et al.

co-authorship networks are widely studied as social networks [3,4,5,6,7,8,9], and


edges can be weighted by the number of articles a pair of authors have authored
together. In email communication networks - the Enron email corpus [1,2], for ex-
ample - the number of messages between two senders can be used as a weight. To
mimic the strong family-type relationship we used in this article, a threshold weight
can be set. Any edge with a weight over that threshold can be treated as a “strong”
relationship (like our family relationship). Clusters of nodes connected with strong
ties would represent the equivalent of a family unit.

9 Conclusions and Future Work


Link prediction is a notoriously difficult problem. In this research, we found
that overlaying friendship and affiliation networks was very effective. For the
networks used in our study, we found that family relationships were very useful
in predicting friendship links. Our experiments show that we can achieve sig-
nificantly higher prediction accuracy (between 15% and 30% more accurate) as
compared to using more traditional features such as descriptive node attributes
and structural features. Family groups helped not only because they represent
a clique of actors, but because the family relationship itself was indicative of
structural equivalence. As future work, we plan to investigate the use of edge
weights and thresholds to define strongly connected clusters, and see if it works
as well in link prediction as the family groups did here.

Acknowledgments
This work was partially supported by NSF under Grants No. 0746930 and No.
0423845.

References
1. Huang, Z., Zeng, D.: A Link Prediction Approach to Anomalous Email Detection.
In: IEEE International Conference on Systems, Man, and Cybernetics (2006)
2. Klimt, B., Yang, Y.: The enron corpus: A new dataset for email classification
research. In: Boulicaut, J.F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML
2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004)
3. Barabasi, A.L., Jeong, H., Neda, Z., Ravasz, E., Schubert, A., Vicsek, T.: Evolution
of the social network of scientific collaborations. PHYSICA A 311, 3 (2002)
4. Hasan, M., Chaoji, V., Salem, S., Zaki, M.: Link Prediction using Supervised Learn-
ing. In: Proceedings of the Workshop on Link Analysis, Counter-terrorism and
Security (with SIAM Data Mining Conference) (2006)
5. Liben-Nowell, D., Kleinberg, J.: The Link Prediction Problem for Social Networks.
In: Proceedings of the 12th International Conference on Information and Knowl-
edge Management (CIKM) (2003)
6. Newman, M.: Who is the best connected scientist? a study of scientific coauthorship
networks. Working Papers 00-12-064, Santa Fe Institute (December 2000),
http://ideas.repec.org/p/wop/safiwp/00-12-064.html
Using Friendship Ties and Family Circles for Link Prediction 113

7. Newman, M.: Coauthorship networks and patterns of scientific collaboration. Pro-


ceedings of the National Academy of Sciences 101, 5200–5205 (2004)
8. Popescul, A., Ungar, L.H.: Statistical relational learning for link prediction. In:
Proceedings of the Workshop on Learning Statistical Models from Relational Data
at IJCAI 2003 (2003)
9. Rattigan, M.J., Jensen, D.: The case for anomalous link discovery. SIGKDD Explor.
Newsl. 7(2), 41–47 (2005)
10. Goldenberg, A., Kubica, J., Komarek, P., Moore, A., Schneider, J.: A Comparison
of Statistical and Machine Learning Algorithms on the Task of Link Completion.
In: KDD Workshop on Link Analysis for Detecting Complex Behavior (2003)
11. Golbeck, J.: The dynamics of web-based social networks: Membership, relation-
ships, and change. First Monday 12 (2007)
12. Adamic, L., Adar, E.: Friends and neighbors on the web. Social Networks 25(3),
211–230 (2003)
13. Kubica, J.M., Moore, A., Cohn, D., Schneider, J.: cGraph: A Fast Graph-Based
Method for Link Analysis and Queries. In: Proceedings of the 2003 IJCAI Text-
Mining & Link-Analysis Workshop (2003)
14. Taskar, B., Wong, M.F., Abbeel, P., Koller, D.: Link Prediction in Relational Data.
In: Advances in Neural Information Processing Systems, NIPS 2003 (2003)
15. Kubica, J., Moore, A., Schneider, J., Yang, Y.: Stochastic Link and Group Detec-
tion. In: Proceedings of the Eighteenth National Conference on Artificial Intelli-
gence, AAAI 2002 (2002)
16. Friedland, J., Jensen, D.: Finding Tribes: Identifying Close-Knit Individuals from
Employment Patterns. In: Proceedings of Knowledge Discovery and Data Mining
(KDD 2007) (2007)
Information Theoretic Criteria for Community
Detection

L. Karl Branting

The MITRE Corporation,


7525 Colshire Drive
McLean, VA 22102, USA
lbranting@mitre.org

Abstract. Many algorithms for finding community structure in graphs


search for a partition that maximizes modularity. However, recent work
has identified two important limitations of modularity as a community
quality criterion: a resolution limit; and a bias towards finding equal-sized
communities. Information-theoretic approaches that search for partitions
that minimize description length are a recent alternative to modularity.
This paper shows that two information-theoretic algorithms are them-
selves subject to a resolution limit, identifies the component of each
approach that is responsible for the resolution limit, proposes a vari-
ant, SGE (Sparse Graph Encoding), that addresses this limitation, and
demonstrates on three artificial data sets that (1) SGE does not exhibit a
resolution limit on sparse graphs in which other approaches do, and that
(2) modularity and the compression-based algorithms, including SGE,
behave similarly on graphs not subject to the resolution limit.

1 Introduction

Many complex networks, such as the Internet, metabolic pathways, and social
networks, are characterized by a community structure that groups related ver-
tices together. Traditional clustering techniques group vertices based on some
metric for attribute similarity [2]. More recent research has focused on detection
of community structure from graph topology. Under this approach, the input to
a community-detection algorithm is a graph in which vertices correspond to indi-
viduals (e.g., URLs, molecules, or people) and edges correspond to relationships
(e.g., hyperlinks, chemical reactions, or marital and business ties). The output
consists of a partition of the graph in which subgraphs correspond to meaningful
groupings (e.g., web communities, families of molecules, or social clans).1
Community detection algorithms can be viewed as comprising two compo-
nents: a utility function that expresses the quality of any given partition of a
1
Some communities, such as social clubs and families, can overlap. Membership in
such communities is better modeled as attributes of vertices rather than through
a partition of the graph [3]. The focus of this paper, however, as in the bulk of
community detection research, is on partition-based community structure.

L. Giles et al. (Eds.): SNAKDD 2008, LNCS 5498, pp. 114–130, 2010.

c Springer-Verlag Berlin Heidelberg 2010
Information Theoretic Criteria for Community Detection 115

Table 1. Utility functions and search strategies for various community-detection algo-
rithms. DHC represents divisive hierarchical clustering, ADHH represents agglomera-
tive hierarchical clustering, and MDL represents “Minimum Description length.”

utility function search strategy algorithm


modularity DHC/betweenness centrality Newman & Girvan (2004) [11]
modularity AHC Newman (2004) [1]
modularity Genetic Algorithm Tasgin & Bingol (2006) [12]
modularity DHC/network structure index Rattigan et al. (2007) [13]
modularity AHC/spectral division Donetti & Munoz (2004) [14]
log-likelihood fixed-point iteration Zhang et al. (2007) [15]
MDL simulated annealing Rosvall & Bergstrom (2007) [16]
MDL iterated hill climbing Chakrabarti (2004) [8]

graph; and a search strategy that specifies a procedure for finding a partition
that optimizes the utility function. Table 1 sets forth utility functions and search
strategies of eight recent community-detection algorithms, showing that utility
functions have been paired with a variety of different search strategies.
The utility function most prevalent in recent community detection research is
the modularity function introduced in [1]:

Q= (w(Dii )/l − (li /l)2 ) (1)
1<i≤m

where i is the index of the communities, w(Dii ) is the number  of edges in the
graph that connect pairs of vertices within community i, li = j≤i w(Dij ), i.e.,
the number of edges in the graph that are incident to at least one vertex in
community i, and l is the total number of edges in the entire graph. Modularity
formalizes the intuition that communities consist of groups of entities having
more links with each other than with members of other groups.
Because of the shortage of real-world data sets with known community struc-
ture, maximum modularity has sometimes even been equated with correct com-
munity structure. However, two important weaknesses have been identified in
modularity as a community-structure criterion.
First, the group structure that optimizes modularity within a given subgraph
can depend on the number of edges in the entire graph in which the subgraph is
embedded. Specifically, modularity is characterized by an intrinsic scale under√
which Q is maximized when pairs of distinct groups having fewer than 2l
edges (where l is the total number of edges in the graph) are combined into
single groups [4]. This phenomenon is apparent in ring graphs, i.e., connected
graphs that consist of identical subgraphs each connected to exactly two other
subgraphs by a single link. For example, in the graph shown in Figure 1 consisting
of a ring of 15 squares, modularity is greater when adjacent squares are grouped
together than when each square is a separate group.
116 L. Karl Branting

Fig. 1. Ring graph R15,4 consisting of 15 communities, each containing 4 vertices

A second weakness of modularity is that even when the resolution limit is not
exceeded, modularity exhibits a bias towards groups of similar size. Intuitively,
the sum of the square terms, (li /l)2 , representing the expected number of intra-
group edges within community i under the null model, is minimized, and Q
therefore maximized, when all li are as nearly equal in size as possible.
One approach to the resolution limit of modularity is to apply modularity
recursively, so that the coarse structure found at one level is refined at lower
levels [5].2 An alternative approach is to substitute a different community-quality
criterion for modularity.
One such alternative criterion for community quality that has recently been
proposed, based on information theory, is minimizing description length [7,8,9].
In this approach, the quality of a given partition of a graph is a function of the
complexity of the community structure together with the mutual information
between the community structure and the graph as a whole. The best commu-
nity structure is one that minimizes the sum of (1) the number of bits needed
to represent the community structure plus (2) the number of bits needed to
2
See [6] for recent approach that addresses resolution limits by using an absolute
evaluation of community structure rather than comparison to a null model.
Information Theoretic Criteria for Community Detection 117

represent the entire graph given the community structure. Under this approach,
the task of community detection consists of finding the community structure that
leads to the minimum description length (MDL) representation of the graph,
where description length is measured in number of bits.
The structure of the paper is as follows: Section 2 of this paper compares the
compression approach used in two previous approaches to information-theoretic
community detection and identifies a feature common to both that can lead
to a bias toward combining distinct communities in large sparse graphs. An
alternative encoding, termed SGE (Sparse Graph Encoding) that addresses this
bias is proposed in Section 3. Section 4 describes the design of an empirical
evaluation comparing the previous information-theoretic utility functions, SGE,
and modularity on three classes of artificial data. The results of this experiment
are set forth in Section 5.

2 Minimum Description Length Encodings

The intuition behind the minimum description length (MDL) criterion for com-
munity structure is that a partition of a graph that permits a more concise
description of the graph is more faithful to the actual community structure than
a partition leading to a less concise description. The best partition is the one
that lends itself to the most concise description, that is, the encoding of the par-
tition and of the graph given with the partition in the fewest bits. However, the
minimum description length (MDL) criterion does not in itself specify how to
encode either the community structure or the graph given the community struc-
ture. Indeed, the close connection between MDL and Kolmogorov complexity
[10], which is undecidable, suggests that MDL may itself be undecidable.
The encoding algorithms of Rosvall and Bergstrom [7] (hereinafter “RB”) and
Chakrabarti [8] (hereinafter “AP,” standing for “AutoPart”) use quite different
approaches to measuring the description length of community structures and
graphs. However, RB and AP have in common that both are characterized by a
resolution limit similar to that observed in modularity.
RB and AP decompose the task of encoding a graph and its community
structure into similar steps, but they calculate the bits in each term differently.
For the purposes of this comparison, the following notation will be followed:

– n - the number of vertices in the graph


– m - the number of groups
– ai - the number of vertices in group i
– l - the total number of edges in the entire graph
– li - the number of edges incident to group i
– Dij - a binary adjacency matrix between groups i and j
– n(Dij ) - the number of elements in adjacency matrix D
– w(Dij ) - the number of 1’s in Dij , i.e., the number of edges between groups
i and j
w(D )
– P (Dij ) - the density of 1’s in Dij , i.e., n(Dijij )
118 L. Karl Branting

– P  (Dij ) - for a square matrix Dij , the density of 1’s ignoring the diagonal
– H(Dij ) = −P (Dij ) log(P (Dij ))−(1−P (Dij )) log(1−P (Dij )), i.e., the mean
entropy of Dij
– H  (Dij ) = −P  (Dij ) log(P  (Dij )) − (1 − P  (Dij )) log(1 − P  (Dij )), i.e., the
mean entropy of Dij if values on the diagonal of Dij are ignored
– B - a matrix representing for each pair of groups whether the pair is con-
nected, i.e., Bij = 1 ⇐⇒ w(Dij ) > 0

The encoding schemes used in RB and AP are as follows:

1. Bits needed to represent the number of vertices in the graph. Since this term
doesn’t vary with differing community structure, it is irrelevant to the choice
between different community structures and can be ignored.
2. Bits needed to represent the number of groups.
– RB. Not explicitly represented.
– AP. log∗ (m). log∗ (x) = log2 (x) + log2 log2 (x) + ... where only positive
terms are included in the sum. This series is apparently intended to
represent the mean coding length of integers given that the probability
of an integer of a given length is a monotonically decreasing function
of the integer’s length, i.e., longer integers are less probable, but no
maximum length is known [17].
3. Bits needed to represent the association between vertices and groups
– RB. n log(m). The rationale appears to be that for each of the n vertices,
log(m) bits are needed to identify the group to which the vertex belongs.
– AP. If the groups are placed in decreasing order of length, i.e., a1 ≥ a2 ≥
... ≥ am ≥ 1,
m−1

log(ai )
i=1
m
where ai = ( t=1 at ) − m + i.
4. Bits needed for the group adjacency matrix, i.e., the number of edges between
pairs of groups.
– RB. 12 m(m+1) log(l). The first term ( 12 m(m+1)) represents the number
of pairs of groups, and the second term (log(l)) the number of bits needed
to specify the number of edges between any pair of groups.
– AP. 
log(ai aj + 1)
1<i,j<m

This expression sums for every pair of groups sufficient bits to represent
the number of edges between that pair.
5. Bits needed to represent the full adjacency matrix for vertices, given the
group structure represented in terms 2-4.
– RB.
m 
   
ai (ai − 1)/2  ai a j
log( )
w(Dii ) w(Dij )
i=1 i<j
Information Theoretic Criteria for Community Detection 119

The expression following the first product sign represents the number of
ways to choose the actual pairs that are connected within a single group
from the set of all possible pairs. The expression following the second
product sign is the number of ways to choose the actual pairs between
vertices in two different groups from the set of possible edges between
vertices in those groups.
– AP. m m

ai aj H(Dij )
i=1 j=1

For each pair of groups, the entropy of the adjacency matrix for that
pair, i.e., the size of the matrix times its the mean entropy.

RB and AP clearly calculate each term quite differently. In general, RB uses


encodings that are much larger than those used in AP. However, a key similarity
is in term 4, the bits needed to encode the number of edges between pairs of
groups. In both RB and AP at least one bit is required for each pair of groups
regardless of how few groups are actually connected (i.e., how few pairs of groups
have at least one edge from a vertex in one to a vertex in the other). The number
of bits arising from this term therefore increases with the square of the number
of groups, regardless of the sparsity of their interconnections. One would expect
that for sufficiently large graphs with sparse community structure the savings
in term 4 from combining groups would be greater than the added cost in term
5 of specifying the vertex adjacencies for the resulting relatively sparse group,
and that this would lead to conflation of distinct groups similar to that observed
when modularity is used as a community quality function. As discussed in the
evaluation below, this conflation is in fact observed. For example, the number
of bits required to encode the graph shown in Figure 1 is lower under both the
RB and AP procedures if some pair of adjacent groups are combined, yielding
14 communities, than if it is divided into 15 equal communities.

3 Sparse Graph Encoding (SGE)


The observations that RB and AP (1) assign at least one bit per pair of com-
munities, regardless of how few are actually connected and (2) conflate distinct
groups in large sparse graphs (as shown experimentally below) suggests the hy-
pothesis that an encoding in which the bits required to encode the number of
edges between pairs of groups grow more slowly than the square of the number
of groups would be less prone to the resolution limit. Sparse Graph Encoding
(SGE) is an encoding scheme designed to test this hypothesis.
The key idea is to encode the group adjacency matrix using two terms. The
first term encodes, for each pair of groups, whether the groups are connected.
The number of bits required for this is equal to the entropy of B, the binary
matrix representing for each pair of groups whether those groups are connected.
The mean entropy of B is at most 1.0, if each group is randomly connected to
exactly half the others. If few, or most, groups are connected to one another, the
120 L. Karl Branting

mean entropy is less than 1.0, and the total entropy is therefore less than the
square of the number of groups.
Moreover, the number of bits needed to represent B can be further reduced by
noting that the value of B’s diagonal need not be explicitly represented because
it can be determined from the number of nodes in each group. Singleton groups
have no within-group edges (assuming that self-loops are prohibited) and groups
with more than one element must have at least one within-group edge (if there
are no within-group edges, the density of within-group edges cannot be higher
than the density of between-group edges, the basic characteristic of a group).
The bits needed to represent B are therefore:
m(m − 1)H  (B) (2)
where H  (B) = −P  (B) log(P  (B)) − (1 − P  (B)) log(1 − P  (B)) and P  (B) is
the density of 1’s in B, ignoring the diagonal.
The second term contains, for each connected pair, the number of bits needed
to represent the number of edges between that pair (the second sum is needed
if, as we assume, edges from a vertex to itself are forbidden):
 
log(ai aj ) + log(ai (aj − 1)) (3)
i=j∧w(Dij )≥0 i=j∧w(Dij )>0

If the cost of representing the group adjacent matrix is calculated as expression


2 + expression 3, the cost will grow with the number of connected pairs rather
than with the total number of pairs.
SGE employs several additional minor modifications to further reduce the
description length. The entire calculation is as follows:
1. Bits needed to represent the number of vertices in the graph. As with RB
and AP, these bits are ignored.
2. Bits needed to represent the number of groups. The log* function of [17]
used in AP is predicated on the assumption that no maximum integer size is
known a priori. Here, however, the maximum number of groups is bounded
by both the machine word size and the virtual memory size of the machine
on which the algorithm is executed. Therefore, SGE uses instead RB’s cal-
culation:
log(m)
3. Bits needed to represent the association between vertices and groups. No
group can contain more than n − m + l vertices (since each group must have
at least one vertex). Accordingly, the following expression contains sufficient
bits to represent the number of vertices in all m groups:
m log(n − m + 1)
4. Bits needed for the group adjacency matrix, i.e., the number of edges between
pairs of groups. As discussed above, the number of bits is:
 
H  (B) + log(ai aj ) + log(ai (aj − 1))
i=j∧w(Dij )>0 i=j∧w(Dij )>0
Information Theoretic Criteria for Community Detection 121

5. Bits needed to represent the full adjacency matrix for vertices given the
group structure represented in terms 2-4. This consists, for every pair of
groups i and j, of size of the i, j adjacency matrix, ai aj , times the entropy
per entry in the corresponding binary matrix, H(Dij ). This is equivalent to
the AP calculation, shown above:
m 
 m
ai aj H(Dij )
i=1 j=1

In summary, the relationship between SGE, RB, and AP is as follows:


1. Bits needed to represent the number of vertices in the graph. Ignored as in
RB and AP.
2. Bits needed to represent the number of groups. Follows RB.
3. Bits needed to represent the association between vertices and groups. Uses
an expression with fewer bits than that used in RB, and that is simpler than
that used in AP.
4. Bits needed for the group adjacency matrix. The primary novelty of SGE,
in that for sparse adjacency matrices this term grows more slowly than the
square of the number of groups.
5. Bits needed to represent the full adjacency matrix for vertices. Follows AP.

4 Empirical Evaluation
The previous section suggested that a graph encoding in which the calculation of
the number bits required to represent a group adjacency matrix was reduced from
an expression that grows as the square of the number of groups, as in RB and AP,
to an expression that grows in proportion to the number of pairs of connected
groups, as in SGE, would reduce or eliminate any resolution limit in sparsely
connected graphs. This hypothesis was tested by comparing the communities
found by optimizing RB, AP, SGE, and modularity on three different artificial
data sets.
To avoid conflating the effect of a utility function with the behavior of a search
strategy, it was necessary to compare alternative utility functions using a single
common search strategy. Accordingly, a single search function was applied to all
for utility functions in the experimental evaluation: the greedy divisive hierar-
chical clustering algorithm of Newman & Girvan (2004) [11]. In the Newman &
Girvan procedure, the edge with the highest betweenness centrality is iteratively
removed, and the partition in the resulting sequence having the optimal value
under the utility function was returned as the community structure. Using a sin-
gle search strategy removes the potentially confounding disparity of the search
algorithms used in published descriptions of RB, AP, and modularity.

4.1 Evaluation Criteria


Various objective functions have been proposed for evaluating the quality of a
proposed community structure given the actual correct community structure,
122 L. Karl Branting

including the Rand index [23], the adjusted Rand index [24], and f-measure.
There is no consensus regarding the most informative objective function. In this
evaluation, f-measure was selected since its use in information retrieval has made
it familiar to a wide range of researchers.
The intuition underlying the use of f-measure is that group structure can be
expressed as a relation c(G) = {vi , vj | ∃g ∈ G ∧ vi , vj ∈ g}, that is, the
community structure can be represented by specifying for each pair of vertices
whether that pair is in the same group. The similarity between the proposed
group structure and the actual group structure can be evaluated by comparing
c(proposed) with c(actual). One way to make the comparison is to view each
pair in c(proposed) that is also in c(actual) as a true positive, whereas each pair
in c(proposed) that is not in c(actual) is a false positive. Under this view, recall
and precision can be defined as follows:

|c(proposed)| |c(actual)|
– Recall =
|c(actual)|

|c(proposed)| |c(actual)|
– Precision =
|c(proposed)|

F-measure is the harmonic mean of recall and precision:


2 ∗ recall ∗ precision
– F-measure =
recall + precision

4.2 Experimental Procedure

Three experiments were performed, each with a different type of artificial graph.
The first, ring graphs, are characterized by the sparsity of connections between
groups observed in many large-scale real-world graphs [20]. The second, uniform
random graphs, has been used in a number of evaluations of community-detection
algorithms. The third, embedded Barabasi-Albert Graphs, consists of communities
generated by preferential attachment [20] embedded in a random graph. Fifty
trials were performed under each experimental condition for uniform random
and EBA graphs. There is no randomness in the construction of ring graphs, so
a single trial was sufficient.

Experiment 1: Ring graphs. Ring graph Rm,c comprises m communities,


each consisting of a ring of c vertices, connected to two other communities,
each by a single link, such that all communities are connected. Ring graphs are
similar to the clique rings of [4] but differ in that the individual communities are
themselves rings rather than cliques. For example, Figure 1 depicts ring graph
R15,4 .
The evaluation compared RB, AP, SGE, and modularity on 91 ring graphs for
which m, c ∈ {4 . . . 16} × {3 . . . 9}.3 Strikingly different behavior was observed
3
Note that for m, c > 3 ring graphs contain no triangles. Therefore, community de-
tection techniques based on clustering coefficient, e.g., [18], are ineffective for finding
communities in such ring graphs.
Information Theoretic Criteria for Community Detection 123

among the four community-structure utility functions. Optimizing SGE led to


the correct partitions in all but two ring graphs, but RB and AP found no correct
partitions. Optimizing modularity led to correct partitions only for those graphs
below the resolution threshold identified by [4].

– SGE. The partition having the optimal (lowest) SGE had the correct parti-
tion (i.e., no separate communities were conflated) in every graph except for
R4,3 and R13,3 In other words, the correct community structure was found
in 89 out of 91 ring graphs.
– RB and AP. No community structure was found by optimizing either RB
or AP. The partition having the optimal (lowest) value for RB and AP
contained at least one pair of communities that were grouped together in
every ring graph tested.
– Modularity. Optimizing modularity led to incorrect community structure
for rings of more than 8 triangles, more than 10 squares, more than 11 pen-
tagons, or more that 13 hexagons or heptagons. In other words, the correct
partitions were obtained with modularity only for rings and communities of
the following sizes:

Fig. 2. A uniform random graph with 32 vertices, 4 groups, size ratio 1.25, and io ratio
0.67
124 L. Karl Branting

• R4,3 − R8,3
• R4,4 − R10,4
• R4,5 − R11,5
• R4,6 − R13,6
• R4,7 − R13,7
• R4,8 − R16,8
• R4,9 − R16,9

This evaluation confirmed empirically the existence of the resolution limit for
modularity derived formally in [4]. The evaluation also showed the surprising
result that optimizing RB and AP leads to even more conflation of distinct
communities than does modularity. The observation that optimizing SGE led to
the correct community structure provides confirmation for the hypothesis that
the conflation of communities in RB and AP arises from term 4, which uses more
bits than necessary to represent the number of edges connecting groups in sparse
graphs. Substituting rings of cliques for rings of graphs that are themselves rings
leads to almost identical results to those described here.

Fig. 3. An Embedded Barabasi-Albert (EBA) graph with 4 communities, each with


5 initial vertices per community, 3 new edges per time step, 10 time steps, and 25
singleton-group edges
Information Theoretic Criteria for Community Detection 125

Experiment 2: Uniform random graphs. A common data set for testing


community-extraction algorithms consists of random networks of 128 vertices
divided into 4 communities with average vertex degree of 16 [11,16,19]. In this
experiment, the relative size of the communities was controlled by a size ratio
parameter s such that if the communities were placed in ascending order, |a|ai+1
i|
|
=
s, where ai is the ith communities. The connections among the vertices were
determined by the average vertex degree d and in/out ratio i such that the
average number of within-community edges incident to each vertex was i ∗ d
and the average number of cross-community edges incident to each vertex was
(1 − i) ∗ d. For example, Figure 2 shows a uniform random graph with s = 1.25
and i = 0.6. Tests were performed for each combination of n = 32, m = 4, d =
6, s ∈ {1.0, 1.25, 1.5, 1.75, 2.0}, and i ∈ {0.6, 0.75, 0.9}.
Figures 4, 5, and 6 show the results of the 4 algorithms on uniform graphs
for i ∈ {0.6, 0.75, 0.9} respectively. For i ∈ {0.75, 0.9}, in which the community
structure is relatively distinct, all four algorithms led to similar results except
when the size ratio s was equal to 2.0 (i.e., the sizes of the groups were highly
skewed). Under these circumstances, modularity led to much lower f-measure
than the other algorithms. When i was equal to 0.6 (i.e., the community structure
was relatively unclear) modularity was best and AP worst for low size ratio, and
RB and AP were best for high size ratio. These results are consistent with [7],
which showed better performance for RB than modularity for skewed community
sizes, but comparable performance when community sizes were equal.

Fig. 4. F-measure for uniform random graphs with i=0.6 (weak community structure)
126 L. Karl Branting

Fig. 5. F-measure for uniform random graphs with i=0.75 (moderate community struc-
ture)

Fig. 6. F-measure for uniform random graphs with i=0.9 (strong community structure)
Information Theoretic Criteria for Community Detection 127

Experiment 3: Embedded Barabasi-Albert Graphs. A wide range of nat-


urally occurring graphs, including those mentioned in the introduction (the In-
ternet, biochemical pathways, social networks) exhibit a power-law degree dis-
tribution that is not present in uniform random graphs [20,21,22]. However, few
such “scale-free” graphs are annotated with correct community structure. The
third data set consisted of communities with scale-free structure embedded in
a sparse random graph. Each graph consists of m communities generated by
the Jung 1.74 implementation of the Barabasi-Albert preferential attachment
algorithm, each starting with i initial vertices in each community, with e new
edges per time step following the preferential attachment rule of [20] for each of
t time steps, together with c singleton-group vertices. The singleton-group ver-
tices were connected to 1. . . e vertices randomly selected from the entire graph,
i.e., including both community and singleton-group vertices. The graphs used tor
testing had 4 communities, 4 initial vertices per community, 2–4 edges added per
time step, 20 time steps, and 25 singleton-group vertices. For example, Figure
3 depicts an EBA graph with 3 edges added per time step. In evaluating EBA
graphs, singleton-group vertices were ignored, regardless of whether they were
grouped into new communities or added to existing communities.
As shown in Figure 7, the behavior of all four algorithms was quite similar
when the number of edges added per time step was 3 or 4, which leads to
relatively densely connected graphs. When only 2 edges were added per time
step (i.e., the communities where quite sparse), AP’s performance was much
worse, and SGE’s somewhat worse, than that of the other two algorithms.

Fig. 7. F-measure for embedded Barabasi-Albert graph with 2–4 edges added per time
step
128 L. Karl Branting

5 Conclusion

The empirical evaluation demonstrated that RB and AP conflate distinct com-


munities in ring graphs, and that changing the calculation of the number of bits
needed to represent the group adjacency matrix eliminated this conflation over
the range of ring graphs tested. Ring graphs are artifacts not likely to occur
in many real-world graphs of interest, but many real-world graphs are like ring
graphs in having very sparse group adjacency matrices (i.e., communities with
links to few other communities). The ring-graph experiment suggests that RB
and AP may perform even more poorly than modularity in such graphs.
SGE’s description length calculation did not entirely eliminate resolution lim-
its in clustering. For example, SGE combines adjacent communities in extremely
large rings, such as R100,4 . Moreover, SGE combines adjacent communities in
R3,4 and R13,3 . Thus, it appears that SGE’s bit encoding is not optimal even in
sparse graphs.
No one algorithm consistently outperformed the others in EBA or uniform
random graphs, but modularity was consistently worse than the MDL algo-
rithms on highly skewed uniform random graphs, and AP and SGE had lower
performance than the others on sparse EBA graphs. Neither uniform random
graphs nor EBA graphs have the sparse group adjacency matrices that char-
acterize ring graphs, so most errors consist of assigning a vertex to the wrong
community rather than combining two communities that should remain distinct.
Under these circumstances, SGE’s representation of the group adjacency matrix
confers no particular benefit.
While MDL is clearly a powerful tool for identifying community structure,
there are many options for MDL encodings, and the consequences of each choice
can be difficult to anticipate. SGE demonstrates that the resolution limits of RB
and AP in graphs with sparse group adjacency matrices can be easily addressed,
but the fact that SGE did not outperform RB or RB on other types of graphs
suggests that considerable subtlety is required to identify the MDL encoding
most effective over a wide range of graph and community types.

Acknowledgments

This work was funded under contract number CECOM Wl5P7T-08-C-F600. The
MITRE Corporation is a nonprofit Federally Funded Research and Development
Center chartered in the public interest.

References
1. Newman, M.E.J.: Fast algorithm for detecting community structure in networks.
Physical Review E 69, 066133 (2004)
2. Ganti, V., Ramakrishnan, R., Gehrke, J., Powell, A.L., French, J.C.: Clustering
large datasets in arbitrary metric spaces. In: Proceedings of the 15th IEEE Inter-
national Conference on Data Engineering, Sydney, pp. 502–511 (1999)
Information Theoretic Criteria for Community Detection 129

3. Koutsourelakis, P., Eliassi-Rad, T.: Finding mixed-memberships in social networks.


In: Papers from the 2008 AAAI Spring Symposium on Social Information Process-
ing, Technical Report WW-08-06, pp. 48–53. AAAI Press, Menlo Park (2008)
4. Fortunato, S., Barthelemy, M.: Resolution limit in community detection. Proc.
Natl. Acad. Sci. USA 104, 36 (2007)
5. Ruan, J., Zhang, W.: Identifying network communities with a high resolution.
PhysRevE (2007)
6. Ronhovde, P., Nussinov, Z.: An improved potts model applied to community de-
tection. physics.soc-ph (2008)
7. Rosvall, M., Bergstrom, C.: An information-theoretic framework for resolving com-
munity structure in complex networks. Proc. Natl. Acad. Sci. USA 104(18), 7327–
7331 (2007)
8. Chakrabarti, D.: Autopart: Parameter-free graph partitioning and outlier detec-
tion. In: Proceedings of the European Conference on Machine Learning and Prac-
tice of Knowledge Discovery in Databases, pp. 112–124 (2004)
9. Sun, J., Faloutsos, C., Papadimitriou, S., Yu, P.: Graphscope: parameter-free min-
ing of large time-evolving graphs. In: KDD 2007: Proceedings of the 13th ACM
SIGKDD international conference on Knowledge discovery and data mining, pp.
687–696. ACM, New York (2007)
10. Wallace, C.S., Dowe, D.L.: Minimum message length and Kolmogorov complexity.
The Computer Journal 42(4), 270–283 (1999)
11. Newman, M.E., Girvan, M.: Finding and evaluating community structure in net-
works. Physical review. E, Statistical, nonlinear, and soft matter physics 69(2 Pt
2) (February 2004)
12. Tasgin, M., Bingol, H.: Community detection in complex networks using genetic
algorithm. In: ECCS 2006: Proc. of the European Conference on Complex Systems
(2006)
13. Rattigan, M.J., Maier, M., Jensen, D.: Graph clustering with network structure in-
dices. In: ICML 2007: Proceedings of the 24th international conference on Machine
learning, pp. 783–790. ACM, New York (2007)
14. Donetti, L., Muoz, M.: Detecting net-work communities: a new systematic and ef-
ficient algorithm. Journal of Statistical Mechanics: Theory and Experiment 10012,
1–15 (2004)
15. Zhang, H., Giles, C.L., Foley, H.C., Yen, J.: Probabilistic community discovery
using hierarchical latent gaussian mixture model. In: AAAI 2007: Proceedings of
the 22nd national conference on Artificial intelligence, pp. 663–668. AAAI Press,
Menlo Park (2007)
16. Rosvall, M., Bergstrom, C.T.: An information-theoretic framework for resolving
community structure in complex networks. PNAS 104(7327) (2007)
17. Rissanen, R.: A universal prior for integers and estimation by minimum description
length. The Annals of Statistics 2, 416–431 (1983)
18. Du, N., Wu, B., Pei, X., Wang, B., Xu, L.: Community detection in large-scale
social networks. In: WebKDD/SNA-KDD 2007: Proceedings of the 9th WebKDD
and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, pp.
16–25. ACM, New York (2007)
19. Bagrow, J.: Evaluating local community methods in networks. J. Stat.
Mech. 2008(05), P05001 (2008)
20. Barabasi, A., Albert, R.: Emergence of scaling in random networks. Science 286,
509–512 (1999)
21. Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empirical
data, cite arxiv:0706.1062 (2007) http://www.santafe.edu/~ aaronc/powerlaws/
130 L. Karl Branting

22. Clauset, A., Shalizi, C., Newman, M.: Power-law distributions in empirical data.
SIAM Review 51(4), 661–703 (2009)
23. Rand, W.M.: Objective Criteria for the Evaluation of Clustering Methods. Journal
of the American Statistical Association 66(336), 846–850 (1971)
24. Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2, 193–218
(1985)
Author Index

Berger-Wolf, Tanya Y. 55 Magdon-Ismail, Malik 36


Branting, L. Karl 114 Mertsalov, Konstantin 36

Eliassi-Rad, Tina 1 Rettinger, Achim 77


Gallagher, Brian 1
Getoor, Lise 97 Saia, Jared 55
Ghosh, Rumi 20
Golbeck, Jennifer 97 Tresp, Volker 77
Goldberg, Mark 36
Wallace, William (Al) 36
Habiba, 55
Xu, Zhao 77
Kelley, Stephen 36
Kersting, Kristian 77
Kuter, Ugur 97 Yu, Yintao 55

Lerman, Kristina 20 Zheleva, Elena 97

Вам также может понравиться