Академический Документы
Профессиональный Документы
Культура Документы
Yu-Ru Lin1 Yun Chi2 Shenghuo Zhu2 Hari Sundaram1 Belle L. Tseng3
1
Arts, Media and Engineering Program, Arizona State University, Tempe, AZ 85281, USA
2
NEC Laboratories America, 10080 N. Wolfe Rd, SW3-350, Cupertino, CA 95014, USA
3
YAHOO! Inc., 2821 Mission College Blvd, Santa Clara, CA 95054 USA
1
{yu-ru.lin,hari.sundaram}@asu.edu, 2 {ychi,zsh}@sv.nec-labs.com, 3 belle@yahoo-inc.com
ABSTRACT 1. INTRODUCTION
We discover communities from social network data, and an- Data from many social network datasets, including pa-
alyze the community evolution. These communities are in- per co-authorship networks and the blogosphere, is a graph
herent characteristics of human interaction in online social where nodes represent individuals (e.g., club members, au-
networks, as well as paper citation networks. Also, commu- thors, and bloggers) and edges represent the relationship
nities may evolve over time, due to changes to individuals’ and interactions among individuals (e.g., interactions in a
roles and social status in the network as well as changes club, co-authorship, and hyperlinks in blogs). In such so-
to individuals’ research interests. We present an innovative cial networks, individuals form communities by building re-
algorithm that deviates from the traditional two-step ap- lationships and interactions with each other. The analysis
proach to analyze community evolutions. In the traditional of these communities (membership, structure and temporal
approach, communities are first detected for each time slice, dynamics) is an important research issue.
and then compared to determine correspondences. We ar- Traditional analysis of social networks treats the network
gue that this approach is inappropriate in applications with as as a static graph, where the static graph is either derived
noisy data. In this paper, we propose FacetNet for analyzing from aggregation of data over all time or taken as a snapshot
communities and their evolutions through a robust unified of data at a particular time. These studies range from well-
process. In this novel framework, communities not only gen- established social network analysis [22] to recent successful
erate evolutions, they also are regularized by the temporal applications such as HITS and PageRank [7, 15]. However,
smoothness of evolutions. As a result, this framework will this research omits one important feature of communities in
discover communities that jointly maximize the fit to the ob- networked data— the temporal evolution of communities.
served data and the temporal evolution. Our approach relies By ignoring community evolution, prior works have over-
on formulating the problem in terms of non-negative ma- looked a key aspect of online communities.
trix factorization, where communities and their evolutions More recently, there has been a growing body of work on
are factorized in a unified way. Then we develop an itera- the analysis of communities and their temporal evolution
tive algorithm, with proven low time complexity, which is in dynamic networks [1, 8, 9, 10, 11, 16, 19, 21]. How-
guaranteed to converge to an optimal solution. We perform ever, a common weakness in these studies, as we will dis-
extensive experimental studies, on both synthetic datasets cuss in detail in related work, is that communities and their
and real datasets, to demonstrate that our method discov- evolutions are studied separately—usually community struc-
ers meaningful communities and provides additional insights tures are independently extracted at consecutive timesteps
not directly obtainable from traditional methods. and then in retrospect, evolutionary characteristics are in-
troduced to explain the difference between these commu-
nity structures over time. Such a two-stage approach may
Categories and Subject Descriptors make sense when the community structure is unambiguous
H.2.8 [Database Management]: Database Applications— (e.g., when the community affiliation is available). How-
Data mining; H.3.3 [Information Storage and Retrieval]: ever, more often than not, data from real-world networks
Information Search and Retrieval—Information filtering; J.4 are ambiguous and subject to noise. Under such scenar-
[Computer Applications]: Social and Behavioral Sciences— ios, if an algorithm extracts community structure for each
Economics timestep independently of other timesteps, it often results in
community structures with high temporal variation. Con-
General Terms sequently, undesirable evolutionary characteristics may have
Algorithms, Experimentation, Measurement, Theory to be introduced in order to explain the high variation in the
community structures. Therefore, we argue that a more ap-
Keywords propriate approach is to analyze communities and their evo-
lutions in a unified framework where the community struc-
Community, Evolution, Soft Membership, Non-negative Ma- ture provides evidence about community evolutions and at
trix Factorization, Community Net, Evolution Net the same time, the evolutionary history offers hints on what
Copyright is held by the International World Wide Web Conference Com- community structure is more appropriate. For example, a
mittee (IW3C2). Distribution of these papers is limited to classroom use, community structure that introduces dramatic evolutions in
and personal use by others. a very short period of time is less desirable.
WWW 2008, April 21–25, 2008, Beijing, China.
ACM 978-1-60558-085-2/08/04.
685
WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Discovery and Evolution of Communities
686
WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Discovery and Evolution of Communities
stability, sociability, influence and popularity for communi- 2.2 Basic Formulation
ties and individuals. Sun et al. [20] proposed a parameter- As mentioned in the introduction, we want to analyze
free algorithm, GraphScope, to mine time-evolving graphs communities and their evolutions in a unified process. That
where the Minimum Description Length (MDL) principle is is, at time t, we prefer a community structure so that the
employed to extract communities and to detect community community evolution from t-1 to t is not unreasonably dra-
changes. Mei et al. [12] extracted latent themes from text matic. To achieve this goal, we propose to use the commu-
and used the evolution graph of themes for temporal text nity structure at time t-1 (already extracted) to regularize
mining. All these studies, however, have a common weak the community structure at current time t (to be extracted).
point—community extraction and community evolution are To incorporate such a regulation, we introduce a cost func-
analyzed in two separated stages. That is, when commu- tion to measure the quality of community structure at time
nities are extracted at a given timestep, historic commu- t, where the cost consists of two parts—a snapshot cost and
nity structure, which contains valuable information related a temporal cost:
to current community structure, is not taken into account.
There are some recent studies on evolutionary embedding cost = α · CS + (1 − α) · CT (1)
and clustering that are closely related to our work. In [17],
This cost function is first proposed by Chakrabarti et al. [2]
Sarkar et al. proposed a dynamic method that embeds nodes
in the context of evolutionary clustering. In this cost func-
into latent spaces where the locations of the nodes at con-
tion, the snapshot cost CS measures how well a community
secutive timesteps are regularized so that dramatic change is
structure fits W , the observed interactions at time t. The
unlikely. In [2], Chakrabarti et al. proposed the first evolu-
temporal cost CT measures how consistent the community
tionary clustering methods where the cluster membership at
structure is with respect to historic community structure (at
time t is influenced by the clusters at time t-1. Chi et al. [3]
time t-1 ). The parameter α is set by the user to control the
extended similar ideas and proposed the first evolutionary
level of emphasis on each part of the total cost.
spectral clustering algorithms. They used graph cut as a
metric for measuring community structures and community 2.2.1 Snapshot Cost
evolutions. All these studies differ from our work in that
they regularize the current community membership at time A community structure at time t should fit W well, where
t by using historic community membership indirectly. In W is the observed interaction (similarity) matrix at time t.
Chakrabarti et al.’s evolutionary hierarchical clustering al- This requirement is reflected in the snapshot cost CS in the
gorithm, historic community structure affects the tree-node cost function Eq. (1). We first describe how we model the
merging step in the current time. In their evolutionary k- community structure and how we define the snapshot cost.
means clustering algorithm, historic centroids affect the k- Assume there exist m communities at time t. We further
mean process at the current time. In Chi et al.’s algorithms, assume that the interaction (similarity) wij is a combined
certain eigenvectors, instead of the community structure, effect due to all the m communities.
P That is, we approximate
are regularized over time. In the work of Sarkar et al., al- wij using a mixture model wij ≈ m k=1 pk ·pk→i ·pk→j , where
though the relationship among nodes in latent spaces is pre- pk is the prior probability that the interaction wij is due to
served over time, the issue of communities are not directly the k-th community, pk→i and pk→j are the probabilities
addressed. In contrast, in our proposed framework, the com- that an interaction in community k involves node vi and vj ,
munity membership itself is directly regularized over time. respectively. Written in a matrix form, we have W ≈ XΛX T
n×m
wherePX ∈ R+ is a non-negative matrix with xik = pk→i
and i xik = 1. In addition, Λ is an m × m non-negative
2. FORMULATION diagonal matrix with λk = pk , where λk is short for λkk .
Matrices X and Λ (or equivalently, their product XΛ) fully
2.1 Notation characterize the community structure in the mixture model.
First, a note on notations. In this paper, we use lower-case This model was first proposed by Yu et al. in [24].
letters, e.g., x, to represent scalars, vector-formed letters,
e.g., ~v , to represent vectors, and upper-case letters, e.g., W , v2 v1 v2 v1 v2 v1
to represent matrices. Both wij and (W )ij represent the
element at the i-th row and j-th column of W . We use c1 x31 c1 λ1
vec (W ) to denote the vectorization of W , i.e., stacking the v3 v6 v3 v6 v3 x32 v6
columns of W into a column vector. A subscript t on a c2 x41 c2 λ2
x42
variable, e.g., Wt or wt;ij , denotes the value of that variable v4 v5 v4 v5 v4 v5
at time t. However, to avoid notation clutter, we try not to w34 ≈ λ1x31x41
use the subscript t unless it is needed for clarity. (a) (b) +λ2x32x42 (c)
We assume that edges in the networked data are asso-
ciated with discrete timesteps. We use a snapshot graph Figure 1: Schematic illustration of soft communities:
Gt (Vt , Et ) to model interactions at time t where in Gt , each (a) the original graph, (b) the bipartite graph with
node vi ∈ Vt represents an individual and each edge eij ∈ Et two communities, and (c) how to approximate an
denotes the presence of interactions between vi and vj . As- edge (w34 )
suming Gt has n nodes, we use a matrix W ∈ Rn×n + (which
is short for Wt ) to represent the similarity between nodes in In Fig. 1, we use a toy example with 6 nodes and 2 com-
Gt , where wij > 0 if eij ∈ Et and otherwise P wij = 0. With- munities to illustrate this model of community structure.
out loss of generality, we assume that i,j wij = 1. Over For a general graph (a), we use a special bipartite graph (b)
time, the interaction history is captured by a sequence of to approximate (a). Note that (b) has two more nodes, i.e.,
snapshot graphs hG1 , · · · , Gt , · · · i indexed by time. c1 and c2 , corresponding to the two communities. However,
687
WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Discovery and Evolution of Communities
because (b) is a bipartite graph (i.e., an edge can only occur time t-1 to t will be more dramatic, and therefore the tem-
between a node v and a community c), it has less degree of poral cost CT will be higher.
freedom and so it is a more parsimonious explanation of (a).
In (c), we show how an edge w34 is generated in the mixture 2.3.2 A Probabilistic Interpretation
model as the sum of λ1 x31 x41 and λ2 x32 x42 . Equivalently, Next we provide a probabilistic interpretation of our frame-
we are approximating W , which has rank n, by a product in work by using a first-order Markov generative model. The
the form of XΛX T , which has rank m. Based on this model, basic ideas are that (1) the currently observed data Wt is
we define the snapshot cost CS as the error introduced by generated from the current community structure following
such an approximation, i.e., a certain distribution and (2) the current community struc-
CS = D(W kXΛX T ) ture at time t is generated by using the community structure
at time t-1 as the prior distribution.
P aij
where D(AkB) = i,j (aij log bij − aij + bij ) is the KL- Let Ut be the community parameters at time t and Ut−1
divergence between A and B. So the snapshot cost is high be those at time t-1. The goal is then to estimate the unseen
when the approximate community structure XΛX T fails to parameters Ut , given Wt and Ut−1 , i.e.,
fit the observed data W well.
U ∗ = arg max log P (Wt , Ut |Ut−1 )
Ut
2.2.2 Temporal Cost
In the cost function Eq. (1), the temporal cost CT is used We further assume a first-order Markov model, as illustrated
to regularize the community structure so that it is less prob- in Fig. 2, and we have P (Wt , Ut |Ut−1 ) = P (Wt |Ut )P (Ut |Ut−1 ).
able for unreasonably dramatic community evolution from Therefore the log-likelihood function can be written as
time t-1 to t. We propose to achieve this regularization
by defining CT as the difference between the community L(Ut ) = log P (Wt |Ut ) + log P (Ut |Ut−1 ) (3)
structure at time t and that at time t-1. Recall that the
community structure is captured by XΛ. Therefore, with
. Dirichlet
Y = Xt−1 Λt−1 , the temporal cost is defined as
U0 Ut−1 Ut
CT = D(Y kXΛ) multinomial
where D is the KL-divergence as defined before. So the
temporal cost CT is high when there is a dramatic change
of community structure from time t-1 to t. W0 Wt−1 Wt
2.2.3 Putting It Together
Putting the snapshot cost CS and the temporal cost CT Figure 2: Schematic illustration of the probabilistic
together, we have an optimization problem as to find the model, where U = (X, Λ)
best community structure at time t, expressed by X and Λ,
that minimizes the following total cost In our model, Ut−1 = (Xt−1 , Λt−1 ) and Ut = (Xt , Λt ).
cost = α · D(W kXΛX T ) + (1 − α) · D(Y kXΛ) (2) Under the above probabilistic model, we have the following
theorem, whose proof is given in the Appendix.
n×m P
subject to X ∈ R+ , i xik = 1, and Λ being an m-by-
m non-negative diagonal matrix. Solving this optimization Theorem 1. Under the assumptions that
problem is the core of our FacetNet framework.
(1) vec (Wt ) follows a multinomial distribution with param-
2.3 Justification eter θt , where θt = vec Xt Λt XtT , and
688
WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Discovery and Evolution of Communities
The proof for the correctness and the convergence of the Figure 3: Schematic illustration of communities and
above update rules is skipped due to space limit. their evolutions: (a) two bipartite graphs at time t-1
and time t (merged by vi ’s), (b) the Community Net
2.4.2 Time Complexity at time t induced by the bipartite graph at time t,
We now show the time complexity for each iteration of and (c) the Evolution Net from t-1 to t induced by
the updates in Theorem 2. The most time-consuming part the two bipartite graphs
is to compute (XΛX T )ij for all i, j ∈ {1, . . . , n}. However,
it turns out that we do not have to compute (XΛX T )ij for
each pair of (i, j), thanks to the sparseness of W . In W , the this bipartite graph, Λt XtT Dt−1 Xt Λt gives a marginal distri-
number of non-zero elements is the number of edges in the bution on the subgraph with nodes {c1 , c2 , c3 } (Fig. 3(b))
snapshot graph, which we denote by ℓ. Then for each non- and this is exactly the community structure we are looking
zero wij , we compute the corresponding (XΛX T )ij , which for. We call this induced subgraph on the community nodes
takes O(m) time with m being the number of communities. (i.e.,{c1 , c2 , c3 }) a Community Net. Note that to induce the
As a result, the total complexity is O(ℓm). If we consider community net, each node vi participates in all the commu-
m, the number of communities, to be a constant and if the nities, with different levels. This is more reasonable than
degree of nodes in the snapshot graph is bounded by another traditional methods in which each node can only contribute
constant, then the complexity is reduced to O(n), i.e., linear to a single community.
in the number of nodes in the snapshot graph.
2.5.3 Evolution Net
2.5 Communities and Their Evolutions To derive the community evolutions, we align the two bi-
After obtaining the solution to Eq. (2) by using our algo- partite graphs, that at time t-1 and that at time t, side by
rithm in Theorem 2, here are the ways we utilize the solution side by merging the corresponding network nodes vi ’s, as
to analyze communities and their evolutions. illustrated in Fig. 3(a). Then a natural definition of com-
munity evolution (from ct−1;i at time t-1 to ct;j at time t) is
2.5.1 Community Membership
the probability of starting from ct−1;i , walking through the
Assume we have computed the result at time t-1, i.e., merged bipartite graphs, and reaching ct;j . Such a walking
(Xt−1 , Λt−1 ), and the result at time t, i.e., (Xt , Λt ). In process produces what we call the Evolution Net to represent
addition, we define a diagonal matrix Dt , whose P diagonal community evolutions, as illustrated in Fig. 3(c). A simple
elements are the row sums of Xt Λt , i.e., dt;ii = j (Xt Λt )ij . T
derivation shows that P (ct−1;i , ct;j ) = (Λt−1 Xt−1 Dt−1 Xt Λt )ij
Then we claim that the i-th row of Dt−1 Xt Λt indicates the T −1
and P (ct;j |ct−1;i ) = (Xt−1 Dt Xt Λt )ij . Again, each node
soft community memberships of vi at time t. We illustrate and each edge contribute to the evolution from ct−1;i to ct;j .
this by using an example shown in Fig. 3. Recall that in the That is, all individuals and all interactions are related to all
bipartite graph at time t (the right side of Fig. 3(a)), the the community evolutions, with different levels. We believe
weights of edges connecting vi to c1 , c2 , and c3 represent this is more reasonable than how community evolutions are
the joint probability P (vi , c1 ), P (vi , c2 ), and P (vi , c3 ). The derived in traditional methods. In tradition methods, usu-
Dt−1 part normalizes this joint probability to get P (c1 |vi ), ally the intersection and union of community members at
P (c2 |vi ), and P (c3 |vi ), i.e., the conditional probability that different time are used alone to compute community evolu-
vi belongs to c1 , c2 , and c3 , respectively. And this condi- tions, with a questionable assumption that all members in
tional probability is exactly the soft community membership a community should be treated with identical importance.
we are looking for. Furthermore, we can see that the i-th
diagonal element of Dt provides information about the level
of activity of vi at time t. 3. EXTENSIONS
In this section we introduce two extensions to our basic
2.5.2 Community Net framework in order to handle inserting and removing of in-
The community structure itself, on the other hand, is ex- dividuals and to determine the number of communities in a
pressed by Λt XtT Dt−1 Xt Λt . For this we again look at the dynamic network over time.
bipartite graph at time t (the right side of Fig. 3(a)). In-
duced from this bipartite graph, Xt ΛXtT gives a marginal 3.1 Inserting and Removing Nodes
distribution on the subgraph with nodes {v1 , . . . , v6 } in or- In real applications, it occurs very often that some new
der to approximate Wt . In a dual fashion, also induced from individuals join a dynamic network (e.g., a new author in a
689
WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Discovery and Evolution of Communities
paper co-authorship network) and existing ones leave (e.g., 3.2.2 Extended Formulas
a blogger who stops blogging). We provide the following If we allow different community numbers at time t and t-1,
heuristic techniques in our algorithm to handle such insert- then we have to revise Eq. (2) accordingly because in Eq. (2),
ing and removing of nodes. the term D(Y kXΛ) requires Y (the community structure at
Assume that at time t, out of the n nodes in the net- t-1 ) and XΛ (the community structure at t) to have the
work, n1 existing nodes are removed from and n2 new nodes same number of columns and therefore the same number
are inserted into the network. We first handle the n1 re- .
of communities. To solve this issue, we first define Z =
moved nodes by removing the corresponding n1 rows from T
Xt−1 Λt−1 Xt−1 and then revise the cost function to be
Y in Equations (4) and (5) to get Y ′ . Next, we scale Y ′ to
get Y ′′
so that Y ′′ is a valid joint distribution, i.e., Y ′′ = cost = α · D(W kXΛX T ) + (1 − α) · D(ZkXΛX T ) (8)
′ P ′
Y / ij yij . The basic idea behind this heuristic is that we
The basic idea is that when the community numbers are dif-
assume the n1 nodes are randomly selected, independent of
ferent at time t and t-1, instead of regularizing the commu-
their community membership. Under such an assumption,
nity structure itself, we regularize the marginal distribution
Y ′′ is a conditional distribution, conditioning on the remain-
induced by the community structure at time t (i.e., XΛX T ,
ing n − n1 nodes in the network. To add the n2 nodes, we
which approximates Wt ) so that it is not too far away from
pad n2 rows of zeros to Y ′′ to get Ŷ . This heuristic is actu- T
that at time t-1 (Xt−1 Λt−1 Xt−1 ). For the cost function
ally equivalent to assuming that these n2 nodes have already
given in Eq. (8), the following update rules are used.
existed at time t-1 but as isolated nodes.
3.2 Changing Community Numbers Theorem 4. The following update rules will monotoni-
cally decrease the cost function defined in Eq. (8) and there-
So far we have assumed that the number of communi-
fore converge to an optimal solution to the objective function.
ties, m, is given beforehand by the user. However, such an
assumption will limit the scope of application of our frame- X (α · wij + (1 − α) · zij ) · λk · xjk
xik ← xik · (9)
work. In this subsection we try to answer two questions: (XΛX T )ij
j
how to automatically determine the number of communities X
at a given time t and how to revise our framework to allow then normalize such that xik = 1, ∀k
the number of communities to change in different timesteps. i
X (α · wij + (1 − α) · zij ) · xik · xjk
3.2.1 Soft Modularity λk ← λk · (10)
ij
(XΛX T )ij
In [13], Newman et al. introduces an elegant concept, X
the modularity Q, to measure the goodness of a community then normalize such that λk = 1.
partition Pm where Q is defined as k
m
" 2 #
X A(Vk , Vk ) A(Vk , V ) The proof for the correctness and the convergence of the
Q(Pm ) = − (6)
A(V, V ) A(V, V ) above update rules is skipped due to space limit.
k=1
P
with A(Vp , Vq ) = i∈Vp ,j∈Vq wij . Basically, Q measures the
4. EXPERIMENTAL STUDIES
deviation between the chance for edges among communities
In this section, we use several synthetic datasets, a blog
to be generated due to the community structure and the
dataset, and a paper co-authorship dataset to study the per-
chance for the edges to be generated randomly. Extensive
formance of our FacetNet framework.
experimental results [13, 23] have demonstrated that Q is an
effective measure for the community quality, where a maxi- 4.1 Synthetic Datasets
mal Q is a good indicator of the best community structure
and therefore the best community number m. 4.1.1 Synthetic Dataset # 1
Here we extend the concept of modularity to handle soft
membership by defining a Soft Modularity Qs : We start with the first synthetic dataset, which is a static
h i network, to illustrate some good properties of our frame-
Qs =T r (D−1 XΛ)T W (D−1 XΛ) work. This dataset was first studied by White et al. [23] and
is shown in Fig. 4(a). The network contains 15 nodes which
− ~1T W T (D−1 XΛ)(D−1 XΛ)T W ~1 (7) roughly form 3 communities—C1, C2, and C3— where edges
tend to occur between nodes in the same community. We
where ~1 is a vector whose elements are all ones. Qs has first check our soft modularity measure. We apply our al-
the following nice property, whose proof is given in the Ap- gorithm to the network with various community numbers m
pendix. and the resulting Qs values are plotted in Fig. 4(b). In addi-
Theorem 3. The Qs defined in Eq. (7) has the same tion, in Fig. 4(b) we also show the modularity values Q that
probabilistic interpretation as the Q defined in Eq. (6), but are reported by White et al. in [23]. As can be seen from
in the context of soft community membership. In addition, the plot, both Qs and Q show distinct peaks when m = 3,
Qs is a generalized modularity in that Qs is identical to Q which corresponds to the correct community number.
when D−1 XΛ becomes a hard community membership (i.e., Next, after our algorithm correctly partitions the 15 nodes
each row of D−1 XΛ has one 1 and m-1 zeros). into three communities, we illustrate the soft community
membership by studying two communities among the three—
So to detect the best community number m at time t, we C1 = {6, 7, 8, 9, 10} and C2 = {11, 12, 13, 14, 15}. In Fig. 4(a),
run our algorithm for a range of candidates for m and pick we use the same circle shape to represent these 10 nodes
the best one determined by Qs . but use different gray levels to indicate their community
690
WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Discovery and Evolution of Communities
(a) z = 3 (b) z = 5
Modularity Q 1.4
1 Soft Modularity Qs EvolSpec EvolSpec
Mutual Information
Mutual Information
0.5
FacetNet FacetNet
2 3
1.38 0.25
5 NCut NCut
4 C2 0.4 SNMF SNMF
C3 12 14
1.36
0.3 0.2
1.34
13
10 15 0.2
11
6 1.32
8 0.15
0.1
7 9
1.3
C1 0
1 2 3 4 5 6
1.28 0.1
Community Number 2 4 6 8 10 2 4 6 8 10
691
WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Discovery and Evolution of Communities
(a) Soft Modularity Qs (b) Mutual Information four communities stay rather stable over all the timesteps.
0.23 This effect is partially due to the way these blogs are selected
1 by our focused crawler—our crawler chose to crawl a densely
0.225
0.8 connected subgraph of the blogosphere where each node in
0.22 0.6
the subgraph has large number of links and high level of in-
teraction intensities. Therefore, most of the selected blogs
0.4
0.215 α = 0.1 belong to some well-known bloggers and they seldom move
0.2 α = 0.5
α = 0.9
around between communities. We apply our FacetNet algo-
0.21
2 4 6 8
0
2 4 6 8 10 12
rithm on the data with different α. And we compute the
Community Number Timestep mutual information between the extracted communities at
each timestep and the four communities shown in Fig. 8.
Fig. 7(b) shows the results under α=0.1, 0.5, and 0.9. As
Figure 7: (a) Soft modularity and (b) mutual infor-
can be seen, as α increases, our algorithm emphasizes less
mation under different α for the NEC dataset
on the temporal smoothness and as a result, the community
We start with analyzing the overall picture of the dataset. structure has higher variation over time. In addition, as α
We first aggregate all the edges over all timesteps into a sin- increases, the communities at each timestep deviate further
gle network and apply our algorithm to compute the soft from the communities obtained from the aggregated data.
modularity score Qs under different community numbers. These results on one hand justify our arguments in the in-
As can be seen in Fig. 7(a), a clear peak shows when the troduction section and on the other hand demonstrate that
community number is 4. We draw the aggregated graph in our FacetNet framework is capable of controlling the trade-
Fig. 8, according to the main community each blog most off between the snapshot cost and the temporal cost in the
likely belongs to. In addition, in Table 1 we list the top key- cost function Eq. (1).
words, measured by the tf-idf score, that occur in the posts
of these four communities. It seems that C1 focuses on tech- Aug 05 c3 c1 c4 c2 Aug 05 c3 c1 c4 c2
0.34 0.42 0.45 0.42 0.69 0.88
nology, C2 on politics, C3 on international entertainment, 0.57 0.36 0.71 0.24 0.59 0.87
Sep 05 c3 c1 c4 c2
Sep 05 c3
and C4 on digital libraries. 0.59 0.31 0.69 0.24 0.53 0.75
c1 c4 c2
@
R
Oct 05 c3 c1 c4 c2
0.45 0.38 0.76 0.41 0.25 0.87
0.59 0.34 0.20 0.71 0.40 0.38 0.86
Nov 05 c3 c1 c4 c2
Nov 05 c3 c1 c4 c2
0.34 0.51 0.76 0.48 0.43 0.22 0.67
0.58 0.33 0.20 0.72 0.27 0.61 0.79
Dec 05 c3 c1 c4 c2
Dec 05 c3 c1 c4 c2
0.49 0.39 0.75 0.23 0.53 0.80 0.59 0.33 0.23 0.70 0.22 0.53 0.79
Jan 06 c3 c1 c4 c2 Jan 06 c3 c1 c4 c2
0.33 0.41 0.24 0.77 0.28 0.57 0.84 0.55 0.33 0.22 0.69 0.60 0.84
Feb 06 c3 c1 c4 c2 Feb 06 c3 c1 c4 c2
692
WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Discovery and Evolution of Communities
693
WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Discovery and Evolution of Communities
[4] F. R. K. Chung. Spectral Graph Theory. American Methods and Applications. Cambridge University
Mathematical Society, 1997. Press, 1994.
[5] I. S. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: [23] S. White and P. Smyth. A spectral clustering approach
spectral clustering and normalized cuts. In Proc. of to finding communities in graph. In SDM, 2005.
the 10th ACM SIGKDD Conference, 2004. [24] K. Yu, S. Yu, and V. Tresp. Soft clustering on graphs.
[6] G. Flake, S. Lawrence, and C. Giles. Efficient In NIPS, 2005.
identification of web communities. In Proc. of the 6th [25] H. Zha, X. He, C. H. Q. Ding, M. Gu, and H. D.
ACM SIGKDD Conference, 2000. Simon. Spectral relaxation for k-means clustering. In
[7] J. M. Kleinberg. Authoritative sources in a NIPS, 2001.
hyperlinked environment. J. of the ACM, 46(5), 1999.
[8] R. Kumar, J. Novak, P. Raghavan, and A. Tomkins. APPENDIX
On the bursty evolution of blogspace. In Proc. of the
Proof for Theorem 1
12th WWW Conference, 2003.
Proof. Given that Xt ∈ Rn×m + , each column of which
[9] R. Kumar, J. Novak, and A. Tomkins. Structure and
sums up to one, and Λt is an m-by-m diagonal matrix and
evolution of online social networks. In Proc. of the
sum to one, we have
12th ACM SIGKDD Conference, 2006.
[10] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs P (Ut |Ut−1 ) = P (Yt |Ut−1 ),
over time: densification laws, shrinking diameters and
where Yt = Xt Λt . Because of the constraint, for any given
possible explanations. In Proc. of the 11th ACM
Yt , there is a unique pair of Xt and Λt such that Yt = Xt Λt ,
SIGKDD Conference, 2005.
and Xt and Λt satisfy the constraint. Thus, Yt uniquely
[11] Y. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. L. determines Ut , which implies
Tseng. Blog community discovery and evolution based
on mutual awareness expansion. In Proc. of the Int. P (Ut |Ut−1 ) = P (Yt |Ut−1 ).
Conf. on Web Intelligence, 2007. The logarithm of MAP is
[12] Q. Mei and C. Zhai. Discovering evolutionary theme
patterns from text: an exploration of temporal text L(Ut ) = log P (Wt |Ut )P (Ut |Ut−1 )
mining. In Proc. of the 11th ACM SIGKDD = log P (Wt |Ut )P (Xt Λt |Ut−1 )
Conference, 2005. = log multinom(vec (Wt ) ; θt )
[13] M. E. J. Newman and M. Girvan. Finding and
+ log Dirichlet(vec (Xt Λt ) ; φt )
evaluating community structure in networks. Phys. P
Rev. E, 2004. ( ij Wt;ij )! Y Wt;ij
= log Q θt;ij
[14] H. Ning, W. Xu, Y. Chi, Y. Gong, and T. Huang. ij Wt;ij ! ij
Incremental spectral clustering with application to
1 Y νYt−1;ik
monitoring of evolving blog communities. In SIAM + log Yt;ik
Int. Conf. on Data Mining, 2007. B(φ)
ik
[15] L. Page, S. Brin, R. Motwani, and T. Winograd. The =
X
Wt;ij log θt;ij +
X
νYt−1;ik log Yt;ik + const.
PageRank citation ranking: Bringing order to the ij ik
web. In Technical report, Stanford Digital Library P P P P
Technologies Project, Stanford University, Stanford, Since ij θt;ij = k Λt;kk = 1 and ik Yt;ik = k Λt;kk =
CA, USA, 1998. 1, we can further derive
[16] G. Palla, A.-L. Barabasi, and T. Vicsek. Quantifying L(Ut ) = − D(Wt kθt;ij ) − νD(Yt−1 kYt ) + const.
social group evolution. Nature, 446, 2007.
1
[17] P. Sarkar and A. W. Moore. Dynamic social network = − [αD(Wt kXt Λt XtT )
α
analysis using latent space models. SIGKDD Explor.
Newsl., 7(2), 2005. + (1 − α)D(Yt−1 kXt Λt )] + const.
[18] J. Shi and J. Malik. Normalized cuts and image Thus, maximizing L(Ut ) is equivalent to minimizing the cost
segmentation. IEEE Trans. on Pattern Analysis and function in Eq. (2).
Machine Intelligence, 22(8), 2000.
Proof for Theorem 3
[19] M. Spiliopoulou, I. Ntoutsi, Y. Theodoridis, and
Proof. In the standard Q formula Eq. (6), the first term
R. Schult. Monic: modeling and monitoring cluster A(Vk ,Vk )
transitions. In Proc. of the 12th ACM SIGKDD A(V,V )
is the empirical probability that a randomly se-
Conference, 2006. lected edge has both ends in community P k. For our case,
[20] J. Sun, C. Faloutsos, S. Papadimitriou, and P. S. Yu. this empirical probability should be i,j wij P (k|i)P (k|j).
k ,V )
GraphScope: parameter-free mining of large For the second term in Eq. (6), A(V A(V,V )
is the empirical
time-evolving graphs. In Proc. of the 13th ACM probability that a randomly selected edge is related to (i.e.,
SIGKDD Conference, 2007. has at least one end in) community k. For P our case, this
[21] M. Toyoda and M. Kitsuregawa. Extracting evolution
P
empirical probability should be i P (k|i) j wij . By sum-
of web communities from a series of web archives. In ming these two terms over all k’s and noticing that P (k|i) =
HYPERTEXT ’03: Proc. of the 14th ACM conference (D−1 XΛ)ik , we get formula Eq. (7). In addition, it is straight-
on hypertext and hypermedia, 2003. forward to verify that Qs is equal to Q when D−1 XΛ be-
[22] S. Wasserman and K. Faust. Social Network Analysis: comes a 0/1 indicator matrix.
694