Gionis

algorithms for mining large graphs
Aristides Gionis
Yahoo! Research, Barcelona
yahoo! research barcelona
research areas and internships!
research areas
distributed systems and search

multimedia retrieval
semantic web
social media
social network analysis
web mining
http://barcelona.research.yahoo.net/internships
this lecture
graph mining algorithms

slides will be available
a few questions as optional homeworks
introduction to graphs and networks
graphs: a simple model
entities – set of vertices

pairwise relations among vertices
– set of edges
can add directions, weights,. . .
graphs can be used to model many real
datasets
people who are friends
computers that are interconnected
web pages that point to each other
proteins that interact
graph theory
graph theory started in the 18th

century, with Leonhard Euler
the problem of Königsberg bridges
since then, graphs have been studied
extensively
analysis of graph datasets in the past
graphs datasets have been studied in the past

e.g., networks of highways, social networks
usually these datasets were small
visual inspection can reveal a lot of information
analysis of graph datasets now
more and larger networks appear

products of technological advancement
e.g., internet, web
result of our ability to collect more, better-quality, and
more complex data
e.g., gene regulatory networks
networks of thousands, millions, or billions of nodes
impossible to visualize
the internet map
types of networks
social networks
knowledge and information networks
technology networks
biological networks
social networks
links denote a social interaction

networks of acquaintances
collaboration networks
actor networks
co-authorship networks
director networks
phone-call networks
e-mail networks
IM networks
sexual networks
knowledge and information networks
nodes store information, links

associate information
citation network (directed
acyclic)
the web (directed)
peer-to-peer networks
word networks
networks of trust
software graphs
bluetooth networks
home page/blog networks
technological networks
networks built for distribution of a commodity

the internet, power grids, telephone networks
airline networks, transportation networks
US power grid
biological networks
biological systems represented as networks

protein-protein interaction networks
gene regulation networks
gene co-expression networks
metabolic pathways
the food web
neural networks
photo-sharing site
what is the underlying graph?
nodes: photos, tags, users, groups, albums, sets,

collections, geo, query, . . .
edges: upload, belong, tag, create, join, contact, friend,
family, comment, fave, search, click, . . .
also many interesting induced graphs
tag graph: based on photos
tag graph: based on users
user graph: based on favorites
user graph: based on groups
which graph to pick — not an easy choice

recurring theme
social media, user-generated content

user interaction is composed by many atomic actions
post, comment, like, mark, join, comment, fave,
thumps-up, . . .
generates all kind of interesting graphs to mine
now what?
the world is full with networks

what do we do with them?
understand their topology and measure their properties
study their evolution and dynamics
create realistic models
create algorithms that make use of the network structure
concepts to study
paths and connectivity

path lengths and diameter — small-world phenomena
degree distributions and degree correlations
communities and clusters
flows and cuts
processes
evolution, random walks, information cascades,
epidemics, . . .
the name of the game
many graph mining problems can be solved with

polynomial-time algorithms
even small-degree polynomials
however, for very large graphs this is not enough
need algorithms in almost linear time
resort to approximation, streaming models, sampling,
sketching, etc...
random graphs
random graphs
Erdős-Rényi random graph model
Paul Erdős Alfréd Rényi

1913 – 1996 1921 – 1970
random graphs
the G (n, p) model:

n : the number of vertices
0 ≤ p ≤ 1 : probability
for each pair (u, v ), independently generate the edge
(u, v ) with probability p
G (n, p) a family of graphs, in which a graph with m
n
edges appears with probability p m (1 − p)(2)−m
the G (n, m) model: related, but not identical

properties of random graphs
a property P holds almost surely (for almost every graph)

if
lim Pr[G has P] = 1
n→∞
which properties hold as p increases?

threshold phenomena : many properties appear suddenly
there exist a probability pc such that
for p < pc the property does not hold a.s.
for p > pc the property holds a.s.
the giant component
let z = np be the average degree

if z < 1 the largest component has size O(log n) a.s.
if z > 1 the largest component has size Θ(n) a.s.;
the second largest component has size O(log n) a.s.
if z = ω(log n) the graph is connected a.s.
phase transition
if z = 1 there is a phase transition

the largest component has size O(n2/3 )
the sizes of the components follow a power-law
random graphs and real datasets
degree distribution : binomial
z k e −z

n k n−k
Ck = p (1 − p) '
k k
highly concentrated around the mean (z = np)

probability of very high degree is exponentially small
clustering coefficient
probability that two vertices are connected is
independent of the local structure
random graphs and real datasets
a beautiful and elegant theory studied exhaustively

have been used as idealized generative models
unfortunately, they don’t capture reality. . .
structure of real-world graphs
properties of real graphs
diverse collections of graphs arising from different phenomena
are there any typical patterns?
at which level should we look for commonalities?

microscopic : degree distribution
mesoscopic : communities
macroscopic : small diameters
degree distribution
Ck = number of vertices with degree k
problem : find the probability distribution that fits best

the observed data
degree distribution
Ck = number of vertices with degree k, then
Ck = ck −γ
with γ > 1, or
ln Ck = ln c − γ ln k
so, plotting ln Ck versus ln k gives a straight line with

slope −γ
heavy-tail distribution : there is a non-negligible fraction
of nodes that has very high degree (hubs)
scale free : average is not informative
power-law degree distribution
power-law degree distribution
[Newman, 2003]
contrast — degree distribution of random graphs
maximum degree
for random graphs, the maximum degree is highly

concentrated around the average degree z
for power-law graphs
dmax ≈ n1/(α−1)
hand-waving argument: solve n Pr[X ≥ d] = 1

community structure
intuitively a subset of vertices that are more connected to

each other than to other vertices in the graph
a proposed measure is clustering coefficient
3 × number of triangles in the network
C1 =
number of connected triples of vertices
captures “transitivity of clustering”
if u is connected to v and
v is connected to w , it is also likely that
u is connected to w
community structure at local level
community structure
alternative definition
local clustering coefficient
number of triangles connected to vertex i
Ci =
number of triples centered at vertex i
global clustering coefficient
1X
C2 = Ci
n i
community structure is captured by large values of

small-world phenomena
small worlds : graphs with short paths
Stanley Milgram (1933-1984)

“The man who shocked the world”
obedience to authority (1963)
small-world experiment (1967)
small-world experiments
letters were handed out to people in Nebraska to be sent

to a target in Boston
people were instructed to pass on the letters to someone
they knew on first-name basis
the letters that reached the destination (64 / 296)
followed paths of length around 6
six degrees of separation : (play of John Guare)
also:
the Kevin Bacon game
the Erdös number
small-world project:
http://smallworld.columbia.edu/index.html
small diameter
proposed measures
diameter : largest shortest-path over all pairs
effective diameter : upper bound of the shortest path of
90% of the pairs of vertices
average shortest path : average of the shortest paths over
all pairs of vertices
characteristic path length : median of the shortest paths
over all pairs of vertices
hop-plots : plot of |Nh (u)|, the number of neighbors of u
at distance at most h, as a function of h
[M. Faloutsos, 1999] conjectured that it grows
exponentially and considered hop exponent
measurements on real graphs
n m γ C1 C2 `
film actors 449 913 25 516 482 2.3 0.20 0.78 3.48
internet 10 697 31 992 2.5 0.03 0.39 3.31
protein interactions 2 115 2 240 2.4 0.07 0.07 6.80
[Newman, 2003]
other properties
degree correlations
distribution of size of connected components
resilience
eigenvalues
distribution of motifs
properties of evolving graphs
[Leskovec et al., 2005] discovered two interesting and

counter-intuitive phenomena
densification power law
|Et | ∝ |Vt |α 1≤α≤2
diameter is shrinking
algorithmic tools
efficiency considerations
data in the web and social-media are typically of

extremely large scale (easily reach to billions)
how to compute simple statistics?
how to locate similar objects fast?
how to cluster objects?
hashing and sketching
probabilistic/approximate methods
sketching : create sketches that summarize the data and
allow to estimate simple statistics with small space
hashing : hash objects in such a way that similar objects
have larger probability of mapped to the same value than
non-similar objects
estimator theorem
consider a set of items U
a fraction ρ of them have a specific property
estimate ρ by sampling
how many samples N are needed?

4 2
N≥ log .
2 ρ δ
for an -approximation with probability at least 1 − δ
notice: it does not depend on |U| (!)
homework
use the Chernoff bound to derive the estimator theorem

computing statistics on data streams
X = (x1 , x2 , . . . , xm ) a sequence of elements

each xi is a member of the set N = {1, . . . , n}
mi = |{j : xj = i}| the number of occurrences of i
define n
X
Fk = mik
i=1
F0 is the number of distinct elements

F1 is the length of the sequence
F2 index of homogeneity, size of self-join, and other
applications
computing statistics on data streams
How to compute the frequency moments using less than

O(n log m) space?
sketching: create a sketch that takes much less space and

gives an estimation of Fk
[Alon et al., 1999]

estimating the number of distinct values (F0)
[Flajolet and Martin, 1985]

consider a bit vector of length O(log n)
upon seen xi , set:
the 1st bit with probability 1/2
the 2nd bit with probability 1/4
...
the i-th bit with probability 1/2i
important: bits are set deterministically for each xi
let R be the index of the largest bit set
return Y = 2R
estimating number of distinct values (F0)
Theorem. For every c > 2, the algorithm computes a

number Y using O(logn) memory bits, such that the
probability that the ratio between Y and F0 is not between
1/c and c is at most 2/c.
locality sensitive hashing
a family H is called (R, cR, p1 , p2 )-sensitive if for any two

objects p and q
if d(p, q) ≤ R, then PrH [h(p) = h(q)] ≥ p1
if d(p, q) ≥ cR, then PrH [h(p) = h(q)] ≤ p2
interesting case when p1 > p2

locality sensitive hashing: example
objects in a Hamming space {0, 1}d – binary vectors

H : {0, 1}d → {0, 1} sample the i bit:
H = {h(x) = xi | i = 1, . . . , d}
for two vectors x and y with distance r , it is
PrH [h(x) = h(y )] = 1 − dr
thus p1 = 1 − Rd and p2 = 1 − cR d
gap between p1 and p2 too small

probability amplification
locality sensitive hashing: Hamming distance
01000110111010111
01000110111010111
00011
01000110111010111
00011 11010
01000110111010111
00011 11010 01101

Probability of collision
r k l
Pr[h(x) = h(y )] = 1 − (1 − (1 − ) )
d
homework
how to apply the locality sensitive hashing for vectors of

integers, not just binary vectors?
vectors x = {x1 , . . . , xd }
Pd
L1 distance ||x − y||1 = i=1 |xi − yi |
Jaccard coefficient
|A∩B|
for two sets A, B ⊆ U define J(A, B) = |A∪B|
measure of similarity of the sets
can we design a locality sensitive hashing family for

Jaccard?
min-wise independent permutations
π : U → U a random permutation of U
h(A) = min{π(x) | x ∈ A}
then
|A ∩ B|
Pr[h(A) = h(B)] = J(A, B) =
|A ∪ B|
amplify the probability as before:

repeat many times,
concatenate into blocks
consider objects similar if they collide in at least one
block
homework
show that for h(A) = min{π(x) | x ∈ A} with π a random

permutation
|A ∩ B|
Pr[h(A) = h(B)] = J(A, B) =
|A ∪ B|
homework
design a locality-sensitive hashing scheme for vectors according

to the cosine similarity measure
vectors x = {x1 , . . . , xd }
x·y
distance 1 − cos(x, y) = 1 − ||x||2 ||y||2
... applications of the algorithmic tools to
real scenarios
clustering coefficient and triangles
3 × number of triangles in the network

C=
number of connected triples of vertices
how to compute it?

how to compute the number of triangles in a graph?
assume that the graph is very large, stored in disk
[Buriol et al., 2006]
count triangles, when graph is seen as a data stream
two models:
edges are stored in any order
edges in order — all edges incident to one vertex are
stored sequentially
counting triangles
brute-force algorithm is checking every triple of vertices

obtain an approximation by sampling triples
sampling algorithm for counting triangles
how many samples are required?
let T be the set of all triples and

Ti the set of triples that have i edges, i = 0, 1, 2, 3
by the estimator theorem, to get an -approximation, with
probability 1 − δ, the number of samples should be
|T | 1 1
N ≥ O( 2
log )
|T3 | δ
but |T | can be very large compared to |T3 |

counting triangles
incidence model : all edges incident to each vertex appear

in order in the stream
sample connected triples
incidence model
consider sample space S = {b-a-c | (a, b), (a, c) ∈ E }
P
|S| = i di (di − 1)/2
1: sample X ⊆ S (paths b-a-c)

2: estimate fraction of X for which edge (b, c) is present
3: scale by |S|
gives (, δ) approximation

counting triangles — incidence stream model
SampleTriangle [Buriol et al., 2006]

1st pass
count the number of paths of length 2 in the stream
2nd pass
uniformly choose one path (a, b, c)
3rd pass
if ((b, c) ∈ E ) β = 1 else β = 0
return β
3|T3 | P du (du −1)
we have E[β] = |T2 |+3|T3 |
, with |T2 | + 3|T3 | = u 2
, so
X du (du − 1)
|T3 | = E[β]
u
6
|T2 | 1
and space needed is O((1 + )
|T3 | 2
log 1δ )
properties of the sampling space
it should be possible to
estimate the size of the sampling space
sample an element uniformly at random
homework
1 compute triangles in 3 passes when edges

appear in arbitrary order
2 compute triangles in 1 pass when edges
appear in arbitrary order
3 compute triangles in 1 pass in the incidence model
triangle sparsifiers
[Tsourakakis et al., 2011]

start with graph G (V , E )
use sparsification parameter p
pick a random subset E 0 of edges
each edge is selected with probability p
T30 = # triangles on graph G 0 (V , E 0 )
return T3 = T30 /p 3
T3 is highly concentrated around the true number

of triangles
counting graph minors
counting other minors
count all minors in a very large graphs

connected subgraphs
size 3 and 4
directed or undirected graphs
why?
modeling networks, “signature” structures, e.g., copying
model
anomaly detection, e.g., spam link farms
[Alon, 2007, Bordino et al., 2008]
counting minors in large graphs
characterize a graph by the distribution of its minors
all undirected minors of size 4
all directed minors of size 3

incidence model
consider sample space S = {b-a-c | (a, b), (a, c) ∈ E }
P
|S| = i di (di − 1)/2
1: sample X ⊆ S (paths b-a-c)

2: estimate fraction of X for which edge (b, c) is present
3: scale by |S|
gives (, δ) approximation

adapt the algorithm to count all minors of size 3 and 4
and directed and undirected graphs
adapting the algorithm
sampling spaces:
3-node directed
4-node undirected
are the sampling space properties satisfied?

datasets
graph class type # instances

synthetic un/directed 39
wikipedia un/directed 7
webgraphs un/directed 5
cellular directed 43
citation directed 3
food webs directed 6
word adjacency directed 4
author collaboration undirected 5
autonomous systems undirected 12
protein interaction undirected 3
US road undirected 12
clustering of undirected graphs
assigned to 0 1 2 3 4 5 6
AS graph 12 0 0 0 0 0 0
collaboration 0 0 3 2 0 0 0
protein 1 0 0 1 0 0 1
road-graph 0 12 0 0 0 0 0
wikipedia 0 0 0 0 2 5 0
synthetic 11 0 0 0 0 0 28
webgraph 2 0 0 1 0 0 0
clustering of directed graphs
feature class accuracy compared

to ground truth
standard topological properties (81) 0.74%
minors of size 3 0.78%
minors of size 4 0.84%
minors of size 3 and 4 0.91%
local statistics
compute local statistics in large graphs
our goal: compute triangle counts for all vertices

local clustering coefficient and related statistics
motivation
motifs can be used to characterize network
families [Alon, 2007, Bordino et al., 2008]
analysis of social or biological networks
thematic relationships in the web
web spam
applications: spam detection and content quality analysis

in social media
semi-streaming model
[Feigenbaum et al., 2004]

data stream model too restrictive
graph stored in secondary memory as adjacency or edge
list
X no random access possible
O(N log N) bits available in main memory
limited amount of information per vertex
X not enough to store edges in main memory
limited (constant or O(log N)) number of passes
compute counts for all vertices concurrently
two algorithms
1 external memory
keep a counter for each vertex (main memory)
keep a counter for each edge (secondary memory)
2 main memory
keep a counter for each vertex
number of triangles for edges and nodes
neighbors: N(u) = {v : (u, v ) ∈ E }

degree: d(u) = |N(u)|
edge triangles: Tuv = |N(u) ∩ N(v )|
vertex triangles: T (u) = 12 v ∈N(u) Tuv
P
all possible triangles: 12 d(u)(d(u) − 1)

global clustering coefficient

P
2 u T (u)
C1 = P
u d(u)(d(u) − 1)
1 X 2T (u)
C2 =
|V | u∈V d(u)(d(u) − 1)
2T (u)
C (u) =
d(u)(d(u) − 1)
computing triangles : idea
consider the Jaccard coefficient between
two sets A and B:
|A ∩ B|
J(A, B) =
|A ∪ B|
if we knew J(N(u), N(v )) = J, then:
J
Tuv = |N(u) ∩ N(v )| = (|N(u)| + |N(v )|)
J +1
and then:
1 X
T (u) = Tuv
2
v ∈N(u)
estimating Jaccard coefficient : basic technique
assume family F of permutations π : [n] → [n]

F is min-wise independent [Broder et al., 1997]
if for every A, B ⊆ [n], π(·) chosen u.a.r. in F
|A ∩ B|
Pr[min π(A) = min π(B)] = J(A, B) =
|A ∪ B|
estimation of J(A, B) (Jaccard coefficient)

In practice...
exponential space (Ω(n) bits) needed to represent
min-wise families [Broder et al., 1998]
π(x) = (ax + b) mod p, a, b chosen u.a.r., p a large
prime [Bohman et al., 2000]
computing triangles : idea
we want:
J
Tuv = |N(u) ∩ N(v )| = (|N(u)| + |N(v )|)
J +1
approximate the Jaccard coefficient:

m independent trials
Zuv : # times that min π(N(u)) = min π(N(v ))
use the estimator:

Zuv
T uv = (|N(u)| + |N(v )|)
Zuv + m
external-memory algorithm
semi-stream model
keep vertex min-hash values (in memory)
keep edge counters (on disk)
use edge counters to estimate number of triangles and
external-memory algorithm
1: Z = 0
2: for i: 1 . . . m do {independent trials}
3: for u : 1 . . . |V | do {assign labels}
4: li (u) = hashi (u) {Min-wise linear permutation}
5: end for
6: for u : 1 . . . |V | do {compute fingerprints}
7: Fi (u) = minv ∈N(u) li (u)
8: end for{1 scan of G }
9: for u : 1 . . . |V | do {update counters}
10: for v ∈ N(u) do
11: if (Fi (u) = Fi (v )) then {minima are equal}
12: Zuv = Zuv + 1 {Zuv ’s stored on disk}
13: end if
14: end for
15: end for
16: end for
implementation
hashi (x) is, e.g., a linear hash function (ai x + bi mod p)

for every i, the Fi (u)’s can be kept in main memory
the Zuv ’s must be stored on disk
for every i, updating Zuv requires access to disk
computing counters most expensive operation
main-memory algorithm
replace:
Zuv
T uv = (|N(u)| + |N(v )|)
Zuv + m
by the estimator for |N(u) ∩ N(v )|:

Zuv
T̃uv = 2 (N(u) + N(v ))
3
m
and estimator for T (u):

1 X 1
T̃ (u) = Zuv (N(u) + N(v )) = Zu
3m 3m
v ∈N(u)
Zu sums d(u) + d(v ) if min π(N(u)) = min π(N(v ))

only one counter per node
main-memory algorithm
1: Z = 0
2: for i: 1 . . . m do {Independent trials}
3: for u : 1 . . . |V | do {Assign labels}
4: li (u) = hashi (u)
5: end for
6: for u : 1 . . . |V | do {Compute fingerprints}
7: Fi (u) = minv ∈V (u) li (u)
8: end for{1 scan of G }
9: for u : 1 . . . |V | do {Update counters}
10: for v ∈ N(u) do
11: if Fi (u) == Fi (v ) then {Minima are equal}
12: Zu = Zu + d(u) + d(v ) {Zu ’s in main memory}
13: end if
14: end for
15: end for
16: end for
experimental results
Algorithm 1 Algorithm 2
Graph Nodes Edges (ext. mem.) (main mem.)
WB-2001 118M 1.7G 10 hr 20 min 3 hr 40 min
IT-2004 41M 2.1G 8 hr 20 min 5 hr 30 min
UK-2006 77M 5.3G 20 hr 30 min 13 hr 10 min
quality of approximation
IT-2004
1.05
Algorithm 1 (ext. mem.)
1.00 Algorithm 2 (main mem.)
0.95
Average Relative Error
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
10 20 30 40 50 60 70
Number of Passes
quality of approximation
IT-2004
1.00
Pearson’s Correlation Coefficient
0.90
0.80
0.70
0.60 Algorithm 1 (ext. mem.)

Algorithm 2 (main mem.)
0.50 Naive approximation (d(d-1))/2
0.40
0.30
10 20 30 40 50 60 70
Number of Passes
applications : spam detection
Exact Method
0.10
Normal
0.09 Spam
0.08
Fraction of Hosts
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.00
1 10 100 1000 10000 100000
Triangles
Separation of non-spam and spam hosts in the histogram of
triangles
Algorithm Using External Memory

0.10
Normal
0.09 Spam
0.08
Fraction of Hosts
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.00
1 10 100 1000 10000 100000
Triangles
triangles
Approximated Using Only Main Memory

0.12
Normal
Spam
0.10
Fraction of Hosts
0.08
0.06
0.04
0.02
0.00
1 10 100 1000 10000 100000
Triangles
triangles
number of triangles feature is ranked 60-th out of 221 for

spam detection
applications : content quality in yahoo! answers
Yahoo! answers, a question-answering portal

consider the graph with edges (u, v ) if user u has answer
a question of user v
consider “high quality” users those who have given a best
answer to a random sample of questions
predict high-quality users based on their local structure
0.30
Average quality
High quality
0.25
0.20
0.15
0.10
0.05
0.00
0.0 0.1 0.2 0.3 0.4 0.5
Fraction of best answers
Separation of users who have provided questions/answers of high

quality with users who have provided questions/answers of normal
quality in terms of fraction of best answers
0.35
Average quality
High quality
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0.00 0.01 0.02 0.03 0.04 0.05 0.06
Clustering coefficient
Separation of users who have provided questions/answers of high

quality with users who have provided questions/answers of normal
quality in terms of local clustering coefficient
graph distance distributions
small-world phenomena
small worlds : graphs with short paths
Stanley Milgram (1933-1984)

“The man who shocked the world”
obedience to authority (1963)
small-world experiment (1967)
Milgram’s experiment
300 people (starting population) are asked to dispatch a

parcel to a single individual (target)
the target was a Boston stockbroker
the starting population is selected as follows:
100 were random Boston inhabitants (group A)
100 were random Nebraska strockbrokers (group B)
100 were random Nebraska inhabitants (group C)
rules of the game :

parcels could be directly sent only to someone the sender
knows personally
453 intermediaries happened to be involved in the
experiments (besides the starting population and the
target)
questions Milgram wanted to answer:

1. how many parcels will reach the target?
.
2. what is the distribution of the number of hops required to
reach the target?
.
3. is this distribution different for the three starting
subpopulations?
.
answers to the questions

1. how many parcels will reach the target?
29%
2. what is the distribution of the number of hops required to
reach the target?
average was 5.2
3. is this distribution different for the three starting
subpopulations?
yes: average for groups A/B/C was 4.6/5.4/5.7
chain lengths
Milgram’s popularity
“Six degrees of separation” entered the world of popular

imagination:
“Six degrees of separation” is a play by John Guare...

...a movie by Fred Schepisi...
...a song sung by dolls in their national costume at
Disneyland in a heart-warming exhibition celebrating the
connectedness of people all
Milgram’s criticisms
“Could it be a big world after all? (The six degrees of

separation myth)”
—Judith S. Kleinfeld, 2002
the vast majority of chains were never completed

extremely difficult to reproduce
measuring what?
but what did Milgrams experiment reveal, after all?
1. the the world is small

2. that people are able to exploit this smallness
graph distance distribution
obtain information about a large graph,

i.e., social network
macroscopic level
distance distribution
mean distance
median distance
diameter
effective diameter
...
graph distance distribution
given a graph, d(x, y ) is the length of the shortest path

from x to y
defined as ∞ if one cannot go from x to y
for undirected graphs, d(x, y ) = d(y , x)
for every t, count the number of pairs (x, y ) such that
d(x, y ) = t
the fraction of pairs at distance t is a distribution
exact computation
how can one compute the distance distribution?
weighted graphs: Dijkstra (single-source: O(m log n)),

Floyd-Warshall (all-pairs: O(n3 ))
in the unweighted case:
a single BFS solves the single-source version of the
problem: O(m)
if we repeat it from every source: O(nm)
sampling pairs
sample at random pairs of nodes (x, y )

compute d(x, y ) with a BFS from x
(possibly: reject the pair if d(x, y ) is infinite)
sampling pairs
for every t, the fraction of sampled pairs that were found

at distance t are an estimator of the value of the
probability mass function
takes a BFS for every pair — O(m)
sampling sources
sample at random a source t

compute a full BFS from t
sampling sources
it is an unbiased estimator only for undirected and

connected graphs
uses anyway BFS...
...not cache friendly
... not compression friendly
idea : diffusion
[Palmer et al., 2002]
let Bt (x) be the ball of radius t around x

(the set of nodes at distance ≤ t from x)
clearly B0 (x) = {x}
S S
moreover Bt+1 (x) = (x,y ) Bt (y ) {x}
so computing Bt+1 from Bt just takes a single
(sequential) scan of the graph
easy but costly
every set requires O(n) bits, hence O(n2 ) bits overall

easy but costly
too many!
what about using approximated sets?
we need probabilistic counters, with just two primitives:
add and size
very small!
estimating the number of distinct values (F0)
[Flajolet and Martin, 1985]

consider a bit vector of length O(log n)
upon seen xi , set:
the 1st bit with probability 1/2
the 2nd bit with probability 1/4
...
the i-th bit with probability 1/2i
important: bits are set deterministically for each xi
let R be the index of the largest bit set
return Y = 2R
ANF
probabilistic counter for approximating the number of

distinct values [Flajolet and Martin, 1985]
ANF algorithm [Palmer et al., 2002]
uses the original probabilist counters
HyperANF algorithm [Boldi et al., 2011]
uses HyperLogLog counters [Flajolet et al., 2007]
HyperANF
HyperLogLog counter [Flajolet et al., 2007]

with 40 bits you can count up to 4 billion with a standard
deviation of 6%
remember: one set per node
implementation tricks
[Boldi et al., 2011]
use broad-word programming to compute efficiently

unions
systolic computation for on-demand updates of counters
exploit micro-parallelization of multicore architectures
performance
HADI, a Hadoop-conscious implementation of ANF

[Kang et al., 2011]
takes 30 minutes on a 200K-node graph
(on one of the 50 world largest supercomputers)
HyperANF does the same in 2.25min on a workstation
(20 min on a laptop).
experiments on facebook
[Backstrom et al., 2011]

considered only active users
it : only italian users

se : only swedish users
it + se : only italian and swedish users
us : only US users
the whole facebook (750m nodes)
based on users current geo-IP location

distance distribution (it)
distance distribution (se)
distance distribution (fb)
average distance
2008 2012
it 6.58 3.90
se 4.33 3.89
it+se 4.90 4.16
us 4.74 4.32
fb 5.28 4.74
fb 2012 : 92% pairs are reachable!

effective diameter
2008 2012
it 9.0 5.2
se 5.9 5.3
it+se 6.8 5.8
us 6.5 5.8
fb 7.0 6.2
actual diameter
2008 2012
it > 29 = 25
se > 16 = 25
it+se > 21 = 27
us > 17 = 30
fb > 17 > 58
breaking the news
another application : spid
[Boldi et al., 2011]
spid : shortest-paths index of dispersion

the ratio between variance and average in the distance
distribution
spid < 1 : the distribution is subdispersed
spid > 1 : is superdispersed
web graphs and social networks have different spid!
spid plot
the spid conjecture
[Boldi et al., 2011] conjecture that spid is able to tell

social networks from web graphs
average distance alone would not suffice: it is very
changeable and depends on the scale
spid, instead, seems to have a clear cutpoint at 1
what is facebook spid? 0.093
indexing distances in large graphs
shortest-path distances in large graphs
input: consider a graph G = (V , E )

and nodes s and t in V
goal: compute the shortest-path distance d(s, t)
from s to t
do it very fast
well-studied problem
different strategies
lazy
compute shortest path at query time
Dijkstra, bfs
no precomputation
bfs takes O(m)
too expensive for large graphs
eager
precompute all-pairs shortest paths
FloydWarshall, matrix multiplication
O(n3 ) precomputation, O(n2 ) storage
too large to store
applications of shortest-path queries
searching in graphs — I. context-sensitive search
"chilly peppers"
"chilly peppers"
RHCP
mexican
cuisine
"chilly peppers"
food
RHCP
mexican
cuisine
"chilly peppers"
music
RHCP
mexican
cuisine
customize search results to the user’s current page or

recent history of pages have visited
increasing relevance of answers
disambiguation
suggesting links to wikipedia editors
searching in graphs — II. social search
consider more information than just contacts

preferences
geographical information
comments
favorites
tags
etc.
machine-learning approach
learn a ranking function that combines a large number

of features
content-based features:
tf/idf, bm25, etc., as in traditional ir and web search
content similarity between the querying node and a
target node
link-based features:
PageRank
shortest-path distance from the querying node to a
target node
spectral distance from the querying node to a target
node
graph-based similarity measures
context-specific PageRank
well-studied problem
different strategies
lazy
compute shortest path at query time
Dijkstra, bfs
no precomputation
bfs takes O(m)
too expensive for large graphs
eager
precompute all-pairs shortest paths
FloydWarshall, matrix multiplication
O(n3 ) precomputation, O(n2 ) storage
too large to store
anything in between?
is there a smooth tradeoff between
hO(1), O(m)i and hO(n2 ), O(1)i

distance oracles
[Thorup and Zwick, 2005]
given a graph G = (V , E )
an (α, β)-approximate distance oracle
is a data structure S that
for a query pair of nodes (u, v ), S returns dS (u, v ) s.t.
d(u, v ) ≤ dS (u, v ) ≤ α d(u, v ) + β
α called stretch or distortion

consider the preprocessing time, the required space, and
the query time
distance oracles
[Thorup and Zwick, 2005]
given k, construct an oracle with

storage O(kn1+1/k ), query time O(k), stretch 2k − 1
k =1
⇒ APSP
k = log n
⇒ storage O(n log n), query time O(log n), stretch
O(log n)
distance oracles — preprocessing
[Das Sarma et al., 2010]
1 r = blog |V |c
2 sample r + 1 sets of sizes 1, 2, 22 , 23 , . . . , 2r
3 call the sampled sets S0 , S1 , . . . , Sr
4 for each node u and each set Si compute (wi , δi ),
where δi = d(u, wi ) = minv ∈Si {d(u, v )}
5 sketch[u] = {(w0 , δ0 ), . . . , (wr , δr )}
6 repeat k times
distance oracles — query processing
[Das Sarma et al., 2010]

given query (u, v )
1 obtain sketch[u] and sketch[v ]

2 find the set of common nodes w in sketch[u] and
sketch[v ]
3 for each common node w , compute d(u, w ) and d(w , v )
4 return the minimum of d(u, w ) + d(w , v ),
taken over all common node w ’s
5 if no common w is present, then return ∞
landmark-based approach
precompute: distance from each node to a fixed

landmark l
then
|d(s, l) − d(t, l)| ≤ d(s, t) ≤ d(s, l) + d(l, t)
precompute: distances to d landmarks, l1 , . . . , ld
max |d(s, li ) − d(t, li )| ≤ d(s, t) ≤ min(d(s, li ) + d(li , t))

i i
obtain a range estimate in time O(d) (i.e., constant)

landmark-based approach
motivated by indexing general metric spaces

used for estimating latency in the internet
[Ng and Zhang, 2008]
typically randomly chosen landmarks
theoretical results
[Kleinberg et al., 2004]
random landmarks can provide distance estimates with

distortion (1 + δ) for a fraction of at least (1 − ) of pairs
number of landmarks required depends on , δ, and the
doubling dimension k of the metric space
approximation guarantee in practice
what does a logarithmic approximation guarantee mean in a

small-world graph?
the landmark selection problem
how to choose good landmarks in practice?

good landmarks
l t
s
if then d(s, t)=d(s, l) + d(l, t)
t
s
if l then |d(s, l) − d(t, l)|=d(s, t)
good (upper-bound) landmarks
a landmark l covers a pair (s, t) if l is on a shortest path

from s to t
problem definition: find a set L ⊆ V of k landmarks that
cover as many pairs (s, t) ∈ V × V as possible
NP-hard
for k = 1: the node with the highest centrality
betweenness
for k > 1: apply a “natural” set-cover approach
(but O(n3 ))
landmark selection heuristics
high-degree nodes
high-centrality nodes
“constrained” versions
once a node is selected none of its neighbors is selected
“clustered” versions
cluster the graph and select one landmark per cluster
select landmarks on the “borders” between clusters
datasets
# nodes # edges median effective clustering

distance diameter coefficient
flickr 801 K 8M 5 8 0.11
dblp 226 K 716 K 9 13 0.47

flickr-implicit — distance error
Flickr Implicit dataset
0.6
Rand
0.55 Centr/1
High/1
0.5 Border
0.45
0.4
Error
0.35
0.3
0.25
0.2
0.15
0.1 0 1 2 3
10 10 10 10
Number of Seeds
dblp — precision @ 5
DBLP dataset
1
Rand
0.95 Centr/1
High/1
0.9 Border
0.85
Precision @ 5
0.8
0.75
0.7
0.65
0.6
0.55
0.5 0 1 2 3
10 10 10 10
Number of Seeds
triangulation task
[Kleinberg et al., 2004]
DBLP dataset Y!IM dataset

250 150
Rand Rand
Degree/P Degree/P
Border Border
200
Number of queries
100
Number of queries
150
100
50
50
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
L/U L/U
comparing with exact algorithm
[Goldberg and Harrelson, 2005]

landmarks (10%) Fl.-E Fl.-I Wiki DBLP Y!IM
Method Cent Cent Cent/P Bord/P Bord/P
Landmarks used 20 100 500 50 50
Nodes visited 1 1 1 1 1
Operations 20 100 500 50 50
CPU ticks 2 10 50 5 5
ALT (exact) Fl.-E Fl.-I Wiki DBLP Y!IM
Method Ikeda Ikeda Ikeda Ikeda Ikeda
Landmarks used 8 4 4 8 4
Nodes visited 7245 10337 19616 2458 2162
Operations 56502 41349 78647 19666 8648
CPU ticks 7062 10519 25868 1536 1856
summary and outlook
graphs are simple models that can represent a wide

variety of datasets
real graphs from different application domains have many
similar properties but also differences
scaling simple computations on very large graphs is
challenging
use of probabilistic and approximation techniques
just scratched the tip of the iceberg
model and understand processes in network, influence,
information cascades
deal with heterogeneous data, combine different sources,
respect user privacy
thank you!
acknowledgements
a number of slides have been adopted by presentations of

Panayiotis Tsaparas and Paolo Boldi
references I
Alon, N., Matias, Y., and Szegedy, M. (1999).

The space complexity of approximating the frequency moments.
J. Comput. Syst. Sci., 58(1):137–147.
Alon, U. (2007).
Network motifs: theory and experimental approaches.
Nature Reviews Genetics.
Backstrom, L., Boldi, P., Rosa, M., Ugander, J., and Vigna, S.
(2011).
Four degrees of separation.
CoRR, abs/1111.4570.
Bohman, T., Cooper, C., and Frieze, A. M. (2000).

Min-wise independent linear permutations.
Electr. J. Comb, 7.
references II
Boldi, P., Rosa, M., and Vigna, S. (2011).

HyperANF: approximating the neighborhood function of very large
graphs on a budget.
In WWW.
Bordino, I., Donato, D., Gionis, A., and Leonardi, S. (2008).
Mining large networks with subgraph counting.
In ICDM.
Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzenmacher, M.
(1998).
Min-wise independent permutations.
In STOC, New York, NY, USA.
references III
Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G.
(1997).
Syntactic clustering of the web.
In Selected papers from the sixth international conference on World
Wide Web, pages 1157–1166, Essex, UK. Elsevier Science Publishers
Ltd.
Buriol, L. S., Frahling, G., Leonardi, S., Marchetti-Spaccamela, A.,
and Sohler, C. (2006).
Counting triangles in data streams.
In PODS ’06: Proceedings of the twenty-fifth ACM
SIGMOD-SIGACT-SIGART symposium on Principles of database
systems, pages 253–262, New York, NY, USA. ACM Press.
Das Sarma, A., Gollapudi, S., Najork, M., and Panigrahy, R. (2010).
A sketch-based distance oracle for web-scale graphs.
In WSDM, pages 401–410.
references IV
Feigenbaum, J., Kannan, S., Gregor, M. A., Suri, S., and Zhang, J.
(2004).
On graph problems in a semi-streaming model.
In 31st International Colloquium on Automata, Languages and
Programming.
Flajolet, F., Fusy, E., Gandouet, O., and Meunier, F. (2007).

Hyperloglog: the analysis of a near-optimal cardinality estimation
algorithm.
In Proceedings of the 13th conference on analysis of algorithm
(AofA).
Flajolet, P. and Martin, N. G. (1985).

Probabilistic counting algorithms for data base applications.
Journal of Computer and System Sciences, 31(2):182–209.
references V
Goldberg, A. and Harrelson, C. (2005).

Computing the shortest path: A* search meets graph.
In SODA.
Kang, U., Tsourakakis, C. E., Appel, A. P., Faloutsos, C., and
Leskovec, J. (2011).
HADI: Mining radii of large graphs.
ACM TKDD, 5.
Kleinberg, J., Slivkins, A., and Wexler, T. (2004).

Triangulation and embedding using small sets of beacons.
In FOCS.
references VI
Leskovec, J., Kleinberg, J., and Faloutsos, C. (2005).

Graphs over time: densification laws, shrinking diameters and
possible explanations.
In KDD ’05: Proceeding of the eleventh ACM SIGKDD international
conference on Knowledge discovery in data mining, pages 177–187,
New York, NY, USA. ACM Press.
M. Faloutsos, P. Faloutsos, C. F. (1999).

On power-law relationships of the internet topology.
In SIGCOMM.
Newman, M. E. J. (2003).
The structure and function of complex networks.
references VII
Ng, E. and Zhang, H. (2008).

Predicting internet network distance with coordinate-based
approaches.
In INFOCOMM.
Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).
ANF: a fast and scalable tool for data mining in massive graphs.
In Proceedings of the eighth ACM SIGKDD international conference
on Knowledge discovery and data mining, pages 81–90, New York,
NY, USA. ACM Press.
Thorup, M. and Zwick, U. (2005).

Approximate distance oracles.
JACM, 52(1):1–24.
references VIII
Tsourakakis, C., Kolountzakis, M., and Miller, G. (2011).

Triangle sparsifiers.
Journal of Graph Algorithms and Applications, 15(6).

Gionis

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Gionis

Загружено:

Авторское право:

Доступные форматы

algorithms for mining large graphs

distributed systems and search

graph mining algorithms

entities – set of vertices

graph theory started in the 18th

graphs datasets have been studied in the past

more and larger networks appear

links denote a social interaction

nodes store information, links

networks built for distribution of a commodity

biological systems represented as networks

nodes: photos, tags, users, groups, albums, sets,

which graph to pick — not an easy choice

social media, user-generated content

the world is full with networks

paths and connectivity

many graph mining problems can be solved with

Erdős-Rényi random graph model

Paul Erdős Alfréd Rényi

the G (n, p) model:

the G (n, m) model: related, but not identical

a property P holds almost surely (for almost every graph)

which properties hold as p increases?

let z = np be the average degree

if z = 1 there is a phase transition

degree distribution : binomial

highly concentrated around the mean (z = np)

a beautiful and elegant theory studied exhaustively

diverse collections of graphs arising from different phenomena

are there any typical patterns?

at which level should we look for commonalities?

Ck = number of vertices with degree k

problem : find the probability distribution that fits best

Ck = number of vertices with degree k, then

so, plotting ln Ck versus ln k gives a straight line with

for random graphs, the maximum degree is highly

hand-waving argument: solve n Pr[X ≥ d] = 1

intuitively a subset of vertices that are more connected to

community structure is captured by large values of

small worlds : graphs with short paths

Stanley Milgram (1933-1984)

letters were handed out to people in Nebraska to be sent

[Leskovec et al., 2005] discovered two interesting and

|Et | ∝ |Vt |α 1≤α≤2

data in the web and social-media are typically of

how many samples N are needed?

use the Chernoff bound to derive the estimator theorem

X = (x1 , x2 , . . . , xm ) a sequence of elements

F0 is the number of distinct elements

How to compute the frequency moments using less than

sketching: create a sketch that takes much less space and

[Alon et al., 1999]

[Flajolet and Martin, 1985]

Theorem. For every c > 2, the algorithm computes a

a family H is called (R, cR, p1 , p2 )-sensitive if for any two

if d(p, q) ≤ R, then PrH [h(p) = h(q)] ≥ p1

if d(p, q) ≥ cR, then PrH [h(p) = h(q)] ≤ p2

interesting case when p1 > p2

objects in a Hamming space {0, 1}d – binary vectors

gap between p1 and p2 too small

00011 11010 01101

how to apply the locality sensitive hashing for vectors of

can we design a locality sensitive hashing family for

amplify the probability as before:

gives (, δ) approximation

gives (, δ) approximation