Вы находитесь на странице: 1из 191

algorithms for mining large graphs

Aristides Gionis
Yahoo! Research, Barcelona
yahoo! research barcelona
research areas and internships!

research areas

distributed systems and search


multimedia retrieval
semantic web
social media
social network analysis
web mining

http://barcelona.research.yahoo.net/internships
this lecture

graph mining algorithms


slides will be available
a few questions as optional homeworks
introduction to graphs and networks
graphs: a simple model

entities – set of vertices


pairwise relations among vertices
– set of edges
can add directions, weights,. . .
graphs can be used to model many real
datasets
people who are friends
computers that are interconnected
web pages that point to each other
proteins that interact
graph theory

graph theory started in the 18th


century, with Leonhard Euler
the problem of Königsberg bridges
since then, graphs have been studied
extensively
analysis of graph datasets in the past

graphs datasets have been studied in the past


e.g., networks of highways, social networks
usually these datasets were small
visual inspection can reveal a lot of information
analysis of graph datasets now

more and larger networks appear


products of technological advancement
e.g., internet, web
result of our ability to collect more, better-quality, and
more complex data
e.g., gene regulatory networks
networks of thousands, millions, or billions of nodes
impossible to visualize
the internet map
types of networks

social networks
knowledge and information networks
technology networks
biological networks
social networks

links denote a social interaction


networks of acquaintances
collaboration networks
actor networks
co-authorship networks
director networks
phone-call networks
e-mail networks
IM networks
sexual networks
knowledge and information networks

nodes store information, links


associate information
citation network (directed
acyclic)
the web (directed)
peer-to-peer networks
word networks
networks of trust
software graphs
bluetooth networks
home page/blog networks
technological networks

networks built for distribution of a commodity


the internet, power grids, telephone networks
airline networks, transportation networks
US power grid
biological networks

biological systems represented as networks


protein-protein interaction networks
gene regulation networks
gene co-expression networks
metabolic pathways
the food web
neural networks
photo-sharing site
what is the underlying graph?

nodes: photos, tags, users, groups, albums, sets,


collections, geo, query, . . .
edges: upload, belong, tag, create, join, contact, friend,
family, comment, fave, search, click, . . .
also many interesting induced graphs
tag graph: based on photos
tag graph: based on users
user graph: based on favorites
user graph: based on groups

which graph to pick — not an easy choice


recurring theme

social media, user-generated content


user interaction is composed by many atomic actions
post, comment, like, mark, join, comment, fave,
thumps-up, . . .
generates all kind of interesting graphs to mine
now what?

the world is full with networks


what do we do with them?
understand their topology and measure their properties
study their evolution and dynamics
create realistic models
create algorithms that make use of the network structure
concepts to study

paths and connectivity


path lengths and diameter — small-world phenomena
degree distributions and degree correlations
communities and clusters
flows and cuts
processes
evolution, random walks, information cascades,
epidemics, . . .
the name of the game

many graph mining problems can be solved with


polynomial-time algorithms
even small-degree polynomials
however, for very large graphs this is not enough
need algorithms in almost linear time
resort to approximation, streaming models, sampling,
sketching, etc...
random graphs
random graphs

Erdős-Rényi random graph model

Paul Erdős Alfréd Rényi


1913 – 1996 1921 – 1970
random graphs

the G (n, p) model:


n : the number of vertices
0 ≤ p ≤ 1 : probability
for each pair (u, v ), independently generate the edge
(u, v ) with probability p
G (n, p) a family of graphs, in which a graph with m
n
edges appears with probability p m (1 − p)(2)−m

the G (n, m) model: related, but not identical


properties of random graphs

a property P holds almost surely (for almost every graph)


if
lim Pr[G has P] = 1
n→∞

which properties hold as p increases?


threshold phenomena : many properties appear suddenly
there exist a probability pc such that
for p < pc the property does not hold a.s.
for p > pc the property holds a.s.
the giant component

let z = np be the average degree


if z < 1 the largest component has size O(log n) a.s.
if z > 1 the largest component has size Θ(n) a.s.;
the second largest component has size O(log n) a.s.
if z = ω(log n) the graph is connected a.s.
phase transition

if z = 1 there is a phase transition


the largest component has size O(n2/3 )
the sizes of the components follow a power-law
random graphs and real datasets

degree distribution : binomial

z k e −z
 
n k n−k
Ck = p (1 − p) '
k k

highly concentrated around the mean (z = np)


probability of very high degree is exponentially small

clustering coefficient
probability that two vertices are connected is
independent of the local structure
random graphs and real datasets

a beautiful and elegant theory studied exhaustively


have been used as idealized generative models
unfortunately, they don’t capture reality. . .
structure of real-world graphs
properties of real graphs

diverse collections of graphs arising from different phenomena

are there any typical patterns?

at which level should we look for commonalities?


microscopic : degree distribution
mesoscopic : communities
macroscopic : small diameters
degree distribution

Ck = number of vertices with degree k

problem : find the probability distribution that fits best


the observed data
degree distribution

Ck = number of vertices with degree k, then

Ck = ck −γ

with γ > 1, or

ln Ck = ln c − γ ln k

so, plotting ln Ck versus ln k gives a straight line with


slope −γ
heavy-tail distribution : there is a non-negligible fraction
of nodes that has very high degree (hubs)
scale free : average is not informative
power-law degree distribution
power-law degree distribution

[Newman, 2003]
contrast — degree distribution of random graphs
maximum degree

for random graphs, the maximum degree is highly


concentrated around the average degree z
for power-law graphs

dmax ≈ n1/(α−1)

hand-waving argument: solve n Pr[X ≥ d] = 1


community structure

intuitively a subset of vertices that are more connected to


each other than to other vertices in the graph
a proposed measure is clustering coefficient
3 × number of triangles in the network
C1 =
number of connected triples of vertices
captures “transitivity of clustering”
if u is connected to v and
v is connected to w , it is also likely that
u is connected to w
community structure at local level
community structure

alternative definition
local clustering coefficient
number of triangles connected to vertex i
Ci =
number of triples centered at vertex i
global clustering coefficient
1X
C2 = Ci
n i

community structure is captured by large values of


clustering coefficient
small-world phenomena

small worlds : graphs with short paths

Stanley Milgram (1933-1984)


“The man who shocked the world”
obedience to authority (1963)
small-world experiment (1967)
small-world experiments

letters were handed out to people in Nebraska to be sent


to a target in Boston
people were instructed to pass on the letters to someone
they knew on first-name basis
the letters that reached the destination (64 / 296)
followed paths of length around 6
six degrees of separation : (play of John Guare)
also:
the Kevin Bacon game
the Erdös number
small-world project:
http://smallworld.columbia.edu/index.html
small diameter

proposed measures
diameter : largest shortest-path over all pairs
effective diameter : upper bound of the shortest path of
90% of the pairs of vertices
average shortest path : average of the shortest paths over
all pairs of vertices
characteristic path length : median of the shortest paths
over all pairs of vertices
hop-plots : plot of |Nh (u)|, the number of neighbors of u
at distance at most h, as a function of h
[M. Faloutsos, 1999] conjectured that it grows
exponentially and considered hop exponent
measurements on real graphs

n m γ C1 C2 `
film actors 449 913 25 516 482 2.3 0.20 0.78 3.48
internet 10 697 31 992 2.5 0.03 0.39 3.31
protein interactions 2 115 2 240 2.4 0.07 0.07 6.80

[Newman, 2003]
other properties

degree correlations
distribution of size of connected components
resilience
eigenvalues
distribution of motifs
properties of evolving graphs

[Leskovec et al., 2005] discovered two interesting and


counter-intuitive phenomena
densification power law

|Et | ∝ |Vt |α 1≤α≤2

diameter is shrinking
algorithmic tools
efficiency considerations

data in the web and social-media are typically of


extremely large scale (easily reach to billions)
how to compute simple statistics?
how to locate similar objects fast?
how to cluster objects?
hashing and sketching

probabilistic/approximate methods
sketching : create sketches that summarize the data and
allow to estimate simple statistics with small space
hashing : hash objects in such a way that similar objects
have larger probability of mapped to the same value than
non-similar objects
estimator theorem
consider a set of items U
a fraction ρ of them have a specific property
estimate ρ by sampling

how many samples N are needed?


4 2
N≥ log .
2 ρ δ
for an -approximation with probability at least 1 − δ
notice: it does not depend on |U| (!)
homework

use the Chernoff bound to derive the estimator theorem


computing statistics on data streams

X = (x1 , x2 , . . . , xm ) a sequence of elements


each xi is a member of the set N = {1, . . . , n}
mi = |{j : xj = i}| the number of occurrences of i
define n
X
Fk = mik
i=1

F0 is the number of distinct elements


F1 is the length of the sequence
F2 index of homogeneity, size of self-join, and other
applications
computing statistics on data streams

How to compute the frequency moments using less than


O(n log m) space?

sketching: create a sketch that takes much less space and


gives an estimation of Fk

[Alon et al., 1999]


estimating the number of distinct values (F0)

[Flajolet and Martin, 1985]


consider a bit vector of length O(log n)
upon seen xi , set:
the 1st bit with probability 1/2
the 2nd bit with probability 1/4
...
the i-th bit with probability 1/2i
important: bits are set deterministically for each xi
let R be the index of the largest bit set
return Y = 2R
estimating number of distinct values (F0)

Theorem. For every c > 2, the algorithm computes a


number Y using O(logn) memory bits, such that the
probability that the ratio between Y and F0 is not between
1/c and c is at most 2/c.
locality sensitive hashing

a family H is called (R, cR, p1 , p2 )-sensitive if for any two


objects p and q

if d(p, q) ≤ R, then PrH [h(p) = h(q)] ≥ p1

if d(p, q) ≥ cR, then PrH [h(p) = h(q)] ≤ p2

interesting case when p1 > p2


locality sensitive hashing: example

objects in a Hamming space {0, 1}d – binary vectors


H : {0, 1}d → {0, 1} sample the i bit:
H = {h(x) = xi | i = 1, . . . , d}
for two vectors x and y with distance r , it is
PrH [h(x) = h(y )] = 1 − dr
thus p1 = 1 − Rd and p2 = 1 − cR d

gap between p1 and p2 too small


probability amplification
locality sensitive hashing: Hamming distance

01000110111010111
locality sensitive hashing: Hamming distance

01000110111010111

00011
locality sensitive hashing: Hamming distance

01000110111010111

00011 11010
locality sensitive hashing: Hamming distance

01000110111010111

00011 11010 01101


locality sensitive hashing: Hamming distance

Probability of collision
r k l
Pr[h(x) = h(y )] = 1 − (1 − (1 − ) )
d
locality sensitive hashing: Hamming distance
homework

how to apply the locality sensitive hashing for vectors of


integers, not just binary vectors?

vectors x = {x1 , . . . , xd }
Pd
L1 distance ||x − y||1 = i=1 |xi − yi |
Jaccard coefficient

|A∩B|
for two sets A, B ⊆ U define J(A, B) = |A∪B|
measure of similarity of the sets

can we design a locality sensitive hashing family for


Jaccard?
min-wise independent permutations

π : U → U a random permutation of U
h(A) = min{π(x) | x ∈ A}
then
|A ∩ B|
Pr[h(A) = h(B)] = J(A, B) =
|A ∪ B|

amplify the probability as before:


repeat many times,
concatenate into blocks
consider objects similar if they collide in at least one
block
homework

show that for h(A) = min{π(x) | x ∈ A} with π a random


permutation

|A ∩ B|
Pr[h(A) = h(B)] = J(A, B) =
|A ∪ B|
homework

design a locality-sensitive hashing scheme for vectors according


to the cosine similarity measure

vectors x = {x1 , . . . , xd }
x·y
distance 1 − cos(x, y) = 1 − ||x||2 ||y||2
... applications of the algorithmic tools to
real scenarios
clustering coefficient and triangles
clustering coefficient

3 × number of triangles in the network


C=
number of connected triples of vertices

how to compute it?


how to compute the number of triangles in a graph?
assume that the graph is very large, stored in disk
[Buriol et al., 2006]
count triangles, when graph is seen as a data stream
two models:
edges are stored in any order
edges in order — all edges incident to one vertex are
stored sequentially
counting triangles

brute-force algorithm is checking every triple of vertices


obtain an approximation by sampling triples
sampling algorithm for counting triangles

how many samples are required?

let T be the set of all triples and


Ti the set of triples that have i edges, i = 0, 1, 2, 3
by the estimator theorem, to get an -approximation, with
probability 1 − δ, the number of samples should be

|T | 1 1
N ≥ O( 2
log )
|T3 |  δ

but |T | can be very large compared to |T3 |


counting triangles

incidence model : all edges incident to each vertex appear


in order in the stream
sample connected triples
sampling algorithm for counting triangles

incidence model
consider sample space S = {b-a-c | (a, b), (a, c) ∈ E }
P
|S| = i di (di − 1)/2

1: sample X ⊆ S (paths b-a-c)


2: estimate fraction of X for which edge (b, c) is present
3: scale by |S|

gives (, δ) approximation


counting triangles — incidence stream model

SampleTriangle [Buriol et al., 2006]


1st pass
count the number of paths of length 2 in the stream
2nd pass
uniformly choose one path (a, b, c)
3rd pass
if ((b, c) ∈ E ) β = 1 else β = 0
return β
3|T3 | P du (du −1)
we have E[β] = |T2 |+3|T3 |
, with |T2 | + 3|T3 | = u 2
, so
X du (du − 1)
|T3 | = E[β]
u
6

|T2 | 1
and space needed is O((1 + )
|T3 | 2
log 1δ )
properties of the sampling space

it should be possible to
estimate the size of the sampling space
sample an element uniformly at random
homework

1 compute triangles in 3 passes when edges


appear in arbitrary order
2 compute triangles in 1 pass when edges
appear in arbitrary order
3 compute triangles in 1 pass in the incidence model
triangle sparsifiers

[Tsourakakis et al., 2011]


start with graph G (V , E )
use sparsification parameter p
pick a random subset E 0 of edges
each edge is selected with probability p
T30 = # triangles on graph G 0 (V , E 0 )
return T3 = T30 /p 3

T3 is highly concentrated around the true number


of triangles
counting graph minors
counting other minors

count all minors in a very large graphs


connected subgraphs
size 3 and 4
directed or undirected graphs
why?
modeling networks, “signature” structures, e.g., copying
model
anomaly detection, e.g., spam link farms
[Alon, 2007, Bordino et al., 2008]
counting minors in large graphs

characterize a graph by the distribution of its minors

all undirected minors of size 4

all directed minors of size 3


sampling algorithm for counting triangles

incidence model
consider sample space S = {b-a-c | (a, b), (a, c) ∈ E }
P
|S| = i di (di − 1)/2

1: sample X ⊆ S (paths b-a-c)


2: estimate fraction of X for which edge (b, c) is present
3: scale by |S|

gives (, δ) approximation


adapt the algorithm to count all minors of size 3 and 4
and directed and undirected graphs
adapting the algorithm

sampling spaces:

3-node directed

4-node undirected

are the sampling space properties satisfied?


datasets

graph class type # instances


synthetic un/directed 39
wikipedia un/directed 7
webgraphs un/directed 5
cellular directed 43
citation directed 3
food webs directed 6
word adjacency directed 4
author collaboration undirected 5
autonomous systems undirected 12
protein interaction undirected 3
US road undirected 12
clustering of undirected graphs

assigned to 0 1 2 3 4 5 6
AS graph 12 0 0 0 0 0 0
collaboration 0 0 3 2 0 0 0
protein 1 0 0 1 0 0 1
road-graph 0 12 0 0 0 0 0
wikipedia 0 0 0 0 2 5 0
synthetic 11 0 0 0 0 0 28
webgraph 2 0 0 1 0 0 0
clustering of directed graphs

feature class accuracy compared


to ground truth
standard topological properties (81) 0.74%
minors of size 3 0.78%
minors of size 4 0.84%
minors of size 3 and 4 0.91%
local statistics
compute local statistics in large graphs

our goal: compute triangle counts for all vertices


local clustering coefficient and related statistics
motivation
motifs can be used to characterize network
families [Alon, 2007, Bordino et al., 2008]
analysis of social or biological networks
thematic relationships in the web
web spam

applications: spam detection and content quality analysis


in social media
semi-streaming model

[Feigenbaum et al., 2004]


data stream model too restrictive
graph stored in secondary memory as adjacency or edge
list
X no random access possible
O(N log N) bits available in main memory
limited amount of information per vertex
X not enough to store edges in main memory
limited (constant or O(log N)) number of passes
compute counts for all vertices concurrently
two algorithms

1 external memory
keep a counter for each vertex (main memory)
keep a counter for each edge (secondary memory)

2 main memory
keep a counter for each vertex
number of triangles for edges and nodes

neighbors: N(u) = {v : (u, v ) ∈ E }


degree: d(u) = |N(u)|
edge triangles: Tuv = |N(u) ∩ N(v )|
vertex triangles: T (u) = 12 v ∈N(u) Tuv
P

all possible triangles: 12 d(u)(d(u) − 1)


clustering coefficient

global clustering coefficient


P
2 u T (u)
C1 = P
u d(u)(d(u) − 1)

1 X 2T (u)
C2 =
|V | u∈V d(u)(d(u) − 1)

local clustering coefficient

2T (u)
C (u) =
d(u)(d(u) − 1)
computing triangles : idea
consider the Jaccard coefficient between
two sets A and B:
|A ∩ B|
J(A, B) =
|A ∪ B|

if we knew J(N(u), N(v )) = J, then:

J
Tuv = |N(u) ∩ N(v )| = (|N(u)| + |N(v )|)
J +1

and then:
1 X
T (u) = Tuv
2
v ∈N(u)
estimating Jaccard coefficient : basic technique

assume family F of permutations π : [n] → [n]


F is min-wise independent [Broder et al., 1997]
if for every A, B ⊆ [n], π(·) chosen u.a.r. in F

|A ∩ B|
Pr[min π(A) = min π(B)] = J(A, B) =
|A ∪ B|

estimation of J(A, B) (Jaccard coefficient)


In practice...
exponential space (Ω(n) bits) needed to represent
min-wise families [Broder et al., 1998]
π(x) = (ax + b) mod p, a, b chosen u.a.r., p a large
prime [Bohman et al., 2000]
computing triangles : idea
we want:
J
Tuv = |N(u) ∩ N(v )| = (|N(u)| + |N(v )|)
J +1

approximate the Jaccard coefficient:


m independent trials
Zuv : # times that min π(N(u)) = min π(N(v ))

use the estimator:


Zuv
T uv = (|N(u)| + |N(v )|)
Zuv + m
external-memory algorithm

semi-stream model
keep vertex min-hash values (in memory)
keep edge counters (on disk)
use edge counters to estimate number of triangles and
local clustering coefficient
external-memory algorithm
1: Z = 0
2: for i: 1 . . . m do {independent trials}
3: for u : 1 . . . |V | do {assign labels}
4: li (u) = hashi (u) {Min-wise linear permutation}
5: end for
6: for u : 1 . . . |V | do {compute fingerprints}
7: Fi (u) = minv ∈N(u) li (u)
8: end for{1 scan of G }
9: for u : 1 . . . |V | do {update counters}
10: for v ∈ N(u) do
11: if (Fi (u) = Fi (v )) then {minima are equal}
12: Zuv = Zuv + 1 {Zuv ’s stored on disk}
13: end if
14: end for
15: end for
16: end for
implementation

hashi (x) is, e.g., a linear hash function (ai x + bi mod p)


for every i, the Fi (u)’s can be kept in main memory
the Zuv ’s must be stored on disk
for every i, updating Zuv requires access to disk
computing counters most expensive operation
main-memory algorithm
replace:
Zuv
T uv = (|N(u)| + |N(v )|)
Zuv + m

by the estimator for |N(u) ∩ N(v )|:


Zuv
T̃uv = 2 (N(u) + N(v ))
3
m

and estimator for T (u):


1 X 1
T̃ (u) = Zuv (N(u) + N(v )) = Zu
3m 3m
v ∈N(u)

Zu sums d(u) + d(v ) if min π(N(u)) = min π(N(v ))


only one counter per node
main-memory algorithm
1: Z = 0
2: for i: 1 . . . m do {Independent trials}
3: for u : 1 . . . |V | do {Assign labels}
4: li (u) = hashi (u)
5: end for
6: for u : 1 . . . |V | do {Compute fingerprints}
7: Fi (u) = minv ∈V (u) li (u)
8: end for{1 scan of G }
9: for u : 1 . . . |V | do {Update counters}
10: for v ∈ N(u) do
11: if Fi (u) == Fi (v ) then {Minima are equal}
12: Zu = Zu + d(u) + d(v ) {Zu ’s in main memory}
13: end if
14: end for
15: end for
16: end for
experimental results

Algorithm 1 Algorithm 2
Graph Nodes Edges (ext. mem.) (main mem.)
WB-2001 118M 1.7G 10 hr 20 min 3 hr 40 min
IT-2004 41M 2.1G 8 hr 20 min 5 hr 30 min
UK-2006 77M 5.3G 20 hr 30 min 13 hr 10 min
quality of approximation

IT-2004
1.05
Algorithm 1 (ext. mem.)
1.00 Algorithm 2 (main mem.)
0.95
Average Relative Error

0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
10 20 30 40 50 60 70
Number of Passes
quality of approximation

IT-2004
1.00
Pearson’s Correlation Coefficient

0.90

0.80

0.70

0.60 Algorithm 1 (ext. mem.)


Algorithm 2 (main mem.)
0.50 Naive approximation (d(d-1))/2

0.40

0.30
10 20 30 40 50 60 70
Number of Passes
applications : spam detection

Exact Method
0.10
Normal
0.09 Spam
0.08
Fraction of Hosts

0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.00
1 10 100 1000 10000 100000
Triangles
Separation of non-spam and spam hosts in the histogram of
triangles
applications : spam detection

Algorithm Using External Memory


0.10
Normal
0.09 Spam
0.08
Fraction of Hosts

0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.00
1 10 100 1000 10000 100000
Triangles
Separation of non-spam and spam hosts in the histogram of
triangles
applications : spam detection

Approximated Using Only Main Memory


0.12
Normal
Spam
0.10
Fraction of Hosts

0.08

0.06

0.04

0.02

0.00
1 10 100 1000 10000 100000
Triangles
Separation of non-spam and spam hosts in the histogram of
triangles
applications : spam detection

number of triangles feature is ranked 60-th out of 221 for


spam detection
applications : content quality in yahoo! answers

Yahoo! answers, a question-answering portal


consider the graph with edges (u, v ) if user u has answer
a question of user v
consider “high quality” users those who have given a best
answer to a random sample of questions
predict high-quality users based on their local structure
applications : content quality in yahoo! answers

0.30
Average quality
High quality
0.25

0.20

0.15

0.10

0.05

0.00
0.0 0.1 0.2 0.3 0.4 0.5
Fraction of best answers

Separation of users who have provided questions/answers of high


quality with users who have provided questions/answers of normal
quality in terms of fraction of best answers
applications : content quality in yahoo! answers

0.35
Average quality
High quality
0.30

0.25

0.20

0.15

0.10

0.05

0.00
0.00 0.01 0.02 0.03 0.04 0.05 0.06
Clustering coefficient

Separation of users who have provided questions/answers of high


quality with users who have provided questions/answers of normal
quality in terms of local clustering coefficient
graph distance distributions
small-world phenomena

small worlds : graphs with short paths

Stanley Milgram (1933-1984)


“The man who shocked the world”
obedience to authority (1963)
small-world experiment (1967)
Milgram’s experiment

300 people (starting population) are asked to dispatch a


parcel to a single individual (target)
the target was a Boston stockbroker
the starting population is selected as follows:
100 were random Boston inhabitants (group A)
100 were random Nebraska strockbrokers (group B)
100 were random Nebraska inhabitants (group C)
Milgram’s experiment

rules of the game :


parcels could be directly sent only to someone the sender
knows personally
453 intermediaries happened to be involved in the
experiments (besides the starting population and the
target)
Milgram’s experiment

questions Milgram wanted to answer:


1. how many parcels will reach the target?
.
2. what is the distribution of the number of hops required to
reach the target?
.
3. is this distribution different for the three starting
subpopulations?
.
Milgram’s experiment

answers to the questions


1. how many parcels will reach the target?
29%
2. what is the distribution of the number of hops required to
reach the target?
average was 5.2
3. is this distribution different for the three starting
subpopulations?
yes: average for groups A/B/C was 4.6/5.4/5.7
chain lengths
Milgram’s popularity

“Six degrees of separation” entered the world of popular


imagination:

“Six degrees of separation” is a play by John Guare...


...a movie by Fred Schepisi...
...a song sung by dolls in their national costume at
Disneyland in a heart-warming exhibition celebrating the
connectedness of people all
Milgram’s criticisms

“Could it be a big world after all? (The six degrees of


separation myth)”
—Judith S. Kleinfeld, 2002

the vast majority of chains were never completed


extremely difficult to reproduce
measuring what?

but what did Milgrams experiment reveal, after all?

1. the the world is small


2. that people are able to exploit this smallness
graph distance distribution

obtain information about a large graph,


i.e., social network
macroscopic level
distance distribution
mean distance
median distance
diameter
effective diameter
...
graph distance distribution

given a graph, d(x, y ) is the length of the shortest path


from x to y
defined as ∞ if one cannot go from x to y
for undirected graphs, d(x, y ) = d(y , x)
for every t, count the number of pairs (x, y ) such that
d(x, y ) = t
the fraction of pairs at distance t is a distribution
exact computation

how can one compute the distance distribution?

weighted graphs: Dijkstra (single-source: O(m log n)),


Floyd-Warshall (all-pairs: O(n3 ))
in the unweighted case:
a single BFS solves the single-source version of the
problem: O(m)
if we repeat it from every source: O(nm)
sampling pairs

sample at random pairs of nodes (x, y )


compute d(x, y ) with a BFS from x
(possibly: reject the pair if d(x, y ) is infinite)
sampling pairs

for every t, the fraction of sampled pairs that were found


at distance t are an estimator of the value of the
probability mass function
takes a BFS for every pair — O(m)
sampling sources

sample at random a source t


compute a full BFS from t
sampling sources

it is an unbiased estimator only for undirected and


connected graphs
uses anyway BFS...
...not cache friendly
... not compression friendly
idea : diffusion

[Palmer et al., 2002]

let Bt (x) be the ball of radius t around x


(the set of nodes at distance ≤ t from x)
clearly B0 (x) = {x}
S S
moreover Bt+1 (x) = (x,y ) Bt (y ) {x}
so computing Bt+1 from Bt just takes a single
(sequential) scan of the graph
easy but costly

every set requires O(n) bits, hence O(n2 ) bits overall


easy but costly
too many!
what about using approximated sets?
we need probabilistic counters, with just two primitives:
add and size
very small!
estimating the number of distinct values (F0)

[Flajolet and Martin, 1985]


consider a bit vector of length O(log n)
upon seen xi , set:
the 1st bit with probability 1/2
the 2nd bit with probability 1/4
...
the i-th bit with probability 1/2i
important: bits are set deterministically for each xi
let R be the index of the largest bit set
return Y = 2R
ANF

probabilistic counter for approximating the number of


distinct values [Flajolet and Martin, 1985]
ANF algorithm [Palmer et al., 2002]
uses the original probabilist counters
HyperANF algorithm [Boldi et al., 2011]
uses HyperLogLog counters [Flajolet et al., 2007]
HyperANF

HyperLogLog counter [Flajolet et al., 2007]


with 40 bits you can count up to 4 billion with a standard
deviation of 6%
remember: one set per node
implementation tricks

[Boldi et al., 2011]

use broad-word programming to compute efficiently


unions
systolic computation for on-demand updates of counters
exploit micro-parallelization of multicore architectures
performance

HADI, a Hadoop-conscious implementation of ANF


[Kang et al., 2011]
takes 30 minutes on a 200K-node graph
(on one of the 50 world largest supercomputers)
HyperANF does the same in 2.25min on a workstation
(20 min on a laptop).
experiments on facebook

[Backstrom et al., 2011]


considered only active users

it : only italian users


se : only swedish users
it + se : only italian and swedish users
us : only US users
the whole facebook (750m nodes)

based on users current geo-IP location


distance distribution (it)
distance distribution (se)
distance distribution (fb)
average distance

2008 2012
it 6.58 3.90
se 4.33 3.89
it+se 4.90 4.16
us 4.74 4.32
fb 5.28 4.74

fb 2012 : 92% pairs are reachable!


effective diameter

2008 2012
it 9.0 5.2
se 5.9 5.3
it+se 6.8 5.8
us 6.5 5.8
fb 7.0 6.2
actual diameter

2008 2012
it > 29 = 25
se > 16 = 25
it+se > 21 = 27
us > 17 = 30
fb > 17 > 58
breaking the news
another application : spid

[Boldi et al., 2011]

spid : shortest-paths index of dispersion


the ratio between variance and average in the distance
distribution
spid < 1 : the distribution is subdispersed
spid > 1 : is superdispersed
web graphs and social networks have different spid!
spid plot
the spid conjecture

[Boldi et al., 2011] conjecture that spid is able to tell


social networks from web graphs
average distance alone would not suffice: it is very
changeable and depends on the scale
spid, instead, seems to have a clear cutpoint at 1
what is facebook spid? 0.093
indexing distances in large graphs
shortest-path distances in large graphs

input: consider a graph G = (V , E )


and nodes s and t in V
goal: compute the shortest-path distance d(s, t)
from s to t
do it very fast
well-studied problem

different strategies
lazy
compute shortest path at query time
Dijkstra, bfs
no precomputation
bfs takes O(m)
too expensive for large graphs
eager
precompute all-pairs shortest paths
FloydWarshall, matrix multiplication
O(n3 ) precomputation, O(n2 ) storage
too large to store
applications of shortest-path queries
searching in graphs — I. context-sensitive search
searching in graphs — I. context-sensitive search

"chilly peppers"
searching in graphs — I. context-sensitive search

"chilly peppers"

RHCP

mexican
cuisine
searching in graphs — I. context-sensitive search

"chilly peppers"

food

RHCP

mexican
cuisine
searching in graphs — I. context-sensitive search

"chilly peppers"

music

RHCP

mexican
cuisine
searching in graphs — I. context-sensitive search

customize search results to the user’s current page or


recent history of pages have visited
increasing relevance of answers
disambiguation
suggesting links to wikipedia editors
searching in graphs — II. social search
searching in graphs — II. social search
searching in graphs — II. social search
searching in graphs — II. social search

consider more information than just contacts


preferences
geographical information
comments
favorites
tags
etc.
machine-learning approach

learn a ranking function that combines a large number


of features
content-based features:
tf/idf, bm25, etc., as in traditional ir and web search
content similarity between the querying node and a
target node
link-based features:
PageRank
shortest-path distance from the querying node to a
target node
spectral distance from the querying node to a target
node
graph-based similarity measures
context-specific PageRank
well-studied problem

different strategies
lazy
compute shortest path at query time
Dijkstra, bfs
no precomputation
bfs takes O(m)
too expensive for large graphs
eager
precompute all-pairs shortest paths
FloydWarshall, matrix multiplication
O(n3 ) precomputation, O(n2 ) storage
too large to store
anything in between?

is there a smooth tradeoff between

hO(1), O(m)i and hO(n2 ), O(1)i


distance oracles

[Thorup and Zwick, 2005]

given a graph G = (V , E )
an (α, β)-approximate distance oracle
is a data structure S that
for a query pair of nodes (u, v ), S returns dS (u, v ) s.t.

d(u, v ) ≤ dS (u, v ) ≤ α d(u, v ) + β

α called stretch or distortion


consider the preprocessing time, the required space, and
the query time
distance oracles

[Thorup and Zwick, 2005]

given k, construct an oracle with


storage O(kn1+1/k ), query time O(k), stretch 2k − 1

k =1
⇒ APSP

k = log n
⇒ storage O(n log n), query time O(log n), stretch
O(log n)
distance oracles — preprocessing

[Das Sarma et al., 2010]

1 r = blog |V |c
2 sample r + 1 sets of sizes 1, 2, 22 , 23 , . . . , 2r
3 call the sampled sets S0 , S1 , . . . , Sr
4 for each node u and each set Si compute (wi , δi ),
where δi = d(u, wi ) = minv ∈Si {d(u, v )}
5 sketch[u] = {(w0 , δ0 ), . . . , (wr , δr )}
6 repeat k times
distance oracles — query processing

[Das Sarma et al., 2010]


given query (u, v )

1 obtain sketch[u] and sketch[v ]


2 find the set of common nodes w in sketch[u] and
sketch[v ]
3 for each common node w , compute d(u, w ) and d(w , v )
4 return the minimum of d(u, w ) + d(w , v ),
taken over all common node w ’s
5 if no common w is present, then return ∞
landmark-based approach

precompute: distance from each node to a fixed


landmark l
then

|d(s, l) − d(t, l)| ≤ d(s, t) ≤ d(s, l) + d(l, t)

precompute: distances to d landmarks, l1 , . . . , ld

max |d(s, li ) − d(t, li )| ≤ d(s, t) ≤ min(d(s, li ) + d(li , t))


i i

obtain a range estimate in time O(d) (i.e., constant)


landmark-based approach

motivated by indexing general metric spaces


used for estimating latency in the internet
[Ng and Zhang, 2008]
typically randomly chosen landmarks
theoretical results

[Kleinberg et al., 2004]

random landmarks can provide distance estimates with


distortion (1 + δ) for a fraction of at least (1 − ) of pairs
number of landmarks required depends on , δ, and the
doubling dimension k of the metric space
approximation guarantee in practice

what does a logarithmic approximation guarantee mean in a


small-world graph?
the landmark selection problem

how to choose good landmarks in practice?


good landmarks

l t
s
if then d(s, t)=d(s, l) + d(l, t)

t
s
if l then |d(s, l) − d(t, l)|=d(s, t)
good (upper-bound) landmarks

a landmark l covers a pair (s, t) if l is on a shortest path


from s to t
problem definition: find a set L ⊆ V of k landmarks that
cover as many pairs (s, t) ∈ V × V as possible

NP-hard
for k = 1: the node with the highest centrality
betweenness
for k > 1: apply a “natural” set-cover approach
(but O(n3 ))
landmark selection heuristics

high-degree nodes
high-centrality nodes
“constrained” versions
once a node is selected none of its neighbors is selected
“clustered” versions
cluster the graph and select one landmark per cluster
select landmarks on the “borders” between clusters
datasets

# nodes # edges median effective clustering


distance diameter coefficient

flickr 801 K 8M 5 8 0.11

dblp 226 K 716 K 9 13 0.47


flickr-implicit — distance error
Flickr Implicit dataset
0.6
Rand
0.55 Centr/1
High/1
0.5 Border

0.45

0.4
Error

0.35

0.3

0.25

0.2

0.15

0.1 0 1 2 3
10 10 10 10
Number of Seeds
dblp — precision @ 5
DBLP dataset
1
Rand
0.95 Centr/1
High/1
0.9 Border

0.85
Precision @ 5

0.8

0.75

0.7

0.65

0.6

0.55

0.5 0 1 2 3
10 10 10 10
Number of Seeds
triangulation task

[Kleinberg et al., 2004]

DBLP dataset Y!IM dataset


250 150
Rand Rand
Degree/P Degree/P
Border Border
200

Number of queries
100
Number of queries

150

100
50

50

0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
L/U L/U
comparing with exact algorithm

[Goldberg and Harrelson, 2005]


landmarks (10%) Fl.-E Fl.-I Wiki DBLP Y!IM
Method Cent Cent Cent/P Bord/P Bord/P
Landmarks used 20 100 500 50 50
Nodes visited 1 1 1 1 1
Operations 20 100 500 50 50
CPU ticks 2 10 50 5 5
ALT (exact) Fl.-E Fl.-I Wiki DBLP Y!IM
Method Ikeda Ikeda Ikeda Ikeda Ikeda
Landmarks used 8 4 4 8 4
Nodes visited 7245 10337 19616 2458 2162
Operations 56502 41349 78647 19666 8648
CPU ticks 7062 10519 25868 1536 1856
summary and outlook

graphs are simple models that can represent a wide


variety of datasets
real graphs from different application domains have many
similar properties but also differences
scaling simple computations on very large graphs is
challenging
use of probabilistic and approximation techniques
just scratched the tip of the iceberg
model and understand processes in network, influence,
information cascades
deal with heterogeneous data, combine different sources,
respect user privacy
thank you!
acknowledgements

a number of slides have been adopted by presentations of


Panayiotis Tsaparas and Paolo Boldi
references I

Alon, N., Matias, Y., and Szegedy, M. (1999).


The space complexity of approximating the frequency moments.
J. Comput. Syst. Sci., 58(1):137–147.

Alon, U. (2007).
Network motifs: theory and experimental approaches.
Nature Reviews Genetics.
Backstrom, L., Boldi, P., Rosa, M., Ugander, J., and Vigna, S.
(2011).
Four degrees of separation.
CoRR, abs/1111.4570.

Bohman, T., Cooper, C., and Frieze, A. M. (2000).


Min-wise independent linear permutations.
Electr. J. Comb, 7.
references II

Boldi, P., Rosa, M., and Vigna, S. (2011).


HyperANF: approximating the neighborhood function of very large
graphs on a budget.
In WWW.
Bordino, I., Donato, D., Gionis, A., and Leonardi, S. (2008).
Mining large networks with subgraph counting.
In ICDM.
Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzenmacher, M.
(1998).
Min-wise independent permutations.
In STOC, New York, NY, USA.
references III
Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G.
(1997).
Syntactic clustering of the web.
In Selected papers from the sixth international conference on World
Wide Web, pages 1157–1166, Essex, UK. Elsevier Science Publishers
Ltd.
Buriol, L. S., Frahling, G., Leonardi, S., Marchetti-Spaccamela, A.,
and Sohler, C. (2006).
Counting triangles in data streams.
In PODS ’06: Proceedings of the twenty-fifth ACM
SIGMOD-SIGACT-SIGART symposium on Principles of database
systems, pages 253–262, New York, NY, USA. ACM Press.

Das Sarma, A., Gollapudi, S., Najork, M., and Panigrahy, R. (2010).
A sketch-based distance oracle for web-scale graphs.
In WSDM, pages 401–410.
references IV

Feigenbaum, J., Kannan, S., Gregor, M. A., Suri, S., and Zhang, J.
(2004).
On graph problems in a semi-streaming model.
In 31st International Colloquium on Automata, Languages and
Programming.

Flajolet, F., Fusy, E., Gandouet, O., and Meunier, F. (2007).


Hyperloglog: the analysis of a near-optimal cardinality estimation
algorithm.
In Proceedings of the 13th conference on analysis of algorithm
(AofA).

Flajolet, P. and Martin, N. G. (1985).


Probabilistic counting algorithms for data base applications.
Journal of Computer and System Sciences, 31(2):182–209.
references V

Goldberg, A. and Harrelson, C. (2005).


Computing the shortest path: A* search meets graph.
In SODA.
Kang, U., Tsourakakis, C. E., Appel, A. P., Faloutsos, C., and
Leskovec, J. (2011).
HADI: Mining radii of large graphs.
ACM TKDD, 5.

Kleinberg, J., Slivkins, A., and Wexler, T. (2004).


Triangulation and embedding using small sets of beacons.
In FOCS.
references VI

Leskovec, J., Kleinberg, J., and Faloutsos, C. (2005).


Graphs over time: densification laws, shrinking diameters and
possible explanations.
In KDD ’05: Proceeding of the eleventh ACM SIGKDD international
conference on Knowledge discovery in data mining, pages 177–187,
New York, NY, USA. ACM Press.

M. Faloutsos, P. Faloutsos, C. F. (1999).


On power-law relationships of the internet topology.
In SIGCOMM.
Newman, M. E. J. (2003).
The structure and function of complex networks.
references VII

Ng, E. and Zhang, H. (2008).


Predicting internet network distance with coordinate-based
approaches.
In INFOCOMM.
Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).
ANF: a fast and scalable tool for data mining in massive graphs.
In Proceedings of the eighth ACM SIGKDD international conference
on Knowledge discovery and data mining, pages 81–90, New York,
NY, USA. ACM Press.

Thorup, M. and Zwick, U. (2005).


Approximate distance oracles.
JACM, 52(1):1–24.
references VIII

Tsourakakis, C., Kolountzakis, M., and Miller, G. (2011).


Triangle sparsifiers.
Journal of Graph Algorithms and Applications, 15(6).

Вам также может понравиться