Академический Документы
Профессиональный Документы
Культура Документы
Aristides Gionis
Yahoo! Research, Barcelona
yahoo! research barcelona
research areas and internships!
research areas
http://barcelona.research.yahoo.net/internships
this lecture
social networks
knowledge and information networks
technology networks
biological networks
social networks
z k e −z
n k n−k
Ck = p (1 − p) '
k k
clustering coefficient
probability that two vertices are connected is
independent of the local structure
random graphs and real datasets
Ck = ck −γ
with γ > 1, or
ln Ck = ln c − γ ln k
[Newman, 2003]
contrast — degree distribution of random graphs
maximum degree
dmax ≈ n1/(α−1)
alternative definition
local clustering coefficient
number of triangles connected to vertex i
Ci =
number of triples centered at vertex i
global clustering coefficient
1X
C2 = Ci
n i
proposed measures
diameter : largest shortest-path over all pairs
effective diameter : upper bound of the shortest path of
90% of the pairs of vertices
average shortest path : average of the shortest paths over
all pairs of vertices
characteristic path length : median of the shortest paths
over all pairs of vertices
hop-plots : plot of |Nh (u)|, the number of neighbors of u
at distance at most h, as a function of h
[M. Faloutsos, 1999] conjectured that it grows
exponentially and considered hop exponent
measurements on real graphs
n m γ C1 C2 `
film actors 449 913 25 516 482 2.3 0.20 0.78 3.48
internet 10 697 31 992 2.5 0.03 0.39 3.31
protein interactions 2 115 2 240 2.4 0.07 0.07 6.80
[Newman, 2003]
other properties
degree correlations
distribution of size of connected components
resilience
eigenvalues
distribution of motifs
properties of evolving graphs
diameter is shrinking
algorithmic tools
efficiency considerations
probabilistic/approximate methods
sketching : create sketches that summarize the data and
allow to estimate simple statistics with small space
hashing : hash objects in such a way that similar objects
have larger probability of mapped to the same value than
non-similar objects
estimator theorem
consider a set of items U
a fraction ρ of them have a specific property
estimate ρ by sampling
01000110111010111
locality sensitive hashing: Hamming distance
01000110111010111
00011
locality sensitive hashing: Hamming distance
01000110111010111
00011 11010
locality sensitive hashing: Hamming distance
01000110111010111
Probability of collision
r k l
Pr[h(x) = h(y )] = 1 − (1 − (1 − ) )
d
locality sensitive hashing: Hamming distance
homework
vectors x = {x1 , . . . , xd }
Pd
L1 distance ||x − y||1 = i=1 |xi − yi |
Jaccard coefficient
|A∩B|
for two sets A, B ⊆ U define J(A, B) = |A∪B|
measure of similarity of the sets
π : U → U a random permutation of U
h(A) = min{π(x) | x ∈ A}
then
|A ∩ B|
Pr[h(A) = h(B)] = J(A, B) =
|A ∪ B|
|A ∩ B|
Pr[h(A) = h(B)] = J(A, B) =
|A ∪ B|
homework
vectors x = {x1 , . . . , xd }
x·y
distance 1 − cos(x, y) = 1 − ||x||2 ||y||2
... applications of the algorithmic tools to
real scenarios
clustering coefficient and triangles
clustering coefficient
|T | 1 1
N ≥ O( 2
log )
|T3 | δ
incidence model
consider sample space S = {b-a-c | (a, b), (a, c) ∈ E }
P
|S| = i di (di − 1)/2
|T2 | 1
and space needed is O((1 + )
|T3 | 2
log 1δ )
properties of the sampling space
it should be possible to
estimate the size of the sampling space
sample an element uniformly at random
homework
incidence model
consider sample space S = {b-a-c | (a, b), (a, c) ∈ E }
P
|S| = i di (di − 1)/2
sampling spaces:
3-node directed
4-node undirected
assigned to 0 1 2 3 4 5 6
AS graph 12 0 0 0 0 0 0
collaboration 0 0 3 2 0 0 0
protein 1 0 0 1 0 0 1
road-graph 0 12 0 0 0 0 0
wikipedia 0 0 0 0 2 5 0
synthetic 11 0 0 0 0 0 28
webgraph 2 0 0 1 0 0 0
clustering of directed graphs
1 external memory
keep a counter for each vertex (main memory)
keep a counter for each edge (secondary memory)
2 main memory
keep a counter for each vertex
number of triangles for edges and nodes
1 X 2T (u)
C2 =
|V | u∈V d(u)(d(u) − 1)
2T (u)
C (u) =
d(u)(d(u) − 1)
computing triangles : idea
consider the Jaccard coefficient between
two sets A and B:
|A ∩ B|
J(A, B) =
|A ∪ B|
J
Tuv = |N(u) ∩ N(v )| = (|N(u)| + |N(v )|)
J +1
and then:
1 X
T (u) = Tuv
2
v ∈N(u)
estimating Jaccard coefficient : basic technique
|A ∩ B|
Pr[min π(A) = min π(B)] = J(A, B) =
|A ∪ B|
semi-stream model
keep vertex min-hash values (in memory)
keep edge counters (on disk)
use edge counters to estimate number of triangles and
local clustering coefficient
external-memory algorithm
1: Z = 0
2: for i: 1 . . . m do {independent trials}
3: for u : 1 . . . |V | do {assign labels}
4: li (u) = hashi (u) {Min-wise linear permutation}
5: end for
6: for u : 1 . . . |V | do {compute fingerprints}
7: Fi (u) = minv ∈N(u) li (u)
8: end for{1 scan of G }
9: for u : 1 . . . |V | do {update counters}
10: for v ∈ N(u) do
11: if (Fi (u) = Fi (v )) then {minima are equal}
12: Zuv = Zuv + 1 {Zuv ’s stored on disk}
13: end if
14: end for
15: end for
16: end for
implementation
Algorithm 1 Algorithm 2
Graph Nodes Edges (ext. mem.) (main mem.)
WB-2001 118M 1.7G 10 hr 20 min 3 hr 40 min
IT-2004 41M 2.1G 8 hr 20 min 5 hr 30 min
UK-2006 77M 5.3G 20 hr 30 min 13 hr 10 min
quality of approximation
IT-2004
1.05
Algorithm 1 (ext. mem.)
1.00 Algorithm 2 (main mem.)
0.95
Average Relative Error
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
10 20 30 40 50 60 70
Number of Passes
quality of approximation
IT-2004
1.00
Pearson’s Correlation Coefficient
0.90
0.80
0.70
0.40
0.30
10 20 30 40 50 60 70
Number of Passes
applications : spam detection
Exact Method
0.10
Normal
0.09 Spam
0.08
Fraction of Hosts
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.00
1 10 100 1000 10000 100000
Triangles
Separation of non-spam and spam hosts in the histogram of
triangles
applications : spam detection
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.00
1 10 100 1000 10000 100000
Triangles
Separation of non-spam and spam hosts in the histogram of
triangles
applications : spam detection
0.08
0.06
0.04
0.02
0.00
1 10 100 1000 10000 100000
Triangles
Separation of non-spam and spam hosts in the histogram of
triangles
applications : spam detection
0.30
Average quality
High quality
0.25
0.20
0.15
0.10
0.05
0.00
0.0 0.1 0.2 0.3 0.4 0.5
Fraction of best answers
0.35
Average quality
High quality
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0.00 0.01 0.02 0.03 0.04 0.05 0.06
Clustering coefficient
2008 2012
it 6.58 3.90
se 4.33 3.89
it+se 4.90 4.16
us 4.74 4.32
fb 5.28 4.74
2008 2012
it 9.0 5.2
se 5.9 5.3
it+se 6.8 5.8
us 6.5 5.8
fb 7.0 6.2
actual diameter
2008 2012
it > 29 = 25
se > 16 = 25
it+se > 21 = 27
us > 17 = 30
fb > 17 > 58
breaking the news
another application : spid
different strategies
lazy
compute shortest path at query time
Dijkstra, bfs
no precomputation
bfs takes O(m)
too expensive for large graphs
eager
precompute all-pairs shortest paths
FloydWarshall, matrix multiplication
O(n3 ) precomputation, O(n2 ) storage
too large to store
applications of shortest-path queries
searching in graphs — I. context-sensitive search
searching in graphs — I. context-sensitive search
"chilly peppers"
searching in graphs — I. context-sensitive search
"chilly peppers"
RHCP
mexican
cuisine
searching in graphs — I. context-sensitive search
"chilly peppers"
food
RHCP
mexican
cuisine
searching in graphs — I. context-sensitive search
"chilly peppers"
music
RHCP
mexican
cuisine
searching in graphs — I. context-sensitive search
different strategies
lazy
compute shortest path at query time
Dijkstra, bfs
no precomputation
bfs takes O(m)
too expensive for large graphs
eager
precompute all-pairs shortest paths
FloydWarshall, matrix multiplication
O(n3 ) precomputation, O(n2 ) storage
too large to store
anything in between?
given a graph G = (V , E )
an (α, β)-approximate distance oracle
is a data structure S that
for a query pair of nodes (u, v ), S returns dS (u, v ) s.t.
k =1
⇒ APSP
k = log n
⇒ storage O(n log n), query time O(log n), stretch
O(log n)
distance oracles — preprocessing
1 r = blog |V |c
2 sample r + 1 sets of sizes 1, 2, 22 , 23 , . . . , 2r
3 call the sampled sets S0 , S1 , . . . , Sr
4 for each node u and each set Si compute (wi , δi ),
where δi = d(u, wi ) = minv ∈Si {d(u, v )}
5 sketch[u] = {(w0 , δ0 ), . . . , (wr , δr )}
6 repeat k times
distance oracles — query processing
l t
s
if then d(s, t)=d(s, l) + d(l, t)
t
s
if l then |d(s, l) − d(t, l)|=d(s, t)
good (upper-bound) landmarks
NP-hard
for k = 1: the node with the highest centrality
betweenness
for k > 1: apply a “natural” set-cover approach
(but O(n3 ))
landmark selection heuristics
high-degree nodes
high-centrality nodes
“constrained” versions
once a node is selected none of its neighbors is selected
“clustered” versions
cluster the graph and select one landmark per cluster
select landmarks on the “borders” between clusters
datasets
0.45
0.4
Error
0.35
0.3
0.25
0.2
0.15
0.1 0 1 2 3
10 10 10 10
Number of Seeds
dblp — precision @ 5
DBLP dataset
1
Rand
0.95 Centr/1
High/1
0.9 Border
0.85
Precision @ 5
0.8
0.75
0.7
0.65
0.6
0.55
0.5 0 1 2 3
10 10 10 10
Number of Seeds
triangulation task
Number of queries
100
Number of queries
150
100
50
50
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
L/U L/U
comparing with exact algorithm
Alon, U. (2007).
Network motifs: theory and experimental approaches.
Nature Reviews Genetics.
Backstrom, L., Boldi, P., Rosa, M., Ugander, J., and Vigna, S.
(2011).
Four degrees of separation.
CoRR, abs/1111.4570.
Das Sarma, A., Gollapudi, S., Najork, M., and Panigrahy, R. (2010).
A sketch-based distance oracle for web-scale graphs.
In WSDM, pages 401–410.
references IV
Feigenbaum, J., Kannan, S., Gregor, M. A., Suri, S., and Zhang, J.
(2004).
On graph problems in a semi-streaming model.
In 31st International Colloquium on Automata, Languages and
Programming.