Graph Mining: Anuraj Mohan 13MZ01, CSED

Graph Mining
Anuraj
Mohan
13MZ01,CSED
What Graphs are good for?

Most of existing data mining algorithms are
based on transaction representation, i.e.,
sets of items.
Datasets with structures, layers, hierarchy
and/or geometry often do not fit well in this
transaction setting. For e.g.
3D protein structures
Chemical Compounds
Generic XML files.
Why Graph Mining?

Graphs are everywhere
Chemical compounds (Cheminformatics)
Protein structures, biological pathways/networks
(Bioinformactics)
Program control flow, traffic flow, and workflow analysis
XML databases, Web, and social network analysis
Graph is a general model

Trees, lattices, sequences, and items are degenerated graphs
Diversity of graphs
Directed vs. undirected, labeled vs. unlabeled (edges & vertices),
weighted, with angles & geometry (topological vs. 2-D/3-D)
Complexity of algorithms: many problems are of high

complexity (NP
complete or NP hard)
from H. Jeong et al Nature 411, 41

(2001)
Graphs, Graphs, Everywhere
Aspirin
Internet
Yeast protein interaction

network
Co-author network
Modeling Data With Graphs

Going Beyond Transactions
Data Instance
Graphs are suitable
for capturing arbitrary
relations between the
various elements.
Element
Elements Attributes
Graph Instance
Vertex
Vertex Label
Relation Between
Two Elements
Edge
Type Of Relation
Edge Label
Relation between
a Set of Elements
Hyper Edge
Provide enormous flexibility for modeling the underlying data as

they allow the modeler to decide on what the elements should
be and the type of relations to be modeled
Terminology-I
A graph is said to be connected if there is a
path between every pair of vertices
A graph Gs (Vs, Es) is a subgraph of another
graph G(V, E) iff
Vs is subset of V and Es is subset of E
Two graphs G1(V1, E1) and G2(V2, E2) are

isomorphic if they are topologically identical
There is a mapping from V1 to V2 such that each
edge in E1 is mapped to a single edge in E2 and
vice-versa
7
Example of Graph
Isomorphism
Terminology-II
Subgraph isomorphism problem
Given two graphs G1(V1, E1) and
G2(V2, E2): find an isomorphism
between G2 and a subgraph of G1
There is a mapping from V1 to V2 such
that each edge in E1 is mapped to a
single edge in E2 and vice-versa
NP-complete problem
Reduction from max-clique or
hamiltonian cycle problem
9
Frequent Subgraph Mining

Given
D : a set of undirected, labeled graphs
: support threshold ; 0 < <= 1
Find all connected, undirected graphs

that are subgraphs in at-least . | D
| of input graphs
Subgraph isomorphism
10
Frequent Subgraph Mining
Frequent subgraphs
A (sub)graph is frequent if its support
(occurrence frequency) in a given dataset is no
less than a minimum support threshold
Applications of graph pattern mining:
Mining biochemical structures
Program control flow analysis
Mining XML structures or Web communities
Building blocks for graph classification,
clustering, compression, comparison, and
correlation analysis
Finding Frequent Subgraphs:

Input and Output
Input: Graph Transactions
Input
Database of graph transactions.
Undirected simple graph
(no loops, no multiples edges).
Each graph transaction has labels
associated with its vertices and
edges.
Transactions may not be
connected.
Minimum support threshold .
Output
Frequent subgraphs that satisfy
the minimum support constraint.
Each frequent subgraph is
connected.
Output: Frequent Connected Subgraphs
Support = 100%
Support = 66%
Support = 66%
Different Approaches for FSM

Apriori Approach
FSG
Path Based
DFS Approach
gSpan
Greedy Approach
Subdue
FSG Algorithm
[M. Kuramochi and G. Karypis. Frequent subgraph discovery. ICDM
2001]
Notation: k-subgraph is a subgraph with k edges.

Init: Scan the transactions to find F1, the set of all frequent
1-subgraphs and 2-subgraphs, together with their counts;
For (k=3; Fk-1 ; k++)
1. Candidate Generation - Ck, the set of candidate ksubgraphs, from Fk-1, the set of frequent (k-1)-subgraphs;
2. Candidates pruning - a necessary condition of
candidate to be frequent is that each of its (k-1)subgraphs is frequent.
3. Frequency counting - Scan the transactions to count
the occurrences of subgraphs in Ck;
4. Fk = { c CK | c has counts no less than #minSup }
5. Return F1 F2 Fk (= F )
FSG: Basic Flow of the Algo.

Enumerate all single and double-edge
subgraphs
Repeat
Generate all candidate subgraphs of size
(k+1) from size-k subgraphs
Count frequency of each candidate
Prune subgraphs which dont satisfy
support constraint
Until (no frequent subgraphs at (k+1) )
18
FSG: Candidate Generation I
Join two frequent

candidate
size-k
subgraphs to get
(k+1)
Common connected subgraph of (k-1) necessary
Problem
K different size (k-1) subgraphs for a given size-k
graph
If we consider all possible subgraphs, we will end
up
Generating same candidates multiple times
Generating candidates that are not downward closed
Significant slowdown
Apriori doesnt suffer this problem due to

lexicographic ordering of itemset
19
FSG: Candidate Generation II

Joining two size-k subgraphs may produce multiple
distinct graphs of size-k +1
CASE 1: Difference can be a vertex with same label
20
Candidates generation example

+
Candidate pruning:downward closure

property
3-candidates:
4-candidates:
Every (k-1)subgraph must be

frequent.
For all the (k-1)subgraphs of a
given
kcandidate, check
if downward
closure property
holds
frequent
1-subgraphs
frequent
2-subgraphs
3-candidates
frequent
3-subgraphs
4-candidates
...
...
frequent
4-subgraphs
Trivial operations are complicated

with graphs
Candidate generation
To determine two candidates for joining, we
need to check for graph isomorphism.
Candidate pruning
To check downward closure property, we need
graph isomorphism.
Frequency counting
Subgraph isomorphism for checking
containment of a frequent subgraph.
Graph Search
Querying graph databases:
Given a graph database and a query graph, find
all the graphs containing this query graph
query graph
graph database
25
Graph Classification
SubStructure based-Basic idea
Extract graph substructures
F {g1,..., g n }
Represent a graph with a feature vector
x {x, 1 ,..., xn }
where xi
is the frequency
in that graph
g i of
Build a classification model
Graph Kernel Based

Can be applied to any complex structure provided you can define
a kernel function on them
Basic idea:
Map each graph to some significant set of patterns
Define a kernel on the corresponding sets of
patterns
Graph clustering
Decompose a network into
subnetworks based on some
topological properties
Usually we look for dense
subnetworks
27
Graph clustering
Why?
Protein complexes in a PPI network
28
k-Spanning Tree based Clustering

2
1
3
2
3
2
4
k groups
of
non-overlapping
vertices
k-Spanning
Tree
Minimum Spanning Tree
k
STEPS:
Obtains the Minimum Spanning Tree (MST) of input graph G
Removes k-1 edges from the MST
Results in k clusters
29
Minimum Spanning Tree (MST)

The spanning tree of a graph with the minimum
possible sum of edge weights, if the edge weights
represent
G distance
Weight
= 11
2
2
3 6
7 3
2
4
5
2
4
Note: maximum
possible sum of
edge weights, if
the edge weights
represent
similarity
2
1
4
2
3
2
1
7
2
3 6
2
4
5
4
Weight = 1730
Weight = 13
k-Spanning Tree
2
1
3
2
Remove k-1 edges

with highest
weight
3
2
4
5
4
Minimum Spanning Tree

E.g., k=3
E.g., k=3
1
5
2
3 Clusters
3
2
4
5
4
31
Note: k is
the number
of clusters
Shared Nearest Neighbor Clustering

Shared Nearest Neighbor Graph (SNN)
2 2
1
2 3 4
3 1
Groups
of
non-overlapping
vertices
Shared
Nearest
Neighbor
Clusterin
g
STEPS:
Obtains the Shared Nearest Neighbor Graph (SNN) of input graph G
Removes edges from the SNN with weight less than
32
What is Shared Nearest Neighbor?

Shared Nearest Neighbor is a proximity measure and denotes
the number of neighbor nodes common between any given
pair of nodes
33
Shared Nearest Neighbor (SNN)

Graph
Given input graph G, weight each edge (u,v) with the number of
shared nearest neighbors between u and v
SNN
G
0
0
4
2 2
1
Node 0 and Node 1 have 2

neighbors in common: Node 2
and Node 3
34
2 3 4
3 1
Shared Nearest Neighbor

Clustering
Jarvis-Patrick Algorithm
SNN graph of input graph G
2 2
1
2 3 4
3 1
If u and v share more than

or equal neighbors
Place them in the same
cluster
E.g., =3
0
2
4
3
35
Betweenness Centrality based

Clustering
Betweenness centrality quantifies the degree to which a
vertex (or edge) occurs on the shortest path between
all the other pairs of nodes
Two types:
Vertex Betweenness
Edge Betweenness
36
Vertex Betweenness
The number of shortest paths in the
graph G that pass through a given node
S
G
E.g., Sharon is likely a liaison between NCSU and DUKE

and hence many connections between DUKE and NCSU
pass through Sharon
37
Edge Betweenness
The number of shortest paths in the
graph G that pass through given edge
E.g., Sharon and
(S, B)
NCSU
Bob both study

at NCSU and
they are the only
link between
NY DANCE and
CISCO groups
Vertices and Edges with high

Betweenness form good starting
points to identify clusters
38
The betweenness of a vertexin a graphwithvertices is

computed as follows:
For each pair of vertices (s,t), compute theshortest
pathsbetween them.
For each pair of vertices (s,t), determine the fraction of
shortest paths that pass through the vertex
Sum this fraction over all pairs of vertices (s,t).
More compactly the betweenness can be represented as:
The betweenness may be normalised by dividing through

the number of pairs of vertices not includingv, which
forundirected graphs is (n-2)(n-1)/2
Vertex Betweenness
Clustering
Given Input graph G
Repeat until
highest vertex
betweenness
Betweenness for each vertex
1. Disconnect graph
at selected vertex
(e.g., vertex 3 )
2. Copy vertex to
both Components
Select vertex v
with the highest
betweenness
E.g., Vertex 3 with
value 0.67
40
Edge-Betweenness Clustering
Girvan and Newman Algorithm
Given Input Graph G
Repeat until
highest edge
betweenness
Betweenness for each edge
Disconnect graph at
selected edge
(E.g., (3,4 ))
Select edge with

Highest
Betweenness
E.g., edge (3,4)
with value 0.571
41
41
Social Network Analysis
Social network analysis [SNA] is the mapping and

measuring of relationships and flows between people,
groups, organizations, computers or other
information/knowledge processing entities.
The nodes in the network are the people and groups

while the links show relationships or flows between the
nodes.
Social Network Analysis

We measure Social Network in terms of:
1. Degree Centrality: The number of direct
connections a node has. What really matters
is where those connections lead to.
2. Betweenness Centrality: A node with high
betweenness has great influence over what
flows in the network indicating important
links and single point of failure.
3. Closeness CentralityThe degree an
individual is near all other individuals in a

network (directly or indirectly). It reflects
the ability to access information through
the network .
Interpretation of measures
Centrality measureInterpretation in social networks
Degree
How many people can this person reach directly?
Betweenness
How likely is this person to be the most

direct route between two people in the
network?
Closeness
How fast can this person reach everyone in

the network?
Eigenvector
How well is this person connected to other

well-connected people?
44
The web as a graph

Pages = nodes, hyperlinks = edges
Ignore content
Directed graph
High linkage
10-20 links/page on average
Power-law degree distribution
Web graph structure

Web may be considered to have four major
components
Central core strongly connected component
(SCC) pages that can reach one another
along directed links - about 30% of the Web
IN group can reach SCC but cannot be
reached from it - about 20%
OUT group can be reached from SCC but
cannot reach it - about 20%
46
Web graph structure

Tendrils cannot reach SCC and cannot be
reached by it - about 20%
Unconnected about 10%
The Web is hierarchical in nature. The Web has a
strong locality feature. Almost two thirds of all
links are to sites within the enterprise domain.
Only one-third of the links are external.
47
Bow-tie Structure
What can the Web graph tell

us?
Distinguish important pages from
unimportant ones
Page Rank
Discover communities of related

pages
HITS
Hubs and Authorities
Detect web spam

Trust rank
THANK YOU

Graph Mining: Anuraj Mohan 13MZ01, CSED

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Graph Mining: Anuraj Mohan 13MZ01, CSED

Загружено:

Авторское право:

Доступные форматы

Graph Mining

What Graphs are good for?

Why Graph Mining?

Graph is a general model

Complexity of algorithms: many problems are of high

from H. Jeong et al Nature 411, 41

Graphs, Graphs, Everywhere

Yeast protein interaction

Modeling Data With Graphs

Provide enormous flexibility for modeling the underlying data as

Two graphs G1(V1, E1) and G2(V2, E2) are

Frequent Subgraph Mining

Find all connected, undirected graphs

Frequent Subgraph Mining

Finding Frequent Subgraphs:

Output: Frequent Connected Subgraphs

Different Approaches for FSM

Notation: k-subgraph is a subgraph with k edges.

FSG: Basic Flow of the Algo.

FSG: Candidate Generation I

Join two frequent

Common connected subgraph of (k-1) necessary

Apriori doesnt suffer this problem due to

FSG: Candidate Generation II

Candidates generation example

Candidate pruning:downward closure

Every (k-1)subgraph must be

Trivial operations are complicated

Graph Kernel Based

k-Spanning Tree based Clustering

Minimum Spanning Tree

Minimum Spanning Tree (MST)

Remove k-1 edges

Minimum Spanning Tree

Shared Nearest Neighbor Clustering

What is Shared Nearest Neighbor?

Shared Nearest Neighbor (SNN)

Node 0 and Node 1 have 2

Shared Nearest Neighbor

If u and v share more than

Betweenness Centrality based

E.g., Sharon is likely a liaison between NCSU and DUKE

Bob both study

Vertices and Edges with high

The betweenness of a vertexin a graphwithvertices is

The betweenness may be normalised by dividing through

Given Input graph G

Betweenness for each vertex

Betweenness for each edge

Select edge with

Social Network Analysis

Social network analysis [SNA] is the mapping and

The nodes in the network are the people and groups

Social Network Analysis

individual is near all other individuals in a

How many people can this person reach directly?

How likely is this person to be the most

How fast can this person reach everyone in

How well is this person connected to other

The web as a graph

Web graph structure

Web graph structure