Вы находитесь на странице: 1из 50

Graph Mining

Anuraj
Mohan
13MZ01,CSED

What Graphs are good for?


Most of existing data mining algorithms are
based on transaction representation, i.e.,
sets of items.
Datasets with structures, layers, hierarchy
and/or geometry often do not fit well in this
transaction setting. For e.g.
3D protein structures
Chemical Compounds
Generic XML files.

Why Graph Mining?


Graphs are everywhere
Chemical compounds (Cheminformatics)
Protein structures, biological pathways/networks
(Bioinformactics)
Program control flow, traffic flow, and workflow analysis
XML databases, Web, and social network analysis

Graph is a general model


Trees, lattices, sequences, and items are degenerated graphs

Diversity of graphs
Directed vs. undirected, labeled vs. unlabeled (edges & vertices),
weighted, with angles & geometry (topological vs. 2-D/3-D)

Complexity of algorithms: many problems are of high


complexity (NP

complete or NP hard)

from H. Jeong et al Nature 411, 41


(2001)

Graphs, Graphs, Everywhere

Aspirin

Internet

Yeast protein interaction


network

Co-author network

Modeling Data With Graphs


Going Beyond Transactions
Data Instance
Graphs are suitable
for capturing arbitrary
relations between the
various elements.

Element
Elements Attributes

Graph Instance
Vertex
Vertex Label

Relation Between
Two Elements

Edge

Type Of Relation

Edge Label

Relation between
a Set of Elements

Hyper Edge

Provide enormous flexibility for modeling the underlying data as


they allow the modeler to decide on what the elements should
be and the type of relations to be modeled

Terminology-I
A graph is said to be connected if there is a
path between every pair of vertices
A graph Gs (Vs, Es) is a subgraph of another
graph G(V, E) iff
Vs is subset of V and Es is subset of E

Two graphs G1(V1, E1) and G2(V2, E2) are


isomorphic if they are topologically identical
There is a mapping from V1 to V2 such that each
edge in E1 is mapped to a single edge in E2 and
vice-versa
7

Example of Graph
Isomorphism

Terminology-II
Subgraph isomorphism problem
Given two graphs G1(V1, E1) and
G2(V2, E2): find an isomorphism
between G2 and a subgraph of G1
There is a mapping from V1 to V2 such
that each edge in E1 is mapped to a
single edge in E2 and vice-versa

NP-complete problem
Reduction from max-clique or
hamiltonian cycle problem
9

Frequent Subgraph Mining


Given
D : a set of undirected, labeled graphs
: support threshold ; 0 < <= 1

Find all connected, undirected graphs


that are subgraphs in at-least . | D
| of input graphs
Subgraph isomorphism
10

Frequent Subgraph Mining

Frequent subgraphs
A (sub)graph is frequent if its support
(occurrence frequency) in a given dataset is no
less than a minimum support threshold
Applications of graph pattern mining:
Mining biochemical structures
Program control flow analysis
Mining XML structures or Web communities
Building blocks for graph classification,
clustering, compression, comparison, and
correlation analysis

Finding Frequent Subgraphs:


Input and Output
Input: Graph Transactions

Input
Database of graph transactions.
Undirected simple graph
(no loops, no multiples edges).
Each graph transaction has labels
associated with its vertices and
edges.
Transactions may not be
connected.
Minimum support threshold .

Output
Frequent subgraphs that satisfy
the minimum support constraint.
Each frequent subgraph is
connected.

Output: Frequent Connected Subgraphs

Support = 100%

Support = 66%

Support = 66%

Different Approaches for FSM


Apriori Approach
FSG
Path Based

DFS Approach
gSpan

Greedy Approach
Subdue

FSG Algorithm
[M. Kuramochi and G. Karypis. Frequent subgraph discovery. ICDM
2001]

Notation: k-subgraph is a subgraph with k edges.


Init: Scan the transactions to find F1, the set of all frequent
1-subgraphs and 2-subgraphs, together with their counts;
For (k=3; Fk-1 ; k++)
1. Candidate Generation - Ck, the set of candidate ksubgraphs, from Fk-1, the set of frequent (k-1)-subgraphs;
2. Candidates pruning - a necessary condition of
candidate to be frequent is that each of its (k-1)subgraphs is frequent.
3. Frequency counting - Scan the transactions to count
the occurrences of subgraphs in Ck;
4. Fk = { c CK | c has counts no less than #minSup }
5. Return F1 F2 Fk (= F )

FSG: Basic Flow of the Algo.


Enumerate all single and double-edge
subgraphs
Repeat
Generate all candidate subgraphs of size
(k+1) from size-k subgraphs
Count frequency of each candidate
Prune subgraphs which dont satisfy
support constraint
Until (no frequent subgraphs at (k+1) )
18

FSG: Candidate Generation I

Join two frequent


candidate

size-k

subgraphs to get

(k+1)

Common connected subgraph of (k-1) necessary

Problem
K different size (k-1) subgraphs for a given size-k
graph
If we consider all possible subgraphs, we will end
up
Generating same candidates multiple times
Generating candidates that are not downward closed
Significant slowdown

Apriori doesnt suffer this problem due to


lexicographic ordering of itemset
19

FSG: Candidate Generation II


Joining two size-k subgraphs may produce multiple
distinct graphs of size-k +1
CASE 1: Difference can be a vertex with same label

20

Candidates generation example


+

Candidate pruning:downward closure


property

3-candidates:

4-candidates:

Every (k-1)subgraph must be


frequent.
For all the (k-1)subgraphs of a
given
kcandidate, check
if downward
closure property
holds

frequent
1-subgraphs

frequent
2-subgraphs

3-candidates

frequent
3-subgraphs
4-candidates
...

...

frequent
4-subgraphs

Trivial operations are complicated


with graphs
Candidate generation
To determine two candidates for joining, we
need to check for graph isomorphism.

Candidate pruning
To check downward closure property, we need
graph isomorphism.

Frequency counting
Subgraph isomorphism for checking
containment of a frequent subgraph.

Graph Search
Querying graph databases:
Given a graph database and a query graph, find
all the graphs containing this query graph

query graph

graph database

25

Graph Classification
SubStructure based-Basic idea
Extract graph substructures
F {g1,..., g n }
Represent a graph with a feature vector
x {x, 1 ,..., xn }
where xi
is the frequency
in that graph
g i of
Build a classification model

Graph Kernel Based


Can be applied to any complex structure provided you can define
a kernel function on them

Basic idea:
Map each graph to some significant set of patterns
Define a kernel on the corresponding sets of
patterns

Graph clustering
Decompose a network into
subnetworks based on some
topological properties
Usually we look for dense
subnetworks

27

Graph clustering
Why?
Protein complexes in a PPI network

28

k-Spanning Tree based Clustering


2

1
3
2

3
2
4

k groups
of
non-overlapping
vertices

k-Spanning
Tree

Minimum Spanning Tree

k
STEPS:
Obtains the Minimum Spanning Tree (MST) of input graph G
Removes k-1 edges from the MST
Results in k clusters

29

Minimum Spanning Tree (MST)


The spanning tree of a graph with the minimum
possible sum of edge weights, if the edge weights
represent
G distance
Weight
= 11
2
2
3 6

7 3
2

4
5

2
4

Note: maximum
possible sum of
edge weights, if
the edge weights
represent
similarity

2
1

4
2

3
2

1
7
2

3 6
2
4

5
4

Weight = 1730

Weight = 13

k-Spanning Tree
2

1
3
2

Remove k-1 edges


with highest
weight

3
2
4

5
4

Minimum Spanning Tree


E.g., k=3

E.g., k=3
1

5
2

3 Clusters

3
2
4

5
4

31

Note: k is
the number
of clusters

Shared Nearest Neighbor Clustering


Shared Nearest Neighbor Graph (SNN)

2 2
1

2 3 4
3 1

Groups
of
non-overlapping
vertices

Shared
Nearest
Neighbor
Clusterin
g

STEPS:
Obtains the Shared Nearest Neighbor Graph (SNN) of input graph G
Removes edges from the SNN with weight less than

32

What is Shared Nearest Neighbor?


Shared Nearest Neighbor is a proximity measure and denotes
the number of neighbor nodes common between any given
pair of nodes

33

Shared Nearest Neighbor (SNN)


Graph
Given input graph G, weight each edge (u,v) with the number of
shared nearest neighbors between u and v

SNN

G
0

0
4

2 2
1

Node 0 and Node 1 have 2


neighbors in common: Node 2
and Node 3

34

2 3 4
3 1

Shared Nearest Neighbor


Clustering
Jarvis-Patrick Algorithm
SNN graph of input graph G

2 2
1

2 3 4
3 1

If u and v share more than


or equal neighbors
Place them in the same
cluster

E.g., =3
0

2
4

3
35

Betweenness Centrality based


Clustering
Betweenness centrality quantifies the degree to which a
vertex (or edge) occurs on the shortest path between
all the other pairs of nodes
Two types:
Vertex Betweenness
Edge Betweenness

36

Vertex Betweenness
The number of shortest paths in the
graph G that pass through a given node
S
G

E.g., Sharon is likely a liaison between NCSU and DUKE


and hence many connections between DUKE and NCSU
pass through Sharon
37

Edge Betweenness
The number of shortest paths in the
graph G that pass through given edge
E.g., Sharon and
(S, B)
NCSU

Bob both study


at NCSU and
they are the only
link between
NY DANCE and
CISCO groups

Vertices and Edges with high


Betweenness form good starting
points to identify clusters
38

The betweenness of a vertexin a graphwithvertices is


computed as follows:
For each pair of vertices (s,t), compute theshortest
pathsbetween them.
For each pair of vertices (s,t), determine the fraction of
shortest paths that pass through the vertex
Sum this fraction over all pairs of vertices (s,t).
More compactly the betweenness can be represented as:

The betweenness may be normalised by dividing through


the number of pairs of vertices not includingv, which
forundirected graphs is (n-2)(n-1)/2

Vertex Betweenness
Clustering

Given Input graph G

Repeat until
highest vertex
betweenness

Betweenness for each vertex

1. Disconnect graph
at selected vertex
(e.g., vertex 3 )
2. Copy vertex to
both Components

Select vertex v
with the highest
betweenness
E.g., Vertex 3 with
value 0.67

40

Edge-Betweenness Clustering
Girvan and Newman Algorithm
Given Input Graph G

Repeat until
highest edge
betweenness

Betweenness for each edge

Disconnect graph at
selected edge
(E.g., (3,4 ))

Select edge with


Highest
Betweenness
E.g., edge (3,4)
with value 0.571

41

41

Social Network Analysis

Social network analysis [SNA] is the mapping and


measuring of relationships and flows between people,
groups, organizations, computers or other
information/knowledge processing entities.

The nodes in the network are the people and groups


while the links show relationships or flows between the
nodes.

Social Network Analysis


We measure Social Network in terms of:
1. Degree Centrality: The number of direct
connections a node has. What really matters
is where those connections lead to.
2. Betweenness Centrality: A node with high
betweenness has great influence over what
flows in the network indicating important
links and single point of failure.
3. Closeness CentralityThe degree an

individual is near all other individuals in a


network (directly or indirectly). It reflects
the ability to access information through
the network .

Interpretation of measures
Centrality measureInterpretation in social networks

Degree

How many people can this person reach directly?

Betweenness

How likely is this person to be the most


direct route between two people in the
network?

Closeness

How fast can this person reach everyone in


the network?

Eigenvector

How well is this person connected to other


well-connected people?

44

The web as a graph


Pages = nodes, hyperlinks = edges
Ignore content
Directed graph

High linkage
10-20 links/page on average
Power-law degree distribution

Web graph structure


Web may be considered to have four major
components
Central core strongly connected component
(SCC) pages that can reach one another
along directed links - about 30% of the Web
IN group can reach SCC but cannot be
reached from it - about 20%
OUT group can be reached from SCC but
cannot reach it - about 20%
46

Web graph structure


Tendrils cannot reach SCC and cannot be
reached by it - about 20%
Unconnected about 10%
The Web is hierarchical in nature. The Web has a
strong locality feature. Almost two thirds of all
links are to sites within the enterprise domain.
Only one-third of the links are external.

47

Bow-tie Structure

What can the Web graph tell


us?
Distinguish important pages from
unimportant ones
Page Rank

Discover communities of related


pages
HITS
Hubs and Authorities

Detect web spam


Trust rank

THANK YOU

Вам также может понравиться