Вы находитесь на странице: 1из 28

PRESENTED BY: SWAGATIKA SWAIN 211CS3302

GRAPH: set of nodes, pairs of which might be connected by edges. STRUCTURE MINING: process of finding or extracting useful information from semistructured data sets. Graph Mining is a special case of structure mining. Graph mining can also be thought of as detecting patterns from the graphs.

Graph become increasingly important in modeling complicated structures such as circuits, chemical compounds, protein structures,biological n/ws, social n/ws,etc Hence, with increasing demand in analysis of large amounts of data, graph mining has become an active and important theme in data mining. Among the various kinds of graph patterns, frequent sub-structures are very basic patterns that can be discovered in a collection of graphs.

Here our objective is to find frequent substructures, that are very basic patterns that can be discovered in a collection of graphs. They are useful for characterizing graph sets, discriminating different groups of graphs, classifying & clustering graphs, building graph indices & facilitating similarity search in graph databases. Recent studies have developed several graph mining methods and applied them to the discovery of interesting patterns in various applications.

For example, there have been reports of discovery of active chemical structures in HIV screening datasets by contrasting the support of frequent graphs between different classes. S C C N C C N C C S C C O S O N O
A sample graph data set

S C C O N
frequency=2 Frequent Graphs

C C N
frequency=3

Vertex set of graph- V(g) Edge set of graph- E(g) A label function, L, maps a vertex or an edge to a label. A graph g is a subgraph of another graph g if there exists a subgraph isomorphism from g to g. Given D={G1,G2,Gn} support(g) or frequency(g) is the percentage of graphs in D where g is subgraph. A frequent subgraph is a subgraph whose support is no less than a minimum support threshold, min_sup.

Consists of two steps. 1st step, we generate frequent substructure candidates. 2nd step, the frequency of each candidate is checked. Most studies on frequent substructure discovery focus on the optimization of 1st step, bcoz the 2nd step involves a subgraph isomorphism test whose complexity is excessively high. 2 basic approaches to this problem: Apriori based approach & pattern growth approach.

Similar to apriori-based frequent itemset mining algorithm. Starts with graphs of small size, and proceed in a bottom-up manner by generating candidates having an extra vertex, edge or path. It adopts a levelwise mining methodology. The frequency of this then checked, those frequent are then used to generate larger candidates in next round. Eg: (abc)&(bcd), new candidate: (abcd)

S1=frequent single element in the data set; Call AprioriGraph(D,min_sup,S1); Procedure AprioriGraph(D,min_sup,SK ) 1) Sk+1=; 2) for each frequent giSk do 3) for each frequent gjSK do 4) for each size (k+1) graph g formed by the merge of gi & gj do 5) If g is frequent in D & gSk+1 then 6) Insert g into Sk+1; 7) If Sk+1 then 8) AprioriGraph(D,min_sup,Sk+1); 9) return;

Some Apriori-based algorithm include AGM, FSG & a path-join method. AGM uses a vertex-based candidate generation method. Graph size = no. of vertices in the graph 2 size k frequent graphs are joined if they have same size k-1 subgraphs. Eg.

FSG adopts an edge-based candidate generation strategy. 2 size-k patterns are merged if & only if they share the same subgraph having k-1 edges, which is called the core. Graph size is no. of edges in the graph. New candidate include the core & additional 2 edges from the size kpatterns.

Edge-disjoint path method uses the no. of disjoint paths they have. Two paths are edge-disjoint if the they do not share any common edge. A substructure pattern with k+1 disjoint paths is generated by joining substructures with k disjoint paths. Most of the Apriori-base algos have considerable overheads, hence to avoid overheads pattern-growth methodology is used.

While apriori-based approach uses BFS for searching, Pattern-growth approach uses both BFS & DFS. A graph is extended by adding a new edge e. Newly formed graph is denoted by gxe. If e introduces a new vertex, new graph denoted by gxfe, otherwise, gxbe. This algo is simple but not efficient. Bcoz here the same graph can be discovered many times & called duplicate graph.

S; Call PatternGrowthGraph(g,D,min_sup,S); Procedure PatternGrowthGraph(g,D,min_sup,S) 1) If gS then return; 2) else insert g into S; 3) scan D once, find all the edges e such that g can be extended to gxe; 4) for each frequent gxe do 5) PatternGrowthGraph(gxe,D,min_sup,S); 6)return;

It adopts DFS to traverse graph. One graph have many DFS trees. gSpan works on labeled simple graphs i.e. the vertex label and edge label. The visiting sequence form a linear order. Use subscript to record this order, where i<j, means vi is visited before vj. A graph G subscripted with a DFS tree T is written as GT. Starting vertex is called as root, and last visited vertex called right most vertex. Straight path from root to right most vertex is called right most path.

gSpan algorithm is a more sophistication extension of PatternGrowth. The new method restricts the extension as follows: A new edge e can be added b/w the right most vertex and another vertex in the right most path- backward extension Introduce a new vertex on the right most path- forward extension These type of extension called right most extension, denoted by Gr e.

Now our task is to find Base subscripting. And conducting RME on that subscripting.

Each subscripted graph can be transformed to an edge sequence, called DFS code. We can build order from these sequences using DFS code. Goal is to select the subscripting that generates the min sequence as its base subscripting. There are 2 kinds of orders in this transformation process: 1)edge order 2)sequence order

DFS tree defines the discovery order of forward edges. The forward edges visited in the order of (0,1), (1,2), (1,3). Now we put backward edges into order: given v, all of its backward edges should appear just before its forward edge. if v does not have any forward edge, we put its backward edge after the forward edge, where v is the 2nd vertex. Complete sequence for graph: (0,1), (1,2), (2,0), (1,3).

Based on ordering, 3 different DFS codes 0, 1, 2. An edge is represented by 5- tuple, (i,j,li,l(I,j),lj). One to one mapping between subscripted graph and DFS code.

edge e0 e1 e2 e3 0 (0,1,X,a,X) (1,2,X,a,Z) (2,0,Z,b,X) (1,3,X,b,Y) 1 (0,1,X,a,X) (1,2,X,b,Y) (1,3,X,a,Z) (3,0,Z,b,X) 2 (0,1,Y,b,X) (1,2,X,a,X) (2,3,X,b,Z) (3,1,Z,a,X)

Order among edge sequences. Since several DFS code for a single graph build an order among these codes and select one code to represent the graph. Since we are dealing with labeled graph the label info should be one of the ordering factor. If there is a tie then label of vertices and edges are used to break the tie. Here priority is given in the following order respectively li, l(I,j), lj.

Acc, we have 0<1<2 This type of ordering is called DFS lexicographic ordering.

Based on DFS lexicographic ordering, minimum DFS code of the given graph is written as dfs(G). The subscripting that generates the min DFS code called base subscripting. The 2 graphs G & G are isomorphic if dfs(G)=dfs(G).

S; Call gSpan(s,D,min_sup,S); Procedure PatternGrowthGraph(s,D,min_sup,S) 1) If sdfs(s), then 2) return; 3) Insert s into S; 4) Set C to ; 5) Scan D once, find all the edges e such that s can be right most extended to s e, insert sr e into C and count its frequency; 6) Sort C into DFS lexicographic order; 7) For each frequent s e in C do 8) gSpan(s e ,D,min_sup,S);
r
r r

9)

Return;

This algorithm implements a DFS search Actually BFS works too But DFS consumes less memory than BFS, while their performances are almost similar.

Graphs represent a more general class of structures than sets, sequences, lattices and trees. Hence, Graph mining is used to mine frequent graph patterns and perform characterization, discrimination, classification and cluster analysis over large graph data sets. Graph mining has a broad spectrum of applications in chemical informatics, bioinformatics, computer vision, video indexing, text retrieval and web analysis.

Han J. and Kamber M. 2006. Data Mining: Concepts and Techniques,1st ed; Morgan K publisher [pg. 535-545] X. Yan and J.Han. 2002 .gSpan:Graphbased substructure pattern mining. Technical report.