Вы находитесь на странице: 1из 7

Journal of Computational Information Systems 7:5 (2011) 1575-1581 Available at http://www.Jofcis.

com

A Hierarchical Clustering Algorithm Based on Dynamic Programming for Categorical Sequences


Jiadong REN1,2,, Shiyuan CAO1, Changzhen HU2
1 2

College of Information Science and Engineering, Yanshan University, Qinhuangdao,066004, China

School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China

Abstract
More and more attention has been paid to the issue of sequence mining. In this paper, a new clustering algorithm for categorical sequences is proposed. For the property that sequences have unequal length, we introduce a similarity measure for clustering of categorical and sequential attributes. The similarity measure is derived from the regular sequence alignment and is based on the idea of dynamic programming. The relative distance between element pairs is used to compute the similarity value for two sequences. The sequence similarity measure is applied in the traditional hierarchical clustering algorithm to cluster sequences. Using a splice dataset and synthetic datasets, we show the quality of clusters generated by our proposed approach and the scalability of our algorithm. Keywords: Clustering; Categorical Sequences; Dynamic Programming

1. Introduction Recently, there has been enormous growth in the amount of commercial and scientific data, such as protein sequences, retail transactions, and web-logs. There exists a sequential order between items in these datasets. To analyze sequence datasets, Hay et al. [1] presented a clustering algorithm that used an edit distance to measure the similarity between sequences, while Wang and Zaiane [2] proposed a clustering method based on a sequence alignment to measure the similarity between sequences. Sequence clustering has been widely applied in the sequences mining [3]. For example, to mine Web usage, a sequence-based clustering algorithm was proposed in [4]. An evaluation framework is developed, in which the performances of the algorithms are compared in terms of whether the clusters are correctly identified using a replicated clustering approach. As one of traditional clustering algorithms, hierarchical clustering has been applied in sequences analysis. In [5], they proposed two new text clustering algorithm, named Clustering based Frequent Word Sequences and Clustering based on Frequent Word Meaning Sequences. A word (meaning) sequence is frequent if it occurs in more than certain percentage of the documents in the test database. Two dynamic hierarchical algorithms called dynamic hierarchical compact and dynamic hierarchical star are proposed for document

Corresponding author. Email addresses: jdren@ysu.edu.cn (Jiadong REN).

1553-9105/ Copyright 2011 Binary Information Press May, 2011

J. Ren et al. /Journal of Computational Information Systems 7:5 (2011) 1575-1581

1576

clustering in [6]. Both methods aim to construct a cluster hierarchy, dealing with dynamic data sets. Dynamic programming, such as the divide-and-conquer method, solves problems by combining the solutions to subproblems. The basic dynamic programming method for sequence alignment forms the core of the pair-wise and multiple sequence and structure comparisons, as well as of sequence database searching [7]. In this paper, a DP based algorithm for comparing the categorical sequences is proposed. It is presented by normalizing the sequence comparison score from sequence alignment. The new sequence similarity measure is applied in the traditional hierarchical clustering algorithm to generate sequence clusters. The remainder of this paper is organized as the following: In section 2, methods of sequence alignment using dynamic programming techniques are presented. Our similarity measure is proposed in section 3, and our sequence clustering algorithm is introduced. In section 4, we analyze the efficiency of the proposed method on several experiment results. Finally, this paper is concluded in section 5. 2. Dynamic Programming In this section, we describe the comparing method for categorical sequences based on the Dynamic Programming (DP) algorithm. A sequence is a multiple and sequential set. For example, let the two sequences be Si={A, B, G, C, C, B, A, E, C, F} and Sj={C, B, C, D, D, E, E, B, D, C, F, F}. 2.1. Sequence Alignment To explicitly describe the word order difference between source and target language, Christoph Tillmann et al.[8] introduced an alignment concept, in which a source position j is mapped to exactly one target position i(as shown in Fig. 1):
Regular alignment: ji=aj Si: A B G C C B A E C F

Sj:

Fig.1 Regular Alignment

The alignment method for sequences is symmetrical. We propose the sequence alignment which has no direction, as shown in Fig. 2. The sequence alignment presents the correspondence relationship between the elements of two sequences. Sequence alignment: i--j=aij Definition 1. (Element Pair) Given two sequences Si and Sj, the element pair is composed of the ith e in sequence Si and the ith e in sequence Sj, where eSiSj. 2.2. Sequence Comparison Score We define the element pair score as the contribution of the element pair to the similarity of the two sequences. It is decided by its relative position to the nearest previous element pair in the sequences. The following rules are depended on when we search the nearest previous element pair.

1577

J. Ren et al. /Journal of Computational Information Systems 7:5 (2011) 1575-1581

When (xpre, ypre) is a previous element pair of (x, y), xpre<x and ypre<y must be satisfied; The nearest element pair minimizes the following expression, (x-xpre)(y- ypre). Definition 2. (Element Pair Score) Supposed the position of the current element pair be (x, y) and the position of the previous element pair be (xprev, yprev), the element pair score is computed by the following equation:
EPS( x, y ) = 1 ( x x pre )( y y pre )

Definition 3. (Sequence Comparison Score) The sequence comparison score is defined as the sum of the EPSs, as follows:
SCS =

( x , y )E

EPS( x , y )

Where E is the set of the sequence pairs between the two sequences. 3. Clustering Algorithm for Categorical Sequences 3.1. Sequence Similarity Measure The more common elements between the two sequences, the more similar the sequences are. To consider this, we define the similarity between sequences using the following definition. Definition 4. (Sequence Similarity) Our similarity measure between the two sequences Si and Sj is defined as follows:
Sim( S i , S j ) =
Si: A B G C C

{S i S j } {S i S j }
B A

SCS {S i S j }
E C F

Sj:

Fig.2 Sequence Alignment

The intersection is computed by the SCS. To be divided by union {SiSj} is the scaling factor ensuring similarity between 0 and 1. The union is used to normalize the similarity measure. The union value is simply the number of all elements such as Fig. 3.

AGA

BCCBECF

DDEDF

Fig.3 the Union of the Two Sequences

The similarity measure is symmetrical, that is Sim(Si, Sj)= Sim(Sj, Si). The same sequences have the largest similarity, 1; while the similarity of the sequences is 0, when no element is shared between them. The method to calculate the similarity is shown in algorithm 1.
Algorithm 1: getSim(Si, Sj)

J. Ren et al. /Journal of Computational Information Systems 7:5 (2011) 1575-1581


Input: the sequences Si and Sj; Output: the similarity measure of Si and Sj; 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: int prex=0, prey=0, sum=0; for(int curx=1; curx<=|Si|; curx++) for(int cury=prey+1; cury<=|Sj|; cruy++) if Sj.curry has no mark and Si.currx==Sj.curry search the nearest previous element pair from (curx, cury), as (prex, prey)

1578

SCS = SCS + EPS ( curx ,cury )


mark Sj.cury that it has been processed; sum++; end if end for end for

12: return Sim=SCS/sum; Example Compute the similarity between two sequences Si and Sj, shown in table 1. Step 1. computes the EPS for each element pair. Step 2. the sequences comparison score: SCS=1/4+1/4+1/2+1/5+1/9+1/4+12.56; 2.56 SCS Step 3. Sim( S i , S j ) = = 0.15 {S i S j } 15
Table 1 The EPS For The Element Pairs of Si and Sj. element pair Nearest Previous EP EPS C1-C1 null 1/4 B1-B1 null 1/4 C2-C2 C -C 1/2
1 1

B2-B2 C -C 1/5
2 2

E-E C -C 1/9
2 2

C3-C3 E-E 1/4

F-F C3-C3 1

3.2. The Clustering Algorithm In our approach, we use an agglomerative hierarchical clustering algorithm for clustering sequences. Initially, each object is assigned into its own cluster. Merging pairs of clusters is repeatedly operated until a certain stopping criterion is met. The criterion function used in this paper is shown as follows. Maximize Cf = of clusters. Consider an n-sequence dataset and the clustering solution that has been computed after performing l merging steps. This solution will contain exactly n-l clusters, as each merging step reduces the number of clusters by one. Now, given this (n-l)-way solution, the pair of clusters that is selected to be merged next, is the one that leads to an (n-l-1)-way solution that optimizes the particular criterion function. That is, each one of the (n-l)*(n-l-1)/2 pairs of possible merges is evaluated, and the one that leads to a clustering solution with the maximum value of the particular criterion function is selected. Depending on the desired
k

n Sim(S , S
i r =1 r S i , S j Cr

j),

where nr is the number of sequences in Cr and k is the number

1579

J. Ren et al. /Journal of Computational Information Systems 7:5 (2011) 1575-1581

solution, this process continues until either there only k cluster left, or when the entire agglomerative tree has been obtained. Our proposed clustering algorithm is presented in algorithm 2.
Algorithm 2: DPSClustering Input: the set of sequences S, the number of clusters k; Output: the set of clusters Clusters; 1: 2: 3: 4: 5: 4: 5: 6: 7: for each Si in the database S Clusters.CiSi; end for int clusterNum=|S|; Cf=0; while clusterNum>k do m=n=1; for each Ci, CjClusters compute the value of the criterion function

Cf ' = Cf
8: 9: 10: 11: 12: 13: if Cf>Cf Cf=Cf;

r =i , j r Si , S j Cr

n getSim(S , S
i

j)

1 ni + n j

Si , S j Ci C j

getSim(S , S
i

j);

m=i, n=j; end if end for Cnewmerge(Cm, Cn);

14: end while 15: return Clusters;

4. Experimental Results All of our experiments are conducted on a PC with Intel Pentium IV 1.6GHz Processor and 2 GB DDR RAM, which runs windows XP Professional Edition operating system. The algorithms are implemented in Microsoft Visual C++. We compare the performance of our algorithm with edit distance(ED) based approach. 4.1. Clustering Quality Evaluation We utilize a splice dataset to examine the quality of our algorithm. The splice dataset is distributed as part of the UCI KDD Archive[9]. Each sequence in the dataset is assigned a class label as either an exon/intron boundary(EI) or an intron/exon boundary(IE). The splice dataset contains sequences for 767 EIs and 768 IEs. Table 2 shows the results applying ED and our clustering algorithm to the splice dataset. For seeking

J. Ren et al. /Journal of Computational Information Systems 7:5 (2011) 1575-1581

1580

global alignment but ignoring local alignments, the cluster quality produced by the edit distance method is very poor. As table 1 illustrates, in ED based algorithm, the majority of sequences were in cluster 1. However, in our proposed algorithm, one cluster contained most of the EIs and the other contained the majority of IEs. This improvement in the quality of clustering can be attributed to our new similarity measure.
Table 2 Clustering Quality Comparison DPSClustering EI 1 2 565 202 IE 233 535 EI 616 151 ED IE 658 110

Cluster No.

4.2. Scalability Test We evaluate the scalability of our algorithm in terms of the number of sequences and clusters. Four different datasets are generated using the synthetic data generator GEN from the Quest project[10]. Each dataset was amarket basket database, and the transaction size parameter had a Poisson distribution with an average value of 15. Of these, 1% are outliers, while the others belongs to one of five clusters. As shown in Fig. 4(a), the response time of our proposed algorithm scales linearly with the number of sequences. To evaluate the scalability of the proposed clustering algorithm with respect to the number of clusters, we generated four synthetic datasets, each of which consisted of 100,000 sequences. The datasets contain 10, 20, 30, 40 clusters respectively. Fig. 4(b) shows the execution results.
1000 Execution Time(s) 800 600 400 200 0 1000 2000 3000 4000 5000 (a) Number of Sequences
Execution Time(s) 6000 5000 4000 3000 2000 1000 0 10 20 30 40 (b) Number of Clusters

Fig.3 Scalabity Evaluation

As shown, the execution time increases linearly to the number of clusters. Response times of the proposed algorithm do not depend on the dataset size and the number of clusters. 5. Conclusion Sequence clustering has become a research focus. To cluster categorical sequences, we proposed a hierarchical clustering algorithm. A similarity measure is introduced to measure the sequences with

1581

J. Ren et al. /Journal of Computational Information Systems 7:5 (2011) 1575-1581

categorical and sequential attributes. Derived from the regular sequence alignment the similarity value is computed according to the relative distance between element pairs of two sequences. Based on the new similarity measure, the traditional hierarchical clustering algorithm is improved to cluster categorical sequences. The experimental results demonstrate that our algorithm obtains a higher clustering quality and good scalability. Acknowledgement This work is supported by the National High Technology Research and Development Program ("863"Program) of China (No. 2009AA01Z433) and the Natural Science Foundation of Hebei Province P.R. China (No. F2010001298). The authors also gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation. References
[1] B. Hay, G.Wets, K. Vanhoof. Clustering navigation patterns on a website using a sequence alignment method. in: Proc. 2001 Internat. Joint Conf. on Artificial Intelligence, 2001. [2] W. Wang, O.R. Zaiane. Clustering web sessions by sequence alignment. in: Proc. 13th Internat. Workshop on Database and Expert Syst. Applications, France, 2002. [3] L. Szilgyi, L. Medvs and S.M. Szilgyi. A modified Markov clustering approach to unsupervised classification of protein sequences. Neurocomputing(2010), doi:10.1016/j.neucom.2010.02.023 [4] S. Park, N. C.suresh and B. K. Jeong. Sequence-based clustering for Web usage mining: A new experimental framework and ANN-ehanced K-means algorithm. Data & Knowledge Engineering, pages 512-543,2008. [5] Y Li, S. M. Chung, and J. D. Holt. Text document clustering based on frequent word meaning sequences. Data & Knowledge Engineering, 64: 381-404, 2008. [6] R. Gil-Garcia, A. Pons-Porrata. Dynamic hierarchical algorithms for document clustering. Pattern Recognition Letter, 31: 469-477, 2010. [7] A. Sali and T. L. Blundell. Definition of general topological equivalence in protein structures: A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. Journal of Molecular Biology. 212(2): 403-428, 1990. [8] C. Tillmann and H. Ney. Word Reordering and a Dynamic Programming Beam Search Algorithm for Statistical Machine Translation. Computational Linguistics 29(1): 97-133, 2003. [9] C.L. Blake, C.J. Merz. UCI Repository of Machine Learning Databases, 1998 [10] R. Agrawal, M. Mehta, J. Shafer, R. Srikant, A. Arning, T. Bollinger. The quest data mining system. in: Proc. 2nd Internat. Conf. Knowledge Discovery and DataMining, Portland, OR, 1996

Вам также может понравиться