Вы находитесь на странице: 1из 5

2010 International Conference on Computational Intelligence and Security

Online segmentation algorithm for time series based on BIRCH clustering features
Yu Tu, Yubao Liu, and Zhijie Li
Department of Computer Science of Sun Yat-Sen University Guangzhou, China tuyukuangren@163.com, liuyubao@mail.sysu.edu.cn, gdgzlzj@gmail.com
AbstractOnline time series data representation is one of important problems of time series data mining. The adjacent points of time series are inherently depended and hence have similar clustering features. Based on BIRCH clustering features, we present a new kind of OSBC algorithm for time series segmentation in this paper. Using cluster features, OSBC algorithm can find easily the changing patterns of time series and achieve better segmentation results. The time complexity of OSBC algorithm is linear and its space complexity is also much smaller. The experiment results on time series benchmark show the effectiveness of our method. Keywords-time series; BIRCH clustering features; online segmentation algorithm

algorithm is given in section 3. The experimental results are in section 4. The conclusions are in section 5. II. BASIC BACKGROUNDS

I.

INTRODUCTION

Time series is widely employed in commercial, economic, geology, bio-medicine, space exploration and many other scientific and industrial fields. However, time series data usually have high dimensionality, noise and volatility. How to manage and use these time series data effectively and how to discover hidden rules and knowledge from them is an interesting problem. Data mining directly on raw time series is relatively difficult, time-consuming and inefficient. Sometimes, the accuracy and reliability of mining results will descend. Pattern representation of time series is abstract and summary of time series. Also it is a high level feature description of time series. Many pattern representations of time series have been proposed, including Discrete Fourier Transform [1], Discrete Wavelets Transform [2], Piecewise Aggregate Approximation [3], Piecewise Linear Approximation [4], Adaptive Piecewise Constant Approximation [5], and Symbolic Aggregate Approximation [6] etc. The adjacent points of time series are inherently depended and hence have similar clustering features. Online segmentation algorithm for time series based on BIRCH clustering features, OSBC, is proposed in this paper. OSBC algorithm views each time subsequence as a cluster. By using the cluster features, OSBC algorithm can find easily the changing patterns of time series and achieve better segmentation results. The time complexity of OSBC algorithm is linear and its space complexity is also small. The experiment results on time series benchmark show the effectiveness of our method. There are some other related works, including discord time series stream detection [9], time series subsequence matching [10], time series privacy [11] and online text data clustering [12] etc. The rest of the paper is organized as follows. The basic concepts of background are included in section 2. The OSBC
978-0-7695-4297-3/10 $26.00 2010 IEEE DOI 10.1109/CIS.2010.19 55 56

A. Agglomerative Hierarchical Clustering Agglomerative hierarchical clustering [7], using bottomup strategy, starts by placing each object in its own cluster and then merges these atomic clusters into larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied. Fig.1 shows the application of AGNES (Agglomerative NESting). Initially, AGNES places each object into a cluster of its own. The clusters are then merged step-by-step according to some criterion. For example, clusters C1 and C2 may be merged if an object in C1 and an object in C2 form the minimum Euclidean distance between any two objects from different clusters. This is a single-linkage approach in that each cluster is represented by all of the objects in the pair of data points belonging to different clusters. The cluster merging process repeats until all of the objects are eventually merged to form one cluster.

Figure 1. Agglomerative hierarchical clustering on data objects {a, b, c, d, e}

B. BIRCH Algorithm BIRCH algorithm [7] integrates hierarchical clustering at the initial micro clustering stage and other clustering methods such as iterative partitioning at the later macro clustering stage. BIRCH introduces two concepts, clustering feature, and clustering feature tree, which are used to summarize cluster representations. Given n d-dimensional data objects or points in a cluster, we can define the centroid x0, radius R of the cluster as follows.

x0 =

xi
i =1

R=

(x x )
i 0 i =1

, where R is also the average distance from member objects to the centroid. A clustering feature (CF) is a threedimensional vector summarizing information about cluster of objects. The CF of cluster is defined as CF= (n, LS, SS) , where n is the number of points in the cluster, LS is the linear sum of n points, and SS is the square sum of data points. A clustering feature is essentially a summary of the statistical for the given cluster. Clustering features are additive. For example, suppose that we have two disjoint clusters, C1 and C2, having the clustering features, CF1 and CF2, respectively. The clustering feature for the cluster that is formed by merging C1 and C2 is simply CF1+CF2. Clustering features are sufficient for calculating all of the measurements needed for making clustering decisions. C. Sliding Window Algorithm (SW) The Sliding Window algorithm [8] works as follows. It first anchors the left point of a potential segment at the first data point of a time series, and then attempts to approximate the data to the right with increasing longer segments. At some point i, the error for the potential segment is greater than the user-specified threshold, the point from the anchor to i-1 is transformed into a segment. The anchor is also moved to location i. The process repeats until the entire time series has been transformed into a piecewise linear approximation. SW algorithm is attractive because of its great simplicity, intuitiveness and in particular the fact that it is an online algorithm. However SW algorithm lacks the global view of time series feature and may result in poor results. III. THE PROPOSED OSBC ALGORITHM

, where ni is the number of points in the subsequence, LSi is the linear sum of the ni points, SSi is the square sum of the data points, tj is the start subscript of subsequence Xi, and tj+1 is the end subscript of subsequence Xi. A clustering feature is essentially a summary of the statistical information for the given subsequence. Clustering features are additive. For example, suppose that we have two disjoint adjacent subsequences, X1 and X2, having the clustering features, CF1 and CF2, respectively. The clustering feature for the subsequence that is formed by merging X1 and X2 is simply CF1+CF2. After micro clustering, a time series X=<x1, x2, , xi, , xn > can be divided into n/C subsequences <X1, X2, , Xn/C >, having clustering features < CF1, CF2, , CFn/C >. The adjacent subsequences are then merged step-by-step. The adjacent subsequences X1 and X2 can be merged if the distance between X1 and X2 is smaller than the maximum of their radius R1 and R2. The distance between Xi and Xi+1 is defined as D (Xi, Xi+1) = |xi,0- xi+1,1|. The cluster merging process repeats until the online time series is over or when it lasts at the last subsequence for offline time series. This process is also called clustering subsequence. While clustering subsequence is finished, subsequences <X1, X2, , Xn/C > may be merged in k clusters <X1, X2, , Xk >. The adjacent subsequences X1 and X2 may be merged if the distance between X1 and X2 is the minimum of all of distances between two adjacent subsequences in clustering subsequence. The cluster merging process repeats until some conditions are satisfied. This process is also called merging subsequence. Clustering subsequence and merging subsequence are the key processes of OSBC algorithm. B. The Description of OSBC Algorithm The OSBC algorithm works as follows. OSBC first anchors the left subsequence of a potential cluster at the first subsequence, and attempts to cluster with its adjacent subsequence to the right. At some subsequence i, the distance between the potential cluster and subsequence i is greater than the maximum of their radius, then the subsequence from the anchor to i-1 is merged into a cluster. The anchor is moved to subsequence i. The process repeats until the online time series is over or the anchor is moved to the last subsequence for offline time series. Suppose subsequences <X1, X2, , Xn/C > can be merged in k clusters when clustering subsequence is finished. If the number of clusters is greater than the user-specified threshold K, the algorithm runs the process of merging subsequence. Finally each subsequence can be transformed into a piecewise linear approximation with their means. In a word, OSBC first calls the process of clustering subsequence and then the process of merging subsequence. The description of OSBC algorithm is in Fig.2. In Fig.2, X is a given time series, C and K are userspecified parameters. Clustering_Subsequence(X, C) corresponds to the process of clustering subsequence. Given a time series X=<x1, x2, , xi, , xn >, it may be merged in k clusters <X1, X2, , Xk > by clustering subsequence. The results, Init_CF, are the clustering features of each cluster.

A. Basic Definitions Definition 1 Given a time series X=<x1, x2, , xi, , xn >, X can be partitioned into k subsequences <x1, x2, , xi, , xk > by k+1 dividing points (including start point and end point). Definition 2 Given a time series X=<x1, x2, , xi, , xn > and parameter C, we can place C adjacent points into a cluster, and then X can be divided into n/C subsequences <X1, X2, , Xn/C > with a window of size C. This process is called micro clustering. Definition 3 Each subsequence Xi (1 i n/C) can be treated as a cluster. We can define the clustering feature (CFi), centroid xi,0 and radius Ri of the subsequence Xi as follows: CFi=(ni, LSi, SSi)
ti +1 1

x
j = ti

xi,0 =
ti +1 1

ni

LSi ni

Ri =

(x
j =ti

xi,0 ) =

ni

SSi LS ( i )2 ni ni

56 57

Merge_Subsequence (Init_CF, K) corresponds to the process of merging subsequence. Subsequences <X1, X2, , Xk > can be merged into K clusters by merging subsequence. The results, CF, are the clustering features of each cluster and saved as a list. Thus, each cluster can be transformed into a piecewise linear approximation with their means by CF.
Algorithm1: OSBC(X, C, K) Inputs: time series X, parameter C, K Output: CF Method: Init_CF= Clustering_Subsequence(X, C) If (length(Init_CF)<K) CF=Merge_Subsequence (Init_CF, K) Else CF=Init_CF End
Figure 2. The description of OSBC algorithm

Algorithm 3: Merge_Sub_sequence(Init_CF, K) Inputs: Clustering_Sub_sequence Init_CF, parameter K Output: CF Method: CF= Init_CF, i=1, k=length (Init_CF) While (i<k) Calculate (D(Xi, Xi+1)) End While (k > K) Select D (Xj, Xj+1) = min (D (Xi, Xi+1)) CFj = CFj+CFj+1 Delete CFj+1 from CF Update D (Xj-1, Xj) and D (Xj, Xj+1) End
Figure 4. The description of Merging_Subsequence algorithm

The Clustering_Subsequence algorithm is in Fig.3. Given a time series X=<x1, x2, , xi, , xn >, we suppose it can be divided into n/C subsequences <X1, X2, , Xn/C>. Each subsequence has C points. Clustering_Subsequence works as follows. Firstly, the process anchors X1 of a potential cluster and calculates the clustering feature of potential cluster. Then it calculates the clustering feature of the adjacent subsequence X2. The potential cluster and X2 can be merged if the distance between potential cluster and X2 is smaller than the maximum of their radius. Also, we should update the clustering feature of potential cluster. At some subsequence i, the distance between the potential cluster and subsequence i is greater than the maximum of their radius, the subsequence from the anchor to i-1 is merged into a cluster. The anchor is moved to subsequence i, and the process repeats until the online time series is over or the anchor is moved to the last subsequence for offline time series.
Algorithm2: Clustering_Sub_sequence(X, C,) Inputs: time series X, parameter C Output: Init_CF Method: Init_CF=, i=1, j=1 While (j<n/C) If (D(Xi, Xj)<max(Ri,Rj)) CFi=CFi+CFj Else Init_CF= union(Init_CF, CFi) i++ CFi= CFj End j++ End

When Clustering_Subsequence algorithm is finished, subsequences <X1, X2, , Xn/C > can be merged in k clusters <X1, X2, , Xk >. The algorithm firstly calculates all distances between any two adjacent subsequences. Suppose the distance between subsequences Xj and Xj+1 is the minimum. Hence, adjacent subsequences Xj and Xj+1 can be merged into Xj. Meanwhile, subsequences Xj+1 should be deleted, the clustering feature of Xj should be renewed. Also, the distance between Xj-1 and Xj and the distance between Xj and Xj+1 should be updated. The cluster merging process repeats until k is smaller than K. C. Algorithm Complexity Analysis Given a time series X=<x1, x2, , xi, , xn > and a parameter C (< n), time series X be merged into k clusters after Clustering Subsequence. If k is greater than the userspecified threshold K, k clusters can be merged into K clusters. Clustering_Subsequence algorithm performs one linear scan through the entire time series. So the time complexity of Clustering_Subsequence algorithm is O(n). Merging_Subsequence algorithm performs constant times linear scan through the k clusters. The time complexity of Merging_Subsequence algorithm is thus O(k). Then the time complexity of OSBC algorithm is O(n). When the anchor is moved to some subsequence in Clustering_Subsequence algorithm, this subsequence and its next subsequence should be read into memory. Hence, space complexity of this process is O(C). In addition, space complexity of the clustering features of k clusters is O(k) for saving the results of Clustering_Subsequence algorithm. So the space complexity of Clustering_Subsequence algorithm is O(C+k). The space complexity for saving all distances between any two adjacent subsequences is O(k) in Merging_Subsequence algorithm. In addition, the space complexity of the clustering features of K clusters is O(K) for saving the results of Merging_Subsequence algorithm. Thus, the space complexity is O(k+K) for Merging_Subsequence algorithm. The space complexity of OSBC is O(C+k+K). However, the entire time series should be read into memory

Figure 3. The description of Clustering_Subsequence algorithm

The Merging_Subsequence algorithm is in Fig.4.


57 58

for traditional method. Then the space complexity of traditional method is O(n). In general, the algorithm parameters C, k and K are much smaller than the length of time series n, and then O(C+k+K) is much smaller than O(n). So the space complexity of OSBC is much smaller than traditional method. IV. THE EXPERIMENTAL RESULTS In this section we construct two kinds of experiments. We first show the time complexity of our algorithm. Then we compare the existing SW algorithm with our algorithm OSBC in aspect of time series segmentation results. The experiment environment is Celeron 3000Hz CPU, 512M, 80G hard disk and Windows XP. All algorithms are implemented by MATHLAB. The test dataset, random_walk, contains 65536 time series (http://www.cs.ucr.edu/~eamonn/TSDMA/datasets.html). The first experiment contains eight datasets generated from random_walk. The eight datasets contain 25, 50, 75, 100, 125, 150, 175, and 200 subsequences, respectively. The length of time series is set to 1000 points. The threshold parameter C=4 and K=100. The running time (second) of OSBC on the eight cases is summarized in figure 5. We can make a conclusion that the running time of our algorithm is increased with the number of time series. This is consistent to the time complexity analysis of OSBC algorithm.
2.5 2 running time 1.5 1 0.5 0 OSBC

parameter of our algorithm C=4, K=8 and adjust the parameter of SW to meet the same compression ratio. Figure 6 shows the segmentation result of SW for the first time series. Figure 7 shows the segmentation result of our algorithm OSBC for the first time series.
1 0.8 0.6 0.4 0.2 0 raw time series segmented with OSBC

20

40

60

80

100

Figure 7. Segmentation results with OSBC for the first time series

Then, we set the threshold parameter of our algorithm C=4, K=50 and adjust the parameter of SW to meet the same compression ratio. Figure 8 shows the segmentation result of SW for the second time series. Figure 9 shows the segmentation result of our algorithm OSBC for the second time series. We can see from the above figures that the segmentation results of our algorithm are better than SW algorithm. SW algorithm does not place together the subsequences that should be in a cluster. OSBC algorithm can place such adjacent time subsequences into a cluster. Both SW algorithm and our algorithm are simple, intuitive and in particular they are kind of online algorithm. However, SW algorithm lacks the global view of time series feature and can not find the changing patterns hidden in time series and result in poor segmentation results. Different from SW algorithm, OSBC stores the partition information based on cluster features, and can find easily the changing patterns of time series while scan the time series.
1

2 3 4 5 6 the number of time series *25

Figure 5. Running time of OSBC


1 0.8 0.6 0.4 0.2

0.9 0.8

raw time series segmented with SW

0.7 0.6 0.5 0.4 0.3 0.2 raw time series segmented with SW

20

40

60

80

100

0.1 0

Figure 6. Segmentation result with SW for the first time series

50

100

150

200

250

300

350

400

450

500

In second experiment, we generated two time series based on random_walk. The length of time series is set to 100 and 500 points, respectively. We set the threshold

Figure 8. Segmentation result with SW for the second time series

58 59

[1]
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

raw time series segmented with OSBC

50

100

150

200

250

300

350

400

450

500

Figure 9. Segmentation result with OSBC for the second time series

V.

CONCLUSIONS

Online time series data representation is one of important problems of time series data mining. Based on BIRCH clustering features and agglomerative hierarchical clustering, online segmentation algorithm for time series based on BIRCH Clustering (OSBC) is proposed in this paper. OSBC algorithm can find easily the changing patterns of time series and achieve better segmentation results. The experiment results, compared with the classic SW algorithm on time series benchmark, show the effectiveness of our algorithm. How to further handle the segmentation of time series is our future work. ACKNOWLEDGMENT This paper is supported by National Natural Science Foundation of China (NSFC) under grant No.60703111, 61070005, 60773198, and 61033010 and Science and Technology Planning Project of Guangdong Province of China under grant No.2010B080701062. REFERENCES

Faloutsos C, Ranganathan M, Manolopoulos Y., Fast Subsequence Matching in Time-Series Databases, Proc. ACM SIGMOD International Conference on Management of Data (SIGMOD 94). ACM Press, May, 1994, pp.419-429 [2] Chan K, Fu A W., Efficient Time Series Matching by Wavelets, Proc. 15th IEEE Conference on Data Engineering. Sydney (ICDE 99), IEEE Computer Society Press, March, 1999, pp.126-133 [3] Keogh E, Chakrabarti K, Pazzani M, et al, Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases, Proc. ACM SIGMOD International Conference on Management of Data (SIGMD01). ACM Press, May, 2001, pp.151162 [4] Pavlidis T, Horwitz S L., Segmentation of Plane Curves, IEEE Transactions on Computers, Vol 23, Aug. 1974, pp. 860-870 [5] Geurts P., Pattern Extraction for Time Series Classification, Proc. European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 01). Springer Press, September, 2001, pp.115-127 [6] Lin J, Keogh E, Lonardi L, et al, A Symbolic Representation of Time Series, with Implications for Streaming Algorithms, Proc. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD 2003). ACM Press, June, 2003, pp. 211 [7] J. Han, M. Kamber. Data Mining Concepts and Techniques (Second Edition). China Machine Press, 2006 [8] E. Keogh, S. Chu, D. Hart, M. Pazzani., An Online Algorithm for Segmenting Time Series, Proc. IEEE International Conference on Data Mining (ICDM 01). IEEE Computer Society Press, November, 2001, pp. 289-296 [9] Yubao Liu, Xiuwei Chen, Fei Wang, Jian Yin, Efficient Detection of Discords for Time Series Stream, Proc. Advances in Data and Web Management, Joint International Conferences (APWeb/WAIM 2009), Springer Press, April, 2009, pp.629-634 [10] Huanmei Wu, Betty Salzberg, Gregory C Sharp, Steve B Jiang, Hiroki Shirato and David Kaeli, Subsequence matching on structured time series data, Proc.ACM SIGMOD international conference on Management of data (SIGMOD 05), ACM Press, June, 2005, pp. 682-693 [11] Spiros Papadimitriou, Feifei Li, George Kollios, Philip S. Yu., Time series compressibility and privacy, Proc.International Conference on Very Large Data Bases (VLDB07), ACM Press, September, 2007, pp.459-470 [12] Yubao Liu, Jiarong Cai, Jian Yin, Ada Wai-Chee Fu, Clustering Text Data Streams, Journal of Computer Science and Technology, Vol.23, Jan., 2008, pp.112-128

59 60

Вам также может понравиться