Вы находитесь на странице: 1из 5

International Journal of Computational Intelligence and Information Security, December 2012 Vol. 3, No.

10 ISSN: 1837-7823

A Study of Grid- based Algorithms for Spatial Data Clustering


Ch.N.Santhosh Kumar1, Dr. K Nageswara Rao2, Dr.A.Govardhan3, V. Sitha Ramulu4, K.Sudheer Reddy5
2

Research Scholar, Dept. of CSE, JNTU- Hyderabad, India. Professor & Head, Computer Science & Engineering,PVPSIT Vijayawada, A.P., India. 3 Professor in CSE & Director of Evaluation, JNTU Hyderabad, A.P., India. 4 Research Scholar, Dept. of CSE, Acharya Nagarjuna University, A.P., India. 5 Research Scholar, Dept. of CSE, Acharya Nagarjuna University, Guntur, A.P, India.

Abstract
Spatial data mining is the process of discovering interesting and previously unknown and potentially useful patterns from large spatial datasets. Extracting interesting and useful patterns from large spatial datasets is more difficult than extracting the corresponding patterns from traditional numeric and categorical data due to the complexity of spatial data types, spatial relationships, and spatial autocorrelation. Clustering is one of the important methods in spatial data mining. Clustering is the subject of active research in many fields such as statistics, pattern recognition, and machine learning. Clustering is the process of division of data into groups of similar objects. Representing the data by fewer clusters necessarily looses certain fine details, but it achieves simplification. Clustering models data by its clusters. There are various methods for spatial data clustering which includes partitioning methods, grid- based methods, density- based methods etc..,. In this paper we present an overview of grid- based algorithms for spatial data clustering. Keywords Clustering, Grid, Spatial data.

I. INTRODUCTION Spatial data mining[1] methods can be applied to extract the interesting and regular knowledge from large spatial databases. In particular, they can be used for understanding the spatial data, discovering relationships between spatial and non spatial data, construction of spatial knowledge-bases, query optimization, data reorganization in spatial databases, capturing the general characteristics in simple and concise manner, etc. This has wide applications in the Geographic Information Systems (IllS), remote sensing, image databases exploration, medical imaging, robot navigation, and other areas where spatial data are used. Knowledge discovered from the spatial data can be of various forms, like characteristic and discriminant rules, extraction and description of prominent structures or clusters, spatial associations, and others. Clustering[2] is one of the important methods in spatial data mining. Clustering is the subject of active research in many fields such as statistics, pattern recognition, and machine learning. Clustering is the process of division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but it achieves simplification. Clustering models data by its clusters. There are various methods for spatial data clustering which includes partitioning methods, grid- based methods, density- based methods [3] etc..,. In this paper we present an overview of grid- based methods. Grid-based method[4] quantifies the object space to a limited number of square cells, forming a grid structure. All the clustering operations are to be operated within the grid structure (i.e. quantitative space). The main advantage of this approach is efficiency and speed - the processing time is usually independent of the number of data objects, which only depends on the number of cells in each the dimension in the quantitative space. II. STING METHOD STatistical INformation Grid-based method(STING)[5] is a hierarchical statistical information grid based approach for spatial data mining. The idea is to capture the statistical information associated with spatial cells in such a manner that whole classes of queries and clustering problems can be answered without the recourse to the individual objects. Divide the spatial area into several rectangle cells and employ a hierarchical structure. Let the root of the hierarchy be at level 1; its children at level 2, etc. A cell in level i corresponds to the union of the areas of its children at level i + 1. With the hierarchical structure of grid cells on hand, we can use a top-down approach to answer spatial data mining queries. For each query, we begin by examining cells on a high level layer. Note that it is not necessary to start with the root; we may begin from an intermediate layer. Starting with the

International Journal of Computational Intelligence and Information Security, December 2012 Vol. 3, No. 10 ISSN: 1837-7823

root, we calculate the likelihood that this cell is relevant to the query at some confidence level using the parameters of this cell. This likelihood can be defined as the proportion of objects in this cell that satisfy the query conditions. After we obtain the confidence interval, we label this cell to be relevant or not relevant at the specified confidence level. When we finish examining the current layer, we proceed to the next lower level of cells and repeat the same process. The algorithm for STING method is as follows: 1. 2. 3. 4. 5. 6. 7. 8. 9. Determine a layer to begin with. For each cell of this layer, we calculate the confidence interval (or estimated range) of probability that this cell is relevant to the query. From the interval calculated above, we label the cell as relevant or not relevant. If this layer is the bottom layer, go to Step 6; otherwise, go to Step 5. We go down the hierarchy structure by one level. Go to Step 2 for those cells that form the relevant cells of the higher level layer. If the specification of the query is met, go to Step 8; otherwise, go to Step 7. Retrieve those data fall into the relevant cells and do further processing. Return the result that meet the requirement of the query. Go to Step 9. Find the regions of relevant cells. Return those regions that meet the requirement of the query. Go to Step 9. Stop.

III. OPTIGRID METHOD Optimal Grid(OPTIGRID) is based on constructing an optimal grid partitioning of the data. The optimal grid partitioning is determined by the calculation of the best partitioning hyper planes for each dimension using certain projections of the data. The OPTIGRID algorithm is given below. OPTIGRID algorithm works recursively. In each step it partitions the actual data set in to a number of sub- sets if possible. The sub sets which contain at least a cluster are treated recursively. The partitioning is done using the multi- dimensional grid defined by at most q cutting planes. Each cutting plane is orthogonal to a minimum of one projection. The point density at the cutting planes is bounded by the density of the orthogonal projection of the cutting plane in the projected space. The q cutting planes are chosen to have the minimal point density. The recursion stops for a sub set if no good cutting plane can be found any more. OPTIGRID(data set D, q, min_cut_score) 1. 2. Determine a set of contracting projections P= {p0, p1, ...pk}. Calculate all projections of the data set D p0(D) ....pk(D). 3. Initialize a list of cutting planes BEST_CUT , CUT . 4. For i=0 to K do (a) CUT Determine best_local_cuts(pi(D)). (b) CUT_SCORE score best_local_cuts(pi(D)). (c) Insert all cutting planes with a score >= min_cut_score in to BEST_CUT. 5. IF BEST_CUT= Then RETURN D as a cluster. 6. Determine the q cutting planes with highest score from BEST_CUT and delete the rest. 7. Construct a multi dimensional grid G defined by the cutting planes in BEST_CUT and insert all data points x D in to G. 8. Determine clusters i.e.., determine the highly populated grid cells in G and add them to the set of cluster C. 9. Refine(C). 10. For Each Cluster Ci C do OPTIGRID(Ci, q, min_cut_score).

International Journal of Computational Intelligence and Information Security, December 2012 Vol. 3, No. 10 ISSN: 1837-7823

IV. COSINE BASED GRID METHOD Cosine Cluster[6] method is based on cosine transformation which detect clusters of arbitrary shapes, insensitive to the outliers (noise) and the order of input data. Using multi-resolution property of the cosine transforms, arbitrary shape clusters can be effectively identified at different degrees of accuracy. Given a set of spatial objects Oi, 1 <= i<= N, the goal of the algorithm is to detect clusters and assign labels to the objects based on the cluster that they belong to. The main idea in Cosine Cluster is to transform the original feature space by applying cosine transform and then find the dense regions in the new space. It yields sets of clusters at different resolutions and scales, which can be chosen based on users' needs. Cosine clustering algorithm is as follows: Input: Multidimensional data objects feature vectors Output: clustered objects 1. Quantize feature space, then assign objects to the units. 2. Apply cosine transform on the feature space. 3. Find the connected components (clusters) in the transformed feature space, at different levels. 4. Assign label to the units. 5. Make the lookup table. 6. Map the objects to the cluster Each i dimensional of the d-dimensional feature space will be divided into mi intervals; this process is called quantization (Quantize Feature Space), which is the first step of Cosine Cluster algorithm. The second step in Cosine Cluster algorithm is applying discrete cosine transform on the quantized feature space. Discrete cosine transform will be applied on the units Mj results in a new feature space and so new units Tk. The third step is that Cosine Cluster detects the connected components in the transformed feature space. Each connected component is a cluster which is a set of units Tk. In the fourth step of the algorithm, Cosine Cluster labels the feature space units which are included in a cluster, with its cluster number. In the fifth step, Cosine Cluster makes a lookup table LT to map the units in the transformed feature space to the units in the original feature space. Finally, Cosine Cluster assigns the label of each unit in the feature space to all the objects whose feature vector is in that unit, and thus the clusters are determined. V. DEFLECTED GRID- BASED METHOD The main idea of deflected grid algorithm is to deflect the original grid structure in each dimension of the data space after the clusters generated from the original grid structure have been obtained. The deflected grid structure is then used to find out the new significant cells[7]. Next, the nearby significant cells are grouped as well to form some new clusters. Finally, these new generated clusters are used to revise the originally generated clusters. The deflected grid algorithm is as follows: First build the grid structure. After the grid structure is built, the deflected grid-based algorithms deflects the cell margins by half a cell width in each dimension and have the new grid structure and then combine the two sets of clusters into the final result. 1. 2. 3. First generate the grid structure. Next, the density of each cell is calculated to find out the significant cells whose densities exceed a predefined threshold. Then the nearby significant cells which are connected to each other are grouped into clusters. The set of the clusters is denoted as S1 . The original grid structure is next deflected by distance d in each dimension of the data space. The step 2 and step 3 are used again to generate the set of new clusters by using the deflected grid structure. The set of new clusters generated here is denoted as S2 . The clusters generated from the deflected grid structure are used to revise the originally obtained clusters. After all clusters of S1 have been revised, S1 is the set of final clusters.

4. 5. 6. 7.

VI. AMR GRID BASED METHOD Adaptive Mesh Refinement (AMR) is a type of multi scale algorithm that achieves high resolution in localized regions of dynamic, multidimensional numerical simulations [8][9]. It has been successfully applied to model large scale scientific

International Journal of Computational Intelligence and Information Security, December 2012 Vol. 3, No. 10 ISSN: 1837-7823

applications in a range of disciplines, such as computational fluid dynamics, astrophysics, meteorological simulations, structural dynamics, magnetics, and thermal dynamics. Basically, the algorithm can place very high resolution grids precisely where the high computational cost requires. Its adaptability allows simulating multi scale resolutions that are out of reach with methods using a global uniform fine grid. The motivation of combining the AMR concept into the clustering comes from the observation that a very fine mesh can be required for clustering on a highly irregular or concentrated data distribution if a grid-based clustering algorithm that employs a single uniform mesh is used. A fine mesh results in high computation cost and, in some cases, the mesh size can even overwhelm the number of the data objects. AMR clustering algorithm connects the grid-based and density-based approaches through AMR technique and, hence, preserves the advantages of both type algorithms. The algorithm consists of two stages: constructing the AMR tree and data clustering. The AMR tree construction is a top-down process starting from the root node that covers the entire problem volume. The data clustering stage is a bottom-up process which starts at a given tree level (depth) and works toward to the root. The algorithm for constructing AMR tree is as follows: AMR(grid, level) 1. create a new subgrid containing this cell only 2. for each particle contained in this grid 3. find the mesh cell id in which the particle is located 4. Accumulate the particle to the density of cell id 5. for each mesh cell 6. if the density of this cell is greater than the threshold 7. Mark this cell to be refined 8. Connect this cell to its neighbor that is also marked. 9. if no such neighbor exists 10. create a new sub grid containing this cell only. 11. for each subgrid 12. Call AMR(subgrid, level+1).

VII. CONCLUSIONS Spatial data mining is the application of data mining techniques to spatial data. It is the discovery of interesting the relationship and characteristics that may exist implicitly in spatial databases. Clustering is an important concept in spatial data mining. Cluster analysis divides data into meaningful or useful clusters. There are many methods for spatial clustering that includes partitioning methods, grid- based methods, density- based methods. Grid-based method quantifies the object space to a limited number of square cells, forming a grid structure. All the clustering operations are to be operated within the grid structure (i.e. quantitative space). Statistical Information Grid-based method(STING) is a hierarchical statistical information grid based approach for spatial data mining. The idea is to capture the statistical information associated with spatial cells in such a manner that whole classes of queries and clustering problems can be answered without the recourse to the individual objects. Optimal Grid(OPTIGRID) is based on constructing an optimal grid partitioning of the data. The optimal grid partitioning is determined by the calculation of the best partitioning hyper planes for each dimension using certain projections of the data. Cosine based grid method applies cosine transform on the spatial data feature space which helps in detecting arbitrary shape clusters at different scales. The deflected grid based algorithm is to deflect the original grid structure in each dimension of the data space after the clusters generated from this original structure have been obtained. AMR method partitions the problem domain into regions that are represented by the grids in a hierarchical tree. Each grid represents the data in a spatial sub domain and grids at different levels of the tree employ meshes of the different granularity.

International Journal of Computational Intelligence and Information Security, December 2012 Vol. 3, No. 10 ISSN: 1837-7823

REFERENCES
[1] M.Hemalatha.M; Naga Saranya.N. A Recent Survey on Knowledge Discovery in Spatial Data Mining, IJCI International Journal of Computer Science, Vol 8, Issue 3, No.2, may,2011. [2] T. Ng. Raymond and J. Han. Efficient and effective clustering methods for spatial data mining. VLDB Conference. Santiago, Chile, 1994. [3] X. Hu and I. Yoo. Cluster ensemble and its application in gene expression analysis. Proceedings, The Second Conference on Asia-Pacific Bioinformatics. Vol.29, Dunedin, New Zealand, pp. 297-302, 2004. [4] Cao Q., Bouqata B., Mackenzie P. D., Messier D., Salvo J., A Grid-Based Clustering Method For Mining Frequent Trips From LargeScale, Event-Based Telematics Datasets, The 2009 IEEE International Conference on Systems, Man, and Cybernetics, San Antonio, TX, USA, October 2009, pp. 2996-3001. [5] W. Wang, J. Yang, and R. R. Muntz. STING: A satistical information grid approach to spatial data mining. Technical Report No. 970006, Computer Science Department, UCLA, February 1997. [6] Safdar Ali Syed Abedi, " Exploring Discrete Cosine Transform For Multi-Resolution Analysis " Under the Direction of Saeid Belkasim. [7] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data, An Introduction to Cluster Analysis. New York: John Wiley & Sons, 1990. [8] M. Berger and P. Colella. Local Adaptive Mesh Refinement for Shock Hydrodynamics. Journal of Computational Physics, 82(1):6484, May 1989. [9] M. Berger and J. Oliger. Adaptive Mesh Refinement for Hyperbolic Partial Differential Equations. Journal of Computational Physics, 53:484512, Mar. 1984.

Вам также может понравиться