Академический Документы
Профессиональный Документы
Культура Документы
5, September 2013
467
1. Introduction
In recent times, majority of the data available
throughout the world are warehoused in databases.
Data mining that has received immense attention from
the research community because of its importance is
the process of detecting patterns from extremely huge
quantities of data collection [1]. Knowledge Discovery
in Databases (KDD) is the other name for data mining
which has been identified as a potential field for
database research [53]. Classification or bunching of
these data into a set of categories or clusters is one of
the essential methods in manipulating these data [10,
63]. Clustering is a delineative task that attempts to
detect similar category of objects based on the
implications of their features dimensions [26, 28]. One
can detect the predominant distribution patterns and
interesting correlations that exist among data attributes
by clustering which can determine dense and sparse
areas [4, 5].
Easier and faster data collection due to technology
advances has given rise to larger and more complicated
datasets with several objects and dimensions.
Adaptations to existing algorithms are necessary to
maintain cluster quality and speed as the datasets
become larger and more diverse. By considering all of
the dimensions of an input dataset, conventional
clustering algorithms attempt to find out as much as
possible about each described object. But, normally
majority of the dimensions are irrelevant in high
dimensional data. These irrelevant dimensions can hide
468
The International Arab Journal of Information Technology, Vol. 10, No. 5, September 2013
469
470
The International Arab Journal of Information Technology, Vol. 10, No. 5, September 2013
2. Apply COP-KMEANS
dataset.
algorithm on reduced
C=
471
1
N
UU *
(1)
472
The International Arab Journal of Information Technology, Vol. 10, No. 5, September 2013
1 Nc
X cc
N i =1
d
Clustering Error, C e = 1 CA
(3)
(4)
a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
b
5
5
3
6
4
8
1
2
2
4
3
2
2
5
1
c
1
4
1
8
1
0
1
1
5
2
1
1
1
5
2
d
1
4
1
8
1
10
1
2
1
1
1
21
1
3
2
e
1
5
19
1
3
8
1
1
21
7
1
1
1
3
1
f
2
7
2
3
2
7
2
2
6
2
1
2
2
2
2
g
1
0
2
4
1
10
10
1
1
1
3
1
1
13
3
h
3
3
3
5
3
9
3
3
1
2
3
2
2
4
3
i
1
2
1
7
1
17
1
1
1
1
1
1
1
4
1
j
1
6
1
1
7
1
1
3
5
1
1
1
5
1
3
Eigen Value
0.08940
0.66158
0.86841
2.9814
5.2712
6.7445
20.77251
22.95552
39.38671
49.2402
1
0
2
4
1
10
10
1
1
1
3
1
1
13
3
h
3
3
3
5
3
9
3
3
1
2
3
2
2
4
3
i
1
2
1
7
1
17
1
1
1
1
1
1
1
4
1
j
1
6
1
1
7
1
1
3
5
1
1
1
5
1
3
Ionosphere Dataset
Reduced no.
Original no.
of attributer
of attribute
using PCA
34
7
Ionosphere Dataset
Original
Proposed
COP- K
Approach
means
71.51
68.67
Proposed
Approach
Ionosphere Dataset
Original
COP- K
means
28.49
31.33
Proposed
Approach
8. Conclusions
Here, we have performed COP-KMEANS algorithm to
the original high dimension dataset proved that
clustering the raw high dimensional data sets did not
produce precise clusters due to intrinsic sparse of high
dimensional data. Hence, in our proposed approach, we
have carried out two steps for clustering high
dimension dataset. Initially, we have performed the
dimensionality reduction on the high dimension dataset
using PCA as a preprocessing step to data clustering.
Later, we have integrated the COP-KMEANS
clustering algorithm to the reduced dataset to produce
precise clusters. The performance of the proposed
approach is evaluated with high dimension UCI
datasets such as Parkinsons dataset and Ionosphere
dataset. The experimental results showed that the
proposed approach is very effective in producing
precise clusters compared with the original COPKMEANS algorithm.
473
References
[1]
474
The International Arab Journal of Information Technology, Vol. 10, No. 5, September 2013
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[51]
[52]
[53]
[54]
[55]
[56]
[57]
[58]
[59]
[60]
[61]
[62]
[63]
475
476
The International Arab Journal of Information Technology, Vol. 10, No. 5, September 2013