Академический Документы
Профессиональный Документы
Культура Документы
I. INTRODUCTION
K. M. Hammouda,
Department of Systems Design Engineering,
University of Waterloo, Waterloo, Ontario, Canada N2L 3G1
A. K-means Clustering
The K-means clustering, or Hard C-means
clustering, is an algorithm based on finding data
clusters in a data set such that a cost function (or
an objection function) of dissimilarity (or
distance) measure is minimized [1]. In most cases
this dissimilarity measure is chosen as the
Euclidean distance.
A set of n vectors x j , j = 1, , n , are to be
partitioned into c groups Gi , i = 1, , c . The cost
function, based on the Euclidean distance
between a vector x k in group j and the
corresponding cluster center ci , can be defined
by:
c
i =1
where J i =
Ji =
k , x k Gi
within group i .
c
i =1
k , x k Gi
x k ci
x k ci
1 if
x j ci
(1)
x j ck
, for each k i ,
0 otherwise.
(2)
ci =
J=
1
Gi
k , x k Gi
xk ,
(3)
u .
j =1 ij
uij = 1, j = 1,
,n .
J (U , c1 ,
, cc ) =
c
i =1
Ji =
(4)
FCM
is
i =1 j =1
uijm dij2 ,
(5)
is the
and
i =1
uij =
c
k =1
d ij
d kj
2 /( m 1)
(7)
exp
v xi
2 2
(8)
v c1
2 2
n
j =1
exp
xi x j
(ra / 2) 2
(10)
(9)
Di = Di Dc1 exp
xi x c1
(rb / 2) 2
(11)
Test Runs
Performance
measure
10
No. of iterations
11
RMSE
0.469
0.469
0.447
0.469
0.632
0.692
0.692
0.447
0.447
0.469
Accuracy
Regression Line
Slope
78.0%
78.0%
80.0%
78.0%
60.0%
52.0%
52.0%
80.0%
80.0%
78.0%
0.564
0.564
0.6
0.564
0.387
0.066
0.057
0.6
0.6
0.564
450
400
Cost Function
350
300
250
A. K-means Clustering
200
As mentioned in the previous section, Kmeans clustering works on finding the cluster
centers by trying to minimize a cost function J . It
alternates between updating the membership
matrix and updating the cluster centers using
6
Iteration
10
11
Weighting exponent m
Performance
measure
1.1
1.2
1.5
12
No. of iterations
18
19
17
19
26
29
33
36
RMSE
0.469
0.469
0.479
0.469
0.458
0.479
0.479
0.479
Accuracy
Regression Line
Slope
78.0%
78.0%
77.0%
78.0%
79.0%
77.0%
77%
77%
0.559
0.559
0.539
0.559
0.579
0.539
0.539
0.539
Data Points
A=T
Best Linear Fit
0.5
R = 0.6
0.5
T
Test Runs
Performance
measure
10
RMSE
0.566
0.469
0.566
0.49
0.548
0.566
0.566
0.529
0.693
0.469
Accuracy
Regression Line
Slope
68.0%
78.0%
68.0%
76.0%
70.0%
68.0%
68.0%
72.0%
52.0%
78.0%
0.351
0.556
0.343
0.515
0.428
0.345
0.345
0.492
0.026
0.551
Accuracy
Iterations
90
80
70
60
50
40
30
20
10
0
remainder clusters
100
6
m
10
11
12
So for the problem at hand, with input data of 13dimensions, 200 training inputs, and a grid size of
10 per dimension, the required number of
mountain function calculation is approximately
2.01 1015 calculations. In addition the value of
the mountain function needs to be stored for
every grid point for later calculations in finding
subsequent clusters; which requires g n storage
locations, for our problem this would be 1013
storage locations. Obviously this seems
impractical for a problem of this dimension.
In order to be able to test this algorithm, the
dimension of the problem have to be reduced to a
reasonable number; e.g. 4-dimensions. This is
achieved by randomly selecting 4 variables from
the input data out of the original 13 and
performing the test on those variables. Several
tests involving differently selected random
variables are conducted in order to have a better
understanding of the results. Table 3 lists the
results of 10 test runs of randomly selected
variables. The accuracy achieved ranged between
52% and 78% with an average of 70%, and
average RMSE of 0.546. Those results are quite
discouraging compared to the results achieved in
K-means and FCM clustering. This is due to the
fact that not all of the variables of the input data
contribute to the clustering process; only 4 are
chosen at random to make it possible to conduct
the tests. However, with only 4 attributes chosen
to do the tests, mountain clustering required far
much more time than any other technique during
the tests; this is because of the fact that the
Neighborhood radius
ra
Performance
measure
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
RMSE
0.67
0.648
0.648
0.5
0.5
0.5
0.5
0.5
0.648
Accuracy
Regression Line
Slope
55.0%
58.0%
58.0%
75.0%
75.0%
75.0%
75.0%
75.0%
58.0%
0.0993
0.1922
0.1922
0.5074
0.5074
0.5074
0.5074
0.5074
0.1922
100
Accuracy
RMSE
90
80
D. Subtractive Clustering
70
remainder clusters
Accuracy
60
50
40
30
20
10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
ra
10
Comparison Aspect
RMSE
Accuracy
Regression
Line Slope
Time
(sec)
K-means
0.447
80.0%
0.6
0.9
FCM
0.469
78.0%
0.559
2.2
Mountain
0.469
78.0%
0.556
118.0
Subtractive
0.5
75.0%
0.5074
3.6
V. CONCLUSION
Four clustering techniques have been
reviewed in this paper, namely: K-means
clustering, Fuzzy C-means clustering, Mountain
clustering, and Subtractive clustering. These
approaches solve the problem of categorizing
data by partitioning a data set into a number of
clusters based on some similarity measure so that
the similarity in each cluster is larger than among
clusters. The four methods have been
implemented and tested against a data set for
medical diagnosis of heart disease. The
comparative study done here is concerned with
the accuracy of each algorithm, with care being
taken toward the efficiency in calculation and
other performance measures. The medical
problem presented has a high number of
dimensions, which might involve some
complicated relationships between the variables
in the input data. It was obvious that mountain
VI. REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
Jang, J.-S. R., Sun, C.-T., Mizutani, E., NeuroFuzzy and Soft Computing A Computational
Approach to Learning and Machine Intelligence,
Prentice Hall.
Azuaje, F., Dubitzky, W., Black, N., Adamson, K.,
Discovering Relevance Knowledge in Data: A
Growing Cell Structures Approach, IEEE
Transactions on Systems, Man, and CyberneticsPart B: Cybernetics, Vol. 30, No. 3, June 2000 (pp.
448)
Lin, C., Lee, C., Neural Fuzzy Systems, Prentice
Hall, NJ, 1996.
Tsoukalas, L., Uhrig, R., Fuzzy and Neural
Approaches in Engineering, John Wiley & Sons,
Inc., NY, 1997.
Nauck, D., Kruse, R., Klawonn, F., Foundations of
Neuro-Fuzzy Systems, John Wiley & Sons Ltd.,
NY, 1997.
J. A. Hartigan and M. A. Wong, A k-means
clustering algorithm, Applied Statistics, 28:100-108, 1979.
The MathWorks, Inc., Fuzzy Logic Toolbox For
Use With MATLAB, The MathWorks, Inc., 1999.
11
Appendix
K-means Clustering (MATLAB script)
% K-means clustering
% ------------------- CLUSTERING PHASE ------------------% Load the Training Set
TrSet = load('TrainingSet.txt');
[m,n] = size(TrSet);
% (m samples) x (n dimensions)
for i = 1:m
% the output (last column) values (0,1,2,3) are mapped to (0,1)
if TrSet(i,end)>=1
TrSet(i,end)=1;
end
end
% find the range of each attribute (for normalization later)
for i = 1:n
range(1,i) = min(TrSet(:,i));
range(2,i) = max(TrSet(:,i));
end
x = Normalize(TrSet, range);
x(:,end) = [];
[m,n] = size(x);
nc = 2; % number of clusters = 2
% Initialize cluster centers to random points
c = zeros(nc,n);
for i = 1:nc
rnd = int16(rand*m + 1);
% select a random vector from the input set
c(i,:) = x(rnd,:);
% assign this vector value to cluster (i)
end
% Clustering Loop
delta = 1e-5;
n = 1000;
iter = 1;
while (iter < n)
% Determine the membership matrix U
% u(i,j) = 1 if euc_dist(x(j),c(i)) <= euc_dist(x(j),c(k)) for each k ~= i
% u(i,j) = 0 otherwise
for i = 1:nc
for j = 1:m
d = euc_dist(x(j,:),c(i,:));
u(i,j) = 1;
for k = 1:nc
if k~=i
if euc_dist(x(j,:),c(k,:)) < d
u(i,j) = 0;
end
end
end
end
end
% Compute the cost function J
J(iter) = 0;
for i = 1:nc
JJ(i) = 0;
for k = 1:m
if u(i,k)==1
JJ(i) = JJ(i) + euc_dist(x(k,:),c(i,:));
end
end
J(iter) = J(iter) + JJ(i);
end
12
13
nc = 2; % number of clusters = 2
% Initialize the membership matrix with random values between 0 and 1
% such that the summation of membership degrees for each vector equals unity
u = zeros(nc,m);
for i = 1:m
u(1,i) = rand;
u(2,i) = 1 - u(1,i);
end
% Clustering Loop
m_exp = 12;
prevJ = 0;
J = 0;
delta = 1e-5;
n = 1000;
iter = 1;
while (iter < n)
% Calculate the fuzzy cluster centers
for i = 1:nc
sum_ux = 0;
sum_u = 0;
for j = 1:m
sum_ux = sum_ux + (u(i,j)^m_exp)*x(j,:);
sum_u = sum_u + (u(i,j)^m_exp);
end
c(i,:) = sum_ux ./ sum_u;
end
% Compute the cost function J
J(iter) = 0;
for i = 1:nc
JJ(i) = 0;
for j = 1:m
JJ(i) = JJ(i) + (u(i,j)^m_exp)*euc_dist(x(j,:),c(i,:));
14
15
16
17
> max_m
Mnew(i);
v;
i;
% report progress
if mod(i,5000)==0
str = sprintf('vector %.0d/%.0d; Mnew(v)=%.2f', i, cur(1,end), Mnew(i));
disp(str);
end
end
c(2,:) = max_v;
str = sprintf('Cluster 2:');
disp(str);
str = sprintf('%4.1f', c(2,:));
disp(str);
str = sprintf('M=%.3f', max_m);
disp(str);
%-----------------------------------------------------------------% Evaluation
%-----------------------------------------------------------------% Load the evaluation data set
EvalSet = load('EvaluationSet.txt');
[m,n] = size(EvalSet);
for i = 1:m
if EvalSet(i,end)>=1
EvalSet(i,end)=1;
end
end
x = Normalize(EvalSet, range);
x(:,end) = [];
[m,n] = size(x);
% drop the attributes corresponding to the ones dropped in the training set
for i = 1:n_dropped
x(:,dropped(i)) = [];
end
[m,n] = size(x);
% Assign every test vector to its nearest cluster
for i = 1:2
for j = 1:m
d = euc_dist(x(j,:),c(i,:));
evu(i,j) = 1;
for k = 1:2
if k~=i
if euc_dist(x(j,:),c(k,:)) < d
evu(i,j) = 0;
end
end
end
end
end
% Analyze results
ev = EvalSet(:,end)';
rmse(1) = norm(evu(1,:)-ev)/sqrt(length(evu(1,:)));
rmse(2) = norm(evu(2,:)-ev)/sqrt(length(evu(2,:)));
if rmse(1) < rmse(2)
r = 1;
else
r = 2;
end
18
19
> max_d
Dnew(i);
x(i,:);
i;
% report progress
if mod(i,50)==0
str = sprintf('vector %.0d/%.0d; Dnew(v)=%.2f', i, m, Dnew(i));
disp(str);
end
end
c(2,:) = max_x;
str = sprintf('Cluster 2:');
disp(str);
str = sprintf('%4.1f', c(2,:));
disp(str);
str = sprintf('D=%.3f', max_d);
disp(str);
%-----------------------------------------------------------------% Evaluation
%-----------------------------------------------------------------% Load the evaluation data set
EvalSet = load('EvaluationSet.txt');
[m,n] = size(EvalSet);
for i = 1:m
if EvalSet(i,end)>=1
EvalSet(i,end)=1;
end
end
x = Normalize(EvalSet, range);
x(:,end) = [];
[m,n] = size(x);
% Assign every test vector to its nearest cluster
for i = 1:2
for j = 1:m
20
21