Академический Документы
Профессиональный Документы
Культура Документы
Pages: 444-453
444
p, q
upq [0,1]
n
0 u pq n
q 1`
pq
445
1
(upq logupq upq )
p1 p q1
J m (U , Y ) u pqEpq
p1 q1
(1)
2
where E
pq
-y
u pq e
and y
p
p, q (2)
q 1
u pq / u pq
p . (3)
q 1
h ( xi , v j )
M (x
ik
, v jk ) (4)
Dor(x,x)=1
0 <DOR(xi ,vj)1.
DOR(xi ,vj) DOR(vj, xi).
k 1
446
III. METHODS
The main idea of the proposed algorithm is to avoid
the use of cluster centers which makes the cluster
boundaries take a pre-specified geometric shape
depending on the used similarity measure. The
proposed solution of the problem resides in
modifying the objective function in equ. (1) so that
it minimizes (maximizes) the average distance
(similarity) between every pair of data points
within the cluster. Based on equ.(1) The objective
function to be maximized is
k
J m (U )
n
i 1
p 1
i1
u pi u pj
pi
u pj S ( x i , x j )
ji
ji
( u pi ln u pi u pi ) )
i 1
(6)
Where S(xi,xj) is the similarity between
subsequence i and subsequence j and represents an
entry (i,j) in the similarity matrix S. And upiis the
possibilistic membership of subsequence i in
cluster p and represents an entry (i,p) in the
membership matrix M.
n
Jm
upi
u S(x , x )
pj
ji
n
pj
u u
pi pj
i1 ji
ji
n n
upiupj
i1 ji
procedureinitproc
u u S(x , x )
pi pj
i1 ji
(7)
output:
p, i (8)
where
n
E pi
c : number of clusters
u pj
ji
i 1
u pi u pj
ji
i 1
u pi u pj S ( x i , x j )
j i
2.
ji
i 1
pi
) forj=1,2,...n
( )
3SettotCountto n, Setk to 1
u pj
j i
(9)
In section III(C), equations (8) , (9) are used as
formulas for recalculating the membership
functions. Initial M, Sand are given as input to
the algorithm described in fig. 4.
A. The Proposed Initialization Procedure
Both the proposed algorithm and the c-medoids
algorithms require pre-specifying the number of
( )=
u pj S ( x i , x j )
begin [initproc]
1. For each pattern xi, calculate the
input :
S :Similarity matrix of size nn
ln(upi) 11 0
u pi e
5. while (totSum/totCount< )
begin
6.1Remove = argmin Sum( )
6.2Maskxjand update totSum, Sum
6.3DecementtotCount
end
6. if(totalCount<prcnt*n)
label patterns in Ck as potential outliers, Clear Ck
447
else
.
6
1.
5
2.
1
7.
2
.
5
.
4
.
8
.
6
.
7
Go To Step 1
.
4
.
8
0.
8
1.
7
incrementk
Procedure EstimateAlphaBeta
endif
input:
end [initproc]
Su
m
.
8
.
7
.
9
2.
4
1.
(b)
else
(a)
end if
su
m
.
9
0.
9
output:
S : Similarity matrix of size nn
: threshold on the average similarity
:parametersof the objective function in equ. (6)
begin [EstimateAlphaBeta]
1. Compute the Similarity matrix S using Dor
2. Select randomly very large number (hundreds of
thousands) of subsets of input subsequences.
3. Compute 1000 bins histogram for the average
similaritiesfor each subset selected by step 2.
4. Get the number of first bins fbins such that the
sum of
the count of them is near 0.001 of the total count.
5. Compute = min. value + fbins * step (a value
close to
the lower end of the range of the average
similarities).
6. Compute average Epi for random sample of
objects in the
identified subsets in step 5 (counted in the lower
end)
7. Compute using equ. (10)
end [EstimateAlphaBeta]
Fig. 3. The proposed procedure for estimating
and
By Identifying subsets that having average
448
(10)
if (Epi is minimum)
upi= e
E pi /
for each(qp)
if((upi - uqi)< ))
beginuqi=w
Eqi /
;upi = w
E pi /
else
uqi=0
end if
endif
Until
(|| U U ' || )
Output U
End [RPLINK]
Fig. 4.Rough Possibilistic Complete Link
Clustering Algorithm (RPLINK)
end
end
449
B. Quality Measures
C. Performance Evaluation
(1 / c ) (1 / n c ) DOR ( x j , v i )
i 1
max
i, j
j 1
1
( DOR ( v j , v i ) DOR ( v i , v j ))
2
C k C
(max
Cl C
sim(C k , C l )
min C m C diam(C
Table 1
m
1 k
si
k j 1 xiC j
(14)
(15)
(13)
GSu =
Performance of RPLINK
compared to all above listed Algorithms on
NP_057849
Algorit
hm
Para
m.
Para
m.
RPLIN
K
0.7
92
0.8
53
0.8
26
RFCM
dd
0.7
36
0.9
14
0.8
01
0.8
19
FCMdd
0.7
19
0.9
14
0.7
46
0.8
28
0.6
12
0.9
38
0.6
35
0.8
29
HCMd
d
0.6
07
0.9
38
0.6
21
0.8
27
MI
0.6
11
0.9
44
0.6
25
0.9
13
GAFR
0.6
09
0.9
62
0.6
18
0.9
02
RPLIN
K
0.8
06
0.8
05
RFCM
dd
0.8
01
0.8
21
FCMdd
0.7
46
RCMd
d
HCMd
RCMd
d
C=13
C=26
C=27
0.8
91
0.7
99
0.6
72
0.8
36
0.6
81
0.8
37
0.7
67
0.7
01
0.6
32
0.8
36
0.6
51
0.7
51
0.6
0.8
0.6
0.7
C=36
450
18
44
43
51
MI
0.6
24
0.9
13
0.6
37
0.8
54
GAFR
0.6
16
0.9
02
0.6
46
0.8
72
Table 3
Execution Time in (ms) for different number of
clusterscompared to RFCMdd on NP_057849 and
NP_057850
NP_057849
Cluster
Count
Silhouette Width
RPLINK
0.321574
0.362815
RFCMdd
0.305913
0.280974
FCMdd
0.291722
0.289233
RCMdd
0.264625
0.228046
HCMdd
0.24615
0.211839
3) Runtime Analysis
The results in table 3 represents the average
execution time for RPLINK and RFCMddIn this
experiment the corresponding algorithms when
they are applied to the dataset NP_057849 and
NP_057850 for different number of clusters.The
runtime in Table 3.It is clear that RPLINK is
slightly slower than RFCMdd on NP_057849 and
comparable to it for the smaller dataset
NP_057850
RPLI
NK
RFCM
dd
RPLI
NK
RFCM
dd
04122
05213
0771
0882
13
14223
13612
1233
1423
26
28312
24063
3677
3114
36
40802
32713
4122
3821
50
61507
44329
7231
5754
Table 2
Algorithm
NP_057849
RPLINK
40802
RFCMdd
32713
FCMdd
240834
RCMdd
160563
HCMdd
4379
451
of
Available:
452
453