Вы находитесь на странице: 1из 9

Acarids Association

EcoMathTeam∗

Technical Reports

1 Cluster analysis
Cluster analysis is the organization of a collection of objects into cluster based on similarity. Intuitively,
objects belonging to same cluster are more similar to each other than they are to an object from a
different cluster. The clustering process involves the following steps:
Object representation and feature selection. The feature selection is a process of identifying
the most relevant features to be used in the cluster analysis. Let N be the number of the objects and
L be the number of selected features. Let us denote by X α , α = 1, N the objects, by O the set of the
object and by Y the data matrix whose elements yiα quantify the values of i-feature of the α-object.
One can think each object X α as a point in a mathematical L-dimensional space.
Object proximity. The similarity is measured by a function defined on pairs of objects

S : O × O → R,

sα,β define the index of the similarity of the two objects X α and X β . The dissimilarity of two objects
is given by dαβ = 1 − sαβ and the proximity matrix contains the dissimilarity between each pair of
objects.
The data matrix Y and the proximity matrix D are the base of the clustering process.
Grouping step. Next step in cluster analysis is to select a method to partition the set of the object
O into subsets. Agglomerative hierarchical clustering algorithms produce a nested series of partitions,
called dendrogram, based on a criterion for merging cluster by using similarity. There are three main
hierarchical clustering algorithms: single-link [7], complete-link [3], and minimum-variance [8] and [6].
The first two differ in the way they characterize the similarity between two clusters. If in the single-
link method, the distance between two clusters is given by the minimum of the distances between all
pairs of objects belonging the two different clusters in the complete-link method the distance between
the two clusters is given by the maximum of distance of all pairs of objects.
The results of the hierarchical cluster algorithm is a tree of clusters and the similarity levels at which
grouping change. Each dendrogram can be translated into a cophenetic matrix [5]. For any two objects
X α and X β the cophenetic distance hαβ equals the level at which the two objects belong to same
cluster for the first time.

”Gheorghe Mihoc-Caius Iacob” Institute of Statistical Mathematics and Applied Mathematics

1
EcoMathTeam

Validation of the clusters. Any clustering algorithm has a cluster structure as output for any data
set. Consequently, is fundamental two know if such structure is valid or not and how many clusters
are there. Cluster validity is a procedure of evaluating the results of a cluster algorithm. The quality
of the structure is quantified by a cluster validity index that give an information about the possibility
that the structure have occurred by chance or is an artifact of a clustering algorithm. There are three
main methods to investigate cluster validity, external criteria methods, internal criteria methods and
relative criteria methods. The internal examination of validity evaluates the result of the clustering
using only the quantities and feature existent in the data matrix.
There exists a lot of internal validity index see [4], [2], for example, we restrict our general presentation
to the cophenetic correlation coefficient. Let hαβ be the cophenetic matrix associate to a dendrogram
and dαβ the dissimilarity matrix. The cophenetic index is the Mantel normalized statistic of the two
matrices P  
hαβ − h dαβ − d
αβ,α<β
Z(h, d) = v ! !
u
u P 2 P 2
t hαβ − h dαβ − d
αβ,α<β αβ,α<β

where the summation is over α and β with α ranging from one to N − 1, h and d are the means of
the matrices h and d respectively. .
We accept a dendrogram as valid if the cophenetic index is closer to one and its value is statistical
significant at a given level p0 .
We test the statistical significance of the cophenetic index Z(h, d) by using the Monte Carlo method.
We generate a number,n, of random dendrograms and for each such dendrogram we calculated the
cophenetic matrix hi and the cophenetic index Z(hi , d). Then we calculated the fraction of the
cophenetic indices greater than Z(h, d)
|{i|Z(hi , d) > Z(h, d), i = 1, n}|
p=
n
where for any finite set A, |A| means the number of the elements of the set A. p measures the
probability to obtain a value of cophenetic index greater then Z(h, d) by chance.

1.1 Acarid Application


We consider now the ecological data obtained for acarid. The investigations refer to three ecological
systems ”brad”, ”molid” and ”fag” and for each ecological system two living media, OLF and OH,
was analyzed. There was 14 sites distributed in each living media and the time period of observations
was two years. At each site and each moment of time we counted the number of the individuals and
the species.
The first issue concern the time association of the species for every living media. For that we proceed
as follows. For each living media the set of the objects consists in all species founded at least one site
and at least one time.
The data matrix is given by the total number of individuals belonging to the same species recorded
in the all sites at given moment of time. More precise if one denote by {ti }i=1,24 the sequences of

2
Mathematical Modelling in Life Sciences Mites Association

the moments of time observation and by {za }a=1,14 the sites of observation and by yiαa the number of
individuals of the α-species recorded at the time ti and the sites za then
X
yiα = yiαa .
a

The dissimilarity of the two objects is given by:


v !2
yiα yiβ
u
uX
dα β = t − /yi (1)
yα yβ
i

where y α = α α
P P
i yi and yi = α yi

1.2 Stable associations


A subset M ⊂ O is named stable association with respect to some perturbed process if for any
perturbation the set is a cluster of the hierarchical structure.
Taking into account that the few species are all time present, see figure 1.1, we investigate the stable
associations by considering variable number of species. Let m be an integer number greater than 1
and lower to the number of time observations.
Denotes by
Om = {X α |yiα > 0; for at least m values of i}
and Ym the corresponding data matrix.
It is obviously that the Om is a descending sequence with respect to m, Om ⊂ Om−1 ⊂ ... ⊂ O1 = O.
We analyzed the stable association by considering three set objects O6 , O1 2, and O1 8. As hierarchical
cluster algorithm we used the single-link method. The list of stable cluster is given in the table 1.2.

3
EcoMathTeam

Table 1: The stable clusters


OLF OH
Pergamasus laetus Leptogamasus parvulus
Veigaia nemorensis Veigaia nemorensis
Neopodocinum mrciaki Neopodocinum mrciaki
”Brad”
Hypoaspis aculeifer Pachylaelaps furcifer
Eviphis ostrinus Pachyseius humeralis
Prozercon kochi Hypoaspis aculeifer
Leptogamasus parvulus
Leptogamasus tectegynellus
Veigaia nemorensis Leptogamasus parvulus
Veigaia exigua Leptogamasus tectegynellus
”Molid”
Neopodocinum mrciaki Veigaia nemorensis
Pachyseius humeralis Pachyseius humeralis
Eviphis ostrinus
Zercon triangularis
Lysigamasus neoruncatellus
Veigaia nemorensis
Veigaia cerva Veigaia nemorensis
”Fag” Neopodocinum mrciaki Neopodocinum mrciaki
Macrocheles decoloratus Pachylaelaps furcifer
Macrocheles montanus
Pachylaelaps furcifer

4
Mathematical Modelling in Life Sciences Mites Association

Figure 1: Distributions of the species by number of presences on time period of observation. ”Brad”,
”Molid” and ”Fag” ecosystems form top to bottom and OLF habitat, left, OH habitat, right. The
plots show only the species founded at least six time.

5
EcoMathTeam

Figure 2: Dendrograms and histograms of the species which was recorded as present at least 12 months.
”Brad”, ”Molid” and ”Fag” ecosystems form top to boo tom and OLF habitat. The histogram plot the
frequencies, dashed bars, and abundance mediated per frequencies solid bars. The dot on dendrograms
marks the stable association, for labels see table 1.2

6
Mathematical Modelling in Life Sciences Mites Association

Figure 3: Same as in figure 1.2 but for OH habitat

7
EcoMathTeam

Species

Table 2: Species founded in all habitats and their labels


1 Epicrius bureschi 52 Zerconopsis remiger
2 Epicrius resinae 53 Arctoseius cetratus
3 Holoparasitus excipuliger 54 Zercon peltadoides
4 Leptogamasus parvulus 55 Pergamasus barbarus
5 Paragamasus sjmilis 56 Epicriopsis rivus
6 Paragamasus (Aclerogamasus) motasi 57 Leitneria granulata
7 Lysigamasus neoruncatellus 58 Vulgarogamasus remberti
8 Leptogamasus tectegynellus 59 Veigaia paradoxa
9 Lysigamasus truncus 60 Lasioseius lawrencei
10 Pergamasus quisquiliarum 61 Prozercon sellnicki
11 Pergamasus laetus 62 Leptogamasus sp.n.
12 Pergamasus athiasae 63 Olopachys vysotskajae
13 Porrhostaspis lunulata 64 Asca bicornis
14 Parasitus furcatus 65 Pachylaelaps imitans
15 Vulgarogamasus kraepelini 66 Olopachys scutatus
16 Vulgarogamasus oudemansi 67 Zercon carpathicus
17 Vulgarogamasus zschokkei 68 Zercon arcuatus
18 Veigaia nemorensis 69 Leptogamasus variabilis
19 Veigaia exigua 70 Holoparasitus minimus
20 Veigaia cerva 71 Arctoseius brevichelis
21 Veigaia propinqua 72 Pachylaelaps magnus
22 Leioseius magnanalis 73 Epicrius mollis
23 Melichares juradeus 74 Leptogamasus obesus
24 Dendrolaelaps rotundus 75 Zercon triangularis
25 Dendrolaelaps foveolatus 76 Holoparasitus rotulifer
26 Rhodacarellus kreuzi 77 Paragamasus (Adinogamasus) sp.n.
27 Neopodocinum mrciaki 78 Dendrolaelaps willmanni
28 Geholaspis longispinosus 79 Proctolaelaps pomorum
29 Macrocheles decoloratus 80 Prozercon sellnicki
30 Macrocheles montanus 81 Gamasolaelaps sp.
31 Pachylaelaps furcifer 82 Gamasolaelaps multidentatus
32 Pachyseius humeralis 83 Protogamasellus sp.
33 Hypoaspis aculeifer 84 Cheroseius sp.
34 Hypospis nolli 85 Eugamasus monticolus
35 Hypoaspis oblonga 86 Veigaia kochi
36 Eviphis ostrinus 87 Macrocheles insignitus
37 Zercon fageticola 88 Iphidozercon venustulus
38 Zercon romagniolus 89 Dendrolaelaps samsinaki
39 Prozercon kochi 90 Porrhostaspis lunulata
40 Arctoseius semiscissus 91 Paragamasus similis
41 Arctoseius eremitus 92 Rhodacarellus silesiacus
42 Amblyseius sp. 93 Pachylaelaps latior
43 Geholaspis mandibularis 94 Lysigamasus vagabundus
44 Zercon peltatus 95 Holoparasitus excisus
45 Zercon pinicola 96 Pergamasus alpinus
46 Prozercon traegardhi 97 Pachylaelaps pectinifer
47 Prozercon fimbriatus 98 Zercon tatrensis
48 Amblygamasus mirabilis 99 Zercon athiasi
49 Eugamasus magnus 100 Pachylalaps latior
50 Gamasodes spiniger 101 Hypospis montana
51 Veigaia transisalae 102 Leptogamasus doinae

8
Mathematical Modelling in Life Sciences Mites Association

References
[1] Raffaele Giancarlo, David Scaturo, and Filippo Utra, A Tutorial on Computational Cluster Anal-
ysis with Application to Pattern Discovery in Microarray Data,Math.comput.sci.,1 (2008), pp.
655–672.

[2] Frederic Cao, Julie Delon, Agnes Desolneux, Pablo Muse, Frederic Sur, An a contrario approach
to hierarchical clustering validity assessment,CMLA report, 2004-16, November 2004.

[3] King, B., Step-wise clustering procedures, J. Am. Stat. Assoc., 69(1967), pp. 86–101.

[4] Maria Halkidi, Yannis Batistakis and Michalis Vazirgiannis, On Clustering Validation Techniques,
Journal of Intelligent Information Syastem, 17:2/3 (2001), pp. 107–145.

[5] Francois-Joseph Lapoint and Pierr Legendre, A Statistical Framework to Test the Consensus of
Two Nested Clasifications, Syst. Zool.,39(1)(1990), pp. 1–13.

[6] Murtagh, F., A survey of recent advances in hierarchical clustering algorithms which use cluster
centers, Comput. J., 26(1984), pp. 354–359.

[7] Sneath, P.H.A. and Sokal, R.R., Numerical Taxonomy, Freeman, London, UK, 1973.

[8] Ward, J. H. Jr., Hierarchical grouping to optimize an objective function, J. Am. Stat. assoc.,
58(1963), pp. 236–244.

[9] Real, R. and Vargas, J. M., The probabilistic basis of Jaccard’s index of similarity. Syst. Biol.,
45(1996),pp. 380–385.

[10] Real, R., Tables of significant values of Jaccard’s index of similarity. Misc. Zool., 22.1(1996),pp.
29–40.

Вам также может понравиться