© All Rights Reserved

Просмотров: 2

© All Rights Reserved

- Data Clustering for Image Segmentation
- A Hierarchical Clustering Algorithm Based on K
- Journal of Computer Applications - www.jcaksrce.org - Volume 4 Issue 2
- Ishola Et Al Pure and Applied Geophysics 2014
- The Myth of the Ideal Worker Does Doing All the Right Things Really Get Women Ahead
- Effect of Different Distance Measures on The
- K-Mean Algo. on Iris Data set_15129145.pdf
- Different Data Mining Techniques and Clustering Algorithms
- IIJIT-2014-06-12-024
- A Heuristic Approach for Network Data Clustering
- Document Classification Methods For
- Data Mining
- ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
- 1-s2.0-S0020025517306266-main
- gb-2003-4-5-r34
- Clusteringytryr
- 3-Clustering 3 KMeans
- Modelos generales para la estimación de la radiación solar global diaria para diferentes zonas de radiación solar en China continental
- Semantic distances for technology landscape visualization.pdf
- 14 Chapter 7

Вы находитесь на странице: 1из 40

Wahrscheinlichkeitstheorie

data:

Problems and possibilities

M. Templ, P. Filzmoser, and C. Reimann

Forschungsbericht CS-2006-5

Dezember 2006

AUSTRIA

Email: P.Filzmoser@tuwien.ac.at

http://www.statistik.tuwien.ac.at

DATA: PROBLEMS AND POSSIBILITIES

Matthias Templ1,2, Peter Filzmoser1 and Clemens Reimann3

1

Hauptstr. 8-10, A-1040 Wien, Austria.

Email: P.Filzmoser@tuwien.ac.at, Tel.: +43 1 58801 10733

2

Department of Register, Classification and Methodology, Statistics Austria,

Guglgasse 13, A-1040 Wien, Austria.

Email: Matthias.Templ@statistik.gv.at, Tel.: +43 1 71128 7327

3

Geological Survey of Norway, N-7491 Trondheim, Norway.

Email: Clemens.Reimann@ngu.no, Tel.: +47 73 904 321

ABSTRACT

A large regional geochemical data set of O-horizon samples from a 188,000 km2 area in the

European Arctic, analysed for 38 chemical elements, pH, electrical conductivity (both in a

water extraction) and loss on ignition (LOI, 480 oC), was used to test the influence of

different variants of cluster analysis on the results obtained. Due to the nature of regional

geochemical data (neither normal nor log-normal, strongly skewed, often multi-modal data

distributions), cluster analysis results usually strongly depend on the clustering algorithm

selected. Deleting or adding just one element (variable) in the input matrix can also

drastically change the results of cluster analysis. Different variants of cluster analysis can lead

to surprisingly different results even when using exactly the same input data. Given that

selection of elements is often based on availability of analytical packages (or detection limits)

rather than on geochemical reasoning this is a disturbing result. Cluster analysis can be used

to group samples and to develop ideas about the multivariate geochemistry of the data set at

hand. It should not be misused as a statistical "proof" of certain relationships in the data. The

use of cluster analysis as an exploratory data analysis tool requires a powerful program

system, able to present the results in a number of easy to grasp graphics. In the context of this

work, such a tool has been developed as a package for the R statistical software.

KEY WORDS: Kola Peninsula, O-horizon, cluster analysis, exploratory data analysis, R

1. INTRODUCTION

The principal aim of cluster analysis is to partition observations into a number of groups. A

good outcome of cluster analysis will result in a number of clusters where the observations

within a cluster are as similar as possible while the differences between the clusters are as

large as possible. Cluster analysis must thus determine the number of classes as well as the

memberships of the observations to the groups. To determine the group membership most

clustering methods use a measure of similarity between the observations. The similarity is

usually expressed by distances between the observations in the p-dimensional space of the

variables.

Cluster analysis was developed in taxonomy. The aim was originally to get away from the

high degree of subjectivity when single taxonomists performed a grouping. Since the

introduction of cluster analysis techniques there has been controversy about its merits (see

Davis, 1973 or Rock, 1988 and references there). It was soon discovered that diverse

techniques can yield different groupings, even when using exactly the same data.

Furthermore the addition (or deletion) of just one variable in a cluster analysis can lead to

completely different results. Workers may thus be tempted to experiment with different

techniques and the selection of variables entered until the results of a cluster analysis fits their

preconceived ideas. Cluster analysis is still a popular technique, in part because as a

complicated statistical technique it appears to add a scientific component to a publication.

Readers of papers using cluster analysis should be very aware of the problems cluster

analysis can be applied as an "exploratory data analysis tool" to better understand the

multivariate behaviour of a data set. It can, however, never be a "statistical proof" of a certain

relationship between the variables or observations.

While factor analysis (Reimann et al., 2002) uses the correlation matrix for extracting

common "factors" from a given data set most cluster analysis techniques use distance

measures to assign observations to a number of groups. Correlation coefficients lie between

1 and +1, with 0 indicating linear independence. Distance coefficients lie between 0 and ,

with 0 indicating complete identity (Rock, 1988). The use of correlation coefficients requires

not only a normal, but even a multivariate normal, distribution for all the input data (Reimann

et al., 2002). This condition is almost never fulfilled when working with geochemical data

(Reimann and Filzmoser, 2000). Furthermore geochemical data are "closed" data

(compositional data expressed in units like wt.-% or mg/kg, summing up to a constant (100,

1000, 1,000.000)) and multivariate statistical methods may thus deliver biased results (Le

2

Maitre, 1982, Aitchison, 1986). The use of distance coefficients does a priori not make any

statistical assumptions about the data (except if the data are of categorical order), theoretically

an ideal situation when working with geochemical data. Distance measures will also be

essential for cluster validation, i.e. measuring the quality of a clustering. In theory, it should

be ideal to first use cluster analysis on a large geochemical dataset to extract more

homogenous data subsets (groups) and to then perform factor analysis or discriminant

analysis on these homogenous data subsets to study their multivariate data structure.

Especially for data sets with many variables, it has been suggested (e.g. Everitt, 1974) to first

use principal component analysis to reduce the dimensionality of the data and to then perform

cluster analysis on the first few principal components. This approach was criticised because

clusters embedded in a high-dimensional variable space will not be properly represented by a

smaller number of orthogonal components (e.g. Yeung and Ruzzo, 2001).

Clustering methods also exist that are not based on distance measures, like model-based

clustering (Fraley and Raftery, 1998). These techniques usually find the clusters by

optimising a maximum likelihood function. The implicit assumption is that the data points

forming the single clusters are multivariate normally distributed, and the algorithm tries to

estimate the parameters from the normal distribution as well as the membership of each

observation to each cluster.

With geochemical data cluster analysis can be used in two different ways: it can be used to

cluster the variables (e.g. to detect geochemical relations between the variables) and it can be

used to cluster the observations (e.g. to assign soil samples to certain parent materials) to

come to more homogenous data subsets for further data analysis.

Here we will apply a variety of different methods of cluster analysis to geochemical data

from a large regional scale geochemical data set containing 617 observations and 40

variables. The objective of this study was to investigate:

- Has the point where cluster analysis can (and should) be applied to such a high-dimensional

data set been reached? If yes, what are the prerequisites for applying cluster analysis to such

data sets?

- What are the results of cluster analysis when such a massive data set is investigated?

- What is the influence of the actual method used and is there an ideal method for regional

geochemical data?

- Is there an objective way to determine the optimum number of clusters extracted?

- Is an objective decision on the number and choice of elements entered into the cluster

analysis possible?

- Which distance measures are most suitable for distance based clustering methods?

- Is there a graphical way to evaluate the stability of clusters?

- Can objective, reliable statistically significant results be obtained that can provide proof of a

hypothesis or explain the multivariate relation between the elements or observations, or is

cluster analysis rather an exploratory data analysis tool that should only be used to generate

ideas (the proof needs to come from elsewhere)?

The paper is organised as follows: Section 2 gives a detailed description of the example data

set. Data problems for cluster analysis are discussed in Section 3. Sections 4-8 are devoted to

the methodology of cluster analysis. For rapid cluster analysis and plotting of the results the

package clustTool running under R was developed (see Section 9). All algorithms are thus

easily available via the internet (e.g. at the R project site: www.r-project.org). Results and

possibilities for graphical presentations of the results are shown in Section 10. The final

Section 11 concludes.

2. MATERIAL AND METHODS

THE KOLA PROJECT

From 1992-1998 the Geological Surveys of Finland (GTK) and Norway (NGU) and Central

Kola Expedition (CKE), Russia, carried out a large, international multi-media, multi-element

geochemical mapping project, covering 188,000 km2 north of the Arctic Circle. The entire

area between 24o and 35.5oE up to the Barents Sea coast (Fig. 1) was sampled during the

summer of 1995. Results of the Kola Ecogeochemistry project are documented on a web

site (http://www.ngu.no/Kola) and in a geochemical atlas (Reimann et al., 1998). One sample

material for the project was the O-horizon developed on top of Podsol profiles, representing

the interplay of pedosphere, atmosphere and biosphere and as such reflecting surface

processes ranging from natural (i.e. input of sea salts, influence of vegetation zones) to

industrial contamination. The average sample density was 1 site per 300 km2. Detailed maps

of the geology, quaternary geology, topography, vegetation zones and climatic conditions in

the survey area can be found in Reimann et al. (1998).

Figure 1: General location map of the study area for the Kola Project (Reimann et al., 1998).

Locations named in the text are given.

While the western part of the project area (N-Finland and Norway) is almost pristine, the

Russian part is heavily industrialised. This includes several important mines, e.g. Cu/Ni ores

are mined near Zapoljarnij, Fe-ores near Olenegorsk, Apatite near Apatity (Fig. 1), and

related mineral processing plants. In terms of environmental impact the most important ore

roasters and smelters, responsible for major emissions of Cu, Ni, Co, V and many other

metals (Reimann et al., 1998), are situated near Zapoljarnij, Nikel and Monchegorsk (see Fig.

1).

SAMPLING, SAMPLE PREPARATION AND ANALYSES

A detailed description of sample site selection criteria and sample methods is given in yrs

and Reimann (1995) and in Reimann et al. (1998). The O-horizon was sampled from the

uppermost 3 cm of the organic horizon (usually litter), avoiding living vegetation, using a

special tool (see Reimann et al., 1998) as a composite sample from a 50 x 50 m area

surrounding a complete podzol soil profile. A field duplicate of all samples was taken, some

100 m distant, at every 15th site.

Analytical procedures and all analytical results are detailed in Reimann et al. (1998). Quality

control procedures followed the methods suggested in Reimann and Wurzer (1986) and

results are documented in Reimann et al. (1998). The O-horizon was collected at 617 sites. A

summary of the elements of the O-horizon is given in Table 1. The samples were air dried

and sieved to < 2 mm using nylon screening. Carbon, hydrogen and nitrogen were determined

using a CHN-analyser according to ISO standard 10694. Electrical conductivity and pH were

determined in a water extraction. To obtain total element concentrations in the organic

fraction 0.4 g of sample were digested with 10 ml of concentrated nitric acid. This extract was

analysed by ICP-AES, inductively coupled plasma mass spectrometer (ICP-MS) and GFAAS

for 36 elements (see Niskavaara, 1995).

Table 1: Elements and summary statistics (minimum (MIN), median (MED), maximum

(MAX) and spread (expressed as median absolute deviation MAD) for the Kola O-horizon

data set used here (from Reimann et al., 1998). In addition the detection limit (DL) and the

number of samples below detection (expressed in %) are given.

3. POSSIBLE DATA PROBLEMS IN THE CONTEXT OF CLUSTER ANALYSIS

MIXING MAJOR, MINOR AND TRACE ELEMENTS

In multi-element analysis of geological materials one usually deals with elements occurring in

very different concentrations. In rock geochemistry, the chemical elements are divided into

"major", "minor" and "trace" elements. Major elements are measured in % or tens of %,

minor elements are measured in about 1 % amounts, and trace elements are measured in ppm,

or even ppb. This may become a problem in multivariate techniques that consider all

variables simultaneously because the variable with the greatest variance will have the greatest

influence on the outcome. Variance is obviously related to absolute magnitude. As one

consequence, one should not mix variables quoted in different units in one and the same

multivariate analysis (Rock, 1988). Transferring all elements to just one unit (e.g. mg/kg) is

not an easy solution to this problem, as the major elements occur in much greater amounts

than the trace elements. To enter geochemical raw data, including major, minor and trace

elements into cluster analysis does not make sense because it can be predicted that the minor

and trace elements would have almost no influence on the result. The same even applies to

the major elements: if C (or LOI as a proxy for "organic content") is entered together with the

other major elements it will completely govern the clustering just because of its much greater

concentrations. The data matrix will thus need to be "prepared" for cluster analysis using

appropriate data transformation and standardisation techniques.

DATA OUTLIERS

Regional geochemical data sets practically always contain outliers. The outliers should not

simply be ignored but they have to be accommodated because they contain important

information about data quality and unexpected behaviour in the region of interest. In fact,

finding data outliers that maybe indicative of mineralisation (in exploration geochemistry) or

of contamination (in environmental geochemistry) are one of the major aims of geochemical

surveys. Outliers can have a severe influence on cluster analysis, because they can affect

proximity measures and obscure clustering tendencies. Outliers should thus be removed prior

to entering a cluster analysis or statistical clustering methods capable of handling outliers

should be used. This is rarely done. Finding data outliers is not a trivial task, especially in

high dimensions. One way of identifying such outliers is to compute robust Mahalanobis

distances, i.e. Mahalanobis distances on the basis of robust estimates of location and scatter

(Filzmoser et al., 2005).

CENSORED DATA

There is a further problem that often occurs when working with geochemical data: the

detection limit problem. For some determinations a proportion of all results are below the

lower limit of detection of the analytical method, i.e. the data are censored. For statistical

analysis these results are often set to a value of the detection limit. However, a sizeable

proportion of all data with an identical value can seriously influence any cluster analysis

procedure. For the study datasets several variables had more than 25% of the data below

detection. It is very questionable as to whether or not such elements should be included at all

in a cluster analysis. Unfortunately it is often the elements of greatest interest that contain the

highest number of censored data (e.g., Se see Tab. 1) the temptation to include these in a

cluster analysis is thus high. Here all elements with more than 5% of all values below

detection have been omitted from cluster analysis (Be and Se see Tab. 1).

DATA TRANSFORMATION AND STANDARDISATION

Cluster analysis in general does not require that the data be normally distributed. However, it

is advisable that heavily skewed data are first transformed to a more symmetric distribution.

If a good cluster structure exists for a variable, we can expect a distribution, which has two or

more modes. A transformation to more symmetry will preserve the modes but remove large

skewness.

Most geochemical textbooks still claim that for geochemical data a log-transformation is most

suitable. Recently Reimann and Filzmoser (2000) have shown that very few geochemical

variables will indeed follow a (log)-normal distribution. Each single variable needs,

unfortunately, to be considered for transformation and different transformations, with the

Box-Cox transfomation (Box and Cox, 1964) being the most universal choice, need to be

considered. The most practical decision guide whether to transform or not and how to

transform should be the data distribution: it should be close to symmetry prior to entering

cluster analysis. Even Box-Cox transformations of all single variables do not guarantee

symmetry of the resulting multivariate data distribution, but more closeness to symmetry (or

removal of strong skewness) will in general improve the cluster results.

An additional standardisation is needed if the variables show a striking difference in the

amount of variability (see discussion above, major, minor and trace elements). Different

methods, all having advantages and disadvantages, exist to accommodate this requirement.

The most universal method is the z-transfomation, which builds on the mean and standard

deviation of the data. When working with geochemical data a robustified version, using

median and MAD should be preferred.

4. DISTANCE MEASURES

A key issue in most cluster analysis techniques is how best to measure distance between the

observations (or variables). Note that "distance" in cluster analysis has nothing to do with

geographical distance between two observations but is rather a measure of similarity between

observations in the multivariate space defined by the entered variables. Many different

distance measures exist (Bandemer and Nther, 1992). Modern software implementations of

cluster algorithms can accommodate a variety of different distance measures because the

distances rather than the data matrix are taken as input, and the algorithm is applied to the

given input.

For clustering the observations the Euclidean distance or the Manhattan distance is the most

frequent choice. The latter measures the distance parallel to the variable axes, rather than

directly (Euclidean), and the cluster results are sometimes more stable. Usually both distance

measures lead to comparable results. Other distance measures like the Gower distance

(Gower, 1966), the Canberra distance (Lance and Williams, 1966), correlation based distance

measures or a distance measure based on the random forest proximity measure (Breiman,

2001) can give completely different cluster results.

To demonstrate the effect of the distance measure used for clustering geochemical data the

average linkage clustering algorithm (see below) was applied to the Kola O-horizon data,

using all 40 variables log-transformed and standardised with less than 5% of values below the

detection limit. The results were retained for a fixed number of clusters (here 6 clusters were

always sought) for reasons of comparability. It is desirable that a similar data set will give

approximately the same cluster result. Therefore,

1. Bootstrap samples from the original data (sample with replacement) are drawn;

2. The bootstrap samples are clustered with the same method and the same number of

clusters and extracted;

3. The results are compared to the cluster results obtained from the original data using

the adjusted Rand index (Hubert and Arabie, 1985) as a measure of similarity.

Figure 2 shows boxplots of the resulting Rand index of the clustered bootstrap samples. A

high value of the Rand index indicates very similar results, a low value means completely

different results. For this example, the Euclidean, Manhattan, Gower and Canberra distances

lead to stable cluster results whereas the random forest distance yields highly unstable

clusters.

Figure 2: Average linkage clustering for 40 variables of the O-horizon data (log-transformed,

standardised). The cluster results for different methods and a fixed number of clusters are

compared with the corresponding results for bootstrap samples of the data using the Rand

index. The boxplots show the resulting Rand indices.

Similar simulations were also undertaken for different numbers of clusters, for other

clustering algorithms, and for other distance measures. The conclusion was that Euclidean

and Manhattan distance measures gave the most stable clusters.

5. CLUSTERING OBSERVATIONS

One of the main problems with cluster analysis is that a multitude of different clustering

methods exists. The observations need to be grouped into classes (clusters). If each

observation is allocated into only one (of several possible) cluster(s) this is called

"partitioning". Partitioning will result in a pre-defined (user defined) number of clusters. It is

also possible to construct a hierarchy of partitions, i.e. group the observations into 1 to n

clusters (n = number of observations). This is called hierachical clustering. Hierachical

clustering always delivers n cluster solutions, and based on these solutions the user has to

decide which result is most appropriate.

Two principally different procedures exist. An observation can be allocated to just one cluster

(hard clustering) or be distributed among several clusters (fuzzy clustering). Fuzzy clustering

allows that one observation belongs to a certain degree to several groups. In terms of applied

geochemistry this procedure will often deliver the more interesting results because it reveals

if one observation is influenced by several factors. The cluster solution will then show to

what degree the observations are influenced by the different factors. Here the factors or

processes are represented by observations that are clustered together in the data space.

HIERARCHICAL METHODS

Input to most hierarchical clustering algorithms is a distance matrix (distances between the

observations). The widely used agglomerative techniques start with single object clusters

(each observation forms an own cluster) and enlarge the clusters stepwise. The

computationally more intensive reverse procedure starts with one cluster containing all

observations and splits the groups step by step. This procedure is called divisive clustering.

At the beginning of an agglomerative algorithm each observation forms its own class, leading

to n single object clusters. The number of clusters is reduced by one by combining (linking)

the most similar classes at each step of the algorithm. The similarity of the combined pair, a

new class, can be measured to all other classes, and the next two most similar classes linked,

and so on. At the end of the process there is only one single cluster left, containing all

observations. A number of different methods are available for linking two clusters. Best

known are average linkage, complete linkage and single linkage. The average linkage method

considers the averages of all pairs of distances between the observations of two clusters. The

two clusters with the minimum average distance are combined into one new cluster.

Complete linkage looks for the maximum distance between the observations of two clusters.

The clusters with the smallest maximum distance are combined. Single linkage considers the

minimum distance between all observations of two clusters. The clusters with the smallest

minimum distance are linked. Single linkage will result in cluster chains because for linkage

it is sufficient that only two objects of different clusters are close together. Complete linkage

will result in very homogenous clusters in the first stages of agglomeration, however the

resulting clusters will be small. Average linkage is a compromise between the two other

methods and usually performs best in typical applied geosciences applications.

Because the cluster solutions grow tree-like (starting with the roots and ending upwards with

the trunk) results are often displayed in a graphic called the dendrogram (see Fig. 4, 5 for

clustering variables). Horizontal lines indicate the linkage of two objects or clusters, and thus

the vertical axis presents the associated height or similarity as a measure of distance. The

objects are arranged in such a way that the branches of the tree do not overlap. Linking of two

groups at a large height indicates strong dissimilarity (and vice versa). Therefore, a clear

cluster structure would be indicated if observations are linked at a very low height, and the

distinct clusters are linked at a considerably higher value (long roots of the tree). Cutting the

dendrogram at the height corresponding to this visible number of clusters allows assigning

the objects to the clusters. Visual inspection of a dendrogram is often helpful in obtaining an

initial idea of the number of clusters to be generated by a partitioning method.

PARTITIONING METHODS

In contrast to hierachical clustering methods partitioning methods require that the number of

resulting clusters be pre-determined. As noted above, when nothing is known about the

observations it can be useful to first carry out a hierachical clustering. The other possibility is

to partition the data into different numbers of clusters and evaluate the results (see below).

For regionalised data a more subjective but still reasonable approach to evaluation is to

10

visually inspect the location of the resulting clusters in a map. This exploratory approach can

often reveal interesting data structures.

A very popular partitioning algorithm is the k-means algorithm. It attempts to minimise the

average squared distance between the observations and their cluster centres or centroids.

Starting from k initial cluster centroids (e.g. random initialisation by k observations), the

algorithm assigns the observations to their closest centroids (using e.g. Euclidean distances),

recomputes the cluster centres, and iteratively reallocates the data points to the closest

centroid. Several algorithms exist for this purpose, those of Hartigan (1975) and MacQueen

(1967) are the most popular. There are also some modifications of the k-means algorithm.

Manhattan distances are used for k-medians and the centroids are the medians of each cluster.

Hard competitive learning works by randomly drawing an observation from the data and

moving the closest centre towards that point (e.g., Ripley, 1996). Martinetz et al. (1993) have

introduced "neuralgas", this method is similar to hard competitive learning, but in addition to

the closest centroid also the second closest centroid is moved at each iteration. A new high

extensible toolbox for centroid clustering was recently implemented in R (Leisch, 2006).

Here the user can easily try out almost any arbitrary distance measure and centroid

computations for data partitioning.

Kaufmann and Rousseeuw (1990) proposed several clustering methods which are

implemented in a number of software packages. The partitioning method PAM (partitioning

around medoids) minimises the average distances to the cluster medians. It is thus similar to

the k-medians method but allows the use of different distance measures. A similar method

called CLARA (Clustering Large Applications) is based on random sampling. It saves

computation time and is particularly appropriate for larger data sets.

The result of all these algorithms depends on the initial cluster centres, which are often a

random selection of k of the observations. If bad initial cluster centres are selected, the

iterative partitioning algorithms can lead to a local optimum that can be far away from the

global optimum. This can be avoided by applying the algorithms with different random

initialisations, and then selecting the best (according to a validity measure, see below) or most

stable result.

Another way to approximate the global optimum is bootstrap aggregation, called bagging

(Breiman, 1996). This bootstrap method generates new data sets from the available data set of

the same size by a random selection of observations with replacement from the data set. The

11

central idea of the bagged clustering algorithm bclust (Leisch, 1998, 1999) is to repeatedly

apply a clustering algorithm (e.g. k-means) on bootstrap data sets, combine the resulting

centroids to a new data set, run a hierarchical clustering algorithm on this new data set and cut

the resulting dendrogram to get a partition into k clusters. The observations are then assigned

to the closest centre.

MODEL-BASED METHODS

A method that is not based on distances between the observations but on certain models

describing the shape of the clusters is called model-based clustering (Fraley and Raftery,

2002). The Mclust algorithm selects the cluster models (e.g. elliptical cluster shape) and the

number of clusters and determines the cluster memberships of all observations. The

estimation is achieved using the Expectation-Maximization (EM) algorithm (Dempster et al,

1977). The EM algorithm is executed on several numbers of clusters and with several sets of

constraints on the covariance matrices of the clusters. Finally, the combination of model and

number of groups that leads to the highest BIC (Bayesian Information Criterion) value can be

chosen as the optimal model (Fraley and Raftery, 1998). The BIC value can also be computed

for each cluster separately.

FUZZY METHODS

In fuzzy clustering, the observations are not clearly allocated to one of the clusters, but they

are distributed in certain amounts among all clusters. Thus, for each observation a

membership coefficient to all clusters is determined, providing information on how strong the

observation is associated with each cluster. The membership coefficients are usually

transformed to the interval [0,1], and they can be visualised for example by using a grey

scale. A popular fuzzy clustering algorithm is the fuzzy c-means (FCM) algorithm, developed

by Dunn (1973) and improved by Bezdek (1981), which calculates the prototypes, most

typical group characteristics, of the clusters and membership coefficients for each observation

to the clusters. Another fuzzy clustering algorithm is the Gustafson-Kessel (GK) algorithm

(Gustafson and Kessel, 1979). While FCM identifies clusters that tend to be rather spherical,

GK is able to detect elliptically shaped clusters, because the FCM algorithm replaces the

Euclidean distance by the Mahalanobis distance. The Gath-Geva (GG) algorithm (Gath and

Geva, 1989), also called the Gaussian mixture decomposition algorithm, is even more

flexible. It is an extension of the GK algorithm which can also deal with different cluster

sizes and densities. The GK and the GG algorithms are freely available at http://www.fuzzyclustering.de. Just as for partitioning methods, the number of clusters resulting from fuzzy

clustering needs to be chosen by the user.

12

6. CLUSTERING VARIABLES

Instead of clustering the observations it is also possible to cluster the variables in order to find

groups of variables that show similar behaviours. All of the methods discussed above can be

used for clustering the variables. One of the best methods to display the results of clustering

variables is the dendrogram (see Fig. 4, 5), calling for hierachical clustering.

7. EVALUATION OF CLUSTER VALIDITY

Because there is no universal definition of clustering, there is no universal measure with

which to compare clustering results. However, evaluating cluster quality is essential since any

clustering algorithm will produce some clustering for every dataset. Validity measures should

support the decision as to the number of natural clusters, and they should also be helpful for

evaluating the quality of the individual clusters. Therefore, validity measures should provide

a value for each single cluster, and they should also return a value for judging the quality of

the overall clustering result.

As mentioned above, a rather simple method of evaluating quality of clustering for

regionalised data is to check the distribution of the resulting clusters on a map. The

distribution of the clusters can then be evaluated against known properties of the survey area.

It is also likely that clusters resulting in geographically homogeneous subgroups are more

likely to have a meaning than clusters resulting in "geographical noise".

There are many different statistical cluster validity measures. Two different concepts of

validity criteria external and internal criteria need to be considered.

External criteria compare the partition found with clustering with a partition that is known a

priori. The most popular external cluster validity indices are Rand, Jaccard, Folkes and

Mallows, and the Hubert indices (see e.g., Gordon, 1999, or Haldiki et al, 2002).

Internal criteria are cluster validity measures which evaluate the clustering result of an

algorithm by using only quantities and features inherent in the data set. Most of the internal

validity criteria are based on within cluster sum of squares and between cluster sum of

squares. Well known indices are the Calinski-Harabasz index (Calinski and Harabasz, 1974),

13

Hartigan's indices (Hartigan, 1975), or the Average Silhouette Width of Kaufman and

Rousseeuw (1990).

From a practical point of view, an optimal value of the validity measure does not imply that

the resulting clusters are meaningful. Some of these criteria evaluate only the allocation of the

data to the clusters. Other criteria evaluate the form of the clusters or how well the clusters

are separated. The resulting clusters only correspond to the best partition according to the

validity measure selected. The measures deliver good results when a very clear cluster

structure exists in the data. Unfortunately, when working with geochemical data such good

clusters are rare. Thus cluster quality measures fail time and again when working with such

data and the best approach to evaluating cluster quality is often by just looking at the results

in a map.

Instead of visual inspection of the cluster results in a map, the resulting clusters could be

evaluated according to the geographical coordinates. Since the validity measure would be

optimised with compact clusters, the compactness of the clusters in the map would be judged.

However, our experience with this approach mostly results in the selection of the same

optimal number of clusters as with the decision when calculating the validity measures on the

data used for clustering. Nevertheless, when calculating validity measures using geographical

coordinates the choice of the optimal number of clusters is easier in most of the cases,

because the characteristic of the validity measures is more distinctive.

Figure 3 shows a plot of validity measures resulting from clustering the log-transformed and

scaled O-horizon data (40 variables). A variety of clustering algorithms with different

distance measures were applied. The evaluation of the performance of the different

algorithms was made with the most simple validity measure, namely the average within

cluster sum of squares divided by the average between cluster sum of squares. Figure 3 shows

plots of this measure against the number of clusters. Small values are preferable because this

indicates homogeneous and well separated clusters. Typically, the optimal number of clusters

is indicated by a knee in the plot. One should select that cluster number before or after the

validity measure increases significantly. For example, according to Figure 3 the optimal

number of clusters for k-means clustering using the Euclidean distance is 5. Note that the

algorithm Mclust (Fig. 3 right) is not distance based, therefore only one graph can be plotted.

Figure 3: Resulting validity measures of clustering the 40 variables of the O-horizon data

(log-transformed, standardised) with different methods and based on different distance

14

measures (rf is the random forest distance). The validity measure changes with varying

number of clusters. The optimal number of clusters is indicated by a knee before or after the

measure increases significantly.

It is obvious that the graphs in Figure 3 do not give an unequivocal answer on the optimal

number of clusters. Even for a single clustering algorithm the optimal number of clusters

strongly depends on the distance measure chosen. When inspecting the resulting clusters on a

map often interesting structures appear, even in instances where the indicated optimal

number of clusters was chosen. This somewhat disappointing result demonstrates that cluster

analysis should only be used in an exploratory manner, by varying parameters in the

algorithm and looking at the results on maps.

When using a multivariate technique, variable selection is often employed in order to reduce

the dimensionality of a data set or to learn something about the internal structure between the

variables and/or observations. Often it may appear desirable to perform cluster analysis with

all available observations and variables. However, the addition of only one or two irrelevant

variables can have drastic consequences in identifying the clusters. The inclusion of only one

irrelevant variable may be enough to hide the real clustering in the data (Gordon, 1999). The

selection of the variables to enter into a cluster analysis is thus of considerable importance

when working with applied earth science data sets containing a multitude of variables.

Another reason for variable selection may be a desire to focus the analysis on a specific

geochemical process. Such a process is usually expressed by a combination of variables, and

using these variables for clustering permits identifying those observations or areas where the

process is either present or non-existent. The variables could simply be chosen based on

expert knowledge. It is also possible to apply variable clustering (see above) and select either

variables which are in close relation (one branch of the cluster tree) to highlight a certain

process, or to study it in more detail.

Variable clustering can of course also be used to select single key variables from each

important cluster to simply reduce dimensionality for clustering observations.

15

As an example for variable clustering the 40 elements of the Kola O-horizon data set were

clustered. Following a log-transformation and standardisation of the data, average linkage

based on the Manhattan distances between the variables was applied resulting in the

dendrogram in Figure 4. Variables that are in close relation form a tree in the dendrogram.

Thus, when focusing on certain processes, variables in a tree can be selected for the further

clustering of observations. For example, the elements Cu and Ni, followed by Co are the

three main elements emitted by the Russian smelters. Traces of As and Mo are also emitted

and a sizable Mo-anomaly is surrounding Monchegorsk (Reimann et al., 1998). For As and

Mo the emissions from industry appear to be the main process determining their regional

distribution in the survey area and they thus link on the Cu, Ni and Co-branch. Interestingly

V, which is also an important part of the industrial emissions, links on the Cr and Al, Fe, Sc,

La, Y, Be, Th, U-branch which signifies the content of minerogenic material in the organic

soil. Here two important processes appear in the survey area: the input of dust from industrial

emissions and the input of geogenic dust. On the regional scale the latter is the more

important process than the anthropogenic impact even for an element like V. In contrast the

regional distribution of Mg, B, Na, Ca, Sr and the pH is dominated by the input of sea spray

along the coast of the Barents Sea, while C, H, LOI, S, N and P and Hg are all related to the

amount of organic material in the sample. That they appear linked to the sea spray elements at

a high level may be caused by the fact that the organic material decays more slowly near

coast due to the wet climate. The Mn, Zn, electrical conductivity, K, Rb-branch is primarily a

plant nutrient status indicator. Plants near Monchegorsk and Nikel/Zapoljarnij have a poor

nutrient status and this may be the reason that this branch links to the "contaminants" branch.

The regional distribution of Pb, Bi, Tl, Cd, Ag, Sb, Ba is dominated by the three main

vegetation zones observed in the survey area (subarctic tundra, subarctic birch forest, boreal

forest) and some of these elements are also emitted at the smelters, explaining the location of

this branch near plant nutrients and major contaminants.

Figure 4: Clustering the 40 variables of the O-horizon data (log-transformed, standardised)

using average linkage based on the Euclidean distance between the variables. The result is

shown by a dendrogram. Variables on a tree can be selected for further clustering of the

objects.

However, if single linkage is used instead of average linkage the dendrogram in Figure 5

emerges, which is substantially different from the dendrogram in Figure 4 and would suggest

a number of different relations. Although some of the key results are the same (e.g., the

existence of a contamination (Cu, Ni and Co) and a sea spray (Mg, B, Na, pH) branch),

16

several elements enter other branches than in Figure 4. The dendrogram displayed in Figure 5

would thus lead to a substantially altered interpretation of the behaviour of some of the

elements. For example, Pb, Bi, Tl and Rb do now enter the contamination branch, the result

in the average linkage diagram backs the long working experience of the authors with the

Kola dataset. If somebody wanted to "prove" that these elements are an important ingredient

of the contamination spectrum of the smelters, the single linkage dendrogram would probably

be used. A reader (or reviewer) of a paper using cluster analysis has practically no chance to

judge whether the authors have just "played" long enough with cluster analysis until a result

emerged that fitted preconceived conceptions. The example demonstrates that cluster analysis

can never be a statistical proof of the existence of relations it can, however, be used in a

truly exploratory data analysis sense to detect structures in the data, that are worth-while

further investigation.

Figure 5: Clustering the 40 variables of the O-horizon data (log-transformed, standardised)

using single linkage based on the Euclidean distance between the variables. The result is quite

different from Figure 4.

For reducing the number of elements entered to cluster analysis one could then select just one

or two elements from each of the major branches of a dendrogram, or study the elements on

just one branch in more detail.

It has been shown above that the cluster results can change dramatically with the choice of the

clustering method, the selected variables, the distance measure, and the number of clusters.

Moreover, depending on the selected validity measure, different solutions result for the

optimal number of clusters. Despite the variety of cluster results, each partition could still be

informative and valuable. The results can give an interesting insight into the multivariate data

structure, even if the validity measure does not suggest the chosen number of clusters is

optimal. Thus, it is desirable to perform cluster analysis in an exploratory context, by

changing the cluster parameters and visually inspecting the results.

For this purpose, a statistical tool has been developed in R (freely available at http://cran.rproject.org) as the contributed package clustTool. Besides the selection of data, a background

map (optional), variables and coordinates, different parameters, like the distance measure, the

17

clustering method, the number of clusters, and the validity measure can be selected.

Depending on the selection, the clusters are presented in maps (see, e.g., Fig. 7) and plots of

the cluster centres are provided (see, e.g., Fig. 8). Additionally, a summary is provided and

information about the clustering is saved in an object in the R workspace. Figure 6 shows the

main menu of the tool clustTool.

Figure 6: Main menu of the tool clustTool for an exploratory use of cluster analysis. Cluster

results are presented in maps, and plots of the cluster centres are provided.

The algorithm Mclust with 6 clusters was applied to 40 elements of the Kola O-horizon data.

The validity measure BIC was used for evaluating each individual cluster. Higher values of

the BIC indicate more informative clusters. Therefore, the BIC value is used for assigning

grey scales to the observations in the maps. Figure 7 shows the resulting clusters in 6 maps.

Cluster 3 shows low BIC values, therefore the observations are visualised as light grey points.

Cluster 5 shows the input of sea spray along the Norwegian coast. Cluster 6 identifies the

core areas of contamination surrounding Monchegorsk and Nikel/Zapoljarnij.

Figure 7: Mclust for the log-transformed and standardised 40 elements of the O-horizon data.

6 clusters were chosen and each cluster is evaluated with the BIC measure, resulting in

different grey scales for the observations in the maps.

In general, not only are the location of the single clusters on the maps of interests but also the

geochemical composition of the clusters. For this purpose, a plot of the cluster centres is

presented in Figure 8, which aids the interpretation of the processes behind the clusters. The

cluster centre is the element-wise mean of all observations of a cluster. Therefore, for each

cluster all elements used for clustering are presented. In Figure 8 the resulting means for all 6

clusters presented in Figure 7 are horizontally arranged. Since the variables used for

clustering were standardised, they each make the same contribution to the cluster analysis. If

single elements show very high or low means for a cluster, they are highly influential for that

cluster. For example, Figure 8 shows high means of the elements Cu, Ni, Co, As and Mo for

cluster 6, identifying the Russian nickel industry.

18

Figure 8: Plot of the cluster centres for the cluster analysis presented in Figure 7 (40 variables

of the O-horizon data, log-transformed, standardised). High or low values of the elements

suggest high influence of these elements on the observations in the corresponding cluster.

It can be interesting to carry out cluster analysis with a reduced number of variables, e.g. the

variables on the "sea spray" branch of the dendrogram shown in Figure 4 (Na, B, Mg, Ca, Sr,

pH) in order to better understand their specific influence on the observations. Figure 9 shows

the results of cluster analysis using the Mclust algorithm for 8 clusters. Cluster 7, which is

dominated by B, Mg and Na (Fig. 10), is clearly the sea spray cluster. Cluster 8, which is

dominated by Sr, adds new insight. It plots on top of the alkaline intrusions near Apatity. A

small cluster near Apatity (Cluster 4) is dominated by Ca, Sr and pH. It is probably directly

related to alkaline dust from the processing plant in Apatity. Thus there exist clearly two

different processes that determine the spatial distribution of the elements that were

interpreted as sea spray related: a true "sea spray component" and an "alkaline rock dust

component".

Figure 9: Mclust with 8 clusters for the selected pollution elements B, Ca, Mg, Na, pH and Sr.

Figure 10: Plot of the cluster centres for the results presented in Figure 9. Different chemical

processes are visible in the different clusters.

Carrying out the same cluster analysis with one cluster less results in the maps presented in

Figure 11 and the cluster centres shown in Figure 12. Although a "sea spray cluster" (cluster

6) is identified and one cluster relating to the alkaline intrusions (cluster 7) results are clearly

different from the previous example, demonstrating that very different results can be obtained

when changing the number of selected clusters.

Figure 11: Mclust with 7 clusters for the selected pollution elements B, Ca, Mg, Na, pH and Sr. The

resulting clusters are quite different from Figure 9.

Figure 12: Plot of the cluster centres for the results presented in Figure 11.

In a final example the results of fuzzy clustering on selected elements of the Kola O-horizon

data are shown. The variables B, Co, Cu, Mg, Na, Ni, indicative of two of the main processes

in the survey area (sea spray and industry) were log-transformed and standardised. Based on

the Euclidean distances, the FCM algorithm with 4 clusters is applied. The resulting

19

membership coefficients are shown in grey scales in Figure 13: higher membership of an

observation to a particular cluster is visualised by a darker point in the corresponding map.

The plot in Figure 14 with the cluster centres allows a better understanding of the resulting

clusters: Cluster 1 is a "sea spray" cluster, and Cluster 4 is a contamination cluster. Cluster 2

appears to indicate an outer rim of contamination, while all background observations

accumulate in Cluster 3.

Figure 13: Results of fuzzy clustering (FCM algorithm based on Euclidean distances) with 4

clusters of the elements B, Co, Cu, Mg, Na and Ni of the Kola O-horizon data (logtransformed, standardised). The membership coefficients of the observations to the clusters

are shown by grey scale in the maps.

Figure 14: This plot with the cluster centres supports an interpretation of the clusters

visualised in Figure 13.

Further clustering results for the Kola Project data as well as for other geochemical data sets

can be found in Templ (2003). This thesis also investigates and demonstrates the sensitivity

of cluster analysis methods to data preparation.

11. CONCLUSIONS

Like many other multivariate statistical methods, cluster analysis should be helpful to obtain

an overview of data sets with many observations and variables. It can be used to both

structure the variables or to group the observations. Which of the available variables are used

for cluster analysis and how they are prepared is crucial to the outcome. Using selected

elements or selected observations (e.g. sub-regions in the map) will in general give very

different results, some of them allow a clearer, and some a less clear, understanding of the

structure or classification of the data. As a general rule, symmetrisation of the data

distribution (e.g. by using a log-transformation for each variable) and standardisation is a

necessary part of data preparation before applying cluster analysis. Depending on the

clustering method, outliers can heavily affect the results and should be removed from the data

prior to analysis.

It is difficult to give a general recommendation concerning the cluster method to use. In our

case interesting results were obtained with model based clustering (algorithm Mclust), but

20

also other simpler algorithms led to useful interpretations and maps. If observations are

clustered (and not variables) the visualisation of the clusters in maps is most helpful, together

with a plot of the cluster centres an immediate impression of the cluster characteristics is

provided. These are often far more helpful than plots of validity measures or dendrograms,

which are difficult to read for applications with many observations.

We recommend the use of cluster analysis as an exploratory method. For this purpose, the

software package clustTool that runs under R has been developed. The user can choose data,

coordinates, background maps, variables, different distance measures, various cluster

algorithms, determine the number of clusters, and look at the results in plots. The visual

impression of the results, together with a pre-chosen validity measure is then helpful for

deciding on the parameter selection for clustering the data. Interesting results are not

necessarily obtained by tuning the parameters for cluster analysis in a statistically optimal

way. Expert knowledge also should be used for this purpose, e.g. for variable selection or

cluster evaluation. A flexible software tool used by experts can combine both strategies.

ACKNOWLEDGEMENT

The authors are grateful to Dr. Robert Garrett (Geological Survey of Canada) for many

stimulating discussions and for his suggestions leading to a significant improvement of an earlier

version of this manuscript.

REFERENCES

Aitchison, J., 1986. The statistical analysis of compositional data. Wiley, New York, 416pp.

yrs, M. and Reimann, C., 1995. Joint ecogeochemical mapping and monitoring in the scale

of 1:1 mill. in the West Murmansk Region and contiguous areas of Finland and Norway 1994-1996. Field Manual. Nor.Geol.Unders.Rep. 95.111, 33pp.

Bandemer, H. and Nther, W., 1992. Fuzzy Data Analysis. Kluwer Academic Publication,

360pp.

Bezdek, J.C., 1981. Pattern Recognition with Fuzzy Objective Function Algoritms. Plenum

Press, New York, 256pp.

21

Box, G.E.P. and Cox, D.R., 1964. An analysis of transformations. Journal of the Royal

Statistical Society (B) 26, 211-252.

Breiman, L., 1996. Bagging predictors. Machine Learning 24, 123-140.

Breiman, L., 2001. Random forests. Machine Learning 45 (1), 5-32.

Calinski, T. and Harabasz, J., 1974. A dendrite method for cluster analysis. Communications

in Statistics 3, 1-27.

Davis, J.C., 1973. Statistics and data analysis in geology. John Wiley & Sons, New York,

550pp.

Dempster, A.P, Laird, N.M, and Rubin, D.B., 1977. Maximum likelihood from incomplete

data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B

39, 1-38.

Dunn, J.C., 1973. A Fuzzy Relative of the ISODATA process and its use in detecting

compact well-separated clusters. Journal of Cybernetics 3, 32-57.

Everitt, B., 1974. Cluster Analysis. Heinemann Educational, London, 1974, 248pp.

Filzmoser, P., Reimann, C., and Garrett, R.G., 2005. Multivariate outlier detection in exploration

geochemistry. Computers and Geosciences 31, 579-587.

Fraley, C. and Raftery, A., 1998. How many clusters? Which clustering method? Answers via

model-based cluster analysis. The Computer Journal 41, 578-588.

Fraley, C. and Raftery, A.E., 2002. Model-based clustering, discriminant analysis, and

density estimation. Journal of the American Statistical Association 97, 611-631.

Gath, I. and Geva, A., 1989. Unsupervised optimal fuzzy clustering, IEEE Trans. on Pattern

Analysis and Machine Intelligence 11(7), 773-781.

Gower, J.C., 1966. Some distance properties of latent root and vector methods used in

multivariate analysis. Biometrika 53, 325-338.

22

Gordon, A.D., 1999. Classification. Chapman & Hall/CRC, Boca Raton, 2nd edition, 256pp.

Gustafson, D.E. and Kessel, W., 1979. Fuzzy clustering with a fuzzy covariance matrix. Proc.

IEEE-CDC 2, 761-766.

Haldiki, M., Batistakis, Y., and Vazirgiannis, M., 2002. Cluster validity methods. SIGMOD

Record 31, 40-45.

Hartigan, J., 1975. Clustering Algorithms. John Wiley and Sons, New York, 351pp.

Hubert, L. and Arabie, P., 1985. Comparing partitions. Journal of Classification 2, 193218.

Kaufman, L. and Rousseeuw, P.J., 1990. Finding Groups in Data. John Wiley & Sons, Inc.,

New York, 368pp.

Lance, G.N. and Williams, W.T., 1966. Computer programs for classification. Proceedings of

the Australian National Committee on Computing and Automatic Control Conference, Canberra.

Paper 12/3, 304-306.

Le Maitre, R.W., 1982. Numerical Petrology. Elsevier, Amsterdam, 281pp.

Leisch, F., 1998. Ensemble methods for neural clustering and classification. PhD thesis,

Institut fr Statistik, Wahrscheinlichkeitstheorie und Versicherungsmathematik, Technische

Universitt Wien, Austria, 130pp.

Leisch, F., 1999. Bagged clustering. Working Paper 51, SFB Adaptive Information Systems

and Modeling in Economics and Management Science, Wirtschaftsuniversitt Wien,

Austria, 11pp.

Leisch, F., 2006. A toolbox for k-centroids cluster analysis. Computational Statistics and Data

Analysis, 2006. Accepted for publication.

MacQueen, J., 1967. Some methods for classification and analysis of multivariate observations.

In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability,

eds L. M. Le Cam & J. Neyman, 1, Berkeley, CA: University of California Press, 281-297.

23

Martinetz T., Berkovich S., and Schulten, K., 1993. Neural-gas network for vector quantization

and its application to time-series prediction. IEEE Transactions on Neural Networks 4 (4), 558569.

Niskavaara, H., 1995. A comprehensive scheme of analysis for soils, sediments, humus and

plant samples using inductively coupled plasma atomic emission spectrometry (ICP-AES). In:

Autio, S. (ed.): Geological Survey of Finland, Current Research 1993-1994. Geological

Survey of Finland, Espoo, Special Paper 20, 167-175.

Reimann, C. and Filzmoser, P., 2000. Normal and lognormal data distribution in

geochemistry: death of a myth. Consequences for the statistical treatment of geochemical and

environmental data. Environmental Geology 39/9, 1001-1014.

Reimann, C. and Wurzer, F., 1986. Monitoring accuracy and precision - improvements by

introducing robust and resistant statistics. Mikrochimica Acta 1986 II, No.1-6, 31-42.

Reimann, C., yrs, M., Chekushin, V.A., Bogatyrev, I., Boyd, R., Caritat, P. de, Dutter, R.,

Finne, T.E., Halleraker, J.H., Jger, ., Kashulina, G., Niskavaara, H., Lehto, O., Pavlov, V.,

Risnen, M. L., Strand, T., and Volden, T., 1998. Environmental Geochemical Atlas of the

Central Barents Region. NGU-GTK-CKE special publication. - Geological Survey of

Norway, Trondheim, Norway, 745pp.

Reimann, C., Filzmoser, P., and Garrett, R.G., 2002. Factor analysis applied to regional

geochemical data: problems and possibilities. Applied Geochemistry 17, 185-206.

Ripley, B. D., 1996. Pattern Recognition and Neural Networks. Cambridge, 416pp.

Rock, N.M.S., 1988. Numerical Geology. Lecture Notes in Earth Sciences 18, Springer

Verlag, New York-Berlin-Heidelberg, 427pp.

Templ, M., 2003. Cluster Analysis applied to Geochemical Data. Diploma Thesis, Vienna

University of Technology, Vienna, Austria, 137pp.

Yeung, K. and Ruzzo, W., 2001. An empirical study on principal component analysis

for clustering gene expression data. Bioinformatics 17(9), 763-774.

24

TABLES

ELEMENT

DL

Ag

0,02

Al

0,2

As

0,05

B

0,8

Ba

0,05

Be

0,02

Bi

0,02

C

1000

Ca

5

Cd

0,02

Co

0,03

Cr

0,4

Cu

0,01

Fe

10

H

1000

Hg

0,04

K

200

La

0,7

Mg

10

Mn

1

Mo

0,01

N

1000

Na

10

Ni

0,3

P

15

Pb

0,04

Rb

0,5

S

15

Sb

0,01

Sc

0,1

Se

0,8

Si

20

Sr

0,2

Th

0,04

Tl

0,01

U

0,004

V

0,02

Y

0,1

Zn

0,4

other parameters

pH

0,1

EC

0,1

LOI

0,1

%<

DL

0

0

0

0,2

0

25,1

0

0

0

0

0

0

0

0

0

0

0

4,5

0

0

0

0

3,4

0

0

0

0

0

0

0,5

88

0

0

0

0

0

0

0

0

0

0

0

MIN

MED

MAX MAD

0,025

0,2

4,79

0,16

372

1890

20600 1201

0,364

1,16

43,5

0,46

<0.8

2,15

13

0,7

13,9

76

290

30,3

<0.02

0,04

1,87

0,04

0,029

0,159

1,12

0,08

153000 450000 508000 3710

460

2960

25400 786

0,07

0,3

1,39

0,11

0,21

1,57

96

1,11

0,39

2,91

109

1,75

2,7

9,7

4080 5,14

430

1970

44800 1245

22000 61000 71000 444

0,094

0,227

0,974 0,05

300

1000

5700

297

<0.7

2,3

139

1,78

240

750

23800 297

11,1

126

5470

108

0,086

0,258

5,45

0,1

5000

13000 20000 300

<10

60

2350 29,7

1,5

9,2

2880 7,74

192

930

9280

208

4,1

19

1110 7,41

0,68

5,8

33

2,76

400

1530

3830

297

0,016

0,183

0,962 0,08

<0.1

0,5

4,1

0.3

<0.8

<0.8

7,4

0.4

290

530

940

74,1

6,1

29

1430 13,6

0,06

0,35

15,4

0,25

0,02

0,09

0,56

0,05

0,008

0,099

14,3

0,07

1,1

4,9

49

2,39

0,2

0,9

69

0,59

12

46

198

15,1

3,2

5,53

33,5

3,85

13

89,8

5,6

23

98,8

0,22

2,92

6,52

Table 1: Elements and summary statistics (minimum (MIN), median (MED), maximum

(MAX) and spread (expressed as median absolute deviation MAD) for the Kola O-horizon

data set used here (from Reimann et al., 1998). In addition the detection limit (DL) and the

number of samples below detection (expressed in %) are given.

25

FIGURES

Figure 1: General location map of the study area for the Kola Project (Reimann et al., 1998).

Locations named in the text are given.

26

Figure 2: Average linkage clustering for 40 variables of the O-horizon data (log-transformed,

standardised). The cluster results for different methods and a fixed number of clusters are

compared with the corresponding results for bootstrap samples of the data using the Rand

index. The boxplots show the resulting Rand indices.

27

Figure 3: Resulting validity measures of clustering the 40 variables of the O-horizon data

(log-transformed, standardised) with different methods and based on different distance

measures (rf is the random forest distance). The validity measure changes with varying

number of clusters. The optimal number of clusters is indicated by a knee before or after the

measure increases significantly.

28

using average linkage based on the Euclidean distance between the variables. The result is

shown by a dendrogram. Variables on a tree can be selected for further clustering of the

objects.

29

using single linkage based on the Euclidean distance between the variables. The result is quite

different from Figure 4.

30

Figure 6: Main menu of the tool clustTool for an exploratory use of cluster analysis. Cluster

results are presented in maps, and plots of the cluster centres are provided.

31

Mclust

1

BIC = -9289.88

obs = 110

BIC = -4167.61

obs = 41

BIC = -29646.24

obs = 308

BIC = -4156.59

obs = 30

BIC = -7838.78

obs = 71

BIC = -6345.8

obs = 57

Figure 7: Mclust for the log-transformed and standardised 40 elements of the O-horizon data.

6 clusters were chosen and each cluster is evaluated with the BIC measure, resulting in

different grey scales for the observations in the maps.

32

110

41

308

30

71

57

Cu

Al

Sc

Y

U

N

S Th pH

Mg

Na

Mo

Cr

CdFe

Bi

V

Hg

Rb

Sc

Na

pH

Ag

Rb

Pb

Co

H

Sr Zn

Sb

N Ca La

Co

LO

Mn Sc

Cd

Y

Mn

Ca

Si

Pb

MnPb

Sr C Ba

SSiTl

Ag Cr

Mg

Ca

LO

N

B

U Co

C

H

Na

Sb

As

Hg

K

Ag Cr

Zn

Zn

K

Sr

Al

H

SSr CoAs

Th C Ba Hg

Ba MnPSb

Co

U

Si

Tl

Mg

Th

pH

LO

Ag

B

Si

N

As

Cd

Mg

Ba

K Ni

K P

Cd

Co

Si V pHBa Mo

VH

MoRb VC Bi

Y

C

Rb

Hg

K

Hg

Sc

H

LO

Cu

Mo

BiCu Ni

LO

La

Pb

Pb

B Cr

Al

Fe

Tl

As

Y

Rb

Al La

U

Ag

Zn

Mo

Tl

Th

BiCu Ni

Na U

Fe Na Sc

P

Na Sr

Y

Ca

P

MoSbTh

Cu Ni

Sb

La

Ca

Cu

K

BCo

P

Sc

B

Sr

Cd

Si ZnCo

N

Fe

Al

Mg S

MnPb

Co

pH

Cd

Co

Mg

Cr Ni

As

Ba

Rb TlV

Mn

Hg

Zn

Sb

AgBi

Co

Ca

Tl

Bi

Th

V

UY

Cr

Al Fe

La

Fe

La

pH

-2

cluster means

Ni

Co

As

LO

C

H

Cluster number

Figure 8: Plot of the cluster centres for the cluster analysis presented in Figure 7 (40 variables

of the O-horizon data, log-transformed, standardised). High or low values of the elements

suggest high influence of these elements on the observations in the corresponding cluster.

33

Mclust

1

BIC = -4386.34

obs = 358

BIC = -157.26

obs = 8

BIC = -419.24

obs = 29

BIC = -66.92

obs = 3

BIC = -405.29

obs = 30

BIC = -679.37

obs = 62

BIC = -1479.12

obs = 110

BIC = -306.88

obs = 17

Figure 9: Mclust with 8 clusters for the selected pollution elements B, Ca, Mg, Na, pH and Sr.

34

358

29

Sr

Ca

30

62

110

pH

Mg

Sr

Na

B

Mg

Sr

-2

cluster means

B

Na

pH

Ca S r Ca

NapH B

B Mg

17

pH

Ca

pH CaNa

B MgSr B MgSr

pH

Na

B MgSr

Mg

Na

NapH

pH

Ca Sr B

Ca

Mg

Ca

-4

Na

Cluster number

Figure 10: Plot of the cluster centres for the results presented in Figure 9. Different chemical

processes are visible in the different clusters.

35

Mclust

1

BIC = -2005.32

obs = 188

BIC = -1350.99

obs = 104

BIC = -450.13

obs = 32

BIC = -300.06

obs = 26

BIC = -1081.65

obs = 145

BIC = -557.52

obs = 68

BIC = -1306.9

obs = 54

Figure 11: Mclust with 7 clusters for the selected pollution elements B, Ca, Mg, Na, pH and Sr. The

resulting clusters are quite different from Figure 9.

36

104

32

26

145

68

54

188

pH

Ca

Na

B Mg

Mg

Sr

pH

Sr

Na

SrpH B Mg

Na

Ca

Ca

Mg Srp H

B Na

Ca

Ca

Na

Mg

BCa

Ca

pH B Na

B

Mg Sr

Mg SrpH

Sr

pH

-2

-1

cluster means

Na

-3

Figure 12: Plot of the cluster centres for the results presented in Figure 11.

37

FCM, Euclidean

108

188

224

97

Figure 13: Results of fuzzy clustering (FCM algorithm based on Euclidean distances) with 4

clusters of the elements B, Co, Cu, Mg, Na and Ni of the Kola O-horizon data (logtransformed, standardised). The membership coefficients of the observations to the clusters

are shown by grey scale in the maps.

38

108

188

224

97

Cu Ni

Co

B

Co Ni

Cu

Cluster means

MgNa

MgNa

Cu

B

Ni

Co

CuNiMgNa

-2

-1

Co

Na

Mg

Cluster number

Figure 14: This plot with the cluster centres supports an interpretation of the clusters

visualised in Figure 13.

39

- Data Clustering for Image SegmentationЗагружено:Hai Hung
- A Hierarchical Clustering Algorithm Based on KЗагружено:johncena90
- Journal of Computer Applications - www.jcaksrce.org - Volume 4 Issue 2Загружено:Journal of Computer Applications
- Ishola Et Al Pure and Applied Geophysics 2014Загружено:Roland Rawlins Igabor
- The Myth of the Ideal Worker Does Doing All the Right Things Really Get Women AheadЗагружено:Nina Rung
- Effect of Different Distance Measures on TheЗагружено:Dibya Jyoti Bora
- K-Mean Algo. on Iris Data set_15129145.pdfЗагружено:Mohammad Waqas Moin Sheikh
- Different Data Mining Techniques and Clustering AlgorithmsЗагружено:ijteee
- IIJIT-2014-06-12-024Загружено:Anonymous vQrJlEN
- A Heuristic Approach for Network Data ClusteringЗагружено:idescitation
- Document Classification Methods ForЗагружено:Trần Văn Long
- Data MiningЗагружено:ajaycasper
- ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMЗагружено:ijflsjournal
- 1-s2.0-S0020025517306266-mainЗагружено:Raul Arredondo Flores
- gb-2003-4-5-r34Загружено:tiepcantrithuc
- ClusteringytryrЗагружено:Anonymous MQjBLReGmX
- 3-Clustering 3 KMeansЗагружено:Rajkishor gupta
- Modelos generales para la estimación de la radiación solar global diaria para diferentes zonas de radiación solar en China continentalЗагружено:7juliocerna
- Semantic distances for technology landscape visualization.pdfЗагружено:aprendizdemagomexicano
- 14 Chapter 7Загружено:Usama19
- Certificate Management System Using Fuzzy Based Clustering ApproachЗагружено:Editor IJRITCC
- Classification of Cluster Area Forsatellite ImageЗагружено:IJSTR Research Publication
- Math lectureЗагружено:navneeth91
- Clustering ReportЗагружено:Kumar Saurav
- Dh 31504508Загружено:IJMER
- InTech-Image Segmentation Through Clustering Based on Natural Computing TechniquesЗагружено:karteekak
- Image segmentationЗагружено:rajarajeswari
- Maneesh Thesis PptЗагружено:jat02013
- 229-1-462-3-10-20171222Загружено:Anonymous Fs8SOu
- Image Segment at i OnЗагружено:Thắng Tạ Bảo

- e6-64-03.pdfЗагружено:Felipe Milhomem
- 8Загружено:geodennys
- news_2012413122445Загружено:geodennys
- Geting Started Whit Arcgis 10Загружено:geodennys
- GoldЗагружено:geodennys
- Timesheet Iván Mejía 10-2015Загружено:deerdad
- Gazley Et Al. (2015) - Application of Principal Component Analysis and Cluster Analysis to Mineral Exploration and Mine GeologyЗагружено:geodennys
- q42015Загружено:geodennys
- Geosphere-2006-Guest-35-52Загружено:geodennys

- Silicon BaluЗагружено:ashok_abcl
- Payroll Information SystemЗагружено:Prince Persia
- RTWD Series R(TM) 70-250 Ton Water-cooled Chiller Dimension DrawingsЗагружено:Jhon Lewis
- CBLMЗагружено:LoixeLicawanCańete
- Optimization of Cooling Systems (2016)Загружено:Akram Benkeblia
- cv metcalf 2015 (1)Загружено:api-277785793
- 62932201-4-Week-RuleЗагружено:toughnedglass
- Dec.designing Computers Digital Systems Using Pdp-16 Register Transfer ModulesЗагружено:Jorge Leandro
- 5th MeetingЗагружено:Joseph SB
- Assignment 1 b.ed Code 8603Загружено:Atif Rehman
- Johnson, R. T., & Johnson, D. W. (1986). Cooperative learning in the science classroom. Science and children, 24, 31-32..pdfЗагружено:Ham Daz
- MBA3MB0034- Research MethodologyЗагружено:kapil3518
- Psychoanalysis Conference Abstracts Booklet 2013Загружено:Daniel Goiana Cantagalo
- EquationsЗагружено:IbraximRamirez
- eu in tramvaiЗагружено:Adrian Paunescu
- Bauhaus Weaving Theory From Feminine Craft to Mode of Design - SmithЗагружено:Anonymous GKqp4Hn
- Multitrans UkЗагружено:nmaafonso
- Melfa Basic IV InstruccionesЗагружено:Enzo Karol Velásquez
- CIMP LEO Course Outline Batch 2008-10Загружено:rohitranjanpathak
- FP3 P6 January 2006 Mark SchemeЗагружено:Fitri Azmeer Nordin
- Topographic Map of Los FresnosЗагружено:HistoricalMaps
- BFA Report on UCOF OnLine EdЗагружено:Chris Newfield
- Entrepreneur (India) - August 2016.pdfЗагружено:geetika
- M100.01 Magnetic Level GagesЗагружено:ariffstudio
- Company Profile MBUЗагружено:Mandala Bakti Utama
- Section 15Загружено:scribdpamman
- At Home Strawberry DNA ExtractionЗагружено:Melissa Wilder
- Surds and IndicesЗагружено:ravi98195
- 10004_22Загружено:nivram1
- ACC Adaptive Cruise Control_2003.pdfЗагружено:jovopavlovic