Академический Документы
Профессиональный Документы
Культура Документы
Edited by
Pierre Legendre
Oepartement de Sciences biologiques
Universite de Montreal, C.P. 6128, Succ. A
Montreal, Quebec H3C 3J7, Canada
Louis Legendre
Oepartement de Biologie, Universite Laval
Ste-Foy, Quebec G1 K 7P4, Canada
Springer-Verlag
Berlin Heidelberg New York London Paris Tokyo
Published in cooperation with NATO Scientific Affairs Oivison
Proceedings of the NATO Advanced Research Workshop on Numerical Ecology
held at the Station marine de Roscoff, Brittany, France, June 3-11, 1986
Library of Congress Cataloging in Publication Data. NATO Advanced Research Workshop on Numerical
Ecology (1986: Station marine de Roscoff) Developments in numerical ecology. (NATO ASI series. Series
G, Ecological sciences; vol. 14) "Proceedings of the NATO Advanced Research Workshop on Numerical
Ecology held at the Station marine de Roscoff, Brittany, France, June 3-11, 1986"-T.p. verso. "Published
in cooperation with NATO Scientific Affairs Division." Includes Index. 1. Ecology-Mathematics-
Congresses. 2. Ecology-Statistical methods-Congresses. I. Legendre, Pierre, 1946- . II. Legendre,
Louis. III. North Atlantic Treaty Organization. Scientific Affairs Division. IV. Title. V. Series. QH541.15.M34N38
1986 574.5'0724 87-16337
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in other ways, and storage in data banks. Duplication of this publication or
parts thereof is only permitted under the provisions of the German Copyright Law of September 9, 1965, in
its version of June 24, 1985, and a copyright fee must always be paid. Violations fall under the prosecution
act of the German Copyright Law.
© Springer-Verlag Berlin Heidelberg 1987
Softcover reprint of the hardcover 1 st edition 1987
2131/3140-543210
Table of Contents
I. Invited Lectures
Scaling techniques
John C. Gower
Introduction to ordination techniques .............................. 3
J. Douglas Carroll
Some multidimensional scaling and related procedures devised at Bell
Laboratories, with ecological applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Yves Escoufier
The duality diagram: a means for better practical applications . . . . . . . . . . . . . . 139
Jan de Leeuw
Nonlinear multivariate analysis with optimal scaling . . . . . . . . . . . . . . . . . . . . 157
Willem J. Heiser
Joint ordination of species and sites: the unfolding technique . . . . . . . . . . . . . . 189
James C. Bezdek
Some non-standard clustering algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Pierre Legendre
Constrained clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
Leonard P. Letkovitch
Species associations and conditional clustering ....................... 309
Fractal theory
Serge Frontier
Applications of fractal theory to ecology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
Spatial analysis
Brian Ripley
Spatial point pattern analysis in ecology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
Robert R. Sokal and James D. Thomson
Applications of spatial autocorrelation in ecology. . . . . . . . . . . . . . . . . . . . . . 431
VI
1 - Michele Scardi, 2 - Marie-Josee Fortin, 3 - WiIIem J. Heiser, 4 - Leonard P. Lefkovitch, 5 - Pierre Legendre, 6-
Louis Legendre, 7 - J. Douglas Carroll, 8 - Pierre Lasserre, 9 - Bruno Scherrer, 10 - Shmuel Amir, 11 - Frederic
Ibanez, 12 - Fortunato A. Ascioti, 13 - Serge Dallot, 14 - Jean-Luc Dupouey, 15 - Jordi Flos, 16 - Richard L.
Haedrich, 17 - Alain Laurec, 18 - David W. Tonkyn, 19 - Julie Sokal, 20 - Steve H. Cousins, 21 - Robert R.
Sokal, 22 - Daniel Simberloff, 23 - Carol D. Collins, 24 - Rebecca Goldburg, 25 - John G. Field, 26 - Clarice M.
Yentsch, 27 - Serge Frontier, 28 - John C. Gower, 29 - Marta Estrada, 30 - James C. Bezdek, 31 - Janet W.
Campbell, 32 - Daniel Wartenberg, 33 - Marinus J. A. Werger, 34 - Marc Troussellier, 35 - Robert Gittins, 36 -
Eugenio Fresi, 37 - Peter Schwinghamer, 38 - Richard A. Park, 39 - Manfred B{l!ter, 40 - Brian H. McArdle, 41 -
S. Edward Stevens, Jr., 42 - Philippe Gros, 43 - Paul Berthet, 44 - Francisco A. de L. Andrade, 45 - Vincent Boy.
Not pictured: Michel Amanieu, Jan de Leeuw, Yves Escoufier, Roger H. Green, Jean-Marie Hubac, Michael Meyer,
Brian Ripley.
Foreword
During the Sixties and the Seventies, most community ecologists joined the general trend
of collecting information in a quantitative manner. This was mainly driven by the need for testing
implicit or explicit ecological models and hypotheses, using statistical techniques. It rapidly
became obvious that simple univariate or bivariate statistics were often inappropriate, and that
community ecologists should resort to multivariate statistical analyses. In addition, some methods
that are not traditionally considered as statistical (e.g., clustering) were sometimes used
alternatively to, or in conjunction with statistical techniques. The fIrst attempts were not always
conclusive, because straightforward applications of both statistical and nonstatistical multivariate
methods often led to unsatisfactory or trivial ecological results. This was either due to the fact that
ecologists did not fully grasp the complexities of the numerical techniques they used, or more
often because the specific nature of ecological data was not taken into account in the course of the
numerical analysis.
As numerical ecology progressively developed, during the last decade, it proposed various
ways of integrating several multivariate techniques into analytical schemes, and it specified sets of
rules that state how conventional methods should be used within the context of community
ecology. Some methods were also modified to better fit multivariate ecological data sets. In the
last few years, however, it has become apparent that existing approaches in numerical ecology
often could not answer the increasingly complex questions raised by community ecologists, and
that a large body of ecological information remained unexploited by lack of appropriate numerical
methods. This was the main incentive for organizing a NATO Advanced Research
Workshop on Numerical Ecology, where community ecologists could meet with proponents
of new methods for the analysis of numerical data, and explore with them how these could be
applied to community ecology.
x
As stated above, numerical ecology typically combines several numerical methods and
models, of complementary character, to probe data sets describing processes that occur within
ecosystems. New mathematical models (e.g., fractals and fuzzy sets) and methods (generalized
scalings, nonlinear multivariate analyses, spatial analyses, etc.) have recently been developed by
mathematicians, or by statisticians and methodologists working in related fields (e.g.,
psychometrics). The first purpose of the Workshop was to bring methodologists and community
ecologists to the same conference room. The Workshop was designed as follows. Mathematicians
and methodologists presented their theories during morning sessions: Scaling techniques (I, II and
III); Clustering with models, includingjuzzy sets; Fractal theory; Qualitative path analysis; Spatial
analysis. During the afternoons, six working groups representing various branches of community
ecology met with the methodologists to discuss the applicability of these methods to the following
fields of specialization: Micro-organisms; Benthic communities; Pelagic communities; Dynamic
biological oceanography and limnology; Terrestrial vegetation; Terrestrialjauna. The Workshop
was also one of the first opportunities offered to numerical ecologists from the various disciplines
(aquatic and terrestrial; botany, microbiology, and zoology) to meet and work towards a common
goal.
The NATO Advanced Research Workshop on Numerical Ecology took place at the Station
marine de Roscoff, France, from 2 to 11 June 1986. There were 51 participants (listed at the end
of the book), originating from 14 countries: Australia, Belgium, Canada, France, Federal
Republic of Germany, Israel, Italy, the Netherlands, New Zealand, Portugal, South Africa,
Spain, the United Kingdom, and the United States of America. The International Organising
Committee for the Workshop was: Pierre Legendre and Louis Legendre (co-directors, Canada),
Michel Amanieu (France), John G. Field (South Africa), Jordi Flos (Spain), Serge Frontier
(France), John C. Gower (United Kingdom), Pierre Lasserre (France), and Robert R. Sokal
(USA).
This book of proceedings comprises the invited lectures, as well as the working group
reports. Lectures contributed by the participants are not included and will eventually appear
elsewhere. The published versions of the papers are often quite different from the oral
presentations in Roscoff, because the authors took into account the discussions that followed their
lectures, as well as criticisms and suggestions by external peer reviewers. As editors, we are
pleased to stress the good spirit and collaboration from all the authors during this critical phase of
paper improvement.
XI
The meeting was sponsored and funded by the Scientific Affairs Division of the North
Atlantic Treaty Organization (NATO). France provided additional financial support, through the
PIREN and PIROcean programs of the Centre national de la Recherche scientifique (grants to
Prof. Michel Amanieu), and the Ministere des Affaires etrangeres (grant to Prof. Pierre Lasserre);
the Station marine de Roscoff also contributed significant non-monetary support. We are sure that
the participants would want us to express their particular thanks to Prof. Pierre Lasserre and his
staff, for local arrangements and superb food, and to Marie-Josee Fortin who very ably assisted
the co-directors with administrative matters before, during and after the meeting, in addition to
being herself an active scientific participant.
In addition to the Editors, several colleagues listed henceforth refereed manuscripts for this
book of proceedings: J. Douglas Carroll, Serge Dallot, William H. E. Day, Yves Escoufier, Scott
D. Ferson, Eugenio Fresi, Robert Gittins, Leonard P. Lefkovitch, Benoit B. Mandelbrot, Brian
H. McArdle, F. James Rohlf, Michele Scardi, Bruno Scherrer, Peter Schwinghamer, Daniel
Simberloff, Robert R. Sokal, Marc Troussellier and Daniel Wartenberg. Their assistance is
gratefully acknowledged.
Scaling techniques
INTRODUCTION TO ORDINATION TECHNIQUES
John C. Gower
Rothamsted Experimental Station
Harpenden, Herts. AL5 2JQ, UK
1. INTRODUCTION
In this paper I shall review the more common ordination techniques that
have found applications in ecology, together with related techniques, mainly
developed by psychometricians and generally termed multidimensional scaling,
that are of potential use to ecologists. Some of the methods covered are
developed in detail by other contributors to this volume. In the interest of
giving a cohesive account, I shall include some introductory comments on such
methods but refer the reader to subsequent chapters for more detailed
expositions. Examples of ecological applications of the methods illustrate the
various techniques; these examples have been drawn entirely from a
forthcoming book "Multivariate Analysis of Ecological Communities" by Digby
and Kempton (1986) and I am grateful to them and to their publisher, Chapman
and Hall, for giving permission. The reader is referred to the book for details
of background information and for many further examples.
Just as a scatter diagram gives a useful graphical representation of a
bivariate sample that allows salient features such as outliers, clusters and
collinearities to be picked out by eye, ordination methods aim to exhibit the
main features of multivariate samples in a few dimensions - ideally two. Thus
the emphasis is on informal graphical displays and not on problems of
inference. Formal inferential procedures are not usually available for the
methods discussed and indeed in my experience are rarely of interest in this
context. However when the effects of sampling variation are deemed relevant
the data-analytic techniques of jack-knifing and boot-strapping will usually be
available and will suffice to give an indication of the stability of displays and
associated confidence in their utility.
Underlying all graphical displays are informal or implicit models that allow
the coordinates of the points to be estimated and plotted. I shall try to make
clear the nature of these informal models. There is, of course, no claim that
the parameters of these models have any special ecological significance; they
are merely a mathematical contrivance that allows the data to be presented
conveniently. Occasionally patterns perceived in a display will suggest the
operation of some biological/ecological process that can be modelled more
formally. When this happens the classical statistical theories of estimation and
inference come into their own.
My aim is to describe the various ordination techniques in general terms,
indicating the assumptions made and how to interpret the graphical results.
This will entail using a little algebra from time to time but this will be kept to
an essential minimum. It is certainly not my aim to explain how to do the
calculations or how to construct suitable algorithms and thus develop computer
programs. For most of the methods discussed software is internationally
available and the provenance of specialised programs is given in the text; the
other methods are readily accommodated by good general-purpose statistical
languages and packages such as Genstat (Alvey et al. 1983).
5
For the most part we shall be concerned with data, as in Table 1, whose
n rows refer to species and whose p columns refer to sites. It is tempting for
mathematicians to refer to such a table as an nxp matrix X and then to ignore
its detailed structure. In this way crucial information may be ignored. Thus
in Table 1, the sites are plots which have each had different fertilizer
treatments and some of which have been limed and others not. In ecology the
sites are often spatially contiguous or they may fall into groups from
geographically different regions. The same species may have been repeatedly
sampled so that data for each species may occur in several rows of the table.
The whole table may have been sampled on several occasions or the different
sites may refer to the same site successively resampled. Such structural
information is vital to any sensible interpretation of the data.
Of equal importance is the type of information given in the body of the
table. In Table 1, a variable "relative abundance of plot species" is given.
This is a quantitative variable whose values, by definition, sum to 100% for
every plot (i.e. for every column). Apart from abundance, typical quantitative
variables of interest to ecologists are measurements (e.g. length of some plant
characteristic in centimetres, total biomass per site in grams per square metre
and counts, such as number of petals). As well as quantitative variables,
6
qualitative variables also are important. A typical qualitative variable may take
on one of a finite number of disjoint categories (e.g. black, white, green or
blue); the terms categorical, nominal and meristic variable also are used to
describe qualitative variables. Some qualitative variables may be ordinal
having an underlying notion of a natural ordering (e.g. smooth, textured,
rough). Of special importance are binary qualitative variables that take two
values (e.g. black/white, or presence/absence). In the latter example, absence
has a different logical status from presence and it may be wise to take
cognisance of the fact.
With quantitative variables we have already noted that some may be
counts, and hence dimensionless, while others are measured on scales that
carry with them definite units of measurement. These are of two principal
kinds, ratio-scales and interval-scales. Weight is an example of a ratio-scale,
where all weights are expressed as multiples of a standard Kilogram kept in
Paris; ratio-scales have a well-defined zero. Interval-scales are exemplified by
temperature, where two points on the scale are identified (e.g. the melting
point of ice and the boiling point of water) and the scale is then divided into
an equal number of steps; interval-scales do not have a well-defined zero (e.g.
zero Fahrenheit and Celsius are not equivalent). Weaker information is also of
importance in certain fields such as psychometrics. Thus with
paired-comparisons it is known only that one item is preferred to another; with
similarity data it is known that item A is more similar to item B than it is to
item C; with confusion data it may be recorded that the ordered sequence A,B
was identified nAB times and that this differs from nBA'
The above merely hints at some of the problems addressed in the major
discipline of the theory of measurement. However I hope it will suffice to
indicate their importance and that there are problems that ecologists should
think about before embarking on what may seem to be routine statistical
calculations. We have seen that a single variable may be exhibited in a
two-way table such as a species x sites table but in a more typical multivariate
sample the columns of the table/matrix X each refer to a different variable and
these different variables will often comprise a mixture of qualitative and
quantitative types, the qualitative variables being of differing numbers of
levels and the quantitative variables measured in different units. The
problems outlined above are thus compounded and the different interpretations
to be associated with a matrix X are extended.
In the following we shall see some of the more simple ways of handling
the difficulties associated with different types of data and different structures
7
of data. Ecologists have long recognised that the raw data will often require
some form of transformation or pre-processing before progress can be made.
Thus with the Braun-Blanquet scale, percentage cover is approximately
transformed to an additive scale in the range 1-5. By contrast the
Hult-Sernander-Du Rietz scale is a logarithmic one where 1 corresponds to less
than 6% cover, 2 to 6-12%% cover, 3 to 12%-25% cover, 4 to 25-50% cover and 5
to 50-100% cover. For insects Lowe (1984) has suggested another logarithmic
scale where 1 corresponds to one individual, 2 to 2-3 individuals, 3 to 4-7
individuals, 4 to 8-15 individuals and so on. These scales are chosen so that
particularly high abundance should not dominate subsequent analyses. When
working with computers it is probably more straightforward to do a logarithmic
transformation rather than to use such scales. The reasons for transformations
include the following:
To ensure independence from scales of measurement
To ensure independence from arbitrary zeros
To eliminate size effects
To eliminate abundance effects.
(ii) X·
1J'-x .J.
. . site means)
(v) - nXij!X.j
species at
(proportion
the jth site)
of jth site
jOften
as
expressed
Writing xij for a typical entry in X, xi. for the mean of the ith species
and x.j for the mean at the jth site, then Table 2 lists some basic
transformations that are sometimes useful. Some of these transformations will
occur naturally in the following and will be discussed in their proper place.
However numbers (vii), (viii) and (ix) need some immediate comment. Number
(vii) is particularly attractive for ratio scales because the result of the
transformation is unaffected by the values of other items of data for the same
variable. This is not so for (viii) and (ix) unless rj is chosen as the ~ priori
range rather than the range in the sample. When rj is chosen as the sample
standard deviation, which is a very common choice, there are difficulties.
These arise because most ecological samples are likely to embrace mixtures of
several biological populations and the value of r j then depends on the mixing
proportions so it is not an estimate of any identifiable statistic. If the samples
are from a homogeneous population they probably have little interest; there is
also the problem that the usual formula for evaluating standard errors,
although unbiased, will with long-tailed distributions normally give gross
underestimates, balanced by occasional gross overestimates. In such cases
some preliminary transformation such as a logarithm or square-root is
indicated.
n 2
n 2
n 2
that I dij = I 6ij + n I rij where the latter term is the minimised,
i,j i,j i,j
residual sum-of-squares. This shows that the criterion
/
/ /
/ /
/
/
/
... ../JYa
/
~
/
~o..:'2
/ ....
/
/ I
Consider any point XI on the Xi-axis (see figure 2). XI may be projected
into YI in the I-II plane, in exactly the same way as were Pi and Pj projected
into Qi and Qj. Indeed any point on the xI-axis will project into a point on
the line joining 0 and Yu so that this line, labelled YI in figure 2, may be
taken to approximate the XI -axis in the principal component plane. Points that
represent samples with positive (negative) deviations from the mean and close
to an x-axis will project into points with positive (negative) deviations from
the mean and close to the corresponding y-axis. Any sample that has values
of Xz and X3 close to the means of these variables, but a value of XI
SUbstantially different from its mean will project into a point close to the
y I-axis and this can, with caution, be used to aid interpretation. The caution
is necessary because any point on XI YI will project into Y I so that although it
13
Instead of, or as well as, plotting the rows of L as coordinates we may plot
those of LA~. The ith row of this matrix, when regarded as a coordinate of a
point Ri, does not lie on the Yi-axes, although its first k dimensions do lie in
the space of the first k principal axes. Thus although the points Ri could be
plotted in the same ordination as those containing the projections Qi and Y I,
Y2, Y3,00' it is best to plot them separately. The interest in the plot of the
points Ri arises from the algebraic identity
y'y = (LA~) (LA~)' = LAL' (4)
which shows that what is being approximated is now not a distance but an
inner-product. Geometrically the (i,j)th element of y'y is approximated in a
k-dimensional representation by A(ORi)A(ORj)cos(RiORj)' The approximation is
again optimal in the least-squares sense (see Section 3.4, below) that no other
k-dimensional representation will have a smaller sum-of-squares than
14
P 2
Trace(Y'Y-LAkL')2= I Ai where Ak differs from A in having zero diagonal
i=k+l
values Ai when i> k. Note that now the sum-of-squares accounted for is
expressed in terms of sums-of-squares of the original eigenvalues, rather than
their sums as previously. Normally Y' Y is the corrected sample
variance-covariance matrix of X, but when X has been normalised to eliminate
the effects of differing measurement scales by using the transformation (viii)
of Table 2, with rj set to the standard error of the jth variable, y'y will be
the product-moment correlation matrix of X. The inner-products will then
approximate the correlations between the variates, and the distances of each
point from the origin should all approximate the common unit variance. Thus
when examining such plots one should be looking for orthogonal pairs ORi, ORj
(suggesting zero correlation between Xi and Xj) or coincident directions ORb
ORj (unit correlation, but Ri and Rj should coincide and be close to the
desired unit distance from 0). Additional to the usual caveats concerning
caution when interpreting projections, extra caution is needed with
correlations. Correlations have well-defined meanings in linear situations such
as arise when data can be considered approximately multinormal. However this
is rarely the case with ecological samples. It should not be forgotten that
even exact non-linear relationships will not give high correlations, thus the
absence of correlation should not be taken to imply the absence of an
interesting relationship. My advice is that rather than examine plots of LA", it
is often better to examine all the pairwise scatter plots of Xi with xjo
The distances approximated by 4(RiRj) in the plots of LA" when y'y is
a correlation matrix with elements rij. is J2(I-rij). Clearly this form of
analysis may be regarded as an analysis of correlation. However it is
misleading to view the fundamental plots of Components Analysis in this light,
for the Pythagorean distance (1) takes no account whatsoever of the possible
correlations between the variables.
and s=Min(n,p). The term orthogonal is that usually used in the current
context but more properly the term orthonormal should be used to indicate
that U' U and UU' are both unit matrices, and similarly for V. The
non-negative quantities O'i are termed the singular values of Y. Thus from (5)
Y' Y = VI' IV' and we may identify the previous orthogonal matrix L with V and
the diagonal matrix A with I'I, (i.e. ~i=O'i). Thus the previous expression for
the component scores, YL, may be written as UIV' V = UI. It follows that the
singular value decomposition may be written Y = (UI)V' = (YL)L'
simultaneously giving the component scores and loadings. Further LA%
corresponds exactly with VI. The decomposition (5) is important for a result
proved by Eckart and Young (1936) states that Yr the best rank r
n p r ( ).
approximation to Y (i.e. the one that minimises
I I (Yij-Yij)2 where Ylj
i=l j=l
is the (i,j)th element of Yr ) is obtained by replacing I by Ir where Ir is the
same as I except that O'i=O for all i > r.
With this change only the first r
s
columns of U and V are effective. Whereas we may write Y=I O'iUiVi' we have
i=l
r
that Yr = I O'iUivi' where ui and Vi are the vectors that are the ith columns
i=l
of U and V respectively. Clearly the residual sum of squares after fitting
s
Yr to Y is given by I O'i.
i=r+l
The Eckart-Young theorem shows that the equivalence of (YL)L' to the
singular value decomposition of Y implies that in Components Analysis the inner
product between component scores and the plots of L gives an approximation
to the data Y. That is Yij ~ d(OQi)d(OYj)cos(QiOYj). Also because y'y is a
symmetric matrix with non-negative eigenvalues, (4) gives its singular value
decomposition and the Eckart-Young theorem shows why taking the first k
columns of LA% gives the best k-dimensional approximation to the correlation,
or covariance, matrix.
This shows that VL =M (say) gives the eigenvectors of the nxn matrix YY'
and that diag(A) again gives the eigenvalues. Because of the previous
normalisation L' L = I, and from equation (2), the normalisation of the
eigenvectors M is given by M'M = L'V'VL = L'LA = A, i.e. the ith column of M
is scaled to have sum-of-squares Ai. Thus finding the eigenvalues of YY' and
scaling them as indicated, the component scores are found immediately; the
vectors L may then be determined by pre multiplying M by (V' V)-l V'. The
operation on the nxn matrix YY' is sometimes referred to as a Q-technique, as
opposed to the R-technique of operating on the pxp matrix V'V. The two
approaches give the same results and should be viewed as alternative methods
of computation. Usually p is much smaller than n so the R-technique will be
preferable.
Occasionally, there is no clear distinction whereby the variables can be
associated with one direction (the rows) and the units with another direction
(the columns). Then we may wish to regard the points Pi as p points referred
to n coordinate axes. The best fitting plane then passes through the point
representing the row-means (species means in Table 1) and X has to be
replaced not by (I-N)X but by X(I-P) where P is pxp with all elements equal to
lip. This too generates both a Q-technique and an equivalent R-technique but
the distance dij is now defined between columns and not between rows and will
therefore generate a different analysis from the one discussed above. When
the columns refer· to well-defined variables, the evaluation of row means is
invalid, for it implies summing quantities with disparate units of measurement.
Some alleviation of this difficulty can be achieved by normalising the variates
to dimensionless forms, as in transformations (vii) and (viii) of Table 2, but I
do not believe that such transformations are sufficient to legitimise the
process. In Section 7.3 a model is discussed where rows and columns have
equal standing.
Legendre and Legendre (1983) suggest that the mode of sampling may be
used to distinguish the units from the variables. The sampling units are then
the units of a Components Analysis and the descriptors of the samples are the
variables. Thus we have to consider carefully how a table, like Table 1, has
been compiled. There are three possibilities:
(i) individual plants are sampled, in which case species
name would be one (categorical) variable, pH a
quantitative variable etc. When all variables are
quantitative we have the classical set-up for
Components Analysis; when all variables are categorical
17
-
*CO-
-
C\I
,...
I
a..
Holcus
lanatus.
Poa
• pratensis • Agrostis
() tenuis
a..
.. J
.
Poa • Anthoxanthum
trivialis • odoratum
• Arrhenatherum ••
elatius •
• ~
•
Alopecurus •
Dactylis
•
• Helictotrichon
• Festuca rubra
.pubescens
•
pratensis glomerata
PCP-I (40.4%)
The latter has been done using the data of Table 1. first transformed to
logarithms of relative abundance and then with row and column means removed
(see Table 2 (iii». This transformation reduces the effect of the more
abundant species. which would otherwise dominate the analysis. The space of
the first two components is given in figure 3. The names of the most
abundantly occurring grasses have been underlined. The two-dimensional
space accounts for only 53% of the total dispersion but a third dimension
increases this to 64%. This third dimension is given in figure 4. where it is
seen that only Festuca rubra contributes significantly to the enlarged space.
Figure 3 may be converted into a biplot by superimposing the vectors given in
figure 5.
19
-
,...
~
-
,..
,..
Anthoxanthum
I • odoratum
a..
o
a..
Arrhenatherum
.elatius
Alopecurus
•
Agrostis tenuis
• pratensis
• Holcus lanatus
• Festuca
rubra
PCP-I (40.4%)
Figure 4. As for figure 3 but showing first and third principal axes and
only the six dominant grass species underlined in figure 3.
20
Continuously
limed
- -_ _ 8
b
Continuously limed
plus recent boost
The directions given in figure 5 refer to sites but, because in the Park Grass
experiment sites receive fertilizer treatments, it is more informative to label the
vectors by the treatment names. We note that plots with treatment N2 PK and
liming seem to be associated with Arrhenatherum elatius and Alopecurus
pratensis while recent liming is associated with Holcus lanatus. Unmanured
plots are most closely associated with Festuca rubra of the dominant grasses
and with herbaceous species that are unnamed in the figures. The direction of
the first component is associated with increased abundance of species per plot
so that the effect of liming and fertilizers is to decrease the number of spC;lcies
and increase productivity. This latter point may be examined in more detail by
doing a Components Analysis on the sites. This still uses logarithm of relative
21
abundance but Pythagorean distance is now defined between sites rather than
between species; it should be recalled that the two forms of analysis are not
simply related. In figure 6 the points plotted refer to sites and these have
again been labelled by their treatments and joined in pairs by directed
lines indicating those sites with increasing levels of liming. It can readily be
seen that pH increases in a roughly NE/SW direction (in the figure, not in the
field). Also plotted on figure 6 are contours defining regions of increasing
biomass (dry matter in tonnes per hectare). Productivity increases in a
direction roughly running from SE to NW. This interpretation is that of Digby
and Kempton (1986) and indicates how Components Analysis may be usefully
enhanced by adding relevant information not directly used in the analysis.
..... .....
".
- -
- ~t~d).d
i<p
.-
Unmanured
C a
~
Species j
Total
present absent
Total Xj P-Xj P
a+d
Simple Matching
p
a
a+b+c Jaccard
2a Sorensen
2a+b+c
a
Ochiai
j(a+b) (a+c)
ad-bc
Pearson's <I>
j(a+b) (a+c) (d+b) (d+c)
dij+dik~djk holds for all triplets (i,j,k). The metric property is weaker than
that of Euclidean distance - all distances are metrics but not all metrics are
distances. When the triangle inequality is valid for all triplets then all the
triangles may be drawn, but higher dimensional Euclidean representations need
not exist. This is most easily seen by considering a tetrahedron whose base
ABC is an equilateral triangle with side 2 units and whose apex D is
equidistant d units from A, Band C. When d=l all triangle inequalities are
valid (with equality except for ABC) and D has to lie simultaneously at the
mid-points of AB, BC and AC. This is clearly impossible in a Euclidean space.
As d increases D moves away from the mid-points but must still occupy three
positions simultaneously, until a true Euclidean representation occurs when
d=2/J3 and D coincides with the centroid of ABC. As d increases further, D
moves out of the plane of ABC to give a normal three-dimensional Euclidean
representation of the tetrahedron.
I
Consider the similarity coefficients
Te = a+e(b+c)
a+d e ~ 0
and Se = a+d+e(b+c)
The family Te excludes negative-matches and the family 8e includes them.
Many coefficients commonly used in ecology are defined for specific values of e
(e.g. e=l, 1/2, 2). It can be shown (Gower and Legendre 1986) that as e
increases from zero, the dissimilarity coefficients I-Te, Ji-Te, I-Se and
Jl-se pass thresholds eM and e E at or above which the coefficients always
give, respectively, metric and Euclidean dissimilarity matrices. The explicit
results are as follows:
I-Te 1
Ji-Te 1/2
I-Se
Jl-se 1/3, eE = 1
These bounds define regions where the matrices are always metric or always
Euclidean. They do not imply for example that when e < 1/3 then all
matrices of Ji-Te are not metric or not Euclidean; the only claim is that
matrices cannot be guaranteed metric when e<e M and cannot be guaranteed
Euclidean when e<e E•
Another interesting property of both families arises from noting that if
25
1 2 3 •..••.••..•. 19 20
is worth noting that the information given by all (~) tables like Table 3
Table 6 lists a few of the many suggestions that have been made for defining
dissimilarities when all variables are quantitative. Once again the question
arises as to whether or not nxn matrices of these coefficients are metric,
Euclidean or neither. An additional complication is whether or not to admit
negative values for xij. Although negative quantities are rarely, if ever,
observed in ecology, they may easily arise as the result of preliminary
transformations such as some of those listed in Table 2. Gower and Legendre
(1986) list the properties of ten different coefficients defined for quantitative
variables, both when negative values are allowed and when they are not.
Values
* * 0 0 (missing values)
28
5. METRIC SCALING
PCO-I(33·3%)
....>- x
x
"-
as
"e
II)
x
II)
C
">
Q)
"-
Q)
II)
.0
o x
x
-x-
Fitted Distance
The relationship between dij and 0ij is not exactly monotonic so a best-
fitting monotonic regression of dij against 0ij has been plotted as in figure 8.
In this regression we are especially interested in the residuals from the
monotone line parallel to the 0ij-direction. Corresponding to the point (oij,
dij) is the value (6ij,dij) fitted by the monotone regression, so the relevant
residual is 0ij-oij and the quantity to be minimised is E(6ij-Oij)2 which is the
modified form of Stress often used with monotonic transformations. Weights
may be introduced if desired and by replacing dij, 0ij and 6ij by dij, 0ij and
6ij a modified form for Sstress can be found. By defining the residuals from a
monotonic regression, it is clear from examining figure 8 that the modified
forms of Stress and Sstress are invariant to monotonic transformations of the
34
• Holcus lanatus
-
In
.~
'L:
ca
'E
In
o
In
"0
Q)
>
L-
Q)
In
.0
o
Fitted Distance
El (1,0,0)
Figure 11. The shaded region is the plane Xl +x2+x3=1 in which all points
must lie when the closure constraint is satisfied. This plane may
be exhibited in general as a regular simplex, and when p=3 as an
equilateral triangle, as above, where (AXi1l AXi2, AXil) are known
as barycentric coordinates.
MN{100%)
e Aborigenes
x Innuit
• American Indians
o Indians
t::. Chinese
o
9"'"
o~o;--
..........
~~ 'e,.
~ ,
/ t::. Hardy-Weinburg ,
/
"-
equilibrium {p2:2pq:q2}
MM{100%) NN{100%)
7.2. Horseshoes
The previous example shows one way in which horseshoes or arches may arise.
A simple ecological example giving another way is derived from the following
table.
Site Species
1 1 1 1 1 1
2 1 1 1 1 1
3 1 1 1 1 1
4 1 1 1 1 1
5 1 1 1 1 1
6 1 1 1 1 1
7 1 I 1 1 1
8 1 1 1 1 1
In Table 8 there is no overlap of species between sites more than four steps
apart and thus any dissimilarity coefficient will be zero for pairs of sites such
as (1, 6) (1,7) and (1, 8) and also (8, 1), (8, 2) and (8, 3). The ordination
therefore has to set all these distances as equal to the maximum allowable
value. The inevitable effect is that points 6, 7 and 8 are close to a circle
centred at point 1 and points 1, 2 and 3 are close to a circle centred at point
8. To accommodate all such constraints the horseshoe effect appears.
Ecologists do not seem to be satisfied with the ordering implicit in
ordinations such as that of figure 12 and, regarding data like that in Table 8
as representing a linear gradient, expect a linear ordination. They have
developed linearisation methods such as Detrended Correspondence Analysis
(Hill and Gauch, 1980) and the Step-Across method of Williamson (1978).
Transformations to straighten horseshoes are discussed by Heiser, this volume,
in the context of Multidimensional Unfolding.
to fit
Xij = p+oq+flj+7i6jo
The least-squares solution for the additive terms is exactly the same as when
the multiplicative terms are absent.
l
Thus il = x ••
eXi = Xi.-X..
and Pj = x .J'-x .•
The residuals Zij = xij-il-iXi-flj
= Xij-Xi.-X.j+x ..
may be found and assembled into a pxq matrix
Z = (I-P)X(I-Q) (9)
where P has elements lip and Q has elements l/q. The least-squares estimates
of 7 and 6 are then obtained from the singular-value decomposition of Z =
UIV' (c.f. equation 5) where 7 is set to be proportional to the first column of
U and i is set to be proportional to the first column of V. The factors of
proportionality are not arbitrary but must have product O'lJ the first singular
value of Z. Because there is generally no reason to favour rows rather than
columns, it is usual to set:
7
A
= %
0'1 U1 and ~
v = ~
O'l V1 •
and estimated by
from the rth singular value and the rth columns of U and V.
This is perhaps the most simple form of analysis in which the data-matrix
has structure imposed on the units. The units are supposed to belong to
known populations and it is convenient to assume that they have been
arranged so that the first nl units are samples from population number 1, the
second n2 units are samples from population number 2, ••• , and the last nk
units are samples from population number k. Thus we may study variation
both between and within the k populations. Suppose B is the sample
between-population dispersion matrix and Wi is the sample dispersion matrix
within the ith population. When the populations have homogeneous dispersions
we may assume that the Wi are all estimates of the same dispersion and
combine the separate dispersions to form a pooled within-population dispersion
matrix W given by:
k
W = I (ni-l)Wi/(n-k).
i=l
In Canonical Variates Analysis, the principle interest is in an ordination of the
k populations rather than in the n samples. An optimal measure of distance
between populations i and j is given by the Mahalanobis D-statistic whose
square is:
2 - - -1 - -
Dij = (Xi-Xj)W (Xi-Xj),
where Xi is the row-vector giving the p means of the variables in the ith
42
Samples of 0 11 °12
°12 °13
°13
Population 1
Samples of
Population 2
0 21 0 22 0 23
Samples of
Population 3
I 0 31 0 32 0 33
Figure 13. Between-units squared distance matrix in a form blocked for three
populations. The elements of the 3x3 matrix on the
right-hand-side are formed by averaging the elements within the
corresponding blocks on the left-hand-side. The quantity
Dii+Djj-2Dij gives the squared distance between the centroids of
the points in the ith and jth populations.
presented in blocked form so that all the units within each population occur in
consecutive rows/columns. Figure 13 shows the situation for three populations.
The n sample units may be imagined as generating a cloud of points in a
Euclidean space. The ni points of population number i will then have a
centroid Gi. The squared distance A2(GiGj) between Gi and Gj may be obtained
as follows. First form Dpq the average of the npnq elements of Dpq' the
matrix giving the squared distances between all members of population p and
all members of population q. For the diagonal block-matrix Dpp this averaging
process includes the zero elements on its own diagonal. A kxk symmetric
matrix D is formed with elements Dpq. Then
11 2 (GiGj) = -%(Dii+Djj-2Dij)'
Thus a Principal Coordinate Analysis of D gives an ordination of the population
centroids. It is then a simple matter to add the individual samples to the
ordination display. Something like a confidence region can then be formed for
the points in the ordination that represent the ith population. This can be
done either by calculating convex hulls or minimal covering ellipses (Green
1981, Silverman and Titterington 1980).
7.5. Asymmetry
Distance
"._·_· . . .,Pj
/ \
I ·Pi
0
\
,
..... .- ....... /
Inner-product \
.p.
I J
I
\
Skew-symmetry
.....
.. ' Pi
....
............. "'0
Figure 14. -.-.-.-. Locus of all points that have the same relationship
with Pi as does Pj.
........ Locus of all points having a null relationship
with Pj (for distance this is only Pi itself).
-------- Locus of all points that have the same relationship
with Pi as does Pj, but with opposite sign.
part in the interpretation; however scaling the axes differentially badly affects
the distance interpretation.
For the inner-products 4(OPi)4(OPj)COS(PiOPj) to be constant as Pj varies
requires 4 (OPj)COS(PiOPj) to be constant. This is merely the projection of Pj
onto OPi, showing that the locus of points with equal relationships with Pi is
the line through Pj orthogonal to OPi' To be zero the locus must pass
through the origin. Thus in this case there are many points with a null
relationship with Pi and the origin plays a central role. Now however the axes
may be rescaled without affecting interpretation provided that the scaling of
one axis is balanced by the inverse scaling of the other. This follows from the
simple formula XiYi+XjYj for the inner-product in terms of the coordinates
1 1
Pi(Xi,Yi) and Pj(Xj,Yj). Clearly XiYi+XjYj = (~Xi)(~i) + (~Xj)(~j).
Another thing to remember is that negative values can occur. The locus of
points with the same magnitude but opposite sign to that given by the locus of
Pj is a parallel line equidistant on the other side of the origin.
If the Eckart-Young theorem is used to give an optimal rank r fit to a
matrix A then it is the inner-product interpretation that generates the
least-squares estimates sij of the individual elements aij of A. Included in
these estimates are the diagonal elements, and therefore from the cosine
formula we have that 4 2 (PiPj)=Sii+Sjj-2Sij' Thus, provided the approximations
are good 4(PiPj) itself approximates (aii+ajj-2aij)", but not in a direct
least-squares sense. This argument shows how certain distances as well as
inner-products are approximated in the same diagram.
With skew-symmetry it is the area of the triangle OPiP j that gives the
approximation. The locus of Pj that keeps area constant is a line through Pj
parallel to OPi. Zero areas are given by all the points on OPi. The axes may
be scaled as for inner-products and negative skew symmetry is given by the
locus parallel to OPj, equidistant but on the opposite side to Pj.
Thus, although ordinations may look alike superficially, one has to be
clear of the exact form of approximation being used and bear in mind the
interpretive tools outlined above. Although in a good approximation "close" Pi
and Pj can be safely interpreted as representing similar points, "distant"
cannot safely be interpreted as being dissimilar. Indeed with skew symmetry
4(PiPj) approximates the distance between the ith and jth rows of the
skew-symmetric matrix N. This distance can be small only if nik-n jk is small
for all k - which implies that nij must itself be small. Thus when Pi and Pj
are distant points on a line through the origin it can be deduced that nij is
small and that nik differs significantly from njk for at least one value of k.
47
-I
R Xq = up }
(13)
p'XC- 1 O'q'
The equations (13) may be used as the basis for calculating the values of p
and q by iterating on initial values until convergence - this algorithm is
termed reciprocal averaging. From (13) it is clear that R%p and R%q are
singular vectors of Y. The first singular value 0'1=1 corresponds to R%p=R%1
and C%q = C%1 , i.e. p=l, q::l which contain no useful information. The scores
are therefore obtained from the second singular vectors to give P=R""%U2 and
q=C-%V2' Subsequent vectors may be similarly determined leading to the
simultaneous plotting of (O'iR""%Uh O'iC-%Vi)' Now the squared distance between
the ith and jth row points is
• Holcus lanatus
I
c(
o
•
Poa pratensis
%nthoxanthum • • AgrostiS tenuis
Poa trivialis odoratum •
. • Alopecurus •
Arrhenatherun elatlus. .pratensis
• •••
• Dactylis Festuca rubra
·glomerata •
• • ••
• • I· ••
•••
•• .Helictotrichon pubescens
• ••
•• •
••
CA-I (.803)
Figure 15. Correspondence Analysis of the Park Grass data. The same
species are labelled as in figure 3. This diagram gives the
ordination of species; figure 16 gives the ordination of sites.
d.
I.d
---
N
I'
II
II
<0
-.:.... N 3 PK ,'1
........, '/
I
«
u
I
,/
/1
CA-I (·803)
Figure 16. Correspondence Analysis of the Park Grass data. The labelling is
as in figure 6. This diagram gives the ordination of sites; figure
15 gives the ordination of species.
Variable
Total
Sample Sex Age Group Nationality
1 0 1 0 1 0 0 1 0 0 3
2 0 1 1 0 0 0 1 0 0 3
3 1 0 1 0 0 0 0 0 1 3
4 0 1 0 0 0 1 0 1 0 3
5 1 0 0 0 1 0 0 0 1 3
Total 2 3 2 1 1 1 2 1 2 15
Indicator matrices like that of Table 9 are the raw material of Multiple
Correspondence Analysis. Note that quantitative information, such as
age-group in Table 9, may be presented in categorical form. This can be
useful when the effects of quantitative variables are believed to operate in
non-linear form.
53
constant and, because of the constraint x'Cx=l, are equal to l/JDP. With
constant scores, within-unit variation is zero and hence fully homogeneous. As
with Correspondence Analysis the vector C%x corresponding to the second
largest eigenvalue then may be selected and because of the trivial vector C%1
it will satisfy l' Cx=O. This is sometimes considered as an additional constraint
but it may also be viewed as a consequence of the criterion being optimised.
Other choices of constraints are discussed in the test-score literature (Healy &.
Goldstein, 1976). For example it may be desired that the mean score for each
test be zero, which for Table 9 would imply the constraints Xl +X2 = X3+X4+XS+X6
= X7+X8+X9=O. Another possibility is that the scores for the lowest and highest
levels may be required to be zero and one, which for Table 9 yields XI=X3=X7=O
and X2=X6=X9=1. For an example of Multiple Correspondence Analysis set in the
Homogeniety Analysis framework see de Leeuw, this volume.
9. COMPARISON OF ORDINATIONS
Very often two or more ordinations may be done on the same samples
(sites, species, etc.) using either different ordination methods or different
variables, or both. Clearly there is an interest in asking to what extent the
different ordinations are giving similar information. If they differ, can the
differences be identified as mainly arising from some subset of samples or are
they of a more general nature? In this section methods are discussed that
address questions of this kind. Suppose we have matrices Yu Y2, ... ,Yk whose
rows give coordinates of points arising from k different ordinations of n
samples. Here Yj is of order nxrj and without loss of generality it can be
assumed that rl =r2= ... =rk=r (say); if this is not initially so, zero columns may
be appended to those Yj for which rj < Max(rllr2, ... ,rk)=r. It is important that
the ith row of every Yj corresponds to the same ith sample. Alternatively we
may be given k distance matrices Dl I D2, ... ,Dk from which the matrices
Yi(i=I,2, ... ,k) may be derived by some form of ordination, or which may be
operated on directly.
55
Here there are just two ordinations YI and Ya which may be regarded as
two sets of points P"Pa'''''P n and QuQa, ... ,Q n in r-dimensional Euclidean space.
In Orthogonal Procrustes Analysis the aim is to fit Ya to Y I using the "rigid
body motions" of translation and rotation in such a way that
n
m:a= I Aa(PiQi) is minimised.
i=l
Translation is readily handled by requiring the centroids of the two
configurations to be superimposed. This is equivalent to subtracting the
column-means from YI and Ya and it will be assumed that this has been done.
Rotations are represented mathematically by orthogonal matrices, which also
may allow for reflections. Thus to minimise the criterion m~a, an orthogonal
matrix H must be found that minimises Trace((Y,-YaH) (Y,-YaH)'). The solution
to this problem turns out to be related to the singular value decomposition of
the matrix Y l ' Ya = UIV'. In fact H = VU' and
see Sch~nemann and Carroll (1970) and Gower (1971b). The residual sum-of-
squares is then given by:
which does not depend on the order of matching. To examine the contributions
of each sample to the total m~a, one only has to examine the individual
residuals A(PiQi). This provides a complete solution to the Orthogonal
Procrustes problem.
56
.C2
.P3 .P4
·Cl
·P5
.Nl
·P2
·Pl
points Gi,G 2, ... ,G n may be taken to estimate an average ordination; I have used
the term consensus ordination to describe this average but it has been pointed
out that this is an abuse of language. Because each Yi is fitted to these
centroid positions, the order in which the matrices are fitted is irrelevant and
scaling does not present the problem it does with the two-matrix Orthogonal
Procrustes problem; indeed it can be shown that when k=2, Generalized
Procrustes Analysis is equivalent to scaling Y1 and Y2 each to unit sum-of
squares as described in Section 9.1. The residual sum-of-squares criterion
n k
may be written as k I I A2(PisGi). To examine the contributions of the
i=1 s=1
different ordinations to this criterion examine the residuals A(PisGj). This can
be split up in two ways: (i) for ordination s (fixed), examine the contributions
of the different samples; and (ii) for sample i (fixed), examine the contributions
of the different ordinations.
'1
~"
B
~DE
N
M
~ H F
~.~
K;:-..
L
~ J
Figure 18. Generalised Procrustes Analysis for 14 sites and 6 years. The
year information is shown only for sites A, D and K. The letters
give the ordination of the 14 sites averaged over the ordinations
for the six separate years.
59
In figure 18 ordinations for 14 sites were done for each of six years. Thus
n=14 and k=6. The centroids G:i of the six year-points for each site are
labelled by letters; to avoid overloading the figure the individual year-values
are given only for sites A, D and K. From figure 18 it can be seen that there
is much more year to year variation at site D than there is at sites A and K.
The centroids summarise the six original ordinations in a single average.
We now focus on methods that operate directly on the matrices D lI D2 , ... ,Dk
which may be imagined as holding either distances or squared distances. Just
as Generalised Procrustes Analysis gives an average ordination whose
coordinates may be held in a matrix nXr so do all the methods described in
this section. They differ in how the differences from this average are
modelled and in the criteria that have to be optimised in the fitting process.
Suppose that the (i,j)th element of Ds is diJs then the aim is to fit
(squared) distances 6ijs, in some specified (and usually small) number of
dimensions.
In Individual Differences Scaling (Carroll and Chang 1970) the model is
2
6ijs = (Xi-Xj)Ws(Xi-Xj),
where xi is the ith row of X and Ws is an rxr diagonal matrix of positive
weights associated with the sth ordination. In the associated computer
program, INDSCAL, this model is fitted by a version of the Strain criterion
(see Section 5). Thus Ds (now with values -%dij~) is replaced by its centred
k
form, Bs = (I-N)Ds(I-N) and Trace I(Bs -XWs X,)2 is minimised. The numerical
s=l
methods for minimisation are outside the present scope but are discussed, with
examples, by Carroll, this volume. The basic things to note are that the
calculations can be done, that X (here termed the Group Average) gives an
average ordination and that for the sth ordination the axes of X are weighted
by the values estimated in Ws. Thus in the sth ordination one axis may get
unusually high weight compared to the weighting of the same axis in other
ordinations. It is interesting to plot the values of Ws (s=1,2, ... ,k) as a set of k
points in r dimensions (usually r=2) as a way of comparing the ordinations.
The axes given by X are uniquely defined and may not be rotated.
The ALSCAL program (Takane et al. 1977) minimises a version of the
Sstress criterion
60
rCdijk-oij)2
where 0ij are the distances generated by the group average X. Here dijk may,
if required, be transformed (perhaps monotonically) as described in Section 5.
One advantage of this criterion over that of Strain is that it easily
accommodates missing values.
A criterion of the Stress family is used in SMACOF-I (Heiser and de
Leeuw 1979). Now it is
2
r WijkCdijk-Oij)
i,j,k
that is minimised, where Wijk are optionally specified weights not to be
confused with those of the Individual Differences Scaling model. The usual
partition of a sum-of-squares gives:
10. CONCLUSION
distance and the method of analysis. Thus Components Analysis should ideally
be used only with quantitative variables and when Pythagorean distance is
accepted; similarly Correspondence Analysis is concerned with categorical, or
categorised, variables and chi-square distance. We have seen that both
categorical and quantitative variables can generate other forms of distance
which can be used with other forms of metric or non-metric scaling of which
Components Analysis and Correspondence Analysis are special cases. The
choice among metric methods is not so much governed by scientific objective
but more by how the chosen distance is to be approximated in few dimensions.
Non-metric methods rely only on ordinal information and therefore assume less
than is required for metric methods and hence would seem to be preferable.
In practice metric and non-metric ordinations of the same data often differ
very little, but this is not always so. Non-metric methods are computationally
much more expensive, and of the metric methods, Principal Coordinates Analysis
is the cheapest and hence is always worth trying and often gives all that is
required; it also shares with Correspondence Analysis the advantage over all
other methods, both metric and non-metric, of avoiding the possibility of
finding sub-optimal solutions. A full assessment of the relative merits of the
various forms of metric and non-metric ordination is much needed.
It has been shown that a two-way array may sometimes be interpreted as
a multivariate sample and sometimes as a table constructed from two categorical
variables (a two-way contingency table) or from two categorical variables and
one quantitative variable (a two-way table). In ecological contexts the
distinction between the three possibilities can sometimes be blurred but
consideration of the logical structure of the table can guide one to an
appropriate form of analysis or, at least, exclude certain methods as being
inappropriate. Computers have not helped here as all two-way arrays are the
same to a computer and users are easily tempted to use unsuitable methods for
the analysis of their data. Similarly all ordinations may look alike but their
proper geometrical interpretation depends on whether distances, inner-products
or skew-symmetry is being approximated.
Just as Components Analysis is the, basic method for analysing a
multivariate sample of quantitative variables, so is Multiple Correspondence
Analysis the basic method for analysing a multivariate sample of qualitative
variables. If one feels that the relationship between the values of quantitative
variables and their ecological effects is non-linear then Multiple
Correspondence Analysis in the form of Homogeneity Analysis offers a way
forward by categorising the quantitative values into disjoint groups and
62
REFERENCES
J. Douglas Carroll
AT &T Bell Laboratories
Murray Hill, New Jersey 07974, USA
INTRODUCTION
In this paper are presented descriptions of some of the major models, methods, and
computer algorithms for multidimensional scaling (MDS) and related techniques
developed at Bell Laboratories. Most of the computer programs implementing the
procedures described in this paper are available in one of two tapes available at a
nominal cost from the AT&T Bell Labs Computer Information Library. These two
tapes are referred to as the MDS-l and MDS-2 tapes. These programs are all written in
FORTRAN. Most of those on the MDS-l tape are written for IBM equipment, while
those on the MDS-2 tape should be machine independent. (It should be emphasized
that no guarantee is implied that any of these programs will continue to be distributed
on this basis by AT&T Bell Laboratories') All of the programs discussed here, except
SINDSCAL, and PREFMAP-3 are on the MDS-ltape (which has already been very
widely distributed). SINDSCAL is on the MDS-2 tape. It is hoped that PREFMAP-3
will soon be available.
While this paper is explicitly limited to procedures for which programs are (or are
hoped soon to be) available through the Bell Labs computer information library, a large
number of other MDS and related procedures have been developed at Bell Labs which
are not so available (and thus are not described here). A supplementary bibliography
citing papers relevant to such other procedures developed (totally or in part) at Bell
Labs is available by request from the author. Space limitations also require omission of
many of the programs included in the Bell Labs package of MDS programs. These
include a procedure for maximum likelihood non metric 2-way MDS appropriate for
proximity data collected by a certain ranking process, called MAXSCAL4.1 (Takane
and Carroll 1980, 1981); SIMULES (SIMultaneous Linear Equation Scaling) (Carroll
and Chang 1972b, Chang and Carroll 1972c); MONANOVA (MONotonic ANalysis Of
VAriance, which implements a procedure for fitting an additive conjoint measurement
model to data from a factorial design) (Kruskal 1965, Kruskal and Carmone 1968);
Categorical Conjoint Measurement (CCM) (Carroll 1969, Chang 1971); CANCOR
(Generalized CANonical CORrelation Analysis) (Carroll 1968, Chang 1971); PROFIT
(PROperty FITting) (Carroll and Chang 1964, Chang and Carroll 1968); PARAMAP
(PARametric MAPping of nonlinear data structures) (Carroll 1965, Shepard and
Carroll 1966, Chang 1968); POLYFAC (POLYnomial FACtor Analysis) (Carroll 1969);
HICLUS (HIerarchical CLUStering via ultra metric tree models) (Johnson 1967),
MAPCLUS (A MAthematical Programming method for fitting the ADCLUS
overlapping CLUStering model) (Arabie and Carroll 1980a,b); INDCLUS (INdividual
Differences CLUStering) (Carroll and Arabie 1982, 1983); and others. (Most of the
programs on the MDS-l tape, including all of those just named with the exceptions of
MAXSCAL4.l, MAPCLUS and INDCLUS, which are on the MDS-2 tape, and
MONANOVA, are synopsized and described briefly in Chang, 1971. This paper also
includes brief synopses of early versions of MDSCAL, as well as INDSCAL,
INDSCALS, NINDSCAL, MDPREF and PREFMAP, all of which are discussed in the
body of this paper.) We focus here on two-way and three-way (or Individual
Differences) MDS methods for proximity data, and on methods for individual differences
preference (or other dominance) data. (For a general discussion of MDS, including
many of those models and methods not discussed in detail in the present paper, see
Carroll and Arabie 1980.)
67
The procedures to be discussed here are organized under 3 general headings. These
are: I. Two-Way (Non metric or metric) Multidimensional Scaling (MDS) procedures;
II. Three-Way Multidimensional Scaling (MDS) procedures; III. MDS Analysis of
Preference (or other Dominance) Data.
A complete outline of the text of this paper follows, including names of programs and
their authors.
LA. MDSCAL-5 (Kruskal and Carmone 1969) and KYST, KYST2 and
KYST-2A (Kruskal, Young and Seery 1973).
II.A. INDSCAL (Carroll and Chang 1969, 1970; Chang and Carroll 1969) and
SINDSCAL (Pruzansky 1975).
II.B. mIOSCAL (Carroll and Chang 1972a; Chang and Carroll 1972a; Carroll
and Wish 1974a).
II.C. An application of three-way MDS to the ecological data on sea worms due
to Fresi et al.
III. MDS and Multidimensional Analysis of Preference (or other Dominance) Data.
lILA. MDPREF (Carroll and Chang 1964; Chang and Carroll 1968).
III.B. PREFMAP and PREFMAP-2 (Carroll and Chang 1967; Chang and
Carroll 1972b) and PREFMAP-3 (Meulman, Heiser and Carroll 1986).
III.C. MDPREF analysis of the Fresi et al. seaworm data, and relation to
previous analyses via KYST-2A and SINDSCAL.
68
Let d ij denote the distance from Xi to Xj. Let X be the matrix whose i-th row is Xi;
thus, X == (Xii), for i ... 1,2 ... n (objects) and t = 1,2 ... r (dimensions).
The criterion is that of minimizing the function called "Stress" given by one of two
alternate formulas. The one now known as Formula 2 is:
STRESSFORMULA 2 = J~ IJ
(dij - dij)2/~ (dij -
IJ
d)2 , (I.A.I)
where d is the average of all the dij's. The one known as Formula 1 is:
STRESSFORMULA 1 = J~ ~
ij
(d·· -
IJ
d·IJ.)2/~ d'f·IJ .
~
ij
(I.A.2)
69
The problem, then, can be expressed as that of finding the matrix X such that
best match the 0ij's. The dij's are a set of numerical values chosen to be as close to
their d ij counterparts as possible, subject to being monotone with the original 0ij's. The
A
The two formulas above will be abbreviated here as S 2 and S 1, respectively. S 2 is,
in MDSCAL5, the "normal" or default option. In the various versions of KYST, S 1 is
the default option. It should be mentioned that the two Stress formulas differ only in
the normalizing factor in the denominator. In all cases the ~ implies summation over
ij
all values of i and j for which there are data. For example, if a half-matrix option with
diagonal absent is used, the sum would be only over that off-diagonal half-matrix, while
if, say, the whole matrix option with diagonal present is used, summation is over all n 2
values. If there are missing cells the summation skips these cells. Furthermore, in the
case of S 2, d is the average over these same values of i and j.
The procedure used for obtaining the x's is the method of steepest descent. Briefly
stated, the method involves improving the starting configuration a bit by moving it
slightly in the direction of the negative gradient, or direction of steepest descent. The
direction of steepest descent is the direction in the configuration space (the space defined
by all n· r parameters of the X matrix) along which stress is decreasing most rapidly.
This direction corresponds to the (negative) gradient which is defined by evaluating the
partial derivatives of the function S (S 1 or S 2, depending on which option is used).
The n . r components of this vector can be "packed" into a matrix G of the same row
and column order as the X matrix; thus G == [- :~t]. On each iteration a step size ex
is defined in a way described in Kruskal's original paper [1964b] and ex times G is added
to X to get an "improved" estimate of X. Using a subscript I for the I-th iteration, the
70
2. Compute (Xl (as described in the above-cited Kruskal paper), and then,
A second option involves a more fully random configuration ("filling" the space more
completely) which can be used by providing a "seed" number for a random number
generator. The configuration, in this case, is generated by choosing points randomly
from a spherical multivariate normal distribution. By choosing different seeds for the
random number generator, of course, different random starts can be used.
A third option is for the user to provide a starting configuration. This may be a
"rational" start provided by using some other procedure, an a priori configuration of
some kind, or one provided by a previous run of the same program which requires
additional iterations.
All of the options listed above are available in both MDSCAL-5 and in the various
versions of KYST. KYST, KYST2, and KYST-2A have an additional option for the
starting configuration, which is probably the most important algorithmic distinction
between the KYST and the earlier MDSCAL family of MDS programs. This option
entails using an adaptation of the classical metric MDS technique associated with
Torgerson (I958) or Gower (I966) to derive a starting configuration. This starting
configuration is similar to, but not quite identical with, that in programs by F. W.
Young called (generically) TORSCA (Young and Torgerson, 1967, Young, 1968). In
the variant of this "TORSCA" starting configuration used in KYST, KYST2 and
KYST-2A, a linear transformation of the data is implemented to assure the data values
are all positive and that the ratio between the smallest and largest values has a
reasonable value. (This provides a practical solution to what is sometimes called the
"additive constant" problem in metric MDS methods')
The MDSCAL5 and KYST programs can cope with a variety of problems arising in
the original dissimilarities data. We shall discuss them in this section.
Missing Data - the program can be set to identify missing observations by reading in a
cut-off value below which data will be treated as missing. The Stress function is
modified by simply omitting, both in the numerator and denominator, the terms which
correspond to the missing cells.
Ties - two approaches are possible for resolving ties between dissimilarities (a tie arises
wherever oij = 0kl). These are called primary and secondary approaches.
In the primary approach, when oij = Okl no restriction is placed on the corresponding
~
d's. Thus, if oij = Okl, dij may be greater than, less than, or equal to d kl , without a
~
necessary penalty in the Stress function (since d ij may be greater than, less than, or
equal to d kl ).
72
Non-Euclidean Distances - the user of the MDSCALS or KYST programs can choose
any Minkowski-p metric, by specifying the value of p (~1.0), thus causing the program
to use the following formula for computing dij:
d ij = [~IXit-XjtlplllP U.A.3)
t-1
This option enables one to use this specific class of non-Euclidean distances. The
Stress and gradient formulas are changed accordingly. (While p is usually restricted to
be ~ 1.0, values between 0 and 1 can in fact be used, and may be meaningful in some
circumstances. If the "lip" power is omitted, this formula does, in fact, yield a metric.
For discussion of this, see Carroll and Wish 1974a.)
Definition of Gradient
U.A.4)
- aSa [
git = aXit = Sa
2
~ (djk _ d a)2
]
f dik-dik-S~(dik-da)]1
dff-O Xkt - Xit P
1 -2
(Xkt - Xit) ,
(LA.S)
Both the definition of Stress and of the gradient are necessarily different for the
various "split data" options, which will be described in the section on splitting data
below.
Four basic options exist for performing the regression of d ij on Oij. These are:
Initial Configuration - the user may supply a starting configuration for scaling the
objects. If not, two varieties of a random start can be used, as discussed above. Also, as
discussed earlier, other options exist if solutions in more than one dimensionality are
obtained. Finally, as discussed earlier, in the KYST programs, the "TORSCA"-like
start is another option.
Splitting Data - four options exist for using parts of the data as separate sublists and
then performing separate regressions for each of these sublists. They are:
1. Split by rows
2. Split by groups
3. Split by decks
4. Split no more (a control phrase used to indicate that no more "split" options are to
be specified).
The first three options make each row of every data deck, each group of rows (see
Kruskal and Carmone 1969, for explanation of this) or each data deck a separate sublist,
respectively. The "split no more" option is relevant only when several data decks are
used. It causes all subsequent data decks to be joined into a single sublist until further
indication.
In case any of the "split data" options are used, it is necessary to redefine Stress as
follows:
S·a = J~ B
~b S~b , O.A.6)
where b stands for a data "block" (which may be a row, group, or deck, depending on
75
The gradient can be defined easily. Dropping the "a" from Sand S*, the overall
gradient is simply:
(I.A.7)
Data Saving - it is possible to use the same data for performing different methods of
scaling by using the option called "Save Data."
Weighting of Data - the MDSCAL5 and KYST programs allow for differential
weighting of the original data values. This can be done either by supplying a matrix of
weights in the same way as the data are laid out or by using a FORTRAN subroutine
for generating weights internally. The standard weights are taken as 1.0 for each
observation. Further details on this and other aspects of these programs can be found in
Kruskal and Carmone (1969) or Kruskal, Young and Seery (1973). More information
and a general introductory overview of "two-way" multidimensional scaling generally (as
well as a brief summary of "three-way" MDS) can be found in Kruskal and Wish
(1978).
Some ecological data collected over a period of two years at 5 sites in the harbor of
Ischia in the Bay of Naples are described in detail in a later section OI.C) of this paper.
Also described in that section is the computation of a number of different proximity
(derived dissimilarity) matrices (one for each of the 5 sites, a number for various time
periods, and an "overall" dissimilarity matrix).
While leaving details of this measure for later, we will describe briefly here the
results of applying KYST-2A to the "overall" dissimilarity measure calculated for the
Fresi et al. data. Before KYST-2A could be applied to this data, a subset of the
seaworm species had to be eliminated. The reason for this is that our version of KYST-
2A would handle only 60 objects (in the present case, the species of sea worms) .
Inspection of the original data in the Fresi et al. (1983) paper indicated that 33 species
76
were observed only twice in the entire study (i.e., at anyone of the sites in anyone of
four time periods). Thus these 33 were eliminated, leaving a total of 55 species to be
analyzed by KYST-2A.
Table 1. Biological names of 88 seawonn species in data from Fresi ~~. (1983). These marked with asterisks
were the 55 most frequent species in that data, which were analyzed via KYST2-A.
·
34 Syllidae gen. 78 Pomaloceros Iriqueter (Linneo)
•
·
35 P/atynereis dumerilii Audouin et Milne-Edwards 79 Hydroides pseudouneinata Zibrowius
0
36 Pfatynereis coecinea Delle Chiaje 80 Hydroides e/egans (Haswe\l)
• 37 Nereis zonala Malmgren 81 Hydroides dianthus (Verrill)
• 38 Nereis persiea Fauvel 82 Serpufa eoneharum Langerhans
• 39 Nereis sp. • 83 Vermiliopsis striatieeps (Grube)
0 40 Ceratonereis costae (Grube) 84 Vermiliopsis sp.
• 41 Perinereis maeropus (ClaparCde) 0
85 Filograna imp/exa Berkeley
• 42 Perinereis cultrifera Grube 0
86 Spirobraneus po/ytrema (Philippi)
0 43 Nereidae gen. sp. I 87 Protula sp.
• 44 Nereidae gen. sp. 2 88 Serpulidae gen. sp.
77
Table 1 indicates the names of all 88 seaworm species analyzed in this paper. The
sequential numerical code on the left is actually used in the various plots in this paper.
Asterisks indicate the 55 most frequent species analyzed by KYST·2A. The "regression
ascending" option was used, with the primary option for ties, and STRESSFORMULAI.
Analyses were done in 6 down to 1 dimension (s).
~ ,------------------------------------------,
o
N
o
I!)
ci
o
o
I!)
o
ci
o
ci
2 3 4 5 6
dimensionality
Figure 1 shows this plot. One often looks for a clear "elbow" in such a plot; that is, a
dimensionality after which STRESS falls off only minimally (and more or less linearly)
with dimensionality. While inspection of Figure 1 does not yield an absolutely clear
78
dim3
49
27
----------lI~t--=ta~-;;2~5---=-+e------dim 1
24
85
"elbow," it was decided that the most appropriate dimensionality was four.
For reasons to be discussed later, the four dimensions were plotted in two planes, the
plane defined by dimensions one and three (in Figure 2) and that defined by dimensions
two and four (in Figure 3).
dim4
85
71
80
74 58 59
75 fffj1 49
dim2
83
47 10
13 38
42 62
54
The 55 seaworm species included in this analysis are shown in these figures, using the
sequential coding indicated in Table 1. Since the present author is not a biologist, and
has no knowledge whatever about these particular species of seaworms, we leave
substantive interpretations of these {and other dimension plots to be seen later} to
subject matter experts.
We also often speak of the number of ways of a data array, as when we refer to
two-way, three-way, or higher-way models and methods, for two-way, three-way or
higher-way data. The simplest "way" {pardon the ambiguity!} to think of this use of the
term "way" is that it is the number of indices, or subscripts, necessary to index the data.
The Fresi et al. data to be described in detail shortly can be viewed as four-way data
{species x sites x months x years} since we would need four indices to keep track of these
four different modalities. If, however, one were to argue {as one well may} that months
and years should readily be thought of as a single mode, and thus a single way of the
data array {indexed by only one subscript, ranging systematically - say sequentially in
time - over all month-year combinations} then it might as easily be formalized as a
three-way data array.
80
Our point here is that the number (and nature) of "ways" in a data array is largely
"in the mind of the beholder" (or, more to the point, is dependent on the aims of and/or
conceptual structure imposed by the data analyst/researcher trying to understand a
particular batch of data) .
Another term often used in reference to data arrays, already alluded to tangentially
above, is "modes". A data mode is a type or category of entity (e.g., the "species"
mode, "time" mode, "site" mode, or "variable" mode) which mayor may not correspond
to the "ways" of the data array. In general, the number of "ways" will be at least as
great as the number of "modes", but may be greater (because two or more different
"ways" of the data may correspond to the same "mode"). The best example of the
latter phenomenon is the case (already considered in Section I) of a two-way, but one-
mode n x n (usually, but not necessarily symmetric) matrix of proximities (similarity or
other proximity measures among sea worm species, for example, or correlation coefficients
among variables).
Another case in which the number of "ways" exceeds that of "modes" - which we
shall soon encounter - will be a data set that is two-mode (seaworm species - the
"objects" - by "data sources" derived, as will be described, from data corresponding to
various combinations of sites, months and years), but three-way (species x species x data
source). As will be seen in detail shortly, we shall begin with a data array that is either
four-mode and four-way, or three-mode and three-way (depending on whether one feels
"month" and "year" should be treated as separate modes/ways or as a single
mode/way), and derive from this another data array that can be conceived as being
two-mode, three-way data of proximities among the 88 sea worm species ("objects") for
14 different "data sources".
II.A INDSCAL
Euclidean model often called the INDSCAL model and b) CANDECOMP analysis -
for scaling stimuli (or other objects) for which (for example) measurements are available
on a number of variables G.e., the input matrices are, in general, rectangular and non-
symmetric) in a number of different conditions (e.g., observational contexts,
experimental variations, times, sites, or other "modes" or scenarios distinguishing the
various object x variable matrices). The CANDECOMP part of the algorithm uses
Carroll and Chang's method of canonical decomposition of N-way tables. INDSCAL
analysis (option a) actually corresponds to using symmetric CANDECOMP with pre-
and post-processing, to be described below.
Assumption 2 - the similarity judgments for each individual are related in a simple way
to a "modified" Euclidean distance in the group stimulus space. In particular, the
relationship is assumed to be linear (in the metric version) or monotone Gn a quasi-
nonmetric version). We shall describe the metric version which is the one used
predominantly. (The quasi-nonmetric version is implemented in a program called
NINDSCAL, available on the MDS-I tape, but this will not be discussed further here.}
We assume that the dissimilarity measure, oW, provided by the i-th individual for the
82
pair of stimuli j and k, is related to a modified or weighted Euclidean distance, dW, by:
L(;)[fJW)-dW (ILA.I)
where L (;) is a linear function with positive slope. The subscripts j and k (for stimuli or
other objects) range from 1, 2, ... , n and the superscript i (for individuals or other
data sources) ranges from 1, 2, ... , m.
The "modified" Euclidean distance for the i-th subject is given by the formula:
(II.A.2)
This formula differs from the usual Euclidean distance formula only in the presence of
the weights Wit, which represent the saliences or "perceptual importances" for the i-th
individual of the 1-th dimension of the group perceptual space, represented by the matrix
X. Another way to express the d}~ 's are as ordinary Euclidean distances computed in a
"private" space for individual i whose coordinates are:
(j) _ 112 (II.A.3)
Yjt - Wit Xjt·
This is a space that is like the X-space except that the configuration has been expanded
or contracted (differentially) in directions corresponding to the coordinate axes. This
can be seen to be a linear transformation with the transformation matrix restricted to be
diagonal (the diagonals being square roots of the w's). This class of transformations is
sometimes referred to as a "strain."
The same basic model, but without a method for fitting the model to data, was
proposed independently by Horan (1969). Alternative methods of fitting the INDSCAL
83
model (sometimes called simply the "weighted Euclidean model") to data have been
proposed by Bloxom (1968), Takane et al. (1977) and Ramsay (1977).
Estimation of Parameters
We now briefly discuss the procedures by which the parameters of the model,
namely, the n x r elements of the X-matrix and the m x r elements of the matrix
W == (Wit) are estimated from dissimilarity judgments on all possible n (n - 1) 12 distinct
pairs of stimuli by m individuals.
The first step in the method of estimation is to convert the dissimilarities into
distance estimates. In view of the linearity assumptions made above, this is done using
the standard procedure described in Torgerson (1958). This method entails estimation
of an additive constant which converts the comparative distances (i.e., the original
dissimilarity judgments) into absolute distances between pairs of stimuli. The method
estimates the smallest value of the constant which guarantees satisfaction of the triangle
inequality for all triples of points. This can easily be shown to be
c~ln = rr;:lx [oW - oW - oM)). This constant guarantees that the triangle inequality
will be satisfied for all triples of points, with the inequality being a precise equality for at
least one triple (the one for which the expression above attains its maximum). It is as
though these three points lie precisely on a straight line in the multidimensional space.
This is why this scheme is sometimes called the "one-dimensional subspace" method of
estimating the additive constant. Any constant larger than c~ln would certainly suffice
also, but c~ln is, as its name implies, the smallest constant guaranteeing this. While
there are a number of other schemes of estimating the so-called additive constant (see
Torgerson 1958), this one is one of the simplest (both conceptually and numerically) and
most assumption-free. Having estimated c (j) in this way, distance estimates, 'd~~, are
calculated as 'dW = oW + c (0 .
The distance estimates are then converted for each subject to scalar products between
the points represented as vectors issuing from an origin at the centroid of all the points.
This is done by double centering the matrix whose entries are -1/2 ['dW ]2. The
b
resulting numbers, ~~, can be regarded as the estimated scalar products between the
vectors y50 == (y 5}) , y 5q , ... , y 59), and y M). This step is the same as in the "metric"
84
phase of the TORSCA (Young, 1968) algorithm, and in generating the "TORSCA"
starting configuration in KYST, KYST2 and KYST-2A (Kruskal, Young and Seery,
1973).
The derivation below shows that these numbers are, in fact, estimated scalar
products. (Readers not interested in this derivation are advised to skip to the section
entitled "Scalar Product Form of INDSCAL Model.")
assume:
n
~ X jl = 0 for all t = 1, 2, ... , r (II.A.S)
j=1
(We may do this without loss of generality, since the origin of the x space is arbitrary,
and this just fixes it at the centroid of all n points') Expanding (II.AA) ,
(II.A.6)
where
(II.A.7)
and
Because of II.A.S
b.k=bj.=b .. =O OI.A.9)
85
d~. = U~ UI.A.12)
where
1 n
e2. = - ~
~
e~J (II.A.13)
n j
Note that we didn't have to know anything about geometry to derive this result. The
law of cosines, for example, was never mentioned.
Note also that this is an exact result for deriving exact scalar products (about an
origin at the centroid) from exact Euclidean distances. In practice, of course, we derive
A A
UI.A.15)
Thus, the three-way matrix of individuals by stimulus pairs, where general entries are
the values of bWderived from the dissimilarity data, can, if the INDSCAL model holds,
be decomposed into the trilinear form in equation UI.A.15). The problem now is one of
estimating values of the X-matrix and the W-matrix where elements enter into the
right-hand side of equation (II.A.15). This estimation (in a least squares sense) can be
achieved by a procedure called "canonical decomposition of N-way tables" (now usually
abbreviated CANDECOMP). In this particular case, N = 3, since there are three ways,
two for stimuli and one for individuals. The CANDECOMP procedure, for the general
86
where Zijk represents data, the a's, b's and c's are parameters to be estimated and ";;;0"
here implies least squares estimation. The CANDECOMP procedure provides least
squares estimates of these parameters (the a's, b's and c's) via what is now called an
Alternating Least Squares procedure but was originally called a NILES <Nonlinear
Iterative Least Squares) or NIPALS <Nonlinear Iterative PArtial Least Squares)
procedure (see Carroll and Chang, 1970).
jZijk
ait
=
=
bW
Wit (II.A.17)
b jt == Cjt - Xjt
In the algorithm for the INDIFF part of the program (the part that does the actual
INDSCAL analysis using CANDECOMP as a subroutine), the original data and final
solutions are normalized. In the case of the original data, the scalar product matrices
87
are normalized such that the sum of squares of the scalar product matrix is set equal to
unity for each subject (or "data source"). In the case of INDSCAL analysis, the final
stimulus space is normalized such that the variance of projections of stimuli on the
coordinate axes is equal to unity and the centroid is at the origin. The appropriate
companion normalization is applied to the subject-matrix.
The combination of these two different procedures has one interesting outcome: the
square of the Euclidean distance of a subject's point from the origin can be
(approximately) interpreted as the (proportion of) total variance accounted for in the
scalar products data for that subject. If the dimensions of the stimulus space are
orthogonal, then the square of the Euclidean distance of the subject's point will exactly
equal the proportion of variance accounted for. No normalization of the data is done for
the CANDECOMP option; there is, however, a normalization of the solution.
Specifically, all matrices but the first are normalized to have unit sums of squares for
each dimension. All the differences in sums of squares are then absorbed in the final
matrix. When using CANDECOMP the origins of the various spaces are not
constrained at all.
Input Parameters
The various input parameters of the INDSCAL program are enumerated below:
Data Input Options - these are controlled by a parameter called IRDAT A. Eight
alternatives are provided in the program, corresponding to integer values of 0 to 7 for
IRDATA.
An option exists for not setting matrix 2 equal to matrix 3. In the general
CANDECOMP analysis this option must always be chosen since, in general, the input
matrices are different. In the case of INDSCAL analysis, however, matrix 2 is set equal
to matrix 3 since, by symmetry, these input matrices should be equal. When done in the
latter fashion, we refer to the CANDECOMP analysis (say of the derived scalar
products) as symmetric CANDECOMP.
The INDSCAL program can also be used in solving for the weights assigned by
subjects to a prespecified configuration. The program also has the ability to use a
prespecified configuration as a rational start even in the case in which all matrices are to
be solved for.
More complete details of how to use the INDSCAL program can be found in Chang
and Carroll (1969).
SINDSCAL
The method of analysis used in SINDSCAL is essentially the same as the method of
Carroll and Chang (970) used in INDSCALS (Chang 1971). Therefore, the final
stimulus and weights configurations should be identical (except for possible differences
due to different convergence criteria, starting configurations. or other numerical details).
The principal differences between SINDSCAL, INDSCALS or INDSCAL used with
three-way "INDIFF" options lie in the computational procedure and user options.
These changes along with the use of the global optimization feature of the Fortran-
IV compiler result in significant savings in computer charges. Additional savings may
result because SINDSCAL uses dynamic storage allocation. Small data sets may be run
with proportionately smaller computer memory and, therefore, some savings in cost.
(4) sufficient printout throughout the computation so that most of the information
from a run can be recovered if the program gets cut before completion,
(5) no limitation on the number or size of the input matrices due to the use of
dynamic storage allocation,
Since most features of SINDSCAL have already been described in the discussion of
INDSCAL above, we highlight only those features in which it is most distinct from that
earlier program/procedure.
(2) covariances or scalar product matrices in the form of lower-half matrices with
diagonals;
(3) full symmetric matrices of similarities or dissimilarities. The program ignores the
values on the diagonal. In this case, although the upper half of each matrix is
(redundantly) provided as input, only a half matrix is stored, thus allowing the
greater efficiency in memory storage and computation which is the principal
hallmark of SINDSCAL.
for this option in order to prevent the program from stopping before convergence has
been reached.
Plot Options
The program generates plots of all possible planes (defined by pairs of SINDSCAL
coordinates) of the final group stimulus space and weights space. The points may be
numbered or the user may supply either the stimulus or subject labels or both sets of
labels. It is also possible to suppress all plotting.
Relaxation Factor
II.B IDIOSCAL
simplify matters, let us suppose for now that all agreed that there were exactly two
dimensions of intelligence.) One school proposed a first (primary) dimension (often
called "G") corresponding to "General Intelligence", with a second (and secondary)
dimension contrasting verbal with quantitative ability. A second school countered that -
quite to the contrary {they felt} - there were two independent, sovereign and equally
theoretically valid dimensions - one a dimension of verbal and a second of quantitative
intelligence! From the perspective of our modern sophisticated multivariate point of
view, replete with manifold degrees of rotational freedom, we see quite clearly that these
two schools were arguing, quite vociferously as it happens, about nothing more than
different rotations of coordinate systems describing the same space of intellectual
"objects" (e.g., specific "abilities" measured by equally specific tests; or, in a dual
manner, specific individuals exemplifying different degrees of these abilities, as measured
by their respective "factor scores"). To derive the IDIOSCAL model as a description of
the perceptual structure of intelligence for these different educational psychologists, we
need only add the assumption that, within each of these "schools" different adherents
attached different saliences, or "perceptual importances", to the two dimensions
characterizing the particular "school" to which that particular scholar subscribed. In
practice, the IDIOSCAL model means that each individual is allowed a generalized
Euclidean metric defined by a positive definite quadratic form. Another (seemingly
different, but mathematically equivalent) interpretation of this quadratic form is possible
in terms of different "subjective intercorrelations" of the same set of coordinate axes.
This latter interpretation is favored by Tucker, Harshman and others.
The model includes as special cases Tucker's (1972) Three-Mode Scaling, based on
three-mode factor analysis, the PARAFAC-2 model and method of R. Harshman
(1972), and a generalization of INDSCAL proposed by Sands and Young (1980). The
method of solution is closely related to one proposed by P. SchOnemann (1972), based on
earlier work of Meredith's (1964).
define a rotation of axes, we define a kind of composite (different from the arithmetic
average) of the actual subjects, which is used to determine a more nearly optimal
orientation. This seems to work well in cases of both real and artificial (errorfuO data.
Thus the three phases of IDIOSCAL are very closely analogous to the first three
phases of PREFMAP (for PREFerence MApping of stimulus spaces) which will be
discussed at a later point in this paper. To carry this analogy further, approximate F
tests have been incorporated, as in PREFMAP, which may be useful for distinguishing
between models, and may even help in judging dimensionality.
In each case we begin description of the relevant model, assuming we have already
(via assumption and/or appropriate preprocessing) obtained data values we believe to be
r,
approximate squared Euclidean distances between stimuli (objects), j and k, for each
subject (data source) i, which we shall call [15 W and state the model assumption for
these values.
r
"Phase I" of IDIOSCAL - The general model.
where
YJ(j) -x-T-
J I' (II.B.2)
so that
(I1.B.3)
94
where
(lI.B.4)
with
1 m
R.=-~Ri· (II.B.6)
m i-I
R. - I, m.B.7)
so that
(II.B.S)
r [dW r
II.A for details). Writing Eq. (1I.B.3) in summational notation, we have
r(r +I) 12 .)
- ~ r*~~t') tl(jkHtt') ,
(tt')
where
(lI.B.10)
and
(lI.B.l1)
(while att' is the "Kronecker delta;" att' ... 1 if t = t', 0 otherwise). Let
95
(II.B.I2)
(II.B.I3)
an [n)2 x [r+1)
2 .
matnx, wh'l
1e
(0)
d i[2] .... [[dCl.,
2
dB 2 , [(0)
[(i)) d23 2 , ... , den-On 2] ,.
[(j») (II.B.I4)
So, dF] is a column vector of (~) components. Eq. (II.B.9) can be written in matrix
form as:
(II.B.IS)
where dF] and Ll are known, and r*i is to be solved for. The least squares solution is
m.B.I6)
Having solved for r*i' the entries can be "unpacked" in the appropriate way into Ri (a
square symmetric matrix). Ri can then be factored into TiT;. One way is to factor
m.B.I7)
Ri = TWiT', (II.B.20)
for some T (in general, nonorthogonal), and with Wi diagonal. The essence of
Schonemann's (1972) analytic solution seems to be that, if Eq. (II.B.20) holds for any
two i (say i = 1 and 2) with Wi nondegenerate (that is, all diagonals nonzero) for both,
we can solve exactly for T (that is guaranteed at least to "fit" those two). This is
because two square symmetric matrices are always simultaneously diagonalizable by a
matrix T, which is not, however, orthogonal (in general). Since clearly T is only defined
up to post-multiplication by a diagonal, we may, without loss of generality, assume T to
be so defined that
(II.B.21)
Thus
(II.B.22)
T can be decomposed as
T = VPV', (II.B.23)
Thus, factoring R 1 yields V and p2 (and thus P). We may then define
R*2 = p- 1 V'R 2Vp-l = p- 1 V'TW 2T'Vp-l (II.B.2S)
(since V'V, and thus p-l U' Vp, = I). Thus, factoring R*2 yields V (and, incidentally,
W2 , although that is of no real interest). Having thus obtained V, p, and V, they may
be put together, according to Eq. (II.B.23), to define T (which may be further post-
multiplied by a diagonal matrix, if desired, for normalization purposes). SchOnemann
chooses the two subjects, in effect, to be the "average subject" whose R matrix is the
average of those for the real subjects, i.e.,
97
1 m
R. =-~R·I'
m.B.26)
~
m i
plus one of the "real" subjects (apparently arbitrarily chosen). Using the average
subject is sensible, from a statistical point of view, and is also correct mathematically
since it is easy to show that, if Eq. OI.B.20) holds for all i, then
R. = TW.T', m.B.27)
showing that Eq. m.B.20) also holds for this average subject, with W. replacing Wi'
The weakness of Schonemann's (I972) solution from a statistical point of view is the
choice of the second subject as some arbitrary real subject. Bad choice of this second
subject could result in a very bad solution. Our modification of Schonemann's procedure
rests essentially on a more representative choice of the two subjects (or pseudosubjects,
since both are composites of the "real" ones).
We have pursued two approaches to this. The first is to use a kind of crude
clustering procedure to group the subjects into two groups of about equal size so that the
profiles in the two groups are maximally different. "Average subjects" are then defined
for each of the two groups, and those two group averages are used as the basis for
finding the appropriate T.
In the second approach the first subject is, as in Schonemann's approach, the
"average subject," as defined in (n.B.26). Using the U and {J found for that subject, we
define the matrices R*i as in (I1.B.2S). Since R*i = VWiV', (with V orthogonal and Wi
diagonal), it follows that
so that
m.B.29)
where
(n.B.30)
Q, then, defines the second "pseudosubject". Note that Q is of the same general
form as R*j, so that factoring it should yield V exactly (in the exact case) or
approximately (in the more usual case of noisy data). Q, however, provides a composite
of all the subjects, but a different one than provided by R.
We have tried two different ways of defining Q, differing in definition of the weights.
One is essentially the unweighted case, in which k j = 11m for all i. In the other case, k j
was defined as:
rr
kj = - - - -
tr (R*)2 '
where rj is the correlation between d 2 's and predicted d 2 's in "Phase I" (in which the
general mIOSCAL model of Eq. (II.B.3) is fit).
Finding the weights for the INDSCAL model. Once the T yielding the correct
orientation of axes is found (or a hopefully reasonable approximation thereto, as
described above), we may find the INDSCAL weights as outlined below. (The x's below
have presumably been defined by use of this T and so correspond to the "correct"
dimensions). Recalling the INDSCAL model:
01.B.32)
dFI
r r,
where is defined as before (in Eq. II.B.14), ~Fl is the analogous column vector with
[oW replacing [dW while W j is the row vector (of r components) with general
with
(1I.B.35)
Estimation with and without constant term. The estimation schemes above have
involved no additive constant terms for the d 2's. It is conceivable, however, that better
fits could be obtained by adding such constant terms. This means that Eq. (II.B.3) is
modified to become:
(II.B.37)
(II.B.38)
It is straightforward to alter the regression schemes for estimating the R;'s or the
Wit'S, as the case may be, to incorporate such a constant. This is done by simply adding
an extra independent pseudovariable (whose values are all 1) to the regression scheme.
This will, of course, change the estimates of the R's and w's to some degree. Inclusion of
this constant has advantages for interpretation of the F ratios to be described later. As
will be seen subsequently, it also seems to improve the fit in Phase II (corresponding to
the INDSCAL approximation).
or
(II.B.40)
where the X matrix is the one derived for the average subject.
Approximate F tests for comparing the three phases. Since the model in the three
phases are fit by using least-squares linear regressions (with appropriately defined
pseudo-variables), it is possible to define approximate F tests for comparing models in
the three phases (as well as for assessing goodness of fit in each independently). This is
very closely analogous to similar approximate F tests in the PREFMAP, PREFMAP-2
100
and PREFMAP-3 procedures, for those familiar with this (Sec. III. B.) . This is most
appropriate when the constant term has been included, since otherwise the residual mean
square is not an unbiased estimate of error variance. The approximate Fs and their
degrees of freedom are defined below. (See Table 2).
The "Fs" must, of course, be taken with a large grain of salt since, first of all the
required normality assumptions cannot be taken seriously, and, secondly, the
configurations (which define the "independent" pseudovariables) have been fitted to the
data. Since, however, these Fs are computed for each subject separately, and since each
subject plays only a small part in determining the configuration in each case, this second
objection can presumably be ignored as the number of subjects grows "large". Possibly
some adjustment of degrees of freedom would correct for it when the number of subjects
is small. Presumably a "jackknife" procedure could be used, even for small numbers of
subjects, but this would be expensive computationally. Analogous approximate F ratios
(called Pseudo-Fs) could be calculated to test "significance" of added dimensions in
INDSCAL or IDIOSCAL. This could conceivably lead to a way of objectively assessing
dimensionality in individual differences scaling. (A somewhat related approach based on
a "leave one out" procedure has recently been investigated by Weinberg and Carroll
1986.)
Table 2. Pseudo-F's for assessing and comparing models fitted in the IDIOSCAL
procedure.
Effect Pseudo-F
Phase I dh ]2
[ dfl ( 2)
r[/I-r[ r(r+012 (~)-r(r+012-1
Phase II [d
h
df 1 ]rft(1-rf) r (~)-r-l
Phase III [d
h
dfl ]2
rIll/( I-rIll
2) 1 (~)-2
A2
NOTE: r[, rll and rIll represent correlations (between d 2 and d ) calculated by
IDIOSCAL for a particular individual in phases I, II and III, respectively.
ranged from 1 (for over a dozen of the species) to well over a thousand. Because of this,
our first impulse was to normalize these data by converting to relative frequencies so that
the normalized data for all species would sum to one. While this seemed to us like a
wise first step in normalizing these data, it did not lead to readily interpretable results in
any of the further analyses we attempted. We therefore abandoned this normalization of
the data, and attempted instead an alternate transformation of these data, suggested by
Pierre Legendre, which should have the effect of somewhat more nearly equalizing the
total weight of resulting data values for the various species, as well as reducing the
skewness of these distributions. This transformation was of the form:
where Jjlmp is the frequency of seaworm species j at site I for month m in year p, and
Zjlmp is the corresponding transformed value. Data transformations are discussed in
some detail in section 2 of Gower's paper in this volume. After this initial
transformation, we then further normalized the data to have zero mean and unit
variance within each site x month x year, so that the final "normalized" data were of the
form
Zjlmp - zimp
Yjlmp == (II.C.2)
Simp
where Zimp is the mean and Simp is the standard deviation of the z's over all 88 species
for site I on month m in year p.
Number of matrices
2
(m' = F, J)
2 (Year p')
(p' = 75,76)
.Jf ( )
4
(t' = F75,J75,F76,J76) (Time t') d,(tk') == Yjlm,'p,' - Yklm,'p,'
2
(where mt' and Pt' are the month and year, respectively, associated with time period t').
In the above, the sites are encoded as simply SI through S5, the months as F (for
February) and J (for July) the years as 75 (for 1975) and 76 (for 1976) and the 4
"times" as corresponding combinations of the month and year codes. Obviously, these
various matrices are far from independent of one another. (In fact, just to take one
example, the square of the overall dissimilarity for j and k is just the sum of the squares
of the five site dissimilarities (or of the four "time" dissimilarities). However, as a "first
start" on an exploratory data analysis for these data, we used the resulting 14 matrices
[1 overall + 5 sites + 2 months + 2 years + 4 times (months x years)] as input to an
INDSCAL analysis, using the SINDSCAL program. In this case the sea worm species
comprised the "stimuli," and the 14 derived dissimilarity measures defined the data
sources. Since each of these dissimilarity matrices was, in fact, defined as a Euclidean
distance matrix, we used the option in SINDSCAL specifying that the data were
Euclidean distances. Thus the total input to SINDSCAL comprised 14 matrices, each a
symmetric half matrix of Euclidean distances among the 88 worms (so each of the 14
matrices had 88·87/2 = 3828 entries, for a total of 14· 3828 = 53592 data values).
Needless to say, this was a rather large data array, at least for so computationally
intensive a procedure as SINDSCAL! Analyses were done in 1 through 6 dimensions.
The fit measures (VAF in the derived scalar products) are given in Table 3. Based on
104
Table 3. Fit measures (Variance accounted Table 4. Variance accounted for in dimensions from
for in derived scalar products) for SINDS· KYST2-A and MDPREF analyses when mapped into four
CAL solutions in 6 down to 1 dimension (s) dimensional SINDSCAL solution.
~or 14 d·ISSlmilarIty
. . . matrIces derived f rom
Fresi et al. data. Solution/Dimension R Z (VAF) by four
Code SINDSCAL dimensions
k4-1 .988
Dimensionality Total Variance k4-2 .990
k4-3 .809
1 .594 k4-4 .603
2 .778 k3-1 .985
3 .833 k3-2 .985
4 .872 k3-3 .718
5 .900 k2-1 .976
6 .922 k2-2 .972
kl-l .840
mdl 1.000
md2 .999
md3 .941
md4 .880
the pattern of these fit measures, and on inspection of the results, it was decided to
report the 4-dimensional INDSCAL solution (although it was somewhat debatable
whether the 4 or the 5-dimensional solution should be chosen).
Interpreting these results, since the present author knowns little of the biology of the
88 species of seaworms, we focused on the pattern of the 14 different data sources in the
"subject (or source) space." The weights for the 14 data sources on the four dimensions
are displayed graphically in Figures 4 and 5. (As can be seen in the Figures, some of
the source weights are slightly negative, a condition which should not occur in
INDSCAL, since all subject or source weights should be zero or positive. These are only
very slightly negative, however, and so can probably be plausibly interpreted as
essentially zero weights, which have become slightly negative due to error in the data.
We will henceforth interpret these, in fact, as though they are zero.)
Figure 4 shows the dimension one vs. two plane of the source space. Dimension one
has very high weight for sites 4 and 5, and for the 1975 time periods. Since only sites 4
and 5 have very large weights on this dimension and the 1975 time periods have large
weights on it, we may label dimension one (sites 4 and 5; 1975). The corresponding
105
dim 2
J 6
'76 53
J
F76
all
54
F 55
52 J75
----'----i*-----'-----'--------'----....L.------ldim 1
'75
F75
Figure 4. Dimension one-two plane of source space for SINDSCAL
solution for Fresi et al. data.
variable seems to be one that was particularly prevalent in sites 4 and 5, somewhat in
sites 3, and not at all in sites 1 and 2, and quite salient in 1975, but very weakly present
in 1976. Dimension two, on the other hand, seems to be very strongly weighted in site 3,
very slightly in sites 4 and 5, but not at all in sites 1 or 2.
Whatever dimension two taps was especially prevalent in 1976. Thus dimension two
will be labeled (site 3; 1976). One interesting point in these results is that sites 4 and 5
seem to occupy essentially the same location in all four dimensions. Thus these two sites
were, insofar as these analyses are concerned, virtually indistinguishable.
We now look at the plane defined by the remaining 2 dimensions, dimensions 3 and
4, in Figure 5. What "jumps out" at us in this plane is that site 2 has high weight on
dimension 3 and virtually zero weight on dimension 4, while site 1 reverses this pattern,
having almost identically zero weight on dimension three but a very large weight on
dimension four. Sites 3, 4 and 5 have essentially zero weights on both these dimensions,
while the matrices relating to the various time periods (as well as the "all" matrix
corresponding to overall dissimilarities over all sites x time periods) generally have
moderate weights on both. Thus dimension three seems to correspond to whatever
106
dim4
s
F';ijl
J76 J
J75 all
F
'75
s2
s4 F75
--L---s:;r----'------'-------1-----'-----'dim 3
s3
Figure 5. Dimension three-four plane of source space for
SINDSCAL solution for Fresi et al. data.
distinguishes site 2 from the others, while dimension four corresponds to the variable
most prevalent in site 1.
While they must be taken with a fairly large "grain of salt," distances among the
source points have a certain interpretation in these INDSCAL subject (source) space
plots. Without actually doing the computation we can see from inspection of these two
planes that sites 4 and 5 are exceedingly close, and in turn are relatively closer to site 3
than to either sites 1 or 2. Conversely, sites 1 and 2 are by far closer to one another
than to any of the other three sites.
dim2
k1-1 24
'mdl
• d4
Rg;;:;;-----------dim 1
points shown in these figures, for the benefit of those knowledgeable about these species.
(It should be commented that we have reflected some of these dimensions so that the
positive values always tend to imply greater frequency.)
In Figures 6 and 7 vectors are shown indicating the direction best corresponding to
the dimensions derived from the one through four dimensional KYST2-A solutions
shown in Figures 2 and 3. Since there were a total of ten such dimensions (4 + 3 + 2 + 1
for the four through one dimensional solutions, respectively) there are a total of ten
vectors indicated corresponding to these. These are encoded "kr-t" where "kr" stands
for the KYST r-dimensional solution, while t indicates the tth dimension in that solution.
Since, in the case of KYST, the solutions for different dimensionalities do not have any
necessary correspondence, these ten dimensions are all distinct, although it will be noted
that the t th dimension in solutions for different dimensionalities do tend to correspond
fairly closely, though certainly not perfectly. In addition to these ten dimensions from
the various KYST solutions, four other vectors, labelled mdl through md4 are also
shown. These are four dimensions from another MDS analysis, called MDPREF, which
will be described in section III.
108
29
13
35
42 24
71
~~~~-------dim3
Table 4 gives figures which indicate how well these fourteen dimensions (ten from the
one through four dimensional KYST solution plus four from the four dimensional
MDPREF analysis) from the other analyses "fit" into the four dimensional SINDSCAL
space. The values in Table 4 are squared multiple correlations (R2's), which can be
interpreted as proportions of variance accounted for in these fourteen dimensions from
KYST and MDPREF via the four SINDSCAL dimensions. (Since the KYST analyses
had to be done on only a subset of 55 of the seaworm species, these R2's were
necessarily based only on this subset of 55 of the total 88 species, however.) In fact, the
procedure used for determining these best fitting directions (or vectors) was the
PREFMAP-3 procedure described also in section III. The particular analysis done in
these cases involved fitting the vector model, with linear regression options. In the case
of this particular set of options, PREFMAP is equivalent to the use of multiple linear
regression. Thus we may view these vector directions as being defined by the regression
coefficients from a multiple linear regression. In fact, the projections of these vectors
onto the SINDSCAL coordinate axes are, in the present case, proportional to the Beta
coefficients for these regressions.
109
The point to be drawn from these PREFMAP/multiple regression analyses is that the
four SINDSCAL dimensions capture quite well essentially all the dimensions emerging
from the other MDS analyses. The fact that the vector directions best representing
these other dimensions do not coincide directly with the coordinate axes indicates,
however, that the SINDSCAL dimensions do not correspond in a simple one-one fashion
with these KYST and MDPREF dimensions. Rather, each of these alternative
dimensions corresponds to a different linear combination of the SINDSCAL dimensions.
Since the SINDSCAL dimensions have a unique orientation, while those in the other
solutions are defined only up to arbitrary rotation (or linear transformation), we feel it
appropriate to treat the SINDSCAL solution as defining the "reference space" in terms
of which the others are defined. As already seen, the SINDSCAL dimensions do have a
particularly simple association with the various derived dissimilarity matrices -
particularly with those defined for the five different sites. This suggests that these
dimensions may correspond to variables characterizing the species having especially
meaningful relations to the geographic variables distinguishing sites (as well as,
secondarily, to variables related to the four different time periods).
To correct for the frequency effect we present another pair of plots in which the
following transformations have been effected.
(0 The origin of the space was first transformed so all the coordinate values were
non-negative (by subtracting the smallest algebraic coordinate value on each
dimension from all the coordinates).
(2) After this translation to a "more or less" rational origin (such that essentially all
the very low frequency species are at or very close to that origin) we now multiply
all coordinates of each species point by the reciprocal of its marginal value in the
"z == log if + 0" scale. This tends to convert all values to something
approximating a relative frequency scale. The coordinate value on each
dimension after this transformation can be interpreted as the relative value of the
species on that dimension (relative to its overall frequency in the samples taken
from all 20 sites x time periods comprising these data). While these plots, shown
in Figures 8 and 9, are no more interpretable to us than were the earlier figures
(6 and 7) we hope they may help ecologists or other biologists in interpreting
these dimensions.
110
For general discussions of two- and three-way MDS and related models and methods
for proximity data, we refer the reader to Carroll and Wish (1974a,b), Wish and Carroll
(1974), Kruskal and Wish (1978), Carroll and Arabie (1980), Shepard (1980), Carroll
and Pruzansky (I980, 1986) and Arabie, Carroll and DeSarbo (in press).
dim2
82
5CB
88 21ft!
l1ll811
9
30
J2
85 ~6 2
12 3i1 1
70J'6
6~~3Jlp79
44f,7 5~ 55
-------F"'-"---fl~cf1!C>!O·---·------·------dim 1
51/1 2!
i;6
70 1~ jlIlI
68 3~1 8 2 30 sa 6Q
4744 1
3'i- 484e15 52 15 ::fu
33 76
121~ !%
112§5 41
~dJ~ 5° 8 8
-"f'1 ~? 80
;S:-~----"l'-----------·---dim 3
Theoretical Discussion
The model assumes that stimulus (object) points are projected onto subject (variable)
vectors, with preference (degree of dominance) being determined by the relative size of
these projected values (the larger value being preferred). Let Xj = (Xjl , ... , Xjr)
represent an r-dimensional stimulus point for the j-th stimulus and Yi = (Yi I , ... , Yir)
represent the vector for subject i in the same r-dimensional space. (For simplicity, we
now speak simply of preference of subjects for stimuli; the reader can make the
necessary substitution of terminology if desired.) Then Sij' the estimated preference
112
UII.A.I)
(the expression on the right being the scalar product in matrix notation). This can be
written, more generally, in matrix notation as follows:
Let X == (Xjt) be the n x r matrix of stimulus coordinate values and Y == (Yit) be the
m x r matrix of the termini of subject vectors, then
S - ("Sij )
" = = Y X, . (III.A.2)
The problem is to determine the matrices Y and X' from the set of paired comparison
"
judgments such that S accounts for the paired comparisons data as well as possible in
some statistically well-defined sense (realized by minimizing an "objective function"
embodying the statistical criterion to be optimized). Carroll and Chang (I 964b)
describe procedures - one iterative and one utilizing an Eckart-Young (I936)
decomposition - that accomplish this task. [In more modern terminology, the "Eckart-
Young decomposition" is frequently called, or closely related to, the "singular value
decomposition" (svd).1 It is the latter that is implemented by MDPREF, and that is
described below.
If the input data are already scale values of preference (this matrix S is called the
"first score matrix") the program proceeds to decompose S by the Eckart-Young
procedure, which involves computing eigenvalues and eigenvectors of the matrix S'S or
SS' {whichever is smaller}. If the input data are paired comparisons, they are first
converted to a "first score matrix" of scale values by summing over rows and/or columns
of each paired comparisons matrix. Monte Carlo analyses by Carroll and Chang have
indicated that the simpler, Eckart-Young, procedure works as well with errorful data as
the iterative one. This is the reason MDPREF utilizes only the Eckart-Young
procedure. This overall procedure can be shown to have certain least squares properties.
Among other properties, in the case in which the original data are paired comparisons, it
provides a least squares fit in a certain sense to the original paired comparisons data,
schematized as a matrix of plus and minus ones (and possibly some zeros). See Carroll
(1972, 1980) for details.
113
Input Options
As noted earlier, MDPREF has two input options, namely, paired comparisons and
direct judgments of preference scale values. In the case of paired comparisons, options
exist for reading in weight matrices specific to each subject and for handling missing
data. In the case of direct preference judgments (e.g., rankings) two options exist for
normalization - either: a) subtracting row means or b) subtracting row means and then
dividing entries by the standard deviation of values for that row.
Output Details
The following are the major output categories entailed in a typical run of MDPREF:
1. First score matrix normalized according to alternative chosen from above options.
5. Estimates of the first score matrix after factorization. (This is sometimes called
the "second score" matrix.)
7. Plots of some or all pairs of dimensions, including both stimuli and subjects.
(Many different versions of MDPREF exist, with different details regarding this
and other options). See Chang and Carroll (1968) for further details on the
specific version of MDPREF available on the MDS-1 tape.
Theoretical Discussion
2. Further, the preference value for the /h stimulus of any individual, say the ith,
is (at least) monotonically related to the squared "distance" between the individual's
ideal point and the location of the stimulus in space. Let the matrix S == (Sjj)
In the metric version of the PREFMAP algorithm, it is assumed that the scale values
of preference are linearly related to squared distance, that is, that F; is linear.
Assuming F; has nonzero slope, we may invert it and write:
(III.B.l)
where a and b are constants (a > b) and;;; denotes approximate equality (except for
error terms not expressed).
Let Xj =
(Xjl, ••• , Xjr) represent the row vector of coordinates of the /h stimulus
(j = 1, 2, ... ,n) and y; == (Yi/, ... ,Y;r) represent the vector of coordinates of the
ideal point for the ith individual (i ... 1, 2, ... , m). Given the above relationship and
input data for Xj and S;j' the PREFMAP method solves, for each individual, for
estimates of the coordinate values of the vector y;, and, depending on the model, possibly
for additional parameters associated with individuals.
In model IV the squared distances are defined in a special way which corresponds to
the special case when the ideal point is infinitely distant from the stimuli, so that only its
direction matters. In this special case, the squared distance is actually defined by a
linear equation, and can also be viewed as equivalent to projection on a vector in the
appropriate direction; thus the name "vector model". This equivalence of the linear, or
vector, model to the unfolding model with ideal points at infinity is demonstrated in
Carroll 0972, 1980).
Four alternative models for relating preference data to a given stimulus space, called
models I, II, III and IV, are included in the hierarchy proposed by Carroll and Chang.
The four models correspond, in the obvious fashion, to the four phases of PREFMAP, in
a decreasing order of complexity. Phase I fits a highly generalized unfolding model of
preference (model I); Phase II utilizes a more restrictive model assuming weighted
Euclidean distances analogous to those assumed in the INDSCAL model discussed
earlier; Phase III is the "simple" or Coombsian unfolding model in which ordinary
(unweighted) Euclidean distances are assumed; and Phase IV is the linear, or "vector",
model. Phases I, II and III differ in the way the term db is formulated, i.e., in the
definition of the metric, while Phase IV can be viewed as putting certain restrictions on
ideal point locations, as discussed earlier.
116
Phase I
One way to describe the model assumed in Phase I is to assume that both Xj and Yi
are operated on by an orthogonal transformation matrix T j - which is idiosyncratic for
each subject - and weighted squared distances are then computed from the transformed
values. Thus, one defines:
OII.B.2)
and
OII.B.3)
and then computes the (weighted) Euclidean squared distances #j by the formula:
r
drj = ~ Wit (xjt - yit)2 . OII.B.4)
t -1
Phase II
Phase II differs from Phase I in that it does not assume a different orthogonal
transformation for each individual, although it allows differential weighting of
dimensions, so that squared distances are computed simply by
r
drj = ~ Wit (Xjt - Yit)2 . (m.B.5)
t -1
Phase III
Phase III is the "simple" unfolding model, but it allows the possibility that some or
all of the dimensions have negative weight, making Phase III equivalent to Phase II,
with weights Wit = ± 1 for each individual. To be precise, the weights Wit = ± at, where
each at = ± 1.
117
Phase IV
Phase IV utilizes the vector model in which preference values are related to
coordinates of the stimulus space by an equation (excluding the error term) of the form:
r
Sij == ~ bit Xjt + Ci' (III.B.6)
/-1
This equation contains only linear terms, so least squares estimates of the bit'S can be
derived immediately by multiple linear regression procedures. Having estimated the
coefficients bi! ,bi2 , ... , bir , the direction cosines of the vector for the ith individual are
A A
dividing each bit by #,. Parameters of the other models are also fit by regression
/
procedures, although these are more complex. The reader is referred to Carroll (1972,
1980) for a more detailed exposition of this.
It may be recalled that the nonmetric version of PREFMAP fits monotonic functions
relating the preference scale values and the squared Euclidean distances between a
subject's ideal point and the stimulus points. This is accomplished by the procedure
described below.
2. Estimate the monotone function M!I) for subject i that best predicts the
estimates (the sW's) from the original Si/S, using the procedure described by Kruskal
(I964b) for least squares monotone regression. Define sW == M!I) (Sij)'
3. Replace Sij with sij to compute a new set of predicted values, sW.
4. Using the new set of Si/S, compute a new monotone function MP> and a new set
A , A (2)
of Sij s, namely Sij
5. Continue this iterative procedure until the process converges (i.e., until no more
changes occur in the monotone function or regression coefficients). Specifically, the
process is terminated by reference to a parameter called CRIT. If the sum of squares of
differences in the predicted Si/S for the [th and ([ _ost iterations is less than CRIT, the
process stops at the [th iteration.
Input Parameters
In all the PREFMAP programs, the preference data can be expressed in one of two
ways: a) smaller values indicating higher preferences or b) larger values indicating
higher preferences. The programs can start with any prespecified phase and can work
their way down to any model of lower complexity. PREFMAP-3 actually allows
different models to be fit for different subjects in the same analysis.
Other options include: a) normalization of original scale values versus leaving them
as initially defined and b) computing each subject's scale values for each new phase or,
alternatively, using the estimates of the previous phase as the original values for the
119
following phase. There are also various options concerning whether or not the canonical
rotation and/or weights are computed prior to entering a particular phase.
Output Details
1. Listing of all input parameters selected and the original configuration of stimuli.
2. For each subject the printout of the original scale values, regression coefficients
and estimates of dtj (or Sjj' where Sjj = aj dtj + hj, or equals projection of stimulus j on
vector for subject i in the case of the "vector model") for each phase and for each
iteration in the case of the monotone (or nonmetric) version.
3. For Phase I (only) the direction cosines of each subject's idiosyncratic rotation.
4. Coordinates (or direction cosines for Phase IV) of ideal point and weights of the
dimensions specific to each subject. In Phase I, the orthogonal rotation matrix may also
be printed for each subject. Depending on options selected, the canonical rotation matrix
and/or canonical weights may also be provided as output.
5. Plot showing the relationship between the monotone transform of the scale values
and original scale values (optional).
6. Plot showing the positions for ideal points or vector directions of all subjects as
well as stimulus positions.
7. A summary table showing the correlation coefficients for each subject by each
phase and corresponding F-ratios, including F-ratios for testing the statistical
significance of the improvement in fit associated with moving from a simple to a more
complex method. Such an F is associated with every pair of models (IV versus III, II or
I; III versus II and I; and II versus I). In each case, it can be taken as assessing
whether the more complex model (with a lower Roman numeral) fits the data
significantly better than the simpler (higher numbered) model. These tests are possible
because of the hierarchical embeddedness (or nested structure) of these models; that is,
the fact that each "simpler" model is a special case of each more complex one. In terms
of the algebraic structure of the models, each more complex model includes all the
parameters of any simpler model, plus additional parameters. The situation is formally
120
It would seem in principle to be very interesting to apply the entire family of models
in the PREFMAP hierarchy to the Fresi et al. data. For example, it would seem quite
appropriate to fit model II (the simple unfolding, or "ideal point" model), using each of
the site x time period variables as a pseudo-subject, seeking an ideal point in the four
dimensional space of seaworm species determined by INDSCALISINDSCAL such that
the frequency of species for that site x time period is inversely related to distance from
that ideal point. One could think of this "ideal point" as the species of sea worm most
ideally suited to that particular site/time period combination. Time constraints did not
allow for a thorough analysis of these data via the PREFMAP hierarchy of models,
however. We therefore opted for an internal analysis of the site x time period variables,
using the MDPREF vector model approach. MDPREF, as discussed earlier,
simultaneously determines a space for the "stimuli" (species in this case) and the
"subjects" (sites x time periods) in terms of a vector model. A vector model can
actually be thought of as an unfolding or "ideal point" model with the ideal points all
infinitely distant (or, in practice, very far) from the stimuli (species), so that the vector
direction simply corresponds to the direction of the ideal point from the centroid of the
stimuli (species). It is of interest both to see how well MDPREF accounts for these
data, and also how the structure of the species space relates to that determined by the
three-way INDSCALISINDSCAL analysis.
121
We thus applied MDPREF to these data, treating the seaworm species as "stimuli"
and the 20 sites x months x years as "subjects." The "total and marginal" variance
accounted for (V AF) for dimensionalities from 1 through 20 are displayed in Table 5.
Since we are focusing, in our attempt to interpret these solutions, on the structure of
the variables (sites x time periods), we present the positions of the vectors for these 20
variables separately from the species points in Figures 8 and 10. In these Figures we use
the same coding for these variables as in the Fresi et al. paper; a three symbol (number,
letter, number) code. The first number (1-5) denotes the site, the letter denotes the
month (F = February, L = July), while the third number denotes the year (5 = 1975,
6 = 1976). (We used an "L" rather than a "J" here to encode "July" to maintain
consistency with the coding used by Fresi et al.). MDPREF does not, like
INDSCAL/SINDSCAL, produce unique dimensions, so that rotation of coordinate axes
is usually necessary to attain an optimally interpretable set of dimensions. In the present
case, however, perhaps fortuitously, the orientation of axes originally obtained appears to
lead to a quite interpretable structure (without rotation) for these 20 variables. (This is
not entirely a happenstance, no doubt; the principal axis orientation in which MDPREF
dimensions emerge is certainly more likely than a purely random orientation to yield
interpretable structure.}
dim 3
2F5 5F5
2~F5
4L5
1F5
gF6 21 6 5:~5 'dim 1
3F6
1F6
1L6 4L6
1L5 4F6
5L6
3L6
31
14
17 12
11
~34
10
site x time variables. Dimension three is more interesting, however. Note that almost
all the variables involving the year 1975 (those whose code ends with "5") weight
positively on that dimension, while those involving 1976 tend to exhibit negative weights.
124
In fact almost all the variables with a final "5" are in the upper right quadrant, and
almost all those with a final "6" in the lower right quadrant. The most glaring exception
is "IL5" (site I, in July 1975) which appears just below "IL6' in the lower right hand
quadrant. We have no definite explanation for this anomaly, although a partial
explanation may be that there is something special about site 1 as a whole on this
dimension. We note that, in general, the variables involving site 1 for a given time
period seem to have systematically lower values on this dimension than do those for the
other four sites. For example, IF5 has a much lower value than do 2F5, 3F5, 4F5 and
5F5 all of which are at the extreme positive end of dimension 3, while IF5 is almost at
the zero point. Whatever dimension three corresponds to in its effect on the 88 species
of sea worms, it is a factor that was positive (tended to increase the abundance of those
species at the positive pole of that dimension) in 1975, and negative in 1976. A more
explicitly descriptive way of stating the same thing is that those species at the positive
end tended to be relatively more abundant in 1975, those at the negative end to be
relatively more so in 1976.
dim4
3L6
2L6
3F6
2F6 3F53L5
2FllL5
4L6
dim 2
1F5 5~56
1F6 5F6
4F6 4F5
1L6
4655
1L5
dim4
54
·k4-4
45
o
48 3
---------=--::-::---'-f~XI~F========~:;_dim
ltI ·k4-2
2
177 2 31
47
14 25
24 ~ k4-3 4053
34
may also more closely resemble that of 1 and 2 than does that of sites 4 and 5, which lie
more distinctly in the harbor area. Figure 13 shows the dimension two-four plane of the
stimulus (species) space, indicating how the seaworm species array themselves on these
dimensions separating the various sites. (Again, it should be noted that overall
frequency of the species has not been normalized here.>
It might be noted, by comparing Figures 1 and 2 to Figures 11 and 13, that the
dimensions emerging from the KYST -2A analysis of the "Overall" dissimilarity matrix
are essentially the same as those (for the seaworm species) in the unrotated MDPREF
analyses. This is true despite the fact that the KYST-2A analysis omitted 33 of the 88
species, and also despite the marked difference in types of analysis. KYST -2A is a
nonmetric technique aimed at accounting for rank orders of these derived dissimilarities,
while MDPREF is a metric technique aimed at accounting for the values of the 88
species on the 20 site x time variables. This congruence of the dimensions in these two
analyses is shown directly by using PREFMAP-3, in a manner essentially identical to
that described in section II.C, to "map" the dimension from the four dimensional
KYST-2A solution into this MDPREF species space. The four vectors representing
these four KYST dimensions (k4-1, k4-2, k4-3 and k4-4), respectively correspond very
closely, as can be seen, to the corresponding dimensions (one through four, respectively)
of the MDPREF solutions. The VAF's (or squared multiple correlations) were: .989,
.991, .806 and .854 respectively. It is not unusual, however, for these two quite different
analyses to produce highly comparable results. The reasons for this probably are
twofold:
(1) The theoretically nonmetric KYST analysis is, in fact, essentially equivalent to a
metric one, since the function relating input dissimilarities (distances) to recovered
distances is almost perfectly linear and, in fact, goes very nearly through the origin,
indicating that the input distances are very nearly ratio scale estimates of the derived
distances. It should be emphasized, as spelled out in more detail below, that this might
not have happened!
(2) The KYST -2A solution is rotated to principal components orientation, while the
MDPREF solution is essentially a principal components solution.
127
The only seemingly important difference between these two solutions vis a vis the
"worm" stimuli is in the scaling of these dimensions. Even this is not of any real
significance however. It merely reflects the fact that, in MDPREF the stimulus
(sea worm species) space is arbritrarily scaled to unit variance on all dimensions (and
zero covariance - Le., a "spherical" distribution), while the differential VAF (Variance
accounted for) is absorbed in the vectors, while in KYST the differential VAF is
reflected in the scaling of the stimulus (worm) dimensions. Thus, in this case at least,
the simple metric MDPREF analysis has recovered essentially the same struture for the
sea worm species as did the more complex and sophisticated KYST-2A procedure, while
MDPREF has also extracted information about the "subjects" (sites x times) in the
form of the 20 vector locations, such that projection of stimulus points onto subject
vectors yields approximations to the original dominance data.
It should be stressed, however, that this simple relationship between these two types
of analysis will not always be exhibited. Particularly in the case of strong nonlinearities
in the data, KYST-2A can yield a lower dimensional, more parsimonious representation
of the stimuli (or other objects) than MDPREF (or other principal components/factor
analytic type models and methods).
As mentioned, MDPREF does not yield unique dimensions, but rather is subject to
rotational indeterminacies. In fact, more generally, a linear transformation of the
stimulus space can be effected, as long as the appropriate companion transformation,
given by the "inverse adjoint" transformation, is applied to the subject vectors.
However, we shall restrict ourselves in the present case to orthogonal transformations,
with possible overall dilations, or scale transformations. Since the inverse adjoint of an
orthogonal transformation is the same orthogonal transformation, this leads to a
particularly simple form (which has other advantages as well). Since the stimulus
spaces in both MDPREF and SINDSCAL are scaled to have equal variances of
projections of stimuli (species) on coordinate axes, restricting the class of
transformations to be orthogonal seems appropriate in this case.
Figure 14 shows the dimension one versus two plane of the transformed species space
superimposed on the same planes of the SINDSCAL space. In this representation the
128
~4
s----<fliIBH-\----'---I-'-------'-----'dim 1
/'h
......f2
~4
0/
8\
3
two points representing the same species are connected with one arrow. The terminus
(arrowhead) of the arrow shows the position of the species point in the SINDSCAL
representation, while the origin (shown by an asterisk) shows the point in the MDSCAL
representation after rotation to optimal congruence with the SINDSCAL representation.
In this case, the SINDSCAL configuration provides the "target," and the MDPREF
129
solution is rotated to best congruence in a certain least squares sense (specifically, so that
the sum of squares of the arrow length is minimized). The specific procedure used was a
variant of one originally proposed by Cliff (I 966), which is closely related to the
"orthogonal procrustes" approach described by Gower in section 9.1 of his paper in this
volume. Figure 15 shows a similar plot for the dimension three-four plane. (It should
be kept in mind that the dimensions referred to here are those from the SINDSCAL
solution, so the one-two plane should be taken as corresponding to those dimensions from
the SINDSCAL analysis, not from the MDPREF solution first described. Since the
MDPREF coordinate system has been completely transformed in this process, there is no
necessary one-one correspondence with those dimensions.> Figures 16 and 17 show the
rotated MDPREF solution in those same two planar projections, but this time with the
2
85
I 1
29 80
(rotated) vectors shown simultaneously (and, in fact, with lines connecting them to the
origin to make their vectorial nature more evident). In Figure 16, showing the
dimension one-two plane of this rotated MDPREF joint representation, we see that all
the vectors for sites 3, 4 and 5 when projected into that plane have substantial lengths,
while those for sites 1 and 2 have lengths, when projected into this plane, that are very
near zero. Thus these two dimensions are accounting for virtually all the reliable
variance for sites 3, 4 and 5, and essentially none for site 1 and 2. This is consistent
with the fact that, in the SINDSCAL representation, the derived dissimilarity matrices
130
dim4
49
85
Figure 17. Same as Figure 16, except that three-four plane of joint
MDPREF representation, after rotation to optimal congruence with
SINDSCAL solution, is plotted showing both species points and
variable vectors.
for sites 3, 4 and 5 had high, clearly non-zero, weights on the corresponding INDSCAL
dimensions, while those derived for sites 1 and 2 had near zero weights. The opposite
pattern shows up in the plane for dimension three and four of this rotated MDPREF
representation shown in Figure 17; the lengths of the vectors for sites 1 and 2 projected
in this plane are substantial, while those for sites 3, 4 and 5 are near zero. Again, this is
consistent with the INDSCAL results. We can also see in this three, four plane a clear
separation between sites 1 and 2, with site 1 having higher weights on dimension four
than three, and site 2 the opposite pattern. In the one, two plane we can see some, but
not as clear, differentiation of site 3 from site 4 and 5. These three sites are much more
"mixed up" in this representation than in others we have seen. There is some hint of the
differentiation based on year (1975 versus 1976) in the vectors for sites 3, 4 and 5 in this
plane, however.
lie in the positive quadrants of these two planes, so the weights of these four dimensions
are almost all positive or zero. This suggests that the use we have made of SINDSCAL
in this case may provide a very effective basis for rotation of an MDPREF type
representation to a special kind of generalized "simple structure."
It now only remains for ecologists and biologists to "interpret" the dimensions in
terms of their effects on the seaworm species. We happily defer that privilege to these
experts. To aid such experts in this creative endeavor, however, we provide a final table,
Table 6, in which the coordinates for the 88 seaworm species on the dimensions of the
four different configurations discussed in this paper are presented.
Acknowledgments. Invaluable help in conducting the data analyses reported and other
technical help in preparing this paper were provided by Rhoda T. losso and Barbara B.
Hollister. Thanks are also due to Martina Bose and to Karen Golday for word
processing and other technical assistance. Finally, I am greatly indebted to
Pierre Legendre and Joseph B. Kruskal, plus two anonymous reviewers, for careful
readings of the paper at various stages of its development, leading to enormous
improvements in its content.
REFERENCES
New York.
Young, F. W. 1968. TORSCA-9: A FORTRAN IV program for nonmetric
multidimensiional scaling. Behavioral Science 13: 343-344.
Young, F. W., & W. S. Torgerson. 1967. TORSCA, a FORTRAN IV program for
Shepard- Kruskal multidimensional scaling analysis. Behavioral Science 12: 498.
THE DUALITY DIAGRAM :
Y. Escoufier
Unite de Biometrie
ENSA-INRA-USTL
9, place Pierre Viala
F-34060 Montpellier Cedex, France
Vj = 1, ..• ,p xj = 1 ~ X~
n i=l ~
1 I'
= (I
A
P
Tr( (V I: 1.5
a=q+1
141
where Tr represents the trace of the matrices, i.e. the sum of their
diagonal elements.
~
If the sum A;
is sufficiently small, the
a=q+1
covariances and variances of the p variables can be visually
appreciated.
....
This leads us to investigate the matrix XX'
n
W
n
that
plays the same role for the units as V does for the variables.
We then set
W P
and we have n E A n
1.8
a=l a
q
Let ~ E 1.9
n a=l
It has been shown (note after expression 1.14) that for every nxn
matrix Aq , of rank q < p
W 2 W P 2
Tr«ri Aq» ~ Tr«ri E A 1.10
a=q+1 a
142
P W p ~a ~~
From V L A ~ ~' and n
LA - - - we can also
a=l a a a a=l a n
conclude that
(~* ) 2
ak
Hence is the participation of the ~* axis 1.12
a
in the reconstruction of the variable K,
in actual fact the reconstruction of the
variance Vkk ,
Note: Expressions 1.5, 1.10 and 1.14 come from the well-known
result by Eckart and Young (1936). It is of importance to remark
A
that Vq , Wq and Xq are not only optimal for the least squares
criterion given here by Tr(.) but also for an infinity of other
criteria (Rao 1980 Sabatier et al. 1984).
If we denote D 1
n I nxn as the diagonal matrix of
elements 1 then
n'
1
A
X (Inxn - 1 ~n ~n
I D) X 11.1
A A
V X' D X II. 2
P
E 11.10
o.=q+l
(~*.)2 D ..
0. ~ ~~
11.11 '
A
0.
P
E A II.14
o.=q+l 0.
Va = 1, ... , p A u
0. 0.
with u'u,
0. 0.
i.e. M' V M u
0.
= A
0.
u
0. with u'u
0. 0.'
For each 0. = 1, ... ,p, consider <Po. defined by Uo. M' <p
0. and set
M M' = Q
145
We have M'V Q = A-
a. M' cP a. with cp' M M' 0
CPa. a. CPa.' a. a. '
i. e. VQ CPa. A-
a. with cp' Q a.' 0 IIL3
CPa. a. cP a. a.'
p p
u' , we have M'VM a. M' cp' M hence
From V[M J a. u a. a.
E A- E A-
a.=l a.=l
CPa. a.
VQ IIL4
Moreover we have
P
E
a.=q+l
q
Tr( (M'VM E A- M' cP
a.
a.= 1 a.
q
Tr«VQ - E IIL5
a. =1
with ~*'
a
D ~*a cP , Q X'D X Q
CPa
a
a. Q CPa. A- a. A-
a. III.7 cP ,
P
L x~1. -J
e.,
j=l
where (~1' ... , ~p) is a system of n linearly independent vectors
of E, i.e. a basis of E.
Symmetrically the j-th variable is considered as a
vector of F = Rn. It will be written as
n
L x~1. -1.
f. where (iI' ... , in) is the basis of F.
i=l
Q 1
E* -------:X,.....----4)
TD
F=Rn
IV - ON APPLICATIONS CONCERNING D
IV.I. Special centering
Since the duality diagram just described coincides
exactly with sections I and II using the matrix X, the weights D
can be included as follows :
149
1ID
E=RP ~
~n ~n
Ql iv w
E* ) F=Rn
(Inxn - 1~n
1 ' D)X
~n
Ql IV
E* )
(Inxn - 1~n
l' 6) X
~n
X3 (!n i X2 )
Let P3 = X3 (X 3 D X3 )-1
A
X'D.
3
Based on the orthogonality of tn and
the columns of X2 = (Inxn - !n !~ D) X2 ,
( I nxn- X2 (X'D
2 X2 )-1 X'2 D)(I nxn- 1~n ~n
I'D)
X'D)
2
Ql1V
E*
______________________-+)F=Rn
w Un
ii) WD = (I
nxn - P ) X
3 Q X'(1 nxn - PI) D
3
Because P3 is idempotent, !~ P3 W D = 0 and the
principal components satisfy
1 ' P3 1/Ja. = 0
~n
i. e. 1 ' D 1/Ja.
~n
0, the principal components are centered for D, and
E=RP ____X'_ _ _ _ _ F*
1fC- 1
~(
Ql IV w
E* -------~A--------~)F=Rn
X
1 p p
2 p
n-l
p 1 p
C p2 p 1
p
n-1 ........... 1
152
with inverse
1 - P 0 0
-P l+p2
C- l 1 0 0
l-p 2
l+p2 -p
0 - p 1
o
lfA (+~
o • . . + D
p
and /:;
which goes back, from the point of view of the statistical units
under consideration, to the representation given by the PCA of
(Y(Y' D y)-l Y' D X, Q, D)
VI - CONCLUSION
A deeper mathematical understanding of the steps taken
in a normal PCA program based upon the variance matrix opens up
numerous paths for theoretical and practical work.
This does not challenge the usual methods of data
analysis, which are still.a reasonable compromise between current
knowledge and what the user is willing to do in terms of cost,
whether it be the cost of the mathematical training necessary for
understanding, or for computations.
156
REFERENCES
ANDERSON, T.W. 1958. An introduction to multivariate statistical
analysis. John Wiley & Sons, New York, NY.
ARAGON, Y.,and H. CAUSSINUS. 1980. Une analyse en composantes
principales pour des unites statistiques correlees,
p. 121-131. In E. Diday et a1. [ed.J Data analysis and
informatics. North Holland Publ. Co. New York, NY.
BESSE, Ph., and S.O. RAMSAY. 1986. Principal components analysis of
sampled functions. Psychometrika (in press).
BONIFAS, L., Y. ESCOUFIER, P.L. GONZALEZ, and R. SABATIER. 1984.
Choix de variables en analyses en composantes principales.
Revue de Statistique Appliquee, Vol. XXXII n°n° 2 : 5-15.
CAILLIEZ, F.,and J.P. PAGES. 1976. Introduction a l'analyse des
donnees. SMASH, 9, rue Dub an , Paris 75010.
ECKART, C., and G. YOUNG. 1936. The approximation of one matrix by
another of lower rank. Psychometrika, Vol. 1 nO 3 : 211-218.
ESCOUFIER, Y. 1982. L'analyse des tableaux de contingence simples
et multiples. Metron, Vol. XL n°n° 1-2 : 53-77.
ESCOUFIER, Y. 1985. L'analyse des correspondances : ses proprietes
et ses extensions. Institut International de Statistique.
Amsterdam: 28.2.1-28.2.16.
ESCOUFIER, Y., and P. ROBERT. 1979. Choosing variables and metrics
by optimizing the RV-coefficient, p. 205-219. In J.S. Rustagi
[ed.J Optimizing methods in statistics. Academic Press Inc.
GOODMAN, L.A., and W.H. KRUSKAL. 1954. Measures of association for
cross-classifications. J. amer. stat. Ass., Vol. 49 : 732-764.
KARMIERCZAK, J.B. 1985. Une application du principe du Yule:
l'analyse logarithmique. Quatriemes Journees Internationales
Analyse des donnees et informatique. Versailles. France.
(Document proviso ire : 393-403).
LAURO, N., and L. D'AMBRA. 1983. L'analyse non symetrique des
correspondances, p. 433-446. In E. Diday et al. [edJ Data
analysis and informatics III. Elsevier Science Publ. BV.
North Holland.
LEBART, L., A. MORINEAU,and J.P. FENELON. 1979. Traitement des
donnees statistiques. Dunod.
MORRISON, D.F. 1967. Multivariate statistical methods. Mc Graw-Hill
Bock Co.
PAGES, J.P., F. CAILLIEZ, and Y. ESCOUFIER. 1979. Analyse factoriel1e:
un peu d'histoire et de geometrie. Revue de Statistique
Appliquee, Vol. XXVII nO 1 : 6-28.
RAO, C.R. 1980. Matrix approximations and reduction of dimensionality
in multivariate statistical analysis, p. 3-22. In P.R. Krishnaiah
[ed.J Multivariate analysis V. North-Holland Pub1. Co.
SABATIER, R., Y. JAN,and Y. ESCOUFIER. 1984. Approximations
d'applications lineaires et analyse en composantes principales,
p. 569-580. In E. Diday et a1. [ed.J Data analysis and
informatics III. Elsevier Science Publ. BV. North Holland.
NONLINEAR MULTIVARIATE ANALYSIS WITH OPTIMAL
SCALING
Jan de Leeuw
Department of Data Theory FSW
University of Leiden
Middelstegracht 4
2312 TW Leiden, The Netherlands
INTRODUCTION
It has already been pointed out by many authors that multivariate analysis is the
natural tool to analyze ecological data structures. Gauch summarizes the reasons for
this choice in a clear and concise way. "Community ecology concerns assemblages
of plants and animals living together and the environmental and historical factors
with which they interact. ... Community data are multivariate because each sample
site is described by the abundances of a number of species, because numerous
environmental factors affect communities, and so on. ... The application of
multivariate analysis to community ecology is natural, routine, and fruitfu1."
(Gauch 1982, p. 1). Legendre and Legendre discuss the ecological hyperspace
implicit in Hutchinson's concept of afundamental niche. "Ecological data sets are
for the most part multidimensional: the ecologist samples along a number of axes
which, depending on the case, are more or less independent, with the purpose of
finding a structure and interpreting it." (Legendre and Legendre 1983, p. 3).
A number of possible ecological applications of multivariate techniques are
mentioned in the following quotation from the recent book by Gittins (1985).
"Ecology deals with relationships between plants and animals and between them and
the places where they live. Consequently, many questions of interest to ecologists
call for the investigation of relationships between variables of two distinct but
associated kinds. Such relationships may involve those, for example, between the
plant and animal constituents of a biotic community. They might also involve, as in
plant ecology, connections between plant communities and their component species,
on the one hand, and characteristics of their physical environment on the other. As
another example, comparative relationships among a number of affiliated species or
populations with respect to a particular treatment regime in a designed experiment
might be studied. In more general terms, the question which arises calls for the
exploration of relationships between any two or more sets of variables of ecological
interest." (I.c., page 1).
It is of some importance to observe that Gittins gives a somewhat limited
description of the possibilities of multivariate analysis here. The reason being, of
course, that his book is about canonical analysis, a rather specific class of
multivariate techniques. We can study relationships between sets of variables, as in
the various form of canonical analysis, but also relationships within a single set of
variables, as in the various forms of clustering and component analysis. In
classification and ordination, for example, we usually deal with a single set of
variables. Each species in the study defines a variable, assigning abundance numbers
to a collection of sites. It may seem natural to relate sets of variables if we want to
study abundance or behaviour of species in relation to the environment, but it would
be more appropriate to analyze the within-structure of a single set if we describe the
structure of a single community or location. And if we want to study the interaction
between members of a community, under various circumstances, it may be even
more appropriate to use techniques derived from multidimensional scaling, for
which the basic data are square interaction or association matrices and the basic
units are pairs of individuals.
it is even clear that the usual assumptions are not satisfied at all. Multivariate
normality and complete independence are quite rare in practice. Thus instead of
starting with a model and trying to fit in the data, we start with the data and we try to
find a structure or model that can describe or summarize the data. These two
approaches correspond, of course, with the age-old distinction between induction
and deduction, between empiricism and rationalism. In recent discussions the
concepts of exploration and confirmation, and of description and inference, are
often contrasted. Data analysts generally feel that the models of classical statistics
are much too strong and too unrealistic to give good descriptions of the data. And,
of course, mathematical statisticians feel that the techniques of data analysis very
often lead to unstable results, that are difficult to integrate with existing prior
knowledge. It will not come as a surprise, that we think that both approaches have
their value. If there is strong and reliable prior knowledge, then it must be
incorporated in the data analysis, because it will make the results more stable and
easier to interpret. But if this prior knowledge is lacking, it must not be invented
just for the purpose of being able to use standard statistical methodology. And,
certainly, we must not make assumptions which we know to be not even
approximately true. Finally there are many situations in which good statistical
procedures can in principle be applied, on the basis of firm prior knowledge, but in
which there simply are not enough data to make practical application possible. In
such situations a data analytical compromise is needed too.
There are some interesting problems in the application of various multivariate
analysis techniques to ecology. They have been admirably reviewed by Noy-Meir
and Whittaker (1978). We mention them briefly here, but we shall also encounter
them again in our more formal development below. The distinction between Rand
Q techniques has been discussed extensively by psychometricians such as Cattell and
Stephenson. It is based on the fact that we think as the species as ordering the
samples, but also as the samples as ordering the species. In a given data structure we
have to decide what the variables are, and what the units are on which the variables
are defined. Sometimes the choice is clear and unambiguous, sometimes the
situation is more complicated. As a second problem Noy-Meir and Whittaker
mention data transformation and the choice of similarity measures. We could
generalize this somewhat to the problem of data definition and expression. This has
as special cases the choice of centering and standarization, but also taking logarithms
or using any of the other reexpression techniques discussed by Legendre and
Legendre (1983, p. 11-18). The nonlinear multivariate techniques explained in our
paper take a radical point of view, by assuming that the expression of the variable in
the data matrix is essentially conventional, merely a coding. Thus the reexpression
problem does not have to be solved before the technique is applied, but it is an
161
MULTIV ARIABLES
is based on descriptors. In this text the term descriptor will be used for the
attributes, variables, or characters (also called items in the social sciences) that
describe or compare the objects of the study. The objects that the ecologists
compare are the samples, locations, quadrats, observations, sampling units or
subjects which are defined a priori by the sampling design, before making the
observations." (Legendre and Legendre 1983, p. 8). For variables we use the
familiar notation cp : n -> r. Here n is the domain of the variable, consisting of the
objects, and r is its target, containing the possible values of the variable. Elements
of the target are also called the categories of a variable. A variable cp associates with
each 00 £ n a category CP(oo) £ r. In practical applications and in actual data analysis
the domain n will be a finite set {OOl>""OOn }. For theoretical purposes the domain
can be infinite. If n is a probability space, for instance, and cp is measurable, then
our variable is a random variable. Targets can be finite or infinite as well. In many
cases the target is the reals or the integers, i.e. r = lR = ]-00,+00[, or r = N =
{0,1,2, .... }. But it is also possible that r = {short grass, short grass Ithicket, tall
grass with thicket} or r = {close, moderate, distant}.
Table 1.5 from Legendre and Legendre (1983, p. 9), that we copy here, shows
the types of targets we can expect to encounter. Most of the terminology will
probably be clear, but we refer to Legendre and Legendre (1983, p. 10-11) for
further explanation.
1 2 3 4 5
01 WPOB C
02 W P 0 B C
03 W P 0 B C
04 Y P 0 B C
05 Y P 0 B C
06 Y P B B B
07 Y P 0 B B
08 Y A 0 B C
09 W P B B C
10 Y P 0 Y C Table 1.
11 Y P 0 Y C Bird data from Mayr.
12 Y A 0 Y C
sometimes debatable.
The next example is also representative, but a bit more problematical. It is taken
from Legendre and Legendre (1983, p. 191). Five ponds are characterized by the
abundances of different species of zooplankton, given on a scale of relative
abundance varying from 0 to 5. It is clear that this matrix is also based on
aggregation, of the same sort as in the Gittins example. But we can also use it to
illustrate transposition, or the choice between Q and R. In this example we can take
the species as units, and the ponds as variables. Each pond maps the eight species into
the target {0,1,2,3,4,5}. It is also possible to interpret the ponds as units and the
species as variables, again with the same target {0,1,2,3,4,5}. We can also treat the
example as bivariate. The grand-total of the data matrix is 52. These 52 'abundance
credits' are used as the units, and the two variables are SPECIES and PONDS. Thus
there are three credits with species-value 1 and pond-value 212, and four credits
with species-value 5 and pond-value 214, and so on. The data matrix is, in this
interpretation, the cross table of the two variables. And finally we can use the 40
ponds and species combinations as units, and interpret our results as measurements
on a variable that maps these 40 combinations into {0,1,2,3,4,5}. Two other
variables can be defined on these units. The first one is POND, with five variables in
its target, and the second one is SPECIES, with eight values. In this last
interpretation there are consequently 40 units, and three variables. There are no
clear a priori reasons for preferring one interpretation over the other. The choice
must be made by the investigator, in combination with the choice of the data analysis
technique.
Ponds
Species
212 214 233 431 432
1 3 3 0 0 0
2 0 0 2 2 0
3 0 2 3 0 2
4 0 0 4 3 3
5 4 4 0 0 0
6 0 2 0 3 3 Table 2.
7 0 0 0 1 2 Zooplankton data
of Legendre.
8 3 3 0 0 0
166
q,
variable
n .. r domain -----~.. ~ target
\/ R
qUantifi~
v"",ble '\ /-~moo
reals
Figure 1.
Quantification diagram.
Let us look at the second part of Table 3. This contains the same information as
the first three columns, but coded differently. In the terminology of De Leeuw
(1973) we call the codings of the variables indicator matrices, but in other contexts
they are also called dummies. One interpretation is that SPECIES, for instance, is
now coded as a set of eight different binary variables. The total number of
variables, in this interpretation, is now equal to 19, which is the total number of
categories of SPECIES, POND, and ABUNDANCE. The important property of
indicator matrices, for our purposes, is that each possible quantification of the
variables is a linear combination of the columns of the indicator matrix of that
variable. Or, if there are n objects, we can say that the columns of the indicator
matrix form a basis for the subspace oflRn defined by the quantifications of the
variable. The columns span the space of possible quantifications.
Suppose G t is the indicator matrix of variable t. Assume that there are n objects
and that variable t has kt categories. Then G t has n rows and kt columns. The matrix
D t = Gt'G t is diagonal, i.e. the columns of G t are orthogonal (the categories of a
variable are exclusive). And the rows of G t sum to unity (the categories are
exhaustive). A quantification "'t of the categories maps the keelement set r t into the
reals, and is thus a kt-element vector. Write it as Yt. Then At, the quantified variable,
is given by the product qt = GtYt. Given vectors Yt of category quantifications we
can construct quantified variables, and given quantified variables we can construct
the correlation matrix R(A). We limit our attention to normalized quantifications. If
u is used for a vector with all elements equal to +1, the number of elements of u
depending on the context, then we want u'qt = u'GtYt = u'DtYt = 0 and qt'qt =
Yt'DtYt = n. If sand t are two variables, with corresponding indicators and
normalized quantifications, then the correlation between the quantified variables is
given by rst = n- 1 Ys'CstYt, where Cst =dfGs'Gt is the cross-table of variables sand
t. Observe that D t = C tt . Our formulation of the quantification problem in terms of
vectors and matrices shows that the correlations rst are functions of the bivariate
frequencies, collected in the cross-tables Cst, and the category quantifications Yt.
For a given problem, i.e. a given coding of a fixed data set, the Cst are constant and
known, but varying the Yt will give varying correlation coefficients. The
comparison of integer scaling and criterion scaling in the previous section was a
first example of this.
We now take a further step. The correlations vary with the choice of the
quantifications, and consequently all statistics depending on the correlations will
also vary. Suppose K(R(A» is such a (real-valued) statistic, interpreted as a function
of the sealings. We are interested in the variation of this statistic, and in many cases
in the largest and/or smallest possible value, under choice of quantifications. It is
possible, for instance, to look for the quantifications of the variables which
maximize or minimize a specific correlation. Or, if we have a number of predictors
and a single variable which must be predicted, we can choose scalings for optimal
prediction, i.e. with maximum multiple correlation coefficient. If the purpose of the
multivariate technique is ordination or some other form of dimension reduction,
then we can choose quantifications in such a way that a maximum amount of
dimension reduction is possible. In a principal components context this could mean
that we maximize the largest eigenvalue, or the sum of the p largest eigenvalues, of
the correlation matrix R(A). In fact we can look through the books on linear
multivariate analysis and find many other criteria that are used to evaluate results of
multivariate techniques. There are canonical correlations, likelihood ratio criteria
in terms of determinants, largest root criteria, variance ratio's, and so on. For each
of these criteria we can study their variation under choice of quantifications, and we
can look for the quantifications that make them as large (or as small) as possible.
Before we give some examples, we briefly discuss the mathematical structure of
such optimal scaling problems. If we restrict ourselves to the case of n units of
observation, coded with indicator matrices, then the stationary equations for an
extreme value of criterion K over normalized quantifications are
where 1ts t = alClarst. This assumes, obviously, that the partial derivatives exist.
Consequently we restrict our attention to criteria that are differentiable functions of
the correlation coefficients. The stationary equations suggest the algorithm
For s=1 to m:
AI: compute G:Jls = l1:~s 1tstGtYt,
A2: compute ys = Ds-IGs ''h,
A3: compute ys by normalizing ys'
next s.
Observe that the algorithm can be used for any criterion K. The criterion influences
the algorithm only through the form of the partial derivatives 1t s t. It is not
guaranteed that it works, i.e. converges, for all criteria. A detailed mathematical
171
analysis is given by De Leeuw (1986), who shows that the algorithm does indeed
work for some of the more usual criteria used in nonlinear multivariate analysis,
such as the ones we have mentioned above.
Let us now look at an example. If we want to apply optimal scaling to the
example of Mayr, in Table 1, then we get into trouble. Because all variables are
binary, the possible scalings are completely determined by the normalization
conditions. For binary variables, there is only one possible scaling, and in that sense
they are are the same as numerical variables. We could create variables with more
than two categories by using interactive coding, but the example is so small and
delicate that this would probably not be worthwhile.
We thus apply the algorithm, with various different criteria, to the zooplankton
example. The results are collected in Table 4. Column A contains the criterion
scaling technique mentioned in the previous section. We use integer scaling for
ABUNDANCE, and scale POND and SPECIES by
maximizing the sum of the correlations between ABUNDANCE and POND and
172
SPECIES. The quantifications are given in Table 4, for the correlations we find
r(S,A) = .29 and r(P,A) = .16. In column B we maximize the correlation r(S,A) by
scaling both SPECIES and ABUNDANCE. Of course this gives no quantification
for POND. The optimal correlation is r(S,A) = .59. In column C the same is done
for r(P,A), which can be increased to .36. Column D is more interesting. It
optimizes r(S,A) + r(P,A) over all three quantifications. This gives r(S,A) = .58 and
r(P,A) = .33. In this solution 44% of the variance in (scaled) ABUNDANCE is
'explained' by (scaled) SPECIES and POND.
We shall make no attempt to give an ecological interpretation of the scalings
found by the techniques. The example is meant only for illustrative purposes. It
seems, by comparing columns B, C, and D, that the optimal transformations are not
very stable over choice of criterion, which is perhaps not surprising in such a small
example. The optimal correlations are much more stable. So is the fact that the
categories of ABUNDANCE are scaled in the correct order, except for the zero
category which moves to the middle of the abundance scale.
Column E in Table 4 is quite different from the others. This is because it
interprets the data as a single bivariate distribution, with 52 'abundance credits' as
the units. If we now scale SPECIES and POND optimally, maximizing the
correlation in the bivariate distribution, then we find the quantifications in column
E, and the optimal correlation equal to .89. Again we give no interpretation, but we
point out that the solution in column E can be used to reorder the rows and columns
of Table 2 by using the order of the optimal quantifications. In this reordered
version of the table the elements are nicely grouped along the diagonal. For more
information about such optimal ordering aspects of nonlinear multivariate analysis
techniques we refer to Heiser (1986).
In the book by Gifi (1981) special attention is paid to a particular class of
criteria, that could be called generalized canonical analysis criteria. Also compare
Van der Burg, De Leeuw, and Verdegaal (1984, 1986) for an extensive analysis of
these criteria, plus a description of alternating least squares methods for optimizing
them. In generalized canonical analysis the variables are partitioned into sets of
variables. In ordinary canonical correlation analysis (Gittins 1985) there are only
two sets. In some of the special cases of ordinary canonical analysis, such as multiple
regression analysis and discriminant analysis, the second set contains only a single
variable. In principal component analysis the number of sets is equal to the number
of variables, i.e. each set contains a single variable. The partitioning of the variables
into sets induces a partitioning of the dispersion matrix of the variables into
dispersion matrices within sets and dispersion matrices between sets. Suppose S is
the dispersion matrix of all variables, and T is the direct sum of the within-set
dispersions, i.e. T is a block-matrix with on the diagonal the within-set dispersions,
173
and outside the diagonal blocks of zeroes. In ordinary canonical correlation analysis
T consists of two blocks along the diagonal that are nonzero, and two zero blocks
outside the diagonal. In principal component analysis T is the diagonal matrix of the
variances of the variables. Van der Burg et al. (1984, 1986) define the generalized
canonical correlations as the eigenvalues of m- 1T-l S, where m is the number of
sets. In principal component analysis the generalized canonical correlations are the
eigenvalues of the correlation matrix, in ordinary canonical analysis they are
linearly related to the usual canonical correlation coefficients. Gifi (1981)
concentrates on techniques that maximize the sum of the p largest generalized
canonical correlation coefficients. These are, of course, functions of the correlation
coefficients between the variables. This means that we are dealing with a special case
of the previous set-up. But this special case is exceedingly important, because the
usual linear multivariate analysis techniques are all forms of generalized canonical
analysis.
MEASUREMENT LEVEL
In the examples we have discussed so far only two possible scalings of the
variables were mentioned. Either the quantification of the categories is known,
which is the case for measured or numerical variables, or the quantification is
completely unknown, and must be found by optimizing the value of the criterion.
Binary variables are special, because the quantification is unknown, but irrelevant.
The two cases 'completely known' and 'completely unknown' are too extreme in
many applications. We may be reasonably sure, for example, that the
transformation we are looking for is monotonic with the original ordering of the
target, which must be an ordered set in this case. Or we may decide that we are not
really interested in nonmonotonic transformations, because they would involve a
shift of meaning in the interpretation of the variable. If we predict optimally
transformed yield, for instance, and the optimal transformation has a parabolic
form, then we could say that we do not predict 'yield' but 'departure from average
yield'. In such cases it may make sense to restrict the transformation to be
increasing. The zooplankton example has shown that often monotonicities in the
data appear even when we do not explicitly impose monotonicy restrictions.
It is one of the major advantages of our algorithm that it generalizes very easily
to optimal scaling with ordinal or monotonic restrictions. It suffices to insert a
monotone regression operator MR(.) in step A2. Thus
For s=1 to m:
174
The situation is in some respects quite similar to the zooplankton example, because
there we also has two orthogonal variables SPECIES and POND that were used to
predict ABUNDANCE. The nature of the variables is quite different, however, in
this larger example. SPECIES is a nominal (or multi-state unordered) variable,
and NITRO, the amount of nitrogen, is a numerical (or measured) variable. But
NITRO takes on only the five discrete values 1, 9, 27, 81, and 243, and in this
respects it differs from the numerical variable YIELD, which can in principle take
on a continuum of possible values. In the Legendre and Legendre classification
NITRO is discontinuous quantitative, while YIELD is continuous quantitative. This
implies that the indicator matrix for YIELD is not very useful. Because of the
continuity of the variable each value will occur only once, and the indicator matrix
will be a permutation matrix, with the number of categories equal to the number of
observations. This will make it possible to predict any quantification of YIELD
exactly and trivially, and thus the result of our optimal scaling will be arbitrary and
not informative. If we want to apply indicator matrices to continuous variables, then
we have to group their values into intervals, that is we have to discreticize them.
Discreticizing can be done in many different ways, and consequently has some
degree of arbitrariness associated with it. Moreover if we plot the orginal variable
against the optimal quantified variable, then we always find a step function, because
by definition data values in the same interval of the discretization get the same
quantified value. Step functions are not very nice representations of continuous
functions. It is very difficult to recognize the shape of a function from its step
function approximation. On the other hand polynomials are far too rigid for
satisfactory approximation. This is the main reason for using splines in nonlinear
multivariate analysis. In order to define a spline we must first choose a number of
knots on the real line, which have a similar function as the discretization points for
step functions. We then fix the degree p of the spline. Given the knots and the degree
a spline is any function which is a polynomial of degree p between knots, and which
has continuous derivatives of degree p - 1 at the knots. Thus a spline can be a
different polynomial in each interval, but not arbitrarily different because of the
smoothness constraints at the knots, i.e. the endpoints of the intervals. For p = 0 this
means that the splines are identical with the step functions, that have steps at each of
the knots. For p = 1 splines are piecewise linear, and the pieces are joined
continuously at the knots. For p = 2 splines are piecewise quadratic, and
continuously differentiable at the knots, and so on. Thus step functions are special
splines. If we choose the knots in such a way that all data values are in one interval,
then we see that polynomials are also special cases. Thus SR(.) has step functions
and polynomials as a special case, and MSR(.), which is monotone spline
regression, includes ordinary monotone regression and monotone polynomials.
176
By combining the various criteria with the various options for measurement
levels we get a very large number of multivariate analysis techniques. Nevertheless
there are some very common techniques, which are still not covered by our
developments. The major example is multiple correspondence analysis (also known
177
..-.... -
2.0
1.5
1.0
... •
0.5
0.0
en
-0.5 Q)
;::)
ca>
2.0~------------------------------------------~
1.5
1.0
0.5
... . ..- .-----~
0.0
-0.5
""
•
-1.0
-1.5
-2.0 ""'.
-2.5 original data
-3.0 -f-----.------r-----.-----,.-----r-----.------.r------,.--.......j
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25
6
5
category numbers
Figure 3.
Nitrogen data
Optimal NITRO transformations
for eight species.
2
179
Cx = mJ.1Dx.
Here C is the supermatrix containing all cross tables Cst. This optimal scaling
problem was originally formulated and solved by Guttman (1941). Matrix C is
called the Burt table in the French correspondence analysis literature. Matrix D is
the diagonal of C, and m is the number of variables. The category quantifications Yt
are found by normalizing the m subvectors of the eigenvector x corresponding with
the dominant nontrivial eigenvalue. In the zooplankton example C is of order 25,
because there are five variables with five categories each. The largest eigenvalue,
which was 3.41 with integer scaling, goes up to 3.70 with optimal scaling. The
percentage variance 'explained' goes from 68% to 74%. Table Sa gives the optimal
quantifications for the five variables. They are quite regular and close to
monotonic, but distinctly nonlinear.
There are now at least three ways in which the problem can be made
multidimensional. In the first place we can compute the induced correlation matrix
R, and find its subsequent eigenvalues and eigenvectors as in ordinary metric
component analysis. This is straightforward. In the second place we can change the
criterion to a multidimensional one. Thus we can maximize the sum of the first two,
or the sum of the first three eigenvalues of the correlation matrix. In general this
will give different correlation matrices, and different eigenvalue distributions. We
illustrate this for the sum of the first two eigenvalues in the zooplankton example. In
the previous solution, which optimized the largest eigenvalue, the first two
eigenvalues 'explained' 74% and 14%. If we optimize the sum of the two largest
eigenvalues we find 'explained' variances of 56% and 44%. The optimal
quantifications in Table 5b make the transformed data matrix exactly of rank two.
180
In order to obtain this perfect fit, the technique transforms variables 3 and 4 in a
somewhat peculiar way.
The third way of finding a multidimensional solution is quite different. It simply
computes additional eigenvalues and eigenvectors of the pair (C,mD). This defines
multiple correspondence analysis. The technique was introduced in psychome~ics
by Guttman and Burt (Guttman 1941, 1950, 1953, Burt 1950, 1953). Each
eigenvector now defines a vector of category quantifications, which induces a
correlation matrix. In Table 5c, for example, we give the quantifications
corresponding with the second eigenvalue of (C,mD), which is 2.55. The
correlation matrix that goes with these quantifications has a dominant eigenvalue
'explaining' 51 % of the variance, and a subdominant one 'explaining' 35%. The
quantifications in Table 5c look peculiar. We could go on, of course, by using
additional eigenvalues of (C,mD).
If one thinks about this a little bit, then it is somewhat disconcerting. The
multiple
category 1 2 3 4 5
It is nice to have a number of principles and technical tools that can be used to
create very general nonlinear multivariate analysis techniques. But it is perhaps
183
even nicer to know that some of the possible options have already been combined
into various series of computer programs, and that these programs are readily
available. The ALSOS series of programs comprises programs for analysis of
variance, mUltiple regression, principal component analysis, factor analysis, and
multidimensional scaling. An overview is given by Young (1981). The GIFI series
has programs for correspondence analysis, multiple correspondence analysis,
principal component analysis, canonical correlation analysis, path analysis, and
multiple-set canonical analysis. Gifi (1981) has the necessary references. A relative
newcomer is the ACE series, discussed in Breiman and Friedman (1985). There are
programs for multiple regression, discriminant analysis, time series analysis, and
principal component analysis.
The three series of nonlinear multivariate analysis programs differ in many
respects, even if they really implement the same technique. The various possibilities
of choosing the regression operators differ, the algorithms differ, and the input and
output can also be quite different. But it is of course much more important to
emphasize what they have in common. All three series generalize existing linear
multivariate analysis techniques by combining them with the notion of optimal
scaling or transformation. Thus they make them more nonparametric and less
model-based, more exploratory and less confirmatory, more data analytic and less
inferential.
discussed by De Leeuw and Vander Burg (1986). Although these techniques for
analyzing stability are often expensive computationally, we think that in almost all
cases the extra computations are quite worthwhile. A confidence band around a
nonlinear transformation, or a confidence ellipsoid around a plane projection give
useful additional information, even if the random sampling assumptions do not seem
to apply.
Books such as Legendre and Legendre (1983), Gauch (1982), and Gittins (1985)
have already shown to ecologists that linear multivariate analysis techniques, if
applied carefully, and by somebody having expert knowledge of the subject area in
question, can be extremely helpful and powerful tools. It seems to us that combining
multivariate exploration with automatic reexpression of variables is an even more
powerful tool, which has already produced interesting results in many different
scientific disciplines. We think that they show great promise for ecology too, but we
must emphasize that perhaps even more care, and an even more expert knowledge
of the ecological problems, is required. Attacking very simple problems with very
powerful tools is usually unwise and sometimes dangerous. One does not rent a
truck to move a box of matches, and one does not use a chain saw to sharpen a
pencil. The techniques we have discussed in this paper are most useful in dealing
with large, relatively unstructured, data sets, in which there is not too much prior
information about physical or causal mechanisms. In other cases, often better
techniques are available. But these other cases occur far less frequently than the
standard mathematical statistics or multivariate analysis texts suggest.
REFERENCES
AGRESTI, A. 1983. Analysis of Ordinal Categorical Data. John Wiley & Sons,
Inc., New York, NY.
ANDERSON, T.W. 1984. An Introduction to Multivariate Statistical Analysis.
(second edition). John Wiley & Sons, Inc., New York, NY.
BEKKER, P. 1986. A Comparison of Various Techniques for Nonlinear Principal
Component Analysis. DSWO-Press, Leiden, The Netherlands.
BENZECRI, J. P. ET AL. 1973. L'Analyse des Donnees. (2 vols). Dunod, Paris,
France.
BENZECRI, J.P. ET AL. 1980. Pratique de l'Analyse des Donnees. (3 vols).
Dunod, Paris, France.
BISHOP, Y.M.M., S.E. FIENBERG, AND P.W. HOLLAND. 1975. Discrete
Multivariate Analysis: Theory and Practice. MIT-Press, Cambridge, Ma.
BREIMAN, L, AND J.H. FRIEDMAN. 1985. Estimating Optimal Transformations
for Multiple Regression and Correlation. J. Am. Statist. Assoc. 80: 580-619.
BURT, C. 1950. The Factorial Analysis of Qualitative Data. British J. Psycho!.
185
Willem J. Heiser
Department of Data Theory
University of Leiden
Abstract - Several different methods of gradient analysis, including correspondence analysis and
Gaussian ordination, can be characterized as unfolding methods. These techniques are applicable
whenever single-peaked response functions are at issue, either with respect to known environment-
al characteristics or else with respect to data driven reorderings of the sites. Unfolding gives a joint
respresentation of the site/species relationships in terms of the distance between two types of
points, the location of which can be constrained in various ways. A classification based on loss
functions is given, as well as a convergent algorithm for the weighted least squares case.
1. INTRODUCTION
Ordination and clustering methods all rely on the concept of distance and some kind of
reduction principle in order to facilitate the analysis of structures in data. Usually, this requires the
choice of some measure of ecological resemblance as a fIrst step, either between objects (individ-
uals, samples), or between attributes (species, descriptors). Then in ordination the aim is fInding a
reduced space that preserves distance, i.e. reduction of dimensionality, and in cluster analysis the
aim is allocating thellnits of analysis to a reduced number of (possibly hierarchically organised)
classes, i.e. reduction of within-group distance with respect to between-group distance.
This paper will be centered at a third type of method, also based on distance and reduction,
but not relying on derived associations or derived dependencies. It is particularly suited for the
analysis of species x samples presence-absence or abundance data; or, perhaps somewhat more
generally, for any ecological data matrix that is dimensionally homogeneous (Legendre and
Legendre 1983), and non-negative. In psychology, where its early developments took place in the
context of the analysis of individual choice behavior and differential preference strength, the group
of methods is called unfolding (Coombs 1950, 1964). Since the word "unfolding" aptly describes
the major aim of the technique, it will be used as a generic name throughout this paper.
In order to outline the objectives of unfolding in ecological terms, the first thing to notice is
that the basic notion of ecological resemblance need not be confmed to distance defined on pairs of
units from a single set. If it is assumed that for each species there is a unique combination of the
levels or states of the environmental variables that optimizes its possibilities to survive, perhaps to
be called its ideal niche, and that the sampling sites approximate these ideal circumstances to
different degrees, then species abundance might be supposed to level off monotonically with the
distance of a sampling site from the ideal niche. Here distance could be understood as concrete,
geographical distance, or as distance in some abstract space. In the latter case the samples are to be
arranged in an orderly fashion, along a gradient, reflecting the gradual changes in environmental or
community characteristics. Now the unfolding technique seeks to find precisely those gradients that
yield single-peaked response functions, i.e. it seeks a reduction to (low-dimensional) unimodality.
Psychologists study objects called stimuli, want to arrange them along stimulus scales, and one of
the major response classes available to them is preference. In these terms, the unfolding technique
aims at finding those stimulus scales that yield single-peaked preference functions.
Coombs developed his form of unfolding in an attempt to resolve a notorious problem in
psychology, i.e. the problem of defining a psychological unit of measurement (Coombs 1950).
How can we quantify human judgement without recourse to an arbitrary grade-point system? The
ecological equivalent of this issue would be: how can we quantify the differential reactions of
species to the environment without capitalizing on the pseudo-exact numerical aspects of
abundance? The answer unfolding has to offer is through the study of consistency (or scalability)
of the behavioral reactions under the condition of single-peakedness.
The first goal of this paper is to convince the reader that the unfolding technique is the natural
general-purpose first candidate for gradient analysis. However, there exists plenty of scope for
making more specific assumptions than has been done so far, and hence several rather different
methods are to be considered as members of the family. Therefore, a second goal is to try to
organize the field a little by comparing the various loss functions on which these methods are
based, and by showing the interrelations between various special cases. The third goal is to present
explicit computational formulas for a convergent unfolding algorithm, and to sketch a few open
problems and lines of development
The importance of single-peaked, or unimodal, response curves and surfaces stems from a
diversity of scientific areas, ecology being one of the richest sources. Frequently a linear analysis
of contingencies showed unexpected nonlinearities, or sometimes regression plots of abundance or
cover against carefully chosen a priori gradients were unmistakenly bell-shaped. Ihm and van
Groenewoud (1984) summarize the early evidence from vegetation studies as follows: "Already
Goodall (1954) in one of the first applications ofPCA to the analysis of vegetation data noted the
problem caused by the nonlinearity of quantitative species relationships in the interpretation of the
principal components. Knowledge about the non-linearity of gradient response was, however, not
191
new. Braun-Blanquet and Jenny (1926) investigated the pH-value of soils in which several species,
e.g. Carex curvula (L) and others, were growing in the Swiss Alps and England. They found
normal frequency curves for these pH-values. Making the assumption of a uniform distribution of
the pH-values - at least in the range of growth of the species studied - one could conclude that also
the gradient response was Gaussian. It appears the bell-shaped gradient response curves were first
suggested by Igoshina (1927). Gause (1930) studied the abundance of certain species as related to
ecological conditions and found that they followed the law of Gauss. The ordination work by
Curtis and Mcintosh (1951), Bray and Curtis (1957), Cottam and Curtis (1956), Whittaker (1948)
and many others all showed the non-linearity of species-site factor relationships. Especially the
published examples of gradient responses clearly show the unimodal type of the response curves."
(1.c., p.13). For many additional references, see Gauch (1982) and Whittaker (1978).
The first articulated unimodal response model in psychology was proposed by Thurstone
(1927), building upon nineteenth century work on sensory discrimination. He claimed wider
applicability, e.g. as a model for attitude and opinion, but later on abandoned the subject. Hovland,
Harvey and Sherif (1957) undertook additional experimental work, and provided convincing
evidence for single-peakedness in human evaluative responses. In factor analyses of personality
tests one frequently found nonlinearities called - by lack of a full understanding - 'difficulty
factors'. Coombs and Smith (1973) and Davison et al. (1980) studied unimodal developmental
processes, and a classic example of single-peaked behavior is preference for family compositions in
terms of number of children and bias towards boys or girls (e.g., Coxon 1974). Yet the phenom-
enon is not very actively studied anymore in psychology, not nearly as much as its special case:
monotonicity .
At this point, it might be useful to emphasize that it is not unimodality alone, but the fact that
the peaks of the curves are shifted with respect to each other which makes the situation special. For
imagine a number of unimodal curves precisely on top of each other, then any transformation of the
gradient would provide the same information; thus one could make the curves more skewed,
double-peaked, monotonically increasing, or indeed of any conceivable shape by suitable re-
expressions of the values against which they are plotted. When the curves are shifted along the
gradient, this freedom of simultaneous change of shape is reduced enormously.
The early contributions to ordination by the famous archaeologist Flinders Petrie, source of
inspiration for Kendall (1963) and much subsequent work in archaeological seriation (cf. Hodson
et al. 1971), were typically not tailored to the precise shape of the artifact distributions, but
primarily to the fact that they should form an overlapping sequence of 'present' counts if the sites
were properly ordered (presumably in time). Roberts (1976, section 3.4) has given an interesting
graph-theoretical characterization of this ordering problem.
1970; Noy-Meir and Austin 1970). Because these distortions can have widely different forms -
depending on such things as the dimensionality of the gradient, the homogeneity of the species and
sample variances, and the variability of maximum abundance - it is hazardous to rely on the
standard PCA approach, and there is clearly a need for specialized nonlinear methods.
Instead of bringing in nonlinearity at the data side, it can be introduced in the functional
structure of the model. McDonald (1962, 1967) and Carroll (1969, 1972) have advocated this
general approach. Deviation from linearity - although a heterogeneous phenomenon by its very
nature - can always be modelled by a sufficiently rich family of polynomials. Carroll's polynomial
factor analysis model has the following form:
(1)
with
(2)
Here, as in the sequel, fij denotes the abundance of species i in sample j, or, in the more general
193
tenninology of Legendre and Legendre (1983), the value of descriptor i for object j. The symbol ==
is used for approximation in the least squares sense, and the indices run as i=I, ... ,n , j=I, ... ,m ,
and r=I, ... ,q. So in its full generality, there are p sample gradients, or a p-dimensional space of
sample points, with coordinates Yjs. Then there are q elementary polynomial functions <l>r that have
to be specified on an a priori basis. Thus to obtain a quadratic response surface, for example, one
would have to specify:
<1>1(.): Z1j = 1,
<1>2(·): Z2j = Yj1 ,
<1>3(·): Z3j = Yj2 ,
<1>4(.): Z4j = y2jl>
<l>S(.): ZSj = y2j2,
<1>6(·): Z6j = Yj1Yj2 •
It is easily verified that if only the frrst three of these are chosen, (1) and (2) reduce to the familiar
bilinear form of the PCA model.
Carroll used a steepest descent method for fmding optimal values for the parameter sets {a ir}
and {Yjs}. There is little experience with the procedure, however. It is quite heavily loaded with
parameters, and does not give a particularly simple parametrization of the species. It has a great
many special cases. Perhaps it should better be called a program for research, rather than a model.
When the {Yjs} are fixed to known values, e.g. environmental measurements such as soil
pH, soil moisture, elevation and so on, the set-up (1) and (2) becomes formally equivalent to a
multiple regression analysis problem (Draper and Smith 1966; Gittins 1985). Note that although
nonlinear predictors are used, the model is now linear in the parameters, and can be fitted by
standard methods. Also note that in fact we have n independent regression problems, one for each
species or row of the data matrix. The last two remarks remain true if the definition of <l>r is
extended to include logarithmic, exponential or other simple functions. Carroll (1972) has given
explicit reparametrizations, constituting the so-called PREFMAP hierarchy of models, to obtain a
description of the species response curves or surfaces in terms of the location of the peak, the
importance of the relative contributions of the gradient factors, and possibly their interaction.
Fixing the space of sample points or objects and then studying the regression is only one way
to simplify the general polynomial model, and is called direct gradient analysis (Whittaker 1967), or
external analysis of preferences (Carroll 1972). These terms are used in contrast to indirect gradient
analysis or internal analysis of preferences, in which some optimal quantification of the gradient
has to be found as well. As we shall see shortly, there is also the possibility of an analysis between
these two extremes, whenever there is partial knowledge on the gradient (f.i., a ranking of the sites
with respect to moisture status, instead of exact numerical measurements). But first a few additional
remarks are in order, regarding the reasons for concentrating on unimodal models.
194
It was remarked earlier: linearity has the virtue of being uniquely defined, but deviation from
linearity can have many appearances. From a statistical point of view, it seems wise to progress
slowly from very simple to increasingly complex models, and to examine the deviations from the
model along the way. In fact, the bilinear model of PCA is already a second type of approximation,
the fIrst one being the hypothesis that all abundances are equal, up to row- and/or column effects.
However ignorant or even indecent this may sound in a field that studies diversity, we may
occasionally need to have statistical assurance that we deal with genuine interaction between species
and sites. If the abundance data are considered to be a contingency table, for instance, the chi-
squared test value under the hypothesis of independence should be very large.
The shifted single-peaked model is a further approximation of the second type, and it has the
virtue of having one defining characteristic as well. It is more complex in form than the bilinear
model, but not necessarily in terms of number of parameters. The situation is depicted in Figure 1.
When moving to the right the number of parameters is increased, so a better fit will always be ob-
one tvo
/ linear component ---7 linear component:! ~ ...
\ ~ shU~d shU~d
single-peaked ~ single-peaked
tained, but one set of curves might be enough where multiple components would be needed. Of
course, other nonlinear models might turn out to be even more appropriate, but in general there is
little hope in trying an exhaustive search.
It is difficult to accept that, when two models describe the same data about equally well, one
of them is "true" and the other one is "false". Let us consider Figure 2 in the light of this remark.
The Figure gives an idealized example of one of those notorious curved chains of sample points
from a PCA of abundance data. In addition, however, it gives two directions representing species A
and B, selected arbitrarily from the whole range of possible species directions. The advantage of
making this so-called joint plot or biplot (Gabriel 1971) is that it enables the demonstration of a
195
very elementary fact, which is often - if not always - overlooked in the literature. The PeA model
implies that, in order to reconstruct the abundances for species A, the sample points should be
orthogonally projected onto direction A. If this is actually done, and for direction B likewise, and
if the curved chain is straightened out, or "unfolded" into a straight line, locally preserving the dis-
B
I) 8
10 *
* *
7
* 6
*
11* 5
* A
* 3
* 2
*1
recon-
structed A
* *
abundance
* * B
*
t * ** *
* *
*
* * *
* * *
* *
* *
2 3 4 5 6 7 8 9 10 11 12
position along the horseshoe
Fig. 3. Abundance as a function of position along the horseshoe
(Peak A corresponds with direction A of Figure 2, and peak B with
direction B).
196
tances among the sample points, the projections plotted against the "unfolded" chain get the
appearance of Figure 3: shifted single-peaked curves! Any direction in between A and B in Figure 2
would yield a curve with its peak in between the peaks of A and B in Figure 3l and more extreme
directions (to the left of B, and to the right of A) would get curves with more extremely shifted
peaks. This shows that there is no real contradiction between the two ways of representing the data,
provided they are interpreted with an open mind. For single-peaked surfaces the PeA representation
will be a curved manifold in three dimensions, much less easily recognizable. Under single-peaked-
ness the data themselves already form a curved manifold in m dimensions, which has to be
"unfolded" to display its simplicity. Of course, these observations are not sufficient for getting a
practical method. The occurrence of deviations from the model, including random errors, as well as
the possible need to work in high dimensionality, urges us to use and further develop specialized
unfolding methods.
A curve or surface of any shape could in principle be modelled by means of the general
polynomial mode1. This relatively blind approach implies that many parameters have to be estimated
(often repeatedly under different specifications of the model), many of which are unlikely to be
readily interpretable. Under shifted single-peakedness the parametrization can be solely in terms of
the location of the peaks, and possibly also with respect to remaining aspects of shape: tolerance or
species dispersion (range of the responses along the gradient), correlated density in the more-
dimensional case, and (lack of) symmetry. Any unfolding method is based on the assumption that
abundance is inversely related to the distance of a sample point from the estimated peak location of
the species response function, frequently called the ideal point. The name "unfolding" refers to the
following metaphor: suppose the model is known, and imagine the sample points painted on a
handkerchief. Pick the handkerchief up at the ideal point of species i andfold it, for instance by
pulling it through a ring. Then observe that the sample points will appear in the order of the
magnitude of the abundances as given in the tth row of the data matrix (or of the raw observations
if these are recorded, for each species i, as a list of samples from most abundant down to least
abundant, or absent). Because the analysis technique must construct the model starting from the
data, this process must be reversed; hence the name.
Two major approaches to unfolding can be discerned: one based on dissimilarity approxi-
mation, the other on distance or squared distance minimization. As shall become evident shortly,
there is an important sense in which the latter - formally equivalent to correspondence analysis - is a
special case of the former. The discussion starts with the problem of external unfolding, where the
location of the sample points is fixed in advance, and the ideal points must be determined.
197
Suppose the coordinates of m points in p-dimensional space are available in the mxp matrix
Y, the j'th row of which is denoted with Yj. Now consider n unknown additional points, indexed
by i, with coordinates xi collected in the rows of the nxp matrix X. The Euclidean distance d(Xi,yj)
is defined by writing its square as:
(3)
In order to construct a loss function that measures the departure of the model distances from the
data, some definition of dissimilarity - the empirical counterpart of distance - has to be agreed upon.
Just to make a start, suppose this is done in the following way. Since the total number of
occurences of a species is often of little interest, at least not in the study of species x environment
interaction, it is advisable to work with the species-specific proportions
(4)
or some other standardization factor, such as maximal species abundance, to make the distributions
row-wise comparable. Now the species-sample dissimilarity Bij and the associated weights Wij may
be defined as:
Other choices will be encountered later. In (5a) and (5b) the weights are merely used to indicate
presence or absence; non-occurrence gets an arbitrary unit dissimilarity, and will not cause any
increase in loss (because Wij = 0). Note that, indeed, dissimilarity is a decreasing function of
relative abundance; ifpij approaches zero, then Bjj approaches infinity, and if pjj = 1 then Bij = O.
The interpretation of the latter case depends on the data standardization; under (4) it implies that &j
only becomes zero if a species occurs in only one sample (in any frequency).
The basic unfolding loss function is now defined as the weighted least squares criterion
cr 2R -- }:.}:.
I J W··
I]
{B··I] - d(x·l' y.)}2
] , (6)
the "rectangular" or "off-diagonal" version of Kruskal's so-called raw STRESS (Kruskal was the
first who explicitly proposed to use least squares distance modelling, in his (1964a, 1964b)
papers). Depending on the alterations in the definition of Wjj and Bij, as well as on the choice of
domain n over which O"R is to be minimized, we get different unfolding methods.
For the problem of this section n is the set of all nxp matrices, but in addition a provision
has to be made for ensuring that Band d match in scale (assuming that the coordinates of the given
set of points are on an arbitrary scale). Because the distance function is homogeneous, i.e. IX
198
d(Xi,yj) = d(axi,aYj) for any nonnegative a, adjusting the scale of the coordinates and adjusting
the scale of the distances amounts to the same thing. However, we can also adjust the scale of the
dissimilarities by just extending their definition so as to include an unknown scaling constant:
where the notation ()i/a) is used to make the dependence on a fully explicit. Whatever choice is
made, the scale adjustment would leave erR dependent on the arbitrary scale of the given set of
points; this is undesirable, so erR has to be normalized. As shown by Kruskal and Carroll (1969),
various ways of normalization only affect the scale of the loss function, not the argument for which
a minimum is attained. De Leeuw and Heiser (1977) have argued that normalization on the
distances makes the computational problem considerably more complicated in a number of
important special cases. Therefore the external unfolding problem - as defined here - becomes:
with
(8b)
This optimization problem (and the one that will follow shortly) has no closed-form solution, it is
not related to any eigenvalues and eigenvectors, nor to projection from some high-dimensional
space to a p-dimensional one; it has to be solved iteratively. A convergent algorithm for finding at
least a local minimum shall be discussed in some detail now, because it offers the opportunity to
illustrate a number of interesting features of this type of algorithm. It is based on the general
algorithm model proposed by De Leeuwand Heiser (1977, 1980), called SMACOF (an acronym
highlighting its prime technical characteristic: scaling by MAximizing a COnvex Function, or, as is
preferred nowadays, Scaling by MAjorizing a COmplicated Function).
The minimization of erN can be done by repeatedly solving two subproblems. There is a
normalized regression problem, in this case finding the optimal value of a for fixed distances, and
a relocation problem, i.e. finding new locations X+ starting from some initial guess "X and keeping
the rescaled dissimilarities constant at their current values. As to the former, it can be shown that,
writing d ij for the fixed distances, the optimal choice of a is
(9a)
The quantities
(9b)
sometimes called the pseudo-distances, or dhats, or disparities, all names referring to the
characteristic of distance approximation by a function of the data, can be substituted in (8b),
199
thereby reducing it to the basic fonn (6) with unifonnly rescaled weights, due to the nonnalization
factor. This settles the regression part for now.
The relocation part is more difficult. One of the objections to a relatively straight-forward
steepest descent method, such as the one used by Kruskal (1964b), is that the partial derivatives of
O'R do not exist at points where d(Xi,y) becomes zero. In this context it is of some interest to note
that the very same problem emerges in the classic Fermat or generalized Weber problem (Kuhn
1967), also called the location problem, which is to locate a point Xi among m known points in
such a way that
Furthermore, the weights are collected in W = {Wij}, and the diagonal matrices P and R are
defined as:
where em denotes an m-vector of ones. The SMACOF algorithm for external unfolding uses the
following two operations:
Here X~ is a preliminary, unconstrained update, and X+ is the successor configuration suitable for
the present case of flxed column points. Note that in the equally weighted case the last operation
(14b) amounts to a uniform rescaling and an adjustment of the centroid. The flrst operation (14a)
carries the burden of the iterative relocation of the species points, because A and P contain infor-
mation on the size of the current distances d;j' on what they should be (d+;}, and on how strongly
.'.'
.'.'
.' .'.'.'
x..., .,........
2
an improvement is desired (Wij)' Let us have a closer look by writing (14a) row-wise as a single
weighted summation:
201
(15)
where K is the subset of the first m integers for which (12a) holds. Thus the preliminary updates
are a weighted sum, with weights wikft+ik, of unit-length difference vectors pointing from the fixed
column points towards the current location of i. If the current location of i coincides with a column
point, then (12b) comes into effect; the zero difference vector cannot be normalized and is omitted
from the summation. Sample sites where species i is absent - or at least where wij = 0, perhaps due
to another reason - do not contribute either.
The relocation step is illustrated in Figure 4, starting from an arbitrary configuration of three
~-points and two ,V-points, with unit weights and the dissimilarities as given in the Figure caption.
Thus there are 6 difference vectors, and the concentric circles around the origin represent the size of
the dissimilarities. The circles are used for adjusting the length of the difference vectors, and are
expanded or contracted during the iterations (this is a uniform expansion or contraction for the
present case of linear regression without an intercept, 9a and 9b; it would become a more involved
stretching and shrinking when other forms of regression are introduced). The x~i are now simply
obtained by vector addition. Next their length has to be divided by 2, the number of ,V-points, and
their origin must be shifted towards 'vo, the centroid Of'v1 and'v2, thus accomplishing (14b). For
x+1 the latter step is explicitly shown, while the other auxiliary lines are omitted for clarity. By
visual inspection alone it can be verified that the new distances are closer to the dissimilarities than
the old ones. Finally note the fact that each point is relocated independently from the others, in
much the same way as there were n independent regression problems under the general polynomial
model.
A summary of all steps is given in the following skeleton algorithm for external unfolding:
* STOP
As a first extension to this scheme we shall now consider the situation in which the sample points
are not a priori given, but have to be located as well.
202
In internal unfolding analysis two sets of points have to be located with respect to each other;
hence the term 'reciprocal relocation'. As a consequence, the relocations are not independent
anymore. It does eliminate the need to rescale the data: the rescaling factor can be absorbed in the
unknown coordinates. Therefore, the normalized loss function (J~ becomes functionally equivalent
to the unnorrnalized one (J\ , i.e. the same up to a constant, and the problem becomes:
(16)
The skeleton algorithm of the previous section need not be changed very much. We can skip step
(iii) (not for long; it will be reintroduced soon). Only step (i), calculation of the new locations, must
really be adjusted. Two additional matrices are required:
where en denotes an n-vector of ones. Then, analogous to (14a), a preliminary update for the
sample points is found from
(18)
The companion operation (14b) is no longer correct. Instead, the successor configurations X + and
Y+ must be computed from the system of linear equations:
The interested reader may consult section 3.6 at this point for finding out how these equations come
about. How to solve the system most efficiently depends on the size of n and m. Suppose n > m
(the other case runs analogously). Then we should fIrst solve
which determines Y+ up to a shift of origin because the matrix C - W'R-IW is generally of rank
m-l (its null space is the vector em, due to the defInition of W, C, and R). Next, any solution of
(20a) can be used to determine X+ from
Finally, although this is not really necessary, X+ and Y+ can be simultaneously centered so that
their joint centroid is in the origin. This settles the relocation part for internal unfolding.
203
Now consider a slight generalization in the regression part. Some species might cover a wider
range of sites than others, independent of the location of their peaks. If the frequencies are
normalized on the sum, this will tend to make the minus log proportions uniformly larger, wich
might be considered undesirable. This effect can be removed by introducing a scaling parameter for
each species as a generalization of (7):
(21)
Note that all that would have to be done for including (21) in the external unfolding algorithm
would be to execute it for each species separately, because that would make (9) effectively row-
specific, and the row-point movements were done independently anyhow. For internal unfolding,
however, the loss function has to be adjusted explicitly:
(22)
where the subscript C in Gc is used to indicate the conditionality of the regression and normalization
on the rows (the loss function is "split by rows", cf. Kruskal and Carroll 1969). Yet the algorithm
does not become very much more complicated. Keeping the distances fixed, the normalized
regression (9) must simply be done on each row separately, giving a+i. Next new weights can be
defmedas
(23)
which shows that minimizing (22) becomes equivalent to the basic unconditional problem (16),
with row-wise rescaled data and row-wise rescaled weights.
Summarizing the steps again in a skeleton algorithm for row-conditional internal unfolding
we get:
* STOP
The algorithm is now illustrated for a classical set of single-peaked ecological data.
The original data (from Brown and Curtis 1952) are the "importance values" of seventeen tree
species in 55 woodland stands. Importance value is a compound measure of species abundance, it
being the sum of relative frequency, relative density, and relative dominance of any species in a
given stand. The data were standardized species-wise as indicated in (4), with a factor of 105% of
the maximum importance values, and coded as (5a) and (5b). This way one obtains small, but non-
zero dissimilarity in the maximum abundance cells. To keep the analysis simple, species-specific
free scaling parameters were omitted. The discussion in Kershaw and Looney (1985) has served as
background; they explain how Brown and Curtis obtained single-peaked importance curves for the
species, the way in which a climax adaptation number was assigned to each species, and give other
details on the original analysis. The species involved here, and their climax adaptation numbers, are
given in Table 1. The climax concept implies that the vegetation has developed to a state of
equilibrium with the environment, but its intricacies are definitely beyond the scope of the present
paper. The adaptation numbers are simply used to label the results of the unfolding analysis (see
Figure 5).
205
Again for reasons of simplicity, the algorithm was executed in two dimensions. Apparently
the horizontal axis, ranging from Pinus banksiana to Acer saccharum, closely resembles the climax
number arrangement (product-moment correlation: 0.97). This is a first, rather strong indication for
the validity of the model. But there is plenty of variation to account for in addition to that. For
instance, Pinus resinosa and Quercus ellipsoidalis almost never occur together in the same stand,
even though they differ by only one unit in climax number. The two-dimensional unfolding
o 3 Pinus
14 • r"inosa
74 121 180
2 Populus 0 0 238 0 8 Ulmus
• tremuloidu 0157 0 • americana
016 145 0 Pin;: ~43
7
strobus
•
0 o
o
o
186 082 Abies
balsamea
121 0 0 054
166 0 158 10 0 0
o
39 106 .9 10
o 34 o OstrJa .Acer
o ~irgtniana saccharum
042 56
o 059 o 08 Tili: 8 0
o 13 0 015 americanao
46 6 Actr
1 o • rubrum o
':, 0 0
o
• Pinus o 8 Tsuga o
banksiana 4. Quercus
alba 6 Quercus • canadensis
o o 5 • rubra 0
00 6
S.Bltula 0 8 Betula
papyri/era 12 0 • lutea
o
o O2 2. Populus 0
• Quercus grandidentata 4 o
elllpsoidalis
analysis shows this by giving them a large separation in the vertical direction, as is also the case for
Betula lutea and Ulmus americana, and, although less strongly, for other pairs.
The model fits the data reasonably «JR = .2254, which is not entirely satisfactory according
to the current standards, indicating that a three-dimensional model could be called for, or, alter-
natively, for optimal rescaling of the species profiles). In order to present more concrete evidence
for the quality of fit, the sites in Figure 5 are labeled with the original importance values of Pinus
strobus, which shows the approximate single-peakedness clearly (Pinus strobus is absent in the
unlabelled sites). Reconstructions of similar quality can be obtained for the other tree species.
206
064 PiftUJI
• resiftosa
(-)
Populus 0
• Iremuloides
J-)07 7
0
0 64 Ulmu,
• am,rlcafta
o 3 6 0 . 071
72 Piftus o 3S7
slrobus
0428
0(-) (_)
S1, ~ 0107 100
032 86 o
0(_)
043 011
.Piftus
baftksiafta
043 0
29 0 21
029
21 0
• Populus 0 S4
29 0 • Quercus graftdideftlala
,lllpsoidalis
Fig. 6. Alternate labelling of the sites: calcium values (lO's lb. per acre).
Since we now have an ordination of the stands along with the optimal tree locations, various
stand characteristics can be examined to gain further understanding of the species-environment
interaction. In Figure 6 the stands are labelled with their calcium values. These tend to increase
when we move from the lower left to the upper right corner. It is especially the area around Ulmus
americana and Ostrya virginiana that has characteristically high calcium values. A numerical
assessment of the strength of relationships like this could be obtained by multiple regression
analysis with the point coordinates serving as predictor variables.
Now that the two basic ways of unfolding via dissimilarity approximation have been
discussed, external when one of the two sets of points is fixed in advance, and internal when both
sets are free to vary, it will be instructive to reconsider the specification of dissimilarities and
weights. Suppose that, instead of (5a) and (5b), it is specified that:
where the second one is not really a change, but the first one says that a species point should
coincide with any site where it occurs, with frequency of occurence used as weight. When these
specifications are substituted in the basic unfolding loss function (6) one obtains:
because the weighted sum of squared dissimilarities and the weighted sum of squared cross
products vanish, due to the special structure in (24a) and (24b). The remaining part of the loss
function, (25), closely resembles the location problem as defined in (10), but aims at squared
distance minimization. Squared distance minimization is interesting for a number of reasons.
First, note that the SMACOF algorithm breaks down immediately under this specification,
because the matrices A (cf. (12a) and (12b)), and thus P (13a) and Q (17a) all vanish. So the
specification is at least incomplete, it has to be supplemented by a strong form of normalization or
a radical type of restriction. A good example of the latter is of course the external approach, which
now has an easy solution. To see this, it is convenient to write loss function (25) in matrix
notation, using the same symbols R and C as before (cf. (13b) and (17c)) for the marginal totals of
the matrix F = {fij}, and writing "tr" for the trace of a matrix:
(26)
For fixed Y the stationary equations for a minimum of cr 2CA over X are (setting the partial
derivatives with respect to X equal to zero):
X+ = R-IFY , (27a)
y+ = C-IF'Ji{. , (27b)
Comparing (27a) with the external unfolding result (14a), it turns out that the solution to squared
distance minimization merely involves taking a weighted average of the fixed points, not a
transform of some previous estimate such as X"". The best location of a species ideal point now is
the centre-of-gravity of the sites it occurs in. When the species points are fixed, the best location of
a site is the centre-of-gravity of the species it is covered with.
The internal approach is conceptually somewhat problematical from the present point of view.
First, we have to keep away from the trivial solution X = Y = 0, which certainly would minimize
(26). In a one-dimensional analysis, this is usually done by requiring that one of the sets of scores
is standardized in the metric of the marginal totals, e.g. en'Rx = 0 and x'Rx = n (where the
notation x and y is used for the vectors of one-dimensional species- and site scores, whereas Xi
and Yj denote the p-dimensional species- and site points). The first requirement can be formulated
as JRx = x, and can be inserted in the loss function; here JR is the projection operator
208
(28)
that centers all n-vectors, with weights R. The second one can be handled by introducing a
Lagrangian multiplier A, so that the adjusted minimization problem for the simultaneous estimation
of x and y becomes
from which it follows in the usual way that x* and y* are a solution whenever they satisfy (using
the relationships JR'Rh = RJR and R-1JR' = J RR-1 ):
x* = J RR-1Fy* A- 1 , (30a)
y* = C-1F'hx* . (30b)
These are the well-known reciprocal averaging, dual scaling, or transition formulas of
correspondence analysis (e.g., Nishisato 1980). So under the specifications (24a) and (24b) of
trying to minimize the distance between a species and a site in the degree of their abundance
correpondence analysis is a special way of performing internal unfolding.
In order to obtain a solution of dimensionality greater than one, a third normalization
condition must be imposed to avoid repetition of the first solution in the columns of X and Y
(because that would actually give the smallest value of the loss function). How to do this is not free
from arbitrariness under the present rationale of the method. Usually one requires in addition that
the coordinates of the higher dimensions are R- or C-orthogonal with respect to the earlier ones.
This gives the stationary equations of a higher-dimensional correspondence analysis. The formulas
are omitted here (but see section 3.6). Healy and Goldstein (1976) have argued that the "usual"
normalization conditions are in fact restrictions, and they presented an alternative solution based on
linear restrictions that can be freshly chosen in any particular application. Whether the freedom
gained should be considered an asset or a liability is difficult to say.
Even within the confines of the usual normalization conditions there remains an awkward
arbitrariness with regard to the species-site distances in a joint plot We can just as well normalize y
and leave x free, thereby obtaining the same value of the loss function. There is also the possibility
to "distribute A" among x and y. Although in all cases the weighted mean squared distance (25)
remains equal, the actual Euclidean distances between species points and site points may change
considerably, especially when Ais small. This was one of the reasons for Legendre and Legendre
(1983, p. 278) to warn against making biplots; for who can withdraw from considering distances
while looking at a configuration of points! Also note that the "folding" interpretation of picking the
representation up at a species point i in order to obtain an approximate reconstruction of the i'th
row of the data matrix will give different results under different normalizations.
Finally, we may substitute (30a) in (30b), or vice versa, from which an eigenvalue-
eigenvector problem in only one of the sets remains. So in contrast to the general unfolding
209
problem, correspondence analysis "has no memory" for the previous locations of the same set
when solved iteratively by alternating between (30a) and (30b); in fact one of the sets of points is
superfluous for solving the problem! Therefore the recognition that it is formally a special case of
unfolding has limited value. It is often preferable to view correspondence analysis - or, for that
matter, principal components analysis - as a way to perform two related, "dual" multidimensional
scaling problems, in which one tries to fit the so-called chi-squared distances among the rows or
columns of the data matrix. This specific viewpoint is more fully explained in Heiser and Meulman
(1983a) and Fichet (1986). An up-to-date, comprehensive account of the method was provided by
Greenacre (1984), who was also the first who seriously compared correspondence analysis with
unfolding in his 1978 dissertation. The use of (24a) and (24b) in connection with the standard
unfolding loss function was suggested by De Leeuw (personal communication) and more fully
worked out in Heiser (1981). Hayashi (1952, 1954, 1956, 1974) based his "theory of
quantification" almost entirely on (25), and dealt with many of the possible appearances the matrix
F can have.
In one of his early papers on multidimensional scaling, Shepard (1958) adduced evidence for
an exponential decay function relating frequency of substitution behaviour to psychological
distance. Transferring this idea,we could model expected frequency E(fij) as:
(31)
with ~i a positive number representing the maximum of the function (attained when the species
point Xi coincides with the site point y), and (Ij a positive number representing the dispersion or
tolerance of the species distribution. From (31) it follows that log expected frequency is linear in
the distances:
(32)
Under this model, then, we could still use the SMACOF algorithm by generalizing the definition of
8ij again a little, writing
(33)
where Ili = (Ii log ~i' In fact, this model inspired the earlier definition of Oij' (Sa), where Ili could
be omitted by fixing ~i equal to one ("to make the curves comparable"). Using (33) instead implies
that we no longer have to use a standardization factor like fi+ (4) prior to the analysis, but can try to
find values that optimize the fit to the data. For the skeleton algorithm it would entail step (iii) to be
a linear regression including an intercept term. The price is n degrees of freedom and, as experience
210
which was studied in ecology by !hm and van Groenewoud (1975), Austin (1976), Kooijman
(1977), Gauch and Chase (1974), Gauch et al. (1974), and others. Also see Schonemann and
Wang (1972). Under the Gaussian decay function it is again the species-site distance that plays the
central part. But now log expected frequency is linear in the squared distances, and this suggests
that we can use (33) in combination with the alternate loss function
(35)
which is called SSTRESS by Takane et al. (1977), who proposed it as a general MDS loss function,
and which was studied in detail for the unfolding case by Greenacre (1978) and Browne and
Greenacre (1986). Here, as in the SMACOF algorithm, Bij may be a fixed set of dissimilarities, or
some function of the original frequencies like (33). The regression principle remains the same.
Minimizing (35) would form a feasible and efficient alternative for the maximum likelihood
methods of Johnson and Goodall (1980) or!hm and van Groenewoud (1984), or the least squares
method of Gauch et al. (1974). In the latter methods it is not the data that is transformed, but the
distances. The STRESS and SSTRESS methods are based on optimal rescaling to achieve reduction
of structural complexity, the same data analytic principle on which the nonlinear multivariate
analysis and path analysis methods are based that are discussed by De Leeuw (1987a, 1987b) in
this volume.
It is possible to relate SSTRESS and STRESS in the following way (Heiser and De Leeuw
1979):
the approximation being better if the dissimilarities and distances match well. So we can simulate
SSTRESS solutions with the SMACOF algorithm by using an additional square root transformation
and choosing the dissimilarities as weights. This form of weighting will tend to give less emphasis
to local relationships, in favour of getting the large distances right
Ihm and van Groenewoud (1984), Ter Braak (1985), and Ter Braak and Barendregt (1986)
recently compared maximum likelihood estimation under the Gaussian response model with
correspondence analysis, as we have seen a technique also based on the squared Euclidean distance
function. The results are encouraging for correspondence analysis, especially if the species
dispersions are homogeneous.
211
Kershaw (1968) used a square root transformation of the abundances to make them less
heterogeneous. It is, of course, one of the usual statistical ways to stabilize the variance. Now
suppose we take the inverse square root as an alternative definition of dissimilarity, and the
frequencies themselves as weights:
Then the basic loss function cr2R transforms into (P denotes all pairs present, (37a»
cr2R = L(.IJ')~D
u
f..IJ {1I...JfIJ.. - d(x·l' y.)}2
J
Thus loss is measured in terms of the ratio of distance and dissimilarity (for a defense of using
these relative deviations, see Greenacre and Underhill 1982), and we now obviously give more
weight to the small dissimilarities. It is interesting to compare this weighting structure with yet
another loss function, proposed by Ramsay (1977). He similarly argued that dissimilarity
measurements in psychology are frequently lognormally distributed. The lognormal arises from the
product of many independent and (nearly) identically distributed random variables. It has been
frequently applied as a model for the variation of nonnegative quantities (Aitchison and Brown
1957; Derman et al. 1973), indeed also for abundances (Grundy 1951). If dissimilarity is assumed
to be lognormally distributed we should work with
(39)
which forms the basis of Ramsay's MULTISCALE algorithm. In order to relate it to the standard
loss, we can use the first order approximation
(40)
(41)
So Ramsay's loss function can be approximated by using the inverse squared dissimilarities as
weights in the standard loss function. The same reasoning is present in (37a), which led to (38).
The choice between so many possible types of transformation of the raw data can be
circumvented by defming a radically extended class of transformations as
(42)
212
So dissimilarity should increase whenever abundance decreases, for each species separately. This
specification would form the basis of a row-conditional, nonmetric unfolding algorithm. The idea
to pose merely monotonicity (42) as the basis of the technique is due to Coombs (1950). He did not
provide a working algorithm, however; it was not until the sixties that Shepard, Kruskal, Guttman
and others developed general nonmetric MDS algorithms (Kruskal 1977; De Leeuw and Heiser
1982). Technically, our skeleton algorithm only needs alteration in step (iii), where the type of
regression performed should be of the monotonic, or isotonic, variety (Kruskal 1964a, 1964b;
Barlow et al. 1972). Yet the nonmetric unfolding case always remained something of a problem,
due to a phenomenon called degeneration: a tendency to collapse many points, or, anyhow, to make
all distances equal (cf. section 4.3). These problems, and proposals to resolve them (although not
fully satisfactorily), are explained in Kruskal and Carroll (1969) and in Heiser (1981), who argued
that it is necessary to put bounds on the regression. Subsequently Heiser (1985, 1986) proposed a
smoothed form of monotonic regression in order to obtain a better behaving algorithm, and this
refmement might make standard application of nonmetric unfolding feasible.
The one-dimensional case of any STRESS minimizing algorithm deserves special care.
Guttman (1968) already pointed out its special status, and De Leeuw and Heiser (1977), also see
Heiser (1981), showed that the SMACOF algorithm does not really resolve the combinatorial
complications that arise in this case. Quite independently, Wilkinson (1971) made some insightful
observations on a form of one-dimensional unfolding, and showed the connection with the so-
called travelling salesman problem. Poole (1984) analysed the situation along the lines of the
graphical version of the algorithm in Figure 4, and proposed an improvement for the one-
dimensional case. Fortunately we now also have Hubert and Arabie (1986), who provided a
globally convergent, dynamic programming algorithm for one-dimensional MDS, extending the
work of Defays (1978). Little is known about its performance in the unfolding situation, but it
surely marks an exciting step forward.
In this section the major tools are described for restricting the locations of either the species
points, or the site points, or both. This is done first for the SMACOF algorithm, next for correspon-
dence analysis. Remember the SMACOF algorithm always starts with the preliminary updates X~
and Y~, as defined in (14a) and (18). These provide the basic corrections necessary to obtain a
better fit to the dissimilarities. From the general results of De Leeuw and Heiser (1980) it then
follows that the remaining task is to find
where a is the domain of minimization, orfeasible region. When X and Y are completely free, a
is the set of all (combined) nxp and mxp matrices, and from equating the partial derivatives to zero
one obtains the system of linear equations (19a) and (19b) for the unrestricted internal unfolding
problem. In De Leeuw and Heiser (1980) it is also shown that it is not at all necessary to solve
problem (43) completely; it suffices to move from a feasible point into the right direction for
minimizing it. The algorithm will still converge to at least a local minimum. This important fact
opens the possibility to use alternating least squares, i.e. to split the parameter set into subsets, and
to alternate among the subset minimizations. The obvious candidate for a flrst split in the unfolding
situation is into X and Y, and accordingly (43) can be split into two subproblems (again writing %
and If for flxed matrices, and after some rearrangement of terms):
then it may be verified that the correspondence analysis loss function transforms into
The second tenn on the right-hand side of (46) is constant, so we again end up with a projection
problem in the metric R, in which X- rather then R-I(X~ + WY) must be projected onto the
feasible region. All the possibilities of restrictions mentioned for the SMACOF algorithm are now
open to us for correspondence analysis. Historically, it is not quite fair to say this, because a lot of
them were used earlier in the developing Gifi system (cf. Gifi 1981). Still, the formulation
presented here is new, and especially putting together (44a) and (46) clarifies the similarities and
differences between unfolding and correspondence analysis a great deal. Ter Braak: (1986a, 1986b)
has further developed the case in which the site locations are linear combinations of environmental
variables, under the name "canonical correspondence analysis".
A special example of restrictions in correspondence analysis is Hill and Gauch' (1980)
method of detrended correspondence analysis. They don't compute all dimensions simultaneously,
but work successively. Their aim is to remove the horseshoe effect, and other nonlinearities in
higher dimensions. To bring it in the present fonnulation, suppose Xl is the first set of scores,
satisfying - as explained in section 3.3 - JRxI = Xl and xI'RxI = n. Then, instead of requiring R-
orthogonality of x2, i.e. x2'RxI = 0, the idea is to have x2locally centered. To do this. an nxkG
matrix G can be fonned on the basis of xl> indicating a partitioning into kG blocks of species that
are close together on X l' Thus G is binary and G 'G is diagonal. The projection matrix
J G = I - G(G'G)-IG' (47)
is the required block-wise centering operator, and the new requirement becomes JGx2 = x2. This
can be inserted in (46), which shows that we have to solve
(48)
The weak point in this method is that it does not provide a unique, convincing definition of G, as a
result of which it may sometimes detrend too much, sometimes too little. This objection is
comparable to the earlier remark on the specificity of Healy and Goldstein's (1976) restrictions.
Homogeneity analysis is the key method of the Gifi system of nonlinear multivariate analysis
(De Leeuw 1984; 1987a). It employs indicator matrices as a basis for all nonlinear transfonnations
215
of a given set of variables, and selects precisely those transformations that are as homogeneous as
possible. If the data matrix F in correspondence analysis is chosen as the set of concatenated
indicator matrices, we obtain solutions that are essentially equivalent to those of homogeneity
analysis. An extended discussion on the details of this connection can be found in Heiser (1981,
chapter 4). There, as well as in Heiser (1985), it was argued that in the case of shifted single-
peaked variables the homogeneity approach should not be followed without restraint. If we think it
is characteristic for species to have distributions that are shifted with respect to each other, we
should not center them (which is part of making them as homogeneous as possible). If, moreover,
the variables are thought to give an asymmetrical type of information, i.e. high abundance indicates
similarity of sites and low abundance dissimilarity, then we should not try to give equally dissimilar
sites as much as possible the same quantification.
Homogeneity analysis in a generalized sense can still be used, provided the right kind of
change of variables, or variable coding, is chosen. One possibility is to use conjoint coding
(Heiser 1981 p.123), which associates a nested sequence of sites to each species. The rationale of
conjoint coding is to assume that we deal with only one multinomial variable, species composition,
with the n species as its categories, and separately established for each site. Reliance on the exact
numerical values of abundance can be avoided by considering K level sets, from "exceptionally
abundant" via "moderately abundant" to "not absent" (note that the level sets are cwnulative). In
conjoint coding K binary m x n matrices are defined, the k'th of which indicates the presence, in
site j, of species i at level of abundance at least k. These are not ordinary indicator matrices, as they
do not have mutually exclusive columns, nor row sums equal to one, but they can be submitted to a
correspondence analysis just as well. All sites corresponding to the 'ones' in any column should be
as closely as possible together, and the weighted mean scores of the columns should be as far as
possible apart. The description here deviates from Heiser (I.c.), but only to the effect that a
different order of columns is used. This method was proposed earlier by Wilkinson (1971) and,
independently, by Hill et al. (1975), who called it the "method of pseudo-species" (see also Hill
1977).
A second possibility is to use convex coding (Heiser, 1981, section 5.3), which is especially
tailored to the situation where there are more species or individuals than sites, because it uses,the
geometrical property that the site space can be partitioned into so-called isotonic regions. Convex
coding does work with ordinary indicator matrices. Since these alternative ways of coding have not
yet been used a great deal, their data analytic value is uncertain.
techniques will find the correct ordering as their fIrst dimension (see Guttman 1950, and Hill 1974,
for somewhat less general statements; Heiser (1981, section 3.2) proved the proposition in the
form stated here; see Schriever 1985, for a comprehensive discussion of such ordering properties).
One would of course like to be able to say that each unfolding method shares this property,
but it is an open question under what conditions anyone unfolding technique can be said to achieve
an optimal rearrangement in the above sense. Perhaps it is necessary to assume symmetry of the
single-peaked functions. A second open question is how to devise an effIcient method that directly
optimizes the single-peakedness condition. Wilkinson (1971) proposed a combinatorial method to
fmd a permutation yielding consecutive ones, but little is known about its effectiveness.
4.3. Horseshoes
solution appear to yield an unacceptably poor monotone fit and/or substantive interpretation."
Meanwhile there has been considerable technical progress for the one-dimensional case (cf. section
3.5). Also, it seems likely that the MDS-horseshoe frequently arises from the occurrence of large
tie-blocks of large dissimilarities, for instance when they are derived from presence-absence data.
In such cases it is advisable to down-weight the large distances, which also forms the basis of the
so-called parametric mapping technique (Shepard and Carroll 1966). In many of the specifications
in the previous sections down-weighting was used as well.
Finally, there is a typical horseshoe effect for unfolding, due to regression to the mean. If the
regression part in the unfolding algorithm is not selected carefully, for instance a straight-forward
monotonic regression is inserted, then the technique capitalizes on a general property of many kinds
of regression to yield regressed values that are more homogeneous than the regressants. The
unfolding technique is attracted to the extreme case of (nearly) equal pseudo-distances, because it
can so easily find a configuration with equal distances: all points of one set collapsed at a single
location, all points of the other set distributed on part of a circle or sphere around it. Linear or
polynomial regression without an intercept, and restricted forms of monotonic regression seem to
provide the best safe-guards against this type of degeneration (cf. section 3.5).
In conclusion, the horseshoe effect is something to be avoided in most cases, and it can be
avoided by an adequate choice of dimensionality, by using the right kind of nonlinear model,
and/or by well-considered transformations of the observations.
Acknowledgements
I would like to acknowledge gratefully the suggestions of the reviewers, F. James Rohlf and
Robert Gittins, and the comments of Daniel Wartenberg and Cajo I.F. ter Braak on an earlier draft.
REFERENCES
AITCHISON, I. AND lA.C. BROWN, 1957. The Lognormal Distribution. Cambridge University
Press, New York, NY.
AUSTIN, M.P. 1976. On non-linear species response models in ordination. Vegetatio 33: 33-41.
AUSTIN, T.L.jr. 1959. An approximation to the point of minimum aggregate distance. Metron 19:
10-21.
BARLOW, R.E., D.J. BARTHOLOMEW, J.M. BREMNER, AND H.D. BRUNK. 1972. Statistical
Inference under Order Restrictions. Wiley, New York, NY.
BRAUN-BLANQUET, J. AND H. JENNY. 1926. Vegetationsentwicklung und Bodenbildung in der
alpinen Stufe der Zentralalpen. Neue Denkschr. Schweiz. Naturforsch. Ges. 63: 175-349.
BRAY, R.J. AND J.T. CURTIS. 1957. An ordination of the upland forest communities of Southern
Wisconsin. Eco1. Monogr. 27: 325-249.
BROWN, R.T. AND T.T. CURTIS. 1952. The upland conifer-hardwood forests of nothern
Wisconsin. Eco1. Monogr. 22: 217-234.
BROWNE, M.l AND M.W. GREENACRE. 1986. An efficient alternating least squares algorithm to
perform multidimensional unfolding. Psychometrika 51: in press.
218
CARROLL, J.D. 1969. Polynomial factor analysis. Proc. 77'th Annual Convention of the APA. 4:
103-104.
CARROLL, J.D. 1972. Individual differences and multidimensional scaling, p. 105-155. In R.N.
Shepard et al. [ed.] Multidimensional Scaling, Vol I: Theory. Seminar Press, New York, NY.
COOMBS, C.H. 1950. Psychological Scaling without a unit of measurement. Psych. Rev. 57: 148-
158.
COOMBS, C.H. 1964. A Theory of Data. Wiley, New York, NY.
COOMBS, C.H. AND J.E.K. SMITH. 1973. On the detection of structure in attitudes and develop-
mental processes. Psych. Rev. 80: 337-351.
COTTAM, G. AND J.T. CURTIS. 1956. The use of distance measures in phytosociological
sampling. Ecology 37: 451-460.
COXON, A.P.M. 1974. The mapping of family-composition preferences: A scaling analysis. Social
Science Research 3: 191-210.
CURTIS, J.T. AND R.P. MCINTOSH. 1951. An upland continuum in the prairie-forest border
region of Wisconsin. Ecology 32: 476-496.
DAVISON, M.L., P.M. KING, K.S. KITCHENER, AND C.A. PARKER. 1980. The stage sequence
concept in cognitive and social development. Developm. Psych. 16: 121-131.
DELEEUW, J. 1977. Applications of convex analysis to multidimensional scaling, p. 133-145. In
J.R. Barra et al. [ed.] Recent Developments in Statistics. North-Holland, Amsterdam.
DE LEEUW, J. 1982. Nonlinear principal component analysis, p. 77-89. In H. Caussinus et al.
[ed.] COMPSTAT 1982. Physica Verlag, Vienna.
DE LEEUW, J. 1984. The Gifi system of nonlinear multivariate analysis, p. 415-424. In E. Diday et
al. [ed.] Data Analysis and Informatics, ill. North-Holland, Amsterdam.
DE LEEUW, J. 1987a. Nonlinear multivariate analysis with optimal scaling. In this volume.
DE LEEUW, J. 1987b. Nonlinear path analysis with optimal scaling. In this volume.
DE LEEUW, J. AND W.J. HEISER. 1977. Convergence of correction matrix algorithms for multi-
dimensional scaling, p. 735-752. In J. Lingoes [ed.] Geometric representations of relational
data. Mathesis Press, Ann Arbor, Mich.
DELEEUW, J. AND W.J. HEISER. 1980. Multidimensional scaling with restrictions on the con-
figuration, p. 501-522. In P.R. Krishnaiah [ed.] Multivariate Analysis, Vol V. North-Holland,
Amsterdam.
DELEEUW, J. AND W.J. HEISER. 1982. Theory of multidimensional scaling, p. 285-316. In P.R.
Krishnaiah and L.N. Kanal [ed.] Handbook of Statistics, Vol 2. North-Holland, Amsterdam.
DEFAYS, D. 1978. A short note on a method of seriation. Brit. J. Math. Stat. Psych. 31: 49-53.
DERMAN, C., L.J. GLESER, AND I. OLKIN. 1973. A Guide to Probability Theory and Application.
Holt, Rinehart and Winston, New York, NY.
DRAPER, N.R. AND H. SMITH. 1966. Applied Regression Analysis. Wiley, New York, NY.
FICHET, B. 1986. Distances and Euclidean distances for presence-absence characters and their
application to factor analysis. In J. de Leeuw et al. [ed.] Multidimensional Data Analysis.
DSWO Press, Leiden, in press.
GABRIEL, K.R. 1971. The biplot graphic display of matrices with application to principal
component analysis. Biometrika 58: 453-467.
GAUCH, H.G. 1982. Multivariate analysis in community ecology. Cambridge University Press,
Cambridge.
GAUCH, H.G. AND G.B. CHASE. 1974. Fitting the Gaussian curve to ecological data. Ecology 55:
1377-1381.
GAUCH, H.G., G.B. CHASE, AND R.H. WHITTAKER. 1974. Ordination of vegetation samples by
Gaussian species distributions. Ecology 55: 1382-1390.
GAUSE, C.F. 1930. Studies of the ecology of the orthoptera. Ecology 11: 307-325.
GIFI, A. 1981. Nonlinear Multivariate Analysis. Department of Data Theory, University of Leiden,
Leiden.
GITTINS, R. 1985. Canonical Analysis: A Review with Applications in Ecology. Physica Verlag,
Berlin.
GOODALL, D.W. 1954. Objective methods for the classification of vegetation, m. An essay in the
use of factor analysis. Aust. J. Bot. 2: 304-324.
GREENACRE, M.l 1978. Some objective methods of graphical display of a data matrix. Special
Report, Dept. of Statistics and Operations Research, University of South-Africa, Pretoria.
219
GREENACRE, M.J. 1984. Theory and Applications of Correspondence Analysis. Academic Press,
London.
GREENACRE, M.J. AND L.G. UNDERHILL. 1982. Scaling a data matrix in a low-dimensional
Euclidean space, p. 183 - 268. In D.M. Hawkins [ed.] Topics in Applied Multivariate Analysis,
Cambridge University Press, Cambridge.
GREIG-SMITH, P. 1983. Quantitative Plant Ecology, 3'rd Ed. Blackwell Scient. PubI., London.
GRUNDY, P.M. 1951. The expected frequencies in a sample of an animal population in which the
abundances of species are lognonnally distributed, I. Biometrika 38: 427-434.
GUTTMAN, L. 1950. The principal components of scale analysis. In S.A. Stouffer et al. [ed.]
Measurement and Prediction. Princeton University Press, Princeton, N1.
GUTIMAN, L. 1968. A general nonmetric technique for finding the smallest coordinate space for a
configuration of points. Psychometrika 33: 469-506.
HAYASHI, C. 1952. On the prediction of phenomena from qualitative data and the quantification of
qualitative data from the mathematico-statistical point of view. Ann. Inst. Statist. Math. 2: 93-
96.
HAYASHI, C. 1954. Multidimensional quantification - with applications to analysis of social
phenomena. Ann. Inst. Stat. Math. 5: 121-143.
HAYASHI, C. 1956. Theory and example of quantification, II. Proc. Inst. Stat. Math. 4: 19-30.
HAYASHI, C. 1974. Minimum dimension analysis MDA. Behavionnetrika 1: 1-24.
HEALY, M.J.R AND H. GOLDSTEIN. 1976. An approach to the scaling of categorised attributes.
Biometrika 63: 219-229.
HEISER, W.J. 1981. Unfolding Analysis of Proximity Data. Ph.D.Thesis, University of Leiden,
Leiden, The Netherlands.
HEISER, W.J. 1985a. Undesired nonlinearities in nonlinear multivariate analysis. In E. Diday et al.
[ed.] Data Analysis and Informatics, IV. North-Holland, Amsterdam, in press.
HEISER, W.J. 1985b. Multidimensional scaling by optimizing goodness-of-fit to a smooth
hypothesis. Internal ReportRR-85-07, Dept. of Data Theory, University of Leiden.
HEISER, W.J. 1986. Order invariant unfolding analysis under smoothness restrictions. Internal
Report RR-86-07, Dept. of Data Theory, University of Leiden.
HEISER, W.J. AND J. DE LEEUW. 1979. How to use SMACOF-I (2nd edition). Internal Report,
Dept. of Data Theory, University of Leiden.
HEISER, W.J. AND J. MEULMAN. 1983a. Analyzing rectangular tables by joint and constrained
multidimensional scaling. J. Econometrics 22: 139-167.
HEISER, W.J. AND J. MEULMAN. 1983b. Constrained multidimensional scaling, including confir-
mation. Applied Psych. Meas. 22: 139-167.
HILL, M.O. 1974. Correspondence analysis: a neglected multivariate method. Applied Statistics 23:
340-354.
HIlL, M.O. 1977. Use of simple discriminant functions to classify quantitative phytosociological
data, p. 181-199. In E. Diday et ai. [ed.] Data Analysis and Infonnatics, I. INRIA, Le Chesnay,
France.
HILL, M.O., RG.H. BUNCE, AND M.W. SHAW. 1975. Indicator species analysis, a divisive
polythetic method of classifcation, and its application to a survey of native pinewoods in
Scotland. J. Ecoi. 63: 597-613.
HILL, M.O. AND H.G. GAUCH. 1980. Detrended correspondence analysis: an improved ordination
technique. Vegetatio 42: 47-58.
HODSON, F.R et al. [ed.] 1971. Mathematics in the Archaeological and Historical Sciences.
Edinburgh University Press, Edinburgh.
HOVLAND, C.I., O.J. HARVEY, AND M. SHERIF. 1957. Assimilation and contrast effects in re-
actions to communication and attitude change. J. Abnorm. Soc. Psych. 55: 244-252.
HUBERT, L. AND Ph. ARABIE. 1986. Unidimensional scaling and combinatorial optimization. In 1.
de Leeuw et al. [ed.] Multidimensional Data Analysis. DSWO Press, Leiden (in press).
IGOSHINA, K.N. 1927. Die Pflanzengesellschaften der Alluvionen der Flusse Kama und
Tschussowaja (in Russian with German summary). Trav. de 1'lnst. BioI. al'Univ. de Penn 1:
1-117.
!HM, P. AND H. VAN GROENEWOUD. 1975. A multivariate ordering of vegetation data based on
Gaussian type gradient response curves. J. Ecol. 63: 767-777.
!HM, P. AND H. VAN GROENEWOUD. 1984. Correspondence analysis and Gaussian ordination.
220
TER BRAAK, C.J.F. AND L.G. BARENDREGT. 1986. Weighted averaging of species indicator
values: its efficiency in environmental calibration. Math. Biosciences 78: 57-72.
THURSTONE, L.L. 1927. A law of comparative judgment. Psych. Rev. 34: 278-286.
VAN RuCKEVORSEL, J.L.A. 1986. About horseshoes in multiple correspondence analysis, p. 377-
388. In W. Gaul and M. Schader [ed.] Classification as a tool of research. North-Holland,
Amsterdam.
WHITTAKER, RH. 1948. A vegetation analysis of the Great Smokey Mountains. Ph.D. Thesis,
University of lllinois, Urbana.
WHITTAKER, RH. 1967. Gradient analysis of vegetation. BioI. Rev. 42: 207-264.
WHITTAKER, RH. 1978. Ordination of Plant Communities. Dr. W. Junk PubI., The Hague.
WHITTAKER, RH. AND H.G. GAUCH. 1978. Evaluation of ordination techniques, p. 277-336. In
RH. Whittaker [ed.] Ordination of Plant Communities. Dr. W. Junk PubI., The Hague.
WILKINSON, E.M. 1971. Archaeological seriation and the travelling salesman problem, p. 276-
283. In F.R Hodson et al. [ed.] Mathematics in the Archaeological and Historical Sciences.
Edinburgh University Press, Edinburgh.
Clustering under a priori models
SOME N-ON-ST ANDARD CLUSTERING ALGORITHMS
James C. Bezdek
Computer Science Department
University of South Carolina
Columbia, South Carolina 29208 USA
1. INTRODUCTION
It has been twenty one years since Zadeh (1965) introduced fuzzy sets
theory in 1965 as a vehicle for the representation and manipulation of non-
statistical uncertainty. Since that time the theory of fuzzy sets and their applica-
tions in various disciplines have often been controversial, usually colorful, and
always interesting (c.f. Arbib 1977, Tribus 1979,Lindley 1982). At this writing
there are perhaps 10000 researchers (worldwide) actively pursuing some facet of
the theory or an application; there is an international fuzzy systems society
(IFSA); many national societies (e.g., NAFIPS, IFSA-Japan, IFSA-China, etc.);
and at least (5) journals (Int. Jo. Fuzzy Sets and Systems, Int. Jo. Man-Machine
Studies, Fuzzy Mathematics (in Chinese), BUSEFAL, and the newly announced
Int. Jo. of Approximate Reasoning) devoted in large part to communications on
fuzzy methodologies. A survey of even one aspect of this immense body of
work is probably already beyond our grasp. The purpose herein is to briefly
characterize the development of fuzzy techniques in cluster analysis, one of the
earliest application areas for fuzzy sets. In view of my previous remarks, it is
NATO AS! Series, VoL G14
Developments in Numerical Ecology
Edited by P_ and L. Legendre
© Springer-Verlag Berlin Heidelberg 1987
226
clear that many papers which might be important landmarks will be overlooked;
for these oversights (which are, of course, unintentional, and due to my own lim-
ited perspective of the field) I apologize a priori.
Section 2 presents a brief description of pattern recognition systems. Sec-
tion 3 contains an overview of the two axiomatic structures that support most of
the fuzzy clustering methodologies that seem to persist - viz., the fuzzy partition
of a finite data set; and the fuzzy similarity relation between two finite sets of
objects. These two structures are isomorphic in the crisp (Le., non-fuzzy) case,
but do not readily lend themselves to direct connections in the more general set-
ting. Section 4 is devoted to clustering algorithms designed to produce fuzzy
partitions. Algorithms are grouped into five categories: relational criteria, object
criteria, decompositions, numerical transitive closures, and generalized k-nearest
neighbor rules.
scientific endeavors involve (one or more of) the elements in the definition above
in some form or another. Figure 1 depicts the four major elements in a (numeri-
cal) PRS: data, feature analysis, clustering, and classification. Note especially
that all four components are "interactive"; each affects (and is affected by)
choices in one or more of the other factors.
Feature Nomination
Design Data I
Test Data
,
,
.
"I
Feature Analysis Classifier Design
Extraction ..... .... Error Rates
Selection
..... Prediction
Display Control
~ (
~
R = Relational Data
t
Cluster Analysis
Exploration
Validity
Display
"-
First and foremost in our illustration are the data, which we usually assume
to be represented by points in a numerical vector space. To be concrete, let X =
{X 1'X2, ... ,xn } denote a set of n feature vectors (or objects) xk in feature space
1Rs . Thus, Xkj E lR is the k-th (measured) observation of feature
j ,l~j~s ,1~k~n. We assume that xk denotes an (sxl) column vector, while its
transpose (xD is a (lxs) row vector. X is often presented in the form of an
(s xn) data array, whose columns are the n xk 's; and whose rows are (s) n-
vectors in "item space" lRn. The object data in Figure 1 are divided into two
229
sets; training or design data; and test data. Test data are presumably needed for
error rate prediction when a classifier has been designed; design or training data
are used to parametrize the classifier - i.e., find decision functions that subse-
quently label each point x E JRs. The other key components in our PRS are:
Feature Analysis (which includes nomination extraction, selection, and display);
Cluster Analysis (which includes cluster validity); and Classifier Design (which
includes performance evaluation and error estimates). There are many other
activities that might be variously connected with one or more components of
Figure 1. In the main, however, the modules in Figure 1 accurately represent
the major constituents of a typical PRS. There is one additional component in
Figure 1 that should be mentioned here - viz., the "relational data " module
shown as a satellite to cluster analysis. It may happen that, instead of object
data (X c JRS) one collects relational data, in the form of an (n xn) numerical
relation matrix R. Data in this form are common, e.g., in numerical taxonomy,
where the item of interest may be relationships between pairs of (implicitly)
defined objects. Thus, rjk' the jk -th element of R, is taken to be the extent to
which implied objects (Xj,xk) in XXX enjoy some relationship, such as similar-
ity, dissimilarity, etc. If we have object data set X, R is often constructed by
computing {rjk=0(xj,Xk)} e.g., if 0=d is a metric on JRs , then R is a dissimi-
larity relational data matrix. When X is given, all of the elements of Figure 1
apply. When R is given and X is only implicit, however, a much narrower
range of problems are presented. Specifically, clusters of the object set X can be
sought, but feature analysis and classifier design are much vaguer propositions.
On the other hand, the objects that are responsible for R may be anything
(species, models, types, categories), and need not have numerical object
representations as vectors xk E JRs. From this point of view clustering in R
becomes a very general problem!
construct a a-algebra with the usual Boolean operations, so fuzzy sets do not
rest on the same axiomatic premises as probability theory; Goodman (1982) and
others have tried to show that fuzzy sets and random sets (Matheron 1975)
amount to the same thing - so far, this work has been pretty esoteric and incon-
clusive.
Second, let us think of an experiment. Let x be, say, an (unobserved) can
of motor oil, let A be the set of potable liquids, and suppose you have available
two pieces of information: uA (x), the membership of x E A (i.e., a number in
[0,1] which represents the extent to which x is a potable liquid), and PA (x), the
probability that x EA. If you could have either uA (x) or PA (x) - but not both -
and needed to decide whether or not to drink x, which number would you prefer
to have? Now uA (x) and PA (x) might both be, say, 0.35 before observation of
x. But upon discovering that x is indeed motor oil, PA (x )=0, whereas uA (x)
remains fixed. The point is that uA and PA convey different types and amounts
of information about x. Based on the above arguments, it seems incontrovertible
that fuzzy sets are not somehow masquerading as probability theory.
(Q4). Can't one use a probabilistic model wherever a fuzzy model seems to
apply?
The answer to (Q4) is almost certainly yes! The point is, however, that
mathematical models are devised to portray some physical process, and are
chosen, at least partially, for their natural ability to "represent the action". It is
as unnatural to imagine a fuzzy model of, say, the binomial experiment, as it is
to propose that the extent to which "x is potable" is a matter of chance. Both
rationales have their place; we should use any model that improves our ability to
represent, analyze, predict, and control a process. Indeed, many processes are
well-modeled by a combination of structures. Thus, it seems better to ask "what
is useful?" rather than to ask "what is right?" There is a beautiful diagram of
this situation which was proposed by Blockley et al. (1983) which is repro-
duced as Figure 2 below. The "conjecture" represented by Figure 2 is best
described in their original words:
There are literally dozens of papers that address questions such as (Ql)-
(Q4). Blockley et ru:s paper wi1l1ead one to this literature: since this has been a
bit of a digression from our main task, we pass now to fuzzy partition spaces.
1/1
1/1
Q)
c:
E
o Probabilistic Fuzzy
"0 .....- Transition
c: Inference Inference
RI
a:
Analy~ lcal
(la)
L,uik = 1 Y k (lb)
(I c)
238
It is both natural and convenient to array the values {Uik} as a (cxn) matrix
U=[Ujk] in the vector space Ven of all (cxn) real matrices. Upon doing so, we
are able to make these definitions:
(2b)
Nje=conv(Ne )· (3b)
Ne is the usual orthonormal basis of R. e ; Nje is its convex hull. Ne (Nje ) are
the crisp (fuzzy) label vectors that comprise each column of U E Men (Mj en ).
Figure 3 illustrates these sets graphically, and shows their relationship to [O,l]e,
the c-fold Cartesian product of [0,1] with itself.
Now imagine the n-fold Cartesian products of the three sets shown in Figure
3. Obviously we have
[Ne]n e [Nje ]n=[conv (Ne )]n e [O,l]en . (4)
239
Men is not quite [Ne]n , nor is M fen as large as [Nfe]n, because constraint (1c)
binds columns of U. We want the lower (upper) constraint in (1c) because it
insures that each Uj is non-empty (no Uj is exhaustive), but this desire forces us
to add degenerate c-partitions of X to M fen to get the convex hull in (4). Thus,
we relax the constraints at (1 c) by putting
Lfen={U EVen I °° ~ 'LUik ~ n V i;(1a),(1b)} , (5a)
Each column of U E L fen is a label vector from N fe , and the columns are
independent. Consequently, it is appropriate to call Len (Lfen) the crisp (fuzzy)
label matrices of X. And because
and (6a)
(6b)
240
Len (Lien) have often been called the crisp (fuzzy) degenerate c-partitions of X.
The effect of this uncoupling is that Mien can now be written as a convex hull.
Indeed, it is easy to show that
Mlen=conv(Len)' (7)
Thus, we need at most n(c-l) vertices of Mien (U's in Len) for convex
decompositions of U. These results, together with the fact that Mien is a con-
vex polytope whose centroid is· the unique "fuzziest" or most uncertain state
(namely, U =[l/c D, determine the geometry of Mien.
(NSR) in case
o :::;rij :::;1 V i ,j (9a)
In ~ (reflexive) (9b)
R =R T (symmetric) (9c)
241
(lOb)
When the elements of R are all crisp and D =V properties (9b) - (9d) yield an
equivalence relation (ER) on X. We denote the crisp ER's on X as
En={R E Vnn I (9b ),(9c ),(9d),D =V , rij E {O,1}V i ,j}. (11)
Men and En are isomorphic: crisp clusters or subsets in X define unique (up to
arrangements) equivalence classes in X and vice-versa. More generally,
En *= {R E Vnn I (9) holds} (12)
There have been many extensions of this notion. Of these, we mention that
(V~) transitivity is equivalent to the property of pseudo-metricity; and (V /\) to
ultra-metricity. Let d :XXX -7[ 0,1] be dij=l-rij. Then
R E E v 6 c::::=::> d is a pseudo-metric and ; (14a)
242
R E E y 1\ ~ d is an ultra-metric (14b)
(14a) shows that (V /:!.) transitivity is essentially equivalent to the triangle ine-
quality. Proofs of (14a) and (14b) are presented in Bezdek and Harris (1978)
and Zadeh (1971), respectively. Another fact derived in Bezdek and Harris
(1978) is that conv (En) and E y/:" are identical at n = 3:
(15)
( Men )• ~ ( En )
r r
( Len ) ( EVA )
r
(conv! Len) = Mfcn) .... - - - ..... (
r
cony! En) )
(
r
EVA )
Figure 5 depicts graphically the difference between (V 1\) and (V /:!.) transi-
tivity for the relation matrix
243
R(~)=
1 0.8 0.7
[ 0.8 1 ~
0.7 ~ 1
].
One may check that R (~) E E y /\ ¢:;> ~=0.7 ; whereas R (~) E E yl1 for all ~ E
[0.5, 0.9]. Thus, (V /\) transitivity occurs only if ~= 0.7, and hence needs maxi-
mal mutual bonding of object 3 to object 2, requiring all "70 relatives" to be
shared with intermediate object 1. On the other hand, (V /:l) transitivity is
achieved by any ~ in the range [0.5, 0.9], so allows for the least possible bond-
ing. In other words, E y /\ contains "pessimistic" chains, whereas E yl1 allows
more optimistic alignments.
.5
G2:=§ (VA)
.5
I I I I I I I
• II • (VA)
I I I I I I I
A more practical question: how does one choose a T -norm for (*)? This
seems to depend on the application at hand. There are several studies that
describe various problems that may arise as a result of, e.g., the discontinuity of
(/\) and /:l); see Bandler and Kohout (1984) for a nice discussion of the theory.
4. CLUSTERING ALGORITHMS
algorithms discussed below are grouped into five categories: Section (4A) is con-
cerned with relation-criterion function methods that produce fuzzy c-partitions
U E Mien from R ; (4B) considers object-criterion methods that produce
U E Mien from X : (4C) contains convex decomposition algorithms for produc-
ing crisp clusters from fuzzy partitions and relations; in (4D) we discuss crisp
clustering based on numerical tranfoitive closures of R : and in (4E) methods on
generalized nearest neighbor rules are briefly reviewed. We shall indicate which
form the data are in for a particular method by exhibiting them as arguments of
the clustering criterion wherever possible.
r
in Ruspini (1969), by defining an objective function JR that seeks U E Mien -
fuzzy c-partitions on n objects - given a relational data matrix R. Let
Ruspini assumed that a was a real constant and that R was a dissimilarity meas-
ure; thus, rjk measured the extent to which the pair of (possibly) implicit objects
U,k) were in some sense unalike. Consequently, we call JR a relational cri-
terion (a function of object-pair relationships in R), as opposed to an object cri-
terion (a function of object vectors in X).
Optimal partitions were taken as local minima of JR over Mien' Iterative
optimization was used to estimate local solutions; Ruspini (1969) contains
several examples of this technique. Minimizing JR was cumbersome, slow, and
solutions were hard to interpret because JR does not measure an obvious pro-
perty of "good" clusters in X. These objections aside, the method was impor-
tant because it was the first fuzzy objective function method, and it paved the
way for further research.
Surprisingly enough, there have been very few fuzzy relational criterion
algorithms since Ruspini published his seminal work. It is not clear whether this
is because most researchers usually acquire object data (X c R S ) rather than
relational data (R E Vnn ); or what seems more likely, that it is very difficult to
245
where U is in Mten; and W is a cXn matrix with entries Wij E [0,1], L. Wij~'
I
and rowsums L Wij=l. We call M'ten the set of such matrices. In the
j
assignment-Prototype (AP) algorithm U is the desired partition on n possibly
implicit objects (X); W is a set of "prototype weights"; and rjk is again a dis-
similarity measure. The interpretation provided for JAP follows from a crisp
special case; viz., when U and WEare hard. In this case, U has c hard clus-
ters {ui}' and one imagines that each ui contains a "prototype" (albeit implicit),
say xli ' which is pointed to by W; wij=1 iff Xj=xu ' and is zero otherwise.
Letting rkl. have its obvious meaning, we can rewrite JAP for this special case as
I
(19)
which sums all dissimilarities of points within ui to its most prototypical object.
Good partitions are taken as local minima of (19), the optimization extending
over MtenxM'ten. Windham presents necessary conditions, discusses conver-
gence, convergence rates, storage, and initialization; and illustrates the algorithm
with the IRIS data and an artificial data set devised to illustrate the shortcomings
246
ABC D E F G H I J K
A 0 6 3 6 11 25 44 72 69 72 100
B 0 3 11 6 14 28 56 47 44 72
C 0 3 3 11 25 47 44 47 69
D 0 6 14 28 44 47 56 72
E 0 3 11 28 25 28 44
F 0 3 14 11 14 25
G 0 6 3 6 11
H 0 3 11 6
I 0 3 3
J 0 6
K 0
A B C D E F G H I J K
u1 .92 .90 .95 .90 .86 .50 .14 .10 .05 .10 .08
U2 .08 .10 .05 .10 .14 .50 .86 .90 .95 .90 .92
247
(20)
where U E Mjen and R E Vnn is unconstrained. Were it not for the denominator
in (20) JReM would, except for the exponent on (uik uij)' be JRB • Unlike
Windham's (AP) method, no direct theoretical conditions are known yet for local
optima of JReM' However, (20) can be iteratively minimized using a variation
of the method of coordinate descent described at length in Bezdek, Hathaway,
Howard, Wilson and Windham (1986). A glance at (20) hardly suggests a ready
interpretation of the property "good" U's have when derived as local minima of
JReM • There is a nice interpretation of this algorithm, but it depends on under-
standing a related object - criterion algorithm called fuzzy c-means (FCM). We
shall return to (20) after describing (FCM) in some detail below.
(22)
248
Next, let Vri(v i ; bil,bi2, ... ,bir) be the linear variety in 1Rs of dimension r,
1 ~ r ~ s, through the point viE 1Rs and spanned by the independent vec-
tors {bij} :
The acronyms stand for fuzzy c-means (FCM), fuzzy c-lines (FCL), etc. When
the vectors {b ij } are orthonormal, the projection theorem enables us to calculate
the squared (A-Orthogonal) distance from a point xk E 1Rs to
Vri , 1 ~ i ~ c , as
Note that (24) reduces to (22) for r = O. With (24) we define the fuzzy c-
varieties (FCV) object-criterion function:
1~m<oo.
bij = A -112 Pij , where {Pij }are the first (r) (26c)
principal eigenvectors of Si ;
(27)
In (27) the arguments of the three functions, from the left, are (U,V ;X),
(U,V 1 ;X) and (U,Vo;X) respectively, where VI = (V11,VI2,,,,,Vlc) are
(c) lines in R S ; and Vo =(vOl'v02.... vOc) are (c) points in R S • The symbol V
represents (c) sets in R S which are neither lines nor planes; but curved surfaces
that were called "elliptotypes" in Bezdek et al. (1981b). The parameter J.1 in
some sense controls the "degree of curvature" of the fitting surfaces. It is a
remarkable fact that vOi is exactly the point which translates V Ii away from the
origin in R S • That is, for any functional like (27), one need only span the
variety of highest dimension, say ri ' with the (ri) vectors from (26c); the lower
dimensional varieties will always be spanned by subsets of this set. The only
change in (26) needed for (27) is that D 2 iKA must be replaced by the "general-
ized distance"
(28)
Figure 7 depicts the geometry of the ik -th telTIl of JFCEm ' which can be written
in telTIlS of the slant distance z in that illustration as
(29)
If J.1=O (FCE) = (FCM) and hence assesses central tendencies, i.e., the propen-
sity for structure in X to cluster in hyperellipsoids (shaped by A) about the
points {voi } . When J.1 = 1 (FCE) = (FCL) so linearity of substructure
252
By far the best known and well studied special case of the (FCV) families is
(FCM), obtained by setting r =0 in (25) or J.l=0 in (27). In this case the fitting
varieties become points {voi} = {vi} c RS, which are typically thought of as
(c) "prototypes" of the (n) xk' s in X. Equations (25) and (26) take the simpler
forms, with v = (vI ,v2'···'ve):
1m (U,v ; X) = LL (Uik)m d 2ikA ; (30)
k i
Vi =L (Uik)m Xk / L (Uik)m ; (31a)
k k
Uik = (L (dikA / d ijA )2/(m-l)r 1• (3Ib)
j
Finally, when A = Is' the identity on RS, and m=1, (30) and (31) become,
respectively, the familiar, conventional least squared errors or minimum variance
object criterion with Euclidean distance, and the Basic ISODATA or hard c-
means procedure, which has been extensively studied and used by virtually hun-
dreds of investigators.
There are at present perhaps seventy papers that concern themselves with
some aspect of the theory or an application of (FCM). It would be impossible to
review all of these here. Rather, interested readers are directed towards some of
these (FCM) papers by categories as follows:
Theoretical Aspects
Bezdek (1976b)
Bezdek (1980)
Gunderson (1983)
Selim and Ismail (1984)
Ismail and Selim (1986)
Cannon, Dave and Bezdek (1986)
254
Medical Data
Bezdek (1976a)
Bezdek and Fordon (1978)
Geological Data
Nutritional Data
Engineering Systems
Image Processing
Classifier Design
Miscellaneous
In most of the application papers (FCM) outputs have been compared to one or
more clustering algorithms which are based on statistical, deterministic or heuris-
tic techniques. We discuss but one example in somewhat greater detail (the one
I found closest to numerical ecology!). McBratney and Moore (1985) report on
the usage of FCM (with A = Is, m=2) to cluster two sets of climatic data from
Australia and China; and compare their results to more classical approaches
taken by previous meteorologists. They argue that the continuity of climate
demands the flexibility of continuous assignments to various classes, so fuzzy c-
partitions of climatic data are a very plausible model. Their paper concludes by
itemizing three advantages fuzzy classifications appear to have over their crisp
predecessors:
1. Fuzzy partitions are (physically) more realistic;
2. Fuzzy partitions are more flexible; and
3. Fuzzy partitions provide better information transfer.
There is also a very interesting side issue discussed by McBratney and Moore,
namely, how one chooses values for (c) and (m) in (30). The authors discuss
and illustrate a method for choosing optimal combinations of (c ,m) jointly by
inspecting plots of (m) versus «-dJm/dmY{~\ This cluster validity functional
recognizes explicitly the joint dependency of "good" solutions on (c) and
(m) - a new and apparently useful idea.
To illustrate the use of the FCM clustering algorithm in the context of
numerical ecology, we processed a set of data provided by Prof. Pierre Legendre
which consists of population counts of 88 species of Polychaetes (marine
worms) which were gathered at 5 different stations at 4 different times. These
data have been previously analyzed by Fresi, Colognola, Gambi, Giangrande,
and Scardi (1983), and will be further analyzed by other authors in this volume.
This writer makes no pretense at understanding the biological intricacies of the
data, so our remarks below are offered in the true spirit of exploratory data
analysis.
256
First, we array the data as an 88 x 20 matrix, say X = [Xij ], with each row
xi = (xi l' Xi2 ' •.. Xi '20) E R 20 being a mixed time/space vector of obser-
vations on species i, 1 SiS 88. More specifically, each xi has coordinates
arrayed sequentially as follows:
xi,l to xi,4 : species i; station 1; times 1,2,3,4
xi ,5 to Xi ,8 : species i; station 2; times 1,2,3,4
xi ,9 to xi ,12: species i; station 3; times 1,2,3,4
Xi ,13 to Xi ,16: species i; station 4; times 1,2,3,4
xi,17 to Xi ,20 : species i; station 5; times 1,2,3,4
The species and data are arrayed this way in Fresi ~ al. (1983). We "com-
pleted" the array by filling with zeroes. It seems reasonable to cluster subsets
of this data in various ways; for example, across stations at each fixed time, or
across times at each fixed station, etc. However, an extensive analysis of the
data are left to a future investigation. In order to save space, we present below
only the results of clustering the 88 species simultaneously over all 20 variables.
The expected result of processing the data this way is an overall "course" clus-
tering of species - if one exists - over all times and stations. Subsequent proces-
ing of time - constrained and/or space - constrained subsets of X would then
yield a more detailed breakdown of possible substructures across space and time.
Computing protocols for the outputs in Tables 2-5 are as follows (refer to
equations (30) and (31): m = 2.00; A = I, the identity matrix for R 20 ; loop-
ing through equations (31) was terminated when the maximum absolute
difference between the p-th and p + I-st estimate of U was less than 0.01
(i.e., max {I u(p+l) ik - u(P) ik I,} S 0.01 ) The number of clusters was
considered unknown. Part of the results of clustering these data with FCM as
described appear in Table 2 for c = 2 and c = 5. Specifically, Table 2 con-
tains the fuzzy membership matrices found by FCM when the exit condition was
satisfied. Space prohibits the exhibition of Ufem for c = 3 and 4; however, we
discuss the outputs for each of these cases below. Our discussion makes use of
the idea of ~-cuts of U, which are covered more fully in Section 4.C. Briefly,
U ~ is a hard partial labeling of X for 0 S ~ S 1 derived from any U E M fen
whenever we replace each column of U by the vertex e i in Ne such that
uik ~ ~. Note that some columns of U may not have a row with uik ~ ~, in
which case the k-th column of U ~ is a column of zeroes. Thus, U ~ is not
necessarily in Len' much less Men' Below, we fix ~ = 0.85, which, practically
speaking, is a very strong membership threshold.
257
Aside Although we convert the fuzzy labels in Table 2 to hard ones via U ~ for
this discussion, this is somewhat contrary to the whole point of using fuzzy
memberships - if possible, one prefers to leave the results in the form shown in
Table 2. Our discussion begins with the case c = 2.
Table 2. FCM/Cluster Memberships of each Species for c=2, c=5.
across the three clusters at various (lower) levels of distribution. Indeed, the
highest membership in the new cluster over all 88 species is #27 (Exogone gem-
mifera Pagenstecher), at 0.61. This suggests that the new cluster is much less
distinct and perhaps less well justified. The third cluster has one species in
U 0.85 : #71 at 0.88.
=
c 4 Something very interesting happens: only two additional species drop
out of U 0.85! Apparently the main cluster is quite stable for the 60 species now
identified as belonging together at c = 4, ~ = 0.85 Of the remaining 28
species, only #71 still remains, with a membership of 0.97, in (essentially) its
own cluster.
=
c 5 The last five columns of Table 2 contain the membership matrix for
X at c = 5. There is essentially no change in the main cluster! All 60 species
which appeared in U 0.85 at c = 4 remain there at c = 5. This is quite remark-
able, suggesting an extremely stable core of species in the main cluster (column
6 in Table 2). Moreover, 51 of these 60 species still have membership ~ 0.99
in this cluster. Note that species 71 now has membership of 0.99 (in column 4);
and that species 29 has established a cluster via the membership 0.90 in column
7. The maximum membership in column 5 is 0.63 (species 61), and in column
8, it is 0.46 (species 27). Thus, a total of 62 of the 88 species are in U 0,85 at
c = 5; and the remaining 26 species have memberships that are - in the main -
distributed across two fuzzy clusters (columns 5 and 8) that are relatively
inseparable.
It is interesting to track the memberships of species 24, 29, 35 and 71, the 4
species not in the main group at c = 2, as c increases from 2 to 5. Table 3
exhibits these memberships. The boldface numbers in Table 3 are the maximum
memberships at each c.
262
24 .84 .16 .60 .18 .22 .34 .21 .19 .26 .26 .18 .16 .21 .19
29 .33 .67 .14.35 .51 .03 .10 .10 .78 .01 .03 .03 .90 .03
35 .26.74 .10 .35 .55 .03 .13 .12 .72 .04 .17 .16 .44 .19
71 .86.14 .87 .06 .07 .97 .01 .01 .01 .99 .00 .00 .01 .00
Note that species 24 begins strongly distinct from the main cluster with max-
imum membership 0.84 at c = 2; and then its maximum membership decreases
monotonically with c. At c = 5 this species nearly has memberships
(llc = 0.20), which are the fuzziest possible state at c = 5. At the other
extreme, the maximum membership of species 71 shows a steady upwards pro-
gression from 0.86 (c = 2) to 0.99 c = 5); this suggests that species 71
"wants" it's own cluster - it has very distinct membership at c = 5, as is evi-
dent in column 4 of Table 2. Note also that species 29 works its way non -
monotonically up to 0.90 at c = 5, thereby demanding a cluster, while species
35 seems unsure of itself, much as species 21. The behavior of memberships as
(c) varies is one of the keys to cluster validity; these numbers can be used - in a
very qualitative way - to evaluate the relative attractiveness of various numbers
of clusters.
Finally, Table 4 exhibits the (truncated) cluster centers {vi } associated
with the matrices U in Table 2 at c = 2 and c = 5. To interpret the values con-
textually, we must truncate (or round) the vij' s so that they are integers; subse-
quently, each Vij may be taken as a non-statistical estimate of the population
count to be expected at each time and station. For example, Table 5 lists the
(truncated) cluster center for VI' which is essentially composed of "99 percent
of" species 71, contaminated, if we may, by "26 percent of" species 24 (cf.
Table 2), and very little else. And next to VI is the data for species 71 (row 71
of X):
263
c=2 c=5
Coord. vI V2 VI V2 V3 V4 Vs
1 2 0 2 0 0 5 0
2 43 10 2 16 2 233 29
3 53 5 28 9 1 184 15
4 34 4 13 7 1 99 12
5 47 2 60 4 0 70 6
6 517 5 697 12 1 159 25
7 78 2 125 4 0 47 8
8 399 3 508 7 0 122 19
9 10 0 5 2 0 2 6
10 138 6 27 24 1 19 54
11 105 2 19 9 0 9 21
12 14 0 2 0 0 1 2
13 21 2 3 7 0 2 11
14 73 9 11 53 1 10 43
15 521 1 8 7 0 4 11
16 161 2 25 14 0 15 27
17 27 4 4 12 0 20 28
18 7 3 1 14 0 3 12
19 43 2 6 8 0 4 18
20 0 0 0 0 0 0 0
264
2 9 1 1
2 0 2 1
28 26 3 1
13 11 4 1
60 63 1 2
697 718 2 2
125 132 3 2
508 520 4 2
5 4 1 3
n 623
19 3 3 3
2 0 4 3
3 0 1 4
11 0 2 4
8 0 3 4
~ 044
4 0 1 5
1 0 2 5
6 0 3 5
o 0 4 5
It is clear from Table 5 that vIis dominated by the occurence of species 71 at
station 2. Further, this ostensibly explains the reason for species 71 wanting
"its own" cluster. Note how nicely the memberships in U (Table 2) mirror this
265
fact. It is also clear from the listing in Table 5 that vIis not a particularly
effective predictor of population count (nor do we expect it to be) - nonetheless,
one hopefully gains an understanding of the role played by the cluster centers in
FCM by studying this example. Note from Table 4 that the cluster center of the
main cluster (labelled v2 at c = 2, v3 at c = 5) is very close to the origin (of
R 20). Apparently this cluster characterizes those species that are found only
rarely in space and time.
So, what has been learned about X? I would guess that the 60 species
clustered together in Table 2 have some physical, chemical or biological relation-
ship that separates them from the other 26 (as previously suggested, perhaps
their main similarity is rarity). Moreover, that species 29 and 71 are somehow
quite distinct, both from the aforementioned group of 60, as well as the remain-
ing 24 marine worms. I expect to hear from marine ecologists about my conjec-
tures, right or wrong! In any case, I hope this example illustrates the main
strengths (and weaknesses) of clustering with FCM.
Beyond the direct use of the (FCM) clustering algorithms as presented
above, there have been many variations and extensions which are designed to
accommodate some feature of the data being studied. For example, Gustafson
and Kessel (1978) also recognized the need to allow each cluster in X to seek
different hyperellipsoidal shapes, and suggested the functional
JGKm(U, A ,v;X) = LL (uik)mdikA•. (32)
k i
as a means for accommodating this problem. In (32) the variable
A = (A 1,A2, ... ,Ai) is a set of (c) (sxs) positive definite matrices; distances to
points in each cluster are measured with different norms. Since the eigenstruc-
ture of Ai determines the hyperellipsoidal shape of clusters that match the i -th
term of JGK well, local minima of JGK possess the desired property: each cluster
may have a different (hyperellipsoidal) shape. Necessary conditions (31a) and
(31 b) are augmented with
Ai = (Pi det(Si»<lIs) (Si)-l, where (33a)
Si is the matrix at (26b), and (33b)
detA i = P for all i (33c)
imposed on the model, viz., finding local shape nonns that adjust themselves to
local substructures. There is probably a combination of algorithmic parameters
(J.l,m,A) for (FeE) that provides (roughly) the same solutions as (m,A) do for
many data sets. The point to be made is that both JFCEm and JGKm seek
"locally" hyperellipsoidal substructure by varying the nonn-induced topology
from cluster to cluster; the fonner fixes one A and varies each shape by stretch-
ing in direction (b i ) with "strength" J.l ; while the latter alters all (s) directions
in each cluster via different Ai's, but with fixed volume. Thus, local shapes
with JGK can be much more diverse (from each other) than with JFCE' but must
all occupy the same volume. Figure 8 illustrates these differences graphically.
The difficulty with all this is, of course, that one cannot know, for s >3, whether
X contains this sort of structure.
in place of (31b) - recall that A = Is and m=2. Pedrycz also presents two gen-
eralizations of (34), one involving weights for the two terms of, respectively,
(lit) and (lIq); and secondly, localized Mahalanobis-like distances induced by
replacing A = Is with (c) matrices (Ci rlwhich combine the information in W
with the S/s in (26b). The methodology is illustrated with two sets of data:
Gustafson and Kessel's cross (Gustafson and Kessel 1978); and a set of EKG
data. This is a very interesting extension of FCM because it integrates local
shape modifications (like JGK ) with previous information (W). This area will
experience further growth.
Yet another avenue of variation from the basic (FCY) methodology is
represented by the (RCM) algorithm described briefly in Section 4A. fudeed,
substitution of (26a) into (22) with A = Is reduces (30) to (20) when rkj is the
squared distance between xk and Xj in X. Consequently, the relational criterion
JRCM is, for a special choice of R, equivalent to JFCM' In this case one can in
principle obtain the same U E M fen by minimizing either JFCM(U ,v; X) or
JRCM(U; R) as long as R = [rjk] = [Ixj - xkI2]. The point of (20) is, of course,
that JRCM is well-defined and can be used for any R, not just [Ixk - xjI2]; in
268
this more general situation, U's gotten by (RCM) may be interpreted as parti-
tions that might have been found by (FCM) if X had been available and con-
verted to R as above.
There are literally dozens of other papers that deal with object-criterion
fuzzy partitioning algorithms. I will conclude this section by pointing interested
readers towards (some) of these: Backer (1978), Bock (1984), Diday and Simon
(1976), Roubens (1982), and Kent and Mardia (1986). The last reference per-
tains to an interesting connection between statistical and fuzzy methodologies.
There are, of course, entire families of algorithms which generate matrices
U E M fen that are not interpreted as fuzzy c-partitions of X. Specifically,
parametric estimation techniques such as the method of maximum likelihood to
decompose mixtures of probability density functions generate a matrix
P = [Pik] E M fen , where Pik is the posterior probability that xk came from
class i given xk. Columns of P are label vectors in Nfe which advocates of sta-
tistical decision theory would call "probabilistic" labels for X. This obviously
yields the same sort of outputs for X that fuzzy clustering does; the difference
lies in one's belief about the data: are they really drawn from a statistical mix-
ture? There are hundreds of papers about this technique; interested readers will
get an excellent start in this direction with Everitt and Hand (1981), or the recent
survey by Redner and Walker (1984). Another school of thought not
represented here that is very active in generating "probabilistic" P's E M fen is
the methodology of relaxation labelling, which has been vigorously pursued by
Rosenfeld and his students. For an introduction to this area see Peleg (1981).
"throwing away" some of the very information that fuzzy models presume to
capture. In this section we discuss convex decompositions of U and R. A fun-
damental distinction to be made at the outset is that (7) and (8) make this possi-
ble for U E Mfen produced by any method whatsoever; whereas no algorithms
exist that yield R E conv(En ). We begin with convex decompositions of U.
Equations (7) and (8) guarantee that each U E Mfen has at least one convex
decomposition into (n (c -1)+ 1) (possibly degenerate) U' s in Len. To see that
we must use Len instead of Men note that
~a·
~ I
= 1·
'
(38b)
(38c)
U 1 = [ 011000]
0 1 1 1 = Umm = U p,. (~~ 0.60).
It turns out that Umm is always the dominant term for both the (MM) and (R)
algorithms. The interesting aspect of decomposition of U as opposed to thres-
holding on U lies with the "remainders." Simply discarding the information in
U not preserved by thresholding seems to offset the advantages of using M fen in
the first place, Decompositions, on the other hand, may provide added insights
about object substructure that are otherwise lost. In Table 6, for example, the
(MM), (R) and (F) decompositions all have for their second term the crisp parti-
tion
11110]
U2 = [ 0 0 0 0 1 ' i.e.,
{l,2,3,4} U {5}. Thus, (U l' U 2 ) account for 80 (70) percent of the member-
ship in U via MM (R or F); this affords investigators with a very different sub-
structural interpretation than that provided by thresholding on UEM/en. Note
that both the number of and specific Ui ' s in the terms after U 2 in Table 6 are
quite different for the three decompositions; and that (MM) and (R) have only
Ui's in Men' whereas U 6 for the (F) decomposition is in Len (degenerate). The
last term in (F) suggests that there is some slight (5) possibility that all (5)
271
Ui MM R F
00010 .10 ** **
1 1 101
00111 ** .05 **
11000
10000 ** ** .10
01111
00001 ** ** .05
11110
00000 ** ** .05
11111
272
objects be grouped together (c = 1), whereas (MM) and (R) yield successively
less attractive possibilities at (c = 2). It is shown in Bezdek and Harris (1979)
that the coefficient vector (al, ...a q ) for (MM) is lexicographically larger than
any other convex decomposition, i.e., (MM) coefficients will always account for
the largest percentage of U in the same number of terms. Conjectured there is
that (MM) decomposition also is the shortest (minimal length) decomposition,
and is always non-degenerate (Ui E Men Y i). An example in Bezdek and
Harris shows that the maximum membership matrix (Umm)' which here appears
as the dominant term in all three decompositions of U exhibited in Table 6, may
in fact not even appear in Lai Ui . However, no decomposition produces a
larger (a l ) than (MM). Thus, the crisp equivalence relation R in En isomorphic
to Umm presumably implicates al as the maximal bonding strength enjoyed by
objects that are partitioned by U.
Questions about the method of convex decomposition in M/ en abound;
uniqueness, minimality, relation to En and conv(En ), physical interpretation of
the {ai}; all are good research topics. Furthermore, there are many algorithms
that convert object data (X) or relational data (R) into U EM/en' so this method,
which provides a very different means of interpreting U than thresholding,
deserves further study.
The intent of Figure 4 was to illustrate that M/en cannot be easily (if at all!)
identified with any of the imbeddings of En shown in the chart. Given a fuzzy
similarity relation R in the hierarchy E v 1\ c conv(En) c E v 1.1' e.g., how shall
we proceed to interpret R in terms of crisp clusters on n objects? When R E
conv(En ), one may proceed as above, to seek convex decompositions of R of the
form
(40a)
with (40b)
1000 1000 3
0040 110 o1 1 0
10 0001
1
101 0 101 0 3
0.30 100 o 100
10 0001
1
1 110 1 110 2
0.30 110 0001
10
1
1 .3 .6 0
1 .7 0
R =
1 0
1
There are at least three important differences between the convex decompo-
sitions of U E Men and R E conv (En) shown above:
274
satisfy (9a) - (9c). For such an R, the (V *) transitive closure (C. (R» is cal-
culated as follows:
C. (R)=RVRk-l,k =2, ... ,n, where (42a)
R 2 = R (V *) R as in (9d). (42b)
In (42) (*) may be any T-nonn; in what follows we discuss only the
T 1, T2, and T3 nonns exhibited at (10). Zadeh (1971) proved that for T2 and T3
(42) indeed tenninated in at most (n - 1) steps; it is easy to see that the same is
true for any (T = *) that is bounded above by T3 (in particular, the (V ~) transi-
tive closure C tJ. (R) of R can be constructed with T 1 this way).
The construction of C. (R) via (42) is not very efficient; using matrix mul-
tiplication as in (42b) is O(n4). Kandel and Yelowitz (1974) presented an O(n3)
generalization of Warshall's algorithm for C. (R). Both algorithms were dis-
cussed for T3 (* = 1\); the complexity is unchanged for T2 and T 1. Equations
(42) continue to appear in reported applications, probably because users do not
have data for which n is large, and also because computer speed seems to
increase faster than our ability to utilize it. Dunn's (1974) paper gives us two
things: an even more (O( n 2» economical method for constructing C. (R); and
much more importantly, a proof that the method to be described below is none
other than the well-known single linkage algorithm when (* = 1\). In order to
appreciate this, we next describe the method itself.
Zadeh (1971) established that every ~-cut of C" (R) yields a hard
equivalence relation, say R,,~ E En' Because En =Mcn , R,,~ induces a unique
crisp partition U ,,~ on the n objects represented by R. Consequently, one may
generate a nested sequence of crisp object clusters by taking ~-cuts of C " at
Ws separating each pair of distinct elements. Specifically,
~l > ~2 ~ R "~1 C R "~2 . (43)
which leads to the sequence {R ,,~} ¢:::;> {U ,,~}, and hence to nested crisp clus-
ters. We shall discover below that when (* * 1\) the same clusters are gen-
erated, but not always sequentially.
276
f3 = .29 1 110 11 10 1 1 10
110 110 1 10
10 10 10
1 1 1
f3 = .31 1 11 0 1110 10 10
110 1 10 110
10 10 10
1 1 1
f3 = .43 1 11 0 1010 1 0 10
110 1 10 110
10 10 10
1 1 1
C i = columns of Ud for indices in Ii = [cit' ci2' .... ' cik]' and (44c)
Readers interested in further discussion along these lines may begin with Keller
and Givens (1985), Keller, Gray and Givens (1985), Jozwik (1983), and Duin
(1982).
5. CONCLUSIONS
There are, of course, many fuzzy clustering algorithms that have not been
reviewed above. Some are ostensibly quite interesting and useful -- others seem
preposterous! On the other hand, any scheme that really solves a problem or
provides useful insights to data deserves a place in the literature. I hope that the
above review constitutes at least a glimpse of the major structures and clustering
models now being pursued by the "Fuzzy sets" community.
Perhaps the best single piece of advice that can be given to potential users
of (any) clustering algorithm is this: try two or three different algorithms on your
data. If the results are stable, interpretation of the data using these results gains
credibility; but widely disparate results suggest one of two other possibilities:
either the data has no cluster substructure, or the algorithms tried so far are not
well matched to existent but as yet undetected substructure. The algorithms
described above have enjoyed varying degrees of success with a wide cross sec-
tion of data types. There is every reason to expect that in some cases clusters
obtained using, e.g., FCM, with ecological data will provide very serviceable
interpretations of the ecosystem under study. I encourage readers in the applica-
tions community to try one or more of the fuzzy algorithms discussed above -
the results might be very surprising! On this note my survey concludes.
REFERENCES
ANDERBERG, M. R. 1983. Cluster analysis for researchers, Academic Press,
New York.
ANDERSON, I., BEZDEK, J., AND DAVE, R. 1982. Polygonal shape descriptions
of plane boundaries, in Systems science and science, vol. 1, pp. 295-301,
SGSR Press, Louisville.
ARBIB, M. 1977. Book reviews, Bull. AMS, vol. 83, no. 5, pp. 946-951.
(Arbib provides scathing reviews of three fuzzy sets books).
BACKER, E. 1978. Cluster analysis by optimal decomposition of induced fuzzy
sets, Delft Univ. Press, Delft.
BANDLER, W., AND KOHOUT, L. 1984. The four modes of inference in fuzzy
expert systems, Cyber. and Sys. Res. , vol. 2, pp. 581-586.
281
BELLMAN, R., KALABA, R., AND ZADEH, L. A. 1966. Abstraction and pattern
classification, Jo. Math. Anal. and Appl., vol. 13, pp. 1-7.
BEZDEK, J. C. 1974. Numerical taxonomy with fuzzy sets, Jo. Math. Bio, vol.
1, no. 1, pp. 57-71.
BEZDEK, J. c., AND DUNN, J. C. 1975. Optimal fuzzy partitions: a heuristic for
estimating the parameters in a mixture of normal distributions, IEEE Tran-
sactions on Computers, vol. 24, no. 8, pp. 835-838.
BEZDEK, J. C. 1976a. Feature selection for binary data: medical diagnosis with
fuzzy sets, Proc. 1976 NCC, AFIPS (45), pp. 1057-1068, AFIPS Press,
Montvale.
BEZDEK, J. C. 1976b. A physical interpretation of fuzzy ISODATA, IEEE
Trans. SMC, vol. 6, no. 5, pp. 387-389.
BEZDEK, J. c., AND CASTELAZ, P. 1977. Prototype classification and feature
selection with fuzzy sets, IEEE Trans. SMC, vol. 7, no. 2, pp. 87-92.
BEZDEK, J. c., AND HARRIS, J. D. 1978. Fuzzy relations and partitions: an
axiomatic basis for clustering, Fuzzy Sets and Systems, vol. 1, pp. 111-127.
BEZDEK, J. c., AND FORD ON, W. 1978. Analysis of hypertensive patients by
the use of the fuzzy ISODATA clustering algorithms, Proc. 1978 Joint
Automatic Control Conference, pp. 349-355, ISA Press, Pittsburgh.
BEZDEK, J. C. 1978. Fuzzy algorithms for particulate morphology, in Proc.
1978 int'l powder and bulk solids conf., pp. 143-150, ISCM Press, Chicago.
BEZDEK, J. c., AND HARRIS, J. D. 1979. Convex decompositions of fuzzy par-
titions, Jo. Math. Anal. and Appl., vol. 67, no. 2, pp. 490-512.
BEZDEK, J. c., AND FORDON, W. A. 1979. The application of fuzzy set theory
to medical diagnosis, in Advances in fuzzy set theory and applications, pp.
445-461, North Holland, Amsterdam.
BEZDEK, J. C. 1980. A convergence theorem for the fuzzy ISODATA cluster-
ing algorithms, IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. PAMI-2, no. 1, pp. 1-8.
BEZDEK, J. C. 1981a. Pattern recognition with fuzzy objective function algo-
rithms, Plenum Press, New York.
BEZDEK, J. c., CORAY, c., GUNDERSON, R., AND WATSON, J. 1981b. Detec-
tion and characterization of cluster substructure: I. linear structure: fuzzy c-
lines, SIAM Jo. Appl. Math, vol. 40, no. 2, pp. 339-357.
BEZDEK, J. C., CORAY, c., GUNDERSON, R., AND WATSON, J. 1981. Detec-
tion and characterization of cluster substructure: II. fuzzy c-varieties and
282
convex combinations thereof, SIAM Jo. Appl. Math, vol. 40, no. 2, pp.
358-372.
BEZDEK, J. c., AND SOLOMON, K. 1981. Simulation of implicit numerical
characteristics using small samples, in Proc. ICASRC, ed. G. E. Lasker, vol.
VI, pp. 2773-2784, Permagon, New York.
BEZDEK, J. c., AND ANDERSON, I. 1984. Curvature and tangential deflection of
discrete arcs, IEEE Trans. PAMI, vol. 6, no. 1, pp. 27-40.
BEZDEK, J. c., AND ANDERSON, I. 1985. An application of the c-varieties clus-
tering algorithms to polygonal curve fitting, IEEE Trans. SMC, vol. 15, no.
5, pp. 637-641.
BEZDEK, J. C., HATHAWAY, R. J., AND HUGGINS, V. J. 1985. Parametric esti-
mation for normal mixtures, Pattern Recognition Letters, vol. 3, pp. 79-84.
BEZDEK, J. C., GRIMBALL, N., CARSON, J., AND ROSS, T. 1986. Structural
failure determination with fuzzy sets, in press, Civil Engr. Sys ..
BEZDEK, J. C., BISWAS, G., AND HUANG, L. 1986. Transitive closures of fuzzy
thesauri for information retrieval systems, in press, IJMMS.
BEZDEK, J. c., CHUAH, S., AND LEEP, D. 1986. Generalized k-nearest neigh-
bor rules, Fuzzy Sets and Systems, vol. 18, pp. 237-256.
BEZDEK, J. C., AND HATHAWAY, R. J. 1986. Clustering with relational c-
means partitions from pairwise distance data, in press, Jo. Math Modeling.
BEZDEK, 1. c., HATHAWAY, R. J., HOWARD, R. E., WILSON, C. E., AND
WINDHAM, M. P. 1986. Local convergence analysis of a grouped variable
version of coordinate descent, in press, Jo. Optimization Theory.
BISWAS, G., JAIN, A. K., AND DUBES, R. C. 1981. Evaluation of projection
algorithms, IEEE Trans. PAMI, vol. 3, no. 6, pp. 701-708.
BLOCKLEY, D. I., PILSWORTH, G. W., AND BALDWIN, J. F. 1983. Measures of
uncertainty, Civil Eng. Sys, vol. 1, pp. 3-9.
BOCK, H. H. 1984. Statistical testing and evaluation methods in cluster analysis,
Proc. lSI, pp. 116-146, Calcutta.
BOISSONADE, A., DONG, W., Lm, S., AND SHAH, H. C. 1984. Use of pattern
recogniton and bayesian classification for earthquake intensity and damage
estimation, Int. Jo. Soil Dynamics & Earth. Engr., vol. 3, no. 3, pp. 145-149.
BONNIS ONE, P., AND DECKER, K. 1985. Selecting uncertainty calculi and
granularity: an experiment in trading-off precision and complexity, GE
TR85.5C38, Schenectady.
283
Pierre Legendre
Departement de Sciences biologiques
Universite de Montreal
C.P. 6128, Succursale A
Montreal, Quebec H3C 317, Canada
Abstract - Results of cluster analysis usually depend to a large extent on the choice of a clustering
method. Clustering with constraint (time or space) is a way of restricting the set of possible
solutions to those that make sense in terms of these constraints. Time and space contiguity are so
important in ecological theory that their imposition as an a priori model during clustering is
reasonable. This paper reviews various methods that have been proposed for clustering with
constraint, flrst in one dimension (space or time), then in two or more dimensions (space). It is
shown, using autocorrelated simulated data series, that if patches do exist, constrained clustering
always recovers a larger fraction of the information than the unconstrained equivalent. The
comparison of autocorrelated to uncorrelated data series also shows that one can tell, from the
results of agglomerative constrained clustering, whether the patches delineated by constrained
clustering are real. Finally, it is shown how constrained clustering can be extended to domains
other than space or time.
lNTRODUCI10N
Clustering with constraint is one way of imposing a model onto the data analysis process,
whose end result otherwise would depend greatly on the clustering algorithm used. The model
consists of a set of relationships that we wish the clustering results to preserve, in addition to the
information contained in the resemblance matrix (or, for some clustering methods, in the raw data:
Lefkovitch 1987). These relationships may consist of geographic information, placement along a
time series, or may be of other types, as we will see. In any case, imposing a constraint or a set
of constraints onto a data-analytic method is a way of restricting the set of possible solutions to
those that are meaningful in terms of this additional information.
In this paper, we will flrst describe various forms of constrained clustering. Then we will
examine the questions of whether constrained clustering is necessary to get meaningful results,
and how to determine if the patches found by constrained clustering are real. Finally, we will
suggest that the concept of constrained clustering can be extended to models other than space or
time.
Ecologists are primarily interested in two types of natural constraints: space and time.
Ecological sampling programs are usually designed along these physical axes, so that information
about the position of ecological samples in space and in time is almost always known.
Furthermore, various parts of ecological theory tei! us that elements of an ecosystem that are
closer in space or in time are more likely to be under the influence of the same generating process
(competition theory, predator-prey interactions, succession theory), while other parts of
ecological theory tell us that the discontinuities between such patches in space or in time are
important for the structure (succession, species-environment relations) or for the dynamics of
ecosystems (ergoclines).
These reasons are so compelling as to legitimize a clustering approach where the clusters
will be considered valid only if they are made of contiguous elements. From this point of view,
clusters of noncontiguous elements, such as can be obtained from the usual unconstrained
clustering algorithms, are seen as an artifact resulting from the artificial aggregation of effects from
different but converging generating processes. We will come back to this point later on.
ONE-DIMENSIONAL CONSTRAINT
Several other proposals have been reviewed by Wartenberg (manuscript). Among these, let
us mention the method of Webster (1973), a soil scientist who needed to partition multivariate
sequences corresponding to a space transect or to a core. Moving a window along the series,
Webster compared the two halves of the segment covered by the window, either with Student's t
or Mahalanobis' D 2 , and he placed boundaries at points of maximum value of the statistic. While
the results obtained depend in part on the window length, Webster's method is interesting in that it
looks for points of maximal changes between regions.
The dual approach to this problem is to look for maximal homogeneity within segments.
This was the point of view adopted by Hawkins and Merriam who proposed a method for
segmenting a univariate (1973) or a multivariate (1974) data series into homogeneous units, using
a dynamic programming algorithm. This method was advocated by Ibanez (1984) for the study of
successional steps in ecosystems.
Using the hierarchical clustering approach, Gordon and Birks (1972, 1974) and Gordon
(1973) included the time constraint in a variety of algorithms to study pollen stratigraphy. They
used constrained single linkage, constrained average linkage, and a constrained binary division
algorithm. Their purpose was to define zones of pollen and spores that are homogeneous within
zones and different between zones. They compared their various techniques, which led by and
large to the same result. As we will see below, this was probably due to the predominant
influence of the constraint on the results.
Legendre et al. (1985) used a very similar approach to study ecological successions
through time. The basis of their method, called "chronological clustering", is proportional-link
linkage hierarchical clustering with a constraint of time contiguity. This means that only
time-adjacent groups are considered contiguous and are assessed for clustering. There is one
292
important addition to the ideas of Gordon and his co-workers, however: this algorithm is
supplemented with a statistical test of cluster fusion whose hypotheses correspond to the
ecological model of a succession evolving by steps.
Prior to this analysis, a distance matrix among samples has been computed, using a
dissimilarity function appropriate to the problem at hand (ecological succession, or other).
Considering two groups (1) that are contiguous and (2) that are proposed for fusion by the
clustering algorithm, a one-tailed test is made of the null hypothesis that the "large distances" in
the sub matrix are distributed at random within and among these two groups. The test is
performed by randomization; this test could actually be re-formulated as a special form of the
Mantel test (1967). The above-mentioned paper shows the true probability of a type I error to be
equal to the nominal significance level of the test. When the null hypothesis is accepted at the
given confidence level, the two groups are fused. The computer program also allows for the
elimination of aberrant samples that can form singletons and prevent the fusion of their
neighboring groups, and it offers complementary tests of the similarity of non-adjacent groups.
The end result is a nonhierarchical partition of the samples into a set of internally contiguous
groups, the number of which has not been coined by the user.
[ill
-,-
1966: 1967
T
3 4 11 12 14 15 23 24 28 29 I 41 42 47
(21 """'Q) 0 ® @ 0
48 51 52 55
®I @
56 6667 74 75 78
, @ ® ~
1967: 1968
I shall illustrate time-constrained clustering with this method. The example consists of a
series of78 samples of Mediterranean zooplankton (chaetognaths) obtained from 1966 to 1968
and analyzed by Legendre et at. (1985). In Figure 1, the series is folded to allow represention of
the relationships among clusters; these relationships have been computed by a posteriori testing,
using the test of cluster fusion described above. The ecological significance of the group breaks is
discussed in the above-mentioned paper.
293
This data set was also subjected to chronological clustering using several values of
connectedness during the proportional-link linkage agglomeration. Without the constraint, low
values of connectedness have a space-contracting effect while high values cause an effect
equivalent to an expansion of the reference space (Lance and Williams 1967). As shown in Figure
2, the results are quite stable through a range of connectedness values. This illustrates the
predominant effect of the constraint during the clustering process, as previously noted by Gordon
and Birks (op. cit.). Clustering the same data set by unconstrained proportional-link linkage
produced scrambled, uninterpretable results (Legendre et al. 1985).
'966 '1'967
9 ~
SAMPLE,'", , 5 " , 'O,",!S, ~~ 25 ~o, , 35 40 45 75 78
Fig. 2. Comparison of four connectedness levels (Co), keeping a fixed at 0.25. Same data as
in Figure 1. Full horizontal lines: clusters of contiguous samples, with blanks representing
significant breaks in the series. Stars: singletons. From Legendre et al. (1985), Figure 3.
Chronological clustering, which was developed with reference to the problem of species
succession in ecosystems, could be applied to other problems where one hypothesizes sharp
breaks within the data series. Besides the examples in Legendre et al. (1985), the method has
been applied to a variety of other problems, that include the successional dynamics of bacteria
through time in sewage treatment lagoons (Legendre et al.1984), the study of fish communities in
a coral reef transect (Galzin and Legendre 1987) and of a stratigraphic sequence of fossil fish (Bell
and Legendre 1987).
Often, the spatially distributed data of interest to the ecologist are not sampled from a
transect, but are spread across a surface or, in some instances, a volume. If the spatial
relationships among samples are to be taken into account during the clustering process, it is
important to define clearly what is meant by "contiguous samples".
If the data represent sub-units of the area under study, with these smaller surfaces
touching one another, then a simple and natural way is to define as contiguous two surfaces that
share a common border.
294
On the contrary, if the data can be seen as attached to points in space that are distant from
one another, then there are various ways of defIning the connection network among these points.
(a) The easiest way is to use the minimum spanning tree among points in geographic space. This
method is also the least effIcient in that it uses only a small fraction of the geographic information.
(b) Among the various types of connection networks, one that is often used is the Gabriel graph
(Gabriel and Sokal1969). In this graph, two points A and B are connected if no other point is
found inside the circle whose diameter is the line joining A and B; in other words, connect A and
B when 0 2AB < 0 2AC + 02BC for every triplet of points A, B, C under study.
(c) Another commonly used type of connection network is the Oelaunay triangulation. This is a
way of dividing the whole plane into triangles without crossing edges. The algorithm proposed
by Green and Sibson (1978) also allows the user to remove those long edges that form along the
perimeter of the surface as "border effects". A Gabriel graph is a subset of a Oelaunay
triangulation (Matula and SokalI980).
(d) When the points form a regular grid (or when the surface is divided into squares or
rectangles), it is a simple matter to connect them in 4 directions if they form a square lattice, or in
8 directions by adding diagonal edges. They could also be connected in 6 directions if they are
positioned in staggered rows.
These connecting schemes can be extended to three dimensions if the points come from a
volume of space, or if the volume is divided into regular or irregular blocks.
Using one or another of these connecting schemes, authors have constrained many of the
usual clustering algorithms: linkage clustering, UPGMA, minimum-variance method, hierarchical
binary division, and so on. Others used the geographic information a posteriori, selecting among
the set of possible partitions those that are consistent with the spatial constraints. Wartenberg
(manuscript) has reviewed these developments, that go back to Ray and Berry (1966).
Tests of various kinds have been developed, either as a part of constrained clustering
algorithms, or to assess the interest of the results.
(a) Howe (1979) used a test of the difference between the means of adjacent groups, during
pairwise agglomeration. In the same line of thought, Gabriel and Sokal (1969) developed a
signifIcance test of the homogeneity of a whole partition based on the sum of squares criterion.
Given what we know now about the influence of spatial autocorrelation on statistical tests, and
295
especially on analysis of variance (e.g., Cliff and Ord 1981, ch. 7), these tests should be used
with caution.
(b) Ray and Berry (1966) evaluated the various agglomeration levels by plotting the changes of the
within-group and the among-group variances as a function of the number of groups. Changes in
the slope of these curves indicate the best partition.
(c) Okabe (1981) developed an index for the difference between the constrained and the
unconstrained solution, that he tested for significance by randomization. His index is based on the
number of point displacements that are necessary to transform one solution into the other, but the
Jaccard or the Rand index (described below), or information measures such as Rajski's metric
(1961), could be used for the same purpose.
79 70 60 56
0987654321098765432109876
7777777 63
77777777 62
7777777777 8 61
jjjjjjjjjj 888 60
jjjjj666666666888 59
jjjj6666666666888 58
6666666666666688 57
66666666666666666888 56
Fig. 3. One of the maps from the 66666666666666666668333 55
constrained clustering study of Legendre and 999966666666666666633333 54
Legendre (1984). This map represents 999996666666666666444333 53
clustering level S = 0.70 of the 999999966666666444433333 52
proportional-link linkage agglomeration, with 999999966666aaaaaa333 51
connectedness of 50%. Each group of 99999996666aaaaaaaa 50
quadrats formed at this level is represented 119999999hhhaaaa 49
by a different letter or number. Longitude 11199999hhh 48
(W) and latitude (N) are shown outside the 1111111110 47
frame. 1111100 46
I will illustrate constrained clustering on a surface using results from our program
(BIOGEO), which is a constrained proportional-link linkage agglomerative algorithm that can
handle large data sets; this property comes from the fact that, in a constrained situation, the search
for the next pair to join is limited to adjacent groups only, as previously noted by Openshaw
(1974) and by Lebart (1978). The program can use either (a) points in a regular grid, or (b) a list
of connections obtained for instance from a Delaunay triangulation. It presents the advantage of
producing directly a series of maps, each corresponding to a clustering level, instead of the usual
dendrogram. These maps are drawn either for the regular grid, or using the X and Y coordinates
of the points. Figure 3 shows one such map, from a biogeographic study of freshwater fishes in
the Quebec peninsula (Legendre and Legendre 1984), based upon the presence/absence of 109
species in 289 units of territory. Figure 8 shows a pair of such maps for points positioned by
their X and Y coordinates.
296
When constrained clustering has been completed, distant groups could be tested a
posteriori to determine if recurrent group structures exist through space. See Cliff and Ord (1981)
for tests of the difference among means in the presence of spatial autocorrelation.
The question has been raised, whether constrained clustering represents a methodological
advance. Could the same results be obtained without the constraint? A constraint is after all
difficult to imbed into computer programs. I would like to argue that if one assumes the existence
of an ecological process generating autocorrelation along the sampling axes (space or time), then
one is more likely to miss uncovering the corresponding ecological structure if the clustering is
carried out without constraint. This property of clustering algorithms will be demonstrated for
agglomerative methods; divisive or nonhierarchical methods would likely lead to the same result.
For the sake of clarity, let us limit this discussion to spatially autocorrelated phenomena,
although the results apply as well to autocorrelation along the time axis. In community ecology,
one can often hypothesize generating processes related either to the abiotic environment, or to
some form of contagious biological growth. If, for the scale of sampling under consideration, the
generating process has produced a gradient, the existence of such a gradient can be demonstrated
by spatial autocorrelation analysis (univariate autocorrelation analysis: Cliff and Ord 1981;
multivariate Mantel correlogram: Sokal et al. 1987), while the gradient itself can be described
adequately by ordination analysis (scaling). On the other hand, if the generating process has
produced locally homogeneous community structures within some larger area subjected to
sampling, then the description of these structures becomes a clustering problem. Since one is then
interested in forming connected clusters of objects, there is no question as to the appropriateness
of constrained clustering, since this is exactly what this family of methods does: it produces
clusters of spatially connected points. On the contrary, clustering without constraint would open
the door to clusters possibly formed by grouping objects whose apparent similarity is the result of
different mechanisms that converged to produce somewhat similar effects on the community
structure; these clusters would present a blurred picture, as noted by Monestiez (1978).
Wartenberg (manuscript) gives a similar example from the health sciences, where lung ailments
may be due to a variety of causes: occupational (Le., from coal mining), ambient (such as near
industrial areas), or personal habits (tobacco consumption), all of which can lead to light or
severe lung conditions; unconstrained clustering would group the samples by severity of cases
while spatially constrained clustering is more likely to delineate areas with similar types of causes.
297
The same rule applies to community ecology, where it is better to form the regional clusters fIrst,
and to fInd the relationship among clusters in a second step.
To demonstrate that constrained clustering is not only appropriate, but also necessary, we
will rely on Monte Carlo simulations. Analyzing known conditions will show that one is less
likely to get a meaningful answer after unconstrained clustering than if a constraint has been used,
in cases where a generating process has produced patchiness.
Five groups of equal size (30 objects each) have been generated by an autocorrelated
process with random components. To make them easier to picture, the groups are made to form
for the moment a one-dimensional array of 150 objects. Within each group, one of the objects is
selected at random to become the nucleus of the generating process giving rise to the group. A
value is given to each of these nuclei; this value is drawn from a normal random distribution with
mean 0 and variance VAR. The rest of each group is made to grow out of its nucleus by a
contagious process, that consists of giving to a point located at distance n from the nucleus, the
value of the point located at distance (n-l ), plus a N(O,I) random normal deviate. Such
autocorrelated Monte Carlo series have been generated with group nuclei variances VAR = 1, 5,
10, 15, 20, 25 and 30, as well as for the intermediate integer values of VAR between 1 and 10;
the amount of variance added at random to the contagious within-group growth process is kept
constant. The data sets, 150 objects long, are univariate; this should not affect the generality of
the conclusions. Spatial autocorrelation analysis was performed on these series to verify that the
data are indeed autocorrelated; signifIcant positive autocorrelation extended to about distance 20 in
each of these data series. Five of them are shown in Figure 4; the seed of the random number
generator was the same for all runs.
After computing a (150 x 150) Euclidean distance matrix among objects, ag- glomerative
clustering is performed using constrained as well as unconstrained clustering. Both of the
algorithms used are based upon proportional-link linkage clustering, and a connectedeness value
of 50% was used throughout for the sake of uniformity.
Since the "truth" is known from the generating process (fIve equal groups of 30 objects
each), it can be used to assess the effIciency of each clustering model. To achieve this, a (150 x
150) half-matrix is fIrst computed for any given partition level of the hierarchical classifIcation,
containing a "1" to describe two objects that are members of the same group at the said level, and
"0" otherwise. Another such half-matrix is built for the reference classifIcation of the objects into
five groups. Milligan (1983) recommends using both the Jaccard and the Rand index to compare
the two partitions:
298
.... ,, ..
.. Var-1
·. t·. . .·. ·:· ... ;. .
.
'
..
::
,
r·
.
"I.
.it:
.. .
.. ..
... I I
.1 I • •'
"
30 1>" I'"
Var-5
,
"
",
i
"
HI
'.
., .
,.
"
•' .........
'"0::
.
30 . '" 120 1>0
, I ••
I"
"n
.. •
ouo" 1.1.
•
~. I . . . . .
..
..............''
..
IU"
,,".........
.................... ~
:...H....................
~ •• • , ~
Var-30
:.....
" .......... 10'
I I.. • . . . . . . .~
:..·..•······.·.·····..•..····1
................ I
I....1'
t..
•
I
• ••
,
60 90 120 150
Sample No.
Fig. 4. Five autocorrelated Monte Carlo series, generated with different values (VAR) of group
nuclei variance. Ordinate: the value attributed to each sample along the series. The seed of the
random number generator was the same for these five runs. Group breaks are materialized by
dashes.
299
The results obtained with the Jaccard index are clear (Fig. 5). For any amount of variance
among group nuclei, constrained clustering recovers more of the original classification's
information than does unconstrained clustering.
The results obtained with the Rand index are the same, although the Rand criterion, at low
VAR values (VAR ~ 5) and only in the unconstrained case, regularly picked out as optimal
partition levels where very few points had been clustered, all the others being treated as singletons
(one-member clusters). The Rand index could pick out these partition levels because the quantity
to be maximized involves d, the number of pairs pertaining to unlike groups in both
classifications.
These simulations lead to the conclusion that one should always use constrained
clustering, when working under the assumption that the phenomenon under study is spatially (or
temporally) autocorrelated.
What if one uses constrained clustering while there is no spatial structure, despite the
assumptions to that effect? Of course, one could have ascertained first that there is a patchy
spatial structure, by spatial autocorrelation analysis (Sokal and Thomson 1987). Spatial
correlograms, however, can only recognize patchiness when it is somewhat regular; they
may fail to give a significant answer if the patches are greatly variable in size. So, constrained
clustering may be needed even if spatial autocorrelation analysis has not demonstrated the
existence of regular patches. Can we use the results of the clustering itself to tell us whether the
patches obtained with constraint are real entities ?
300
1.00 2 2 2 3 3
0.75
)(
W
Q
Z
A Q
II:
0.50
-<
0
0
.,-<
0.25
0.00
5 10 15 20 25 30
1.00 2 2 2 3 3
,e{II\Q
e 4 G\Il·
,{ell'
Gol\·
2
0.95 8 0 i
0 0
)(
W
•
•
Q
B -z 6
Q
Z
0.90
0
0
•
-<
II:
•
0.85 2
0.80~~~---;~-------r--------.--------.--------.--------.~
5 10 15 20 25 30
GROUP NUCLEI VARIANCE (VAR)
Fig. S. Fraction of the group structure information recovered using constrained (open circles) and
unconstrained (closed circles) clustering, according to (A) the Jaccard index, and (B) the Rand
index, for groups generated with various amounts of variance among group nuclei (abscissa).
301
This can indeed be done. Let us compare what should happen during constrained
clustering, in the absence or in the presence of patchiness. Let us consider fIrst an example where
the values to be clustered are the result of a strictly random process. In that case, the probability
that two neighbors (groups, or single objects) will be the next most similar pair is equal among
pairs of neighbors, and its value is (l/number of possible pairs) in ideal cases. It varies with
group size in the case of space-contracting (like single linkage) or space-dilating (like complete
f.
It It *fl
* If *
...•...
35 ff**
* fff
If * ** ft
ftf
• *f 1* * *f
* **
30 • f
••••
25
... •••
...
-
10
.
~
CD
... .•..
~ 20
-u
o 15
...••
.. •
..
Z
o
.
•
..
10
•
.. •
.
5
Decrease = 40 steps
•
30 60 90 120
Clustering steps
30
.*t* *
f*
it
**t
* **11" ++
-I
** ** f*
tft
25 -I .,,* tf. * *
*f
-
10
.
..•..
~
CD 20
I I)
:::J
-
U
o
15
.
..
o 10
Z •
5
30 60
..
15 steps
•• 90 120
Clustering steps
Fig. 6. Spatially autocorrelated data, from Figure 4 (V AR 10), produce a longer zone of =
decrease (top) than 150 random points (bottom panel). The ordinate of each graph represents the
number of groups, other than single-object clusters, that are present at the corresponding level
(abscissa).
302
linkage) clustering methods (Lance and Williams 1967); this point deserves further investigation.
In any case, one expects the random agglomeration mechanism to produce at fIrst a large number
of small patches, that grow according to some random model, while near the end of the clustering
process, we can expect the quick formation of very large patches (within a few clustering steps),
before the fInal formation of a single group. If there is a spatially autocorrelated structure, the
beginning of the agglomeration should follow essentially the same pattern, since the points that
cluster correspond at first to random within-group variations; near the end of the agglomerative
process, the differences among groups should translate into extra steps in the larger distance
classes, contrary to the no-structure case.
Actual experiments show that this is indeed what happens (Fig. 6 and 7). When the data
series is one-dimensional (circles in Fig. 7), the difference in length of the zone of decline is very
large at all values of connectedness, from 1% to 100%, used in the proportional-link linkage
agglomeration. When the series are made to form a two-dimensional grid of 5 lines and 30
Autocorrel.ted .erle. °° 0
Random numb... • -
~~-------
----_iC"i
~-~-~=---~-----
~ .. _ _ _ _ _ _ _ e
o.o~----------~~~-----------.-------------,-------------.
25 50 75 100
1.0
-OJ------c9------ CP - - - - - - CD
-----------------------------------
o.o~------------~~------------~------------~--------------,
25 50 75 100
% CONNECTEDNESS
Fig. 7. Length of the zone of decrease, as a function of the connectedness (Co) used during
linkage agglomeration, for autocorrelated series (150 points) and for random numbers (150
points). The zone of decrease is measured (A) as a proportion of the total number of steps, or (B)
as a fraction of the range of distances where the number of groups decreases, over the total range
of distances where agglomeration took place.
303
columns (chosen to agree with the autocorrelated group structure that we created), the difference is
not as large, but it is still significant (sign test). In the lower panel of Figure 7, the ordinate value
0.85 seems to form a line separting the two processes; further statistical investigation of this
property is obviously needed, either by Monte Carlo methods, or by studying the theoretical
distribution of these statistics for constrained group formation.
One step further up the scale of abstraction consists of using constrained clustering to test
the hypothesis that a variable or a set of multivariate data forms clusters that are autocorrelated in
some other space than geography or time.
S = 0.32323
8 GRClJPS
o
o 0 0
o 0
o
D 00
D
o 0
o 0
o 0
o o 00
o
E
o E
E
QO E E
Q c: g QE
OF
o 0
o F
o 0
OF
S = 0.2f{£7
4 GRClJPS
...
..••..••••.
.• ..•
..•·
. ••.
A'" A A
II
.F
A A A"
• F
•F
Fig. 8. Two of the steps during constrained agglomerative clustering of the forest vegetation
data. Each step is represented by a map whose abscissa is principal coordinate I and ordinate is
principal coordinate II of the edaphic space. The clustering similarity level is shown on each map.
Each group of sampling stations is represented by a different letter (without order).
305
One should wonder frrst if the relationship is real between community structure and the
edaphic space that we have constructed by principal coordinate analysis; studying the length of the
zone of decrease of the number of clusters shows that the decrease occupies 0.390 of the total
number of steps, and 0.442 along the distance scale; these figures fall in the "random numbers"
zone of Figure 7, for a connectedness of 50%. So, instead of pursuing the interpretation of these
results, one should conclude that the tree community structure data do not lead to significant
clusters in the edaphic space, given the way it was created with the data and by the method
described above.
CONCLUSION
Our experience with clustering methods that impose a constraint of contiguity through
space or time is that the results obtained through a wide range of clustering methods -- linkage
clustering, from single to complete linkage -- are much more similar to one another than without
the constraint. This is because constraining the clustering process also constrains the set of
solutions, eliminating a number of solutions that are compatible with the resemblance matrix, but
that do not make much sense in view of the spatial or temporal relationships existing among the
samples under study.
From the descriptive point of view, constrained clustering is one of the few ways available
for synthetically representing multivariate data onto a map. With many ecological problems, this
type of mapping is far more interesting than separate maps of the variables forming the
multivariate data set. On the other hand, theories about the importance of dispersal routes for
individual species or for whole biotic communities could be tested by comparing the unconstrained
to the space-constrained classification of sites; many other hypotheses of contagiousness of
ecological processes through space or time could be tested in the same way.
A number of constrained clustering programs have been written and are available to other
users. This is the case at least with the present author's programs used in the examples presented
above, as well as the program of Lebart (1978, for two-dimensional constraint), whose paper
includes the program listing. De Soete et a/. (1987) present algorithms for deriving constrained
classifications in a more general context than that of the present paper, and they review the
psychometric literature on the subject.
Compute C = A + w B, where w is a scalar weight. Cluster C for different values of w and pick
the result with the smallest w where all clusters are internally contiguous. This method can also be
used to obtain constrained ordinations.
In the future, constrained clustering programs, if they are agglomerative, should be made
to include some measure of the information content of the various clustering levels, and also
perhaps a measure of "patchiness" such as the one developed in one of the previous sections.
Since clustering with constraint includes, in the data analysis process, some a priori knowledge
that is pertinent to many of the theories the ecologists are dealing with, it may be viewed by these
same ecologists as an interesting method both for descriptive purposes and for hypothesis
testing.
REFERENCES
Lebart, L. 1978. Programme d'agregation avec contraintes (c. A. H. contiguHe). Cah. Anal.
Donnees 3: 275-287.
Lefkovitch, L. P. 1987. Species associations and conditional clustering: clustering with or
without pairwise resemblances. This volume.
Legendre, P., B. Baleux, and M. Troussellier. 1984. Dynamics of pollution-indicator and
heterotrophic bacteria in sewage treatment lagoons. Appl. Environ. Microbiol. 48: 586-593.
Legendre, P., S. Dallot, and L. Legendre. 1985. Succession of species within a community:
chronological clustering, with applications to marine and freshwater zooplankton. Am. Nat.
125: 257-288.
Legendre, P., and V. Legendre. 1984. Postglacial dispersal of freshwater fishes in the Quebec
peninsula. Can. J. Fish. Aquat. Sci. 41: 1781-1802.
Mantel, N. 1967. The detection of disease clustering and a generalized regression approach.
Cancer Res. 27: 209-220.
Matula, D. W., and R. R. Sokal. 1980. Properties of Gabriel graphs relevant to geographic
variation research and the clustering of points in the plane. Geogr. Anal. 12: 205-222.
Milligan, G. W. 1983. Characteristics of four external criterion measures, p. 167-173. In J.
Felsenstein [ed.] Numerical taxonomy. NATO Advanced Study Institute Series G
(Ecological Sciences), No.1. Springer-Verlag, Berlin.
Monestiez, P. 1978. Methodes de classification automatique sous contraintes spatiales, p.
367-379. In J. M. Legay and R Tomassone [ed.] Biometrie et ecologie. Societe fran~aise de
Biometrie, Paris.
Motyka, J. 1947.0 zadaniach i metodach badan geobotanicznych. Sur les buts et les methodes
des recherches geobotaniques. Ann. Univ. Mariae Curie-Sklodowska Sect C, Suppl. I. viii
+ 168 p.
Okabe, A. 1981. Statistical analysis of the pattern similarity between 2 sets of regional clusters.
Environment and Planning A 13: 547-562.
Openshaw, S. 1974. A regionalisation program for large data sets. Computer Appl. 3-4:
136-160.
Rajski, C. 1961. Entropy and metric space, p. 44-45. In C. Cherry [ed.] Information theory.
Butterworths, London.
Ray, D. M., and B. J. L. Berry. 1966. Multivariate socioeconomic regionalization: a pilot study
in central Canada, p. 75-130. In S. Ostry and T. Rymes [ed.] Papers on regional statistical
studies. Univ. of Toronto Press, Toronto.
Sokal, R R, N. L. Oden, and J. S. F. Barker. 1987. Spatial structure in Drosophila buzzatii
populations: simple and directional spatial autocorrelation. Am. Nat. 129: 122-142.
Sokal, R R, and J. D. Thomson. 1987. Applications of spatial autocorrelation in ecology. This
volume.
Ward, J. H. Jr. 1963. Hierarchical grouping to optimize an objective function. J. Amer. Stat.
Assoc. 58: 236-244.
Wartenberg, D. E. Regional analysis: describing multivariate data distributions using geographic
information. Manuscript (cited with permission of the author).
Webster, R 1973. Automatic soil-boundary location from transect data. J. Int. Assoc. Math.
Geology 5: 27-37.
SPECIES ASSOCIATIONS AND CONDITIONAL CLUSTERING:
CLUSTERING WITH OR WITHOUT PAIRWISE RESEMBLANCES
L.P. Lefkovitch
Engineering and Statistical Research Centre
Agriculture Canada
Ottawa, Ontario, Canada KIA OC6
I - INTRODUCTION
II - THE CONSTRAINTS
step function, categorize the data for each inter-step class; if there is
no evidence of steps i.e. the data seem not to exhibit po1ymodality,
there is good reason to exclude this attribute from consideration.
Assume that attribute j has been so categorized; then the procedures in 2a or
2b can be used, as appropriate.
4. Frequency data. In ecology, empirical data sometimes consist of the
proportion of either some fixed number of samples or of the total flora or
fauna for each of n species at each of m sites (Table 2a). One possibility
for such data is to choose some threshold value, e.g. 0.5, and define the A
matrix accordingly. This arbitary choice can be avoided, however, by a simple
extension of the binary data model as follows. Define the matrix B to consist
of the probabilities of occurrence of each species in each site, and let these
be estimated by the proportions. It is clear that A can be regarded as a
special case of B in which the probabilities are either 0 or 1.
I I (c) With similarity coefficients. Relational data are often obtained in
psychometric contexts, in antibody/antigen studies, in crossing experiments,
and are often estimated from attribute data by use of some measure of
similarity (see Gower and Legendre 1986, for a recent review). Without loss
of generality, it is assumed that the pairwise relationships have been
converted to dissimilarities (which need not be a metric). The objective of
this section is to summarize the procedure given in Lefkovitch (1982) to form
an A matrix, which is essentially the first phase of conditional clustering.
Its motivation is the question: if a particular subset of objects is
postulated, what other objects should be included? The requirement is that the
answer should satisfy two conditions. First, if the postulated objects
consist only of the pair with dissimilarity zero, then no others should be
included unless they also have dissimilarity zero with the pair and each
other; and, second, if the postulated objects include the pair of maximum
dissimilarity, then all objects should be included. If the maximum
dissimilarity in a subset is equated to the interval between the lower and
upper characteristic values of extreme value theory, the following procedure
generates a family of subsets of interest.
Let E be the adjacency matrix of the relative neighbourhood graph
(Toussaint 1980) of the objects based on the disSimilarities, D = {d ij },
i.e.
= I
01 ' d i j $ max (d ik' d jk ), V k "* i, j; i "* j;
otherwise.
314
go to step 3
else if S i A include S in A,
t t
go to step 2.
The heuristics described in Lefkovitch (1982) to restrict the number of
initial subsets which need be considered without changing the optimal
covering solution can be shown to be unnecessary since they are dominated by
the pairs adjacent on the relative neighbourhood graph. This graph has O(n)
edges (a sometimes achieved sharp lower bound is n - 1, since the minimum
spanning tree is a subgraph of E; an upper bound has empirically been found to
be less than 3.5n in random graphs (Lefkovitch 1984) and appreciably less in
those with obvious groupings). The generation of subsets from each of the
edges is 0(n 2 ), generating the graph requires arithmetic of 0(n3) and so
the subset generation phase is 0(n 3 ).
The number of initial pairs may be further constrained if there are other
known relationships among the objects. For example, given the geographical
distribution of the objects, the initial pairs can be confined to those which
are adjacent on the (geographical) Gabriel graph or Dirichlet tesselation,
with the condition that the candidates for inclusion must form a connected
subgraph with the current members (Lefkovitch 1980), even though the primary
decisions are based on the dissimilarities. In the special case that the
objects form a linear sequence (e.g. a pollen core, a line transect), the
number of initial pairs is precisely n - 1. Some other classes of constraints
are considered by Legendre (this volume).
315
III (a) The information available is contained within A; this will now be
exploited with as few assumptions as possible. A represents a set of
predicates, each of which is either true or false, about each object (e.g.
object i shows presence for attribute j), about each attribute (e.g. the
jth column shows presence for object i), and compound predicates about the
objects, the attributes and the object/attribute combinations. All these
predicates constitute the evidence, and it is propositions of the form "the
objects showing presence in attribute j represent an association of interest"
which are being considered to determine a minimal number of recurrent
associations; thus a measure of the extent by which the evidence supports
these assertions is sought. If the sole evidence were to be that there are m
attributes (1.e. without knowledge of the elements of A), this necessarily
leads to a statement that the evidence in support of the jth attribute being
in the optimal solution is equal to that of any other. After evaluation of
the predicates in A, the evidence may suggest otherwise; for example, if
column j' consists entirely of unities, then the evidence is overwhelming that
the whole set of objects can and do show the same attribute state, and that
perhaps the most reasonable course is to define just one association, so that
the evidence in favour of the remaining attributes drops to zero. This
extreme example shows that the evaluation of the evidence in A may lead to
unequal degrees of support for the attributes as potential candidates for
definition of an association. If the degree of support is assigned a
numerical non-negative value, for which zero indicates certainty that the
attribute does ~ participate in the optimal object solution, and if complete
support is assigned a value of unity, then these (posterior) degrees of
support have the basic formal properties of a finitely additive probability,
and are logical probabilities in the Carnap sense (Fine 1973). Informally,
therefore, such a probability represents the degree of support for a
hypothesis (e.g. the objects showing presence for attribute j are an
association in the optimal solution) given the evidence in the set of
predicates explicit and implicit in A.
With the interpretation of probability just given
The proof of this theorem, given informally by Lefkovi tch (1982) and more
formally by Lefkovitch (1985), depends on two components, first, that the set
of all coverings of the objects is a sigma-algebra on the columns of A, and,
second, on the equivalence in information of the complementary dual problem
(that of set representation in A*), namely to determine the relative
importance, qI. of object i as an indicator of which subsets are in the
optimal covering. (In determining the probabilities, rather than forming
especially since the elements of A and A*, being either 0 or I, imply that
additions and subtractions can replace multiplications (see Appendix 1».
III (b) It is rare that the states of all attributes are known for all
objects, so that it is not always possible to specify that aij = 1 or that
aij = 0, because data are missing. In these circumstances, there are
potential difficulties in obtaining the probabilities and the constraints on
the covering solution. While it is possible to exclude objects or attributes
to obtain a partial solution, the following proposal omits reference to the
missing elements in obtaining both p and q. Let K be the number of elements
of A equal to unity or which are missing values, and let IL(L),L=I ... K be
their row indices, JL(L),L=I ••• K their column indices, where JL(L) is positive
if aij is unity, and is negative if it is missing. In the Fortran
subroutine given in Appendix I, it is apparent that the missing elements are
omitted in both passes of the iterative procedure, but that excluding
these elements from IL and JL would equate missing values with absence, which
is clearly incorrect.
For frequency data (see above), the arguments leading to finding p from B
are identical with those of A, and lead to the following extension of the
theorem in Lefkovitch (1985).
The constraint matrix for the least cost set covering is then given by
I, b ij > t
a ij = {
0, b ij ~ t
IV (a) Having obtained the probabilities, all that remains is to use them to
make a final choice of attributes, and to interpret the solution obtained.
Any subset of the columns of A can be indicated by the binary vector x, and
evaluated as a conjectured covering. If x fails to satisfy the constraint
Ax ~ 1, it is immediately disqualified by lemma 1; if it satisfies the
constraint, the joint probability of the chosen subsets given the hypothesis
represented by x is clearly x l1 P j • Clearly, x chosen to satisfy the
j
constraints and to maximize the joint probability is an optimal solution.
Formally, this problem is equivalent to least-cost set-covering; if Cj is
the "cost" of including subset j, which here is defined as -log Pj' the
optimal choice is given by the vector x for which
summarised as
mine -I:Xj log Pj I Ax> 0, Xj E iO,l} )
and requires a different solution procedure. Bounds on the solution to this
problem are given by the solutions to the first two possibilities, and will
indicate whether an exact solution is worth the seeking.
v- HYPOTHESIS TESTING
L1 = n' II
i
where n is the number of objects, ni the number in each subset, and lPi= 1.
If the subsets form a covering, it is clear that ini> n and so Ll is not
applicable. Suppose (for the moment) that the intersection of three (and
more) distinct subsets is empty and let nij denote lIn JI, where I,J denotes
the objects in two distinct subsets, and let Pij= PiPj' The problem,
therefore, is to adjust Ll for these intersections. In particular, the
numerator should be reduced by the size of the weighted probability against
n ij ... m .'
L
m n
(l-P ij ••• m) ij ••• m
2. If for t subscripts all intersections are empty, then so will be those for
t+1, t+2, ••• ,me
3. A good approximation can be made to Lm by using L2 (or perhaps L3),
This follows from the following lemma.
VI - NUMERICAL EXAMPLES
Using the first of two (artificial) examples given by Andre (1984), the
incidence matrix corresponding to his figure 1 is given in table la together
with the computed probabilities of the lists; the optimal covering was
obtained by the reductions and is given in table 1c. As noted above, the dual
problem can also be solved by the methods of this paper, namely, which sites
should be grouped. In the absence of spatial continguity information, the
cost-free reductions yielded the unique solution to be sites {l-12}, and {ll,
13-25}, with indicator species a and h respectively. Only site 11 is in
321
Species
Sites a b c d e f g h i j Probabilities
1 1 0.0119
2 1 0.OU9
3 1 0.0119
4 1 1 0.0198
5 1 1 1 1 1 0.0382
6 1 1 1 1 1 0.0502
7 1 1 1 1 1 0.0382
8 1 1 1 1 1 0.0502
9 1 1 1 1 1 0.0382
10 1 1 1 1 1 1 0.0582
11 1 1 1 1 1 0.0458
12 1 1 1 1 1 0.0502
13 1 1 1 1 1 1 0.0428
14 1 1 1 1 1 1 1 1 0.0635
15 1 1 1 1 1 1 1 1 0.0635
16 1 1 1 1 1 1 1 1 0.0635
17 1 1 1 1 1 1 0.0592
18 1 1 1 1 1 0.0402
19 1 1 1 1 1 1 0.0521
20 1 1 1 1 1 1 0.0521
21 1 1 1 1 1 1 0.0521
22 1 1 1 0.0258
23 1 1 1 0.0258
24 1 1 0.0258
25 1 1 0.0170
Species
Sites a c f j Probabilities
4 1 0.0198
6 1 1 1 0.0502
11 1 1 0.0458
12 1 0.0428
14 1 1 0.0635
i7 1 1 0.0592
18 1 0.0402
canins.
Agrostis canina. 1 13 7 36
A. tenuis. 2 11 8 23 9 20 38 28
Anthoxanthum odoratum. 3 2 8
Blechnum spicant. 4 1
Calluna vulgaris. 5 2 3
Carex binervis. 6 64
f1exuosa.
Deschampsia flexuosa. 7 94 71 95 95 81 99 100 99 100 48 86 66 86 94 98 70 17 74 94 98 84 93 12 24 22
Festuca ovina. 8 95 92 59 66 53 47 18 23 54 30 26 51 12 57 26 30 67 24 100
Galium saxatlle.
saxatile. 9 13 11
13 1311 24 1 17 4 34 6 6
Ilolcus
Holcus lanatus. 10 2
Juncus
Junetts squarrosus. 11 2 3 1 1 4
Luzu1a campestris.
Luzula 12 1
Nardus stricta. 13 5 1 8 31 33 27 36 13 3 16 59 80 81 59 53 46 4 2
Potentl11a
Potentilla erecta. 14 2
Pteridium aquilinum. 15 2 17 17 7 1 2 2 13 1 14 17
Rumex acetosa. 16 28
Vacclnium myrti11us.
Vaccinium myrtillus. 17 100 100 100100 99 100 100 100 99 79 99 98 97 100 98 100 1 45 10 39 31 27 79
V. vi tis-idaea. 18
---- --
common.
Two further analyses to smooth the data were made using the Jaccard and
Russell/Rao similarity coefficients for estimating the similarity among the
species. Using phase 1 of conditional clustering (see section II(c», eight
subsets were generated from each, of which the same three remained after the
cost-free reductions. All three (table 1d) were mandatory with respect to the
constraints, and so probabilities did not need to be estimated. It can be
seen that while the optimal solutions for the analyses have much in common,
the groupings obtained from the similarity coefficients suggest more species
as characteristic than in the direct analysis of the incidence table, and
also, somewhat surpriSingly, places j in a group by itself. Since the
presence of species j implies that species e is present, it seems more
reasonable that these two should belong in the same group, as in table 1c.
The second example uses frequency data for 18 species in 25 sites given
by Dale (1971) and reproduced in table 2a. Estimates of the site
probabilities both from the matrix B and from A formed as a presence/absence
array, are given in table 2b. Both arrays suggested the same eight species
associations (table 2c), of which the first six are mandatory. There was
considerable overlap among the associations. The analyses were repeated using
a threshold to eliminate infrequent species; all values in table 2a exceeding
25% were retained, and others replaced by zero. As a result, only three
associations were obtained (table 2d).
The indirect method was also used, i.e. computing dissimilarities as the
sum of the absolute values of the difference in frequencies, and using the
subset generating procedures described in section II(c) to smooth the
relationships. 12 subsets were generated, none of which were mandatory.
There were four subsets in the optimal covering (table 2c). Comparing tables
2c, d, e suggests that while there are differences, there are also some
apparent recurrent species associations.
Remembering that the role of clustering is to provide candidate groupings
of objects for further evaluation, this diversity of result demonstrates the
need for further ecological investigations to determine which if any
association is more than random.
VII - DISCUSSION
interest to decide if there is any merit in the present proposal compared with
others previously made. Traditional procedures seem to be as follows
1. determine a relational measure among each pair of objects;
2. by some clustering method, determine subsets of the objects.
There are many different relational measures which have been proposed (Gower
and Legendre 1986). Each has its arbitariness and hidden assumptions; there
does not seem to be any single relational measure which is superior in all
circumstances, or which by assuming a particular probability distribution for
the observed incidence matrix, does not impose more structure on the data than
they themselves have. Nevertheless, the conversion of attribute data into
similarities, and the (re-)generation of an incidence table by the algorithm
of section II (c), can be regarded as a smoothing process, which may lead to
simpler solutions. In the present proposal, although ATA* can be regarded as
a relational measure among the attributes, there is none among the objects
which replaces the data; these remain as lists of objects which exhibit the
same attribute state.
The two assumptions which are made are :
1. the principle of indifference, which is used to obtain the probabilities;
here, this is equivalent to the maximum entropy principle to obtain a
probability distribution just consistent with the structure of the data
without imposing further structure (such as that arising from the
assumption of a Poisson, binomial etc. distribution); and the principle
2. of maximum joint probability, which in the present formulation is
equivalent both to that of minimum cross-entropy and of maximum
likelihood (Lefkovitch 1985); this is used to select the attributes from
among those seen. It has been shown that any choice has to agree with
this principle (Shore and Johnson, 1980) if consistency is desired; this
contrasts sharply with traditional clustering procedures, whose
assumptions are rarely known (or are even knowable) in the context of
consistency.
An open question is whether the probabilities should be obtained from the
original A or from this array after duplicate attributes (i.e. identical
columns in A) have been eliminated. The numerical values, after allowing for
the different standardization, can be very different i f the duplication is
considerable for some attributes. A decision to retain duplicate columns
clearly depends on the original sampling procedure for their choice; if it was
random (see also below), it seems preferable to use the original A. In any
325
case, it is not difficult to obtain the probabilities from both arrays, and to
come to some decision based on both solutions; cluster analysis, after all, is
a hypothesis generating procedure and not an evaluation.
The second component of traditional group-forming procedures is the use
of a clustering algorithm; this requires a choice from the plethora currently
available, since each has requirements about the metric, and makes somewhat
arbitary even if plausible definitions of relationships among compound
subsets, as well as in the initial definition of dissimilarity. The end
result of many of these methods is usually a dendrogram, so that it is
necessary to make further assumptions to obtain the subsets of the objects.
In the present proposal, the dissimilarity, clustering and reconstruction
phases are avoided, since the incidence matrix itself gives the candidate
subsets; the only problem is to choose from among these. The choice is based
on the logic of implication, on the duality of the information in the rows of
a table with that in the columns (see the proof of the theorem in Lefkovitch
1985), and on the classical principle of maximizing the joint probability.
The only component which is somewhat unfamiliar is the meaning of this
probability, since it is not a frequency nor is it subjective, but has a
logical interpretation in the sense of Carnap.
Although linear least-cost set covering is NP complete, its special
structure makes it one of the easiest of integer programs to solve, primarily
because of the reductions which are possible. In the present context, because
the costs are a function of the constraints, the problem is even further
simplified, and arguments based on worst case performance can be neglected.
It also conjectured that because of the definition of the probabilities,
(Pj> Pk) <= > (J::o K), which implies that reduction rules 1-3 can be performed
eigenvector of
yv
which clearly differs from
T T T T
A A*p = (A 11 - A A)p = A P
of the present paper.
In the present model, the rows (= objects), columns (= attributes) and
elements of A are ~ regarded as being random. There is a superficially
similar set of circumstances, arising from item analysis (see Rasch 1960;
Andersen 1980; Tjur 1982), which by contrast, assumes that the aij are
independent Bernoulli random variables, with
ai
Pr (a ij = 1)
ai + ~j
where a.t is a row parameter which increases with the increasing 'ability' of
object i to show the suite of attributes under consideration, and j:\j is a
column parameter which decreases with the increasing 'difficulty' of attribute
j to be shown by the objects under consideration. The objective of the
analysis, which is to estimate a.t, is different from that of the present
paper, which is to identify recurrent sets of individuals. The Rasch model
leads to determining the set representation probabilities of A, given that
they balance the covering probabilities of A*, and so a.t is equivalent to
qi of the present paper. It should emphasized, however, that in the set
covering model, there is no probabilistic interpretation of the elements of A,
and that p and q have meaning only with respect to providing evidence relevant
to propositions about the grouping of objects. Were either the rows, columns
or elements of A to be regarded as random samples from populations of rows or
columns, then it is apparent that the Rasch model would be of interest, and
advantage could be taken of any hypothesis tests which may be relevant.
ACKNOWLEDGEMENTS
REFERENCES
SUBROUTINE COVPRB(N,M,K,IL,JL,P,Q,Y,TOL)
C
C THIS SUBROUTINE OBTAINS BOTH THE SET COVERING AND
C SET REPRESENTATION (COMPLEMENTARY PROBLEM) PROBABILITIES
C
C N IS THE NUMBER OF ROWS
C M IS THE NUMBER OF COLUMNS
C K IS THE NUMBER OF ELEMENTS
C IL IS A VECTOR OF LENGTH K CONTAINING THE ROW
C INDICES OF THE ELEMENTS OF A
C JL CONTAINS THE CORRESPONDING COLUMN INDICES
C WHICH IF NEGATIVE INDICATE MISSING (NOT ABSENT) DATA
C P WILL CONTAIN THE COVERING PROBABILITIES
C Q WILL CONTAIN THE REPRESENTATION PROBABILITIES
C Y IS A WORK VECTOR OF LENGTH M
C TOL IS A CONVERGENCE CRITERION
C
C THIS SUBROUTINE IS NOT PROTECTED AGAINST N, M, K, IL OR
C JL BEING ZERO ETC. ON INPUT OR FOR Z = 0.0
C DURING THE CALCULATIONS
C
DIMENSION P (M), Y(M), Q(N)
C
C IT IS SUGGESTED THAT P,Q,Y,Z,V,TOL BE DOUBLE PRECISION
C REMOVE THE C IN COLUMN 1 FROM THE NEXT CARD,
C DOUBLE PRECISION P,Q,Y,Z,V,TOL
C AND REPLACE ABS BY DABS, 0.0 BY O.DO, 1.0 BY 1.DO
C WHERE APPROPRIATE
C
INTEGER*2 IL(K),JL(K)
C
C INITIALIZE P
C
Z=l. 0 /FLOAT (M)
DO 1 J=l,H
1 P (J )=Z
C
C INITIALIZE A NEW ITERATION
C
100 DO 5 I=l,N
5 Q(I)=l.O
DO 10 J=l,M
10 Y(J)=O.O
C
C NOW DO AN ITERATION
C
DO 15 L=l,K
J=JL(L)
IF(J.LT.O) GO TO 15
I=IL(L)
Q(I )=Q(I )-P(J)
329
15 CONTINUE
Z=O.O
DO 20 L=l,K
J=JL(L)
IF(J.LT.O) GO TO 20
I=IL(L)
Y(J )=Y (J )+Q (I )
Z=Z+Q(I)
20 CONTINUE
C
C NEW VALUES OBTAINED; CHECK FOR CONVERGENCE
C
W=O.o
DO 25 J=l,M
V=Y(J)/Z
W=W+ABS (P (J )-V)
25 P(J)=V
IF(W.GT.TOL) GO TO 100
C
C STANDARDIZE Q; P IS ALREADY STANDARDISED
C
Z=O.o
DO 30 I=l,N
30 Z=Z+Q (I)
DO 35 I=l,N
35 Q(I)=Q(I)/Z
RETURN
END
SUBROUTINE SETCOV(MVRS,NCON,A,COEF,BETA,EPS,MITR,ISEED,PC,
1 ULMT,XBEST,FBEST,X,XX,T)
C
C INPUT
C ***************
C MVRS INTEGER NUMBER OF COLUMNS OF A
C NCON INTEGER NUMBER OF ROWS OF A
C A LOGICAL*l BINARY CONSTRAINT MATRIX
C COEF REAL *4 THE (NON-NEGATIVE) FUNCTION COEFFICIENTS (CHANGED
C BY THE SUBROUTINE TO SUM TO UNITY)
C BETA REAL *4 POSITIVE: TO CONTROL CONVERGENCE (APPROX 5.0)
C EPS REAL *4 POSITIVE: TO TEST CONVERGENCE (E.G. 0.001)
C MITR INTEGER MAXIMUM NUMBER OF CANDIDATE SOLUTIONS
C (E.G. MVRS*NCON)
C ISEED INTEGER TO INITIALIZE THE RANDOM NUMBER GENERATOR
C PC REAL *4 0.5 PC 1.0 TO DETERMINE NEIGHBOURING SOLUTIONS
C ULMT REAL *4 UPPER LIMIT ON COEF FOR INCLUSION
C
C OUTPUT
C ***************
C XBEST LOGICAL*l THE SOLUTION ARRAY
C FBEST REAL *4 THE FUNCTION VALUE AT THE OPTIMUM
330
55 FNOW=FNOW+COEF(J)
IF(.NOT.COVER) GO TO 10
C
C FEASIBLE COVERING FOUND. UPDATE C
C DETERMINE IMPROVEMENTS OVER THE BEST AND LAST
C
C=C/ (l.O+BETA*C)
C
C IF BEST SO FAR, KEEP
C
IF(FNOW.GT.FBEST) GO TO 80
DO 5 J=l,MVRS
XX(J )=X(J)
5 XBEST(J)=X(J)
FLAST=FNOW
FBEST=FNOW
GO TO 100
C
C IF BETTER THAN THE LAST, REPLACE
C OR IF WORSE, THEN 'HEAT UP' RANDOMLY
C
80 Z=AMAX1«FNOW-FLAST)/C,-1000.)
IF(Z.GE.O.O.OR.EXP(-Z).GT.RAN(ISEED» GO TO 100
DO 85 J=l,MVRS
85 XX(J)=X(J)
FLAST=FNOW
100 CONTINUE
C
C END OF ITERATIONS
C
130 RETURN
END
Fractal theory
APPLICATIONS OF FRACTAL THEORY TO ECOWGY
Serge Frontier
Laboratoire d'Ecologie numerique
Universite des Sciences et Techniques de Lille Flandres Artois
F-59655 Villeneuve d'Ascq Cedex, France,
and Station marine de Wimereux
B.P. 68, F-62930 Wimereux, France
Abstract - Forms with fractal geometric properties are found in ecosystems. Fractal geometry
seems to be a basic space occupation property of biological systems. The surface area of the
contact zones between interacting parts of an ecosystem is considerably increased if it has a fractal
geometry, resulting in enhanced fluxes of energy, matter, and information. The interface structure
often develops into a particular type of ecosystem, becoming an "interpenetration volume" that
manages the fluxes and exchanges. The physical environment of ecosystems may also have a
fractal morphology. This is found for instance in the granulometry of soils and sediments, and in
the phenomenon of turbulence. On the other hand, organisms often display patchiness in space,
which may be a fractal if patches are hierarchically nested.
A statistical fractal geometry appears along trips and trajectories of mobile organisms.
This strategy diversifies the contact points between organisms and a heteregeneous environment,
or among individuals in predator-prey systems. Finally, fractals appear in abstract representational
spaces, such as the one in which strange attractors are drawn in population dynamics, or in the
case of species diversity. The "evenness" component of diversity seems to be a true fractal
dimension of community structure. Species distributions, at least at some scales of observation,
often fit a Mandelbrot model!r =!O (r +~) - Y, where!r is the relative frequency of the species
of rank r, and Ity is the fractal dimension of the distribution of individuals among species.
Fractal theory is likely to become of fundamental interest for global analysis and
modelling of ecosystems, in the future.
INTRODUCTION
The importance of fractal geometry in the morphology of living beings has often been
stressed, for scales of observation ranging from intracellular organelles (mitochondria) to entire
organisms (trees) to vegetation physiognomies. Fractal geometry not only is an attempt to search
for an order in the inextricable morphology of living beings, but seems to point out some property
that is essential for the functioning of life. Indeed, life is made of ceaseless interactions, and
incessant fluxes of matter, energy and information through interfaces, which at first sight look like
surfaces. As a matter of fact, it is at the level of these interfaces that the geometry becomes
inextricable, suggesting an interpenetration volume instead of a smooth surface, between two
adjacent interacting elements. Actually, they are neither surfaces nor volumes, but fractals.
without any change of form), surface areas increase less rapidly than volumes. In order for the
surfaces to grow at the same rate as the volume, a particular highly folded morphology has to
develop, which strongly reminds one of fractal objects.
At scales larger than organs and organisms, fIrst appears the population, then the
ecosystem which is an interacting system of various populations and the environment. Ecology is
the science of these interactions, which are produced by fluxes of energy and matter and by
information exchanges. Once more, a fractal organization is visible here. Since little has been
written about "fractal ecology" up to now, my purpose is to review what can be considered as
fractals at the scale of the ecosystem. Such an inventory has to include the following:
- Forms characterising the contact between organisms, between organisms and the
environment, between communities and the environment, and among ecosystems. In developing
these forms, fractal structures seem to be part of the biological strategy at all scales of observation.
- Size frequency distributions, which often have a fractal dimension (Mandelbrot 1977,
1982; Section 1 and Fig. 3 below).
- Spatial distributions of organisms (patchy distributions, and so on).
These three fIrst items describe a strategy of space occupation. There are also stragegies of
time-and-space occupation:
- Paths or trajectories make it possible for organisms to increase the number of their
contacts with a heterogenous environment, or among populations; all these increase the rates of
interaction.
The uninitiated reader can refer to the Appendix, where defmitions, elements of fractal
theory, and methods of computation are presented.
[Note: references to Mandelbrot without a date concern the 1977 or the 1982 editions of his book;
both contain an extensive bibliography].
L FRACfALFORMS IN ECOWGY
The fractal shapes (morphologies) observed at scales greater than the individual are just
a continuation of those observed inside the cells and organisms. For instance, the fractal
dimension of the external surface of mitochondria is 2.09, 2.53 for the internal surface, 2.17 for
the inner membrane of the human lung, etc. Intuitively speaking, the fractal dimension indicates a
certain degree of occupation of the physical space by a contorted, ramified, or fragmented surface,
where some exchanges occur. The histological structure of any tissue appears as a kind of
sponge, whose fractal dimension is between 2 and 3. Moreover, tissues are connected with their
biological environment (inside and outside the organism) thanks to an organized circulation of
substances, which may have the form of an arborescence of canals (invaginations); or, the tissue
may have its external surface ramified (evagination). For example, the branching out of the
bronchioles inside the lung has a fractal dimension which is slightly less than 3. As a matter of
fact, the circulation of substances is one of the main factors coupling two living webs or
organisms, and it cannot be dissociated from energy flow; according to the Morowitz' (1968)
principle, any flux of energy is associated with (at least) one cycle of matter in dissipative
systems. Figure I shows two isomeres A and B in a state of energy equilibrium. An energy flow
crosses the system and is coupled with transport mechanisms between points of different energy
levels; there is either turbulent diffusion or organized channels (both being represented in Fig. 1),
resulting in a cycle of matter. A fractal geometry is a logical requirement for the wall
configuration and the transport system, in order to accelerate the energy flow through the system,
as well as the cycling of matter. That particular geometry can be directly observed in the
morphology of trees, for example, where the canopy allows sufficient contact between the
atmosphere and the chlorophyll web, and the beard of roots and rootlets allows an intimate contact
with the nutrients in the soil. Other examples of contact between organisms and the environment
are given by animal lungs and gills, filtering apparatuses, and so on. Figure 2 shows various
fractals evoking canopies, root beards or bronchioles, tremendously increasing the contact
between the organism and the medium, as the black and white parts do.
In other cases, the fractal geometry responsible for the efficiency of the system is more
subtle. Sometimes biomass uses the fractal geometry of its physical environment instead of
338
Fractal interface
turbulent transport
A
~
I
I
high I low
•••••• energy flow •••••• I energy
energy_---...;~
I level
level I
'f:, "- I
s-:::----_ S
-:....~-.;-- --~ - ~- '?>--
--:. ::-;r. _ _ --::r - - nort
- - - :=- - . d trans ..
--- n,se
C e- orga
spa
Fig. 1. Energy flow and matter cycling through a fractal wall geometry. Inside the system, A and
B are two isomeres whose equilibrium depends on the energy level. They are transported from
one wall to the other either by diffusion, or by a spatially organized transport mechanism. The
broad arrows symbolize energy flows. The dashed lines represent matter cycling. Modified from
Morowitz (1968).
organizing itself in a fractal fonn. For example, in the aquatic environment, the enhancement of
contact surfaces is obtained by parcelling out the biomass into isolated cells; this is the strategy
followed by bacteria and phytoplankton cells, where the renewing of contact surfaces with water
is produced by turbulence: Mandelbrot demonstrated that the geometry of turbulence is fractal, for
it is composed of eddies, which dissipate into smaller and smaller ones -- a typical fractal process
-- up to the scale of viscosity. The fractal dimension of dissipation is approximately 2.6. The
fractal dimension of boundaries of wakes and clouds is 2.3.
The importance of turbulence for pelagic production, and of contacts and shears
between complementary water bodies and currents, is well known (Legendre 1981; Legendre and
Demers 1984). It is important both for primary production, and for the exploitation of this
primary production by consumers. Moreover, turbulence is sometimes induced by organisms,
when either they shake the surrounding water, or constitute a roughness that increases the velocity
of eddies within a previously regular current (Frechette 1984).
339
The environmental fractal geometry used by organisms is also seen in the soil and in
sediments, where organisms are moving and growing. Any sediment or soil is characterized by a
particular distribution of grain sizes, which is their granulometry. Smaller grains are lying
between the larger ones, resulting in a picture that can be schematized as an arrangement of
spheres (Fig. 3). This arrangement can be studied for its fractal geometry. Some general
properties of soils, related to percolation and water retention by surfaces, depend on this fractal
geometry. It would be interesting to see whether the size distribution of organisms, from the
tiniest ones (bacteria) to the biggest (vertebrates), also has a fractal-type regularity, and whether its
fractal dimension is linked to that of the medium, which can be made of solid particles or be
aquatic and turbulent.
At another scale of observation, relations have been known for a long time by
limnologists between the morphology of ponds and lakes, and such biological properties as
overall productivity and so on (Hutchinson 1957; Ryder 1965; Wetzel 1975; Adams and Oliver
1977). Lake morphology, as well as the "morphoedaphic index", have always been expressed in
terms of a ratio between the length of the shoreline and the volume of water, but we know today
that the shoreline is a fractal and that its "length" is not uniquely defined, depending on the stride
length (or "yard-stick" of Mandelbrot) that has been used to measure it. Figure 4 indicates the
fractal dimension of a lake shoreline, following Kent and Wong (1982). It follows that it is not
the lengthlvolume ratio, but its fractal dimension, that ought to be correlated with ecosystem
properties. This has been pointed out by Kent and Wong, but without any deep investigation on
the relationship, which they only assumed to exist; the process can be seen in the fact that the
littoral zone oflakes (the extent of which depends on the fractal dimension of the shoreline) brings
together the primary producers and the decomposers, then accelerating the cycling of matter. It
seems necessary to persevere in this way, renewing entirely the notion of "morphoedaphic index"
in the light of fractal theory.
More precisely, a shoreline is a contact zone between two ecosystems, an aquatic and a
terrestrial one (including soil and vegetation). Often the limit between them cannot be stated
precisely, because a particular ecosystem (contact or interface ecosystem) develops in the vicinity
of the water-soil contact line: interpenetration area, reed-belts and their fauna, intensified
exchanges, etc. The shallow coastal stretch is very important in the economy of the whole lake,
and also of the surrounding terrestrial ecosystem (both ecosystems may "exploit" it). The surface
area of that contact ecosystem is then important, and it depends upon the "length" of the theoretical
boundary -- or, more precisely, on its fractal dimension. Figure 4c explains that "law", starting
from the assumption that the interface ecosystem can only develop within a distance L from the
geometrical boundary.
341
Q)
.: 10.3
Q)
o
~
-
.r:.
(I)
o
of 10.0
CI
c:::
Q)
..J
c:::
..J
9.7
•
4.0 6.0 8.0
Ln Length of ·stick-
c
water
dlow dhigh
.........
L
Fig. 4. Shoreline of Gull Lake, Ontario, Canada. a: Map of the lake. .11: The length of the
shoreline is a decreasing function of the length of the" yard stick" used for measuring it. The
fractal dimension can be inferred from the slope of the line (see text); it could provide a new type
of "morphological index' for lakes. Modified from Kent and Wong (1982). £: Two different
fractal dimensions d of the shoreline, resulting in two different areas of land/water
interpenetration surface; this surface is here defined as the set of points located within a maximum
distance L from any point of the fractal shoreline.
343
Similar indexes could certainly be described for the contact zones between other pairs
of ecosystems (Frontier 1978), such as a forest-savanna contact zone, a coral reef, and so on.
They should include both the structure of the multispecies living community and the fractal
morphology of the landscape, as represented for instance in Mandelbrot's recent landscape
models.
Let me now discuss the limitations of the fractal geometric model in biology and
ecology. Fractal theory is, to my knowledge, the fIrst mathematical theory that explicitely uses the
notion of observation scale, for in building up a fractal object, it states that the same generative
process repeatedly acts from scale to scale, following a so-called "cascade". Nevertheless, the
reality of the scale in a mathematical fractal is, so to speak, immediately occulted by the generative
process itself because, when looking at a fractal picture, it is impossible to infer at what scale it is
actually considered; all scales are equivalent and undiscernible from the form itself. For example,
a theoretical tree is branching out ad infinitum, any tree being a branch of a larger one, and so
on, following the rule of self-similarity. Consequently, the very question of the scale at which we
are looking at a particular branching has no mathematical meaning.
On the other hand, a real biological object, such as a living web, does not look the
same at different observation scales. For example, when looking at a histological preparation
under a microscope, with a little knowledge of histology one can infer the scale from the structure
seen, even without knowing what magnifIcation is being used.
It has been shown that the fractal dimension of the shape of a coral reef changes for
different intervals of the observation scale (Bradbury et al. 1983, 1984; Mark 1984), being
approximately 1 if measured with 20 cm to 2 m steps, and a little more than 1.1 outside that
interval; transitions are sharp. We could say that a biological object, in which a fractal geometry
can be recognized, actually "mimics"a fractal upon some range of scales. The lung branches out
344
:t· ·
b
• .0
,iI!,'
'iII'
"iIt·
'
-.
.
'
"
_
.
, '
" :
. "'f
, ,
, ,
"', ....
,iI!"
'" '
."r:r
iJti
*
:t".
' . .'
'
,,'.::t:
. .'
Fig. 5. Two fractal figures from Mandelbrot (1982, with permission). !!: Model evoking a spruce
tree forest; its fractal dimension is 1.88. h: Model evoking the Roscoff coastline (location of this
NATO Workshop); its fractal dimension is 1.61. s:: The generator of figure (b).
345
23 times, a fish gill 4 times, and so on; beyond these limits, organs belong to other fractals. That
"fractality" of the living matter represents a developmental strategy by which living matter is able
to conduct the volume of exchanges that are necessary for the biomass to remain alive, and which
imply a sufficient surface/volume ratio. So the fractal view of the object is only a mathematical
model, pertinent at one observation scale or between two scales, that describes the developmental
strategy at that scales, in the same way as a mathematical smooth surface describes a leaf or a lung
surface at a particular observation scale. We do not have to expect any "real" (mathematical)
fractal to stand out in nature, no more than a "real" plane; this is also true for any artificial object,
for the smoothest technological object has a very rugged surface, when examined at high
magnification.
Rather than calculating only the fractal dimension within an interval of scales, it is
perhaps more interesting to look for those scales of observation where the fractal dimension is
changing, because at these critical scales, the constraints of the environment that act upon the
biomass are changing too.
Properties of the unliving matter also depend on the observation scale. For instance, the
same fractal dimension can be observed over a very broad range of scales, as in "breaking
surfaces" (ten orders of magnitude: Mandelbrot, pers. comm.) or in turbulence. The breaking of
stony material is bounded between the planet scale and that of atoms, while turbulence is bounded
between the planet scale again and the scale of molecules, where it turns out to be viscosity. At
intermediate scales, we can recognize viscosity, lapping, waves, local currents, and geostrophic
currents. From the point of view of the living organisms or of the ecosystems, they are not the
same phenomenon at all,since organisms and ecosystems have to adapt themselves in different
ways according to the scale, resulting in different morphologies, behaviours or fractal
dimensions.
If a tree was growing indefinitely, a problem of sap supply to the leaves would arise.
In the opposite way, if it was branching out infinitely, it would result in a colmated felting, which
would hinder both air circulation along the tissues, and sap circulation inside them because of
viscosity. Hence branching out cannot be infinite either towards huge or towards small sizes.
For the contact between air and sap to be efficient, the foliage chlorophyllous tissues have to be
organized as a porous sponge -- another fractal structure. The choice of a limited number of
branching steps appears to be an optimizing choice for the transfer of matter and energy.
Another example, which clearly shows that a fractal geometry has to be truncated
instead of going on infinitely, is in the utilization of soil by organisms. Not only the latter are
moving and growing inside it, but a liquid charged with dissolved nutrients, organic molecules
346
and gas has to be able to circulate within the soil. Remember the fractal model of the set of
spheres with various diameters (Fig. 3), more numerous in proportion as the diameter decreases.
At each step along the observation scale, smaller spheres fill in the holes left by larger ones. If the
process were repeated indefinitely, the sediment would be completely compact. Even before the
sediment could be completely sealed, it would block the water because of viscosity and surface
tension. So, to maintain a sufficient level of porosity, the rate of grain fragmentation into smaller
and smaller ones has to decrease, at least at the level of the smallers grains; that is, the fractal is
necessarily truncated. Adsorbent surfaces are also very important in soil ecology, and a fractal
geometry enhances these surfaces. Since free volumes are also necessary, the soil quality
depends upon a balance between surfaces and volumes. Burrough (1981,1983) has shown that
granulometry, as well as other properties of soils, exhibit variability in fractal dimension. On the
other hand, the percolation properties of porous materials are presently thoroughly investigated
by fuel engineers; this was revealed, together with the role of fractal "surfaces" in catalytic
reactions, through the papers presented during the colloquium "lournee application des fractales",
sponsored by the petroleum company Elf-Aquitaine (paris, 21 November 1985). I suggest that
investigations should be carried out, relating the biological properties of soil with its fractal
structure, in the same way as benthologists are relating benthic communities to the roughness of
the substratum (E. Bourget, in prep.). To summarize, the "fractality" of a living object has to be
described by means of a succession of fractal models, or perhaps an infinity of models if the
fractal dimension changes progressively.
through simulations by Villermaux sa.ul. (1986a, 1986b). They built a "Devil's comb" (Fig. 6)
made of a handle bearing a number of teeth, these teeth bearing smaller teeth, and so on. The
structure of the object, represented in black in the picture, is hollow, so that a substance can
diffuse inside its tubing. The authors modeled the diffusion of a gas up to the very end of the
teeth pattern. While they had thought initially that the molecules would take an infinite amount of
time to reach the ultimate teeth, since it is an infinite process, the result is actually the opposite: the
amount of time required converges to a fmite value. Moreover (and what is still more important),
the time is almost the same to fill up the 4 or 5 first sets of teeth, or the entire structure. Finally,
assuming that the internal surface was covered with a catalyst, an efficiency close to maximum is
obtained as soon as 4 or 5 steps of the fractal structure are covered. This is of great importance in
the design of an industrial catalytic apparatus, for it shows that it is not necessary to build more
than 4 or 5 steps. Knowing that, the cost of such an object can be minimized, since the object
becomes more expensive as the amount of detail increases.
Fig. 6. The Devil's comb of ViIIermaux flal. (1986a, 1986b). Every tooth bears
8 teeth that are 4 times smaller; the fractal dimension is then log 8 I log 4 = 1.5.
The generating process is repeated indefinitely.
In our field of interest, biology, this allows to understand why trees, or lungs, or
mitochondria, have a fractal morphology with only a limited number of steps (4 to 23); it is
because the chief properties of such a morphology are obtained after a few steps, and it is not
useful for the living object to continue its fragmenting process beyond, at the cost of too delicate
and expensive a morphology. Possibly also, an organism cannot maintain a structure beyond a
certain degree of complexity and delicacy, which would be another reason for living fractals to be
truncated. Any real object with a fractal form is then trying to optimize a life condition in a given
range of scales, and not at others. Let me add that fractal geometry by no means provides an
explanation of forms, but only a description; our astonishment is not to be diminished when
observing living forms, since their morphogenesis has still to be explained.
348
.~
0_. :: -:." .. :r
'.
.... .,. 0 . ."
..... ',c.
;~. ':. .:
.... "::
.::- r.
.i: .
.. :,
.
'
. .
'.
..
: ..
..:
::'-...
.
' ..," ..
..
~:
".
..: .~ .. ~
~
y •
•; 0:. ~'.
". :\
Fig.7. Statistical fractal of dimension 1.2, that indefinitely models the scattering of biological
organisms in space.
349
levels are not limited by membranes or walls. Limits are fuzzy, hence the intervals between
patches created by the process are less evident. The analysis of this type of form requires another
method, which was developed by Fournier d'Albe (1907) and used by Mandelbrot for studying
the distribution of galaxies in the sky; galaxies are separated from living organisms only by "a
few" orders of magnitude, say 15 or 16.
",.,,, \ ~
, ... ~ .... '. I
':-;:t '
, '-, f •• , ~ /
\':~, \:~ ~/
~ ...
--- /
C:I>
Cl
Q
o
....
,,-
I
'"""'_ ....
- \
I /~i' I:~"
I ":1 \;. \
\
I I ,,:,. , ..... I I log r
I
\ I ,:F' ~:,'I'/ "....... I
I
\\ \ ..•,ti'
:.1 . I,.!,,",.~ '..' I
1,/ ..•.. t~ .\\ I
\\. .." 1''''''
...-/,/ I
'..... ... ...._;.",...,,"
a ----., .... b
351
As a matter of fact, it can be observed that the behaviour of predators is complex and
stratified (hierarchical). As soon as a prey species is located, the very broad exploration of the
hunting area is replaced by a more specific behaviour within a smaller spatial range. That is a
response to the patchiness of the prey population because, by definition of an aggregated
distribution, the probability of a prey item existing at a point is enhanced by the presence of other
prey in the vicinity, and conversely. Hence the predator alternatively displays a scanning
behaviour, including straight travels from one patch to another, then a more Brownian motion
inside a patch, probably within a hierarchical pattern due to the hierarchical distribution of patches.
It can be conceived (but has yet to be proved) that such a "cascade" or intermittency of behaviours
occurs in conformity with the fractal pattern of the prey distribution. The predator trajectory
resembles a Brownian motion which would be divided hierarchically, following a cascade of
levels. The movement cannot be a perfectly Brownian one in the detail because, in animal
trajectories, the direction at one instant is positively correlated with that of the previous moment,
since sharp changes of direction are costful. Mandelbrot developed a "fractional Brownian"
model of a trip; Figure 9b presents an example of such a motion, with a fractal dimension of 1.11.
A true Brownian motion has a dimension of 2 (Fig. 9a), that is, each point of the plane is likely to
be occupied once by the travelling molecule, which is not the case for the hunting predator.
\B
,, \
\
,magnification
\
\
\
\
,
\ B'
, \
,magnification\
,, \
'.
\
Fig. 9. a: Classical Brownian motion, d = 2. Dots are the successive positions of the particle.
Segments are interpolated trajectories between two dots • .b.: Fractional Brownian motion of
dimension d = 1.11. From Mandelbrot (1982), with permission.
obtained varies according to the value of d , which is a fractal dimension, and a whole range of
values of d give a plausible representation of the natural patterns (Fig. 10).
It would be interesting to analyze the trails followed by animals, from that fractal point
of view, as well as the confonnity of that fractal line to the fractal pattern of distribution of the
prey in space; I believe it may reflect a fundamental process in the management of biomass and
energy in ecosystems. It is now admitted in ecology that displacements of matter have a
fundamental importance in ecosystems; considering aquatic ecosystems, water movements and the
passive or active movements of biomass have complementary effects. Primary productivity is
enhanced at the level of an interface, or "ergocline" (Legendre and Demers 1985), for turbulence
353
puts phytoplankton cells in contact with nutrients and with light. Furthennore, animals consume
that primary production, and they are consumed in turn following the trophic chain, which is
usually associated with a set of migration behaviours. The larger organisms eat the smaller ones
while at the same time they undertake longer migrations, so that an overall migration of the
biomass occurs, from the zones with high primary productivity to the zones oflow productivity,
d =1.0
d =1.5
Fig. 10. Rayleigh flights of fractal dimension 1.0 and 1.5. From Mandelbrot (1977), with
permission.
354
turbulence:
.nterpene-
~tration of
(J ~(!) water
masses
:It ;C)~v~
~ C))(9 I
~fJ :I
I
I
I
I
I
l
increase of phytoplankton
----------~-----
;'
;' "
"" ..,,----
,/'" .....
I '"
Fig. 11. Diagram of an hydrological front, associated primary production, and exploitation by
trophic chains. Production is increased at the interfaces between warm and cold waters
(ergoclines). Contact surfaces are enhanced by turbulence, that possesses a fractal geometry
enhancing the complementarity of the water masses. Primary production is exploited by animal
biomass through a fractal" cascade" of sizes of organisms and of trajectories. Full lines are
interfaces between the two water masses. Dashed lines represent biomass transport; the
trajectories are also fractal curves, as in Fig. 9b and Fig. 10.
355
as schematized in Figure 11. This is another aspect of the fractal organization of ecosystems: as
the size distribution of organisms is a fractal, the set of movements (at all scales of magnitude)
probably is another one, which is linked to the distribution of sizes. Both are aspects of a strategy
of space-and-time occupation. We can hypothesize that this strategy tends to optimize the flows
of matter and energy. Such an organization is no longer a physical fractal, for no physical form
(systematic or statistical) is measured here, but only the size of the spatio-temporal domain
involved in a trajectory. This is then an abstract case of fractal geometry. I will now look at even
more abstract fractal objects.
These fractals aim at modelling processes; they can be graphed in a phase space. Or,
they can be useful in the abstract description of networks of interacting elements. Let me give
some ecological examples.
An interesting feature is that these strange attractors, that look like "forms in the fog",
are fractal objects. So, the mixture of order and chaos, or of determinism and indeterminism, that
characterizes this mixture of biological or ecological evolutionary solutions, has, as a matter of
fact, a fractal structure. That evidence will probably play an important role in future ecosystem
modelling, although the functional significance of the fractal structure of ecosystem evolution is
still unknown; perhaps it is again a fractal occupation of space-and-time.
356
Dr-______~------_r------_,------__,
b ".::':';
~-;.' .
,1.1
r"----=-..." ...
------X',:--------",··,-----------!,.,
D
,,~-;;.,:--------j;-,.•
,
c
Fig. 12. Strange attractor. From Ekeland
(1984), with permission.
a: Trajectories in the phase space. T is a
stationary attractor. T' is the location of a
strange attractor that produces denser
trajectories in certain regions of the phase
space.
!!: When intersected by a plane, it results in a
"strange" picture with crisp and fuzzy parts.
.\:: Magnifying a part of (b) results in another
"strange" picture at another scale of
observation. A mixture of fuzzy and crisp
parts is observed again. A part of (c),
magnified, reproduces the pattern seen in (b),
thus showing that the strange attractor has a
fractal geometry.
357
The theory of strange attractors has been developed for rather complex physical
systems, but these attractors may be obtained from a very simple set of equations. These
equations cannot be solved analytically, but only stepwise, using a computer. The example in
Figure 12 corresponds to a very simple system of equations, called "Henon's formula" (following
Ekeland 1984):
xn +1 = xn e cos a - (yn - Xn 2 ) e sin a
Yn+l = Xn esina+(Yn - Xn 2 )ecosa
After a large number of iterations, made possible thanks to the power of present-day
computers, a set of drawings appears in any plane that may intersect the set of rings of the
multidimensional trajectory. The picture becomes more and more distinct as the iterations are
pursued. After awhile, it can be observed that the cloud of points is a fractal: any part of the
cloud contains a miniature model of the whole. For the time being, its fractal dimension cannot be
computed analytically, but only observed. It accounts for a certain degree of occupation of the
phase space by the trajectories.
Ibanez and Etienne (submitted) applied a method due to Grassberger and Procaccia
(1983) in order to assign a fractal dimesion to a serie of 1200 chlorophyll observations along a
transect in the sea. For each observation x (t) , they considered the points with coordinates x (t),
x (t - 't) ,x (t - 2't), ... , x (t - k't) , 't being equal to an integer multiple of the sampling step, and
k varying from 2 to 9. So k is the Euclidean dimension of a phase space, in which the cloud of
points can be described by a fractal dimension if the chlorophyll record shows any stochastic
regularity. It was observed that the fractal dimension of the attractor increases up to 2 as k varies
from 2 to 6, after which it remains constant. According to Ibanez and Etienne, this means that six
degrees of freedom are sufficient to describe the fractal regularity (of fractal dimension 2) of this
sequence of 1200 observations. It remains to investigate what these 6 degrees of freedom are,
and to search for their ecological significance. It would be interesting to compare this result with
that of a Fourier spectral analysis, or other data treatments.
Mandelbrot (cited in Landman and Russo, 1971) gave the following technical example.
Consider a number C of internal elements or "components" of a computer unit, and a number T
of connections with its environment of "terminals". Computer engineers have observed the
358
S
H = - L.fi log2.fi
;=1
where S is the number of species and .fi is the relative frequency of the i -th species. This
expression represents the average amount of information per individual, knowing that each
individual is bringing, when determined, a quantity of information that is larger when its
frequency is smaller. It is easy to show that the maximum value of H is obtained when all
species are equally frequent; when the distribution of individuals among species is uniform,
H max = log2S . The ratio J = HIH max is called the evenness. The diversity index can then be
written as (H/H max) • H max = J • 10gzS ; in other words, diversity is the product of its two
components, evenness and number of species (on a log scale). All that is very classical.
359
H is the mathematical expectation E(-log1i}, when calculated over the set of species
considered. The sample is described by its species frequency distribution, for convenience. We
can write H =E(-log1i} = log~, or A =2H , where A is the fictitious number of species
which would give the same diversity index, if these species were equifrequent. It follows that
the evenness J has the form of a fractal dimension, that is, a ratio of two logarithms: J =
log2A Ilog 2S ,or A = S J. The latter equation expresses the rate of increase of the
diversity-equivalent number A of equifrequent species when the real number of species S
increases, the evenness remaining the same. These considerations will take their full meaning in
the discussion of the theory of species distributions that follows.
5.2.3 - The distribution of individuals among species. One may wish to have species
diversity given by a synthetic description which would be a little more informative than a simple
one-number index. The distribution of individuals among species can be used as a synthetic
parameter; it can take the form of a histogram describing the proportion of frequent, less frequent,
and rare species, divided into a number of classes. Various well-known models have been
proposed in the literature to fit such distributions, as for example the log-normal distribution of
Preston, etc. (Pielou 1975; Legendre and Legendre 1983). Of course, the empirical distribution
can also be used, without being fitted to any model, as a mere synthetic description of species
distribution within a sample.
The distribution often cannot be represented by a histogram because the total number of
species is too small; then, the number of species belonging to each class of abundance is low, and
the histogram becomes very irregular and uninformative. The distribution can, in that case, be
represented as a function of ranks: it is the rank-frequency method (Frontier 1976, 1985).
Species are ordered in decreasing frequencies, as they appear in the community or in the sample.
Each species is represented by a point in a diagram, with rank on the abscissa, and frequency on
the ordinate. The scatter diagram, or "Rank-Frequency Diagram", is monotonically decreasing by
construction, but the shape of the decrease (either linear, or convex, or concave, with steps, etc.)
gives a lot of information about the distribution. This representation is exactly equivalent to a
retrocumulate frequency function, for it is equivalent to saying that the species with rank r has the
frequency f, ,or that for a fraction r / S of the species, their frequency is larger than or equal to
f, .
100 100
a b
10 \ \\ \X"/\ 10
Intermedlat. 10
or atag. 31 W
W (!)
C!' W
w
(!)
C!' 4(
« (!)
C!'
4(
« ~
I- 4(
«
~
I- Z
z ~
I-
zZ W
w zZ
W
w 1 ()
(.) W
w
()
(.) a: ()
(.)
a: w
Q. a:
w c.. W
w
Q.
c.. Q.
c..
"\ (,)
(..)
0)
0>
0
0.1
0.1-/ \\ "- I\.. \ \\
0.1
0.01 ~
2 5 10 20 5 10 20 515 10
RANK RANK RANK
Fig. 13. Some examples of rank-frequency diagrams. The ranks of the species are on the
abscissa while their relative frequencies in the sample are on the ordinate, both on a log scale. a:
Marine benthos, along a pollution gradient (modified from Hily 1983). b: h: Lake phytoplankton
along a seasonal ecological succession (modified from Devaux 1980). !:: ~: Euphausids in an
East-West transect along the Pacific Equatorial Current (modified from Frontier 1985).
361
Besides ecology, these diagrams have been used in the past in several other fields, also
dealing with complex interaction systems. Characteristic distributions have been described and
analyzed in socio-economics (Pareto 1896, 1965) and in linguistics (Zipf 1949-1965). Observed
frequency distributions have been fitted to a family of curves, given by the Zipf model, which was
not very well-known to ecologists until recently:
!,=!ler-Y
Later, the Mandelbrot model, which is a generalization of the Zipf model, was used for the same
purpose:
!, =!o (r +~)-Y
were ~ and y are parameters, and!o is chosen such that the sum of all!, values predicted by the
model is 1; the!, values are relative frequencies. Convergence is possible only when y>1.
a b
log r
The first interpretation of these curves, made by Mandelbrot (1953), refers to the notion
of a "cost" of an element in an information system, in the framework of information theory. It
does not specify the nature of this cost, nor does it give it a precise value, for the rank-frequency
distribution is very robust in that respect. The distribution of the frequency of words in a
language may respond to a psycho-physiological cost, or perhaps to a sociological one linked with
the amount of time required to assimilate a new notion. Without specifying this cost, a good fit of
the model to data can be observed in the case of real languages, but not in artificial languages such
as Esperanto, nor in the language of young children. Analyzing the diversification of signals in a
code, Mandelbrot demonstrated that the above equation corresponds to an optimum in information
transfer, namely: the costlier signals also have to be the rarest (obviously without disappearing
completely), and the maximum efficiency occurs for a particular distribution of frequencies, which
has precisely the form given above, with parameters J3 and 'Y which, then, have a meaning.
In ecology, the "cost of a species" is linked with the amount of assimilated energy that it
requires; for example, it is more costly in terms of energy for an ecosystem to produce and
maintain a carnivore than a primary producer, because of the loss of energy at each trophic level.
The "cost of a species" can also be related to other kinds of expenditures, expressed in terms of
accumulated information. A specialized species, for instance, has to wait for some particular
conditions to be present, or for the state of the ecosystem that allows it to appear. This introduces
a historical aspect in ecosystem theory, and leads to thinking of this "cost" in terms of required
past history.
The rank-frequency diagrams and the Mandelbrot models associated with them do not
provide proofs for these philosophical considerations. It is nevertheless very exciting to explore
the properties of the model, and to investigate possible ways of generating such distributions. Let
us come back to fractals for a moment, since Mandelbrot has recently specified a way of
generating this kind of distribution. He expressed this in the context of the analysis of a
"lexicographic tree", so-called because once again it initially dealt with languages, but it can easily
be translated into ecological terms.
363
"
/1\ .1\/\ /\
: \: \ \' '.
Fig. 15. A lexicographic tree, following Mandelbrot (1982). The ai" hj and ck are previous
conditions required by species S 1 , S 2 , S 3 •.. to appear. See text.
Let us suppose that the occurrence of a species depends on the previous realization of a
number of conditions in its physical, chemical and biotic environment. The nature of these
conditions is not specified; one condition can even be the previous appearance of some other
species in the community. Let a i ,bj ,ck' ... designate these previous conditions that are
required by species Sr' The probability of this species is:
Pr(Sr) = Pr(ai ) • Pr(bj ) • Pr(ck) •...• Pr(Sr I ai ,bj ,ck ' ... )
if these conditions are independent from one another. The sequence of events can be as follows
(Fig. 15):
- An ubiquitous species S 1 appears as soon as a restricted number of conditions are
realized; let us represent this ftrst set by a single condition aI' so that
Pr(S 1) oc Pr(a 1)
- If the second species requires conditions a 2 and b 1 ' then
Pr(S 2) oc Pr(a 2 ) • Pr(b 1) < Pr(a 1)
since all the probabilities are assumed to be small and of the same order of magnitude.
- For the third species to be allowed to occur, let us suppose that the conditions are a 2'
at any given time, it is always possible to expect one more to appear in the future. The only
condition that was stated, in Mandelbrot's demonstration, is that the probabilities for the
occurrence of the "previous conditions" be small, compared to the probability for the species to
occur when the previous conditions are met. With these very broad conditions, the probability of
a species Sr is a function of its rank r in the frequency distribution, of the form
Pr(Sr) = Po (r + /3Y'Y
where PO' /3 and y are the same parameters as above. In the course of Mandelbrot's
demonstration, it appears that the parameters /3 and y have a functional importance. Directly
transposing his words to ecology, /3 is linked with the diversity of the environment, that is, with
the average number of modalities of type ai ' or bj , or ck ' etc. at each level. On the other hand,
1Iy is linked with the predictability of the community, that is, the probability of a species to
appear when the conditions that it requires have been met. This is of great interest, because
environmental diversity and predictability of the organic assemblage are two important elements
determining the composition of a community, as is well-known in ecosystem theory.
Finally, it can be stated that 1Iyis afractal dimension « 1): it is the dimension of a
fractal representing the set of species abundances as forecasted by the model; in other words, it is
the fractal dimension of the "species distribution", or distribution of the individuals among
species, studied as to its diversity. Diversity is then a fractal property of the biomass. The
demonstration that 1Iyis a fractal dimension rests on Cantor sets (Mandelbrot 1977,1982). On
the other hand, it has been shown (Frontier 1985) that 1Iy is strongly and almost linearly
correlated with the evenness measure J =HIH max; this supports the idea of using the latter as a
fractal dimension. The equation A =S J of section 5.2.2 then becomes A '" S 1/ 'Y •
different kinds of systems may signify that they are describing optimal conditions of general
information and dynamic systems, whatever the physical support of the information is.
In community samples observed in real ecosystems, a great variety of shapes have been
found in rank-frequency diagrams, according to the degree of complexity of the community, its
stage of evolution, its stress, the observation scale, etc. Few are found to conform exactly with a
Mandelbrot model, at least at the level of the single sample. The sampling process introduces a
statistical irregularity, of which we get an idea by superposing a number of curves describing
individual samples from the same community. The width of the bundle of curves so obtained
indicates something about the random variability. In Figure 16 for example, the population
consists of young fish of various species, coexisting in a littoral nursery sampled at various times
during a year. Superposing two sets of curves coming from two different years is an approximate
statistical test showing, in this case, that no significant difference exists between the two sets.
100
w
oc(
I-
Z
W
o
a::
w
Il.
0.1
In this example, no Mandelbrot model can easily be fitted, due to the fact that the curves
do not show much evidence of an asymptotic behaviour, so that the slope - 1 cannot be estimated
precisely. It seems justified to fit a Mandelbrot model only in cases where the existence of an
asymptotic line is supported by the graph. In most cases, such a model will be found by
cumulating a number of samples over an ecologically homogeneous area and/or time span. As a
matter of fact, at too small a scale, the patchiness of the spatial distributions of the various species
is biasing the overall species distribution, for in a very limited site, a small number of species are
dominant, while at some other site, other species may dominate. It follows that at a given site,
and consequently also in a sample, we often observe a concave or a convex rank-frequency
diagram, the ordering of the species varying from sample to sample. Summing the number of
individuals sampled, species by species, over a set of samples, results in a curve more extended
towards the right; then a Mandelbrot-like distribution is found, as in curve b of Figure 17, that has
1= 3.54 and ~ '" 12. On the contrary, summing the numbers of individuals rank by rank,
100
-
Q)
C)
ta
C
Q)
o
~
Q) Fig. 17. Two ways of summing
Q.. frequencies to get a "mean"
rank-frequency diagram. a: Summing
by ranks, i.e., total of individuals of
species of rank 1, whatever the
0.1
species name is in the various
samples; then, total of individuals of
species of rank 2; etc. That produces
an average of the individual sample
curves, without increasing the
number of species. h: Summing
species by species; species are ranked
after summing their abundances over
the set of samples. This increases the
5 10 50 total number of species, so that the
shape of the rank-frequency diagram
Rank is different From Safran (in press).
367
independently from the actual species names, provides an "avemge" curve (Fig. 17, curve a) that
passes through the center of the bundle of sample curves, and cannot be fitted to a Mandelbrot
model.
CONCLUSION
I have presented in this paper many more working hypothesis and questions, than
results. Up to now, fractal geometry has been applied very little to ecological problems;
nevertheless, it seems to offer perspectives that are not trivial. Our short exploration through
forms, spatial distributions, movements of organisms, size distributions, strange attractors,
species diversity and species distributions indicates that fractals properties go far beyond
morphological analysis, that calls only upon fractals in physical space. We have to rephrase the
discussion in terms of the dynamics of the interactions of a system, made by a biomass divided
into various populations, size classes, trophic levels, and so on, with its physical environment.
These interactions imply a fractal geometry of surfaces and of sets of contact points. An
ecosystem could not exist if it were made only of lines (D=l), surfaces (D=2) and volumes (D=3),
as engines are made because we made them, and as the Greek philosophers tried to describe the
world. Interactions imply a "fractal" kind of complexity in time and in space. In that sense,
fractal geometry provides a new tool, and a new paradigm, for analyzing that mixture of order and
chaos that classical science had up to now generally avoided, but that numerical ecology can now
grasp.
Fractal theory has been introduced by Mandelbrot, first in a book in French in 1975,
"Les objets fractals: forme, chance et dimension" (Flarnmarion, Paris), then in English, "Fractals.
Form, chance, and dimension" (Freeman and Co., San Francisco, 1977), with a second edition
in 1982 entitled "The/ractal geometry o/nature". The fundamentals of fractal theory are brought
together in these books, which summarize the papers of the author and of others on the subject.
What is a fractal? Initially, the term designates a geometrical object with a non-integer
dimension. Such an expression may be astonishing, for we usually describe real and conceptual
spaces in terms of points (dimension = 0), lines and curves (dimension = I), surfaces (dimension
368
=2) and volumes (dimension = 3). Furthennore, multivariate analysis and phase space analysis
have accustomed us to speak about Euclidean spaces with 4, 5, ... N dimensions, N being always
an integer.
Fractal geometry then allows one to describe conceptual or concrete objects that realize
"a certain degree" of occupation of a bi- or tri-dimensional Euclidean space, somewhere between a
curve and a surface, or between a surface and a volume. The "fractal dimension" has to be
considered as a measure of that degree of occupation, following a mathematical rule that identifies
the properties of the index with those of a "dimension" in the usual sense. An integer dimension
turns out to be a particular case of a generalized fractional dimension. This mathematical theory
had already been developed by previous mathematicians such as Hausdorff (1919) and
Besicovitch and Ursell (1937). Mandelbrot used and deepened these previous theories in order to
make them applicable to the description of the real world, and this attempt was extraordinarily
fruitful since it allowed one to describe the various states of fragmenting and branching out of
living and unliving matter.
How can we talk about a "dimension"? Let us remember the usual meaning of a
dimension 1, 2 or 3 of a geometric object. Dividing a segment of length 1 metre into N equal
369
Initiator
Generator
2 _ _ _ _ _---'
5
etc.
or d = lQd
10gN
370
Fig. 19. .a: Square, d =log 4/log 2 =2.0. 11: Cube, d =log S/Iog 2 =3.0.
In the case of the Koch curve, self-similarity is obvious for, after indefinite generation
of the fonD, any part is a miniature model of the whole. Each element contains 4 elements 3 times
smaller, so that the dimension is:
d = }Q&A = 1.2619
log 3
This is a fractional, or "fractal", dimension. Another example of a fractal line is a "tree" (Fig. 20),
whose ecological significance is described in the main part of this paper. Starting again with a
straight segment of length 1, two branches are added, branching out from the middle point of the
Fig. 20. .a: Geometric fractal tree, d = log 3/log 2 = 1.585. The partial trees (or" branches" , that
are miniature models of the tree) are surrounded by dashed lines, and are self-similar.
11: Statistical fractal tree, d '" 1.6; partial trees are statistically self-similar.
371
a b
6
etc.
Fig. 21. Cantor dusts. a: On a line, d = log 2 / log 3 = 0.631. h: In the plane, d = log 4 / log 7
= 0.712.
previous segment, giving three branches, each of length 1/2, plus one stem. Then each of the
three branches is submitted to the same generator, each one giving three sub-branches of length
1/4, and so on. At each step of the generation, each of the terminal segments is replaced by one
trunk and three branches, so that its total length is multiplied by 2. Since the remaining part of the
tree remains the same, the total length of the ramified object tends to infmity. The final object is
self-similar because, after indefinitely branching out, any branch or sub-branch is a miniature
model of the whole "tree". The fractal dimension is found by considering that, at each step, one
tree bears 3 sub-trees whose linear size is twice smaller, hence
d = k!U = 1.5850
log 2
That represents a higher degree of occupancy of a portion of the plane, by a fractal curve, than in
the case of the Koch curve. For other examples, refer to the books of Mandelbrot, that provide a
wide variety of fractal patterns, with dimensions between I and 2.
A fractal dimension less than I can be obtained with a generator rather close to that of
the Koch curve. The middle segment of the three is removed, at each step, without being replaced
(Fig. 21a). At the limit, there remains an infinite set of points (or "Cantor dust") showing couples
of points, couples of couples, and so on. At each step of the generating process, any segment is
replaced by 2 segments 3 times smaller, so that the fractal dimension is
d =.lQU = 0.6309
log 3
The fractal picture represents a rather low degree of occupancy of a line by an infinite set of
points. The total length of the set of points is obviously zero.
In the plane, a Cantor dust can be built in two dimensions, for example (Fig. 2Ib), by
372
constructing groups of 4 squares, each one containing 4 squares 7 times smaller (in linear size).
The fmal picture is self-similar with dimension
d = J.QU = 0.7124
log 7
The total length and area are, of course, zero. If we had 4 squares 4 times smaller instead, d
would be equal to log 4/ log 4 = 1, although the object is not a line. This shows that a fractal
dimension can happen to be an integer.
A bounded (fmite) object of integer dimension d has a measure of zero with respect to a
higher dimension, infinite with respect to a smaller one, and it has a finite measure only in its own
dimension d. For example, an area has a volume 0, a length 00 (that is, the length of a line filling
the whole area), and it has a finite measure in square metres only.
For a fractal object of fractional dimension d , its measure is 0 in any dimension larger
than d , and 00 in any dimension smaller than d ; it is a finite number only in the fractal dimension
d . A Koch curve built up starting from a 1 metre segment has an infmite length (in metres), and a
om2 or 0 m3 area or volume. The following table illustrates that rule, for d varying from 0.5 to 3:
. ...
,.
I
11'
a . .
,,
..
. I
11
,
.' 3rd blow-up
------;;> .. .. - ............. _---
\
...
\ ... 2nd blow-up \
\ \
" \
\
--- --- ---
---
\
\
\
\
\
\
1s1 blow-up \
- - - --- ---;'>-
--- ... --- .. _--\ ...
.-
",
,,
1.1-1= lOOmm
,
\
\
\
Fig. 22. Fractal dimension of a rocky shoreline. B: Statistical self-similarity• .11: Computation of
the fractal dimension. If one segment of length l.~ 1 is replaced by 20 segments of length 1.I =
(1/ 10, then d =log 20 / log 10 =1.301 .
2 - Statistical fractals. Another way of constructing fractals consists of adding a random
element to the generator. Hence, from one step to the next, only the statistical or stochastic
characteristics of the fragmenting process are maintained. The object shows a much greater
resemblance to a natural, physical object. For example, a rocky coastline (Fig. 22a) can be
374
described by considering that the roughness has the same statistical characteristics at all
observation scales. An approximate description of the coastline is given by a broken line made of
equal segments of length P(Fig. 22b). In order to estimate the length of that coast as it appears at
that observation scale, we add the lengths of all the segments necessary to cover the whole coast.
When we want to detail the coastline, replacing each straight segment by a rugged line, the coast
appears longer. Choosing a unit segment N times smaller than the previous one, one has to
insert more than N small segments, because of the contorted shape of the coast; that is true at
every observation scale. At each step of the decreasing scale, if we assume for example that one
segment has to be replaced, on the average, by 20 segments 10 times smaller, then the fractal
dimension of the coastline is estimated as
d = ~ = 1.3010
log 10
The final length is obviously infinite, as more details are taken into account at each step. So, the
usual concept of the "length of a coastline" is a non-concept, because the real length is always
infinite. The length of a coast, as measured from a map, is arbitrary and depends on the
cartographic scale; it can be indefinitely enlarged, as more and more details of the coastline are
taken into account.
Physical phenomena often evoke a fractal generating process with a random component,
so that a fractal dimension can often be assigned to them. A classical example is the Brownian
motion. When observing at time intervals the displacement of a particle on a plane, we see the
movement as a broken line; observing the same movement at intermediate times, each of the
straight line segments previously seen is replaced by a finer broken line, whose length is greater
(Fig. 9a). The trajectory clearly appears as a fractal line; it can be calculated that its fractal
dimension is 2, that is, an integer, meaning that the particle is equally likely to be found at any
point of the plane.
m km
6 3
5 2
i
Fig. 23. Length L of a boundary as a function of the length of the" yard stick" used to
measure it. The slope of the line in a log-log graph is a = -0.2, so that the fractal dimension is d
= 1 - a = 1.2. The conversion of measurement from km1.2 to m1.2 is done as follows: X(m1.2) =
10001.2 • Y(km1.2), or 3981.07 • Y. For example, 18838 m1.2 = 4.732 km1.2.
A Cantor dust can also be randomized, as seen in Figure 7. As such it could model
either the dispersion of galaxies in the sky, or of plankton in the sea.
With many real fractals, there is no geometric generator that would allow to calculate a
fractal dimension through self-similarity considerations, since they have a statistical component.
In that case, the fractal dimension has to be inferred through observing the increase of (for
example) the length of a line between two points, as the unit of measure decreases. The greater
the fractal dimension of a coastline -- that is, the more pronounced its roughness -- the faster the
measured length will increase when the unit segments used to cover the curve decrease in length.
Precisely, if the length of the unit segment is! and the number of segments covering the fractal
line is N , then the length measured at that step is L =N.£. Choosing another unit segment, of
length .l!k , the number of segments gets multiplied by k d, so that the new length is Nk d •
( ilk) = L ·k d -1; now, .£ being inversely proportional to k, L is proportional to fd-l. Then,
putting Land 1- on a graph with log-log scale, we obtain a straight line of slope (d -1), from
which the unknown fractal dimension can immediately be inferred. For example in Figure 23, a
376
......
slope of -0.2 is observed, hence the fractal dimension is 1.2. A fractal measure of the line has to
be expressed in m1.2 ("metres to the 1.2"), or km1.2 , or cm1.2 ... Since 1 km = 1000 m, the
measure in m1. 2 is equal to 10001.2 = 3981 times the measure in km1.2 .
In the "fractal tree" (with or without a random component), at each step a given number
of self-similar smaller trees appear, plus a stem (or fractal "residue"), that increases the total
length. For that reason, the length measured at any step increases more rapidly when the
branching-out goes on, than predicted by the mere self-similarity rule, as seen in Figure 24. The
curve is asymptotic to a straight line of slope (l-d ), giving again the fractal dimension d .
For a Cantor dust, an estimation of the fractal dimension can be made from the decrease
of the density of points inside spheres of increasing diameters, as explained in section 3 above and
in Figure 8. The slope of the line describing the decrease in log-log scale gives, here again, the
dimension of the fractal object.
diameter, and finally observe the decrease of the mean density of points per unit volume.
(b) Lexicographic trees (Fig. 15), used by Mande1brot for linguistic analysis, may also
be applied to ecology, as shown in Section 5.2.
REFERENCES
Adams, G.F., and C.H. Oliver. 1977. Yield properties and structure of boreal percid
communities in Ontario. J. Fish. Res. Bd. Canada 34: 1613-1625.
Besicovitch, A.S., and H.D. Ursell. 1937. Sets of fractional dimensions (V): On dimensional
numbers of some continuous curves. J. London Math. Soc. 12: 18-25.
Bradbury, RH., and RE. Reichelt. 1983. Fractal dimension of a coral reef at ecological scales.
Mar. Ecol. Progr. Ser. 10: 169-171.
Bradbury, RH., RE. Reichelt, and D.G. Green. 1984. Fractals in ecology: methods and
interpretation. Mar. Ecol. Progr. Ser. 14: 295-296.
Burrough, P.A. 1981. Fractal dimensions of landscapes and other environmental data. Nature
(Lond.) 294: 240-242.
Burrough, P.A. 1983. Multiscale sources of spatial variation in soil. I. The application of fractal
concepts to nested levels of soil variation. J. Soil Science 34: 577-597.
Devaux, J. 1980. Structure des populations phytoplanctoniques dans trois lacs du Massif Central:
successions eco10giques et diversite. Acta CEcol./Oecol. Gener. 1: 11-26.
Eke1and, I. 1984. Le calcul, l'imprevu. Seuil, Paris. 170 p.
Fournier d'Albe, E.E. 1907. Two new worlds: I The infra world; II The supra world. Longmans
Green, London.
Frechette, M. 1984. Interactions pelago-benthiques et flux d'energie dans une population de
moules bleues, Mytilus edulis L., de l'estuaire du Saint-Laurent. These de Ph.D.,
Universite Laval, Quebec. viii + 172 p.
Frontier, S. 1976. Utilisation des diagrammes rang-frequence dans l'analyse des ecosystemes. J.
Rech. oceanogr. 1: 35-48.
Frontier, S. 1978. Interfaces entre deux ecosystemes. Exemples dans Ie domaine pelagique.
Ann. Inst. oceanogr., Paris 54: 96-106.
Frontier, S. 1985. Diversity and structure in aquatic ecosystems. Oceanogr. mar. BioI. ann.
Rev. 23: 253-312.
Goodman, D. 1975. The theory of diversity-stability relationship in ecology, Quart. Rev. BioI.
50: 237-266.
Grassberger, P., and I. Procaccia. 1983. Characterization of strange attractors. Phys. Rev. Lett.
50: 346-349.
Hausdorff, F. 1919. Dimension und iiuBeres Mass. Mathematische Annalen 79: 157-179.
Hily, C. 1983. Modifications de la structure ecologique d'un peuplement a Mellina palmata. Ann.
Inst. oceanogr. Paris 59: 37-56.
Huchinson, G.E. 1957. A treatise on limnology. Wiley and Sons, New York.
Ibanez, F., and M. Etienne. The fractal dimension of a chlorophyll record. (Submitted).
Kent, C., and J. Wong. 1982. An index of littoral zone complexity and its measurement. Can.
J. Fish. Aquat. Sci. 39: 847-853.
Landman, B.S., and RL. Russo. 1971. On a pin versus block relationship for partition of logic
graphs. I.E.E.E. Tr. on Computers 20: 1469-1479.
Legendre, L. 1981. Hydrodynamic control of marine phytoplankton production. In J. Nihou1
[ed.] Ecohydrodynamics. Elsevier Scient. Publ. Co., Amsterdam.
Legendre, L., and S. Demers. 1984. Towards dynamic biological oceanography and limnology.
Can. J. Fish. Aquat. Sci. 41: 2-9.
Legendre, L., and S. Demers. 1985. Auxiliary energy, ergoc1ines and aquatic biological
production. Naturaliste can. (Rev. Ecol. Syst.) 112: 5-14.
Legendre, L., and P. Legendre. 1983. Numerical ecology. Developments in Environmental
Modelling, 3. Elsevier Scient. Publ. Co., Amsterdam. xvi + 419 p.
378
Mandelbrot, B. 1953. Contribution ala theorie mathematique des jeux de communication. These
de Doctorat d'Etat, Univ. Paris. Pub!. Inst. Stat. Univ. Paris 2: 1-121.
Mandelbrot, B. 1974. Intermittent turbulence in selfsimilar cascades: divergence of high
moments and dimension of the carrier. J. Fluid Mech. 62: 331-358.
Mandelbrot, B. 1975. Les objets fractals: forme, chance et dimension. Flammarion, Paris.
[Second edition in 1984.]
Mandelbrot, B. 1977. Fractals. Form, chance, and dimension. Freeman & Co., San Francisco.
365 p.
Mandelbrot, B. 1982. The fractal geometry of nature. Freeman & Co., San Francisco. 468 p.
Margalef, R. 1980. La biosfera. Ediciones Omega, Barcelona. 236 p.
Mark, D.M. 1984. Fractal dimension of a coral reef at ecological scales: a discussion. Mar.
Ecol. Progr. Ser. 14: 293-296.
May, R.M. 1974. Stability and complexity in model ecosystems. 2nd ed. Princeton Univ.
Press. 265 p.
May, R.M. 1975. Deterministic models with chaotic dynamics. Nature (London) 256:
165-166.
May, R.M. 1981. Nonlinear phenomena in ecology and epidemiology. Ann. N.Y. Acad. Sci.
357: 267-281.
Meyer, J.A. 1980. Sur la dynamique des systemes ecologiques non lineaires. J. Physique
(Colloque C5, 1978: suppl. au nO 8) 38: C5.29-C5.37.
Meyer, J.A. 1981. Sur la stabilite des systemes ecologiques plurispecifiques. 335-351 in B.E.
Paulre [ed.] System dynamics and analysis of chance. North Holland Publ. Co.
Morozitz, H.J. 1968. Energy flow in biology. Acad. Press, New York. 179 p.
Nicolis, C., and G. Nicolis. 1984. Is there a climatic attractor? Nature (London) 311: 529-532.
Pareto, V. 1896, 1965. Cours d'economie politique. Reimprime dans un volume d' "Oeuvres
Completes", Droz, Geneve.
Pielou, E.C. 1975. Ecological diversity. Wiley Interscience, New York. viii + 165 p.
Platt, T., and K.L. Denman. 1977. Organization in the pelagic ecosystem. Helgoland Wiss.
Meeresunters. 30: 575-581.
Platt, T., and K.L. Denman. 1978. The structure of pelagic marine ecosystems. Rapp. P.-v.
Reun. ClEM 173: 60-65.
Ripley, B.D. 1981. Spatial statistics. John Wiley & Sons, New York. x + 252 p.
Ryder, R.A. 1965. A method for estimating the potential fish production of north-temperate
lakes. Trans. Amer. Fish. Soc. 94: 214-218.
Safran, P. Etude d'une nurserie littorale a partir des peches accessoires d'une pecherie artisanale
de crevettes grises (Crangon crangon). Oceanol. Acta (in press).
Villermaux, J., D. Schweich, and 1.R. Hautelin. 1986a. Le peigne du diable, un modele
d'interface fractale bidimensionnelle. C. R. hebd. Seances Acad. Sci., Paris. In press.
Villermaux, 1., D. Schweich, and 1.R. Hautelin. 1986b. Transfert et reaction a une interface
fractale representee par Ie peigne du diable. C. R. hebd. Seances Acad. Sci., Paris. In
press.
Wetzel, R.G. 1975. Limnology. Saunders, Toronto.
Zipf, G.K. 1949-1965. Human behavior and the principle of least-effort. Addison-Wesley,
Cambridge, Mass.
Path analysis for mixed variables
PATH ANALYSIS WITH OPTIMAL SCALING
Jan de Leeuw
Department of Data Theory FSW, University of Leiden
Middelstegracht 4
2312 TW Leiden, The Netherlands
Abstract - In this paper we discuss the technique of path analysis, its extension to
structural models with latent variables, and various generalizations using optimal
scaling techniques. In these generalizations nonlinear transformations of the
variables are possible, and consequently the techniques can also deal with nonlinear
relationships. The precise role of causal hypotheses in this context is discussed. Some
applications to community ecology are treated briefly, and indicate that the method
is a promising one.
INTRODUCTION
In this paper we shall discuss the method of path analysis, with a number of
extensions that have been proposed in recent years. The first part discusses path
analysis in general, because the method is not very familiar to ecologists. In fact we
have been able to find only very few papers using path analysis in the literature of
community ecology. With the help of Pierre and Louis Legendre we located Harris
and Charleston (1977), Chang (1981), Schwinghamer (1983), Gosselin et al.
(1986), and Troussellier et al. (1986).
In this paper we combine classical path analysis models, first proposed by
Wright (1921, 1934), with the notion of latent variables, due to psychometricians
such as Spearman (1904) and to econometricians such as Frisch (1934). This
produces a very general class of models. If we combine these models with the
notion of least squares optimal scaling (or quantification, or transformation),
explained in De Leeuw (1987), we obtain a very general class of techniques.
Now in many disciplines, for example in sociology, these path analysis
techniques are often discussed under the name causal analysis. It is suggested,
thereby, that such techniques are able to discover causal relationships that exist
between the variables in the study. This is a rather unfortunate state of affairs (De
Leeuw 1985). In order to discuss it more properly, we must start the paper with
some elementary methodological discussion.
One of the major purposes of data analysis, in any of the sciences, is to arrive at a
convenient description of the data in the study. By 'convenient' we mean that the
data are described parsimoneously, in terms of a relatively small number of
NATO AS! Series, Vol. G 14
Developments in Numerical Ecology
Edited by P. and L. Legendre
© Springer-Verlag Berlin Heidelberg 1987
382
equal footing with the laws of the physical sciences. Pearson eloquently argued that
correlation is the more fundamental scientific category, because causality is merely
a degenerate special case, which does not really occur in practice. Again this point
of view is not inherently wrong, provided we broaden the definition of correlation
sufficiently.
This is related to the fact that lawlike relationships in the social sciences and the
life sciences are usually described as probabilistic in stead of deterministic. If we
have ten kettles, and we put them on the fire, then the water will boil in six or seven
of them. But this difference is mainly a question of choosing the appropriate unit. A
probabilistic relationship between individual units is a deterministic relationship, in
fact a functional relationship, between the random variables defined on these units.
A linear regression between status and income is a deterministic relationship
between averages, even though it does not make it possible to predict each
individual income precisely from a known status-value. If we call a law-like
relationship between the parameters of multivariate probability distributions a
correlation, then Pearson's point of view about causality makes sense. Of course we
must again be careful, because another far more specific meaning of the word
'correlation', also connected with the name of Pearson, is around too. Compare
Tukey (1954) for more discussion on this point.
Up to now we have concentrated on data analysis as a method of description. We
summarize our data, preferably in the context of a known or conjectured model
which incorporates the prior information we have. At the same time we also
investigate if the model we use describes the data sufficiently well. But science does
not only consist of descriptions, we also need to make predictions. It is not enough to
describe the data at hand, we must also make statements about similar or related data
sets, or about the behaviour of the system we study in the future. In fact it is
perfectly possible that we have a model which provides us with a very good
description, for example because it has many parameters, but which is useless for
prediction. If there are too many parameters they cannot be estimated in a stable
way, and we have to extrapolate on a very uncertain basis. Or, to put it differently,
we must try to separate the stable components of the situation, which can be used for
prediction, from the unstable disturbances which are typical for the specific data set
we happen to have.
We end this brief methodological discussion with a short summary. The words
'correlation' and 'causality' have been used rather loosely by statisticians, certainly
in the past. Causal terminology has sometimes been used by social scientists as a
means of making their results sound more impressive than they really are, and this
is seriously misleading. It is impossible, by any form of scientific reasoning or
activity, to prove that a causal connection exists, if we interpret 'causal' as
384
We shall now define formally what we mean by a path model. In the first place
such a model has a qualitative component, presented mathematically by a graph or
arrow diagram. In such a graph the variables in our study are the corners, the
relationships between these variables are the edges. In the path diagrams the
variables are drawn as boxes, if there is an arrow from variable V 1 to variable V2
then we say that V 1 is a direct cause of V2 (and V2 is a direct effect of VI)'
Figure 1.
Path diagram.
385
Compare Figure 1, for example. Observe that we use causal terminology without
hesitation, but we follow the Simon-Wold example and give a precise definition of
causes and effects in terms of graph theory. If there is a path from a variable VIto
another variable V 2, then we say that V 1 is a cause of V 2 (and V2 is an effect of
VI). In Figure 1, for instance, V 1 is a cause of V6 and V7, although not a direct
cause.
Table 1.
Causal relations in Figure 1.
direct
level causes causes predecessors
Var 1 0 **** **** ****
Var2 0 **** **** ****
Var 3 1 {1,2} {1,2} {1,2}
Var4 1 {I} {I} {1,2}
Var 5 1 {2} {2} {1,2}
Var6 2 {1,4} {4} {1,2,3,4,5}
Var7 2 {1,4} {4} {1,2,3,4,5}
is. We have defined our notion of causality in terms of the path diagram. Other
notions which are important in path analysis will be discussed below.
The assumptions we make about the disturbance terms Ej are critical. These
assumptions are in terms of uncorrelatedness, for which we use the symbol .L First
assume for each j that the Ej are uncorrelated with dcause(xj). Thus
E3 1. {x1,x2}, (2a)
E41. {Xl}, (2b)
E51. {x2}, (2c)
E61. {x4}, (2d)
E71. {x4}· (2e)
Now model (1)(2) describes any data set of seven variables perfectly. To see this it
suffices to project each Xj on the space spanned by its direct causes, i.e. to perform a
387
Assumption (3) is much stronger than (2), and not all sets of seven variables satisfy
(1) and (3). Because £4 ~ {xl>x2}, for example, regression of x4 on Xl and x2 will
give ~42 = 0 if (1)(3) is true, and this is clearly restrictive. Thus model (1)(3) can
be a poor descriptor as well as a poor predictor. It is clear, by the way, that a model
which is a good predictor is automatically a good descriptor.
For the causal interpretation the following argument is useful. It extends to all
transitive models. We have £6 ~ {xl>x2} and E6 ~ £3· Thus, from (1a), £6 ~ x3. In
the same way £6 ~ x4 and £6 ~ x5· Thus £6 ~ {xl>x2,x3,x4,x5}, which implies that
proj(x6Ixbx2,x3,x4,x5} = proj(x6Ix4), with proj(ylx1, ... ,xm ) denoting least
squares projection of y on the space spanned by x 1> ••• ,xm . In words this says that the
projection of x6 on the space spanned by its predecessors is the projection of x6 on
the space spanned by its direct causes. The interpretation is that, given the direct
causes, a variable is independent of its other predecessors. Thus the strong
orthogonality assumptions in transitive models imply a (weak) form of conditional
independence .
We shall now treat some more or less familiar models in which description is
388
perfect. These models are consequently saturated. The structural equations defining
the model can be solved uniquely, and the model describes the data exactly. The
first, and perhaps simplest, example is the multiple regression model. An example is
given in Figure 2.
r13 y
Figure 2.
Multiple regression model.
If we compare this with Figure 1 we see some differences which are due to the
fact that we have made the model quantitative. In the first place the arrows now have
values, the regression coefficients. In the second place it is convenient to use curved
loops indicating the correlations between the exogenous variables. The curved loops
can also be used to represent correlated disturbances. This becomes more clear
perhaps if we add dummy equations like Xj = Ej for each of the exogenous variables,
which is consistent with the idea that exogenous variables have no causes; exogenous
variables are, in this sense, identical with disturbances. The strong orthogonality
assumptions on disturbances can now be stated more briefly, because they reduce to
the single statement Ej 1- { Ek I Iev(xk) "# Iev(xj) }. Arrows are also drawn in
Figure 2 to represent uncorrelated disturbance terms.
In Figure 2 , and in multiple regression in general, there is only one endogenous
variable, often called the dependent variable. There are several exogenous
variables, often called predictors or independent variables. The linear structural
model is
(4)
assumptions, because dcause(y) are exactly the exogenous variables. Thus (4) is a
saturated model. If we project the dependent variable on the space spanned by the
predictors, then the residual is automatically uncorrelated with each of the
predictors. The description is perfect, although the prediction may be lousy. We
measure quality of prediction by the multiple correlation coefficient R2 = I -
VAR(£), in this context also known as the coefficient of determination.
Figure 3 shows a somewhat less familiar model. Its linear structure is
The weak orthogonality assumptions, which make (5) a saturated model, are £2 1.
{Xl} and £31. {xJ,x2}. It follows from this that £2 is the residual after projection of
x2 on Xl. Thus ~21 is equal to the correlation between Xl and x2, and £2 = x2 -
~21xI is a linear combination of Xl and x2. This implies that £31. £2, and
consequently the strong orthogonality assumptions are true as well. Although we
did not require it, we automatically get uncorrelatedness of the disturbance terms.
l-~
, ,
... ...
Xl ~""1
_A x2 --
~.,...,
X3
Figure 3 .
L . A simple saturated
recursive model.
~31
18 arrows 24 arrows
~ ~
12 arrows
..-
Figure 4.
In quantifying any path model we can simply use the path diagram to write down
the linear structural equations. We also have to assume something about the
disturbances in terms of their correlation with each other and with the Xj. The
391
weak orthogonality assumptions can be applied in all cases. They make the model
saturated, and have as a consequence that consistent estimation of the regression
coefficients is possible by projecting a variable on the space spanned by its direct
causes. In all transitive models, saturated or not, the strong orthogonality conditions
follow from the weak orthogonality conditions and the linear structure. Thus the
causal interpretation in terms of conditional independence is available.
The notion of a linear structural model is more general than the notion of a
transitive model, of course. If we assume a structural model, such as (1), then we
can make alternative assumptions about the residuals, for instance that they are all
uncorrelated. In fact we can easily build linear structural models which are not
transitive at all. Simply write down the model from the path diagram, one equation
for each endogenous variable, and make some sort of assumption about the
disturbances. By allowing for correlations between the disturbances we can create
saturated nontransitive models, and we can also get into problems with
identifiability. For these identification problems we refer to the econometric
literature, for instance to Hsiao (1983) or Bekker (1986). Observe that
nontransitive models can not be translated into conditional independence statements,
which has caused some authors to say that nontransitive models are not causal.
For a small ecological example we use a part of the correlation matrix given by
Legendre and Legendre (1983, Table 5.6). The data have to do with primary
production, and were collected in 1967 in the Baie des Chaleurs (Quebec). There
are 40 measurements on four variables. These are:
K: the biological attenuation coefficient which represents
the relative primary production,
C: the concentration of chlorophyll a,
S: the degree of salinity,
T: the temperature.
The correlation matrix, and some simple path models, are given in Table 2.
Model (a) is the saturated model which has T and S as exogenous variables (level 0),
has C as a variable of levell, and K as the innermost variable of level 2. Model (b) is
not saturated, because the paths from T and S directly to K are eliminated. All
effects of T and S on K go through C, or, to put it differently, K is independent of T
and S, given C. Model (c) is also saturated, but no choice is made about the causal
priority of C or K. Thus C and K have correlated errors, because they both have
level 1. In the part of Table 2 that gives the fitted coefficients we see that the
covariance of the errors in (c) is .721. Because of this covariance variable K has a
much larger error variance in model (c).
392
K C T
C +.842
T +.043 +.236 Correlations
s -.146 -.369 -.925 Baie des Chaleurs
(b)
Models (a) and (c) give a perfect description of the correlations, so the choice
between them must be made purely on the basis of prior notions the investigator has.
We are not familiar with the problems in question, so we cannot make a sensible
choice. Model (b) is restrictive. If we compare it with (a) we still see that its
description is relatively good. If we want to decide whether to prefer it to (a) we can
either use statistics, and see if the description is 'significantly' worse. But we can
also use (a) and (b) predictively, and see which one is better. Our guess is that on
both counts (b) is the more satisfactory model.
393
strong reasons to prefer the particular model in the study over other competing
models. And this happens only if we already have a pretty good idea about the
mechanisms that are at work in the situation we are studying. If the sociologist says
that fathers's income only has an indirect effect on the career of the child, this is
either just a figure of speech, or a statement that a particular partial correlation
coefficient is small. In Chang (1981), and Troussellier et al. (in press), it is shown
that the decomposition of the correlation coefficients in direct and indirect
contributions (with respect to a particular path model) can lead to useful
interpretations in community ecology.
LATENT VARIABLES
Now consider the path models in Figures 5 and 6. They are different from the
ones we have seen before, because they involve latent or unobserved variables. In
the diagrams we indicate these latent variables by using circles instead of squares.
First we give the causal interpretation of Figure 5. If we project the observed
variables on the space spanned by the unobserved variables then the residuals are
uncorrelated. Thus the observed variables are independent given the unobserved
variable. All relationships between the observed variables can be 'explained' by the
latent variable, which is their common factor. In somewhat more intuitive terms a
good fit of this common factor model to the data means that the variables all
measure essentially the same property. A good fit, and small residuals, means that
they all measure this property in a precise way. Again we see that the model can be a
good description of the data without being a good predictor. Uncorrelated
variables, for instance, are described perfectly by the model, but cannot be
predicted at all.
The structural equations describing the model are
(6)
The Ej are assumed to be uncorrelated with~. Model (6) is saturated and transitive,
but it has the peculiar property that the exogenous variable is not measured. In De
Leeuw (1984) it was suggested that latent variables are just another example of
variables about which not everything is known. We have nominal variables, ordinal
variables, polynomial variables, splinical variables, and we also have latent
variables. About latent variables absolutely nothing is known, except for their place
in the model. Thus the basic optimal scaling idea that transformations and
quantifications must be chosen to optimize prediction also applies to latent variables.
395
Consequently latent variables fit very naturally into the optimal scaling approach to
path analysis.
,
a £1
1
x
2 I~ ~
Figure 5.
a One-factor model
3
x
3 ~
x
1
Y1 £ 1
x
2
Y2 £2
x
Figure 6.
3
MIMIC model.
.567
.620
.062
Figure 7.
MIMIC model,Legendre data.
Figure 7 illustrates an application of the MIMIC model to the Baie des Chaleurs
data of Legendre and Legendre. The values of the path coefficients and the error
variances are given in the diagram. The model provides a reasonably good
description, compared with the transitive models in Table 2. The causal
interpretation of Figure 7 is that temperature and salinity determine the unmeasured
variable ~, which in its tum determines primary production and chlorophyll
concentration.
In our experience some people find it difficult to accept the concept of a latent
variable. But there are several reasons why we still think that such a concept is
useful. In the first place in many of the sciences measurement errors can not be
neglected. This means that the observed variable is an indicator of the latent 'true'
variable. The concept of an indicator can be generalized considerably, and this has
happened mainly in psychometrics and in sociological methodology. It is not
possible to measure 'intelligence' directly, but it is possible to measure a large
number of indicators for intelligence. If the common factor model is acceptable,
397
We now briefly indicate where the theory of optimal scaling comes in. We have
seen in De Leeuw (1987) that optimal scaling (or transformation, or quantification)
can be used to optimize criteria defined in terms of the correlation matrix of the
variables. In path analysis the obvious criteria are the coefficients of determination,
i.e. the multiple correlation coefficients. In De Leeuw (1987) we already analyzed
an example in which the multiple correlation between predictors SPECIES and
NITRO and dependent variable YIELD was optimized. In path analysis we deal with
nested multiple regressions, and we can choose which one (or which combination)
of the multiple correlations we want to optimize. If there is no prior knowledge
dictating otherwise, then it seems to make most sense to maximize the sum of the
coefficients or determination of all the endogenous variables. But in other cases we
may prefer to maximize the sum computed only over all variables of the highest
level.
In general nontransitive models the methods of optimal scaling can be used
exactly as in transitive models. We have one coefficient of determination for each
endogenous variable, and we can scale the variables in such a way that the sum of
these coefficients is optimized. This amounts to finding transformations or
quantifications optimizing the predictive power of the model. Moreover it is
irrelevant for our approach if the model contains latent variables or not. We have
398
seen that latent variables are simply variables with a very low measurement level,
and that they can be scaled in exactly the same way as ordinal or nominal variables.
This point of view, due to De Leeuw (1984), makes our approach quite general. It is
quite similar to the NIPALS approach of Wold, described most fully in Joreskog
and Wold (1982) and Lohmoller (1986).
It is of some interest that we do not necessary optimize the descriptive efficiency
at the same time. Optimizing predictive power is directed towards the weak
orthogonality assumptions. It is possible, at least in principle, that a model with
optimized coefficients of determination has a worse fit to the strong orthogonality
assumptions. Scaling to optimize predictability does not guarantee an improved fit
in this respect. This has as a consequence that there is a discrepancy between the
least squares and the maximum likelihood approach to fitting non transitive path
models. We do not go into these problems, but refer the interested reader to
Dijkstra (1981), Joreskog and Wold (1982), and De Leeuw (1984) for extensive
discussions.
We now outline the algorithm that we use in nonlinear path analysis somewhat
more in detail. We minimize the sum
(8)
over both the regression coefficients f3jl and the quantifications (or transformations)
of the variables. The outer summation, over j, is over all endogenous variables, the
inner summation, over 1, is over all variables that are direct causes of variable j. The
algorithm we use of is the alternating least squares type (Young, 1981). This means
that the parameters of the problem are partitioned into sets, and that each stage of
the algorithm minimizes the loss function over one of the sets, while keeping the
other sets fixed at their current values. By cycling through the sets of parameters we
obtain a convergent algorithm. In this particular application of the general
alternating least squares principle each variable defines a set of parameters, and the
regression coefficients define another set.
We give an ecological illustration of this nonlinear PATHALS algorithm. The
data are taken from Van der Aart and Smeenk-Enserink (1975), who reported
abundance data for 12 species of hunting spiders in a dune area in the Netherlands.
A total of 28 sites was studied, and the sites were also described in terms of a
number of environmental variables. We have used a selection and coding from these
data made by Ter Braak (1986a). He used the six environmental variables:
WC Water content, percentage dry weight,
BS Percentage bare sand,
CM Percentage covered by moss layer,
399
For a more detailed discussion and interpretation of the data we refer to Van der
Aart and Smeenk-Enserink (1975) and to Ter Braak (1986a), who both perfonned
fonns of canonical analysis. Actually Ter Braak used canonical correspondence
analysis, a fonn of nonlinear canonical analysis, also discussed in Ter Braak
(1986b). We merely point out some 'technica1' aspects of our analysis, and we
400
0.4 0.4
0.3 0.3
0.2
., 02 .,
I)
I)
::l 0.1 ::l 0.1
~ ~
> -0.0 > -0.0
'"
I)
-0.1 '" -0.1
I)
8 -0.2 8 -0.2
.8., .8.,
§ -0.3 Water Content
I:
t':I -0.3 Bare sand
b b
-0.4 -0.4
0 3 0
category numbers category numbers
0.4 0.4
0.3 0.3
0.2 0.2
., .,
8 0.1 8 0.1
~ 'iU
> -0.0 > -0.0
]
-0.1 '§
" -0.1
8 -0.2
.8., .£., -0.2
a
b
-0.3 Cover moss a -0.3 Light reflection
b
-0.4 -0.4
0 3 0 3
category numbers category numbers
0.4 0.4
0.3 0.3
0.2 0.2
.,
I)
.,
I)
::l 0.1 ::l 0.1
~ ~
> ·0.0 ;:- -0.0
'E" ·0.1
I)
'"
I)
E
-0.1
.8., ·0.2 .8., -02
I: ~
b
~ ·0.3 Fallen twigs Ol
b
·0.3 Covered herbs
·0.4 ·0.4
0 1 2 3 0 1
category numbers category numbers
compare the linear and nonlinear solutions. It is clear that the 'explained' variances
of the transformed abundance variables increase considerably. The table does not
give the 'explained' variance of the two latent variables. For the metric analysis the
residuals are .06 and .14, for the nonmetric analysis they are .01 and .01. Thus the
latent variables in the nonmetric analysis are almost completely in the space of the
transformed environmental variables, which implies that our method is very close
to a nonmetric redundancy analysis.
The interpretation of the latent variables is facilitated, as is usual in forms of
canonical analysis, by correlation the latent variables with the transformed
variables. This gives canonical loadings. If we do this we find, for example, that
the first latent variable correlates -.75 with both Water Content and Cover Herbs,
while the second one correlates +.80 with Light Reflection and -.80 with Fallen
Twigs. The analysis clearly shows some of the advantages of nonlinear multivariate
analysis. By allowing for transformations of the variables we need fewer
dimensions to account for a large proportion of the variance. Much of the
remaining variation after a linear analysis is taken care of by the transformations,
and in stead of interpreting high-dimensional linear solutions we can interprete
low-dimensional nonlinear solutions, together with the transformations computed
by the technique. Using transformations allows for simple nonlinear relationships in
the data, and the optimal transformations often give additional useful information
about the data.
CONCLUSIONS
The nonlinear extensions of path analysis discussed in his paper allow for even
more flexibility. Not only can we choose the overall structure of the analysis by
choosing a suitable path model, but within the model we can also choose the
measurement level of each of the variables separately. Or, if one prefers this
terminology, we can define a suitable class of transformations for each variable
from which an optimal one must be chosen. The use of transformations can greatly
increase the explanatory power of path models, at least for the data set in question.
If the transformations we obtain are indeed stable, and also increase the quality of
the predictions, is quite another matter. This must be investigated by a detailed
analysis of the stability and the cross-validation properties of the estimates, which is
a very important component of any serious data analysis.
Thus we can say that this paper adds a number of very powerful and flexible
tools to the toolbox of the ecologist, with the logical and inevitable consequence that
these new tools can lead to more serious forms of misuse than the standard tools,
which are more rigid and less powerful. The major hazard is chance capitalization,
i.e. instability, and the user of these tools must take precautions against this danger.
But if suitable precautions are taken, the path analysis methods and the
generalizations discussed in this paper provide us with a convenient and useful way
to formalize scientific theories in situations, in which there is no precise knowledge
of the detailed mechanisms, or in which there are too many factors influencing the
system to make a precise deterministic description possible.
REFERENCES
Netherlands.
FRISCH, R 1934. Statistical confluence analysis by means of complete regression
systems. Economic Institute, University of Oslo, Norway.
GITIINS, R 1985. Canonical analysis. Springer, Berlin, BRD.
GOODMAN, L.A. 1978. Analyzing qualitative categorical data. Abt, Cambridge,
Ma.
GOSSELIN, M., L. LEGENDRE, J.-C. THERRIAULT, S. DEMERS, AND M.
ROCHET. 1986. Physical control of the horizontal patchiness of sea-ice
microalgae. Marine Ecology Progress Series 29: 289-298.
HARRIS, RE., AND W.AG. CHARLESTON. 1977. An examination of the marsh
microhabitats of Lymnaea tomentosa and L. columella (Mollusca: Gastropoda)
by path analysis. New Zealand Journal of Zoology 4: 395-399.
HSIAO, C. 1983. Identification. In Z. Griliches, and M.T. Intriligator [eds.]
Handbook of Econometrics I. North Holland Publishing Co., Amsterdam, The
Netherlands
JASPARS, J.M.F., AND J. DE LEEUW. 1980. Genetic-environment covariation in
human behaviour genetics. In L.1.Th. van der Kamp et al. (eds.) Psychometrics
for Educational Debates. John Wiley and Sons, New York, NY.
J0RESKOG, K.G., AND AS. GOLDBERGER. 1975. Estimation of a model with
multiple indicators and multiple causes of a single latent variable. Journal of the
American Statistical Association 70: 631-639.
J0RESKOG, K.G., AND H. WOLD. 1982. Systems under indirect observation.
North Holland Publishing Co., Amsterdam, The Netherlands.
KIIVERI, H., AND T.P.SPEED. 1982. Structural analysis of multivariate data. In
S. Leinhardt (ed.) Sociological Methodology. Jossey-Bass, San Francisco, CA
LEGENDRE, L., AND P. LEGENDRE. 1983. Numerical ecology. Elsevier
Scientific Publishing Company, Amsterdam, The Netherlands.
LOHMOLLER , J.B. 1986. Die Partialkleinstquadratmethode ftir Pfadmodelle mit
latenten Variablen und das Programm LVPLS. In L. Hildebrand et al. (eds.)
Kausalanalyse in der Umweltforschung. Campus, Frankfurt, BRD.
PEARSON, K. 1911. The grammar of science. Third Edition.
SCHWINGHAMER, P. 1983. Generating ecological hypotheses from biomass
spectra using causal analysis: a benthic example. Marine Ecology Progress
Series 13: 151-166.
SIMON, H.A 1953. Causal ordering and identifiability. In W.e. Hood, and T.e.
Koopmans (eds.) Studies in Econometric Method. John Wiley and Sons, New
York, NY.
SPEARMAN, e. 1904. General intelligence objectively measured and defined.
American Journal of Psychology 15: 201-299.
TER BRAAK, e.L.F. 1986a. Canonical correspondence analysis: a new eigenvector
technique for multivariate direct gradient analysis. Ecology, in press.
TER BRAAK, C.L.F. 1986b. The analysis of vegetation-environment relationships
by canonical correspondence analysis. Vegetatio, in press.
TROUSSELIER, M., P. LEGENDRE, AND B. BALEUX. 1986. Modeling of the
evolution of bacterial densities in an eutrophic ecosystem (sewage lagoons).
Microbial Ecology 12: 355-379.
TUKEY, J.W. 1954. Causation, regression, and path analysis. In O. Kempthorne
404
(ed.) Statistical Methods in Biology. Iowa State University Press, Ames, Iowa.
VAN DER AART, PJ.M., AND N. SMEEK-ENSERINK. 1975. Correlation
between distributions of hunting spiders (Lycosidae, Ctenidae) and
environmental characteristics in a dune area. Netherlands Journal of Zoology
25: 1-45.
WOLD, H. 1954. Causality and econometrics. Econometrica 22: 162-177.
WRIGHT, S. 1921. Correlation and causation. Journal Agricultural Research 20:
557-585.
WRIGHT, S. 1934. The method of path coefficients. Annals of Mathematical
Statistics 5: 161-215.
YOUNG, F.W. 1981. Quantitative analysis of qualitative data. Psychometrika 46:
347-388.
Spatial analysis
SPATIAL POINT PATTERN ANALYSIS IN ECOLOGY
B.D. Ripley
Department of Mathematics
University of Strathclyde, Glasgow U.K. Gl lXH.
1. SOME HISTORY
Spatial statistics has a long history in fields related to
ecology. Forestry examples go back at least to Hertz (1909),
and ecologists have been proposing new methods since the pioneer-
ing work of Greig-Smith (1952) and Skellam (1952). The concerns
in those early days were principally to census populations and to
detect "scales" of pattern in plant communities. These problems
are still alive today, and many methods have been proposed.
(Unfortunately the statistical problems are subtle and by no means
all these methods are statistically valid.) Some specialist
techniques such as those for enumerating game from transect counts
have a history of thirty years or more.
It seems that the computer revolution has yet to make m~ch
3. QUADRAT SAMPLING
A traditional way to sample grassland is to use quadrats.
These are small (Scm-1m) metal squares used to select a sampling
region. Three types of sampling are in common use:
410
A = x/A
Under random sampling this is unbiased but its variance depends on
the spatial pattern. Intuitively, the variance will be low if
the pattern is rather regular, and high if the individuals occur
in small (relative to the quadrat) clumps. The unbiasedness of
this estimator makes it a good choice for censusing populations
whenever it is feasible.
The benchmark for spatial point patterns is the Poisson
process, the mathematical model for complete randomness. The
number of points in non-overlapping subregions are independent.
In a region of area A the total number has a Poisson distribution
of mean AA. Thus
EA = A, var A = A/A
This can be used to give confidence limits on the total population
size but will be optimistic for clustered patterns.
Some workers have tried to turn the dependence of var(x.) on
1
the pattern to advantage. Many indices have been developed which
are combinations of x and s2, the sample mean and variance of the
411
4. BLOCKS OF QUADRATS
5. DISTANCE METHODS
The basis of distance methods for estimating intensity is
that if the points are densely packed the distances from each
point to its nearest neighbour will be small. Let d denote this
distance. Then dimensional considerations show that
A = "!
m/TI E d.
2
1 1
• • •
•
• • •
• •
• • • • •
• • • •
• • •
414
recommended estimator is
A =m//[Eu.xEv.]
~ ~
observer
415
En = 2AL~
so
A= n/2L~
where
~ = f~ g(y)dy < ~
o
var A ~ AO+/(J2)/2itL
where n var f(O) ~ (J2 depends critically on independence. Such
surveys seem more useful for assessing trends in population
numbers than for estimating absolute population sizes.
Other specialized methods are available for, for example,
416
7. MAPPED POPULATIONS
Ed = l/(A/N)
with
the sum being over ordered pairs (x,y) of points. Here k(x,y) is
a weighting factor to allow for edge effects;
l/k(x,y) = proportion of circle centre x through y
which is within the study region.
For a Poisson pattern ("randomness")
EK(t) ~ nt 2
L(t) to stray from t more than 1.5/N at any t-value. This gives
a very sensitive formal significance test of randomness, but the
plot of L vs t is more useful in describing the ecologically
significant features of the pattern.
Some examples of this analysis are shown in Figures 1-4.
All the examples are within a metre square, and all distances
are in metres. Figure la is a "random" pattern, a sample of a
Poisson process. Its L-plot in Figure Ib shows conformity to
L(t) = t. Figure 2 is a regular pattern, of points restrained
from being closer than 40cm apart, a feature which is seen quite
clearly in Figure 2b. The pattern in Figure 3a could be either
heterogeneity or clustering; Figure 3b indicates "clustering" at
a scale of 250cm. Finally, Figure 4a is the type of pattern
which defeats the indices referred to in section 3. As Figure 4b
shows, there is regularity, clustering and regularity at
successively increasing scales.
Biological case studies in the use of K are given by Ripley
(1981, 1985) for nest spacings, Ripley (1977) (see also Diggle
1983) for redwood seedlings and biological cells, Diggle (1983)
for bramble canes, and Pedro et al. (1984) and Appleyard et al.
(1985) for features in membranes of muscle fibres.
These summaries can be used both to suggest suitable models
for the patterns under study and to help fit such models. For
example, the studies of birds' nests concluded with a model that
inhibited pairs of nests closer than a certain distance and, for
some species, a less rigorous exclusion for slightly larger
distances. This provides both a biologically useful summary of
the pattern and reassurance that there is nothing significant in
the data not explained by such a simple description.
. .," . .".
.. :
-. ..
:. . .'
." (a)
.'
. . '.
, ...••. -..
. .'
.. '
'. '.
.. .. •
.. -.
'.' .. ""
' .. ','
0.6
0.5
( b)
0."1
0.3
0.2
0.1
0.1 0.2
.. .. . .
(a )
..
..
..
.. ..
0.8
0.7
0.6
(b)
0.5
0."t
0.3
0.2
0.1
... ",
., .
"',
'
.... (a )
..
"
" '
'
..
..
'0 ••
• • 0°
.....
'. , .' .'
:'
0.6
0.5
(b)
0.'"
0.3
0.2
0.1
0.0~.-ro-.-r'-"-'-r,-"ro-r,-"ro-r'-'-~1
0.0 0.1 0.2 0.3 0.'" 0.5
, . .'
(a )
.
"
.. .. .
..
. ....
,
,
..
~
0.6
0.5
(b)
0.'1
0.3
0.2
0.1
0.0~~"-..-.-,,-..-.-ro-.-.,-,,-.-.,-,,,,~
0.0 0.1 0.2 0.3 0.'1 0.5
species B
present absent
species A present a b
absent c· d
(a+b) (a+c)
a = (a+b+c+d) x (a+c+b+d) x (a+b+c+d)
and similar formulae for b,c and d. These reduce to the single
condition ad = bc. The cross-product ratio
1/1 = ad/bc
measures association. For 1/1 = 0 there is no association. If
1/1 > 0 species A and B tend to occur together whereas if 1/1 < 0
then segregation occurs. Another indicator of association is
2
the X -test statistic
425
2
X
2 N!ad-bc! N = (a+b+c+d)
neighbour
A B
point A a b
B c d
9. EPILOGUE
REFERENCES
ANDERSON, D.R.,. K.P. BURNHAM, G.C. WHITE, and D.L. OTIS. 1983.
Density estimation of small-mammal populations using a
trapping web and distance sampling methods. Ecology 64:
674-680.
USHER, M.B. 1969. The relation between mean square and block
size in the analysis of similar patterns. J. Ecology 57:
505-514.
INTRODUCTION
THE METHOD
B
B
c A
a b
EXAMPLES
o 10 20 30 40 50 60 70 80 90 100
( mel res )
quadrats of the block. When fruits were nearly ripe but not yet
abscised, the infructescences were harvested. Fecundity was
calculated as the number of fruits divided by the number of
flowers. The unbagged infructescences were attacked heavily by
animals, so that analyses involving fecundity consider only the
inner 64 quadrats. Since 20 of these quadrats contained only
males, fecundity could be defined for only 44 quadrats. The
variables analyzed were Aralia density (numbers of male plus
female ramets), percent female per quadrat, and three habitat
variables, density of Clintonia borealis (Ait.) Raf.
(Liliaceae), development of bracken (and shrubs), and canopy
449
Male-Male
Meters
Transect 10 20 30 40 50 60 70 80 9U lOU
1 + +
2 + +
3 + +
4 + +
5 + +
Female-Female
Meters
Transect 10 20 30 40 50 60 70 80 90 JOO
1 +
2 + +
3 + +
4 +
5 + + + +
Male-Female
Meters
Transect 10 20 30 40 5U 60 70 80 90 JOO
1 + + + + + +
2 + + + + +
4 + + + + +
5 + + + + + + + +
Note: Entries in the table show the signs of deviations significant at P < 0.05.
451
0.50
0.25
I-...t
',
.
, ........---- ................---- .....
en
z 0 ....::::.::.::.::.::.:- - - - - - - .... - -
« BR _
a::
0
::i: CA
-0.25
a b
Table 2. Spatial autocorrelation coefficients I for three flower census variables in A. hispida on
three dates in 1984.
Distance classes in m
4 8 12 16 20 24 28 32 36 46
Number of male flowers in bloom
10 July .19*** .01 .00 -.04* -.02 .00 -.04 .01 .00 -.01
18 July .17*** -.04* -.02 .00 -.04** .03 .01 .02 .00 -.02
18 July .17*** -.01 .01 -.01 -.02 -.02 -.03* -.03 .01 .00
Percentfemaleflowers in bloom
10 July .28*** .10*** -.01 -.06** -.03 -.03 -.05 -.03 -.02 -.04
18 July .14*** .00 .05*** .05** -.06** -.05* -.06* -.06* -.05 -.04
38 53
GREEN-AQUA
35 88
RED-SILVER RED-YELLOW
ACKNOWLEDGEMENTS
REFERENCES
BARRETT, S.C.H., AND K. HELENURM. 1981. Floral sex ratios
and life history in Aralia nudicaulis (Araliaceae).
Evolution 35:752-762.
BARRETT, S.C.H., AND J.D. THOMSON. 1982. Spatial pattern,
floral sex ratios, and fecundity in dioecious Aralia
nudicaulis (Araliaceae). Can. J. Bot. 60:1662-1670.
BAWA, K.S., C.R. KEEGAN, AND R.H. VOSS. 1982. Sexual
dimorphism in Aralia nudicaulis L. (Araliaceae). Evolution
36:371-378.
TOBLER, W R. 1975.
Linear operators applied to areal data, p.
14--37. In J. C. Davis and M. J. McCullagh [ed.]
Display and
analysis of spatial data. John Wiley, London.
INTRODUCTION
MULTIDIMENSIONAL SCALING
CONDITIONAL CLUSTERING
CONSTRAINED CLUSTERING
FRACTAL THEORY
PATH ANALYSIS
SPATIAL AUTOCORRELATION
CONCLUDING REMARKS
REFERENCES
INfRODUCTION
In discussing the use of techniques, it is ftrst necessary to note the aims of the potential
users of those techniques, in order to judge whether they are applicable. Some of the main aims
of benthic community ecologists include the following:
Aims (3) and (4) are in the forefront of benthic ecology at present, but are only discussed briefly
in passing, since most of the techniques dealt with at the workshop are more relevant to the ftrst
two. In considering aims (1) and (2), there are three alternative approaches in relating biotic and
environmental data:
a) Analyse patterns in the biotic data ftrst, then relate these patterns to environmental factors;
b) Analyse paterns in the environmental data, then relate these to changes in the biotic data
(common in pollution studies);
c) Analyse the patterns and relationships within and between biotic and environmental data
simultaneously.
All three approaches have been used in benthic ecology for some 20 years. Conventional
clustering and classical scaling (ordination) have been used for analysing patterns in both biotic
and environmental data (Legendre and Legendre 1983), whereas canonical correlations have been
used rather rarely to analyse patterns in both biotic and environmental data simultaneously. In
this report, we present an overview of methods that appear to be of potential use to benthic
ecologists, although they may have only been tested so far in other fields, such as psychometrics.
SampJiw:
Benthic sampling methods to a large extent dictate the kind of data collected and therefore
the type of analysis that might be appropriate. Data collected in the littoral region, and by
photography or underwater by SCUBA may be quantitative, and the exact position may be
mapped by co-ordinates. Since relatively immobile organisms are collected, and they are
essentially in a two-dimentional surface, this type of sampling may utilize grid, quadrat, or
transect techniques; the methods and data are to a large extent similar to those collected by plant
ecologists. In deeper waters, benthic ecologists are often forced to sample "blind", using grabs
and/or cores from ships to collect quantitative samples at roughly positioned locations, or using
cruder dredges and trawls dragged over an unmeasured area to collect at best semi-quantitative
data. The scale of observation is an important consideration in interpreting ecological structures.
Most benthic data are obtained from biased sampling. The bias is typically in one direction
(under-estimation) and thus the bias cannot be "averaged out" by sampling in different ways.
Benthic ecologists often wish to compare things (sites, times, conditions) which are estimated
with different biases (e.g., comparing communities on sand versus mud using a grab, which will
of course penetrate differently in sand and mud).
The sampling design needs to take into account the numerical methods which are to follow.
This is critical for meaningful results and interpretation. Furthermore, the high cost of obtaining
raw benthic data usually prevents feedbacks from the analysis to the sampling design and
analytical procedures.
Data pre-treatment
When one is looking for structure in the biotic data, it is often advisable to transform data
in order to stabilize variances, and there are good arguments for recommending using either Y =
log (X+c), where 0.2 < c < 1 (logarithmic transform), or Y = (X)O.25 (fourth-root transform).
The value of c appears to have little influence on the ability of the transformation to stabilise the
variance. Both transformations are special cases of the general Taylor's Power variance:mean
relationship.
487
NUMERICAL METIlODS
Table 1 summarises the main aims of benthic community ecologists in the columns, with
some of the numerical techniques discussed during the workshop as rows. The columns describe
categories of ecological questions to be investigated, from analysing biotic distribution patterns in
space only (sites x species data), to consideration of both space and time (sites x species x times),
to relating the 2- to 3-dimensional biotic features to environmental ones, and finally to questions
concerned with modelling or analysing how the systems function. The techniques (rows) are
approximately arranged from simpler to more complex under each heading, and at the same time,
in general from more to less dependent upon assumptions.
Some of the main features of the techniques and their potential for benthic ecology are
highlighted below.
Principal Components Analysis (pCA: Gower, this volume): This should be restricted to
analysing the correlation or covariance structure among variables (e.g., species) and care should
be taken since it may be sensitive to non-linearity and non-normality in data. With Principal
Co-ordinates Analysis (PCD), one can achieve the same solution starting from a matrix of
inter-object (e.g., site) distances, with the advantage that one can choose different measures of
inter-site distance. Classical metric scaling is equivalent to Principal Co-ordinates Analysis.
Correspondence Analysis (of contingency table count data) differs in that one is tied to the
chi-squared distance measure. Detrended Co"respondence Analysis is not recommended, since
"horseshoes", if they occur, show real relationships in the data.
Non-metric Scaling (Carroll, this volume): Here one is finding a reduced space solution
that preserves the rank order of inter-object distances (monotonicity), as opposed to the linear
relationship of classical (metric) scaling. Non-metric scaling has the advantage of robustness in
that it is not sensitive to outliers (e.g., chance occurence of one individual of massive biomass in a
site).
themselves with this technique. The framework and its methods should be explored by
experienced ecologists and the methods compared.
Asymmetric Matrix Analysis (de Leeuw, this volume): The resolution of matrices into
two, one symmetric (for example where interactions are reversible) and the other asymmetric
(e.g., irreversible interactions), may have applications in showing successional and competitive
phenomena in benthic ecology. There are no known published benthic examples to date.
Unfolding (Heiser, this volume): Unlike other scaling techniques, it applies directly to a
rectangular matrix (e.g., sites vs species distances, or species affinities for different sites). It aims
at producing a geometric representation in a subspace of reduced dimension maximising
conservation of rank-order relationships of distances among species, among sites, and between
sites and species. A behavioral analogue is given by a rectangular matrix of boy-girl relationships,
from which unfolding may infer two triangular matrices, one of girl-girl relationships and another
of boy-boy relationships. It produces a true joint-space (as opposed to a projection), unlike other
techniques such as PCA. It has great potential, but there are no ecological examples except that of
Heiser (this volume) and it needs exploring.
Path Analysis (de Leeuw, this volume): This is a way of testing the fit of an a priori
model of a causal structure, by means of generalized least squares (e.g., as an interpretation of a
matrix of correlations among variables). Non-linear path analysis is the non-parametric
equivalent. In both cases the structure is expressed as a web of arrows joining the variables.
Current methods are capable of handling unobserved latent variables in the causal structure, a
potentially useful feature. The path analysis structure diagram may be a useful complement to
regression and contingency table techniques already in use, but there are limitations to its use in
systems with feed-backs, such as many ecological ones.
Procrustes Analysis (Gower, this volume): With species sampled at different times,
Procrustes Analysis can be used to measure the relative variability of each species with time.
489
Similarly, within-site variability can be compared from site to site if replicate samples are taken at
each site. Different sampling devices or techniques can also be compared. Another application
would be to compare matrices based on biological and environmental data. It has been applied to
marine ecological data by Fasham and Foxton (1979) who compared various environmental
hypotheses for goodness of fit to the biotic data. It appears to have great potential in benthic
ecology.
Individual Distance Scaling (INDSCAL) (de Leeuw, this volume): This is a metric
method for comparing Euclidean distance matrices. There are no known benthic examples, and
the method needs exploring. The non-metric version has degenerate solutions.
3-Way Unfolding (Heiser, this volume): A three-way version exists but has not been
tested. It has potential in benthic ecology but the large amount of data required may limit its
application in practice.
Qusteriru=
Fuzzy Sets (Bezdek, this volume): The idea of fuzzy sets is intellectually appealing, since
there is no reason to believe that benthic communities are discrete and disjunct. The concept of
fuzzy sets is intermediate between those of clustering and ordination. The techniques for
delineating fuzzy sets involve easy algorithms, and one should try to use several of them to gauge
the stability of the solutions with each particular data set. In particular, it is worth exploring the
C-means algorithm for use on benthic data, and using output from this to speed up the more
time-consuming maximum-liklihood function for fuzzy sets.
Constrained clustering (P. Legendre, this volume): This is useful for tracing successional
data, and for exploring the historical and spatial evolution of dispersion. One should try both
constrained and un-constrained analyses on the same data. The technique has been used in
ecology and needs further application. It may also be possible to test a null hypothesis such as
that there is no spatial auto-correlation (no patches) against a specific alternative hypothesis, in
order to investigate the processes underlying patch formation. One may be able to test the
clustering of the (biotic) x-variables in the environment space by setting up a connection matrix on
the basis of similarity of environmental variables. However, further investigation into the logical
validity of such hypothesis testing is needed.
Spatial analyses
Fractal theory (Frontier, this volume): This describes how a structure may occupy a
space of dimension greater than the structure itself (e.g., surface or volume). It may be of use in
describing the physical dimension of a niche such as the rugosity of hard substrata, or in
predicting the surface area available as an environment at the appropriate scale for particular
organisms (the area available for larval settlement, or growth, or photosynthesis). Changes in
fractal dimension might account for scale transitions which imply changes in structural or
functional properties of the object/system (e.g., transition from a physical to a biological scale). Its
utility in describing soft sediments is unclear at present and examples are needed.
Kriging (Matheron 1969, 1970; Scardi ~ill.. 1986). This is an interpolation technique
useful for mapping and contouring single variables (e.g., species densities, biomasses, sediment
parameters). Kriging also provides an estimate of the interpolation error for each point, which
may indicate where more sampling is needed or where spatial patterns are very irregular. It
appears to be an improvement on trend-surface analysis. Since Kriging is based on variograms, it
should be regarded as a complex and powerful tool for spatial analysis rather than as a simple
interpolator.
491
Spatial Autocorrelation (Sokal and Thomson, this volume): The correlogram is useful for
revealing spatial patterns of a single variable (e.g., density, or a compound or a discontinuous
variable). It can be used to show patterns such as clines, isotropy and anisotropy. It has been
successfully used in benthic ecology to demonstrate the scale of variation of single species.
The Mantel Test (Sokal and Thomson, this volume): This test is useful for comparing
distance matrices. It has been used successfully for analysing spatial and spatio-temporal
relationships. It appears to have much potential for more general use, e.g. for comparing biotic
and environmental dissimilarity matrices.
Point Pattern Analysis (Ripley, this volume): In contrast to spatial autocorrelation, this is
used to analyse spatial patterns described by co-ordinates in space (as opposed to continuous
variables with values at each point). The K(t) method depends on having all the organisms
mapped and counting the average number of organisms within a radius t of each organism in tum.
Distances need not be exact; one needs to know the positions to about 1/3 of the distance between
points (preserving rank order). It would be useful for describing univariate patterns of
aggregation and dispersion, when mapping of the benthic species is possible.
DISCUSSION
Description and analysis have been emphasized, rather than hypothesis testing.
"Significance testing" can be a good screening method preceding descriptive multivariate analysis
(NOT to validate "significance"!). For example, one can perform a test of sphericity (lIo: IRI = 1)
and IF the null hypothesis is rejected, then proceed to describe the correlation structure. If the null
hypothesis is not rejected, then there is no evidence of any correlation structure to describe, and
492
the analysis should be abandoned. Another example is provided by contingency table data,
including multiway tables. Beginning with log-linear models, one tests the highest level
interactions fIrst and the main effects last, in the normal way. The table should be collapsed over
dimensions not involved in signifIcant interactions. Correspondence analysis can be performed on
this reduced table; in effect this is an approach which describes a suffIcient model representation
of the data. It is, in a sense, a testing procedure for descriptive multivariate methods such as
clustering and ordination, in that one has found evidence that there is structure present to be
described.
The most promising methods for benthic ecology are also promising in other areas of
ecology. At present each group or school has its favourite techniques and computer programs,
and tends to put data through them. It is not yet clear to what extent the more traditional
techniques such as PCA, metric scaling (PCO), and canonical correlation analysis give distorted
results when data are increasingly heterogeneous, full of zeros, and assumptions about linearity do
not apply. The newer non-metric techniques such as non-linear non-metric scaling, asymmetric
matrix analysis and unfolding are all very attractive because of their generality and lack of
assumptions about the data. Their generality and approximate (loose) nature may make them
particularly suited to analysing ecological data which has much in common with psychometric data
with regard to their approximate nature. However, it is worth noting that, as with non-parametric
statistical methods, one loses in power and rigor what one gains in generality; this is especially
true if one wishes to turn the description into some sort of predictive model afterwards. At the
same time, benthic ecologists have developed some expertise at more conventional techniques,
which in general have given interpretable results, and it will have to be demonstrated that the gain
in robustness and flexibility is worth the effort of learning to use new sets of techniques with
many variants.
In exactly the same way, traditional clustering techniques have become part of the
standard tool box of benthic community ecologists. Conditional clustering, fuzzy sets and
constrained clustering are to a large extent untested and hold much promise for the future.
The spatial analysis techniques all have applications in benthic ecology. Perhaps the most
exciting is the Mantel test, which has applications on all types of data, including relationships
between species, space, time, and environmental factors (Table 1). In particular, this may be
combined with the descriptive technique of constrained or weighted scaling.
493
Table 1. Relationships of some principal aims of benthic ecologists to numerical techniques. See
= = =
text for details. Key:· applies, NR not recommended, (1) univariate analysis, blank =
inappropriate.
QUESTIONS / AIMS
Metric
PCA and Biplot
PCO
Correspondance A.
Detrend. corresp. A. NR NR
Non-metric
N-MScaling
NL N-M Scaling
Asymm. Matrix A.
Unfolding
Path Analysis
n-WAY SCALING
INDSCAL
Canonical Corr.
Multiple Corresp. A.
Constrained Scaling
Procrustes
Unfolding (3-Way)
CLUSTERING
Conventional ?
Conditional
Fuzzy Sets
Constrained
SPATIAL ANALYSIS
Fractals
Kriging (1) (1)
Autocorrelation (1)
Mantel
Point pattern (1)
494
The two techniques of asymmetric matrix analysis and path analysis are the only methods
considered at the workshop which spill over directly into the important area of generating and
testing hypotheses about how benthic systems function. The new methods of approximate
reasoning (Bezdek, this volume; L. Legendre ~.i!l., this volume) also have exciting possibilities
for generating and testing ecological hypotheses.
It is clearly very important that traditional and newly available techniques be evaluated and
compared using different types of data by experienced ecologists and data analysts working
together. This evaluation procedure may be referred to as gauging (see also de Leeuw, this
volume). It is proposed that a gauging workshop be held. Both aspects of gauging are important:
a) Varying the techniques, coefficients and, where appropriate, distance measures on common
data; and,
b) Analysing different types of real or artificial data (more or fewer empty cells, semi-quantitative,
quantitative, continuous and contingency data) using a common technique.
In particular, the traditional scaling techniques need to be compared with the many
variants available from the Gifi School of Leiden (de Leeuw, this volume; Heiser, this volume)
and traditional and newer clustering techniques need to be compared with the fuzzy set algorithms
(Bezdek, this volume). This should result in the production of a guide to the suitability of the
techniques to each purpose and type of data, so that appropriate data may be collected in the first
place. Only after such excercises will it be possible to recommend confidently which of the old
and which of the exciting newly available techniques are most appropriate for which type of data,
and which are robust or sensitive, and to what. It is very likely that benthic ecologists will still be
advised to perform several analyses on each data set, with most confident interpretation of the
patterns and relationships when the results of several techniques agree.
REFERENCES
Fasham, M.J.R., and P. Foxton. 1979. Zonal distribution of pelagic Decapoda (Crustacea) in
the eastern North Atlantic and its relation to the physical oceanography. J. expo mar. Bio!.
Eco!. 37: 225-253.
Green, R.H. 1979. Sampling dessign and statistical methods for environmental biologists.
Wiley, New York. 257 p.
Legendre, L., and P. Legendre. 1983. Numerical ecology. Elsevier, Amsterdam. 419 p.
Matheron, G. 1969. Le krigeage universe!. Cah. Cent. Morpho!. Math. 1: 1-83.
Matheron, G. 1970. La th60rie des variables regionalisees et ses applications. Cah. Cent.
Morpho!. Math. 5: 1-212.
Scardi, M., E. Fresi, and G.D. Ardizonne. In press. Cartographic representation of sea-grass
beds: application of a stochastic interpolation technique (Kriging). In C.F. Bouderesque, A.
Jeudi de Grissac and J. Olivier [ed.] 2nd International Workshop on Posidonia oceanica
beds. G.I.S. Posidonie Pub!., France.
DATA ANALYSIS IN PELAGIC COMMUNITY STUDIES
Jordi Flos* (Chairman), Fortunato A. Ascioti, 1. Douglas Carroll, Serge Dallot, Serge Frontier,
John C. Gower, Richard L. Haedrich, and Alain Laurec
INTRODUCTION
Physical Biological
Processes Processes
("Continuous" ) ("Discrete")
\
Diesi pati ve
/ IExternal information 1---.. ._.
Structure
\/ DATA
Fig. 1. Scheme of the way in which data are obtained from the
environment.
498
variableB
TEMPORAL
PROCESS
time
time
SEQUENTIAL
~ .....,.. ....
t n• 1 = tn • 6 tn
time
BiteB
SPATIAL
STRUCTURE
variableB
Horseshoes
PC~l __~____~____~~~
PC3
12
13
PC1~~~____ +-______~r-
PC4
11
PCl
The two axes reflect almost banal information but the analysis
reduces dimensionality from 8 hydrographic variables to 2
components, which is useful for geographical description of the
hydrographic situation. To extract more information, three groups
510
E 12 o E 13 o
10 10
20 20
30 30
40 40
50 50
100 100
\
200
\
"""'" 200
\C 1 \C,
\
\
\
,, 300
I I 400
~ ...I ... 600
... 800
0 0 1000
PC2
o metres
,- --~-
., ............
'...,"
/," ......... _-_ ... ,~\ PCl
8
....
------ -'
STATION 5
STATION 2
PC2
...1
/1
...1 3 metres
PC2 ... /
... ...
STATION f;1 "
I I
I I
"/8
?\\ .... I'" PCl
PCl
8 metres
CLUSTERING TECHNIQUES
FRACfALS
PATH ANALYSIS
SPATIAL ANALYSIS
CONCLUSIONS
REFERENCES
INTRODUCTION
Numerical techniques used in biological oceanography and limnology must take into
account ecological hypotheses to be tested and the specific nature of aquatic data. These
techniques can be used either for examining data sets (exploratory data analysis) or for estimating
population or subpopulation characteristics from samples (inferential statistics). These two
approaches now tend to be considered as complementary, and there will be continuing need for
both exploratory and critical/conftrmatory methods (Mallows and Tukey 1982). To some extent,
through appropriate sampling design and experimental planning, the nature of the data can be
controlled to accommodate constraints of the numerical methods. Nevertheless, it often occurs
that aquatic data do not meet the basic assumptions of the numerical techniques (see below).
Despite these problems numerical methods can provide new ecological insights, which is precisely
the aim of numerical ecology.
Scaling techniques discussed in this volume, together with other feature extraction and
display methods (e.g., linear projection pursuit, Sammon mapping, and triangulation: Friedman
and Tukey 1974; Biswas et al. 1981; Huber 1985), are in an area of rapidly advancing computer
data analysis aimed at the visual perception of high dimensional data (> 3d). While such aims are
not new, the field has been reactivated by recent advances in computer technology that allow the
analyst a high level of rapid, dynamical interaction with the data (Becker and Chambers 1984).
For example 3-d data can be visualized as a rotating cloud of points, using computers capable of
rapidly displaying 2-d projections. A fourth dimension can be depicted with the use of colour
(Brieman and Friedman 1982; McDonald 1982). For a more extensive discussion of feature
analysis as used for graphical interpretation of multidimensional data, refer to section 2.B of
Bezdek (this volume).
Scaling techniques can be used with two types of data: measurements and assessments.
Principal component or correspondence analyses are limited to the first type of data, while
multidimensional scaling, and so on (Carroll, this volume) can be applied to both types of data.
On one hand, quantitative or qualitative variables observed for each object (site, sample, etc.) may
be used to compute similarity or distance matrices. On the other hand, similarities or distances
between objects can be defined by the observer as based on global appreciation. For example, the
relative strength of vertical mixing at various stations is easier to infer than to measure directly.
The same is true for ecological niches. However to extract useful information, data must be
organized a priori in an ecologically meaningful way. Analyses always give information, and
even the result that the data set cannot be approximated by a given analytical model is useful
information in itself.
One problem in common to biological oceanography and limnology is that aquatic data are
in general strongly autocorrelated in both space and time. This means that values observed at
given points in space and/or time are to some extent functions of values observed at other points.
This can obscure relationships among the observed variables, so that sampling constraints may
become as severe or more severe than numerical problems. In principle, standard scaling
techniques do not give optimal representations when applied to autocorrelated data. The taking
into account of autocorrelation in scaling techniques is however possible, but not routinely applied
in numerical ecology. One possibility could be to compute a correlation matrix on the basis of the
spatio-temporal coordinates of the objects, followed by the method of Aragon and Caussinus
(1980) for principal component analysis with correlated statistical units. Another possibility could
be to use Procustes (Gower, this volume) or PCARIY (principal component analysis with respect
to instrumental variables: Rao 1965; Bonifas et al. 1984), to compare the scaling of the objects
based on their spatio-temporal coordinates to that computed from measured variables. When there
is more that one subset of variables (e.g., comparing environmental to biological variables),
methods analogous to partial correlation but for Procrustes or PCARIV could be useful in
523
exploring the relationships between the scalings of these subsets while controlling for the
spatio-temporal scaling. Such methods remain to be developed, following for example the
generalized procedures of Carroll and Chang (1970), Gower (1975) or Escoufier (1980) (see also
the June 1985 issue of Statistique et analyse des donnees). One could also use spatio-temporal
coordinates as additional variables to put restrictions on the ordination when applying techniques
such as unfolding (Heiser, this volume).
Some scaling techniques, such as unfolding and correspondence analysis, allow dual
projection of variables and objects in the ordination space. This may be useful in biological
oceanography and limnology, in order to relate objects and variables. One must realize, however,
that unfolding represents the proximities between objects and variables but not the proximities
among objects. This is contrasting to some other scaling techniques (e.g., principal component
analysis or multidimensional scaling), where proximities among objects and also among variables
can be represented but where proximities between objects and variables are not immediately
accessible.
CLUSTERING TECHNIQUES
and conditional clustering (Lefkovitch, this volume). These are not mutually exclusive methods:
i.e., constraints such as those discussed by Legendre (this volume) can be applied to data that are
clustered using any method, including the fuzzy sets approach. In the aquatic context, where
samples are often spatio-temporally autocorrelated, it appears that clustering under constraints
leads to more realistic subsets. Constrained clustering has the additional advantage of reducing
the number of pairwise comparisons, thus facilitating rapid processing of large data sets such as
satellite images or flow cytometric records.
Fuzzy set algorithms give a relative measure of association of each object or variable to
each cluster, thus defining inliers and outliers. In the special case when there is insufficient
information to properly assign objects or variables to anyone cluster, these become outliers of
fuzzy sets. It is unrealistic in biological oceanography and limnology to assume that each object
or variable should be member of one and only one cluster. Conditional clustering offers the
possibility for any object or variable to become a member of two or several overlapping clusters.
It is unclear for the time being which of the above clustering approaches will lead to more
ecologically meaningful results, and different cases may well vary. In all instances it would be
advisable to cluster the data with algorithms from several different families, so as to judge the
robustness of the resulting clusterings. In the case of robust ecological structures, different
methods should lead to relatively similar results.
FRACfALS
It appears that, in lakes and oceans, several physical and biological structures have fractal
dimensions (e.g., ergocline structures such as fronts, pycnocline, ice-water and water-sediment
interfaces: Legendre et a/. 1986). These structures may be characterized by complex geometry,
changing species diversity, patchiness, high biological production, and so on. Considering the
complexity of these phenomena occurring at different scales, and the lack of models, it is
presently very difficult to appropriately sample such environments. Fractal theory (Frontier, this
volume) may provide the framework for modelling complex aquatic ecosystems, and designing
new sampling approaches and new techniques for numerical data analysis (e.g., Ibanez 1986;
Ibanez and Etienne, submitted).
525
One of the major objectives of biological oceanography and limnology is to evidence cause
and effect relationships in aquatic ecosystems. In all the available numerical techniques, there are
implicit and/or explicit hypotheses concerning the causal relationships among variables. Available
techniques allow different levels of input from the scientists, different levels of precision of
causality, and different levels of model rigidity. For example, multiple linear regression requires
limited input from the scientists, assumes very precise causal relationships, and is therefore a very
rigid model. As a consequence, multiple linear regression may be little adapted to model
ecological relationships, and may thus give the investigator a false sense of confidence concerning
his quantitative understanding.
Path analysis is a more sophisticated method, where the scientist defines the causal
relationships from a priori knowledge by specifying the paths among variables. These variables
may also include "latent variables", that are not observed by the investigator but whose paths can
be included in the model. In path analysis, only the direction of the causality is specified, which
makes it a less rigid model than for instance multiple linear regression. "Nonlinear path analysis"
(de Leeuw, this volume) can be applied to quantitative, semi-quantitative and qualitative ecological
data (e.g., PATHALS algorithm). In ecology, path analysis is used to explore the consequences
of hypotheses concerning causal relationships among variables, given the computed regression
and correlation coefficients.
volume). Path analysis and approximate reasoning show promise for biological oceanography
and limnology.
In the field of spatial pattern analysis, one makes the distinction between two situations.
(1) Point patterns (Ripley, this volume), in which the data are the spatial coordinates of the
objects, e.g., the location of spawning sites or whale sightings. (2) Cases where variables
changing continuously over space and time are sampled at discrete points whose coordinates are
determined by the observer (surface patterns). In general, biological oceanographers and
limnologists are mainly concerned with the latter situation.
Methods of spatial analysis (Sokal and Thomson, this volume), in particular those that
treat anisotropic environments, can be readily applied in oceanography to satellite images.
Normally, field data result from the interaction of two spatio-temporal patterns: that of the
measured variables (natural pattern) and that of the sampling design (sampling pattern). This
means that, for the same natural pattern, different spatio-temporal sampling patterns may give
different results (Ibanez 1973a, 1976). This problem is magnified in aquatic environments, as
natural patterns change rapidly in both space and time. This is obviously a fundamental problem
in limnology and oceanography, which will require further advances in the methods of
spatio-temporal analysis. It appears that methods such as partial Mantel analysis (Dow and
Cheverud 1985; Hubert 1985; Smouse et al. 1986) may be a step towards controlling for the
spatial and temporal organization of the samples.
REFERENCES
Aragon, Y., and H. Caussinus. 1980. Une analyse en composantes principales pour des unites
statistiques correlees, p. 121-131. In E. Diday et al. [ed.] Data analysis and informatics.
North Holland Pub!. Co., New York.
Becker, R. A., and J. M. Chambers. 1984. S: an interactive environment for data analysis and
graphics. Wadsworth Adv. Book Program, Belmont, CA. 550 p.
Biswas, G., A. K. Jain, and R. C. Dukes. 1981. Evaluation of projection algorithms. IEEE
Trans. PAMI 3: 701-708.
Bonifas, L., Y. Escoufier, P. L. Gonzalez, and R. Sabatier. 1984. Choix de variables en analyse
en composantes principales. Revue de Statistique appliquee 32 (2): 5-15.
Bonnisone, P., and K. Decker. 1985. Selecting uncertainty calculi and granularity: an experiment
in trading-off precision and complexity. GE TR85.5C38, Schenectady, N.Y.
Brieman, L., and J. H. Friedman. 1982. Estimating optimal transformations for multiple
regression and correlation. Dept. Stat., Stanford Univ., Calif., 81 p.
527
1. INTRODUCTION
Studies of terrestrial plant communities are generally directed towards describing the
vegetation of some suitably circumscribed area and then interpreting the features described in
terms of environmental, biological or historical processes or events. More specifically, the usual
objectives are (a) to study spatial and temporal patterns in the occurrence and representation of
terrestrial plants; (b) to elucidate the causes of such patterns in terms of environmental factors and
species interactions; and (c) to establish whether recognisable communities exist, and, if so, to
describe them and to account for the dynamics in their species composition. Numerical ecology
offers one avenue of approach towards the attainment of these goals. Quantitative studies of
vegetation begin with observations or measurements in the field and proceed by means of
algebraic manipulation of the data to graphical expression or display. Such displays, in
themselves, are succinct descriptions of the vegetation which, in addition, may provide insight
into the nature and relative importance of underlying ecological controls.
Terrestrial plant communities are composed of individuals that belong to numerous
co-existing species. It is the higher-plant species which are the focus of attention in most, though
not all, studies of terrestrial vegetation. These species constitute the variables of interest. As a
rough guide, the number of higher-plant species, p, likely to be encountered in most studies can
be taken to be bounded by 50 and 500, that is 50 ::;; p ::;; 500. Thus, a salient characteristic of
terrestrial plant ecology is that the domain with which it deals is multivariate. Put rather
differently, we may say that field observations in community ecology are vector-valued.
Terrestrial vegetation is just one component of a larger entity - the ecosystem, the remaining
components of which are the fauna, fungi, climate and soils of the area in question, together with
the interactions and reactions which bind these constituents together. Definitive studies of
terrestrial vegetation are therefore inseparable from the study of the ecosystem as a whole, or at
any rate of some significant part of it Accordingly, a second characteristic of terrestrial ecology is
that it is relations between variables of two or more distinct but associated kinds which ultimately
1 This report represents Dr. Gittins' summary of the debates of the Working Group. It was not submitted to the
other members' approval before publication, for lack of time. [Editor]
are the primary focus of interest in most if not all studies. A third characteristic of terrestrial
ecology is associated with the heterogeneity of terrestrial vegetation. The meaning of heterogeneity
here has been well-described by Webb (1954), who observed that variation in random samples of
vegetation hovers in a tantalising way between the continuous and the discontinuous. That is to
say, units comprising the sample are likely to have been drawn from some unknown mixture of
underlying plant communities or p-variate distributions and thus are unlikely to consist of
identically distributed representatives of a single homogeneous community or distribution.
Together, it is the high-dimensionality of the data, the complexity of the network of
interrelations among species and between them and environmental variables, and the heterogeneity
of the samples which combine to make the description and comprehension of vegetation the
challenge that it is. These features explain why narrative accounts of vegetation, in themselves,
should have proved insufficient in efforts to place terrestrial plant ecology on a firm scientific
footing. Vector-valued observations are simply not amenable to verbal description though they are
readily handled algebraically.
Ecology by definition deals with relations between variables of distinct but associated
kinds (e.g., Walter and Breckle 1985; Begon eta!' 1986). It follows as a consequence that the
data matrix with which we deal in terrestrial plant ecology is a partitioned matrix of the form A =
[AI I A21 ... lAm], where Al (N x p), A2 (N x q), ... , Am (N X z) are submatices whose
columns correspond to biological, physical, chemical, spatial or other variables. The case A = [AI
I A2] is usual in applications, although reports of investigations for which m = 3 or m = 4 are not
unknown. The partitioned nature of A is frequently overlooked. In the ecological literature, for
example, the matrix A = [AI I A2] is all too frequently misspecified as the non-partitioned matrix
A = [AI]' where the distinction between variables of different kinds is either entirely disregarded
or else one of the two sets is neglected, in the initial stages of analysis at least. Misspecification
results in the inapproriate analysis of A, for example by principal components or correspondence
analysis. Studies of vegetation in which interest is strictly confined to variables of just one kind,
namely the species comprising the vegetation, are encountered from time to time. Since relations
between variables of different kinds are not involved, such studies are by definition (above)
outside the domain of ecology. If desired, they might be brought within the confines of the subject
by treating them as a special case where the data matrix A = [AI] is genuinely non-partitioned and
so corresponds to a degenerate case. It would be a serious error, however, if a degenerate case of
this kind were allowed to obscure the partitioned nature of A in ecological studies generally.
In addition to the partitioned nature of the data matrix, the rows (sample-units) of A are
rarely identically distributed, in terrestrial plant ecology. Rather, the topology of the implied
structure in A corresponds to an unknown mixture of multivariate distributions which differ in
location and which mayor may not overlap, but whose characteristics are otherwise unspecified.
In other words, the sample is heterogeneous. An account of one such empirical ecological data
distribution has been described with admirable clarity by Noy-Meir (1971). The partitioned form
531
of A together with the distributional features noted serve to distinguish the data matrix usually
encountered in terrestrial ecology rather sharply from its counterpart in classical multivariate
analysis and in experimental psychology, fields from which the impetus for the development of
methodology appropriate for analyzing multiresponse data has largely been derived. Evidently,
what is required above all in terrestrial plant ecology are procedures that will render relations
between m ~ 2 vector-valued variables tractable even where the sample in question is
heterogeneous.
The principal aim of this report is to evaluate, as far as possible, the opportunities afforded
by new or recently developed numerical methods for attaining research goals in terrestrial plant
ecology. Little or nothing in the way of a coherent body of ecological theory exists which might be
used to structure or guide such an endeavour. Accordingly, having stated above what we consider
to be the principal methodological challenges posed by terrestrial vegetation data, we have opted to
proceed to evaluate methods in terms of their ability or promise to address one or more of the
challenges identified. The principal sections of the paper focus on problems associated with
high-dimensional data and heterogeneity. Sections 2 and 3 are devoted to different classes of
techniques for the reduction of dimensionality in high-dimensional data, and section 4 to
procedures for dealing with heterogeneity. Finally, in section 5, our assessment of the
opportunities for the productive use of numerical methods in terrestrial plant ecology is expressed.
2. REDUCTION OF DIMENSIONALITY
Classical multivariate analysis provides several methods that lead directly or indirectly to
532
linear reduction of dimensionality. By classical multivariate analysis we mean that part of the
subject in which the multivariate normal distribution plays a prominent role. We shall assume
initially that m = 1, but will remove this restriction later. For a single set of variables (m = 1), a
useful set of summary statistics where the data distribution is p-variate normal is based on the
means, variances and covariances of the variables in question. Suppose that there is interesting
structure in A. Then an important consequence of p-variate normality is that the configuration
generated by A in p-space will be a p-dimensional ellipsoid with linear axes. Further, the
covariance matrix, S = (N-l)-lAtA, where A is the data matrix in deviations form A = (A -lat)
and at is the row vector of column means, will capture this structure with remarkable efficiency.
The covariance matrix, or some related matrix of scalar products, is the starting point for most
classical multivariate analyses. The solutions such methods provide very often result in a reduction
of dimensionality, a reduction which is linear in the sense that the coordinates of sample-points in
the reduced space are linear functions of the original coordinates in p-space. We sketch three
methods of this kind together with a closely related method (principal coordinates analysis), all of
which have application in terrestrial plant ecology.
Principal component analysis. Principal component analysis is the classical method for the
linear reduction of dimensionality in an unstructured sample. Though the plane (or hyperplane) is
best fitting, in the sense that it minimizes the sum of squares of the residuals, this does not assure
that it will necessarily yield the most useful view of the sample for practical purposes. Thus, an
outlying sample-point might dominate one of the first two dimensions and result in a poor fit to
many other points. Or, where sample-points lie close to some nonlinear manifold inp-space rather
than a linear one, the principal components will at best provide only a poor approximation of the
sample.
Principal coordinates analysis. The starting point for principal coordinates analysis (Gower
1966, and this volume) is a symmetric matrix of distances or of distance-like quantities between
samples, rather than a covariance matrix between variables. The distances may be observed
directly or may be computed from the data matrix A. Principal coordinates analysis of a distance
matrix!!.. (N x N) provides a mapping of N sample-points in a reduced t-dimensional space such
that the distances between points in the plot approximate the observed or calculated distances in !!...
Principal coordinates analysis may be applied to matrices whose elements are computed from one
of the many coefficients that measure similarity or dissimilarity, Euclidean or otherwise. For a
review of such coefficients and their properties, the interested reader is referred to Gower and
Legendre (1986). The freedom of choice regarding the elements of!!.. represents an important
advantage over principal component analysis, as the distance used in components analysis
(Euclidean distance) is arbitrary and is rarely appropriate in practice. In this sense, principal
coordinates analysis represents a significant generalization and improvement over components
analysis. On the other hand, unlike principal component analysis, principal coordinates analysis
provides no information on the role of the variables in the analysis. In applications, such
533
considerations as these are pertinent to the matter of selecting a method which is appropriate to a
specified ecological goal.
We consider next two Euclidean mappings, both extensions of principal component
analysis, which may be useful where relations between species and samples are important to the
success of an investigation. These mappings are the biplot and correspondence analysis. Each
yields a graphical representation of the sample that discloses relations within the set of
sampling-units, within the set of species, and also relations between the sample-units and the
species.
The hiplot. Let A = (aji) be an N x p matrix consisting of N relations of a p-valued
variable expressed as deviations from the variate means. Starting from A, the biplot (Gabriel
1971, 1982) provides a display in which rows and columns are simultaneously represented in a
low-dimensional vector space with the aim of obtaining more insight into the data than could be
obtained from separate inspections of samples and variables. It is usual to represent samples
(rows) by points and variables (columns) by vectors emanating from the origin which is at the
centroid. Relations among samples are proportional to the distance between pairs of
sample-points. With respect to the variables, the length of a vector is proportional to the standard
deviation of the variable in question, while the cosine of the angle between any two vectors is the
correlation between the corresponding variables. Relations between samples and variables are
interpretable in terms of the angular separation of the row and column markers or symbols
involved. More specifically, the scalar product between the jth sample-point and the ith variable
vector approximates the jith element of A, aji. These between-set relations are especially valuable
for interpreting the configuration of sample points. In terrestrial ecology all too often one comes
across attempts to interpret the sample in terms of its principal components - in effect, that is,
attempts are made to interpret the components themselves in ecological terms. Though useful for
plotting, however, it is far from clear that the components are necessarily useful for interpretation.
Interpretation of the sample is more likely to be productive in terms of the original variables, hence
the appeal of the biplot. The biplot is a flexible tool, many variants and extentions of which have
been described. For an introduction to this work, see Bradu and Gabriel (1978), Gabriel (1981),
Cox and Gabriel (1982), Greenacre (1984, p. 341).
Correspondence analysis. Correspondence analysis (Gower, Escoufier, and de Leeuw,
this volume; Greenacre 1984) provides a simultaneous, two-dimensional display of the rows and
columns of a data matrix A which is useful where row and column entities are commensurate.
Rows and columns are each represented by points in the display. The matrix A is first
standardized to R-1I2 A C- 1I2 , where R (N x N) is the diagonal matrix of row totals and C (p xp)
the diagonal matrix of column totals. When A is a contingency table this standardization results in
displays with appealing metric properties. Ecological affinities between row-entities are inversely
proportional to the distance between row-points in the display; similarly, affinities between
column-entities are inversely proportional to the distance between column-points. In contrast, the
534
specifically, transfonnations are sought that disentangle relations (correlations) within each set of
variables while simultaneously emphasizing relations (correlations) between sets. The correlation
matrix of the x's and y's, R(x,y), is in fact reduced to a fonn, R (u,v), that involves in most
cases only two or three nonzero quantities.
This remarkable result enonnously simplifies the task of comprehending relations among
the original x's and y's. What canonical analysis does is to identify variables (the u's and v's) that
535
preserve and clearly reveal the internal structure of R(x,y). The new variables are associated in
conjugate pairs (uk' vk)' for k = 1,2, ... , s where s = min(p, q), and are known as canonical
variates. Not all pairs (uk' vk) are equally useful, and in applications it is often found that all but
the two or three most highly correlated pairs can be discarded with little loss of important
information. In this way a reduction in dimensionality is achieved. Interpretation of results is
based on the point configuration that results on mapping sample-points into the low-dimensional
vector-space associated with the retained canonical variates. The resulting configuration is then
examined for its substantive implications, with the distance between points again being the
primary interpretive device. Thus, we have seen that canonical analysis is a particular form of
scaling which is very specifically oriented towards clarifying the correlation structure of
multiresponse data where the variables in question fall naturally into classes of different kinds.
The notion of arriving at a low-dimensional spatial representation of the sample, however, was not
explicit in Hotelling's original formulation.
The results sketched above for m = 2 have been generalized to the case m > 2. Where all m
sets of variables are on an equal footing, solutions have been proposed by Horst (1961),
Kettenring (1971), van de Geer (1984) and Verdegaal (1986). Generalizations of quite different
kinds have also been proposed. It is usual in canonical analysis to assume, for example, that the
relationship between x and y is symmetric. The theory has however been extended to situations in
which the relationship is asymmetric (van den Wollenberg 1977; Tyler 1982; Israels 1984). Other
extensions that allow the efficient analysis of nonmetric data have been proposed by van der Burg
and de Leeuw (1983), van der Burg (1985), and Verdegaal (1986). The practical significance of
these and similar developments for terrestrial plant ecology is that canonical analysis clearly
emerges as a method of remarkable flexibility and of correspondingly wide applicability.
Convincing ecological applications of canonical analysis are nevertheless few.
Concluding remarks. Applications of methods such as principal components analysis, the
biplot, correspondence analysis, and canonical analysis, which depend on the data only through
the covariance matrix or some similar matrix, will be most satisfactory when the sample is
homogeneous and the data distribution elliptically symmetric, not too long-tailed and free from
contamination by outlying or extraneous observations. Thus, before embarking on linear reduction
of dimensionality, one is well-advised to examine the data and to proceed only where these are
shown to conform to the specifications mentioned. Procedures for assessing mUltiresponse data
for this purpose have been described by Gnanadesikan (1977). In addititon, it is well to bear in
mind that all real problems are nonlinear. Linear methods are therefore at best only first-level
approximations to problems of greater complexity. Yet, where they are applicable, linear methods
have the merits of being sufficiently simple to be mathematically tractable and sufficiently realistic
to allow sensible interpretation of results. Algebraically, the methods lead to straightforward
matrix decompositions with closed-form solutions for which efficient, numerically stable
algorithms are widely available. For these reasons, methods of classical linear multivariate
536
analysis are likely to remain the methods of choice for aiding the comprehension of
high-dimensional data whenever the data meet the specifications set out above.
In selecting a method for linear reduction of dimensionality, several matters have to be
weighed. Where relations between two or more sets of variables are the focus of attention,
canonical analysis in one form or another is almost always likely to be the preferred method. In
studies involving variables of just one kind, principal component analysis offers an uncomplicated
approach and one which in addition provides insight as to the role of the species in the analysis.
Yet, the distance coefficient implicitly used is arbitrary and often difficult to justify in practice.
Principal coordinates analysis overcomes the latter difficulty, but at the expense of forfeiting
information on the species. The biplot and correspondence analysis are available where samples
and species jointly are the focus of attention. Of the two, the biplot has the greater flexibility and
yields a precise representation of species/sample relations. Correspondence analysis may be
preferred where the data are of a particular kind, namely commensurate, where the declared
ecological objective unequivocally calls for a quite specific standardization of the data, and where
an exact specification of species/samples relations is not of overriding importance.
Although the theory of classical multivariate analysis has existed for something like fifty
years, and despite its straightforward algebraic basis and computational requirements, the impact
of classical multivariate analysis on our understanding and appreciation of terrestrial vegetation has
scarcely been beneficial. Convincing ecological applications in which due care and attention has
been paid to the selection of a method, to the implementation of the chosen method, and to the
interpretation and reporting of results are few and far between. It is clear from the ecological
literature that the methods themselves are generally poorly understood by terrestrial plant
ecologists. Three requirements for incisive analysis are almost always overlooked: the need to
ensure that a chosen method is properly matched to the declared ecological goal, the need to
carefully consider the effects of the implicitly or explicitly chosen origin and unit of scale on the
outcome of the analysis, and the need for care and sound judgement in implementing a method if
analysis is to be productive. We shall defer consideration of the reasons for this state of affairs and
its implications until section 5.
Where the data distribution is other than elliptically symmetric, important structures may be
present that cannot be captured adequately by linear associations or correlations between variables
- in other words, by a covariance or correlation matrix. Such features include the tendency for
sample-points to be concentrated close to certain kinds of curved, t-dimensional surfaces or to be
aggregated into clusters, either discrete or overlapping. In contrast to linear effects, the variety of
shapes and other attributes of nonlinearity are many. Linear methods in these cases will be
deficient and there is a need for methods that are sensitive to effects of the kinds described, even
537
though it is impossible to specify all the many possibilities in advance. Before proceeding to
consider methods for nonlinear reduction of dimensionality, we make some general observations
regarding these methods, contrasting them in the process with procedures for linear reduction of
dimensionality .
We remark fIrst of all that the starting point for classical, linear multivariate analysis is a
model which is assumed to describe the distribution of the data to be analyzed. Distributional
specifications figure prominently, the model being fitted under the assumption that the
distributional specifications are in fact satisfied. Nonlinear scaling methods, in contrast, begin
with the data rather than a model, the analysis being directed towards finding a structure or model
that describes the data. The tightly-specified distributions of classical statistics are thus entirely
circumvented. In short, nonlinear reduction of dimensionality is data analytic rather than
confrrmatory in character, exploratory rather than inferential. Second, the search for coordinates in
a reduced dimensional space in nonlinear scaling is not restricted to coordinates which are linear
functions of the original coordinates of the data-points. This freedom imparts a degree of elasticity
to the shape of configurations implied by the data which can profitably be fitted by nonlinear
scaling which is not shared by methods for linear reduction of dimensionality.
A third point concerns the measurement level of the data for analysis. The methods of
classical, linear multivariate analysis were proposed with metric (interval or ratio scaled) data
primarily in mind. Nonmetric (nominal or ordinal) data are for the most part less amenable to
incisive analysis by classical methods. Nonlinear scaling in contrast is directed primarily towards
the analysis of nonmetric data. The preoccupation with nonmetric data is justified on several
grounds. For example, the apparent precision of metric or quantitative data in terrestrial plant
ecology is all too often spurious, the extent of measurement error being such that the data contain
little reliable information beyond their rank order. A second point is that nonmetric data are
comparatively resistant to the effects of outlying observations or other peculiarities in distribution
to which ecological data are all too prone. Further, nonmetric data are almost always speedier and
cheaper to acquire than metric data. Finally, we shall see that nonmetric data can be transformed in
such a way that nonlinear structure in the data can frequently be linearized and hence captured
parsimoneously if analyzed appropriately. We find these arguments in favour of nonmetric data
compelling and advocate their widespread adoption in ecological studies of terrestrial plant
communities. Indeed, the unreliability of some data collected, and assumed to be metric for the
purpose of data analysis, together with the availability of efficient means for analyzing nonmetric
data, suggest that it would often be advantageous to replace these data by their ranks and then to
analyze (or re-analyze) them accordingly. In much of what follows we shall be primarily, although
not exclusively, concerned with nonmetric data.
The principles used to guide and inform the analysis of nonmetric data have been
well-summarized by Takane (1985) and by Carroll (this volume). A key notion in this area is that
non metric data are nonlinear transforms of metric data. Thus, if appropriate transformations of
538
initial, nonmetric data can be found, the transformed data can be analyzed by some such
quantitative procedure as multidimensional scaling. Unlike other procedures that require data
transformations, the form of any particular transformation in nonlinear reduction of dimensionality
does not have to be specified in advance; optimization of a suitable index or loss function will
yield both the best transformations and the best parameter estimates for a given model within the
least squares framework.
An important special case of multidimensional scaling is represented by multidimensional
unfolding (Heiser, this volume). Both are scaling methods used to represent dissimilarity data, the
distinction being that whereas multidimensional scaling involves a single set of objects 0, in
multidimensional unfolding the objects are partitioned into two fmite subsets, 01 and 02' In many
applications 01' corresponds to the row-objects (e.g., samples) of a rectangular data matrix A (N
xp) and 02 to the colunm-objects (e.g., variables) of the same matrix. Multidimensional scaling
and unfolding are likely to be most useful for nonlinear reduction of dimensionality where the
structure in the data is largely continuous.
Multidimensional scaling (Carroll, this volume) consists of a family of methods for the
spatial description of relations among objects which are free from both the distributional
restrictions and the requirement for metric data associated with classical, variance-based
multivariate analysis. Multidimensional scaling therefore significantly widens the domain of
applicability of multivariate methods in areas such as terrestrial ecology. Nevertheless,
characteristics of many ecological data sets, such as the partitioning of variables into subsets and
the occurrence of heterogeneous data structures, have not featured prominently in the development
of multidimensional scaling methodology. As a result, a sizeable gap exists between the
methodological requirements of terrestrial plant ecology, on the one hand, and the scaling methods
available to meet these needs on the other. Nevertheless, the range of substantive goals that can be
addressed by multidimensional scaling and the very general conditions under which scaling
methods are applicable, will in themselves suffice to ensure that multidimensional scaling will
come to make a significant contribution to numerical ecology.
The basic methods of two-way metric and nonmetrlc scaling have been shown to have
much to contribute to large-scale investigations of vegetation (e.g., Kenkel 1986; Kenkel and
OrI6ci 1986). In studies where temporal as well as spatial variation is of interest, and in a variety
of other situations, three-way scaling provides a wealth of opportunities. Three-way scaling with
linear constraints, for example, extends the applicability of multidimensional scaling to factorially
designed analysis of variance experiments. Accordingly, three-way scaling enables time trends,
treatment effects, individual differences, and numerous comparable conditions to be explored
directly within the context of multidimensional scaling. Opportunities of a different kind are
afforded by constrained multidimensional scaling. Constraints on the coordinates of a
configuration matrix X, for sample, may be useful where relations between objects are generated
by seasonal or other cyclical processes, not necessarily time-related. Or, constraints may be used
539
implementation.
The robustness of multidimensional scaling against disturbances of different kinds has
received attention. Classical scaling has been shown to be little affected by random error (Kruskal
1977; Sibson 1979; Sibson et al. 1981). Scaling is also robust under variation in the method used,
except perhaps in certain unusual circumstances. Indeed, robustness against variation in method is
such that for most interesting sets of data, metric and nonmetric methods tend to yield
configurations which are remarkably similar, despite the algebraic and computational differences
involved (Carroll and Kruskal 1978). The results of multidimensional scaling do, however,
depend on the domain from which the objects of interest are drawn, particularly on its composition
and extent. That the composition of the domain should be influential in this way is perhaps not
altogether surprising. Fortunately, graphical procedures for assessing the extent of
sample-dependence or sample-specificity are available (Heiser and Meulman 1983a, 1983b;
Weinberg et al. 1984; de Leeuw and Meulman 1986). The effect of extent or breadth is more
subtle. As breadth increases so also do opportunities for the data distribution to become
increasingly heterogeneous. With heterogeneity, the likelihood that important or interesting
structures will go undetected is greatly increased. Where the data structure is neither strictly
continuous nor strictly discontinuous, but is somewhere between the two, as is common in
terrestrial ecology, the result of scaling is open to domination by chance features of the data. What
is more, no warning is given that a configuration has been so determined. In other words, where a
data distribution is not well-behaved, the results of scaling, while having every appearance of
normality and of being acceptable at face value, are more than capable of leading one astray. Thus,
it seems to us that in the context of large-scale vegetation surveys, the lack of robustness of
multidimensional scaling as the breath of the domain in question increases represents perhaps its
most vulnerable characteristic. Yet it is very much part of the flavour of multidimensional scaling
to neglect the data distribution altogether.
Remedial measures are nevertheless possible. Thus, a simple scatterplot of dissimilarity
values against fitted distances may be used to diagnose heterogeneity following analysis. An even
stronger case exists for examining the form of the data distribution itself before embarking on
scaling. As the dissimilarity matrix A in terrestrial plant ecology is almost invariably derived from
an N x p matrix A of multiresponse observations, such a step would be perfectly feasible. A
quantile-quantile (Q-Q) probability plot constructed from the rows, ai, of A would shed light on
the shape of the data distribution, and in particular of its coherence (Gnanadesikan 1977; Campbell
1980). Where coherence is demonstrated, A can be calculated and scaling undertaken in the usual
way. Otherwise, one or more dissimilarity matrices, each corresponding to a substantially
coherent subset of the rows of A, might be calculated, and each then scaled in turn. The procedure
would unavoidably fragment the analysis, but would have the considerable merit of being less
likely to lead one astray than an analysis based on A as a whole. A complementary step would be
to employ a loss function other than a least-squares one, since the sensitivity of least-squares
541
We use the term nonlinear multivariate analysis to refer to a class of methods which are
invariant under nonlinear transformations of the variables. That is to say, in the sense of Gifi
(1981) and de Leeuw (1984, 1987a, 1987b). The ALSOS-system of Young et al. (1980) is a
542
related approach which overlaps with Gifi's system (de Leeuw 1987a). For present purposes,
distinctions between the two approaches (Meulman 1986) are not important and are neglected.
Fundamental to both conceptualizations of multivariate analysis are the notions that all data
irrespective of measurement level are qualitative, and that qualitative data are nonlinear
transformations of metric data.
The idea that all data are qualitative is justified by appeal to the finite precision of the
measurement process (Takane et al. 1977; Young 1981). In terrestrial plant ecology, the precision
of measurements is generally low, so that data collected and assumed for the purpose of analysis
to be metric, commonly do not meet this standard. For Gifi, the well-known principal
measurement levels of Stevens (1962), that is nominal, ordinal, interval and ratio, are regarded not
as a property of data but as a set of restrictions that mayor may not be imposed in any subsequent
manipulation of the data. The notion that qualitative data are nonlinear transformations of metric
data draws empirical support from the common observation that nonlinear configurations in
geometric representations of multiresponse data are more frequent with qualitative than with metric
data. It follows that if suitable transformations of nonmetric data can be found, the transformed
data will be metric and therefore amenable to analysis by any appropriate method of standard linear
multivariate analysis. In contrast to other situations that call for the use of re-expression, the form
of a particular transformation in nonlinear multivariate analysis does not have to be pre-specified;
optimization of a suitable loss function will yield the best transformations and the best parameter
estimates for the model in question, within the least squares framework.
The emphasis in nonlinear multivariate analysis is very much on the analysis of linear
relations among nonmetric data, and not at all on nonlinear approximation. Study of linear
relations between any functions, f(x) and g(y), of two variables x and y, is precisely the study of
nonlinear relations between x and y. Accordingly, the nonlinear multivariate analysis problem is to
find optimal transformations - transformations of qualitative variables for which the transformed
variables are as linearly related as possible. Thus the nonlinear part of nonlinear multivariate
analysis concerns the transformations of the variables. Equivalently, the problem can be stated as
that of finding transformations that lead to the optimization of some pertinent criterion or loss
function. The general objective of nonlinear multivariate analysis is very much the same as that of
multidimensional scaling, namely to arrive at a spatial representation of a data matrix that conveys
as many useful relations as possible. At the same time, linear reduction of dimensionality is an
essential ingredient of nonlinear multivariate analysis, and in this respect nonlinear multivariate
analysis resembles classical multivariate analysis. In the nonlinear case, however, reduction of
dimensionality follows nonlinear transformation of variables.
Nonlinear multivariate analysis provides a very general framework for the quantitative
analysis of qualitative data. In facilitating the application of classical linear multivariate methods to
nominal and ordinal data, the new methodology represents a significant development. We turn
now to consider several implications of nonlinear multivariate analysis for terrestrial plant ecology.
543
In the first place, the metric or quantitative data still widely used in terrestrial ecology can
be dispensed with entirely. Metric field data are decidedly time consuming and expensive to
acquire. Nonmetric data, in contrast, are far easier and cheaper to obtain, so enabling larger
samples to be obtained for a given expenditure of time and effort. We have seen that nonlinear
multivariate analysis opens the way for the efficient analysis of nonmetric data. There is, however,
a price to be paid for this facility. The price is that two sets of parameter estimates are required
(those of the optimal scaling and model parameters) as opposed to a single set of estimates (model)
for the usual linear analyses. The precision of parameter estimates is very much a function of
sample size. It so happens that the larger samples, which become feasible with the acquisition of
nonmetric data, very nicely offset the reduced precision of estimates which would otherwise result
in the nonlinear case because of the additional estimates required. Second, nonlinearities due to the
coding of nonmetric variables are circumvented by finding optimal transformations. The practical
significance of this feature is that an improvement in the fit of a given model could reasonably be
anticipated. In other words, more informative graphical summaries of the plant communities or
ecosystems of interest could be expected than would result from a linear analysis of the same data.
A third aspect of nonlinear multivariate analysis is that questions of data expression do not
have to be solved before a chosen method is applied. Data expression includes as special cases
centering, standardization and the choice of a similarity measure. In practice, a rationale for
choosing between the various options is sometimes lacking and arbitrary choices which lack
theoretical justification are often made. In nonlinear multivariate analysis, however, the expression
of a variable in a data matrix is regarded as essentially a convention, merely a coding. As a
consequence, the question of re-expression does not have to be solved before a technique is
applied; rather, it is an important part of the methodology of nonlinear multivariate analysis to find
appropriate re-expressions. In other words, optimal scaling removes the arbitrariness from
re-expression. Fourth, nonlinear multivariate analysis is very much oriented towards statistical
data analysis as distinct from statistical inference. In this respect, the new methodology accords
closely with the realities of ecological data, where departure from p-variate normality or even
elliptical symmetry is the rule rather than the exception. It is for precisely this reason that the
models of classical multivariate analysis all too often prove to be too tightly-specified to be used
responsibly in studies of terrestrial vegetation. Nonlinear multivariate analysis, in contrast, is
comparatively free from distributional constraints. More realistic analyses and sounder
conclusions are to be expected. Lastly, nonlinear multivariate analysis with optimal scaling can
under appropriate conditions be used to generate a surprisingly wide class of techniques, as de
Leeuw (1987a) has shown. We have already seen that where a least squares procedure for
analyzing metric data is known, then that procedure can also be used to analyze qualitative data
simply by alternating the procedure with optimal scaling. In fact, nonlinear multivariate analysis
subsumes and generalizes all linear multivariate methods to yield a unified framework for linear
and nonlinear multvariate analysis, and so it provides a common approach to a diversity of
544
ecological goals.
Having mentioned some of the ecological benefits to be expected from nonlinear
multivariate analysis, we draw attention to precautions which for best results need to be observed
in applying nonlinear methods. These are (a) the desirability of assessing the joint distribution of
the optimally scaled variables before fitting a linear model; (b) the need for care in model fitting;
and (c) the importance of assessing the stability of the results obtained.
Joint distribution of optimally scaled variables. Nonlinear multivariate analysis can be
viewed as a class of methods that have as their common starting point a correlation matrix (de
Leeuw 1987a, 1987b). Our remarks in this section are made with this observation very much in
mind.
It is very much part of the flavour of nonlinear multivariate analysis that statements about
the joint distribution of the data are almost entirely avoided. The resulting freedom from
distributional constraints will be appreciated in community ecology, where it is a matter of
common experience that the distributional requirements of linear multivariate analysis are
sometimes restrictive. Yet, in ecological applications of nonlinear multivariate analysis, it seems to
us that it would be unwise to disregard the data distribution entirely. In a previous section, the
view was expressed that the kind of data distributions with which we deal in terrestrial plant
ecology consist of some unknown mixture of multivariate distributions which differ in location
and which mayor may not overlap, but whose characteristics are otherwise unspecified. It is
pertinent at this point to enquire as to the effect of such a distribution on nonlinear multivariate
analysis. Just how far can a data distribution depart form a homogeneous, p-dimensional ellisoid
of the kind implied by multinormality and yet yield sensible results? Given the sensitivity of
second-order linear multivariate methods to disturbances of the kind described (Devlin et al.
1981), it seems to us that the consequences of heterogeneity, at least, on the structure of the
correlation matrix merit attention in nonlinear multivariate analysis.
The sensitivity of least-squares based methods to disturbances in the data distribution is
leading, in careful applications of linear multivariate methods, to the fitting of a model only where
the data distribution has first been examined and shown to be well-behaved. Thus, the first step is
to probe the data distribution, and to proceed to the next stage (fitting) only where evidence
justifying this step can be adduced. Good examples of this practice are provided by Campbell
(1980) and by Smith et al. (1983). Such measures do not seem to be part of the new
methodology. Instead, in nonlinear multivariate analysis, the impression is conveyed that the
optimal transformations will, in themselves, suffice to bring the joint distribution to an acceptable
form, irrespective of the presence of gross heterogeneity or other more subtle disturbances. There
is evidence in support of the robustness of nonlinear methods against such features. Nominal and
ordinal data are certainly less sensitive than metric data to any peculiarities which may be present
in the data. Further, the stability of the data transformations themselves has been clearly
demonstrated in certain applications (e.g., van der Burg and de Leeuw 1983). The use of loss
545
functions other than unweighted least squares functions might further strenghten robustness.
Nevertheless, it would seem to us unwise in ecological applications to disregard the data structure
altogether.
In the absence of some such evidence as a coherent and substantially linear Q-Q probability
plot of the data prior to fitting a model, the question of the impact of the data on the outcome of
analysis must remain equivocal. From the alternating least squares method of algorithm
construction, which at the time of writing is an integral part of nonlinear multivariate analysis, it is
plain that the data distribution cannot be examined in the usual way before fitting a model. What
can, however, be done is to first obtain the matrix of optimally transformed variables, using for
example the step-one HOMALS procedure of Gifi (1981, sect. 3.8.2). An even better alternative
might be to obtain the optimally transformed variables from a preliminary run of the nonlinear
analysis in question, performed solely for this purpose, as the scaling of optimally transformed
variables is dependent on the criterion (loss function) minimized. The joint distribution of the
transformed variables could then be examined, using a Q-Q plot for the purpose, and, where
found to be well-behaved, the analysis completed by the routine application of a standard classical
linear multivariate method. Where the joint distribution proves to be other than homogeneous and
elliptically symmetric, steps to bring the distribution to a more acceptable form might be feasible.
In place of the matrix of transformed variables, the induced correlation matrix of the optimally
tranformed variables (van der Burg 1985, p. 38; de Leeuw 1987a) might be used to shed light on
the data distribution.
The detection of disturbances in a data distribution by a rather different procedure from that
described above has been described by van der Burg (1985, p. 41). Van der Burg's procedure is
as follows. Where the presence of outlying observations is disclosed by a nonlinear analysis,
either the implicated variables are re-coded, or the samples in question are otherwise dealt with.
The entire analysis is then re-run. The procedure proposed above, however, is both more
systematic and more revealing as to the shape of the data distribution as a whole than in van der
Burg's method. A price would have to be paid for the added refinement. Thus, every nonlinear
analysis would involve at least three steps: a preliminary analysis to obtain the optimally
transformed variables, a probability plot to disclose the joint distribution of the transformed
variables, followed by a classical linear multivariate analysis to fit some suitable model to the
optimally scaled variables where the shape of the distribution of these is shown to be acceptable
for this purpose.
Modelfitting. As in multidimensional scaling, fitting a nonlinear multivariate model is an
iterative procedure, with all the attendant sensitivity on a variety of factors. Iteration normally
commences by taking as starting values the actual values of the data to be analyzed, having first
re-expressed quantitative data in discrete form where necessary. Other starting values, however,
are admissible. Fitting is effected by optimizing an appropriate criterion or loss function with
respect to a solution in some pre-specified dimensionality. Convergence may be to a local rather
546
than a global optimum, an outcome which is not revealed by inspection of the loss function itself.
Trial and error using different starting values and different choices of dimensionality may be
helpful in distinguishing local from global optima. Unlike the equivalent linear method, the
solution in nonlinear multivariate analysis is not nested. That is to say, the coordinates of a
solution in t-dimensions are not equal to the fIrst t-coordinates of the (t+l)-dimensional solution.
Stability. Results in nonlinear multivariate analysis are dependent on characteristics of the
particular sample analyzed, to an even greater extent than in classical linear multivariate analysis,
since results in nonlinear analysis tend to be sample-specific. As a consequence, while the results
and any conclusions drawn from them may well be valid for the sample actually analyzed, there is
an element of uncertainly or vagueness about any wider validity which the results and conclusions
may possess. This feature of nonlinear multivariate analysis is a direct consequence of the large
number of parameter estimates required. Invariably, there are two sets of these (estimates for
optimal scaling parameters and for model parameters), compared with just one set in linear
multivariate analysis (model paramaters). Sample-specificity is very much a function of the total
number of parameter estimates required relative to sample size. In view of the implications of
sample-specifIcity, it is always worthwhile to assess the extent of this condition in applications.
Stability refers to the extent to which the results of an analysis are resistant to small
perturbations in the data. Small perturbations might reasonably be expected in taking one or more
additional samples, with the same specifIcations and from the same domain as the original sample.
Or, comparable additional samples are to be obtained by resampling from the original. The
jackknife and the bootstrap are both computer-intensive re-sampling schemes which have been
used to examine the stability of results in nonlinear multivariate analysis (van der Burg 1985; van
Rijckevorsel et al. 1985). The use of some such procedure to assess the sample-specificity of
results is best regarded as a integral part of any nonlinear multivariate analysis.
Concluding remarks. The essential point about nonlinear multivariate analysis is that it
extends the applicability of all the methods of classical linear multivariate analysis to nonmetric
data. Better fitting models, yet more parsimonious than would otherwise be the case, result.
Nonlinear multivariate analysis also subsumes methods that are well-suited to the analysis of
relationships between sets of variables of different kinds, a problem that we have seen is pertinent
in a large class of ecological endeavours. There are other benefits. Nonlinear methods are
well-suited to dealing with large, relatively unstructured (i.e., homogeneous) data sets for which
there is little in the way of prior information about physical or causal mechanisms. A wealth of
new opportunities for productive analysis is therefore provided.
Despite the flexibility that nonlinear multivariate analysis brings to data analysis, the
approach has its limitations and it is only proper that these be examined. It is well to recognize, for
example, that nonlinear methods comprise a very restricted class of multivariate techniques; they
are methods that depend on the data only through second-order moments and product-moments.
More specifically, nonlinear multivariate analysis is confined strictly to methods that have a
547
correlation matrix as their starting point (de Leeuw 1987a, 1987b). This represents a very severe
cutting operation. Furthermore, the essential role of the correlation matrix immediately raises two
issues, namely a need to consider the consequences of (a) the implied centering and scaling of the
data; and (b) the likely impact of the data themselves on the outcome of the analysis. Now,
centering and standardization may each be called for where there are compelling ecological
grounds and where the sample is substantially homogeneous. If the sample is not homogeneous,
operations involving sample means and sample standard deviations are not well-founded. The
point to be emphasized here is that in nonlinear multivariate analysis, freedom to allow substantive
or other pertinent considerations to guide and inform the crucial issue of choice of scale unit and
origin is lost. Even where a sample is substantially homogeneous, it is necessary to bear in mind
that a correlation matrix has a breakdown point of zero percent. The sensitivity of the correlation
matrix to possible disturbances in the data distribution accordingly has to be taken very seriously.
Evidently, in ecological studies, it could be prudent if not mandatory to examine global and local
features of the data structure before embarking on nonlinear multivariate analysis. The Q-Q
probability plot would provide one convenient means of obtaining insight into both the coherence
of the sample as a whole, as well as of symmetry and other trait characteristics of the joint
distribution. Other procedures for the same purpose are also available (e.g., Friedman and Rafsky
1981). Where the data on examination prove to be free from irregularities, one might proceed to fit
some appropriate nonlinear model. Otherwise, steps to bring the data distribution into closer
conformity with the desirable norms for any standard linear analysis first deserve to be
contemplated.
We have also seen that nonlinear methods have a strong tendency to capitalise on chance
characteristics of the sample analyzed. Accordingly, it is good practice to assess the extent of
sample-dependence by analysis. Jackknife and bootstrap analyses are available for this purpose.
Further, as model fitting in nonlinear multivariate analysis is iterative, considerably more skill is
called for implementing a chosen method than is the case for a standard linear procedure. In short,
it is plain that the methods comprising nonlinear multivariate analysis are delicate tools, to be used
with sound judgement and care, if trustworthy results are to be obtained. It seems therefore that if
nonlinear methods are to be exploited to ecological ar, rantage, users will first have to acquire the
necessary insight and skill.
4. HETEROGENEITY
data distributions encountered in terrestrial plant ecology are generally of this sort, being
especially characteristic of large-scale vegetation surveys. Heterogeneous data sets pose
difficulties for statistical data analysis. With most scaling methods, for example, there is an
implicit or explicit requirement that, for sensible interpretation of results, sample-units should not
deviate too far from being identically distributed, at least. Clustering procedures are free from any
such requirement but are all too likely to impose structure on the data, and thus to destroy features
that may have ecological significance. Scaling methods generally, as we shall see, while much
less drastic in their effects, are no less misleading when applied to large sets of heterogeneous
data. Evidently, there is a need for methods that are sensitive to data structures, which, in the
works of Webb (1954), hover in a tantalising way between the continuous and the discontinuous.
Most, if not all, scaling methods in common use are centered or, what amounts to the same
thing, are applied to centered data. The lack of robustness of centered scaling methods generally
(classical multivariate analysis and multidimensional scaling) to discontinuities in data arises as
follows. The first dimension of any centered scaling method applied to heterogeneous data is
likely to divide the sample into two (or perhaps more) subgroups, not necessarily of equal size or
coherence. The second dimension in such cases is open to being unduly influenced by chance
characteristics of one or another subgroup, such as a difference in size or scatter, or by some
uneasy compromise of properties of the two, rather than by characteristics of the sample as a
whole. In either case, the dimension extracted will poorly represent the total sample;
consequently, the second axis is all too likely to be uninformative if not actually misleading.
These remarks apply with even greater force to the third and higher dimensions, extraction of
which will only confound an already confused situation. No warning that an analysis may have
been affected in this way is signalled, the results having the appearance of being normal in every
respect. We stress that it is not the interpretation of dimensions in ecological terms that is at issue
here; we regard the axes simply as a convenient coordinate system in relation to which to study
the sample after projection. The point of general interest here is that scaling methods applied to
heterogeneous data after centering are likely to yield misleading or incorrect results and hence to
lead to confusion rather than to insight.
Observe that the notion of correcting for row or column effects (or both), that is of
centering, for other than a homogeneous sample is not well-founded. Evidently, a very strong
case exists in terrestrial ecology for probing the data structure prior to the application of any
scaling method, and for centering the data and proceeding only where the homogeneity of the
sample is beyond question. We note also that scaling methods that have proved useful with
complex data distributions of the kind in question do exist, an we tum now to briefly consider two
of these.
Non-centered principal component analysis. Effects of centering and non-centering on the
principal component analysis of heterogeneous ecological data have been studied by Noy-Meir
(1971, 1973). Noy-Meir was able to demonstrate that, by non-centering and varimax rotation of
549
widens the applicability of the analysis. With these modifications, canonical variate analysis may
be useful in connection with vegetation surveys and investigations of vegetation succession where
the discrete element in the data is clearly dominant and where global rather than local variation -
variation between groups as distinct from within groups - is of overriding interest
Two further developments in the spirit of Campbell's suggestions for widening the
applicability of canonical variate analysis deserve mention. Digby and Gower (1981) have
described a robust form of canonical variate analysis, called canonical coordinate analysis (see
Gower, this volume), which is both distribution-free and for which the requirement for a relatively
stable covariance structure is entirely dispensed with. In this robust version, canonical variate
analysis is applicable under very general conditions. The second development is due to Hawkins
and his co-workers (Hawkins and Merriam 1974, Hawkins and Ten Krooden 1979) and is
directed towards extending the use of the method to situations where discrete communities or other
comparable sample-groups cannot be recognised at the outset. Their proposal uses constrained
cluster analysis to create discrete groups of neighbouring samples in abstract or geographical
space, to which canonical variate analysis or a robust variant thereof may be applied in the usual
way. The appeal of this development will be self-evident in the context of data distributions that
are neither striclty continuous nor striclty discontinuous but are somewhere between the two.
Heterogeneity may be dealt with in yet other ways. We mention finally a two-stage
procedure for its analysis in which clustering and scaling each have a role, and which in some
ways is reminiscent of an extreme form of Noy-Meir's procedure. The data are first clustered by
means of a suitable standard clustering procedure (e.g., Hartigan 1975). Some at least of the
resulting clusters may be expected to be substantially homogeneous, a point that is readily checked
by means of projection pursuit or of a Q-Q probability plot The internal structure of at least the
larger of the homogeneous clusters thus established may then be examined by means of any
scaling method appropriate to the problem in hand. This two-stage strategy, though more
cumbersome than either procedure described above, does provide a convenient and informative
means of dealing with large, heteogeneous data sets, which might otherwise prove difficult to deal
with.
s. CONCLUSIONS
Recent advances in scaling theory promise to sharpen the analysis of ecological data.
Conceptual advances in the comprehension of vegetation may follow. Salient characteristics of
ecological data are their high dimensionality, variables that fall naturally into two or more sets, and
distributions that are composites of some unknown mixture of component multivariate
distributions. Characteristically, the data matrix is a partitioned matrix A = [All A21 ... 1Am]
whose constituent observations (rows) are not identically distributed. Our attention has been
551
confmed to three related families of techniques for analyzing arrays of this kind: methods for linear
reduction of dimensionality, for nonlinear reduction of dimensionality, and for nonlinear
multivariate analysis. The approaches represented are united by a commom theme, namely they
are all concerned, directly or indirectly, with scaling high-dimensional data. Methodological
developments of several other kinds were presented at the workshop. These include conditional
and constrained clustering, fractal theory, spatial analysis, qualitative path analysis, and the duality
diagram, some or all of which seem likely to be useful for analyzing ecological data. Our
discussion focused principally on scaling methods because it is these which in our view seem
likely to have the greatest impact on terrestrial plant ecology in the forseeable future and because
the benefits and limitations of these methods are the most tangible at present. Unquestionably,
however, there is an important place for approaches other than scaling.
Of the three families of methods, classical multivariate analysis provides techniques for
linear reduction of dimensionality which are conceptually simple and computationally
straightforward. But with these methods are also associated restrictions on the data for analysis
which all too often prove stringent or even unrealistic in practice. Multidimensional scaling
provides varied opportunities under much more general conditions for nonlinear reduction of
dimensionality, for analyzing three-way arrays, and for revealing relations between samples,
species and external variables of several kinds. Nonlinear multivariate analysis opens the way for
the scaling of nonmetric data by means of classical linear and bilinear methods, so dispensing with
he need for quantitative data in terrestrial ecology. Classical and nonlinear multivariate analysis
both contain methods which specifically address the question of the relatedness of variables of
different kinds, and for this reason more closely match the prime substantive goal of a large and
important class of ecological endeavours than multidimensional scaling. Indeed, variables almost
always playa more prominent role in multivariate analysis than in multidimensional scaling, a
point of some importance in selecting a method for a given purpose. Nonlinear multivariate
analysis and multidimensional scaling are both appreciably more demanding computationally than
classical multivariate analysis. Moreover, none of the three families is conspicuously rich in
methods for dealing with data distributions that are complex mixtures of several underlying
multivariate distributions.
Much of the new methodology discussed rests heavily on notions or techniques of classical
multivariate analysis. Thus, the robust scaling procedures of Campbell (1982, 1984) and of
Digby and Gower (1981) can each be thought of as direct extensions of the classical method of
canonical variate analysis. The noncentered principal component analysis of Noy-Meir (1971,
1973) is another case of the same general kind. In the same spirit also is the nonlinear multivariate
analysis of Gifi (1981), which amounts to nothing less than a generalization of virtually the whole
of classical multivariate analysis to encompass data whose measurement-level characteristics are
very general indeed. There are also close affinities between multidimensional scaling and classical
multivariate analysis, though the derivation of a particular technique in each case is usually quite
552
different Meulman (1986) gives a clear exposition of the relationships from a distance-geometric
point of view. The impetus for the above and other similar developments has been a need for
methods that are free from the tightly-specified restrictions associated with standard methods,
which are often unrealistic or inconvenient in practice. In this way, the realities of ecological data
are respected and the applicability of classical methods widened. The simple device of
noncentering in the context of the hitherto somewhat intractable problem posed by heterogeneity,
illustrates just how much can be accomplished in this direction. A need for scaling methods
suitable for partitioned, heterogeneous data sets, remains.
The second-order linear methods of classical multivariate analysis constitute an extremely
restricted class of procedures; they are methods whose dependence on the data is channeled
entirely through the covariance matrix. Algebraically, they represent little more than different
aspects of a single matrix operation, namely singular value decomposition. Furthermore, they are
usually unable to handle outliers or other disturbances in the data distribution. It would be
unreasonable to expect such a narrowly-based class of methods to provide solutions to all or
indeed most problems posed by multiresponse data. Thus, it is likely to be advantageous to
develop a variety of other, perhaps quite different approaches. Several of the methodological
developments mentioned at the outset of the present section are of this kind. In contrast to
methods whose development can be traced to the 1930's as variations on a single theme, methods
which would be inconceivable without the aid of the high-speed computer would be especially
appealing. There is relevant work, some of which is directed towards the provision of
information-rich graphical displays similar in spirit to those of multivariate analysis and
multidimensional scaling (Friedman and Tukey 1974, Friedman and Rafsky 1983).
Reference to centering, above, serves as a reminder of a serious drawback of scaling
methods generally: they are scale-dependent. Where substantive or other considerations provide
clear guidelines, the dependence can be used to good advantage. In practice, however, arbitrary
choices for the unit and origin of the scale often have to be made and it is particularly important
then to bear the consequences of these choices in mind when interpreting results.
In turning to consider the likely impact of scaling methods generally on terrestrial plant
ecology, it will be instructive to reflect first on the success actually achieved by the use of
classical, linear multivariate analysis in this sphere. The underlying theory has been well
understood for something like fifty years, while the methods themselves have been
computationally feasible for 25 years or so. Moreover, linear methods have a fairly extensive
history of application in terrestrial plant ecology. Thus, there is an adequate foundation on which
assessment can be based. Regrettably, it is plain from the ecological literature that classical
multivariate analysis is poorly understood by terrestrial ecologists. Familiarity with the subject
rarely extends beyond a superficial acquaintance with one or two methods. As a result, all too
often a method, chosen in relation to some specified ecological purpose, fails to match the declared
purpose. Indeed, considerations such as the purpose of an investigation frequently seems to play
553
little or no part in the selection process. Even where a plausible method is used, deficiencies in its
implementation and in the interpretation and reporting of results are often apparent. These are very
disturbing matters. Further, they are often exacerbated in practice by uncritical reliance on widely
distributed computer programs or program packages. Convincing applications of classical
multivariate analysis in terrestrial plant ecology are accordingly very few.
Altogether, there is an alarming gap between theory and practice in this area, so much so
that the impact of multivariate analysis can scarcely be regarded as beneficial. Similar views have
been expressed by many workers (e.g., Jeffers 1972, Innis 1979, Levin 1980, Van Valen 1985,
Freeman 1987). Lindley (1984) has remarked that the success of multivariate analysis in
applications generally has been small in relation to the body of theory. Gnanadesikan and
Kettenring (1984) suggest reasons as to why this should be so. Such shortcomings in the use of
statistical methods are by no means confmed to multivariate analysis or to terrestrial ecology; they
are simply one aspect of a much more pervasive malaise (see Underwood 1981, Preece 1982,
1986, Gnanadesikan and Kettenring 1984, Hamill 1985). The spread of methodological
innovations from the research laboratory where they were developed to the applied scientist is
known to be slow (Jeffers 1971, Bentler 1986). Gani (1985) has estimated the time lag in
question to be of the order of 20 to 30 years, which in the light of ecological experience may be a
modest underestimate. If this argument is accepted, then it would be prudent to recognise that a
lenghty delay is in order before classical, linear methods come to be employed effectively in
community ecology in a general way. Yet it seems to us that if terrestrial plant ecology is to
become a science it must be more than a collection of anecdotes. This is precisely where
mathematics in our view has a contribution to make. Algebraic models provide the only way of
dealing adequately with the complexity of plant communities and ecosystems; they are able to
abstract the essential elements of a problem and to identify the minimum number of dimensions or
parameters necessary to describe such complex systems.
Before turning to consider the likely impact of recent developments in multidimensional
scaling and nonlinear multivariate analysis, it is worth recalling the elementary and unified
algebraic foundation of classical linear multivariate analysis (Krzanowski 1971) and its
comparatively simple computational requirements. At the same time it is well to be aware that
linear models are delicate tools to be applied with care and good sense if trustworthy results are to
be achieved. Multidimensional scaling and nonlinear multivariate analysis are both less
straightforward algebraically and more demanding computationally than classical multivariate
analysis. Further, unlike classical multivariate analysis, multidimensional scaling and nonlinear
multivariate analysis have been generally available for perhaps only ten years or less, and their
impact on terrestrial plant ecology to-date has been negligible. Using the record of classical
multivariate analysis in terrestrial plant ecology, we shall argue that the new methodology carries
with it stringent responsibilities if it is to be used productively. De Leeuw (1987a) has remarked
in connection with nonlinear multivariate analysis that nonlinear methods require even more care
554
and even more expert knowledge than standard linear methods. This observation is no less true of
multidimensional scaling (cf. Section 2.2 above). Given that classical multivariate analysis has yet
to be generally applied to useful advantage in terrestrial ecology, we have no assurance that the
still more demanding new methodology will be properly used. The existence of pertinent methods
is no guarantee of their sensible use, far from it. Success in endeavours of this kind is to be
expected only where methodological innovations are accompanied by a commensurate effort on
the part of practitioners to make sure that they understand the nature and properties of the methods
available and to acquire the skills necessary for their successful implementation. Unless the effort
is made, there is every danger that the opportunities afforded by the new methodology, far from
leading to new ecological insights, will paradoxically only further widen the already alarming gap
between theory and practice in vegetation ecology.
The often suggested course of seeking the advice of a professional statistician
(Gnanadesikan and Kettenring 1984) does not work well in practice; statisticians with the requisite
expertise, time and interest are simply too few and too far between. This certainly has been the
case with ecological applications of classical, linear methods, as the record clearly shows. There
are no grounds for supposing that matters will improve in the foreseeable future. In any case,
while a statistician may be asked for guidance, he cannot be expected to make good fundamental
deficiencies in the content and structure of ecological research programs. It is for ecologists to
work out how this is to be done. Steps which in our view would go some way towards rectifying
shortcomings in current applications of scaling methods in terrestrial ecology include the
following. First, to recognise that field observations in ecology are vector-value quantities
(strictly partitioned vectors). It follows immediately from the nature of field observations that the
description and analysis of vegetation are largely algebraic matters. Second, the need to check
data for irregular features before embarking on scaling. A variety of measures are available for
this purpose (Gnanadesikan 1977, Friedman and Stuetzle 1982). Third, the cardinal importance
of selecting a method of data analysis that is suitable for the purpose in hand. There are two
aspects to this question. Ensuring (a) that a chosen method is properly matched to the declared
substantive goal; and (b) that assumptions explicit and implicit in the method concerning estimates
of fitted quantities are satisfied by the data for analysis. Fourth, the need to carefully consider the
unit and origin of the scale of measurement and their effect on the outcome of the analysis.
Choices should be based on ecological considerations whenever possible. Noy-Meir (1973) and
Noy-Meir et al. (1975) provide useful guidelines. There are numerous other aspects of good
statistical data analysis: appreciation that realistic research goals can be set only by familiarity with
and reference to the range of methods available for their attainment, that much can be done in the
design stage to ensure that subsequent data analysis will be manageable and efficient, that large
samples are advantageous in increasing the precision of parameter estimates and that the
measurement level of the data to be collected is best decided with this point in mind. Data
re-expression before or during analysis and computer-intensive resampling schemes following
555
analysis can do much to sharpen both the results and the conclusions drawn from them. Until
principles of statistical data analysis such as these are widely used to guide and inform the use of
scaling methods in terrestrial ecology, misgivings about the worth of much numerical work in this
field are bound to persist.
In short, a large and growing body of scaling methods for aiding the comprehension of
terrestrial vegetation exists. There is nevertheless room for further methods whose development is
guided by greater attention to the salient characteristics of ecological data. On the other hand, the
availability of the new methodology has unfortunately not been matched by a commensurate
increase in our understanding of vegetation. Far from it. In our view, the impact of scaling
methods generally in community ecology has not been beneficial, principally because of the
existence of an alarming gap between theory and practice in this area. Progress in describing and
comprehending vegetation will result, in our view, only when plant ecologists are equipped to use
existing and emerging methods with insight and ingenuity.
REFERENCES
de Leeuw, J. 1987a. Nonlinear multivariate analysis with optimal scaling, p. 157-187. In this
volume.
de Leeuw, J. 1987b. Path analysis with optimal scaling, p. 381-404. In this volume.
de Leeuw, J., and J. Meulman. 1986. Principal component analysis and restricted
multidimensional scaling. In W. Gaul and M. Schader [eds.] Classification as a tool of
research. North Holland, Amsterdam.
Devlin, S.J., R. Gnanadesikan, and J.R. Kettenring. 1981. Robust estimation of dispersion
matrices and principal components. J. Amer. Stat. Ass. 76: 354-362.
Digby, P.G.N., and J.C. Gower. 1981. Ordination between- and within-groups applied to soil
classification, p. 63-75. In D.F. Merriam [ed.] Down to earth statistics: Solutions looking
for geological problems. Syracuse University Geological Contributions.
Fisher, R.A. 1936. The use of multiple measurements in taxonomic problems. Ann. Eugen.
Lond. 7: 179-188.
Freeman, G.H. 1987. Letter to the editor. Nature 325: 656.
Friedman, J.H., and L.C. Rafsky. 1981. Graphics for the multivariate two-sample problem. J.
Amer. Stat. Ass. 76: 275-295.
Friedman, J.H., and L.C. Rafsky. 1983. Graph-theoretic measures of multivariate association
and prediction. Ann. Stat. 11: 377-391.
Friedman, J.H., and W. Stuetzle. 1982. Projection pursuit methods for data analysis, p. 123-147.
In R.L. Launer and A.F. Siegel [eds.] Modem data analysis. Academic Press, New York.
Friedman, J.H., and J.W. Tukey. 1974. A projection pursuit algorithm for exploratory data
analysis. IEEE Transactions on Computers C-23: 881-890.
Gabriel, K.R. 1971. The biplot graphic display of matrices with application to principal
component analysis. Biometrika 58: 453-467.
Gabriel, K.R. 1981. Biplot display of multivariate matrices for inspection of data and diagnosis,
p. 147-173. In V. Barnett [ed.] Interpreting multivariate data. Wiley, Chichester.
Gabriel, K.R 1982. Biplot, p. 263-271. In S. Kotz and N.L. Johnson [eds.] Encyclopaedia of
the statistical sciences, Vol. 1. Wiley, New York.
Gani, J. 1985. In L. Rode and T. Speed [eds.] Teaching of statistics in the computer age.
Chartwell Bratt, Bromley. .
Geer, J.P. van der. 1984. Relations among k sets of variables. Psychometrika 49: 79-94.
Gifi, A. 1981. Nonlinear multivariate analysis. Department of Data Theory, University of Leiden,
Leiden.
Gittins, R., and J. Ogden. 1977. A reconnaissance survey of lowland tropical rain forest in
Guyana. [Unpublished manuscript]
Gnanadesikan, R 1977. Methods for statistical data analysis of multivariate observations. Wiley,
New York.
Gnanadesikan, R, and J.R. Kettenring. 1984. A pragmatic review of multivariate methods in
applications, p. 309-337. In H.A. David and H.T. David [eds.] Statistics: an appraisal. Iowa
State Univ. Press, Ames.
Gower, J.C. 1966. Some distance properties of latent root and vector methods used in
multivariate analysis. Biometrika 53: 325-338.
Gower, J.e., and P. Legendre. 1986. Metric and Euclidean properties of dissimilarity
coefficients. J. Class. 3: 5-48.
Greenacre, M.J. 1984. Theory and applications of correspondence analysis. Academic Press,
New York.
Greenacre, M.J., and L.G. Underhill. 1982. Scaling a data matrix in a low-dimensional Euclidean
space, p. 183-268. In D.M. Hawkins [ed.] Topics in applied multivariate analysis.
Cambridge Univ. Press, Cambridge.
Hamill, L. 1985. On the persistence of error in scholarly communication: the case of landscape
aesthetic. Can. Geographer 29: 270-273.
Hartigan, J.A. 1975. Clustering algorithms. Wiley, New York.
Hawkins, D.M., and D.F. Merriam. 1974. Zonation of multivariate sequences of digitized
geologic data. J. Int. Assoc. Math. Geology 6: 263-269.
Hawkins, D.M., and J.A. Ten Krooden. 1979. Zonation of sequences of heteroscedastic
multivariate data. Computers and Geosciences 5: 189-194.
557
Heiser, W.J., and J. Meulman. 1983a. Analyzing rectangular tables by joint and constrained
multidimensional scaling. Journal of Econometrics 22: 139-167.
Heiser, W.J., and J. Meulman. 1983b. Constrained multidimensional scaling, including
confIrmation. Applied Psychological Measurement 7: 381-404.
Horst, P. 1961. Relations among m sets of measures. Psychometrika 26: 129-150.
Hotelling, H. 1935. The most predictable criterion. J. Educ. Psychol. 26: 139-142.
Hotelling, H. 1936. Relations between two sets of variates. Biometrika 28: 321-377.
Innis, G.S. 1979. Letter to the editor. Science 204: 242.
Israels, A. 1984. Redundancy analysis for qualitative variables. Psychometrika 49: 331-346.
Jeffers, J.N.R. 1971. The challenge of modern mathematics. In J.N.R. Jeffers [ed.J Mathematical
models in ecology. Blackwell, Oxford.
Jeffers, J.N.R. 1972. The statisticians' role in the environmental sciences. The Statistician 21:
3-17.
Kenkel, N.C. 1986. Structure and dynamics of jack pine stands near Elk Lake, Ontario: a
multivariate approach. Can. J. Bot. 64: 486-497.
Kenkel, N.C., and L. Orl6ci. 1986. Applying metric and nonmetric multidimensional scaling to
ecological studies: some new results. Ecology 67: 919-928.
Kettenring, J.R. 1971. Canonical analysis of several sets of variables. Biometrika 58: 433-451.
Kruskal, J.B. 1977. The relationship between multidimensional scaling and clustering, p. 17-44.
In J. van Ryzin [ed.J Classification and clustering. Academic Press, New York.
Kruska1, J.B., and J.D. Carroll. 1969. Geometric models and badness-of-fit functions. In P.R.
Krishnaiah [ed.J Multivariate analysis, Vol. II. Academic Press, New York.
Krzanowski, W.J. 1971. The algebraic basis of classical multivariate methods. The Statistician
20: 51-61.
Lebart, L., A. Morineau, and K.M. Warwick. 1984. Multivariate descriptive analysis. Wiley,
New York.
Levin, S.A. 1980. Mathematics, ecology and ornithology. The Auk 97: 422-425.
Lindley, D.V. 1984. Prospects for the future: the next 50 years. J. Roy. Stat. Soc., Ser. A 147:
359-367.
Meu1man, J.J. 1986. A distance approach to nonlinear multivariate analysis. DSWO Press,
Leiden.
Noy-Meir,1. 1971. Multivariate analysis of the semi-arid vegetation in southeastern Australia:
nodal ordination by component analysis, p. 159-193. In N.A. Nix [ed.J Quantifying
ecology, Proc. Ecol. Soc. Aust. 6.
Noy-Meir, 1. 1973. Data transformations in ecological ordination. 1. Some advantages of
non-centering. J. Ecol. 61: 329-341.
Noy-Meir, I. 1974a. Multivariate analysis of the semiarid vegetation in southeastern Australia. II.
Vegetation catenae and environmental gradients. Aust. J. Bot 22: 115-140.
Noy-Meir,1. 1974b. Catenation: quantitative methods for the definition of coenoclines. Vegetatio
29: 89-99.
Noy-Meir, 1., D. Walker, and W.T. Williams. 1975. Data transformations in ecological
ordination. II. On the meaning of data standardization. J. Ecol. 63: 779-800.
Preece, D.A. 1982. The design and analysis of experiments: what has gone wrong? Utilitas
Mathematica 21A: 201-244.
Preece, D.A. 1986. illustrative examples: illustrative of what? The Statistician 35: 33-44.
Rao, C.R. 1948. The utilization of multiple measurements in problems of biological classifIcation.
J. Roy. Stat. Soc., Ser. B 10: 159-203.
Sibson, R. 1979. Studies in the robustness of multidimensional scaling: perturbational analysis of
classical scaling. J. Roy. Statist. Soc., Ser. B 41: 217-229.
Sibson, R., R. Bowyer, and C. Osmond. 1981. Studies in the robustness of multidimensional
scaling: Euclidean models and simulation studies. J. Stat. Compo Simul. 13: 273-296.
Smith, R.E., N.A. Campbell, and J.L. Perdrix. 1983. Identification of some Western Australian
Cu-Zn and Pb-Zn gossans by multi-element geochemistry, p. 109-126. In R.E. Smith [ed.J
Geochemical exploration in deeply weathered terrain. C.S.1.R.O. Institute of Energy and
Earth Sciences, Division of Mineralogy, Floreat Part.
Stevens, S.S. 1962. Mathematics, measurement and psychophysics. In S.S. Stevens [ed.J
Handbook of experimental psychology. Wiley, New York.
558
Takane, Y. 1985. The nonmetric data analysis, p. 314-318. In S. Kotz and N.L. Johnson [eds.]
Encyclopaedia of statistical sciences, Vol. 6. Wiley, New York.
Takane, Y., F.W. Young, and J. de Leeuw. 1977. Nonmetric individual differences
multidimensional scaling: an alternative least squares method with optimal scaling features.
Psychometrika 42: 7-67.
Tyler, D.E. 1982. On the optimality of the simultaneous redundancy transformations.
Psychometrika 47: 77-86.
Underwood, AJ. 1981. Techniques of analysis of variance in experimental marine biology and
ecology. Oceanogr. Mar. BioI. Ann. Rev. 19: 513.
van den Wollenberg, A.L. 1977. Redundancy analysis. An alternative for canonical correlation
analysis. Psychometrika 42: 207-219.
van Rijckevorsel, J., B. Bettonvil, and J. de Leeuw. 1985. Recovery and stability in nonlinear
principal component analysis. Internal Report RR-85-21, Department of Data Theory,
University of Leiden.
Van Valen, L.M. 1985. Letter to the editor. Nature 314: 230.
Verdegaal, R. 1986. OVERALS. Department of Data Theory, University of Leiden.
Walter, H., and S.-W. Breckle. 1985. Ecological systems of the geobiosphere. I. Ecological
principles in global perspective. Springer-Verlag, Berlin.
Webb, D.A. 1954. Is the classification of plant communities either possible or desirable? Botanisk
Tidsskrift 51: 362-370.
Weinberg, S.L., J.D. Carroll, and H.S. Cohen. 1984. Confidence regions for INDSCAL using
the jackknife and bootstrap techniques. Psychometrika 49: 475-491.
Wish, M., and J.D. Carroll. 1982. Multidimensional scaling and its applications. In P.R.
Krishnaiah and L.N. Kanal [eds.] Handbook of statistics, Vol. 2. North Holland,
Amsterdam.
Young, F.W. 1981. Quantitative analysis of qualitative data. Psychometrika 46: 347-388.
Young, F.W., J. de Leeuw, and Y .. Takane. 1980. Quantifying qualitative data. In E.D.
Lantermann and H. Feger [eds.] Similarity and choice. Hans Huber Verlag, Wien and Bern.
NOVEL STATISTICAL ANALYSES IN TERRESTRIAL ANIMAL ECOLOGY:
DIRTY DATA AND CLEAN QUESTIONS
INTRODUCTION
ORDINATION
CLUSTERING
FRACTALS
PATH ANALYSIS
SPATIAL ANALYSIS
REFERENCES
Marc TROUSSEll..IER,
Laboratoire d'Hydrobiologie marine,
Universite des Sciences et
Techniques du Languedoc,
Place Eugene Bataillon,
F-34060 Montpellier Cedex,
France.
Earn: HAIR@FRMOP11
Subject Index
algorithm (see also analysis; computer pro- - Guttman's principal components of scale
grams and packages) a, 179
- annealing a., 318, 329 - homogeneity a., 53, 179,214,215
- clustering a., 225-287, 291, 294, 325 - individual differences scaling a. (INDS-
- constrained clustering a., 291, 294, 295 CAUL), 59, 80,489,493,505
- cutting plane a, 318 - individual differences in orientation scal-
- dynamic programming a., 291 ing (IDIOSCAUL), 91
- nonlinear path analysis a., 398 - item a., 326
- unfolding a., 201, 203 -linear projection pursuit, 521
alternating least squares procedure (ALS), -loglinear a., 159,166,386,492
86,172,213,398 - maximum likelihood nonmetric 2-way
analysis MDS,66
- ACE-method of nonlinear multivariate a., - metric scaling, 487, 493, 539
174 - monotonic analysis of variance, 66
- asymmetric matrix a., 488, 492, 493, 494 - multidimensional preferences scaling, 111
- autocorrelation a.: see autocorrelation - multidimensional scaling a.: see analysis
- canonical a., 172, 183 (nonmetric multidimensional scaling)
- canonical coordinate analysis, 42, 472 - multidimensional unfolding a.: see analy-
- canonical correlation a., 173, 183, 401, sis (unfolding a)
472,485,488,493,534,535,536 - multiple correlation a., 397
- canonical correspondence a., 214 - multiple correspondence a., 52, 176, 180,
- canonical decomposition of N-way tables, 183,472,489,493,504,523
81,85 - multiple regression a.: see regression
- canonical variates a., 41, 125, 154, 166, - multiple-set canonical a., 183
183,472,549,551,565 - multiplicative analysis of a two-way table,
- Chernofffaces, 231 39,43
- classical scaling: see analysis (principal - multivariate a., ix, 157, 158, 163, 401,
coordinates a.) 531,537
- cluster a.: see clustering - nonlinear iterative least squares, 86
- common factor a., 397 - nonlinear iterative partial least squares,
- confIrmatory data a., 183,537 86
-constrained scaling, 305, 489,493 - non-linear mapping, 32
- contingency table a., 488, 561, 570 - nonlinear multivariate a. with optimal
- correspondence a., 47, 56, 153, 161, scaling (see also under the specific entry),
179, 181, 183, 196, 206, 208, 209, 210, 157-187, 210, 214, 401, 474, 506, 537,
212, 213, 215, 216, 325, 487, 492, 493, 541,551,553
522,533,535,536,566 - nonlinear ordination with optimal scaling,
- detrended correspondence a., 39, 161, 183,487,493
214,487,493 - nonlinear path analysis (see also analysis,
- discriminant a.: see analysis (canonical path a. with optimal scaling), 210, 386,
variates a.) 398,479,525
- distance methods, 412-414 - nonlinear principal component a., 180,
- dual scaling: see correspondence a. 181, 183
- exploratory data a., 103, 183, 230, 521, - nonmetric multidimensional scaling
537,560,572 (MDS), 32, 43, 56, 65-138, 158, 183,
- factor a., 183,230 209, 216, 230, 471, 473, 487, 492, 493,
- feature a., 228, 230 522, 523, 538, 539, 541, 542, 551, 553,
- generalized canonical a., 172, 181 562,567
- generalized canonical correlation a., 66, - nonmetric unfolding a., 212
183 - of partial covariances, 150
- generalized Procrustes a., 57, 565 - of three-way data matrix, 59
578
stress, 31, 32, 33, 60, 68, 74, 77, 197, 210, - distinction between object and variable,
212,213,216 16,162
- diagram, 77, 475 - dummy v., 162, 169
stretching of coordinate system, 116 - endogenous v., 385, 397, 398
structure -exogenous v. 385
- data s., 5, 41 - indicator v., 396
- ecological s., 296 -latent v., 381, 394, 396, 401
subgraph (connected), 314 - metric v.: see variable (quantitative v.)
succession theory, 290, 291, 361, 473, 490, - mixed-type v., 474
565,568 - nominal v.: see variable (qualitative v.)
surface pattern: see analysis (surface pattern - numerical v.: see variable (quantitative
a.) v.)
- ordered v., ordinal v.: see variable (quan-
titative v., semi-quantitative v.)
T-square method, 413 - qualitative v., 6, 26, 47, 162, 174, 312,
table (data): see matrix 436, 446, 479, 522, 523, 525, 535, 537,
target: see variable 542-546
Taylor's power law, 486 - quantification of a v., 163, 166, 169,
terrestrial ecosystem, 469 381,397
test (statistical; for specific tests, see under - quantitative v., 5, 26, 162, 174, 312,
the specific entry) 433,522,523,525,537,542-544
- randomization t., 292, 295 - semi-quantitative (ordinal, rank-ordered)
Thomson, James D., 431 v., 6, 162, 174, 312, 479, 486, 523,
three-way, three-mode analysis: see analysis 525,535,537,542-546
ties (in MDS), 71 - standardization, 197
time (see also constraint), 289,534 - state attribute (see also variable, binary
-scale, 477 v.), 312
- series (see also autocorrelation), 289 - summary variable, 521
trajectories of organisms, 336 - target of a v., 162
transect, 290, 293, 314 - transformation of v., 7, 38, 102, 160,
transformation (of data, of variables): see 163, 166, 192, 381, 397, 401, 471, 486,
variable 537,542-544
transitivity, 241,242 - unordered s-state v.: see variable (qualita-
trees, 337, 343, 345,347,370,413 tive v.)
trend surface analysis, 479, 490 variance, analysis of (ANOVA): see analysis
triangle inequality, 83 variate: see variable ,
trophic level, 560 vegetation, 5, 18, 30, 34, 50, 164, 191, 204,
turbulence, 338, 345, 349, 351, 352, 354, 303, 322, 340, 343, 407, 409, 413, 424,
374 439, 446, 447, 455, 529-558, 559, 564-
two-way analysis: see analysis 566
typology, 567 - nitrogen treatments of grass, 174
viscosity, 338, 345,346
unimodality, 190, 191
units: see objects ways (defInition of number of w., in a model
or a method), 79
weather, 255
variable Weber problem (generalized), 199
- binary v., 6, 23,162,310,312 weighting, 75, 143, 148,474
- categorical v.: see variable (qualitative v.) working groups, x, 467-
- category quantification of a v., 166, 169 worms: see Polychaetes
- continuous v.: see variable (quantitative
v.) zooplankton, 165, 168, 171, 173, 179, 292,
- defInition of a v., 161, 162 359,360,479