Вы находитесь на странице: 1из 316

METHODS OF ENVIRONMENTAL

DATA ANALYSIS
Environmental Management Series

Edited by

Prof. J. Cairns, Jr, University Center for Environmental and Hazardous


Materials Studies. Virginia Polytechnic Institute, USA

and

Prof. R.M. Harrison, Institute of Public and Environmental Health,


University of Birmingham, UK

This series hJS been established [0 meet the need for a set of in-depth volumes
dealing with environmental issues. particufarly with regard to a sustainable future.
The series provides a uniform and quality coverage. building up to form a library
of reference books spanning major tOPICS within this diverse tield.

The level of presentation is advanced. aimed primarily at a research/consultancy


readership. Coverage IOcludes all aspects of environmental science and engineering
relevant to evaluation and management of the natural and human-modified
environment. as weI! as topics dealing with the political. economic. legal and social
conslderariom pertaining to environmental management.

Previously published titles in the Series include:

Biomonitoring of Trace Aquatic Contaminants


D.J.H. Phillips and P.S. Rainbow (1993. reprinted 1994)

Global Atmospheric Chemical Change


C.N. Hewitt and W.T. Sturges (eds) (1993. reprinted 1995)

Atmospheric Acidity: Sources, Consequences and Abatement


M. Radojevic and R.M. Harrison (eds) (1992)

Methods of Environmental Data Analysis


C. N. Hewitt (ed.) (1992, reprinted 1995)

Please contact the Publisher or one of the Series' Editors if you would
like to contribute to the Series.

Dr R.C.J. Carling Prof. Roy Harrison Prof. John Cairns, Jr


Senior Editor The Institute of Public and Environmental and Hazardous
Environmental Sciences Environmental Health Materials Studies
Chapman & Hall School of Chemistry Virginia PolytechniC Institute
2·6 Boundary Row University of Birmingham and State University
London Edgbaston Blacksburg
SEI SHN. UK BI5 2IT. UK Virginia 24061·0414. USA
email: email: email:
bob. carling@chall.co.uk r.m.harnson.ipe@bham.ac.uk cairnsb@v!.edu
METHODS OF
ENVIRONMENTAL
DATA ANALYSIS

Edited by

C.N. HEWITT
INSTITUTE OF ENVIRONMENTAL & BIOLOGICAL
SCIENCES, LANCASTER UNIVERSITY,
LANCASTER LA1 4YQ, UK

CHAPMAN &. HALL


London· Glasgow· Weinheim . New York· Tokyo· Melbourne· Madras
Published by Chapman & Hall. 2-6 Boundary Row. London SE1 8HN

Chapman & Hall, 2-6 Boundary Row, London SEl BHN, UK


Blackie Academic & Professional, Wester Cleddens Road, Bishopbriggs,
Glasgow G64 2NZ, UK
Chapman & Hall GmbH, Pappelallee 3, 69469 Weinbeim, Germany
Chapman & Hall USA, 115 Fifth Avenue, New York, NY 10003, USA

Chapman & Hall Japan, ITP-Japan, Kyowa Building, 3F, 2-2-1


Hirakawacho, Chiyoda-ku, Tokyo 102, Japan

Chapman & Hall Australia, 102 Dodds Street, South Melbourne,


Victoria 3205, Australia
Chapman & Hall India, R. Seshadri, 32 Second Main Road, CIT East,
Madras 600 035, India

First published by Elsevier Science Publishers Ltd 1992

© 1992 Chapman & Hall

Typeset by Alden Multimedia Ltd, Northampton

ISBN 0 412 739909

Apart from any fair dealing for the purposes of research or private
study, or criticism or review, as permitted under the UK Copyright
Designs and Patents Act, 1988. this publication may not be
reproduced, stored, or transmitted, in any form or by any means.
without the prior permission In writing of the publishers. or in the case
of reprographic reproduction only in accordance with the terms of the
licences issued by the Copyright Licensing Agency in the UK, or in
accordance with the terms of licences issued by the appropriate
Reproduction Rights Organization outside the UK. Enquiries concerning
reproduction outside the terms stated here should be sent to the
publishers at the London address printed on this page.
The publisher makes no representation, express or implied, with
regard to the accuracy of the information contained in this book and
cannot accept any legal responsibility or liability for any errors or
omissions that may be made.

A catalogue reGord for this book is available from the British Library

Library of Congress Cataloging-in-Publication Data available

Special regulations for readers in the USA

This publication has been registered with the Copyright Clearance


Center Inc. (CCCl. Salem, Massachusetts. Information can be obtained
from the CCC about conditions under which photocopies of parts of
this publication may be made in the USA. All other copyright questions,
including photocopying outside the USA, should be referred to the
publisher.
Foreword

ENVIRONMENTAL MANAGEMENT SERIES

The current expansion of both public and scientific interest in environ-


mental issues has not been accompanied by a commensurate production
of adequate books, and those which are available are widely variable in
approach and depth.
The Environmental Management Series has been established with a view
to co-ordinating a series of volumes dealing with each topic within the field
in some depth. It is hoped that this Series will provide a uniform and
quality coverage and that, over a period of years, it will build up to form
a library of reference books covering most of the major topics within this
diverse field. It is envisaged that the books will be of single, or dual
authorship, or edited volumes as appropriate for respective topics.
The level of presentation will be advanced, the books being aimed
primarily at a research/consultancy readership. The coverage will include
all aspects of environmental science and engineering pertinent to manage-
ment and monitoring of the natural and man-modified environment, as
well as topics dealing with the political. t:conomic, legal and social con-
siderations pertaining to environmental management.

J. CAIRNS JNR and R.M. HARRISON


Preface

In recent years there has been a dramatic increase in public interest and
concern for the welfare of the planet and in our desire and need to
understand its workings. The commensurate expansion in activity in the
environmental sciences has led to a huge increase in the amount of data
gathered on a wide range of environmental parameters. The arrival of
personal computers in the analytical laboratory, the increasing automation
of sampling and analytical devices and the rapid adoption of remote
sensing techniques have all aided in this process. Many laboratories and
individual scientists now generate thousands of data points every month
or year.
The assimilation of data of any given variable, whether they be straight-
forward, as for example, the annual average concentrations of a pollutant
in a single city, or more complex, say spatial and temporal variations of
a wide range of physical and chemical parameters at a large number of
sites, is itself not useful. Raw numbers convey very little readily assimilated
information: it is only when they are analysed, tabulated, displayed and
presented can they serve the scientific and management functions for
which they were collected.
This book aims to aid the active environmental scientist in the process
of turning raw data into comprehensible, visually intelligible and useful
information. Basic descriptive statistical techniques are first covered, with
univariate methods of time series analysis (of much current importance as
the implications of increasing carbon dioxide and other trace gas concen-
trations in the atmosphere are grappled with), regression, correlation and
multivariate factor analysis following. Methods of analysing and deter-
mining errors and detection limits are covered in detail, as are graphical
methods of exploratory data analysis and the visual representation of

VII
VllI PREFACE

data. The final chapter describes in detail the management procedures


necessary to ensure the quality and integrity of environmental chemical
data. Numerous examples are used to illustrate the way in which particular
techniques can be used.
The authors of these chapters have been selected to ensure that an
authoritative account of each topic is given. I sincerely hope that a wide
range of readers, including undergraduates, researchers, policy makers
and administrators, will find the book useful and that it will help scientists
produce information, not just numbers.

NICK HEWITT
Lancaster
Contents

Foreword . . . . . v
Preface . . . . . . Vll

List of Contributors. Xl

Chapter 1 Descriptive Statistical Techniques


A.C. BAJPAI, I.M. CALUS and J.A. FAIRLEY
Chapter 2 Environmetric Methods of Nonstationary Time-Series
Analysis: Univariate Methods
P.c. YOUNG and T. YOUNG. . . . . . . . . 37
Chapter 3 Regression and Correlation
A.C. DAVISON. . . . . . . . 79
Chapter 4 Factor and Correlation Analysis of Multivariate
Environmental Data
P.K. HOPKE. . . . . . . . . . . . . . 139
Chapter 5 Errors and Detection Limits
M.J. ADAMS . . . . . . 181
Chapter 6 Visual Representation of Data Including Graphical
Exploratory Data Analysis
J.M. THOMPSON. . . . . . . . . . . . . 213
Chapter 7 Quality Assurance for Environmental Assessment
Activities
A.A. LIABASTRE, K.A. CARLBERG and
M.S. MILLER 259
Index . . . 301

IX
List of Contributors

M.J. ADAMS
School of Applied Sciences, Wolverhampton Polytechnic, Wulfruna
Street, Wolverhampton WVIISB, UK
A.C. BAJPAI
Department of Mathematical Sciences, Loughborough University of
Technology, Loughborough LEll 3TU, UK
I.M. CALUS
72 Westfield Drive, Loughborough LEll 3QL, UK
K.A. CARLBERG
29 Hoffman Place. Belle Mead, New Jersey 08502, USA
A.C. DAVISON
Department of Statistics, University of Oxford, 1 South Parks Road,
Oxford OXI 3TG, UK
J.A. FAIRLEY
Department of Mathematical Sciences, Loughborough University of
Technology, Loughborough LEll 3TU, UK
A.A. LIABASTRE
Environmental Laboratory Division, US Army Environmental Hygiene
Activity-South, Building 180, Fort McPherson, Georgia 30330-5000,
USA
M.S. MILLER
Automated Compliance Systems, 673 Emory Valley Road, Oak Ridge,
Tennessee 37830, USA

xi
XII LIST OF CONTRIBUTORS

1.M. THOMPSON
Department of Biomedical Engineering and Medical Physics,
University of Keele Hospital Centre, Thornburrow Drive, Hartshill,
Stoke-on-Trent, Staffordshire ST4 7QB, UK. Present address:
Department of Medical Physics and Biomedical Engineering, Queen
Elizabeth Hospital, Birmingham BJ5 2TH, UK
P.e. YOUNG
Institute of Environmental and Biological Sciences, Lancaster
University, Lancaster, Lancashire LAJ 4YQ, UK
T. YOUNG
Institute of Environmental and Biological Sciences, Lancaster
University, Lancaster, Lancashire LAJ 4YQ, UK. Present address:
Maths Techniques Group, Bank of England, Threadneedle Street,
London EC2R 8AH, UK
Chapter 1

Descriptive Statistical
Techniques
A.C. BAJPAI,a IRENE M. CALUSb and J.A. FAIRLEyli
GDepartment of Mathematical Sciences, Loughborough University of
Technology, Loughborough, Leicestershire LE11 3TU, UK; b72 Westfield
Drive, Loughborough, Leicestershire LE11 3QL, UK

1 RANDOM VARIATION

The air quality in a city in terms of, say, the level of sulphur dioxide
present, cannot be adequately assessed by a single measurement. This is
because air pollutant concentrations in the city do not have a fixed value
but vary from one place to another. They also vary with respect to time.
Similar considerations apply in the assessment of water quality in a river
in terms of, say, the level of nitrogen or number of faecal coliforms
present, or in assessing the activity of a radioactive pollutant. In such
situations, while it may be that some of the variation can be attributed to
known causes, there still remains a residual component which cannot be
fully explained or controlled and must be regarded as a matter of chance.
It is this random variation that explains why, for instance, two samples of
water, of equal volume, taken at the same point on the river at the same
time give different coliform counts, and why, in the case of a radioactive
source, the number of disintegrations in, say, a I-min time interval varies
from one interval to another.
Random variation may be caused, wholly or in part, by errors in
measurement or it may simply be inherent in the nature of the variable
under consideration. When a die is cast, no error is involved in counting
the number of dots on the uppermost face. The score is affected by a
multitude of factors-the force with which the die is thrown, the angle at
which it is thrown, etc.-which combine to produce the end result.
2 A.C. BAJPAI, IRENE M. CALUS AND J.A. FAIRLEY

At the other extreme, variation in the results of repeated determinations


of nitrogen concentrations in the same sample of water must be entirely
due to error. Just as the die is never thrown in exactly the same way on
successive occasions, so repeated determinations are not exact repetitions
even though made under apparently identical conditions. Lee & Lee! point
out that in a laboratory there may be slight changes in room temperature,
pressure or humidity, fluctuations in the mains electricity supply, or
variation in the level to which a pipette or standard flask is filled. In a
titrimetric analysis, the two burette readings and the judging of the end
point are amongst possible sources of variation. All such factors, of which
the observer is unaware, combine to produce the random error which is
causing the random variation.
Between the two extremes is the situation where random error partly,
but not wholly, explains the variation. An example of this would be given
by determinations of nitrogen made on a number of samples of water
taken at the same time at a gauging station. While random error would
contribute to the variation in results, there would also be sample-to-
sample variation in the actual amount of nitrogen present, as river water
is unlikely to be perfectly homogeneous.
The score when a die is thrown and the observation made on the amount
of nitrogen present in a water sample are both examples of a random
variable or variate, but are of two different types. The distinction between
these two types is relevant when a probability model is used to describe the
pattern of variation. When the variable can take only certain specific
values, and not intermediate ones, it is said to be discrete. This would
apply if it is the result of a counting process, where only integer values can
result. Thus the score when a die is thrown, the number of emissions from
a radioactive source in a 30-s interval and the number of coliforms in 1 ml
of water are all examples of a discrete variate. When a variable can take
any value within the range covered it is said to be continuous. Values of a
variable of this kind are obtained by measurement along a continuous
scale as, for instance, when dealing with length, mass, volume or time.
Measurements of the level of nitrogen in water, lead in blood or sulphur
dioxide in air fall into this category.
In situations such as those which have been described here, a single
observation would be inadequate for providing the information required.
Hence sets of data must be dealt with and the remainder of this chapter
will be devoted to ways of presenting and summarising them.
DESCRIPTIVE STATISTICAL TECHNIQUES 3

2 TABULAR PRESENTATION

2.1 The frequency table


A mass of numerical data gives only a confused impression of the situa-
tion. Reorganising it into a table can make it more informative, as illu-
strated by the following examples.

Example 1. A 'colony' method for counting bacteria in liquids entails,


as a first step, the dilution with sterile water of the liquid under examina-
tion. Then I ml of diluted liquid is placed in a nutrient medium in a dish
and incubated. The colonies of bacteria which have formed by the end of
the incubation period are then counted. This gives the number of bacteria
originally present, assuming that each colony has grown from a single
bacterium. Recording the number of colonies produced in each of 40
dishes might give the results shown here.
234 3 0 3 2 4
220 2 3 2 5 3 2
4 0 2 4 5 2 0

4 2 2 2 033
In Table I the data are presented as a frequency table. It shows the
number of dishes with no colony, the number with one colony, and so on.
To form this table you will probably find it easiest to work your way
systematically through the data, recording each observation in its appro-
priate category by a tally mark, as shown. In fact, these particular data
may well be recorded in this way in the first place.
The variate here is 'number of colonies' and the column headed 'number
of dishes' shows the frequency with which each value of the variate

TABLE 1
Frequency table for colony counts

Number of colonies Taffy Number of dishes

o J.Hf 5
1 J.Hf IIII 9
2 1Hf1Hf II 12
3 J.HfII 7
4 1Hf 5
5 II 2
40
4 A.C. BAJPAI, IRENE M. CALUS AND J.A. FAIRLEY

occurred. Sometimes it is useful to give the frequencies as a proportion of


the total, i.e. to give the relative frequency. Corresponding to 0, I, 2 .. .
colonies, the relative frequencies would be respectively 5/40, 9/40, 12/40 .. .
Using f to denote the frequency with which a value x of the variate
occurred, the corresponding relative frequency is f/ N where N is the total
number of observations.
Example 2. Random error in measurement, of which mention has
already been made, would be the cause of the variation in the results,
shown here, of 40 replicate determinations of nitrate ion concentration (in
Jlg/ml) in a water specimen.
0-49 0·45 0-48 0·48 0·49 0·48 0·46 0·48 0·51 0·52
0·51 0·49 0·50 0·49 0·47 0·48 0·50 0·50 0·47 0·50
0·50 0·51 0·48 0·50 0·50 0-47 0·50 0-48 0·49 0-47
0·52 0·50 0·51 0-49 0·48 0·51 0-46 0·49 0·50 0·49
In Example I we were dealing with a discrete variate, which could take
only integer values. The variate in this example has the appearance of
being discrete, in that it only takes values 0'45, 0-46, 0'47, etc., and not
values in between, but this is because recordings have been made to only
2 decimal places. Nitrate ion concentration is measured on a continuous
scale and should therefore be regarded as a continuou,; variate. The
formation of a frequency table follows along the same lines as in Example
I. Table 2 shows the frequency distribution obtained.
The variation in this case is entirely attributable to random error and

TABLE 2
Frequency table for nitrate ion concentration
measurements

Nitrate ion concentration Frequency


(J1g/ml)

0·45 I
0·46 2
0·47 4
0·48 8
0·49 8
0·50 10
0·51 5
0·52 2
40
DESCRIPTIVE STATISTICAL TECHNIQUES 5

only 8 different values were taken by the variate. If, however, the deter-
minations had been made on different water specimens, there would have
been a greater amount of variation. Giving the frequency corresponding
to each value taken by the variate would then make the table too unwieldy
and grouping becomes advisable. This is illustrated by the next example,
where a similar situation exists.

Examp/e3. The lead concentrations (in Jl.g/m 3) shown here represent


recordings made on 50 weekday afternoons at an air monitoring station
near a US freeway.
5·4 6·8 6·1 10·6 7·0 5·2 4·9 6·5 8·3 7·1
6·0 5·0 7·8 5·9 6·0 8·7 6·0 6·2 6·0 10·1
6-4 7·2 6·4 6·4 8·0 8·3 8·0 8·1 9·9 6·8
5·3 2·1 7·2 7·6 7·3 3·9 10·9 6·1 6·8 9·3
5·0 9·2 7·9 8·6 3·2 6·9 8·6 9·5 11·2 6·4
The smallest value is 2·1 and the largest 11·2. It is desirable that groups of
equal width should be chosen. In Table 3, values from 2·0 up to and
including 2·9 have been placed in the first group, values from 3·0 up to and
including 3·9 in the second group, and so on. The data then being covered
by 10 groups, a table of reasonable size is obtained. If the number of
groups is too small, too much information is lost. Too many groups would

TABLE 3
Frequency table for lead concentration measurements

Lead concentration True group Frequency


(pg/m3) boundaries

2·0-2·9 1·95-2·95 1
3·0-3·9 2,95-3,95 2
4'~'9 3·95-4·95 1
5·0-5·9 4·95-5·95 6
6·0-6,9 5·95-6·95 16
7·0-7-9 6·95-7·95 8
8,0-8,9 7,95-8,95 8
9·0-9·9 8·95-9·95 4
10·0-10·9 9,95-10·95 3
11·0-11·9 10·95-11·95 1
50
6 A.C. BAIPAI, IRENE M. CALUS AND I.A. FAIRLEY

mean the retention of too much detail and show little improvement on the
original data.
As with the measurements of nitrate ion concentration in Example 2, we
are here dealing with a variate which is, in essence, continuous. In this
case, readings were recorded to I decimal place. Thus 2·9 represents a
value between 2·85 and 2·95, 3·0 represents a value between 2·95 and 3·05,
and so on. Hence there is not really a gap between the upper end of one
group and the lower end of the next. The true boundaries of the first group
are 1·95 and 2·95, of the next 2·95 and 3·95, and so on, as shown in Table
3. Notice that no observation falls on these boundaries and hence no
ambiguity arises when allocating an observation to a group. If the boun-
daries were 2·0, 3·0, 4·0, etc., then the problem would arise of whether 3·0,
say, should be allocated to the group 2·0-3·0 or to the group 3·0-4·0.
Various conventions exist for dealing with this problem but it can be
avoided by a judicious choice of boundaries, as seen here.
2.2 Table of cumulative frequencies
It may be of interest to know on how many days the lead concentration
was below a stated level. From Table 3 it is readily seen that there were
no observations below 1·95, 1 below 2·95, 3 (= I + 2) below 3·95,
4 (= 1 + 2 + 1) below 4·95 and so on. The complete set of cumulative
frequencies thus obtained is shown in Table 4.
In the case of a frequency table, it has been mentioned that it may be
more useful to think in terms of the relative frequency, i.e. the proportion
of observations falling into each category. Similar considerations apply in

TABLE 4
Table of cumulative frequencies for data in Table 3

Lead concentration Cumulative % Cumulative


(/lg/m 3 ) frequency frequency

1·95 0 0
2·95 1 2
3·95 3 6
4·95 4 8
5·95 10 20
6·95 26 52
7·95 34 68
8·95 42 84
9·95 46 92
10·95 49 98
11·95 50 100
DESCRIPTIVE STATISTICAL TECHNIQUES 7

the case of cumulative frequencies. Thus, in the present example, it may


be more useful to know the proportion of days on which the recorded lead
concentration was below a stated level. In Table 4 this is given as a
percentage, a common practice with cumulative frequencies.
We have considered here the 'less than' cumulative frequencies as this
is more customary but the situation could obviously be looked at from an
opposite point of view. For example, saying that there were 10 days with
recordings below 5·95 Jl.g/m 3 is equivalent to saying that there were 40 days
when recordings exceeded 5·95 Jl.g/m 3 • In percentage terms, a 20% 'less
than' cumulative frequency corresponds to an 80% 'more than' cumula-
tive frequency.

3 DIAGRAMMATIC PRESENTATION

For some people a diagram conveys more than a table of figures, and ways
of presenting data in this form will now be considered.

3.1 Dot diagram


In a dot diagram, each observation is marked by a dot on a horizontal axis
which represents the scale of measurement. An example will show how this
is done.

Example 4. Values of biochemical oxygen demand (BOD), which is a


measure of biodegradable organic matter, were recorded on water spe-
cimens taken throughout 1987 and 1988 from the River Clyde at Station
12A (Addersgill) downstream of Carbams sewage treatment works. The
results (in mg/litre) were as follows:
1987: 3·2 2·9 2·1 4·3 2·9 3·8 4·6 2·4 4·5 3·9 4·5 2·0 4·2
1988: 3·7 2·8 2·6 2·9 3·8 5·1 3·2 5·0 2·3 2·8 2·2 2·8 2·2
The dot diagram in Fig. l(a) displays the 1988 data. Although 5·0 and 5·1
appear as outlying values, they seem less exceptional in Fig. l(b) where the
combined results for the two years are displayed. Obviously this very
simple form of diagram would not be suitable when the number of
observations is much larger.

3.2 Line diagram


The frequency distribution in Table I can be displayed as shown in
Fig. 2. The variate (number of colonies) is shown along the base axis and
8 A.C. BAJPAI, IRENE M. CALUS AND J.A. FAIRLEY

(a)~. I.. .. •5
234
BOD (mg/litre)

(b) u1....L.
I .1. II •• • II

234 5
BOD (mg/litre)

Fig. 1. Dot diagrams for data recorded at Station 12A on River Clyde:
(a) 1988; (b) 1987 and 1988.

o 2 3 5
Number of Colonies

Fig. 2. Line diagram for data in Table 1.

the heights of the lines (or bars) represent the frequencies. This type of
diagram is appropriate here as it emphasises the discrete nature of the
variate.

3.3 Histogram
For data in which the variate is of the continuous type, as in Examples 2
and 3, the frequency distribution can be displayed as a histogram. Each
frequency is represented by the area of a rectangle whose base, on a
horizontal scale representing the variate, extends from the lower to the
upper boundary of the group. Taking the distribution in Table 3 as an
example, the base of the first rectangle should extend from 1·95 to 2'95, the
next from 2·95 to 3'95, and so on, with no gaps between the rectangles, as
shown in Fig. 3.
In the case of the distribution of nitrate ion concentration readings in
Table 2, 0·45 is considered as representing a value between 0·445 and
0'455, 0·46 as representing a value between 0·455 and 0'465, and so on.
Thus, when the histogram is drawn, the bases of the rectangles would have
DESCRIPTIVE STATISTICAL TECHNIQUES 9

5 6 7 8 9 10 11 12

Lead Concentration (llg/m 3)

Fig. 3. Histogram for frequency distribution in Table 3.

these boundaries as their end points. Again, there must be continuous


coverage of the base scale representing nitrate ion concentration.
It must be emphasised that it is really the areas of the rectangles that
represent the frequencies, and not, as is often thought, the heights. There
are two reasons why it is important to make this clear. Firstly, the heights
are proportional to the frequencies only when the groups are all of equal
width. That is the most common situation and one therefore becomes
accustomed to measuring the heights on a scale representing frequency
when drawing a histogram. Thus in Fig. 3, after choice of a suitable
vertical scale, heights of 1, 2, I, 6, ... units were measured off when
constructing the rectangles. However, occasionally tables are encountered
in which the group widths are not all equal (e.g. wider intervals may have
been used at the ends of the distribution where observations are sparse).
In such cases, the bases of the rectangles will not be of equal width and the
heights will not therefore be proportional to the frequencies if the histo-
gram is drawn correctly. Suppose, for instance, that in Table 3 the final
group had extended from 9·95 to 11·95. Its frequency would then have
been 4, the same as for the adjacent group (8·95 to 9,95). For the rectangles
representing these frequencies to have equal areas, they will not be of equal
height, as Fig. 4 shows.
The second reason is that if one proceeds to the stage of using a
probability model to describe the distribution of a continuous variate, it
is an area, not a height, that represents relative frequency and therefore,
also, frequency. The true interpretation of the height of a rectangle in the
histogram is that it represents frequency density, i.e. frequency per unit
along the horizontal axis.
IO A.C. BAJPAI, IRENE M. CALUS AND I.A. FAIRLEY

8 9 10 11 12
Lead Concentration ().tg/m')

Fig. 4. Construction of histogram where groups are of unequal width.

3.4 Cumulative frequency graph


A distribution of cumulative frequencies can be displayed diagrammatic-
ally by plotting a graph with relative cumulative frequency (usually given
as a percentage) on the vertical scale and the variate on the horizontal
scale. Figure 5 shows the plot for the distribution in Table 4. The vertical
scale represents the percentage of days when the lead concentration was
below the value given on the horizontal scale. Thus, for instance, 2% on
the vertical scale corresponds to 2·95 J.lg/m 3 on the horizontal scale.
The term 'ogive' is often given to a cumulative frequency graph, though
it properly applies only in the case when the frequency distribution is
symmetrical with a central peak and the cumulative frequency curve then
has the shape of an elongated'S'.

100

90
>-
(J
c 80
Q)
~
0-
Q)
70
u:
Q)
> 60
~
:;
E 50
~
u 40
Q)
Cl
.;g
c 30
Q)
f:!
Q) 20
11.
10

0
0 2 3 4 5 6 7 8 9 10 11 12

Lead Concentration ().tg/m')

Fig. 5. Plot of percentage cumulative frequencies in Table 4.


DESCRIPTIVE STATISTICAL TECHNIQUES 11

The value on the horizontal scale corresponding to P% on the vertical


scale is the Pth percentile of the distribution. Thus, in the present example,
5·95 fJ,g/m 3 is the 20th percentile, i.e. 20% of recordings gave lead con-
centrations below 5·95 fJ,g/m 3 • The Global Environment Monitoring
System (GEMSf report on the results of health-related environmental
monitoring gives 90th percentile values for various aspects of the water
quality of rivers. For example, for the 190 rivers monitored for biochemi-
cal oxygen demand (BOD) the 90th percentile was 6·5 mg/litre. Or, to put
it another way, for 10% of rivers, i.e. 19 of them, the level of BOD
exceeded 6·5 mg/litre. The levels of pesticide residue and industrial chemi-
cals in foods are also described by showing the 90th percentile in each case.
For instance, for DDT in meat the 90th percentile value was 100 fJ,g/kg and
thus in 10% of participating countries this level was exceeded.
World Health Organization (WHO) guidelines established for urban air
quality (e.g. levels of sulphur dioxide and suspended particulate matter)
are expressed in terms of the 98th percentile, meaning that this level should
not be exceeded more than 2% of the time or approximately seven days
in a year. A GEMS 3 report on air quality in urban areas presents fre-
quency distributions for data in terms of the 10th, 20th, ... and 90th
percentiles (sometimes referred to as deciles). This form of presentation
looks at cumulative frequencies from the reverse viewpoint to that used in
Table 4. There, percentage cumulative frequencies were given for selected
values of the variate, whereas the GEMS report gives values of the variate
corresponding to selected percentage cumulative frequencies.

4 MEASURES OF LOCATION (TYPICAL VALUE)

We shall now look at ways of summarising a set of data by calculating


numerical values which measure various aspects of it. The first measures
to be considered will be ones which aim to give a general indication of the
size of values taken by the variate. These are sometimes termed 'measures
of location' because they are intended to give an indication of where the
values of the variate are located on the scale of measurement. To illustrate
this idea, let us suppose that, over a period of 24 h, recordings were made
each hour of the carbon monoxide (CO) concentration (in parts per
million) at a point near a motorway. The first 12 results, labelled 'day',
were obtained in the period beginning at 08.00 h. The second 12 results,
labelled 'night', were obtained in the remaining part of the period, i.e.
12 A.C. BAJPAI, IRENE M. CALUS AND I.A. FAIRLEY

.1 ....-... Day

......-'.IL-..,
-'1L.l.L-a .....".........~_....._ _ _ _ _ _ Night

2 3 4 5 6 7 8

CO Concentration (ppm)

Fig. 6. Dot diagrams showing recordings of CO concentration.

beginning at 20.00 h.
Day: 5·8 6·9 6·7 6·7 6·3 5·8 5·5 6·1 6·8 7·0 7·4 6·4
Night: 5·0 3·8 3·5 3·3 3·1 2·4 1·8 1·5 1·3 1·3 2·0 3·4
From a glance at the two sets of data it is seen that higher values were
recorded during the daytime period. (This is not surprising because the
density of traffic, which one would expect to have an effect on CO con-
centration, is higher during the day.) This difference between the two sets
of data is highlighted by the two dot diagrams shown in Fig. 6.

4.1 The arithmetic mean

4. 1. 1 Definition
One way of indicating where each set of data is located on the scale of
measurement is to calculate the arithmetic mean. This is the measure of
location in most common use. Often it is simply referred to as the mean,
as will sometimes be done in this chapter. There are other types of mean,
e.g. the geometric mean of which mention will be made later. However, in
that case the full title is usually given so there should be no misunderstand-
ing. In common parlance the term 'average' is also used for the arithmetic
mean although, strictly speaking, it applies to any measure of location:
.h
A nt . Sum of all observations
metIc mean = . (1)
Total number of observatIons
Applying eqn (l) to the daytime recordings gives the mean CO concen-
tration as
5·8 + 6·9 + ... + 7·4 + 6·4 77·4
12
= 12 = 6·45ppm

Similarly, for the night recordings the mean is 32·4/12, i.e. 2·7 ppm.
DESCRIPTIVE STATISTICAL TECHNIQUES 13

Generalising, if XI' X2, X3, •.. , Xn are n values taken by a variate X,


their mean x is given by

n
n
Using I: to denote 'the sum from i = 1 to i = n of' this can be written
;=1
as
1 n
X = - L X; (2)
n ;=1
Where there is no possibility of misunderstanding, the use of the suffix i
on the right-hand side can be dropped and the sum simply written as I:x.

Example 5. An important aspect of solid waste is its physical composi-


tion. Knowledge of this is required, for instance, in the design and opera-
tion of a municipal incinerator or in the effective use of landfilling for
disposal. An estimate of the composition can be obtained by manually
separating portions of the waste into a number of physical categories and
calculating the percentage of waste in each category. Let us suppose that
in a preliminary study 6 portions, each weighing approximately 150 kg, are
taken from waste arriving at a disposal site. Worker A separates 4 portions
and reports the mean percentage (by weight) in the food waste category as
8·8. The corresponding 2 percentages obtained by Worker B in the separa-
tion of the remaining portions have a mean of 5·2. In order to obtain, for
all 6 portions, the mean percentage in the food waste category, we return
to the basic definition in eqn (l).
For Worker A, sum of observations = 4 x 8·8 = 35·2
For Worker B, sum of observations = 2 x 5·2 = 10·4
Combined sum of observations = 35·2 + 10·4 = 45·6
Combined number of observations = 4 + 2 = 6
For all 6 observations, mean percentage = 45·6/6 = 7·6
Note that the same answer would not be obtained by taking the mean of
8·8 and 5'2, as this would not take account of these two means being based
on different numbers of observations. The evaluation of the overall mean
was given by
4 x 8·8 + 2 x 5·2
4 + 2
and in this form you can see how 8·8 and 5·2 are weighted in accordance
with the number of observations on which they are based.
14 A.C. BAJPAI, IRENE M. CALUS AND J.A. FAIRLEY

4.1.2 Properties of the arithmetic mean


Two properties of the arithmetic mean may now be noted. An apprecia-
tion of them will lead to a better understanding of procedures met with
later on. The first concerns the values of x - i, i.e. the deviations of
individual observations from their mean. As an example, consider the 4
values of x given here. Their mean, i, is 4. Hence the deviations are as
shown, and it will be obvious that their sum is zero, the negative and
positive deviations cancelling each other out.
x 5 8 2
x - i -3 4 -2
Try it with any other set of numbers and you will find that the same thing
happens. A little simple algebra shows why this must be so. For n observa-
tionsx l , x 2 , ••• ,xn ,
I:(x - i) (XI - i)+ (X2 - i) + ... + (xn - i)
(XI + x 2 + ... + x n ) - ni
ni - ni

= 0
We have, therefore, the general result that the sum of the deviations from
the arithmetic mean is always zero.
The second property concerns the values of the squares of the deviations
from the mean, i.e. (x - i)2. Continuing with the same set of 4 values of
x, squaring the deviations and summing gives
I:(x - i)2 = (- 3)2 + 12 + 42 + (- 2)2 = 30
Now let us look at the sum of squares of deviations from a value other
than i. Choosing 2, say, gives
I:(x - 2)2 = (l - 2)2 + (5 - 2)2 + (8 - 2)2 + (2 - 2)2 = 46
Choosing 10 gives
(l lW + (5 - 10)2 + (8 - 10)2 + (2 - lW
174
Both these sums are larger than 30, as also would have been the case if
numbers other than 2 and 10 had been chosen. This illustrates the general
result that the sum of the squares of the deviations from the arithmetic mean
is less than the sum of the squares of the deviations taken from any other
DESCRIPTIVE STATISTICAL TECHNIQUES 15

value. An algebraic proof of this is given by Bajpai et al. 4 Here, another


approach will be adopted by beginning with the question: 'What value of
a will make I:r=l(x; - a)2 a minimum?'
For convenience, let

L (x;
n
S = - a)2 = (XI - a)2 + (Xl - a)l . .. + (xn - a)l
;=1

For a particular set of n values of x, S will vary with a. From differential


calculus it is known that, for S to be a minimum, dS/da = O. Now
dS
- = - 2(xl - a) - 2(Xl - a) - ... - 2(xn - a)
da
Putting this equal to zero then gives
(XI - a) + (Xl - a) + ... + (xn - a) 0
i.e.

whence

As dl S/dd = 2n, which is positive, a minimum value of S is confirmed.


If XI, X2, ... , Xn are replicate measurements obtained in the chemical
analysis of a specimen, as was the case with the nitrate ion concentration
data from which Table 2 was formed, it is a common practice to use their
mean i as an estimate of the actual concentration present. In so doing, an
estimator is being used which satisfies the criterion that the sum of the
squares of the deviations of the observations from it should be a minimum.
A criterion of this kind is the basis of the method of least squares. An
important instance of its application occurs in the estimation of the
equations of regression lines, described in another chapter.

4.1.3 Calculation of the mean from a frequency table


We shall now consider how to interpret the definition of the arithmetic
mean, as given in eqn (I), when the data are presented in the form of a
frequency table. The colony counts in Example I will be used as an
illustration. By reference to Table I it will be seen that the sum of the
original 40 observations will, in fact, be the sum of five Os, nine Is, twelve
2s, seven 3s, five 4s and two 5s. Hence the value of the numerator in eqn
16 A.C. BAJPAI, IRENE M. CALUS AND I.A. FAIRLEY

TABLE 5
Calculation of arithmetic mean for data in Table 1

Number of colonies (x) Frequency (f) fx

0 5 0
1 9 9
2 12 24
3 7 21
4 5 20
5 2 10
40 84

(1) is given by
5 x 0 + 9 x 1 + 12 x 2 + 7 x 3 + 5 x 4 + 2 x 5 84
The calculations can be set out as shown in Table 5.
Mean number of colonies per dish = ~~ = 2·1
It will be noted that "i.fx is the sum of all the observations (in this case the
total number of colonies observed) and "i.fis the total number of observa-
tions (in this case the total number of dishes).
An extension to the general case is easily made. If the variate takes
values XI' X2, ... ,Xn with frequencies J;, J;, ... ,In respectively, the
arithmetic mean is given by
J;XI + J;X2 + ... + Inxn
x = (3)
J;+J;+···+1n

"i.fx
- - or, more simply,
LA
n
"i.f
;=1

On examination of eqn (3), it will be seen that XI' X2, ... , Xn are weighted
according to the frequencies J;, J;, ... , In with which they occur. The
mean for the data in Example 2 can be found by a straightforward
application of this formula to Table 2, thus giving 0·49. In Examples 1 and
2 no information was lost by putting the data into the form of a frequency
table. In each case it would be possible, from the frequency table, to say
what the original 40 observations were. The same is not true of the
frequency table in Example 3. By combining values together in each class
interval, some of the original detail was lost and it would not be possible
DESCRIPTIVE STATISTICAL TECHNIQUES 17

to reproduce the original data from Table 3. To calculate the mean from
such a frequency table, the values in each class interval are taken to have
the value at the mid-point of the interval. Thus, in Table 3, observed values
in the first group (1·95-2'95) are taken to be 2'45, in the next group
(2·95-3·95) they are taken to be 3'45, and so on. This is, of course, an
approximation, but an unavoidable one. It may be of interest to know that
the mean calculated in this way from Table 3 is 7'15, whereas the original
50 recordings give a mean of 7'08, so the approximation is quite good.

4.2 The mode


Although the arithmetic mean is the measure of location most often used,
it has some drawbacks. In the case of a discrete variate, for instance, it will
often turn out to be a value which cannot be taken by the variate. This
happened with the colony counts in Table 5 where the mean was found to
be 2·1, a value that would never result from counting the number of
colonies in a dish. Nevertheless, in this particular situation, although the
figure of 2·1 may appear to be nonsensical, it can be given an interpreta-
tion. The colony count, it will be recalled, was assumed to give the number
of bacteria present in I ml of diluted liquid. Now, although the notion of
2·1 bacteria may appear ridiculous, it becomes meaningful when 2·1
bacteria per ml is converted into 2100 bacteria per litre.
Suppose, however, that a builder wants to decide what size of house will
be in greatest demand. The fact that the mean size of household in the
region is 3·2 persons is not helpful, as there will not be any potential
customers of that kind. It might be more useful, in that case, to know the
size of household that occurs most frequently. This is the mode. For a
discrete variate it is the value where peak frequency occurs. Reference to
Table 1 shows that for the colony counts the mode is 2. For a continuous
variate, the mode occurs where the frequency density has a peak value. All
the frequency distributions that have been considered in this chapter have
a single peak and are therefore said to be unimodal. Sometimes, a bimodal
distribution, with two peaks, may occur. This could happen if, for in-
stance, the distribution is really a combination of two distributions.

4.3 The median


Another drawback of the arithmetic mean is that a few, or even just one,
extreme recorded values can make it unrepresentative of the data as a
whole, as an example will now show.
Values of conductivity (in /lS/cm) were recorded for water specimens
taken at Station 20 (Tidal Weir) on the River Clyde at intervals of 3 to 4
18 A.C. BAJPAI, IRENE M. CALUS AND J.A. FAIRLEY

weeks in 1988. The results, in chronological order, were as follows:


240 289 290 308 279 380 574 488
3590 17200 235 260 323 318 188
The mean is 1664, but 13 of the 15 results are way below this. Clearly the
value of the mean has been inflated by the two very high results, 3590 and
17200, recorded in July and August. (In fact, they arose from the tidal
intrusion of sea water into the lower reaches of the river. The river board's
normal procedure is to avoid taking samples when this occurs, but oc-
casionally there is a residue of sea water in the samples after the tide has
turned.) If these two values are excluded the mean of the remaining 13 is
found to be 321.
The median is a more truly middle value, in that half of the observations
are below it and half above. It is, in fact, the 50th percentile. Thus when
the GEMS 2 report states that, for the 190 rivers monitored for biochemical
oxygen demand (BOD), the median was 3 mg/litre, it indicates that 95
rivers reported a BOD level above 3 mg/litre and 95 reported a value
below.
To calculate the value of the median from a set of data, it is necessary
to consider the recorded values rearranged in order from the smallest to
the largest (or the largest to the smallest, though this is less usual). For the
River Clyde data, such rearrangement gives
188 235 240 260 279 289 290 308
318 323 380 488 574 3590 17200
There are 15 values so the middle one is the 8th and the median conductiv-
ity is therefore 308 j1.Sjcm. Where the number of observations is even there
will be two middle values and the median is taken to be midway between
them. Thus for the 190 rivers in the GEMS survey, the median would have
been midway between the 95th and 96th. Sorting such a large amount of
data into order is easily done with the aid of a computer.
When incomes of people in a particular industry or community are
being described, it is often the median income that is specified rather than
the mean. This is because when just a few people have very high incomes
compared with the rest, the effect of those high values may be to raise the
mean to a value that is untypical, as happened with the conductivity data.
The mean would then give a false impression of the general level of
income. In that situation the distribution of incomes would be of the type
depicted in the histogram in Fig. 7(b). A similar pattern of distribution has
been found to apply to the carbon monoxide emissions (in mass per unit
DESCRIPTIVE STATISTICAL TECHNIQUES 19

(a) (b)

A~
Fig. 7. Histograms showing distributions that are (a) symmetric; (b) skew.

distance travelled) of cars, and to blood lead concentrations. 5- 7 Such a


distribution is said to be skew, in contrast to the symmetric distribution
depicted in Fig. 7(a).
While perfect symmetry can occur in a theoretical distribution used as
a model, a set of data will usually show only approximate symmetry.
Where variation is entirely attributable to random error, as with the
repeated determinations of nitrate ion concentration in Example 2, the
distribution is, in theory, symmetric. When perfect symmetry exists as in
Fig. 7(a), it follows that
Mode = Median = Mean
When a distribution is skewed as in Fig. 7(b), with a long right-hand tail,
Mode < Median < Mean
A skew distribution can also be the reverse of that in Fig. 7(b), having the
long tail at the left-hand end, though this situation is a less common
occurrence. Whereas previously it was seen that a few high values resulted
in the mean being un typically large, now the tail of low values reduces the
mean to an untypically low level. The relation between the three measures
of location is now
Mode > Median > Mean
A common practice in use by analytical chemists is to perform several
determinations on the specimen under examination and then use the mean
of the results to estimate the required concentration. Let us suppose that
a titrimetric method was involved and that in four titrations the following
volumes (in ml) were recorded:
25·06 24·89 25·03 25·01
Suspicion at once falls on the value 24·89 which appears to be rather far
away from the other three readings. An observation in a set of data which
seems to be inconsistent with the remainder of the set is termed an outlier.
The question arises: 'Should it be included or excluded in subsequent
20 A.C. BAJPAI, IRENE M. CALUS AND l.A. FAIRLEY

calculations?' Finding an answer becomes less important if the median is


to be used instead of the mean. As has previously been noted, the median
is affected much less by one or two extreme values than is the mean. Thus,
taking the titration data as an example, rearrangement in order of mag-
nitude gives:
24·89 25·01 25·03 25·06
The median is t(25'01 + 25'03), i.e. 25·02. Now, it might be that 24·89
was the result of wrongly recording 24·98 (transposing numbers in this
way is a common mistake). However, although such a mistake affects the
value of the mean, it has no effect on the median. In general, the effect an
outlier can have on the median is limited and this is an argument in favour
of using the median, instead of the mean, in this kind of situation.
There are also situations in which, compared with the median, the mean
is less practical to calculate or perhaps even impossible. As an example, the
toxicity of an insecticide might be measured by observing how quickly
insects are killed by it. To calculate the mean survival time of a batch of
insects after having been exposed to the insecticide under test, it would be
necessary to record the time of death of each insect. On the other hand,
obtaining the median survival time would be just a matter of recording
when half of the batch had been killed off and, moreover, it would not be
necessary to wait until every insect had died. The 'half-life' of a radioactive
element is another example. It is the time after which half the atoms will
have disintegrated and is thus the median life of the atoms present, i.e. half
of them will have a life less than the 'half-life' and half will have a life
which exceeds it.
4.4 The geometric mean
A change of variable can transform a skew distribution of the type
shown in Fig. 7(b) to a symmetric distribution which is more amenable to
further statistical analysis. Such an effect could be produced by a logari-
thmic transformation which entails changing from the variable x to
a new variable y, where y = logx (using any convenient base). Then
if the distribution of x is as in Fig. 7(b), the distribution of y would
show symmetry. For a set of data x(, x 2 , ••• , x n , such a transformation
would produce a new set of values y(, 12, ... ,Yn where Y( = logx(,
12 = logx2' ... , Yn = logxn • The meany of the new values is then given by
I
Y = - (y(
n
+ Y2 + ... + Yn)
The geometric mean (GM) of the original data is the value of x that
DESCRIPTIVE STATISTICAL TECHNIQUES 21

transforms to y, i.e. is such that its logarithm is y. Thus


I
10gGM = Y = -n (logxl + logx,
-
+ ... + logxn)

Rewriting this as

10gGM

leads to

GM = ':jX I X 2 • •• Xn

Both the GEMS' report on air quality and the UK Blood Lead Monitor-
ing Programme 5- 7 include geometric means in their presentation of results.

5 MEASURES OF DISPERSION

In addition to a measure that indicates where the observations are located


on the measurement scale, it is also useful to have a measure that provides
some indication of how widely they are scattered. Let us suppose that the
percentage iron content of a specimen is required and that two laborato-
ries, A and B, each make five determinations, with the following results:

Laboratory A: 13·99 14·15 14·28 13·93 14·30


Laboratory B: 14·12 14·10 14·15 14·11 14·17

Both sets of data give a mean of 14·13. (This would be unlikely to happen
in practice, but the figures have been chosen to emphasise the point that
is being made here.) It can, however, be seen at a glance that B's results
show a much smaller degree of scatter, and thus better precision, than
those obtained by A. Although the terms precision and accuracy tend to
be used interchangeably in everyday speech, the theory of errors makes a
clear distinction between them. A brief discussion of the difference
between these two features of a measuring technique or instrument will be
useful at this point.
Mention has already been made of the occurrence of random error in
measurements. Another type of error that can occur is a systematic error
(bias). Possible causes, cited by Lee & Lee,1 are instrumental errors such
as the zero incorrectly adjusted or incorrect calibration, reagent errors
such as the sample used as a primary standard being impure or made up
22 A.C. BAJPAI, IRENE M. CALUS AND J.A. FAIRLEY

to the wrong concentration of solution, or personal errors arising from


always viewing a voltmeter from the side in an identical way or from the
individual's judgment in detecting a colour change. Such a systematic
error is present in every measurement. If it is constant in size, it is the
amount by which the mean of an infinitely large number of repeated
measurements would differ from the true value (of what is being mea-
sured). The accuracy of a measuring device is determined by the systematic
error. If none exists, the accuracy is perfect. The greater the systematic
error, the poorer the accuracy.
Accuracy, then, is concerned with the position of the readings on the
measurement scale, the existence of the systematic error causing a general
shift in one direction. This, of course, is the aspect of data which is
described by a measure oflocation such as the mean. Precision, however,
is concerned with the closeness of agreement between replicate test results,
i.e. with the variation due to the presence of random error. Therefore, in
describing precision, a measure of dispersion is appropriate. The greater
the scatter, the poorer is the precision.
Although the need to measure dispersion has been introduced here in
relation to the precision of a test procedure, the amount of variation is an
important feature of most sets of data. For instance, the consistency of the
quality of a product may matter to the consumer as well as the general
level of quality.

5.1 The range


This is the simplest measure of spread, easy to understand and easy to
calculate, being given by:
Range = Largest observation - Smallest observation
Thus returning to the %Fe determinations, we have
Laboratory A: Range 14·30 - 13·93 0·37
Laboratory B: Range 14·17 - 14·10 0·07

With calculators and computers now readily available, ease of calculation


has become less important. This advantage of the range is now outweighed
by the disadvantage that it is based on just two of the observations, the
only account being taken of the others is that they are somewhere in
between. One observation which is exceptionally large or small can exert
a disproportionate influence on it and give a false impression of the general
amount of scatter in the data.
DESCRIPTIVE STATISTICAL TECHNIQUES 23

5.2 The interquartile range


The influence of extreme values is removed by a measure that ignores the
highest 25% and the lowest 25% of the observations. It is the range of the
remaining middle 50%. You have already seen how the total frequency is
divided into two equal parts by the median. Now we are considering the
total frequency being divided into four equal parts by the quartiles. The
lower quartile, QI' has 25% of the observations below it and is therefore
the 25th percentile. The upper quartile, Q3, is exceeded by 25% of the
observations and is therefore the 75th percentile. The subdivision is com-
pleted by the middle quartile, Q2, which is, of course, the median. To
summarise, there are thus:

25% of observations less than QI


25% of observations between QI and Q2
25% of observations between Q2 and Q3
25% of observations above Q3

Using the range of the middle 50% of observations as a measure of spread:

Interquartile range = Upper quartile - Lower quartile = Q3 - QI

The exceptionally high value at the upper extreme of the Clyde conductiv-
ity data (for which the median was found in Section 4.3) would have a
huge effect on the range, but none at all on the interquartile range, which
will now be calculated.
In the calculation of the median from a set of data there is a universal
convention that when there is no single middle value, the median is taken
to be midway between the two middle values. There is, however, no
universally agreed procedure for finding the quartiles, or, indeed, percen-
tiles generally. It can be argued that, in the same way that the median is
found from the ordered data by counting halfway from one extreme to the
other, the quartiles should be found by counting halfway from each
extreme to the median. Applying this to the conductivity data, there are
two middle values, the 4th and 5th, which are halfway between the I st and
8th (the median). They are 260 and 279, so the lower quartile would
then be taken as (260 + 279)/2, i.e. 269·5. Similarly, the upper quartile
would then be (380 + 488)/2, i.e. 434, giving the interquartile range as
434 - 269·5 = 164·5.
Another approach is based on taking the Pth percentile of n ordered
observations to be the (P/lOO)(n + l)th value. For the conductivity data,
n = 15 and the quartiles (P = 25 and P = 75) would be taken as the 4th
24 A.C. BAJPAI, ffiENE M. CALUS AND I.A. FAffiLEY

and 12th values, i.e. 260 and 488. For a larger set of data, the choice of
approach would have less effect.

5.3 The mean absolute deviation


Although the interquartile range may be an improvement on the range, it
still does not use to the full all the information given by the data. Greater
use of the values recorded would be made by a measure which takes into
account how much each value deviates from the arithmetic mean. Taking
the mean of such deviations would, however, be of no avail. It has already
been noted in Section 4.1.2 that their sum is always zero. One way round
this would be to ignore the negative signs, i.e. take all deviations as
positive. Their mean is then the mean absolute deviation (more often
loosely referred to as the mean deviation, though that can be a misleading
description). For n observations Xl' X2' •.. , Xn whose mean is i, it would
be

This measure of spread is, however, of limited usefulness and is not often
met with nowadays.

5.4 Variance and standard deviation


Another way of dealing with the problem of the negative and positive
deviations cancelling each other out is to square them. This idea fits in
more neatly with the mathematical theory of statistics than does the use
of absolute values. Taking the mean of the squares of the deviations gives
the variance. Thus:
Variance = Mean of squares of deviations from the mean
However, this will not be in the original units. For example, if the data are
in,mm, the variance will be in mm2 • Usually, it is desirable to have a
measure of spread expressed in the same units as the data themselves.
Hence the positive square root of the variance is taken, giving the standard
deviation, i.e.
Standard deviation = JVariance
The standard deviation is an example of a root-mean-square (r.m.s.) value.
This is a general concept which finds application when the 'positives' and
'negatives' would cancel each other out, as for instance, in finding the
mean value of an alternating current or voltage over a cycle. Thus the
DESCRIPTIVE STATISTICAL TECHNIQUES 25

figure of 240 V stated for the UK electricity supply is, in fact, the r.m.s.
value of the voltage.
The calculations required when finding a standard deviation are more
complicated than for other measures of dispersion, but this disadvantage
has been reduced by the aids to computation now available. A calcu-
lator with the facility to carry out statistical calculations usually offers
a choice of two possible values. To explain this we must now make a slight
digression.

5.4.1 Population parameters and sample estimates


One of the main problems with which statistical analysis has to deal is that,
of necessity, conclusions have to be made about what is called the popula-
tion from only a limited number of values in it, called the sample.
A manufacturer of electric light bulbs who states that their mean length
of life is 1000 h is making a statement about a popUlation-the lengths of
life of all the bulbs being manufactured by the company. The claim is
made on the basis of a sample-the lengths of life of a small number of
bulbs which have been tested in the factory's quality control department.
In introducing measures of dispersion, the results of 5 determinations of
%Fe content made by each of two laboratories, A and B, were considered.
What has to be borne in mind, when interpreting such data, is that if a
further 5 determinations were made by, say, Laboratory A, it is most
unlikely that they would give the same results as before. Such a set of 5
measurements, therefore, represents a sample from the population of all
the possible measurements that might result from determinations by
Laboratory A of the %Fe content of the specimen. If no bias is present in
the measurement process, the mean of this population would be the actual
%Fe content, which is, of course, what the laboratory is seeking to
determine.
As a further illustration, let us take the solid waste composition study
referred to in Example 5. The values for the percentage of food waste in
the six 150-kg lots taken for separation can be regarded as a sample from
the population of the values that would be obtained from the separation
of all the 150-kg lots into which the waste arriving at the site could be
divided. Obviously, it is the composition of the waste as a whole that the
investigators would have wanted to know about, not merely the composi-
tion of the small amounts taken for separation.
In all these three cases, the situation is the same. Information about the
popUlation is what is required, but only information about a sample is
available. Where values of population parameters, such as the mean and
26 A.C. BA1PAI, IRENE M. CALUS AND 1.A. FAIRLEY

standard deviation are required, estimates based on a sample will have to


be used. The convention of denoting population parameters by Greek
letters is well established and will be followed here, with fJ. for the mean
and (1 for the standard deviation. The sample mean x can be used as an
estimate of the population mean fJ.. Different samples would produce a
variety of values of x. Some would underestimate fJ., some would over-
estimate it but, on average, there is no bias in either direction. In other
words, x provides an unbiased estimate of fJ..
For a population of size N, the variance (12 is given by

where the summation is taken over all N values of x. Mostly, however, a


set of data represents a sample from a population, not the population
itself. It might be thought that I:(x - x)2jn should be the estimate of (12
provided by a sample of size n. However, just as different samples give
a variety of values of x, so also they will give a variety of values for
I:(x - x)2jn. While some of these underestimate (12 and some overestimate
it, there is, on average, a bias towards underestimation. An unbiased
estimate of (12 is given by I:(x - x)2j(n - I). (Explanation of this, in
greater detail, can be found in Ref. 4.) Obviously, the larger the value of
n, the less it matters whether n - 1 or n is used as the denominator. When
n = 100, the result of dividing by n - 1 will differ little from that ob-
tained when dividing by n, but when n = 4 the difference would be more
appreciable. The estimate will be denoted here by i. Although this nota-
. tion is widely adopted, it is not a universal convention so one should be
watchful in this respect. For the unbiased estimate of the population
variance, then, we have

i = _I_ I:(x _ X)2 (4)


n-I
and the corresponding estimate of the population standard deviation is

s = )_1-
n-I
I:(x _ X)2 (5)

5.4.2 Calculating s from the definition


Even though, in practice, a calculator may be used in finding s, setting out
the calculations in detail here will help to clarify what the summation
process in eqns (4) and (5) involves. This is done in Table 6 for the
DESCRIPTIVE STATISTICAL TECHNIQUES 27

TABLE 6
Calculation of s for Laboratory 8's measurements of %Fe
x x - X (x - X)2 x2

14·12 -0'01 0·000 I 199·3744


14· 10 -0·03 0·0009 198·8100
14·15 0·02 0·0004 200·2225
14· \I -0'02 0·0004 199·0921
14·17 0·04 0·0016 200·7889

70·65 0·0034 998·2879

x = 70·65 = 14.13 i = 0·0034 = 0.00085 s = 0·029


5 4

determinations of %Fe made by Laboratory B. (Ignore the column


headed X2 for the time being.)
Similar calculations give, for Laboratory A's results, s = 0'167, the
larger value reflecting the greater amount of spread.

5.4.3 A shortcut method


The formula for s can be converted into an alternative form which cuts out
the step of calculating deviations from the mean. By simple algebra, it can
be shown that

(6)

The %Fe results obtained by Laboratory B are again chosen for illustra-
tion.
From Table 6, LX = 70·65 and Lx2 = 998·2879. Substitution in eqn (6)
gives
998·2879 - 70.65 2/5
998·2879 - 998·2845
0·0034
s is then found as before.

5.4.4 The use of coding


If you examine the calculations that have just been carried out in applying
eqn (6), you will see that they involve a considerable increase in the
number of digits involved. While the original items of data involved only
28 A.C. BAJPAI, IRENE M. CALUS AND J.A. FAIRLEY

4 digits, the process of squaring and adding nearly doubled this number.
The example will serve to show a hazard that may exist, not only in the
use of the shortcut formula but in similar situations when a computer or
calculator is used.
In the previous calculations all the figures were carried in the working
and there was no rounding off. Now, let us look at the effect of working
to a maximum of 5 significant figures. The values of ~ would then be taken
as 199'37, 198'81,200'22, 199·09 and 200'79, giving ~X2 = 998·28. Sub-
stitution in eqn (6) gives ~(x - X)2 = 998·28 - 998·28 = 0, leading to
s = O. What has happened here is that the digits discarded in the rounding
off process are the very ones that produce the correct result. This occurs
when the two terms on the right-hand side of eqn (6) are near to each other
in value. Hence, in the present example, 7 significant figures must be
retained in the working in order to obtain 2 significant figures in the
answer.
Calculators and computers carry only a limited number of digits in their
working. Having seen what can happen when just 5 observations are
involved, only a little imagination is required to realise that, with a larger
amount of data, overflow of the capacity could easily occur. The conse-
quent round-off errors can then lead to an incorrect answer, of which the
unsuspecting user of the calculator will be unaware (unless an obviously
ridiculous answer like s = 0 is obtained).
The number of digits in the working can be reduced by coding the data,
i.e. making a change of origin and/or size of unit. For Laboratory B's %Fe
data, subtracting 14·1 from each observation (i.e. moving the origin to
x = 14'1) gives the coded values 0'02, 0'00, 0·05, 0·01 and 0·07. The
spread of these values is just the same as that of the original data and they
will yield the same value of s. The number of digits in the working will,
however, be drastically reduced. This illustrates how a convenient number
can be subtracted from each item of data without affecting the value of s.
Apart from reducing possible risk arising from round-off error, fewer
digits mean fewer keys to be pressed, thus saving time and reducing
opportunities of making wrong entries. In fact, the %Fe data could be
even further simplified by making 0·0 I the unit, so that the values become
2,0, 5, I and 7. It would then be necessary, when s has been calculated,
to convert back to the original units by multiplying by 0·01. A more
detailed explanation of coding is given in Ref. 4.
DESCRIPTIVE STATISTICAL TECHNIQUES 29

TABLE 7
Calculation of s for frequency distribution in Table 1

x f fx x- x (x - X)2 f(x - X)2

0 5 0 -2·1 4·41 22·05


I 9 9 -1·1 1·21 JO·89
2 12 24 -0·1 O·OJ 0·12
3 7 21 0·9 0·81 5·67
4 5 20 1·9 HI 18·05
5 2 10 2·9 8·41 16·82
40 84 7HO

_ 84 = J73.60 = 1·37
.~ = -40 = 2·1 s
39 -

5.4.5 Calculating s from a frequency table


As with the arithmetic mean, calculating s from data given in the form of
a frequency table is just a matter of adapting the original definition. Thus
eqn (5) now becomes
'Lf(x _ .q2
(7)
'Lf - I
The appropriate form of eqn (6) is now

- 2 2 I 2
'Lf(x - x) = 'Lf'K - 'Lf('Lf'K) (8)

Applying eqn (7) to the colony count data in Table I gives the calculations
shown in Table 7. Applying eqn (8) to the same data,

I fx 2 = 0 + 9 + 48 + 63 + 80 + 50 = 250
and hence
If(x - .~)2 = 250 - 842/40 = 73·60
Calculation of s then proceeds as before.

5.5 Coefficient of variation (relative standard deviation)


An error of I mm would usually be of much greater consequence when
measuring a length of I cm than when a length of I m is being measured.
Although the actual error is the same in both cases, the relative error, i.e.
the error as a proportion of the true value, is quite different. Similar
remarks apply to a measure of spread. Hence, when using the standard
30 A.C. BAJPAI, IRENE M. CALUS AND J.A. FAIRLEY

deviation as a measure of precision, an analytical chemist will often prefer


to express it in relative terms. The coefficient of variation (CV) does this
by expressing the standard deviation as a proportion (usually a percen-
tage) of the true value as estimated by the mean. Thus, for a set of data,
s
CV = - x 100
x
This is also known as the percentage relative standard deviation (RSD). It
is independent of the size of unit in which the variate is measured. For
example, if data in litres were converted into millilitres, both x and s would
be multiplied by the same factor and thus their ratio would remain
unchanged. The value of the CV is, however, not independent of the
position of the origin on the measurement scale. Thus the conversion of
data from °C to of would affect its value, because the two temperature
scales do not have the same origin. Although Marriott8 warns of its
sensitivity to error in the mean, the CV enjoys considerable popularity.

6 MEASURES OF SKEWNESS

Another feature of a distribution, to which reference has already been


made, is its skewness. Measures oflocation and dispersion have been dealt
with at length because of their widespread use. Measures of skewness,
however, are encountered less frequently and so only a brief mention is
given here. The effect that lack of symmetry in a distribution has on the
relative positions of the mean, median and mode was noted earlier.
Measures of skewness have been devised which make use of this, by
incorporating either the difference between mean and mode or the differ-
ence between mean and median. Another approach follows along the lines
of the mean and variance in that it is based on the concept of moments.
This is a mathematical concept which it would not be appropriate to
discuss in detail here. Let it suffice to say that the variance, which involves
the squares of the deviations from the mean, is a second moment. The
coefficient of skewness uses a third moment which involves the cubes of the
deviations from the mean.
Whichever measure of skewness is chosen, its value is:
-positive when the distribution has a long right tail as in Fig. 7(b),
-zero when the distribution is symmetric,
-negative when the distribution has a long tail at the left-hand end.
DESCRIPTIVE STATISTICAL TECHNIQUES 31

7 EXPLORATORY DATA ANALYSIS

Various ways of summarising a set of data have been described in this


chapter. Some of them are incorporated in two visual forms of display that
have been developed in recent years and are now widely used in explorat-
ory data analysis (EDA), in which J.W. Tukey has played a leading role.
As its name suggests this may be used in the initial stages of an investiga-
tion in order to indicate what further analysis might be fruitful. A few
simple examples of the displays will be given here, by way of introduction
to the ideas involved. A more detailed discussion is provided by Tukey,9
and also by Erickson & Nosanchuk. 'o

7.1 Stem-and-Ieaf displays


We have already seen that one way of organising a set of data is to form
a frequency table but that when this entails grouping values together some
information about the original data is lost. In a stem-and-leaf display,
observations are classified into groups without any loss of information
occurring. To demonstrate how this is done, we shall use the following
values of alkalinity (as calcium carbonate in mg/litre) recorded for water
taken from the River Clyde at Station 20 (Tidal Weir) at intervals through-
out 1987 and 1988.
68 90 102 74 68 108 122 85 115 62 66 89 66 70 112
50 67 60 66 60 92 117 126 88 133 60 68 76 82 42
A frequency table could be formed from the data by grouping together
values from 40 to 49,50 to 59, 60 to 69 and so on. All observations in the
first group would then begin with 4, described by Tukey as the 'starting
part'. This is now regarded as the stem. For the other groups the stems are
thus 5, 6, etc. All observations in a particular group have the same stem.
The remaining part of an observation, which varies within the group, is the
leaf Thus 68 and 62 have the same stem, 6, but their leaves are 8 and 2
respectively. The way that the stem and its leaves are displayed is shown
in Table 8. If desired a column giving the frequencies can be shown
alongside the display.
In the present example, the stem represented 'tens' and the leaf 'units'.
If, however, the data on lead concentration in Example 3 were to be
displayed, the stem would represent units and the leaf the first place of
decimals. Thus, for the first reading, 5'4, the stem would be 5 and the leaf
4. In other situations a two-digit leaf might be required, in which case each
pair of digits is separated from the next by a comma.
32 A.C. BA1PAI, IRENE M. CALUS AND 1.A. FAIRLEY

TABLE 8
Stem-and-Ieaf display for River Clyde (Station 20)
alkalinity data

Stem Leaf Frequency

4 2 1
5 0 1
6 88266706008 11
7 406 3
8 5982 4
9 02 2
10 28 2
11 527 3
12 26 2
13 3 1

A further refinement of Table 8 can be achieved by placing the leaves


in order of magnitude, as shown in Table 9. It then becomes easier to
identify such measures as the quartiles, for instance.
Two sets of data can be compared using a back-to-back stem-and-leaf
display. With a common stem, the leaves for one set of data are shown on
the left and the leaves for the other set on the right. Table 10 shows the
alkalinity data for Station 20 on the River Clyde, previously displayed in
Table 8, on the right. The leaves on the left represent values of alkalinity
recorded during the same period at Station l2A.

TABLE 9
Ordered stem-and-Ieaf display
for data in Table 8

Stem Leaf

4 2
5 0
6 00026667888
7 046
8 2589
9 02
10 28
11 257
12 26
13 3
DESCRIPTIVE STATISTICAL TECHNIQUES 33

TABLE 10
Back-to-back stem-and-Ieaf display for River Clyde
alkalinity data

Station 12A Station 20

9 2
4 3
688 4 2
9885234 5 0
20205 6 88266706008
21 7 406
780 8 5982
62 9 02
6 10 28
40 11 527
12 26
13 3

7.2 Box-and-whisker displays (box plots)


Another form of display is one which features five values obtained from
the data-the three quartiles (lower, middle and upper) and two extremes
(one at each end). For reasons which will become obvious, it is called a
box-and-whisker plot. You may also find it referred to as a box-and-dot
plot or, more simply, as a box plot.
The box represents the middle half of the distribution. It extends from
the lower quartile to the upper quartile, and a line across it indicates the
position of the median. The 'whiskers' extend to the extremities of the
data. Taking again, as an example, the alkalinity data recorded for the
River Clyde at Station 20, the lowest value recorded was 42 and the highest
133. There were 30 observations so, when they are placed in order, the
median is halfway between the 15th and 16th, i.e. 75. Taking the lower
quartile as the 8th, i.e. 66, and the upper quartile as the 23rd, i.e. 102, the
box-and-whisker display is as shown in Fig. 8.
Here the box is aligned with a horizontal scale but, if preferred, it can
be positioned vertically as in Fig. 9. Where a larger amount of data is
involved, the bottom and top 5% or 10% of the distribution may be
ignored when drawing the whiskers, thus cutting out any freak values that
might have occurred. This is done in the reports of the UK Blood Lead
Monitoring Programme6,7 where the extremities of the whiskers are the 5th
and 95th percentiles. These reports provide an example of a particularly
effective use of box plots, in that they can be shown alongside one another
34 A.C. BAlPAI, IRENE M. CALUS AND l.A. FAIRLEY

_.-----II-I-,---I_~--.

40 60 80 100 120 140

Alkalinity (as CaCO, in mg/litre)

Fig. 8. Box-and-whisker display for River Clyde (Station 20) alkalinity data.

to compare various data sets, e.g. blood lead concentrations in men and
women, in different years or in various areas where surveys were carried
out. Figure 9 gives an indication of how this can be done, enabling a rapid
visual assessment of any differences between sets of data to be made quite
easily.

YEAR 1 YEAR 2

MEN WOMEN MEN WOMEN


20

E
o
o 15
0,
3
c:
o
.~

C
~ 10
c:
o
(,)
-g
Ql
~
"0
g 5
iIi

o
Fig. 9. Box-and-whisker displays comparing blood lead concentrations in
men and women in successive years.
DESCRIPTIVE STATISTICAL TECHNIQUES 35

ACKNOWLEDG EM ENT

The authors are indebted to Desmond Hammerton, Director of the Clyde


River Purification Board, for supplying data on water quality and for
giving permission for its use in illustrative examples in this chapter.

REFERENCES

I. Lee, J.D. & Lee, T.D., Statistics and Numerical Methods in BASIC for Biolo-
gists. Van Nostrand Reinhold, Wokingham, 1982.
2. GEMS: Global Environment Monitoring System, Global Pollution and
Health. United Nations Environment Programme and World Health Or-
ganization, London, 1987.
3. GEMS: Global Environment Monitoring System, Air Quality in Selected
Urban Areas 1975-1976. World Health Organization, Geneva, 1978.
4. Bajpai, A.C., Calus, I.M. & Fairley, J.A., Statistical Methods for Engineers
and Scientists. John Wiley, Chichester, 1978.
5. Department of the Environment, UK Blood Lead Monitoring Programme
1984-1987 Resultsfor 1984 Pollution Report No 22. Her Majesty's Stationery
Office, London, 1986.
6. Department of the Environment, UK Blood Lead Monitoring Programme
1984-1987 Resultsfor 1985 Pollution Report No 24. Her Majesty's Stationery
Office, London, 1987.
7. Department of the Environment, UK Blood Lead Monitoring Programme
1984-1987 Resultsfor 1986 Pollution Report No 26. Her Majesty's Stationery
Office, London, 1988.
8. Marriott, F.H.C., A Dictionary of Statistical Terms. 5th edn, Longman
Group, UK, Harlow, 1990.
9. Tukey, J.W., Exploratory Data Analysis. Addison-Wesley, Reading, MA,
1977.
10. Erickson, B.H. & Nosanchuk, T.A., Understanding Data. Open University
Press, Milton Keynes, 1979.
Chapter 2

Environmetric Methods of
Nonstationary Time-Series
Analysis: Univariate Methods
PETER YOUNG and TIM YOUNG*
Centre for Research in Environmental Systems, Institute of
Environmental and Biological Sciences, Lancaster University, Lancaster,
Lancashire, LA] 4YQ, UK

1 INTRODUCTION

By 'environmetrics', we mean the application of statistical and systems


methods to the analysis and modelling of environmental data. In this
chapter, we consider a particular class of environmetric methods; namely
the analysis of environmental time-series. Although such time-series can
be obtained from planned experiments, they are more often obtained by
passively monitoring the environmental variables over long periods of
time. Not surprisingly, therefore, the statistical characteristics of such
series can change considerably over the observation interval, so that the
series can be considered nonstationary in a statistical sense. Figure l(a), for
example, shows a topical and important environmental time-series-the
variations of atmospheric CO2 measured at Mauna Loa in Hawaii over the
period 1974 to 1987. This series exhibits a clear upward trend, together
with pronounced annual periodicity. The trend behaviour is a classic
example of statistical nonstationarity of the mean, with the local mean
value of the series changing markedly over the observation interval. The

*Present address: Maths Techniques Group, Bank of England, Threadneedle


Street, London EC2R 8AH.
37
38 PETER YOUNG AND TIM YOUNG

352 .----------------~
350 (a)
348
346
344
342

340

:~WfNV 20 40 60 80 100 120 140

160.--------------------,
(b)
140

120

100

80

60

40

20

I.
1750 1800 1850 1900 1950

en 60
5 50
6z 40

'"
~ 30

~ 20
UJ
~ 10
UJ
Q.

7 13 19 25 31 37 43 49 55 61 67
SAMPLE NUMBER
Tim. (kyrs.B.P.l
" -----"-----,,,-----,--- ,
250 200 150 100 50
PERMIAN ITRIASSlcl JURASSIC CAETACEOUS I TERTIARY

Fig, 1, Three examples of typical nonstationary environmental time series: (a)


the Mauna Loa CO 2 series (1974-1987); (b) the Waldemeir sunspot series
(1700-1950); (c) the Raup-Sepkoski 'Extinctions' series (between the
Permian (253 My BP) and the Tertiary (11·3 My BP) periods of geologic time).
NONSTATIONARY TIME-SERIES ANALYSIS 39

nature of the periodicity (or seasonality), on the other hand, shows no


obvious indications of radical change although, as we shall see, there is
some evidence of mild nonstationarity even in this aspect of the series.
The well-known Waldmeier annual sunspot data (1700-1950) shown in
Fig. l(b) are also quite clearly nonstationary, although here it is the
amplitude modulation of the periodic component which catches the eye,
with the mean level remaining relatively stable. However, here we see also
that the periodic variations are clearly distorted as they fluctuate around
the 'mean' value, with the major amplitude variation appearing in the
upper maxima. This suggests more complex behavioural characteristics
and the possibility of nonlinearity as well as nonstationarity in the data.
Finally, Fig. l(c) shows 70 samples derived from the 'Extinction' series
compiled by Raup & Sepkoski.1 This series is based on the percentage of
families of fossil marine animals that appear to become extinct over the
period between the Permian (253 My BP) and the Tertiary (11·3 My BP)
periods of geologic time (My = 1000 000 years). The series in Fig. l(c)
was obtained from the original, irregularly spaced Raup & Sepkoski series
by linear interpolation at a sampling interval of 3·5 My. Using a simple
analysis, which gave equal weighting to all the observed peaks, Raup &
Sepkoski l noted that the series appeared to have a cyclic component with
a period in the region of 26 My. However, any evaluation of this cyclic
pattern must be treated with caution since it will be affected by the
shortness of the record and, as in the case of the sunspot data, by the great
variation in amplitude and associated assymetry of the main periodic
component. In other words there are, once again, clear indications of both
nonlinearity and nonstationarity in the data, both of which are not
handled well by conventional techniques of time-series analysis.
The kinds of nonstationarity and/or nonlinearity observed in the three
series shown in Fig. I are indicative of changes in the underlying statistical
characteristics ofthe data. As a result, we might expect that any mathema-
tical models used to characterise these data should be able to represent this
nonstationarity or nonlinearity if they are to characterise the time-series
in an acceptable manner. Box & Jenkins,2 amongst many other statisti-
cians, have recognised this problem and have proposed various methods
of tackling it. In their case, they propose simple devices, such as differenc-
ing the data prior to model identification and estimation, in order to
remove the trend,2 or nonlinear transformation prior to analysis in order
to purge the series of its nonlinearity. But what if, for example, we do not
wish to difference the data, since we feel that this will amplify high
frequency components in the series? Can we account for the nonstationar-
40 PETER YOUNG AND TIM YOUNG

ity of the mean in other ways? Or again, if we do not choose to perform


prior nonlinear transformation, or feel that even after such transformation
the seasonality is still distorted in some manner, then how can we handle
this situation? Similarly, what if we find that, on closer inspection, the
amplitude of the seasonal components of the CO2 data set is not, in fact,
exactly constant: can we develop a procedure for estimating these varia-
tions so that the estimates can be useful in exercises such as 'adjusting' the
data to remove the seasonal effects?
In this chapter, we try to address such questions as these by considering
some of the newest environmetric tools available for evaluating non-
stationary and nonlinear time-series. There is no attempt to make the
treatment comprehensive; rather the authors' aim is to stimulate interest
in these new and powerful methodological tools that are just emerging and
appear to have particular relevance to the analysis of environmental data.
In particular, a new recursive estimation approach is introduced to the
modelling, forecasting and seasonal adjustment of nonstationary time-
series and its utility is demonstrated by considering the analysis of both
Mauna Loa CO2 and the Extinctions data.

2 PRIOR EVALUATION OF THE TIME-SERIES

We must start at the beginning: the first step in the evaluation of any data
set is to look at it carefully. This presumes the availability of a good
computer filing system and associated plotting facilities. Fortunately,
most scientists and engineers have access to microcomputers with appro-
priate software; either the IBM-PC-AT/IBM PS2 and their compatibles,
or the Macintosh SE/H family. In this chapter, we will use mainly the
micro CAPTAIN program developed at Lancaster/ which is designed for
the IBM-type machines but will be extended to the Macintosh in the near
future. Other programs, such as StatGraphics@, can also provide similar
facilities for the basic analysis of time-series data, but they do not provide
the recursive estimation tools which are central to the micro CAPTAIN
approach to time-series analysis.
Visual appraisal of time-series data is dependent very much on the
background and experience of the analyst. In general, however, factors
such as nonstationarity of the mean value and the presence of pronounced
periodicity will be quite obvious. Moreover, the eye is quite good at
analysing data and perceiving underlying patterns, even in the presence of
background noise: in other words, the eye can effectively 'filter' off the
NONSTATIONARY TIME-SERIES ANALYSIS 41

effects of stochastic (random) influences from the data and reveal aspects
of the data that may be of importance to their understanding within the
context of the problem under consideration. The comments above on the
CO 2 and sunspot data are, for example, typical of the kind of initial
observations the analyst might make on these two time-series.
Having visually appraised the data, the next step is to consider their
more quantitative statistical properties. There are, of course, many dif-
ferent statistical procedures and tests that can be applied to time-series
data and the reader should refer to any good text on time-series analysis
for a comprehensive appraisal of the subjectY Here, we will consider only
those statistical procedures that we consider to be of major importance in
day-to-day analysis; namely, correlation analysis in the time-domain, and
spectral analysis in the frequency-domain.
A discrete time-series is a set of observations taken sequentially in time;
thus N observations, or samples, taken from a series y(t) at times t l ,
t2 , ••• , tk , ••• , tN' may be denoted by y(tl), y(t2), ... ,y(td, ... ,y(tN)'
In this chapter, however, we consider only sampled data observed at some
fixed interval M: thus we then have N successive values of the series
available for analysis over the observation interval of N samples, so that we
can use y(1), y(2), ... , y(k), ... , y(N) to denote the observations made
at equidistant time intervals to, to + bt, to + 2bt, ... , to + kbt, ... ,
to + Nbt. If we adopt to as the origin and bt as the sampling interval, then
we can regard y(k) as the observation at time t = tk •
A stationary time-series is one which can be considered in a state of
statistical equilibrium; while a strictly stationary time-series is one in which
its statistical properties are unaffected by any change in the time origin to.
Thus for a strictly stationary time-series, the joint distribution of any set
of observations must be unaffected by shifting the observation interval. A
nonstationary time-series violates these requirements, so that its statistical
description may change in some manner over any selected observation
interval.
What do we mean by 'statistical description'? Clearly a time-series can
be described by numerous statistical measures, some of which, such as the
a;
sample mean y and the sample variance are very well known, i.e.
I k=N I k=N
Y = - L
N k=1
y(k); a; = -
N
L
k=1
[y(k) - YF
If we are to provide a reasonably rich description of a fairly complex
time-series, however, it is necessary to examine further the temporal
patterns in the data and consider other, more sophisticated but still
42 PETER YOUNG AND TIM YOUNG

conventional statistical measures. It is necessary to emphasise, however,


that these more sophisticated statistics are all defined for stationary pro-
cesses, since it is clearly much more difficult to consider the definition and
computation of statistical properties that may change over time. We will
leave such considerations, therefore, until Section 3 and restrict discussion
here to the conventional definitions of these statistics.

2.1 Statistical properties of time-series in the time domain:


correlation analysis
The mean and variance provide information on the level and the spread
about this level: they define, in other words, the first two statistical moments
of the data. For a time-series, however, it is also useful to evaluate the
covariance properties of the data, as defined by the sample covariance
matrix, i.e.

where Cn is the covariance at lag n, which is defined as,


I k=N-n
Cn = N k~1 [y(k) - y][y(k + n) - y]; n = 0, I, 2, ... r -

which provides a useful indication of the average statistical relationship of


the time-series separated, or lagged, by n samples. Two other related
time-domain statistical measures that are of particular importance in
characterising the patterns in time-series data are the sample autocorrela-
tion function and the sample partial autocorrelation function, which are
exploited particularly in the time-series analysis procedures proposed by
Box and Jenkins in their classic text on time-series analysis, forecasting
and controU
For a stationary series, the sample autocorrelation function (A F), r(n),
at lag n is simply the normalised covariance function, defined as,

cn
r(n) = -; n = 0, I, 2, ... , r - I
Co
NONSTATIONARY TIME-SERIES ANALYSIS 43

Clearly, there is perfect correlation for lag zero and, by definition,


r(O) = 1·0. If r(n) is insignificantly different from zero for any other lag,
then there is no significant relationship at lag 11; if r(n) is close to unity, then
there is a significant positive autocorrelation at lag n; and if r(n) is close to
minus unity, then there is a significant negative autocorrelation at lag n.
Thus for a strongly seasonal series of period p samples, we would expect
the lagp autocorrelation to be highly significant and positive; and the
lagp/2 autocorrelation to be highly significant and negative. Statistical
significance in these terms is discussed, for example, in Ref. 2.
The partial autocorrelation function is not such an obvious measure
and derives directly from modelling the time-series as an autoregressive
process of order n, i.e. by assuming that sequential samples from the series
can be related by the following expression,
y(k) + a,y(k - 1) + a2y(k - 2) + ... + any(k - n) = e(k)
or, in vector terms,
y(k) Z(k)T a + e(k)
where,
a(k) = [a,a2,"" anl T ;
z(k) = [-y(k - 1), -y(k - 2), ... , -y(k - nW
in which the superscript T denotes the vector-matrix transpose; aj, i = 1,
2, ... , n are unknown constant coefficients or autoregressive parameters;
and e(k) is a zero mean, serially uncorrelated sequence of random vari-
ables with variance (f2, i.e. discrete 'white noise'. This AR(n) model
suggests that there is a linear relationship between y(k) and its n previous
past values y(k - 1), ... ,y(k - n), which is also affected by a stochas-
tic influence in the form of the white noise input e(k); in other words, y(k)
is a weighted linear aggregate of its past values and a 'random shock' e(k).
If we introduce the backward shift operator z-r of order r, i.e.
z-r y(k) = y(k - r), then this AR(n) process can be written in the transfer
function form,
1
y(k) '2 e(k)
I + a, z + a2z + ... + anz
-n
44 PETER YOUNG AND TIM YOUNG

which can be represented in the following block diagram terms,

---
Autoregressive
e(k) I y(k)
+ a1z- + a2z-2 + ...
1 + anz- n
Filter

where we see that y(k) can be interpreted as the output of an autoregressive


filter which shapes the 'white' noise input e(k) to produce 'colour' or
temporal pattern in y(k). This temporal pattern, or autocorrelation, can be
evaluated quantitatively by the statistical identification of the order n of the
AR process, and the estimation of the associated model parameters aj,
i = I, 2, ... , n, and (i (see, e.g. Refs. 2 and 5). Here identification is
normally accomplished by reference to a statistical structure (or order)
identification procedure, of which the best known is that proposed by
Akaike;6 while estimation of the parameters characterising this identified
model structure is normally accomplished by least squares or recursive
least squares procedures, as discussed in Section 3.
The partial autocorrelation function (PAF) exploits the fact that,
whereas it can be shown that the autocorrelation function for an AR(n)
process is theoretically infinite in extent, it can clearly be represented in
terms of a finite number of coefficients, namely the n parameters of the AR
model. This topic is discussed fully in Ref. 2 but, in effect, the PAFs are
obtained by estimating successively AR models of order 1, 2, 3, ... , p by
least squares estimation. The PAF values are then defined as the sequence
of estimates of the last (i.e. nth) coefficient in the AR model at each
successive step, with the 'lag' n = 1, 2, 3, ... , p, where p is chosen as a
sufficiently large integer by the analyst. For a time-series described ade-
quately by an AR(n) process, the PAF should be significantly non-zero for
p ~ n but will be insignificantly different from zero for p > n: in other
words, the PAF of an nth order autoregressive process should have a
'cut-off' after lag n. In practice, however, this cut-off is often not sufficient-
ly sharp to unambiguously define the appropriate AR order and order
identification criteria such as the AIC seem preferable in practice
(although it can be argued that the AIC tends to over-identify the order).

2.1 Statistical properties of time-series in the


frequency-domain: spectral analysis
Time-domain methods of statistical analysis, such as those discussed
above, are often easier to comprehend than their relatives in the fre-
NONSTATIONARY TIME-SERIES ANALYSIS 45

quency-domain: after all, we observe the series and plot them initially in
temporal terms, so that the first patterns we see in the data are those which
are most obvious in a time-domain context. An alternative way of analys-
ing a time-series is on the basis of Fourier-type analysis, i.e. to assume that
the series is composed of an aggregation of sine and cosine waves with
different frequencies. One of the simplest and most useful procedures
which uses this approach is the periodogram introduced by Schuster in the
late nineteenth century.2
The intensity of the periodogram I(.f;) is defined by

2
I(/;) = N {[k=N
k~l y(k)cos(2n/;k) J2 + [k=N
k~l y(k)sin(2n/;k) J2} ;

i = 1,2, ... , q
where q = (N - 1)/2 for odd Nand q = N/2 for even N. The periodo-
gram is then the plot of I(/;) against /;, where /; = i/ N is ith harmonic of
the fundamental frequency liN, up to the Nyquist Frequency of 0·5 cycles
per sampling interval (which corresponds to the smallest identifiable wave-
length of two samples). Since I(.f;) is obtained by multiplying y(k) by sine
and cosine functions of the harmonic frequency, it will take on relatively
large values when this frequency coincides with a periodicity of this
frequency occurring in y(k). As a result, the periodogram maps out the
spectral content of the series, indicating how its relative power varies over
the range of frequencies between /; = 0 and 0·5. Thus, for example,
pronounced seasonality in the series with period T = 1/;; samples will
induce a sharp peak in the periodogram at /; cycles/sample; while if the
seasonality is amplitude modulated or the period is not constant then the
peak will tend to be broader and less well defined.
The sample spectrum is simply the periodogram with the frequency /;
allowed to vary continuously over the range 0 to 0·5 cycles, rather than
restricting it to the harmonics of the fundamental frequency (often, as in
later sections of the present chapter, the sample spectrum is also referred
to as the periodogram). This sample spectrum is, in fact, related directly
to the autocovariance function by the relationship,

I(.f;) = 2 {co + 2 k~t:l AkCkCOS(2n/;k)}; 0 ~ /; ~ 0·5


with Ak = }·O for all k. In other words, the sample spectrum is the Fourier
cosine transform of the estimate of the autocovariance function. It is
clearly a very useful measure of the relative power of the y(k) at different
frequencies and provides a quick and easy method of evaluating the series
46 PETER YOUNG AND TIM YOUNG

in this regard. It is true that the sample spectrum obtained in this manner
has high variance about the theoretical 'true' spectrum, which has led to
the computation of 'smoothed' estimates obtained by choosing Ak in the
above expression to have suitably chosen weights called the lag window.
However, the raw sample spectrum remains a fundamentally important
statistical characterisation of the data which is complementary in the
frequency-domain with the autocovariance or autocorrelation function in
the time-domain.
There is also a spectral representation of the data which is complemen-
tary with the partial autocorrelation function, in the sense that it depends
directly on autoregression estimation: this is the autoregressive spectrum.
Having identified and estimated an AR model for the time-series data in
the time-domain, its frequency-domain characteristics can be inferred by
noting that the spectral representation of the backward shift operator, for
a sampling interval (jt, is given by,
z-r = exp(-jrl;(jt) = cos(rl;(j() + jsin(rl;(jt); o ~ I; ~ 0·5
so that by substituting for z-', r = I, 2, ... , n, in the AR transfer
function, it can be represented as a frequency-dependent complex number
of the form A(I;) + jB(I;). The spectrum associated with this represen-
tation is then obtained simply by plotting the squared amplitude A(I;)2 +
B(I;)2, or its logarithm, either against I; in the range 0 to O· 5 cycles/sample
interval, or against the period 1/1; in samples. This spectrum, which is
closely related to the maximum entropy spectrum,4 is much smoother than
the sample spectrum and appears to resolve spectral peaks rather better
than the more directly smoothed versions of sample spectrum.

3 RECURSIVE ESTIMATION

The AF, PAF, sample spectrum and AR spectrum computations reduce


a stationary time-series to a set of easily digestible plots which describe
very well its statistical properties and provide a suitable platform for
subsequent time-series analysis and modelling. It seems likely, however,
that if we wish to develop a general procedure for modelling nonstationary
time-series, then we should utilise estimation techniques that are able to
handle models, such as the AR process discussed in the last section, with
parameters which are not constant, as in the conventional approach to
time-series analysis, but which may vary over time. This was one of the
major motivations for the development of recursive techniques for Time
NONSTATIONARY TIME-SERIES ANALYSIS 47

Variable Parameter (TVP) estimation, in which the object is to 'model the


parameter variations'7-9 by some form of stochastic state-space model.
Such TVP models have been in almost continual use in the control and
system's field since the early 1960s, when Kopp and Orford,1O and Lee ll
pioneered their use in the wake of Kalman's seminal paper on state
variable filtering and estimation theory.12 The present first author made
liberal use of this same device in the 1960s within the context of self
adaptive control7,8,13-16 and, in the early 1970s, reminded a statistical
audience of the extensive system's literature on recursive estimation (see
Refs. 17 and 18; also the comments ofW.D. Rayon Ref. 19). Since the
early 1970s, time-varying parameter models have also been proposed and
studied extensively in the statistical econometrics literature. Engle et al.
have presented a brief review of this literature and discuss an interesting
application to electricity sales forecasting, in which the model is a time-
variable parameter regression plus an adaptive trend. 20
The best known recursive estimation procedure is the recursive least
squares (RLS) algorithm. 5,9,21 It is well known that the least squares
estimates of the parameters in the AR(n) model are obtained by choosing
that estimate i of the parameter vector a which minimises the least squares
cost function J, where
k=N
J = L [y(k) - ZT (k)if
k=n+1
and that this minimisation, obtained simply by setting the gradient of J
with respect to it equal to zero in the usual manner, i.e.
oj k=N
:1A = 2 L z(k)[y(k) - ZT (k)i] = 0
va k=n+1
results in an estimate i(N), for an N sample observation interval, which
is obtained by the solution of the normal equations of least squares
regression analysis,

L~t:1 Z(k)ZT(k)]i(N) = k~t:1 z(k)y(k)


Alternatively, at an arbitrary kth sample within the N observations, the
RLS estimate i(k) can be obtained from the following algorithm:
i(k) = i(k - I) + g(k) { y(k) - ZT (k)i(k - I)} (I)
where,
g(k) P(k - l)z(k)[a 2 + Z(k)Tp(k - I)z(k)]-I
48 PETER YOUNG AND TIM YOUNG

and,
P(k) = P(k - I) + P(k - I)z(k)
x [0- 2 + Z(k)TP(k - l)z(k)r'zT(k)P(k - 1) (2)
In this algorithm, y(k) is the kth observation of the time-series data; a(k)
is the recursive estimate at the kth recursion of the autoregressive para-
meter vector a(k), as defined for the AR model; and P(k) is a symmetric,
n x n matrix which provides an estimate of the covariance matrix associ-
ated with the parameter estimates. A full derivation and description of this
algorithm, the essence of which can be traced back to Gauss at the
beginning of the nineteenth century, is given in Ref. 9. Here, it will suffice
to note that eqn (1) generates an estimate a(k) of the AR parameter vector
a at the kth instant by updating the estimate a(k) obtained at the previous
(k - I)th instant in proportion to the prediction error e(k/k - I), where
e(k/k - I) = y(k) - zT(k)a(k - I)
is the error between the latest sample y(k) and its predicted value
zT(k)a(k - I), conditional on the estimate a(k) at the (k - I)th instant.
This update is controlled by the vector g(k), which is itself a function of
the covariance matrix P(k). As a result, the magnitude of the recursive
update is seen to be a direct function of the confidence that the algorithm
associates with the parameter estimates at the (k - I )th sampling instant:
the greater the confidence, as indicated by a P(k - 1) matrix with
elements having low relative values, the smaller the attention paid to the
prediction error, since this is more likely to be due to the random noise
input e(k) and less likely to be due to estimation error on a(k - 1).
In the RLS algorithm shown above, there is an implicit assumption that
the parameter vector a is time-invariant. In the recursive TVP version of
the algorithm, on the other hand, this assumption is relaxed and the
parameter may vary over the observation interval to reflect some changes
in the statistical properties of the time-series y(k), as described by an
assumed stochastic model of the parameter variations. In the present
chapter, we make extensive use of recursive TVP estimation. In particular,
we exploit the excellent spectral properties of certain recursive TVP esti-
mation and smoothing algorithms to develop a practical and unified
approach to adaptive time-series analysis, forecasting and seasonal adjust-
ment, i.e. where the results of TVP estimation are used recursively to
update the forecasts or seasonal adjustments to reflect any nonstationary
or nonlinear characteristics in y(k).
The approach is based around the well-known 'structural' or 'com-
NONSTATIONARY TIME-SERIES ANALYSIS 49

ponent' time-series model and, like previous state-space solutions,22-24 it


employs the standard Kalman filter-type l2 recursive algorithms. (The term
'structural' has been used in other connections in both the statistical and
economics literatures and so we will employ the latter term.) Except in
the final forecasting and smoothing stages of the analysis, however, the
justification for using these algorithms is not the traditional one based on
'optimality' in a prediction error or maximum likelihood (ML) sense.
Rather, the algorithms outlined here are utilised in a manner which allows
for straightforward and effective spectral decomposition of the time series
into quasi-orthogonal components. A unifying element in this analysis is
the modelling of non stationary state variables and time variable para-
meters by a stochastic model in the form of a class of second order random
walk equations. As we shall see, this simple device not only facilitates the
development of the spectral decomposition algorithms but it also injects
an inherent adaptive capability which can be exploited in both forecasting
and seasonal adjustment.

4 THE TIME-SERIES MODEL

Although the analytical procedures proposed in this paper can be applied


to multi variable (vector) processes,25 we will restrict the discussion, for
simplicity of exposition, to the following component model of a univariate
(scalar) time-series y(k),
y(k) = t(k) + p(k) + e(k) (3)
where, t(k) is a low frequency or trend component; p(k) is a perturbational
component around the long period trend which may be either a zero mean,
stochastic component with fairly general statistical properties; or a sus-
tained periodic or seasonal component; and, finally, e(k) is a zero mean,
serially uncorrelated, discrete white noise component, with variance (J2.
Component models such as eqn (3) have been popular in the literature
on econometrics and forecasting 26,27 but it is only in the last few years that
they have been utilised within the context of state-space estimation. Prob-
ably the first work of this kind was by Harrison and Stevens l9.22 who
exploited state-space methods by using a Bayesian interpretation applied
to their 'Dynamic Linear Model', which is related to models we shall
discuss here. More recent papers which exemplify this state-space
approach and which are particularly pertinent to the present paper, are
those of Jakeman & Young,28 Kitagawa & Gersch,29 and Harvey.24
50 PETER YOUNG AND TIM YOUNG

In the state-space approach, each of the components t(k) and p(k) is


modelled in a manner which allows the observed time series y(k) to be
represented in terms of a set of discrete-time state equations. These state
equations then form the basis for recursive state estimation, forecasting
and smoothing based on the recursive filtering equations of Kalman. '2 In
order to exemplify this process, within the present context, we consider
below a simple TVP, linear regression model for the sum of t(k) + p(k).
It should be noted, however, that this model is a specific example of the
more general component models discussed in Refs. 30-33.

4.1 The dynamic linear regression (OLR) model


In the general DLR model, it is assumed that y(k) in eqn (3) can be
expressed in the form of a linear regression with time-variable coefficients
cj(k), i = 0, 1,2, ... , n i.e.
y(k) = coCk) + c,(k)x,(k) + C2 (k)x 2 (k), ... ,cn(k)xn(k) + e(k) (4)
where coCk) = t(k) represents the time-variable trend component; cj(k),
i = I, 2, ... , n, are time-variable parameters or regression coefficients,
with the associated regression variables xj(k) selected so that the sum
of the terms cj(k)xj(k) provides a suitable representation of the pertur-
bational component p(k). For example, if xj(k) = y(k - i), then the
model for p(k) is a dynamic autoregression (DAR) in y(k), i.e. an auto-
regression model with time-variable parameters.
Clearly, an important and novel aspect of this model is the presence of
the time-variable parameters. In order to progress further with the iden-
tification and estimation of the model, therefore, it is necessary to make
certain assumptions about the nature of the time-variation. Here, we
assume that each of the n + I coefficients cj(k) can be modelled by the
following stochastic, second order, generalised random walk (GRW)
process,
(5)
where,
Xj(k) and
and,

Here, ~(k) is a second state variable, the function of which is discussed


NONSTATIONARY TIME-SERIES ANALYSIS 51

below; while '1il (k) and '1i2(k) represent zero mean, serially uncorrelated,
discrete white noise inputs, with the vector '1i(k) normally characterised by
a covariance matrix Qi' i.e.
I for k = j
(\. = {
.J 0 for k oF j
where, unless there is evidence to the contrary, Qi is assumed to be
diagonal in form with unknown diagonal elements qill and qi22' respectively.
This GRW model subsumes, as special cases,9 the very well-known
random walk (RW: P = y = 0; '1i2(k) = 0); and the integrated random
walk (IRW: P = y = 1; '1il (k) = 0). In the case of the IRW, we see that
ci(k) and ~(k) can be interpreted as level and slope variables associated
with the variations of the ith parameter, with the random disturbance
entering only through the di(k) equation. If '1il (k) is non-zero, however,
then both the level and slope equations can have random fluctuations
defined by '1il (k) and '1i2(k), respectively. This variant has been termed the
'Linear Growth Model' by Harrison & Stevens. 19•22
The advantage of these random walk models is that they allow, in a very
simple manner, for the introduction of nonstationarity into the regression
model. By introducing a parameter variation model of this type, we are
assuming that the time-series can be characterised by a stochastically
variable mean value, arising from co(k) = t(k) and a perturbational com-
ponent with potentially very rich stochastic properties deriving from the
TVP regression terms. The nature of this variability will depend upon the
specific form of the GRW chosen: for instance, the IRW model is par-
ticularly useful for describing large smooth changes in the parameters;
while the RW model provides for smaller scale, less smooth variationsY4
As we shall see below, these same models can also be used to handle large,
abrupt changes or discontinuities in the level and slope of either the trend
or the regression model coefficients.
The state space representation of this dynamic regression model is
obtained simply by combining the GRW models for the n + I parameters
into the following composite state space form,
x(k) = Fx(k - 1) + G'1(k - I) (6a)
y(k) = Hx + e(k) (6b)
where the composite state vector x is composed of the ci(k) and ~(k)
parameters, i.e.
x(k) = [co(k) do(k) cl(k) d1(k) ... cn(k) dn(kW
52 PETER YOUNG AND TIM YOUNG

the stochastic input vector r,(k) is composed of the disturbance terms to the
GRW models for each of the time-variable regression coefficients, i.e.
r,(k)T = [r,OI (k) r,02(k) r,11 (k) r,12(k) ... r,nl (k) r,n2(kW
while the state transition matrix F, the input matrix G and the observation
matrix H are defined as,
F; 0 0 0 Gt 0 0 0
0 F; 0 0 0 Gt 0 0
F G
0 0
0 0 0 F; 0 0 0 Gt
H = [I 0 xl(k) 0 x 2(k) o ... xn(k) 0]
In other words, the observation eqn (6b) represents the regression model
(4); with the state equations in (6a) describing the dynamic behaviour of
the regression coefficients; and the disturbance vector r,(k) in eqn (6a)
defined by the disturbance inputs to the constituent GRW sub-models.
We have indicated that one obvious choice for the definition of the
regression variables x;(k) is to set them equal to the past values of y(k), so
allowing the perturbations to be described by a TVP version of the AR(n)
model. This is clearly a sensible choice, since we have seen in Section 2 that
the AR model provides a most useful description of a stationary stochastic
process, and we might reasonably assume that, in its TVP form, it provides
a good basis for describing nonstationary time-series. If the perturbational
component is strongly periodic, however, the spectral analysis in Section 2
suggests an alternative representation in the form of the dynamic harmonic
regression (DHR) model. 35 Here t(k) is defined as in the DAR model but
p(k) is now defined as the linear sum of sin and cosine variables in F
different frequencies, suitably chosen to reflect the nature of the seasonal
variations, i.e.
;=F
p(k) L OI;(k)cos(2n/;k) + 02;(k)sin(2n/;k) (7)
;=1

where the regression coefficients OJ;(k),j = 1,2 and i = 1,2, ... , Fare
assumed to be time-variable, so that the model is able to handle any
nonstationarity in the seasonal phenomena. The DHR model is then in the
form of the regression eqn (4), with co(k) = t(k), as before, and appro-
priate definitions for the remaining c; (k) coefficients, in terms of OJ; (k). The
integer n, in this case, has to be set equal to 2F, so that the regression
NONSTA TIONARY TIME-SERIES ANALYSIS 53

variables xj(k), i = 1, 2, ... , 2F can be defined in terms of the sin and


cosine components, sin (27th k) and cos (27th k), i = I, 2, ... , F, respec-
tively, i.e.
H = [lOcos (27tfl k) 0 sin (27t.t; k) 0 ... cos (27tfF k) 0 sin (27tfF k) 0]
Finally, it should be noted that, since there are two parameters associated
with each frequency component, the changes in the amplitude A(k) of each
component, as defined by

Aj(k) = JOlj(k)2 + 02j(k)2


provides a useful indication of the estimated amplitude modulation on
each of the frequency components. In many practical situations, the
overall periodic behaviour can be considered in terms of a primary fre-
quency and its principal harmonics, e.g. a 12-month annual cycle and its
harmonics at periods of 6,4,3,2·4 and 2 months, respectively. As a result,
Aj (k), i = 1, 2, ... , F will represent the amplitude of these components
and will provide information on how the individual harmonics vary over
the observation interval.

5 THE RECURSIVE FORECASTING AND SMOOTHING


ALGORITHMS

In this chapter, recursive forecasting and smoothing is achieved using


algorithms based on optimal state-space (Kalman) filtering and fixed-
interval smoothing equations. The Kalman filtering algorithml2 is, of
course, well known and can be written most conveniently in the following
general 'prediction-correction' form,9
Prediction:
i(k/k - I) Fi(k - 1)
(8)
P(k/k - I) = FP(k - I)FT + G[Qr]GT
Correction:
i(k) = i(k/k - I) + P(k/k - I)HT[1 + HP(k/k - I)HTrl
x {y(k) - Hi(k/k - I)}
(9)
P(k) P(k/k - 1) - P(k/k - I)
x HT[1 + HP(k/k - I)HT ]-IHP(k/k - 1)
54 PETER YOUNG AND TIM YOUNG

In these equations, we use i(k) to denote the estimate of the composite


state vector x of the complete state-space model (eqn (6». The reader
should note the similarity between the correction eqns (9) and the RLS
algorithm discussed in Section 3. This is no coincidence: the RLS algo-
rithm is a special case of the Kalman filter in which the unknown 'states'
are considered as time-invariant parameters. 9
Since the models used here are all characterised by a scalar observation
equation, the filtering algorithm (8) and (9) has been manipulated into the
well-known form (see, e.g. Ref. 9) where the observation noise variance is
normalised to unity and Qr represents a 'noise variance ratio' (NVR)
matrix. This matrix and the P(k) matrix are then both defined in relation
to the white measurement noise variance (J2, i.e.
P(k) = P*(k)/~ (10)
where P*(k) is the error covariance matrix associated with the state
estimates. In the RW and IRW, models, moreover, there is only a single
white noise disturbance input term to the state equations for each
unknown parameter, so that only a single, scalar NVR value has to be
specified by the analyst in these cases.
In fixed interval smoothing, an estimate x(k/N) of x(k) is obtained
which provides an estimate of the state at the kth sampling instant based
on all N samples over the observation interval (in contrast to the filtered
estimate x(k) = x(k/k) which is only based on the data up to the kth
sample). There are a variety of algorithms for off-line, fixed interval
smoothing but the one we will consider here utilises the following back-
wards recursive algorithm, subsequent to application of the above
Kalman filtering forwards recursion/,34
x(k/N) = F-1[x(k + liN) + GQrGTL(k - 1)] (lIa)
where,
L(N) = 0
N is the total number of observations (the 'fixed interval'); and
L(k) = [I - P(k + I)HTH]T
x {FTL(k + 1) - HT[y(k + 1) - HFi(k)]} (lIb)
is an associated backwards recursion for the 'Lagrange Multiplier' vector
L(k) required in the solution of this two-point boundary value problem.
An alternative update equation to eqn (IIa) is the following,
x(k/N) = x(k/k) - P(k)FTL(k) (llc)
NONSTATIONARY TIME-SERIES ANALYSIS 55

where we see that x(kIN) is obtained by reference to the estimate i(klk)


generated during the forward recursive (filtering) algorithm. Finally, the
covariance matrix P*(kIN) = q'lP(kIN) for the smoothed estimate
is obtained by reference to P(kIN) generated by the following matrix
recursion,
P(kIN) = P(k) + P(k)FT[P(k + l/k)tl
x {P(k + lIN) - P(k + I/k)}[P(k + I/k)tIFP(k) (12)
while the smoothed estimate of original series y(k) is given simply by,
y(k/N) = HX(k/N) (13)
i.e. the appropriate linear combination of the smoothed state variables.
Provided we have estimates for the unknown parameters in the state
space model (6), then the procedures for smoothing and forecasting of y(k)
now follow by the straightforward application of these state-space
filtering/smoothing algorithms. This allows directly for the following
operations: -
(1) Forecasting. The/step ahead forecasts of the composite state vector
x(k) in eqn (6) are obtained at any point in the time-series by repeated
application of the prediction eqns (8) which, for the complete model, yields
the equation,
x(k + /Ik) = FfX(k) (14)
where f denotes the forecasting period. The associated forecast of y(k) is
provided by,
y(k + Jlk) = Hx(k + Jlk) (15)
with the variance of this forecast computed from
var{ji(k + Jlk)} = (12[1 + HP(k + //k)HT] (16)
where ji(k + JI k) is the / step ahead prediction error, i.e.
ji(k + /Ik) = y(k + f) - y(k + Jlk)
In relation to more conventional alternatives to forecasting, such as those
of Box & Jenkins,2 the present state-space approach, with its inherent
structural decomposition, has the advantage that the estimates and fore-
casts of individual component state variables can also be obtained simply
as by-products of the analysis. For example, it is easy to recover the
estimate and forecast of the trend component, which provides a measure
of the underlying 'local' trend at the forecasting origin.
56 PETER YOUNG AND TIM YOUNG

(2) Forward interpolation. Within this discrete-time setting, the process


of forward interpolation, in the sense of estimating the series y(k) over a
section of missing data, based on the data up to that point, follows
straightforwardly; the missing data points are accommodated in the usual
manner by replacing the observation y(k) by its conditional expectation
(or predicted value) y(k/k - I) and omitting the correction eqns (9).
(3) Smoothing. Finally, the smoothed estimate y(k/N) of y(k) for all
values of k is obtained directly from eqn (13); and associated smoothed
estimates of all the component states are available from eqn (II). Smooth-
ing can, of course, provide a superior interpolation over gaps in the data,
in which the interpolated points are now based on all of the N samples.

6 IDENTIFICATION AND ESTIMATION OF THE


TIME-SERIES MODEL

The problems of structure identification and subsequent parameter esti-


mation for the complete state space model (6) are clearly non-trivial. From
a theoretical standpoint, the most obvious approach is to formulate the
problem in Maximum Likelihood (ML) terms. If the stochastic distur-
bances in the state-space model are normally distributed, the likelihood
function for the observations may then be obtained from the Kalman
Filter via 'prediction error decomposition'.36 For a suitably identified
model, therefore, it is possible, in theory, to maximise the likelihood with
respect to any or all of the unknown parameters in the state-space model,
using some form of numerical optimisation.
This kind of maximum likelihood approach has been tried by a number
of research workers but their results (e.g. Ref. 37, which provides a good
review of competing methods of optimisation) suggest that it can be quite
complex, even in the case of particularly simple models such as those
proposed here. In addition, it is not easy to solve the ML problem in
practically useful and completely recursive terms, i.e. with the parameters
being estimated recursively, as well as the states.
The alternative spectral decomposition (SO) approach used in this
chapter has been described fully in Refs. 30-33 and 38, the reader is
referred to these papers for detailed information on the method. It is based
on the application of the state-space, fixed interval smoothing algorithms
discussed in Section 5, as applied to the component models such as eqn (3).
In particular, it exploits the excellent spectral properties of the smoothing
algorithms derived in this manner to decompose the time-series into
NONSTA TIONARY TIME-SERIES ANALYSIS 57

quasi-orthogonal estimates of the trend and perturbational components.


This considerably simplifies the solution, albeit at the cost of strict opti-
mality in a ML sense, and yields an analytical procedure that is completely
recursive in form and robust in practical application. It also allows for
greater user-interaction than other alternatives. As we shall see, this
procedure is well suited for adaptive implementations of both on-line
forecasting and off-line smoothing.
In relation to more conventional approaches to data analysis, we can
consider that SD replaces better known, and more ad hoc, filtering tech-
niques, * by a unified, more sophisticated and flexible approach based on
the recursive smoothing algorithms.

6.1 The IRWSMOOTH algorithm


A full discussion of the spectral properties of the fixed interval smoothing
algorithms discussed in this chapter is given in the various references cited
above. In order to exemplify the approach, however, it is interesting to
consider how the smoothing algorithms can be used in a sub-optimal
fashion to obtain a smoothed estimate i(kIN) of the trend component,
t(k). This is accomplished quite simply by applying the forward-pass
filtering and backward-pass smoothing equations directly to the model (3)
with the p(k) term removed. In this manner, y(k) is modelled simply as the
sum of a trend, represented by a GRW process, and the white noise
disturbance e(k). This IRWSMOOTH algorithm will normally be sub-
optimal in a minimum variance sense, since it seems unlikely that y(k)
could, in general, be described by such a simplistic model: for most real
environmental data, the perturbations about the long term trend, i.e.
b(k) = y(k) - t(k), are likely to be highly coloured or periodic in form
and white noise perturbations seem most unlikely. Nevertheless, the algo-
rithm is a most useful one in practical terms and, while sub-optimal in the
context of the minimum variance Kalman filter, it can be justified in terms
of its more general low-pass filtering properties.
These filtering properties are illustrated in Fig. 2. The amplitude spec-
trum for the IRWSMOOTH algorithm is shown in Fig. 2(a) which demon-
strates that the behaviour of the algorithm is controlled by the selected
NVR value, which is simply a scalar in this case because the IRW model

·Such as the use of centralised moving averaging or smoothing spline functions for
trend estimation; and constant parameter harmonic regression (HR) or equivalent
techniques for modelling the seasonal cycle; see Ref. 39 which uses such procedures
with the HR based on the first two harmonic components associated with the
annual CO 2 cycle.
58 PETER YOUNG AND TIM YOUNG

NVR:

.00001
~Q.Q.(l! __
~
J!L._
I I
0.0 0.1 0.2 0.3 0.4 0.5
Cycle per interval

(b) ~!:;:. .

.j I' !iIi \\\ '\\\.


0.8

/I
5
c
0.6 I, ii \\ I \ NVR:

o .00001
'i:
~ 0.4 Ii \i \ \ ~Q.Q.Q.L_

/. ,Iili:\\\ '\ \\
~99.L ...

0.2
j .01

.J /
.1

j \ \. .. ~~~.~r_----_.
o+-______~~--~/~·~:~IL-~I~.~ \ '" 1.

0.0 0.1 0.2 0.3 0.4 0.5


Cycle per interval
Fig. 2. Spectral properties of the recursive smoothing algorithms: (a) ampli-
tude spectrum of the recursive IRWSMOOTH algorithm; (b) amplitude
spectrum of the recursive DHRSMOOTH algorithm.
NONSTATIONARY TIME-SERIES ANALYSIS 59

has only a single white noise input disturbance. It is clear that this scalar
NVR, which is the only parameter to be specified by the analyst, defines
the 'bandwidth' of the smoothing filter. The phase characteristics are not
shown, since the algorithm is of the 'two-pass' smoothing type and so
exhibits zero phase lag for all frequencies. We see from Fig. 2(a) that the
IRWSMOOTH algorithm is a very effective 'low-pass' filter, with par-
ticularly sharp 'cut-off' properties for low values of the NVR. The rela-
tionship between the loglO(F5O ), where F50 is the 50% cut-off frequency,
and 10gI0(NVR) is approximately linear over the useful range of NVR
values, so that the NVR which provides a specified cut-off frequency can
be obtained from the following approximate relationship,35
NVR = 1605[F50 ]4 (17)
In this manner, the NVR which provides specified low-pass filtering
characteristics can be defined quite easily by the analyst. For an inter-
mediate value of NVR = 0·0001 (cut-off frequency = 0·016cycles/sample),
for example, the estimated trend reflects the low frequency movements in
the data while attenuating higher frequency components; for NVR = 0
the bandwidth is also zero and the algorithm yields a linear trend with
constant slope; and for large values greater than 10, the estimated 'trend'
almost follows the data and the associated derivative estimate dj(k) pro-
vides a good smoothed numerical differentiation of the data. The band-
pass nature of the DHR recursive smoothing algorithm (DHRSMOOTH)
is clear from Fig. 2(b) and a similarly simple relationship once again exists
between the bandwidth and the NVR value. These convenient bandwidth-
NVR relationships for IRWSMOOTH and DHRSMOOTH are useful in
the proposed procedure for spectral decomposition discussed below.
Clearly, smoothing algorithms based on other simple random walk and
TVP models can be developed: for instance, the double integrated random
walk (DIRW, see Refs. 9 and 40) smoothing algorithm has even sharper
cut-off characteristics than the IRW, but its filtering characteristics exhibit
much higher levels of distortion at the ends of the data set. 40

6.2 Parametric nonstationarity and variance intervention


As has been pointed out, the response characteristics of environmental
systems are clearly subject to change with the passage of time. As a result,
the time-series obtained by monitoring such systems will be affected by
these changes and any models that are based on the analysis of these
time-series will need to reflect the nature of this nonstationarity, for
example by changes in the model parameters. Similarly, many environ-
60 PETER YOUNG AND TIM YOUNG

mental systems exhibit nonlinear dynamic behaviour and linear, small


perturbation models of such systems will need to account for the nonlinear
dynamics by allowing for time-variable parameters. However, the nature
of such parametric variation is difficult to predict. While changes in
environmental systems are often relatively slow and smooth, more rapid
and violent changes do occur from time-to-time and lead to similarly rapid
changes, or even discontinuities, in the related time series. One approach
to this kind of problem is 'intervention analysis',41 but an alternative,
recursively based approach is possible which exploits the special nature of
the GRW model.
If the variances qill and qi22, are assumed constant, then the GRW
model, in its RW or IRW form, can describe a relatively wide range of
smooth variation in the associated trend or regression model parameters.
Moreover, if we allow these variances to change over time, then an even
wider range of behaviour can be accommodated. In particular, large, but
otherwise arbitrary, instantaneous changes in qill and qi22 , (e.g. increases to
values ~ 102 ) introduced at selected 'intervention' points, can signal to the
associated estimation algorithm the possibility of significant changes in the
level or slope, respectively, of the modelled variable (i.e. the trend or
parameter estimates in the present context) at these same points. The
sample number associated with such intervention points can be identified
either objectively, using statistical detection methods;42 or more subjectively
by the analyst. 43
It is interesting to note that this same device, which is termed 'variance
intervention,43 can be applied to any state-space or TVP model:
Young,7,8,14--16 for example, has used a similar approach to track the
significant and rapid changes in the level of the model parameters of an
aerospace vehicle during a rocket boost phase.

6.3 The DHR model and seasonal adjustment


Recursive identification and estimation of the DHR model is straight-
forward, In the case where the regression parameters are assumed con-
stant, the normal recursive least squares (RLS) algorithm can be used.
When the parameters are assumed time-variable, then it is simply neces-
sary to represent the variations by the GRW model, with or without
variance intervention as appropriate, and use the recursive least squares
filtering and fixed interval smoothing algorithms outlined in Section 5.
Recursive state-space forecasting then follows straightforwardly by appli-
cation of the estimation and forecasting eqns (7), (8) and (14)-(16).
If off-line analysis of the non stationary time-series is required, then the
NONSTATIONARY TIME-SERIES ANALYSIS 61

recursive fixed interval smoothing eqns (11)-(13) can be used to provide


smoothed estimates of the model components and any associated time-
variable parameters. The main effect of allowing the parameters and,
therefore, the amplitude and phase of the identified seasonal components
to vary over time in this manner, is to include in the estimated seasonal
component, components of other frequencies with periods close to the
principal period. As pointed out above, the chosen NVR then controls the
band of frequencies that are taken into account by the ORRSMOOTR
algorithm. If it is felt that the amplitude variations in the seasonal com-
ponent are related to some known longer period fluctuations, then such
prior knowledge can be used to influence the choice of the NVR.35
If the ORR model is identified and estimated for all the major periodic
components characterising the data (i.e. those components that are associ-
ated with the peaks in the periodogram or AR spectrum) then the
ORRSMOOTR algorithm can be used to construct and remove these
nonstationary 'seasonal' components in order to yield a 'seasonally
adjusted' data set. Such an 'adaptive' seasonal adjustment procedure is, of
course, most important in the evaluation of business and economic data,
where existing SA methods, such as the Census X-II procedure44 which
uses a procedure based on centralised moving average (CMA filters), are
well established. The proposed ORR-based approach offers various
advantages over techniques such as X-11: by virtue of variance interven-
tion, it can handle large and sudden changes in the dynamic characteristics
of the series, including amplitude and phase changes; it is not limited to
the normally specified seasonal periods (i.e. annual periods of 12 months
or 4 quarters); and it is more objective and simpler to apply in prac-
tice. 31 ,32,35,40 In this sense, it provides an excellent vehicle for seasonal
adjustment of environmental data,

7 PRACTICAL EXAMPLES

In this section, we consider in detail the analysis of two of the three


environmental time-series discussed briefly in the Introduction to this
chapter; namely the atmospheric CO2 and Raup-Sepkoski Extinctions
data. Similar analyses have been conducted on the Sunspot data 35 and these
reveal various interesting aspects of the series, including a possible link
between the long term modulation of the cyclic component and central
England temperature record for the period 1659-1973, compiled by
Manley.45
62 PETER YOUNG AND TIM YOUNG

7.1 Analysis of the Mauna Loa CO 2 data


Following the establishment of the World Climate Programme by the
8th Meteorological Congress, the World Meteorological Organisation
(WMO) Project on Research and Monitoring of Atmospheric Carbon
Dioxide was incorporated into the World Climate Research Programme
(WCRP) and, in 1979, the Executive Committee recognised the potential
threat to future climate posed by increasing amounts of atmospheric CO2 •
In the ensuing decade, the importance of atmospheric CO 2 levels has been
stressed, not only in scientific journals, but also in the more general media.
Discussion of the potential perils of the 'Greenhouse Effect' have become
ever more prevalent during this period as the public have perceived the
possibility of a link between factors such as atmospheric CO 2 and extreme
climatic events throughout the world.
One important aspect of the various scientific programmes on atmos-
pheric CO2 , currently being carried out in many countries, is the analysis
and interpretation of monitored CO2 data from different sites in the world.
Indeed, one of the first events organised under the auspices of the WCRP
was the WMOjICSUjUNEP Scientific Conference on this subject held at
Bern in September 1981; and the interested reader is directed to Ref. 46 for
an introduction to the topic. References 39, 47 and 48 are of particular
relevance to the analysis reported here. The CO 2 data shown in Fig. l(a)
are monthly figures based on an essentially continuous record made at
Mauna Loa in Hawaii between 1974 (May) and 1987 (Sept.). Here these
data are used to exemplify the kind of analysis made possible by the
recursive procedures described in previous sections: no attempt is made to
interpret the results, which are part of a joint research programme between
Lancaster and the UK Meteorological Office.
The more traditional time-series analysis of the CO 2 data, such as that
reported in the above references, uses various approaches, both to the
estimation and removal of the low frequency trend in the data and the
analysis of the seasonality in the resulting detrended series. These include
the use of smoothing spline functions for trend estimation and constant
parameter harmonic regression (HR) for modelling the seasonal cycle. 48
Here, we first consider the amplitude periodogram of the CO 2 series
shown in Fig. 3. The scale is chosen to expose some of the lower magnitude
peaks in the spectrum: we see that the 12- and 6-monthly period harmonics
dominate the periodogram, with only a very small peak at 4 months and,
apparently, no detectable harmonics at 3, 2-4 and 2 months. A small peak
worth noting for future reference, however, is the one marked with an
NONSTATIONARY TIME-SERIES ANALYSIS 63

0.10

0.09

0.08 -

0.07 -

0.06 -

~0.05 -

0.04 -

0.03 -
*
0.02 -

0.01 -

0.00 -. I -. •• ,--, • -·-·--r·-···-.---.. ·.-··· .. I .••• I ••. ""'":""""i ,~,~-,-• • ;-~~;-. -:-:-:-: I
0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45

Frequency (Cycles! sample)

Fig. 3. Amplitude periodogram of the Mauna Loa CO 2 series.

asterisk at about 40 months; this mode of behaviour will be commented


on later in this Section.
The low frequency trend, so obvious in the CO2 data, is estimated and
removed by the IRWSMOOTH algorithm with NVR = 0·000005. The
results are given in Fig. 4, which shows: (a) the estimated trend super-
imposed on the data; (b) the estimated slope or derivative of the trend; and
(c) the detrended data. The value of the NVR is selected interactively so
that the trend derivative contains only a very slight leakage of the higher
frequency components associated with the annual, 12-monthly cycle. This
NVR corresponds to a 50% cut-off frequency F50 = 0·00747 cyclesl
sample, which induces 50% attenuation of periods less than about 12
years; and 95% attenuation of components with periods less than about
5 years.
It is worth commenting upon the estimated slope or derivative of the
trend in Fig. 4(b), which reveals evidence of what appears to be very long
term oscillatory characteristics. Little can be concluded about this mode
64 PETER YOUNG AND TIM YOUNG

354
352
(a)
350
348
346
344
342
340
338
336
334
332
330
328
326
20 40 60 80 100 120 140
0.128
(b)
"'
I~"
0.127
!
/
0.126
/ \
!
0.125
0.124

J
0.123
0.122
0.121
0.120
0.119
0.118
0.117
V
20 40 60 80 100 120 140

t (c)
5
4
I
3

~
~
\ ~
2 f\

o
-1
-2
-3
-4
-5
20 40 60 80 100 120 140

Fig. 4. Initial analysis of the Mauna Loa CO 2 series: (a) the IRWSMOOTH
estimated trend superimposed on the data; (b) the IRWSMOOTH estimated
CO 2 slope or derivative of the trend; (c) the detrended CO 2 series data.
NONSTATIONAR Y TIME-SERIES ANALYSIS 65

of behaviour, which appears to have a periodicity, if repeated, of some


10-11 years (a possible link with the sun-spot cycle?), since the obser-
vational period is too short. However, the results help to demonstrate the
potential usefulness of the IRWSMOOTH algorithm in allowing the
analyst to expose such characteristics in the data. In the analysis of
numerous sets of socio-economic data, for instance, the authors have
found that this estimated slope series provides strong evidence for the
existence of underlying quasi-periodic 'economic cycles'.
Let us now continue the analysis of the CO2 data by considering AR
modelling of the detrended data shown in Fig. 4(c). The AIC indicates that
an AR(15) model is most appropriate: this yields a coefficient of deter-
mination R~ = 0·9818 (i.e. 98·18% of the data explained by the model)
and AIC = - 2·274. However, further examination indicates that a
subset AR(15) model with the 3rd, 5th and 7th parameters constrained to
zero provides a superior AIC = 2,304, with R~ only marginally less than
the full AR(15) at 0·9816. Moreover, a subset AR(15) with all intermediate
parameters from 3rd to 9th constrained to zero provides a quite reason-
able model with R~ = 0·980 and minimum AIC = - 2·25. Here, the
insignificant parameters were identified by the usual significance tests and
reference to the convergence of the recursive estimates. 9 The subset
AR(l5) spectrum, which reveals clearly the nature of the periodicity with
the dominant 12 and 6 month components already noted in the periodo-
gram of Fig. 4, also provides some evidence of higher order harmonics
(see below).
The forecasting and smoothing results given in Fig. 5(a) and (b) are
generated by an adaptive DHR model incorporating the fundamental
component and the first harmonic (i.e. 12 and 6 months period, respec-
tively), plus an IRW model for the trend, with an NVR = 0·000005. (The
full details of this DHR analysis (forecasting, smoothing and seasonal
adjustment) are given in Ref. 38.) The DHR model coefficients are
modelled here by an RW with NVR = 0·001. One step ahead forecasts
are given up to a forecasting origin (FO) of 100 samples and, thereafter,
multistep ahead forecasts are generated for the remaining 60 samples (five
years). We see that the multistep forecasts are very good, with small
standard errors over the whole 5-year period, even though the algorithm
is not referring at all to the data over this period. Note that a small
negative forecasting error develops over this period, indicating that the
atmospheric CO2 has risen slightly less than might be expected from the
prior historical record. Figure 5(b) shows how, subsequent to the forecast-
ing analysis, it is a simple matter to continue on to the backward smooth-
355~---------------------.------------~

(a)
350 -+- One-Step-Ahead Forecasts

345

340

335
Forecasting
Origin (FO)
330

~~ 20 40 60 80 100 120 140

355~--------------------~----------~

(b)
350

345

340

335 Original Data

330

I I
20 40 60 80 100 120 140

Fig. 5. Time variable parameter (TVP) forecasting, smoothing and seasonal


adjustment of the CO 2 series based on dynamic harmonic regression (DHR)
analysis: (a) forecasting results based on IRW trend and two component DHR
model (one step ahead up to forecasting origin (FO) at 100 samples; then
multiple step ahead from FO); (b) smoothed estimates obtained by backwards
smoothing recursions from the 100 sample FO in (a); (c) seasonally adjusted
series compared with the original CO 2 series; (d) the residual or 'anomaly'
series; (e) amplitude periodogram of the anomaly series.
NONSTATIONARY TIME-SERIES ANALYSIS 67

354
352 (c)
350
348
346
344
342
340
338
336
334
332
330
328
326~~~~~~~~__-T__-r~~~~
20 40 60 80 100 120 140

0.6 .......- - - - - - - - - - - - - - - - - ,
0.5 (d)
0.4
0.3
0.2
0.1
0.0 +--'--++--t-'---+--i1t--'"--'\---+---+1'11t-A-r.,.--l
-0.1
-0.2
-0.3
-0.4
-0.5
20 40 60 80 100 120 140

8~----------------------------~
(e)
7

0.1 0.2 0.3 0.4


Frequency (Cycles/sample)

Fig. 5. Continued.
68 PETER YOUNG AND TIM YOUNG

ing pass: note the improvement in the residuals in this case compared with
the forward, filtering pass. Smoothed estimates of the Aj(k) amplitudes of
the seasonal components are also available from the analysis, if required.
The seasonally adjusted CO 2 series, as obtained using the proposed
method of DHR analysis, is compared with the original CO 2 measure-
ments in Fig. 5(c), while Figs 5(d) and (e) show, respectively, the residual
or 'anomaly' series (i.e. the series obtained by subtracting the trend and
total seasonal components from the original data), together with its associ-
ated amplitude periodogram. Here, the trend was modelled as an IRW
process with NVR = 0·0001 and the DHR parameters (associated, in this
case, with all the possible principal harmonics at 12, 6, 4, 3, 2-4 and
2 month periods) were also modelled as IRW processes with NVR =
0·001. These are the 'default' values used for the 'automatic' seasonal
analysis option in the micro CAPT AIN program; they are used here to
show that the analysis is not too sensitive to the selection of the models
and their associated NVR values.
The residual component in Fig. 5(d) can be considered as an 'anomaly'
series in the sense that it reveals the movements of the seasonally adjusted
series about the long term 'smooth' trend. Clearly, even on visual apprai-
sal, this estimated anomaly series has significant serial correlation, which
can be directly associated with the interesting peak at 40 months period on
its amplitude periodogram. This spectral characteristic obviously corres-
ponds to the equivalent peak in the spectrum of the original series, which
we noted earlier on Fig. 3. The seasonal adjustment process has nicely
revealed the potential importance of the relatively low, but apparently
significant power in this part of the spectrum. In this last connection, it
should be noted that Young et al. 38 discuss further the relevance of the CO 2
anomally series and consider a potential dynamic, lagged relationship
between this series and a Pacific Ocean Sea Surface Temperature (SST)
anomaly series using multivariate (or vector) time-series analysis.

7.2 Analysis of the Raup Sepkoski Extinctions data


The variance periodogram of the Extinctions data, as shown in Fig. 6,
indicates that most of the variance can be associated with the very low
frequency component which arises mainly from the elevated level of
extinctions at either end of the series. The next most significant peak
occurs at a period of 30·65 My. The position of the maxima associated
with the least squares best fit cycle of this wavelength are marked by
vertical dashed lines in Fig. l(c), and we see that, as might be expected,
they also correspond closely to the large peaks at either end of the data.
NONSTATIONARY TIME-SERIES ANALYSIS 69

50

40
w
U
Z
<
it 30
~
w
~
~ 20
z
w
30.65 MYRS
U
a:
w
0-
10

0
0.0 1.0 2.0 3.0 4.0 5.0
CYCLES PER INTERVAL X10- 1

Fig. 6. Variance periodogram of the Extinctions series.

In order to correct some of the amplitude asymmetry, the log-transformed


Extinction series are shown in Fig. 7 together with the IRWSMOOTH-
estimated long term trend. This is obtained with NVR = 0'001, which
ensures that the estimate contains negligible power around those frequen-
cies associated with the quasi-periodic behaviour. It is interesting also to

LOGGED EXTINCTIONS SERIES


5

C/)
4
~
~1= 3
~
i1
~ 2
9

7 13 19 25 31 37 43 49 55 61 67

Sl\MPLE NUMBER
Fig. 7. IRWSMOOTH estimated trend superimposed on the log-transformed
Extinctions series.
70 PETER YOUNG AND TIM YOUNG

PERIOOOGRAM ANALYSIS
30T---------------------------------------~
30.65 MYRS

20
40.87 MYRS

0.0 1.0 2.0 3.0 4.0 5.0

CYCLES PER INTERVAL X10-1


Fig. 8. Variance periodogram of the log-transformed Extinctions series.

note that the trend, as estimated here, appears to be quite sinusoidal in


form, with an almost complete period covering the 242 My of the record:
this is consistent with the speculative suggestion by Rampino & Stothers49
of a 260-My period in comet impacts, arising from the orbit of the solar
system about the galactic centre. The variance periodogram of the
detrended and log-transformed series shown in Fig. 8 shows that, follow-
ing removal of the low frequency power, the power is now seen to be
concentrated in two major peaks, at periods of 30·65 and 40·87 My, with
a smaller peak at the 17·52-My period (which appears to be a combination
of the harmonic wavelengths associated with the two lower frequency
peaks).
Because the Extinctions series is so short, there are only 36 ordinates
over the spectral range. In order both to determine more precisely the
position and nature of the spectral peaks, and also to investigate the
possibility of non stationarity, DAR spectral analysis can be applied to the
detrended and log-transformed series. For reference, Fig. 9 shows the
AR(7) variance spectrum in the stationary case; i.e. with assumed time-
invariant AR parameters. Figure 10 is a contour spectrum obtained from
the DAR(7) analysis using an IRW process to model the parameter
variations, with the NVR = 0·007 for all parameters. This contour spec-
trum is obtained by computing the AR(7) spectrum at all 63 sample points
NONSTATIONARY TIME-SERIES ANALYSIS 71

3~
AR SPECTRAL ANALYSIS
____________________________________-,

34.91 MYRS

oU
\ 20.10 MYRS

0.0
I
1.0
~----
I ,

2.0
I •
3.0
I
4.0 5.0

CYCLES PER INTERVAL X10- 1

Fig. 9. AR(7) spectrum of the log-transformed Extinctions series.

5.0

4.0

3.0

2.0

1.0

0.0
0.5 2.5 4.5 6.5
SAMPLE NUMBER

Fig. 10. Contour spectrum based on the smoothed DAR estimates of the
AR(7) parameter values at each of the 63 sample points between 8 and 70.
72 PETER YOUNG AND TIM YOUNG

between 8 and 70, based on the smoothed estimates of the AR parameter


values at each sample obtained from the DAR analysis. Because of the
extremely sharp spectral peaks resolved by the spectral model, the
contours are plotted in binary orders of magnitude, rather than linearly,
in order to give fewer contours around the peaks where the slope is steep.
(Clearly, a three-dimensional plot of the spectrum against frequency would
be an alternative, and possibly more impressive, way of presenting these
results but the authors feel this contour plot allows for more precise and
easy measurement.)
This rather novel contour spectrum clearly illustrates the very different
nature of the series in the middle section, where there are few high peaks;
in fact, we see that this region is characterised by two peaks at periods of
around 18 and 39 My, which is in contrast to either end of the series, where
there is a dominant peak of between 30 and 32 My, lengthening to around
36 My over the last few samples. Obviously, this kind of DAR analysis,
based on only 70 data points, should be assessed carefully since it is
difficult to draw any firm conclusions about the significance of estimated
spectral peaks based on such a small sample size, particularly when using
this novel form of nonstationary time-series analysis. These and other
important aspects of this analysis are, however, discussed in Ref. 35, to
which the interested reader is referred for a more detailed discussion of the
methods and results.
Finally, the high frequency variation in the middle of the series can be
effectively attenuated by smoothing the log-transformed and detrended
series using the IRWSMOOTH algorithm with NVR = 0·1, as shown in
Fig. 11. The aim here is to reduce the power at frequencies greater than
those under major investigation and it shows how the six peaks in the
Jurassic and Cretaceous periods are smoothed to produce three lower
magnitude peaks of longer wavelength. This leaves eight smoothed peaks
in all (including the two high points at the ends of the series), with a mean
separation of 34·5 My. This result is reasonably insensitive to the smooth-
ness of the line and is in reasonable agreement with the treatment of this
part by Rampino & Stothers49 who, by rejecting peaks below 10% extinc-
tions, remove two of the peaks in the mid-region.
The analysis in this example has produced spectral estimates of the
changes in the quasi-periodic behaviour which characterises the Extinc-
tions data, based on the assumption that the series may be nonstationary
in statistical terms. The results indicate that the period of the cyclic
component is at least 30 My, and may be as large as 34·5 My. It should be
stressed, however, that these results should be regarded as simply provid-
NONSTATIONARY TIME-SERIES ANALYSIS 73

DETRENDED LOGGED EXllNCTlOIIS ~TA

-1 I
I
-2 -I-r-~-j-,...,...., ,I, , 1 I
I
I I I I
I
I I I
I
I I r I I
I I I i I I

7131925313743 556167
SAMPLE NUMBER
Fig. 11. Additional smoothing of the detrended, log-transformed series using
the IRWSMOOTH algorithm with NVR = 0·1.

ing further critical evaluation of this important series, rather than provid-
ing an accurate estimate of the cyclic period. The small number of com-
plete cycles in the record, coupled with the uncertainty about the geologic
dating of the events, unavoidably restricts our ability to reach unam-
biguous conclusions about the statistical properties of the series. The
authors hope, however, that the example helps to illustrate the extra
dimension to time-series analysis provided by recursive time variable
parameter estimation.

9 CONCLUSIONS

This chapter has discussed the analysis of nonstationary environmental


data and introduced a new approach to environmental time-series analysis
and modelling based on sophisticated recursive filtering and smoothing
algorithms. The two practical examples show how it can be used quite
straightforwardly to investigate nonstationarity in typical environmental
series. The approach can also be extended to the analysis of multivariable
(vector) time-series25•32 and we have seen that it allows for the development
of various time-variable parameter (TVP) modelling procedures. These
include the dynamic (TVP) forms of the common regression models, i.e.
the dynamic linear, auto and harmonic regression models discussed here,
as well as TVP versions of the various transfer function (TF) modelling
procedures. 9•s()-s3 Such transfer function modelling techniques are not well
known in environmental data analysis but the topic is too extensive to be
covered in the present chapter. Additional information on these pro-
74 PETER YOUNG AND TIM YOUNG

cedures within an environmental context can, however, be found in Refs.


5, 9, 54-60 and from the many references cited therein.
Finally, the new procedure can be considered as a method of analysing
certain classes of nonlinear dynamic systems by the development of TVps8
or the closely related state dependent model (SDM: see Refs. 59 and 61)
approximations. This is discussed fully in Ref. 62.

ACKNOWLEDG EM ENTS

Some parts of this chapter are based on the analysis of the Mauna Loa
CO2 data carried out by one of the authors (P.Y.) while visiting the
Institute of Empirical Macroeconomics at the Federal Reserve Bank of
Minneapolis, as reported in Ref. 38; the author is grateful to the Institute
for its support during his visit and to the Journal of Forecasting for
permission to use this material.

REFERENCES

I. Raup, D.M. & Sepkoski, 1.1., Periodicity of extinctions in the geologic past.
Proc. USA Nat. A cad. Sci., 81 (1984) 801-05.
2. Box, G.E.P. & Jenkins, G.M., Time Series Analysis, Forecasting and Control.
Holden-Day, San Francisco, 1970.
3. Young, P.e. & Benner, S., microCA PTA IN Handbook: Version 2.0, Lancaster
University, 1988.
4. Priestley, M.B., Spectral Analysis and Time Series. Academic Press, London,
1981.
5. Young, P.e., Recursive approaches to time-series analysis. Bull. of Inst. of
Math. and its Applications, 10 (1974) 209-24.
6. Akaike, H., A new look at statistical model identification. IEEE Trans. on Aut.
Control, AC-19 (1974) 716-22.
7. Young, P.C., The Differential Equation Error Method of Process Parameter
Estimation. PhD Thesis, University of Cambridge, UK, 1969.
8. Young, P.C., The use of a priori parameter variation information to enhance
the performance of a recursive least squares estimator. Tech. Note 404-90,
Naval Weapons Center, China Lake, CA, 1969.
9. Young, P.e., Recursive Estimation and Time-Series Analysis. Springer-Verlag,
Berlin, 1984.
10. Kopp, R.E. & Orford, R.J., Linear regression applied to system identification
for adaptive control systems. AIAA Journal, 1 (1963) 2300-06.
11. Lee, R.C.K., Optimal Identification, Estimation and Control. MIT Press, Cam-
bridge, MA, 1964.
12. Kalman, R.E., A new approach to linear filtering and prediction problems.
ASME Trans., J. Basic Eng., 83-D (1960) 95-108.
NONSTATIONARY TIME-SERIES ANALYSIS 75

13. Young, P.e., On a weighted steepest descent method of process parameter


estimation. Control. Div., Dept. of Eng., Univ. of Cambridge.
14. Young, P.C., An instrumental variable method for real time identification of
a noisy process. Automatica, 6 (1970) 271-87.
15. Young, P.e. A second generation adaptive pitch autostabilisation system for
a missile or aircraft. Tech. Note 404- 109, Naval Weapons Centre, China Lake,
CA, 1971.
16. Young, P.e., A second generation adaptive autostabilisation system for air-
borne vehicles. Automatica, 17 (1981), 459-69.
17. Young, P.e., Comments on 'Dynamic equations for economic forecasting
with GDP-unemployment relation and the growth in the GDP in the U.K. as
an example'. Royal Stat. Soc., Series A, 134 (1971) 167-227.
18. Young, P.e., Comments on 'Techniques for assessing the constancy of a
regression relationship over time'. J. Royal Stat. Soc., Series B, 37 (1975)
149-92.
19. Harrison, P.J. & Stevens, e.F., Bayesian forecasting. J. Royal Stat. Soc.,
Series B., 38 (1976) 205-47.
20. Engle, R.F., Brown, S.J. & Stem, G., A comparison of adaptive structural
forecasting models for electricity sales. J. of Forecasting, 7 (1988) 149-72.
21. Young, P.e., Applying parameter estimation to dynamic systems, Parts I and
II, Control Engineering, 16 (10) (1969) 119-25 and (11) (1969) 118-24.
22. Harrison, P.J. & Stevens, e.F., A Bayesian approach to short-term forecast-
ing. Operational Res. Quarterly, 22 (1971) 341-62.
23. Kitagawa, G., A non-stationary time-series model and its fitting by a recursive
filter. J. of Time Series, 2 (1981) 103-16.
24. Harvey, A.e., A unified view of statistical forecasting procedures. J. of Fore-
casting, 3 (1984) 245-75.
25. Ng, e.N., Young, P.e. & Wang, e.L., Recursive identification, estimation
and forecasting of multivariate time-series. Proc. IFAC Symposium on Iden-
tification and System Parameter Estimation, ed. H.F. Chen, Pergamon Press
Oxford, 1988, pp. 1349-53.
26. Nerlove, M., Grether, D.M. & Carvalho, l.L., Analysis of Economic Time
Series: A Synthesis. Academic Press, New York, 1979.
27. Bell, W.R. & Hillmer, S.e., Issues involved with the seasonal adjustment of
economic time series. J. Bus. and £Con. Stat., 2 (1984) 291-320.
28. lakeman, A.J. & Young, P.e., Recursive filtering and the inversion of ilI-
posed causal problems. Utilitas Mathematica, 35 (1984) 351-76. (Appeared
originally as Report No. AS/R28/1979, Centre for Resource and Environmen-
tal Studies, Australian National University, 1979.)
29. Kitagawa, G. & Gersch, W., A smoothness priors state-space modelling of
time series with trend and seasonality. J. American Stat. Ass., 79 (1984)
378-89.
30. Young, P.e., Recursive extrapolation, interpolation and smoothing of non-
stationary time-series. Proc. IFAC Symposium on Identification and System
Parameter Estimation, ed. H.F. Chen, Pergamon Press, Oxford, 1988,
pp.33-44.
31. Young, P.e., Recursive estimation, forecasting and adaptive control. In
76 PETER YOUNG AND TIM YOUNG

Control and Dynamic Systems, vol. 30, ed. e.T. Leondes. Academic Press, San
Diego, 1989, pp. 119-66.
32. Young, P.e., Ng, e.N. & Armitage, P., A systems approach to economic
forecasting and seasonal adjustment. International Journal on Computers and
Mathematics with Applications, 18 (1989) 481-501.
33. Ng, e.N. & Young, P.e., Recursive estimation and forecasting of nons tat ion-
ary time-series. J. of Forecasting, 9 (1990) 173-204.
34. Norton, J.P., Optimal smoothing in the identification of linear time-varying
systems. Proc. lEE, 122 (1975) 663-8.
35. Young, TJ., Recursive Methods in the Analysis of Long Time Series in Meteo-
rology and Climatology. PhD Thesis, Centre for Research on Environmental
Systems, University of Lancaster, UK, 1987.
36. Schweppe, F., Evaluation of likelihood function for Gaussian signals. IEEE
Trans. on In! Theory, 11 (1965) 61-70.
37. Harvey, A.e. & Peters, S., Estimation procedures for structural time-series
models. London School of Economics, Discussion Paper No. A28 (1984).
38. Young, P.e., Ng, e.N., Lane, K. & Parker, D., Recursive forecasting,
smoothing and seasonal adjustment of nonstationary environmental data. J.
of Forecasting, 10 (1991) 57-89.
39. Bacastow, R.B. & Keeling, e.D., Atmospheric CO 2 and the southern oscil-
lation effects associated with recent EI Nino events. Proceedings of the WMOj
ICSUjUNEP Scientific Conference on the Analysis and Interpretation of
Atmospheric CO 2 Data. Bern, Switzerland, WCP-14, 14-18 Sept. 1981, World
Meteorological Organisation, pp. 109-12.
40. Ng, C.N., Recursive Identification, Estimation and Forecasting of Non-
Stationary Time-Series. PhD Thesis, Centre for Research on Environmental
Systems, University of Lancaster, UK.
41. Box, G.E.P. & Tiao, G.e., Intervention analysis with application to economic
and environmental problems. J. American Stat. Ass., 70 (1975) 70-9.
42. Tsay, R.S., Outliers, level shifts and variance changes in time series. J. of
Forecasting, 7 (1988) 1-20.
43. Young, P.e. & Ng, e.N., Variance intervention. J. of Forecasting, 8 (1989)
399-416.
44. Shiskin, J., Young, A.H. & Musgrave, J.e., The X-II variant of the Census
Method II seasonal adjustment program. US Dept of Commerce, Bureau of
Economic Analysis, Tech. Paper No. 15.
45. Manley, G., Central England temperatures: monthly means: 1659-1973.
Quart. J. Royal Met. Soc., 100 (1974) 387-405.
46. WCRP, Proceedings of the WMOjICSUjUNEP Scientific Conference on the
Analysis and Interpretation of Atmospheric CO 2 Data, Bern, Switzerland,
WCP-14, 14-18 Sept., World Meteorological Organisation, 1981.
47. Schnelle et al. (1981) In Proceedings of the WMOjlCSUjUNEP Scientific
Conference on the Analysis and Interpretation of Atmospheric CO 2 Data. Bern,
Switzerland, WCP-14, 14-18 Sept., World Meteorological Organisation, 1981,
pp. 155-62.
48. Bacastow, R.B., Keeling, e.D. & Whorf, T.P., Seasonal amplitude in atmos-
pheric CO 2 concentration at Mauna Loa, Hawaii, 1959-1980. Proceedings of
the WMOjlCSUjUNEP Scientific Conference on the Analysis and Interpre-
NONSTATIONARY TIME-SERIES ANALYSIS 77

tation of Atmospheric CO 2 Data. Bern, Switzerland, WCP-14, 14-18 Sept.


1981, World Meteorological Organisation, pp. 169-76.
49. Rampino, M.R. & Stothers, R.B., Terrestrial mass extinctions, cometry
impacts and the sun's motion perpendicular to the galactic plane. Nature, 308
(1984) 709-12.
50. Young, P.e., Some observations on instrumental variable methods of time-
series analysis. Int. J. of Control, 23 (1976) 593-612.
51. Ljung, L. & Soderstrom, T., Theory and Practice of Recursive Estimation. MIT
Press, Cambridge, MA, 1983.
52. Young, P.e., Recursive identification, estimation and control. In Handbook of
Statistics 5: Time Series in the Time Domain, ed. EJ. Hannan, P.R. Krishnaiah
& M.M. Rao. North Holland, Amsterdam, 1985, pp. 213-55.
53. Young, P.e., The instrumental variable method: a practical approach to
identification and system parameter estimation. Tn Ident!fication and System
Parameter Estimation 1985, vol. 1 and 2, ed. H.A. Barker and P.e. Young.
Pergamon Press, Oxford, 1985, pp. 1-16.
54. Young, P.e., The validity and credibility of models for badly defined systems.
In Uncertainty and Forecasting of Water Quality, ed. M.B. Beck & G. Van
Straten. Springer-Verlag, Berlin, 1983, pp. 69-100.
55. Young, P.e., Time-series methods and recursive estimation in hydrological
systems analysis. In River Flow Modelling and Forecasting, ed. D.A. Kraijen-
hoff & J.R. Moll. D. Reidel, Dordrecht, 1986, pp. 129-80.
56. Young, P.e. & Wallis, S.G., Recursive estimation: a unified approach to
identification, estimation and forecasting for hydrological systems. J. of App.
Maths. and Computation, 17 (1985) 299-334.
57. Wallis, S.G., Young, P.e. & Beven, KJ., Experimental investigation of the
aggregated dead zone model for longitudinal solute transport in stream
channels. Proc. Ins. Civ. Engrs.: Part 2,87 (1989) 1-22.
58. Young, P.e., A general theory of modelling for badly defined dynamic
systems. In Modeling, Ident!fication and Control in Environmental Systems, ed.
G.e. Vansteenkiste. North-Holland, Amsterdam, 1978, pp. 103-35.
59. Priestley, M.B., State-dependent models: A general approach to nonlinear
time-series analysis. J. of Time Series Analysis, 1 (\980) 47-71.
60. Young, P.e. & Wallis, S.G., The Aggregated Dead Zone (ADZ) model for
dispersion in rivers. BHRA Int. Conf. on Water Quality Modelling in the
Inland Natural Environment, BHRA, Cranfield, Bedfordshire, 1986,
pp.421-33.
61. Haggan, V., Heravi, S.M. & Priestly, M.B., A study of the application of
state-dependent models in nonlinear time-series analysis. J. of Time Series
Analysis, 5 (1984) 69-102.
62. Young, P.e. & Runkle, D.E., Recursive estimation and the modelling of
nonstationary and nonlinear time-series. In Adaptive Systems in Control and
Signa/Processing, Vol. 1. Institute of Measure and Control, London, 1989,
pp.49-64.
Chapter 3

Regression and Correlation


A.C. DAVISON
Department of Statistics. University of Oxford. 1 South Parks Road.
Oxford OX1 3TG. UK

1 BASIC IDEAS

1.1 Systematic and random variation


Statistical methods are concerned with the analysis of data with an appreci-
able random component. Whatever the reason for an analysis, there is
almost always pattern in the data, obscured to a greater or lesser extent by
random variation. Both pattern and randomness can be summarized by
statistical methods, but their relative importance and the eventual
summary depend on the aim of the analysis. At one extreme are techniques
intended solely to explore and describe the data, and at the other extreme
is the analysis of data for which there is an accepted probability model
based on biological or physical mechanisms. Regression methods stretch
between these extremes, and are among the most widely used techniques
in statistics.
There are various ways to classify data. Typically there are a number of
individuals or units on each of which a number of variables are measured.
Often in environmental data the units correspond to evenly-spaced points
:in time, and if there is the possibility of dependence between measurements
on successive units the techniques of time series analysis must be con-
sidered, especially if interest is focused primarily on a single variable. In
this chapter we suppose that this dependence is at most weak, or that a
major part of any apparent serial dependence is due to changes in another
variable for which data are available.
The variables themselves may be continuous or discrete, and bounded
or unbounded. Discrete variables may be measurements, or counts, or
79
80 A.C. DAVISON

proportions of counts, or they may represent a classification of other


variables. Various classifications are possible. Examples are ordinal data,
where categories I, 2, ... , k represent categories for which the order is
important, or nominal categories, for which there is no natural ordering.
In many problems there is a variable of central interest, a response. The
variables on which the response is thought to depend are explanatory
variables which, though possibly random themselves, are regarded as fixed
for the purpose of analysis, Some authors use the terms dependent and
independent for response and explanatory variables. The collection of
response and explanatory variables for a unit is sometimes called a case.
In regression problems the aim is to explain variation in the response,
in terms of the other variables. Even though different variables may be
chosen as responses at different stages of an analysis, it helps to focus
discussion and simplifies interpretation if there is a single response at each
stage. A central aspect in formulating a regression model is therefore to
determine which variable is to be taken as the response. Very often, though
not invariably, this is dictated by the purpose of the analysis. Sometimes
several variables measuring essentially the same quantity may usefully be
combined into a single derived variable to be used as the response.
Once a response has been identified, the question arises of how to model
its dependence on explanatory variables. There are two aspects to this.
The first is to specify a plausible form for the systematic variation. A linear
relation is often chosen for its simplicity of interpretation, and this may
adequately summarize the data. Consistency with any known limiting
behaviour is important, however, in order to avoid such absurdities as
negative predictions for positive quantities. Formulations which encompass
known features of the relationship between the response and explanatory
variables, such as asymptotes, are to be preferred.
A second aspect is the form chosen to describe departures from the
systematic structure. Sometimes a method of fitting such as least squares
is regarded as self-justifying, as when data subject to virtually no measure-
ment error are summarized in terms of a fitted polynomial. More often
scatter about the systematic structure is described in terms of a suitable
probability distribution. The normal distribution is often used to model
measurement error, but other distributions may be more suitable for
modelling positive skewed responses, proportions, or counts. Examples
are given below. In some cases it is necessary only to specify the mean and
variance properties of the model, so that full specification of a probability
distribution is not required.
The two most widely used methods of fitting are least squares and
REGRESSION AND CORRELATION 81

TABLE 1
Annual maximum sea levels (cm) in Venice, 1931-1981

103 78 121 116 115 147 119 114 89 102


99 91 97 106 105 136 126 132 104 117
151 116 107 112 97 95 119 124 118 145
122 114 118 107 110 194 138 144 138 123
122 120 114 96 125 124 120 132 166 134
138

maximum likelihood. Routines for least squares have been widely avail-
able for many years and the method provides generally satisfactory
answers over a range of settings. Some loss of sensitivity can result from
their use, however, and with the computational facilities now available,
maximum likelihood estimates-sometimes previously avoided as being
harder to calculate-are used for their good theoretical properties. The
methods coincide in an important class of situations (Section 3.1).
In many situations interest is focused primarily on one variable, whose
variation is to be explained. In problems of correlation, however, the
relationships between different variables to be treated on an equal footing
are of interest. Perhaps as a result, examples of successful correlation
analyses are rarer than successful examples of regression, where the focus
on a single variable gives a potentially more incisive analysis. Another
reason may be the relative scarcity of flexible and analytically tractable
distributions with which to model the myriad forms of multivariate
dependence which can arise in practice. In some cases, however, variables
must be treated on an equal footing and an attempt made to unravel their
joint structure.

1.2 Examples
Some examples are given below to elaborate on these general comments.

Example 1. The data in Table I consist of annual maximum sea levels


in centimetres in Venice for the years 1931-1981. 1 The plot of the data in
Fig. I shows a marked increase in sea levels over the period in question,
superimposed upon which there are apparently random fluctuations.
There are different possible interpretations of the systematic trend, but
clearly it would be foolish not to take it into account in predictions for the
future. In this case the trend appears to be linear, and we might express
the annual maximum sea level :r; in year t as
(1)
82 A.C. DAVISON

200 -
+

E
()
180

CD +
~ 160
(\l
+
+ +
~ 140~
+
+ + + +
+ + +
:l + ++
E 120 + + + +++
+ + + + +
x ++ + + + + +
(\l +
E ++ +
+
100 + ++ +
iii +
:l + +
C
C
« 80 +

Year

Fig. 1. Annual maximum sea levels measured at Venice, 1931-1981.

where po(cm) and PI (cm/year) are unknown parameters representing


respectively the mean sea level in 1931 and the rate of increase per year.
The random component lit is a random variable which represents the
scatter in the data about the regression line. In many cases it would be
assumed that the lit are uncorrelated with mean zero and constant vari-
ance, but a stronger assumption would that the lit were independently
normally distributed. A further elaboration of the systematic part of
eqn (1) would be to suppose that the trend was polynomial rather than
purely linear, or that it contained cyclical behaviour, which could be
represented by adding trigonometric functions of t to the right-hand side
ofeqn (1).

Data on river flows or sea levels are continuous: measurements could in


principle be made and recorded to any desired accuracy. In other cases
data take a small number of discrete values.

Example 2. Table 2 gives data on levels of ozone at two sites in east


Texas. 2 For each of 48 consecutive months the table gives the number of
days on which ambient ozone levels exceeded 0·08 parts per million (ppm),
the number of days per month on which data were recorded, and the
average daily maximum temperature for the month. Figure 2 displays a
REGRESSION AND CORRELATION 83

TABLE 2
Ozone data for two locations in Texas, 1981-1984

1981 1982 1983 1984


A B C A B C A B C A B C
Beaumont
January 0 30 14·61 0 31 15·50 0 13 14·56 5 31 12·72
February I 25 16·89 I 28 14·44 0 0 16·17 2 28 17·67
March 0 24 20·22 I 27 22·33 0 5 19·83 3 31 19·94
April 0 24 25·50 4 29 23·89 0 18 20·39 5 24 24·22
May 0 31 26·78 4 25 27-39 9 28 26·89 lO 31 27·11
June 0 30 30·78 8 22 31·50 3 21 29·28 4 24 29·00
July 3 30 32·11 0 3 32·39 0 0 31-67 5 25 30·33
August 8 25 32·17 6 20 32-33 0 0 30·94 6 29 31·28
September II 26 29·56 3 24 30·06 4 9 27·50 2 29 28·56
October 2 31 25·28 5 25 25·06 6 26 25-83 2 30 26·61
November 4 29 21·17 I 26 20·50 0 27 21·44 0 30 19·22
December 0 31 15·72 1 24 19·06 1 29 l2·11 0 30 21·67

North Port Arthur


January 0 0 0 0 23 16·50 2 17 15·00 5 31 12·39
February 0 0 0 2 28 16·61 I 25 17-78 I 28 17-33
March 0 0 0 1 31 21·94 2 29 22·00 3 27 19·89
April 0 0 0 2 28 24·17 2 20 24·00 3 22 24·56
May 0 0 0 5 31 28·94 I 14 29·94 4 31 27·89
June 4 18 31·06 6 28 31·94 0 0 30·00 0 3 30·33
July 6 19 31·94 0 to 32·28 1 5 31·67 4 26 31·56
August 3 13 32·11 4 II 36·28 8 19 35·67 7 30 32·61
September 9 23 29·44 3 15 34·67 6 20 30·28 3 30 29·06
October 2 18 25·28 4 26 27·61 6 27 26·94 0 17 27·28
November 3 24 21·83 0 30 20·50 0 28 21·61 I 28 20·89
December 2 30 16·44 0 23 18·28 0 29 12·17 0 31 21·67

A Number of days per month with exceedances over 0·08 ppm ozone.
B Number of days of data per month.
C Daily maximum temperature eC) for the month.

plot of the proportion of days on which 0·08 ppm was exceeded against
temperature. There is clearly a relation between the two, but the discre~e-
ness of the data and the large number of zero proportions mean that a
simple relation of the same form as eqn (1) is not appropriate.

Example 3. Table 3 gives annual peak discharges of four rivers in


Texas for the years 1924-1982. Data were collected for the San Saba
84 A.C. DAVISON

E
e.
0.5
e.
co
0
c:i 0.4
Q;
15
.c;
0.3
. ,.
c0 +

E
+ +
Q; 0.2 + + + ++
e.
(f)
>- ~ t
'"
+ +
"0 + +
0.1
15 ... + + + + +

.,c
0
0.0
+ + ++

5e. + ..++ + ++*' *


0
0:
-0.1
10 15 20 25 30 35 40
Average daily maximum temperature (OC)

Fig. 2. Proportion of days per month on which ozone levels exceed 0·08 ppm
for the years 1981-1984 at Beaumont and North Port Arthur, Texas, plotted
against maximum daily average temperature.

and Colorado Rivers at and near San Saba, data from Llano and near
Castell were combined for the Llano River, and data from near Johnson
City and Spicewood were combined for the Pedernales River (see Fig. 3).
If prevention of flooding downstream at Austin is of interest, it will be
important to explain joint variation in the peak discharges, which will
have a combined effect at Austin if they arise from the same rainfall
episode.
There are many possible joint distributions which might be fitted to the
data. About the simplest is the multivariate normal distribution, whose
probability density function is
f(y; Il, U) = (2n)-p/2IUI- 1/ 2 exp {-t(y - Il)TU- 1 (y - Il)} (2)

where y is the p x 1 vector of observations, Il is the p x I vector of their


means, and the p x p non-negative definite matrix U is the covariance
matrix of Y. Thus

and so on. In this example the vector observation for each year consists
of the peak discharges for the p = 4 rivers. There are n = 59 independent
copies of this if we assume that discharges are independent in different
TABLE 3
Annual peak discharges (cusecs x 103 ) for four Texas rivers. 1924-1982

Llano
59·50 21·40 14·40 33·90 27·30 49·00 122·00 92·50 22·80 19·30
388·00 130·0 3·72 110·00 55·50 28·20 26·70 23·40 50·60 10·10
8·50 18·20 8·60 108·00 14·60 7·77 13·90 23·20 16·50 3·46
72-00 1·85 47·20 83·70 35·60 103·00 57·60 1·70 1-81 67·20
2·48 15·40 27·40 44·40 4·52 154·00 28·00 24·50 11·40 154·00
19·40 61·50 67·50 139·00 25·80 210·00 32·90 116·00 5·98
~
Pedernales
~
1·36 28·30 16·40 6·94 155·00 36·60 13·90 18·50 2·02 11·40 ~

105·00 85·30 10·00 14·80 2·39 42·90 21·10 26·60 27·00 104·00 ~
25·50 9·68 10·20 8·38 8·17 29·10 11·80 441·00 32·20 5·34 0
13-60 0·16 125·00 50·20 47·00 142·00 15·60 8·55 5·27 10·60
z
32·30 7·55 58·30 27·90 12·70 28·90
>
Z
9·07 35·70 21·40 44·40
90·10 16·80 98·00 127·00 64·20 2·79 49·60 32·30 62·60 C")

San Saba
"
0
~
~
6·50 4·66 8·64 8·64 8·46 7·46 8·64 44·80 34·00 7·35 m
r"'
27·20 64·00 67·20 5·11 203·00 2·19 5·57 27·20 25·20 20·40 >
4·67 4·78 14·70 2·49 4·66 6·29 2·72 12·50 70·40 1·50 g
7·15 41·30 35·60 27·50 32·00 3·16 10·30 11·50 10·20 0·68 Z
20·20 20·40 1·92 4·67 17·40 4·25 36·70 25·60 5·99 1·05
40·50 3·20 10·30 10·50 27·00 1·81 40·70 3·56 3·59
Colorado
17·40 30·30 29·60 189·00 27·20 35·00 31·60 78·90 39·80 26·60
45·30 86·00 179·00 115·00 224·00 20·40 23·40 42·60 25·00 23·20
19·20 32·30 16·30 19·40 34·10 32·80 8·01 22·40 69·00 20·70
24·90 57·20 54·10 66·20 44·40 20·90 43·00 23·40 15·60 12·60
00
29·90 42·40 16·00 13-20 34·80 15·00 44·50 30·90 11·80 7·08 v.
46·20 13-30 10·70 18·10 28·10 9·11 36·00 4·36 21·40
86 A.C. DAVISON

San Saba river

-~~
/ Llano river Castell

; Austin
~

Pedernales ri ver Johnson city

Fig. 3. Sketch of central Texas rivers showing position of Austin relative to


Llano, Pedernales, San Saba and Colorado rivers.

years. The fitting of this distribution to the data is discussed in Section 2.

The examples above are observational studies. They are typical of


environmental data where usually there is no prospect of controlling other
factors thought to influence the variables of central interest. The situation
is different in designed experiments as used in, for example, agricultural
and some biological studies, where the experimenter has a good deal of
control over the allocation of, say, fertilizer treatments to plots of land.
Random allocation of plots to treatments helps reduce bias and may make
causal statements possible. That is, the experimenter may at the end of the
day be able to say that the average effect of this treatment is to cause that
response. Sometimes randomization is possible in environmental studies,
but more often it is not, and this limits the strength of inferences that can
be made. Generally inferences regarding association may be drawn, so
that it may be possible to say, for example, that a 1°C rise in temperature
is associated with a rise in the rate at which some ozone level is exceeded.
The further implication of a causal relation must come from consider-
ations external to the data, such as a knowledge of underlying physical,
biological or chemical mechanisms, or from additional data from suitably
designed investigations.
REGRESSION AND CORRELA nON 87

2 CORRELATION

Two variables are said to be correlated if there is an association between


them. The strength and direction of association between independent data
pairs (XI> YI ), ••• , (Xn' Yn) may be informally assessed by a scatterplot
of the values of the lj against the Aj. A numerical measure of linear
association is the (product moment) sample correlation coefficient,
R = I:j=I(Aj - X)(lj - Y)
n - 2 n - 2} 1/2
(3)
{I:j=I(Aj - X) I:j=l(lj - Y)
where X = n- I I: Aj and Y = n- I I: lj are the averages of the Aj and lj.
It is possible to show that - I ~ R ~ I and that IR I is location and scale
invariant, i.e. unaltered by changes of form X --+ a + bX, Y --+ c + dY,
provided band d are non-zero. The implication is that R measures associ-
ation only and does not depend on the units in which the data are
measured. If R = ± 1, the pairs lie on a straight line; if R = 0 the
variables are said to be uncorrelated.
If the pairs (Aj, lj) have a bivariate normal distribution (eqn (2)) with
p = 2, and population covariance matrix

WII WI2)
(
W21 W22

the population correlation coefficient is p = W 12 /{W II W 22 }1/2, and the


distribution of R is tabulatedY If tables are unavailable a good approxi-
mation is that Z = t 10g{(1 + R)/(l - R)} is approximately normal
with mean tanh -I (p) + tp/(n - I) and variance l/(n - 3). If p = 0 then
WI2 = 0, and it foHows from the factorization of eqn (2) into a product of
separate probability density functions for x and y that the distributions of
X and Yare independent.
It is important to realise that R measures only linear association
between X and Y. The (x, y) pairs ( - 2, 4), ( - I, I), (0, 0), (1, I) and (2, 4)
are strongly and non-randomly related, with y = x 2 , but their correlation
coefficient is zero. A scatterplot of data pairs is therefore an essential
preliminary to calculation and subsequent interpretation of their sample
correlation coefficient. Apparently strong correlation may be due to a
single extreme pair, for example, but the numerical value of R alone will
not convey this information.

Example 3 (contd). Figure 4 shows a scatterplot matrix for the Texas


rivers data and for the year for which the measurements are observed. This
88 A.C. DAVISON

1982

Year

Fig. 4. Scatterplot matrix for data on Texas rivers.

consists of all possible scatterplots of pairs of variables of the original


data, together with plots of the data against time. Thus the cell second
from the left in the bottom row consists of values for Llano on the y-axis
plotted against values for Pedernales on the x-axis. The diagonal cells
consist of data plotted against themselves and are not informative.
The data vary over several orders of magnitude and have very skewed
distributions. Both these features are removed by a logarithmic transfor-
mation of the peak discharges. There seem to be high peak discharges of
the rivers, especially the Colorado, in early years. Table 4 gives the matrix
of correlations for the original data and for the natural logarithm of the
peak discharges. Thus 0·29 is the sample correlation coefficient for the
original Llano and Pedernales data. The correlation between a vector of
observations and itself is I, and the diagonal is usually omitted from a
correlation matrix, which is of course symmetric.
Figure 5 shows a scatterplot matrix for the data after log transfor-
mation. The correlations between the rivers are very similar to those for
the un transformed data, but the skewness of the distributions has been
removed and the signs and sizes of the sample correlation coefficients are
clearly supported by the entire datasets. The appreciable correlation
between the San Saba and Colorado rivers is seen clearly. The correlation
REGRESSION AND CORRELATION 89

TABLE 4
Correlation matrices for the Texas rivers data: original data and data after
log transformation

Llano Pedernales San Saba Colorado

Original data
Colorado I
San Saba I 0·70
Pedemales I 0·00 -0,14
Llano 0·29 -0·14 -0·08

Data after log transformation


Colorado I
San Saba I 0·75
Pedemales I 0·5 -0·08
Llano 0·41 -0·\2 -0·10

. ~~ .t.~
o· .. 1982
.4.~ o~d'
o Q \"
r? 0 ·o'\.· o

°t·It(i
o· ~o Year
o .~;B
0

:~\,
~ 0 0
0
o _~" 0 0 0 ••
0
~~9"~ 1924
0° o. o 0 5.412

#
0 0 06

J. 0 0
.!

,:
o
,:.;~ Colorado ,J(~60
~ , 0 o0 5 •
o • 0• 00
1,472 .~
0 0 5.313 0 0

o 0 f °iM (~~';I
00".,'Iffd •
0~~0 San 8aba
o~~ o .. ",. oe
.".0 00

.
• • 0
... 0 0\ 0 ~o 0 •
0 Go 0 -0.3975 0 0

6.089 ° 0 0

ooo~. °~l'= ~oj!\


o~o o. ~~~
~/~ Aldernales 0,66 00 o o -S::\f>

. . .
0' 0 o 0 0 0 0 0 0 00 •

-1.808
5.981 0
• 0./k!>
Llano
~~. 09·
of."
~ J'4. 0
"~09" •• ~Co .'fl~
00.,.. ~
~ ./'
0.5306 0
•• 0
0
• ~o • o •• A. 0

Fig. 5. Scatterplot matrix for Texas rivers data after log transformation.
90 A.C. DAVISON

between the Llano and Pedernales rivers would be strengthened by


removal of the lowest observation, which is highlighted. The other plots
show little correlation between the variables. There seems to be downward
trend with time for the Colorado and a possible slight upward trend for
the Pedernales.
In this example transformation to remove skewness makes a summary
of the data in terms of a multivariate normal distribution sensible for some
purposes. The largest correlation for the transformed data is O· 70 for the
San Saba and Colorado rivers. The only other correlation significantly
different from zero at the 95% level is 0·29 between the Llano and
Pedernales rivers. Taken together with Fig. 5, this suggests that after
transformation the data can be summarized in terms of two independent
bivariate normal distributions, one for the Llano and Pedernales rivers
and another for the San Saba and Colorado rivers.

Discussion of other parametric measures of correlation is deferred to


Chapter 4. There are a number of non-parametric measures of association
between two variables, generally based on comparisons of the ranking of
the (X, Y) pairs. Let ~, for example, be the difference between the rank
of ~ and that of lj. Spearman's S is defined as L ~2, and is obviously zero
if there is a monotonic increasing relation between the measurements. This
and other statistics based on ranks do not suffer from the disadvantage of
measuring linear association only. Percentage points of S under the hypoth-
esis of independence between rankings are available. 3
Although the multivariate normal distribution is the most widely used,
others may be appropriate in analysing data on multivariate survival
times,5 discrete multivariate data,6 mixtures of discrete and continuous
multivariate data/ and other situations. 8 We take as an example the study
of multivariate extremes.

Example 4. The need to model sample minima and maxima arises in


many environmental contexts, such as hydrology, meteorology and ocean-
ography. Suppose that X and Y represent the annual minimum daily levels
of two nearby rivers, and that data are available for a number of years.
The assumption that the annual minima are independent may obviously
be incorrect, as a drought might affect both rivers at once. It may therefore
be necessary to model the joint behaviour of X and Y simultaneously,
either to obtain more accurate predictions of how each varies individually,
or in order to model the behaviour of some function of both X and Y.
REGRESSION AND CORRELATION 91

One class of joint distributions for such minima is specified by9.10

pr(X>x,Y>y) = exp{-(x+ y)+ oxy },


x+y
o~ 0 ~ 1, (x, y > 0) (4)
The population correlation coefficient between X and Y is then
(1 - !O)-3/20-1/2{sin- l (tO I /2 - to l/2(1 - !0)'/2(1 - to)}
The variables X and Yare independent if 0 = 0, when eqn (4) factorizes
into the product e- x e- Y• Tawn9discusses this and related models in the
X

context of modelling extreme sea levels.

3 LINEAR REGRESSION

3.1 Basics
Consider again the data on sea levels in Venice pictured in Fig. 1. If we
suppose that model (1) is appropriate, we need to estimate the parameters
130 and 131' We suppose that the errors et are uncorrelated with mean zero
and variance (J2, also to be estimated. We write eqn (l) in matrix form as
1';931 0 el931
YI932 el932
Yl933 2
(~~) + el933 (5)

YI981 50 el981
or
y = XfJ + I: (6)
in an obvious notation.
More generally we take a model written in form (6) in which Y is an
n x 1 vector of responses, X is an n x p matrix of explanatory variables
or covariates, fJ is a p x 1 vector of parameters to be estimated, with
p < n, and Il is an n x I vector of unobservable random disturbances.
The sum of squares which corresponds to any value fJ is

xi fJ)2
n
(Y - XfJ)T(y - XfJ) = L (Yj - (7)
j=1
92 A.C. DAVISON

where superscript T denotes matrix transpose, lj is the jth element of Y,


and xi
is the jth row of X. The sum of squares (eqn (7)) measures the
squared distance, and hence the discrepancy, between the summary of the
data given by the model with parameters fl and the data themselves. The
model value corresponding to a given set of explanatory values and
xi
parameters is fl, from which the observation lj differs by lj - xi
fl; the
overall discrepancy for the entire dataset is eqn (7).
As their name suggests, the least squares estimates of the parameters
minimize the sum of squares (eqn (7)). Provided that the matrix Xhas rank
p, so that the inverse (XT X) -I exists, the least squares estimate of fl is the
p x I vector
(8)
The fitted value corresponding to lj is ~ = xi fJ. The n x I vector of
residuals
e = Y - Y = Y - xfJ = {I - X(XT X)-I XT} Y = (I - H)Y
(9)
say, can be thought of as an estimate of the vector of errors 8. The n x n
matrix H is called the 'hat' matrix because it 'puts hats' on Y: f = HY.
The interpretation of e as the estimated error suggests that the residuals
e be used to estimate the variance (i of the Ej' and it can be shown that
provided the model is correct, the residual sum of squares

(10)

has expected value (n - p)(J2. Therefore

S
2 1 ~
= -- ~ (lj - Tj)2
Xjp) (11)
n - p j=1
is unbiased as an estimate of (J2. The divisor n - p can be thought of as
compensating for the estimation of p parameters from n observations and
is called the degrees of freedom of the model.
The estimators fJ and; have many desirable statistical properties. I I The
estimators can be derived under the second-order assumptions made above,
which concern only the means and covariance structure of the errors.
These assumptions are that
var(e) = (J2, (k =f j)
A stronger set of assumptions for the linear model are the normal-theory
REGRESSION AND CORRELA nON 93

assumptions. These are that the Gj have independent normal distributions,


with mean zero and variance (J2, i.e. Yj is normally distributed with mean
xI/I and variance (J2. Thus the probability density function of Yj is

f(Yj; /I, (J2) = (2n(J2)-1/2 exp {- 2~2 (Yj - XJ/I)2 } (12)

An alternative derivation of Pis then as follows. The Yj are independent,


and so the joint density of the observations Y;, ... , Yn is the product of
the densities (eqn (12)) for the individual Yj:

nf( Yj; /I, (J2) = (2n(J2) - n/2 exp {- 2(J2


}] I n}
j~1 (Yj - xJtN (13)

For a given set of data, i.e. having observed values of Yj and Xj' this can
be regarded as a function, the likelihood, of the unknown parameters /J and
(J2. We now seek the value of /I which makes the data most plausible in the
sense of maximizing eqn (13). It is numerically equivalent but algebraically
more convenient to maximize the log likelihood

n l~
2
L(/I, (J) = - 2 2
log (2n(J ) - 2(J2 j~ (Yj -
T2
Xj /I)

which is clearly equivalent to minimizing the sum of squares (eqn (7)) and
so the maximum likelihood estimate of /I equals the least squares estimate
p. The maximum likelihood estimate of (J2 is
n
&2 = n- I I (Yj - XJjJ)2
j=1

This is a biased estimate, and S2 is used in most applications. This deri-


vation of pand (J2 as maximum likelihood estimates has the advantage of
generalizing naturally to more complicated situations, as we shall see in
Section 5.
Under the normal-theory assumptions, confidence intervals for the
parameters /I are based on the fact that the joint distribution of the
p
estimates has a multivariate normal distribution with dimension p, mean
/I and covariance matrix (J2(XT X)-I. Ifwe denote the rth diagonal element
of (XT X) -I by Urr , Pr has the normal distribution with mean f3r and
variance (J2 urr • If (J2 is known, a(l - 2a:) x 100% confidence interval for
the true value of f3r is

where <I>(z.) = a: and <1>(.) is the cumulative distribution function of the


94 A.C. DAVISON

standard normal distribution, namely

<I>(z) = (2n)-1/2 [00 e- u2 / 2 du

Usually in applications a2 is unknown, and then it follows that since


(n - p)s-ja2 has a chi-squared distribution of n - p degrees of freedom
independent of jJ, a(1 - 2tx) x 100% confidence interval for f3r is
Pr ± sJi);;tn_p(tx)
where tn_p(tx) is the tx x 100% point of the Student t distribution on
n - p degrees offreedom. Values of z. and tn_p(tx) are widely tabulatedY
Example 1 (contd). For the Venice data and model (1), there are
p = 2 elements of the parameter vector fl. The matrices X and Yare given
in eqn (5). The least squares estimates are Po = 105·4 and PI = 0'567, the
residual sum of squares is 16988,ands2 = 16988/(51 - 2) = 346·7. The
matrix i(X T X)-I is
26-41 - 0·7844 )
(
- 0·7844 0·03138
and the standard errors of Po and PI are respectively 5·159 and 0·1771. The
value of (J2 is unknown, and t 49 (0'025) ~ 2·01. A 95% confidence interval
for PI is (0,211, 0'923), so the hypothesis of no trend, PI = 0, is decisively
rejected at the 95% level. The straight fitted line shown in Fig. 6 seems to
be an adequate summary of the trend, although there is considerable
scatter about the line. The interpretation of the parameter values is that
the estimated level in 1931 was 105·4 cm, and the increase in mean annual
maximum sea level is about O· 57 cm per year.
One use to which the fitted line could be put is extrapolation. The
estimated average sea level in 1990, for example, is Po + (1990 - 1931)P I
= Po + 59PI = 138'85, and this has a standard error of
var(Po + 59PI )1/2
= {var(Po) + 2 x 59 x cov(Po, PI) + 59 2var(PI)}I/2
which is 6·56. Comparison with the natural variability in Fig. 6 shows that
this is not the likely variability of the level to be observed in 1990, but the
variability of the point on the fitted line to be observed that year. The
actuallevel to be observed can be estimated as Y I990 = Po + 59PI + e199O,
where el990 is independent of previous years and has zero mean and
estimated variance S2. Thus
var( Y199O ) = var(Po + 59PI) + var(eI990)
REGRESSION AND CORRELAnON 95

200
+

180
E
.£ +

i 160
+
'"
Q)
til
140
+ +
+
+
+ +
+
~ + ++
.~ 120 + +
+
++ + +
~c: +
++ + +
+ +
~ 100 + +
++ +
+ +

80 +

50, I I I I I I I I I I I I I ,1 I I I I ,

1930 1940 1950 1960 1970 1980


Year

Fig. 6. Linear and cubic polynomials fitted to Venice sea level data.

and this has an estimated value 389·78; the standard error of Y1990 is 19·74.
Thus most of the prediction uncertainty for the future value is due to the
intrinsic variability of the maximum sea levels. An approximate 95%
predictive confidence interval is given by 138·85 ± 1·96 x 19·74 =
(100,2, 177'5). This is very wide.
Apart from uncertainty due to the variability of estimated model par-
ameters, and intrinsic variability which would remain in the model even if
the parameters were known, there is phenomenological uncertainty which
makes it dangerous to extrapolate the model outside the range of the data
observed. The annual change in the annual maxima may change over the
years 1981-1990 for reasons that cannot be known from the data. This
makes a prediction based on the data alone possibly meaningless and
certainly risky.

3.2 Decomposition of variability


The sum of squares (eqn (5» has an important geometric interpretation.
If we think of Yand X/l as being points in an n-dimensional space, then
minimization of (Y - X/l) T (Y - X/l) amounts to choice of the value of
/l which·· minimizes the squared distance between Y and X/l. The data
vector Y lies in the full n-dimensional space, but the fitted point X/llies in
96 A.C. DAVISON

y
y

x
Fig. 7. The geometry of least squares. The X-V plane is spanned by the columns
of the covariate matrix X, and the least squares estimate minimizes the distance
between the fitted value, 9, which lies in the X-V plane, and the data, Y. The
x-axis is spanned by a column of ones, and the overall average, 9, minimizes
the distance between that axis and the fitted value 9. The extra variation
accounted for by the model beyond that in the x-direction is orthogonal to the
x-axis.

the p-dimensional subspace of the full space spanned by the columns of X.


In Fig. 7 the x-y plane is spanned by the columns of X, and the x-axis is
spanned by a vector of ones. Since the distance from Y to f is minimized,
f must be orthogonal to Y - f. Likewise Y, which is an n x I vector all
of whose elements are the overall average n-'l: 1'; and which therefore lies
along the x-axis, is orthogonal to f - Y. The role of H is to project the
n-dimensional space onto the p-dimensional subspace of it spanned by the
columns of X, which is the x-y plane in the figure. Now
Y = f + (Y - f), and by Pythagoras' theorem,
yTy = fTf + (Y _ f)T(y - f)
Thus the total sum of squares of the data splits into a component due to
the model and one due to the variation unaccounted for by the model.
The variability due to the model can be further decomposed. If we write
f as f + (f - f), it follows from the orthogonality of f and f ~.. f
REGRESSION AND CORRELA nON 97

TABLE 5
Analysis of variance table for linear regression model

Source df Sum of squares Mean square


Regression (adjusted for mean) p - I SSReg = 1: (Yj - Y)22
A -

SSReg/(P - I)
Residual P = 1: (Yj - Yj) p)
A

11 - SSRes SSRes/(1l -

Total (adjusted for mean) 11 - I

Mean

Total n

that
yTy = yTy + (Y _ y)T(y _ Y) + (y __ y)T(y _ Y)

or equivalently
L lj2 = ny2 + L (~ - y)2 + L (lj - ~)2 (14)
The degrees of freedom (df) of a sum of squares is the number of par-
ameters to which it corresponds, or equivalently the dimension of the
subspace spanned by the corresponding columns of X. The degrees of
freedom of the terms on the right-hand side of eqn (14) are respectively I,
p - 1, and n - p.
These sums of squares can be laid out in an analysis of variance table as
shown in Table 5. The sum of squares for the regression can be further
broken down into reductions due to individual parameters or sets of them.
An analysis of variance (ANOV A) table displays concisely the relative
contributions to the overall variability accounted for by the inclusion of
the model parameters and their corresponding covariates. The last two
rows of an analysis of variance table are usually omitted, on the grounds
that the reduction in sum of squares due to an overall mean is rarely of
interest.
A good model is one in which the fitted value Y is close to the observed
y, so that a high proportion of the overall variability is accounted for by
the model and the residual sum of squares is small relative to the regression
sum of squares. The ratio of the regression sum of squares to the adjusted
total sum of squares can be used to express what proportion of the overall
variability is accounted for by the fitted regression model. This is numeri-
cally equal to the square of the correlation coefficient between the fitted
values and the data.
98 A.C. DAVISON

TABLE 6
Analysis of variance for fit of cubic regression model to Venice sea level data

Source df Sum of squares Mean square

Linear I 3552 3552


Quadratic and cubic (adjusted for linear) 2 9·2 4·6

Regression (adjusted for mean) 3 3561·2 I 187·1

Residual 47 16979 361·26

Total (adjusted for mean) 50 20540

When the observations have independent normal distributions with


common variance (f2, the residual sum of squares SSRes has a chi-squared
distribution with mean (n - p )(f2 and n - p degrees of freedom, and this
implies that the residual mean square SSRes/(n - p) is an estimate of (f2. If
there was in fact no regression on the columns of X after adjusting for the
mean, i.e. if in the model
+ P1Xjl + ... Pp_1Xj.P_1 + Bj
Yj = Po
the parameters PI, ... , PP_I were all equal to zero, the regression sum of
squares SSReg would have a chi-squared distribution on p - 1 degrees of
freedom, with mean (p - 1)(f2. If any of these parameters were non-zero
the regression sum of squares would tend to be larger, since more of the
overall variability would be accounted for by the model. A statistical
implication of the orthogonality of Y - IT and Y - Y is that the corre-
sponding sums of squares are independent, and so if the regression par-
ameters are zero, the statistic
SSReg/(P - 1)
SSRes/(n - p)
has an F-distribution on p - 1 and n - p degrees of freedomY This
can be used to determine if a reduction in sum of squares is too large to
have occurred by chance and so is possibly due to non-zero regression
parameters.

Example 1 (contd). The analysis of variance table when the cubic


model
Po + PI(t - 1930) + P2(t - 1930)2 + P3(t - 1930)3
is fitted to the Venice data, is given in Table 6.
REGRESSION AND CORRELATION 99

The estimate of (12 based on the regression is 1187, as compared to


361· 26 based on the residual sum of squares. The corresponding F-statistic
is 1187/361'26 = 3'29, which is just not significant at the 2·5% level, and
gives strong evidence that the average sea level changes with the passage
of time. The decomposition of the regression sum of squares, however,
shows that the regression effect is due almost wholly to the linear term; the
F-statistic for the extra terms is 0·013 = 4'6/361'26, which is not signifi-
cant at any reasonable level. Figure 6 shows the fitted values for the cubic
model. These differ very little from those for the linear model, thus
explaining the result of the analysis of variance. The proportion of total
variation accounted for by the model is 3561·2/20 540 x 100% = 17'3%,
corresponding to a correlation coefficient of ,J0·173 = 0-42 between
observations and fitted values. This is not very large but the line seems to
be an adequate summary of the trend in Fig. 6. Perhaps more striking than
the trend is the large degree of scatter about the fitted line.

3.3 Model selection


In larger problems than Example 1, there are typically many explanatory
variables. The problem then arises of how to select a suitable model from
the many that are possible. The simplest model is the minimal model,
which summarizes the response purely in terms of its average, a single
number. This is the most concise summary of the response possible, but
it makes no allowance for any systematic variation which might be
explained by explanatory variables. The least concise summary is to use
the data themselves as the summary. This uses n numbers to summarize
n numbers, so there is no reduction in complexity. The problem of model
choice is to find a middle path where as much systematic variation as
possible is explained by dependence of the response on the covariates, and
the remaining variation is summarized in the random component of the
model.
In many situations there is prior knowledge about which co variates are
likely to prove important; indeed the sign and the approximate size of a
regression coefficient may be known. Sometimes there is a natural order
in which covariates should be fitted. With polynomial covariates, as in
Example I, it would make no sense to fit a model with linear and cubic
terms but no quadratic term unless there were strong prior grounds for the
belief that the quadratic coefficient is zero, perhaps based on theoretical
knowledge of the relationship. If a polynomial term is fitted, then all terms
of lower order should normally also be fitted. A similar rule applies also
to qualitative terms, for reasons illustrated in Section 5.5.
lOO A.C. DAVISON

If several covariates measure essentially the same thing, it may be


sensible to use their average, or some other function of them, as a covariate,
instead of trying to fit them separately. Regardless ofthis it is essential to
plot the response variable against any likely covariates, in order to verify
prior assumptions, indicate covariates unlikely to be useful, and to screen
the data for unusual observations. It is valuable to check if any pairs of
covariates are highly correlated, or collinear, as this can lead to trouble.
For example, collinear covariates that are individually significant may
both be insignificant if included in the same model. A model with collinear
covariates may have very wide confidence intervals for parameter esti-
mates, predictions, and so forth because the matrix X T X is close to
singular. Weisberg l2 gives a good discussion of this. Collinearity often
arises in models containing polynomial terms, in which case orthogonal
polynomials should be used.
There are broadly three ways to proceed in regression model selection.
Forward selection begins by fitting each covariate individually. The model
with the intercept and the covariate leading to the largest reduction in
residual sum of squares is then regarded as the base model. Each of the
other covariates is then added separately to this model, and the residual
sum of squares for each model with the intercept, the first covariate
chosen, and the other covariates taken separately is recorded. Whichever
of these models that has the smallest residual sum of squares is then taken
as the base model and each remaining covariate is then added separately
to this. The procedure is repeated until a suitable stopping rule applies.
Often the stopping rule used is that the residual sum of squares is not
reduced significantly by the addition of any covariate not in the model.
Backwards elimination starts with all available covariates fitted, and
drops the least significant covariate. The model without this covariate is
then fitted, and the least significant covariate dropped; the procedure is
repeated until a stopping rule applies. A common stopping rule is that
deletion of any remaining covarite leads to a significant increase in the
residual sum of squares for the model.
In both forward selection and backward elimination the choice of
significant covariates is usually made by reference to tables of the F or t
distributions. The two methods may not lead to the same eventual model,
and consequently a combination of them known as stepwise regression is
available in a number of regression packages. In this procedure four
options are considered at each stage: add a variable, delete a variable,
swap a variable in the model for one not in the model, or stop. This more
complicated algorithm is often used in practice, but like forward selection
REGRESSION AND CORRELATION 101

and backwards elimination it will not necessarily produce a satisfactory or


even a sensible model. Examples are known where automatic procedures
find complicated models to fit data that are known to have no systematic
structure whatever. The best guide to a satisfactory model is the know-
ledge the investigator has of his field of application, and automatic output
from any regression package should be examined critically.

3.4 Model checking


There are many ways that a regression model can be wrongly specified.
Sometimes one or a few observations are misrecorded, and so appear
unusual compared to the rest. A problem like this can be called an isolated
discrepancy. A systematic discrepancy arises when, for example, an explan-
atory variable is included in the model but inclusion of its logarithm would
be more appropriate, or when the random component of the model is
incorrectly specified-by supposing correlated responses to be indepen-
dent, for example. A range of statistical techniques has been developed to
detect and make allowance for such difficulties, and we discuss some of
these below. Many of these model-checking techniques are based on the
use of residuals.
We saw in Section 3.1 thatthe vector e = Y - f of differences between
the data and fitted values can be thought of as estimating the unobservable
vector t: of errors. Since assumptions about the t:j concern the random
component of the model, it is natural to use the ej to check those assump-
tions. One problem with direct use of the ej' however, is that although they
have zero means, unlike the true errors they do not have constant variance.
Standardized reiduals are therefore defined as

R -
y-y
J J
j - s(l - hjjy/2
where hjj is thejth diagonal element of the hat matrix H = X(XT X)-I XT.
If the model is correct, the Rj should have zero means and approximately
unit variance, and should display no forms of non-randomness, the most
usual of which are likely to be
(i) the presence of outliers, sometimes due to a misrecorded or
mistyped data value which may show up as lying out of the pattern
of the rest, and sometimes indicating a region of the space of
covariates in which there are departures from the model. Single
outliers are likely to be detected by any of the plots described below,
whereas multiple outliers may lead to masking difficulties in which
each outlier is concealed by the presence of others;
102 A.C. DAVISON

(ii) omission of a further explanatory variable, detected by plotting


residuals against that variable;
(iii) incorrect form of dependence on an explanatory variable, for
example a linear rather than nonlinear relation, detectable by plot-
ting residuals against the variable;
(iv) correlation between residuals as in time series data, detected by
scatterplots between lagged residuals; and
(v) incorrect assumptions regarding the distribution of the errors t:j'
detected by plots of residuals against fitted values Y to detect
systematic variation of their means or variances with the mean
response, or of ordered residuals against expected normal order
statistics to detect non-normal errors.

Figure 8 shows some possible patterns and their causes. Parts (a)-(d)
should show random scatter, although allowance should be made for
apparently non-random scatter caused by variable density of points along
the abscissa. Plots (e) and (f) are designed to check for outliers and
non-normal errors. The idea is that if the Rj are roughly a random sample
from the normal distribution, a plot of the ordered Rj against approximate
normal order statistics <I>~ I {(j - 3/8)/(n + 1/2)} should be a straight line
of unit gradient through the origin. Outliers manifest themselves as
extreme points lying off the line, and skewness of the errors shows up
through a nonlinear plot.
The value of Xj may give a case unduly high leverage or influence. The
distinction is a somewhat subtle one. An influential observation is one
whose deletion changes the model greatly, whereas deletion of an obser-
vation with high leverage changes the accuracy with which the model is
determined. Figure 9 makes the distinction clearer. The covariates for a
point with high leverage lie outwith the covariates for the other obser-
vations. The measure of leverage in a linear model is the jth diagonal
element of the hat matrix, hjj' which has average value pin, so an obser-
vation with leverage much in excess of this is worth examining.
One measure of influence is the overall change in fitted values when an
observation is deleted from the data. Let Y(j) denote the vector of fitted
values when lJ_ is deleted from the data. Then one simple measure of the
change from Y to Y(j) is Cook's distance 13 defined as

Cj = p~2 (Y - YU}?(Y - YU»)

where Y(j) = xi P(j)' and subscript (j) denotes a quantity calculated


REGRESSION AND CORRELATION 103

R R
x x
x x
x x
x x x
x x x X X X
X X X X X
X X
0 0
X X X X X X X l(
Y
x x x
x x x
x
x x
x
x

(a) (b)

R x
x R
x x
x x
x x x x
x x x
x x x x
x Xx X X x x x x x x
0 0
x><xxxxxxxx x "y or x xx x
)(
x x
x
x x y
"
X X
x xX x X X X X
X X X X X X
X X X '(
X X
X X X X

(c) (d)
')(

x x
x
x X
X X
X
X X
o x xX o xXX
Normal order x Normal order
statistic
)(
statistic
x><E xl'
x x xxxx
x
x

(e) (f)

Fig. 8. Examples of residual plots: (a) nonlinear relation between response


and x; (b) variance of response increasing with y; (c) null plot; (d) nonlinearity
and increasing variance; (e) null normal order statistics plot with possible
outlier; and (f) normal order statistics plot showing skewed error distribution.

without the jth observation. An alternative more easily calculated form in


which to express Cj is
Rfhjj
Cj =
p(l - hjj )
It is hard to give guidance as to when an observation is unduly influential,
104 A.C. DAVISON

Y x y

. x

~
.
~
.
x x
(a) (b)

y x

x
(c)

Fig. 9. The relation between the leverage and the influence of a point. The
light line shows the fitted regression with the point x included, and the heavy
line shows the fitted regression with it excluded. In (a) the point has little
leverage but some influence on the intercept. though not on the estimate of
slope; in (b) the point has high leverage but little influence; and in (c) both the
leverage and the influence of the point are high.

but if one or two values of Cj are large relative to the rest it is worth
refitting the model without such observations in order to see if there are
major changes in the interpretation or strength of the regression.
Even though it is an outlier, an observation with high leverage may have
a small standarized residual because the regression line passes close to it.
This problem can be overcome by use of jackknifed residuals
, y _ x T P(jl ( n _ p _ I )1/2
Rj = S(jl {I _ ~jj} 1/2 = Rj n - p - Rf
which measure the discrepancy between the observation lj and the model
obtained when lj is not included in the fitting. The Rj are more useful than
the ordinary residuals Rj for detecting outliers.

Example 1 (contd). Figures 10-13 display diagnostic plots for the


REGRESSION AND CORRELA nON 105

4 +
-'-
cr
~ 2 +
+ +
OJ
U
en
~ 0
u
2
c: -2 +
-'"
-'"
0
.!ll, -4

-6
100 110 120 130 140
"-
Ii tted value Yj

Fig. 10. Plot of jackknifed residuals, Rj, against fitted values, Yi , for Venice
data.

model of linear trend fitted to the sea level data. Figure 10 shows the
jackknifed residuals R;
plotted against the fitted values ~. There is one
outstandingly large residual, for the year 1966, but the rest seem reason-
able compared to a standard normal distribution. Figure II shows the
values of Cook's distance Cj plotted against case number. The largest
value is for observation 36, which corresponds to 1966, but cases 6 (1936)
and 49 (1979) also seem to have large influence on the fitted model.

IT 3
0 +
+
en
+
~en 2 +

-'"
0
0 +
u +
+ +
+
+ + +
+ + + +
+ + +
+ + + + + + + +
++ + + +
+ + + +
+ ++
+
0
0 10 20 30 40 50
case number,j

Fig. 11. Plot of Cook distances, Ci , against case numbers, j, for Venice data.
106 A.C. DAVISON

0.10

0.08
+ +
+ +
0.06

.s::
0.04

0.02

Case.j

Fig. 12. Plot of leverages. hli • against case numbers. j. for Venice data.

Reinspection of Fig. to shows that these observations have the largest


positive residuals.
Figure 12 shows a strong systematic pattern in the values of the measures
of leverage, hjj • On reflection, the reason for this is clear, namely that in
a model Yj = fJT Xj = Po + P1Zj, the matrix (XT X)-I has the form

1 (1: Z2 1: Z.)
n 1: (Zj- Z)2 1: ~ nJ
where z = n- I 1: Zj' and so hkb which is the kth diagonal element of the

4
+
3
2 +
++
<ii .v

/
:J
"0
en
~ 0
"0
-1
./'
Gl
CD +
"0 +
(; -2 +

-3
-4 Fig. 13. Plot of ordered residuals
-3 -2 -1 0 2 3 against normal order statistics for
normal order statistic Venice data.
REGRESSION AND CORRELA nON 107

matrix X(XT X)-I X T and so equals xl(XT X)-I Xb can be written in the
form

hkk =
T T
X k (X X)
_I
Xk
I
= -n + ~
(Zk -
(
zi-)2
""j Zj - Z

Thus hkk consists of a constant plus a term representing the squared


distance between Zk and the average value of the Zj' which gives rise to the
quadratic shape of Fig. 12. These two terms may be interpreted as the
leverage an observation at Zk has on the mean level of the regression line
and the estimate of slope, respectively; an observation which is more
extreme in terms of its value of Z has a larger leverage on the estimate of
slope. In this case the plot of the leverages is uninformative, but in more
complicated problems it can be valuable to examine the hjj .
Figure 13 shows ordered standardized residuals plotted against normal
order statistics. If the errors Bj were independent and normally distributed
with constant variance, and the systematic part of the model was correct,
this plot would be expected to be a straight line of unit gradient through
the origin. However, the plot shows a clear curvature in its upper tail,
which implies that the errors arise from a distribution which is skewed to
the right. Reinspection of Fig. 10 with hindsight tells the same story. We
conclude that the assumption of normal errors is unsuitable for this set of
data. This finding calls into question the previous suggestion that the
observation for 1966 is an outlier. Though it is unlikely to have arisen from
the standard normal distribution, the observation might arise from a
positively skewed distribution.

The most important assumption made in regression modelling is usually


that the errors Bj are independent. Independence is also the hardest
assumption to check. The direct effect of correlated errors on the estimates
Pmay be small, provided the systematic part of the model is correctly
specified, but the standard errors of the Pr may be seriously affected. It
may be clear on general grounds that the errors may be regarded as
independent, but in cases such as Examples 1 and 2 above where the errors
may not be independent, scatter plots of consecutive residuals may reveal
serial correlation. A formal test of autocorrelation of residuals is based on
the Durbin-Watson statistic, which is often available from regression
packages.
There can be problems of interpretation if discrepancies arise due to a
single observation or a small subset of the data. If there is access to the
original data records, it may be possible to account for unusual obser-
108 A.C. DAVISON

vations in terms of faulty recording or transcription of the data. The


obvious remedy is then to delete the offending observation(s). The situ-
ation where there is no external evidence about the validity of an obser-
vation is less straightforward. Then it is often sensible to report analyses
with and without the observation(s), especially if conclusions are substan-
tially altered. Evidence from related studies may be useful. Systematic
failure of a model at the highest or lowest responses is particularly import-
ant, and may indicate severe limitations on its applicability that make
extrapolation more than usually unwise.
Reliable conclusions cannot be drawn from statistical methods, how-
ever sophisticated, applied to data for which they are unsuitable. A
discussion of model adequacy is therefore an essential component in a
statistical analysis. Nowadays residuals and the other measure of fit
described above are often calculated automatically in regression packages,
so the checks described above are easily performed.

3.5 Transformations
One requirement of a successful model is consistency with any known
asymptotic behaviour of the phenomenon under investigation. This is
especially important if the model is to be used for prediction or forecast-
ing. Many quantities are necessarily non-negative, for example, so a linear
model for them can lead to logically impossible negative predictions. One
remedy for this is to study the behaviour of the data after a suitable
transformation. Even where considerations such as these do not apply, it
may be sensible to investigate the possibility of transformation to satisfy
model assumptions more closely.
The interpretation and possible usefulness of a linear model are com-
plicated by such factors as

(a) non-constant variance;


(b) asymmetrically distributed errors;
(c) non-linearity in one or more covariates; or
(d) interactions between explanatory variables.

A transformation can sometimes overcome one or more of these dif-


ficulties. In many applications a suitable transformation will be obvious
on general grounds, or after inspection of the data, or from previous
experience of related sets of data. In less clear-cut situations a more formal
approach may be helpful, and one possibility is described below.
One class of transformations for positive observations is the power
REGRESSION AND CORRELA nON 109

family. In one version the transformed value of an observation Y is l4


l- 1
y(;) = { A- (A- '" 0)

log y (A- = 0)

which changes continuously from power to log forms at A- = O. The idea


now is to use the data themselves to determine a value of A- which makes
them close to normality. That is, we aim to find the value of A- which makes
the model
Y(A) = Xp + 8
most plausible in the sense of maximizing the likelihood. Here Y(A) =
(yf'), y~A), ... , y~A) T is the vector of transformed observations.
The density of 1j is

f(Yj; p, (,z, A-) = (2no.2)-1/2 exp {- 2~2 (y]') - xJ P?} y}-i, (15)

where the Jacobian y} -.I is needed for eqn (15) to be a density for Yj rather
than yY). The optimum value of A- is chosen by maximizing the log
likelihood
n
L(P, (12, A-) = L 10gf(Yj; p, (12, A-)
j=1

Let y = (Ilj=, yy/n denote the geometric mean of the data, and define zj'l
to be equal to yJ!.) Il-I. Then the profile log likelihood for A- is

Lp(A-) = max/l,,,z L(P, (12, A-) = - ~ log SSRes(A-) + constant


where SSRes(A-) is the residual sum of squares when the vector Z(Al is
regressed on the columns of X. We now plot Lp(A-) as a function of A-, and
aim to choose a value of A- which is easily interpreted but close to the
maximum of the profile log likelihood.

Example 1 (contd), Figure 14 shows the profile log likelihood for A-


for the Venice data. It suggests that the inverse transformation A- = - I
or the log transformation A- = 0 would be more sensible than the identity
transformation A- = I. The interpretation of the parameters /31 and /32
would change entirely if either of these scales were used. If the logarithmic
scale was used the mean sea level would increase by a multiplicative factor
i' each year, for example, which seems harder to interpret physically than
110 A.C. DAVISON

-285
-
...<
~

Cl. -290
-l
"0
0
0
~ -295
Q)
-"
Ol -300
.2
~
'E -305
0:

-310
-2 -1 0 2
}..

Fig. 14. Profile log likelihood for power transformation for Venice data.

a constant additive change. Analysis of transformed data is unsatisfactory


in this example because of these difficulties of interpretation.

A traditional use of transformations is in the analysis of counts. Small


counts are often modelled as Poisson random variables, with density
function

y = 0, 1, ... (16)

the mean and variance of which are both J1.. In the case of non-constant
variance we might aim to find a transformation h( Y) whose variance is
constant. Taylor series expansion gives heY) ~ h(J1.) + (Y - J1.)h'(J1.), so
E{h(Y)} ~ h(J1.) and var{h(Y)} ~ h'(J1.)2 var(Y) = J1.h'(J1.f If this is to
be constant, we must have h( Y) ex yl/2. A more refined calculation shows
that heY) = (Y + 1/4)1/2 has approximate variance 1/4. The procedure
would now be to analyse the transformed counts as variables with known
variance. Difficulties of interpretation like those experienced for the
Venice data arise, however, because a linear model for the square root of
the count suggests that the count itself depends quadratically on the
explanatory variables through the linear part of the model. This would
seem highly artificial in most circumstances, and a more satisfactory
approach is given in Section 5.
REGRESSION AND CORRELA nON III

3.6 Weighted least squares


One immediate generalization of the results of Section 3.1 is to a 'si tuation
where different observations lj have different variances. We suppose that
var(lj) = (12/Wj' where Wj is the weight ascribed to lj. The least squares
estimate of fJ is then P = (XT WX) -I X T WY, where W is the matrix with
jth diagonal element Wj and zeros elsewhere. The covariance matrix of P
is (12(XT WX)-I, and the estimate of (12 is given by

S
2
= --
I In W (Y - X
TA2
fJ)
n _ P j=1 J J J

One case where such a model is appropriate is when lj is an average of mj


observations, so var(lj) = (12jmj • An example with other instructive fea-
tures is as follows.

Example 5. 15 In studies of pollutant dispersal from a point source, data


on pollutant levels may be available at a number of spatial locations.
Exposures at each individual location may be summarized by fitting a
suitable distribution,16 so that a vector of parameter estimates for that
distribution, Yj' say, summarizes the mj observations at the jth site. In
almost all cases the estimates will have a joint asymptotic covariance
matrix of form mj- I M(yJ Variation in the entries of M is often small
compared to variation in the mj , which will be heavily dependent on the
distance of the location from the source. Thus the uncertainty attached to
the fitted distribution may differ substantially from one location to
another.
Although the values of the Yj may be of interest in themselves, it will
often be useful to summarize how the distributions depend on such factors
as distance from the source, meteorological factors, and pollutant charac-
teristics. Variation in the estimates Yj may then be of only indirect interest.
The Wei bull distribution, for example, is often parametrized so that its
probability density function is

y > 0; YI' Y2 > 0

but in applications the parameters of direct interest are usually the mean
and variance
112 A.C. DAVISON

of the distribution of exposures, where

r(z) = fo'" zt-'e-Udu

"I "2
is the gamma function. Since the mean and variance and are generally
more capable of physical interpretation than YI and Y2' it makes sense to
model their variation directly. A form whereby "I "2
and depend on the
covariates may be suggested by exploration of the data, but more satisfac-
tory co variates are likely to be derived if a heuristic argument for their
form based on physical considerations can be found. Once suitable com-
binations of covariates are found, the estimates mean and variance Kjl and
Kj2 can be regressed on them with weights mj • A more refined approach

would use the covariance matrices ~ and multivariate regression '7 for the
pairs (Kjl , Kj2 ).
This example illustrates two general points. The first is that in combin-
ing estimates from different samples with differing samples sizes it is
necessary to take the sample sizes into account. The second is that when
empirical distributions are replaced by fitted probability models it is wise
to present results in a parametrization of the model which is readily
interpreted on subject-matter grounds, and this may not coincide with a
mathematically convenient parametrization.

Section 5 deals with further generalizations of the linear model.

4 ANALYSIS OF VARIANCE

The idea of decomposition of the variability in a dataset arose in Section 3


in the context of linear regression. This section discusses the correspond-
ing decomposition when the explanatory variables represent qualitative
effects rather than quantitative covariates. The simplest example is the
one-way layout. Suppose that m independent measurements have been
taken at each of p different sites, so that n = mp observations are available
in all. If the variances of the observation at the sites are equal, the jth
observation at the rth site may be written

j = 1, ... , m, r = 1, ... , p (17)

where the Gjr are supposed to be independent errors with zero means and
variances a 2, and the parameter Pr represents the mean at the rth site. In
REGRESSION AND CORRELA nON 113

terms of matrices this model may be expressed as

o o

0 eml

0 /31 el2

/32
= + (18)
0 e m2

/3p
el p

Ymp 0 0 e mp

which has the same linear structure, Y = xp + 8, as eqn (6). Here


however the columns of the n x p matrix X are dummy variables indicat-
ing the presence or absence of a parameter in the linear expression for the
corresponding response variable. The qualitative nature of the columns
of the covariate matrix in eqn (18) contrasts with eqn (5), where the
co variates represent the quantitative effect of time on the mean response.
The qualitative nature of the explanatory variables has no effect on the
algebra of least squares estimation, however, and estimates of the /3r are
obtained as before. Thus P = (XT X)-I X T Y, and in fact the estimate of
/3r for the one-way layout is the obvious, namely the average observed at
the rth site,)if = m- I :Ej Yjr' The results in Section 3 on confidence inter-
vals, tests, model-checking, and so forth apply directly because the linear
structure of the model is unaffected by the nature of the covariates.
The model which asserts that the mean observation is the same for each
site contains a single parameter /3 representing the overall mean, and is a
restriction of eqn (17) that insists that /31 = /32 = ... = /3 p = /3. The
matrix of covariates for this model consists of a single column of ones, and
the estimate of /3 is the overall average ji = n- I :Ej .kYjk' This model is said
to be nested within the model represented by eqn (18), which reduces to it
when the parameters of eqn (18) are constrained to be equal. The differ-
ence in degrees of freedom between the models is p - I, and the analysis
of variance table is given in Table 7.
In this table SSReg represents variability due to differences between sites
~

TABLE 7
Analysis of variance table for a one-way layout

Source df Sum of squares Mean square

Between sites (adjusted for mean) p-I SSReg = Ljr (Y.r - y .. )2 SSReg/(P - I)
o>
Cl
Within sites p(m - 1) SSRes = Ljr (Yjr - Yr)2 SSRes/{p(m - I)} >

Total (adjusted for mean) mp - 1 L jr (Yjr - yj ~


REGRESSION AND CORRELATION 115

adjusted for the overall mean, and SSRes represents variability at sites
adjusted for their different means. This corresponds to the decomposition

Yj' = Y. + (y., - yJ + (Yj, - y,)


in which the terms on the right-hand side correspond successively to the
overall mean, the difference between the site mean and the overall mean,
and the difference of the jth observation at the rth site from the mean
there. If the differences between sites are non-zero, SSReg will be inflated
relative to its probable value if there were no differences. Assessment of
whether between-site differences are important is based on

SSReg/(P - 1)
SSRes/{p(m - I)}

which would have an F distribution with p - I and p(m - 1) degrees of


freedom if there were no differences between sites and the errors were
independent and normal.
Equation (17) represents one way to write down the model of different
means at each site. Another possibility is

Jj, = IX + y, + Bjll j = I, ... , m, r = I, ... ,p (19)

in which the overall mean is represented by IX and the parameters YI, ... , YP
represent the differences between the site means and the overall mean. This
formulation of the problem contains p + I parameters, whereas eqn (17)
contains p parameters. Plainly p + I parameters cannot be estimated
from data at p sites, and at first sight it seems that eqns (17) and (19) are
incompatible. This difficulty is resolved by noting that only certain linear
combinations of parameters can be estimated from the data; such com-
binations are called estimable. In eqn (17) the parameters {31, ... , {3p are
estimable, but in eqn (19) only the quantities IX + YI, IX + Y2, ... , IX + Yp
are estimable, so that estimates of the parameters IX, YI' ... , Yp cannot all
be found without some constraint to ensure that the estimates are unique.
Possible constraints that can be imposed are:

(a) Ii = 0, which gives the same estimates as eqn (17);


(b))l1 = 0, so that Ii represents the mean at site I, and )I, represents the
difference between the mean at site r and at site I; and
(c) 1:,)1, = 0, so that Ii is the average of the site means and )I, is the
difference between the mean at the rth site and the overall mean.
116 A.C. DAVISON

Equation (19) leads to an n x (p + I) matrix of covariates


0 0

0 0
0 0

X (20)
0 0

0 0

0 0
which has rank p because its first column equals the sum of the other
columns. Thus the (p + I) x (p + l) matrix X T X is not invertible and
P = (XT X) -I X T Y cannot be found. This difficulty can be overcome by
the use of generalized inverse matrices, but it is more satisfactory in
practice to force X to have rank p by dropping one of its columns.
Constraints (a) and (b) above, correspond respectively to dropping the
first and second columns of eqn (20).
It is important to appreciate that the parametrization in which a model
is expressed has no effect on the fitted values obtained from a model, which
are the same whatever parametrization is used. Thus the fitted value for
an observation at the rth site in the example above is the average at that
site, ji" for parametrization (17) or (19) and any of the constraints (a), (b)
or (c). The parametrization used affects only the interpretation of the
resulting estimates and not any aspect of model fit.

Example 6. As part of a much larger investigation into the atmospheric


chemistry of vehicle emissions, data were collected on a number of sub-
stances in the atmosphere at the side of a London road. Table 8 gives
hourly average levels of toluene (ppb) for the hours 8-9, 9-10, ... , 15-16,
measured on each day over a 2 week period. Here we shall be interested
in modelling temporal changes in toluene levels, without regard to other
pollutants, although in practice the levels of different pollutants are closely
related. Some of the observations are missing due to machine breakdown
and other effects.
REGRESSION AND CORRELATION 117

TABLE 8
Hourly average toluene levels (ppb) in London street

Start of hour

8 9 10 11 12 13 14 15

Week 1
Sunday 4·56 4·05 4·22 9·82 4·84 6·59 4·95 5·41
Monday 19·55 9·48 7·24
Tuesday 11·86 15·13 12·13 7·67 7·08 9·55 7·21 8·31
Wednesday 21·68 22·13 18·40 12·16 13·35 11·42 13-80 10·13
Thursday 20·01 28·15 19·59 24·39 25·18 20·44 23·06 26·96
Friday 45·42 57·78 42·45 74·78 75·16 61·90 46·25 52·38
Saturday 4·23 8·65 15·64 14·29 17·02 23·22 18·35 14·14

Week 2
Sunday 4·46 5·59 6·48 5·86 6·27 3·68 3·93 6·29
Monday 5·41 5·41 6·37 7·51 6·40 6·01 6·01 6·22
Tuesday, 10·80 11·27 10·34 11·62 11·41 11·32 12·43 15·60
Wednesday 5·90 29·20 21·81 32·61 18·19 18·95 19·05 19·41
Thursday 12·62 14·33 10·24 8·66 14·69 10·50 14·65 14·83
Friday 8·46 11·61 11·90 11·94 10·12 11·41 9·28 11·51
Saturday 15·88 14·37 12·81 12·17 12·97 13·09 12·72

One approach to detailed analysis of such data would be via time series
analysis, but this is problematic here since the data consist of a number
of short series with large gaps intervening. The aim instead is to sum-
marize the sources of variability by investigating the relative import-
ance of hourly variation and daily variation. Hourly variation might be
due to variable traffic intensity, as well as diurnal changes in temperature
and other meteorological variables, whereas day-to-day variation may be
due to different traffic intensity on different days of the week as well as the
weather.
Some possible models for the data in Table 8 are that Ywdo the toluene
level observed at time t of day d on week w, has an expected value given
by
Q(

Q( + f3d
(21)
Q( + f3d + Yt
Q( + f3wd + Yt
118 A.C. DAVISON

TABLE 9
Analysis of variance for toluene data

Source df Sum of squares Mean square

Day of week 6 8263 1373


Time of day 7 196·7 28·1
Days 7 9254 1322
Residual 85 2116·8 24·9

Total (adjusted for mean) 105 19830

The parameter a corresponds to an overall mean, Pd to the difference


between the mean on day d and the overall mean, Yt to the difference
between the mean at time t and the overall mean, and Pwd to the difference
between the mean on day d of week wand the overall mean. The first
model in (21) is that the data all have the same mean. The second that there
is a day-of-the-week effect but no variation between hours on the same
day; note that this implies that variation is due to different traffic patterns
on different days of the week, but that all weeks have the same pattern, and
that there is no variation between hours on the same day. The third model
is that there is a day-of-the-week effect, and also variation between times
of the day. The fourth model is that there is variation between times, but
that the between-day variation is due not to causes which remain unvary-
ing from week to week, but to other causes, possibly overall meteorologi-
cal changes. This would not imply that every Monday, Tuesday, and so
forth had the same average toluene levels.
Table 9 shows the analysis of variance when the models in (21) are
successively fitted to the data. The seven extra degrees of freedom for days
correspond to the extra day parameters which are fitted in the final model
in (21) but not in the previous one. The table shows large differences due
to days, but no significant effect for times of the day once differences
between days are taken into account. The between-day variation is not due
solely to a day-of-the-week effect, as the final model in (21) gives a big
reduction in overall sum of squares. The between-day mean squares in the
table are very large compared to the residual mean square, which gives an
estimate of (J2 of 24·9 on 85 degrees of freedom.
Figure 15 shows the average levels of toluene for the different days. The
two lowest values are for Sundays, which supports the contention that the
levels are related to traffic intensity, but the variability between days is very
substantial, and no definite weekly pattern emerges. Considerably more
REGRESSION AND CORRELA nON 119

60

:0 50
0.
E-
OJ
40
t:J)
~ 30
OJ
>
en
;>, 20
en
0
10

0'--.l.--...1.---'---'---'----L.-..i.--I_'--.l...-.L.-...I.--'--....J
o 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Day

Fig. 15. Daily average toluene level (ppb), Sundays (days 1 and 8) have the
lowest values,

data would be required to discern a day-of-the-week effect with any


confidence. The estimates of the hour-of-the-day effects are all ± 2 ppb or
so, and'are very small compared to the variation between days.
Figure 16 shows a plot of standardized residuals against fitted values for
the final model in (21). There is an increase of variability with fitted value,
and this implies that the assumption of constant variance is inappropriate.
This example is returned to in Section 5.

5 GENERALIZED LINEAR MODELS

5.1 The components of a generalized linear model


Linear regression is the most popular tool in applied statistics. However,
there are difficulties in using it with data in the form of counts or propor-
tions, not because it cannot be applied to the data, perhaps suitably
transformed, but because of difficulties in the interpretation of parameter
estimates. An important extension of linear models is the class of
generalized linear models, in which this problem may be overcome. 6,18
The normal linear model with normal-theory assumptions outlined in
Section 3.1 can be thought of as having three components:
(l) the observation Y has normal density with mean f1 and variance 0'2;
(2) a linear predictor 1'/ = xT fJ through which covariates enter the
model; and
(3) a function linking f1 and 1'/, which in this case is f1 = 1'/.
In a generalized linear model, (I) and (3) are replaced by:
120 A.C. DAVISON

4
+
+
3

'-
a: 2
+
++
+
(ij +
::I
U + + +
Ui ++ t.+ ++
~ + +,"+ +
u
Ql 0 ~+:tr + ++
N
u +~+~+ "!t +
+
(;j + + ++
u
-1 ~ + + +
c ++ +
+
'"
Ci5 -2 +
+
++
+
-3 +

-4
0 10 20 30 40 50 60 70
• A
Fitted value Y j

Fig. 16. Residual plot for fit of linear model with normal errors to toluene data.
The tendency for variability to increase with fitted value suggests that the
assumption of constant variance is inappropriate.

(1') the observation Y has a density function of form


yO - b(O) }
f(y; 0, c/J) = exp { a(c/J) + c(y, c/J) (22)

from which it can be shown that the mean and variance of Yare
p. = b'(O) and a(c/J)b"(O); and
(3') the mean p. of Yis connected to the linear predictor '1 by '1 = g(p.),
where the link function g(.) is a differentiable monotonic increasing
function.
The Poisson, binomial, gamma and normal distributions are among
those which can be expressed in the form (22). These allow for the
modelling of data in the form of counts, proportions of counts, positive
continuous measurements, or unbounded continuous measurements
respectively.
We write a(c/J) = c/J/w, where c/J is the dispersion parameter and w is a
weight attached to the observation. The quantity a(c/J) is related to the
second parameter which appears in the binomial, normal, and gamma
REGRESSION AND CORRELAnON 121

distributions. If lj is the average of mj independent normal variables each


with variance rr, for example, we write a(4)) = 4>lw for the jth obser-
vation, with 4> = (J2 and w = mj' to take account of the fact that lj has
weight mj'
The parameter 0 in eqn (22) is related implicitly to the linear predictor
'1 = x T fl. Since p = b' (0) and the linear predictor is '1 = g(p) in terms of
the mean, p, we see that '1 = g{ b' (O)}. If we invert this relationship we can
write 0 as a function of '1. Hence eqn (22) can be expressed as a function
of fl and x, which are of primary interest, rather than of 0, which enters
mainly for mathematical convenience. Thus the logarithm of eqn (22) can
be written as

1('1,4» = yO('1) ~~{0('1)} + c(y, 4» (23)

in terms of '1 = xT fl, and 0('1) denotes that 0 is regarded here as a function
of '1 and through '1 of fl. This expression arises below in connection with
estimation of the parameters fl.
For each density which can be written in form of eqn (22), there is one
link function for which the model is particularly simple. This, the canonical
link junction, for which 0('1) = '1, is of some theoretical importance.
The value of the link function is that it can remove the need for the data
to be transformed in order for a linear model to apply. Consider data
where the response consists of counts, for example. Such data usually have
variance proportional to their mean, which suggests that suitable models
may be based on the Poisson distribution (eqn (16)). Direct use of the
linear model (eqn (6)) would, however, usually be inappropriate for two
reasons. Firstly, if the counts varied substantially in size the assumption
of constant variance would not apply. This could of course be overcome
by use of a variance-stabilizing transformation, such as the square root
transformation derived in Section 3.5, but the difficulties of interpretation
mentioned there would arise. Second, a linear model fitted to the counts
themselves would lead to the unsatisfactory possibility of negative fitted
means. The link function can remove these difficulties by use of the
Poisson distribution with mean J.I. = e~, which is positive whatever the
value of '1. This model corresponds to the logarithmic link function, for
which '1 = log J.I.. This is the canonical link function for the Poisson
distribution. When this link function is used, the effect of increasing '1 by
one unit is to increase the mean value of the response by a factor e. Such
a model is known as a log-linear model. Since the Poisson probability
122 A.C. DAVISON

density function (16) can be written as


j(y) = exp (y log Jl. - Jl. - log y!)
we see by comparison with eqn (22) that 9 = log Jl., b(9) = eO, a(c/J) = 1,
and c( y, c/J) = -log y!. The mean and variance of Y both equal Jl. = eO.
It has already been pointed out that distributions whose density func-
tion is ofform eqn (22) have mean Jl. = b'(9) and variance a(c/J)bH(9). It
follows that since 9 can be expressed as b,-I(Jl.), the variance can be
expressed in terms of the mean Jl. as a(c/J)bH{b,-I(Jl.)} = a(c/J)V(Jl.), say,
where V{J1.) is the variance junction, an important characteristic of the
distribution. The mean and variance of the Poisson distribution are equal,
so that this distribution has V(Jl.) = Jl.. Similarly V(Jl.) = 1 for the normal
distribution, for which there is no relation between the mean and variance.
However data arise with a variety of mean-variance relationships, and it
is useful that several of these are compassed by models of form eqn (22).
It is frequently observed that the ratio of the variance to the square of
the mean is roughly constant across related samples of positive continuous
data. A distribution with this property is the gamma distribution. Its
probability density function is
1 vV (y)V-1
f( Y; /1, v) = r(v) Ii p, exp ( - vY/ /1), (y > 0; /1, v > 0)

The two parameters Jl. and v are respectively the mean and shape parameter
of the distribution, which is flexible enough to take a variety of shapes. For
v < I it has the shape of a reversed 'J', but with a pole at y = 0; for
v = I, the distribution is exponential; and when v > I the distribution is
peaked, approaching the characteristic bell-shaped curve of the normal
distribution for large v. The gamma distribution has variance /1 2/v, so its
variance function is quadratic, i.e. V(Jl.) = Jl.2• Comparison with eqn (22)
shows that a(c/J) = I/v, 9 = -1/Jl., b(9) = -log (- 9), and c(y; c/J) =
v log (vy) - log (y) - log r(v). Various link functions are possible, of
which the logarithmic, for which g(Jl.) = log Jl. = '1, or the canonical link,
the inverse, with g{J1.) = I/Jl. = '1, are most common in practice.

Example 6 (contd). We saw in Section 4 that the variance of the


toluene data apparently depends on the mean. Figure 17 shows a plot of
log variance against log mean for each of the 14 days. The slope is close
to two, which suggests that the variance is proportional to the square of
the mean and that the gamma distribution may be suitable. The con-
clusion that between-day variation is important but that between-time
REGRESSION AND CORRELATION 123

5 +

4 +
+
Ql
()
+
c 3 +
as
to +
>
OJ
2 +
0 +
...J
+
+
+ +
+
0

-1 +
1 2 3 4 5
Log mean

Fig. 17. Plot of log variance against log average for each day for toluene data.
The linear form of the plot shows a relation between the average and variance.
The slope is about 2, indicating that the coefficient of variation is roughly
constant.

variation is not important is again reached when the models with gamma
error distribution, log link function and linear predictors in eqn (21) are
fitted to the data. Residual plots show that this model gives a better fit to
the data than a model with normal errors.
Apart from the normal, Poisson and gamma distributions, the most
common generalized linear models used in practice are based on the
binomial distribution, an application of which is described in Section 5.5.

5.2 Estimation
The parameters fJ of a generalized linear model are usually estimated by
the method of maximum likelihood. The log likelihood is
n
L(fJ) = L log f { Yj; ()j (fJ), <p} (24)
j=l

where f is defined in eqn (22). If we assume that <p is known, a maximum


of the likelihood is determined by the likelihood equation

(25)

which must be solved iteratively. For this problem and many others,
124 A.C. DAVISON

Newton's method can be rewritten as a weighted least squares algorithml9


in which the weights are updated at each iteration. The derivation, given
below, involves more matrix algebra than other sections of this chapter
and may be omitted at a first reading.
The fitting algorithm is derived as follows. First, note that eqn (26) can
be written as

a"T aL = 0 (26)
ap a"
evaluated at the overall maximum likelihood estimates p. Thejth element
of the n x 1 vector of linear predictors, ", is xJ p, SO" = Xp, and there-
fore the p x n matrix a"Tjap equals X. Also the n x 1 vector aLja" has
jth element
a 10gf{Yj; OJ(1'/j), ¢} aOj Yj - b'(Oj)
(27)
a1'/j = a1'/j a( ¢)
Newton's method applied to eqn (25) involves a first-order Taylor series
expansion of aLjap about an initial value of p. Thus
aL aL 02 L(P)
o = 8fJ (p) == 8fJ (fJ) + 8P8fJT (p - P) (28)
where 02 L(p)jopapT is the p x p matrix whose (r, s) element is
a2 L(p) _ ~ a2 10gf{Yj; (}j(11j), ¢}
apr aps - L XjrXjs
)= I
0..2 ./)

and x jr is the rth element of the p x I covariate vector Xj' and so forth.
In the derivation of Newton's method, the right-hand side of eqn (28) is
now rearranged by writing

(29)

whose right-hand side depends only on quantities evaluated at p.


However, there can be numerical difficulties associated with the use in the
algorithm of the p x p matrix of second derivatives
aL 2

- afJapT
and it is replaced by its expected value, which can be written in the form
XT WX, where W is the n x n matrix with zeros off the diagonal andjth
REGRESSION AND CORRELATION 125

diagonal element

W'
J
= E[- jPlogf{Yj; 8j (t/j), cp}]
at/I
For distributions whose density function can be written in th- form
eqn (22), it turns out that Wj equals (aJlj/at/j)2/{a(cp)V(Jlj)} in term!; of the
mean and variance function Jlj and V(Jl) of the jth observation.
From eqns (26), (27) and (29) we see that

p == p + (XT WX)-I XT aL
a"
(XTWX)-I (XT Wxp + xT ~~ )
= (XT WX)-I xTWz (30)
where

z Xp + w-I aL
a"
= ,,+ W-1(y - II)/a(cp)
is an n x I vector known as the adjusted dependent variable. We see that
Pis obtained as the result of applying the weighted least squares algorithm
of Section 3.6 with matrix of co variates X, weight matrix W, and response
variable z. The solution to eqn (25) is not usually obtained in a single step,
and the value of Pis obtained by repeated application of eqn (30). This
iterative weighted least squares algorithm can be set out as follows:
(la) First time through, calculate initial values for the mean vector II
and the linear predictor vector" based on the observed y. Go to (2).
(1 b) If not the first time through, calculate II and" from the current p.
(2) Calculate the weight matrix Wand the adjusted dependent variable
z from II and II·
(3) Regress z on the columns of X using weights W, to obtain a current
vector of parameter estimates p.
(4) Decide whether to stop based on the change in L(P) or in Pfrom
the previous iteration; if decide to continue, go to (lb) and repeat
until the change in L(P) or in P between two successive iterations
is sufficiently small.
This algorithm gives a flexible and rapid method for maximum like-
lihood estimation in a wide variety of regression problems. '9
126 A.C. DAVISON

Estimation of the dispersion parameter cP, where necessary, is discussed


in Ref. 6. The dispersion parameter is known for the Poisson and binomial
distributions, and for normal data cP = (12 is estimated, as usual, using the
residual sum of squares. For the gamma distribution, estimation of cP is
equivalent to estimation of the shape parameter v.
Confidence intervals for the components of Pare based on the result
that in large samples Pr is approximately normally distributed with mean
Pr and variance V r" where Vrr is the rth diagonal element of the P x P
matrix (Xl WX) -I, evaluated at p. Similarly the approximate covariance
of Pr and Ps is the (r, s) element of (Xl WX)-I evaluated at p. These results
generalize those in Section 3.1.

5.3 The deviance and model fit


In a linear model with normal errors the residual sum of squares is used
to measure the effect of adding covariates to the model. For a generalized
linear model the corresponding quantity for a model with parameter
p
estimates is the deviance, which is defined as
n

D(P) = 2 Lg- Ij(P)} (31 )


j=1

Here ~(fJ) = 10gf(Yj; fJ), from eqn (24), ¢ is the dispersion parameter,
and ~ is the biggest possible log likelihood attainable by the jth obser-
vation. The deviance is non-negative, and for a normal linear model it
equals the residual sum of squares for the model, SSRes' The deviance is
used to compare models, and to judge the overall adequacy of a model.
As in Section 4, model Mo is said to be nested within model MI if MI can
be reduced to Mo by restrictions on the values of the parameters. Consider
for example a model with Poisson error distribution, log link function, and
linear predictor '10 = Po + PI ZI' This is nested within the model with
linear predictor'll = Po + PIZI + P2Z2, which reduces to '10 if P2 = O.
However'll cannot be reduced to '1b = Po + P3Z3 by restrictions on Po, PI
and P2' The corresponding models are not nested, although '1b has fewer
parameters than'll'
If there are Po unknown parameters in Mo and PI in MI and the models
are nested, the degrees of freedom n - Po of Mo exceeds the degrees of
freedom n - PI of MI' General statistical theory then indicates that for
binomial and Poisson data the difference in deviances D(Mo) - D(M I),
which is necessarily non-negative, has a chi-squared distribution on
PI - Po degrees of freedom if model Mo is in fact adequate for the data.
A difference D(Mo) - D(M I) which is large relative to that distribution
REGRESSION AND CORRELA nON 127

may indicate that MI fits the data substantially better than Mo. For
normal models, inference proceeds based on F-statistics, as outlined in
Section 3.
The forward selection, backwards elimination, and stepwise regression
algorithms described in Section 3.4 can be used for selection of a suitable
generalized linear model, though the caveats there continue to apply. The
role of the residual sum of squares is taken by the deviance, and reductions
in deviance are judged to be significant or not, relative to the appropriate
chi-squared distribution.
Under some circumstances the deviance has a X~-PI distribution if model
MI is in fact adequate for the data. For Poisson data the deviance has an
approximate chi-squared distribution if a substantial number of the indi-
vidual counts are fairly large. For binomial data the distribution is
approximately chi-squared if the denominators mj are fairly large and the
binomial numerators are not all very close to zero or to the denominator.
For normal models with known (J2 the deviance has an exact chi-squared
distribution when the model is adequate, and the distribution is approxi-
mately chi-squared for gamma data when v is known.

5.4 Residuals and diagnostics


The residuals and measures of leverage and influence outlined in Section
3.4 can be extended to generalized linear models. Thorough surveys are
given in Refs 6 and 20.
The most useful general definition of a residual for a generalized linear
model is based on the idea of comparing the deviances for the same model
with and without an observation. This is analogous to the jackknifed
residual described in Section 3.4. The change in deviance when the jth
observation is deleted from the model is approximately the square of the
jackknifed deviance residual
Rj = sign (Yj - ft)(d} + hjj '::')1/2 (32)

In eqn (32) ~2 is the contribution to the deviance from the jth obser-
vation, i.e. ~2 =2g - Ij(P)}, and hjj is given by thejth diagonal element
of H = W I/2 X(X T WX)-I X T W 1/2, where W is the diagonal matrix of
128 A.C. DAVISON

weights Wj evaluated at the current model. The quantity


a10g!{Yj; (Jj('1), <jJ}/0'1j
rpj = {wj(l _ hij)}1/2

Yj - {J.j
(33)
{a(<jJ)V(jJ.)(1 - hij)}1/2
is known as a standardized Pearson residual.
The Rj can be calibrated by reference to a normal distribution with
unit variance and mean zero. Observations whose residuals have values
that are unusual relative to the standard normal distribution merit
the close scrutiny that such observations would get in a linear model.
For most purposes Rj may be used to construct the plots described in
Section 3.4.
A useful measure of approximate leverage in a generalized linear model
is hij as defined above, and the measure of influence is the approximate
Cook statistic
h-.
(j= lJ tf,
p(l - hij) j
where p is the dimension of the parameter vector /3. Both hjj and Cj may
be used as described in Section 3.4.
The definitions of Rj , hij and Cj given here reduce to those given in
Section 3.4 for normal linear models.

5.5 An application
The data in Table 2 are in the form of counts rj of the numbers of days out
of mj on which ozone levels exceed 0·08 ppm. One model for this is that the
rj have a binomial distribution with probability 7tj and denominator mj:

rj = 0, I, ... , mj, 0 < 7tj < I


(34)
The mean of this distribution is mj7tj and its variance is mj7tj(1 - 7t). The
idea underlying the use of this distribution is that the probability that the
ozone level exceeds 0·08 ppm is 7tj on each day of the jth month. Exceedances
are assumed to occur independently on different days, and the number of
days with exceedances in a month for which mj days of data are recorded
then has the binomial distribution eqn (34). The probabilities 7tj are likely
to depend heavily on time of year since there are marked seasonal vari-
REGRESSION AND CORRELATION 129

ations in ozone levels, and a possible form for the dependence is suggested
by the following argument.
If exceedances over the threshold during month j occur at random as a
Poisson process with a positive constant rate Aj exceedances per day, the
number of exceedances Zj occurring on any day that month has the
Poisson distribution (16) with mean Aj • Thus the probability of one or
more exceedances that day is
1tj = pr (Zj ~ I) = 1 - exp (-Aj )
If Aj = exp (xJ fl), corresponding to a log-linear model for the rate of the
Poisson process, we find that the expected number of days per month on
which there are exceedances is
p.j = mj 1tj = mj[1 - exp {-exp (xJfl}}]
which corresponds to the complementary log-log link function, " =
xJ fl = log { -log (l - p.j/mj)}. This model automatically makes allow-
ance for the different numbers of days of recorded data in each month, and
would enable the data analyst to impute numbers of days with exceedances
even for months for which very little or even no data are recorded. 2
Other link functions that are useful for the binomial distribution are the
probit, for which " = ~-I {Ji./m}, and the logistic, for which " = log
{p./(m - p.}}. The canonical link function is the logistic, which is widely
used in applications but does not have the clear-cut interpretation that can
be given to the complementary log-log model in this context.
The ozone levels in this example are thought to depend on several
effects. There is an overall effect due to differences between sites, which
vary in their proximity to sources of pollutants. There is possibly a similar
effect due to differences between years, which can be explained on the basis
of different weather patterns from one year to another. There may be
interaction between sites and years, corresponding to the overall ozone
levels showing different patterns at the two sites over the four years
involved. There is likely to be a strong effect of temperature; here used as
a surrogate for insolation. Furthermore an effect for the 'ozone season',
the months May-September, may also be required. Of these effeCts only
that due to temperature is quantitative; the remainder are qualitative
factors of the type discussed in Section 4.
Table 10 shows the contributions of these different terms towards
explaining the variation observed in the data. The table is analogous to an
analysis of variance table, but with the distinction that the deviance
explained by a term would be approximately distributed as chi-squared on
130 A.C. DAVISON

TABLE 10
Texas ozone data: deviance explained by model terms when entered in the
order given

Model term Degrees of freedom Deviance

Site I 0·17
Year 3 1·00
Site-year interaction 3 13-34
Daily maximum temperature 1 103·16
Year-ozone season interaction 4 18·83

Residual 74 143·55

the corresponding degrees of freedom if that term were not needed in the
model, and otherwise would tend to be too large. If the data showed no
effect of daily maximum temperature, for example, the deviance explained
in the table would be randomly distributed as chi-squared on one degree
of freedom. In fact lO3·16 is very large relative to that distribution:
overwhelming evidence of a strong effect of daily maximum temperature.
The deviance explained by each set of terms except for differences
between sites and years is statistically significant at less than the 5% level.
The site and year effects are retained because of the interaction of site and
year: it makes little sense to allow overall rates of exceedance to vary with
year and site but to require that overall year effects and overall site effects
must necessarily be zero.
The coefficient of the monthly daily maximum temperature is approxi-
mately 0·07, which indicates that the rate at which exceedances occur
increases by a factor eO·07! for each rise in temperature of t°c. Since the
temperature variation over the course of a year is about 15°C, the rate of
exceedances varies by a factor 2· 7 or so due to annual fluctuations in air
temperature, regarded in this context as a surrogate for insolation. Other
parameter estimates can be interpreted in a similar fashion.
The size of the residual deviance compared to its degrees of freedom
suggests that the model is not adequate for the data. Standard theory
would suggest that if the model was adequate, the residual deviance would
have a chi-squared distribution on 74 degrees offreedom, but the observed
residual deviance of 143·55 is very large compared to that distribution.
However, in this example there is doubt about the adequacy of a chi-
squared approximation to the distribution of the residual deviance,
because many of the observed counts rj are rather small. Furthermore
some lack of fit is to be expected, because of the gross simplifications made
REGRESSION AND CORRELA nON 131

0.5
.I::
C
0
E 0.4
~
Q)
Q.

'"
Q)
()
0.3
c
'"
"C
Q)
Q)
() 0.2
x
Q)

'0
Q)
(ii
0.1
c:

0.0
0 12 24 36 48
Month

Fig. 18. Fitted exceedance rate and observed frequency of exceedance for
Beaumont, 1981-1984. The fitted rate, lj (solid curve), smooths out fluctu-
ations in the frequencies, rJmj (dots).

in deriving the model. There is evidently variation in the rates Ai within


months, as well as a tendency for days with exceedances to occur together
due to short-term persistence of the weather conditions which generate
high ozone levels. These effects will generate data that are over-dispersed
relative to the model, and hence have a larger deviance than would be
expected.
There is also the possibility that individual observations are poorly
fitted by the model. Figures 18 and 19 show plots of the estimated rates
based on the model, Xi' and the observed proportions of days with
exceedances, rJmi , for the two sites. There are clearly some months where
there are more exceedances than would be expected under the model, and
examination of residuals confirms this. In particular there are substantially
more days with exceedances at Beaumont in August and September 1981,
in September 1983, and in January and May 1984; and at North Port
Arthur in January 1984. Some of these outliers can be traced to specific
pollution incidents. Plots of residuals and measures of influence show the
characteristics, and the deviance drops considerably when they are deleted
from the data.
132 A.C. DAVISON

0.5

.c
C
0
E 0.4
Q;
Q.
(J)
Q)
tl
0.3
C
<Il
"0.
Q)
Q)
tl
x 0.2
Q)

'0
Q)
0; 0.1
a:

0.0
0 12 24 36 48
Month

Fig. 19. Fitted exceedance rate and observed frequency of exceedance for
North Port Arthur, 1981-1984. The fitted rate, Ij (solid curve) smooths out
fluctuations in the frequencies, rJmj (dots).

5.6 More general models


The models described above can be broadened in several ways. One
generalization is to some types of correlated data, which in general leads
to a weight matrix W which is not diagonal.
A second possibility stems from the observation that the iterative
weighted least squares algorithm derived in Section 5.2 extends readily to
distributions which are not of form eqn (22) and is best illustrated by
example.

Example 1 (contd). We saw in Section 3 that the assumption of


normal errors is not suitable for the Venice sea level data, whose errors are
skewed to the right. The distribution of the maximum of a large number
of independent and identically distributed observations may be modelled
by the generalized extreme-value distribution

H(y; rt, t/I, k) = exp [ - { I - k y-t/l-rt}l/kJ (35)

over the range of y that k( y - rt) < t/I, where t/I > 0 and k, Jl are
REGRESSION AND CORRELATION 133

arbitrary. The case k = 0 is interpreted as the limit k --. 0, i.e.

H(y; '1, 1/1, 0) = exp [ - exp {_ y ; '1}]

the Gumbel distribution. This distribution arises as the limiting stable


distributions of extreme value theory, and is widely used in engineering,
hydrological and meteorological contexts when it is required to model the
behaviour of the extremes of a sample. The relatively simpler Gumbel
distribution is an obvious initial point from which to attempt to model the
Venice data, and we assume as before that a linear model

holds for the data, but that the 'errors' Ilt have the Gumbel distribution.
The log likelihood for this model is
1981
L(P) = L (-log 1/1 - (YI - tit)/I/I - e -(Yt-qt)/tII)
1=1931

where'1t = Po + PI(t - 1931) is the linear predictor for year t. Apart


from the presence of 1/1, the structure of this model is analogous to that of
eqn (24) and the derivation of the fitting algorithm given in Section 5.2 can
be carried through in a very similar manner. The value of 1/1, which plays
a role like that of the unknown variance in the linear model with normal
errors, is estimated by a single extra step in the algorithm. 19,21
When this model is fitted to the data in Table 1, the estimates of Po and
PI are 97·15 and 0'574, compared to the values 105·4 and 0·567 obtained
previously. The estimates of slope are very similar, and the estimates of the
intercept differ partly because the mean of the Gumbel distribution is not
zero. However, examination of residuals shows that the Gumbel distri-
bution, unlike the normal, gives an adequate fit to the data. There is no
reason to suppose on the basis of the data that the more complicated form
eqn (35) is needed. Figure 20 shows the normal order statistics plot of
residuals defined by extending eqn (32). The use of the Gumbel distri-
bution has removed the skewness that was evident in Fig. 13.
One message from this example is that regression parameter estimates
are usually fairly insensitive to the choice of error distribution. If interest
was focused on the prediction of extreme sea levels from these data,
however, the symmetry of the normal distribution might lead to substan-
tial underprediction of extreme levels in any particular year. For purposes
such as this it would be important to use an error distribution which
134 A.C. DAVISON

3
+
2
+
++
OJ +
::l +++
!? .r+
'"~ "If'"

~/
"0 0
~
Q)
"0 #
0 -1
+
......
+++
+
-2
+

+
-3
-3 -2 -1 0 2 3
Normal order statistic

Fig. 20. Plot of ordered residuals against normal order statistics for fit of model
with Gumbel errors to Venice data.

matches the asymmetry of the data, and this precludes the use of the
normal distribution.

6 COMPUTING PACKAGES

Almost every statistical package has facilities for fitting the linear regression
models described in Section 3, and most now provide some form of
diagnostic aids for the modeller. At the time of writing, only two packages,
GUM and GENSTAT, have facilities for direct fitting of generalized
linear models. There is no space here for a comprehensive discussion of
computing aspects of linear models. Some general issues to consider when
choosing a package are:
(a) the flexibility of the package in terms of interactive use, plotting
facilities, model fitting, and control over the level of output;
(b) the ease with which standard and non-standard models can be fitted;
(c) the ease with which models can be checked; and
(d) the quality of documentation and the level of help available, both in
terms of on-line help and in the form of local gurus.
As ever, there is likely to be a trade-off between immediate ease of use
REGRESSION AND CORRELATION 135

and the power of a package. Whatever decision is made, there is no excuse


for wasting valuable scientific time by writing computer programs to fit
regression models instead of using an off-the-shelf package whenever
possible.

7 BIBLIOGRAPHIC NOTES AND DISCUSSION

There is a vast literature on regression models. Weisberg l2 is a good


introduction to applied regression analysis, and Seber I I is good on the
more theoretical aspects. Another standard reference is Draper & Smith. 22
The standard book on generalized linear models is McCullagh & Nelder;6
Aitkin et al. 18 give an account based on the use of the package GUM.
Seber & Wild 23 deal specifically with nonlinear models with normal errors.
Aspects of model-checking are covered by Atkinson,24 who pays special
attention to graphical methods and transformations, and by Cook &
Weisberg,25 and Carroll & Ruppert. 26
Two chapters in Hinkley et al. give short accounts of generalized linear
models27 and residuals and diagnostics. 20
Reference was made to overdispersion in Section 5.5. Overdispersion is
widespread in applications, and often arises because a model cannot hope
to capture all the variability which arises in data. A response will often
depend on factors that were not recorded, as well as on those that were,
and this will lead to data that are overdispersed, apparently at random,
relative to the postulated model. One remedy is to broaden the applicability
of generalized linear models by making model assumptions only involving
the mean-variance relationship and hence the variance function. The
relation of these to the generalized linear models described in Section 5 is
the same as that of the second-order assumptions to the normal-theory
assumptions described in Section 3.1. The topic is discussed fully in
Ref. 6.
This chapter has concentrated on the analysis of single sets of data,
rather than the automatic analysis of many similar sets. When there are
many sets of data to be analysed in an automatic way and there is the
possibility of gross errors in some observations, it may be useful to
consider robust or resistant methods of analysis. Robust methods are
insensitive to changes in the assumptions which underlie a statistical
model, whereas resistant methods are insensitive to large changes in a few
observations. If a suitable model is known to apply, there are obvious
advantages to the use of such methods. The difficulty is that usually
136 A.C. DAVISON

models are uncertain, and it is then valuable to detect the ways in which
the model departs from the data. Robust and resistant methods can make
this harder precisely because of their insensitivity to changes in assump-
tions or in the data. Such methods are considered in more detail by
Green,19 Li,28 Rousseeuw & Leroy,29 and Hampel et al. 30
The full parametric assumption that" = x T fJ for all values of x and fJ
can be relaxed in various ways. One possibility is non-parametric smooth-
ing techniques, which aim to fit a smooth curve to the data. This avoids
the full parametric assumptions made above, while giving tractable and
reasonably fast methods of curve-fitting. Hastie & Tibshirani31 give a full
discussion of the topic.

ACKNOWLEDG EM ENTS

This work was supported by a grant from the Nuffield Foundation. The
author is grateful to L. Tierney for a copy of his statistical package
XLISP-STAT.

REFERENCES

I. Smith, R.L., Extreme value theory based on the r largest annual events. J.
Hydrology, 86 (1986) 27-43.
2. Davison, A.C. & Hemphill, M.W., On the statistical analysis of ambient ozone
data when measurements are missing. Atmospheric Environment, 21 (1987)
629-39.
3. Lindley, D.V. & Scott, W.F., New Cambridge Elementary Statistical Tables.
Cambridge University Press, Cambridge, 1984.
4. Pearson, E.S. & Hartley, H.O., Biometrika Tables for Statisticians, 3rd edn,
vol. I and 2. Biometrika Trust, University College, London, 1976.
5. Crowder, MJ., A multivariate distribution with Weibull connections. J. Roy.
Statist. Soc. B, 51 (1989) 93-107.
6. McCullagh, P. & Neider, J.A., Generalized Linear Models, 2nd edn. Chapman
& Hall, London, 1989.
7. Edwards, D., Hierarchical interaction models (with Discussion). J. Roy.
Statist. Soc. B, 52 (1990) 3-20, 51-72.
8. Jensen, D.R., Multivariate distributions. In Encyclopedia of Statistical Sci-
ences, vol. 5, ed. S. Kotz, N.L. Johnson & C.B. Read. Wiley, New York, 1985,
pp.43-55.
9. Tawn, J.A., Bivariate extreme value theory: Models and estimation. Bio-
metrika, 75 (1988) 397-415.
10. Tawn, J.A., Modelling multivariate extreme value distributions. Biometrika,
77 (1990) 245-53.
REGRESSION AND CORRELAnON 137

II. Seber, G.A.F., Linear Regression Analysis. Wiley, New York, 1977.
12. Weisberg, S., Applied Linear Regression, 2nd edn. Wiley, New York, 1985.
13. Cook, R.D., Detection of influential observations in linear regression. Tech-
nometrics, 19 (1977) 15-18.
14. Box, G.E.P. & Cox, D.R., An analysis of transformations (with Discussion).
J. Roy. Statist. Soc. B, 26 (1964) 211-52.
15. ApSimon, H.M. & Davison, A.C., A statistical model for deriving probability
distributions of contamination for accidental releases. Atmospheric Environ-
ment, 20 (1986) 1249-59.
16. Holland, D.M. & Fitz-Simons, T., Fitting statistical distributions to air
quality data by the maximum likelihood method. Atmospheric Environment,
16 (1982) 1071-6.
17. Seber, G.A.F., Multivariate Observations, Wiley, New York, 1985.
18. Aitkin, M., Anderson, D., Francis, B. & Hinde, J., Statistical Modelling in
GLIM. Clarendon Press, Oxford, 1989.
19. Green, PJ., Iteratively reweighted least squares for maximum likelihood
estimation and some robust and resistant alternatives (with Discussion). J.
Roy. Statist. Soc. B, 46 (1984) 149-92.
20. Davison, A.C. & Snell,EJ., Residuals and diagnostics. In Statistical Theory
and Modelling: In Honour of Sir David Cox, ed. D.V. Hinkley, N. Reid & EJ.
Snell. Chapman & Hall, London, 1990, pp. 83-106.
21. }",rgensen, B., The delta algorithm and GUM. Int. Statist. Rev., 52 (1984)
283-300.
22. Draper, N. & Smith, H., Applied Regression Analysis, 2nd edn. Wiley, New
York, 1981.
23. Seber, G.A.F. & Wild, CJ., Nonlinear Regression. Wiley, New York, 1989.
24. Atkinson, A.C., Plots, Transformations, and Regression. Clarendon Press,
Oxford, 1985.
25. Cook, R.D. & Weisberg, S., Residuals and Influence in Regression. Chapman
& Hall, London, 1982.
26. Carroll, R.J. & Ruppert, D., Transformation and Weighting in Regression.
Chapman & Hall, London, 1988.
27. Firth, D., Generalized linear models. In Statistical Theory and Modelling: In
Honour of Sir David Cox, ed. D.V. Hinkley, N. Reid & E.J. Snell. Chapman
& Hall, London, 1990, pp. 55-82.
28. Li, G., Robust regression. In Exploring Data Tables, Trends, and Shapes, ed.
D.C. Hoaglin, F. Mosteller & J.W. Tukey. Wiley, New York, 1985, pp.
281-343.
29. Rousseeuw, PJ. & Leroy, A.M., Robust Regression and Outlier Detection.
Wiley, New York, 1987.
30. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.I. & Stahel, W.A., Robust
Statistics. Wiley, New York, 1986.
31. Hastie, T. & Tibshirani, R.I., Generalized Additive Models. Chapman & Hall,
London, 1990.
Chapter 4

Factor and Correlation


Analysis of Multivariate
Environmental Data
PHILIP K. HOPKE
Department of Chemistry, Clarkson University, Potsdam,
New York 13699, USA

1 INTRODUCTION

In studies of the environment, many variables are measured to charac-


terize the system. However, not all of the variables are independent of one
another. Thus, it is essential to have mathematical techniques that permit
the study of the simultaneous variation of multiple variables. One such
analysis is based on examining the relationships between pairs of varia-
bles. This correlation analysis, however, does not provide a clear view of
the multiple interactions in the data. Thus, various forms of eigenvector
analysis are used to convert the correlation data into multivariate informa-
tion. Factor analysis is the name given to one of the variety of forms of
eigenvector analysis. It was originally developed and used in psychology
to provide mathematical models of psychological theories of human
ability and behavior. 1 However, eigenvector analysis has found wide
application throughout the physical and life sciences. Unfortunately, a
great deal of confusion exists in the literature in regard to the terminology
of eigenvector analysis. Various changes in the way the method is applied
have resulted in it being called factor analysis, principal components
analysis, principal components factor analysis, empirical orthogonal func-
tion analysis, Karhunen-Loeve transform, etc., depending on the way the
data are scaled before analysis or how the resulting vectors are treated
after the eigenvector analysis is completed.
139
140 PHILIP K. HOPKE

All of the eigenvector methods have the same basic objective; the
compression of data into fewer dimensions and the identification of the
structure of interrelationships that exist between the variables measured or
the samples being studied. In many chemical studies, the measured proper-
ties of the system can be considered to be the linear sum of the term
representing the fundamental effects in that system times appropriate
weighing factors. For example, the absorbance at a particular wavelength
of a mixture of compounds for a fixed path length, z, is considered to be
a sum of the absorbencies of the individual components

(1)

where ej is the molar extinction coefficient for the ith compound at wave-
length A, and Cj is the corresponding concentration. Thus, if the absorben-
cies of a mixture of several absorbing species are measured at m various
wavelengths, a series of equations can be obtained:

(2)

If we know what components are present and what the molar extinction
coefficients are for each compound at each wavelength, the concentrations
of each compound can be determined using a multiple linear regression fit
to these data. However, in many cases neither the number of compounds
nor their absorbance spectra may be known. For example, several com-
pounds may elute from an HPLC column at about the same retention time
so that a broad elution peak or several poorly resolved peaks containing
these compounds may be observed. Thus, at any point in the elution curve,
there would be a mixture of the same components but in differing propor-
tions. If the absorbance spectrum of each of these different mixtures could
be measured such as by using a diode array system, then the resulting data
set would consist of a number of absorption spectra for a series of n
different mixtures of the same compounds.

j = I,m
(3)
= l,n
For such a data set, factor analysis can be employed to identify the number
of components in the mixture, the absorption spectra of each component,
and the concentration of each compound for each of the mixtures. Similar
problems are found throughout analytical and environmental chemistry
FACTOR AND CORRELATION ANALYSIS 141

where there are mixtures of unknown numbers of components and the


properties of each pure component are not known a priori.
Another use for such methods is in physical chemistry where the mea-
sured property can also be related to a linearly additive sum of indepen-
dent causative processes. For example, the effects of solvents on the
proton NMR shifts for non-polar solutes can be expressed in a form of
p
~a = L ~jfja
j=1
(4)

where ~a is. the chemical shift of solute i in solvent a, ~j refers to the jth
solute factor of the ith solvent, and fja refers to the jth solvent factor of
the ath solvent with the summation over all of the physical factors that
might give rise to the measured chemical shifts. Similar examples have
been found for a variety of chemical problems and are described in
Ref. 2.
Finally, similar problems arise in the resolution of environmental mix-
tures to. their source contributions. For example, a sample of airborne
particulate matter collected at a specific site is made up of particles of soil,
motor vehicle exhaust, secondary sulfate particles, primary emissions from
industrial point sources, etc. It may be of interest to determine how much
of the total collected mass of particles comes from each source. It is then
assumed that the measured ambient concentration of some species, Xi'
where i = 1, m measured elements, is a linear sum of contributions from
p independent sources of particles. These species are normally elemental
concentrations such as lead or silicon and are given in J.Lg of element per
cubic meter of air. Each kth source emits particles that have a profile of
elemental concentrations, aik> and the mass contribution per unit volume
of the kth source is/k • When the compositions are measured for a number
of samples, an equation of the following form is obtained.
P
Xij = L
k=1
aidkj (5)

The use of factor analysis for this type of study is reviewed in Ref. 3.
Thus, a factor analysis can help compress multivariate data to sufficiently
few dimensions to make visualization possible and assist in identifying the
interrelated variables. Depending on the approach used, the results can be
interpreted statistically or they can be directly related to the structural
variables that describe the system being studied. Examples of both types
of results will be presented in this chapter. However, in all cases, the
investigator must then interpret the interrelationships determined through
142 PHILIP K. HOPKE

the analysis within the context of his problem to provide the more detailed
understanding of the system being studied.

2 EIGENVECTOR ANALYSIS

2.1 Dispersion matrix


The initial step in the analysis of the data requires the calculation of a
function that can indicate the degree of interrelationships that exists
within the data. Functions exist that can provide this measure between the
two variables when calculated over all of the samples or between the
samples when calculated over all of the variables. The most well-known of
these functions is the product-moment correlation coefficient. To be more
precise, this function should be referred to as the correlation about the
mean. The 'correlation coefficient' between two variables, Xi and Xk, over
all n samples is given by
n
L (xij - Xi)(Xkj - Xk)

(t
rik = j=1 (6)
(xij - xy)1/2(.± (xkj _ Xk)2)1/2
)=1 )=1

The original variables can be transformed by subtracting the mean value


and dividing by the standard deviation,

(7)

Using the standardized variables, eqn (6) can be simplified to


I n
rik = - L
n j=1
ZijZkj (8)

The standardized variables have several other benefits to their use. Each
standardized variable has a mean value of zero and a standard deviation
of 1. Thus, each variable carries 1 unit of system variance and the total
variance for a set of measurements of m variables would be m.
There are several other measures of interrelationship that can also be
utilized. These measures include covariance about the mean as defined by
n
Cik = L
j=1
dijdki (9)

where
d ij = xii - Xi (10)
FACTOR AND CORRELATION ANALYSIS 143

are called the deviations, and Xi is the average value of the ith variable. The
covariance about the origin is defined by

c~ = L" xijx kj (II)


j=1

and the correlation about the origin

L" XijXkj

Ctl xij jt xiJ/2


j=1
(12)

The matrix of either the correlations or covariances, called the dis-


persion matrix, can be obtained from the original or transformed data
matrices. The data matrix contains that data for the m variables measured
over the n samples. The correlation about the mean is given by
Rm = ZZ' (13)
where Z' is the transpose of the standardized data matrix Z. The correla-
tion about the origin
Ro = zo Zo' = (XV)(XV)' (14)
where

zij (IS)
" )1/2
( Ixij
'=1

which is a normalized variable still referenced to the original variable


origin and V is a diagonal matrix whose elements are defined by

Vik = ( " Xij )1/2


<>ikL (16)
J=I

This normalized variable also carries a variance of I, but the mean value
is not zero. The covariance about the mean is given as
(17)
where D is the matrix of deviations from the mean whose elements are
calculated using eqn (10) and D' is its transpose. The covariance about the
origin is
(18)
144 PHILIP K. HOPKE

the simple product of the data matrix by its transpose. As written, these
product matrices would be of dimension m by m and would represent the
pairwise interrelationships between variables. If the order of the multi-
plication is reversed, the resulting n by n dispersion matrices contain the
interrelationships between samples.
The relative merits of these functions to reflect the total information
content contained in the data have been discussed in the literatureY
Rozett & Petersen 3 argue that since many types of physical and chemical
variables have a real zero, the information regarding the location of the
true origin is lost by using the correlation and covariance about the mean
that include only differences from the variable means. The normalization
made in calculating the correlations from the co variances causes each
variable to have an identical weight in the subsequent analysis. In mass
spectrometry where the variables consist of the ion intensities at the
various m/e values observed for the fragments of a molecule, the nor-
malization represents a loss of information because the variable metric is
the same for all of the m/e values. In environmental studies where mea-
sured species concentrations range from the trace level (sub part per
million) to major constituents at the per cent level, the use of covariance
may weight the major constituents too heavily in the subsequent analy-
ses. The choice of dispersion function depends heavily on the nature of the
variables being measured.
Another use of the correlation coefficient is that it can be interpreted in
a statistical sense to test the null hypothesis as to whether a linear relation-
ship exists between the pair of variables being tested. It is important to
note that the existence of a correlation coefficient that is different from
zero does not prove that a cause and effect relationship exists. Also, it is
important to note that the use of probabilities to determine if correlation
coefficients are 'significant' is very questionable for environmental data. In
the development of those probability relationships, explicit assumptions
are made that the underlying distributions of the variables in the correla-
tion analysis are normal. For most environmental variables, normal distri-
butions are uncommon. Generally, the distributions are positively skewed
and heavy tailed. Thus, great care should be taken in making probability
arguments regarding the significance of pairwise correlation coefficients
between variables measured in environmental samples.
Another problem with interpreting correlation coefficients is that en-
vironmental systems are often truly multivariate systems. Thus, there may
be more than two variables that covary because of the underlying nature
of the processes being studied. Although there can be very strong correla-
FACTOR AND CORRELATION ANALYSIS 145

tions between two variables, the correlation may arise through a causal
factor in the system that cannot be detected.
For each of the equations previously given in this section, the resulting
dispersion matrix provides a measure of the interrelationship between the
measured variables. Thus, in the use of a matrix of correlations between
the pairs of variables, each variable is given equal weight in the subsequent
eigenvector analysis. This form of factor analysis is commonly referred to
as an R-mode analysis. Alternatively, the order of multiplication could be
reversed to yield covariances or correlations between the samples obtained
in the system. The eigenvector analysis of these matrices would be referred
to as a Q-analysis. The differences between these two approaches will be
discussed further after the approach to eigenvector analysis has been
introduced.

2.2 Eigenvector calculation


The primary goal of eigenvector analysis is to represent a data matrix as
a product of two other matrices containing specific information regarding
the sources of the variation observed in the data. It can be shown 5 that any
matrix can be expressed as a product of two matrices
(19)
where the subscripts denote the dimensions of the respective matrices.
There will be an infinite number of different A and F matrices that satisfy
this equation.
To divide the matrix into two cofactor matrices as in eqn (19), a
question is raised about the minimum value p can have and still yield a
solution. This value is the 'rank' of matrix X (Ref. 5, p. 335). The rank
clearly cannot be greater than a matrix's smaller dimension and the rank
of a product moment matrix cannot be greater than the smaller of the
number of columns or the number of rows. The rank of the product
moment matrix must be the same as that of the matrix from which it was
formed.
Associated with the idea of matrix rank is the concept of linear in-
dependence of variables. We can look at the interrelationships between
columns (or rows) of a matrix and determine if columns (or rows) are
linearly independent of one another. To understand linear independence
let us examine the relationship between the two vectors in Fig. I. We can
find the vector t such that
s = r - t (20)
The vector t can then be generalized to be the resultant of the sum of rand
146 PHILIP K. HOPKE

Fig. 1. Illustration of the interrelationship between two vectors.

s with coefficients a r and as:

t = arf + ass (21)


1ft = 0, then f and s are said to be collinear or linearly dependent vectors.
Thus, a vector y is linearly dependent on a set of vectors, VI_ V2 , ••• , Vrn
if
(22)
and at least one of the coefficients, aj, is non-zero. If all of the aj values in
eqn (22) are zero, then y is linearly independent of the set of vectors Vj. The
number of linearly independent column vectors in a matrix defines the
minimum number of dimensions needed to contain all of the vectors. The
idea of the rank or true dimensionality of a data matrix is an important
concept in receptor modeling as it defines the number of separately identi-
fiable, independent sources contributing to the system under study. Thus
finding the rank of a data matrix will be an important task. In addition,
the ability to resolve sources of material with similar properties or the
resolution of various receptor models needs to be carefully examined. To
examine this question, several additional mathematical concepts need to
be discussed.
A given data matrix can be reproduced by one of an infinite number of
sets of independent column vectors or basis vectors that will describe the
axes of the reduced dimensionality space. The rank of the matrix can be
determined and a set of linearly independent basis vectors can be de-
veloped by the use of an eigenvalue analysis. In this discussion, only the
analysis of real, symmetric matrices such as those obtained as the minor
or major product of a data matrix will be discussed. Suppose there exists
a real, symmetric matrix R that is to be analyzed for its rank, remembering
FACTOR AND CORRELATION ANALYSIS 147

that the rank of a product moment matrix is the same as the data matrix
from which it is formed.
An eigenvector of R is a vector U such that
Ru = UA (23)
where Ais an unknown scalar. The problem then is to find a vector so that
the vector Ru is proportional to u. This equation can be rewritten as
Ru - UA = 0 (24)
or
(R - AI)U = 0 (25)
implying that u is a vector that is orthogonal to all of the row vectors of
(R - AI). This vector equation can be considered as a set of p equations
where p is the order of R:
u, (1 - A) + u2r'2 + U3r13 + ... + upr,p = 0
u,r2' + u2(1 - A) + U3r23 + ... + upr2p = 0
(26)

u, rp' + U2rp2 + u3rp3 + ... + up(1 - A) = 0


Unless u is a null vector, eqn (25) can only hold if
R-A./=O (27)
There is a solution to this set of equations only if the determinant of the
left side of the equation is zero:
IR - A.l1 = 0 (28)
This equation yields a polynomial in A of degree p. It is then necessary to
obtain the p roots of this equation, Ai' i = 1, p. For each Ai there is an
associated vector Ui such that
Ru, - u, A, = 0 (29)
If these Ai values are placed as the elements of a diagonal matrix A, and
the eigenvectors can be collected as columns of the matrix U, then we can
express eqn (25) as
RU = UA (30)
The matrix U is a square orthonormal so that
U'U = UU' = I (31)
148 PHILIP K. HOPKE

Postmultiplying eqn (30) by V' yields


R = UU'A (32)
Thus, any symmetric matrix R may be represented in terms of its eigen-
values and eigenvectors:
(33)
so that R is weighted sum of matrices II; ui, of order p by p and of rank I.
Each term is orthogonal to all other terms so that for i '# j
(34)
and
(35)
Premultiplying eqn (30) by V' yields
V'RV = A (36)
so that V is a matrix that reduces R to a diagonal form. The eigenvalues
have a number of useful properties: 6
(l) Trace A = trace R; the sum of the eigenvalues equals the sum of the
elements in the principal diagonal of the matrix.
(2) IIf= I A; = IRI; the product of the eigenvalues equals the determinant
of the matrix. If one or more of the eigenvalues is zero, then the
determinant is zero and the matrix R is called a singular matrix. A
singular matrix cannot be inverted.
(3) The number of non-zero eigenvalues equals the rank of R.
Therefore, if for a matrix R of order p, there are m zero eigenvalues, the
rank of R is (p - m); the (p - m) eigenvectors corresponding to those
non-zero linearly independent vectors of R.
Another approach can be taken to examine the basic structure of a
matrix. This method is called a singular value decomposition of an arbi-
trary rectangular matrix. A very detailed discussion of this process is given
in Ref. 7. According to the singular value decomposition theorem, any
matrix can be uniquely written as
R = UDV' (37)
where R is an n by m data matrix, U is an n by n orthogonal matrix, V is
an m by m orthogonal matrix, and D is an n by m diagonal matrix. All of
the diagonal elements are non-negative and exactly k of them are strictly
FACTOR AND CORRELATION ANALYSIS 149

positive. These elements are called the singular values of R. The column
values of the U matrix are the eigenvectors of XX'. The column values of
the V matrix are the eigenvectors of X'X. Zhou et al. 8 show that the R- and
Q-mode factor solutions are interrelated as follows
AQ FQ
r! r----J
X U
I
D
I
V'
L...J
(38)
FR AR
Although there has been discussion of the relative merits of R- and
Q-mode analyses in the literature,3.9 the direction of multiplication is not
the factor that alters the solutions obtained. Different solutions are ob-
tained depending on the direction in which the scaling is performed. Thus,
different vectors are derived depending on whether the data are scaled by
row, by column, or both. Zhou et al. 8 discuss this problem in more detail.
By making appropriate choices of A and F in eqn (38), the singular value
decomposition is one method to partition any matrix. The singular value
decomposition is also a key diagnostic tool in examining collinearity
problems in regression analysis. 1O The application of the singular value
decomposition to regression diagnostics is beyond the scope of this
chapter.
In the discussion of the dispersion matrix, it becomes necessary to
discuss some of the seman tical problems that arise in 'factor' analysis. If
one consults the social science literature on factor analysis, a major
distinction is made between factor analysis and principal components
analysis. Because there are substantial problems in developing quantita-
tive models analogous to those given in the introduction to this chapter in
eqns (I) to (5), the social sciences want to obtain 'factors' that have
significant values for two or more of the measured variables. Thus, they
are interested in factors that are common to several variables. The model
then being applied to the data is of the form:
p

zij = L aiJkj + di~j


k=l
(39)

where the standardized variables, zij' are related to the product of the
common factor loadings, aik, by the common factor scores, fkj' plus the
unique loading and score. The system variance is therefore partitioned
into the common factor variance, the specific variance unique to the
150 PIDLIP K. HOPKE

particular variable, and the measurement error.


True Variance
I I
System Variance Common Variance + Specific Variance + Error
I I
Unique Variance
(40)
In order to make this separation, an estimation is made of the partition-
ing of the variance between the common factors and the specific factors.
A common approach to this estimation is to replace each 'I' on the
diagonal of the correlation matrix with an estimate of the 'communality'
defined by

(41)

The multiple correlation coefficients for each variable against all of the
remaining variables are often used as initial estimates of the com-
munalities. Alternatively, the eigenvector analysis is made and the com-
munalities for the initial solution are then substituted into the diagonal
elements of the correlation matrix to produce a communality matrix. This
matrix is then analyzed and the process repeated until stable communality
values are obtained.
The principal components analysis simply decomposes the correlation
matrix and leads to the model outlined in eqn (39) without the d; U;j term.
It can produce components that have a strong relationship with only one
variable. The single variable component could also be considered to be the
unique factor. Thus, both principal components analysis and classical
factor analysis really lead to similar solutions although reaching these
solutions by different routes. Since it is quite reasonable for many environ-
mental systems to show factors that produce such single variable behavior,
it is advisable to use a principal components analysis and extend the
number offactors to those necessary to reproduce the original data within
the error limits inherent in the data set.
Typically, this approach to making the eigenvector analysis compresses
the information content of the data set into as few eigenvectors as possible.
Thus, in considering the number of factors to be used to describe the
system, it is necessary to carefully examine the problems of reconstructing
both the variability within the data and reconstructing the actual data
itself.
FACTOR AND CORRELATION ANALYSIS 151

2.3 Number of retained factors


Following the diagonalization of the correlation or covariance matrix, it
is necessary to make the difficult choice of the number of factors, p, to use
in the subsequent analysis. This problem occurs in any application of an
eigenvector analysis of data containing noise. In the absence of error, the
eigenvalues beyond the true number of sources become zero except for
calculational error. The choice becomes more difficult depending on the
error in the data. Several approaches have been suggested. 4•il
A large relative decrease in the magnitude of the eigenvalues is one
indicator of the correct number of factors. It can often be useful to plot
the eigenvalues as a function of factor number and look for sharp breaks
in the slope of the line. 12 If the eigenvalue is a measure of the information
content of the corresponding eigenvector, then only sufficiently 'large'
eigenvalues need to be retained in order to reproduce the variation initially
present in the data. One of the most commonly used and abused criteria
for selecting the number of factors to retain is retaining only those eigen-
values greater than 1. I3 The argument is made that the normalized varia-
bles each carry one unit of variance. Thus, if an eigenvalue is less than one,
then it carries less information than one of the initial variables and is
therefore not needed. However, Kaiser & Hunka l4 make a strong argu-
ment that although eigenvalue greater than one does set a lower limit on
the number of factors to be retained, it does not set a simultaneous upper
bound. Thus, there must be at least as many factors as there are eigen-
values greater than one, but there can be more than that number that are
important to the understanding of the system's behavior.
Hopke ls has suggested a useful empirical criterion for choosing the
number of retained eigenvectors. In a number of cases of airborne par-
ticulate matter composition source identification problems, Hopke found
that choosing the number of factors containing variance greater than one
after an orthogonal rotation provided a stable solution. Since the eigen-
vector analysis artificially compresses variance into the first few factors,
reapportioning the variance using the rotations described in the next
section will result in more factors with total variance greater than one than
there are eigenvalues greater than one. In many cases, this number of
factors will stay the same even after rotating more factors.
For a different type of test, the original data are reproduced using only
the first factor and compared point-by-point with the original data.
152 PHILIP K. HOPKE

Several measures of the quality of fit are calculated including chi-squared

t = I I
n m (
Xij -
-)2
Xi (42)
j=1 i=1 aij
where xij is the reconstructed data point using p factors and (1ij is the
uncertainty in the value of xij' The Exner function 16 is a similar measure
and is calculated by

EP = [t f (x~
j=1 i=1
-
(x - Xi)
~i)~JI/2 (43)

where XO is a grand ensemble average value. The empirical indicator


function suggested by Malinowski '7 can be used for this purpose and is
calculated as follows:

RSD = [t j=p+1
Aj
n(m - p)
] for n > m (44)

RSD
IND (45)
(m _ p)2

RSD [± j=p+1
Aj
m(n - p)
] for n < m (46)

RSD
IND = (n _ p)2
(47)

where Aj are the eigenvalues from the diagonalization. This function has
proven very successful with spectroscopy results.'7.'7a However, it has not
proven to be as useful with other types of environmental data. IS. Finally,
the root-mean-square error and the arithmetic average of the absolute
values of the point-by-point errors are also calculated. The data are next
reproduced with both the first and second factors and again compared
point-by-point with the original data. The procedure is repeated, each time
with one additional factor, until the data are reproduced with the desired
precision. If p is the minimum number of factors needed to adequately
reproduce the data, then the remaining n - p factors can be eliminated
from the analysis. These tests do not provide unequivocal indicators of the
number of factors that should be retained. Judgement becomes necessary
in evaluating all of the test results and deciding upon a value of p. In this
manner the dimension of the A and F matrices is reduced from n to p.
The compression of variance into the first factors will improve the ease
with which the number of factors can be determined. However, their
FACTOR AND CORRELATION ANALYSIS 153

/
Fig. 2. Illustration of the rotation of a coordinate system (x,. x 2 ) to a new
system (y,. Y2) by an angle O.

nature has now been mixed by the calculational method. Thus, once the
number offactors has been determined, it is often useful to rotate the axes
in order to provide a more interpretable structure.

2.4 Rotation of factor axes


The axis rotations can retain the orthogonality of the eigenvectors or they
can be oblique. Depending on the initial data treatment, the axes rotations
may be in the scaled and/or centered space or in the original variable scale
space. The latter approach has proved quite useful in a number of chemi-
cal applications 2 and in environmental systems. 19 To begin the discussion
of factor rotation, it is useful to describe how one set of coordinate system
axes can be transformed into a new set. Support that it is necessary to
change from the coordinate system Xl' X2to the system of Yl, Y2 by rotating
through angle 0 as shown in Fig. 2. For this two-dimensional system, it is
easy to see that
COSOXI + sinOx2
(48)
-sinOx l + COSOX2
In matrix form, this equation could be written as
V' = X'T (49)
where T is the transformation matrix.
In order to obtain a more general case, it is useful to define the angles
of rotation in the manner shown in Fig. 3. Now the angle Oij is the angle
154 PHILIP K. HOPKE

Fig. 3. Illustration of the rotation of a coordinate system (X" X2) to a new


coordinate system ( y, , Y2) showing the definitions of the full set of angles used
to describe the rotation.

between the ith original reference axis and thejth new axis. Assuming that
this is a rotation that maintains orthogonal axes (rigid rotation), then, for
two dimensions,
Oil = 011 + 90°
0 21 = 011 + 90° (50)
0 22 011

There are also trigonometric relationships that exist,


sin 011 = sin (0 21 + 90°) = cos 021
(51)
- sin 011 = - sin (0 12 + 90°) = sin (90° - ( 21 ) = COS 812

so that eqn (48) can be rewritten as


YI COSOIIX I + COS021 X2
(52)
Y2 = cos 012XI + cos 0nX2
This set of equations can then be easily expanded to n orthogonal axes
FACTOR AND CORRELATION ANALYSIS 155

yielding

(53)

Yn COS1nX 1 + cos (J2n X 2 + ... + cos (Jnnxn

A transformation matrix T can then be defined such that its elements are
(54)
Then, for a collection of N row vectors in a matrix X with n columns,
Y = XT (55)
and Y has the coordinates for all N row vectors in terms of the n rotated
axes. For the rotation to be rigid, T must be an orthogonal matrix. Note
that the column vectors in Y can be thought of as new variables made up
by linear combinations of the variables in X with the elements of T being
the coefficients of those combinations. Also a row vector of X gives the
properties of a sample in terms of the original variables while a row vector
ofY gives the properties of a sample in terms of the transformed variables.

3 APPLICATIONS

3.1 Data screening


To illustrate this approach to data, a set of geological data from the
literature has been perturbed in various ways to illustrate the utility of a
factor analysis. 20 These data for 29 elements were obtained by neutron
activation analysis of 15 samples taken from Borax Lake, California
where a lava flow had once occurred. 21 The samples obtained were formed
by two different sources of lava that combined in varying amounts and,
thus, formed a mixture of two distinct mineral phases. The data was
acquired to identify the elemental profiles of each source mineral phase.

3.1.1 Decimal point errors


Decimal point errors are common mistakes made in data handling.
Therefore, to simulate a decimal point error, the geological data was
changed so that one of the 15 thorium values was off by a factor of ten.
The data was processed by classical factor analysis to produce factor
loadings and factor scores for 2-10 causal factors. The factor loadings
156 PHILIP K. HOPKE

TABLE 1
Factor loadings of the geological data with a decimal point error

Element Factor-} Factor-2 Factor-3


Th 0·030 0·046 0·972
Sm 0·899 0·360 0·155
U 0·959 0·264 0·090
Na 0·778 0·362 0·342
Sc -0'971 -0,221 -0,063
Mn -0,970 -0,226 -0,075
Cs 0·962 0·247 0·093
La 0·904 0·286 0·181
Fe -0·970 -0,226 -0,064
Al -0,937 -0·155 0·034
Dy -0'002 0·828 0·208
Hf 0·708 0·288 0·481
Ba -0·262 -0·834 0·095
Rb 0·939 0·074 0·204
Ce 0·956 0·220 0·128
Lu 0·771 0·266 0·173
Nd 0·878 0·142 0·211
Yb 0·925 0·084 -0·119
Tb 0·923 0·024 0·032
Ta 0·837 0·303 0·334
Eu -0,972 -0,212 -0,068
K 0·908 0·263 -0,092
Sb 0·621 0·674 0·024
Zn -0,927 -0,247 -0·123
Cr -0,976 -0,191 -0·058
Ti -0,972 -0,211 -0,039
Co -0,971 -0,216 -0·073
Ca -0,963 -0·119 -0'121
V -0'948 -0,208 -0,091

were then examined starting with the two factor solution to determine the
nature of the identified factors. For the two factor solution, one factor of
many highly correlated variables was identified in addition to a second
factor with high correlations by Dy, Ba and Sb. For the three factor
solution, the above factors were again observed in addition to a factor
containing most of the variance in thorium and little of any other variable
(see Table 1). This factor that reflects a high variance for only a single
variable shows the typical identifying characteristics of a possible data
error. To further investigate the nature of this factor, the factor scores of
the three factor solution are calculated (Table 2). As can be seen, there is
FACTOR AND CORRELATION ANALYSIS 157

TABLE 2
Factor scores of geological data with a decimal point error

Sample Factor-} Factor-2 Factor-3


I -0'179 0·101 3.51
2 -0,240 2·63 -0,591
3 0·037 1·06 -0'027
4 0·846 -0,517 -0,462
5 0·753 -0'560 0·011
6 0·782 -0'520 -0,029
7 0·900 -0,333 -0·426
8 0·645 0·522 -0,387
9 1·23 -1·53 -0·162
10 0·804 0·662 -0'074
II -2·08 -1·02 -0·601
12 -1,58 -0,236 0·148
13 -1'12 -0'376 -0,500
14 -0,679 -0,486 -0·290
15 -0·109 0·613 -0'120

a very large contribution of factor-3 for sample-I, indicating that most


of the variance was due to sample-I, the sample with the altered thorium
value. Having made these observations, one then must go back to the raw
data and decide if an explanation can be found for this abnormality and
then take the appropriate corrective action.

3.1.2 Random variables


If there was no physical explanation for the unique thorium contribution
in the preceding example and the factor scores did not identify a specific
sample, then the error may have been distributed throughout the data for
the variable. An example of this would be the presence of a noisy variable
in the set of data. To demonstrate how to identify a variable dominated
by random noise, an additional variable, Xl, was added to the geological
data set. The values for Xl were generated using a normally-distributed
random number generator to produce values with a mean of 5·0 and
standard deviation 0·5. Using the procedure described above, factor an-
alysis identified the random variable as can be seen in Table 3. Again, the
same basic characteristics were observed in the factor, i.e. a large loading
for variable Xl and little contribution for the other variables. If one were
to look at the factor scores for this case, the distribution of the values for
factor-3 would not identify a specific sample. Thus, the problem with
variable Xl is distributed over all the samples. If there is no physical
158 PHILIP K. HOPKE

TABLE 3
Factor loadings of the geological data with a random variable X1

Element Factor-l Factor-2 Factor-3

Th 0·966 0·245 -0,045


Sm 0·903 0·389 -0,002
U 0·955 0·286 -0·058
Na 0·832 0·340 0·153
Sc -0,968 -0·236 0·048
Mn -0·966 -0,243 0·060
Cs 0·958 0·270 -0·063
La 0·915 0·311 0·005
Fe -0'966 -0,242 0·056
AI -0,888 -0,223 0·296
Dy 0·048 0·757 0·341
Hf 0·728 0·362 -0,127
Ba -0,231 - 0·841 -0,50
Rb 0·963 0·078 -0'011
Ce 0·948 0·260 -0,122
Lu 0·710 0·408 -0,439
Nd 0·909 0·137 0·041
Yb 0·865 0·146 -0·343
Tb 0·944 -0,010 0·058
Ta 0·847 0·370 -0·059
Eu -0,968 -0,226 0·074
K 0·880 0·277 -0,097
Sb 0·596 0·690 -0·100
Zn -0,943 -0,241 -0,032
Cr -0,975 -0,200 0·042
Ti -0,968 -0,223 0·049
Co -0·969 -0·233 0·050
Ca -0,958 -0,154 0·116
V -0,941 -0,232 0·109
XI -0,006 0·290 0·856

reason for this variable being unique, then the investigator should explore
his analytical method for possible errors.

3.1.3 Interfering variables


In some cases, especially spectral analysis, two variables may interfere
with each other. For example, in neutron activation analysis, Mn and Mg
have overlapping peaks. If there is a systematic error in separating these
two peaks, errors in analytical values will be obtained. To demonstrate
this, two correlated random variables, Yl and Y2, were generated and
added to the geological data set. The YI variables were produced as the
FACTOR AND CORRELATION ANALYSIS 159

TABLE 4
Factor loadings of the geological data with two correlated random variables,
Y1 and Y2

Element Factor-l Factor-2 Factor-3

Yl 0-133 -0-160 0-888


Y2 -0-116 -0-010 0-857
Th 0-976 0-202 -0-004
Sm 0-916 0-352 -0-003
U 0-968 0-242 -0-009
Na 0-813 0-382 -0-145
Sc -0-978 -0-193 0-027
Mn -0-978 -0-200 0-018
Cs 0-972 0-224 -0-011
La 0-920 0-288 0-067
Fe -0-977 -0-198 0-027
AI -0-933 -0-117 -0-143
Dy 0-031 0-858 -0-066
Hf 0-742 0-361 0-229
Ba -0-277 -0-783 0-044
Rb 0-956 0-066 -0-095
Ce 0-967 0-208 0-057
Lu 0-785 0-270 0-303
Nd 0-898 0-136 -0-048
Yb 0-917 0-020 0-050
Tb 0-924 -0-009 -0-136
Ta 0-862 0-340 0-189
Eu -0-980 -0-182 0-049
K 0-904 0-210 -0-004
Sb 0-646 0-616 -0-273
Zn -0-938 -0-234 -0-022
Cr -0-982 -0-163 -0-025
Ti -0-977 -0-180 0-013
Co -0-979 -0-191 0-017
Ca -0-972 -0-103 0-041
V -0-956 -0-188 -0-033

Xl variables described above_ The Y2 variables were generated by dividing


the Yl variable by 3 and adding a normally distributed random error with
a standard deviation of 0-15 to it. After factor analysis, the factor loadings
identified the nature of the problem in the data as can be seen in Table 4_
The large correlation between variables YI and Y2 for factor-3 along with
the absence of any other contributing variable was very obvious_
The question now arises, what if the two interfering variables are also
correlated with other variables in the data? To investigate this, the COD-
160 PffiLiP K. HOPKE

TABLE 5
Factor loadings of the geological data with errors present in two previously
uncorrelated variables, Th and Eu

Element Factor-l Factor-2 Factor-3

Th 0·965 0·234 0·091


Sm 0·899 0·383 0·093
U 0·956 0·275 0·087
Na 0·821 0·391 -0,109
Sc -0,969 -0'226 -0·067
Mn -0·967 -0,233 -0,077
Cs 0·958 0·259 0·093
La 0·904 0·309 0·129
Fe -0,968 0·232 -0,069
AI -0·897 -0,165 -0,290
Dy 0·032 0·827 -0·198
Hf 0·712 0·370 0·253
Ba -0·231 -0,820 -0·102
Rb 0·964 0·091 -0'021
Ce 0·947 0·242 0·160
Lu 0·717 0·324 0·498
Nd 0·899 0·156 0·024
Yb 0·882 0·081 0·264
Tb 0·944 0·009 -0·095
Ta 0·831 0·360 0·256
Eu 0·035 -0,178 0·935
K 0·887 0·248 0·099
Sb 0·626 0·669 -0,082
Zn -0,932 -0,253 -0'070
Cr -0,976 -0,194 -0·059
Ti -0,968 -0'2l3 -0,076
Co -0,969 -0,223 -0·075
Ca -0,968 -0'137 -0,056
V -0·942 -0,219 -0'1l8

centrations of two uncorrelated elements, thorium and europium, were


altered to simulate an interference. Europium values were altered by
adding 10% of the thorium value plus a normally distributed random
error about thorium with a 5% standard deviation. The thorium values
were altered similarly except that 10% of the europium value was subtract-
ed rather than added. As before, the problem in the data set was identified
(see Table 5) in factor-3 by the same characteristics previously described.
Again, this factor establishes the possibility of a problem and it is up to
the researcher to identify the nature of the difficulty.
FACTOR AND CORRELATION ANALYSIS 161

TABLE 6
Factor loadings of the geological data with errors present in two previously
correlated variables, Th and Sm

Element Factor-1 Factor-2 Factor-3


Th 0·854 0·472 0·206
Sm 0·414 0·810 -0'208
U 0·860 0·428 0·268
Na 0·831 0·132 0·412
Sc -0,883 -0·405 -0'224
Mn -0·879 -0'412 -0,230
Cs 0·862 0·431 0·252
La 0·811 0·427 0·297
Fe -0,880 -0,408 -0,229
Al -0,732 -0,616 -0'134
Dy 0·106 -0,178 0·857
Hf 0·624 0·421 0·340
Ba -0,108 -0,340 -0'796
Rb 0·927 0·273 0·104
Ce 0·835 0·483 0·226
Lu 0·486 0·779 0·257
Nd 0·861 0·279 0·165
Yb 0·720 0·597 0·054
Tb 0·936 0·187 0·034
Ta 0·706 0·516 0·330
Eu -0·890 -0,391 -0,217
K 0·783 0·435 0·242
Sb 0·542 0·282 0·678
Zn -0,877 -0,339 -0'257
Cr -0·896 -0,389 -0·194
Ti -0·882 -0,408 -0,211
Co -0,882 -0,411 -0,220
Ca -0,875 -0'409 -0,132
V -0,846 -0,436 -0,210

Now, consider if the two interfering variables are correlated as well as


interfering with each other. Again the data set was altered, as described
above except that the samarium values were substituted for the europium
values. The resulting factor loadings are given in Table 6. The problem
appears in factor-2, but it is not obvious that it is a difficulty since there
are high values for some other elements. For these data, the researcher
might be able to identify the problem concerning samarium if he had
sufficient insight into the true nature of the data. In this case it is known
there are only two mineral phases. The existence offactor-3 indicates that
162 PIDLIP K. HOPKE

either an error has been made in assuming only two phases or there are
errors in the data set. Thus, knowledge of the nature of the system under
study would be needed to find this type of error. The two variables
involved could also be an indication of a problem. If two variables that
would not normally be thought to be interrelated appear together in a
factor, it could indicate a correlated error.
In both of the previous two examples, there were two interfering varia-
bles present, either thorium and samarium or thorium and europium.
Potential problems in the data were recognized after using factor analysis,
i.e. samarium and europium. However, no associated problem was noticed
with thorium in either case because of the relative concentrations of the
two variables. Since thorium was much less sensitive to a change (because
of its large magnitude relative to samarium or europium), the added error
in thorium was interpreted as unnoticeable increases in the variance of the
thorium values. For an actual common interference, consider the Mn-Mg
problem in neutron activation analysis. If the spectral analysis program
used was having a problem properly separating the Mn and Mg peaks, the
factor analysis would usually identify the problem as being in the Mg
values since the sensitivity of Mg to neutron activation analysis is so much
less than that of Mn. However, if the actual levels of Mn were so low that
the Mg peak was relatively the same size as the Mn peak, then the problem
could show up in both variables.

3.1.4 Summary and conclusions


The approach to classical factor analysis described in this section, i.e. doing
the analysis for varying numbers of factors prior assumptions to the
number of factors, prevents one from getting erroneous results by implicit
computer code assumptions. Identification of a factor containing most of
the variance of one variable with little variance of other variables, pin-
points a possible difficulty in the data, if the singularity has no obvious
physical significance. Examination of the factor scores will determine
whether the problem is isolated to a few samples or over all the samples.
Having this information, one may then go back to this raw data and take
the appropriate corrective action.
Classical factor analysis has the ability to identify several types of errors
in data after it has been generated. It is then ideally suited for scanning
large data sets. The ease of the identification technique makes it a benefi-
cial tool to use before reduction and analysis of large data sets and should,
in the long run, save time and effort.
FACTOR AND CORRELATION ANALYSIS 163

3.2 Data interpretation


To illustrate the use of factor analysis to assist in the interpretation of
multivariate environmental data, it will be applied to a set of physical and
chemical data resulting from the analysis of surficial sediment samples
taken from a shallow, eutrophic lake in western New York State, Chau-
tauqua Lake.
Chatauqua Lake is a water resource with a long history of management
attempts of various types. The outfall of the lake, the Chadakoin River,
is controlled by a dam at Jamestown, New York, so that the lake volume
can be adjusted. Various chemical treatment techniques have been em-
ployed to control biological activity in this eutrophic lake; copper sulfate
applied in the late 1930s and early 1940s to control algae, and sodium
arsenite added to control rooted macrophytes in 1953 with intensive
spraying from 1955 to 1963. Since 1963, various organic herbicides have
been employed.
Chautauqua Lake is a 24-km long, narrow lake in southwestern New
York State. It is similar in surface configuration to the Finger Lakes of
New York. However, in contrast to the Finger Lakes, Chautauqua Lake
is a warm, shallow lake. Beginning in 1971, this lake has been the subject
of an intensive multidisciplinary study by the State University College at
Fredonia (Lake Erie Environmental Studies, unpublished). In the early
studies of the lake, Lis & Hopke 22 found unusually high arsenic concen-
trations in the waters of Chautauqua Lake.
One possible source of this arsenic was suggested as the release from the
bottom sediments of arsenic residues from the earlier chemical treatment
of the lake. In an effort to investigate this possibility, the abundance of
arsenic23 along with 14 other elements was determined by neutron activa-
tion analysis for grab samples of the bottom sediments which had been
taken for sediment particle size analysis. 24 The concentrations of 15
elements were determined quantitatively and are given in Ref. 25. Ruppert
et al. 23 reported the particle size distribution as characterized by per cent
sand, per cent silt, and per cent clay, as well as the per cent organic matter
and water depth above the sample. In addition, parameters describing the
particle size distribution as determined by Clute 24 include measures of the
average grain size, mean grain size, median grain size, and parameters
describing the shape of the distribution, sorting (standard deviation),
skewness, kurtosis and normalized kurtosis. These values are calculated
using the methods described in Ref. 26.
Several additional variables were also calculated. These variables were
the square of the median and mean grain sizes in millimeters and the
164 PHILIP K. HOPKE

reciprocal of the median and mean grain sizes in millimeters. The squared
diameters provide a measure of the available surface area assuming that
the particles are spherical. The reciprocals of the diameter should help
indicate the surface area per unit volume of the particles. Seventy-nine
samples were available for which there were complete data for all of the
variables.
An interpretation can be made of the factor loading matrix (Table 7) in
terms of both the physical and chemical variables. The variables describ-
ing particle diameter in millimeters and per cent sand, have high positive
loadings for factor one. This factor could be thought of as the amount of
coarse grain source material in the sediment. The amount of sand (coarse-
grained sediment) is positively related only to this first factor. Of the
elements determined, only sodium and hafnium have positive coefficients
for this factor.
The second factor is related to the available specific surface area of the
sedimental particles. The amount of surface area per unit volume would
be given by
4n,J 3 6
R = ~nr3 r d
3

where d is the particle diameter. Thus the surface to volume ratio is


proportional to the inverse of the particle radius. It is the inverse diameter
that has the highest dependence on the second factor. The clay fraction
contains the smallest particles which are the ones with the largest surface
to volume ratio and, thus, it is reasonable that the per cent clay is also
strongly dependent on the second factor. The elements that have positive
loadings are adsorbing onto the surface of the particles. This factor has a
significant value for As, Br, K and La.
The third factor represents the glacial till source material. The source for
sediments found in the lake consists of Upper Devonian siltstones and
shales overlain by Pleistocene glacial drift.27,28 In order to compare the
average elemental abundances of the lake sediments with the source
material, two samples of surface soil were obtained from forested areas
that had been left dormant for more than 50 years. Two samples were
obtained of shale typical of the area. These samples show quite similar
elemental concentrations to those samples taken from the center area of
the lake where there has been little active sedimentation. It is suggested
that factor-3 accounts for the presence of this glacial till in the sedimental
material and explains the high correlation coefficients (> 0·8) reported by
Hopke et al. 25 between Sb, Cs, Sc and Ta.
FACTOR AND CORRELATION ANALYSIS 165

The only variable having a high loading for factor-4 is the per cent silt.
Since the silt represents a mixture of size fractions that can be carried by
streams, this factor may represent active sedimentation at or near stream
deltas. The silty material is deposited on the delta and then winnowed
away by the action of lake currents and waves.
The final factor has a high loading for sorting. The sorting parameter
is a measure of the standard deviation of the particle size distribution. A
large standard deviation implies that there is a wide mixture of grain sizes
in the sample. Wave action sorts particles by carrying away fine-grained
particles and leaving the coarse-grained. Therefore, the fifth factor may
represent the wave and current action that transport the sedimental mat-
erial within the lake.

3.3 Receptor modeling


The first receptor modeling applications of classical factor analysis were
by Prinz & Stratmann29 and Blifford & Meeker. 30 Prinz & Stratmann
examined both the aromatic hydrocarbon content of the air in 12 West
German cities and data from Colucci & Begeman31 on the air quality of
Detroit. In both cases they found three factor solutions and used an
orthogonal varimax rotation to give more readily interpretable results.
Blifford & Meeker30 used a principal component analysis with both
varimax and a non-orthogonal rotation to examine particle composition
data collected by the National Air Sampling Network (NASN) during
1957-1961 in 30 US cities. They were generally not able to extract much
interpretable information from their data. Since there are a very wide
variety of particle sources among these 30 cities and only 13 elements were
measured, it is not surprising that they were not able to provide much
specificity to their factors.
The factor analysis approach was then reintroduced by Hopke et al. 25
and Gaarenstroom et al. 32 for their analysis of particle composition data
from Boston, MA and Tucson, AZ, respectively. In the Boston data for
90 samples at a variety of sites, six common factors were identified that
were interpreted as soil, sea salt, oil-fired power plants, motor vehicles,
refuse incineration and an unknown manganese-selenium source. The six
factors accounted for about 78% of the system variance. There was also
a high unique factor for bromine that was interpreted to be fresh automo-
bile exhaust. Since lead was not determined, these motor vehicle related
factor loading assignments remain uncertain. Large unique factors for
antimony and selenium were found. These factors represent emission of
volatile species whose concentrations do not covary with other elements
;;
0-
TABLE 7
Varimax-rotated maximum likelihood factor matrix for Chautauqua Lake sediment
Variable Factor h2
1 2 3 4 5
Arsenic -0·3446 0·3794 0·2824 0·0426 0·4099 0·5123
Bromine -0·2552 0·3883 0·3767 -0·0036 0·3810 0·5030
Cesium -0·2115 0·2661 0·9106 0·0575 0·2278 0·9999
Europium -0·4300 0·2567 0·1856 0·0869 0·2087 0·3363
Iron -0·2276 0·1747 0·4452 0·0495 0·3400 0·3986 '"d
Gallium -0·1034 0·2065 0·2147 0·0626 0·0994 0·1132 ==
t:
Hafnium 0·2603 -0·2164 0·3318 -0·2404 - 0·1461 0·3038 '"d
Potassium -0·0706 0·4068 0·4062 0·1228 0-4017 0·5119 ~
Lanthanum -0·4579 0·3777 0·3797 0·1805 0·4553 0·7364 :I:
Manganese -0·2092 0·1733 0·1289 0·1643 0·2809 0·1963 0
'"d
Sodium 0·2971 -0·3436 -0·3540 - 0·1368 -0·5410 0·6430 ~
tTl
Antimony - 0·1217 0·1677 0·9321 0·0185 0·0578 0·9154
Scandium -0·1094 0·2535 0·8657 -0·0188 0·1611 0·8625
Tantalum -0·1097 0·0536 0·7240 0·0491 0·1009 0·5517
Terbium -0·1290 0·1073 0·4483 0·0192 0·2587 0·2964
% Sand 0·7607 -0·2507 -0·2248 -0·3705 -0·4070 0·9950
% Silt -0·8331 -0·1036 0·0872 0·5029 0·1782 0·9971
% Clay -0·3729 0·7066 0·2704 0·0840 0·5236 0·9955
% Organic matter -0·2670 0·1018 0·3310 -0·0249 0·4020 0·3534
Depth (m) -0·2290 0·3135 0·3716 -0·0203 0·2623 0·3580
Median grain size 0·9282 -0·2265 -0·1732 -0·1148 -0·1925 0·9931
(mm)
Mean grain size (mm) 0·9293 -0,1446 -0·1661 -0·0122 -0·2828 0·9922
'(Median grain sizei 0·9639 -0,1087 -0'1I88 0·0619 -0,0690 0·9636
(Mean grain size)2 0·9307 -0·0357 -0'1I05 0·1900 -0·1615 0·9419
Median grain size (q,) -0·6532 0·5788 0·2711 0·1909 0·3442 0·9901
Mean grain size (q,) -0·6925 0·4754 0·2631 0·1842 0·4306 0·9941
Sorting -0,1680 -0·0328 0·0874 -0·0008 0·6647 0·4788
Skewness -0,1075 -0·7110 -0,1268 0·0847 0·1005 0·5504
Kurtosis 0·4975 -0'301I -0,2123 -0·1736 -0,3754 0·5543
Normalized kurtosis 0·4322 -0·4230 -0,2549 - 0·1618 -0·4844 0·6915 ~
(Median grain size)-I -0·2184 0·9163 0·1567 0·0715 0·1241 0·9324
(Mean grain size)-I -0·3226 0·8463 0·2490 0·1103 0·3114 0·9914 ~
~

~
~
~
~

~
168 pmLIP K. HOPKE

emitted by the same source. Subsequent studies by Thurston et al. 33 where


other elements including sulfur and lead were measured showed a similar
result. They found that the selenium was strongly correlated with sulfur
for the warm season (6 May to 5 November). This result is in agreement
with the Whiteface Mountain results34 and suggests that selenium is an
indicator of long range transport of coal-fired power plant effluents to the
northeastern US. They found lead to be strongly correlated with bromine
and readily interpreted as motor vehicle emissions.
In the study of Tucson, AZ,32 whole filter data were analyzed separately
at each site. Factors were identified as soil, automotive, several secondary
aerosols such as (NH4)2S04 and several unknown factors. Also discovered
was a factor that represented the variation of elemental composition in
their aliquots of their neutron activation standard containing Na, Ca, K,
Fe, Zn and Mg. This finding illustrates one of the important uses offactor
analysis; screening the data for noisy variables or analytical artifacts.
One of the valuable uses of this type of analysis is in screening large data
sets to identify errors. 20 With the use of atomic and nuclear methods to
analyze environmental samples for a multitude of elements, very large
data sets have been generated. Because of the ease in obtaining these
results with computerized systems, the elemental data acquired are not
always as thoroughly checked as they should be, leading to some, if not
many, bad data points. It is advantageous to have an efficient and effective
method to identify problems with a data set before it is used for further
studies. Principal component factor analysis can provide useful insight
into several possible problems that may exist in a data set including
incorrect single values and some types of systematic errors.
Gatz3S used a principal components analysis of aerosol composition and
meteorological data for St Louis, MO, taken as part of project MET-
ROMEX. 36•37 Nearly 400 filters collected at 12 sites were analyzed for up
to 20 elements by ion-induced X-ray fluorescence. Gatz used additional
parameters in his analysis including day of the week, mean wind speed, per
cent of time with the wind from NE, SE, SW or NW quadrants or variable,
ventilation rate, rain amount and duration. At several sites the inclusion
of wind data permitted the extraction of additional factors that allowed
identification of motor vehicle emissions in the presence of specific point
sources of lead such as a secondary copper smelter. An important advan-
tage of this form of factor analysis is the ability to include parameters such
as wind speed and direction or particle size in the analysis.
In the early applications of factor analysis to particulate compositional
data, it was generally easy to identify a fine particle mode lead/bromine
FACTOR AND CORRELATION ANALYSIS 169

factor that could be assigned as motor vehicle emissions. In many cases,


a calcium factor sometimes associated with lead could be found in the
coarse mode analysis and could be assigned as road dust. However, there
is a problem of diminishing lead concentrations in gasoline. As the lead
and related bromine concentrations diminish, the clearly distinguishable
covariance of these two elements is disappearing. In a study of particle
sources in southeast Chicago based on samples from 1985 and 1986,
much lower lead levels were observed and the lead/bromine correlation
was quite weak. 38 Thus, the identification of highway emissions through
factor analysis based on lead or lead and bromine is becoming more
and more difficult and other analyte species are going to be needed in the
future.
A problem that exists with these forms of factor analysis is that they do
not permit quantitative source apportionment of particle mass or of
specific elemental concentrations. In an effort to find an alternative
method that would provide information on source contributions when
only the ambient particulate analytical results are available,18.38-40 target
transformation factor analysis (TTF A) has been developed, in which
uncentered but standardized data are analyzed. In this analysis, resolution
similar to that obtained from a Chemical mass balance (CMB) analysis
can be obtained. However, a CMB analysis can be made on a single
sample if the source data is known while TTFA requires a series of samples
with varying impacts by the same sources, but does not require a priori
knowledge of the source characteristics. The objectives ofTTFA are (1) to
determine the number of independent sources that contribute to the
system, (2) to identify the elemental source profiles, and (3) to calculate the
contribution of each source to each sample.
One of the first applications of TTFA was to the source identification
of urban street dust. II A sample of street dust was physically fractionated
by particle size, density and magnetic susceptibility to produce 30 subsam-
pIes. Each subsample was analyzed by instrumental neutron activation
analysis and atomic absorption spectroscopy to yield analytical results for
35 elements. The number of sources is determined by performing an
eigenvalue analysis on the matrix of correlations between the samples. A
target transformation determines the degree of overlap between an input
source profile and one of the calculated factor axes. The input source
profiles, called test vectors, are developed from existing knowledge of the
emission profiles of various sources or by an iterative technique from
simple test vectors. 41 The identified source profiles are then used in a simple
170 PHILIP K. HOPKE

weighted least-squares determination of the mass contributions of the


sources. 42
In the analysis of the street dust, six sources were identified including
soil, cement, tire wear, direct automobile exhaust, salt and iron particles.
The lead concentration of the motor vehicle source was found to be 15%
with a lead to bromine ratio of O· 39. This ratio is in good agreement with
the values obtained by Dzubay et al. 43 for Los Angeles freeways and in the
range presented by Harrison & Sturges44 in their extensive review of the
literature. A comparison of the actual mass fractions with those calculated
from the TTF A results shows that the TTF A provided a good repro-
duction of the mass distribution and source apportionments of the street
dust that suggest that a substantial fraction of the urban roadway dust is
anthropogenic in origin.
One of the principal advantages of TTF A is that it can identify the
source composition profiles as they exist at the receptor site. There can be
changes in the composition of the particles in transit from the source to
the receptor and approaches that provide these modified source profiles
should improve the receptor model results. Chang et al. 45 have applied
TTF A to an extensive set of data from St Louis, MO, to develop source
composition profiles based on a subset selection process developed by
Rheingrover & Gordon. 46 They select samples from a data base such as the
one obtained in the Regional Air Pollution Study (RAPS) of St Louis,
MO, that were heavily influenced by major sources of each element. These
samples were identified according to the following criteria:
(l) Concentration of the element in question X > X + Zcr where X is
the average concentration of that particular element for each station
and size fraction (coarse or fine particle size fraction), Zcr is typically
set at about three for most elements, and is the standard deviation
of the concentration of that element.
(2) The standard deviation of the 6 or 12 hourly average wind directions
for most samples, or minute averages for 2-h samples, taken during
intensive periods is less than 20°.
Samples that are strongly affected by emissions from a source were
identified through observation of clustering of mean wind directions for
the sampling periods selected with angles pointing toward the source.
The RAPS data of about 35000 individual ambient aerosol samples
were collected at 10 selected sampling sites in the vicinity of St Louis, MO,
and were screened according to the criteria stated above. With wind
trajectory analysis, specific emission sources could be identified even in
FACTOR AND CORRELATION ANALYSIS 171

cases where the sources were located very close together. 46 A compilation
of the selected impacted samples was made so that target transformation
factor analysis could be employed to obtain elemental profiles for these
sources at the various receptor sites. 45
Thus, TTF A may be very useful in determining the concentration of
lead in motor vehicle emission as the mix of leaded fuel continues to
change. Multivariate methods can thus provide considerable information
regarding the sources of particles including highway emissions from only
the ambient data matrix. The TTF A method represents a useful approach
when source information for the area is lacking or suspect and if there is
uncertainty as to the identification of all of the sources contributing to the
measured concentrations at the receptor site. TTF A has been performed
using F ANTASIA. 18a,40
Further efforts have recently been made by Henry & Kim 47 on extending
eigenvector analysis methods. They have been examining ways to incor-
porate the explicit physical constraints that are inherent in the mixture
resolution problem into the analysis. Through the use of linear program-
ming methods, they are better able to define the feasible region in which
the solution must lie. There exists a limited region in the solution space
because the elements of the source profiles must all be greater than or
equal to zero (non-negative source profiles) and the mass contributions of
the identified sources must also be greater than or equal to zero. Although
there has only been limited applications of this expanded method, it offers
an important additional tool to apply to those systems where a priori
source profile data are not available. These methods provide a useful
parallel analysis with eMB to help insure that the profiles used are
reasonable representations of the sources contributing to a given set of
samples.

3.4 Illustrative example


In order to demonstrate the use of target transformation factor analysis
for the resolution of sources of urban aerosols, TTF A will be applied to
a compositional data set obtained from aerosol samples collected during
the RAPS program in St Louis, Missouri.

3.4. 1 Description of data


In the RAPS program, automated dichotomous samplers were operated
over a 2-year period at 10 sites in the St Louis metropolitan area. Ambient
aerosol samples were collected in fine « 2-4 jlm) and coarse (2·4- to
20-jlm) fractions and were analyzed at the Lawrence Berkeley Laboratory
J72 PHILIP K. HOPKE

for total mass by beta-gauge measurements and for 27 elements by X-ray


fluorescence. The data for the samples collected during July and August
1976 from station 112 were selected for the TTFA process.
Station 112 was located near Francis Field, the football stadium on the
campus of Washington University, west of downtown St Louis. During
the 62 days of July and August, filters were changed at 12-h intervals,
producing a total of 124 samples in each the fine and coarse fractions.
Data were missing for 24 pairs of samples leaving a total of 100 pairs of
coarse and fine fraction samples. Of the 27 elements determined for each
sample, a majority of the determinations of 10 elements had values below
the detection limits. Since a complete and accurate data set is required to
perform a factor analysis, these 10 elements were eliminated from the
analysis. For example, arsenic was excluded because almost all of the
values were below the detection limits. Arsenic determinations by X-ray
fluorescence are often unreliable because of an interference between the
arsenic K X-ray and the lead L X-ray. A neutron activation analysis of
these samples would produce better arsenic determinations. Reliable data
for arsenic may be important to the differentiation of coal flyash and
crustal material; two materials with very similar source profiles. The low
percentage of measured elements can lead to distortions in the scaling
factors produced by the mUltiple regression analysis. The remaining mass
consists primarily of hydrogen, oxygen, nitrogen and carbon. Although no
measurements of carbon are included in the RAPS data, that portion of
the sample mass must still be accounted for by the resolved sources. In
order to produce the best possible source resolutions, it is vital to have
accurate measurements of the mass of total suspended particles (TSP) as
well as determinations for as many elements as possible.
The fine and the coarse samples were analyzed separately and only the
fine fraction results will be reported here. In this target transformation
analysis, a set of potential source profiles was assembled from the litera-
ture to use as initial test vectors. In addition, the set of unique vectors was
also tested.

3.4.2 Results
The eigenvector analysis provided the results presented in Table 8. Exami-
nation of the eigenvectors suggests the presence of four major sources,
possibly two weak sources, and noise. To begin the analysis, a four-vector
solution was obtained. The iteratively refined source profiles are given in
Table 9. The first three vectors can be easily identified as motor vehicles
FACTOR AND CORRELATION ANALYSIS 173

TABLE 8
Results of eigenvector analysis of July and August 1976 fine fraction data at
site 112 in St Louis, Missouri

Factor Eigenvalue Chi square Exner % Error

1 90·0 210 0·324 204


2 5·0 156 0·214 164
3 1·7 65 0·141 129
4 1·3 63 0·064 93
5 0·16 55 0·047 72
6 0·09 26 0·034 68
7 0·03 24 0·027 67
8 0·02 24 0·021 58
9 0·02 15 0·016 49

(Pb, Br), regional sulfate, and soiljflyash (Si, AI) based on their apparent
elemental composition.
However, the fourth vector showed high K, Zn, Ba and Sr and was not
initially obvious as to its origin. The resulting mass loadings were then
calculated and the only significant values were for the sampling periods of

TABLE 9
Refined source profiles for the four source solution at RAPS Site 112, July-
August 1976

Element Motor Flyashfsulfate Soil Fireworks


vehicle

Al 3·0 0·9 62·0 60·0


Si 0·0 2·8 140·0 0·0
S 0·0 232·0 14·0 26·0
Cl 5·2 1·6 0·31 19·0
K 0·0 0·06 43·0 580·0
Ca 12·0 0·006 17·0 0·27
Ti 2·8 1·8 2·3 0·0
Mn 1·5 0·1 0·8 3·6
Fe 5·8 3-8 38·0 9·0
Ni 0·2 0·06 0·05 0·3
Cu 1·9 0·2 0·03 4·6
Zn 9·8 1·4 0·0 24·0
Se 0·1 0·1 0·0 0·01
Br 26·0 0·0 2·7 2·0
Sr 0·0 0·0 0·9 12·0
Ba 1·45 0·3 0·8 15·0
Pb 105·0 8·0 3-8 0·0
174 PffiLlP K. HOPKE

TABLE 10
Comparison of data with and without samples from 4 and 5 July (ng/m3)
RAPS Station 112, July and August 1976, fine fraction

Element With 4 and 5 Without 4 and 5


July mean July mean

Al 220 ± 30 200 ± 30
Si 440 ± 60 450 ± 60
S 4370 ± 310 4360 ± 320
CI 90 ± 10 80 ± 9
K 320 ± 130 150 ± 9
Ca 11O±1O 110 ± 10
Ti 63 ± 13 64 ± 13
Mn 17 ± 3 17 ± 3
Fe 220 ± 20 220 ± 20
Ni 2·3 ± 0·2 2·3 ± 0·2
Cu 16 ± 3 15 ± 3
Zn 78 ± 8 75 ± 8
Se 2·7 ± 0·2 2·7 ± 0·2
Br 140 ± 9 130 ± 8
Sr 5± 4 1·1 ± 0·1
Ba 19 ± 5 15 ± 4
Pb 730 ± 50 720 ± 50

noon to midnight on 4 July and midnight to noon on 5 July. This was 4


July 1976 and there was a bicentennial fireworks display at this location.
Thus, these two highly influenced samples change the whole analysis.
To illustrate this further, Table 10 gives the average values of the
elemental composition of the fine fraction samples for the samples with
and without the 4 and 5 July samples included. It can be seen that these
two samples from 4 and 5 July from the IOO-sample set have changed the
average value of K by a factor of 2 and the average Sr by a factor of 5.
Thus, TTF A can find strong, unusual events in a large complex data set.
After dropping the samples from 4 and 5 July, the analysis was repeated
and the results are presented in Table 11. Now there are 3 strong factors,
2 weaker ones, and a more continuum. Thus, a 5 factor solution was
sought. These results are presented in Table 12.
The target transformation analysis for the fine fraction without 4 and
5 July data indicated the presence of a motor vehicle source, a sulfate
source, a soil or flyash source, a paint-pigment source and a refuse source.
The presence of the sulfate, paint-pigment and refuse factors was deter-
mined by the uniqueness test for the elements sulfur, titanium, and zinc
FACTOR AND CORRELATION ANALYSIS 175

TABLE 11
Results of eigenvector analysis of July and August 1976 fine fraction data at
Site 112 in St Louis, Missouri, excluding 4 and 5 July data

Factor Eigenvalue Chi square Exner Average


% error

I 87-0 210 0-304 197


2 4-9 152 0-304 197
3 2-0 57 0-070 123
4 0-2 42 0-050 98
5 0-1 26 0-037 73
6 0-1 25 0-029 69
7 0-02 26 0-023 69
8 0-02 17 0-019 67
9 0-01 16 0-015 53

TABLE 12
Refined source profiles (mg/g), RAPS Station 112, July and August 1976, fine
fraction without 4 and 5 July data

Element Motor Sulfate Soillflyash Paint Refuse


vehicle
AI 5-0 I -I 53-0 0-0 0-0
Si 0-0 1-9 130-0 0-0 7-0
S 0-2 240-0 19-0 6-0 0-0
CI 2-4 I-I 0-0 4-6 22-0
K 1-4 1-6 15-0 5-7 48-0
Ca ll-O 0-0 16-0 34-0 1-2
Ti 0-0 0-7 2-5 110-0 0-0
Mn 0-0 0-0 0-7 4-8 8-6
Fe 0-0 I-I 36-0 90-0 36-0
Ni 0-08 0-04 0-042 O-Oll 0-7
Cu 0-6 0-01 0-0 0-0 8-7
Zn 0-8 0-0 0-0 3-7 65-0
Se 0-1 0-1 0-001 0-2 0-2
Br 30-0 0-3 2-5 0-0 0-05
Sr 0-09 0-01 0-15 0-1 0-001
Ba 0-7 0-035 0-07 28-0 0-5
Pb 107-0 6-5 5-0 0-0 46-0
176 PHILIP K. HOPKE

respectively. In the paint-pigment factor, titanium was found to be asso-


ciated with the elements sulfur, calcium, iron and barium. This plant used
iron titanate as its input material and the profile obtained in this analysis
appears to be realistic. The zinc factor, associated with the elements
chlorine, potassium, iron and lead, is attributed to refuse-incinerator
emissions. This factor might also represent particles from zinc and/or lead
smelters since a high chlorine concentration is usually associated with
particles from refuse incinerators. 48 ,49
The results of this analysis provide quite reasonable fits to the elemental
concentration and to the fine mass concentrations for this system. Thus,
the TTF A provided a resolution of source types and concentrations that
appear plausible although specific sources are not identified and quantita-
tively apportioned. From other studies with other data sets, it appears
TTF A is typically able to identify 5 to 7 source types as long as they are
reasonably distinct from one another.

4 SUMMARY

The purpose of this chapter has been to introduce a number of ways in


which eigenvector analysis can be used to reduce multivariate, environmen-
tal data to manageable and interpretable proportions. These methods
have been made much more accessible with the quite sophisticated statisti-
cal packages that are now available for microcomputers. It is now quite
easy to perform analyses that previously required mainframe computer
capabilities. The key to utilizing eigenvector analysis is the recognition
that the structure that it finds in the data may arise from many different
sources both reflecting real causal processes in the system being studied
and by errors in the sampling, analysis, or data transcription. There is
often reluctance on the part of many investigators to use such 'complex'
data analysis tools. However, they have often been surprised at how much
useful information can be gleaned from large data sets using eigenvector
methods. The key ingredients are learning enough about the methods to
know what assumptions are being made and applying some healthy skep-
ticism with regards to the interpretation of the results to be certain that
they make sense within the context of the problem under investigation.
The data are generally trying to convey information. The critical problem
is to properly interpret the message even if it is that these data are wrong.
It is hoped that this chapter has provided some assistance in understand-
ing eigenvector methods and their application to environmental data
analysis.
FACTOR AND CORRELATION ANALYSIS 177

ACKNOWLEDG EM ENTS

Many of the studies reported here have been performed by students or


post-doctoral associates in the author's group and their substantial contri-
butions to the results presented here must be acknowledged. The work
could not have been conducted without the support of the US Department
of Energy under contract DE AC02-80EVlO403, the US Environmental
Protection Agency under Grant No. R808229 and Cooperative Agree-
ment R806236 and the National Science Foundation under Grants ATM
85-20533, ATM 88-10767 and ATM 89-96203.

REFERENCES

1. Hannan, H.H., Modern Factor Analysis, 3rd edn. University of Chicago


Press, Chicago, 1976.
2. Malinowski, E.R. & Howery, D.G., Factor Analysis in Chemistry. l Wiley &
Sons, New York, 1980.
3. Rozett, R.W. & Petersen, E.M., Methods offactor analysis of mass spectra.
Anal. Chem., 47 (1975) 1301-08.
4. Duewer, D.L., Kowalski, B.R. & Fasching, lL. Improving the reliability of
factor analysis of chemical data by utilizing the measured analytical uncer-
tainty. Anal. Chem., 48 (1976) 2002-10.
5. Horst, P., Matrix Algebrafor Social Scientists. Holt, Rinehart and Winston,
New York, 1963.
6. Joreskog, K.G., Kovan, lE. & Reyment, R.A., Geological Factor Analysis.
Elsevier Scientific Publishing Company, Amsterdam, 1976.
7. Lawson, e.L. & Hanson, R.J., Solving Least Squares Problems. Prentice-
Hall, Englewood Cliffs, NJ, 1974.
8. Zhou, D., Chang, T. & Davis, J.e., Dual extraction of R-mode and Q-mode
factor solutions. Int. J. Math. Geol. Assoc., 15 (1983) 581-606.
9. Hwang, e.S., Severin, K.G. & Hopke, P.K., Comparison ofR- and Q-mode
factor analysis for aerosol mass apportionment. Atmospheric Environ., 18
(1984) 345-52.
10. Belsley, D.A., Kuh, K. & Welsch, R.E., Regression Diagnostics, Identifying
Influential Data and Sources of Collinearity. Wiley, New York, 1980.
11. Hopke, P.K., Lamb, R.E. & Natusch, D.F., Multielemental characterization
of urban roadway dust. Environ. Sci. Technol., 14 (1980) 164-72.
12. Cattell, R.B., Handbook of Multivariate Experimental Psychology. Rand
McNally, Chicago, 1966, pp. 174-243.
13. Guttman, L. Some necessary conditions for common factor analysis. Psy-
chometrika, 19 (1954) 149-61.
14. Kaiser, H.F. & Hunker, S., Some empirical results from Guttmais stronger
laverband for the number of common factors. Education and Psych. Mea-
surements, 33 (1973) 99-102.
15. Hopke, P.K. Comments on 'Trace element concentrations in summer
178 PHILIP K. HOPKE

aerosols at rural sites in New York State and their possible sources' by P.
Parekh and L. Husain and 'Seasonal variations in the composition of
ambient sulfur-containing aerosols' by R. Tanner and B. Leaderer. Atmo-
spheric Environ., 16 (1982) 1279-80.
16. Exner, 0., Additive physical properties. Collection of Czech. Chem.
Commun.,31 (1966) 3222-53.
17. Malinowski, E. R., Determination of the number of factors and the experi-
mental error in a data matrix. Anal. Chem., 49 (1977) 612-17.
17a. Malinowski, E.R. & McCue, M., Qualitative and quantitative determination
of suspected components in mixtures by target transformation factor analysis
of their mass spectra. Anal. Chem., 49 (1977) 284-7.
18. Hopke, P.K., Target transformation factor analysis as an aerosol mass
apportionment method: A review and sensitivity analysis. Atmospheric
Environ., 22 (1988a) 1777-92.
18a. Hopke, P.K., FANTASIA, A Program for Target Transformation Factor
Analysis, for program availability, contact P.K. Hopke (l988b).
19. Hopke, P.K., Receptor Modeling in Environmental Chemistry. J. Wiley &
Sons, New York, 1985.
20. Roscoe, B.A., Hopke, P.K., Dattner, S.L. & Jenks, J.M., The use of principal
components factor analysis to interpret particulate compositional data sets.
J. Air Pollut. Control Assoc., 32 (1982) 637-42.
21. Bowman, H.R., Asaro, F. & Perlman, I., On the uniformity of composition
in obsidians and evidence for magnetic mixing. J. Geology, 81 (1973) 312-27.
22. Lis, S.A. & Hopke, P.K., Anomalous arsenic concentrations in Chautauqua
Lake. Env. Letters,S (1973) 45-55.
23. Ruppert, D.F., Hopke, P.K., Clute, P.R., Metzger, WJ. & Crowley, OJ.,
Arsenic concentrations and distribution of Chautauqua Lake sediments. J.
Radioanal. Chem., 23 (1974) 159-69.
24. Clute, P.R., Chautauqua Lake sediments. MS Thesis, State University
College at Fredonia, NY, 1973.
25. Hopke, P.K., Ruppert, D.F., Clute, P.R., Metzger, W.J. & Crowley, OJ.,
Geochemical profile of Chautauqua Lake sediments. J. Radioanal. Chem., 29
(1976) 39-59.
25a. Hopke, P.K., Gladney, E.S., Gordon, G.E., Zoller, W.H. & Jones, A.G., The
use of multivariate analysis to identify sources of selected elements in the
Boston urban aerosol. Atmospheric Environ., 10 (1976) 1015-25.
26. Folk, R.L., A review of grain-size parameters. Sedimentology, 6 (1964) 73-93.
27. Tesmer, I.H., Geology of Chautauqua County, New York Part I: Strati-
graphy and Paleontology (Upper Devonian). New York State Museum and
Science Service Bull. no. 391: Albany, University of the State of New York,
State Education Department, 1963, p. 65.
28. Muller, E.H., Geology of Chautauqua County, New York, Part II: Pleis-
tocene geology. New York State Museum and Science Service Bull. no. 392:
Albany, the University of the State of New York, The State Education
Department, 1963, p. 60.
29. Prinz, B. & Stratmann, H., The possible use offactor analysis in investigating
air quality. Staub-Reinhalt Luft, 28 (1968) 33-9.
FACTOR AND CORRELATION ANALYSIS 179

30. Blifford, I.H. & Meeker, G.O., A factor analysis model of large scale pol-
lution. Atmospheric Environ., 1 (1967) 147-57.
31. Colucci, J.M. & Begeman, C.R., The automotive contribution to air-borne
polynuclear aromatic hydrocarbons in Detroit. 1. Air Pollut. Control Assoc.,
15 (1965) 113-22.
32. Gaarenstrom, P.D., Perone, S.P. & Moyers, J.P., Application of pattern
recognition and factor analysis for characterization of atmospheric par-
ticulate composition in southwest desert atmosphere. Environ. Sci. Technol.,
11 (1977) 795-800.
33. Thurston, G.D. & Spengler, J.D. A quantitative assessment of source contri-
butions to inhalable particulate matter pollution in metropolitan Boston.
Atmospheric Environ., 19 (1985) 9-26.
34. Parekh, P.P. & Husain, L., Trace element concentrations in summer aerosols
at rural sites in New York State and their possible sources. Atmospheric
Environ., 15 (1981) 1717-25.
35. Gatz, D.F., Identification of aerosol sources in the St Louis area using factor
analysis. 1. Appl. Met., 17 (1978) 600-08.
36. Changnon, S.A., Huff, R.A., Schickendenz, p.r. & Vogel, J.L., Summary of
METROMEX, Vol. I: Weather Anomalies and Impacts, Illinois State Water
Survey Bulletin 62, Urbana, IL, 1977.
37. Ackerman, B., Chagnon, S.A., Dzurisin, G., Gatz, D.L., Grosh, R.C.,
Hilberg, S.D., Huff, F.A., Mansell, J.W., Ochs, H.T., Peden, M.E., Schick-
edanz, P.T., Semonin, R.G. & Vogel, J.L. Summary of METRO ME X, Vol.
2: Causes of Precipitation Anomalies. Illinois State Water Survey Bulletin 63,
Urbana, IL, 1978.
38. Hopke, P.K., Wlaschin, W., Landsberger, S., Sweet, C. & Vermette, SJ., The
source apportionment of PM lOin South Chicago. In PM- 10: Implementation
of Standards, ed. C.V. Mathai & D.H. Stonefield. Air Pollution Control
Association, Pittsburgh, PA, 1988, pp. 484-94.
39. Alpert, D.l. & Hopke, P.K. A quantitative determination of sources in the
Boston urban aerosol. Atmospheric Environ., 14 (1980) 1137-46.
39a. Alpert, D.l. & Hopke, P.K., A determination of the sources of airborne
particles collected during the regional air pollution study. Atmospheric
Environ., 15 (1981) 675-87.
40. Hopke, P.K., Alpert, DJ. & Roscoe, B.A., FANTASIA-A program for
target transformation factor analysis to apportion sources in environmental
samples. Computers & Chemistry, 7 (1983) 149-55.
41. Roscoe, B.A. & Hopke, P.K., Comparison of weighted and unweighted
target transformation rotations in factor analysis. Computers & Chem., 5
(1981) 5-7.
42. Severin, K.G., Roscoe, B.A. & Hopke, P.K., The use of factor analysis in
source determination of particulate emissions. Particulate Science and Tech-
nology, 1 (1983) 183-92.
43. Dzubay, T.G., Stevens, R.K. & Richards, L.W., Composition of aerosols
over Los Angeles freeways. Atmospheric Environ., 13 (1979) 653-9.
44. Harrison, R.M. & Sturges, W.T. The measurement and interpretation of
Br/Pb ratios in airborne particles. Atmospheric Environ., 17 (1983) 311-28.
45. Chang, S.N., Hopke, P.K., Gordon, G.E. & Rheingrover, S.W., Target
180 PHILIP K. HOPKE

transformation factor analysis of airborne particulate samples selected by


wind-trajectory analysis. Aerosol Sci. Technol., 8 (1988) 63-80.
46. Rheingrover, S.G. & Gordon, G.E., Wind-trajectory method for determining
compositions of particles from major air pollution sources. Aerosol Sci.
Technol., 8 (1988) 29-61.
47. Henry, R.c. & Kim, B.M., Extension of self-modeling curve resolution to
mixtures of more than three components. Part 1. Finding the basic feasible
region. Chemometrics and Intelligent Laboratory Systems, 8 (1990) 205-16.
48. Greenberg, R.R., Zoller, W.H. & Gordon, G.E., Composition and size
distribution of particles released in refuse incineration. Environ. Sci. Technol.,
12 (1978) 566-73.
49. Greenberg, R.R., Gordon, G.E., Zoller, W.H., Jacko, R.B., Neuendorf,
D.W. & Yost, KJ., Composition of particles emitted from the Nicosia
municipal incinerator. Environ. Sci. Technol., 12 (1978) 1329-32.
50. Heidam, N. & Kronborg, D., A comparison of R- and Q-modes in target
transformation factor analysis for resolving environmental data. Atmospher-
ic Environment, 19 (1985) 1549-53.
51. Mosteller, F., The jackknife. Rev. Inst. Stat. Inst., 39 (1971) 363-8.
Chapter 5

Errors and Detection Limits


M.J. ADAMS
School of Applied Sciences. Wolverhampton Polytechnic. Wulfruna
Street. Wolverhampton, WVI ISB, UK

1 TYPES OF ERROR

It is important to appreciate that in all practical sciences any measurement


we make will be subject to some degree of error, no matter how much care
is taken. As this measurement error can influence the subsequent conclu-
sions drawn from the analysis of the data, it is essential to identify, reduce
and minimise the effects of the errors wherever possible. This procedure
requires that we have a thorough knowledge of the measurement process
and implies a knowledge of the cause and source of errors in our measure-
ments. For their treatment and analysis, experimental errors are assigned
to one of three types: gross, random or systematic errors.

1.1 Gross errors


Gross errors are, hopefully, rare in a well organised laboratory. Their
occurrence does not fit into any common pattern and the analytical data
following a gross error should be rejected as providing little or no analyti-
cal information. For example, if a powder sample is spilled and contami-
nated prior to examination or a solution is incorrectly diluted, then any
subsequent analysis is likely to provide erroneous results and it would be
wrong to infer any conclusions or base any meaningful decision on the
resultant analytical data. It is worth noting here that gross errors can
easily arise outside of the laboratory. If a soil or plant sample is taken from
the wrong site then no matter how well performed the laboratory analysis,
the meaning of the results can give rise to a gross error.
The presence of previously unidentified gross errors in any data set can
181
182 M.J. ADAMS

often be inferred by the presence of outliers which are most easily iden-
tified by graphing or producing some pictorial representation of the data.
It is important to realise that gross errors can occur in the best planned
laboratories and a watch should always be kept for their presence.

1.2 Random errors


The most common and easily analysed errors are random; their occurr-
ence is irregular and individual errors are not predictable. Random errors
are present in all measurement operations and they provide the observed
variability in any repeated determination. Random errors in an analysis
can be, and always should be, subject to quantitative statistical analysis to
provide a measure of their magnitude. This chapter is largely concerned
with the analysis and testing of random errors in analytical measurements.

1.3 Systematic errors


By definition, systematic errors are not random; the presence of a fixed
systematic error affects equally each determination in a series of measure-
ments. The source of a systematic error may be human, for example
always reading from the top of a burette meniscus during a titration, or
mechanical or instrumental. A faulty balance may result in all analyses in
that laboratory producing results too high, a positive bias due to the fixed
systematic error. Whilst the presence of systematic errors does not give rise
to increased variability within a set or batch of data, it can increase
variability between batches. Thus, if a student repeatedly weighs the same
sample on several balances then the results using a particular balance will
indicate the random error for that student and the differences between
balances may indicate systematic errors in the balances. Systematic errors
are frequently encountered in modern instrumental methods of analysis
and their effects are minimised by the analysis of standards and subse-
quent correction of the results by calibration. These procedures are dis-
cussed in more detail below. The nature and effects of these types of errors
on a series of analytical results may be best appreciated by an illustrative
example.

Example. In the determination of lead in a standard sample of dried


stream sediment, four independent analysts provided the series of results
shown in Table 1, where each result is expressed in mg ofPb per kg sample.
The variability in these data can be seen clearly in the blob chart of Fig.
1. The average or mean of the 10 determinations from each analyst is
shown along with the known true result as provided with the standard
ERRORS AND DETECTION LIMITS 183

TABLE 1
The concentration of lead in stream sediment samples, mg Pbkg-', as deter-
mined by four analysts

Analyst

A B C D

Results (mgkg-I) 49·2 48·6 55·2 57·1


49·4 47-4 54·8 58·6
51·2 51·4 54·6 56·4
50·5 49·7 55·8 57-8
51·5 51·8 56·1 55·3
48·7 52-8 54·8 53·9
49·6 47-8 55·3 57·9
50·2 47-4 55·9 55·4
49·1 49·2 56·1 53-5
51·6 57·6 55·2 58·2

Sum (mgPb) 501·0 497·7 553-8 564·1


Mean (mgPb) 50·1 49·77 55·38 56·41
Variance (mg2Pb) 1·122 4·013 0·315 3·294
s (mgPb) 1·059 2·003 0·561 1·815
CV(%) 2·114 4·025 1·0\4 3·218

Known value = 50'00mgkg- 1 Pb.

sample. From this data we are able to assign quantitative, statistical,


values to the variability in each of the analyst's set of results and to the
difference between their mean values and the known value. Evaluation of
these measures leads us naturally to the important analytical concepts of
accuracy and precision. Accuracy is usually considered as a measure of the
correctness of a result. Visual inspection of the data in Table 1 and Fig.
1 can provide a qualitative conclusion that the mean results from analysts
A and B are similarly close to the expected true value and they both can
be described as providing accurate data. Analyst B, however, exhibits
much greater variability or scatter in the results than analyst A, and B
therefore is considered less precise. In the case of analysts C and D, both
show considerable deviation in their average result from the known true
value, both are inaccurate, but C is more precise than D. The terms
accuracy and precision are not synonymous in analytical and measure-
ment science. The accuracy of a measurement is its closeness to some
known, true value and is dependent on the presence and extent of any
systematic errors. On the other hand, precision indicates the variability in
184 M.J. ADAMS

xx x
xxxxxxx A

x x
XXXXX I xx X B
--_._----"-----

x x
x x
x x
xxxx c

x
xx ~ xx ~x D

Fig. 1. The variability in the lead concentration data, from Table 1, as deter-
mined by four analysts A, B, C and D. The correct value is illustrated by the
dotted line. Analyst A is accurate and precise, analyst B is accurate but less
precise, analyst C is inaccurate but of high precision and the data from analyst
D has low accuracy and poor precision.

a data set and is a measure of the random errors. It is important that this
distinction is appreciated. No matter how many times analysts C and D
repeat their measurements, because of some inherent bias in their pro-
cedure they cannot improve their accuracy. The precision might be im-
proved by exercising greater care to reduce random errors.
Regarding precision and the occurrence of random errors, two other
terms are frequently encountered in analytical science, repeatability and
reproducibility. If analyst B had performed the measurements sequentially
on a single occasion then a measure of precision would reflect the repeata-
bility of the analysis, the within-batch precision. If the tests were run over,
say, 2 days, then an analysis of data from each occasion would provide the
between-batch precision or reproducibility.

2 DISTRIBUTION OF ERRORS

Whilst the qualitative merits of each of our four analysts can be inferred
immediately from the data, to provide a quantitative assessment a statisti-
cal analysis is necessary and should always be performed on data from any
quantitative analytical procedure. This quantitative treatment requires
that some assumptions are made concerning the measurement process.
Any measure of a variable, mass, size, concentration, etc., is expected to
ERRORS AND DETECTION LIMITS 185

approximate the true value but it is not likely to be exactly equal to it.
Similarly, repeated measurements of the same variable will provide further
discrepancies between the observed results and the true value, as well as
differences between each practical measure, due to the presence of random
errors. As more repeated measurements are made, a pattern to the scatter
of the data will emerge; some values will be too high and some too low
compared with the known correct result, and, in the absence of any
systematic error, they will be distributed evenly about the true value. If an
infinite number of such measurements could be made then the true distri-
bution of the data about the known value would be known. Of course, in
practice this exercise cannot be accomplished but the presence of some
such parent distribution can be hypothesised. In analytical science this
distribution is assumed to approximate the well-known normal form and
our data are assumed to represent a sample from this idealised parent
population. Whilst this assumption may appear to be taking a big step in
describing our data, many scientists over many years of study have de-
monstrated that in a variety of situations repeat measurements do ap-
proximate well to the normal distribution and the normal error distribu-
tion is accepted as being the most important for use in statistical studies
of data.

2.1 The normal distribution


The shape of the normal distribution curve is illustrated in Fig. 2. Ex-
pressed as a probability function, the curve is defined by

p = _1_ exp [-
ay'ln
~ (x
2
-/YJ
a
(I)

where (x - Jl.) represents the measured deviation of any value x from the
population mean, Jl., and a is the standard deviation which defines the
spread of the curve. The formula given in eqn (1) need not be remembered;
its importance is in defining the shape of the normal distribution curve. As
drawn in Fig. 2, the total area under the curve is one for all values of Jl. and
a by standardising the data. This standard normal transform is achieved
by means of subtracting the mean value from each data value and dividing
by the standard deviation,
(x - Jl.)
Z = (2)

This operation produces a distribution with a mean of zero and unit


standard deviation. As the function described by eqn (1) expresses the
186 M.J. ADAMS

F(z) (0.5
".,_\
, \
, \
\

i \
I, II \

F(z) "

F(z) " 0.97~


o.a1' : :\
/'
I
i

i
i
,

I
I
!

t
, '\
i
:

i
I

:
; ,
'\

--_../ ~r
---- --- -----'---- - -~---------~-----------

-3 -2 -1 o +1 +2 +3

Standard Deviation

Fig. 2. The normal distribution curve, standardised to a mean value of 0 and


a standard deviation of 1. The cumulative probability function values, F(z), are
shown for standardised valued z of 0, 1, and 2 standard deviations from the
mean and are available from statistical tables.

idealised case of the assumed parent population, in practice the sample


mean, x, and standard deviation, s, are substituted and employed to
characterise our analytical sample, the subset of measured data. These
sample measures and the related quantity, the variance, are given by,
mean,
Ix
i (3)
n
variance,
I(x - i)2
V = (4)
n- I
and standard deviation,
s = JV (5)
where n is the number of sample data values, i.e. the number of repeated
measures.
The application of these formulae to the data from the four analysts is
shown in Table I and confirms the qualitative conclusions from the visual
inspection of the data.
A point to note is that the divisor in the calculation of the variance, and,
ERRORS AND DETECTION LIMITS 187

hence, standard deviation, is (n - I) and not n. This arises because v and


s are estimates of the parent distribution and the use of n would serve to
underestimate the deviation in the data set.
The standard deviation and variance are directly related (see eqns (4)
and (5» and in practice the standard deviation of a set of measures is
usually reported as this is in the same units as the original results. Another
frequently reported measure is the relative standard deviation or coefficient
of variation, CV, which is defined by
s
CV = - x 100 (6)
x
CV is thus a percentage spread of the data relative to the mean value.
Armed with the assumption that the analytical data is taken from a
parent normal distribution as described by eqn (1), then the properties of
the normal function can be used to infer further information from the
data. As shown in Fig. 2, the normal curve is symmetrical about the mean
and its shape depends on the standard deviation. The larger is a, or s, the
greater the spread of the distribution. For all practical purposes there is
little interest in the height of the normal curve; we are more concerned with
the area under sections of the curve and the cumulative distribution
function. Whatever the actual values of the mean and standard deviation
describing the curve, approximately 66% of the data will lie within one
standard deviation of the mean, 95% will be within two standard devia-
tions and less than I observation in 300 will lie more than three standard
deviations from the mean. These values are obtained from the cumulative
distribution function which is derived by integrating the standard normal
curve, with a mean of zero and unit standard deviation. This integral is
evaluated numerically and the results are usually presented in tabular
form. This table is available from many statistical texts, but for our
immediate purposes a few useful values are presented below:
standardised variable, Z = 012 3
cumulative probability function, F = 0·5 0·84 0·977 0·9987
and are shown in Fig. 2.
Returning to the results submitted by analyst A (Table I), we can now
calculate the proportion of the results produced by A which would be
expected to be above some selected value, say 52 mg Pb. To use the
cumulative standard normal tables, the analyst's data must first be stan-
dardised. From eqn (2) this is achieved by z = (x - x)/s which, as
required, moves the mean of the data to zero and provides unit standard
188 M.J. ADAMS

deviation. For our example,

52·0 - 50·1 = 1.8


Z =
1·059
the value of z is 1·8 standard deviations above the mean and, from the data
given above, about 2·5% of determinations can be expected to exceed this
value.

2.2 The central limit theory


Providing there is no systematic error or bias in our measurements then
the mean result of a set of values provides us with an estimate of the true
value. It is unlikely, however, that the determined sample mean will be
exactly the same as the true value and the mean should be recorded and
presented along with some measure of confidence limits, some indication
of our belief that the mean is close to the correct value. The range of values
denoted by such limits will depend on the precision of the determinations
and the number of the measurements. It is intuitive that the more measure-
ments that are made, the more confidence we will have in the recorded
mean value. In our example above, analyst A may decide to perform 30
tests to provide a more reliable estimate and indication of the true value.
It is unlikely in practice that 30 repeated measurements would be per-
formed in a single batch experiment, a more common arrangement might
be to conduct the tests on 6 batches of 5 samples each. The mean values
determined from each of the 6 sets of data can be considered as being
derived from some parent population distribution and it is a well esta-
blished and important property of this distribution of means that it tends
to the normal distribution as the number of determinations increases, even
if the original data population is not normally distributed. This is the
centra/limit theorem and is important as most statistical tests are under-
taken using the means of repeated experiments and assume normality. The
mean of the sample means distribution will be the same as the mean value
of the original data, of course, but the standard deviation of the means is
smaller and is given by a/J1j, which is often termed the standard error of
the sample mean.
The central limit theory can be applied and used to define confidence
limits to accompany a declared analytical result. Given that the parent
distribution is normal, then from the cumulative normal distribution
function, 95% of the standard normal distribution lies between ± 1·96 of
the mean and, therefore, the 95% confidence limits can be defined as
x ± 1·96a/J1j. If the mean concentration of lead in 30 samples is deter-
ERRORS AND DETECTION LIMITS 189

mined as being 50·1 mg kg-I and the standard deviation is 10 mg kg-I,


then the 95% confidence interval for the analysis is given by 50·1 ± 1,96(1/
jn = 50·1 ± 0·682. Because (1 is unknown, being derived from the idea-
lised, infinite parent distribution, the estimate of standard deviation, s,
must be used and then the 95% confidence interval is given by x ± (tos/
jn), where the value of the factor t tends to 1·96 as n increases. The smaller
the sample size, the larger is t. Above n = 25 an approximate value of
t = 2·0 is often used. For 99% confidence limits, a t value of 3·0 can be
used. Other values of t for any confidence interval can be obtained from
standard statistical tables; they are derived from the t-distribution, a
symmetric distribution with zero mean, the shape of which is dependent
on the degrees offreedom of the sample data. In the above example, where
s is calculated from a sample size of n measures, there are (n - 1) degrees
of freedom.

2.3 Propagation of errors


Most analytical procedures are not simple single measures but are com-
prised of a number of discrete steps, e.g. dissolution and dilution. If any
analytical process involves measuring and combining the results from
several actions, then the cumulative effects of the individual errors asso-
ciated with each stage in the procedure must be considered, quantified and
combined to give the total experimental error.

2.3.1 Summed combinations


If the final determined analytical result, y, is the sum of several indepen-
dent variables, Xi, such that
(7)
where ai represents some constant coefficients, then it is a property of the
variance that
(8)
The fact that the linear combination of errors can be determined by
summing the individual variances associated with each stage of the analy-
sis provides for a relatively simple calculation of the total experimental
error.

2.3.2 Multiplicative combination of errors


Combining measures of precision in a non-linear multiplication or divi-
sion calculation is much more complex than in the simple linear case
190 M.J. ADAMS

shown above. An approximate but more useful and simple procedure


makes use of the relative standard deviation, the coefficient of variation.
Given that the final result, y, is obtained from two measurements, XI and
X 2 , by Y = XI/X2 then rather than combining the variance values, which
involves complex formulae, the CV values can be used.

(9)

(10)
and the standard deviation, SY' associated with the final result can be
obtained by rearranging eqn (6) such that,

s =
CV x x (11)
100

3 SIGNIFICANCE TESTS

As stated above, the characteristics of the normal distribution are well


known from theoretical considerations and countless experimental tests. If
the variance of a parent normal population is known then the probability
of any sampled data element occurring within the population can be
calculated from the cumulative probability distribution function. Similar-
ly, any sample datum can be considered as not belonging to the parent
population if its measured value is beyond some selected and specified
distance from the population mean. Such assumptions and resultant
calculations provide experimentalists with the powerful, yet simple to use,
tools referred to as significance tests. A simple example will illustrate the
value and basic stages involved in such an analysis.

Example. If an identified river catchment pool is extensively (infinite-


ly) sampled and analysed for, say, aluminium over a long period of time
then an accurate and established distribution curve for the aluminium
content of the water will be available. If at some subsequent stage an
unspecified water sample is analysed for aluminium, a test of significance
can indicate whether the unknown sample was derived from a source
different from that providing our parent population of standard samples.
To answer such a question, the problem is usually expressed as, 'Do the
two samples, the unknown and the standard set, have the same mean
values for their aluminium concentration?' In statistics this statement is
ERRORS AND DETECTION LIMITS 191

referred to as the null hypothesis and it is usually written in the form


Ho: i = J1.,
where x is the mean of replicate analyses of the unknown sample and /l,
is the mean of the standard, known source population. The alternative
hypothesis is that the mean values of the two sets of data are significantly
different, i.e.
HI: x =F /ls
If the mean value of the unknown sample set is identified as belonging
to that area of the population curve corresponding to a region of signifi-
cantly low probability then a safe conclusion is that the unknown sample
did not come from our identified source and the null hypothesis is rejected.
If, in our example, the mean and standard deviation of the parent popula-
tion of aluminium concentration values are known to be 5· 2 mg kg-I and
0·53mgkg- ' respectively, and the mean concentration of five replicate
analyses of the unknown sample is 7·8 mg kg-I, then the test statistic is
given by
i - /ls
Z = (12)
a/..fo
The level of significance below which we are willing to accept the wrong
conclusion must be chosen and 5% (0·05 probability) is common. This
implies that we are willing to risk rejecting the null hypothesis when it is
in fact correct 1 time in 20. For our example,
Z = (7·8 - 5·2)/(0·53/.j5) = 2·19
We are not interested in whether the mean of the unknown sample is
significantly less than or more than that of the parent population; both will
lead to rejection of the null hypothesis. From the standardised normal
distribution curve, therefore, we must determine the critical region con-
taining 5% of the area of the curve, i.e. the extreme 2·5% either side of the
mean. The critical region is illustrated in Fig. 3, and from tables of
cumulative probabilities for the standardised normal region the critical
test value is J1.s ± 1·96. The calculated test value, Z = 2·19, is greater than
1·96 and, therefore, the null hypothesis is rejected and we assume the
sample to come from a different source than our parent, standard popula-
tion.
Note that we could not prove that the samples were from the same
source but if the value of the test statistic was less than the critical value
192 M.J. ADAMS

,
\5%
, 2.5%

-1.96 -1.64 o +1.64 +1.96

Fig. 3. The standardised normal distribution curve and the critical regions
containing the extreme 2·5% and 5% of the curve area. Cumulative probabili-
ties, from statistical tables, are also shown.

of 1'96, then statistics would indicate that there was no reason to assume
any difference.
In this example a so-called two-tailed test was performed: we are not
interested in whether the new sample was significantly more or less con-
centrated than the standards, just different. To indicate that the unknown
sample was significantly more concentrated, then a one-tailed test would
have been appropriate to reject the null hypothesis and, at the 5% signifi-
cance level, the critical test value would have been 1·64 (see Fig. 3).
3.1 t-Test
In the above example it was assumed that all possible samples of the
known water source had been analysed-a clearly impossible situation. If
the true standard deviation of the parent population distribution is not
known, as is usually the case, then the test statistic calculation proceeds in
a similar fashion but depends on the number of samples analysed. If a
large number of samples are analysed (n > 25) then the sample deviation,
s, is considered a good estimate of (1 and then the test statistic,now
denoted as t, is given by

t
x-
= ---
Jl.o
(13)
s..{ri
When n is less than 25, s may be a poor estimate of (1 and the test statistic
is compared not with the normal curve but with the t-distribution curve
which is more broad and spread out. The t-distribution is symmetric and
similar to the normal distribution but its wider spread is dependent on the
ERRORS AND DETECTION LIMITS 193

TABLE 2
The concentration of copper determined from ten soil samples by AAS

Sample no.

2 3 4 5 6 7 8 9 10

Coppercontent(mgkg- ' ) 72 71 78 98 98 116 76 104 84 96

Sum = 893mgkg- l ; mean = 89·3mgkg- l ; ;. = 232·46(mgkg- I )2; s =


15·246mgkg- ' ; t = 1·93.

number of samples examined. For an infinite sample size (n = 00) the


t-distribution and the normal curve are identical. As with the normal
distribution, critical values of the t-distribution are tabulated and avail-
able in standard statistical texts. A value is selected by the user according
to the level of significance required and the number of degrees of freedom
in the experiment. In simple statistical tests, such as discussed here, the
number of degrees of freedom is generally one less than the number of
samples or observations (n - 1).

Example. Copper is routinely determined in soils, following extrac-


tion with EDTA, by atomic absorption spectrometry. At levels greater
than 80mgkg- ' soil, copper toxicity in crops may occur. In Table 2 the
results of 10 soil samples from a site analysed for copper are presented. Is
this site likely to suffer from copper toxicity?
In statistical terms we are testing the null hypothesis,
Ho: x ~ J.lo
against the alternative hypothesis
HI: X > J.lo
The null hypothesis states that our 10 samples are from some parent
population with a mean equal to or less than 80 mg kg-I. The alternative
hypothesis is that the parent population has a mean copper concentration
greater than 80mgkg-'. The test statistic, t, is given by eqn (13) in which
s approximates the standard deviation of the parent population, x is the
calculated mean of the analysed samples and J.lo is assumed to be the mean
of our parent population, 80 mg kg-I. From Table 2 and eqn 13, t =
(89·3 - 80)/(15'25/JIQ) = 1·93. From statistical tables, the value of t
must exceed 1·83 for a significance level of 5% and nine degrees of freedom
(n - I). This is indeed the case: our mean result lies in the critical region
beyond the 5% level and the null hypothesis is rejected. Thus the conclu-
194 M.J. ADAMS

sion is that the soil samples do arise from a site, the copper content of
which is greater than the level of 80 mg kg-I and copper toxicity may be
expected.

3.1.1 Comparing means


A common application of the I-test in analytical science is comparing the
means of two sets of sample data. We may wish to compare two samples
for similarity or to compare that two analytical methods or analysts
provide similar results. In applying the I-test it is assumed that both sets
of observations being compared are normally distributed, that the parent
populations have the same variance and that the measurements are in-
dependent. Our null hypothesis in such cases can then be expressed as

Ho: Jl.1 = Jl.2


against the alternative

HI: Jl.1 f= Jl.2


In comparing two sets of data it is evident that the greater the difference
between their mean values, XI and Xh the less likely that the two samples
or results are the same, i.e. from the same parent population. The test
statistic is therefore obtained by dividing the difference (XI - X2) by the
standard error. Since the variance associated with the mean X, is given by
(12
variance (XI) = - (14)
nl
and the variance associated the mean X2 is given by

(15)

and the combined variance of the sum or difference of two independent


variables is equal to the sum of the variances of the two samples (see
Section 2.3.1), then the variance associated with the difference of the two
means is given by

(16)

The standard error can be obtained from the square root of this variance,
i.e.
ERRORS AND DETECTION LIMITS 195

The test statistic is therefore given by,

Z = (17)
(1' .j(l/n l + l/n2)
If the variance of the underlying population distribution is not known
then (1' is replaced by the sample standard deviation, s, and the t-test
statistic is used,

(18)

where s is derived from the standard deviations of each sample set by,
; = !: (Xi - X)2
(19)
n-I
and in the special case of nl = n2, then

; = ~+~ (20)
2
If the null hypothesis is true, i.e. the two means are equal, then the
t-statistic can be compared with the t-distribution with (nl + n2 - 2)
degrees of freedom at some selected level of significance.

Example. Two samples of an expanded polyurethane foam are sus-


pected of coming from the same source and to have been prepared using
chlorofluorocarbons (CFCs). Using GC-mass spectrometry, the CFC has
been identified and the quantitative results for 10 determinations on each
sample are shown in Table 3. Is there a significant difference between the
two sample means?
From eqn (20), the combined estimate of the variance is
; (~ + sD/2 = 44·66
s = 6·68
and the test statistic is, from eqn (18),

t = (XI - x2 ) = 1.27
s.j(l/IO + 1/10)
A two-tailed test is appropriate as we are not concerned whether any
one sample contains more or less CFC, and from statistical tables for a
t-distribution with 18 degrees offreedom (n l + n2 - 2), to-025.18 = 2·101.
196 M.J. ADAMS

TABLE 3
The analysis of two expanded polyurethane foams. A and e, for CFC content
expressed as mg CFC m- 3 foam

Sample A Sample B

CFC content (mgm- 3 ) 66 78


74 70
70 72
82 84
68 69
70 78
80 92
76 75
64 68
74 76
Sum (mgm- 3 ) 724 762
Mean (mgm- 3 ) 72-4 76·2
;- (mgm- 3)2 34·49 54·84
s (mgm- 3 ) 5·87 HI

Our result (t = 1·27) is less than this critical value, therefore we have
no evidence that the two samples came from different sources and the null
hypothesis is accepted.

3. 1.2 Paired experiments


If two analytical methods are being compared they may be tested on a
wide range of sample types and analyte concentrations. Use of eqn (18) in
such a case would be inappropriate because the differences in the analyti-
cal results between the sample types might be greater than the observed
differences between the two methods. In paired experiments of this kind,
therefore, the difference between the results from similar analyses are
tested.

Example. Two students are required to determine iron by colorimetric


analysis in a range of water samples. The results are given in Table 4. Is
there a significant difference between the two students' results?
The null hypothesis is that the differences between the pairs of results
come from a parent population with zero mean. Thus, from eqn (13), the
test statistic to be calculated is

(21)
ERRORS AND DETECTION LIMITS 197

TABLE 4
The comparison of two students' results for the determination of iron in water
samples

Sample Student Difference

A B

1 2·74 2·12 0·62


2 3·52 3·72 -0,20
3 0·82 0·62 0·20
4 3-47 l07 0·40
5 10·82 9·21 1-61
6 16·92 15·60 1·32

Sum = 3'95mgkg- l ; mean = 0'658mgkg- l ; ;. = 0'472(mgkg- I ?; s =


0'687mgkg- l ; t = 2·35.

and from Table 4, I = dl(Sd/.j6) = 2·35. Using a two-tailed test at the


10% significance level, this value exceeds the tabulated value of
10.05 •5 = 2·01. Therefore, we have evidence that the results from the two
students are significantly different at the 10% level.

3.2 F-Test
In applying the I-test an assumption is made that the two sets of sample
data have the same variance. Whether this assumption is valid or not can
be evaluated using a further statistical measure, the F-test. This test, to
determine the equality of variance of samples, is based on the so-called
F-distribution. This is a distribution calculated from the ratios of all
possible sample variances from a normal population. As the sample
variance is poorly defined if only a small number of observations or trials
are made then the F-distribution, like the I-distribution discussed above,
is dependent on the sample size, i.e. the number of degrees of freedom. As
with I-values, F-test values are tabulated in standard texts and in this case
the critical value selected is dependent on two values of degrees of
freedom, one associated with each sample variance to be compared, and
the level of significance to be used.
In practice, we assume our two samples are drawn from the same parent
population; the variance of each sample is calculated and the F-ratio of
variances evaluated. The null hypothesis is
198 M.J. ADAMS

against

and we can determine the probability that, by chance, the observed F-ratio
came from two samples taken from a single population distribution.

Example. Returning to the CFC data in Table 3, an F-test can indicate


if the variation in CFC concentration is similar for the two sets of samples.
If we are willing to use a 5% level of significance then we accept a I in 20
chance of concluding that the concentrations are different when they are
the same.
The F-ratio is used to compare variances and is calculated from

F = ~ (22)
s~
Substituting the data from Table 3, F = 54·84/34·49 = 1·59. From
statistical tables, the critical value for F with 9 degrees of freedom associ-
ated with each data set is F9.9.0-05 = 3·18. Our value is less than this so the
null hypothesis is accepted and we can conclude the variances of the two
sample sets is the same and the application of the t-test is valid.

3.3 Analysis of variance


Many experiments involve comparing analytical results from more than
two samples and the repeated application of the t-test may not be appro-
priate. In these cases the problem is examined by the statistical techniques
called the analysis of variance.
Suppose the phosphate content of a soil is to be determined to assess its
fertility for crop growth. If, say, six samples of soil are taken as being
representative of the plot then we need to determine if the phosphate
concentration is similar in each. Phosphate is determined colorimetrically
using acidified ammonium molybdate and ascorbic acid.
It is important in such analyses that the order of analysis in the laborat-
ory be randomised to reduce any systematic error. This can be achieved
by assigning to each analytical sample a sequential number taken from a
random number table. Randomising the samples in this way mixes up the
various sources of experimental error over all the replicates and the errors
are said to be confounded. The equivalency of the six soil samples for
phosphate is determined using a one-way analysis of variance. The null
hypothesis is
ERRORS AND DETECTION LIMITS 199

TABLE 5
The concentration of phosphate, determined colorimetrically in six soil
samples, using five replications on each sample. The results are expressed as
mg kg- 1 , phosphate in dry soil

Replicate Soil 1 Soil 2 Soil 3 Soil 4 Soil 5 Soil 6


no.

I 38·2 36·7 24·5 40·3 38·9 44-4


2 36·7 28·3 28·3 44·5 49·3 32·2
3 42·3 40·2 16·7 34·6 34·6 32·5
4 32·5 34·6 22-4 36·4 40·2 38·0
5 34·3 38·3 18·5 30·9 36·4 38·1

and the alternative,


HI: at least one mean is different
The data for this exercise is presented in Table 5.
For the one-way analysis it is necessary to split the total variance into
two sources, the experimental variance observed within each set of repli-
cates and the variance between the soil samples. A common method of
performing an analysis of variance it to complete an ANOVA table. This
is shown in Table 6.
The total variance for all the analyses is given by SSr,

SST =
m
LL~
n (f ±Xii)2
_ = 1 i=1 (23)
j=1 i=1 m x n
where xi.i represents the ith replicate of the jth sample. The variance

TABLE 6
An ANOVA table"

Source of Sum of Degrees of Mean squares F-test


variation squares freedom (variance)

Between groups SSB m - I s~ s~/s~


Within groups SSw men - I) s~

Total variation SSr m'n - 1

aWhere n is the number of replicates and m is the number of different samples


(soils).
200 M.J. ADAMS

between the samples is given by SSB,

m
[ ( Ln Xij )2 ] (mL Ln Xij )2
SSB = L i=1 _ j=1 i=1
(24)
j=1 n m x n

and the within-group variance can be found by difference,


SSw = SST - SSB (25)
The second term in eqns (23) and (24) occurs in many ANOV A calcula-
tions and is often referred to as the correction factor, CF
m n )2
(
L L Xij
CF = j=1 i=1
(26)
m x 11

From the data in Table 5, CF is calculated by summing all the values,


squaring the result and dividing by the number of analyses performed,
CF = 36248
Using eqn (23) we can determine the total variance, which is obtained
from summing every squared value and subtracting CF,
m n
SST = L L xT,j - 36248 = 1558
j=1 i=1

and similarly for the between-group variance from totalling the squares of
the sums of each replicate and dividing by the number of replicates,

m [( L Xi,j
n )2]
SSB = j~ i=l
n - 36248 = 1009

and finally, the within-group sum of squares by difference,


SSw = SSr - SSB = 1558 - 1009 = 549
and we can complete the ANOVA table (Table 7).
At the I % level of significance the value of the F-ratio, from tables, for
5 and 24 degrees of freedom is, FO.01 ,5,24 = 3·90. The calculated value of
8·81 for the soils analysis exceeds this and we can confidently conclude that
there is a significant difference in the means of the replicate analysis; the
soil samples are not similar in their phosphate content.
The analysis of variance is an important topic in the statistical examina-
ERRORS AND DETECTION LIMITS 201

TABLE 7
The completed ANOVA table

Source of Sum of Degrees of Mean squares F-test


variation squares freedom (variance)

Between groups 1009 5 201·8 8·81


Within groups 549 24 22·9

Total variation 1558 29

tion of experimental data and there are many specialised texts on the
subject.

4 ANALYTICAL MEASURES

In recent years instrumental methods of analysis have largely superseded


the traditional 'wet chemical' techniques and even the most modest analy-
tical laboratory will contain some example of modern instrumentation,
e.g. spectrophotometers, HPLC, electrochemical cells, etc. An important
characteristic and figure of merit for any instrumental measuring system
is its detection limit for an analyte. Using the basic statistical techniques
developed in previous sections we can now consider the derivation of the
detection limit for an analytical method. As the limit of detection for any
measurement is dependent on the inherent noise or random error associ-
ated with the measure, the fonn of noise and the concept of the signal-to-
noise ratio will be considered first.

4.1 Noise and signal-to-noise ratio


The electrical signals produced by instruments used in scientific measure-
ments are carriers of encoded information about some chemical or physi-
cal quantity. Such signals consist of the desirable component related to the
quantity of interest and an undesirable component, which is termed noise
and which can interfere with the accurate measurement of the required
signal. There are numerous sources of noise in all instruments and the
interested reader is recommended to seek further details from one of the
excellent texts on electronic measurements and instrumentation. Briefly,
whatever its source, the noise produced by an instrument will be a com-
bination of three distinct types: white noise, flicker noise and interference
noise. White noise is of nearly equal amplitude at all frequencies and it can
be considered as a mixture of signals of all frequencies with random
202 M.J. ADAMS

amplitudes and phases. It is this random, white noise that will concern us
in these discussions. Flicker noise, or Ilf noise, is characterised by a power
spectrum which is pronounced at low frequencies and is minimised by a.c.
detection and signal processing. Many instruments will also display inter-
ference noise due to pick-up usually from the 50-Hz or 60-Hz power lines.
Most instruments will operate detector systems well away from these
frequencies to minimise such interference. One of the aims of instrument
manufacturers is to produce instruments that can extract an analytical
signal as effectively as possible. However, because noise is a fundamental
characteristic of all instruments, complete freedom from noise can never
be realised in practice. A figure of merit to describe the quality of a
measurement is the signal-to-noise ratio, SIN, which is defined as,
SIN = average signal ~agnitude
(27)
rms nOise
The rms (root mean square) noise is defined as the square root of the
average squared deviation of the signal, s, from its mean value, s, i.e.
.
rms nOIse =
Jr,(S n- S)2
(28)

or, if the number of measurements is small,


.
rms nOise =
Jr,(Sn _- 1S)2
(29)

Comparison with eqns (4) and (5) illustrates the equivalency of the rms
value and the standard deviation of the signal, a,. The signal-to-noise
ratio, therefore, can be defined as S/a,.
The SIN can be measured easily in one of two ways. One method is to
repeatedly measure the analytical signal, determine the mean value and
calculate the rms noise using eqn (29). A second method of estimating SIN
is to record the analytical signal on a strip-chart recorder. Assuming the
noise is random, white noise, then it is 99% likely that the deviations in
the signal lie within ± 2·5a, of the mean value. The rms value can thus be
determined by measuring the peak-to-peak deviation of the signal from
the mean and dividing by 5. With both methods it is important that the
signal be monitored for sufficient time to obtain a reliable estimate of the
standard deviation. Note also that it has been assumed that at low analyte
concentrations the noise associated with the analyte signal is the same as
that when no analyte signal is present, the blank noise aB' i.e. at low
concentrations the measurement error is independent of the concentra-
ERRORS AND DETECTION LIMITS 203

.'-;,- ...•. A-f. - -, ./'........


••

i\i.! \
1
•. \
'1\
'.

j
\ jlj~" \\j1\\j .~I\I \'
Ji \

I
;,

v
! I\
/
I

I :.'
:
:

I
__

~
(J
S

-I------- . . ---.~----
! ~ i

I I
s

Fig. 4. An amplified trace of an analytical signal recorded at low response (the


signal amplitude is close to the background level). The mean signal response
is denoted by S, the standard deviation of the signal by as and the peak-to-peak
noise is given by 5as .

tion. This assumption is continued throughout this section. It is further


assumed that the mean signal from the blank measurement (no analyte)
will be zero, i.e. JiB = O.
The ideas associated with SIN are illustrated in Fig. 4.

4.2 Detection limits


The detection limit for any analytical method is a measure frequently
quoted by instrument manufacturers and in the scientific literature. Unfor-
tunately, it is not always clear as to the particular method by which this
figure has been determined. A knowledge of detection limits and their
calculation is important in evaluating and comparing instrument perfor-
mance as it relates directly to the sensitivity of an analytical method and
the noise associated with the instrument.
The concept of a detection limit for an analysis implies, in the first
instance, that we can make a qualitative decision about the presence or
absence of the analyte in a sample. In arriving at such a decision there are
two errors of decision that could arise. The first, a so-called Type I error,
is that we may conclude that the analyte is present in the sample when it
is known not to be, and the second, a Type II error, is made if we decide
the analyte to be absent when it is in fact present. To be of practical value,
our definition of the detection limit must minimise both types of decision
error.
Assuming the noise associated with our analysis is random, then the
distribution of the electrical signal will approximate the normal form, as
204 M.J. ADAMS

a.
\
\

\ 5%
o 1.65 ...
:'----

b.

c. /
1\ /\ \ / \oetection Limit

~/ -y,--
o 3.29
"-.~
Determination Limit

i\
d.

----/
//
I
\, '----
o 5 10
Standard deviation of blank

Fig. 5. (a) The normal distribution with the 5% critical region marked; (b) two
normally distributed signals overlapping, with the mean of one located at the
5% point of the second; the so-called decision limit; (c) two normally distributed
signals overlapping at their 5% points with their means separated by 3'290', the
so-called detection limit; (d) two normally distributed signals with equal
variance and their means separated by 100'; the so-called determination limit.

illustrated in Fig. 1. From tables of the cumulative normal distribution


function and using a one-tailed test as discussed above, 95% of this
random signal will occur below the critical value of J.i.B ± 1·650' B' This case
is illustrated in Fig. 5(a). If we are willing to accept a 5% chance of
committing a Type I error, a reasonable value, then any average signal
detected as being greater than J.i.B + l'650'B can be assumed to indicate the
presence of the analyte. This measure has been referred to in the literature
as the decision limit and is defined as the concentration producing the
signal at which we may decide whether or not the result of the analysis
indicates detection, i.e.
Decision Limit = ZO.9 5 0'B = l'650'B (30)
given that J.i.B = 0 for a well-established noise measurement when O'B is
known.
If the noise or error estimate is calculated from relatively few measures
ERRORS AND DETECTION LIMITS 205

then the I-distribution should be used and the definition is now given by
Decision Limit = IO.9SSB (31)
where SB is an estimate of O"B and as before (see Section 2.2); the value of
I depends on the degrees of freedom, i.e. the number of measurements and
10'9S approaches ZO.9S as the number of measurements increases.
This definition of the decision limit addresses our concern with Type I
errors but says nothing about the effects of Type II errors. If an analyte
producing a signal equivalent to the decision limit is repeatedly analysed,
then the distribution of results will appear as illustrated in Fig. 5(b). Whilst
the mean value of this analyte signal, Jl" is equivalent to the decision limit,
50% of the results are below this critical value and no analyte will be
reported present in 50% of the measurements. This decision limit
therefore must be modified to take account of these Type II errors so as
to obtain a more practical definition of the limit of detection.
If we are willing to accept a 5% chance of committing a Type II error,
the same probability as for a Type I error, then the relationship between
the blank measurements and the sample reading is as indicated in Fig. 5(c).
In such a case
Detection Limit = 2Z0.9S O"B (32)
or if SB is an estimate of O"B,
Detection Limit = 2to'9SSB (33)
Under these conditions we have a 5% chance of reporting the presence
of the analyte in a blank solution and a 5% chance of missing the analyte
in a true sample. Before we accept this defintion of detection limit, it is
worth considering the precision of measurements made at this level.
Repeated measurements on an analytical sample at the detection limit will
lead to the analyte being reported as below the detection limit 50% of the
time. In addition, from eqn (6) the relative standard deviation, or CV, of
the sample measurement is defined as
CV = 1000",/Jl, = 100/3·29 = 30·3%
compared with 60% at the decision limit. Thus while quantitative mea-
surements can be made at these low concentrations they do not constitute
the accepted degree of precision for quantitative analysis in which the
relative error should be below 20% or, better still, 10%.
If a minimum relative standard deviation of, say, 10%, is required from
the analysis then a further critical value must be defined. This is sometimes
206 M.J. ADAMS

referred to as the determination limit and for a 10% CV is defined as


Determination Limit = lOus (34)
This case is illustrated in Fig. 5(d).
In summary, we now have three critical values to indicate the lower
limits of analysis. It is recommended that analyses with signals less than
the decision limit should be reported as 'not detected' and for all results
above this value, even if below the detection limit, the result with appro-
priate confidence limits should be recorded.

5 CALIBRATION

Unlike the traditional volumetric and gravimetric methods of analysis


which are based on stoichiometric reactions and which can provide ab-
solute measures of analyte in a sample, modern instrumental techniques
usually provide results which are relative to some known standard sample.
Thus the instrumental method must be calibrated prior to an analysis
being undertaken. This procedure typically involves the preparation of a
suitable range of accurately known standard samples, against which the
instrumental response is monitored, followed by the recording of the
results for the unknown samples and the concentration of the analyte
subsequently determined by interpolation against the standard results.

5.1 Linear calibration


In most analytical procedures a linear relationship between the instrument
response and the analyte concentration is sought and a straight-line cali-
bration graph fitted. For example, colorimetric analysis is extensively
employed for the determination of a wide range of cations and anions in
a variety of sample types and, from the Beer-Lambert Law, there is a linear
relationship between absorbance by the sample and the analyte concentra-
tion. In Table 8 the results are presented of the absorbance of a set of
standard aqueous complexed copper solutions, recorded at a wavelength
of 600 nm. These data are plotted in Fig. 6.
Visual inspection of the data suggests that a straight line can be fitted
to the data and the construction of the calibration graph, or working
curve, is undertaken by fitting the best straight line of the form
Absorbance, Ai = a + bXi (35)
where a and b are constants denoting the A intercept and slope of the fitted
ERRORS AND DETECTION LIMITS 207

TABLE 8
Analytical data of concentration and absorbance detained fromAAS for the
determination of copper in aqueous media using standard solutions

Copper Absorbance d, = df d2 = d,d 2


concentration (600nm) (x - x) (A - A)
(mgkg- l )

1·0 0·122 -4·2 17·64 -0'167 0·7014


3·0 0·198 -2,2 4·84 -0,091 0·2002
5·0 0·281 -0,2 0·04 -0'008 0·0016
7·0 0·374 1·8 3·24 0·085 0·1530
10·0 0·470 4·8 23·04 0·181 0·8688

Mean i = 5·2 A= 0·289


Sum 48·8 1·925

Absorbance
0.5
/

0.4

0.3
with blank

0.2

0.1 J
v i I I IT I
2 4 6 8 10

Copper Cone. (mg/kg)

Fig. 6. The calibration curves produced by plotting the data from Table 6
using the least-squares regression technique. The upper line uses the original
data and the lower line uses the same data following a correction for a blank
analysis.
208 M.J. ADAMS

line respectively and Xi represents the concentration of the standard solu-


tions. In our treatment of this data set we will assume that the dependent
variable, the absorbance readings, are subject to scatter but that the
independent variable, the concentrations of the standard solutions, is not.
A common method of estimating the best straight line through data is
by the method of least squares. Using the fitted line, any concentration Xi
of copper will be expected to produce an absorbance value of (a + bXi)
and this will deviate from the true measured absorbance, Ai, for this
solution by some error, ei ,
ei = Ai - (a + bxJ (36)
The least squares estimates of a and b are calculated so as to minimise
the sum of the squares of these errors. If this sum of squares is denoted by
S, then
n n
S L [Ai - (a + bxJF = L (Ai - a - bxY (37)
i=! i=! i=!

S is a function of the two unknown parameters a and b and its minimum


value can be evaluated by simple differentiation of eqn (37) with respect
to a and b, setting both partial derivatives to zero (the condition for a
minimum) and solving the two simultaneous equations.
bS/ba = ~ 2( - 1)(Ai - a - bxJ 0
bS/bb = ~2( -xJ(A i - a - bXi) = 0
which upon rearrangement give,
na + b~Xi

a~xi + b~xf
and solving for a and b
~(Xi - i)(Ai - A)
b (38)
~(Xi - i)2

a = A - bi (39)
The calculated values are presented with Table 8 and for these data the
least squares estimates for a and bare,
b = 1·925/48·8 = 0·0395
a = 0·289 - 5·2b = 0·0839
and the best straight line is given by
A = 0·0839 + 0·0395x (40)
ERRORS AND DETECTION LIMITS 209

which is illustrated in Fig. 6. From rearranging eqn (40), or the working


curve, a value of x from an unknown sample can be obtained from its
measured absorbance value.
In practice, the absorbance values recorded, as presented in Table 8, will
be the mean results of several readings. In such cases, the fitted line
connecting these mean values is referred to as the regression line of A on
x. Assuming that the errors associated with measuring absorbance are
normally distributed and that the standard deviation of the errors is
independent of concentration of analyte, then we can calculate confidence
intervals for our estimates of the intercept and slope of the fitted line.
The residual sum of squares is defined by eqn (37) as the difference
between the measured and predicted values from the fitted line and this
allows us to define a residual standard deviation, SR, as

SR
_ ~ _ JI:.(A - a -
- ---
bXJ2
(41)
n-2 n-2
and, using a two-tailed test with the t-distribution, the 95% confidence
limits for the slope b are defined by

and for the intercept,

and for the fitted line at some value Xo,

1 (xo - X)2
-n + ~( ~Xi
-)2
- x

For our example, therefore, the characteristics of the fitted line and their
95% confidence limits are (to-025,3 = 3·182),
slope, b = 0·0395 ± 6·2 x 10- 6
intercept, a = 0·0839 ± 2·6 x 10- 4
If an unknown solution is subsequently analysed and gives an average
absorbance reading of 0·182 then rearranging eqn (40),
x = (A - 0·0839)/0·0395 = 2·48 mg kg-I
210 M.J. ADAMS

TABLE 9
Analysis for potassium using the method of standard additions (all volumes in
cm 3 )

Solution no.

2 3 4 5
Sample volume 20 20 20 20 20
Water volume 5 4 3 2 I
Standard K volume 0 I 2 3 4
Emission response (mv) 36 35 57 69 80

and the 95% confidence limits for the fitted value are
2·48 ± 3·182sR (I/5) + «2·48 - 5·2)/48·8) 2-48 ± 1·80
x 10-4 mg kg- '

5.2 Using a blank solution


It is usual in undertaking analyses of the type discussed above to include
with the standard solutions a blank solution, i.e. a sample similar to the
standards but known to contain no analyte. The measured instrument
response from this blank is subsequently subtracted from each of the
measured standard response values. When this procedure is undertaken it
may be more pertinent to use as a mathematical model to fit the best
straight line an equation of the form,
A = b'x (42)
and assume that the inteicept with the axis of the dependent variable is
zero, i.e. the line passes through the origin. Proceeding as above, the error
in the line can be described by

and the sum of squared deviations,


S = (A; - b'xY
and following differentiation with respect to b' and setting the derivative
equal to zero to find the minimum error,

b' = !:A;x; (43)


!:xf
If in our example data from Table 8, the blank value is measured as
ERRORS AND DETECTION LIMITS 211

Emission Intensity (mV)


100

80 ~/

;//
60 //

/!
40 - , - /

/ :

////20

/T . '---T----' T----,----,-----i--i

-.062 o .02 .04 .06 .08 .10

mg K added

Fig. 7. The determination of potassium in an acetic acid soil extract solution,


using the method of standard additions, by flame emmission photometry.

0·081 absorbance units and this is subtracted from each response, then the
value of b' is calculated as
b' = 7·333/184 = 0·0399
and the new line is shown in Fig. 6. For our unknown solution, of
corrected absorbance (0·182 - 0·081 = 0'102) then
x = A/b' = 0'102/0·0399 = 2'56mgkg- 1
The results are similar and the use of the blank solution simplifies the
calculations and removes the effect of bias or systematic error due to the
background level from the sample and the instrument.

5.3 Standard additions


The construction and use of a calibration graph implies that the standard
solutions employed and the subsequently analysed solutions are of similar
composition, with the exception of the concentration of the analyte. If this
is not the case then it would be unwise to rely on the calibration graph to
provide accurate results. It would obviously be incorrect to use a range of
aqueous standards to calibrate an atomic absorption spectrometer for the
analysis of petroleum products. The problem of badly matched standards
212 M.J. ADAMS

and samples can give rise to a severe systematic error in an analysis. One
technique to overcome the problem is to use the method of standard
additions. By this method the sample to be analysed is split into several
sub-samples and to each is added small volumes of a known standard. A
simple example will serve to illustrate the technique and subsequent cal-
culations.

Example. Flame photometry is a common method of analysis for the


determination of the alkali metals in solution. The technique employs
inexpensive instrumentation and a typical analysis is the measurement of
extractable potassium in soils. A sample of the soil (5 g) is extracted with
200 ml of O· 5 M acetic acid and the resultant solution, containing many
elements other than potassium, can be analysed directly. From previous
studies it is known that the potassium concentration is likely to be about
100 mg kg-I dry soil. Five 20-ml aliquots of the sample solution are taken
and each is made up to 25 ml using distilled water and a standard 20 mg kg-I
standard potassium solution in the proportions shown in Table 9, and the
emission intensity is measured using the flame photometer.
The results are plotted in Fig. 7. The intercept of the fitted regression
line on the concentration axis indicates the concentration of potassium in
the original sample to be 0·062 mg. This is in 20 ml of solution, therefore
0·62 mg in the 200-ml extractant from 5 g of soil which indicates a soil
concentration of 124 mg kg-I. Confidence intervals can be determined as
for the general calibration graph discussed above.
The statistical treatment and evaluation of analytical data is of para-
mount importance in interpreting the results. This chapter has attempted
to highlight some of the more important and more pertinent techniques.
The interested reader is advised and encouraged to proceed with further
reading and study on this fascinating subject.

BIBLIOGRAPHY

Bevington, P.R., Data Reduction and Error Analysis for the Physical Sciences.
McGraw-Hill, New York, 1969.
Caulcutt, R. & Boddy, K., Statistics for Analytical Chemistry. Chapman and Hall,
London, 1983.
Chatfield, c., Statistics for Technology. Chapman and Hall, London, 1975.
Davis, J .c., Statistics and Data Analysis in Geology. J. Wiley and Sons, New York,
1973.
Malmstadt, H.V., Enke, C.G., Crouch, S.R. & Horlick, G., Electronic Measure-
ments for Scientists. W.A. Benjamin, California, 1974.
Chapter 6

Visual Representation of Data


Including Graphical
Exploratory Data Analysis
JOHN M. THOMPSON*
Department of Biomedical Engineering and Medical Physics, University
of Keele Hospital Centre, Thornburrow Drive, Hartshill, Stoke-on-Trent,
Staffordshire ST4 7QB, UK

1 INTRODUCTION

1.1 Uses and misuses of visual representation


Reliance on simple number summaries, such as correlation coefficients,
without plotting the data used to derive the coefficients, can lead one to
misinterpret their real meaning. I- 3 An excellent example of this is shown
in Fig. 1, which shows a series of plots of data sets, all of which have
correlation coefficients, some of which apparently indicate reasonably
strong correlation. However, the plots reveal clearly the dangers of relying
only on number summaries. They also demonstrate the value of graphical
displays in understanding data behaviour and identifying influential ob-
servations.
Figure 2 shows two plots on different scales of the same observations
used in studies of subjective elements in exploratory data analysis. 4 It was
found that the rescaling of plots had most effect on subjective assessment

*Present address: Department of Medical Physics and Biomedical Engineering,


Queen Elizabeth Hospital, Birmingham B15 2TH, UK.

213
214 J.M. THOMPSON

10 10
(a) (b)

, r=0·855 r=0·666

o 10 o 10

10 10
r=0·980 r=0·672
(c) (d) , , '

.... ,

o 10 o 10

Fig. 1. Scatter plots of bivariate data, demonstrating that a favourable correla-


tion coefficient is not sufficient evidence that variables are correlated.

10r---------------------~

5
81- " ,
4 •••••
-. . . . ""0

61- ,:'~/);.:~
.... .. . :'
• • -0 ' . "0° .... : ,. •

. .......~.:::-.
• 0. . : . \ ••

4 ...
2 •• '~"o : ...; •

•0. :.
~ '.
'. '
1 ...

"
o~--~--~----~--~--~. O~__~__~__~--~--~~
o 2 4 6 8 10 0 2 3 4 5

Fig, 2, Effect of scale of plot on perception of plot; both plots have correlation
coefficients of 0·8. (Reprinted with permission from W.S. Cleveland, P. Dia-
conis & R. McGill Science 216 1138-1141, copyright 1982 by the American
Association for the Advancement of Science.)
VISUAL REPRESENTATION OF DATA 215

of data sets with correlation coefficients between O· 3 and 0·8. The per-
ceived degree of association was found to shift by 10-15% as a result of
rescaling.
The purpose of visual representation of data is to provide the scientist/
technologist as a data analyst with insights into data behaviour not readily
obtained by nonvisual methods. However, one must be vigilant about the
psychological problems of experimenter bias resulting from anchoring of
one's interpretation onto preconceived notions as to the outcome of an
investigation. 5•6 The importance of designing the visual presentation of
data so as to avoid bias in perception of either the presenter or the receiver
of such displays has been emphasized by several groups.7-1O These percep-
tual problems are active areas of collaborative research between statisti-
cians and psychologists. Awareness of the importance of subjective, per-
ceptual elements in the design of effective visual representations and a
willingness to take account of these in data analysis, and in the com-
munication of the results of such analysis, should now be considered as an
essential part of statistical good practice.

1.2 Exploratory data analysis


In the early development of statistical analysis, data exploration was
necessarily an important feature but, as sophisticated parametric statisti-
cal tools were developed for confirmatory analysis and came into general
use, the exploratory approach was eclipsed and rather neglected. Pioneer-
ing work in recent decades by Tukey l1.l2 and others (see Bibliography) has
resulted in new and powerful tools for visual representation of data and
exploratory/graphical data analysis. Many of these tools have an elegant
simplicity which makes them readily approachable by any person of
reasonable arithmetic competence.
The four main themes of exploratory data analysis have been described
by Hoaglin et al. 13 as resistance, residuals, re-expression and revelation.
These themes are of fundamental importance to the visual representation
of data. Resistance is the provision of insensitivity to localized misbeha-
viour of data. Data from real world situations rarely match the idealized
models of parametric statistics, so resistance to quirky ness is a valuable
quality in any tool for data analysis. Residuals are what remain after we
have performed some data analysis and have tried to fit the data to a
model. In exploratory data analysis, careful analysis of the residuals is an
essential part of the strategy. The use of resistant data analysis helps to
ensure that the residuals contain both the chance variations and the more
unusual departures from the main pattern. Unusual observations may
216 1.M. THOMPSON

distort nonresistant analyses, so that the true extent of their influence may
be grossly distorted, and therefore not apparent when examining the
residuals. Some brief glimpses at various aspects of exploratory data
analysis (EDA) are outlined below and aspects relevant to the theme of
this chapter are discussed in more detail in the appropriate sections. The
reader is encouraged to delve further into this subject by studying texts
listed in the Bibliography and References.

1.2.1 Tabular displays


Tabular displays can either be simple, showing various key number sum-
maries, or complex but designed in such a way as to highlight various
patterns. Several tabular display tools have evolved from EDA and are
described in Sections 4.3, 4.4, 5.2.1, 6.5.1 and 6.5.2.

1.2.2 Graphical displays


Many different kinds of graphical display have been developed enabling
us to look at different features of the data. The most important point to
bear in mind is that one should use several tools to examine various
aspects of the data, so as to avoid missing key features. Examples will be
given in later sections which will illustrate this particular issue.

1.2.3 Diagnosing types of data behaviour


Many statistical software packages are now available for use on personal
computers. A black box approach to their use in data analysis has its
drawbacks. A salutory example of this was given by Ames & Szonyi, 14 who
cited a study in which they wished to evaluate the influence of an additive
in improving the quality of a manufactured product. When applying the
standard parametric tests, the additive apparently did not improve
product quality, which did not make much sense physically. Data explora-
tion was then undertaken to evaluate the underlying distributions of the
data from product with and without additive. Both were found to be
non-Gaussian and subsequent application of more appropriate non-
parametric tests demonstrated clear differences.

1.2.4 Identifying outliers for further investigation


An important role for the visual representation of data is in the identifica-
tion of unusual (outlier) data for further investigation, not for rejection at
this stage. Fisher ls advised that 'a point is never to be excluded on
statistical grounds alone'. One should then establish whether the outlier
arises from transcription or transmission errors, faulty measurement,
VISUAL REPRESENTATION OF DATA 217

calibration, etc. If the outlier still appears genuine, then it may well
become the starting point for new experiments or surveys.

2 TYPES OF DATA

Environmental data can be classified in a variety of ways which have an


influence on the choice of visual representations that may be useful, either
in analysis, or for communication of ideas or information to others.
Continuous data arise from observations that are made on a measure-
ment scale which from an experimental viewpoint is infinitely divisible.
Measurements involving counting of animals, plants, radioactive particles
or photons give us discrete data. Proportional data are in the form of
ratios, such as percentages.
Spatial and time-dependent (temporal) data present special display
problems, but are obviously of major concern to environmental scientists
and technologists, and various techniques will be discussed in appropriate
sections.
Data may be in the form of a single variable or many variables and the
discussion on visual representation in this chapter starts with single varia-
bles, proceeding then to two variable displays, following on with mul-
tivariate displays and maps. The chapter ends with a brieflook at some of
the software available to assist in visual representation.

3 TYPES OF DISPLAY

Although not conventionally considered as visual representation, many


modern tabular display techniques provide us with useful tools for
showing patterns/relationships amongst data in a quite striking visual
way.
A wide range of ways of displaying data using shade. pattern, texture
and colour on maps have been developed, which can be used singly, or in
combination, to illustrate the geographical distribution of environmental
variables and their interactions.
Graphs are the classical way of displaying data and modern computer
graphics have extended the range of possibilities, especially in the projec-
tion of three or more dimensional data onto a two dimensional display.
However, as we will see later in this chapter, even very simple graphical
techniques can provide the user with powerful ways of understanding and
218 J.M. THOMPSON

II
0·22 .....
,
I
III I

iii I
I
:::l
"C
'>
I
"C ~

.!:
'0
c
.2
I
r- ~

....u i I
I

I
",

Lt J

irl
i

0'" J I J
I
71 305
Xylene concentration (mg m- 3 )

Fig. 3. Histogram of air pollution data showing the distribution of time-


weighted average occupational exposures of histopathology laboratory
technicians to xylene vapour (mg m- 3 ) (data of the author and R. Sitham-
paranadarajah; generated using Stata).

analysing data. The development of dynamic displays and animation in


computer graphics has enhanced the presentation of time dependent
phenomena.
Hybrid displays combining different display tools, e.g. the use of graphs
and statistical symbols on maps, are an interesting development. We are all
familiar with the use of weather symbols on maps, but the use of symbolic
tools in the visual representation of multivariate data has developed
considerably and the wide range of such tools is discussed in Section 6.3.

4 STUDYING SINGLE VARIABLES AND THEIR


DISTRIBUTIONS

4.1 Histograms and frequency plots


The histogram is an often used tool for displaying distributional informa-
tion and a typical histogram of a time-weighted average air pollutant
concentration (occupational exposures of histopathology technicians to
xylene vapour) in a specific laboratory is shown in Fig. 3. Each bar of the
histogram shows the frequency of occurrence of a given concentration
range of xylene vapour exposures. Another alternative is to plot the
VISUAL REPRESENTATION OF DATA 219

I III III III 11111111 U I II I II I II III I I I


71 xylene 305
(a)

71 xylene 305
(b)

Fig. 4. (a) Cluttered one-dimensional scatter plot using vertical bars to repre-
sent individual observations of xylene vapour exposure of histopathology
technicians. It is not obvious from this plot that several observations overlap
(generated using Stata). (b) Alternative means of reducing clutter using jitter-
ing of the same data used for Figs. 3 and 4(a) but this method of plotting
reveals that there are several observations with the same value (generated
using Stata).

histogram of the cumulative frequency. Here the histogram may be sim-


plified by merely plotting the tops of the bars. This enables us to compare
two histograms on the same plot (see also Section 5.2.2). Thus the cumula-
tive frequency distribution of the observations can be compared with a
model distribution of the same median and spread, in order to judge their
similarities. The largest gap between the cumulative plots is the test
statistic for the Kolmogorov-Smirnov one sample test. 16
If the frequencies are plotted as points which are then joined by lines,
we have a frequency plot and, as with the histogram, cumulative frequency
plots are also useful.

4.2 Scatter plots


Despite their simplicity, scatter plots are very useful for highlighting
patterns in data and even for comparisons or revealing the presence of
outliers.
Plotting a single variable can either be done horizontally or vertically in
one dimension only, as in Fig. 4(a), in which each observation is represent-
ed as a vertical bar. It is not obvious from this particular plot that
observations overlap. In order to show this, one may either stack points,
or use the technique of jittering, as in Fig. 4(b). The jittering is achieved
by plotting on the vertical axis, Uj (i = 1 to n), versus the variable to be
jitter plotted, Xj, where Uj is the series of integers I to n, in random order. I?
The range of the vertical axis is kept small relative to the horizontal axis.
220 1.M. THOMPS0N

4.3 Stem and leaf displays


A very useful alternative to the histogram is the stem and leaf display. 18
Not only can this display convey more information than the conventional
histogram, showing how each bar is built up, but it can also be used to
determine the median, fourths or hinges (similar to quartiles; see Section
4.4 below), and other summary statistics in a quick and simple way.
Sometimes the display may be too widely spread out and we need to find
a more compact form. On other occasions the display is too compact and
needs to be spread out by splitting the stems. Many software packages do
this automatically. For those without access to such packages, or wishing
to experiment with suitable layouts for the display, various rules have been
proposed based upon different functions of the number of observations in
the data set. 19 These produce different numbers of lines for any given
number of observations. The I + log2n rule makes no allowances for
outliers. skewed data and multiple clumps separated by gaps, and is
regarded as poor for information transfer. Emerson & Hoaglinl9 recom-
mend using the 2nl/2 rule below 100 observations and the 1010g IO n rule
above 100 observations. Extreme values may cause the display to be
distorted but this can be avoided by placing these in two categories, 'hi'
and '10', and then sorting the data for the stem and leaf display without
the extremes.

4.4 Box and whisker plots


Various kinds of box and whisker plots 20 provide useful displays of the key
number summaries: median, fourths or hinges twhich are closely related
to the quartiles) and extremes. They enable us to show key features of a
set of data, to gain an overall impression of the shape of the data distribu-
tion and to identify potential outliers.

4.4.1 Simple box and whisker plots


The simple version of this plot is illustrated in Fig. 5. The upper and lower
boundaries of the box are, respectively, the upper and lower fourths (also
known as hinges), which are derived as follows: 21
depth of fourth = [(depth of median) + 1]/2
The depth of a data value is defined as the smaller of its upward and
downward ranks. Upward ranks are derived from ordering data from the
smallest value upwards; downward ranks start from the largest observa-
tion as rank I. The numerical values of the observations at the fourths
determine the upper and lower box boundaries. In-between these is a line
VISUAL REPRESENTATION OF DATA 221

••


••

Fig. 5. Box and whisker plot. Fig. 6. Notched box and whisker plot.

representing the position of the median. Beyond these boundaries stretch


the whiskers. The whiskers extend out as far as the most remote points
within outlier cutoffs defined as follows: 22
upper cutoff upper fourth + 1·5 (fourth spread)
lower cutoff lower fourth - 1·5 (fourth spread)
fourth spread upper fourth - lower fourth
Beyond the outlier cutoffs, data may be considered as outliers and each
such outlying observation is plotted as a distinct and separate point. The
position of the median line, relative to the fourths, indicates the presence
or absence of skewness in the data and the direction of any skewness. The
lengths of the whiskers give us an indication of whether a set of data is
heavy or light tailed. If the data is distributed in a Gaussian fashion, then
only 0·7% of the data lie outside the outlier cutoffs. 23

4.4.2 Extended versions


The outlier cutoffs are also termed inner fences by Tukey24 and observa-
tions beyond the inner fences are then termed outside values. Tukey also
222 1.M. THOMPSON

defines outer fences as:


upper outer fence = upper fourth + 3 (fourth spread)
lower outer fence = lower fourth - 3 (fourth spread)
Observations beyond the outer fences are then termedfar out values. Thus
one now has a way of identifying and classifying outliers according to their
extremeness from the middle zone of the distribution.
Recently, Frigge et al. 25 have highlighted the problem that a number of
statistical software packages produce box plots according to different
definitions of quartiles and fences. They have offered recommendations
for a single standard form of the box plot. Of the eight different definitions
for the quartile that they list. they suggest that, for the time being,
definition 6 (currently in Minitab and Systat) should be used as the
standard. They do suggest, however, that definition 7 may eventually
become the standard. These two definitions (of the lower quartile Q" in
terms of the ordered observations XI ~ X2 ~ X3 ~ ... ~ xn) are listed
below:
- definition 6: standard fourths or hinges
Q, = (l - g)Xj + gxj +,
where [(n + 3)/2]/2 = j + g and g = 0 or g = 1/2. n is the number
of observations, j is an integer;
- definition 7: ideal or machine fourths
Q, = (I - g)Xj + gXj + 1
where n/4 + 1/2 = j + g.
The multiplier of the fourth spread used in calculating the fences. as
described above and in Section 4.4.1, also varies. Frigge et al. 25 suggest
that the use of a multiplier of 1·0 now seems too small, on the basis of
accumulated experience, and 1·5 would seem to them to be a more satisfac-
tory value to use for estimating the inner fences for exploratory purposes.
The use of 3·0 as the multiplier for the outer fences (already discussed
earlier in this section) is regarded as a useful option.

4.4.3 Notched box and whisker plots


The box plot may be modified to convey information on the confidence in
the estimated median by placing a notch in the box at the position of the
median. 26 The size of the notch indicates the confidence interval around
the median. A commonly used confidence interval is 95%. The upper and
lower confidence bounds are calculated by a distribution-free procedure
VISUAL REPRESENTATION OF DATA 223

TABLE 1
Relationship between letter values, their tags, tail areas and Gaussian letter
spreads

Tag Tail Upper Gaussian Letter spread


area letter value

M 1/2
F 1/4 0·6745 1·349
E 1/8 0·1503 2·301
D 1/16 1· 534 1 3-068
C 1/32 1·8627 3·725
B 1/64 2·1539 4·308
A 1/128 H176 4·835
Z 1/256 2·6601 5·320
Y 1/512 2-8856 5·771
X 1/1024 3-0973 6·195
W 1/2048 3·2972 6·594

based on the Sign Test26 in Minitab, whereas Velleman & Hoaglin22


describe an alternative, in which the notches are placed at:
median ± 1'58*(fourth spread)/(n)I/2
Figure 6 illustrates the notched box plot. The applications of this
display method in multivariate analysis are discussed in Section 6.1.2.

4.5 Letter value displays


So far we have dealt with two of the number summaries known as letter
values: the median and the fourths or hinges. In Section 4.4.1 the calcula-
tion of the depth of the fourth from the depth of the median was shown.
This may be generalized to the following: 27
depth of letter value = [(previous depth) + 1]/2
so that the next letter values in this sequence are the eighths, the sixteenths,
the thirty-seconds, etc. For convenience these number summaries are
given letters as tags or labels; hence they have become known as letter
values. Table I shows the relationship between letter values, their associ-
ated tags and the fraction of the data distribution remaining outside boun-
daries defined by the letter values (the tail areas). Tables of letter values
can range from simple 5 or 7 number summaries as in Fig. 7, to much more
comprehensive letter value displays, as in Fig. 8. The latter shows other
information obtainable from the letter values, including mid summaries
(abbreviated to mid), spreads and pseudosigmas. These are discussed in
224 1.M. THOMPSON

(a) 5-number summary


# 53
Tag Depth Lower Mid Upper
M 27 153'0
F 14 125·0 201·0
I 71-0 305·0

(b) 7-number summary


# 53
Tag Depth Lower Mid Upper
M 27 153·0
F 14 125·0 201·0
E 7·5 116·5 224·0
I 71-0 305·0

Fig. 7. (a) Five and (b) seven number summaries: simple letter value displays
of the xylene vapour exposure data.

218 174 125 170 145 180 124 135 115 148 264
305 107 144 202 239 106 171 201 137 216 224
153 102 173 125 119 154 186 118 204 141 226
105 150 194 129 118 233 224 227 170 128 173
71 185 144 209 155 124 108 124 144

Depth Lower Upper Mid Spread


N 53
M 27·0 153·000 153·000 153·000
H 14·0 125·000 201·000 163·000 76·000
E 7·5 116·500 224·000 170·250 107·500
D 4·0 106·000 233·000 169'500 127·000
C 2·5 103·500 251·500 177-500 148·000
B 1·5 86·500 284·500 185·500 198·000
1 71·000 305·000 188·000 234·000

Fig. 8. Comprehensive letter value display of xylene vapour exposure data


generated using Minitab, showing the depth of each letter value in from each
end of the ranked data, the lower and upper letter values, the spread between
them and the mean of the lower and upper letter values known as the 'mid'.
Tabulated above the display is the original data set.
VISUAL REPRESENTATION OF DATA 225

detail in Refs. 24, 27 and 28, and will be briefly discussed here. The
mid summary is the average of a pair of letter values, so we can have the
midfourth, the mideighth, etc., all the way through to the midextreme or
midrange. The median is already a midsummary. The spread is the differ-
ence between the upper and lower letter values. The pseudosigma is
calculated by dividing the letter spread by the Gaussian letter spread.
Letter values and the derived parameters can be used to provide us with
a powerful range of graphical methods for exploring the shapes of data
distributions, especially for the analysis of skewness and elongation. 28
These techniques will be discussed in more detail in Section 4.7.

4.6 Graphical assessment of assumptions about data


distributions
It is often a prerequisite to the use of a particular parametric statistical
test, that the assumptions about the distribution that the data is thought
to follow are tested. There are many useful graphical tools with which to
test such assumptions, whether one is dealing with discrete or continuous
data.

4.6.1 Looking at discrete data distributions


Since much environmental/ecological data is of the discrete type, it is
useful to have available some suitable graphical means of calculating
distributional parameters and deciding on which discrete distribution best
describes the data. Amongst the most widely known discrete data distribu-
tions that are identifiable by graphical methods are the following: 29
Poisson:
k = 0, 1,2, ... ; L > °
Binomial:

k = 0, 1, 2, ... , N; o<p<
Negative binomial:

Pk = (k + m - 1) pm (I _ p)k;
m - I
k = 0, I, 2, ... ; ° < P < I, m >
226 1.M. THOMPSON

Ne~a[ive
binomial (il > 0)

.......
"-
-------.,,------ Poisson (f> = 0)
......

o
"- Bmomlal
Logsenes (b = -.I) (b < 0)

Fig. 9. Ord's procedure for graphical analysis of discrete data. (Reproduced


with permission from S.C.H. du Toit, A.GW. Stern and R.H. Stumpf 'Graphical
Exploratory Data Analysis' p. 38 Fig. 3.1 Copyright 1986 Springer- Verlag, New
York Inc.)

Logarithmic series:
Pk = - cf>kj[k -In (1 - cf> )]; k = 1,2, ... ;
where Pk is the relative number of observations in group k in the sample,
nk is the number of observations in group k in the sample, L is the mean
of the Poisson distribution, P is the proportion of the population with the
particular attribute, cf> is the logarithmic series parameter.

4.6.1.1 Ord's procedure for large samples. Ord's procedure for large
samples 30 involves calculating the following parameter:
Uk = kpkjpk_1
for all observed x values and then plotting Uk versus k for all nk_1 > 5.
If this plot appears reasonably linear, one of the models listed in Section
4.6.1 can be chosen by comparison with those in Fig. 9, but this procedure
is regarded as suitable only for use with large samples of data. 30 More
appropriate procedures for smaller samples are described briefly in the
next three sections; they are discussed in detail in Ref. 29.

4.6.1.2 Poissonness plots. Poissonness plots were originally pro-


posed by Hoaglin31 and were subsequently modified by Hoaglin &
Tukey,29 in order to simplify comparisons for frequency distributions with
different total counts, N. We plot:
log.(k!nkjN) versus k
VISUAL REPRESENTATION OF DATA 227

where k is the number of classes into which the distribution is divided and
nk is the number of observations in each class. The slope of this plot is
10g,,(L) and the intercept is - L. The slope may be estimated by resistantly
fitting a line to the points and then estimating L from L = eb • This
procedure works even if the Poisson distribution being fitted is truncated
or is a 'no-zeroes' Poisson distribution. Discrepant points in the plot do
not affect the position of other points, so the procedure is reasonably
resistant. A further improvement, which Hoaglin & Tukey29 introduced,
'levels' the Poissonness plot by plotting:
log,,(k!nk/N) + [Lo - klog,,(Lo)] versus k
where Lo is a rough value for the Poisson parameter L, and this new plot
would have a slope of 10g,,(L) - 10g,,(Lo) and intercept Lo - L. If the
original Lo was a reasonable estimate then this plot will be nearly as flat
as it is possible to achieve. If there is systematic curvature in the Poisson-
ness plot then the distribution is not Poisson.

4.6.1.3 Confidence interval plots. Sometimes an isolated point


appears to have deviated from an otherwise linear Poissonness plot and it
would be useful to judge how discrepant that point is. A confidence
interval plot enables us to determine the extent of the discrepancy.
However, in such a plot, the use of confidence intervals, based on the
significance levels for individual categories, will mean that there is a
substantial likelihood of one category appearing discrepant, even when it
is not. To avoid this trap, we can use simultaneous significance levels.
Hoaglin & Tukey29 suggest using both individual and simultaneous confi-
dence intervals on a levelled plot so that the discrepant points are more
readily identifiable.

4.6.1.4 Plots for other discrete data distributions. The same ap-
proach can be used for checking binomialness, negative binomialness or
tendency to conform to the logarithmic or the geometric series. The
negative binomialness of a distribution of many contagious insect popula-
tions is a measure of the dispersion parameter and Southwood32 discusses
the problems of estimating the characteristic parameter of the negative
binomial using the more traditional approach. It would be useful to
re-assess data, previously assessed using the methods discussed by South-
wood,32 with the methods proposed in Hoaglin & Tukey.29 However, space
and time constraints limit the author to alerting the interested reader to
studying Ref. 29 on these very useful plots for estimating discrete distribu-
228 J .M. THOMPSON

tion parameters, and for hunting out discrepancies in a way that is resis-
tant to misclassification.

4.6.2 Continuous data distributions


Whilst discrete data are important in environmental and ecological inves-
tigations, continuous data are obviously a major interest and the main
thrust of research on graphical and illustrative methods remains in this
area.

4.6.2.1 Theoretical quantile-quantile or probability plots. In explor-


ing the data to establish conformity to a particular distribution, one
approach is to sort the data into ascending order; obtain quantiles of the
distribution of interest; and finally to plot the sorted data against the
theoretical quantiles, thus producing a theoretical quantile-quantile plot.
A 'quantile' such as the O' 76 quantile of a set of data is that number in the
ordered set below which lies a fraction O· 76 of the observations and above
which we find a fraction O' 24 of the data. Such plots can be done for both
discrete and continuous data. Chambers et al. 33 give useful guidance on
appropriate parameters to plot as ordinate and abscissa and how to add
variability information into the plot. They also give help on dealing with
censored and grouped data; problems which are common in environmen-
tal studies. Clustering of data or absence of data in specific zones of the
ordered set can produce misleading distortions of such plots, as can the
natural variability of the data. Hoaglin34 has developed a simplified
version of the quantile-quantile plot using letter values.

4.6.2.2 Hanging histobars and suspended rootograms. An interest-


ing way of checking the fit of data to a distribution consists of hanging the
bars of a histogram from a plot of the model distribution, this is the
hanging histobars plot;35 an example is shown in Fig. 10 of such a com-
parison with a Gaussian distribution. This kind of plot is available in
Statgraphics Pc.
An alternative plot which enables one to check the fit to a Gaussian is
the suspended rootogram devised by Velleman & Hoaglin.36 So, instead of
using the actual count of observations in each bin (or bar) of the histo-
gram, we use a function based on the square root of the bin count, in order
to stabilize variance. The comparison is made with a Gaussian curve fitted
using the median (as an indicator of the middle of the data) and the fourths
as characteristics of the spread of the data (to protect the fit against
outliers and unusual bin values). The function used is the double root
VISUAL REPRESENTATION OF DATA 229

11

>,5
u
c:
V
:J
r:r
v
tL 2

-1

-4
o
XYLENE 1. xylene

Fig. 10. Hanging histobars plot of the xylene vapour exposure data (gener-
ated using Statgraphics PC).

residual, which is calculated as follows: 36


ORR = [2 + 4( observed)] 1/2 - [I + 4(fittedW/2 ,
if observed > or < 0
= I - [I - 4(fittedW /2 , if observed = 0
This kind of plot is available in Minitab as ROOTOGRAM.

4.7 Diagnostic plots for studying distribution shape


Rather than testing the data to see whether it adheres to a particular model
distribution, an alternative approach is to examine features of the actual
distribution of the observations in an exploratory fashion. In this case, we
are looking at features such as skewness and light or heavy tails. The letter
values, described earlier, provide us with a suitable set of summary data
with which we can explore visually these kinds of feature and from which
230 1.M. THOMPSON

we can calculate other parameters for graphical analysis of shape. The five
plots discussed below are described in more detail in Ref. 37.

4.7.1 Upper versus lower letter value plots


If the data distribution is not skewed plotting the upper letter values versus
the corresponding lower letter values should give a straight line of slope
- 1. To check deviations from the line of slope - I, this should be
explicitly plotted on the graph as well, then we can more clearly see the
direction of the deviations.

4.7.2 Mid versus spread plots


An improvement on the upper versus lower plot, enabling deviations to be
more clearly seen, is obtained by plotting the midsummaries of the letter
values (mids) versus the letter value spreads. If the plotted points curve
steadily upwards to the right, away from the horizontal line through the
median, this indicates right skewness. On the other hand, if the points
curve downwards, we have left skewness.

4.7.3 Mid versus Z2 plots


A problem with the mid versus spread plot is that the more extreme
spreads are adversely affected by outliers and, as a consequence, the plot
will be strung out over an unnecessarily large range. To overcome that, we
can plot the mids versus the square of the corresponding Gaussian quan-
tile (.i). A distribution with no skewness is again a horizontal line.

4.7.4 Pseudosigma versus Z2 plots


If pseudosigmas are calculated from the corresponding letter values and
the data are normally distributed, the pseudosigmas will not differ greatly
from one another but if the data distribution is elongated, the pseudosig-
mas will increase with increasing letter value. With data distributions less
elongated than Gaussian, the pseudosigmas will decrease. Thus plotting
pseudosigmas against i will enable us to diagnose elongation of a data
distribution.

4.7.5 Pushback analysis and flattened letter values versus z plots


Sometimes a data distribution will be elongated to differing extents in each
tail. In order to study this, we need to subtract a Gaussian shape from the
distribution, essentially pushing back or flattening the distribution and the
letter values. A quick and resistant way of obtaining a scale factor with
which to do the flattening of the letter values is to find the median s of the
VISUAL REPRESENTATION OF DATA 231

pseudosigmas. Multiplying s by the standard Gaussian quantiles z for the


upper and lower letter values and subtracting that product from the actual
letter values, yields a set of flattened letter values. A plot of these versus
z will reveal the behaviour of each tail separately. If the data are normally
distributed, the plot is of a straight line. If more than one straight line
segment is found, then the distribution may be built up from more than
one Gaussian. Deviations from linearity indicate more complex beha-
viour.

4.7.6 Diagnostic plots to quantitate skewness and elongation


Hoaglin 38 describes graphical methods for obtaining numerical summaries
of shape using further functions calculated from the letter values. These
summaries disentangle skewness from elongation. If data are collected
from a skewed distribution, we may attribute elongation to the longer tail
which may have arisen purely as a function of skewness. Hoaglin's g and
h functions enable us to extract from the data that part of any tail
elongation arising from skewness and separately extract that part arising
from pure elongation. After calculating the skewness function, its influ-
ence is removed by adjustment of the letter value half-spreads. Separate
plots for the elongation analysis of the left and right tails are then con-
structed of the natural logarithms of the upper and lower adjusted letter
value half-spreads versus Z2, the slopes of which are elongation factors for
each tail.

5 REPRESENTING BIVARIATE DATA

5.1 Graphics for regression and correlation


As was pointed out in Section I it is very important not to rely solely on
number summaries, such as correlation and regression coefficients, when
investigating relationships between variables. There are a wide variety of
graphical tools available to enable one to present bivariate data and
explore relationships. As our understanding of the limitations of some of
the earlier tools has grown, new methods have evolved which provide new
insights. Some of the limitations and new tools are discussed below.

5.1.1 Two dimensional scatter plots


Simply plotting a two dimensional scatter plot can provide useful insight
into the relationship between two variables but it should be only the
starting point for data exploration. One can go further by plotting a
232 J.M. THOMPSON

regression line and confidence envelopes on the scatter plot but there are
numerous ways of performing regression and of calculating confidence
envelopes. In environmental studies, much data is collected from observa-
tion in the field or by measurements made on field-collected samples as
opposed to laboratory experiments. When regressing such variables
against one another, ordinary least squares regression is inappropriate
because the x variable is assumed to be error free; clearly this is not the
case in practice. Deming39 and Mandel40 have discussed ways of allowing
for errors in x within the framework of ordinary least squares. If outliers
and/or non-Gaussian behaviour are present, any such extreme data will
have a disproportionate effect on the position of the regression line even
for Deming/Mandel regression. Various exploratory, robust and non-
parametric methods are available for regression which are protective
against such behaviour. 41-43 No one method will give a unique best line but
methods are now appearing which enable the calculation of confidence
envelopes around resistant lines to be made. 44.45 Many of these resistant
and robust regression methods are likely to be in reasonable agreement
with one another. Using such methods, one can plot useful regression lines
and confidence envelopes. The scatter plot is also essential as a diagnostic
tool in the examination of regression residuals in guiding us on whether
the data need transformation to improve the regression.

5.1.1.1 Problems with plotting local high data densities. Some-


times, in plotting data from large samples one observes such local high
densities of data points that it is difficult to convey visually the density of
points. Minitab overcomes this by plotting numbers representing the
numbers of points it is trying to plot at particular coordinates. Other
alternative ways of dealing with this problem include the use of symbols
indicating the number of overlapping points (Section 5.1.1.2) and the use
of sharpened plots (Section 5.1.1.3).

5.1. 1.2 Use of sunflower symbols. Where there is substantial overlap


on scatter plots, this may be alleviated by means of symbols indicating the
amount oflocal high density, which were devised by Cleveland & McGill46
and which they called sunflowers. A dot means one observation, a dot with
two lines means two observations, a dot with three lines means three
observations, etc. Dividing the plot up into square cells and counting the
number of observations per cell, enables us to plot a bivariate histogram
with a sunflower at the centre,of each cell. The process of dividing the plot
into cells was termed cellulation by Tukey & Tukey.47
VISUAL REPRESENTATION OF DATA 233

100

cumulative r·d

----pF
frequency

OL--------------------

Fig. 11. Comparing cumulative frequency plots.

5.1.1.3 Sharpened scatter plots. Chambers et al. 48 describe ways of


revealing detail in scatter plots by a technique which they term sharpening,
where a series of plots is produced in which each successive plot in the
series has had stripped out of it regions of successively higher relative data
density. The plots give us an impression of contours in the data without
needing to draw them in. They give algorithms for sharpened scatter plots
applicable even when data scales are different on each axis.

5.1.2 Strip box plots


It is sometimes useful to simplify a detailed bivariate plot by dividing it
into a series of vertical stripes and summarising the distribution of y values
in each strip as a box plot.

5.2 Diagnostic plots for bivariate regression analysis

5.2.1 Back-to-back stem and leaf displays


We saw the use of stem and leaf displays in Section 4.3 and we can extend
its use to the comparison of the distributions of two variables by placing
their stem and leaf displays back-to-back allowing the visual appreciation
of differences in the location and spread of each distribution.

5.2.2 Cumulative frequency plots


In a similar way to comparing the distributions of two sets of observations
via the back-to-back stem and leaf display, by plotting each set separately
on a common scale we may plot two separate cumulative frequency
distributions, as in Fig. II. We use the largest difference between the two
plots as the test statistic for the two sample Kolmogorov-Smirnov test for
comparing two empirical distributions (cf. Section 4.1).
234 I.M. THOMPSON

Xx
x
x x
x x x
Xx x x x
x x xxx x Xx x x
x xx x x x xx x
x XX x
x

x x
x
x Xx
xx x xX x x x
x x x
x x x x x x
x x
x x x
x X
x x Xx x x x x
x x x x x
x x x
x x x
x x x

Fig. 12. Patterns of residuals from linear regressions.

5.2.3 Empirical quantile-quantile plots


In this kind of display, we plot the median of the y variable versus the
median of the x variable, the upper y quartile versus the upper x quartile
and so on; each point on the plot corresponding to the same quantile in
each distribution. If we plot the line of identity (y = x) on this plot, we
can readily see how the two distributions compare and in what manner
they may differ.

5.2.4 Diagnosing the need to transform variables: residuals plots


When exploring the relationship between two variables, we usually first
test whether the relationship is linear. It is then helpful to test the reliability
of the hypothesis that there is a linear relationship by various graphical
methods, because these frequently reveal deviations which would not be
revealed by simply relying on the correlation coefficient. Tiley2 demon-
strated that it is possible to be misled into believing that one is dealing with
a strongly linear relationship when, with separate model experiments in
which the 'precision' was said to have increased, the correlation coefficient
was found to have improved with each increase in experimental 'preci-
sion'. A residuals plot was said to demonstrate the nonlinearity in the data.
The difference between the observed and fitted y values, at any given x
value, is the residual of that corresponding y value. These residual y values
may be plotted against the x values and this is termed a residuals plot. The
shape of this plot is a useful indicator of whether we need to transform the
y values using, for example, the logarithmic transformation. Figure 12
VISUAL REPRESENTATION OF DATA 235

shows several different residuals plots characteristic of both linear and


nonlinear relationships. The unstructured cloud of points shows us that
the relationship is linear; the sloping band indicates the need for addition
of a linear term; the curved band suggests a nonlinear dependence; the
wedge shaped band indicates increasing variability of y with increasing x
values. Such plots should be used both before and after any transforma-
tion of data as a first stage in demonstrating whether the transformation
is satisfactory.
A useful brief introduction to residuals plots is given in Ref. 49 and
a more detailed discussion is given in the excellent introductory mono-
graph by Atkinson so on graphical methods of diagnostic regression
analysis.

5.2.5 Identifying outliers and points of high influence or leverage


An important role for visual representations of data is in the identification
and classification of unusual data within a sample. With univariate data
analysis, we have already seen useful ways of identifying and classifying
unusual data features. Those methods can be applied to each variable
separately in a multivariate analysis but with the warning that variables
may influence others and it is vital to hunt for unusual behaviour under
those conditions as well. In order to do so we must understand some basic
concepts relating to unusual behaviour in bivariate regression. We can
also make use of these concepts again in the discussion of exploratory
graphical multivariate analysis.
The first point to emphasize is the necessity of using robust and resistant
regression methods because ordinary least squares regression is too easily
influenced by unusual points. Identification of outliers in such circumstan-
ces is frought with difficulty because outliers may mask one another's
influence (this is quite definitely not the same as inferring that they are
cancelling out each other's influence). Graphical exploration combined
with subjective assessment is helpful in protecting against mislabelling
points as outliers and missing other points which may well be, but unfor-
tunately it is not a foolproof safety net.
Another important concept is that points may actually exert a large
influence without appearing to be outliers at all, even in a bivariate scatter
plot. Such points may be ones with a high 'leverage' on the regression and
a number of diagnostic measures of influence and leverage have been
proposed which can be usefully employed in graphical identification of
unusual points. Many of these diagnostics have been incorporated into
Minitab, Stata and other PC-based software. As Hampel et al. 51 and
236 1.M. THOMPSON

Rousseeuw & Leroi 2 make clear, the purpose of such plots is identifica-
tion not rejection. If a point is clearly erroneous, as a result of faulty
measurement, recording or transcription, then we have sound reasons for
rejection. Otherwise the identification of outliers should be a starting point
for the search for the reasons for the unusual behaviour and/or refinement
of the hypothesized model.

5.3 Robust and resistant methods


The lack of resistance and robustness of ordinary least sq uares regression
and related methods to unusual data behaviour has prompted many
statisticians and data analysts to search for better methods of assessment
of data. The subject is now a vast one and graphical methods playa key
role. The bibliography at the end of this chapter provides a suitable list for
interested readers wishing to learn about these areas.

6 STUDYING MULTIVARIATE DATA

6.1 Multiple comparisons and graphical one-way analysis of


variance
Often in environmental studies, it is of interest to determine whether the
concentrations of a pollutant found in specimens of an indicator species
at various sites differ significantly from one site to another. This is a
problem of one-way analysis of variance and such problems can be ap-
proached graphically. Such an approach may not only be useful to the
data analyst but may be helpful in communicating the analysis to others.

6.1.1 Multiple one-dimensional scatter plots


In a recent study53 of the exposure of operating theatre staff to volatile
anaesthetic agent contamination of the theatre air, the time-weighted
average exposures for individuals in different occupational and task
groups in a theatre were measured for several operating sessions. Jittered
scatter plots of these exposures are shown in Fig. 13 and they may be used
to explore differences in group exposure to this occupational contaminant.
The qualitative impression of these differences was confirmed by Kruskal-
Wallis one-way analysis of variance by ranks and subsequent multiple
comparisons. Thus the multiple scatter plot served a useful role in support-
ing the formal analysis of variance.
VISUAL REPRESENTATION OF DATA 237

10.7 anaesth 454.7

10.7 surgeon 454.7

10.7 scrbnurs 4547

10.7 othrnurs 454.7

10.7 auxnurs 454.7

10.7 thtrtech 454.7

Fig. 13. Multiple jittered scatter plots of occupational exposure to the volatile
anaesthetic agent. halothane, in a poorly ventilated operating theatre, mg m- 3
(data of the author and R. Sithamparanadarajah; generated using Stata).

6.1.2 Multiple notched box and whisker plots


The same data may be plotted as box plots. This can have the advantage
of reducing clutter in the plot and focussing the attention on two main
aspects of each data set: the medians and the spreads. This makes the
comparison more useful but it is still on somewhat shaky foundations in
relying on subjective analysis. This may be improved upon by using
notched boxes and looking for overlap or lack of overlap of the notches,
as in Fig. 14. Where notches do not overlap, this is an indication that the
medians may be significantly different. The further apart the notches are,
the more reliable that conclusion is. Thus, we have a simple form of
graphical one-way analysis of variance, including a multiple comparisons
test.
238 J.M. THOMPSON

480
+

,....
c
.2
+
;u
c
ou
II
li
s:
180
1)
ii
J:

80

Fig. 14. Graphical one-way analysis of variance, using multiple notched box
and whisker plots, of the same data as in Fig. 14 (generated using Statgraphics
PC).

6.2 Solving the problems of illustrating multivariate data in


two dimensions
Multivariate analysis is often helped by the use of graphical and/or sym-
bolic plots, which can sometimes provide quite striking illustrations of the
distributions of variables through the range of specimens examined. We
must be very wary though of becoming fascinated by the symbolism and
missing the purpose of the display. Tufte54 has drawn attention to the
ever-present dangers of poor use of graphics and warns us to be careful not
to fall into the trap of inadvertantly and unintentionally 'lying' or convey-
ing the wrong message, by using displays that are poorly designed in terms
of the use of symbols, labelling, graphical 'junk' (such as distracting lines,
etc.), shading and colour. Although this is a problem with univariate and
bivariate data, it is much more acute when dealing with multivariate data.
VISUAL REPRESENTATION OF DATA 239

Much attention is being devoted to the devising of clear displays and there
is much to be gained from the collaboration between statisticians/data
analysts and psychologists in this area. KosslynsS has usefully reviewed
some recent monographs on graphical display methods from the percep-
tual psychologist's viewpoint.

6.2.1 One dimensional views


When beginning to look at multivariate data, it is sensible to start to look
at individual variables by the methods already discussed for univariate
samples. Thus one builds up a picture of the shapes of the distributions of
each variable before 'plunging in at the deep end' with the immediate use
of multivariate methods. Adopting the exploratory approach, one is
alerted to some of the potential problems at the start and a strategy can
then be evolved to deal with them in a systematic way.

6.2.2 Two dimensional views for three variables


The next useful stage is the examination of plots for all the pairs of
variables in the set. This can be illustrated with sets of three variables.

6.2.2.1 Draftsman's plots. We can arrange the three possible pairwise


plots of a set of three variables so that adjacent plots share an axis in
common. As an example, we can look at data from a study of occupational
exposure to air contaminants in an operating theatre. The time-weighted
average exposure to halothane is compared to the concentration in the
blood and expired air from anaesthetists in Fig. 15. This approach is
similar to that which a draftsman takes in providing 'front', 'top' and 'side'
views of an object. This analogy led Tukey & Tukey47.56 to call such a
display a draftsman's display (or scatterplot matrix, as it is called in Stata).
Relationships between variables can be explored qualitatively using this
approach and can aid the data analyst in the subsequent multivariate
analysis. If several groups are present in the data, then the use of different
symbols for the different groups helps in the visual exploration of the data.

6.2.2.2 Casement plots. An alternative approach is to partition the


data into various subsets according to the values of one of the variables.
This is rather like taking slices through the cube in which the three
variables might be plotted. The cubical (i.e. three dimensional) scatter plot
is shown in Fig. 16 and the slices through the cube in Fig. 17. Tukey &
Tukeys6 call this a casement display and the process by which the display
is built up was illustrated schematically by Chambers et al. s7 Each ofthese
50
+ +

t Halothane blood + +
concentrations
+ + +
itt ++ f t
0

18.----+-------,

Halothane
t t expired air
concentrations
N+ + +*+ + +
0
0
1 + 80 0
++ + I I I I I I
50 0 18

Fig. 15. Draftsman·s plot of occupational exposures to halothane and the


corresponding expired air and venous blood concentrations from the same
individuals (generated using Statgraphics PC from the data of the author and
R. Sithamparanadarajah).

kinds of display is useful in drawing one's attention to features of the data


that may not be so apparent in the other.

6.2.3 Generalization to four or more variables


Both the draftsman's and the casement plots can be extended to apply to
four or more variables and then the main difficulty arises in trying to
analyse such complex displays, especially as the number of dimensions
studied increases. An example of a draftsman's display or scatterplot
matrix for a many-variable problem is shown in Fig. 18, which shows the
results of trace-element analysis (for 6 of 13 assayed trace metals) of
various stream sediments taken from streams draining an old mining
district in Sweden. 58 Chambers et af.57 discuss the role of these plots in
VISUAL REPRESENTATION OF DATA 241

15
c:
o
:;; 12
....l!c:
~ 9
c:
o
u
.c. 6

80

Fig. 16. Three dimensional scatter plot (same data as Fig. 15; generated using
Statgraphics PC).

multivariate analysis, including the assessment of the effects of various


data transformations in achieving useful data models.
Becker & Cleveland 59 have developed a dynamic, interactive method of
highlighting various features in many-variable draftsman's plots to
perform various tasks including: single-point and cluster linking, con-
ditioning on single and on two variables, subsetting with categorical
variables, and stationarity probing of a time series. More recently, Mead w
has introduced a new technique of graphical exploratory data analysis of
multivariate data which may be all continuous, all binary or a mixture of
these, which he calls the sorted binary plot. He developed the plot origin-
ally to enable pattern analyses to be made of mass spectrometric measure-
ments on process-effluent samples, in order to evaluate suspected fluctua-
tions in the process. The first step is to prepare a table of residuals by
taking the median values for each variable from the data vectors for the
242 1.M. THOMPSON

Casement plot by levels of halothane in blood concentration


(0,10) (10,20) (20,30) (30,40) (40,50)
80

c: +
.2
....I'll
....Lc: 60 f-
41
u
c:
0
U
"0
0
0
:0
40 -
it +
S
41
c:
I'll
.c
.... + t
0
iii 20 f- + +
:r t

180 180 180 18 0 18


Halothane in expired breath concentrations

Fig. 17. Casement display of same data as Fig. 15 (generated using Stat-
graphics PC).

individual samples. Only the signs of the residuals are used to make the
plot. Mead describes simple sorting routines and illustrates these ideas
with various data sets. 60

6.3 Use of multicode symbols: the alternative approach for


multidimensional data displays
Several distinctive multicode symbols have been developed which enable
one to display the values of the variables in an individual data vector and
simultaneously to display several such symbols together, in order to show
the variations from one individual sample to another. Because we are
adept at recognizing shape changes, this opens up the possibility of graphi-
cal cluster or discriminant analysis, for example.

6.3.1 Profile symbols


The simplest way to represent a many-dimensional observation is by the
use of vertical bars or by the use of lines joining the midpoints of the bar
tops61 (see Fig. 19). The appearance of this plot depends on the ordering
VISUAL REPRESENTATION OF DATA 243

WEt]
Co

0
t \0+
+++

ID ~
Cr
f +
+
: r

[em ~ ~
++t+ +
0

0. +t t + t +*+ +
0

ill w~
+
Ni '\ + ++ t
+++ +
o +
+r f + ++ + +t

.1' ::.j um 14.~


00
++
Sa 16000 Co 60 0
++
Cr
1(I: :1 r.;'jl
200 0
++
CU 80
~
0 Ni 80
ppm

Fig. 18. Scatter-plot matrix or draftsman's display of the concentrations of six


trace metals in sediments from streams draining an old mining district in
Sweden (generated using Statgraphics PC with data from Ref. 58).

Fig. 19. Profile symbol plot.


244 J .M. THOMPSON

83

*
. i \

~ ~~
.-;-

~ ~
,,~
~R \ J 1:0
, I
\1
\T'
\ I
,
' f,/
., .........
i"~
\
u,'" ( --1
'-J Cr
'----:J
5 9 '" 10 ' 'J CU

j
r--.
"'- ~-c..; ~h

~
\j/ ~ ~ 1N (-.J~)

~)
----..
Pb
Sr
11 12 13 \4 15
(::) '.,t
'-J
"
"\ '" Zn
(f ~ \.:J
A v
~

f ~

J
16 I: 18 19 20

' \ ,/
,'"" ,/1

~
h\' \
~
~.

~
Fig. 20. Star symbol plots of the concentrations of various trace metals in 20
sediments collected from streams draining an old mining district in Sweden
(generated using Stata with data from Ref. 58).

of the variables, so this must be kept unvarying from one data vector to
another. The plot is termed a 'profile symbol' and many profiles may be
plotted on the same graph for comparison.

6.3.2 Star symbols


Star symbols may be regarded as profile symbols in polar coordinates62
and a typical form is shown in Fig. 20. This form is much easier to use
visually and is available in Statgraphics PC in two slightly different
versions: Star Symbol and Sun Ray plots. Star plots are also available in
SYSTAT and Stata.

6.3.3 Kleiner-Hartigan trees


As with profiles, stars depend critically on the ordering of the variables
and changing this may radically change the appearance. To overcome this
problem, Kleiner & Hartigan63 used a tree symbol in which each variable
is represented by a branch, branches are connected to limbs which are
connected to the trunk. Using hierarchical cluster analysis to group varia-
VISUAL REPRESENTATION OF DATA 245

Fig. 21. Kleiner-Hartigan trees.

bles together enables us to put clustered variables together on the same


limb (at the outset, one must be very clear whether one is using agglomera-
tive or divisive clustering). Going higher up the dendogram (see Section
6.5.4) we can see which groups of variables themselves form clusters; these
can in tum be represented by limbs joining together before joining the
trunk. Thus, closely correlated variables will cluster together on the same
limb and less closely correlated groups may appear as limbs joined togeth-
er. The structure of the tree is determined by the dendogram from which
it is derived. The dendogram represents the population average picture of
all the data vectors, whereas the trees represent individual data vectors.
Several benefits ensue from the use of trees:
(I) groups of variables that are closely correlated can reinforce one
another's influence on the lengths of shared limbs,
(2) tree symbols are also sensitive to deviations from joint behaviour of
closely linked variables.
An example of Kleiner-Hartigan trees is given in Fig. 21.

6.3.4 Anderson's glyphs


In 1960 Anderson64 suggested the use of glyphs in which each variable is
represented as a ray drawn outwards from the top of a circle, but unlike
profiles and stars, the rays are not joined together. Anderson provided
some guidelines for the use of glyphs:
(I) seven variables are the most that can be represented;
(2) divide the range of each variable into categories, e.g. quartiles or
deciles, so that all variables are normalized to the same rank scale;
(3) correlated variables should as far as possible be associated with
adjacent rays;
(4) if two or more types of case (e.g. different sex or species) are being
analysed together, the circles of the glyphs could be coloured or
shaded to distinguish them.
An example of the glyph representation is shown in Fig. 22.
246 I.M. THOMPSON

6.3.5 Andrews' curves


W F;g. 22. Anderson's glyphs.

The use of a Fourier function to generate a curve representing a data


vector was suggested by Andrews in 1972.65 For a p-variable data vector
Ix' = (XI' x 2 , ••• , x p ), the Andrews curve is calculated using the func-
tion:
fx(t) = (xl/../i) + x2sint + X3cost + x4sin2t + x5cos2t + ...
This function is plotted over the range - pi < t < pi, and it has the
following interesting and useful properties:
(1) it preserves the means, so that at each point t, the function corres-
ponding to the mean vector is the mean of the n functions corres-
ponding to the n observations;
(2) it preserves Euclidean distances, i.e. the Euclidean distance between
two observation vectors is directly proportional to the Euclidean
distance between the two functions which represent those observa-
tions.
An example of an Andrews curve plot is shown in Fig. 23. These plots
are available in SYSTAT under the heading of Fourier plots.

Fig. 23. Andrews' curves.


VISUAL REPRESENT ATION OF DATA 247

Fig. 24. Chernoff faces.

6.3.6 Chernoff faces


Originally introduced by Chernofri' in 1973, the idea of using faces to
represent data vectors was considered useful because we are all used to
distinguishing between faces with different features. The procedure is
simple and adaptable to suit the needs of different data sets, so that, for
example:
- XI could be associated with the size of the nose,
- X2 with the left eyebrow,
- X) with the left eye socket, etc.
Chernoff's implementation was suitable for use with up to 18 variables and
required a plotter for good quality representations. Flury & Riedwyl67-69
have developed the method to produce asymmetrical faces in order to
handle up to 36 variables on a CalComp plotter and Schupbach 7o.71 has
written Basic and Pascal versions of this asymmetrical faces routine for use
on IBM compatible PCs linked to plotters. The SYSTAT and Solo pack-
ages also implement Chernoff faces routines and Schupbach has written an
SAS macro for asymmetrical faces.
It is necessary to experiment with assignment of features to variables;
the most important variables being assigned to the most prominent fea-
tures. A simple version of the Chernoff face is illustrated in Fig. 24.

6.3.7 Which symbols help us most?


The value of different symbols in analysing multivariate data depends to
some extent upon the intended use of the analysis. If one is primarily
248 I.M. THOMPSON

interested in clustering, the Andrews' curves are regarded by du Toit et


a/.,72 Tufte73 and Krzanowski 54 as useful for small data sets. Whereas, for
large data sets, Chernoff faces and Kleiner-Hartigan trees are much more
suitable. Stars and glyphs are also quite valuable, but profile plots are
regarded by du Toit et a/.72 as the least useful.

6.4 Spatial data displays


Spatial distributions of variables or sets of variables, with or without
superposition of time variations, present us with additional problems,
especially with the huge amounts of data available from aerial and satellite
remote sensing. The problems associated with the high data density in
geographical information systems will not be tackled in detail in this
chapter.

6.4.1 Contour, shade and colour maps


Contour, shade and colour are the traditional methods of displaying data
on maps but considerable enhancement has been achieved with computer
aided techniques. Although the eye can separate tiny features, achievable
separability will depend upon the visual context, including the size, shape,
orientation, shading, colour, intensity, etc., of the feature and of its
background. A useful review of the problems involved, and of some
monographs offering guidance, was published recently by Kosslyn. 55
An example of a highly sophisticated environmental data mapping
system, which uses colour and shade, is the Geochemical Interactive
Systems Analysis (GISA) service, funded by the UK Department of Trade
and Industry and based at the National Remote Sensing Centre, Farnbo-
rough, Hampshire, England. Amongst the information sources, it uses the
data bank of the British Geological Survey's Geochemical Survey Pro-
gramme (GSP), as well as images from Landsat satellites, to construct
colour maps which essentially enable visual multivariate analysis to be
done for such purposes as geochemical prospecting.

6.4.2 Use of coded symbols on maps


The combination of symbols such as Kleiner-Hartigan trees, Chernoff
faces or Star symbols with maps enables the spatial distribution of mul-
tivariate observations to be displayed. Schmid74 urges caution in the use
of symbols in this way, as such displays may possess the characteristics of
a complex and difficult puzzle, and so frustrate the objective of clear
communication. Ideally, the symbols should be readily interpretable when
used in combination with maps.
VISUAL REPRESENTATION OF DATA 249

Providing that the density of display of such symbols is not too high
then this should provide a useful approach to spatial cluster analysis. It
could also be quite an effective way of demonstrating or illustrating, for
example, regional variations of species diversity or water quality in a
drainage basin.
A well-known multivariate coded symbol is the weather vane, which
combines quantitative and directional information in a well-tried way.
Varying the size of a symbol to convey quantitative information is not
a straightforward solution, because our perception of such symbols does
not always follow the designer's intentions. For example, using circles of
different areas to convey quantity is not a reliable approach, because in
our perception of circle size, we seem unable to judge their relative sizes
at all accurately, and our judgement of sphere symbols appears to fare
even worse. 75

6.4.3 Combining graphs and maps


The alternative to the use of coded symbols in presenting complex infor-
mation on a map is the use of small graphs, such as histograms or
frequency polygons, in different map regions to illustrate such features as
regional variations in monthly rainfall or other climatic information.
Sometimes, this presentation can be enhanced and made easier to interpret
by the addition of brief annotation, in order to draw one's attention to
spatial patterns.

6.5 Graphical techniques of classification


A major goal of much ecological and environmental work is the classifica-
tion of sites into areas with similar species composition, patterns of
pollution or geochemical composition, for example.
Techniques of cluster analysis are important tools of classification and
can be divided into two main groups: hierarchical and nonhierarchical.
Both groups rely on graphical presentation to display the results of the
analysis.
Within the hierarchical group, a further subdivision is possible into
agglomerative and divisive methods. Seber61 lists II different agglomera-
tive methods, so that it is absolutely necessary to clearly identify which
technique is being used. Rock 76 has vividly demonstrated the problem of
choosing a suitable agglomerative hierarchical method by analysing a set
of data obtained from analyses of systematically collected samples from 6
isolated clusters of limestone outcrops within otherwise limestone-free
metamorphic terrains in the Grampian Highlands in Scotland. He used
250 J.M. THOMPSON

% similarity
100 95 90 85 80
Leontodon
Poterium
Lo/ium
Trifolium
Ranunculus
Plantago
Helictotrichon
Anthriscus
Taraxacum
Heracleum
Lathyrus
Dactylis
Poa trivialis

-====:::r-----1
Poa pratensis
Holcus
Alopecurus
Arrhenatherum -
Agrostis
Anthoxanthum
Festuca

Fig. 25. Dendogram of results of cluster analysis of the log abundances of


major Park Grass species (reproduced with permission from P.G.N. Digby and
R.A. Kempton 'Multivariate Analysis of Ecological Communities' p. 140, Fig.
5.5, Chapman and Hall, 1987. Copyright 1987 P.G.N. Digby and R.A.
Kempton).

the same agglomerative, polythetic algorithm, but nine different choices of


input matrix, similarity measure and linkage method. The result was a set
of nine completely different dendograms showing how the limestones
might be related! These different methods are compared in detail in Ref.
54 and 61. Dendograms may also be constructed from divisive hierarchical
clustering methods.
Data may not always have an hierarchical structure, in which case a
non hierarchical method is preferable and two basic approaches are pos-
sible: nested and fuzzy clusters. Rock 76 considers fuzzy clustering as more
appropriate for real geological data.
Dendograms are used to show overall groups, as in Fig. 25, but they
may mask individual similarities. So, an individual may be shown as
belonging to a particular group, merely by virtue of its similarity to one
other individual within that group and yet it may be vastly different from
the rest. Such anomolous behaviour is best discovered by inspecting the
VISUAL REPRESENTATION OF DATA 251

9
I I
10
2 3 4 5
I 6
~_
7 8 9 10

Fig. 26. Artificial similarity matrix showing its representation as a shaded


matrix with progressively darker shading as the similarity increases (re-
produced with permission from P.G.N. Digby and RA Kempton 'Multivariate
Analysis of Ecological Communities' p. 141, Fig. 5.6, Chapman and Hall, 1987.
Copyright 1987 P.G.N. Digby and RA Kempton).

shaded similarity matrix, in which the individual units are re-ordered to


match the order in the dendogram, and numerical similarity values are
replaced by shading in each cell of the matrix. The darkness of the shading
increases as the similarity value increases (see, for example, Fig. 26).
Dendograms and shaded similarity matrices may be combined to give a
more complete picture of relationships as in Fig. 27.

7 SOFTWARE FOR ILLUSTRATIVE TECHNIQUES

Whilst most of the techniques of illustration described in this chapter


can be implemented manually, complex or large data sets are best pro-
cessed by computerized methods, with the proviso that the computerized
Agrostis Similarities
Anthoxanthum 1>92.5%
Festuca
Plantago rJ85-92.5%
Leontodon
Poterium
Lolium
0 75-85%
Trifolium
Ranunculus
Helictotrichon 1-4~411.
Oactylis
Poa trivia lis
Anthriscus
Taraxacum
Heracleum
Lathyrus
Poa pratensis
Holcus
Alopecurus
Arrhenatherum 1.....J'--L....J......L......J.....L....l.-..J'-..J.-L.-i;;.±;L.;LJ~-L...L.J

Fig. 27. Combined display of a hanging dendogram and a shaded similarity


matrix for the Park Grass species data (reproduced with permission from
P.G.N. Digby and RA Kempton 'Multivariate Analysis of Ecological Com-
munities' p. 143, Fig. 5.7, Chapman and Hall, 1987. Copyright 1987 P.G.N.
Digby and RA Kempton).
VISUAL REPRESENTATION OF DATA 253

processing will often need improvement by manual techniques, particularly


for annotation and titling.

7.1 General purpose and business graphics


Many general purpose packages, such as Lotus 123 and Symphony,
Supercalc, dBase, etc., have quite useful simple graphics routines, which
can be readily adapted to a number of the display tasks discussed above.
They may also be interfaced with a variety of other graphical/illustrative
packages, such as Harvard Graphics, Gem Paint or Freelance, in order to
enhance the quality of the display.

7.2 Statistical software and computerized mapping


Several packages are available for use on personal computers and many
have a wide range of the display tools discussed above. Amongst the
leaders in this area (not in any specific order of merit) are the following:
- Stata from Computing Resource Center, 1640 Fifth Street, Santa
Monica, California 90401, USA.
- Systat from Systat, Inc., 1800 Sherman Avenue, Evanston, Illinois
60201-3973, USA or Eurostat Ltd, Icknield House, Eastcheap, Letch-
worth SG6 3DA, UK.
- Minitab (preferably with a graphplotter) from Minitab, Inc., 3081
Enterprise Drive, State College, Pennsylvania 1680 I, USA or Cle.Com
Ltd., The Research Park, Vincent Dr., Birmingham BI5 2SQ, UK.
- Solo and PC90 from BMDP Statistical Software, 1440 Sepulveda
Blvd., Los Angeles, California 90025, USA or at Cork Technology
Park, Model Farm Road, Cork, Eire.
- Statgraphics PC from STSC, Inc., 2115 East Jefferson Street, Rock-
ville, Maryland 20852, USA or Mercia Software Ltd., Aston Science
Park, Love Lane, Birmingham B7 4BJ, UK.
- SPSS/PC+ 4.0 from 444 North Michigan Avenue, Chicago, Illinois
60611, USA.
- CSS/Statistica from StatSoft, 2325 East 13th St., Tulsa, Oklahoma
74104, USA or Eurostat Ltd., Icknield House, Eastcheap, Letchworth
SG6 3DA, UK.
Many special purpose statistical graphics software packages are also
available and news and reviews of these may be found in the following
journals: The American Statistician, Technometrics, The Statistician,
Applied Statistics, Journal of Chemometrics, etc.
In the UK, a very useful source of information is the CHEST Directory
254 I.M. THOMPSON

published by the UK NISS (National Information on Software and Ser-


vices), whose publications are available via most university and polytech-
nic computer centres.
A powerful Geographic Information System (GIS) called ARC/INFO
is available for use on mainframes, minicomputers, workstations and even
on IBM PC/AT or AT-compatibles and PS/2. This offers the capability of
advanced cartographic facilities with a relational data base management
system and extensive possibilities for multi-variable dynamic environmen-
tal modelling, together with data analysis and display both in graphical
and map formats. The software is available from the Environmental
Systems Research Institute, 380 New York Street, Redlands, California
92373, USA or one of its distributors.
Other GIS packages for use on PCs or Macintoshes were recently
reviewed by Mandef 7 and some of these have considerable data analysis
capability combined with their mapping facilities.

REFERENCES

I. Chambers, J.M., Cleveland, W.S., Kleiner, B. & Tukey, P .. Graphical Methods


jar Data Analysis. Duxbury Press, Boston, MA, 1983, pp. 76-80.
2. Tiley, P.F., The misuse of correlation coefficients. Chemistry in Britain. (Feb.
1985) 162-3.
3. Tukey, J.W., Exploratory Data Analysis. Addison-Wesley, Reading, MA,
1977, Chapter 5.
4. Cleveland, W.S., Diaconis, P. & McGill, R., Variables on scatterplots look
more highly correlated when the scales are increased. Science, 216 (1982)
1138-41.
5. Diaconis, P., Theories of data anlaysis: From magical thinking through
classical statistics. In Exploring Data Tables, Trends, and Shapes, ed. D.C.
Hoaglin, F. Mosteller & J.W. Tukey. John Wiley & Sons, New York, 1985,
Chapter I.
6. Chambers, J.M., Cleveland, W.S., Kleiner, B. & Tukey, P., Graphical Methods
of Data Analysis. Duxbury Press, Boston, MA, 1983, Chapter 8.
7. Chernoff, H., Graphical representations as a discipline. In Graphical Repre-
sentation of Multivariate Data, cd. P.c. Wang. Academic Press, New York,
1978, pp. 1-11.
8. Ross, L. & Lepper, M.R., The perseverance of beliefs: empirical and norrna-
tive considerations. In New Directions for Methodology of Behavioural Scien-
ces: Fallible Judgement in Behavioral Research. Jossey-Bass, San Francisco,
1980.
9. Kahneman, D., Slovic, P. & Tversky, A. (ed.), Judgement under Uncertainty:
Heuristics and Biases. Cambridge University Press, Cambridge, 1982.
10. Schmid, C.F., Statistical Graphics. John Wiley & Sons, New York, 1983.
VISUAL REPRESENTATION OF DATA 255

II. Tukey, J.W., Exploratory Data Analysis. Addison-Wesley, Reading, MA,


1977.
12. Tukey, J.W. & Mosteller, F.W., Data Analysis and Regression: A Second
Course in Statistics. Addison-Wesley, Reading, MA, 1977.
13. Hoaglin, D.C., Mosteller, F. & Tukey, J.W., Understanding Robust and Ex-
ploratory Data Analysis. John Wiley & Sons, New York, 1983, pp. 1-4.
14. Ames, A.E. & Szonyi, G., How to avoid lying with statistics. In Chemometrics:
Theory and Applications. ACS Symposium Series 52, American Chemical
Society, Washington, DC, 1977, Chapter II.
IS. Fisher, R.A., On the mathematical foundations of theoretical statistics. Phil.
Trans. Royal Soc., 222A (1922) 322.
16. Gibbons, J.D., Nonparametric Methodsfor Quantitative Analysis. Holt, Rine-
hart and Winston, New York, 1976, pp. 56-77.
17. Chambers, J.M., Cleveland, W.S., Kleiner, B. & Tukey, PA, Graphical
Methods for Data Analysis. Wadsworth International Group, Belmont, CA.
1983, pp. 19-21.
18. Velleman, P.F. & Hoaglin, D.C., Applications, Basics and Computing ()j' Ex-
ploratory Data Analysis. Duxbury Press, Boston. MA, 1981. Chapter I.
19. Emerson, J.D. & Hoaglin. D.C., Stem-and-Ieaf displays. In Understanding
Robust and Exploratory Data Analysis, ed. D.C. Hoaglin, F. Mosteller & J.W.
Tukey. John Wiley & Sons, New York, 1983, Chapter I.
20. McGill, R., Tukey, J.W. & Larsen, W.A., Variations of box plots. The
American Statistician, 32 (1978) 12-16.
21. Mosteller, F. & Tukey, J.W., Data Analysis and Regression: A Second Course
in Statistics, Addison-Wesley, Reading, MA, 1977, Chapter 3.
22. Velleman, P.F. & Hoaglin, D.C., Applications, Basics and Computing of Ex-
ploratory Data Analysis. Duxbury Press, Boston, MA, 1981, Chapter 3.
23. Emerson, J.D. & Strenio, J., Boxplots and batch comparison. In Understand-
ing Robust and Exploratory Data Analysis, ed. D.C. Hoaglin, F. Mosteller &
J.W. Tukey. John Wiley & Sons, New York, 1983, Chapter 3.
24. Tukey, J.W., Exploratory Data Analysis, Addison-Wesley, Reading, MA.
1977, Chapter 2.
25. Frigge, M., Hoaglin, D.C. & Iglewicz, B., Some implementations of the
boxplot. The American Statistician, 43 (1989) 50-4.
26. Minitab, Inc. Minitab Reference Manual, release 6, January 1988. Minitab,
Inc., State College, PA, pp. 234-6.
27. Hoaglin, D.C., Letter values: a set of order statistics. In Understanding Robust
and Exploratory Data Analysis, ed. D.C. Hoaglin, F. Mosteller & J .W. Tukey.
John Wiley & Sons, New York, 1983, Chapter 2.
28. Velleman, P.F. & Hoaglin, D.C., Applications, Basics and Computing oj' Ex-
ploratory Data Analysis. Duxbury Press, Boston, MA, 1981, Chapter 2.
29. Hoaglin, D.C. & Tukey, J.W., Checking the shape of discrete distributions. In
Exploring Data Tables, Trends, and Shapes, ed. D.C. Hoaglin, F. Mosteller &
J.W. Tukey, John Wiley & Sons, New York, 1985, Chapter 9.
30. Ord, J.K., Graphical methods for a class of discrete distributions. J. Roy.
Statist. Assoc. A, 130 (1967) 232-8.
31. Hoaglin, D.C., A Poisson ness plot. The American Statistician, 34 (1980)
146-9.
256 I.M. THOMPSON

32. Southwood, T.R.E., Ecological Methods with Particular Reference to the Study
of Insect Populations. Chapman and Hall, London, 1978, Chapter 2.
33. Chambers, J.M., Cleveland, W.S., Kleiner, B. & Tukey, P.A., Graphical
Methods for Data Analysis. Wadsworth International Group, Belmont, CA,
1983, Chapter 6.
34. Hoaglin, D.C., In Exploring Data Tables, Trends, and Shapes, ed. D.C.
Hoaglin, F. Mosteller & J.W. Tukey. John Wiley & Sons, New York, 1985,
pp.437-8.
35. Statistical Graphics Corporation. STATGRAPHICS Statistical Graphics
System. User's Guide, Version 2.6. STSC, Inc., Rockville, MA, 1987, pp. 11-13
to ll-14.
36. Velleman, P.F. & Hoaglin, D.C., Applications, Basics, and Computing of
Exploratory Data Analysis. Duxbury Press, Boston, MA, 1981, Chapter 9.
37. Hoaglin, D.C., Using quantiles to study shape. In Exploring Data Tables.
Trends. and Shapes, ed. D.C. Hoaglin, F. Mosteller & J.W. Tukey, John Wiley
& Sons, New York, 1985, Chapter 10.
38. Hoaglin, D.C., Summarizing shape numerically: the g-and-h distributions. In
Exploring Data Tables. Trends, and Shapes, ed. D.C. Hoaglin, F. Mosteller &
J.W. Tukey, John Wiley & Sons, New York, 1985, Chapter II.
39. Deming, W.E., Statistical Adjustment of Data. John Wiley & Sons, New York,
1943, pp. 178-82.
40. Mandel, J., The Statistical Analysis of Experimental Data. John Wiley & Sons,
New York, 1964, pp. 288-92.
41. Theil, H., A rank invariant method of linear and polynomial regression
analysis, parts I, II and III. Proc. Kon. Nederl. Akad. Wetensch., A, 53 (1950)
386-92, 521-5, \397-412.
42. Tukey, J.W., Exploratory Data Analysis, Limited Preliminary Edition, Ad-
dison-Wesley, Reading, MA, 1970.
43. Rousseeuw, P.I. & Leroy, A.M., Robust Regression and Outlier Detection.
John Wiley &.Sons, New York, 1987, Chapter 2.
44. Lancaster, J.F. & Quade, D., A non parametric test for linear regression based
on combining Kendall's tau with the sign test. J. Amer. Statist. Assoc., 80
(1985) 393-7.
45. Thompson, J.M., The use of a robust resistant regression method for personal
monitor validation with decay of trapped materials during storage. Analytica
Chimica Acta, 186 (1986), 205-12.
46. Cleveland, W.S. & McGill, R., The many faces of a scatterplot. 1. Amer.
Statist. Assoc., 79 (1984) 807-22.
47. Tukey, J.W. & Tukey, P.A., Some graphics for studying four-dimensional
data. Computer Science and Statistics: Proceedings of the 14th. Symposium on
the Interface. Springer-Verlag, New York, 1983, pp. 60-6.
48. Chambers, J.M., Cleveland, W.S., Kleiner, B. & Tukey, P.A., Graphical
Methods for Data Analysis. Duxbury Press, Belmont, CA, 1983, pp. 110-21.
49. Goodall, c., Examining residuals. In Understanding Robust and Exploratory
Data Analysis, ed. D.C. Hoaglin, F. Mosteller & J.W. Tukey. John Wiley &
Sons, New York, 1983, Chapter 7.
50. Atkinson, A.C., Plots, Transformations and Regression. An Introduction to
VISUAL REPRESENTATION OF DATA 257

Graphical Methods of Diagnostic Regression Analysis. Oxford University


Press, Oxford, 1985.
51. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J. & Stahel, W.A. Robust
Statistics. The Approach Based on Influence Functions. John Wiley & Sons,
New York, 1986, Chapter 1.
52. Rousseeuw, P.l & Leroy, A.M., Robust Regression and Outlier Detection.
John Wiley & Sons, New York, 1987, Chapter I.
53. Thompson, lM .. Sithamparanadarajah, R., Robinson, lS. & Stephen, W.l..
Occupational exposure to nitrous oxide, halothane and isopropanol in operat-
ing theatres. Health & Hygiene, 8 (1987) 60-8.
54. Krzanowski, W.J., Principles of Multivariate Analysis. A User's Perspective.
Oxford University Press, Oxford, 1988, Chapter 2.
55. Kosslyn, S.M., Graphics and human information processing. A review of five
books. 1. Amer. Statist. Assoc., 80 (1985) 499-512.
56. Tukey, P.A. & Tukey, lW., Graphical display of data sets in 3 or more
dimensions. In Interpreting Multivariate Data, ed. V. Barnett. John Wiley &
Sons, New York, 1981, Chapters 10, I I & 12.
57. Chambers, lM., Cleveland, W.S., Kleiner, B. & Tukey, P.A., Graphical
Methods for Data Analysis. Duxbury Press, Boston, MA, 1983, Chapter 5.
58. Davis, J.c., Statistics and Data Analysis in Geology. John Wiley & Sons, New
York, 1986, p. 488.
59. Becker, R.A. & Cleveland, W.S., Brushing scatterplots. Technometrics, 29
(1987) 127-42.
60. Mead, G.A., The sorted binary plot: A new technique for exploratory data
analysis. Technometrics, 31 (1989) 61-7.
61. Seber, G.A.F., Multivariate Observations. John Wiley & Sons, New York,
1984, Chapter 4.
62. Newton, CM., Graphics: From alpha to omega in data analysis. In Graphical
Representation of Multivariate Data. ed. P.e. Wang. Academic Press. New
York, 1978, pp. 59-92.
63. Kleiner, B. & Hartigan, J.A., Representing points in many dimensions by trees
and castles. 1. Amer. Statist. Assoc., 76 (1981) 260-76.
64. Anderson, E., A semi graphical method for the analysis of complex problems.
Technometrics, 2 (1960) 387-91.
65. Andrews, D.F., Plots of high-dimensional data. Biometrics, 28 (\ 972) 125-36.
66. Chernoff, H., Using faces to represent points in K-dimensional space graphic-
ally. 1. Amer. Statist. Assoc., 68 (1973) 361-8.
67. Flury, B. & Riedwyl, H., Multivariate Statistics. A Practical Approach.
Chapman and Hall, London, 1988. Chapter 4.
68. Flury, B., Construction of an asymmetrical face to represent multivariate data
graphically. Technische Bericht No.3, Institut fUr Mathematische Statistik
und Versicherungslehre, Universitiit Bern, 1980.
69. Flury, B. & Riedwyl, H., Some applications of asymmetrical faces. Technische
Bericht No. 11. Institut fUr Mathematische Statistik und Versichungslehre,
Universitiit Bern, 1983.
70. Schupbach, M., ASYMFACE: Asymmetrical faces on IBM and Olivetti Pc.
Technische Bericht No. 16. Institut fur Mathematische Statistik und Versi-
chungslehre, Universitiit Bern, 1984.
258 1.M. THOMPSON

71. Schupbach, M., ASYMFACE: Asymmetrical faces in Turbo Pascal. Tech-


nische Bericht No. 25. Institut fUr Mathematische Statistik und Versichung-
slehre, Universitiit Bern, 1987.
72. du Toit, S.H.C., Steyn, A.W.G. & Stumpf, R.H., Graphical Exploratory Data
Analysis. Springer-Verlag, New York, 1986, Chapter 4.
73. Tufte, E.R., The Visual Display of Quantitative Information. Graphics Press,
Cheshire, CT, 1983.
74. Schmid, C.F., Statistical Graphics. John Wiley & Sons, New York, \983, pp.
188-90.
75. Dickenson, G.c., Statistical Mapping and the Presentation of Statistics, 2nd
edn. Edward Arnold, London, 1973, Chapter 5.
76. Rock, N.M.S., Numerical Geology. A Source Guide, Glossary and Selective
Bibliography to Geological Uses of Computers and Statistics. Springer-Verlag.
Berlin, 1988, Topic 21, pp. 275-96.
77. Mandel, R., The world according to Micros. Byte, 15 (1990) 256-67.
78. Monmonier, M.S., Computer-Assisted Cartography: Principles and Prospects.
Prentice-Ha\l, Englewood-Cliffs, NJ, 1982, p. 90.
79. Seber, G.A.F., Multivariate Observations. John Wiley & Sons, New York,
1984. Chapter 7.

BIBLIOGRAPHY

In addition to the literature listed in the References Section as well as references


within those sources, the fo\lowing books may prove useful to the reader.

Cleveland, W.S. & McGill, M.E. (ed.), Dynamic Graphics for Statistics. Wads-
worth, BeImont, CA, 1989.
Davis, J.e., Statistics and Data Analysis in Geology. John Wiley & Sons, New
York, 1973.
Eddy, W.F. (ed.), Computer Science and Statistics: Proceedings of the 13th Sym-
posium on the Interface. Springer-Verlag, New York. 1981.
Green. W.R., Computer-aided Data Analysis. A Practical Guide. John Wiley &
Sons, New York, 1985.
Haining, R., Spatial Data Analysis in the Social and Environmental Sciences.
Cambridge University Press, Cambridge, 1990.
Isaaks, E.H. & Srivastava, R.M., An Introduction to Applied Geostatistics. Oxford
University Press, New York, 1989.
Ripley, B.D., Spatial Statistics. J. Wiley & Sons, New York, 1981.
Ripley, B.D., Statistical Inference for Spatial Processes. Cambridge University
Press, Cambridge, 1988.
Upton, G. & Fingleton, B., Spatial Data Analysis by Example. Volume 1, Point
Pattern and Quantitative Data and Volume 2, Categorical and Directional Data.
John Wiley & Sons, New York, 1985, 1989.
Chapter 7

Quality Assurance for


Environmental Assessment
Activities
ALBERT A. LIABASTRE
US Army Environmental Hygiene Activity-South. Building 180. Fort
McPherson, Georgia 30330-5000. USA

KATHLEEN A. CARLBERG
29 Hoffman Place. Belle Mead. New Jersey 08502, USA

MITZI S. MILLER
Automated Compliance Systems. 673 Emory Valley Road, Oak Ridge,
Tennessee 37830, USA

1 INTRODUCTION

1.1 Background
Environmental assessment activities may be viewed as being comprised of
four parts: establishment of Data Quality Objectives (DQO); design of the
Sampling and Analytical Plan; execution of the Sampling and Analytical
Plan; and Data Assessment.
During the last 20 years, numerous environmental assessments have
been conducted, many of which have not met the needs of the data users.
In an effort to resolve many of these problems, the National Academy of
Sciences of the United States was requested by the US Environmental
Protection Agency (EPA) to review the Agency's Quality Assurance (QA)
Program. I
259
260 A.A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER

This review and the efforts of the EPA Quality Assurance Management
Staff have led to the use of both Total Quality Principles and the DQO
concept. Data quality objectives are interactive management tools used to
interpret and communicate the data users' needs to the data supplier such
that the supplier can develop the necessary objectives for QA and appro-
priate levels of quality control. In the past, it was not considered important
that data users convey to data suppliers what the quality of the data
should be.
Data use objectives are statements relating to why data are needed, what
questions have to be answered, what decisions have to be made, and/or
what decisions need to be supported by the data. It should be recognized
that the highest attainable quality may be unrelated to the quality of data
adequate for a stated objective. The suppliers of data should know what
quality is achievable, respond to the users' quality needs and eliminate
unnecessary costs associated with providing data of much higher quality
than needed, or the production of data of inadequate quality.
Use of the DQO process in the development of the assessment plan
assures that the data adequately support decisions, provides for cost
effective sampling and analysis, prevents unnecessary repetition of work
due to incorrect quality specifications, and assures that decisions which
require data collection are considered in the planning phase.
The discussions presented here assume the DQO process is complete
and the data use objectives are set, thus allowing the sampling and analyti-
cal design process to begin.
The sampling and analytical design process should minimally address
several basic issues including: historical information; purpose of the
sample collection along with the rationale for the number, location, type
of samples and analytical parameters; field and laboratory QA pro-
cedures.
Once these and other pertinent questions are addressed, a QA program
may be designed that is capable of providing the quality specifications
required for each project. By using this approach, the QA program ensures
a project capable of providing the data necessary for meeting the data use
objectives.

1.2 The quality assurance program


The QA program is a system of activities aimed at ensuring that the
information provided in the environmental assessment meets the data
users' needs. It is designed to provide both control offield and laboratory
QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITIES 261

operations, traceable documentation, and reporting of results of sample


collection and analytical activities.
The organizations involved in providing sampling and analytical ser-
vices must demontrate their qualifications to perform the work. The
qualification to perform the work involves assuring that adequate person-
nel, facilities and equipment/instrumentation are available.
One way of demonstrating qualifications is to develop, maintain and
register2 a documented Quality Management System (QMS)-one that
conforms to the generic quality systems model standards for QA of either
the American Society for Quality Control (Q-90 Series)3-7 or the Interna-
tional Organization for Standardization (ISO Guide 9000 Series).
The organizations involved must demonstrate that personnel are quali-
fied based on education, experience, and/or training to perform their job
functions. In addition, the organizations must also demonstrate that: both
the equipment and facilities are available to perform the work and the staff
are trained in the operation and maintenance of the equipment and
facilities. The term 'facilities' as used in this text includes either a per-
manent laboratory or a temporary field facility which provide for sample
handling. A controlled environment appropriate for the work to be
performed must also be provided.
Field and laboratory operations may be controlled by following written
standard operating procedures (SOP). It is essential that the SOPs be
documented, controlled, maintained and followed without exception
unless authorized by a change order. Such standardization of procedures
assures that no matter who performs the work, the information provided
will be reproducible. Among the quality control procedures that may be
necessary are those that identify and define control samples, data accept-
ance criteria, and corrective actions.
Documentation and reporting are essential in providing the traceability
necessary for legally defensible data. The quality program must address
and include detailed documentation requirements.

2 ORGANIZATIONAL AND DOCUMENTATION


REQUIREMENTS

It is essential that environmental assessment activities provide information


that has been developed using an adequate QA program or a plan that
provides documentation of the process employed.
A number of organizations including the EPA,8-IO the International
262 A.A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER

Standards Organization, II the American Association for Laboratory Ac-


creditation I2.13 and the American Society for Testing and Materials,I4-16
have developed requirements and criteria that field sample collection
organizations and/or laboratories should meet in order to provide quality
environmental assessment information.
The following sections detail minimal policies and procedures that must
be developed, implemented and maintained within a field sampling or-
ganization and/or laboratory, in order to provide accurate and reliable
information to meet environmental DQOs.
Many of the recommendations presented here are applicable to both
field and laboratory organizations. Where there are specialized require-
ments, the organization will be identified. These recommendations are
written in this manner to emphasize the similarity and interdependence of
the requirements for field and laboratory organizations. It is well recog-
nized that the sample is the focal point of all environmental assessments.
It is essential that the sample, to the extent possible, actually represents the
environment being assessed. The laboratory can only evaluate what is in
the sample. Thus it is imperative that an adequate QA plan be in place.
In most situations, field sample collection and laboratory operations are
conducted by separate organizations under different management.
Therefore, it is necessary that communication exists between field, laborat-
ory and project managers to ensure that the requirements of the QA plan
are met. Where deviations from the plan occur, they must be discussed
with project managers and documented.

2.1 Organization and personnel


The field sample collection group and the laboratory must be legally
identifiable and have an organizational structure, including quality
systems, that enable them to maintain the capability to perform quality
environmental testing services.

2. 1. 1 Organization
The field and laboratory organizations should be structured such that each
member of the organization has a clear understanding of their role and
responsibilities. To accomplish this, a table of organization indicating
lines of authority, areas of responsibility, and job descriptions for key
personnel should be available. The organization'S management must
promote QA ideals and provide the technical staff with a written QA
policy. This policy should require development, implementation, and
maintenance of a documented QA program. To further demonstrate their
QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITIES 263

capability, both field and laboratory organizations should maintain a


current list of accreditations from governmental agencies and/or private
associations.

2. 1.2 Personnel
Current job descriptions and qualification requirements, including educa-
tion, training, technical knowledge and experience, should be maintained
for each staff position. Training shall be provided for each staff member
to enable proper performance of their assigned duties. All such training
must be documented and summarized in the individual's training file.
Where appropriate, evidence demonstrating that the training met minimal
acceptance criteria is necessary to establish competence with testing, sam-
pling, or other procedures. The proportion of supervisory to non-supervis-
ory staff should be maintained at the level needed to ensure production of
quality environmental data. In addition, back-up personnel should be
designated for all senior technical positions, and they should be trained to
provide for continuity of operations in the absence of the senior technical
staff member. There should be sufficient personnel to provide timely and
proper conduct of analytical sampling and analysis in conformance with
the QA program.

2.1.2.1 Management. The management is responsible for establish-


ing and maintaining organizational, operational and quality policies, and
also for providing the personnel and necessary resources to ensure that
methodologies used are capable of meeting the needs of the data users. In
addition, management should inspire an attitude among the staff of con-
tinuing unconditional commitment to implementation of the QA plan. A
laboratory director and field manager should be designated by manage-
ment. Each laboratory and field sampling organization should also have
a QA Officer and/or Quality Assurance Unit.

2.1.2.2 Laboratory director/field manager. The director/manager is


responsible for overall management of the field or laboratory organiza-
tion. These responsibilities include oversight of personnel selection, de-
velopment and training; development of performance evaluation criteria
for personnel; review, selection and approval of methods of analysis
and/or sample collection; and development, implementation and mainte-
nance of a QA program. It is important that management techniques
employed rely on a vigorous program of education, training and modern
methods of supervision, which emphasize commitment to quality at all
levels rather than numerical goals or test report inspection l7 alone.
264 A.A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER

The director/manager should be qualified to assume administrative,


organizational, professional and educational responsibility and must be
involved in the day-to-day management of the organization. The laborat-
ory director should have at least a BS degree in chemistry and 6 years
experience in environmental chemistry or an advanced degree in chemistry
and 3 years experience in environmental chemistry. The field manager
should have at least a BS degree in environmental, chemical or civil
engineering or natural sciences and 6 years experience in environmental
assessment activities.

2.1.2.3 Technical staff. The technical staff should consist of person-


nel with the education, experience and training to conduct the range of
duties assigned to their position. Every member of the technical staff must
demonstrate proficiency with the applicable analytical, sampling, and
related procedures prior to analyzing or collecting samples. The criteria
for acceptable performance shall be those specified in the procedure, best
practice, or if neither of these exists, the criteria shall be determined and
published by the laboratory director or field manager. Development of
data acceptance criteria should be based on: consideration of precision,
accuracy, quantitation limit, and method detection limit information
published for the particular procedure; information available from Perfor-
mance Evaluation Study data, previously demonstrated during satisfactory
operation; or the requirements of the data user. Many of the more
complex laboratory test technologies (gas chromatography, gas chroma-
tography-mass spectroscopy, inductively coupled plasma-atomic emis-
sion spectroscopy and atomic absorption spectroscopy, for example),
require specific education and/or experience, which should be carefully
evaluated. As a general rule, 10.13 specific requirements for senior laboratory
technical personnel should include a BS degree in chemistry or a degree in
a related science with 4 years experience as minimum requirements. In
addition, each analyst accountable for supervisory tasks in the following
areas should meet minimum experience requirements: general chemistry
and instrumentation, 6 months; gas chromatography and mass spectro-
scopy, I year; atomic absorption and emission spectroscopy, I year; and
spectra interpretation, 2 years.

2.1.2.4 Support staff. The support staff is comprised of personnel


who perform sampling, laboratory and administrative support functions.
These activities include: cleaning of sample containers, laboratory ware
and equipment; transportation and handling of samples and equipment;
QUALITY ASSURANCE FOR ENVmONMENTAL ASSESSMENT ACTIVITIES 265

and clerical and secretarial services. Each member of the support staff
must be provided with on-the-job training to enable them to perform the
job in conformance with the requirements of the job description and the
QA program. Such training must enable the trainee to meet adopted
performance criteria and it must also be documented in the employee's
training file.

2.1.2.5 Quality assurance function. The organization must have a


QA Officer and/or a QA Unit that is responsible for monitoring the field
and laboratory activities. The QA staff assures management that facilities,
equipment, personnel, methods, practices and records are in conformance
with the QA plan. The QA function should be separate from and indepen-
dent of the personnel directly involved in environmental assessment ac-
tivities. The QA Unit should: (1) inspect records to assure that sample
collection and testing activities were performed properly and within the
sample holding times or specified turn around times; (2) maintain and
distribute copies of the laboratory QA plan along with any project specific
QA plan dealing with sampling and/or testing for which the laboratory is
responsible; (3) perform assessments of the organization to ensure adher-
ence to the QA plan; (4) periodically submit to management written status
reports, noting any problems and corrective actions taken; (5) assure that
deviations from the QA plan or SOPs were properly authorized and
documented; and (6) assure that all data and sampling reports accurately
describe the methods and SOPs used and that the reported results ac-
curately reflect the raw data. The responsibilities, procedures, records of
archiving applicable to the QA Unit should be documented and main-
tained in a manner reflecting the current practices.

2.1.3 Subcontractors
All subcontractors should be required to meet the same quality standards
as the primary laboratory or sample collection organization. Subcontrac-
tors should be audited to meet the same criteria as the in-house labora-
tories by using quality control samples, such as double-blind samples and
site visits conducted by the QA Officer.

2.2 Facilities
Improperly designed and poorly maintained laboratory facilities can have
a significant effect on the results of analyses, the health, safety and morale
of the analysts, and the safe operation of the facilities. Although the
emphasis here is the production of quality data, proper facility design
266 A.A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER

serves to protect personnel from chemical exposure health hazards, as well


as to protect them from fires, explosions and other hazards. 1s- 23 Further-
more, poorly maintained facilities detract from the production of quality
data. An additional consideration is facility and equipment requirements
for field measurements, which should be addressed in the field QA plan.
These requirements should include consideration of the types of measure-
ments, an appropriate area to perform the measurements and address
ventilation, climate control, power, water, gases, and safety needs.

2.2.1 General
Each laboratory should be sized and constructed to facilitate the proper
conduct of laboratory analyses and associated operations. Adequate
bench space or working area per analyst should be provided; 4'6-
7·6 meters of bench space or 14-28 square meters of floor space per analyst
has been recommended. 21 Lighting requirements may vary depending
upon the tasks being performed. Lighting levels in the range of 540-
1075 lumens per square meter are usually adequate. 24 A stable and reliable
source of power is essential to proper operation of many laboratory
instruments. Surge suppressors are required for computers and other
sensitive instruments and uninterrupted power supplies, as well as isolated
ground circuits may be required. The actual requirements depend on the
equipment or apparatus utilized, power line supply characteristics and the
number of operations that are to be performed (many laboratories have
more work stations than analysts) at one time. The specific instrumenta-
tion, equipment, materials and supplies required for performance of a test
method are usually described in the approved procedure. If the laboratory
intends to perform a new test, it must acquire the necessary instrumenta-
tion and supplies, provide the space, and conduct the training necessary
to demonstrate competence with the new test before analyzing any routine
samples.

2.2.2 Laboratory environmenfo


The facility should be designed so that no one activity will have an adverse
effect on another activity. In addition, separate space for laboratory
operations and ancillary support should be provided. The laboratory
should be well-ventilated, adequately lit, free of dust and drafts, protected
from extremes of temperature and humidity, and have access to a stable
and reliable source of power. The laboratory should be minimally equipped
with the following safety equipment: fire extinguishers of the appropriate
number and class; eye wash and safety shower facilities; eye, face, skin and
QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITIES 267

respiratory personal protective equipment; and spill control materials


appropriate to each laboratory's chemical types and volumes. Labora-
tories may also have need of specialized facilities, such as a perchloric acid
hood, glovebox, special ventilation requirements, and sample disposal
area. It is frequently desirable and often essential to separate types of
laboratory activities. Glassware cleaning and portable equipment storage
areas should be convenient, but separated from the laboratory work area.
When conducting analyses over a wide concentration range, caution must
be exercised to prevent cross contamination among samples in storage and
during analysis. Sample preparation areas where extraction, separation,
clean-up, or digestion activities are conducted must be separate from
instrumentation rooms to reduce hazards and avoid sample contamination
and instrument damage. Documentation must be available demonstrating
contamination is not a problem where samples, calibration standards and
reference materials are not stored separately. Field and mobile labora-
tories may also have special requirements based on function or location.

2.2.3 Ventilation systerr12


The laboratory ventilation system should: provide a source of air for both
breathing and input to local ventilation devices; ensure that laboratory air
is continually replaced to prevent concentrating toxic substances during
the work day; provide air flow from non-laboratory areas into the laborat-
ory areas and direct exhaust air flow above the low pressure envelope
around the building created by prevailing windflow. The best way to
prevent contamination or exposure due to airborne substances, is to
prevent their escape into the working atmosphere by using hoods or other
ventilation devices. Laboratory ventilation should be accomplished using
a central air conditioning system which can: filter incoming air to reduce
airborne contamination; provide stable operation of instrumentation and
equipment through more uniform temperature maintenance; lower hum-
idity which reduces moisture problems associated with hygroscopic sub-
stances and reduces instrument corrosion problems; provide an adequate
supply of make up air for hoods and local exhaust devices. It has been
recommended that a laboratory hood with 0·75 meter of hood width per
person be provided for every two workers 20 (it is a rare occurrence for a
hood not to be required). Each hood should be equipped with a con-
tinuous monitoring device to afford convenient confirmation of hood
performance prior to use (typical hood face velocity should be in the range
of 18'5-30' 5 meters per minute ).18.20.22 In addition, other local ventilation
devices, such as ventilated storage cabinets, canopy hoods, and snorkles
268 A.A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER

should be provided where needed. These devices shall be exhausted


separately to the exterior of the building. The laboratory must also have
adequate and well ventilated storerooms (4-12 room air changes per
hour).2o,23,25 The laboratory ventilation system, hoods and local exhaust
ventilation devices should be evaluated on installation, monitored every
3 months where continuous monitors are not installed and must be
reevaluated when any changes in the system are made.

2.2.4 Sample handling, shipping, receiving and storage area


The attainment of quality environmental data depends not only on collect-
ing representative samples, but also on ensuring that those samples remain
as close to their original condition as possible until analyzed. When
samples cannot be analyzed upon collection, they must be preserved and
stored as required for the analytes of interest and shipped to a laboratory
for analysis, Shipping of samples to the laboratory must be done in a
manner that provides sufficient time to perform analyses within specified
holding times. To ensure safe storage and prevent contamination and/or
misidentification, there must be adequate facilities for handling, shipping,
receipt, and secure storage of samples.
In addition, adequate storage facilities are required for reagents, stan-
dards, and reference materials to preserve their identity, concentration,
purity and stability,

2,2.5 Chemical and waste storage areas


Facilities adequate for the collection, storage and disposal of waste chemi-
cals and samples must be provided at the field sampling site, as well as in
the laboratory. These facilities are to be operated in a manner that mini-
mizes the chance of environmental contamination, and complies with all
applicable regulations.
Where practical, it is advisable to either recover, reuse, or dispose of
wastes in-house. Sample disposal presents complex problems for laborato-
ries. In many situations, these materials can be returned to the originator;
however, in those situations where this is not possible, other alternatives
must be considered, such as recovery, waste exchange, incineration, sol-
idification, lab pack containerization and/or disposal to the sewer system
or landfill site. Procedures outlining disposal practices must be available,
and staff members must be trained in disposal practices applicable to their
area,
QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITIES 269

2.2.6 Data and record storage areas


Space must be provided for the storage and retrieval of all documents
related to field sampling and laboratory operations and reports. Such
space must also include an archive area for older records. The environ-
ment of these storage areas must be maintained to ensure the integrity of
all records and materials stored. Access to these areas must be controlled
and limited to authorized personnel.

2.3 Equipment and instrumentation


There should be well-defined and documented purchasing guidelines for
the procurement of equipment. 2- 6,11-13 These guidelines should ensure the
data quality needs of the users.

2.3. 1 General
The laboratory and field sampling organizations should have available all
items of equipment and instrumentation necessary to correctly and ac-
curately perform all the testing and measuring services that it provides to
its users, For the field sampling activities, the site facilities should be
examined prior to beginning work to ensure that appropriate facilities,
equipment, instrumentation, and supplies are available to accomplish the
objectives of the QA plan. Records should be maintained on all major
items of equipment and instrumentation which should include the name of
the item, the manufacturer and model number, serial number, date of
purchase, date placed in service, accessories, any modifications, updates or
upgrades that have been made, current location of the equipment, as well
as related accessories and manuals, and all details of maintenance. Items
of equipment used for environmental testing must meet certain minimum
requirements. For example, analytical balances must be capable of weigh-
ing to 0·1 mg; pH meters must have scale graduations of at least 0·01 pH
units and must employ a temperature sensor or thermometer; sample
storage refrigerators must be capable of maintaining the temperature in
the range 2-5°C; the laboratory must have access to a certified National
Institute of Science and Technology (NIST) traceable thermometer; ther-
mometers used should be calibrated and have graduations no larger than
that appropriate to the testing or operating procedure; and probes for
conductivity measurements must be of the appropriate sensitivity.

2.3.2 Maintenance and calibration4.1 1


The organization must have a program to verify the performance of newly
installed instrumentation to manufacturers' specifications. The program
270 A.A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER

must provide clearly defined and written maintenance procedures for each
measurement system and the required support equipment. The program
must also detail the records required for each maintenance activity to
document the adequacy of maintenance schedules and the parts inventory.
All equipment should be properly maintained to ensure protection from
corrosion and other causes of deterioration. A proper maintenance pro-
cedure should be available for those items of equipment which require
periodic maintenance. Any item of equipment which has been subject to
mishandling, which gives suspect results, or has been shown by calibration
or other procedure to be defective, should be taken out of service and
clearly labelled until it has been repaired. After repair, the equipment must
be shown by test or calibration to be performing satisfactorily. All actions
taken regarding calibration and maintenance should be documented in a
permanent record.
Calibration of equipment used for environmental testing usually falls in
either the operational or periodic calibration category. Operational cali-
bration is usually performed prior to each use of an instrumental measure-
ment system. It typically involves developing a calibration curve and
verification with a reference material. Periodic calibration is performed
depending on use, or at least annually on items of equipment or devices
that are very stable in operation, such as balances, weights, thermometers
and ovens. Typical calibration frequencies for instrumentation is either
prior to use, daily or every 12 h. All such calibrations should be traceable
to a recognized authority such as NIST. The calibration process involves:
identifying equipment to be calibrated; identifying reference standards
(both physical and chemical) used for calibration; identifying, where
appropriate, the concentration of standards; use of calibration pro-
cedures; use of performance criteria; use of a stated frequency of calibra-
tion; and the appropriate records and documentation to support the
calibration process.

2.4 Standard operating procedures28.39


All organizations need to have documented SOPs which clearly delineate
the steps to be followed in performing a given operation.
It is essential that SOPs be clearly written and easily understood. Some
excellent advice for writing clearly was given by Sir Ernest Gowers who
said: 'It is not enough to write with such precision that a person reading
with good will may understand; it is necessary to write with such precision
that a person reading with bad will cannot misunderstand.' A standard
format should be adopted and employed in writing SOPs for all pro-
QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITIES 271

cedures and methods, including sample collection and analysis, laboratory


functions and ancillary functions. Published industry accepted procedures
often serve as the foundation for these written in-house analytical pro-
cedures and these in-house protocols must include the technical restric-
tions of the applicable regulatory approved methods.
Each page of final drafts of SOPs should be annotated with a document
control heading containing procedure number, revision number, date of
revision, implementation date and page number, and the total number of
pages in the SOP. The cover page of the SOP should indicate whether it
is an uncontrolled copy or a controlled copy (with the control number
annotated on the page) and must include endorsing signatures of approval
by appropriate officers of the organization. A controlled copy is one that
is numbered and issued to an individual, the contents of which are updated
after issue; an uncontrolled copy is current at the time of issue but no
attempt is made to update it after issuance. Whenever an SOP is modified,
it should be rewritten with changes and contain a new revision number and
a new implementation date. A master copy of each SOP should be kept
in a file under the control of the QA Officer. The master copy should
contain the signatures of the Laboratory Director/Field Manager and the
Laboratory/Field QA Officer or their designees. Records should be main-
tained of the distribution of each SOP and its revisions. Use of a com-
prehensive documentation procedure allows tracking and identification of
which procedure was being employed at any given time.
It is essential that appropriate SOPs be available to all personnel in-
volved in field sampling and laboratory operations and that these SOPs
address the following areas: (l) sample control including sample collec-
tion and preservation, sample storage and chain of custody; (2) standard
and reagent preparation; (3) instrument and equipment maintenance; (4)
procedures to be used in the field and laboratory; (5) analytical methods;
(6) quality control SOPs; (7) corrective actions; (8) data reduction and
validation; (9) reporting; (10) records management; (11) chemical and
sample disposal; and (12) health and safety.

2.5 Field and laboratory records 26-28


All information relating to field and laboratory operations should be
documented, providing evidence related to project activities and support-
ing the technical interpretations and judgments. The information usually
contained in these records relates to: project planning documents, includ-
ing maps and drawings; all QA plans involved in the project; sample
management, collection and tracking; all deviations from procedural and
272 A.A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER

planning documents; all project correspondence; personnel training and


qualifications; equipment calibration and maintenance; SOPs, including
available historical information about the site; traceability information
relating to calibrations, reagents, standards and reference materials; origi-
nal data, including raw data and calculated results for field, QC samples
and standards; method performance, including detection and quantitation
limits, precision, bias, and spike/surrogate recovery; and final report. It
is essential that these records be maintained as though intended for use in
regulatory litigation. These records must be traceable to provide historical
evidence required for reconstruction of events and review and analysis.
Clearly defined SOPs should be available providing guidance for re-
viewing, approving and revising field and laboratory records. These
records must be legible, identifiable, retrievable and protected against
damage, deterioration or loss. All records should be written or typed with
black waterproof ink. For laboratory operations each page of the note-
book, logbook or each bench sheet must be signed and dated by the
person who entered the data. Documentation errors should be corrected
by drawing a line through the incorrect entry, initialing and dating the
deletion, and writing the correct entry adjacent to the deletion. The reason
for the change should be indicated.
Field records minimally consist of: bound field notebooks with prenum-
bered pages; sampling point location maps; sample collection forms;
sample analysis request forms; chain of custody forms; equipment mainte-
nance and calibration logs; and field change authorization forms.
Laboratory records minimally consist of: bound notebooks and
logbooks with prenumbered pages; bench sheets; graphical and/or com-
puterized instrument output; files and other sample tracking or data entry
forms; SOPs; and QA plans.

2.6 Document storage11 ,27


Procedures should be established providing for safe storage of the docu-
mentation necessary to recreate sampling, analyses and reporting informa-
tion. These documents should minimally include planning documents,
SOPs, logbooks, field and laboratory data records, sample management
records, photographs, maps, correspondence and final reports.
Field and laboratory management should have a written policy specify-
ing the minimum length of time that documentation is to be retained. This
policy should be in conformance with the more stringent of regulatory
requirements, project requirements or organizational policy. Deviations
from the policy should be documented and available to those responsible
QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITIES 273

for maintaining the documentation. No records should be disposed of


without written notice being provided to the client stating a disposition
date which provides an opportunity for them to obtain the records. If the
testing organization or an archive contract facility goes out of business
prior to expiration of the documentation storage time specified in the
policy, all documentation should be transferred to the archives of the
client involved.
The documentation should be stored in a facility capable of maintaining
the integrity and minimizing deterioration of the records for the length of
time they are retained.
Control of the records involved is essential in providing evidence of their
traceability and integrity. These records should be identified, readily re-
trievable, and organized to prevent loss. Access to the archived records
should restrict unauthorized personnel from free and open access. An
authorized access list should be maintained naming personnel authorized
to access the archived information. All accesses to the archives should be
documented. The documentation involved should include: name of the
individual; date and reason for accessing the data; and all changes, de-
letions or withdrawals that may have occurred.

3 MEASUREMENT SYSTEM CONTROL21 .29-31

An essential part of a QA system involves continuous and timely monitor-


ing of measurement system control. Organizations may be able to
demonstrate the capability to conduct field and laboratory aspects of
environmental assessments, but this alone does not provide the ongoing
day-to-day performance checks necessary to guarantee and document
performance. Performance check procedures must be in place to ensure
that the quality of work conforms to the quality specifications on an
ongoing basis.
The first step in instituting performance checks is to evaluate the mea-
surement systems involved in terms of sensitivity, precision, linearity and
bias to establish conditional measurement system performance charac-
teristics. This is accomplished by specifying: instrument and method cali-
bration criteria; acceptable precision and bias of the measurement system;
and method detection and reporting limits. The performance characteris-
tics may serve as the basis for acceptance criteria used to validate data. If
published performance characteristics are to be used, they must be verified
274 A.A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER

as reasonable and achievable with the instrumentation available and


under the operating conditions employed.

3.1 Calibration
Calibration criteria should be specified for each test technology and/or
analytical instrument and method in order to verify measurement system
performance. The calibration used should be consistent with the data use
objectives. Such criteria should specify the number of standards necessary
to establish a calibration curve, the procedures to employ for determining
linear fit and linear range of the calibration, and acceptance limits for
(continuing) analysis of calibration curve verification standards. All SOPs
for field and laboratory measurement system operation should specify
system calibration requirements, acceptance limits for analysis of calibra-
tion curve verification standards, and detailed steps to be taken when
acceptance limits are exceeded.

3.2 Precision and bias32- 34


The precision and bias of a measurement system must be determined prior
to routine use using solutions of known concentration. The degree to
which a measurement is reproducible is frequently measured by the stan-
dard deviation or relative per cent difference of replicates. Bias, the deter-
mination of how close a measurement is to the true value, is usually
calculated in terms of it's complement, the per cent recovery of known
concentrations of analytes from reference or spiked samples. Precision and
bias data may be used to establish acceptance criteria, also called control
limits, which define limits for acceptance performance. Data that fall
inside established control limits are judged to be acceptable, while data
lying outside of the control interval are considered suspect. Quality
control limits are often two-tiered: warning limits are established at ± 2
standard deviation units around the mean; and control limits are esta-
blished at ± 3 standard deviation units around the mean. All limits should
be updated periodically in order to accurately reflect the current state of
the measurement system.

3.3 Method detection and reporting limits


Prior to routine use, the sensitivity of the measurement system must be
determined. There are many terms that have been used to describe sensitiv-
ity including: detection limit (DL),35 limit of detection (LOD),36 method
detection limit (MDL),37 instrument detection limit (IDL),18 method quan-
titation limit (MQL),19 limit of quantitation (LOQ),36 practical quantita-
QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITIES 275

tion limit (PQL),39.40 contract required detection limit (CRDL),38 and


criteria of detection (COD)33 (see Table 1).
In an analytical process, the IDL is generally used to describe instru-
ment sensitivity and the MDL is generally used to describe overall sensitiv-
ity of the measurement system, including sample preparation and analysis.
Frequently, either the IDL or MDL is designated as the lowest reportable
concentration level, and analytical data are quantitatively reported at or
above the IDL or MDL. As seen from Table 1, there is a great deal of
disagreement and confusion regarding definitions and reporting limits.
The definition of quantitation limit presumes that samples at the con-
centration can be quantitated with virtual assurance. This assumes the use
of '< MDL' as a reporting threshold. However, results reported as
'< LOQ' are not quantitative, which negates the definition of LOQ. This
is a bad and unnecessary practice.
It is essential that the sensitivity of the measurement system and report-
ing limits be established prior to routine use of the measurement system.
Although MDLs may be published in analytical methods, it is necessary
for the laboratory to generate its own sensitivity data to support its own
performance.

3.4 Quality control samples41 -45


The continuing performance of the measurement system is verified by
using Quality Control (QC) samples. The QC samples are used to evaluate
measurement system control and the effect of the matrix on the data
generated. Control samples are of known composition and measured by
the field or laboratory organization on an ongoing basis. Data from these
control samples are compared to established acceptance criteria to deter-
mine whether the system is 'in a state of statistical control'. Control
samples evaluate measurement system performance independent of matrix
effects.
The effect of the matrix on measurement system performance is evalu-
ated by collecting and analyzing environmental samples from the site or
location being evaluated. These environmental samples are typically col-
lected and/or analyzed in duplicate to assess precision, or spiked with
known concentrations of target analytes to assess accuracy.
It is essential that both system control and matrix effects be evaluated
to get a true picture of system performance and data quality. However, the
types ofQC samples, and frequency of usage, depend on the end-use of the
data. Prior to beginning a sampling effort, the types of QC samples
required should be determined. Data validation procedures should specify
N
TABLE 1 -..J
c;r,
Definition of detection limit terms

Definition Determination Calculation Reference

Detection limit (DL) The concentration which Analysis of replicate Two times the standard 35
deviation
>
is distinctly detectable standards ~
above, but close to a
blank ~
The lowest concentration Analysis of replicate Three times the standard 36
=
>
VJ
Limit of detection
(LOD) that can be determined samples deviation ;d
~m
to be statistically
:;-:
different from a blank
~
Method detection limit The minimum Analysis of a minimum The standard deviation 37 (")

(MOL) concentration of a of seven replicates spiked times the Student t-value >
::o:l
t"'"
substance that can be at 1-5 times the expected at the desired confidence =
identified, measured and detection limit level (for seven replicates, ~
the value is 3·14) 0
reported with 99%
confidence that the >
analyte concentration is ~
greater than zero is:
1n
Instrument detection The smallest signal above Analysis of three Three times the standard 38
limit (IDL) background noise that an replicate standards at deviation ~
instrument can detect concentrations of 3-5 ~
reliably times the detection limit
Method quantitation The minimum Analysis of replicate Five times the standard 39
limit (MQL) concentration of a samples deviation
substance that can be
measured and reported
Limit of quantitation The level above which Analysis of replicate Ten times the standard 36
(LOO) quantitative results may samples deviation
be obtained with a
specified degree of ~
confidence
Practical quantitation The lowest level that Interlaboratory analysis (1) Ten times the MDL 39
~
limit (POL) can be reliably of check samples (2) Value where 80% of 40
determined within laboratories are
specified limits of within 20% of the !
precision and accuracy true value Q
during routine ~
laboratory operating :ocJ
conditions
Contract required Reporting limit specified Unknown Unknown 38
detection limit for laboratories under
(CRDL) contract to the EPA for
Superfund activities
I
....j
I
;;
::l
;::;
~
N
......
......
278 A.A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER

how the results of analysis of these samples will be used in evaluating the
data.
Common types of field and laboratory QC samples are discussed in the
following sections, along with how they are commonly used in evaluating
data quality.

3.4.1 Field QC samples '4•34


Field QC samples typically consist of the following kinds of samples:

(I) Field blank: a sample of analyte-free media similar to the sample


matrix that is transferred from one vessel to another or is exposed to or
passes through the sampling device or environment at the sampling site.
This blank is preserved and processed in the same manner as the associat-
ed samples. A field blank is used to document contamination in the
sampling and analysis process.
(2) Trip blank: a sample of analyte-free media taken from the laboratory
to the sampling site and returned to the laboratory unopened. A trip blank
is used to document contamination attributable to shipping, field handling
procedures and sample container preparation. The trip blank is particular-
ly useful in documenting contamination of volatile organics samples and
is recommended when sampling for volatile organics.
(3) Equipment blank: a sample of analyte-free media that has been used
to rinse sampling equipment (also referred to as an equipment rinsate). It
is collected after completion of decontamination and prior to sampling.
An equipment blank is useful in documenting adequate decontamination
of sampling equipment.
(4) Material blank: a sample of construction materials such as those
used in monitoring wells or other sampling point construction/installa-
tion, well development, pump and flow testing, and slurry wall construc-
tion. Samples of these materials are used to document contamination
resulting from the use of construction materials.
(5) Field duplicates: independent samples collected as close as possible
to the same point in space and time and which are intended to be identical;
they are carried through the entire analytical process. Field duplicates are
used to indicate the overall precision of the sampling and analytical
process.
(6) Background sample: a sample taken from a location on or proximate
to the site of interest, also known as a site blank. It is generally taken from
an area thought to be uncontaminated in order to document baseline or
historical information.
QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITIES 279

3.4.2 Laboratory QC samplesl1-14.21,31.33.34


Laboratory QC samples typically include the following kinds of samples:
(I) Method blank (also referred to as laboratory blank): an analyte-free
media to which all reagents are added in the same volumes or proportions
as used in the sample processing. The method blank is carried through the
entire sample preparation and analysis procedure and is used to document
contamination resulting from the analytical process. A method blank is
generally analyzed with each group of samples processed.
(2) Laboratory control sample: a sample of known composition spiked
with compound(s) representative of target analytes, which is carried
through the entire analytical process. The results of the laboratory control
sample(s) analyses are compared to laboratory acceptance criteria (control
limits) to document that the laboratory is in control during an individual
analytical episode. The laboratory control sample must be appropriately
batched with samples to assure that the quality of both the preparation
and analysis of the batch is monitored.
(3) Reference material: a sample containing known quantities of target
analytes in solution or in a homogeneous matrix. Reference materials are
generally provided to an organization through external sources. Reference
materials are used to document the accuracy of the analytical process.
(4) Duplicate samples: two aliquots of sample taken from the same
container after sample homogenization and intended to be identical.
Matrix duplicates are analyzed independently and are used to assess the
precision of the analytical process. Duplicates are used to assess precision
when there is a high likelihood that the sample contains the analyte of
interest.
(5) Matrix spike: an aliquot of sample (natural matrix) spiked with a
known concentration of target analyte(s) which is carried through the
entire analytical process. A matrix spike is used to document the affect of
the matrix on the accuracy of the method, when compared to an aliquot
of the unspiked sample.
(6) Matrix spike duplicates: two aliquots of sample (natural matrix)
taken from the same container and spiked with identical concentrations of
target analytes. Matrix spike duplicates are analyzed independently and
are used to assess the effect of the matrix on the precision of the analytical
process. Matrix spike duplicates are used to assess precision when there is
little likelihood the sample contains compounds of interest.
(7) Surrogate standard: a compound, usually organic, which is similar
to the target analyte(s) in chemical composition and behavior in the
280 A.A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER

analytical process, but which is generally not found in environmental


samples. Surrogate samples are usually added to environmental samples
prior to sample extraction and analysis and are used to monitor the effect
of the matrix on the accuracy of the analytical process.

4 QUALITY SYSTEMS AUDITS3-6.1~14.46-50

The effectiveness of quality systems are evaluated through the audit


process. The audit process should be designed to provide for both consis-
tent and objective evaluation of the item or element undergoing evalua-
tion.
The QA plan should provide general guidance regarding the frequency
and scope of audits; rationale for determining the need for additional
audits; recording and reporting formats; guidance for initiation and train-
ing; and corrective action plans including implementation and tracking of
corrective actions.
The audit process provides the means for continuously monitoring the
effectiveness of the QA program. The audit process can involve both
internal and third party auditors. Where internal auditors are involved, it
is essential that the auditors have no routine involvement in the activities
they are auditing.
Audit results should be reported to management in a timely manner.
Management should respond to the auditor by outlining its plan for
correcting the quality system deficiencies through the implementation of
corrective actions.

4.1 Internal QC audit


The field and laboratory organization should develop written guidelines
for conducting internal QC audits. The written guidelines should include
specifications of reporting formats and checklist formats to ensure the
audits are performed in a consistent manner.
The internal QC audits should be conducted by the QA officer or a
trained audit team. These audits must be well-planned to ensure minimiza-
tion of their impact on laboratory or field operations. Audit planning
elements should include guidance on: (l) scheduling and notification; (2)
development of written SOPs for conduct of audits; (3) development of
standard checklists and reporting formats; and (4) corrective action. The
auditors should be trained in the areas they audit with emphasis on
consistently recording information collected during audits. To be most
QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITIES 281

effective, internal QC audits should be timely, thorough, and accurate.


The results of the audits should be reported to the staff both verbally and
in writing in a timely manner.

4.2 Third party QC audits


Third party QC audits are used by the organizations responsible for
conducting the sampling and analysis and the group that contracted the
work to verify the validity of the QA plan and the internal auditing
process.
Third party audits are important because internal QC audits do not
necessarily result in the implementation of the QC procedures specified in
the QA plan and they also provide an independent evaluation of the
quality system.

4.3 Type of audits46 .41


There are several types of audits that are commonly conducted by both
internal and third party auditors: system, performance, data quality, and
contract/regulatory compliance audits. Frequently, an audit may involve
some combination of any or all of these audit types to obtain a more
complete understanding of an organization's performance.

4.3. 1 System audit


A system audit is performed on a scheduled, periodic (usually semi-annu-
ally) basis by an independent auditor. These audits are usually conducted
by auditors external to the auditee organization, and the results are
reported to the auditee management.
The system audit involves a thorough overview of implementation of
the QA plan within the auditee organization and includes inspection and
evaluation of: (1) facilities; (2) staff; (3) equipment; (4) SOPs; (5) sample
management procedures; and (6) QA/QC procedures.

4.3.2 Performance audit


A performance audit involves a detailed inspection of specific areas within
an organization and its implementation of the QA program including: (1)
sample maintenance; (2) calibration; (3) preventive maintenance; (4)
receipt and storage of standards, chemicals and gases; (5) analytical
methods; (6) data verification; and (7) records management.
The performance audit may also be more restrictive and simply test the
ability of the organization to correctly test samples of known composition.
This involves the use of performance evaluation (PE) samples which may
282 A.A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER

be submitted as blind or double blind samples by either an internal or


external source. The data from PE sample analyses are compared to
acceptance limits in order to identify problems with qualitative identifica-
tion or quantitative analysis. The organizations involved are usually asked
to provide both an explanation for any data outside acceptance limits and
a listing of corrective actions taken.
This audit is performed on an ongoing basis within field and laboratory
organizations by the QA officer or staff. The results of these audits are
reported to the management of the organization.

4.3.3 Data quality audit


The data quality audit involves an assessment of the precISIOn, bias
(accuracy), representativeness and completeness of the data sets obtained.
This audit is performed on representative data sets produced by the
organization involved. These audits are usually conducted on a project-
specific basis. As with other audits, the results are reported to manage-
ment.

4.3.4 Contract/regulatory compliance audit


This audit is conducted to evaluate the effectiveness of the QA plan and
protocols in ensuring contract/regulatory compliance by the organiza-
tion involved. The contract/regulatory compliance audit generally revol-
ves around ensuring that correct protocols are followed and the resulting
reports are in the contractually prescribed format.

4.4 Field and laboratory checklists1o.34.49.51


A checklist should be developed for each audit activity, ensuring that all
areas of the field and laboratory organizations' operations are systematic-
ally addressed. The actual content of the checklist depends on the objective
of the audit but should minimally address QC procedures, the sampling
and analysis plan and conformance to the quality plan. These checklists
are helpful in ensuring that the audits are as objective, comprehensive and
consistent as possible.

4.5 Corrective action


The result of the auditing process, whether by internal auditors or third
party auditors, is the audit report which serves as the basis for the develop-
ment of corrective actions. These corrective actions are implemented to
eliminate any deficiencies uncovered during the audit. The final step in this
process is the evaluation of the effectiveness of the corrective actions in
QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITIES 283

eliminating the deficiencies. All corrective actions and their effectiveness


must be documented.

5 DATA ASSESSMENT34.52

Data assessment involves determining whether the data meet the require-
ments of the QA plan and the needs of the data user. Data assessment is
a three-part process involving assessment of the field data, the laboratory
data and the combined field and laboratory data. Both the field and
laboratory assessments involve comparison of the data obtained to the
specifications stated in the QA plan, whereas the combined, or overall,
assessment involves determining the data usability. The data usability
assessment is the determination of the data's ability to meet the DQOs and
whether they are appropriate for the intended use.

5.1 Field data assessment34•46•52


This aspect of assessment involves verification of the documentation and
evaluation of both quantitative and qualitative validation offield data from
measurements such as pH, conductivity, flow rate and direction, as well as
information such as soil stratigraphy, groundwater well installation and
sample management records, observations, anomalies and corrective
actions.

5.1.1 Field data completeness


The process of reviewing the field data for completeness ensures that the
QA plan requirements for record traceability and procedure documenta-
tion have been implemented. The documentation must be sufficiently
detailed to permit recreation of the field episode if necessary. Incomplete
records should be identified and should include data qualifiers that specify
the usefulness of the data in question.

5.1.2 Field data validity


Review of field data for validity ensures that problems affecting accuracy
and precision of quantitative results as well as the representativeness of
samples are identified so that the data may be qualified as necessary.
Problems of this type may include improper sample preservation and well
screening, instability of pH or conductivity, or collection of volatile
organic samples adjacent to sources of contamination.
284 A.A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER

5.1.3 Field data comparisons


Review of the data should include comparison/correlation of data ob-
tained by more than one method. This review will allow identification of
anomalous field test data such as groundwater samples with pH several
units higher than those from similar wells in the same aquifer.

5.1.4 Field laboratory data validation


Field laboratory data validation involves using data validation techniques
which mirror those used in fixed laboratory operations. Details of these
requirements are given in Section 5.2.

5.1.5 Field quality plan variances


This data review involves documenting all quality plan variances. Such
variances must be documented and include the rationale for the changes
that were made. The review involves evaluation of all failures to meet data
acceptance criteria and corrective actions implemetlted and should include
appropriate data qualification where necessary.

5.2 Laboratory data assessment34•46•52


The laboratory data assessment involves a review of the methods em-
ployed, conformance with QA plan requirements and a review of records.
The discussion presented here is limited to the technical data requirements
because contract requirements and technical data conformance require-
ments are usually not the same.

5.2.1 Laboratory data completeness


Laboratory data completeness is usually defined as the percentage of valid
data obtained. It is necessary to define procedures for establishing data
validity (Sections 5.2.2-5.2.7).

5.2.2 Evaluation of laboratory OC sample!l6-28.31,44,45


(1) Laboratory blank: the results of these blank sample analyses should
be compared with the sample analyse per analyte to determine whether the
source of any analyte present is due to laboratory contamination or actual
analyte in the sample.
(2) QC sample: the results of control sample analyses are usually ev-
aluated statistically in order to determine the mean and standard devia-
tion. These data can be used to determine data acceptance criteria (control
limits), and to determine precision and accuracy characteristics for the
method. The data can be statistically analyzed to determine outliers, bias
QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITIES 285

and trends. The control limits from these data should be used on a
real-time basis to demonstrate measurement system control and sample
data validity.
(3) Performance evaluation sample: the results ofPE sample analyses are
used to evaluate and compare different laboratories. In order for such
comparisons to be valid, the PE sample results must have associated
control limits that are statistically valid.

5.2.3 Evaluation and reporting of low level data33


Sample data should be evaluated on an analyte basis in determining
whether values to be reported are above detection or reporting limits.
Since there are numerous definitions (see Section 3.3) of these terms, it is
essential that the QA plan clearly defines the terms and specifies the
procedures for determining their values for each analyte/procedure.

5.2.4 Evaluation of matrix effects


Since each environmental sample has the potential for containing a dif-
ferent matrix, it may be necessary to document the effect of each matrix
on the analyte of interest. Matrix effects are usually evaluated by analyzing
the spike recovery from an environmental sample spiked with the analyte
of interest prior to sample preparation and analysis. Surrogate spike
recoveries and matrix spike recoveries may be used to evaluate matrix
effects. If a method shows significant bias or imprecision with a particular
matrix then data near the regulatory limit (or an action level) should be
carefully evaluated.

5.2.5 Review of sample management data


Sample management information/data involves a review of records that
include and address sample traceability and storage and holding times.
The field and laboratory organizations must evaluate whether the infor-
mation contained in the records is complete and whether it documents
improper preservation, exceeded holding times or improper sample
storage procedures.

5.2.6 Calibration
Instrument calibration information including sensitivity checks, instru-
ment calibration and continuing calibration checks should be evaluated
and compared to historical information and acceptance criteria. Control
charts are a preferred method for documenting calibration performance.
Standard and reference material traceability must be evaluated and doc-
286 A.A. LlABASTRE, K.A. CARLBERG AND M.S. MILLER

umented along with verifying and documenting their concentration!


purity.

5.3 Assessment of combined data 34•52


At some point after the field and laboratory organizations have validated
and reported their data, these reported data are combined to determine the
data's usability. This assessment of data usability occurring after com-
pletion of the data collection activity involves evaluating the documenta-
tion of information and performance against established criteria.

5.3.1 Combined data assessment


The final data assessment involves integration of all data from the field
and laboratory activities. The data are evaluated for their conformance
with the DQOs in terms of their precision, accuracy, representativeness,
completeness and comparability. The precision and accuracy are evalu-
ated in terms of the method. Representativeness expresses the degree to
which the data represents the sample population. Completeness is
expressed as the percentage of valid field and laboratory data. Compar-
ability, based on the degree of standardization of methods and procedures,
attempts to express how one data set compares to another data set.
Comparisons of the field, rinsate, and trip blanks, which usually occur
only after the field and laboratory data have been combined, is accom-
plished by employing procedures similar to those used in evaluating
laboratory blanks. Sample management records should be assessed to
assure that sample integrity was maintained and not compromised.

5.3.2 Criteria for classification of data


The criteria for data classification must be specified in the QA plan and
based on the documentation of specific information and evaluation of the
data in specific quality terms. Where mUltiple data usability levels are
permitted, the minimum criteria for each acceptable level of data usability
must be specified in the QA plan. To be categorized as fully usable, the
data must be supported (documented) by the specified (as described in the
QA plan) minimum informational requirements.

5.3.3 Classification of the data


The assessment process should evaluate the data using procedures and
criteria documented in the QA plan. A report should be prepared which
describes each of the usability levels of the data for the DQOs. This report
should also delineate the reason(s) data have been given a particular
usability level classification.
QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITIES 287

6 FIELD AND LABORATORY STANDARD OPERATING


PROCEDURES26.53

The following paragraphs specify recommended minimal levels of detail


for SOPs used in obtaining environmental data.

6.1 Sample control 2S


Sample control SOPs detail sample collection, management, receipt, hand-
ling, preservation, storage and disposal requirements, including chain of
custody procedures. The purpose of these procedures is to permit trace-
ability from the time samples are collected until they are released for final
disposition.
The field manager and laboratory director should appoint a sample
custodian with the responsibility for carrying out the provisions of the
sample control SOPs.

6.1.1 Sample collection and field management I4.54


There are a large number of sample collection procedures necessary to
provide for proper collection of the wide variety of environmental
samples. Since sample collection is the most critical part of the environ-
mental assessment, it is essential that these sample collection protocols
(SOPs) be implemented as written. These SOPs should not be simple
references to a published method unless the method is to be implemented
exactly as written. Sample collection SOPs should include at least the
following sections: (l) purpose; (2) scope; (3) responsibility; (4) references;
(5) equipment required; (6) detailed description of the procedure; (7) QA
data acceptance criteria; (8) reporting and records; and (9) health and
safety considerations.
Field sample management SOPs should describe the numbering and
labeling system, sample collection point selection methods; chain of
custody procedures; the specification of holding times; sample volume and
preservatives required for analysis; as well as detail sample shipping
requirements.
Samples should be identified by attaching a label or tag to the container
usually prior to sample collection. The label or tag used should be firmly
attached to the sample container, made of waterproof paper, and filled in
with waterproof ink. The label or tag should uniquely identify the sample
and include information such as: (1) date and time of sample collection;
(2) location of sample point; (3) sample type; preservative if added; (4)
safety considerations; (5) company name and project identification; and
288 A,A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER

(6) observations, remarks, and name and signature of person recording the
data.

6.1.2 Sample receipt, handling and preservation54•55


These SOPs describe procedures to be followed in: opening sample ship-
ment containers; verifying chain of custody maintenance; examining
samples for damage; checking for proper preservatives and temperature;
assignment to the testing program; and logging samples into the laborat-
ory sample stream.
Samples should be inspected to determine the condition of the sample
and custody seal, if used. If any sample has leaked or any custody seal is
broken, the condition is noted and the custodian, along with the super-
visor responsible, must decide if it is a valid sample. Sample documenta-
tion is verified for agreement between chain of custody/shipping record,
sample analysis request form, and sample label. Any discrepancies must be
resolved prior to sample assignment for analysis.
The results of all inspections and investigations of discrepancies are
noted on the sample analysis request forms, as well as the laboratory
sample logbook. All samples are assigned, a unique laboratory sample
number and logged into the laboratory sample logbook or a computerized
sample management system.

6.1.3 Sample storage54•55


Sample storage SOPs describe storage conditions required for all samples,
procedures used to verify and document storage conditions, and pro-
cedures used to ensure that sample custody has been maintained from
sample collection to sample disposal.
Sample storage conditions address holding times and preservation and
refrigeration requirements. The SOP dealing with storage conditions
should specify that a log be maintained and entries be made at least twice
a day, at the beginning and end of the workday. Such entries should note,
at minimum, the security condition, the temperature of the sample storage
area and any comments. The required sampling chain of custody pro-
cedures should minimally meet standard practices such as those described
in Ref. 28.

6.1.4 Sample disposal 56


Sample disposal SOPs should describe the process of sample release and
disposal. The authority for sample release should be assigned to one
individual, the sample custodian. Laboratory and field management
QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITIES 289

should designate an individual responsible for waste disposal, with ade-


quate training, experience and knowledge to deal with the regulatory
aspects of waste disposal. Typical procedures used are listed in Section
2.2.5 above.

6.2 Standard and reagent preparation 57


Standard and reagent preparation SOPs detail the procedures used to
prepare, verify and document standard and reagent solutions, including
reagent-grade water and dilution water used in the laboratory. These
SOPs should include: (I) information concerning specific grades of mat-
erials used; (2) appropriate laboratory ware and containers for prepara-
tion and storage; (3) labeling and record-keeping for stocks and dilutions;
(4) procedures used to verify concentration and purity; and (5) safety
precautions to be taken.
For environmental analyses, certain minimal requirements must be met:
(l) use of reagent water meeting a standard equivalent to ASTM Type II
(ASTM D 1194); (2) chemicals of ACS grade or better; (3) glass and plastic
laboratory ware, volumetric flasks and transfer pipets shall be Class A,
precision grade and within tolerances established by the National Institute
of Science and Technology (NIST); (4) balances and weights shall be
calibrated at least annually (traceable to NIST), (5) solution storage
should include light sensitive containers if required, storage conditions for
solutions should be within the range IS-30°C unless special requirements
exist, detennination of shelflife if not known; (6) purchased and in-house
calibration solutions should be assayed periodically; (7) labeling should
include preparation date, name and concentration of analyte, lot number,
assay and date; and (8) the preparation of calibration solutions, titrants
and assay results should be documented.

6.3 Instrument/equipment maintenance4.11


Instrument/equipment maintenance SOPs describe the procedures neces-
sary to ensure that equipment and instrumentation are operating within
the specifications required to provide data of acceptable quality. These
SOPs specify: (I) calibration and maintenance procedures; (2) perfor-
mance schedules for these functions; (3) record-keeping requirements,
including maintenance logs; (4) service contracts and/or service arrange-
ments for all non-in-house procedures; and (5) the availability of spare
parts. The procedures employed should include manufacturer's recom-
mendations to ensure performance of the equipment to be within specifica-
tions.
290 A.A. LlABASTRE, K.A. CARLBERG AND M.S. MILLER

6.4 General field and laboratory procedures4.8.9.11-14.21.26


General field and laboratory SOPs detail all essential operations or re-
quirements not detailed in other SOPs. These SOPs include: (1) prepara-
tion and cleaning of sampling site preparation devices; (2) sample collec-
tion devices; (3) sample collection bottles and laboratory ware; (4) use of
weighing devices; (5) dilution techniques; (6) use of volumetric glassware;
and (7) documentation of thermally controlled devices.
Many of the cleaning and preparation procedures have special require-
ments necessary to prevent contamination problems associated with spe-
cific analytes. Therefore, individual procedures are required for metals
analyses, various types of organic analyses, total phosphorous analysis,
and ammonia analysis, for example. 35 •39.58

6.5 Analytical methods35.37.39.58--61


Analytical method SOPs detail the procedures followed in the field or
laboratory to determine a chemical or physical parameter and should
describe how the analysis is actually performed. They should not simply
be a reference to standard methods, unless the analysis is to be performed
exactly as described in the reference method.
Whenever possible, reference method sources used should include those
published by the US Environmental Protection Agency (EPA), American
Public Health Association (APHA), American Society for Testing and
Materials (ASTM), Association of Official Analytical Chemists (AOAC),
Occupational Health and Safety Administration (OSHA), National In-
stitute for Occupational Safety and Health (NIOSH), or other recognized
organizations.
A test method that involves selection of options which depend on
conditions or the sample matrix, requires that the SOP be organized into
subsections dealing with each optional path; each optional path being
treated as a separate test method. The laboratory must validate each SOP
by verifying that it can obtain results which meet the minimum require-
ments of the published reference methods. For test methods, this involves
obtaining results with precision and bias comparable to the reference
method and documenting this performance.
Each analyst should be provided with a copy of the procedure currently
being employed. Furthermore, the analyst should be instructed to follow
the procedure exactly as written. Where the test method allows for
options, the particular option employed should be reported with the final
results.
A standardized format for SOPS26 should be adopted and employed.
QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITIES 291

This format should include: title, principal reference, application,


summary of the test method, interferences, sample handling, apparatus,
chemical and reagents, safety, procedure, sample storage, calculation,
data management, QA and QC, references and appendices.
Of particular importance to this process are those method-specific QA
and QC procedures necessary to provide data of acceptable quality. These
method-specific procedures should describe the QA elements that must be
routinely performed. These should include equipment and reagent checks,
instrument standardization, linear standardization range of the method,
frequency of analysis of calibration standards, recalibrations, check stan-
dards, calibration data acceptance criteria and other system checks as
appropriate. Necessary statistical QC parameters should be specified
along with their frequency of performance and include the use of: QC
samples; batch size; reference materials; and data handling, validation and
acceptance criteria. The requirements for reporting low-level data for all
analytes in the method should be addressed and include limits of detection
and reporting level (see Section 3.3 above). Precision and accuracy of data
should be estimated and tabulated using the specific test method (by both
matrix and concentration) by the laboratory. Finally, appropriate referen-
ces to any applicable QC SOPs should be included.

6.6 Quality control SOPS29-31.48


Quality control SOPs address QC procedures necessary to assure that data
produced meet the needs of the users. The procedures should detail
minimal QC requirements to be followed unless otherwise specified in the
analytical method SOP. The topics addressed by these SOPs should
include: standardization/calibration requirements; use of QC samples;
statistical techniques for the method performance characteristic deter-
minations, detection limit values, precision and bias/recovery estimates;
and development of data acceptance criteria.
A useful approach to developing minimal QC requirements has been
recommended by A2LA 13 and is based on controlling the testing technol-
ogy involved; it is intended to augment, where necessary, the method
specific requirements.
The SOPs concerning the use of QC samples should address the types,
purpose and frequency of employment. These SOPs provide information
and data for demonstrating measurement system control, estimating
matrix effects, evaluation of method recovery/bias and precision, demon-
stration of effectiveness of cleaning procedures and evaluation of reagent
and blank used for evaluating contamination.
292 A.A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER

The SOPs dealing with method performance characteristics should


detail the statistical procedure necessary to determine the value and uncer-
tainty associated with the laboratory measurement using validated
methods. It should be noted that there are a number of different and often
conflicting detection limit definitions and procedures for their determina-
tion. This topic needs to be clarified. However, it appears that currently
the soundest approach may be to use the method detection limit (MDL)
procedure proposed by the USEPA (see Table 1).37

6.7 Corrective actions


Corrective action SOPs describe procedures used in identifying and cor-
recting non-conformances associated with out-of-control situations occur-
ring within the testing process. These SOPs should describe specific steps
to be followed in evaluating and correcting loss of system control; such as,
reanalysis of the reference material, preparation of new standards,
reagents and/or reference materials, recalibration/restandardization of
equipment, evaluation of effectiveness of the corrective action taken,
reanalysis of samples, and/or recommending retraining of laboratory
analysts in the use of the affected procedures.
The evaluations of non-conformance situations should be documented
using a corrective action report which describes: (I) the non-conformance;
(2) the samples affected by the non-conformance; (3) the corrective action
taken and evaluation of its effectiveness; and (4) the date of corrective
action implementation.

6.S Data reduction and validation 21 •29-33.53.62


Data reduction and validation SOPs describe the procedures used in
reviewing and validating sample and QC data. They should include pro-
cedures used in computing, evaluating and interpreting results from analy-
ses of QC samples; and procedures used in certifying intrasample and/or
inters ample consistency among multiparameter analyses and/or sample
batches.
Furthermore, these SOPs should also include those elements necessary
to establish and monitor precision and bias/recovery characteristics asso-
ciated with the analysis of the full range of QC samples: (l) blank samples
(field, trip and reagent); (2) calibration standards; (3) check standards; (4)
control standards; (5) reference standards; (6) duplicate samples; (7)
matrix spike samples; and (8) surrogate recovery samples.
QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITmS 293

6.9 Reporting 11-13.33.36.63


Reporting SOPs describe the process for reporting testing results and
should clearly and unambiguously present the test results and all other
relevant information. These SOPs should include the procedures for: (I)
summarizing the testing results and QC data; (2) presentation format and
content, report review process; and (3) issuing and amending test reports.
Each testing report should minimally contain the following informa-
tion:"- 13 (I) name and address of the testing organization; (2) unique
identification of the report, and of each page of the report; (3) name and
address of the client; (4) description and identification of the test item; (5)
date of receipt of test item and date(s) of performance of the test; (6)
identification of the testing procedure; (7) description of the sampling
procedure, where relevant; (8) any deviations, additions to or exclusions
from the testing procedure, and any other information relevant to the
specific test; (9) disclosure of any non-standard testing procedure used;
(10) measurements, examinations and derived results, supported by tables,
graphs, sketches and photographs as appropriate; (II) where appropriate,
a statement on measurement uncertainty to include precision and bias!
recovery; (12) signature and title of person(s) accepting technical respon-
sibility for the testing report and date of report issue; (13) and a statement
that only a complete copy of the testing report may be made.

6.10 Records management27.34


Records management SOPs describe the procedures for generating, con-
trolling and archiving laboratory records. These SOPs should detail the
responsibilities for record generation and control, and policies for record
retention, including type, time, security, and retrieval and disposal auth-
orities.
Records documenting overall laboratory and project specific operations
should be maintained in conformance with laboratory and project policy
and any regulatory requirements. These records include correspondence,
chain of custody, request for testing, DQOs, QA plans, notebooks, equip-
ment performance and maintenance logs, calibration records, testing data,
QC samples, software documentation, control charts, reference/control
material certification, personnel files, SOPs and corrective action reports.

6.11 Chemical and sample disposal 14.55


Chemical and sample disposal SOPs describe the policies and procedures
necessary to properly dispose of chemicals, standard and reagent solu-
tions, process wastes and unused samples. Disposal of all chemicals and
294 A.A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER

samples must be in conformance with applicable regulatory requirements.


These SOPs should detail appropriate disposal and pretreatment/recovery
methods. They must take into consideration the discharge requirements of
the publicly-owned treatment works and landfills. The pretreatment/
recovery SOPs should include recovery, reuse, dilution, neutralization,
oxidation, reduction, and controlled reactions/processes.

6.12 Health and safety18-23.25.84


Health and safety SOPs describe policies and procedures necessary to meet
health and safety regulatory requirements in providing a safe and healthy
working environment for field and laboratory personnel engaged in en-
vironmental sample collection and testing operations. The SOPs should be
work practice oriented, that is detailing how to accomplish the task safely,
and in conformance with appropriate regulatory requirements. For
example, in the UK the specific requirements of the Control of Substances
Hazardous to Health (CoSH H) regulations must be fully addressed.
The SOPs should detail the procedures necessary for operation and
maintenance of laboratory safety devices including fume hoods, glove
boxes, miscellaneous ventilation devices, eye washes, safety showers, fire
extinguishers, fire blankets and self-contained breathing apparatus.
In addition, the SOPs should fully describe emergency procedures for
contingency planning, evacuation plans, appropriate first aid measures,
chemical spill plans, proper selection and use of protective equipment,
hazard communication, chemical hygiene plans, and information and
training requirements.

6.13 Definitions and terminology14.34.39.85-70


The QA plan as well as associated SOPs and other documentation must
be supported by a set of standardized and consistently used definitions and
terminology. The quality system should have an SOP that describes and
lists the terms, acronyms and symbols that are acceptable for use.

ACKNOWLEDGEMENT

The author would like to thank James H. Scott, Senior Environmental


Analyst, Georgia Power Company, Atlanta, Georgia for his suggestions
and assistance with the preparation and proof-reading of this manuscript.
QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITIES 295

REFERENCES·

I. National Research Council, Final report on quality assurance to the Environ-


mental Protection Agency. Report to Environmental Protection Agency,
National Academy Press, Washington, DC, 1988.
2. American Association for Registration of Quality Systems, Program Specific
Requirements, 656 Quince Orchard Road, Gaithersburg, MD 20878, 1990.
3. American National Standards Institute, Quality management and quality
assurance standards-Guidelines for selection and use. American Society for
Quality Control, Designation ANSI/ASQC Q90, Milwaukee, WI, 1987.
4. American National Standards Institute, Quality systems-Model for quality
assurance in design/development, production, installation, and servicing. Am-
erican Society for Quality Control, Designation ANSI/ASQC Q9L Mil-
waukee, WI, 1987.
5. American National Standards Institute, Quality systems-Model for quality
assurance in production and installation. American Society for Quality
Control, Designation ANSI/ASQC Q92, Milwaukee, WI, 1987.
6. American National Standards Institute, Quality systems-Model for quality
assurance in final inspection and test. American Society for Quality Control,
Designation ANSI/ASQC Q93, Milwaukee, WI, 1987.
7. American National Standards Institute, Quality Management and quality
system elements-Guidelines. American Society for Quality Control, Desig-
nation ANSI/ASQC Q94, Milwaukee, WI, 1987.
8. US Environmental Protection Agency, Interim Guidelines and Specifications
for Preparing Quality Assurance Project Plans. QAMS-005/80, EPA-600/4-83-
004, USEPA, Quality Assurance Management Staff, Washington, DC, 1983.
9. US Environmental Protection Agency, Guidance/or Preparation o/Combined
Work/Quality Assurance Project Plans/or Environmental Monitoring. OWRS
QA-l, USEPA, Office of Water Regulations and Standards, Washington, DC,
1984.
10. US Environmental Protection Agency, Manual/or the Certification of Labora-
tories Analyzing Drinking Water. EPA-570/9-90-008, USEPA, Office of Drink-
ing Water, Washington, DC, 1990.
II. International Standards Organization, General requirements for the technical
competence of testing laboratories. ISO/IEC Guide 25, Switzerland, 1990.
12. American Association for Laboratory Accreditation, General Requirements
for Accreditation. A2LA, Gaithersburg, MD, 1991.
13. American Association for Laboratory Accreditation, Environmental Program
Requirements. A2LA, Gaithersburg, MD, 1991.
14. American Society for Testing and Materials, Standard practice for the genera-
tion of environmental data related to waste management activities. ASTM
Designation ESI6, Philadelphia, PA, 1990.
15. American Society for Testing and Materials, Standard practice for prepara-

"The majority of the references cited are periodically reviewed and updated, the
reader is advised to consult the latest edition of these documents.
296 A.A. LIABASTRE, K.A. CARLBERG AND M_S. MILLER

tion of criteria for use in the evaluation of testing laboratories and inspection
bodies. ASTM Designation E548, Philadelphia, PA, 1984.
16. American Society for Testing and Materials, Standard guide for laboratory
accreditation systems. ASTM Designation E994, Philadelphia, PA, 1990.
17. Locke, J.W., Quality, productivity, and the competitive position of a testing
laboratory. ASTM Standardization News, July (1985) 48-52.
18. American National Standards Institute, Fire protection for laboratories using
chemicals. National Fire Protection Association, Designation ANSI/NFPA
45, Quincy, MA, 1986.
19. Koenigsberg, J., Building a safe laboratory environment. American Laborat-
ory, 19 June (1987) 96-105.
20. US Code of Federal Regulations, Title 29, Part 1910, Occupational Safety and
Health Standards, Subpart Z, Toxic and Hazardous Substances, Section
.1450, Occupational exposure to hazardous chemicals in laboratories, OSHA,
1990, pp 373-89.
21. American Society for Testing and Materials, Standard guide for good laborat-
ory practices in laboratories engaged in sampling and analysis of water.
ASTM Designation 03856, Philadelphia, PA, 1988.
22. Committee on Industrial Ventilation, Industrial Ventilation: A Manual of
Recommended Practice, 20th edn. American Conference of Governmental
Industrial Hygienists, Cincinnati, OH, 1988.
23. American National Standards Institute, Flammable and combustible liquids
code. National Fire Protection Association, Designation ANSI/NFPA 30,
Quincy, MA, 1987.
24. Kaufman, J.E. (ed.), IES Lighting Handbook: The Standard Lighting Guide,
5th edn. Illuminating Engineering Society, New York, 1972.
25. US Code of Federal Regulations, Title 29, Part 1910, Occupational Safety and
Health Standards, Subpart H, Hazardous Materials, Section .106, Flammable
and combustible liquids, OSHA, 1990, pp. 242-75.
26. American Society of Testing and Materials, Standard guide for documenting
the standard operating procedure for the analysis of water. ASTM Designa-
tion D5172, Philadelphia, PA, 1991.
27. American Society for Testing and Materials, Standard guide for records
management in spectrometry laboratories performing analysis in support of
nonciinicallaboratory studies. ASTM Designation E899, Philadelphia, PA,
1987.
28. American Society for Testing and Materials, Standard practice for sampling
chain of custody procedures. ASTM Designation D4840, Philadelphia, PA,
1988.
29. American Society for Testing and Materials, Standard guide for accountabil-
ity and quality control in the chemical analysis laboratory. ASTM Designa-
tion E882, Philadelphia, PA, 1987.
30. American Society for Testing and Materials, Standard guide for quality
assurance of laboratories using molecular spectroscopy. ASTM Designation
E924, Philadelphia, PA, 1990.
31. US Environmental Protection Agency, Handbook for Analytical Quality
Control in Water and Wastewater, Laboratories. EPA-600/4-79-019, USEPA,
Environmental Monitoring and Support Laboratory, Cincinnati, OH, 1979.
QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITIES 297

32. American Society for Testing and Materials, Standard practice for determina-
tion of precision and bias of applicable methods of Committee 0-19 on water.
ASTM Designation 02777, Philadelphia, PA, 1986.
33. American Society for Testing and Materials, Standard practice for intralabo-
ratory quality control procedures and a discussion on reporting low-level data.
ASTM Designation 04210, Philadelphia, PA, 1989.
34. US Environmental Protection Agency, Guidance Documentfor Assessment of
RCRA Environmental Data Quality. DRAFT, USEPA, Office of Solid Waste
and Emergency Response, Washington, DC, 1987.
35. US Environmental Protection Agency, Methods for Chemical Analysis of
Water and Wastes. EPA/600/4-79-020, Environmental Monitoring and
Support Laboratory, Cincinnati, OH, revised 1983.
36. Keith, L.H., Crummett, W., Deegan, 1., Jr., Libby, R.A., Taylor, 1.K.,
Wender, G., Principles in environmental analysis. Analytical Chemistry, 55
(1983) 2210-18.
37. US Code of Federal Regulations, Title 40, Part 136, Guidelines establishing
test procedures for the analysis of pollutants, Appendix B-Definition and
procedure for the determination of the method detection limit, Revision 1.11,
USEPA, 1990, pp. 537-9.
38. US Environmental Protection Agency, User's Guide to the Contract Laborat-
ory Program: and Statements of Work for Specific Types of Analysis. Office of
Emergency and Remedial Response, USEPA 9240.0-1, December 1988,
Washington, DC.
39. US Environmental Protection Agency, Test Methods for Evaluating Solid
Waste, SW-846, 3rd edn, Office of Solid Waste (RCRA), Washington, DC,
1990.
40. US Code of Federal Regulation, Title 40, Protection of Environment, Part 141
-National Primary Drinking Water Regulations, Subpart C: Monitoring and
Analytical Requirements, Section .24, Organic Chemicals other than total
trihalomethanes, sampling and analytical requirements, USEPA, 1990, pp.
574-9.
41. Britton, P., US Environmental Protection Agency, Estimation of generic
acceptance limits for quality control purposes in a drinking water laboratory.
Environmental Monitoring and Support Laboratory, Cincinnati, OH, 1989.
42. Britton, P., US Environmental Protection Agency, Estimation of generic
quality control limits for use in a water pollution laboratory. Environmental
Monitoring and Support Laboratory, Cincinnati, OH, 1989.
43. Britton, P. & Lewis, D., US Environmental Protection Agency, Statistical
basis for laboratory performance evaluation limits. Environmental Monitor-
ing and Support Laboratory, Cincinnati, OH, 1986.
44. US Environmental Protection Agency, Data Quality Objectives for Remedial
Response Activities Example Scenario: RI/ FS Activities at a Site with Contami-
nated Soils and Ground Water. EPA/540/G-87/004, USEPA, Office of Emer-
gency and Remedial Response and Office of Waste Programs Enforcement,
Washington, DC, 1987.
45. US Environmental Protection Agency, Field Screening Methods Catalog:
User's Guide. EPA/540/2-88/005, USEPA, Office of Emergency and Remedial
Response, Washington, DC, 1988.
298 A.A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER

46. US Environmental Protection Agency, Guidance Documentfor the Preparation


of Quality Assurance Project Plans. Office of Toxic Substances, Office of
Pesticide and Toxic Substances; Battelle Columbus Division, Contract No.
68-02-4243, Washington, DC, 1987.
47. Worthington, J.e. & Lincicome, D., Internal and third party quality control
audits: more important now than ever. In Proceedings of the Fifth Annual
Waste Testing and QA Symposium, ed. D. Friedman. USEPA, Washington,
DC, 1989, pp. 1-7.
48. US Environmental Protection Agency, NPDES Compliance Inspection
Manual. EN-338, USEPA, Office of Water Enforcement and Permits,
Washington, DC, 1984.
49. Carlberg, K.A., Miller, M.S., Tait, S.R., Beiro, H. & Forsberg, D., Minimal
QA/QC criteria for field and laboratory organizations generating environmen-
tal data. In Proceedings of the Fifth Annual Waste Testing and QA Symposium,
ed. D. Friedman. USEPA, Washington, DC, 1989, p. 321.
50. Liabastre, A.A., The A2LA accreditation approach: Lab certification from an
assessor's point of view. Environmental Lab., 2 (Oct. 1990) 24-5.
51. American Association for Laboratory Accreditation, Environmental field of
testing checklists for potable water, nonpotable water, and solid/hazardous
waste. A2LA, Gaithersburg, MD, 1990.
52. US Environmental Protection Agency, Report on Minimum Criteria to Assure
Data Quality. EPA/530-SW-90-021, USEPA, Office of Solid Waste, Washing-
ton, DC, 1990.
53. Ratliff, T.A., The Laboratory Quality Assurance System, Van Nostrand Rein-
hold, New York, 1990.
54. US Environmental Protection Agency, Handbook for Sampling and Sample
Preservation of Water and Wastewater. Document No. EPA-600/4-82-029,
Environmental Monitoring and Support Laboratory, Cincinnati, OH.
55. American Society for Testing and Materials, Standard practice for estimation
of holding time for water samples containing organic and inorganic con-
stituents. ASTM Designation 04841, Philadelphia, PA, 1988.
56. American Society for Testing and Materials, Standard guide for disposal of
laboratory chemicals and samples. ASTM Designation 04447, Philadelphia,
PA,1990.
57. American Society for Testing and Materials, Standard practice for the prep-
aration of calibration solutions for spectrophotometric and for spectroscopic
atomic analysis. ASTM Designation EI330, Philadelphia, PA, 1990.
58. American Public Health Association, American Water Works Association
and Water Pollution Control Federation, Standard Methodsfor the Examina-
tion of Water and Wastewater, 16th edn. American Public Health Association,
Washington, De.
59. US Department of Labor, Occupational health and safety administration,
Official analytical method manual. OSHA Analytical Laboratory, Salt Lake
City, UT, 1985.
60. US Department of Health and Human Services, National Institute for Oc-
cupational Safety and Health, NIOSH manual of analytical methods, 3rd edn.
NIOSH Publication No. 84-100, Cincinnati, OH, 1984.
61. Horwitz, W. (ed.), Official Methods of Analysis of the Association of Official
QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITIES 299

Analytical Chemists, 14th edn. Association of Official Analytical Chemists,


Washington, DC, 1985.
62. American Society for Testing and Materials, Standard practice for the verifi-
cation and the use of control charts in spectrochemical analysis. ASTM
Designation E1329, Philadelphia, PA, 1990.
63. American Society of Testing and Materials, Standard method of reporting
results of analysis of water. ASTM Designation 0596, Philadelphia, PA, 1983.
64. US Code of Federal Regulations, Title 29, Part 1910, Occupational Safety and
Health Provisions, OSHA, 1990.
65. International Standard for Organization, Quality-Vocabulary, ISO Designa-
tion 8402, Switzerland, 1986.
66. American Society for Testing and Materials, Standard terminology for statis-
tical methods. ASTM Designation E456, Philadelphia, PA, 1990.
67. American Society for Testing and Materials, Standard terminology relating to
chemical analysis of metals. ASTM Designation EI227, Philadelphia, PA,
1988.
68. American Society for Testing and Materials, Standard practice for use of the
terms precision and bias in ASTM test methods. ASTM Designation EI77,
Philadelphia, PA, 1986.
69. American Society for Testing and Materials, Standard definitions of terms
relating to water. ASTM Designation Dl129, Philadelphia, PA, 1988.
70. American Society for Testing and Materials, Standard specification for
reagent water. ASTM Designation Dl193, Philadelphia, PA, 1983.
Index

Absolute deviation, mean, 24 Backwards elimination, 100


Accuracy, 21, 283-4 Beer-Lambert law, 206
see also Precision Bias, 21, 274
Adjusted dependent variables, 125 Bimodal distribution, 17
AF, see Autocorrelation function Bivariate data representation, 231-6
Analysis of variance table, see graphics for correlation/regression,
ANOVA 231-3
Analytical methods, 290-1 regression analysis, diagnostic plots,
Anderson's glyphs, 245 233-5
Andrew's curves, 245-6 robust/resistant methods, 236
ANOVA, 97-8,198-201 Blanks, 278-9
Approximate Cook statistic, 128 Blank solution, 210-11
Arithmetic mean, 12-7 Box plots, 33-4
definition of, 12-23 Box and whisker displays, 33-4
frequency tables and, 15-7 extended versions, 221-2
properties of, 14-5 notched box, 222
see also Mean simple, 220-1
Audits, Business graphics, 251
checklists, 282
corrective action and, 282-3
internal, 280-1
third party, 281 Calibration, 206-12, 269-70, 274,
types of, 281-2 285-6
Autocorrelation function, 42 blanks, 210-11
Autoregressive filter, 44 linear, 206-10
Autoregressive process, 43 standard deviation, 211-12
Canonical link function, 121
Carbon dioxide, and time-series, 62-8
Casement plots, 239
Background samples, 278 Cellulation, 232
Back-to-back stem-and-Ieaf display, Central limit theory, 188-9
32-3,233 Centralised moving average filters, 61

301
302 INDEX

Checklists, 282 Data assessment, 283-6


Chemical disposal, 293-4 combined, 286
Chemicals, 268 field, 283-4
Chernoff faces, 246-7 laboratory, 284-6
Classification Data behaviour types, 216
of data, 286 Data classification criteria, 286
graphical, 249-50 Data distribution, graphical
Cluster analysis, 249-50 assessment, 225-8
Coded symbols, on maps, 248 confidence interval plots, 227
Coding, and standard deviation, 27-9 continuous data, 227-8
Coefficient of variation, see Relative discrete data distribution, 225
standard deviation Ord's procedure, 226
Collection, of samples, 287-8 Poissonness plots, 226-7
Collinearity, 100 quantile-quantile plots, 228
Combined (field/laboratory) data suspended rootograms, 228-9
assessment, 286 Data distribution shape, 229-31
Complementary log-log link function, mid vs. spread plots, 230
129 mid vs. Z2 plots, 230
Computer software, for illustrative pseudosigma vs. ; plots, 230
techniques, 250-4 push back analysis, 230
business, 251 skewness and, 231
general purpose, 251 upper vs. lower letter value plots,
mapping, 251, 253 229
see also specific names of Data interpretation, 163-5
Confidence interval plots, 227 Data quality audit, 282
Confounded error, 198 Data quality objectives, 259-60
Continuous data distribution, 227 Data representation, see Visual data
Contour, 247-8 representation
Contract audit, 282 Data screening, in factor/correlation
Cook statistic, 128 analysis, 155-62
Correction, 200 decimal point errors, 155-7
Corrective actions, 292 interfering variables, 158-62
audits, 282-3 random variables, 157-8
Correlation, 87-91 Data storage, 269, 271-2
analysis, 42-44 document, 272-3
coefficient, 87 Data types, 217
Correlative analysis, 139-80 Decimal point errors, 155-7
application, 155-76 Decision limit, 204
Eigenvector analysis, 142-55 Decomposition of variability, 95-9
interpretation of data, 163-5 Degrees of freedom, 92, 97
receptor modelling, 165-71 Demming-Mandel regression, 231-2
screening data, 155-62 Dendrograms, 249-50
Critical region, and significance Depth, of data, 220
testing, 191 Descriptive statistics, 1-35
Cumulative frequency graphs, 10-11 diagrams, 7-11
Cumulative frequency plots, 233 exploratory data analysis, 31-4
Cumulative tables, 6-7 measures of, 11-30
CV, see Relative standard deviation dispersion, 21-30
INDEX 303

Descriptive statistics-contd. Equipment blanks, 278


measures of-contd. Error, 181-90
location, 11-21 central limit theory, 188-9
skewness, 30 confounded, 198
random variation, 1-2 distribution, 184-5
tables, 3-7 normal, 185-8
Designed experimentation, 86 propagation of, 189-90
Detection limits, 274-5 types of, 181-4
analytical methods, 203-6 II/II, 203-6
definitions of, 276-7 gross, 181-2
Determination limit, 206 random, 182
Deviance, 126-7 systematic, 182-4
DHRSMOOTH, 58, 59, 61 Error bias, 21-2
Diagrams, 7-11 Error propagation, 189-90
cumulative frequency graphs, 10-11 multiplicative combination, 189-90
histograms, 8-10 summed combinations, 189
line, 7-8 Estimable combinations, 115
Discrete data distributions, 225-7 Explanatory variables, 80
Discrete variables, 2 Exploratory data analysis, 31-4,
continuous, and, 2 215-6
Dispersion, 121 box-and-whisker displays, 33-4,
measures of, 21-30 220-1
interquartile range, 23-4 stem-and-Ieaf displays, 33-4,
mean absolute deviation, 24 219-20
range, 22
standard deviation, 24-9
variance, 24-9 Factor analysis, 139-80
Dispersion matrix, 142-5 applications of, 155-76
Display, see specific types of data interpretation, 163-5
DLM, see Dynamic Linear Model data screening, 155-62
Document storage, 272-3 receptor modelling, 165-71
Dot diagram, 7 Factor axes rotation, 153-5
Draftsman's plots, 239 FANTASIA,171
Duplicate samples, 279 Far out values, 221
Dynamic autoregression, 50 Field blanks, 278
Dynamic harmonic model, 52 Field data assessment, 283-4
Dynamic Linear Model, 49-53 comparisons, 284
completeness of, 283
quality plan variances, 284
EDA, see Exploratory data analysis validity of, 283, 284
Eigenvector analysis, 142-55 Field duplicates, 278
calculation in, 145-50 Field quality-control samples, 278
dispersion matrix, 142-5 Field managers, 263-4
factor axes rotation, 153-5 Field records, 271-2
number of retained factors, 151-3 Flattened letter values, 230
Elongation, 231 Flicker noise, 202
Entropy spectrum, 46 Foreward interpolation, 56
Environmetrics, definition of, 37 Foreward selection 100
304 INDEX

Four-variable, two-dimensional views, IRWSMOOTH--contd.


239-42 Raup Sepkoski extinction data,
Frequency-domain time-series, see and,68-73
Spectral analysis Isolated discrepancy, 102
Frequency plots, 218-9 Iterative weighted least squares, 125
Frequency tables, 3-6
F-test, 197-8
Jackknifed deviance residuals, 127-8
Jackknifed residuals, 104-5
Gaussian letter spread, 223 Jittered scatter plots, 236-7
Generalized linear model, 119-23 'Junk', graphical, 237-8
deviance and, 126-7
diagnostics, 127-8
estimation, 123-6 Kalman filtering algorithm, 53-6
parts comprising, 119-23 Kleiner-Hartigan trees, 243-4
residuals and, 127-8 Kolmogorov-Smirnov test, 219, 233
Generalized random walk, 50-3
GENSTAT,133-4
Geometric mean, 20-1
GUM,133-4 Laboratories, 266-7
Glyphs, 245 safety in, 293-4
Graphical classification, 249-80 Laboratory blanks, 279, 284
Graphical one-way analysis, 236-7 Laboratory control samples, 279
Graphs, 216, 217 Laboratory data assessment, 284-6
Gross error, 181-2 Laboratory directors, 263-4
Gumbel distribution, 132 Laboratory quality-control samples,
279-80
Laboratory records, 271-2
Handling, of samples, 268, 288 Lag window, 46
Hanging histobars, 228-9 'Lagrange Multiplier' vector, 54
Leaf, see Stem-and-leaf displays
Harmonic regression, 62
Health & safety, 266-8, 293-4 Least squares
estimates, 92
Hierarchical cluster analysis, 249-50
method,208
Histobars, hanging, 228-9
weighted, 111-2
Histograms, 8-10, 218-9
Hybrid displays, 217-8
see also Recursive estimation
Letter value displays, 223-5
Letter value plots, upper vs. lower,
229
Illustrative techniques, software for, Leverage, 102, 235
250-4 Likelihood, 94
Inner fences, 221 equation, 124
Instrumentation, 269-70 Limits of detection, 274-5
Interfering variables, 158-62 Linear calibration, 206-10
Internal audits, 280-1 Linear correlation, 87
Interquartile range, 23-4 Linear regression, 91-112
IRWSMOOTH,57-9 basics of, 91-5
Mauna Loca CO2 data, and, 62-8 decomposition, 95-9
INDEX 305

Linear regression-contd. Mode, 17


model checking, 10 1-8 Moments, 30, 42
model selection, 99-10 I Multicode symbols, 241-7
weighted least squares, 111-2 Anderson's glyphs, 245
Line diagrams, 7-8 Andrew's curves, 245-6
Link functions, 120 assessment, 247
Local high data density plotting, 232 Chernoff faces, 246-7
Location, measures of, ll-21 Kleiner-Hartigan trees, 243-4
mean, 12-7; see also Arithmetic profile, 242
mean star, 242
median, 17-20 Multidimensional data display,
geometric, 20-1 multicode symbols, 241-7
mode, 17 Multiple notched box plots, 236-7
Logistic, 129 Multiple one-dimensional scatter
Log likelihood, 93 plots, 236
Low level data, 285 Multiple whisker plots, 236-7
Multiplicative combination, of errors,
189-90
Multivariate environmental data,
139-80, 236-50
Maintenance, 289 graphical classification, 249-80
instrumentation, of, 269-70 graphical one-way analysis, 236-7
Malinowski empirical indicator multicode symbols, 241-7
function, 152 spatial display, 247-9
Maps, 247-8 two dimensions, in, 237-41
coded symbols, 248
graphs, combined, and, 249
Masking, 10 1
Material blanks, 278 Nesting, 126
Matrix effects, 285 Noise, see Signal-to-noise ratio
Matrix spike, 279 'Noise variance ratio' matrix, 54
Mauna Loa CO 2 data, 62-8 Nonhierarchical cluster analysis,
Maximum entropy spectrum, 46 249-50
Maximum likelihood, 56-7 Nonstationarity, 37
Mean Nonstationary time-series analysis,
non stationarity of, 37, 39 37-77
standard error of, 188-9 model, 49-53
(-test and, 194-6 model identification/estimation,
see also Arithmetic mean and also 56-61
Geometric mean and Location, practical examples, 61-73
measures of prior evaluation, 40-6
Mean absolute deviation, 24 recursive forecasting, 53-6
Measurement system control, 273-80 smoothing algorithms, 53-6
Median, 17-20 Normal distribution, 184-8
Method blanks, 279 Notched box, multiple, 236-7
microCaptain, 40 Notched box plots, 222
Midplots, 230 Null hypothesis, 190-1
MINITAB, 222 Nyquist frequency, 45
306 INDEX

Observation, 86 Quality assurance-contd,


Ogive, 10 program, 260-1
One-way layout, 112 records, 271-2
Ord's procedure, 226 SOPs, 270-1, 287-94
Outer fences, 221 support staff, 264-5
Outlier identification, 216 technical staff, 264
Outliers, lOl, 235 use of subcontractors, 265
Quality systems audits, see Audits
Quantile-quantile plots, 228
empirical, 233
Paired experimentation, and I-test,
Quartiles, 23-4
196-7
lower, 23
Parametric nonstationarity, 59-60
middle, 23
Parent distribution, 185
upper, 23
Partial autocorrelation function, 44
Pearson residual, 128
Percentage relative standard
Random errors, 182
deviation, 30
Random variables, 157-8
Performance audits, 281-2
Random variation, 1-2, 79-86
Periodograms, 45
Range, 22
Personnel, and QA, 263-5
interquartile, 23-4
support staff, 264-5 Raup Sepko ski extinctions data,
technical staff, 264 68-73
Poissonness plots, 226-7 Receptor modelling, 165-71
Population parameters, 25-6 Records, 271-2
Precision, 21, 183-4, 274 management, 269, 271-3, 293
Prior evaluation, in time-series, 40-6 storage, 269
correlation analysis, 42-4 Recursive estimation, 46-9
spectral analysis, 44-6 Recursive forecasting, 53-6
Probability plots, 228 forecasting, 55
Probit, 129 forward interpolation, 56
Profile symbols, 242 see also Smooth algorithms
Pseudosigma, 230 Reference materials, 279
Push back analysis, 230 Regression, 91-112, 209
see also Linear regression
Regression lines, 209
Qualifications, of staff, 263-5 Regulatory compliance audit, 282
Quality assurance, 259-99 Relative frequency, 4
audits, 280-3 Relative standard deviation, 29-30,
data assessment, 283-6 187
document storage, 272-3 Repeatability, 184
equipment, 269-70 Reporting
facilities, 265-9 limits, 274-5
function, 265 SOPs, of, 293
management, 263-4 Reproducibility, 184
measurement system control, Residual plots, 234-5
273-80 Residuals, 102, 215
organization, 261-5 Residual standard deviation, 209
INDEX 307

Residual sum of squares, 92, 209 Smoothing algorithm-contd.


Resistance, 215 forward interpolation, 56
Resistant methods, 236 Kalman filtering, 53-4
Retained factors, in Eigenvector SIN, see Signal-to-noise ratio
analysis, 151-3 Software, 251, 253
RLS, see Recursive estimation SOPs, 261, 270-1, 287-94
Robust methods, 236 analytical methods, 290-1
Root-mean-square, 24 chemicals, 293-4
ROOTOGRAM, 228-9 corrective action, 292
data reduction, 292
definition, 294
Safety, 294 health & safety, 294
chemicals, 268 instrument maintenance, 289
handling, 268 quality control, 291-2
laboratory, 266-7 records management, 293
ventilation, 267-8 reporting, 293
_,waste, 268, 293-4 sample control, 287-9
Samples standard preparation, 289
autocorrelation function, 42 terminology, 294
collection of, 287-8 validation, 292
covariance matrix, 42 wastes, 293-4
estimates, 25-6 Sorted binary plot, 240-1
handling of, 268 Spatial data display, 247-9
large, and Ord's procedure, 226 coded symbols, 248
partial autocorrelation function, 42 colour and, 247-8
quality control, 275-80 graph-map combinations, 249
field,278 Spectral analysis, 44-6
laboratory, 279-80 Spectral decomposition, 56-7
Scatter plots, 87-91, 214, 219 Spiking, 279
jittered, 236-7 Standard addition, 211-2
multiple one-dimensional, 236 Standard deviation
sharpened, 232 calculations, 26-7
two-dimensional, 231-2 frequency tables, 29
Schuster periodograms, 45 shortcut, 27
Shade, 247-8 coding, 27-9
Sharpened scatter plots, 232 definition, calculation, 26-7
Shipping, 268 frequency, see Relative standard
Signal-to-noise ratio, 201-3 deviation
Significance testing, 190-20 I popUlation parameters, 25-6
F-test, 197-8 residual, 285, 209
t-test, 192-7 sample estimates, 25-6
means comparison, 194-6 shortcut calculation, 27
paired experimentation, 196-7 Standard error, of sample mean, 188
variance analysis, see ANOVA Standard operating procedures, see
Sign Test, 222 SOPs
Skewness, 30, 231 Standard preparation, 289
Smoothing algorithm, 53-6 Standardized Pearson residual, 128
forecasting, 55 Standardized residuals, 101
308 INDEX

Star symbols, 242 Time-series-contd.


StatGraphics™,40 frequency domain, in, see Spectral
Stationarity, time-series, 41 analysis
Stationary time-series, 41 time domain, in, see Correlation
Statistical software, 251, 253 analysis
Stem-and-leaf display, 31-3, 219-20, Time-series models, 49-53
233 DLR,50-3
back-to-back, 32-3, 233 seasonal adjustment, 60-1
Stepwide regression, 100 identification/estimation, 56-7
Storage, 268, 288 IRWSMOOTH, 57-9
data, of, see Data storage parametric nonstationarity, 59-60
documents, of, 272-3 Time variable parameters, 46-9
see also under Health & Safety DLM,49-53
Strip box plots, 233 Transformations, 108-10
student's t-test, see t-test Trees, see Kleiner-Hartigan trees
Subcontractors, 265 Trip blanks, 278
Summed combinations, in error t-test, 192-7
propagation, 189 means comparison, 194-6
Sunflower symbols, 232 paired experimentation, 196-7
Support staff, 264-5 Tukey analysis, 31-4
Surrogate standards, 279-80 Two-dimensional multivariate data
Suspended rootograms, 228-9 illustrations, 237-41
SYST AT, 222, 246 casement plots, 239
System audits, 281 draftsman's plots, 239
Systematic discrepancy, 10 1 four or more variables, 239-41
Systematic errors, 21, 282-4 one-dimensional views, 238-9
Systematic variation, 79-86 three variables, 239
Two-dimensional scatter plots, 231-2
Type 1/11 errors, 203-6

Tables, 216
presentation of, 3-7
cumulative, 6-7
frequency, 3-6, 15-7,29 Uncorrelated variables, 87
mean, of, 15-7 Unimodal distribution, 17
standard deviation, of, 29 Univariate methods, see
Target transformation factor analysis Nonstationary, time-series
(TTFA), 169-76 analysis
t-distribution, 192-3
Technical staff, 264
Third party audits, 281
Three-variable, two-dimensional Validation, 283-4, 292
views, 239 Variability, decomposition, 95-9
Time-domain time-series, see Variable transformation, 234-5
Correlation analysis Variables, see specific types of
Time-series Variance, 24-9, 112-9
analysis, see Nonstationary see also Standard deviation
time-series analysis Variance analysis, see ANOVA
INDEX 309

Variance function, 122 software for illustrations, 250-4


Variance intervention, 59-60 uses/misuses, 213, 215
Variates, 2
Variation, 79-86
see also Random variation and also Waldemeir sunspot data, 38-9
Relative standard deviation Waste, 268
Vectors, see Eigenvector analysis disposal, 288-9, 293-4
Ventilation systems, 267-8 Weighted least squares, 111-2
Visual data representation, 213-58 Weibull distribution, III
bivariate data, 231-6 Whisker plots, 222
data types, 217 multiple, 236-7
display, 217-8 see also Box-and-whisker displays
exploratory data analysis, 215-7
multivariate data, 236-50
single variable distributions, 218-31 Z2 plots, 230

Вам также может понравиться