You are on page 1of 22

LCGC Europe Online Supplement

statistics and data analysis

Understanding the
Structure of
Scientific Data
Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK.

This is the first in a series of articles that aims to promote the better use of
statistics by scientists. The series intends to show everyone from bench
chemists to laboratory managers that the application of many statistical
methods does not require the services of a statistician or a mathematician
to convert chemical data into useful information. Each article will be a concise introduction to a small subset of methods. Wherever possible, diagrams
will be used and equations kept to a minimum; for those wanting more theory, references to relevant statistical books and standards will be included.
By the end of the series, the scientist should have an understanding of the
most common statistical methods and be able to perform the test while
avoiding the pitfalls that are inherent in their misapplication.

In this article we look at the initial steps in


data analysis (i.e., exploratory data analysis),
and how to calculate the basic summary
statistics (the mean and sample standard
deviation). These two processes, which
increase our understanding of the data
structure, are vital if the correct selection of
more advanced statistical methods and
interpretation of their results are to be
achieved. From that base we will progress to
significance testing (t-tests and the F-test).
These statistics allow a comparison between
two sets of results in an objective and
unbiased way. For example, significance
tests are useful when comparing a new
analytical method with an old method or
when comparing the current days
production with that of the previous day.
Exploratory Data Analysis
Exploratory data analysis is a term used to
describe a group of techniques (largely
graphical in nature) that sheds light on the
structure of the data. Without this
knowledge the scientist, or anyone else,
cannot be sure they are using the correct
form of statistical evaluation.
The statistics and graphs referred to in this
first section are applicable to a single
column of data (i.e., univariate data), such
as the number of analyses performed in a
laboratory each month. For small amounts
of data (<15 points), a blob plot (also

known as a dot plot) can be used to


explore how the data set is distributed
(Figure 1). Blob plots are constructed
simply by drawing a line, marking it off
with a suitable scale and plotting the data
along the axis.
A stem-and-leaf plot is yet another
method for examining patterns in the data
set. These are complex to describe and
perceived as old fashioned, especially with
the modern graphical packages available
today. For the sake of completeness they
are described in Box 1.
For larger data sets, frequency
histograms (Figure 2(a)) and Box and
Whisker plots (Figure 2(b)) may be better
options to display the data distribution.
Once the data set is entered, or as is more
usual with modern instrumentation,
electronically imported, most modern PC
statistical packages can construct these

graph types with a few clicks of the


mouse. All of these plots can give an
indication of the presence or absence of
outliers (1). The frequency histogram, stem
and leaf plot, and blob plot can also
indicate the type of distribution the data
belongs to. It should be remembered that
if the data set is from a non-normal (2)
distribution, (Figure 2(a) and possibly
Figure 1(a)), it may be that which looks like
an outlier is in fact a good piece of
information. The outliers are the most
extreme points on the right-hand side of
Figures 1(a) and 2(a). Note: Outliers, outlier
tests and robust methods will be the
subject of a later article.
Assuming there are no obvious outliers,
we still have to do one more plot to make
sure we understand the data structure. The
individual results should be plotted against
a time index (i.e., the order the data were

(a)
Scale
Mean

(b)

Scale
Mean

figure 1 Blob plots of the raw data.

LCGC Europe Online Supplement

statistics and data analysis

obtained). If any systematic trends are


observed (Figures 3(a)3(c)) then the
reasons for this must be investigated.
Normal statistical methods assume a
random distribution about the mean with
time (Figure 3(d)) but if this is not the case
the interpretation of the statistics can be
erroneous.
Summary Statistics
Summary statistics are used to make sense
of large amounts of data. Typically, the
mean, sample standard deviation, range,
confidence intervals, quantiles (1), and
measures for skewness and
spread/peakedness of the distribution
(kurtosis) are reported (2). The mean and
sample standard deviation are the most
widely used and are discussed below

Box 1: Stem-and-leaf plot


A stem-and-leaf plot is another
method of examining patterns in the
data set. They show the range, in
which the values are concentrated,
and the symmetry. This type of plot is
constructed by splitting data into the
stem (the leading digits). In the figure
below, this is from 0.1 to 0.6, and
the leaf (the trailing digit). Thus,
0.216 is represented as 2|1 and
0.350 by 3|5. Note, the decimal
places are truncated and not rounded in this type of plot. Reading the
plot below, we can see that the data
values range from 0.12 to 0.63. The
column on the left contains the
depth information (i.e., how many
leaves lie on the lines closest to the
end of the range). Thus, there are 13
points which lie between 0.40 and
0.63. The line containing the middle
value is indicated differently with a
count (the number of items in the
line) and is enclosed in parentheses.
Stem-and-leaf plot
Units = 0.1
42

1|2 = 0.12

Count =

together with how they relate to the


confidence intervals for normally
distributed data.
The Mean
The average or arithmetic mean (3) is
generally the first statistic everyone is
taught to calculate. This statistic is easily
found using a calculator or spreadsheet
and simply involves the summing of the
individual results x1, x2, x3, ..., xi) and
division by the number of results (n),
n

 xi
i1

x n

where,
n

  x1  x2  x3  xi
i1

Frequency (N of data points in each bar)

(a)

1|22677
2|112224578
3|000011122333355
4|0047889
5|56669
6|3

The Standard Deviation (3)


The standard deviation is a measure of the
spread of data (dispersion) about the mean
and can again be calculated using a
calculator or spreadsheet. There is,
however, a slight added complication; if
you look at a typical scientific calculator
you will notice there are two types of

(b)
1.5  interquartile
upper quartile value
interquartile
median
lower quartile value
1.5  interquartile

*outlier

interquartile range is the range which contains the middle 50% of the data when
*The
it is sorted into ascending order.

figure 2 Frequency histogram and Box and Whisker plot.

(a)
Magnitude
10
8
6
4
2
0

(c)

10
8
6
4
2
0

(b)
Magnitude

10
8
6
4
2
Time 0
Time
n = 7, mean = 6, standard deviation = 2.16
n = 9, mean = 6, standard deviation = 2.65

Magnitude

5
14
(15)
13
6
1

Unfortunately, the mean is often reported


as an estimate of the true-value (m) of
whatever is being measured without
considering the underlying distribution.
This is a mistake. Before any statistic is
calculated it is important that the raw data
should be carefully scrutinized and plotted
as described above. An outlying point can
have a big effect on the mean (compare
Figure 1(a) with 1(b)).

(d)
Magnitude

10
8
6
4
2
Time 0
Time
n = 9, mean = 6, standard deviation = 1.80
n = 9, mean = 6, standard deviation = 2.06

figure 3 Time-indexed plots.

LCGC Europe Online Supplement

statistics and data analysis

99.7%
95%
68%
Mean

-3

-2

-1

Standard deviations from the mean

figure 4 The relationship between the


normal distribution curve, the mean and
standard deviation.

(a)
(i)

standard deviation (denoted by the


symbols n and n-1, or  and s). The
correct one to use depends upon how the
problem is framed. For example, each
batch of a chemical contains 10 sub-units.
You are asked to analyse each sub-unit, in
a single batch, for mercury contamination
and report the mean mercury content and
standard deviation. Now, if the mean and
standard deviation are to be used solely
with this analysed batch, then the 10
results represent the whole population (i.e.,
all are tested) and the correct standard
deviation to use is the one for a population
(n). If, however, the intended use of the
results is to estimate the mercury

probably not different and would 'pass' the t-test


(tcrit > tcalculated value)

(ii)

(b)
(i)

probably different and would 'fail' the t-test


(tcrit < tcalculated value)

(ii)

(c)
(i)

could be different but not enough data to say for


sure (i.e., would 'pass' the t-test [tcrit > tcalculated value])

(ii)

1
(d)
(i)
(ii)
2

  n  ((xi  )2 / n)
practically identical means, but with so many data
points there is a small but statistically siginificant
('real') difference and so would 'fail' the t-test
(tcrit < tcalculated value)

(e)
(i)
spread in the data as measured by the variance
are similar would 'pass' the F-test (Fcrit > Fcalculated value)
(ii)

(f)
(i)

(ii)

spread in the data as measured by the variance are


different would 'fail' the F-test (Fcrit < Fcalculated value)
and hence (i) gives more consistent results than (ii)

(g)
(i)

(ii)

contamination for several batches of the


chemical, the 10 results then represent a
sample from the whole population and the
correct standard deviation to use is that for
a sample (n-1). If you are using a statistical
package you should always check that the
correct standard deviation is being
calculated for your particular problem.

could be a different spread but not enough data


to say for sure would 'pass' the F-test
(Fcrit > Fcalculated value)

figure 5 Comparison of different data sets.

s  n1 ((xi  x)2 / n  1)

Interpreting the mean and standard


deviation
If the distribution is normal (i.e., when the
data are plotted it approximates to the curve
shown in Figure 4) then the mean is located
at the centre of the distribution. Sixty-eight
per 0cent of the results will be contained
within 1 standard deviation from the mean,
95% within 2 standard deviations and
99.7% within 3 standard deviations.
Using the above facts it is possible to
estimate a standard deviation from a
stated confidence interval and vice versa a
confidence interval from a standard
deviation. For example, if a mean value of
0.72 0.02 g/L at the 95% confidence
level is quoted then it follows that the
standard deviation = 0.02/2 or 0.01 g/L. If
the same figure was quoted at the 99.7%
confidence level the standard deviation
would be 0.02/3 or 0.0066 g/L.

LCGC Europe Online Supplement

statistics and data analysis

Significance Testing
Suppose, for example, we have the
following two sets of results for lead
content in water 17.3, 17.3, 17.4, 17.4
and 18.5, 18.6, 18.5, 18.6. It is fairly clear,
by simply looking at the data, that the two
sets are different. In reaching this
conclusion you have probably considered
the amount of data, the average for each
set and the spread in the results. The
difference between two sets of data is,
however, not so clear in many situations.
The application of significance tests gives
us a more systematic way of assessing the
results with the added advantage of
allowing us to express our conclusion with
a stated degree of confidence.
What does significance mean?
In statistics the words significant and
significance have specific meanings. A
significant difference, means a difference
that is unlikely to have occurred by chance.
A significance test, shows up differences
unlikely to occur because of a purely
random variation.
As previously mentioned, to decide if one
set of results is significantly different from
another depends not only on the
magnitude of the difference in the means
but also on the amount of data available

Jargon

and its spread. For example, consider the


blob plots shown in Figure 5. For the two
data sets shown in Figure 5(a), the means
for set (i) and set (ii) are numerically
different. From the limited amount of
information available, however, they are
from a statistical point of view the same.
For Figure 5(b), the means for set (i) and
set (ii) are probably different but when
fewer data points are available, Figure 5(c),
we cannot be sure with any degree of
confidence that the means are different
even if they are a long way apart. With a
large number of data points, even a very
small difference, can be significant (Figure
5(d)). Similarly, when we are interested in
comparing the spread of results, for
example, when we want to know if
method (i) gives more consistent results
than method (ii), we have to take note of
the amount of information available
(Figures 5(e)(g)).
It is fortunate that tables are published
that show how large a difference needs to
be before it can be considered not to have
occurred by chance. These are, critical
t-value for differences between means,
and critical F-values for differences
between the spread of results (4).
Note: Significance is a function of sample
size. Comparing very large samples will

Definition

Alternate Hypothesis A statement describing the alternative to the null hypothesis


(H1)
(i.e., there is a difference between the means [see two-tailed]
or mean1 is mean2 [see one-tailed]).
Critical Value
(tcrit or Fcrit)
cance

The value obtained from statistical tables or statistical packages at a


given confidence level against which the result of applying a signifitest is compared.

Null hypothesis
(H0)

A statement describing what is being tested


(i.e., there is no difference between the two means [mean1 = mean2]).

One-tailed

A one-tailed test is performed if the analyst is only interested in the


answer when the result is different in one direction, for example, (1)

the
new production method results in a higher yield, or (2) the amount of
waste product is reduced (i.e., a limit value , >, <, or is used in the
alternate hypothesis). In these cases the calculation to determine the
t-value is the same as that for the two-tailed t-test but the critical
value is different.
Population
Sample

Two-tailed

A large group of items or measurements under investigation


(e.g., 2500 lots from a single batch of a certified reference material).
A group of items or measurements taken from the population
(e.g., 25 lots of a certified reference material taken from a batch
containing 2500 lots).
A two-tailed t-test is performed if the analyst is interested in any
change. For example, is method A different from method B
(i.e., is used in the alternate hypothesis. Under most circumstances
two-tailed t-tests should be performed).

table 1 Definitions of statistical terms used in significance testing.

nearly always lead to a significant


difference but a statistically significant
result is not necessarily an important result.
For example in Figure 5(d) there is a
statistically significant difference, but does
it really matter in practice?
What is a t-test?
A t-test is a statistical procedure that can
be used to compare mean values. A lot of
jargon surrounds these tests (see Table 1
for definition of the terms used below) but
they are relatively simple to apply using the
built-in functions of a spreadsheet like
Excel or a statistical software package.
Using a calculator is also an option but you
have to know the correct formula to apply
(see Table 2) and have access to statistical
tables to look up the so-called critical
values (4).
Three worked examples are shown in
Box 2 (5) to illustrate how the different
t-tests are carried out and how to interpret
the results.
What is an F-test?
An F-test compares the spread of results in
two data sets to determine if they could
reasonably be considered to come from the
same parent distribution. The test can,
therefore, be used to answer questions
such as are two methods equally precise?
The measure of spread used in the F-test is
variance which is simply the square of the
standard deviation. The variances are
ratioed (i.e., divide the variance of one set
of data by the variance, of the other) to
get the test value F = 2
S1 2
S2
This F value is then compared with a critical
value that tells us how big the ratio needs
to be to rule out the difference in spread
occurring by chance. The Fcrit value is
found from tables using (n11) and (n21)
degrees of freedom, at the appropriate
level of confidence.
[Note: it is usual to arrange s1 and s2 so
that F > 1]. If the standard deviations are to
be considered to come from the same
population then Fcrit > F. As an example we
use the data in Example 2 (see Box 2).
2

F  2.75 1.471 2  3.49

Fcrit = 9.605 (51) and (51) degrees of


freedom at the 97.5% confidence level.
As Fcrit> Fcalculated we can conclude that the
spread of results in the two data sets are
not significantly different and it is,
therefore, reasonable to combine the two
standard deviations as we have done.

LCGC Europe Online Supplement

Using statistical software


(what is a p-value?)
When you use statistical software packages
and some spreadsheet functions, the
results of performing a significance test are
often summarized as a p-value. The
p-value represents an inverse index of the
reliability of the statistic (i.e., the
probability of error in accepting the
observed result as valid). Thus, if we are
comparing two means to see if they are
different a p-value of 0.10 is equivalent to
saying we are 90% certain that the means
are different; 0.05 is equivalent to saying
we are 95% certain that the means are
different; and 0.01 we are 99% certain
that the means are different, i.e., [(1p) x
100%]. It is usual when analysing chemical
data (but somewhat arbitrary) to say that
p-levels 0.05 are statistically significant.
Some assumptions
behind significance testing
In most statistical tests it is
assumed that the sample correctly
represents the population and that the
population follows a normal distribution.
Although these assumptions are never
complied with precisely, in a large number
of situations where laboratory data is being
used they are not grossly violated.
Conclusions
Always plot your data and understand
the patterns in it before calculating any
statistic, even the arithmetic mean.
Make sure the correct standard deviation
is calculated for your particular
circumstance. This will nearly always be
the sample standard deviation (n-1).
Significance tests are used to compare,
in an unbiased way, the means or spread
(variance) of two data sets.
The tests are easily performed using
statistical routines in spreadsheets and
statistical packages.
The p-value is a measure of confidence
in the result obtained when applying a
significance test.
Acknowledgement
The preparation of this paper was
supported under a contract with the UK
Department of Trade and Industry as part
of the National Measurement System Valid
Analytical Measurement Programme
(VAM)6.
References
(1) ISO 3534 part 1: Statistics Vocabulary and
Symbols. Part 1: Probability and General
Statistical Terms (1993).
(2) BS 2846 part 7: Tests for Departure from
Normality (1984).

statistics and data analysis

(3) BS 2846 part 4 (ISO 2854): Techniques of


Estimation Relating to Means and Variances
(1976).
(4) D.V. Lindley and W.F. Scott, New Cambridge
Elementary Statistical Tables (ISBN: 0 521
48485 5). Cambridge University Press (1995).
(5) T.J. Farrant, Practical Statistics for the Analytical
Scientist: A Bench Guide (ISBN: 085 404 4426),
Royal Society of Chemistry (1997).
(6) M. Sargent, VAM Bulletin, Issue 13, 45,
(Laboratory of the Government Chemist,
Teddington, UK) Autumn 1995.

Shaun Burke currently works in the Food


Technology Department of RHM Technology
Ltd, High Wycombe, Buckinghamshire, UK.
However, these articles were produced while
he was working at LGC, Teddington,
Middlesex, UK (http://www.lgc.co.uk).

Bibliography
1. G.B. Wetherill, Elementary Statistical
Methods, Chapman and Hall, London,
UK.
2. J.C. Miller and J.N. Miller, Statistics for
Analytical Chemistry, Ellis Horwood PTR
Prentice Hall, London, UK.
3. J. Tukey, Exploration of Data Analysis,
Edison and Westley.
4. T.J. Farrant, Practical Statistics for the
Analytical Scientist: A Bench Guide
(ISBN: 085 404 4426), Royal Society of
Chemistry, London, UK (1997).

t-test to use when comparing

Equation

The long-term average (population mean, ) with a sample mean

t

The difference between two means (e.g., two analytical methods)

x 
s/ n

For a two-tailed test

t

d  n
sd

For a one-tailed test


the sign is important

t

d n
sd

Difference between independent sample means with equal variances

t

x1  x2
1
1
sc
n1  n2

Difference between independent sample means with unequal variances

t

x1  x2
s21 s22
n1  n2

where:

x is the sample mean, is the population mean, s is the standard deviation for the sample, n is the number items in the sample,

|d | is the absolute mean difference between pairs, d is the mean difference between pairs, sd is the sample standard deviation for the

pairs, x1 and x2 are two independent sample means, n1 and n2 are the number of items making up each sample
2

and s is the combined standard deviation found using


c

sc 

s1 n1  1  s2 n2  1
n1  n2 2

where s1 and s2 are the sample standard deviations.


Note: The degrees of freedom () used for looking up the critical t value for independent sample means with unequal variances

is given by

s41
s42
s21 s22
1
 k 2 n2 n  1  k 2 n2 n 1 where k  n1  n2
1 1
2 2

table 2 Summary of statistical formulae.

LCGC Europe Online Supplement

statistics and data analysis

Box 2
Example 1
A chemist is asked to validate a new
economic method of derivatization
before analysing a solution by a standard
gas chromatography method. The longterm mean for the check samples using
the old method is 22.7 g/L. For the new
method the mean is 23.5 g/L, based on
10 results with a standard deviation of
0.9 g/L. Is the new method equivalent
to the old? To answer this question we
use the t-test to compare the two mean
values. We start by stating exactly what
we are trying to decide, in the form of
two alternative hypotheses; (i) the means
could really be the same, or (ii) the
means could really be different. In
statistical terminology this is written as:
The null hypothesis (H0): new method
mean = long-term check sample mean.
The alternative hypothesis (H1): new
method mean long-term check sample
mean.

To test the null hypothesis we calculate


the t-value as below. Note, the calculated
t-value is the ratio of the difference
between the means and a measure of
the spread (standard deviation) and the
amount of data available (n).
t

23.5  22.7
 2.81
0.9 / 10

In the final step of the significance test


we compare the calculated t-value with
the critical t-value obtained from tables
(4). To look up the critical value we need
to know three pieces of information:
(i) Are we interested in the direction
of the difference between the two
means or only that there is a difference,
for example, are we performing a onesided or two-sided t-test (see Table 1)?
In the case above, it is the latter, therefore, the two-sided critical value is used.
(ii) The degrees of freedom: this is
simply the number of data points
minus one (n1).
(iii) How certain do we want to be
about our conclusions? It is normal
practice in chemistry to select the 95%
confidence level (i.e., about 1 in 20
times we perform the t-test we could
arrive at an erroneous conclusion).
However, in some situations this is an
unacceptable level of error, such as in
medical research. In these cases, the
99% or even the 99.9% confidence
level can be chosen.

Method 1

4.2

4.5

6.8

7.2

4.3

5.40

1.471

Method 2

9.2

4.0

1.9

5.2

3.5

4.76

2.750

table 3 Results from two methods used to determine concentrations of selenium.

tcrit = 2.26 at the 95% confidence


level for 9 degrees of freedom.
As tcalculated > tcrit we can reject the null
hypothesis and conclude that we are 95%
certain that there is a significant difference
between the new and old methods.
[Note: This does not mean the new
derivatization method should be
abandoned. A judgement needs to
be made on the economics and on
whether the results are fit for purpose.
The significance test is only one piece
of information to be considered.]
Example 2 (5)

Two methods for determining the


concentration of Selenium are to be
compared. The results from each
method are shown in Table 3:
Using the t-test for independent
sample means we define the null

hypothesis H0 as x 1 = x 2
This means there is no difference between
the means of the two methods (the
alternative hypothesis is H1: x1 x2). If
the two methods have sample standard
deviations that are not significantly
different then we can combine (or pool)
the standard deviation (Sc).

(see What is an F-Test?)


Sc 

1.4712  (5  1)  2.7502  (5  1)
(5  5  2)

0.64
0.64 0.459
2.205  0.632 1.395

The 95% critical value is 2.306 for


n = 8 (n1 + n2 2 ) degrees of freedom.
This exceeds the calculated value of
0.459, thus the null hypothesis (H0)
cannot be rejected and we conclude
there is no significant difference between
the means or the results given by the
two methods.
Example 3 (5)

Two methods are available for


determining the concentration of
vitamins in foodstuffs. To compare
the methods several different sample
matrices are prepared using the same
technique. Each sample preparation is
then divided into two aliquots and
readings are obtained using the two
methods, ideally commencing at the
same time to lessen the possible effects
of sample deterioration. The results are
shown in Table 4.

The null hypothesis is H0: d = 0

against the alternative H1: d 0


The test is a two-tailed test as we are

interested in both d<0 and d>0

The mean d = 0.475 and the sample


standard deviation of the paired
differences is sd = 0.700
t

 2.205

If the standard deviations are


significantly different then the t-test
for un-equal variances should be used
(Table 2).
Evaluating the test statistic t
t

(5.40  4.76)
1 1
5 5

=>

2.205

0.475  8
 1.918
0.700

The tabulated value of tcrit (with


n = 7 degrees of freedom, at the 95%
confidence limit) is 2.365. Since the
calculated value is less than the critical
value, H0 cannot be rejected and it
follows that there is no difference between
the two techniques.

Matrix
Method

A (mg/g)

2.52

3.13

4.33

2.25

2.79

3.04

2.19

2.16

B (mg/g)

3.17

5.00

4.03

2.38

3.68

2.94

2.83

2.18

-0.65

-1.87

0.30

-0.13

-0.89

0.10

-0.64

-0.02

Difference (d)

table 4 Comparison of two methods used to determine the concentration of vitamins in foodstuffs.

LCGC Europe Online Supplement

statistics and data analysis

Analysis
of Variance
Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK.

Statistical methods can be powerful tools for unlocking the information


contained in analytical data. This second part in our statistics refresher series
looks at one of the most frequently used of these tools: Analysis of Variance
(ANOVA). In the previous paper we examined the initial steps in describing
the structure of the data and explained a number of alternative significance
tests (1). In particular, we showed that t-tests can be used to compare the
results from two analytical methods or chemical processes. In this article,
we will expand on the theme of significance testing by showing how ANOVA
can be used to compare the results from more than two sets of data at
the same time, and how it is particularly useful in analysing data from
designed experiments.

With the advent of built-in spreadsheet


functions and affordable dedicated
statistical software packages, Analysis of
Variance (ANOVA) has become relatively
simple to carry out. This article will
therefore concentrate on how to select the
correct variant of the ANOVA method, the
advantages of ANOVA, how to interpret
the results and how to avoid some of the
pitfalls. For those wanting more detailed
theory than is given in the following
section, several texts are available (25).
A bit of ANOVA theory
Whenever we make repeated
measurements there is always some
variation. Sometimes this variation (known
as within-group variation) makes it difficult
for analysts to see if there have been
significant changes between different groups
of replicates. For example, in Figure 1
(which shows the results from four replicate
analyses by 12 analysts), we can see that
the total variation is a combination of the
spread of results within groups and the
spread between the mean values (betweengroup variation). The statistic that measures
the within and between-group variations in
ANOVA is called the sum of squares and
often appears in the output tables
abbreviated as SS. It can be shown that the
different sums of squares calculated in
ANOVA are equivalent to variances (1). The

central tenet of ANOVA is that the total SS in


an experiment can be divided into the
components caused by random error, given
by the within-group (or sample) SS, and the
components resulting from differences
between means. It is these latter components
that are used to test for statistical
significance using a simple F-test (1).
Why not use multiple t-tests
instead of ANOVA?
Why should we use ANOVA in preference
to carrying out a series of t-tests? I think
this is best explained by using an example;
suppose we want to compare the results
from 12 analysts taking part in a training
exercise. If we were to use t-tests, we
would need to calculate 66 t-values. Not
only is this a lot of work but the chance of
reaching a wrong conclusion increases. The
correct way to analyse this sort of data is
to use one-way ANOVA.
One-way ANOVA
One-way ANOVA will answer the question:
Is there a significant difference between
the mean values (or levels), given that the
means are calculated from a number of
replicate observations? Significant refers
to the observed spread of means that
would not normally arise from the chance
variation within groups. We have already
seen an example of this type of problem in

the form of the data contained in Figure 1,


which shows the results from 12 different
analysts analysing the same material. Using
these data and a spreadsheet, the results
obtained from carrying out one-way
ANOVA are reported in Example 1. In this
example, the ANOVA shows there are
significant differences between analysts
(Fvalue > Fcrit at the 95% confidence level).
This result is obvious from a plot of the
data (Figure 1) but in many situations a
visual inspection of a plot will not give such
a clear-cut result. Notice that the output
also includes a p-value (see Interpretation
of the result(s) section, which follows).
Note: ANOVA cannot tell us which
individual mean or means are different
from the consensus value and in what
direction they deviate. The most effective
way to show this is to plot the data (Figure
1) or alternatively, but less effectively, carry
out a multiple comparison test such as
Scheffe's test (2). It is also important to
make sure the right questions are being
asked and that the right data are being
captured. In Example 1, it is possible that
the time difference between the analysts
carrying out the determinations is the
reason for the difference in the mean
values. This example shows how good
experimental design procedures could have
prevented ambiguity in the conclusions.

LCGC Europe Online Supplement

statistics and data analysis

Example 1 An example of one-way ANOVA carried out by Excel


A_1
34.1
34.1
34.69
34.6

Replicate 1
Replicate 2
Replicate 3
Replicate 4
A_7
Replicate 1
Replicate 2
Replicate 3
Replicate 4

A_2
A_3
A_4
A_5
A_6
35.84
36.67
40.54
41.19
41.22
36.58
37.33
40.67
40.29
39.61
31.3
36.96
40.81
40.99
37.89
34.19
36.83
40.78
40.4
36.67
A_8

40.71
40.91
40.8
38.42

Anova: Single Factor


Source of Variation
Between Groups
Within Groups

A_9
39.2
39.3
39.3
39.3

SS
438.7988
35.6208

df
11
36

42.5
42.3
42.5
42.5

A_10
A_11
A_12
39.75
36.04
44.36
39.69
37.03
45.73
39.23
36.85
45.25
39.73
36.24
45.34

MS
F
P-value
F crit
39.8908 40.31545
6.6E-17 2.066606
0.989467

(Note: the data table has been split into two sections (A_1 to A_6, A_7 to A_12) for display purposes. The ANOVA is
carried out on a single table.)

SS = sum of squares, df = degrees of freedom, MS = mean square (SS/df).


The P-value is < 0.05 (Fvalue is > Fcrit - 95% confidence level for 11 and 36 degrees of freedom)
therefore it can be concluded that there is a significant difference between the analysts' results.

Example 2 Two-way ANOVA


The analysis of tinned ham was carried out at three temperatures (415, 435 and 460
C) and three times (30, 60 and 90 minutes). Three analyses, determining protein
yield were made at each temperature and time. The measurements are summarized
in the diagram below and the results of the two-way ANOVA are given in the table.
Temp (C)
415

460

435

Time (min)/Temp (C)


30
30
30
60
60
60
90
90
90

0.215867

435
27.2
26.97
27.13
27.07
27.1
27.03
27.2
27.23
27.27

df
2
2
4
18

27.3
27.3
27.3

27.2
27.2
27.2

27.1
27.1
27.1

27
27
27

26.9
26.9
26.9

27.2

27.3

27.2

27.3

27.2

27.3

27.1
27.1
27.1

27
27
27

26.9
26.9
26.9

27.2
415
27.13
27.2
27.13
27.29
27.13
27.23
27.03
27.13
27.07

Anova: Two-factor with replication


Source of Variation
SS
Sample (=Time)
0.000867
Columns (=Temperature) 0.049689
Interaction
0.087644
Within
0.077667
Total

27.3

27.2

27.3

27.2

27.3

27

27.1

27

27.1

27

90

27.1

60

26.9

30

26.9

Time (min)

26.9

10

460
27.03
27.1
27.13
27.1
27.07
27.03
27.03
27.07
26.9

MS
F
P-value
F crit
0.000433 0.100429 0.904952 3.554561
0.024844 5.75794 0.011667 3.554561
0.021911 5.078112 0.006437 2.927749
0.004315

26

Note: in the above example, the spreadsheet (Excel) labels Source of Variation as Sample, Columns, Interaction and Within.
Sample = Time, Columns = Temperature, Interaction is the interaction between temperature and time, and Within is a
measure of the within-group variation. (Note: Source of variation Columns = Temperature and Sample = Time).

Two-way ANOVA
In a typical experiment things can be more
complex than described previously. For
example, in Example 2 the aim is to find
out if time and/or temperature have any
effect on protein yield when analysing
samples of tinned ham. When analysing
data from this type of experiment we use
two-way ANOVA. Two-way ANOVA can
test the significance of each of two
experimental variables (factors or
treatments) with respect to the response,
such as an instrument's output. When
replicate measurements are made we can
also examine whether or not there are
significant interactions between variables.
An interaction is said to be present when
the response being measured changes
more than can be explained from the
change in level of an individual factor. This
is illustrated in Figure 2 for a process with
two factors (Y and Z) when both factors
are studied at two levels (low and high). In
Figure 2(b), the changes in response
caused by Y depend on Z, and vice versa.
In two-way ANOVA we ask the
following questions:
Is there a significant interaction between
the two factors (variables)?
Does a change in any of the factors
affect the measured result?
It is important to check the answers in the
right order: Figure 3 illustrates the
decision process. In the case of Example
2 the questions are:
Is there an interaction between
temperature and time which affects the
protein yield?
Does time and/or temperature affect the
protein yield?
Using the built-in functions of a
spreadsheet (in this case Excels data
analysis tools two-factor analysis with
replication) we see that there is a
significant interaction between time and
temperature and a significant effect of
temperature alone (both p-value < 0.05
and F > Fcrit). Following the process
outlined in Figure 3, we consider the
interaction question first by comparing the
mean squares (MS) for the within-group
variation with the interaction MS. This is
reported in the results table of Example 2.
F = 0.021911/0.004315 = 5.078
If the interaction is significant (F > Fcrit),
as in this case, then the individual factors
(time and temperature) should each be
compared with the MS for the interaction
(not the within-group MS) thus:
Ftemp = 0.024844/0.021911 = 1.134

LCGC Europe Online Supplement

statistics and data analysis

Ftime = 0.000433/0.021911 = 0.020


Fcrit = 6.944, for 2 and 4 degrees of freedom (at the 95% confidence level)

Interpretation of the result(s)


To reiterate the interpretation of ANOVA
results, a calculated F-value that is greater
than Fcrit for a stated level of confidence
(typically 95%) means that the difference
being tested is statistically significant at
that level. As an alternative to using the Fvalues the p-value can be used to indicate
the degree of confidence we have that
there is a significant difference between
means (i.e., (1-p) * 100 is the percentage
confidence). Normally a p-value of 0.05
is considered to denote a significant
difference.
Note: Extrapolation of ANOVA results is
not advisable, so in Example 2 for instance,
it is impossible to say if a time of 15 or 120
minutes would lead to a measurable effect
on protein yield. It is, therefore, always
more economic in the long run to design
the experiment in advance, in order to
cover the likely ranges of the parameter(s)
of interest.

In other words, there is no significant difference between the interaction of time and
temperature with respect to either of the individual factors, and, therefore, the interaction
of temperature with time is worth further investigation. If one or both of the individual
factors were significant compared with the interaction, then the individual factor or factors
would dominate and for all practical purposes any interaction could be ignored.
If the interaction term is not significant then it can be considered to be another small
error term and can thus be pooled with the within-group (error) sums of squares term. It is
the pooled value (SS2pooled) that is then used as the denominator in the F-test to
determine if the individual factors affect the measured results significantly. To combine the
sums of squares the following formula is used:
ss2pooled 

ss inter  ss within
dof inter  dof within

where dofinter and dofwithin are the degrees of freedom for the interaction term and
error term, and SSinter and SSwithin are the sums of squares for the interaction term and
error term, respectively.
(dofpooled  dofinter  dofwithin)
Selecting the ANOVA method
One-way ANOVA should be used when there is only one factor being considered and
replicate data from changing the level of that factor are available. Two-way ANOVA (with
or without replication) is used when there are two factors being considered. If no replicate
data are collected then the interactions between the two factors cannot be calculated.
Higher level ANOVAs are also available for looking at more than two factors.

Avoiding some of
the pitfalls using ANOVA
In ANOVA it is assumed that the data for
each variable are normally distributed.
Usually in ANOVA we dont have a large
amount of data so it is difficult to prove
any departure from normality. It has been
shown, however, that even quite large
deviations do not affect the decisions
made on the basis of the F-test.
A more important assumption about
ANOVA is that the variance (spread)
between groups is homogeneous
(homoscedastic). If this is not the case (this
often happens in chemistry, see Figure 1)
then the F-test can suggest a statistically

Advantages of ANOVA
Compared with using multiple t-tests, one-way and two-way ANOVA require fewer
measurements to discover significant effects (i.e., the tests are said to have more power).
This is one reason why ANOVA is used frequently when analysing data from statistically
designed experiments.
Other ANOVA and multivariate ANOVA (MANOVA) methods exist for more complex
experimental situations but a description of these is beyond the scope of this introductory
article. More details can be found in reference 6.

48

Analyte concentration (ppm)

46
44
42
total
standard
deviation

40
38
36
34

Mean

32
30
A1

A2

A3

A4

A5

A6

A7

Analyst ID

figure 1 Plot comparing the results from 12 analysts.

A8

A9

A10

A11

A12

11

LCGC Europe Online Supplement

statistics and data analysis

significant difference when none is


present. The best way to avoid this pitfall
is, as ever, to plot the data. There also exist

a number of tests for heteroscedasity (i.e.,


Bartlett's test (5) and Levene's test (2)). It
may be possible to overcome this type of

ZHigh

ZLow

YLow

ZLow

Response

Response

ZHigh

YHigh

YLow

(a) Y and Z are independent

YHigh

(b) Y and Z are interacting

figure 2 Interactive factors.

Start

Significant
difference?
(F > F crit)

Compare within-group mean


squares with interaction mean
squares

Yes

Compare interaction mean


squares with individual factor
mean squares

No
Pool the within-group and
interaction sums of squares

Compare pooled mean


squares with individual factor
mean squares

figure 3 Comparing mean squares in two-way ANOVA with replication.

Unreliable high mean (may contain outliers)

Variance

12

Significantly different means by ANOVA

Mean value

figure 4 A plot of variance versus the mean.

problem in the data structure by


transforming it, such as by taking logs (7).
If the variability within a group is
correlated with its mean value then
ANOVA may not be appropriate and/or it
may indicate the presence of outliers in the
data (Figure 4). Cochran's test (5) can be
used to test for variance outliers.
Conclusions
ANOVA is a powerful tool for
determining if there is a statistically
significant difference between two or
more sets of data.
One-way ANOVA should be used
when we are comparing several sets
of observations.
Two-way ANOVA is the method
used when there are two separate
factors that may be influencing a result.
Except for the smallest of data sets
ANOVA is best carried out using a
spreadsheet or statistical software
package.
You should always plot your data to
make sure the assumptions ANOVA is
based on are not violated.
Acknowledgements
The preparation of this paper was
supported under a contract with the UK
Department of Trade and Industry as part
of the National Measurement System Valid
Analytical Measurement Programme (VAM)
(8).
References
(1) S. Burke, Scientific Data Management 1(1),
3238, September 1997.
(2) G.A. Millikem and D.E. Johnson, Analysis of
Messy Data, Volume 1: Designed Experiments,
Van Nostrand Reinhold Company, New York,
USA (1984).
(3) J.C. Miller and J.N. Miller, Statistics for
Analytical Chemistry, Ellis Horwood PTR
Prentice Hall, London, UK (ISBN 0 13 0309907).
(4) C. Chatfield, Statistics for Technology,
Chapman & Hall, London, UK (ISBN 0412
25340 2).
(5) T.J. Farrant, Practical Statistics for the Analytical
Scientist, A Bench Guide, Royal Society of
Chemistry, London, UK (ISBN 0 85404 442 6)
(1997).
(6) K.V. Mardia, J.T. Kent and J.M. Bibby,
Multivariate Analysis, Academic Press Inc. (ISBN
0 12 471252 5) (1979).
(7) ISO 4259: 1992. Petroleum Products Determination and Application of Precision
Data in Relation to Methods of Test. Annex E,
International Organisation for Standardisation,
Geneva, Switzerland (1992).
(8) M. Sargent, VAM Bulletin, Issue 13, 45,
Laboratory of the Government Chemist
(Autumn 1995).

Shaun Burke currently works in the Food


Technology Department of RHM Technology
Ltd, High Wycombe, Buckinghamshire, UK.
However, these articles were produced while
he was working at LGC, Teddington,
Middlesex, UK (http://www.lgc.co.uk).

LCGC Europe Online Supplement

statistics and data analysis

Regression
and Calibration
Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK.

One of the most frequently used statistical methods in calibration is linear


regression. This third paper in our statistics refresher series concentrates on
the practical applications of linear regression and the interpretation of the
regression statistics.

Calibration is fundamental to achieving


consistency of measurement. Often
calibration involves establishing the
relationship between an instrument
response and one or more reference
values. Linear regression is one of the most
frequently used statistical methods in
calibration. Once the relationship between
the input value and the response value
(assumed to be represented by a straight
line) is established, the calibration model is
used in reverse; that is, to predict a value
from an instrument response. In general,
regression methods are also useful for
establishing relationships of all kinds, not
just linear relationships. This paper
concentrates on the practical applications
of linear regression and the interpretation
of the regression statistics. For those of you
who want to know about the theory of
regression there are some excellent
references (16).
For anyone intending to apply linear
least-squares regression to their own data,
it is recommended that a statistics/graphics
package is used. This will speed up the
production of the graphs needed to
confirm the validity of the regression
statistics. The built-in functions of a
spreadsheet can also be used if the
routines have been validated for accuracy
(e.g., using standard data sets (7)).
What is regression?
In statistics, the term regression is used to
describe a group of methods that
summarize the degree of association
between one variable (or set of variables)
and another variable (or set of variables).
The most common statistical method used

to do this is least-squares regression, which


works by finding the best curve through
the data that minimizes the sums of
squares of the residuals. The important
term here is the best curve, not the
method by which this is achieved. There
are a number of least-squares regression
models, for example, linear (the most
common type), logarithmic, exponential
and power. As already stated, this paper
will concentrate on linear least-squares
regression.
[You should also be aware that there are
other regression methods, such as ranked
regression, multiple linear regression, nonlinear regression, principal-component
regression, partial least-squares regression,
etc., which are useful for analysing instrument
or chemically derived data, but are beyond
the scope of this introductory text.]
What do the linear least-squares
regression statistics mean?
Correlation coefficient: Whether you use a
calculators built-in functions, a
spreadsheet or a statistics package, the
first statistic most chemists look at when
performing this analysis is the correlation
coefficient (r). The correlation coefficient
ranges from 1, a perfect negative
relationship, through zero (no relationship),
to +1, a perfect positive relationship
(Figures 1(ac)). The correlation coefficient
is, therefore, a measure of the degree of
linear relationship between two sets of
data. However, the r value is open to
misinterpretation (8) (Figures 1(d) and (e),
show instances in which the r values alone
would give the wrong impression of the
underlying relationship). Indeed, it is

possible for several different data sets to


yield identical regression statistics (r value,
residual sum of squares, slope and
intercept), but still not satisfy the linear
assumption in all cases (9). It, therefore,
remains essential to plot the data in order
to check that linear least-squares statistics
are appropriate.
As in the t-tests discussed in the first
paper (10) in this series, the statistical
significance of the correlation coefficient is
dependent on the number of data points.
To test if a particular r value indicates a
statistically significant relationship we can
use the Pearsons correlation coefficient
test (Table 1). Thus, if we only have four
points (for which the number of degrees of
freedom is 2) a linear least-squares
correlation coefficient of 0.94 will not be
significant at the 95% confidence level.
However, if there are more than 60 points
an r value of just 0.26 (r2 = 0.0676) would
indicate a significant, but not very strong,
positive linear relationship. In other words,
a relationship can be statistically significant
but of no practical value. Note that the test
used here simply shows whether two sets
are linearly related; it does not prove
linearity or adequacy of fit.
It is also important to note that a
significant correlation between one
variable and another should not be taken
as an indication of causality. For example,
there is a negative correlation between
time (measured in months) and catalyst
performance in car exhaust systems.
However, time is not the cause of the
deterioration, it is the build up of sulfur
and phosphorous compounds that
gradually poisons the catalyst. Causality is,

13

LCGC Europe Online Supplement

statistics and data analysis

(n  1)
1  r2
(n  2)

RSE  s(y)
in fact, very difficult to prove unless the
chemist can vary systematically and
independently all critical parameters, while
measuring the response for each change.
Slope and intercept
In linear regression the relationship
between the X and Y data is assumed to
be represented by a straight line, Y = a +
bX (see Figure 2), where Y is the estimated
response/dependent variable, b is the slope
(gradient) of the regression line and a is
the intercept (Y value when X = 0). This
straight-line model is only appropriate if
the data approximately fits the assumption
of linearity. This can be tested for by
plotting the data and looking for curvature
(e.g., Figure 1(d)) or by plotting the
residuals against the predicted Y values or
X values (see Figure 3).
Although the relationship may be known
to be non-linear (i.e., follow a different
functional form, such as an exponential
curve), it can sometimes be made to fit the
linear assumption by transforming the data
in line with the function, for example, by
taking logarithms or squaring the Y and/or
X data. Note that if such transformations
are performed, weighted regression
(discussed later) should be used to obtain
an accurate model. Weighting is required
because of changes in the residual/error
structure of the regression model. Using
non-linear regression may, however, be a
better alternative to transforming the data
when this option is available in the
statistical packages you are using.
Residuals and residual standard error
A residual value is calculated by taking the
difference between the predicted value
and the actual value (see Figure 2). When
the residuals are plotted against the
predicted (or actual) data values the plot
becomes a powerful diagnostic tool,
enabling patterns and curvature in the data
to be recognized (Figure 3). It can also be
used to highlight points of influence (see
Bias, leverage and outliers overleaf).
The residual standard error (RSE, also
known as the residual standard deviation,
RSD) is a statistical measure of the average
residual. In other words, it is an estimate
of the average error (or deviation) about
the regression line. The RSE is used to
calculate many useful regression statistics
including confidence intervals and outlier
test values.

where s(y) is the standard deviation of the y values in the calibration, n is the number of
data pairs and r is the least-squares regression correlation coefficient.
Confidence intervals
As with most statistics, the slope (b) and intercept (a) are estimates based on a finite
sample, so there is some uncertainty in the values. (Note: Strictly, the uncertainty arises
from random variability between sets of data. There may be other uncertainties, such as
measurement bias, but these are outside the scope of this article.) This uncertainty is
quantified in most statistical routines by displaying the confidence limits and other
statistics, such as the standard error and p values. Examples of these statistics are given in
Table 2.

Degrees of freedom

Confidence level

(n-2)

95% (  = 0.05)

99% (  = 0.01)

0.950

0.990

0.878

0.959

0.811

0.917

0.754

0.875

0.707

0.834

0.666

0.798

0.632

0.765

0.602

0.735

10

0.576

0.708

11

0.553

0.684

12

0.532

0.661

13

0.514

0.641

14

0.497

0.623

15

0.482

0.606

20

0.423

0.537

30

0.349

0.449

40

0.304

0.393

60

0.250

0.325

Significant correlation when |r| table value

1
0.8
0.6
Correlation coefficient (r)

14

0.4
0.2
0
-0.2

10

15

20

25

30

35

40

45

50

55

60

-0.4
-0.6
95% confidence level

-0.8

99% confidence level

-1
Degrees of freedom (n-2)

table 1 Pearson's correlation coefficient test.

LCGC Europe Online Supplement

statistics and data analysis

The p value is the probability that a value could arise by chance if the true value was
zero. By convention a p value of less than 0.05 indicates a significant non-zero statistic.
Thus, examining the spreadsheets results, we can see that there is no reason to reject the
hypothesis that the intercept is zero, but there is a significant non-zero positive
gradient/relationship. The confidence intervals for the regression line can be plotted for all
points along the x-axis and is dumbbell in shape (Figure 2). In practice, this means that the
model is more certain in the middle than at the extremes, which in turn has important
consequences for extrapolating relationships.
When regression is used to construct a calibration model, the calibration graph is used
in reverse (i.e., we predict the X value from the instrument response [Y-value]). This
prediction has an associated uncertainty (expressed as a confidence interval)

Xpredicted 

Y a
b
t RSE
b

Conf. interval for the prediction is: X predicted 

1 1
m n 

Y y

2
2

b2 n  1 s x

where a is the intercept and b is the slope obtained from the regression equation.

Y is the mean value of the response (e.g., instrument readings) for m replicates (replicates
are repeat measurements made at the same level).
y is the mean of the y data for the n points in the calibration. t is the critical value obtained
from t-tables for n2 degrees of freedom. s(x) is the standard deviation for the

x data for the n points in the calibration.


RSE is the residual standard error for the
calibration.
If we want, therefore, to reduce the size
of the confidence interval of the prediction
there are several things that can be done.
1. Make sure that the unknown
determinations of interest are close to
the centre of the calibration (i.e., close
to the values x ,y [the centroid point]).
This suggests that if we want a small
confidence interval at low values of x
then the standards/reference samples
used in the calibration should be
concentrated around this region. For
example, in analytical chemistry, a typical
pattern of standard concentrations
might be 0.05, 0.1, 0.2, 0.4, 0.8, 1.6

(a) r = -1

1.4
Intercept
1.2

Slope

(b) r = 0

Residuals

Y= -0.046 + 0.1124 * X
r = 0.98731

1.0
Y

Correlation coefficient

0.8

(c) r = +1
0.6
0.4

confidence limits for


the regression line

Intercept
0.2

(d) r = 0

0.0
confidence limits for the prediction
-0.2
0

10

12

(e) r = 0.99

Residuals

figure 2 Calibration graph.

0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
-0.02
-0.04
-0.06
-0.08

(f) r = 0.9

possible outlier

(g) r = 0.9

figure 3 Residuals plot.

5
X

10

figure 1 Correlation coefficients and


goodness of fit.

15

16

LCGC Europe Online Supplement

statistics and data analysis

(i.e., only one or two standards are used at


higher concentrations). While this will lead
to a smaller confidence interval at lower
concentrations the calibration model will
be prone to leverage errors (see below).
2. Increase the number of points in the
calibration (n). There is, however, little
improvement to be gained by going
above 10 calibration points unless
standard preparation and analysis is
rapid and cheap.
3. Increase the number of replicate
determinations for estimating the
unknown (m). Once again there is a
law of diminishing returns, so the
number of replicates should typically
be in the range 2 to 5.
4. The range of the calibration can be
extended, providing the calibration is still
linear.
Bias, leverage and outliers
Points of influence, which may or may not
be outliers, can have a significant effect on
the regression model and therefore, on its
predictive ability. If a point is in the middle
of the model (i.e., close to x ) but outlying
on the Y axis, its effect will be to move the
regression line up or down. The point is
then said to have influence because it
introduces an offset (or bias) in the
predicted values (see Figure 1(f)). If the
point is towards one of the extreme ends
of the plot its effect will be to tilt the
regression line. The point is then said to
have high leverage because it acts as a
lever and changes the slope of the
regression model (see Figure 1(g)).
Leverage can be a major problem if one or
two data points are a long way from all the
other points along the X axis.
A leverage statistic (ranging between
_ and 1) can be calculated for each value
1
n
of x. There is no set value above which this
leverage statistic indicates a point of
influence. A value of 0.9 is, however, used
by some statistical software packages.

Coefficients
Intercept
Slope

Leverage i  1n 

xi  x

j=1

x j x 2

where xi is the x value for which the leverage statistic is to be calculated, n is the
number of points in the calibration and x is the mean of all the x values in the calibration.
To test if a data point (xi,yi) is an outlier (relative to the regression model) the following
outlier test can be applied.

residualma x

Test value 

RSE

Y y
1  1n  i
n  1 sy2

where RSE is the residual standard error, sy is the standard deviation of the Y values, Yi is
the y value, n is the number of points, y is the mean of all the y values in the calibration
and residualmax is the largest residual value.
For example, the test value for the suspected outlier in Figure 3 is 1.78 and the critical
value is 2.37 (Table 3 for 10 data points). Although the point appears extreme, it could
reasonably be expected to arise by chance within the data set.

Extrapolation and interpolation


We have already mentioned that the regression line is subject to some uncertainty and that
this uncertainty becomes greater at the extremes of the line. If we, therefore, try to
extrapolate much beyond the point where we have real data (10%) there may be
relatively large errors associated with the predicted value. Conversely, interpolation near
the middle of the calibration will minimize the prediction uncertainty. It follows, therefore,
that when constructing a calibration graph, the standards should cover a larger range of
concentrations than the analyst is interested in. Alternatively, several calibration graphs
covering smaller, overlapping, concentration ranges can be constructed.

Response

Residuals

(a)

(b)

Concentration

Predicted value

figure 4 Plots of typical instrument response versus concentration.

Standard Error

t Stat

p value

Lower 95%

Upper 95%

-0.046000012

0.039648848

-1.160185324

0.279423552

-0.137430479

0.045430455

0.112363638

0.00638999

17.58432015

1.11755E-07

0.097628284

0.127098992

*Note the large number of significant figures. In fact none of the values above warrant more than 3 significant figures!

table 2 Statistics obtained using Excel 5.0 regression analysis function from the data used to generate the calibration graph in Figure 2.

LCGC Europe Online Supplement

statistics and data analysis

Weighted linear regression and calibration


In analytical science we often find that the precision changes with concentration. In
particular, the standard deviation of the data is proportional to the magnitude of the value
being measured, (see Figure 4(a)). A residuals plot will tend to show this relationship even
more clearly (Figure 4(b)). When this relationship is observed (or if the data has been
transformed before regression analysis), weighted linear regression should be used for
obtaining the calibration curve (3). The following description shows how the weighted
regression works. Dont be put off by the equations as most modern statistical software
packages will perform the calculations for you. They are only included in the text for
completeness.
Weighted regression works by giving points known to have a better precision a higher
weighting than those with lower precision. During method validation the way the standard
deviation varies with concentration should have been investigated. This relationship can
then be used to calculate the initial weightings

Sample size

95%

99%

1.74

1.75

1.93

1.98

2.08

2.17

2.20

2.23

2.29

2.44

10

2.37

2.55

12

2.49

2.70

14

2.58

2.82

16

2.66

2.92

18

2.72

3.00

20

2.77

3.06

25

2.88

3.25

30

2.96

3.36

35

3.02

3.40

40

3.08

3.43

45

3.12

3.47

50

3.16

3.51

60

3.23

3.57

70

3.29

3.62

80

3.33

3.68

90

3.37

3.73

100

3.41

3.78

Wi  wi

b(w) 

Test value

99%
50

60

70

80

Number of samples (n)

table 3 Outlier test for simple linear least-squares regression.

90

Wi xi yi
i=1
n

Wi xi2
i=1

where Y is the mean value of the


response (e.g., instrument readings) for m
replicates and xi and yi are the data pair for
the ith point.
By assuming the regression line goes
through the origin a better estimate of the
slope is obtained, providing that the
assumption of a zero intercept is correct.
This may be a reasonable assumption in
some instrument calibrations. However, in
most cases, the regression line will no
longer represent the least-squares best
line through the data.

1.5
40

wj

95%

30

j=1

with

2.5

20

X(w)predicted Y
bw

The regression model generated will be


similar to that for non-weighted linear
regression. The prediction confidence
intervals will, however, be different.
The weighted prediction (xw) for a given
instrument reading (y) for the regression
model forcing the line through the origin (y
= bx) is:

3.5

10

concentrations in the calibration.


These initial weightings can then be
standardized by multiplying by the number
of calibration points divided by the sum of
all the weights to give the final weights (Wi).

Confidence table-value

(n)

(wi  12 )
si at each of the n

100

17

18

LCGC Europe Online Supplement

statistics and data analysis

References

The associated uncertainty for the weighted prediction, expressed as a confidence


interval is then:
Conf. interval for the prediction is

X(w)predicted 

t RSE(w)
b(w)

1 
mWi

2
b(w)
Wj xj2
j=1

where t is the critical value obtained from t tables for n2 degrees of freedom at a
stated significance level (typically a = 0.05), Wi is the weighted standard deviation for the
x data for the ith point in the calibration, m is the number of replicates and the weighted
residual.
n

Standard error for the calibration RSE(w) 

2
Wj yj2  b(w)
Wj xj2
j=1
j=1

n 1

Conclusions
Always plot the data. Dont rely on the regression statistics to indicate a linear
relationship. For example, the correlation coefficient is not a reliable measure of
goodness-of-fit.
Always examine the residuals plot. This is a valuable diagnostic tool.
Remove points of influence (leverage, bias and outlying points) only if a reason can be
found for their aberrant behaviour.
Be aware that a regression line is an estimate of the best line through the data and
that there is some uncertainty associated with it. The uncertainty, in the form of a
confidence interval, should be reported with the interpolated result obtained from any
linear regression calibrations.

Acknowledgement
The preparation of this paper was supported under a contract with the Department of
Trade and Industry as part of the National Measurement System Valid Analytical
Measurement Programme (VAM) (11).

(1) G.W. Snedecor and W.G. Cochran, Statistical


Methods, The Iowa State University Press, USA,
6th edition (1967).
(2) N. Draper and H. Smith, Applied Regression
Analysis, John Wiley & Sons Inc., New York,
USA, 2nd edition (1981).
(3) BS ISO 11095: Linear Calibration Using
Reference Materials (1996).
(4) J.C. Miller and J.N. Miller, Statistics for
Analytical Chemistry, Ellis Harwood PTR Prentice
Hall, London, UK.
(5) A.R. Hoshmand, Statistical Methods for
Environmental and Agricultural Sciences, 2nd
edition, CRC Press (ISBN 0-8493-3152-8)
(1998).
(6) T.J. Farrant, Practical Statistics for the Analytical
Scientist, A Bench Guide, Royal Society of
Chemistry, London, UK (ISBN 0 85404 4226)
(1997).
(7) Statistical Software Qualification: Reference
Data Sets, Eds. B.P. Butler, M.G. Cox, S.L.R.
Ellison and W.A. Hardcastle, Royal Society of
Chemistry, London, UK (ISBN 0-85404-422-1)
(1996).
(8) H. Sahai and R.P. Singh, Virginia J. Sci., 40(1),
59, (1989).
(9) F.J. Anscombe, Graphs in Statistical Analysis,
American Statistician, 27, 1721, February
1973.
(10) S. Burke, Scientific Data Management, 1(1),
3238, September 1997.
(11) M. Sargent, VAM Bulletin, Issue 13, 45,
Laboratory of the Government Chemist
(Autumn 1995).

Shaun Burke currently works in the Food


Technology Department of RHM Technology
Ltd, High Wycombe, Buckinghamshire, UK.
However, these articles were produced while
he was working at LGC, Teddington,
Middlesex, UK (http://www.lgc.co.uk).

LCGC Europe Online Supplement

statistics and data analysis

Missing Values, Outliers,


Robust Statistics &
Non-parametric Methods
Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK.

This article, the fourth and final part of our statistics refresher series, looks
at how to deal with messy data that contain transcription errors or extreme
and skewed results.

This is the last article in a series of short


papers introducing basic statistical methods
of use in analytical science. In the three
previous papers (13) we have assumed
the data has been tidy; that is, normally
distributed with no anomalous and/or
missing results. In the real world, however,
we often need to deal with messy data,
for example data sets that contain
transcription errors, unexpected extreme
results or are skewed. How we deal with
this type of data is the subject of this article.
Transcription errors
Transcription errors can normally be
corrected by implementing good quality
control procedures before statistical
analysis is carried out. For example, the
data can be independently checked or,
more rarely, the data can be entered, again
independently, into two separate files and
the files compared electronically to
highlight any discrepancies. There are also
a number of outlier tests that can be used
to highlight anomalous values before other
statistics are calculated. These tests do not
remove the need for good quality
assurance; rather they should be seen as
an additional quality check.
Missing data
No matter how well our experiments are
planned there will always be times when
something goes wrong, resulting in gaps in
the data. Some statistical procedures will
not work as well, or at all, with some data
missing. The best recourse is always to
repeat the experiment to generate the
complete data set. Sometimes, however,
this is not feasible, particularly where

readings are taken at set times or the cost


of retesting is prohibitive, so alternative
ways of addressing this problem are needed.
Current statistical software packages
typically deal with missing data by one of
three methods:
Casewise deletion excludes all examples
(cases) that have missing data in at least
one of the selected variables. For example,
in ICPAAS (inductively coupled
plasmaatomic absorption spectroscopy)
calibrated with a number of standard
solutions containing several metal ions at
different concentrations, if the aluminium
value were missing for a particular test
portion, all the results for that test portion
would be disregarded (See Table 1).
This is the usual way of dealing with
missing data, but it does not guarantee
correct answers. This is particularly so, in
complex (multivariate) data sets where it is
possible to end up deleting the majority
of your data if the missing data are
randomly distributed across cases
and variables.

Al
Solution 1
Solution 2
Solution 3
Solution 4

567
234

B
94.5
72.1
34.0
97.4

Pairwise deletion can be used as an

alternative to casewise deletion in


situations where parameters (correlation
coefficients, for example) are calculated on
successive pairs of variables (e.g., in a
recovery experiment we may be interested
in the correlations between material
recovered and extraction time, temperature,
particle size, polarity, etc. With pairwise
deletion, if one solvent polarity measurement
was missing only this single pair would be
deleted from the correlation and the
correlations for recovery versus extraction
time and particle size would be unaffected)
(see Table 2).
Pairwise deletion can, however, lead to
serious problems. For example, if there is a
hidden systematic distribution of missing
points then a bias may result when
calculating a correlation matrix (i.e., different
correlation coefficients in the matrix can be
based on different subsets of cases).
Mean substitution replaces all missing
data in a variable by the mean value for
that variable. Though this looks as if the

Fe
578
673
674
429

Ni
23.1
7.6
44.7
82.9

Casewise deletion. Statistical analysis


only carried out on the reduced data set.

Solution 2
Solution 4

table 1 Casewise deletion.

Al
567
234

B
72.1
97.4

Fe
673
429

Ni
7.6
82.9

19

20

LCGC Europe Online Supplement

statistics and data analysis

data set is now complete, mean substitution


has its own disadvantages. The variability
in the data set is artificially decreased in
direct proportion to the number of missing
data points, leading to underestimates of
dispersion (the spread of the data). Mean
substitution may also considerably change
the values of some other statistics, such as
linear regression statistics (3), particularly
where correlations are strong (See Table 3).
Examples of these three approaches are
illustrated in Figure 1, for the calculation of
a correlation matrix, where the correlation
coefficient (r) (3) is determined for each
paired combination of the five variables,
A to E. Note, how the r value can increase,
diminish or even reverse sign depending on
which method is chosen to handle the
missing data (i.e., the A, B correlation
coefficients).

way (7,8). Good grounds for believing the


data is normal are
past experience of similar data
passing normality tests, for example,
KolmogrovSmirnovLillefors test,

Recovery Extraction Particle


%
time
Size
(mins)
(m)
Sample 1
Sample 2
Sample 3
Sample 4

93
105
99
73

20
120
180
10

90
150
50
500

Solvent
Polarity
(pKa)
1.8
1.0
1.5

Pairwise deletion. Statistical analysis unaffected except


for when one of a pair of data points are missing.

r
(number of data points
in the correlation)

Extreme values,
stragglers and outliers
Extreme values are defined as observations
in a sample, so far separated in value from
the remainder as to suggest that they may
be from a different population, or the
result of an error in measurement (6).
Extreme values can also be subdivided into
stragglers, extreme values detected
between the 95% and 99% confidence
levels; and outliers, extreme values at >
99% confidence level.
It is tempting to remove extreme values
automatically from a data set, because
they can alter the calculated statistics, e.g.,
increase the estimate of variance (a
measure of spread), or possibly introduce a
bias in the calculated mean. There is one
golden rule however: no value should be
removed from a data set on statistical
grounds alone. Statistical grounds include
outlier testing.
Outlier tests tell you, on the basis of
some simple assumptions, where you are
most likely to have a technical error; they
do not tell you that the point is wrong.
No matter how extreme a value is in a set
of data, the suspect value could
nonetheless be a correct piece of
information (1). Only with experience or
the identification of a particular cause can
data be declared wrong and removed.
So, given that we understand that the
tests only tell us where to look, how do we
test for outliers? If we have good grounds
for believing our data is normally
distributed then a number of outlier tests
(sometimes called Q-tests) are available
that identify extreme values in an objective

ShapiroWilks test, skewness test,


kurtosis test (7,9) etc.
plots of the data, e.g., frequency
histogram normal probability plots (1,7).
Note that the tests used to check

Recovery
vs
Extraction
time
0.728886

Recovery
vs
Particle
Size
-0.87495

Recovery
vs
Solvent
Polarity
0.033942

(4)

(4)

(3)

table 2 Pairwise deletion.

Al
Solution 1
Solution 2
Solution 3
Solution 4

567
234

B
94.5
72.1
34.0
97.4

Fe
578
673
674
429

Ni
23.1
7.6
44.7
82.9

Mean substitution. Statistical analysis carried


out on pseudo completed data with no
allowance made for errors in estimated values.

Solution 1
Solution 2
Solution 3
Solution 4

Al
400.5
567
400.5
234

B
94.5
72.1
34.0
97.4

Fe
578
673
674
429

Ni
23.1
7.6
44.7
82.9

table 3 Mean substitution.

Box 1: Imputation (4,5) is yet another method that is increasingly being used to
handle missing data. It is, however, not yet widely available in statistical software
packages. In its simplest ad hoc form an imputed value is substituted for the
missing value (e.g., mean substitution already discussed above is a form of
imputation). In its more general/systematic form, however, the imputed missing
values are predicted from patterns in the real (non-missing) data. A total of m
possible imputed values are calculated for each missing value (using a suitable
statistical model derived from the patterns in the data) and then m possible
complete data sets are analysed in turn by the selected statistical method. The m
intermediate results are then pooled to yield the final result (statistic) and an
estimate of its uncertainty. This method works well providing that the missing
data is randomly distributed and the model used to predict the inputed values
is sensible.

LCGC Europe Online Supplement

statistics and data analysis

Correlation matrices with different


approaches selected for missing data.
No missing data (15 cases)
A

normality usually require a significant


amount of data (a minimum of 1015
results are recommended depending on
the normality test applied). For this reason
there will be many examples in analytical
science where either it will be impractical
to carry out such tests, or the tests will not
tell us anything meaningful.
If we are not sure the data set is normally
distributed then robust statistics and/or
non-parametric (distribution independent)
tests can be applied to the data. These
three approaches (outlier tests, robust
estimates and non-parametric methods)
are examined in more detail below.

Variables / Factors

101.7 115.1 101.0

95.2

77.0

72.9

77.5

72.7

61.6

86.0

82.2

78.9

78.0

91.7

13

90.0

77.4

100.8

97.0

111.1

14

90.0

91.3

89.2

81.3

100.5

15

96.9

103.0

97.5

98.5

96.8

105.1

0.41

0.39

0.53

0.47

0.50

0.57

0.59
0.61

Casewise deletion (only 5 cases remain)


B

-0.62

0.11

0.50

0.02

-0.21

-0.36

0.17

0.91

0.71

D
mean

99.2

92.4

94.6

89.4

0.66

91.7
Pairwise deletion (Variable number of cases)
B

value = Data removed to show the effects


of missing data.
mean = Mean values replacing missing data.
Note, at the 95% confidence level, significant correlations are indicated at

Outlier tests
In analytical chemistry it is rare that we
have large numbers of replicate data, and
small data sets often show fortuitous
grouping and consequent apparent
outliers. Outlier tests should, therefore, be
used with care and, of course, identified
data points should only be removed if a
technical reason can be found for their
aberrant behaviour.
Most outlier tests look at some measure
of the relative distance of a suspect point
from the mean value. This measure is then
assessed to see if the extreme value could
reasonably be expected to have arisen by
chance. Most of the tests look for single
extreme values (Figure 2(a)), but
sometimes it is possible for several
outliers to be present in the same data
set. These can be identified in one of two
ways:
by iteratively applying the outlier test
by using tests that look for pairs of
extreme values, i.e., outliers that are
masking each other (see Figure 2(b) and
2(c)).
Note, as a rule of thumb, if more than
20% of the data are identified as outlying
you should start to question your
assumption about the data distribution
and/or the quality of the data collected.
The appropriate outlier tests for the
three situations described in Figure 2 are:
2(a) Grubbs 1, Dixon or Nalimov; 2(b)
Grubbs 2 and 2(c) Grubbs 3.
We will concentrate on the three
Grubbs tests (7). The test values are
calculated using the formulae below, after
the data are arranged in ascending order.

C
0.68

B
E

Cases

B
0.62

0.54(12) 0.55(12) 0.27(12) 0.23(11)

0.50(11) 0.47(11) 0.77(10)

0.79(11) 0.70(10)

0.71(10)

15

0.514

12

0.576

11

0.602

10

0.632

0.950

Mean substitution (15 cases)


B

0.01

-0.05

0.02

0.36

0.40

0.47

0.25

0.47

0.43
0.46

figure 1 Effect of missing data on a correlation matrix.

Outlier

Outlier

(a)

or

Outlier

Outlier

(b)
Outliers

Outliers

(c)

or

figure 2 Outliers and masking.

G1 =

x xi
s

x x
G2 = n s 1

G3 = 1

n 3 sn2 2
n 1 s2

where, s is the standard deviation for the whole data set, xi is the suspected single
outlier, i.e., the value furthest away from the mean, | | is the modulus the value of a
calculation ignoring the sign of the result, x is the mean, n is the number of data points, xn
and x1 are the most extreme values, sn-2 is the standard deviation for the data set

21

22

LCGC Europe Online Supplement

statistics and data analysis

excluding the suspected pair of outlier


values, i.e., the pair of values furthest away
from the mean.
If the test values (G1, G2, G3) are greater
than the critical value obtained from tables
(see Table 4) then the extreme value(s) are
unlikely to have occurred by chance at the
stated confidence level (see Box 2).
Pitfalls of outlier tests
Figure 3 shows three situations where
outlier tests can misleadingly identify an
extreme value.
Figure 3(a) shows a situation common in
chemical analysis. Because of limited
measurement precision (rounding errors) it
is possible to end up comparing a result
which, no matter how close it is to the
other values, is an infinite number of
standard deviations away from the mean
of the remaining results. This value will
therefore always be flagged as an outlier.
In Figure 3(b) there is a genuine long tail
on the distribution that may cause
successive outlying points to be identified.
This type of distribution is surprisingly
common in some types of chemical
analysis, e.g., pesticide residues.
If there is very little data (Figure 3(c)) an
outlier can be identified by chance. In this
situation it is possible that the identified
point is closer to the true value and it is
the other values that are the outliers. This
occurs more often than we would like to
admit; how many times do your procedures
state average the best two out of three
determinations?
Outliers by variance
When the data are from different groups
(for example when comparing test
methods via interlaboratory comparison) it

is not only possible for individual points within a group to be outlying but also for the
group means to have outliers with respect to each other. Another type of outlier that can
occur is when the spread of data within one particular group is unusually small or large
when compared with the spread of the other groups (see Figure 4).
The same Grubbs tests that are used to determine the presence of within group
outlying replicates may also be used to test for suspected outlying means.
The Cochrans test can be used to test for the third case, that of a suspected
outlying variance.
To carry out the Cochrans test, the suspect variance is compared with the sum of all
group variances. (The variance is a measure of spread and is simply the square of the
standard deviation (1).)
g

Cn =

suspected s2
g

S2
i=1 i

where g is the number of groups and

ni
n = i = g1

If this calculated ratio, Cn , exceeds the critical value obtained from statistical tables (7)
is the average number of all
then the suspect group spread is extreme. The choice of n
sample results produced by all groups.
The Cochrans test assumes the number of replicates within the groups are the same or
at least similar ( 1). It also assumes that none of the data have been rounded and there
are sufficient numbers of replicates to get a reasonable estimate of the variance. The
Cochrans test should not be used iteratively as this could lead to a large percentage of
data being removed (See Box 3).
Robust statistics
Robust statistics include methods that are largely unaffected by the presence of extreme
values. The most commonly used of these statistics are as follows:
Median: The median is a measure of central tendency1 and can be used instead of the

mean. To calculate the median ( ) the data are arranged in order of magnitude and the
median is then the central member of the series (or the mean of the two central
members when there is an even number of data, i.e., there are equal numbers of
observations smaller and greater than the median). For a symmetrical distribution the mean
and median have the same value.

xm
xm xm  1
2

when n is odd 1, 3, 5,
n
when n is even 2, 4, 6, where m = round up 2

Median Absolute Deviation (MAD): The MAD value is an estimate of the spread in the
data similar to the standard deviation.

Box 2: Grubbs tests (worked example).


13 replicates are ordered in ascending order.

x1
xn
47.876 47.997 48.065 48.118 48.151 48.211 48.251 48.559 48.634 48.711 49.005 49.166 49.484
2 = 0.123
n = 13, mean = 48.479, s = 0.498, sn2

G1 = 49.484 48.479 = 2.02


0.498

G2 = 49.484 47.876 = 3.23


0.498

G3 = 1 10 0.1232 = 0.587
12 0.498

Grubbs critical values for 13 values are G1 = 2.331 and 2.607, G2 = 4.00 and 4.24, G3 = 0.6705 and 0.7667 for the 95%
and 99% confidence levels. Since the test values are less than their respective critical values, in all cases, it can be concluded
there are no outlying values.

LCGC Europe Online Supplement

statistics and data analysis

MADE = 1.483 MAD

For n values MAD = median xi


x i = 1, 2, , n
If the MAD value is scaled by a factor of 1.483 it becomes comparable with a standard
deviation, this is the MADE value.

95% confidence level

99% confidence

level
n

G(1)

G(2)

G(3)

G(1)

G(2)

G(3)

1.153

2.00

---

1.155

2.00

---

1.463

2.43

0.9992

1.492

2.44

1.0000

1.672

2.75

0.9817

1.749

2.80

0.9965

1.822

3.01

0.9436

1.944

3.10

0.9814

1.938

3.22

0.8980

2.097

3.34

0.9560

2.032

3.40

0.8522

2.221

3.54

0.9250

2.110

3.55

0.8091

2.323

3.72

0.8918

10

2.176

3.68

0.7695

2.410

3.88

0.8586

12

2.285

3.91

0.7004

2.550

4.13

0.7957

13

2.331

4.00

0.6705

2.607

4.24

0.7667

15

2.409

4.17

0.6182

2.705

4.43

0.7141

20

2.557

4.49

0.5196

2.884

4.79

0.6091

25

2.663

4.73

0.4505

3.009

5.03

0.5320

30

2.745

4.89

0.3992

3.103

5.19

0.4732

35

2.811

5.026

0.3595

3.178

5.326

0.4270

40

2.866

5.150

0.3276

3.240

5.450

0.3896

50

2.956

5.350

0.2797

3.336

5.650

0.3328

60

3.025

5.500

0.2450

3.411

5.800

0.2914

70

3.082

5.638

0.2187

3.471

5.938

0.2599

80

3.130

5.730

0.1979

3.521

6.030

0.2350

90

3.171

5.820

0.1810

3.563

6.120

0.2147

100

3.207

5.900

0.1671

3.600

6.200

0.1980

110

3.239

5.968

0.1553

3.632

6.268

0.1838

120

3.267

6.030

0.1452

3.662

6.330

0.1716

130

3.294

6.086

0.1364

3.688

6.386

0.1611

140

3.318

6.137

0.1288

3.712

6.437

0.1519

Other robust statistical estimates include


trimmed mean and deviations, Winsorized
mean and deviation, least median of
squares (robust regression), Levenes test
(heterogeneity in ANOVA), etc. A
discussion of robust statistics in analytical
chemistry can be found elsewhere (10, 11).
Non-parametric tests
Typical statistical tests incorporate
assumptions about the underlying
distribution of data (such as normality),
and hence rely on distribution parameters.
Non-parametric tests are so called
because they make few or no assumptions
about the distributions, and do not rely on
distribution parameters. Their chief
advantage is improved reliability when the
distribution is unknown. There is at least
one non-parametric equivalent for each
parametric type of test (see Table 5). In a
short article, such as this, it is impossible to
describe the methodology for all these
tests but more information can be found in
other publications (12, 13).
Conclusions
Always check your data for transcription
errors. Outlier tests can help to identify
them as part of a quality control check.
Delete extreme values only when a
technical reason for their aberrant
behaviour can be found.
Missing data can result in misinterpretation
of the resulting statistics so care should
be taken with the method chosen to
handle the gaps. If at all possible, further
experiments should be carried out to fill
in the missing points.

table 4 Grubbs critical value table (5).

Box 3: Cochrans test (worked example).


An interlaboratory study was carried out by 13 laboratories to determine the amount of cotton in a cotton/polyester fabric,
85 determinations where carried out in total. The standard deviations of the data obtained by each of the 13 laboratories
was as follows:
Std. Dev. 0.202 0.402 0.332 0.236 0.318 0.452 0.210 0.074 0.525 0.067 0.609 0.246 0.198

n = 85 = 6.54 7
13

Cn =

0.6092
= 0.371 = 0.252
0.2022 + 0.4022 ....... 0.2462 + 0.1982 1.474

Cochrans critical value for n = 7 and g = 13 is 0.23 at the 95% confidence levels7.
As the test value is greater than the critical values it can be concluded that the laboratory with the highest standard deviation
(0.609) has an outlying spread of replicates and this laboratorys results therefore need to be investigated further. It is normal
practice in inter-laboratory comparisons not to test for low variance outliers, i.e., laboratories reporting unusually precise results.

23

24

LCGC Europe Online Supplement

statistics and data analysis

Outlier tests assume the data distribution


is known. This assumption should be
checked for validity before these tests
are applied.
Robust statistics avoid the need to use
outlier tests by down-weighting the
effect of extreme values.
When knowledge about the underlying
data distribution is limited, nonparametric methods should be used.

Box & Whisker Plot

Analyte concentration
22
21

outlying variance

20

outlying mean
19

NB: It should be noted that following a


judgement in a US court, the Food and
Drug Administration (FDA) in a guide
Guide to inspection of pharmaceutical
quality control laboratories has
specifically prohibited the use of outlier
tests.

18
17
16
15
14
13

10

11

12

13

14

15

16

Laboratory ID

References

figure 4 Different types of outlier in grouped data.

Types of comparison
Differences between
independent groups
of data

Parametric methods
t-test for independent groups2

(ANOVA/MANOVA)2

Differences between
dependent groups
of data

t-test for dependent groups2

ANOVA with replication2


Relationships between
continuous variables

Linear regression3
Correlation coefficient3

Acknowledgement
The preparation of this paper was supported
under a contract with the UKs Department
of Trade and Industry as part of the
National Measurement System Valid
Analytical Measurement Programme (VAM)
(14).

Non-parametric methods (12, 13)


WaldWolfowitz runs test
MannWhitney U test
KolmogorovSmirnov two-sample
test
KruskalWallis analysis of ranks.
Median test
Sign test
Wilcoxons matched pairs test
McNemars test
2 (Chi-square) test
Friedmans two-way ANOVA
Cochran Q test
Spearman R
Kendall
Tau

Homogeneity of Variance Bartletts test7

Levenes test, Brown & Forsythe

Relationships between
counted variables

coefficient Gamma
2 (Chi-square) test
Phi coefficient
Fisher exact test
Kendall coefficient of
concordance

table 5 Non-parametric alternatives to parametric statistical tests.

(1) S. Burke, Scientific Data Management 1(1),


3238, 1997.
(2) S. Burke, Scientific Data Management 2(1),
3641, 1998.
(3) S. Burke, Scientific Data Management 2(2),
3240, 1998.
(4) J.L. Schafer, Monographs on Statistics and
Applied Probability 72 Analysis of
Incomplete Multivariate Data, Chapman & Hall
(1997) ISBN 0-412-04061-1.
(5) R.J.A. Little & D.B. Rubin, Statistical Analysis
With Missing Data, John Wiley & Sons (1987),
ISBN 0-471-80243-9.
(6) ISO 3534. Statistics Vocabulary and Symbols.
Part 1: Probability and general statistical terms,
section 2.64. Geneva 1993.
(7) T.J. Farrant, Practical statistics for the analytical
scientist: A bench guide, Royal Society of
Chemistry 1997. (ISBN 0 85404 442 6).
(8) V. Barret & T. Lewis, Outliers in Statistical Data,
3rd Edition, John Wiley (1994).
(9) William H. Kruskal & Judith M. Tanur,
International Encyclopaedia of Statistics, Collier
Macmillian Publishers, 1978. ISBN 0-02917960-2.
(10) Analytical Methods Committee, Robust
Statistics How Not to Reject Outliers Part 2.
Analyst 1989 114, 16937.
(11) D.C. Hoaglin, F. Mosteller & J.W. Tukey,
Understanding Robust and Exploratory Data
Analysis, John Wiley & Sons (1983), ISBN 0471-09777-2.
(12) M. Hollander & D.A. Wolf, Non-parametric
statistical methods, Wiley & Sons, New York
1973.
(13) W.W. Daniel, Applied non-parametric statistics,
Houghton Mifflin, Boston 1978.
(14) M. Sargent, VAM Bulletin, Issue 13, 45,
Autumn. Laboratory of the Government
Chemist, 1995.