Академический Документы
Профессиональный Документы
Культура Документы
Understanding the
Structure of
Scientific Data
Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK.
This is the first in a series of articles that aims to promote the better use of
statistics by scientists. The series intends to show everyone from bench
chemists to laboratory managers that the application of many statistical
methods does not require the services of a statistician or a mathematician
to convert chemical data into useful information. Each article will be a concise introduction to a small subset of methods. Wherever possible, diagrams
will be used and equations kept to a minimum; for those wanting more theory, references to relevant statistical books and standards will be included.
By the end of the series, the scientist should have an understanding of the
most common statistical methods and be able to perform the test while
avoiding the pitfalls that are inherent in their misapplication.
(a)
Scale
Mean
(b)
Scale
Mean
1|2 = 0.12
Count =
xi
i1
x n
where,
n
x1 x2 x3 xi
i1
(a)
1|22677
2|112224578
3|000011122333355
4|0047889
5|56669
6|3
(b)
1.5 interquartile
upper quartile value
interquartile
median
lower quartile value
1.5 interquartile
*outlier
interquartile range is the range which contains the middle 50% of the data when
*The
it is sorted into ascending order.
(a)
Magnitude
10
8
6
4
2
0
(c)
10
8
6
4
2
0
(b)
Magnitude
10
8
6
4
2
Time 0
Time
n = 7, mean = 6, standard deviation = 2.16
n = 9, mean = 6, standard deviation = 2.65
Magnitude
5
14
(15)
13
6
1
(d)
Magnitude
10
8
6
4
2
Time 0
Time
n = 9, mean = 6, standard deviation = 1.80
n = 9, mean = 6, standard deviation = 2.06
99.7%
95%
68%
Mean
-3
-2
-1
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(i)
(ii)
1
(d)
(i)
(ii)
2
n ((xi )2 / n)
practically identical means, but with so many data
points there is a small but statistically siginificant
('real') difference and so would 'fail' the t-test
(tcrit < tcalculated value)
(e)
(i)
spread in the data as measured by the variance
are similar would 'pass' the F-test (Fcrit > Fcalculated value)
(ii)
(f)
(i)
(ii)
(g)
(i)
(ii)
Significance Testing
Suppose, for example, we have the
following two sets of results for lead
content in water 17.3, 17.3, 17.4, 17.4
and 18.5, 18.6, 18.5, 18.6. It is fairly clear,
by simply looking at the data, that the two
sets are different. In reaching this
conclusion you have probably considered
the amount of data, the average for each
set and the spread in the results. The
difference between two sets of data is,
however, not so clear in many situations.
The application of significance tests gives
us a more systematic way of assessing the
results with the added advantage of
allowing us to express our conclusion with
a stated degree of confidence.
What does significance mean?
In statistics the words significant and
significance have specific meanings. A
significant difference, means a difference
that is unlikely to have occurred by chance.
A significance test, shows up differences
unlikely to occur because of a purely
random variation.
As previously mentioned, to decide if one
set of results is significantly different from
another depends not only on the
magnitude of the difference in the means
but also on the amount of data available
Jargon
Definition
Null hypothesis
(H0)
One-tailed
the
new production method results in a higher yield, or (2) the amount of
waste product is reduced (i.e., a limit value , >, <, or is used in the
alternate hypothesis). In these cases the calculation to determine the
t-value is the same as that for the two-tailed t-test but the critical
value is different.
Population
Sample
Two-tailed
Bibliography
1. G.B. Wetherill, Elementary Statistical
Methods, Chapman and Hall, London,
UK.
2. J.C. Miller and J.N. Miller, Statistics for
Analytical Chemistry, Ellis Horwood PTR
Prentice Hall, London, UK.
3. J. Tukey, Exploration of Data Analysis,
Edison and Westley.
4. T.J. Farrant, Practical Statistics for the
Analytical Scientist: A Bench Guide
(ISBN: 085 404 4426), Royal Society of
Chemistry, London, UK (1997).
Equation
t
x
s/ n
t
d n
sd
t
d n
sd
t
x1 x2
1
1
sc
n1 n2
t
x1 x2
s21 s22
n1 n2
where:
x is the sample mean, is the population mean, s is the standard deviation for the sample, n is the number items in the sample,
|d | is the absolute mean difference between pairs, d is the mean difference between pairs, sd is the sample standard deviation for the
pairs, x1 and x2 are two independent sample means, n1 and n2 are the number of items making up each sample
2
sc
s1 n1 1 s2 n2 1
n1 n2 2
is given by
s41
s42
s21 s22
1
k 2 n2 n 1 k 2 n2 n 1 where k n1 n2
1 1
2 2
Box 2
Example 1
A chemist is asked to validate a new
economic method of derivatization
before analysing a solution by a standard
gas chromatography method. The longterm mean for the check samples using
the old method is 22.7 g/L. For the new
method the mean is 23.5 g/L, based on
10 results with a standard deviation of
0.9 g/L. Is the new method equivalent
to the old? To answer this question we
use the t-test to compare the two mean
values. We start by stating exactly what
we are trying to decide, in the form of
two alternative hypotheses; (i) the means
could really be the same, or (ii) the
means could really be different. In
statistical terminology this is written as:
The null hypothesis (H0): new method
mean = long-term check sample mean.
The alternative hypothesis (H1): new
method mean long-term check sample
mean.
23.5 22.7
2.81
0.9 / 10
Method 1
4.2
4.5
6.8
7.2
4.3
5.40
1.471
Method 2
9.2
4.0
1.9
5.2
3.5
4.76
2.750
hypothesis H0 as x 1 = x 2
This means there is no difference between
the means of the two methods (the
alternative hypothesis is H1: x1 x2). If
the two methods have sample standard
deviations that are not significantly
different then we can combine (or pool)
the standard deviation (Sc).
1.4712 (5 1) 2.7502 (5 1)
(5 5 2)
0.64
0.64 0.459
2.205 0.632 1.395
2.205
(5.40 4.76)
1 1
5 5
=>
2.205
0.475 8
1.918
0.700
Matrix
Method
A (mg/g)
2.52
3.13
4.33
2.25
2.79
3.04
2.19
2.16
B (mg/g)
3.17
5.00
4.03
2.38
3.68
2.94
2.83
2.18
-0.65
-1.87
0.30
-0.13
-0.89
0.10
-0.64
-0.02
Difference (d)
table 4 Comparison of two methods used to determine the concentration of vitamins in foodstuffs.
Analysis
of Variance
Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK.
Replicate 1
Replicate 2
Replicate 3
Replicate 4
A_7
Replicate 1
Replicate 2
Replicate 3
Replicate 4
A_2
A_3
A_4
A_5
A_6
35.84
36.67
40.54
41.19
41.22
36.58
37.33
40.67
40.29
39.61
31.3
36.96
40.81
40.99
37.89
34.19
36.83
40.78
40.4
36.67
A_8
40.71
40.91
40.8
38.42
A_9
39.2
39.3
39.3
39.3
SS
438.7988
35.6208
df
11
36
42.5
42.3
42.5
42.5
A_10
A_11
A_12
39.75
36.04
44.36
39.69
37.03
45.73
39.23
36.85
45.25
39.73
36.24
45.34
MS
F
P-value
F crit
39.8908 40.31545
6.6E-17 2.066606
0.989467
(Note: the data table has been split into two sections (A_1 to A_6, A_7 to A_12) for display purposes. The ANOVA is
carried out on a single table.)
460
435
0.215867
435
27.2
26.97
27.13
27.07
27.1
27.03
27.2
27.23
27.27
df
2
2
4
18
27.3
27.3
27.3
27.2
27.2
27.2
27.1
27.1
27.1
27
27
27
26.9
26.9
26.9
27.2
27.3
27.2
27.3
27.2
27.3
27.1
27.1
27.1
27
27
27
26.9
26.9
26.9
27.2
415
27.13
27.2
27.13
27.29
27.13
27.23
27.03
27.13
27.07
27.3
27.2
27.3
27.2
27.3
27
27.1
27
27.1
27
90
27.1
60
26.9
30
26.9
Time (min)
26.9
10
460
27.03
27.1
27.13
27.1
27.07
27.03
27.03
27.07
26.9
MS
F
P-value
F crit
0.000433 0.100429 0.904952 3.554561
0.024844 5.75794 0.011667 3.554561
0.021911 5.078112 0.006437 2.927749
0.004315
26
Note: in the above example, the spreadsheet (Excel) labels Source of Variation as Sample, Columns, Interaction and Within.
Sample = Time, Columns = Temperature, Interaction is the interaction between temperature and time, and Within is a
measure of the within-group variation. (Note: Source of variation Columns = Temperature and Sample = Time).
Two-way ANOVA
In a typical experiment things can be more
complex than described previously. For
example, in Example 2 the aim is to find
out if time and/or temperature have any
effect on protein yield when analysing
samples of tinned ham. When analysing
data from this type of experiment we use
two-way ANOVA. Two-way ANOVA can
test the significance of each of two
experimental variables (factors or
treatments) with respect to the response,
such as an instrument's output. When
replicate measurements are made we can
also examine whether or not there are
significant interactions between variables.
An interaction is said to be present when
the response being measured changes
more than can be explained from the
change in level of an individual factor. This
is illustrated in Figure 2 for a process with
two factors (Y and Z) when both factors
are studied at two levels (low and high). In
Figure 2(b), the changes in response
caused by Y depend on Z, and vice versa.
In two-way ANOVA we ask the
following questions:
Is there a significant interaction between
the two factors (variables)?
Does a change in any of the factors
affect the measured result?
It is important to check the answers in the
right order: Figure 3 illustrates the
decision process. In the case of Example
2 the questions are:
Is there an interaction between
temperature and time which affects the
protein yield?
Does time and/or temperature affect the
protein yield?
Using the built-in functions of a
spreadsheet (in this case Excels data
analysis tools two-factor analysis with
replication) we see that there is a
significant interaction between time and
temperature and a significant effect of
temperature alone (both p-value < 0.05
and F > Fcrit). Following the process
outlined in Figure 3, we consider the
interaction question first by comparing the
mean squares (MS) for the within-group
variation with the interaction MS. This is
reported in the results table of Example 2.
F = 0.021911/0.004315 = 5.078
If the interaction is significant (F > Fcrit),
as in this case, then the individual factors
(time and temperature) should each be
compared with the MS for the interaction
(not the within-group MS) thus:
Ftemp = 0.024844/0.021911 = 1.134
In other words, there is no significant difference between the interaction of time and
temperature with respect to either of the individual factors, and, therefore, the interaction
of temperature with time is worth further investigation. If one or both of the individual
factors were significant compared with the interaction, then the individual factor or factors
would dominate and for all practical purposes any interaction could be ignored.
If the interaction term is not significant then it can be considered to be another small
error term and can thus be pooled with the within-group (error) sums of squares term. It is
the pooled value (SS2pooled) that is then used as the denominator in the F-test to
determine if the individual factors affect the measured results significantly. To combine the
sums of squares the following formula is used:
ss2pooled
ss inter ss within
dof inter dof within
where dofinter and dofwithin are the degrees of freedom for the interaction term and
error term, and SSinter and SSwithin are the sums of squares for the interaction term and
error term, respectively.
(dofpooled dofinter dofwithin)
Selecting the ANOVA method
One-way ANOVA should be used when there is only one factor being considered and
replicate data from changing the level of that factor are available. Two-way ANOVA (with
or without replication) is used when there are two factors being considered. If no replicate
data are collected then the interactions between the two factors cannot be calculated.
Higher level ANOVAs are also available for looking at more than two factors.
Avoiding some of
the pitfalls using ANOVA
In ANOVA it is assumed that the data for
each variable are normally distributed.
Usually in ANOVA we dont have a large
amount of data so it is difficult to prove
any departure from normality. It has been
shown, however, that even quite large
deviations do not affect the decisions
made on the basis of the F-test.
A more important assumption about
ANOVA is that the variance (spread)
between groups is homogeneous
(homoscedastic). If this is not the case (this
often happens in chemistry, see Figure 1)
then the F-test can suggest a statistically
Advantages of ANOVA
Compared with using multiple t-tests, one-way and two-way ANOVA require fewer
measurements to discover significant effects (i.e., the tests are said to have more power).
This is one reason why ANOVA is used frequently when analysing data from statistically
designed experiments.
Other ANOVA and multivariate ANOVA (MANOVA) methods exist for more complex
experimental situations but a description of these is beyond the scope of this introductory
article. More details can be found in reference 6.
48
46
44
42
total
standard
deviation
40
38
36
34
Mean
32
30
A1
A2
A3
A4
A5
A6
A7
Analyst ID
A8
A9
A10
A11
A12
11
ZHigh
ZLow
YLow
ZLow
Response
Response
ZHigh
YHigh
YLow
YHigh
Start
Significant
difference?
(F > F crit)
Yes
No
Pool the within-group and
interaction sums of squares
Variance
12
Mean value
Regression
and Calibration
Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK.
13
(n 1)
1 r2
(n 2)
RSE s(y)
in fact, very difficult to prove unless the
chemist can vary systematically and
independently all critical parameters, while
measuring the response for each change.
Slope and intercept
In linear regression the relationship
between the X and Y data is assumed to
be represented by a straight line, Y = a +
bX (see Figure 2), where Y is the estimated
response/dependent variable, b is the slope
(gradient) of the regression line and a is
the intercept (Y value when X = 0). This
straight-line model is only appropriate if
the data approximately fits the assumption
of linearity. This can be tested for by
plotting the data and looking for curvature
(e.g., Figure 1(d)) or by plotting the
residuals against the predicted Y values or
X values (see Figure 3).
Although the relationship may be known
to be non-linear (i.e., follow a different
functional form, such as an exponential
curve), it can sometimes be made to fit the
linear assumption by transforming the data
in line with the function, for example, by
taking logarithms or squaring the Y and/or
X data. Note that if such transformations
are performed, weighted regression
(discussed later) should be used to obtain
an accurate model. Weighting is required
because of changes in the residual/error
structure of the regression model. Using
non-linear regression may, however, be a
better alternative to transforming the data
when this option is available in the
statistical packages you are using.
Residuals and residual standard error
A residual value is calculated by taking the
difference between the predicted value
and the actual value (see Figure 2). When
the residuals are plotted against the
predicted (or actual) data values the plot
becomes a powerful diagnostic tool,
enabling patterns and curvature in the data
to be recognized (Figure 3). It can also be
used to highlight points of influence (see
Bias, leverage and outliers overleaf).
The residual standard error (RSE, also
known as the residual standard deviation,
RSD) is a statistical measure of the average
residual. In other words, it is an estimate
of the average error (or deviation) about
the regression line. The RSE is used to
calculate many useful regression statistics
including confidence intervals and outlier
test values.
where s(y) is the standard deviation of the y values in the calibration, n is the number of
data pairs and r is the least-squares regression correlation coefficient.
Confidence intervals
As with most statistics, the slope (b) and intercept (a) are estimates based on a finite
sample, so there is some uncertainty in the values. (Note: Strictly, the uncertainty arises
from random variability between sets of data. There may be other uncertainties, such as
measurement bias, but these are outside the scope of this article.) This uncertainty is
quantified in most statistical routines by displaying the confidence limits and other
statistics, such as the standard error and p values. Examples of these statistics are given in
Table 2.
Degrees of freedom
Confidence level
(n-2)
95% ( = 0.05)
99% ( = 0.01)
0.950
0.990
0.878
0.959
0.811
0.917
0.754
0.875
0.707
0.834
0.666
0.798
0.632
0.765
0.602
0.735
10
0.576
0.708
11
0.553
0.684
12
0.532
0.661
13
0.514
0.641
14
0.497
0.623
15
0.482
0.606
20
0.423
0.537
30
0.349
0.449
40
0.304
0.393
60
0.250
0.325
1
0.8
0.6
Correlation coefficient (r)
14
0.4
0.2
0
-0.2
10
15
20
25
30
35
40
45
50
55
60
-0.4
-0.6
95% confidence level
-0.8
-1
Degrees of freedom (n-2)
The p value is the probability that a value could arise by chance if the true value was
zero. By convention a p value of less than 0.05 indicates a significant non-zero statistic.
Thus, examining the spreadsheets results, we can see that there is no reason to reject the
hypothesis that the intercept is zero, but there is a significant non-zero positive
gradient/relationship. The confidence intervals for the regression line can be plotted for all
points along the x-axis and is dumbbell in shape (Figure 2). In practice, this means that the
model is more certain in the middle than at the extremes, which in turn has important
consequences for extrapolating relationships.
When regression is used to construct a calibration model, the calibration graph is used
in reverse (i.e., we predict the X value from the instrument response [Y-value]). This
prediction has an associated uncertainty (expressed as a confidence interval)
Xpredicted
Y a
b
t RSE
b
1 1
m n
Y y
2
2
b2 n 1 s x
where a is the intercept and b is the slope obtained from the regression equation.
Y is the mean value of the response (e.g., instrument readings) for m replicates (replicates
are repeat measurements made at the same level).
y is the mean of the y data for the n points in the calibration. t is the critical value obtained
from t-tables for n2 degrees of freedom. s(x) is the standard deviation for the
(a) r = -1
1.4
Intercept
1.2
Slope
(b) r = 0
Residuals
Y= -0.046 + 0.1124 * X
r = 0.98731
1.0
Y
Correlation coefficient
0.8
(c) r = +1
0.6
0.4
Intercept
0.2
(d) r = 0
0.0
confidence limits for the prediction
-0.2
0
10
12
(e) r = 0.99
Residuals
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
-0.02
-0.04
-0.06
-0.08
(f) r = 0.9
possible outlier
(g) r = 0.9
5
X
10
15
16
Coefficients
Intercept
Slope
Leverage i 1n
xi x
j=1
x j x 2
where xi is the x value for which the leverage statistic is to be calculated, n is the
number of points in the calibration and x is the mean of all the x values in the calibration.
To test if a data point (xi,yi) is an outlier (relative to the regression model) the following
outlier test can be applied.
residualma x
Test value
RSE
Y y
1 1n i
n 1 sy2
where RSE is the residual standard error, sy is the standard deviation of the Y values, Yi is
the y value, n is the number of points, y is the mean of all the y values in the calibration
and residualmax is the largest residual value.
For example, the test value for the suspected outlier in Figure 3 is 1.78 and the critical
value is 2.37 (Table 3 for 10 data points). Although the point appears extreme, it could
reasonably be expected to arise by chance within the data set.
Response
Residuals
(a)
(b)
Concentration
Predicted value
Standard Error
t Stat
p value
Lower 95%
Upper 95%
-0.046000012
0.039648848
-1.160185324
0.279423552
-0.137430479
0.045430455
0.112363638
0.00638999
17.58432015
1.11755E-07
0.097628284
0.127098992
*Note the large number of significant figures. In fact none of the values above warrant more than 3 significant figures!
table 2 Statistics obtained using Excel 5.0 regression analysis function from the data used to generate the calibration graph in Figure 2.
Sample size
95%
99%
1.74
1.75
1.93
1.98
2.08
2.17
2.20
2.23
2.29
2.44
10
2.37
2.55
12
2.49
2.70
14
2.58
2.82
16
2.66
2.92
18
2.72
3.00
20
2.77
3.06
25
2.88
3.25
30
2.96
3.36
35
3.02
3.40
40
3.08
3.43
45
3.12
3.47
50
3.16
3.51
60
3.23
3.57
70
3.29
3.62
80
3.33
3.68
90
3.37
3.73
100
3.41
3.78
Wi wi
b(w)
Test value
99%
50
60
70
80
90
Wi xi yi
i=1
n
Wi xi2
i=1
1.5
40
wj
95%
30
j=1
with
2.5
20
X(w)predicted Y
bw
3.5
10
Confidence table-value
(n)
(wi 12 )
si at each of the n
100
17
18
References
X(w)predicted
t RSE(w)
b(w)
1
mWi
2
b(w)
Wj xj2
j=1
where t is the critical value obtained from t tables for n2 degrees of freedom at a
stated significance level (typically a = 0.05), Wi is the weighted standard deviation for the
x data for the ith point in the calibration, m is the number of replicates and the weighted
residual.
n
2
Wj yj2 b(w)
Wj xj2
j=1
j=1
n 1
Conclusions
Always plot the data. Dont rely on the regression statistics to indicate a linear
relationship. For example, the correlation coefficient is not a reliable measure of
goodness-of-fit.
Always examine the residuals plot. This is a valuable diagnostic tool.
Remove points of influence (leverage, bias and outlying points) only if a reason can be
found for their aberrant behaviour.
Be aware that a regression line is an estimate of the best line through the data and
that there is some uncertainty associated with it. The uncertainty, in the form of a
confidence interval, should be reported with the interpolated result obtained from any
linear regression calibrations.
Acknowledgement
The preparation of this paper was supported under a contract with the Department of
Trade and Industry as part of the National Measurement System Valid Analytical
Measurement Programme (VAM) (11).
This article, the fourth and final part of our statistics refresher series, looks
at how to deal with messy data that contain transcription errors or extreme
and skewed results.
Al
Solution 1
Solution 2
Solution 3
Solution 4
567
234
B
94.5
72.1
34.0
97.4
Fe
578
673
674
429
Ni
23.1
7.6
44.7
82.9
Solution 2
Solution 4
Al
567
234
B
72.1
97.4
Fe
673
429
Ni
7.6
82.9
19
20
93
105
99
73
20
120
180
10
90
150
50
500
Solvent
Polarity
(pKa)
1.8
1.0
1.5
r
(number of data points
in the correlation)
Extreme values,
stragglers and outliers
Extreme values are defined as observations
in a sample, so far separated in value from
the remainder as to suggest that they may
be from a different population, or the
result of an error in measurement (6).
Extreme values can also be subdivided into
stragglers, extreme values detected
between the 95% and 99% confidence
levels; and outliers, extreme values at >
99% confidence level.
It is tempting to remove extreme values
automatically from a data set, because
they can alter the calculated statistics, e.g.,
increase the estimate of variance (a
measure of spread), or possibly introduce a
bias in the calculated mean. There is one
golden rule however: no value should be
removed from a data set on statistical
grounds alone. Statistical grounds include
outlier testing.
Outlier tests tell you, on the basis of
some simple assumptions, where you are
most likely to have a technical error; they
do not tell you that the point is wrong.
No matter how extreme a value is in a set
of data, the suspect value could
nonetheless be a correct piece of
information (1). Only with experience or
the identification of a particular cause can
data be declared wrong and removed.
So, given that we understand that the
tests only tell us where to look, how do we
test for outliers? If we have good grounds
for believing our data is normally
distributed then a number of outlier tests
(sometimes called Q-tests) are available
that identify extreme values in an objective
Recovery
vs
Extraction
time
0.728886
Recovery
vs
Particle
Size
-0.87495
Recovery
vs
Solvent
Polarity
0.033942
(4)
(4)
(3)
Al
Solution 1
Solution 2
Solution 3
Solution 4
567
234
B
94.5
72.1
34.0
97.4
Fe
578
673
674
429
Ni
23.1
7.6
44.7
82.9
Solution 1
Solution 2
Solution 3
Solution 4
Al
400.5
567
400.5
234
B
94.5
72.1
34.0
97.4
Fe
578
673
674
429
Ni
23.1
7.6
44.7
82.9
Box 1: Imputation (4,5) is yet another method that is increasingly being used to
handle missing data. It is, however, not yet widely available in statistical software
packages. In its simplest ad hoc form an imputed value is substituted for the
missing value (e.g., mean substitution already discussed above is a form of
imputation). In its more general/systematic form, however, the imputed missing
values are predicted from patterns in the real (non-missing) data. A total of m
possible imputed values are calculated for each missing value (using a suitable
statistical model derived from the patterns in the data) and then m possible
complete data sets are analysed in turn by the selected statistical method. The m
intermediate results are then pooled to yield the final result (statistic) and an
estimate of its uncertainty. This method works well providing that the missing
data is randomly distributed and the model used to predict the inputed values
is sensible.
Variables / Factors
95.2
77.0
72.9
77.5
72.7
61.6
86.0
82.2
78.9
78.0
91.7
13
90.0
77.4
100.8
97.0
111.1
14
90.0
91.3
89.2
81.3
100.5
15
96.9
103.0
97.5
98.5
96.8
105.1
0.41
0.39
0.53
0.47
0.50
0.57
0.59
0.61
-0.62
0.11
0.50
0.02
-0.21
-0.36
0.17
0.91
0.71
D
mean
99.2
92.4
94.6
89.4
0.66
91.7
Pairwise deletion (Variable number of cases)
B
Outlier tests
In analytical chemistry it is rare that we
have large numbers of replicate data, and
small data sets often show fortuitous
grouping and consequent apparent
outliers. Outlier tests should, therefore, be
used with care and, of course, identified
data points should only be removed if a
technical reason can be found for their
aberrant behaviour.
Most outlier tests look at some measure
of the relative distance of a suspect point
from the mean value. This measure is then
assessed to see if the extreme value could
reasonably be expected to have arisen by
chance. Most of the tests look for single
extreme values (Figure 2(a)), but
sometimes it is possible for several
outliers to be present in the same data
set. These can be identified in one of two
ways:
by iteratively applying the outlier test
by using tests that look for pairs of
extreme values, i.e., outliers that are
masking each other (see Figure 2(b) and
2(c)).
Note, as a rule of thumb, if more than
20% of the data are identified as outlying
you should start to question your
assumption about the data distribution
and/or the quality of the data collected.
The appropriate outlier tests for the
three situations described in Figure 2 are:
2(a) Grubbs 1, Dixon or Nalimov; 2(b)
Grubbs 2 and 2(c) Grubbs 3.
We will concentrate on the three
Grubbs tests (7). The test values are
calculated using the formulae below, after
the data are arranged in ascending order.
C
0.68
B
E
Cases
B
0.62
0.79(11) 0.70(10)
0.71(10)
15
0.514
12
0.576
11
0.602
10
0.632
0.950
0.01
-0.05
0.02
0.36
0.40
0.47
0.25
0.47
0.43
0.46
Outlier
Outlier
(a)
or
Outlier
Outlier
(b)
Outliers
Outliers
(c)
or
G1 =
x xi
s
x x
G2 = n s 1
G3 = 1
n 3 sn2 2
n 1 s2
where, s is the standard deviation for the whole data set, xi is the suspected single
outlier, i.e., the value furthest away from the mean, | | is the modulus the value of a
calculation ignoring the sign of the result, x is the mean, n is the number of data points, xn
and x1 are the most extreme values, sn-2 is the standard deviation for the data set
21
22
is not only possible for individual points within a group to be outlying but also for the
group means to have outliers with respect to each other. Another type of outlier that can
occur is when the spread of data within one particular group is unusually small or large
when compared with the spread of the other groups (see Figure 4).
The same Grubbs tests that are used to determine the presence of within group
outlying replicates may also be used to test for suspected outlying means.
The Cochrans test can be used to test for the third case, that of a suspected
outlying variance.
To carry out the Cochrans test, the suspect variance is compared with the sum of all
group variances. (The variance is a measure of spread and is simply the square of the
standard deviation (1).)
g
Cn =
suspected s2
g
S2
i=1 i
ni
n = i = g1
If this calculated ratio, Cn , exceeds the critical value obtained from statistical tables (7)
is the average number of all
then the suspect group spread is extreme. The choice of n
sample results produced by all groups.
The Cochrans test assumes the number of replicates within the groups are the same or
at least similar ( 1). It also assumes that none of the data have been rounded and there
are sufficient numbers of replicates to get a reasonable estimate of the variance. The
Cochrans test should not be used iteratively as this could lead to a large percentage of
data being removed (See Box 3).
Robust statistics
Robust statistics include methods that are largely unaffected by the presence of extreme
values. The most commonly used of these statistics are as follows:
Median: The median is a measure of central tendency1 and can be used instead of the
mean. To calculate the median ( ) the data are arranged in order of magnitude and the
median is then the central member of the series (or the mean of the two central
members when there is an even number of data, i.e., there are equal numbers of
observations smaller and greater than the median). For a symmetrical distribution the mean
and median have the same value.
xm
xm xm 1
2
when n is odd 1, 3, 5,
n
when n is even 2, 4, 6, where m = round up 2
Median Absolute Deviation (MAD): The MAD value is an estimate of the spread in the
data similar to the standard deviation.
x1
xn
47.876 47.997 48.065 48.118 48.151 48.211 48.251 48.559 48.634 48.711 49.005 49.166 49.484
2 = 0.123
n = 13, mean = 48.479, s = 0.498, sn2
G3 = 1 10 0.1232 = 0.587
12 0.498
Grubbs critical values for 13 values are G1 = 2.331 and 2.607, G2 = 4.00 and 4.24, G3 = 0.6705 and 0.7667 for the 95%
and 99% confidence levels. Since the test values are less than their respective critical values, in all cases, it can be concluded
there are no outlying values.
99% confidence
level
n
G(1)
G(2)
G(3)
G(1)
G(2)
G(3)
1.153
2.00
---
1.155
2.00
---
1.463
2.43
0.9992
1.492
2.44
1.0000
1.672
2.75
0.9817
1.749
2.80
0.9965
1.822
3.01
0.9436
1.944
3.10
0.9814
1.938
3.22
0.8980
2.097
3.34
0.9560
2.032
3.40
0.8522
2.221
3.54
0.9250
2.110
3.55
0.8091
2.323
3.72
0.8918
10
2.176
3.68
0.7695
2.410
3.88
0.8586
12
2.285
3.91
0.7004
2.550
4.13
0.7957
13
2.331
4.00
0.6705
2.607
4.24
0.7667
15
2.409
4.17
0.6182
2.705
4.43
0.7141
20
2.557
4.49
0.5196
2.884
4.79
0.6091
25
2.663
4.73
0.4505
3.009
5.03
0.5320
30
2.745
4.89
0.3992
3.103
5.19
0.4732
35
2.811
5.026
0.3595
3.178
5.326
0.4270
40
2.866
5.150
0.3276
3.240
5.450
0.3896
50
2.956
5.350
0.2797
3.336
5.650
0.3328
60
3.025
5.500
0.2450
3.411
5.800
0.2914
70
3.082
5.638
0.2187
3.471
5.938
0.2599
80
3.130
5.730
0.1979
3.521
6.030
0.2350
90
3.171
5.820
0.1810
3.563
6.120
0.2147
100
3.207
5.900
0.1671
3.600
6.200
0.1980
110
3.239
5.968
0.1553
3.632
6.268
0.1838
120
3.267
6.030
0.1452
3.662
6.330
0.1716
130
3.294
6.086
0.1364
3.688
6.386
0.1611
140
3.318
6.137
0.1288
3.712
6.437
0.1519
n = 85 = 6.54 7
13
Cn =
0.6092
= 0.371 = 0.252
0.2022 + 0.4022 ....... 0.2462 + 0.1982 1.474
Cochrans critical value for n = 7 and g = 13 is 0.23 at the 95% confidence levels7.
As the test value is greater than the critical values it can be concluded that the laboratory with the highest standard deviation
(0.609) has an outlying spread of replicates and this laboratorys results therefore need to be investigated further. It is normal
practice in inter-laboratory comparisons not to test for low variance outliers, i.e., laboratories reporting unusually precise results.
23
24
Analyte concentration
22
21
outlying variance
20
outlying mean
19
18
17
16
15
14
13
10
11
12
13
14
15
16
Laboratory ID
References
Types of comparison
Differences between
independent groups
of data
Parametric methods
t-test for independent groups2
(ANOVA/MANOVA)2
Differences between
dependent groups
of data
Linear regression3
Correlation coefficient3
Acknowledgement
The preparation of this paper was supported
under a contract with the UKs Department
of Trade and Industry as part of the
National Measurement System Valid
Analytical Measurement Programme (VAM)
(14).
Relationships between
counted variables
coefficient Gamma
2 (Chi-square) test
Phi coefficient
Fisher exact test
Kendall coefficient of
concordance