Академический Документы
Профессиональный Документы
Культура Документы
The problem of comparing several averages arises in many contexts: compare ve bioassay treatments against a control, compare four new polymers for sludge conditioning, or compare eight new combinations of media for treating odorous ventilation air. One multiple paired comparison problem is to compare all possible pairs of k treatments. Another is to compare k 1 treatments with a control. Knowing how to do a t-test may tempt us to compare several combinations of treatments using a series of paired t-tests. If there are k treatments, the number of pair-wise comparisons that could be made is k(k 1) /2. For k = 4, there are 6 possible combinations, for k = 5 there are 10, for k = 10 there are 45, and for k = 15 there are 105. Checking 5, 10, 45, or even 105 combinations is manageable but not recommended. Statisticians call this data snooping (Sokal and Rohlf, 1969) or data dredging (Tukey, 1991). We need to understand why data snooping is dangerous. Suppose, to take a not too extreme example, that we have 15 different treatments. The number of possible pair-wise comparisons that could be made is 15(15 1)/2 = 105. If, before the results are known, we make one selected comparison using a t-test with a 100 % = 5% error rate, there is a 5% chance of reaching the wrong decision each time we repeat the data collection experiment for those two treatments. If, however, several pairs of treatments are tested for possible differences using this procedure, the error rate will be larger than the expected 5% rate. Imagine that a two-sample t-test is used to compare the largest of the 15 average values against the smallest. The null hypothesis that this difference, the largest of all the 105 possible pair-wise differences, is likely to be rejected almost every time the experiment is repeated, instead of just at the 5% rate that would apply to making one pair-wise comparison selected at random from among the 105 possible comparisons. The number of comparisons does not have to be large for problems to arise. If there are just three treatment methods and of the three averages, A is larger than B and C is slightly larger than A ( y C > y A > y B ) , it is possible for the three possible t-tests to indicate that A gives higher results than B ( A > B), A is not different from C ( A = C), and B is not different from C ( B = C). This apparent contradiction can happen because different variances are used to make the different comparisons. Analysis of variance (Chapter 21) eliminates this problem by using a common variance to make a single test of signicance (using the F statistic). The multiple comparison test is similar to a t-test but an allowance is made in the error rate to keep the collective error rate at the stated level. This collective rate can be dened in two ways. Returning to the example of 15 treatments and 105 possible pair-wise comparisons, the probability of getting the wrong conclusion for a single randomly selected comparison is the individual error rate. The family error rate (also called the Bonferroni error rate) is the chance of getting one or more of the 105 comparisons wrong in each repetition of data collection for all 15 treatments. The family error rate counts an error for each wrong comparison in each repetition of data collection for all 15 treatments. Thus, to make valid statistical comparisons, the individual per comparison error rate must be shrunk to keep the simultaneous family error rate at the desired level.
TABLE 20.1 Ten Measurements of Lead Concentration (g/L) Measured on Identical Wastewater Specimens by Five Laboratories
Lab 1 3.4 3.0 3.4 5.0 5.1 5.5 5.4 4.2 3.8 4.2 Mean y = 4.30 Variance s 2 i = 0.82 Lab 2 4.5 3.7 3.8 3.9 4.3 3.9 4.1 4.0 3.0 4.5 3.97 0.19 Lab 3 5.3 4.7 3.6 5.0 3.6 4.5 4.6 5.3 3.9 4.1 4.46 0.41 Lab 4 3.2 3.4 3.1 3.0 3.9 2.0 1.9 2.7 3.8 4.2 3.12 0.58 Lab 5 3.3 2.4 2.7 3.2 3.3 2.9 4.4 3.4 4.8 3.0 3.34 0.54
Laboratory j 1 2 3 4 5
TABLE 20.3 Values of the Studentized Range Statistic qk,, / 2 for k (k 1)/2 Two-Sided Comparisons for a Joint 95% Condence Interval Where There are a Total of k Treatments
5 10 15 20 30 60 2 4.47 3.73 3.52 3.43 3.34 3.25 3.17 3 5.56 4.47 4.18 4.05 3.92 3.80 3.68 4 6.26 4.94 4.59 4.43 4.27 4.12 3.98 k 5 6.78 5.29 4.89 4.70 4.52 4.36 4.20 6 7.19 5.56 5.12 4.91 4.72 4.54 4.36 8 7.82 5.97 5.47 5.24 5.02 4.81 4.61 10 8.29 6.29 5.74 5.48 5.24 5.01 4.78
Note: Family error rate = 5%; /2 = 0.05/2 = 0.025. Source: Harter, H. L. (1960). Annals Math. Stat., 31, 11221147.
where it is assumed that the two treatments have the same variance, which is estimated by pooling the two sample variances: ( ni 1 ) si + ( n j 1 ) s j 2 s pool = --------------------------------------------------ni + n j 2 The chance that the interval includes the true value for any single comparison is exactly 1 . But the chance that all possible k(k 1)/2 intervals will simultaneously contain their true values is less than 1 . Tukey (1949) showed that the condence interval for the difference in two means (i and j), taking into account that all possible comparisons of k treatments may be made, is given by: q k , , /2 1 1 - s pool --- + --y i y j -------------ni n j 2 where qk,, /2 is the upper signicance level of the studentized range for k means and degrees of freedom 2 2 in the estimate s pool of the variance . This formula is exact if the numbers of observations in all the averages are equal, and approximate if the k treatments have different numbers of observations. The 2 value of s pool is obtained by pooling sample variances over all k treatments: ( n1 1 ) s1 + + ( nk 1 ) sk 2 s pool = ---------------------------------------------------------------n1 + + nk k
2 2 2 2
The size of the condence interval is larger when qk,, /2 is used than for the t statistic. This is because the studentized range allows for the possibility that any one of the k(k 1) / 2 possible pair-wise comparisons might be selected for the test. Critical values of qk,v, /2 have been tabulated by Harter (1960) and may be found in the statistical tables of Rohlf and Sokal (1981) and Pearson and Hartley (1966). Table 20.3 gives a few values for computing the two-sided 95% condence interval.
TABLE 20.4 Table of tk1,,0.05 /2 for k 1 Two-Sided Comparisons for a Joint 95% Condence Level Where There are a Total of k Treatments, One of Which is a Control
5 10 15 20 30 60 k 1 = Number of Treatments Excluding the Control 2 3 4 5 6 8 10 3.03 2.57 2.44 2.38 2.32 2.27 2.21 3.29 2.76 2.61 2.54 2.47 2.41 2.35 3.48 2.89 2.73 2.65 2.58 2.51 2.44 3.62 2.99 2.82 2.73 2.66 2.58 2.51 3.73 3.07 2.89 2.80 2.72 2.64 2.57 3.90 3.19 3.00 2.90 2.82 2.73 2.65 4.03 3.29 3.08 2.98 2.89 2.80 2.72
and the difference in the true means is, with 95% condence, within the interval: 1.01 y i y j 1.01 We can say, with a high degree of condence, that any observed difference larger than 1.01 g/L or smaller than 1.01 g/L is not likely to be zero. We conclude that laboratories 3 and 1 are higher than 4 and that laboratory 3 is also different from laboratory 5. We cannot say which laboratory is correct, or which one is best, without knowing the true concentration of the test specimens.
condence limits are: 1 1 - + ----y i y c 2.55 ( 0.71 ) ----10 10 0.81 y i y c 0.81 We can say with 95% condence that any observed difference greater than 0.81 or smaller than 0.81 is unlikely to be zero. The four comparisons with laboratory 2 shown in Table 20.5 indicate that the measurements from laboratory 4 are smaller than those of the control laboratory.
Comments
Box et al. (1978) describe yet another way of making multiple comparisons. The simple idea is that if k treatment averages had the same mean, they would appear to be k observations from the same, nearly normal distribution with standard deviation / n . The plausibility of this outcome is examined graphically by constructing such a normal reference distribution and superimposing upon it a dot diagram of the k average values. The reference distribution is then moved along the horizontal axis to see if there is a way to locate it so that all the observed averages appear to be typical random values selected from it. This sliding reference distribution is a rough method for making what are called multiple comparisons. The Tukey and Dunnett methods are more formal ways of making these comparisons. Dunnett (1955) discussed the allocation of observations between the control group and the other p = k 1 treatment groups. For practical purposes, if the experimenter is working with a joint condence level in the neighborhood of 95% or greater, then the experiment should be designed so that n c /n = p approximately, where nc is the number of observations on the control and n is the number on each of the p noncontrol treatments. Thus, for an experiment that compares four treatments to a control, p = 4 and nc is approximately 2n.
References
Box, G. E. P., W. G. Hunter, and J. S. Hunter (1978). Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building, New York, Wiley Interscience. Dunnett, C. W. (1955). Multiple Comparison Procedure for Comparing Several Treatments with a Control, J. Am. Stat. Assoc., 50, 10961121. Dunnett, C. W. (1964). New Tables for Multiple Comparisons with a Control, Biometrics, 20, 482491. Harter, H. L. (1960). Tables of Range and Studentized Range, Annals Math. Stat., 31, 11221147. Pearson, E. S. and H. O. Hartley (1966). Biometrika Tables for Statisticians, Vol. 1, 3rd ed., Cambridge, England, Cambridge University Press. Rohlf, F. J. and R. R. Sokal (1981). Statistical Tables, 2nd ed., New York, W. H. Freeman & Co. Sokal, R. R. and F. J. Rohlf (1969). Biometry: The Principles and Practice of Statistics in Biological Research, New York, W. H. Freeman and Co. Tukey, J. W. (1949). Comparing Individual Means in the Analysis of Variance, Biometrics, 5, 99. Tukey, J. W. (1991). The Philosophy of Multiple Comparisons, Stat. Sci., 6(6), 100116.
2002 By CRC Press LLC