Вы находитесь на странице: 1из 5

20

Multiple Paired Comparisons of k Averages


KEY WORDS data snooping, data dredging, Dunnetts procedure, multiple comparisons, sliding reference distribution, studentized range, t-tests, Tukeys procedure.

The problem of comparing several averages arises in many contexts: compare ve bioassay treatments against a control, compare four new polymers for sludge conditioning, or compare eight new combinations of media for treating odorous ventilation air. One multiple paired comparison problem is to compare all possible pairs of k treatments. Another is to compare k 1 treatments with a control. Knowing how to do a t-test may tempt us to compare several combinations of treatments using a series of paired t-tests. If there are k treatments, the number of pair-wise comparisons that could be made is k(k 1) /2. For k = 4, there are 6 possible combinations, for k = 5 there are 10, for k = 10 there are 45, and for k = 15 there are 105. Checking 5, 10, 45, or even 105 combinations is manageable but not recommended. Statisticians call this data snooping (Sokal and Rohlf, 1969) or data dredging (Tukey, 1991). We need to understand why data snooping is dangerous. Suppose, to take a not too extreme example, that we have 15 different treatments. The number of possible pair-wise comparisons that could be made is 15(15 1)/2 = 105. If, before the results are known, we make one selected comparison using a t-test with a 100 % = 5% error rate, there is a 5% chance of reaching the wrong decision each time we repeat the data collection experiment for those two treatments. If, however, several pairs of treatments are tested for possible differences using this procedure, the error rate will be larger than the expected 5% rate. Imagine that a two-sample t-test is used to compare the largest of the 15 average values against the smallest. The null hypothesis that this difference, the largest of all the 105 possible pair-wise differences, is likely to be rejected almost every time the experiment is repeated, instead of just at the 5% rate that would apply to making one pair-wise comparison selected at random from among the 105 possible comparisons. The number of comparisons does not have to be large for problems to arise. If there are just three treatment methods and of the three averages, A is larger than B and C is slightly larger than A ( y C > y A > y B ) , it is possible for the three possible t-tests to indicate that A gives higher results than B ( A > B), A is not different from C ( A = C), and B is not different from C ( B = C). This apparent contradiction can happen because different variances are used to make the different comparisons. Analysis of variance (Chapter 21) eliminates this problem by using a common variance to make a single test of signicance (using the F statistic). The multiple comparison test is similar to a t-test but an allowance is made in the error rate to keep the collective error rate at the stated level. This collective rate can be dened in two ways. Returning to the example of 15 treatments and 105 possible pair-wise comparisons, the probability of getting the wrong conclusion for a single randomly selected comparison is the individual error rate. The family error rate (also called the Bonferroni error rate) is the chance of getting one or more of the 105 comparisons wrong in each repetition of data collection for all 15 treatments. The family error rate counts an error for each wrong comparison in each repetition of data collection for all 15 treatments. Thus, to make valid statistical comparisons, the individual per comparison error rate must be shrunk to keep the simultaneous family error rate at the desired level.

2002 By CRC Press LLC

TABLE 20.1 Ten Measurements of Lead Concentration (g/L) Measured on Identical Wastewater Specimens by Five Laboratories
Lab 1 3.4 3.0 3.4 5.0 5.1 5.5 5.4 4.2 3.8 4.2 Mean y = 4.30 Variance s 2 i = 0.82 Lab 2 4.5 3.7 3.8 3.9 4.3 3.9 4.1 4.0 3.0 4.5 3.97 0.19 Lab 3 5.3 4.7 3.6 5.0 3.6 4.5 4.6 5.3 3.9 4.1 4.46 0.41 Lab 4 3.2 3.4 3.1 3.0 3.9 2.0 1.9 2.7 3.8 4.2 3.12 0.58 Lab 5 3.3 2.4 2.7 3.2 3.3 2.9 4.4 3.4 4.8 3.0 3.34 0.54

TABLE 20.2 Ten Possible Differences of Means ( y i y j ) Between Five Laboratories


1 (4.30) 0.33 0.16 1.18 0.96 Laboratory i (Average = y i) 2 3 4 (3.97) (4.46) (3.12) 0.49 0.85 0.63 1.34 1.12 0.22 5 (3.34)

Laboratory j 1 2 3 4 5

Case Study: Measurements of Lead by Five Laboratories


Five laboratories each made measurements of lead on ten replicate wastewater specimens. The data are given in Table 20.1 along with the means and variance for each laboratory. The ten possible comparisons of mean lead concentrations are given in Table 20.2. Laboratory 3 has the highest mean (4.46 g/L) and laboratory 4 has the lowest (3.12 g/L). Are the differences consistent with what one might expect from random sampling and measurement error, or can the differences be attributed to real differences in the performance of the laboratories? We will illustrate Tukeys multiple t-test and Dunnetts method of multiple comparisons with a control, with a minimal explanation of statistical theory.

Tukeys Paired Comparison Method


A (1 )100% condence interval for the true difference between the means of two treatments, say treatments i and j, is: 1 1 - + --( y i y j ) t , / 2 s pool --ni n j
2002 By CRC Press LLC

TABLE 20.3 Values of the Studentized Range Statistic qk,, / 2 for k (k 1)/2 Two-Sided Comparisons for a Joint 95% Condence Interval Where There are a Total of k Treatments

5 10 15 20 30 60 2 4.47 3.73 3.52 3.43 3.34 3.25 3.17 3 5.56 4.47 4.18 4.05 3.92 3.80 3.68 4 6.26 4.94 4.59 4.43 4.27 4.12 3.98 k 5 6.78 5.29 4.89 4.70 4.52 4.36 4.20 6 7.19 5.56 5.12 4.91 4.72 4.54 4.36 8 7.82 5.97 5.47 5.24 5.02 4.81 4.61 10 8.29 6.29 5.74 5.48 5.24 5.01 4.78

Note: Family error rate = 5%; /2 = 0.05/2 = 0.025. Source: Harter, H. L. (1960). Annals Math. Stat., 31, 11221147.

where it is assumed that the two treatments have the same variance, which is estimated by pooling the two sample variances: ( ni 1 ) si + ( n j 1 ) s j 2 s pool = --------------------------------------------------ni + n j 2 The chance that the interval includes the true value for any single comparison is exactly 1 . But the chance that all possible k(k 1)/2 intervals will simultaneously contain their true values is less than 1 . Tukey (1949) showed that the condence interval for the difference in two means (i and j), taking into account that all possible comparisons of k treatments may be made, is given by: q k , , /2 1 1 - s pool --- + --y i y j -------------ni n j 2 where qk,, /2 is the upper signicance level of the studentized range for k means and degrees of freedom 2 2 in the estimate s pool of the variance . This formula is exact if the numbers of observations in all the averages are equal, and approximate if the k treatments have different numbers of observations. The 2 value of s pool is obtained by pooling sample variances over all k treatments: ( n1 1 ) s1 + + ( nk 1 ) sk 2 s pool = ---------------------------------------------------------------n1 + + nk k
2 2 2 2

The size of the condence interval is larger when qk,, /2 is used than for the t statistic. This is because the studentized range allows for the possibility that any one of the k(k 1) / 2 possible pair-wise comparisons might be selected for the test. Critical values of qk,v, /2 have been tabulated by Harter (1960) and may be found in the statistical tables of Rohlf and Sokal (1981) and Pearson and Hartley (1966). Table 20.3 gives a few values for computing the two-sided 95% condence interval.

Solution: Tukey s Method


For this example, k = 5, s pool = 0.51, spool = 0.71, = 50 5 = 45, and q5,40,0.05/2 = 4.49. This gives the 95% condence limits of: 4.49 1 1 - ( 0.71 ) ----- + ----( y i y j ) --------10 10 2
2002 By CRC Press LLC
2

TABLE 20.4 Table of tk1,,0.05 /2 for k 1 Two-Sided Comparisons for a Joint 95% Condence Level Where There are a Total of k Treatments, One of Which is a Control

5 10 15 20 30 60 k 1 = Number of Treatments Excluding the Control 2 3 4 5 6 8 10 3.03 2.57 2.44 2.38 2.32 2.27 2.21 3.29 2.76 2.61 2.54 2.47 2.41 2.35 3.48 2.89 2.73 2.65 2.58 2.51 2.44 3.62 2.99 2.82 2.73 2.66 2.58 2.51 3.73 3.07 2.89 2.80 2.72 2.64 2.57 3.90 3.19 3.00 2.90 2.82 2.73 2.65 4.03 3.29 3.08 2.98 2.89 2.80 2.72

Source: Dunnett, C. W. (1964). Biometrics, 20, 482 491.

and the difference in the true means is, with 95% condence, within the interval: 1.01 y i y j 1.01 We can say, with a high degree of condence, that any observed difference larger than 1.01 g/L or smaller than 1.01 g/L is not likely to be zero. We conclude that laboratories 3 and 1 are higher than 4 and that laboratory 3 is also different from laboratory 5. We cannot say which laboratory is correct, or which one is best, without knowing the true concentration of the test specimens.

Dunnett s Method for Multiple Comparisons with a Control


In many experiments and monitoring programs, one experimental condition (treatment, location, etc.) is a standard or a control treatment. In bioassays, there is always an unexposed group of organisms that serve as a control. In river monitoring, one location above a waste outfall may serve as a control or reference station. Now, instead of k treatments to compare, there are only k 1. And there is a strong likelihood that the control will be different from at least one of the other treatments. The quantities to be tested are the differences y i y c , where y c is the observed average response for the control treatment. The (1 )100% condence intervals for all k 1 comparisons with the control are given by: 1 1 - + ---( y i y c ) t k 1, , /2 s pool --ni nc This expression is similar to Tukeys as used in the previous section except the quantity q k , , /2 / 2 is replaced with Dunnetts t k 1, , /2 . The value of spool is obtained by pooling over all treatments. An abbreviated table for 95% condence intervals is reproduced in Table 20.4. More extensive tables for one- and two-sided tests are found in Dunnett (1964).

Solution: Dunnet s Method


Rather than create a new example we reconsider the data in Table 20.1 supposing that laboratory 2 is a reference (control) laboratory. Pooling sample variances over all ve laboratories gives the estimated 2 within-laboratory variance, s pool = 0.51 and spool = 0.71. For k 1 = 4 treatments to be compared with the control and = 45 degrees of freedom, the value of t4,45,0.05 / 2 = 2.55 is found in Table 20.4. The 95%
2002 By CRC Press LLC

TABLE 20.5 Comparing Four Laboratories with a Reference Laboratory


Laboratory Average Difference ( y i y c ) Control 3.97 1 4.30 0.33 3 4.46 0.49 4 3.12 0.85 5 3.34 0.63

condence limits are: 1 1 - + ----y i y c 2.55 ( 0.71 ) ----10 10 0.81 y i y c 0.81 We can say with 95% condence that any observed difference greater than 0.81 or smaller than 0.81 is unlikely to be zero. The four comparisons with laboratory 2 shown in Table 20.5 indicate that the measurements from laboratory 4 are smaller than those of the control laboratory.

Comments
Box et al. (1978) describe yet another way of making multiple comparisons. The simple idea is that if k treatment averages had the same mean, they would appear to be k observations from the same, nearly normal distribution with standard deviation / n . The plausibility of this outcome is examined graphically by constructing such a normal reference distribution and superimposing upon it a dot diagram of the k average values. The reference distribution is then moved along the horizontal axis to see if there is a way to locate it so that all the observed averages appear to be typical random values selected from it. This sliding reference distribution is a rough method for making what are called multiple comparisons. The Tukey and Dunnett methods are more formal ways of making these comparisons. Dunnett (1955) discussed the allocation of observations between the control group and the other p = k 1 treatment groups. For practical purposes, if the experimenter is working with a joint condence level in the neighborhood of 95% or greater, then the experiment should be designed so that n c /n = p approximately, where nc is the number of observations on the control and n is the number on each of the p noncontrol treatments. Thus, for an experiment that compares four treatments to a control, p = 4 and nc is approximately 2n.

References
Box, G. E. P., W. G. Hunter, and J. S. Hunter (1978). Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building, New York, Wiley Interscience. Dunnett, C. W. (1955). Multiple Comparison Procedure for Comparing Several Treatments with a Control, J. Am. Stat. Assoc., 50, 10961121. Dunnett, C. W. (1964). New Tables for Multiple Comparisons with a Control, Biometrics, 20, 482491. Harter, H. L. (1960). Tables of Range and Studentized Range, Annals Math. Stat., 31, 11221147. Pearson, E. S. and H. O. Hartley (1966). Biometrika Tables for Statisticians, Vol. 1, 3rd ed., Cambridge, England, Cambridge University Press. Rohlf, F. J. and R. R. Sokal (1981). Statistical Tables, 2nd ed., New York, W. H. Freeman & Co. Sokal, R. R. and F. J. Rohlf (1969). Biometry: The Principles and Practice of Statistics in Biological Research, New York, W. H. Freeman and Co. Tukey, J. W. (1949). Comparing Individual Means in the Analysis of Variance, Biometrics, 5, 99. Tukey, J. W. (1991). The Philosophy of Multiple Comparisons, Stat. Sci., 6(6), 100116.
2002 By CRC Press LLC

Вам также может понравиться