Вы находитесь на странице: 1из 8

The Journal of General Psychology, 2011, 138(4), 292299 Copyright C 2011 Taylor & Francis Group, LLC

Correcting Overestimated Effect Size Estimates in Multiple Trials


WOLFGANG WIEDERMANN BARTOSZ GULA PAUL CZECH DENISE MUSCHIK University of Klagenfurt

ABSTRACT. In a simulation study, Brand, Bradley, Best, and Stoica (2011) have shown that Cohens d is notably overestimated if computed for data aggregated over multiple trials. Although the phenomenon is highly important for studies and meta-analyses of studies structurally similar to the simulated scenario, the authors do not comprehensively address how the problem could be handled. In this comment, we rst suggest a corrective term dc that includes the number and correlation of trials. Next, the results of a simulation study provide evidence that the proposed dc results in a more precise estimation of trial-level effects. We conclude that, in practice, dc together with plausible estimates of inter-trial correlation will produce a more precise effect size range compared to that suggested by Brand and colleagues (2011). Keywords: aggregation bias, Cohens d effect size, multiple trials

RECENTLY, BRAND, BRADLEY, BEST, AND STOICA (2011) addressed the important topic of highly overestimated effect size estimates obtained from data aggregated across multiple trials, henceforth referred to as aggregation bias. Via simulations, the authors showed that Cohens d (Cohen, 1988)the standardized effect size of the mean difference of two independent samplesis generally overestimated if computed from the sums or means of repeated measurements, assuming that the effects of interest occur at trial-level. Under some circumstances (e.g., 30 independent trials per person), the population effect is overestimated up to nearly six times. The authors explain this by the fact that the pooled variance of aggregated trials decreases as the number of trials increases. They suggest the
The authors wish to thank Rainer Alexandrowicz and Reinhold Hatzinger for valuable comments on an earlier version of the article. Address correspondence to Wolfgang Wiedermann, University of Klagenfurt, Department of Psychology, Universitaetsstrasse 65-67, Klagenfurt, 9020 Austria; wolfgang. wiedermann@aau.at (e-mail).
292

Wiedermann, Gula, Czech, & Muschik

293

additional use of more conservative split-plot ANOVA and the reporting of the average inter-trial correlation together with the number of trials. In this comment, (1) we argue that the suggested computation of a range is intransparent and of marginal practical value, (2) we propose a simple corrective term which attenuates the aggregation bias, and (3) we demonstrate the adequacy of the proposed correction method in a simulation study. As a consequence of effect size overestimation, Brand and colleagues (2011, p. 9) suggest . . . a liberal test and the more conservative split-plot design to gain a full understanding of a range of potential effect sizes. Beyond this suggestion to incorporate trials as a separate factor in analyses, the authors provide no further information on why and how this should be an improvement. Although many effect size measures for split-plot ANOVAs exist, such as 2, 2, or generalized 2 (Olejnik & Algina, 2003), 2 is commonly reported in the psychological literature (Pierce, Block, & Aguinis, 2004). Partial 2 is part of the standard output of statistical packages such as SPSS, for example, which is likely to provoke an imprecise reporting of 2 values, where partial 2 is erroneously referred to as classic 2 (Levine & Hullet, 2002; Pierce, Block, & Aguinis, 2004). Partial 2 is dened as SSeffect /(SSeffect + SSerror ), whereas classic 2 as SSeffect /SStotal , SS being the sum of squares. For single-trial studies, the results are identical for either measure as SStotal = SSeffect + SSerror . Both 2 can be transformed into the same
metric as Cohens d applying d = 2 12 (e.g. see Cohen, 1988). If partial 2 is used, then the overestimation will be the same as that for Cohens d calculated from aggregated variables. This follows from the fact that both aggregated Cohens d and partial 2 for the main effect of treatment in the corresponding split-plot ANOVA rely on the same within-group sums of squares. Only classical 2 provides a more conservative estimate of the true underlying trial-level effect. However, even if an effect size range is computed from the classical 2 (transformed into the Cohens d metric) and the mean-aggregated Cohens d, this interval will be of marginal value and misleading for the following reasons: First, depending on the inter-trial correlation and the number of trials, the range may be too large to be informative. Second, should classical 2 provide an accurate estimate for the true effect, then the least-biased estimate of the effect is not located somewhere in the middle of the range but equals its lower bound. It has been shown that such split-plot analyses result in a power loss for the between subjects effect with higher number of trials and increasing inter-trial correlation (Bradley & Russell, 1998). Therefore, we suggest that classical 2 and the corrected Cohens d suggested below should be compared in future studies. However, in the present article, we focus on the improvement of Cohens d, because it is more commonly reported in studies dealing with two group comparisons. A further conclusion the authors arrive at concerns a more sophisticated reporting practice, in which the number of trials as well as the average inter-trial correlation should be routinely reported to allow researchers to draw inferences about the strength of the results. Although we generally support the idea of a
2

294

The Journal of General Psychology

more sophisticated reporting routine in the presentation of empirical results, we doubt that this additional information allows for empirically sound conclusions about the actual ination of effect size estimates. Instead of simply reporting the additional numbers, we suggest an empirical approach to assess the amount of overestimation, which is outlined in more detail in the next section. A Simple Correction Method The central limit theorem states that as the sample size (n) of independent samples increases, the actual distribution of the sample means approaches a normal distribution with mean and variance 2/n. Thus, for k independent trials the individual means follow a normal distribution with mean and variance 2/k. In this case, 2 denotes the variance of the means of each respondents trials. Consequently, the bias in the effect size estimate da obtained from average responses simply depends on the number of trials and can be corrected using dc = da / k. However, this approach requires all responses to be mutually independent, which seems rather implausible in practice. For correlated variables the variance of mean-aggregated measures of k trials is dened as k1 2 2 + , k k with and 2 being the inter-trial correlation and trial variance (assuming equal trial variances), respectively. Transforming the latter equation shows that the meanaggregated effect size estimate can be corrected using dc = da /(k/ k + (k 2 k)). Here, knowledge of the true population correlation is required. Therefore, we performed a simulation study to investigate the properties of this correction, applying the sample-based average inter-trial correlation instead of the true population correlation. Methods In essence, we adopted the simulation design of Brand and colleagues (2011)1 in order to ensure comparability of results. However, instead of rearranging simulated values to obtain an average correlation between trials, two independent n k matrices (n = number of participants, k = number of trials) following a multivariate normal distribution were generated. Each trial had mean of 10 and a standard deviation of 2. Sample size was n = 38 per group. Covariance structures of the multivariate distributions were varied to obtain the average correlation ( = 0, 0.2, 0.5, 0.8). Means of the second matrix were increased to obtain three most common

Wiedermann, Gula, Czech, & Muschik

295

effect sizes, d = 0.20 (small), 0.50 (medium), 0.80 (large). For each of 100,000 iterations mean-aggregated variables were computed and the Cohens effect size estimate (da ) as well as the corrected dc were logged. The proposed correction assumes knowledge of the true population correlation , thus additionally dc = da /(k/ k + r(k 2 k)) was calculated by replacing with the averaged sample estimate r. We will consider the case of mean-aggregated measures only and not the aggregation based on sums, which was also simulated by Brand and colleagues (2011). The mean-aggregation is more common in practice because missing values impair the sum aggregation due to unequal number of values per participant. The simulation study was performed in R (R Development Core Team, 2011). Results Table 1 shows the amount of effect size ination as a function of and k for small, medium, and large true effects. For the uncorrected estimates (da ), the results are virtually identical to those of Brand and colleagues (2011). Generally, da values are heavily inated with increasing number of trials. For uncorrelated trials this overestimation is generally more pronounced (with distortions ranging from 129% to 461%) than for correlated trials. Correcting Cohens d using the true population correlation (dc ) eliminates the ination for all simulated scenarios. However, dc values are still slightly above the true population effect. Across all scenarios overestimation ranged from 0% to 5%. This might be attributable to the fact that ordinary Cohens d is biased in the case of small sample sizes (Hedges & Olkin, 1985). Correcting these values using the Hedges and Olkins formula 3 g = d 1 4N9 , where N is the total sample size, would additionally mitigate the overestimation (for space limitation not shown in Table 1). Because the true population correlation is unknown in practice, the rows in Table 1 labeled dc refer to the corrected effect size measures using the sample-based average intertrial correlation. For independent trials ( = 0) and small effects (d = 0.20) the distortion ranges from 0% to 15%. For medium true effects (d = 0.50) a maximum distortion of 68% is observed. In the case of large true effects (d = 0.80) and k > 1 overestimation ranges from 28% to 130%. However, as Brand and colleagues (2011) already noted, zero inter-trial correlation appears extremely unrealistic. Therefore, scenarios involving correlated trials are of greater interest for practice. For small and medium true effects, the correction approach performs almost as well as using the true population correlation. Across all levels of trials and correlations the maximum distortion is 12%. For large true effects (d = 0.80) together with a rather small inter-trial correlation ( = 0.2) effect size ination ranges from 15% to 24% depending on the number of trials.

296

TABLE 1. Averaged Cohens d (see text) of Either Small, Medium, or Large True Effects Derived from Aggregation Over 1, 5, 10, 20, and 30 Trials. Values in Parentheses Represent the Percentage of Overestimation Computed Based on Rounded Estimates. Number of Trials 5 10 20 30

Correlation

Effect Size

The Journal of General Psychology

0.2

0.5

0.8

da dc dc da dc dc da dc dc da dc dc 0.46 0.20 0.21 0.34 0.20 0.21 0.27 0.21 0.21 0.22 0.21 0.21 (130.0) (0.0) (5.0) (70.0) (0.0) (5.0) (35.0) (5.0) (5.0) (10.0) (5.0) (5.0) (360.0) (0.0) (10.0) (110.0) (0.0) (5.0) (40.0) (0.0) (5.0) (15.0) (0.0) (5.0)

0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20

(0.0) (0.0) (0.0) (0.0) (0.0) (0.0) (0.0) (0.0) (0.0) (0.0) (0.0) (0.0)

1.12 0.20 0.23 0.43 0.21 0.21 0.28 0.20 0.21 0.23 0.21 0.21

(460.0) (0.0) (15.0) (115.0) (5.0) (5.0) (40.0) (0.0) (5.0) (15.0) (5.0) (5.0)

0.2

da dc dc da dc dc 1.15 0.51 0.57 0.85 0.51 0.54 (130.0) (2.0) (14.0) (70.0) (2.0) (8.0)

0.51 0.51 0.51 0.51 0.51 0.51

(2.0) (2.0) (2.0) (2.0) (2.0) (2.0)

Cohens d = 0.2 (small effect) 0.65 (225.0) 0.92 0.20 (0.0) 0.20 0.21 (5.0) 0.22 0.39 (95.0) 0.42 0.20 (0.0) 0.20 0.21 (5.0) 0.21 0.28 (40.0) 0.28 0.20 (0.0) 0.20 0.21 (5.0) 0.21 0.23 (15.0) 0.23 0.20 (0.0) 0.20 0.20 (0.0) 0.21 Cohens d = 0.5 (medium effect) 1.62 (224.0) 2.29 0.51 (2.0) 0.51 0.63 (26.0) 0.75 0.97 (94.0) 1.05 0.51 (2.0) 0.51 0.55 (10.0) 0.56 (358.0) (2.0) (50.0) (110.0) (2.0) (12.0) 2.80 0.51 0.84 1.07 0.51 0.56

(460.0) (2.0) (68.0) (114.0) (2.0) (12.0) (Continued on next page)

TABLE 1. Averaged Cohens d (see text) of Either Small, Medium, or Large True Effects Derived from Aggregation Over 1, 5, 10, 20, and 30 Trials. Values in Parentheses Represent the Percentage of Overestimation Computed Based on Rounded Estimates. (Continued) Number of Trials 5 10 20 30

Correlation

Effect Size

0.5

0.8

da dc dc da dc dc

0.51 0.51 0.51 0.51 0.51 0.51

(2.0) (2.0) (2.0) (2.0) (2.0) (2.0)

0.66 0.51 0.52 0.56 0.51 0.51

(32.0) (2.0) (4.0) (12.0) (2.0) (2.0)

(40.0) (2.0) (4.0) (14.0) (2.0) (2.0)

0.71 0.51 0.53 0.57 0.51 0.52

(42.0) (2.0) (6.0) (14.0) (2.0) (4.0)

0.2

0.5

Wiedermann, Gula, Czech, & Muschik

0.8

da dc dc da dc dc da dc dc da dc dc 1.83 0.82 1.02 1.37 0.82 0.92 1.06 0.82 0.86 0.89 0.82 0.83 (128.8) (2.5) (27.5) (71.3) (2.5) (15.0) (32.5) (2.5) (7.5) (11.3) (2.5) (3.8)

0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82

(2.5) (2.5) (2.5) (2.5) (2.5) (2.5) (2.5) (2.5) (2.5) (2.5) (2.5) (2.5)

0.69 (38.0) 0.70 0.51 (2.0) 0.51 0.52 (4.0) 0.52 0.57 (14.0) 0.57 0.51 (2.0) 0.51 0.52 (4.0) 0.51 Cohens d = 0.8 (large effect) 2.59 (223.8) 3.66 0.82 (2.5) 0.82 1.23 (53.8) 1.56 1.55 (93.8) 1.67 0.82 (2.5) 0.82 0.95 (18.8) 0.98 1.10 (37.5) 1.13 0.82 (2.5) 0.82 0.86 (7.5) 0.87 0.90 (12.5) 0.91 0.82 (2.5) 0.82 0.83 (3.8) 0.83 (357.5) (2.5) (95.0) (108.8) (2.5) (22.5) (41.3) (2.5) (8.58) (13.8) (2.5) (3.8) 4.49 0.82 1.84 1.72 0.82 0.99 1.14 0.82 0.87 0.91 0.82 0.83

(461.3) (2.5) (130.0) (115.0) (2.5) (23.8) (42.5) (2.5) (8.75) (13.8) (2.5) (3.8)

297

298

The Journal of General Psychology

Conclusions The simulation shows that distortion of effect size estimates can be corrected if the true underlying inter-trial correlation is known. If, on the other hand, the true correlation is unknown, the sample estimate can be used to notably reduce the overestimation reported by Brand and colleagues (2011). In the more realistic case of > 0 and the most extreme case of true Cohens d = 0.8 the overestimation was 24% (average dc = 0.99) compared to the distortion of 115% obtained from the uncorrected measure. Moreover, this correction does not vary as much as a function of trials in contrast to the uncorrected Cohens d. Hence, dc will be less prone to inations in meta-analyses, where studies with different numbers of trials are combined. A related problem and correction approach for inated effect sizes has been discussed by Dunlap, Cortina, Vaslow, and Burke (1996) for dependent samples. For meta-analyses, where the sample estimate of the correlation between two dependent variables is unavailable, they suggested using estimates from previous studies. Similarly, if in meta-analyses of studies employing independent samples neither the true nor the sample estimates of the inter-trial are given, surrogate values should be derived from previous studies. These would still allow for the application of the more precise dc . For example, given a study with 30 trials, each trial drawn from a population with true d = 0.5 and = 0.5, and previous studies suggested a plausible range of correlation between 0.4 and 0.6, then dc is bilower ased by merely 8% to 12% (dc = 0.71/(30/ 30 + 0.4(302 30)) = 0.46 and upper 2 30)) = 0.56). This range still constitutes a = 0.71/(30/ 30 + 0.6(30 dc considerable improvement over da . An important question for future research on meta-analytical methods is to what degree the effect sizes reported in practice are distorted and how corrective terms such as the suggested dc , should be included in the methodological repertoire in order to obtain valid knowledge of the empirical phenomena of interest.
NOTE 1. The authors want to thank Andrew Brand for supplying the R code used in the study. AUTHOR NOTES Wolfgang Wiedermann is a research associate at the Applied Psychology and Methods Research Unit, University of Klagenfurt and at the Department of Health Care Management, Carinthia University of Applied Sciences. His research focuses on statistical methods, addiction research, and health-related cognition. Bartosz Gula is assistant professor at the Cognitive Psychology Unit, University of Klagenfurt. His main research focuses on judgment and decision making, learning, and memory. Paul Czech is graduate student at the Applied Psychology and Methods Research Unit, University of Klagenfurt. His primary interests are statistical methods. Denise Muschik is graduate student at the Applied Psychology and Methods Research Unit, University of Klagenfurt. Her primary interests are applied statistics and consumer research.

Wiedermann, Gula, Czech, & Muschik

299

REFERENCES Bradley, D. R., & Russell, R. L. (1998). Some cautions regarding statistical power in splitplot designs. Behavior Research Methods, Instruments, & Computers, 30, 462477. Brand, A., Bradley, M. T., Best L. A., & Stoica, G. (2011). Multiple trials may yield exaggerated effect size estimates. The Journal of General Psychology, 138, 111. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Dunlap, W. P., Cortina, J. M., Vaslow, J. B., Burke, M. J. (1996). Meta-analysis of experiments with matched groups or repeated measures designs. Psychological Methods, 1, 170177. Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Orlando, FL: Academic Press. Levine, T. R., & Hullet, C. G. (2002). Eta squared, partial eta squared, and misreporting of effect size in communication research. Human Communication Research, 28, 612625. Olejnik, S., & Algina, J. (2003). Generalized eta and omega squared statistics: Measures of effect size for some common research designs. Psychological Methods, 8, 434447. Pierce, C. A., Block, R. A., Aguinis, H. (2004). Cautionary note on reporting eta-squared values from multifactor ANOVA designs. Educational and Psychological Measurement, 64, 916924. R Development Core Team (2011). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

Original manuscript received May 9, 2011 Final version accepted July 8, 2011

Вам также может понравиться