Вы находитесь на странице: 1из 31

UNDERSTANDING STATISTICS, 3(4), 201230 Copyright 2004, Lawrence Erlbaum Associates, Inc.

Post Hoc Power: A Concept Whose Time Has Come


Anthony J. Onwuegbuzie
Department of Educational Measurement and Research University of South Florida

Nancy L. Leech
Department of Educational Psychology University of Colorado at Denver and Health Sciences

This article advocates the use of post hoc power analyses. First, reasons for the nonuse of a priori power analyses are presented. Next, post hoc power is defined and its utility delineated. Third, a step-by-step guide is provided for conducting post hoc power analyses. Fourth, a heuristic example is provided to illustrate how post hoc power can help to rule in/out rival explanations in the presence of statistically nonsignificant findings. Finally, several methods are outlined that describe how post hoc power analyses can be used to improve the design of independent replications.

power, a prior power, post hoc power, significance testing, effect size For more than 75 years, null hypothesis significance testing (NHST) has dominated the quantitative paradigm stemming from the seminal works of Fisher (1925/1941) and Neyman and Pearson (1928). NHST was designed to provide a means of ruling out a chance finding, thereby reducing the chance of falsely rejecting the null hypothesis in favor of the alternative hypothesis (i.e., committing a Type I error). Although NHST has permeated the behavioral and social science field since its inception, its practice has been subjected to severe criticism with the number of critics growing throughout the years. The most consistent criticism of NHST that has emerged is that statistical significance is not synonymous with practical importance. More specifically, statistical significance does not provide any informaRequests for reprints should be sent to Anthony J. Onwuegbuzie, Department of Educational Measurement and Research, College of Education, University of South Florida, 4202 East Fowler Avenue, EDU 162, Tampa, FL 336207750. E-mail: tonyonwuegbuzie@aol.com

202

ONWUEGBUZIE AND LEECH

tion about how important or meaningful an observed finding is (e.g., Bakan, 1966; Cahan, 2000; Carver, 1978, 1993; Cohen, 1994, 1997; Guttman, 1985; Loftus, 1996; Meehl, 1967, 1978; Nix & Barnette, 1998; Onwuegbuzie & Daniel, 2003; Rozeboom, 1960; Schmidt, 1992; 1996; Schmidt & Hunter, 1997). As a result of this limitation of NHST, some researchers (e.g., Carver, 1993) contend that effect sizes, which represent measures of practical importance, should replace statistical significance testing completely. However, reporting and interpreting only effect sizes could lead to the overinterpretation of a finding (Onwuegbuzie & Levin, 2003; Onwuegbuzie, Levin, & Leech, 2003). As noted by Robinson and Levin (1997)
Although effect sizes speak loads about the magnitude of a difference or relationship, they are, in and of themselves, silent with respect to the probability that the estimated difference or relationship is due to chance (sampling error). Permitting authors to promote and publish seemingly interesting or unusual outcomes when it can be documented that such outcomes are not really that unusual would open the publication floodgates to chance occurrences and other strange phenomena. (p. 25)

Moreover, interpreting effect sizes in the absence of a NHST could adversely affect conclusion validity just as interpreting p values without interpreting effect sizes could lead to conclusion invalidity (Onwuegbuzie, 2001). Table 1 illustrates the four possible outcomes and their conclusion validity when combining p values and effect sizes. This table echoes the decision table that demonstrates the relationship between Type I error and Type II error. It can be seen from Table 1 that conclusion validity occurs when (a) both statistical significance (e.g., p < .05) and a large effect size prevail or (b) both statistical nonsignificance (e.g., p > .05) and a small effect size occur. Conversely, conclusion invalidity exists if (a) statistical nonsignificance (e.g., p > .05) is combined with a large effect size (i.e., Type A error) and (b) statistical significance (e.g., p < .05) is combined with a small effect size (i.e., Type B error). Both Type A and Type B errors suggest that any declaration that the observed finding is meaningful on the part of the researcher would be misleading and hence result in conclusion invalidity (Onwuegbuzie, 2001). In this respect, conclusion validity is similar to what Levin and Robinson (2000) termed conclusion coherence. Levin and Robinson (2000) defined conclusion coherence as consistency between the hypothesis-testing and estimation phases of the decision-oriented empirical study (p. 35). Therefore, as can be seen in Table 1, always interpreting effect sizes regardless of whether statistical significance is observed may lead to a Type A or Type B error being committed. Similarly, Type A and Type B errors also prevail when NHST is used by itself. In fact, conclusion validity is maximized only if both p values and effect sizes are utilized. In an attempt to maximize conclusion validity (i.e., to reduce the probability of Type A and Type B error), Robinson and Levin (1997) proposed what they termed

POST HOC POWER

203

TABLE 1 Possible Outcomes and Conclusion Validity When Null Hypothesis Significance Testing and Effect Sizes Are Combined Effect Size Large Not statistically significant p value Statistically significant Type A error Conclusion validity Small Conclusion validity Type B error

a two-step process for making statistical inferences. According to this model, a statistically significant observed finding is followed by the reporting and interpreting of one or more indexes of practical importance; however, no effect sizes are reported in light of a statistically nonsignificant finding. In other words, analysts should determine first whether the observed result is statistically significant (Step 1), and if and only if statistical significance is found, then they should report how large or important the observed finding is (Step 2). In this way, the statistical significance test in Step 1 serves as a gatekeeper for the reporting and interpreting of effect sizes in Step 2. This two-step process is indirectly endorsed by the latest edition of the Publication Manual of the American Psychological Association (American Psychological Association [APA], 2001):
When reporting inferential statistics (e.g., t tests, F tests, and chi-square), include [italics added] information about the obtained magnitude or value of the test statistic, the degrees of freedom, the probability of obtaining a value as extreme as or more extreme than the one obtained, and the direction of the effect. Be sure to include [italics added] sufficient descriptive statistics (e.g., per-cell sample size, means, correlations, standard deviations) so that the nature of the effect being reported can be understood by the reader and for future meta-analyses. (p. 22)

Three pages later, the APA (2001) stated


Neither of the two types of probability value directly reflects the magnitude of an effect or the strength of a relationship. For the reader to fully understand the importance of your findings, it is almost always necessary to include some index of effect size or strength of relationship in your Results section. (p. 25)

On the following page, APA (2001) stated that


The general principle to be followed, however, is to provide the reader not only with information about statistical significance but also with enough information to assess the magnitude of the observed effect or relationship. (p. 26)

204

ONWUEGBUZIE AND LEECH

Most recently, Onwuegbuzie and Levin (2003) proposed a three-step procedure when two or more hypothesis tests are conducted within the same study, which involves testing the trend of the set of hypotheses at the third step. Using either the two-step method or the three-step method helps to reduce not only the probability of committing a Type A error but also the probability of committing a Type B error, namely, declaring as important a statistically significant finding with a small effect size (Onwuegbuzie, 2001). However, whereas Type B error almost certainly will be reduced by using one of these methods compared to using NHST alone, the reduction in the probability of Type A error is not guaranteed using these procedures. This is because if statistical power is lacking, then the first step of the two-step method and the first and third steps of the three-step procedure, which serve as gatekeepers for computing effect sizes, may lead to the nonreporting of a nontrivial effect (i.e., Type A error). Simply put, sample sizes that are too small increase the probability of a Type II error (not rejecting a false null hypothesis) and subsequently increase the probably of committing a Type A error. Clearly, both error probabilities can be reduced if researchers conduct a priori power analyses to select appropriate sample sizes. However, unfortunately, such analyses are rarely employed (Cohen, 1992; Keselman et al., 1998; Onwuegbuzie, 2002). When a priori power analyses have been omitted, researchers should conduct post hoc power analyses, especially for nonstatistically significant findings. This would help researchers determine whether low power threatens the internal validity of findings (i.e., Type A error). Yet, researchers typically have not used this technique. Thus, in this article, we advocate the use of post hoc power analyses. First, we present reasons for the nonuse of a priori power analyses. Next, we define post hoc power and delineate its utility. Third, we provide a step-by-step guide for conducting post hoc power analyses. Fourth, we provide a heuristic example to illustrate how post hoc power can help to rule in/out rival explanations in the presence of statistically nonsignificant findings. Finally, we outline several methods that describe how post hoc power analyses can be used to improve the design of independent replications. DEFINITION OF STATISTICAL POWER Neyman and Pearson (1933a, 1933b) were the first to discuss the concepts of Type I and Type II error. Type I error occurs when the researcher rejects the null hypothesis when it is true. As noted previously, the Type I error probability is determined by the significance level (). For example, if a 5% level of significance is designated, then the Type I error rate is 5%. Stated another way, represents the conditional probability of making a Type I error when the null hypothesis is true. Neymann and Pearson (1928) defined as the long-run relative frequency by which Type I errors

POST HOC POWER

205

are made over repeated samples from the same population under the same null and alternative hypothesis, assuming the null hypothesis is true. Conversely, Type II error occurs when the analyst accepts the null hypothesis when the alternative hypothesis is true. The conditional probability of making a Type II error under the alternative hypothesis is denoted by . Statistical power is the conditional probability of rejecting the null hypothesis (i.e., accepting the alternative hypothesis) when the alternative hypothesis is true. The most common definition of power comes from Cohen (1988), who defined the power of a statistical test as the probability [assuming the null hypothesis is false] that it will lead to the rejection of the null hypothesis, i.e., the probability that it will result in the conclusion that the phenomenon exists (p. 4). Power can be viewed as how likely it is that the researcher will find a relationship or difference that really prevails. It is given by 1 . Statistical power estimates are affected by three factors.1 The first factor is level of significance. Holding all other aspects constant, increasing the level of significance increases power but also increases the probability of rejecting the null hypothesis when it is true. The second influential factor is the effect size. Specifically, the larger the difference between the value of the parameter under the null hypothesis and the parameter under the alternative parameter, the greater the power to detect it. The third instrumental component is the sample size. The larger the sample size, the greater the likelihood of rejecting the null hypothesis2 (Chase & Tucker, 1976; Cohen, 1965, 1969, 1988, 1992). However, caution is needed due to the fact that large sample sizes can increase power to the point of rejecting the null hypothesis when the associated observed finding is not practically important. Cohen (1965), in accordance with McNemar (1960), recommended a probability of .80 or greater for correctly rejecting the null hypothesis representing a medium effect at the 5% level of significance. This recommendation was based on considering the ratio of the probability of committing a Type I error (i.e., 5%) to the probability of committing a Type II error (i.e., 1 .80 = .20). In this case, the ratio was 1:4, reflecting the contention that Type I errors are generally more serious than are Type II errors. Power, level of significance, effect size, and sample size are related such that any one of these components is a function of the other three components. As noted by Cohen (1988), when any three of them are fixed, the fourth is completely determined (p. 14). Thus, there are four possible types of power analyses in which one of the parameters is determined as a function of the other three as follows: (a) power as a function of level of significance, effect size, and sample size; (b) effect size as a function
Onwuegbuzie and Daniel (2004) demonstrated empirically that power also is affected by score reliability. Specifically, Onwuegbuzie and Daniel showed that the higher the score reliability of measures used, the greater the statistical power. Also, see Onwuegbuzie and Daniel (2002). 2Power also depends on the alternative type, whether it is one-tailed or two-tailed.
1Recently,

206

ONWUEGBUZIE AND LEECH

of level of significance, sample size, and power; (c) level of significance as a function of sample size, effect size, and power; and (d) sample size as a function of level of significance, effect size, and power (Cohen, 1965, 1988). The latter type of power analysis is the most popular and most useful for planning research studies (Cohen, 1992). This form of power analysis, which is called an a priori power analysis, helps the researcher to ascertain the sample size necessary to obtain a desired level of power for a specified effect size and level of significance. Conventionally, most researchers set the power coefficient at .80 and the level of significance at .05. Thus, once the expected effect size and type of analysis are specified, then the sample size needed to meet all specifications can be determined. The value of an a priori power analysis is that it helps the researcher in planning research studies (Sherron, 1988). By conducting such an analysis, researchers put themselves in the position to select a sample size that is large enough to lead to a rejection of the null hypothesis for a given effect size. Alternatively stated, a priori power analyses help researchers to obtain the necessary sample sizes to reach a decision with adequate power. Indeed, the optimum time to conduct a power analysis is during the research design phase (Wooley & Dawson, 1983). Failing to consider statistical power can have dire consequences for researchers. First and foremost, low statistical power reduces the probability of rejecting the null hypothesis and therefore increases the probability of committing a Type II error (Bakan, 1966; Cohen, 1988). It may also increase the probability of committing a Type I error (Overall, 1969), yield misleading results in power studies (Chase & Tucker, 1976), and prevent potentially important studies from being published as a result of publication bias (Greenwald, 1975) that has been called the file-drawer problem representing the tendency to keep statistically nonsignificant results in file drawers (Rosenthal, 1979). It has been exactly 40 years since Jacob Cohen (1962) conducted the first survey of power. In this seminal work, Cohen assessed the power of studies published in the abnormal-social psychology literature. Using the reported sample size and a nondirectional significance level of 5%, Cohen calculated the average power to detect a hypothesized effect (i.e., hypothesized power) across the 70 selected studies for nine frequently used statistical tests using small, medium, and large estimated effect size values. The average power of the 2,088 major statistical tests were .18, .48, and .83 for detecting a small, medium, and large effect size, respectively. The average hypothesized statistical power of .48 for medium effects indicated that studies in the abnormal psychology field had, on average, less than a 50% chance of correctly rejecting the null hypothesis (Brewer, 1972; Halpin & Easterday, 1999). During the next three decades after Cohens (1962) investigation, several researchers have conducted hypothetical power surveys across a myriad of disciplines including the following: applied and abnormal psychology (Chase & Chase, 1976), educational research (Brewer, 1972), educational measurement (Brewer &

POST HOC POWER

207

Owen, 1973), communication (Chase & Tucker, 1975; Katzer & Sodt, 1973), communication disorders (Kroll & Chase, 1975), mass communication (Chase & Baran, 1976), counselor education (Haase, 1974), social work education (Orme & Tolman, 1986), science education (Penick & Brewer, 1972; Wooley & Dawson, 1983), English education (Daly & Hexamer, 1983), gerontology (Levenson, 1980), marketing research (Sawyer & Ball, 1981), and mathematics education (Halpin & Easterday, 1999). The average hypothetical power of these 15 studies was .24, .63, and .85 for small, medium, and large effects, respectively. Assuming that a medium effect size is appropriate for use in most studies because of its combination of being practically meaningful and realistic (Cohen, 1965; Cooper & Findley, 1982; Haase, Waechter, & Solomon, 1982), the average power of .63 across these studies is disturbing. Similarly disturbing is the average hypothesized power of .64 for a medium effect reported by Rossi (1990) across 25 power surveys involving more than 1,500 journal articles and 40,000 statistical tests. An even more alarming picture is painted by Schmidt and Hunter (1997) who reported that the average [hypothesized] power of null hypothesis significance tests in typical studies and research literature is in the .40 to .60 range (Cohen, 1962, 1965, 1988, 1992; Schmidt, 1996; Schmidt, Hunter, & Urry, 1976; Sedlmeier & Gigerenzer, 1989) [with] .50 as a rough average (p. 40). Unfortunately, an average hypothetical power of .5 indicates that more than one half of all statistical tests in the social and behavioral science literature will be statistically nonsignificant. As noted by Schmidt and Hunter (1997), This level of accuracy is so low that it could be achieved just by flipping a (unbiased) coin! (p. 40). Yet, the fact that power is unacceptably low in most studies suggests that misuse of NHST is to blame, not the logic of NHST. Moreover, the publication bias that prevails in research suggests that the hypothetical power estimates provided previously likely represent an upper bound. Thus, as declared by Rossi (1997), it is possible that at least some controversies in the social and behavioral sciences may be artifactual in nature (p. 178). Indeed, it can be argued that low statistical power represents more of a research design issue than it is a statistical issue because it can be rectified by using a larger sample. Bearing in mind the importance of conducting statistical power analyses, it is extremely surprising that very few researchers conduct and report power analyses for their studies (Brewer, 1972; Cohen, 1962, 1965, 1988, 1992; Keselman et al., 1998; Onwuegbuzie, 2002; Sherron, 1988) even though statistical power has been promoted actively since the 1960s (Cohen, 1962, 1965, 1969) and even though for many types of statistical analyses (e.g., r, z, F, 2), tables have been provided by Cohen (1988, 1992) to determine the necessary sample size. Even when a priori power has been calculated, it is rarely reported (Wooley & Dawson, 1983). This lack of power analyses still prevails despite the recommendations of the APA (2001) to take power seriously and to provide evidence that your study has sufficient power to detect effects of substantive interest (p. 24).

208

ONWUEGBUZIE AND LEECH

The lack of use of power analysis might be the result of one or more of the following factors. First and foremost, evidence exists that statistical power is not sufficiently understood by researchers (Cohen, 1988, 1992). Second, it appears that the concept and applications of power are not taught in many undergraduate- and graduate-level statistical courses. Moreover, when power is taught, it is likely that inadequate coverage is given. Disturbingly, Mundfrom, Shaw, Thomas, Young, and Moore (1998) reported that the issue of statistical power is regarded by instructors of research methodology, statistics, and measurement as being only the 34th most important topic in their fields out of the 39 topics presented. Also in the Mundfrom et al. study, power received the same low ranking with respect to coverage in the instructors classes. Clearly, if power is not being given a high status in quantitative-based research courses, then students similarly will not take it seriously. In any case, these students will not be suitably equipped to conduct such analyses. Another reason for the spasmodic use of statistical power possibly stems from the incongruency between endorsement and practice. For instance, although APA (2001) stipulated that power analyses be conducted despite providing several NHST examples, the manual does not provide any examples of how to report statistical power (Fidler, 2002). Harris (1997) also provided an additional rationale for the lack of power analyses:
I suspect that this low rate of use of power analysis is largely due to the lack of proportionality between the effort required to learn and execute power analyses (e.g., dealing with noncentral distributions or learning the appropriate effect-size measure with which to enter the power tables in a given chapter of Cohen, 1977) and the low payoff from such an analysis (e.g., the high probability that resource constraints will force you to settle for a lower N than your power analysis says you should have)especially given the uncertainties involved in a priori estimates of effect sizes and standard deviations, which render the resulting power calculation rather suspect. If calculation of the sample size needed for adequate power and for choosing between alternative interpretations of a nonsignificant result could be made more nearly equal in difficulty to the effort weve grown accustomed to putting into significance testing itself, more of us might in fact carry out these preliminary and supplementary analyses. (p. 165)

A further reason why a priori power analyses are not conducted likely stems from the fact that the most commonly used statistical packages, such as the Statistical Package for the Social Sciences (SPSS; SPSS Inc., 2001) and the Statistical Analysis System (SAS Institute Inc., 2002), do not allow researchers directly to conduct power analyses. Furthermore, the statistical software programs that conduct power analyses (e.g., Erdfelder, Faul, & Buchner, 1996; Morse, 2001), although extremely useful, typically do not conduct other types of analyses, and thus, researchers are forced to use at least two types of statistical software to conduct quantitative research studies, which is both inconvenient and possibly expen-

POST HOC POWER

209

sive. Even when researchers have power software in their possession, the lack of information regarding components needed to calculate power (e.g., effect size, variance) serves as an additional impediment to a priori power analyses. It is likely that the lack of power analyses coupled with a publication bias promulgates the publishing of findings that are statistically significant but have small effect sizes (possibly increasing Type B error) as well as leading researchers to eliminate valuable hypotheses (Halpin & Easterday, 1999). Thus, we recommend in the strongest possible manner that all quantitative researchers conduct a priori power analyses whenever possible. These analyses should be reported in the Method section of research reports. This report also should include a rationale for criteria used for all input variables (i.e., power, significance level, effect size; APA, 2001; Cohen, 1973, 1988). Inclusion of such analyses will help researchers to make optimum choices on the components (e.g., sample size, number of variables studied) needed to design a trustworthy study. POST HOC POWER ANALYSES Whether an a priori power analysis is undertaken and reported, problems can still arise. One problem that commonly occurs in educational research is when the study is completed and a statistically nonsignificant result is found. In many cases, the researcher then disregards the study (i.e., file-drawer problem) or when he or she submits the final report to a journal for review, finds it is rejected (i.e., publication bias). Unfortunately, most researchers do not determine whether the statistically nonsignificant result is the result of insufficient statistical power. That is, without knowing the power of the statistical test, it is not possible to rule in or rule out low statistical power as a threat to internal validity (Onwuegbuzie, 2003). Nor can an a priori power analysis necessarily rule in/out this threat. This is because a priori power analyses involve the use of a priori estimates of effect sizes and standard deviations (Harris, 1997). As such, a priori power analyses do not represent the power to detect the observed effect of the ensuing study; rather, they represent the power to detect hypothesized effects. Before the study is conducted, researchers do not know what the observed effect size will be. All they can do is try to estimate it based on previous research and theory (Wilkinson & the Task Force on Statistical Inference, 1999). The observed effect size could end up being much smaller or much larger than the hypothesized effect size on which the power analysis is undertaken. (Indeed, this is a criticism of the power surveys highlighted previously; Mulaik, Raju, & Harshman, 1997.) In particular, if the observed effect size is smaller than what is proposed, the sample size yielded by the a priori power analysis might be smaller than is needed to detect it. In other words, a smaller effect size than anticipated increases the chances of Type II error. On the other hand, the effect of power on a statistically nonsignificant finding can be assessed more appropriately by using the observed (true) effect to investi-

210

ONWUEGBUZIE AND LEECH

gate the performance of an NHST (Mulaik et al., 1997; Schmidt, 1996; Sherron, 1988). Such a technique leads to what is often called a post hoc power analysis. Interestingly, several authors have recommended the use of post hoc power analyses for statistically nonsignificant findings (Cohen, 1969; Dayton, Schafer, & Rogers, 1973; Fagely, 1985; Fagley & McKinney, 1983; Sawyer & Ball, 1981; Wooley & Dawson, 1983). When post hoc power should be reported has been the subject of debate. Although some researchers advocate that post hoc power always be reported (e.g., Wooley & Dawson, 1983), the majority of researchers advocate reporting post hoc power only for statistically nonsignificant results (Cohen, 1965; Fagely, 1985; Fagley & McKinney, 1983; Sawyer & Ball, 1981). However, both sets of analysts have agreed that estimating the power of significance tests that yield statistically nonsignificant findings plays an important role in their interpretations (e.g., Fagely, 1985; Fagley & McKinney, 1983; Sawyer & Ball, 1981; Tversky & Kahneman, 1971). Specifically, statistically nonsignificant results in a study with low power suggest ambiguity. Conversely, statistically nonsignificant results in a study with high power contribute to the body of knowledge because power can be ruled out as a threat to internal validity (e.g., Fagely, 1985; Fagley & McKinney, 1983; Sawyer & Ball, 1981; Tversky & Kahneman, 1971). To this end, statistically nonsignificant results can make a greater contribution to the research community than they presently do. As noted by Fagely (1985), Just as rejecting the null does not guarantee large and meaningful effects, accepting the null does not preclude interpretable results (p. 392). Conveniently, post hoc power analyses can be conducted relatively easily because some of the major statistical software programs compute post hoc power estimates. In fact, post hoc power coefficients are available in SPSS for the general linear model (GLM). Post hoc power (or observed power, as it is called in the SPSS output) is based on taking the observed effect size as the assumed population effect, which produces a positively biased but consistent estimate of the effect (D. Nichols,3 personal communication, November 4, 2002). For example, the post hoc power procedure for analyses of variance (ANOVAs) and multiple ANOVAs is contained within the options button.4 It should be noted that due to sampling error, the observed effect size used to compute the post hoc power estimate might be very different than the true (population) effect size, culminating in a misleading evaluation of power.

3David Nichols is a Principal Support Statistician and Manager of Statistical Support, SPSS Technical Support. 4Details of the post hoc power computations in the statistical algorithms for the GLM and MANOVA procedures can be obtained from the following Web site: http://www.spss.com/tech/stat/ Algorithms.htm.

POST HOC POWER

211

PROCEDURE FOR CONDUCTING POST HOC POWER ANALYSES Although post hoc power analysis procedures exist for multivariate tests of statistical significance such as multivariate analysis of variance and multivariate analysis of covariance (cf. Cohen, 1988), these methods are sufficiently complex to warrant use of statistical software. Thus, for this article, we restrict our attention to post hoc power analysis procedures for univariate tests of statistical significance. What follows is a description of how to undertake a post hoc power analysis by hand for univariate tests, that is, for situations in which there is one quantitative dependent variable and one or more independent variables. Here, we use Cohens (1988) framework for conducting post hoc power for multiple regression and correlation analysis. Furthermore, these steps can also be used for a priori power analyses by substituting in a hypothesized effect size value instead of an observed effect size value. As noted by Cohen (1988), the advantage of this framework is that it can be used for any type of relationship between the independent variable(s) and dependent variable (e.g., linear or curvilinear, whole or partial, general or partial). Furthermore, the independent variables can be quantitative or qualitative (i.e., nominal scale), direct variates or covariates, and main effects or interactions. Also, the independent variables can represent ex post facto research or retrospective research and can be naturally occurring or the results of experimental manipulation. Most important, however, because multiple regression subsumes all other univariate statistical techniques as special cases (Cohen, 1968), this post hoc power analysis framework can be used for other members of the univariate GLM, including correlation analysis, t tests, ANOVA, and analysis of covariance. The F test can be used for all univariate members of the GLM to test the null hypothesis that the proportion of variance in the dependent variable that is accounted for by one or more independent variables (PVi) is zero in the population. This F test can be expressed as
PVi F= PVe dfi

(1)

dfe where PVi is the proportion of the variance in the dependent variable explained by the set of independent variables, PVe is the proportion of error variance, dfi is the degrees of freedom for the numerator (i.e., the number of independent variables in the set), and dfe is degrees of freedom for the denominator (i.e., degrees of freedom for the error variance). Equation 1 can be rewritten as

F=

PVi dfe . PVe dfi

(2)

212

ONWUEGBUZIE AND LEECH

The left-hand term in Equation 2, which represents the proportion of variance in the dependent variable that is accounted for by the set of independent variables, is a measure of the effect size (ES) in the sample. The right-hand term in Equation 2, namely, the study size or sensitivity of the statistical test, provides information about the sample size and number of independent variables. Thus, degree of statistical significance is a function of the product of the effect size and the study size (Cohen, 1988). Because most theoretically selected independent variables have a nonzero relationship (i.e., nonzero effect size) with dependent variables, the distribution of F values in any particular investigation likely takes the form of a noncentral F distribution (Murphy & Myors, 1998). Thus, the (post hoc) power of a statistical significance test can be defined as the proportion of the noncentral distribution that exceeds the critical value used to define statistical significance (Murphy & Myors, 1998, p. 24). The shape of the noncentral F distribution as well as its range are a function of both the noncentrality parameter and the respective degrees of freedom (i.e., dfi and dfe). The noncentrality parameter, which is an index that determines the extent to which null hypothesis is false, is given by = ES(dfi + dfe + 1), where ES, the effect size, is given by
ES = PVi , PVe

(3)

(4)

and dfi and dfe are the numerator and denominator degrees of freedom, respectively. Once the noncentrality parameter, , has been calculated, Table 2 or Table 3, depending on the level of , can be used to determine the post hoc power of the statistical test. Specifically, these tables are entered for values of , , dfi, and dfe. Step 1: Statistical Significance Criterion () Table 2 represents = .05, and Table 3 represents = .01. (Interpolation and extrapolation techniques can be used for values between .01 and .05 and outside these limits, respectively.) Step 2: The Noncentrality Parameter () In Tables 2 and 3, power values are presented for 15 values. Because is a continuous variable, linear interpolation typically is needed. According to Cohen (1988),

POST HOC POWER TABLE 2 Power of the F Test As a Function of , dfi, and dfe

213

dfi 1

dfe 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120

2 27 29 29 29 20 22 22 23 17 19 19 19 15 17 17 17 13 15 16 16 12 14 14 15 11 17 13 14 10 12 12 13 10 11 11 13 09 10 11 12

4 48 50 51 52 36 40 41 42 30 34 35 36 26 30 31 32 23 27 29 29 21 25 27 27 19 24 25 25 18 23 24 24 17 21 22 23 16 20 21 21

6 64 67 68 69 52 56 57 58 44 49 50 52 38 44 46 47 34 40 41 43 30 37 39 40 28 35 37 38 26 33 35 36 24 31 33 34 23 30 31 32

8 77 79 80 81 65 69 71 72 56 62 64 65 49 57 58 60 44 52 54 56 40 48 50 53 37 45 47 50 34 43 45 48 32 41 44 45 30 39 41 43

10 85 88 88 89 75 79 80 82 67 73 75 76 60 68 70 72 54 63 65 68 50 59 62 64 46 56 59 61 42 52 55 59 39 50 53 56 37 48 51 54

12 91 92 93 93 83 87 87 88 75 81 83 84 69 77 78 80 63 72 75 77 59 68 71 74 54 65 68 71 50 62 65 68 47 58 62 66 44 56 60 64

14 95 96 96 96 88 91 92 93 82 87 89 90 76 83 85 87 71 80 82 84 66 76 79 81 62 73 76 79 58 70 73 77 54 67 71 74 51 65 69 72

16 97 98 98 98 92 95 95 96 87 92 93 93 83 89 90 91 78 86 87 89 73 83 85 87 69 80 82 85 65 77 80 83 61 74 78 81 58 72 75 79

18 98 99 99 99 95 97 97 97 91 95 95 96 87 92 93 94 83 90 91 93 79 87 89 91 75 85 87 89 71 83 85 88 68 80 83 86 64 78 81 85

20 99 99 99 99 97 98 98 99 94 97 97 98 91 95 96 96 87 93 94 95 84 91 93 94 80 89 91 93 76 87 89 92 73 85 88 90 70 83 86 89

24 * * * * 99 * * * 97 98 99 99 95 98 98 99 93 97 98 98 91 96 97 97 88 94 96 97 85 93 95 96 82 92 94 95 79 90 93 94

28

32

36

40

10

99 * * * 98 99 99 * 96 99 99 99 95 98 99 99 93 97 98 99 91 97 98 99 88 96 97 98 86 95 96 98

99 * * 98 * * * 97 99 99 * 96 99 99 99 94 98 99 99 93 98 99 99 91 97 98 99

99

99 * *

98 99 99 * * * 97 98 99 * * * 96 97 99 * * * 94 96 99 99 99 * * (continued)

214

ONWUEGBUZIE AND LEECH TABLE 2 (Continued)

dfi 11

dfe 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120

2 09 11 11 12 08 09 10 11 09 10 10 11 09 10 10 11 08 09 10 10 08 09 09 10 08 08 09 09 07 08 08 09 07 08 08 08 07 07 07 08 07 07 07 08

4 15 18 19 21 14 18 20 20 13 16 18 19 13 16 17 19 12 15 16 18 11 14 15 17 11 13 14 16 10 12 13 15 09 11 12 13 08 10 10 12 08 09 10 11

6 21 27 29 31 20 28 30 30 19 24 26 29 18 23 25 28 17 22 24 27 15 20 22 25 14 19 21 24 13 17 19 21 11 15 16 19 10 12 14 17 09 11 13 15

8 27 36 39 42 27 36 40 40 24 33 36 40 23 31 34 38 22 30 33 37 19 27 30 34 18 25 28 32 16 22 25 29 14 19 22 26 11 16 18 22 10 14 16 21

10 34 45 49 52 33 44 48 50 30 41 45 50 29 40 43 48 27 38 42 47 24 34 38 43 22 31 36 41 19 28 32 37 16 24 28 33 13 19 23 29 12 17 20 26

12 41 54 58 62 39 52 56 60 37 50 54 59 35 48 52 58 33 46 51 56 29 41 46 52 26 38 43 50 23 34 39 46 19 29 34 41 15 23 28 35 14 20 24 32

14 48 62 67 70 45 59 63 69 43 58 62 67 41 56 61 65 39 54 59 64 34 48 54 60 31 45 51 58 27 40 46 54 22 34 40 48 18 27 33 42 15 24 29 38

16 55 70 74 78 52 67 72 76 49 65 70 75 47 63 68 73 44 61 66 72 39 55 61 68 36 52 58 65 31 46 53 61 25 40 46 56 20 32 38 49 17 28 34 45

18 61 76 80 83 58 73 78 82 55 71 76 81 52 69 74 79 50 67 73 78 44 62 68 76 40 58 65 72 35 52 60 68 29 45 53 62 22 36 44 55 19 31 39 51

20 67 81 85 89 64 79 83 87 61 77 81 85 58 75 80 84 55 73 78 83 49 68 74 80 45 64 71 78 39 58 66 74 32 51 59 69 25 41 49 61 21 35 44 57

24 76 89 91 94 74 87 90 93 71 86 89 92 68 84 88 91 65 73 78 83 58 78 83 88 54 75 81 87 47 69 76 83 39 61 69 79 30 50 60 72 25 44 54 68

28 84 94 96 97 81 93 95 97 79 92 94 96 76 90 93 96 74 89 92 95 67 85 90 93 63 83 88 92 55 78 84 90 45 70 78 87 35 59 69 81 30 52 63 74

32 89 97 98 99 87 96 97 98 85 95 97 98 83 94 97 98 81 94 96 97 74 91 94 97 70 89 93 96 62 84 90 94 53 78 85 92 41 67 77 87 35 59 71 84

36

40

12

13

14

15

18

20

24

30

40

48

93 96 98 99 99 * 99 * 91 94 98 99 99 99 99 * 90 93 97 99 98 99 99 * 88 92 97 98 98 99 99 * 86 90 96 98 98 99 99 99 80 85 94 97 97 98 98 99 77 82 93 96 96 98 98 99 69 75 89 93 94 96 97 98 59 65 84 89 90 94 95 97 47 52 74 80 83 88 92 95 39 44 67 73 78 83 89 93 (continued)

POST HOC POWER TABLE 2 (Continued)

215

dfi 60

dfe 20 60 120 20 60 120

2 07 07 07 07 06 06 06 06

4 08 08 09 10 07 07 07 08

6 09 10 11 14 08 08 08 11

8 10 12 14 18 08 09 10 13

10 12 15 17 23 09 10 11 16

12 13 17 21 28 10 11 13 19

14 14 20 25 34 10 12 15 23

16 16 23 28 39 11 13 17 24

18 18 26 33 45 12 15 19 31

20 19 29 37 01 12 16 21 35

24 23 36 46 62 14 19 26 43

28 26 43 54 71 15 23 31 52

32 30 50 62 79 17 26 36 60

36 34 57 70 85 19 30 41 68

40 38 63 76 70 20 34 47 74

120

Note. = .05. From Statistical Power Analysis for the Behavioral Sciences (2nd ed., pp. 420423), by J. Cohen, 1988, Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Reprinted with permission. *Power values here and to the right are > .995. TABLE 3 Power of the F Test As a Function of , dfi, and dfe

dfi 1

dfe 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120

2 10 12 12 12 06 08 08 08 05 06 06 07 04 05 05 06 03 04 04 05 03 04 04 05

4 23 26 27 28 15 18 19 20 11 14 15 16 09 12 13 14 07 10 11 12 06 09 10 11

6 37 42 44 45 26 30 33 35 20 25 27 29 16 21 23 25 13 18 20 22 11 16 17 19

8 51 57 58 60 37 45 46 49 29 37 39 42 23 31 34 37 20 28 30 33 17 24 27 30

10 63 69 71 72 48 57 59 61 39 49 51 54 32 42 45 49 27 37 41 44 23 34 37 41

12 73 79 80 81 58 68 70 72 48 60 62 65 41 53 56 60 35 48 51 55 30 43 47 51

14 80 86 87 88 67 76 78 80 57 69 72 74 49 62 66 69 43 57 61 65 37 52 57 61

16 86 91 92 92 75 83 85 87 65 77 79 82 57 71 74 77 50 66 70 74 45 61 66 70

18 90 94 95 95 81 88 90 91 72 83 85 87 64 78 81 84 58 73 77 80 52 69 73 77

20 94 96 97 97 86 92 93 94 78 88 90 91 71 83 86 89 94 79 83 86 58 75 79 83

24 97 99 99 99 93 97 97 98 87 94 95 96 81 91 93 95 76 88 91 93 70 85 89 91

28 99 * * * 96 99 99 99 93 97 98 98 89 96 97 98 84 94 95 97 79 92 94 96

32 *

36

40

98 99 * * 96 99 99 * 93 98 98 99 90 97 98 99 86 96 97 98

99 *

98 * * 96 99 * * 93 * 98 99 99 91 * 98 99 99 (continued)

216

ONWUEGBUZIE AND LEECH TABLE 3 (Continued)

dfi 7

dfe 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120

2 03 04 04 04 02 03 03 04 02 03 03 04 02 02 02 03 02 03 03 03 02 02 02 03 02 02 02 03 02 02 02 03 02 02 02 02 02 02 02 02 02 02 02 02

4 05 08 09 10 05 07 08 09 04 06 07 08 04 06 07 08 04 05 06 07 03 05 06 07 04 05 05 06 04 05 05 06 03 04 05 06 03 04 04 05 03 04 04 05

6 09 14 16 18 08 13 14 16 07 11 13 15 06 10 12 14 07 10 11 13 05 09 11 12 06 08 10 11 05 08 09 11 05 07 09 10 04 06 07 09 04 06 07 08

8 15 22 24 27 13 20 22 25 11 18 20 23 10 16 19 22 10 15 17 20 08 14 17 19 08 13 15 18 08 12 14 17 07 11 13 16 06 10 11 14 05 09 10 13

10 20 30 34 37 18 27 31 35 16 25 29 33 14 23 27 31 13 21 25 29 12 20 24 27 11 18 22 26 10 17 21 25 09 16 19 24 08 14 17 21 07 12 15 19

12 26 39 43 48 23 36 40 45 21 33 37 42 19 30 35 40 17 29 33 38 15 26 31 36 14 25 29 35 13 23 28 33 12 22 26 32 10 18 22 28 09 16 20 26

14 33 48 53 58 29 44 49 55 26 41 46 52 24 38 44 49 22 36 41 47 20 33 38 45 18 32 37 44 17 30 35 42 15 28 33 40 13 23 29 36 11 21 26 33

16 40 57 62 67 36 53 58 64 32 49 55 61 29 46 52 58 26 44 50 56 24 41 47 54 22 39 45 52 20 36 43 50 19 34 41 49 15 29 36 44 14 26 32 42

18 46 64 69 74 42 61 66 72 38 57 63 69 34 54 60 66 31 51 58 64 29 48 55 62 26 46 53 60 24 43 50 59 23 41 48 57 18 35 43 52 16 32 39 49

20 53 71 76 81 48 67 73 78 44 64 70 76 40 61 67 74 37 58 65 71 34 55 62 69 31 52 60 68 29 50 58 66 27 47 55 64 22 41 49 59 19 37 46 56

24 65 82 86 90 60 79 84 88 55 76 81 86 51 73 79 84 47 71 77 83 44 68 75 81 40 65 72 80 38 62 70 78 35 60 68 77 29 53 62 72 25 49 59 69

28 74 90 93 95 70 87 91 94 65 85 89 93 61 83 87 91 57 80 86 90 53 78 84 89 50 76 82 88 47 73 80 87 44 71 78 86 36 64 73 82 32 60 70 80

32 82 94 96 98 78 93 95 97 74 91 94 96 70 89 93 96 66 87 92 95 62 85 90 94 59 84 89 93 55 82 88 92 52 80 86 92 44 74 82 89 39 70 79 87

36

40

10

11

12

13

14

15

18

20

88 99 97 * 96 99 84 98 96 * 98 99 81 86 95 97 97 98 98 99 77 83 94 96 96 98 98 99 74 80 92 95 95 97 97 99 70 77 91 94 95 97 97 99 67 74 89 93 93 96 96 98 63 70 88 92 92 96 96 98 60 67 86 91 90 95 95 98 51 59 81 87 88 93 94 96 46 53 78 84 86 91 92 96 (continued)

POST HOC POWER TABLE 3 (Continued)

217

dfi 24

dfe 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120 20 60 120

2 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 01 01 01 01 01 01 01 01

4 03 03 03 04 02 03 03 04 02 02 03 03 02 02 02 03 02 02 02 02 02 02 02 02

6 04 05 06 07 03 04 05 06 03 03 04 05 02 03 03 05 02 03 03 04 02 02 02 03

8 05 07 09 12 04 06 07 10 03 05 06 08 03 04 04 07 03 03 04 06 02 02 02 04

10 06 10 13 17 05 08 10 14 04 06 08 11 03 05 06 10 03 04 05 08 02 02 03 05

12 07 14 17 23 06 11 14 19 05 08 10 15 04 06 08 13 03 05 07 11 02 03 04 06

14 09 17 22 29 07 14 18 25 05 10 13 20 04 08 11 17 04 06 08 14 02 03 04 08

16 11 22 28 37 08 17 22 31 06 12 17 25 05 10 14 21 04 08 11 18 03 04 05 10

18 13 26 34 44 10 21 27 37 07 15 20 31 06 12 17 26 05 09 13 22 03 04 06 12

20 15 31 40 51 12 25 33 44 08 18 24 37 07 14 20 32 05 11 15 27 03 05 07 15

24 20 42 52 64 15 34 44 57 11 24 33 49 08 19 28 43 07 14 21 37 03 06 09 20

28 26 52 63 75 19 43 55 69 13 32 41 60 11 25 36 54 08 19 28 47 04 08 12 27

32 32 62 73 84 24 52 65 79 16 39 52 71 13 32 45 65 10 24 35 58 04 09 15 35

36 38 71 81 90 29 61 73 86 20 47 61 79 15 39 53 74 11 29 43 67 05 11 18 43

40 44 78 87 94 34 69 80 91 23 55 70 86 18 45 62 81 13 35 51 75 06 13 22 51

30

40

48

60

120

Note. = .01. From Statistical Power Analysis for the Behavioral Sciences (2nd ed., pp. 420423), by J. Cohen, 1988, Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Reprinted with permission. *Power values here and to the right are > .995.

for interpolation, when < 2, it should be noted that = 0, for all values of dfi, power = . Step 3: Numerator Degrees of Freedom (dfi) Both Table 2 and Table 3 present entries for 23 values of dfi. Again, linear interpolation can be used for other values. Step 4: Denominator Degrees of Freedom (dfe) For each value of dfi, power coefficients for the following four values of dfe are presented: 20, 60, 120, and . Interpolation between dfe values should be linear with re-

218

ONWUEGBUZIE AND LEECH

spect to the reciprocals of dfe, that is, 1/20, 1/60, 1/120, and 0, respectively. In particular, when the power coefficient is needed for a given for a given dfe, where dfe falls between Ldfe and Udfe, the lower and upper values presented in Table 2 and Table 3, the power values PowerL and PowerU are obtained for at Ldfe and Udfe by linear interpolation. Then, to compute power for a given dfe, the following formula is used:

Power = PowerL =

( (
1

Ldfe

+ 1

dfe

+ 1 Ldfe Udfe

) )

( Poweru PowerL ).

(5)

For Udfe = , 1/Udfe = 0, such that for dfe > 120, the denominator of Equation 5 is 1/120 0 = .0833 (Cohen, 1988). Virtually all univariate tests of statistical significance fall into one of the following three cases: Case A, Case B, or Case C. Case A, the simplest case, involves testing a set (A) of one or more independent variables. As before, dfi is the number of independent variables contained in the set, and Equation 3 is used, with the effect size being represented by
2 RY A , 2 1 RY A

(6)

where the numerator represents the proportion of variance explained by the variables in Set A and the denominator represents the error variance proportion. With respect to Case B, the proportion of variance in the dependent variable accounted for a Set A, over and above what is explained by another set B, is ascer2 2 tained. This is represented by RY A, B RY B . The error variance proportion is then 2 1 RY A, B . Equation 3 can then be used, with the effect size being represented by
2 2 RY A, B RY B 2 1 RY A, B

(7)

The error degrees of freedom for Case B is given by dfe = N dfi dfb 1, (8)

where dfi is the number of variables contained in Set A, and dfb is the number of variables in Set B. Case C is similar to Case B, inasmuch as the test of interest is the unique contribution of Set A in accounting for the variance in the dependent variable. However, there are other variables (Set C) that form part of the error variation, yielding

POST HOC POWER

219

2 RY A, B,C . Therefore, Equation 3 can then be used, with the effect size being represented by

2 2 RY A, B RY B 2 1 RY A, B,C

(9)

The error degrees of freedom for Case B is given by dfe = N dfi dfb dfc 1, (10)

where dfi is the number of variables contained in Set A, dfb is the number of variables in Set B, and dfc is the number of variables in Set C. As noted by Cohen (1988), Case C represents the most general case, with Cases A and B being special cases of Case C. FRAMEWORK FOR CONDUCTING POST HOC POWER ANALYSES We agree that post hoc power analyses should accompany statistically nonsignificant findings.5 In fact, such analyses can provide useful information for replication studies. In particular, the components of the post hoc power analysis can be used to conduct a priori power analyses in subsequent replication investigations. Figure 1 displays our power-based framework for conducting NHST. Specifically, once the research purpose and hypotheses have been determined, the next step is to use an a priori power analysis to design the study. Once data have been collected, the next step is to test the hypotheses. For each hypothesis, if statistical significance is reached (e.g., at the 5% level), then the researcher should report the effect size and confidence interval around the effect size (e.g., Bird, 2002; Chandler, 1957; Cumming & Finch, 2001; Fleishman, 1980; Steiger & Fouladi, 1992, 1997; Thompson, 2002). Conversely, if statistical significance is not reached, then the researcher should conduct a post hoc power analysis in an attempt to rule in or to rule out inadequate power (e.g., power < .80) as a threat to the internal validity of the finding. HEURISTIC EXAMPLE Recently, Onwuegbuzie, Witcher, Filer, Collins, and Moore (2003) conducted a study investigating characteristics associated with teachers views on discipline.
5Moreover, we recommend that upper bounds for post hoc power estimates be computed (Steiger & Fouladi, 1997). This upper bound is estimated via the noncentrality parameter. However, this is beyond the scope of this article. For an example of how to compute upper bounds for post hoc power estimates, the reader is referred to Steiger and Fouladi (1997).

220

ONWUEGBUZIE AND LEECH

I d e n t if y re s e a rc h p u rp o s e

D e t e rm in e a n d s tate hy pothes es

C o lle c t data

C o n d u c t a p rio ri p o w e r a n a ly s is a s p a rt o f p la n n in g re s e a rc h d e s ig n

C o n d u c t N u ll H y p o t h e s is S ig n if ic a n c e Te s t

S t a t is t ic a l s ig n if ic a n c e a c h ie v e d ?

No

R e p o rt a n d in t e rp re t p o s t -h o c p o w e r c o e f f ic ie n t

Ye s

P o s t -h o c power c o e f f ic ie n t lo w ?

Ye s

No

R e p o rt e f f e c t s iz e a n d c o n f id e n c e in t e rv a l a ro u n d e f f e c t s iz e

Lac k of power is a riv a l e x p la n a t io n o f t h e s t a t is t ic a lly n o n -s ig n if ic a n t f in d in g

R u le o u t p o w e r a s a riv a l e x p la n a t io n o f t h e s t a t is t ic a l nons ig n if ic a n c e

FIGURE 1

Power-based framework for conducting null hypothesis significant tests.

The theoretical framework for this investigation, although not presented here, can be found by examining the original study. Although several independent variables were examined by Onwuegbuzie, Witcher, et al., we restrict our attention to one of them, namely, ethnicity (i.e., Caucasian American vs. minority) and its relationship to discipline styles. Participants were 201 students at a large mid-southern university who were either preservice (77.0%) or in-service (23.0%) teachers. The sample size was selected via an a priori power analysis because it provided acceptable statistical power (i.e., .82) for detecting a moderate difference in means (i.e., Cohens [1988]

POST HOC POWER

221

d = .5) at the (two-tailed) .05 level of significance, maintaining a family-wise error of 5% (i.e., approximately .01 for each set of statistical tests comprising the three subscales used; Erdfelder et al., 1996). The preservice teachers were selected from several sections of an introductory-level undergraduate education class. On the other hand, the in-service teachers represented graduate students who were enrolled in one of two sections of a research methodology course. On the 1st week of class, participants were administered the Beliefs on Discipline Inventory (BODI), which was developed by Roy T. Tamashiro and Carl D. Glickman (as cited in Wolfgang & Glickman, 1986). This measure was constructed to assess teachers beliefs on classroom discipline by indicating the degree to which they are noninterventionists, interventionists, and interactionalists. The BODI contains 12 multiple-choice items, each with two response options. For each item, participants are asked to select the statement with which they most agree. The BODI contains three subscales representing the Noninterventionist, Interventionist, and Interactionalist orientations, with scores on each subscale ranging from 0 to 8. A high score on any of these scales represents a teachers proclivity toward the particular discipline approach. For our study, the Noninterventionist, Interventionist, and Interactionalist subscales generated scores that had a classical theory alpha reliability coefficient of .72 (95% confidence interval [CI] = .66, .77), .75 (95% CI = .69, .80), and .94 (95% CI = .93, .95), respectively. A series of independent t tests, using the Bonferroni adjustment to maintain a family-wise error of 5%, revealed no statistically significant difference between Caucasian American (n = 175) and minority participants (n = 26) for scores on the Interventionist, t(199) = 1.47, p > .05; Noninterventionist, t(199) = 0.88, p > .05; and Interactionalist, t(199) = 0.52, p > .05 subscales. After finding statistical nonsignificance, Onwuegbuzie, Witcher, et al. (2003) could have concluded that there were no ethnic differences in discipline beliefs. However, Onwuegbizie, Witcher, et al. decided to conduct a post hoc power analysis. The post hoc power analysis for this test of ethnic differences revealed low statistical power. Thus, Onwuegbuzie, Witcher, et al. (2003) concluded the following:
The finding of no ethnic differences in discipline beliefs also is not congruent with Witcher et al. (2001), who reported that minority preservice teachers less often endorsed classroom and behavior management skills as characteristic of effective teachers than did Caucasian-American preservice teachers. Again, the non-significance could have stemmed from the relatively small proportion of minority students (i.e., 12.9%), which induced relatively low statistical power (i.e., 0.59) for detecting a moderate effect size for comparing the two groups with respect to three outcomes (Erdfelder et al., 1996). More specifically, using the actual observed effect sizes pertaining to these three differences, and applying the Bonferroni adjustment, the post-hoc statistical power estimates were .12, .06, and .03 for interventionist, non-interventionist, and interactionalist orientations, respectivelywhich all represent extremely low statistical power for detecting the small observed effect sizes.

222

ONWUEGBUZIE AND LEECH

Replications are thus needed to determine the reliability of the present findings of no ethnic differences in discipline belief. (p. 19)

Although Onwuegbuzie, Witcher, et al. (2003) computed their power coefficients using a statistical power program (i.e., Erdfelder et al., 1996), these estimates could have been calculated by hand using Equation 3 and Tables 2 and 3. For example, to determine the power of the test of racial differences in levels of interventionist orientation, Table 3 would be used (i.e., = .01) as a result of the Bonferroni adjustment used (i.e., .05/3). The proportion of variance in interventionist orientation accounted for by race (i.e., effect size) can be obtained by using one of the following three transformation formulae:
ES = t2 , + dfe )

(t 2

(11)

ES =

dfi ( F ) , dfi ( F ) + dfe


d2 ; +4

(12)

ES =

d2

(13)

where t2 is square of the t value, F is square of the F value, d2 is the square of Cohens d statistic, and dfi and dfe are the numerator and denominator degrees of freedom, respectively (Murphy & Myors, 1998). Using the t value of 1.47 reported previously by Onwuegbuzie, Witcher, et al. pertaining to the racial difference in interventionist orientation, the associated numerator degrees of freedom (dfi) of 1 (i.e., one independent variable: race) and the corresponding denominator degrees of freedom (dfe) of 199 (i.e., total sample size 2) from Equation 11, we obtain
ES = (11.47)2 = .0107. (1.47)2 + 199

(14)

Next, we substitute this value of ES and the numerator and denominator degrees of freedom into Equation 3 to yield a noncentrality parameter of = .0107 (1 + 199 + 1) = 2.15 2. From Table 3, we use dfi = 1 and = 2. Now, dfe = 199 is between Ldfe = 120 and Udfe = . We could use Equation 5, however, because if both the lower and upper power values are .12, our estimated power value will be .12, which represents extremely low statistical power for detecting the small observed effect size of 1.07%. The same procedure can be used to verify the post hoc power values of .06 and .03 for noninterventionist and interactionalist orientations, respectively.

POST HOC POWER

223

If Onwuegbuzie, Witcher, et al. (2003) had stopped after finding statistically nonsignificant results, this would have been yet another instance of the deepening file-drawer problem (Rosenthal, 1979). The post hoc power analysis allowed the statistically nonsignificant finding pertaining to ethnicity to be placed in a more appropriate context. This in turn helped the researchers to understand that their statistically nonsignificant finding was not necessarily due to an absence of a relationship between ethnicity and discipline style in the population but was due at least in part to a lack of statistical power. By conducting a post hoc analysis, Onwuegbuzie, Witcher, et al. (2003) realized that the conclusion that no ethnic differences prevailed in discipline beliefs likely would have resulted in a Type II error being committed. Similarly, if Onwuegbuzie, Witcher, et al. had bypassed conducting a statistical significance test and had merely computed and interpreted the associated effect size, they would have increased the probability of committing a Type A error. Either way, conclusion validity would have been threatened. As noted by Onwuegbuzie and Levin (2003) and Onwuegbuzie, Levin, et al. (2003), the best way to increase conclusion validity is through independent (external) replications of results (i.e., two or more independent studies yielding similar findings that produce statistically and practically compatible outcomes). Indeed, independent replications make invaluable contributions to the cumulative knowledge in a given domain (Robinson & Levin, 1997, p. 25). Thus, post hoc analyses can play a very important role in promoting as well as improving the quality of external replications. The post hoc analysis documented previously, by revealing low statistical power first and foremost, strengthens the rationale for conducting independent replications. Moreover, researchers interested in performing independent replications could use the information from Onwuegbuzie, Witcher, et al.s (2003) post hoc analysis for their subsequent a priori power analyses. Indeed, as part of their post hoc power analysis, Onwuegbuzie, Witcher, et al. could have conducted what we call a what-if post hoc power analysis to determine how many more cases would have been needed to increase the power coefficient to a more acceptable level (e.g., .80). As such, what-if post hoc power analyses provide useful information for future researchers. Even novice researchers who are unable or unwilling to conduct a priori power analyses can use the post hoc power values to warn them to employ larger samples (i.e., n > 201) in independent replication studies than that used in Onwuegbuzie, Witcher, et al.s (2003) investigation. Moreover, novice researchers can use information from what-if post hoc power analyses to set their minimum group/sample sizes. Although we strongly encourage these researchers to perform a priori power analyses, we believe that what-if post hoc power analyses place the novice researcher in a better position to select appropriate sample sizesones that reduce the probability of committing both Type II and Type A errors. Even more information about future sample sizes can be gleaned by constructing confidence intervals

224

ONWUEGBUZIE AND LEECH

around what-if post hoc power estimates using the (95%) upper and lower confidence limits around the observed effect size (cf. Bird, 2002). For instance, for the two-sample t-test context, Hedges and Olkins (1985) z-based 95% CI6 could be used to determine the upper and lower limits that subsequently would be used to derive a confidence interval around the estimated minimum sample size needed to obtain adequate power in an independent replication. Post hoc power analyses also could be used for meta-analytic purposes. In particular, if researchers routinely were to conduct post hoc power analyses, meta-analysts could determine the average observed power across studies for any hypothesis of interest. Knowledge of observed power across studies in turn would help researchers interpret the extent to which studies in a particular area were truly contributing to the knowledge base. Low average observed power across independent replications would necessitate research design modifications in future external replications. Additionally, what we call weighted meta-analyses could be conducted wherein effect sizes across studies are weighted by their respective post hoc power estimates. For example, if an analysis revealed a post hoc power estimate of .80, the effect-size estimate of interest would be multiplied by this value to yield a power-adjusted, effect-size coefficient. Similarly, a post hoc power estimate of .60 would adjust the observed effect-size coefficient by this amount. By using this technique when conducting meta-analyses, effect sizes resulting from studies with adequate power would be given more weight than effect sizes derived from investigations with low power. In addition, if a researcher conducts both an a priori and a post hoc power analysis within the same study, then a power discrepancy index (PDI) could be computed, which represents the difference between the a priori and post hoc power coefficients. The PDI would then provide information to researchers for increasing the sensitivity (i.e., the degree to which sampling error introduce imprecision into the results of a study; Murphy & Myors, 1998, p. 5) of future a priori power analyses. Sensitivity might be improved by increasing the sample size in independent replication studies. Alternatively, sensitivity might be increased by using measures that yield more reliable and valid scores or a study design that allows them to control for undesirable sources of variability in their data (Murphy & Myors, 1998). We have outlined several ways in which post hoc power analyses can be used to improve the quality of present and future studies. However, this list is by no means exhaustive. Indeed, we are convinced that other uses of post hoc power analyses are awaiting to be uncovered. Regardless of the way in which a post hoc power analysis is utilized, it is clear that this approach represents good detective work because it enables researchers to get to know their data better and consequently puts
6According to Hess and Kromrey (2002), Hedges and Olkinss (1985) z-based confidence interval procedure compares favorably with all other methods of constructing confidence bands including Steiger and Fouladis (1992, 1997) interval inversion approach.

POST HOC POWER

225

them in a position to interpret their data in a more meaningful way and with more conclusion validity. Therefore, we believe that post hoc power analyses should play a central role in NHST. SUMMARY AND CONCLUSIONS Robinson and Levin (1997) proposed a two-step procedure for analyzing empirical data whereby researchers first evaluate the probability of an observed effect (i.e., statistical significance), and if and only if statistical significance is found, then they assess the effect size. Recently, Onwuegbuzie and Levin (2003) proposed a three-step procedure when two or more hypothesis tests are conducted within the same study, which involves testing the trend of the set of hypotheses at the third step. Although both methods are appealing, their effectiveness depends on the statistical power of the underlying hypothesis tests. Specifically, if power is lacking, then the first step of the two-step method and the first and third steps of the three-step procedure, which serve as gatekeepers for computing effect sizes, may lead to the nonreporting of a nontrivial effect (i.e., Type A error; Onwuegbuzie, 2001). Because the typical level of power for medium effect sizes in the behavioral and social sciences is around .50 (Cohen, 1962), the incidence of Type A error likely is high. Clearly, this incidence can be reduced if researchers conduct an a priori power analysis to select appropriate sample sizes. However, such analyses are rarely employed (Cohen, 1992). Regardless, when a statistically nonsignificant finding emerges, researchers should then conduct a post hoc power analysis. This would help researchers determine whether low power threatens the internal validity of their findings (i.e., Type A error). Moreover, if researchers conduct post hoc power analyses whether statistical significance is reached, then meta-analysts would be able to assess whether independent replications are making significant contributions to the accumulation of knowledge in a given domain. Unfortunately, virtually no meta-analyst has formally used this technique. Thus, in this article, we advocate the use of post hoc power analyses in empirical studies, especially when statistically nonsignificant findings prevail. First, reasons for the nonuse of a priori power analyses were presented. Second, post hoc power was defined and its utility delineated. Third, a step-by-step guide is provided for conducting post hoc power analyses. Fourth, a heuristic example was provided to illustrate how post hoc power can help to rule in/out rival explanations to observed findings. Finally, several methods were outlined that describe how post hoc power analyses can be used to improve the design of independent replications. Although we advocate the use of post hoc power analyses, we believe that such approaches should never be used as a substitute for a priori power analyses. Moreover, we recommend that a priori power analyses always be conducted and reported. Nevertheless, even when an a priori power analysis has been conducted, we believe that a post hoc analysis also should be performed, especially if one or

226

ONWUEGBUZIE AND LEECH

more statistically nonsignificant findings emerge. Post hoc power analyses rely more on available data and less on speculation than do a priori power analyses that are based on hypothesized effect sizes. Indeed, we agree with Wooley and Dawson (1983), who suggested editorial policies to require all such information relating to a priori design considerations and post hoc interpretation to be incorporated as a standard component of any research report submitted for publication (p. 680). Although it could be argued that this recommendation is ambitious, it is no more ambitious than the editorial policies at 20 journals that now formally stipulate that effect sizes be reported for all statistically significant findings (Capraro & Capraro, 2002). In fact, post hoc power provides a nice balance in report writing because we believe that post hoc power coefficients are to statistically nonsignificant findings as effect sizes are to statistically significant findings. In any case, we believe that such a policy of conducting and reporting a priori and post hoc power analyses would simultaneously reduce the incidence of Type II and Type A errors and subsequently reduce the incidence of publication bias and the file-drawer problem. This can only help to increase the accumulation of knowledge across studies because meta-analysts will have much more information with which to use. This surely would represent a step in the right direction. REFERENCES
American Psychological Association. (2001). Publication manual of the American Psychological Association (5th ed.). Washington, DC: Author. Bakan, D. (1966). The test of significance in psychological research. Psychological Bulletin, 6, 432437. Bird, K. D. (2002). Confidence intervals for effect sizes in analysis of variance. Educational and Psychological Measurement, 62, 197226. Brewer, J. K. (1972). On the power of statistical tests in the American Education Research Journal. American Education Research Journal, 9, 391401. Brewer, J. K., & Owen, P. W. (1973). A note on the power of statistical tests in the Journal of Educational Measurement. Journal of Educational Measurement, 10, 7174. Cahan, S. (2000). Statistical significance is not a kosher certificate for observed effects: A critical analysis of the two-step approach to the evaluation of empirical results. Educational Researcher, 29(1), 3134. Capraro, R. M., & Capraro, M. M. (2002). Treatments of effect sizes and statistical significance tests in textbooks. Educational and Psychological Measurement, 62, 771782. Carver, R. P. (1978). The case against statistical significance testing. Harvard Educational Review, 48, 378399. Carver, R. P. (1993). The case against statistical significance testing, revisited. Journal of Experimental Education, 61, 287292. Chandler, R. E. (1957). The statistical concepts of confidence and significance. Psychological Bulletin, 54, 429430. Chase, L. J., & Baran, S. J. (1976). An assessment of quantitative research in mass communication. Journalism Quarterly, 53, 308311.

POST HOC POWER

227

Chase, L. J., & Chase, R. B. (1976). A statistical power analysis of applied psychological research. Journal of Applied Psychology, 61, 234237. Chase, L. J., & Tucker, R. K. (1975). A power-analytical examination of contemporary communication research. Speech Monographs, 42(1), 2941. Chase, L. J., & Tucker, R. K. (1976). Statistical power: Derivation, development, and data-analytic implications. The Psychological Record, 26, 473486. Cohen, J. (1962). The statistical power of abnormal social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145153. Cohen, J. (1965). Some statistical issues in psychological research. In B. B. Wolman (Ed.), Handbook of clinical psychology (pp. 95121). New York: McGraw-Hill. Cohen, J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin, 70, 426443. Cohen, J. (1969). Statistical power analysis for the behavioral sciences. New York: Academic. Cohen, J. (1973). Statistical power analysis and research results. American Educational Research Journal, 10, 225230. Cohen, J. (1977). Statistical power analysis for the behavioral sciences (Rev. ed.). New York: Academic. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155159. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 9971003. Cohen, J. (1997). The earth is round (p < .05). In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? (pp. 2135). Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Cooper, H. M., & Findley, M. (1982). Expected effect sizes: Estimates for statistical power analysis in social psychology. Personality and Social Psychology Bulletin, 8, 168173. Cumming, G., & Finch, S. (2001). A primer on the understanding, use and calculation of confidence intervals that are based on central and noncentral distributions. Educational and Psychological Measurement, 61, 532575. Daly, J., & Hexamer, A. (1983). Statistical power in research in English education. Research in the Teaching of English, 17, 157164. Dayton, C. M., Schafer, W. D., & Rogers, B. G. (1973). On appropriate uses and interpretations of power analysis: A comment. American Educational Research Journal, 10, 231234. Erdfelder, E., Faul, F., & Buchner, A. (1996). GPOWER: A general power analysis program. Behavior Research Methods, Instruments, & Computers, 28, 111. Fagely, N. S. (1985). Applied statistical power analysis and the interpretation of nonsignificant results by research consumers. Journal of Counseling Psychology, 32, 391396. Fagley, N. S., & McKinney, I. J. (1983). Reviewer bias for statistically significant results: A reexamination. Journal of Counseling Psychology, 30, 298300. Fidler, F. (2002). The fifth edition of the APA Publication Manual: Why its statistics recommendations are so controversial. Educational and Psychological Measurement, 62, 749770. Fisher, R. A. (1941). Statistical methods for research workers (84th ed.) Edinburgh, Scotland: Oliver & Boyd. (Original work published 1925) Fleishman, A. I. (1980). Confidence intervals for correlation ratios. Educational and Psychological Measurement, 40, 659670. Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis. Psychological Bulletin, 82, 120. Guttman, L. B. (1985). The illogic of statistical inference for cumulative science. Applied Stochastic Models and Data Analysis, 1, 310. Haase, R. F. (1974). Power analysis of research in counselor education. Counselor Education and Supervision, 14, 124132.

228

ONWUEGBUZIE AND LEECH

Haase, R. F., Waechter, D. M., & Solomon, G. S. (1982). How significant is a significant difference? Average effect size of research in counseling psychology. Journal of Counseling Psychology, 29, 5865. Halpin, R., & Easterday, K. E. (1999). The importance of statistical power for quantitative research: A post-hoc review. Louisiana Educational Research Journal, 24, 528. Harris, R. J. (1997). Reforming significance testing via three-valued logic. In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? (pp. 145174). Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. New York: Academic. Hess, M. R., & Kromrey, J. D. (2002, April). Interval estimates of effect size. An empirical comparison of methods for constructing confidence bands around standardized mean differences. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA. Katzer, J., & Sodt, J. (1973). An analysis of the use of statistical testing in communication research. The Journal of Communication, 23, 251265. Keselman, H. J., Huberty, C. J., Lix, L. M., Olejnik, S., Cribbie, R. A., Donahue, B., et al. (1998). Statistical practices of educational researchers: An analysis of their ANOVA, MANOVA, and ANCOVA analyses. Review of Educational Research, 68, 350386. Kroll, R. M., & Chase, L. J. (1975). Communication disorders: A power analytical assessment of recent research. Journal of Communication Disorders, 8, 237247. Levenson, R. L. (1980). Statistical power analysis: Implications for researchers, planners, and practitioners in gerontology. Gerontologist, 20, 494498. Levin, J. R., & Robinson, D. H. (2000). Statistical hypothesis testing, effect-size estimation, and the conclusion coherence of primary research studies. Educational Researcher, 29(1), 3436. Loftus, G. R. (1996). Psychology will be a much better science when we change the way we analyze data. Current Directions in Psychology, 5, 161171. McNemar, Q. (1960). At random: Sense and nonsense. American Psychologist, 15, 295300. Meehl, P. E. (1967). Theory testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103115. Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806834. Morse, D. T. (2001, November). YAPP: Yet another power program. Paper presented at the annual meeting of the Mid-South Educational Research Association, Chattanooga, TN. Mulaik, S. A., Raju, N. S., & Harshman, R. A. (1997). There is a time and a place for significance testing. In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? (pp. 65115). Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Mundfrom, D. J., Shaw, D. G., Thomas, A., Young, S., & Moore, A. D. (1998, April). Introductory graduate research courses: An examination of the knowledge base. Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA. Murphy, K. R., & Myors, B. (1998). Statistical power analysis: A simple and general model for traditional and modern hypothesis tests. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Neyman, J., & Pearson, E. S. (1928). On the use and interpretation of certain test criteria for the purposes of statistical inference. Biometrika, 20A, 175200, 263294. Neyman, J., & Pearson, E. S. (1933a). On the problem of the most efficient tests of statistical hypotheses. Transactions of the Royal Society of London, Series A, 231, 289337. Neyman, J., & Pearson, E. S. (1933b). Testing statistical hypotheses in relation to probabilities a priori. Proceedings of the Cambridge Philosophical Society, 29, 492510. Nix, T. W., & Barnette, J. J. (1998). The data analysis dilemma: Ban or abandon. A review of null hypothesis significance testing. Research in the Schools, 5, 314. Onwuegbuzie, A. J. (2001, November). A new proposed binomial test of result direction. Paper presented at the annual meeting of the Mid-South Educational Research Association, Little Rock, AR.

POST HOC POWER

229

Onwuegbuzie, A. J. (2002). Common analytical and interpretational errors in educational research: An analysis of the 1998 volume of the British Journal of Educational Psychology. Educational Research Quarterly, 26, 1122. Onwuegbuzie, A. J. (2003). Expanding the framework of internal and external validity in quantitative research. Research in the Schools, 10, 7190. Onwuegbuzie, A. J., & Daniel, L. G. (2002). A framework for reporting and interpreting internal consistency reliability estimates. Measurement and Evaluation in Counseling and Development, 35, 89103. Onwuegbuzie, A. J., & Daniel, L. G. (2003, February 12). Typology of analytical and interpretational errors in quantitative and qualitative educational research. Current Issues in Education, 6(2) [Online]. Retrieved October 10, 2004, from http://cie.ed.asu.edu/volume6/number2/ Onwuegbuzie, A. J., & Daniel, L. G. (2004). Reliability generalization: The importance of considering sample specificity, confidence intervals, and subgroup differences. Research in the Schools, 11(1), 6172. Onwuegbuzie, A. J., & Levin, J. R. (2003). A new proposed three-step method for testing result direction. Manuscript in preparation. Onwuegbuzie, A. J., & Levin, J. R. (2003). Without supporting statistical evidence, where would reported measures of substantive importance lead? To no good effect. Journal of Modern Applied Statistical Methods, 2, 133151. Onwuegbuzie, A. J., Levin, J. R., & Leech, N. L. (2003). Do effect-size measures measure up? A brief assessment. Learning Disabilities: A Contemporary Journal, 1, 3740. Onwuegbuzie, A. J., Witcher, A. E., Filer, J., Collins, K. M. T., & Moore, J. (2003). Factors associated with teachers beliefs about discipline in the context of practice. Research in the Schools, 10(2), 3544. Orme, J. G., & Tolman, R. M. (1986). The statistical power of a decade of social work education research. Social Science Review, 60, 619632. Overall, J. E. (1969). Classical statistical hypotheses within the context of Bayesian theory. Psychological Bulletin, 71, 285292. Penick, J. E., & Brewer, J. K. (1972). The power of statistical tests in science teaching research. Journal of Research in Science Teaching, 9, 377381. Robinson, D. H., & Levin, J. R. (1997). Reflections on statistical and substantive significance, with a slice of replication. Educational Researcher, 26(5), 2126. Rosenthal, R. (1979). The file-drawer problem and tolerance for null results. Psychological Bulletin, 86, 638641. Rossi, J. S. (1990). Statistical power of psychological research: What have we gained in 20 years? Journal of Consulting and Clinical Psychology, 58, 646656. Rossi, J. S. (1997). A case study in the failure of psychology as a cumulative science: The spontaneous recovery of verbal learning. In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? (pp. 175197). Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Rozeboom, W. W. (1960). The fallacy of the null hypothesis significance test. Psychological Bulletin, 57, 416428. SAS Institute, Inc. (2002). SAS/STAT Users Guide (Version 8.2) [Computer software]. Cary, NC: Author. Sawyer, A. G., & Ball, A. D. (1981). Statistical power and effect size in marketing research. Journal of Marketing Research, 18, 275290. Schmidt, F. L. (1992). What do data really mean? American Psychologist, 47, 11731181. Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for the training of researchers. Psychological Methods, 1, 115129. Schmidt, F. L., & Hunter, J. E. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In L. L. Harlow, S. A. Mulaik, & J. H. Steiger

230

ONWUEGBUZIE AND LEECH

(Eds.), What if there were no significance tests? (pp. 3764). Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Schmidt, F. L., Hunter, J. E., & Urry, V. E. (1976). Statistical power in criterion-related validation studies. Journal of Applied Psychology, 61, 473485. Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105, 309316. Sherron, R. H. (1988). Power analysis: The other half of the coin. Community/Junior College Quarterly, 12, 169175. SPSS, Inc. (2001). SPSS 11.0 for Windows [Computer software]. Chicago: Author. Steiger, J. H., & Fouladi, R. T. (1992). R2: A computer program for interval estimation, power calculation, and hypothesis testing for the squared multiple correlation. Behavior Research Methods, Instruments, and Computers, 4, 581582. Steiger, J. H., & Fouladi, R. T. (1997). Noncentrality interval estimation and the evaluation of statistical models. In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? (pp. 221257). Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Thompson, B. (2002). What future quantitative social science research could look like: Confidence intervals for effect sizes. Educational Researcher, 31(3), 2532. Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76, 105110. Witcher, A. E., Onwuegbuzie, A. J., & Minor, L. C. (2001). Characteristics of effective teachers: Perceptions of preservice teachers. Research in the Schools, 8(2), 4557. Wilkinson, L., & The Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594604. Witcher, A. E., Onwuegbuzie, A. J., & Minor, L. (2001). Characteristics of effective teachers: Perceptions of preservice teachers. Research in the Schools, 8, 4557. Wolfgang, D. H., & Glickman, C. D. (1986). Solving discipline problems: Strategies for classroom teachers (2nd ed.). Boston: Allyn & Bacon. Wooley, T. W., & Dawson, G. O. (1983). A follow-up power analysis of the statistical tests used in the Journal of Research in Science Teaching. Journal of Research in Science Teaching, 20, 673681.