Вы находитесь на странице: 1из 19

Psychological Methods

2003, Vol. 8, No. 2, 206224

Copyright 2003 by the American Psychological Association, Inc.


1082-989X/03/$12.00 DOI: 10.1037/1082-989X.8.2.206

Beyond Alpha: An Empirical Examination of the Effects of


Different Sources of Measurement Error on Reliability Estimates
for Measures of Individual Differences Constructs
Frank L. Schmidt and Huy Le

Remus Ilies

University of Iowa

University of Florida

On the basis of an empirical study of measures of constructs from the cognitive


domain, the personality domain, and the domain of affective traits, the authors of
this study examine the implications of transient measurement error for the measurement of frequently studied individual differences variables. The authors clarify
relevant reliability concepts as they relate to transient error and present a procedure
for estimating the coefficient of equivalence and stability (L. J. Cronbach, 1947),
the only classical reliability coefficient that assesses all 3 major sources of measurement error (random response, transient, and specific factor errors). The authors
conclude that transient error exists in all 3 trait domains and is especially large in
the domain of affective traits. Their findings indicate that the nearly universal use
of the coefficient of equivalence (Cronbachs alpha; L. J. Cronbach, 1951), which
fails to assess transient error, leads to overestimates of reliability and undercorrections for biases due to measurement error.

relationships of substantive interestthe observed relationships are often adjusted for the effect of measurement error; for these true score relationship estimates to be accurate the adjustments need to be
precise.
An important class of measurement errors that is
often ignored when estimating reliability of measures
is transient error (Becker, 2000; Thorndike, 1951). If
transient errors indeed exist, ignoring them in the estimation of the reliability of a specific measure leads
to overestimates of reliability. Perhaps the most important consequence of overestimating reliability is
the underestimation of the relationships between psychological constructs due to undercorrecting for the
bias due to measurement error. This may lead to undesirable consequences potentially hindering scientific progress (Schmidt, 1999; Schmidt & Hunter,
1999). Transient errors are defined as longitudinal
variations in responses to measures that are produced
by random variations in respondents psychological
states across time. Thus transient errors are randomly
distributed across time (occasions). Consequently, the
score variance produced by these random variations is
not relevant to the construct that is measured and
should be partialed out when estimating true score
relationships.
Measurement experts have long been aware of tran-

Psychological measurement and measurement


theory are fundamental tools that make progress in
psychological research possible. Estimating relationships between constructs and testing hypotheses are
essential to the advancement of psychological theories. However, research cannot estimate relationships
between constructs directly. Constructs are measured,
and it is relationships between scores on the construct
measures that are directly estimated. The process of
measurement always contains error that biases estimates of relationships between constructs (there is no
such thing as a perfectly reliable measure). Because of
measurement error, the relationships between specific
measures (observed relationships) underestimate the
relationships between the constructs (true score relationships). To estimate true score relationshipsthe

Frank L. Schmidt and Huy Le, Department of Management and Organizations, Henry B. Tippie College of Business, University of Iowa; Remus Ilies, Department of Management, Warrington College of Business Administration,
University of Florida.
Correspondence concerning this article should be addressed to Frank L. Schmidt, Department of Management
and Organizations, Henry B. Tippie College of Business,
University of Iowa, Iowa City, Iowa 52242. E-mail:
frank-schmidt@uiowa.edu

206

207

BEYOND ALPHA

sient error. Thorndike (1951), for example, defined


transient measurement error over 50 years ago and
called for empirical research to calibrate its magnitude. However, to our knowledge, Becker (2000) provided the first empirical examination of the effects of
transient error on the reliability of measures of psychological constructs. His study showed that the effect of transient error, while being negligible in some
measures (0.02% of total score variance for a measure
of verbal aggression), can be substantial in others
(10.01% of total score variance for a measure of hostility). Measurement error variance of such magnitude
can substantially attenuate observed relationships between constructs and lead to erroneous conclusions in
substantive research unless it is appropriately corrected for. Beckers findings indicate the desirability
of more extensive examination of the effects of transient error in measures of psychological constructs.
The research presented here investigates the implications of transient error in the measurement of important individual differences variables. In an attempt
to cover a significant part of the individual differences
area, we examined measures of important constructs
from (a) the cognitive domain (cognitive ability), (b)
the broad personality domain (the Big Five factors of
personality, generalized self-efficacy, and selfesteem), and (c) the domain of affective traits (positive affectivity [PA]; negative affectivity [NA]).
Beckers (2000) study focused on the four subscales
of the BussPerry Aggression Questionnaire (Buss &
Perry, 1992). In this study, we examine 10 constructs
more central to research in differential psychology.
To clarify the impact of transient error processes,
we first review classical measurement methods used
to correct for biases introduced by measurement error
when estimating relationships between constructs and
develop a method to improve the accuracy of this
correction process. We then apply this method to
study the impact of transient errors on the measurement of individual differences constructs from the categories noted above, which is the substantive purpose
of this study.
In classical measurement theory, the fundamental
formula for the observed correlation in the population
between two measures (x and y) is
xy xtyt (xx yy)1/2,

(1)

where xy is the observed correlation, xtyt is the correlation between the true scores of (constructs underlying) measures x and y, and xx and yy are the reliabilities of x and y, respectively. This is called the

attenuation formula because it shows how measurement error in the x and y measures reduces the observed correlation (xy) below the true score correlation (xtyt). Solving this equation for xtyt yields the
disattenuation formula:
xtyt xy/(xx yy)1/2.

(2)

With an infinite sample size (i.e., in the population), this formula is perfectly accurate (so long as the
appropriate type of reliability estimates is used). In
the smaller samples used in research, there are sampling errors in the estimates of xy, xx, and yy, and
therefore there is also sampling error in the estimate
of xtyt. Because of this, a circumflex is used in Equation 3 below to indicate that xtyt is estimated. For the
same reason, rxy, rxx, and ryy are used instead of xy,
xx, and yy, respectively. The disattenuation formula
therefore can be rewritten as follows:
xtyt rxy/(rxx ryy)1/2.

(3)

Thus, xtyt is the estimated correlation between the


construct underlying the measure x and the construct
underlying the measure y; this estimate approximates
the population value (xt yt) when the sample size is
reasonably large.
As Equation 3 shows, for the estimation of the true
score correlation between constructs measured by x
and y ( xtyt in Equation 3) to be unbiased, the estimates
of the reliabilities of measures x and y need to be
accurate. To accurately estimate reliabilities of construct measures, one must understand and account for
the processes that produce measurement error.
Three major error processes are relevant to all psychological measurement: random response error, transient error, and specific factor error. (A fourth error
process, disagreement between raters, is relevant only
to measurement that uses multiple raters of the constructs that are measured.) Depending on which
sources of error variance are taken into account when
computing reliability estimates, the magnitude of the
reliability estimates can vary substantially, which can
lead to erroneous interpretations of observed relationships between psychological constructs (Schmidt,
1999; Schmidt & Hunter, 1999; Thorndike, 1951).
Schmidt and Hunter (1999) observed that measurement error is typically not defined substantively in
psychological research. These authors asserted that
unless the psychological processes that produce measurement error are described, one is left with the
impression that measurement error springs from hidden and unknown sources and its nature is mysteri-

208

SCHMIDT, LE, AND ILIES

ous (Schmidt & Hunter, 1999, p. 192). Thus, following Schmidt and Hunter, we first describe the
substantive nature of the three main types of error
variance sources. Next we examine the bias (in estimating the reliability of measures) associated with
selective consideration of these sources. Finally, we
propose a method of computing reliability estimates
that takes into consideration all main sources of measurement error, and we present an empirical study of
the effect of the different types of error variance
sources on measures of important individual differences constructs.
It has long been known that many sources of variance influence the observed variance of measures in
addition to the constructs they are meant to measure
(Cronbach, 1947; Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Thorndike, 1951). Variance due to
these sources is measurement error variance, and its
biasing effects must be removed in the process of
estimating the true relationships between constructs.
Although the multifaceted nature of measurement error has occasionally been examined in classical measurement theory (e.g., Cronbach, 1947), it is the central focus of generalizability theory (Cronbach et al.,
1972). Although there is potentially a large number of
measurement error sources (termed facets in generalizability theory; Cronbach et al., 1972), only a few
sources significantly contribute to the observed variance of self-report measures (Le & Schmidt, 2001).
These sources are random response error, transient
error, and specific factor error.

Random Response Error


Random response error is caused by momentary
variations in attention, mental efficiency, distractions,
and so forth within a given occasion. It is specific to
a moment when subjects respond to an item of a measure. As such, a subject may, for example, provide
different answers to the same item if it appears in two
places in a questionnaire. These variations can be
thought of as noise in the human central nervous system (Schmidt & Hunter, 1999, Thorndike, 1951).
Random response error is reduced by summing item
scores. Thus, other things equal, the larger the number
of items in a specific scale, the greater the extent to
which random response error is reduced.

Transient Error
Whereas random response error occurs across items
(i.e., discrete time moments) within the same occa-

sion, transient error occurs across occasions. Transient errors are produced by longitudinal variations in
respondents mood, feelings, or in the efficiency of
the information processing mechanisms used to answer questionnaires (Becker, 2000; Cronbach, 1947,
DeShon, 1998; Schmidt & Hunter, 1999). Thus, those
occasion-characteristic variations influence the responses to all items that are answered on that specific
occasion. For example, respondents affective state
(mood) will influence their responses to various satisfaction questionnaires, in that respondents in a more
pleasurable affective state on that occasion will be
more likely to rate their satisfaction with various aspects of their lives or jobs higher. If those responses
are used to examine cross-sectional relationships between satisfaction constructs and other constructs, the
variations in the satisfaction scores across occasions
should be treated as measurement error (i.e., transient
errors). Under the conceptualization of generalizability theory (Cronbach et al., 1972), transient error is
the interaction between persons (measurement objects) and occasions.
Transient error can be defined only when the construct to be measured is assumed to be substantively
stable across the time period over which transient error is assessed. Individual differences constructs such
as intelligence or various personality traits can safely
be assumed to be substantively stable across at least
short periods of time, their score variation across occasions being caused by transient error. For constructs
that exhibit systematic variations even across short
time periods, such as mood, changes in the magnitude
of the measurement cannot be considered the result of
transient errors because those changes have substantive (theoretically significant) causes that can and
should be studied and understood (Watson, 2000).
If transient error indeed has a nonnegligible impact
on the measurement of individual differences (i.e., it
reduces the reliability of such measures), this impact
needs to be estimated, and relationships among constructs need to be corrected for the downwardly biasing effect of transient error on these relationships.
That is, observed relationships should be corrected for
unreliability using reliability estimates that take into
account transient error. To assess the impact of transient error on the measurement of psychological constructs, one estimates the amount of transient error in
the scores of a measure by correlating responses
across occasions. As noted, much too often transient
error is ignored when estimating reliability of measures (Becker, 2000).

BEYOND ALPHA

Specific Factor Error


This type of measurement error arises from the subjects idiosyncratic response to some aspect of the
measurement situation (Schmidt & Hunter, 1999). At
the item level, specific factor errors are produced by
respondent-specific interpretation of the wording of
questionnaire items. In that sense, specific factor errors correspond to the interaction between respondents and items. Specific factors are not part of the
constructs being measured because they do not correlate with scores on other items measuring that construct (i.e., they are item specific); thus they are measurement errors. Specific factor errors tend to cancel
each other out across different items, but for a specific
item they replicate across occasions. At the scale
level, different scales that were created to measure the
same construct may also contain specific factor errors.
Scale-specific factor error can be controlled only by
using multiple scales to measure the same construct,
thus defining the construct being measured by what is
measured and shared by all scales (Le & Schmidt,
2001; Schmidt & Hunter, 1999). That is, the construct
is defined as the factor that the scales have in common.

Calibrating Measurement Error Processes With


Reliability Coefficients
By definition, reliability is the ratio of true score
variance to observed score variance. The basic conceptual difference between methods of computing reliability lies in the ways in which they define and
estimate true score variance. Estimates of true score
variance based on different methods may include variance components due to measurement errors in addition to those due to the construct of interest. To put it
in another way, different methods implicitly assess
the extent to which different error processes (i.e., random response, transient, and specific factor) produce
variation in test scores that is not relevant to the underlying trait to be measured (i.e., measurement error). The magnitude of an error process is assessed by
a reliability estimate only when that error process is
allowed to reduce the size of the reliability estimate;
the magnitude of the reduction is a measure of the
impact of the error process that produced it. It follows
that some methods of computing reliability are more
appropriate than others because they better model the
various error processes. These issues have in fact long
been known and discussed by measurement researchers (e.g., Cronbach, 1947; Thorndike, 1951). Never-

209

theless, we reexamine the topic here to clearly define


the problem and then offer a solution. Specifically, we
discuss below how popular estimates of reliability assess different sources of measurement errors.

Internal Consistency Reliability


This method of computing reliability is probably
the one most frequently used in psychological research, and computations of internal consistency are
included in all standard statistical analysis programs.
Internal consistency reliability assesses specific factor
error and random response error. Specific factor errors and random response error are independently distributed across items and tend to cancel each other out
when item scores are summated.
The most popular index of internal consistency of a
measure is coefficient alpha (Cronbach, 1951). Cronbach (1951) showed that coefficient alpha is the average of all the split-half reliabilities. A split-half reliability is the correlation between two halves of a
scale, adjusted to reflect the reliability of the fulllength scale. As such, coefficient alpha is conceptually equivalent to the correlation between two parallel
forms of a scale administered on the same occasion
(Cronbach, 1951). Parallel forms of a scale are different sets of items having identical loadings on the
construct to be measured but having different specific
factors (further definition and discussion of parallel
forms is provided in a subsequent section). When parallel forms of a scale administered on the same occasion are correlated, measurement error variance due to
specific factor error and random response error reduces the correlation between the parallel forms.
Thus, in practice, internal consistency reliability can
be estimated three different ways: (a) by the correlation between parallel forms of a scale; (b) by splitting
a scale into halves, computing the correlation between
the resulting subscales and correcting this estimate for
the reduced number of items with the Spearman
Brown prophecy formula; and (c) by using formulas
specific to the measurement situation such as those for
Cronbachs alpha (when items are scored on a continuum; Cronbach, 1951) or KuderRichardson20
(when items are scored dichotomously). Because the
most direct method for computing internal consistency is to correlate parallel forms of the scale administered on one occasion, this class of reliability coefficient estimates is referred to as the coefficient of
equivalence (CE; Anastasi, 1988; Cronbach, 1947).
Feldt and Brennan (1989) provided a formula illustrating the variance components included as true score

210

SCHMIDT, LE, AND ILIES

variance by the CE (Equation 102, p. 135). The CE


assesses the magnitude of measurement error produced by specific factor and random response error
but not by transient error processes.

TestRetest Reliability
Testretest reliability, as its name implies, is computed by correlating the scores on the same form of a
measure across two different occasions. The test
retest reliability coefficient is alternatively referred to
as the coefficient of stability (CS; Cronbach, 1947)
because it estimates the extent to which scores on a
specific measure are stable across time (occasions).
The CS assesses the magnitude of transient error and
random response error, but it does not assess specific
factor error, as shown by Feldt and Brennan (1989;
Equation 100, p. 135).

Coefficient of Equivalence and Stability


The coefficient of equivalence assesses specific
factor error and random response error but not transient error, whereas the CS assesses transient and random response error but not specific factor error. The
only type of reliability coefficient that estimates the
magnitude of the measurement error produced by all
three error processes is the coefficient of equivalence
and stability (CES; Cronbach, 1947). The CES is
computed by correlating two parallel forms of a measure, each administered on a different occasion. The
use of parallel forms administered on one occasion
assesses only specific factor error and random response error, and the administration of the same measure across two different occasions assesses only transient error and random response error. Hence, the
CES is the ideal reliability estimate because its magnitude is appropriately reduced by all three sources of
measurement error. Supporting this argument, Feldt
and Brennan (1989) provided a formula for the CES,
showing that its estimated true score variance includes
only variance due to the construct of interest (Equation 101, p. 135). The other types of reliability coefficients selectively take into account only some
sources of measurement error, and thus they overestimate the reliability of the scale.

Computing the Coefficient of Equivalence


and Stability
As already noted, to compute the CES of a measure
for a specific sample, two parallel forms of the measure, each administered on a different occasion, are

required. The correlation of the scores on these parallel forms is reduced by all three main sources of
measurement error, and it equals the CES. However,
in many cases, parallel forms of the same measure
may not be available. For such cases, we present here
an alternative method for estimating the CES of a
scale from the CES of its half scales that were obtained either from (a) administering the scale on two
distinct occasions and then splitting it post hoc to
form parallel half scales or (b) splitting the scale into
parallel halves then administering each half on each
separate occasion. The former approach (i.e., post hoc
division) provides us with multiple estimates of the
CE and CES coefficients of the half scales, which we
can average to reduce the effect of sampling error,
thereby obtaining more accurate estimates of the coefficients of interest. However, administering the
same scale on two occasions with a short interval can
result in measurement reactivity (i.e., subjects
memory of item content could affect their responses
on the second occasion; Stanley, 1971; Thorndike,
1951). This problem is especially likely for measures
of cognitive constructs. In such a situation, it may be
advisable to adopt the latter approach: The scale
should be split into halves first, with each half administered on only one occasion.
After estimating the half-scale CES, one needs to
compute the CES for the full scale. Using the SpearmanBrown prophecy formula to adjust for the increased (double) number of items for the full scale
appears to be the most direct approach. However,
Feldt and Brennan (1989) have cautioned that the
SpearmanBrown formula does not apply when there
are multiple sources of measurement error involved
(Feldt & Brennan, 1989, p. 134), which is the case
here. We demonstrate below that use of the SpearmanBrown formula is not accurate in this case.
Our derivations assume that all the statistics are
obtained from a very large sample, so there is no
sampling error; in other words, they are population
parameters. We further assume that there exist parallel forms for the scale in question, and these forms can
be further split into parallel subscales. Parallel forms
are defined under classical measurement theory to be
scales that (a) have the same true score and (b) have
the same measurement error variance (Traub, 1994).
It can be readily inferred from this definition that
parallel forms have the same standard deviation and
reliability coefficient (ratio of true score variance to
observed score variance). Parallel forms thus defined
are strictly parallel, as contrasted to randomly parallel

211

BEYOND ALPHA

forms, which are scales having the same number of


items randomly sampled from a hypothetical infinite
item domain of the construct in question (Nunnally &
Bernstein, 1994; Stanley, 1971). Randomly parallel
forms may have different true score and observed
score variances due to sampling error resulting from
the sampling of items.
Let A and B be two parallel forms of a scale that
were both administered at two times: Time 1 and
Time 2. 1A symbolizes Form A administered at Time
1; 2A symbolizes Form A administered at Time 2; 1B
symbolizes Form B administered at Time 1; and 2B
symbolizes Form B administered at Time 2. With this
notation, the CES can be written as
CES (1A,2B) (2A,1B).

(4)

If we split A to create two parallel half scales, a1 and


a2, and split B to create two parallel half scales, b1 and
b2, we then have four parallel half scales, a1, a2, b1,
and b2, and
A = a1 + a2,

(5)

B = b1 + b2.

(6)

From Equations 4, 5, and 6, with prefixes 1 and 2


denoting the time when a half scale is administered,
we have
CES (1a1 + 1a2, 2b1 + 2b2)
(2a1 + 2a2, 1b1 + 1b2).

(7)

Because a1, a2, b1, and b2 are parallel scales, their


standard deviations are the same. Further, the standard
deviation should be invariant across Times 1 and 2.
Symbolizing this value, , we have
(1a1) = (1a2) = (1b1) = (1b2) = (2a1)
= (2a2) = (2b1) = (2b2) = .

(8)

Further, let the CE of the half scales be symbolized as


ce; this coefficient can be written as
ce (1a1,1a2) (2a1,2a2) (1b1,1b2)
(2b1,2b2).
(9)
Let the CES of the half scales be symbolized as ces;
by definition this coefficient is the correlation between any different half scales across times:
ces = (1a1,2a2) = (1a1,2b1) = (1a1,2b2)
= (1a2,2a1) = (1a2,2b1) = (1a2,2b2)
= (1b1,2a1) = (1b1,2a2) = (1b1,2b2)
= (1b2,2a1) = (1b2,2a2) = (1b2,2b1).

(10)

From Equation 7 above and the definition of the correlation coefficient, we have
CES = (1a1 + 1a2, 2b1 + 2b2)
Cov[1a1 + 1a2, 2b1 + 2b2]
.
=
[Var(1a1 + 1a2)]12 [Var(2b1 + 2b2)]12
(11)
Using Equations 8 and 10 above, we can rewrite the
numerator of the right side of Equation 11 as follows:
Cov(1a1,2b1) + Cov(1a1,2b2) + Cov(1a2,2b1)
+ Cov(1a2,2b2)
= (1a1,2b1)(1a1)(2b1)
+ (1a1,2b2)(1a1)(2b2)
+ (1a2,2b1)(1a2)(2b1)
+ (1a2,2b2) (1a2)(2b2)
= 4 2 ces.

(12)

Using Equations 8 and 9 above, we can rewrite the


denominator of the right side of Equation 11 as
[2(1a1) + 2(1a2) + 2(1a1,1a2)(1a1)(1a2)]12
[2(2b1) + 2(2b2) + 2(2b1,2b2)(2b1)(2b2)]12
= 22(1 + ce).
(13)
Finally, from Equations 12 and 13, we have
CES 2 ces/(1 + ce).

(14)

Equation 14 gives a simple formula for computing


the CES of a full scale that was split into halves from
the CES of the half scales. It can be inferred from
above derivations that this equation yields the exact
value of CES when (a) all statistics are obtained from
very large sample size, and (b) strictly parallel half
scales can be formed by appropriately splitting the full
scale. To the extent that those assumptions do not
hold in practice (i.e., there is a limited sample size and
departures from strict parallelism of half scales), the
estimate of CES is affected by sampling error in both
subjects and items (Feldt, 1965; Lord, 1955). Nevertheless, the formula presented here provides a convenient way to obtain an unbiased and asymptotically
accurate estimate of CES for any measure with real
samples (i.e., sample sizes are limited) and in the
absence of perfectly parallel forms. When half scales
are formed by post hoc splitting of a full scale as
described above, sampling error can be reduced by
averaging statistics of subscales to obtain more accurate estimates of population values. If the scales are
split in random halves (instead of strictly parallel
halves), the resulting reliability coefficient estimates
the random CES (Nunnally & Bernstein, 1994).
We note here that the use of the SpearmanBrown

212

SCHMIDT, LE, AND ILIES

prophecy formula is not appropriate to adjust the CES


for the length of the full scale (though it would be
appropriate to adjust the CE). If the SpearmanBrown
formula were erroneously applied here to estimate
CES from ces, the erroneous estimate of the adjusted
CES would be
CES 2 ces/(1 + ces).

Extending the Formula for Estimating the CES


Equation 14 provides a formula for computing the
CES of a full-scale from the CE and CES of its half
scales. To derive the equation, we have assumed that
(a) a scale can be split into parallel half scales, and (b)
the psychometric properties of the half scales (standard deviations and ce) remain unchanged across occasions. Arguably, these assumptions are not strong
and are likely to be met in practice. Nevertheless, as
an anonymous reviewer pointed out, the assumptions
can be further relaxed. When measurement reactivity
is not a problem and thus the same scale can be administered at both times (i.e. the post hoc splitting
approach mentioned earlier is adopted), the following
equation can be used to estimate the CES of the
scale:1
2[Cov (1a1,2a2) + Cov(2a1,1a2)]
,
[Var(1A)]12 [Var(2A)]12

CES =

Cov(1a1,2a2) + Cov(2a1,1a2)

,
2p1p2[Var(1A)]12 [Var(2A)]12

(17a)

(15)

Comparing Equations 14 and 15 reveals that adjusting


CES with the SpearmanBrown formula results in
overestimation of the full-scale CES (because ces is
always smaller than ce).

CES =

subscales, and then the following equation can be


used to estimate the CES of the full scale from information on the subscales:

(16)

where Cov(1a1,2a2) is the covariance between the


half-scale a1 administered at Time 1 (1a1) and the
half-scale a2 administered at Time 2 (2a2); Cov(2a1,
1a2) is the covariance between the half-scale a1 administered at Time 2 (2a1) and the half-scale a2 administered at Time 1 (1a2); Var(1A) is the variance of
the Full-Scale A administered at Time 1; and Var(2A)
is the variance of the Full-Scale A administered at
Time 2.
Equation 16 assumes only that Scale A can be divided into two parallel forms, a1 and a2. In practice,
however, there are scales with an odd number of
items; hence it is not possible to divide them into
parallel half scales. The assumption underlying Equation 16 can then be further relaxed to cover this situation. A scale can now be simply divided into two

where Cov(1a1, 2a2) is the covariance between the


subscale a1 administered at Time 1 (1a1) and the subscale a2 administered at Time 2 (2a2); Cov(2a1, 1a2) is
the covariance between the subscale a1 administered
at Time 2 (2a1) and the subscale a2 administered at
Time 1 (1a2); Var(1A) is the variance of the FullScale A administered at Time 1; Var(2A) is the variance of the Full-Scale A administered at Time 2; p1 is
the ratio of number of items in the subscale a1 to that
in the full scale; p2 is the ratio of number of items in
the subscale a2 to that in the full scale.
Equations 16 and 17a can be applied only when the
same scale is used at both times (the post hoc splitting
approach). In situations where it is impossible to apply this approach because of time constraint or concern about measurement reactivity, a modification can
be introduced into Equation 17a to enable estimating
the CES for scales with an odd number of items:
CES =

Cov(1a1,2a2)

,
p1p2[Vr(1A)]12 [Vr(2A)]12

(17b)

where Cov(1a1,2a2) is the covariance between the


subscale a1 administered at Time 1 (1a1) and the subscale a2 administered at Time 2 (2a2); p1, p2 are the
proportions of total items in the subscales a1 and a2;
and Va r(1A) and Va r(2A) are the estimated variances
of the Full-Scale A.
As seen from above, we need to estimate the variances of the Full-Scale A when it is administered at
Times 1 and 2 (which are not directly available because only one subscale is administered at each time
in this situation). These values can be obtained from
the following equations (Appendix A provides derivations of these equations):

We are grateful to an anonymous reviewer for suggesting the way to relax the assumptions underlying Equation
14. The reviewer also provided Equations 16 and 17 and
their derivations. Derivation details are available from
Frank L. Schmidt upon request.

BEYOND ALPHA

Var(1A) =

Var(1a1)[p1 (p1 1)ce1]


,
p21

(18a)

Var(2A) =

Var(2a2)[p2 (p2 1)ce2]


,
p22

(18b)

where ce1 and ce2 are the coefficients of equivalence


of subscale a1 and subscale a2, respectively.
From Equation 16 to Equation 17b above, we have
gradually relaxed the assumptions required for estimating the CES. This is achieved at the cost of simplicity: those equations are more complicated than
Equation 14. Using Equations 16 and 17a also requires an additional assumption, that is, measurement
reactivity is not a problem. Equation 17b further relaxes this assumption by allowing the CES to be estimated by subscales administered at two different
times. All these equations, especially Equation 17b,
can be used in most practical situations where the
assumptions underlying Equation 14 are not likely to
hold.2

The Impact of Transient Error


on the Measurement of
Individual Differences Constructs
As shown above, the only type of reliability that
takes into consideration all the main sources of measurement error is the CES. The most widely used
reliability coefficient, Cronbachs alpha, does not take
into account transient error. Thus, Cronbachs alpha
(i.e., CE) is a potential overestimate of the reliability
of a measure (if transient error processes indeed impact the measured scores). By computing the CES
with the method detailed in the preceding section we
can estimate the extent to which CE overestimates
reliability: subtracting CES from CE yields the proportion of observed variance produced by transient
error processes (Feldt & Brennan, 1989; Le &
Schmidt, 2001). This approach was previously
adopted by Becker (2000) to estimate the effect of
transient error.
For constructs that exhibit trait-like characteristics,
the processes that produce temporal variations in the
measurements of a specific construct should be
treated as measurement error processes, but for statelike constructs, temporal variations of the measurements cannot be treated as measurement error. Extending the traitstate dichotomy along a continuum,
it follows that, except for constructs that can be placed
at the ends of this continuum (such as intelligence,
which can be placed at the trait end, or emotions,
which are at the state end), the trait versus state dis-

213

tinction dependsat least for the purpose of defining


transient erroron the specific (theoretical) definition
of the construct that is measured. For example the
Positive and Negative Affect Schedule (PANAS;
Watson, Clark, & Tellegen, 1988) scales can be used
to assess both temporary mood-states (affect) and
stable-traits (affectivity) constructs,3 the difference
being only in the instructions given to the respondents
(momentary instructions vs. general or long-term instructions). Furthermore, empirical evidence shows
that average levels of daily state affect are substantially correlated with affectivity trait measures (r
.64 and r .53, for Positive Affect, [PA] and Negative Affect [NA], respectively; Watson and Clark,
1994), thus scores of momentary affect ratings (i.e.,
mood states) can be considered indicators of traitlevel affectivity if such measures are averaged across
multiple occasions. The larger the number of occasions that are averaged over, the more closely the
average approaches a trait measure.
A particularly illustrative example of how the same
measures can be used to construct scores whose position on the traitstate continuum varies from state to
trait was presented by Watson (2000). He presented
testretest correlations for scores that were constructed by aggregating daily mood measures across a
number of days increasing from 1 to 14. The test
retest correlation for the composite scores increased
monotonically with the number of days entering the
composite, from .44 to .85 for Positive Affect and
from .37 to .77 for Negative Affect (Watson, 2000; p.
147, Table 5.1.). That is, the more closely the measure
approximated a trait measure, the more stable it became.
Transient measurement error processes were expected to affect the measurement of all constructs included in this study, but the impact on the measurement of the different traits may not be equal. Conley
(1984) estimated the annual stabilities (i.e., average
1-year testretest correlation between measures, cor-

From our available data (not from this study), all these
equations yield very similar results, even when the assumptions for Equation 14 are violated (i.e., when standard deviations of a scale are different across testing occasions). In
other words, Equation 14 appears to be quite robust to violation of its assumptions.
3
In the literature, affective states (moods) and affective
traits (personality) are referred to as affect and affectivity,
respectively. We adopt this terminology in this article.

214

SCHMIDT, LE, AND ILIES

rected for measurement error using coefficient alpha)


of measures of intelligence, personality, and selfopinion (i.e., self-esteem) to be .99, .98, and .94, respectively. Those findings may actually reflect differences in (a) stabilities of the constructs (as argued by
Conley), (b) the susceptibilities of their measures to
transient error, or (c) both. Schmidt and Hunter (1996,
1999), on the basis of initial empirical evidence, hypothesized that transient error is lower in the cognitive
domain than in the noncognitive domain. In accordance with Schmidt and Hunters hypothesis (1996,
1999) we expected transient error to be the smallest in
the cognitive domain. Within the personality domain,
we expected transient error to influence affectivity
traits to a larger extent than broad personality traits.
This expectation is suggested by the relatively lower
testretest correlations often observed in measures of
affectivity traits when compared with those of other
dispositional measures (Judge & Bretz, 1993). Even
when compared with the Big Five traits to which they
are frequently related (i.e., Extraversion and Neuroticism, respectively; e.g., Brief, 1998; Watson, 2000),
positive affectivity and negative affectivity display
relatively smaller stability coefficients over moderate
time periods (2.53 years; Vaidya, Gray, Haig, &
Watson, 2001). Vaidya et al. (2001) found average
testretest correlation for positive affectivity and
negative affectivity to be .54, whereas the average
testretest correlation for Extraversion and Neuroticism was .63. However, as mentioned above, with
such a relatively long interval, the lower testretest
correlations could be due to the lower stability of the
constructs, or larger transient error, or both. It is not
clear from the current research evidence which source
(instability of the constructs or transient error) is responsible for the low testretest correlations often observed. Our study attempts to disentangle effects of
these sources by examining only the effect of transient
error. We accomplished this by using a short test
retest interval (one week), during which we can be
fairly certain that the constructs in question are stable.
From a theoretical perspective, people may not be
able to entirely separate self-assessments of trait affectivity from current affective states (Schmidt,
1999). That is, mood states may, to some extent, influence self-ratings of trait affectivity. This influence
will translate into increased levels of transient error in
the measurement of trait-level affectivity. We expect
the impact of mood states on the self-reports of trait
affectivity to be the largest for the PANAS measures
because these scales use mood descriptors to assess

affectivity; consequently, the amount of transient error should be largest for the PANAS measures among
those examined in the study.

Method
Procedure
To investigate these hypotheses, we conducted a
study at a midwestern university. Two hundred thirtyfive students enrolled in an introductory management
course participated in the study. Participation was voluntary and was rewarded with course credits. A total
of 123 women (52%) and 112 men (48%) were in the
initial sample. The average age of the participants was
21.5 years. All participants were requested to complete two laboratory sessions separated by an interval
of approximately 1 week. A total of 18 sessions, each
having about 20 participants, was arranged. The sessions spanned 3 weeks. Measures were organized into
two distinct questionnaires. Each questionnaire included one of a pair of parallel forms of measures for
each construct studied (described below) in this project. In each session, participants were given one questionnaire to answer. Each participant received different questionnaires in his or her two different sessions.
Sixty-eight of the initial participants failed to attend
the second session, so the final sample consists of 167
participants (83 men and 84 women; average age
21.3 years). There was no difference between the participants who dropped out and those in the final
sample in terms of age and gender. Because it is possible that the dropout participants were different from
those in the final sample in some other characteristics
that could influence the results (e.g., those who
dropped out were lower in conscientiousness, which
could have resulted in some range restriction in this
variable in the final sample), we carried out further
analyses to examine the potential bias due to sample
attrition. Mean scores of the participants who did not
complete the study were compared with those in the
final sample on all the available measures. No statistically significant differences were found.

Measures
As mentioned above, we measured constructs from
the cognitive domain, the broad personality domain,
and the affective traits domain. More specifically, we
measured general mental ability (GMA), the Big Five
personality traits, self-esteem, self-efficacy, positive
affectivity, and negative affectivity. Table 1 presents

BEYOND ALPHA
Table 1
Measures Used in the Study
Construct
Cognitive
General mental ability
Broad personality
Big Five (Conscientiousness,
Extraversion, Agreeableness,
Neuroticism, and Openness)
Self-esteem
Generalized self-efficacy
Affective traits
Positive and negative
affectivity

Measure
Wonderlic
PCIa
IPIP Big Fivea
TSBI, Forms A and B
Rosenberga
GSE Scalea
PANASa
DE scalea
MPIa

Note. Wonderlic Wonderlic Personnel Test (1998); PCI


Personal Characteristic Inventory (Barrick & Mount, 1995); IPIP
International Personality Inventory Pool (Goldberg, 1997);
TSBI Texas Social Behavior Inventory (Helmreich & Stapp,
1974); Rosenberg Rosenberg Self-Esteem Scale (1965); GSE
Sherers Generalized Self-Efficacy Scale (Sherer et al., 1982);
PANAS Positive and Negative Affect Schedule (Watson, Clark,
& Tellegen, 1988); DE Diener and Emmonss (1984) AffectAdjective Scale; MPI Multidimensional Personality Index (Watson & Tellegen, 1985).
a
These measures have no parallel forms. We used the method
described in the text to calculate coefficient of equivalence stability
reliability values.

the measures used in the study. As shown therein, we


used the Wonderlic Personnel Test (Wonderlic Personnel Test Inc., 1998) to measure GMA. This test is
popular among employers as a valid personnel selection tool. It has also been used in research to operationalize the GMA construct (e.g., Barrick, Stewart,
Neubert, & Mount, 1998; Hansen, 1989; Mackaman,
1982). Available empirical evidence showed that the
test is highly correlated with many other popular tests
of GMA, such as the Wechsler Adult Intelligence
ScaleRevised (Wechsler, 1981; .75.96), Otis
Lennon (Otis & Lennon, 1979; .87.99), and the General Aptitude Test Battery (U.S. Department of Labor,
1970; .74, Wonderlic Personnel Test, 1998).
For the Big Five personality constructs, we used
two inventories: The Personal Characteristics Inventory (PCI; Barrick & Mount, 1995) and the International Personality Inventory Pool Big Five scale
(IPIP; Goldberg, 1997). The respective manuals of
these scales show that they are highly correlated with
other measures of Big Five personality constructs
popularly used in research and practice, that is, the
NeuroticismExtraversionOpenness Personality InventoryRevised (NEO-PIR; Costa & McCrae,

215

1985), Hogan Personality Inventory (Hogan, 1986),


and Goldbergs markers (Goldberg, 1992). There
have also been empirical studies using these scales in
the literature (e.g., Barrick, Mount, & Strauss, 1994;
Barrick et al., 1998; Heaven & Bucci, 2001).
Two scales were used for the self-esteem construct:
the Rosenberg Self-Esteem Scale (Rosenberg, 1965)
and the Texas Social Behavior Inventory (Helmreich
& Stapp, 1974). According to a recent review (Blascovich & Tomaka, 1991), these scales are among the
most popularly used measures of self-esteem in the
literature. Because generalized self-efficacy is a relatively new construct, there are not many scales available. We chose to use the scale developed by Sherer
et al. (1982), which appears to be one of the most
widely used measures of this construct.
The PANAS (Watson et al., 1988) is probably the
most popular measure of affective traits (Price, 1997),
so we used it in our study. To obtain additional estimates of transient error for the affective constructs,
we included two other measures of trait affectivity:
Diener and Emmonss (1984) Affect-Adjective Scale
and the Multidimensional Personality Index (Watson
& Tellegen, 1985).
As shown in Table 1, only two of the measures
used have parallel forms available: Wonderlic Personnel Test (Wonderlic Personnel Test, 1998) for the
construct of GMA, and Texas Social Behavior Inventory (Helmreich & Stapp, 1974) for the construct of
self-esteem. In the case of the other measures, we split
them into halves to form parallel half scales. Efforts
were made to create half scales as strictly parallel as
possible (cf. Becker, 2000; Clause, Mullins, Nee, Pulakos, & Schmitt, 1998; Cronbach, 1943): Items in the
measures were allocated into the halves on the basis
of (a) item content, (b) their loadings on respective
factors (from previous empirical studies providing
this information), and (c) other relevant statistics (i.e.,
standard deviations, intercorrelations with other
items) when available. For each measure of this type,
a different half scale was included in each questionnaire to be administered on that occasion. Because of
time constraints and concerns about measurement reactivity, we did not include full scales for those measures each time and apply post hoc division of scales,
as discussed in the earlier section.

Analyses
Two types of reliability coefficients were computed: the CE was computed with Cronbachs (1951)
formula, and the CES was computed using the method

216

SCHMIDT, LE, AND ILIES

detailed earlier. Specifically, for the measures with


parallel forms (Wonderlic Personnel Test and Texas
Social Behavior Inventory), the CE value was estimated by averaging the coefficient alphas of the parallel forms; the CES values were the correlations between the parallel forms administered on different
occasions. For those measures without parallel forms,
their half-scale coefficients of equivalence (ce) were
estimated by averaging coefficient alphas of the parallel half scales of the same measures; their half-scale
coefficients of equivalence and stability (ces) were the
correlations between the parallel half scales of the
same measures administered on two different occasions. The SpearmanBrown formula was then used
to estimate the CE of the full scales from those of the
half-scales (ce). The CES of the full scales was estimated from those of the half-scales (ces) by use of
Equation 14 derived in the previous section.
Next we estimated the ratio of transient error variance to observed score variance for measures of
each construct by subtracting the CES from CE; this
ratio or proportion is hereafter referred to as TEV.
When available, TEV for measures of the same
construct were averaged to obtain the best estimate
of the effect of transient error on measures of that
construct. Finally, we examined the effect of transient
error on measures of different constructs by comparing proportions of TEV across measures and
constructs.
As discussed earlier, when sample size is limited
and half scales are likely to be randomly parallel instead of strictly parallel, the values of CES estimated
by Equation 14 could be affected by sampling error in
both subjects and items. It is therefore important to
have information about the distribution of the CES
estimates, as well as those of TEV and CE. Unfortunately, the nonlinearity of Equation 14 and the complexity of the sampling distributions of its components (e.g., ce is estimated by averaging two
coefficient alphas of the half scales) renders analytical
derivation of the formula for estimating the standard
error of the CES difficult. Here the Monte Carlo simulation technique can provide a tool for empirically
examining the distributions of interest (cf. Mooney,
1997). We simulated data based on the results (i.e.,
estimates of CE, CES, and TEV) obtained from the
previous analysis. One thousand data sets were generated for each measure included in the study under
the same conditions (see Appendix B for further details of the simulation procedure). For each data set,
the same analysis procedure described in the previous

section was carried out to estimate the CE, CES, and


TEV. Standard deviations of the distributions of these
estimates provide the standard errors of interest.

Results
Table 2 shows the results of the study. The standard
deviations given are for full-length scales. Where the
observed data were for half-length scales, the standard
deviations were computed using the methods presented in Appendix A. The next two columns present
the CE and CES values, with the following column
being the estimate of the TEV. The last column indicates the percentage by which the CE (coefficient alpha) overestimates the actual reliability. Overall, estimates of the CES are smaller than the estimates of
the CE (coefficient alpha), indicating that transient
error exists in scores obtained with the measures used
in this study. For GMA, the CE overestimated the
reliability of the measure by 6.70%; for the Big Five
personality measures, the extent of the overestimation
varied between 0 and 14.90% across the ten measures
(two measures of each of the Big Five constructs).
From the measures of the Big Five traits, Neuroticism
measures contained the largest amount of transient
error (TEV .09 across the two measures), which is
consistent with the affective nature of this trait (Watson, 2000). The estimated value of TEV is negative
for measures of Openness to Experience and for the
PCI measure of Extraversion. Because proportion of
variance cannot be negative by definition, negative
values must be attributed to sampling error, so we
considered the TEV of these measures to be zero.
Hunter and Schmidt (1990) provided a detailed discussion of the problem of negative variance estimates,
a common phenomenon in meta-analysis due to second-order sampling error. Negative variance estimates also occur in analysis of variance. When a variance is estimated as the difference between two other
variance estimates, the obtained value can be negative
because of sampling error in the two estimates. For
example, if the population value of the variance in
question is zero, the probability that it is estimated to
be negative will be about 50% (because about half of
its sampling distribution is less than zero). The practice of considering negative variance estimates as due
to sampling error and treating them as zero is also the
rule in generalizability theory research (e.g., Cronbach et al., 1972; Shavelson, Webb, & Rowley, 1989).
For the generalized self-efficacy construct, the CE
of the Sherer et al. (1982) measure of generalized

217

BEYOND ALPHA

Table 2
Estimates of Reliability Coefficients and Proportions of Transient Error Variances
Construct and measure
GMA
Wonderlic
Conscientiousness
PCI
IPIP
Average
Extraversion
PCI
IPIP
Average
Agreeableness
PCI
IPIP
Average
Neuroticism
PCI
IPIP
Average
Openness
PCI
IPIP
Average
GSE
Sherers GSE
Self-Esteem
TSBI
Rosenberg
Average
Positive affectivity
PANAS
MPI
DE scale
Average
Negative affectivity
PANAS
MPI
DE scale
Average

SDa

CE

CES

5.36

.79

.74

.05

6.7

10.09
18.08

.88
.93
.91

.78
.89
.84

.10
.04
.07

13.0
4.5
8.3

8.43
16.90

.81
.92
.87

.83
.88
.86

.00 (.02)c
.04
.02

0.0
4.5
2.2

6.93
14.82

.85
.88
.87

.80
.87
.84

.05
.01
.03

6.3
1.1
3.6

8.14
19.69

.85
.93
.89

.74
.86
.80

.11
.07
.09

14.9
8.1
11.3

6.17
15.16

.81
.88
.85

.87
.90
.89

.00 (.05)c
.00 (.02)c
.00 (.04)c

0.0
0.0
0.0

28.83

.89

.83

.05

6.0

8.31

.85
.84
.85

.80
.79
.80

.05
.05
.05

6.3
6.3
6.3

4.96
7.94
3.72

.82
.91
.86
.86

.63
.81
.74
.73

.19
.09
.12
.13

30.2
11.1
16.2
17.8

5.78
9.33
4.17

.83
.90
.90
.88

.78
.80
.69
.76

.05
.09
.21
.11

6.4
11.2
30.4
14.5

TEV

% Overestimateb

Note. N 167. Blank cells indicate that data are not applicable. CE coefficient equivalence (alpha);
CES coefficient of equivalence and stability; TEV transient error variance over observed score
variance; GMA general mental ability; Wonderlic Wonderlic Personnel Test (1998); PCI
Personal Characteristic Inventory (Barrick & Mount, 1995); IPIP International Personality Inventory
Pool (Goldberg, 1997); GSE Sherers Generalized Self-Efficacy Scale (Sherer et al., 1982); TSBI
Texas Social Behavior Inventory (Helmreich & Stapp, 1974); Rosenberg Rosenberg Self-Esteem
Scale (1965); PANAS Positive and Negative Affect Schedule (Watson, Clark, & Tellegen, 1988); MPI
Multidimensional Personality Index (Watson & Tellegen, 1985); DE Diener and Emmonss (1984)
Affect-Adjective Scale.
a
Full-scale standard deviation. bPercentage that CES is overestimated by CE. cThe estimated ratio of
transient error variance to observed score variance is negative for these measures. Because variance
proportion cannot be negative, zero is the appropriate estimate of its population value.

218

SCHMIDT, LE, AND ILIES

self-efficacy overestimated reliability by 6.00%. The


overestimation was 6.30% for both measures of selfesteem.
Against our expectation, measures of GMA and
broad personality traits appear to have comparable
TEV (.05 for GMA and an average of .04 across 7
broad personality traits; Table 2). Hence, the comparison of the amount of transient error in measures of
GMA with the amounts of transient error present in
measures of broad personality constructs does not
support Schmidt and Hunters (1996, 1999) hypothesis that transitory error phenomena affect measures
of ability to a lesser extent than measures of broad
personality factors.4
The findings indicate that the proportion of transient error component is quite large for measures of
affectivity, averaging (across the three affectivity inventories) .13 for positive affectivity and .11 for negative affectivity. These results show that the CE overestimates the reliability of PA and NA, on average, by
about 18% and 15%, respectively. Consistent with our
prediction, the proportion of transient error in measures of PA is largest for the PANAS measure (.19).
However, among measures of NA, the PANAS contained the smallest TEV (.05), contrary to our expectation.
Table 3 shows the standard errors of the CE, CES,
and TEV. These values were obtained from the Monte
Carlo simulation, except for those of averaged estimates, that is, the averaged CE, CES, or TEV for
constructs that have more than one measure. Standard
errors of these averaged estimates were calculated by
the usual formula for averaged statistics.5 As can be
seen therein, the standard errors for CE estimates
(ranging from .008 for IPIP Conscientiousness to .023
for PCI Openness) are consistently smaller than those
of the CES and TEV. Standard errors of the CES
(ranging from .022 to .065) in turn are slightly but
consistently larger than those of TEV (from .019 to
.062). In general, scales with larger numbers of items
have smaller standard errors. This finding is expected,
as the estimates were affected by sampling errors in
both subjects and items. Of special interest are the
standard errors of the TEV estimates. These values
are small, but because the TEV estimates are also
small, the standard errors are somewhat large relative
to the TEV estimates. Obviously, this is due to the
modest sample size of the study. Nevertheless, the
fact that the 90% confidence intervals of the majority
of the TEV estimates (12 cases out of 20 measures
included in the study) do not cover the zero point

indicates the existence of transient error in those measures.

Discussion and Conclusion


This study replicates and expands the findings presented by Becker (2000). The primary implication of
these findings is that the nearly universal use of the
CE as the reliability estimate for measures of important and widely used psychological constructs, such as
those studied in this project, leads to overestimates of
scale reliability. As shown in this article, the overestimation of reliability occurs because the CE does not
capture transient error. As Becker noted, few empirical investigations of the effect of transient errors go
beyond computer simulations. The present study contributes to the literature on measurement by calibrating empirically the effect of transient error on the
measurement of important constructs from the individual differences area and by formulating a simple
method of computing the CESthe reliability coefficient that takes into account random response, transient, and specific factor error processes.
With the exception of the personality trait of Open

This general conclusion remained essentially unchanged even after adjusting for the fact that, as might be
expected, our student sample was less heterogeneous on
Wonderlic scores than the general population (SDs 5.36
and 7.50, respectively, where the 7.50 is from the Wonderlic
Test Manual). We adjusted both the CE and CES using the
following formula:
rxxA 1 u2x (1 rxxB),
where ux sample SD/population SD, rxxB sample reliability (from Table 2), and rxxA reliability in the general
(more heterogenous population; Nunnally & Bernstein,
1994). This adjustment reduced the TEV estimate for the
Wonderlic from .05 to .03. This is still not appreciably
different from the average figure of .04 for the personality
measures. It is to be expected that college students will be
somewhat range restricted on GMA. As is usually the case,
no such differences in variability were found for the noncognitive measures.
5
The following formula was used to calculate the standard errors of averaged estimates (CE, CES, or TEV):

SEm =1 k

SEi2

12

where SEm is the standard error of the averaged estimates;


SEi is the standard error of an estimate; and k is the number
of estimates.

219

BEYOND ALPHA

Table 3
Standard Errors for the Reliability Coefficents and Transient Error Variance Estimates
Construct and measure
GMA
Wonderlic
Conscientiousness
PCI
IPIP
Average
Extraversion
PCI
IPIP
Average
Agreeableness
PCI
IPIP
Average
Neuroticism
PCI
IPIP
Average
Openness
PCI
IPIP
Average
GSE
Sherers GSE
Self-Esteem
TSBI
Rosenberg
Average
Positive affectivity
PANAS
MPI
DE scale
Average
Negative affectivity
PANAS
MPI
DE scale
Average

CE

SECEa

CES

SECESb

TEV

SETEVc

.79

.020

.74

.035

.05

.029

.88
.93
.91

.013
.008
.008d

.78
.89
.84

.042
.022
.024d

.10
.04
.07

.039
.019
.022d

.81
.92
.86

.020
.009
.011d

.83
.88
.86

.039
.025
.023d

.00
.04
.02

.038
.022
.022d

.85
.88
.87

.017
.014
.011d

.80
.87
.84

.041
.030
.025d

.05
.01
.03

.039
.027
.024d

.85
.93
.89

.016
.008
.009d

.74
.86
.80

.048
.027
.028d

.11
.07
.09

.045
.025
.026d

.81
.88
.85

.023
.013
.013d

.87
.90
.89

.045
.028
.027d

.00
.00
.00

.045
.027
.026d

.89

.013

.83

.035

.05

.032

.85
.84
.85

.015
.019
.012d

.80
.79
.80

.029
.044
.026d

.05
.05
.05

.023
.043
.024d

.82
.91
.86
.86

.022
.011
.018
.010d

.63
.81
.74
.73

.065
.033
.045
.029d

.19
.09
.12
.13

.062
.031
.044
.027d

.83
.90
.90
.88

.021
.011
.012
.009d

.78
.80
.69
.76

.047
.036
.049
.026d

.05
.09
.21
.11

.046
.033
.046
.024d

Note. All the standard errors were obtained from the Monte Carlo simulation. CE coefficient
equivalence (alpha); CES coefficient of equivalence and stability; TEV transient error variance
over observed score variance; GMA general mental ability; Wonderlic Wonderlic Personnel Test
(1998); PCI Personal Characteristic Inventory (Barrick & Mount, 1995); IPIP International Personality Inventory Pool (Goldberg, 1997); GSE Sherers Generalized Self-Efficacy Scale (Sherer et
al., 1982); TSBI Texas Social Behavior Inventory (Helmreich & Stapp, 1974); Rosenberg Rosenberg Self-Esteem Scale (1965); PANAS Positive and Negative Affect Schedule (Watson, Clark, &
Tellegen, 1988); MPI Multidimensional Personality Index (Watson & Tellegen, 1985); DE Diener
and Emmonss (1984) Affect-Adjective Scale.
a
Standard error of CE estimates. bStandard error of the CES estimates. cStandard error of the TEV
estimates. dStandard error of the averaged estimates.

220

SCHMIDT, LE, AND ILIES

ness to Experience, transient error affected the measurement of all constructs measured in this study.6
Results showed that the proportion of transient error
in measures of GMA is comparable to the transient
error proportion in measures of broad personality
traits, which was inconsistent with our expectation, on
the basis of Schmidt and Hunters (1996) speculation
that transient error is smaller in magnitude in the cognitive domain (as compared with the noncognitive
domain). These findings, however, are consistent with
the arguments presented by Conley (1984). It may be
the case that, in the long term, personality indeed
changes more than intelligence does, thus causing decreased levels of testretest stability over long periods
for personality constructs, whereas in the short term,
transient error affects cognitive and noncognitive
measures similarly. Future research might examine
this hypothesis.
As expected, in the affectivity domain the overestimation of scale reliability was substantial and the
largest among the construct domains sampled in this
study. For positive affectivity, we found that the CE
overestimates the reliability of the measures examined
in the study by about 18%. For the PANAS, the most
popular measure of affectivity, the overestimate is
higher, at about 30%. For negative affectivity the
amount of TEV was also substantial, leading to an
overestimation of the reliability of NA scores by
about 15%. The extent of bias resulting from using the
CE as the reliability estimate for measures of GMA
and broad personality factors is relatively smaller, but
it can be potentially consequential. Our estimate of
the proportion of TEV in the measures of self-esteem
(.05 across the two measures; Table 2) is very similar
in magnitude to the .04 estimate obtained by Becker
(2000).
As noted earlier, we tried to analyze widely used
measures for the constructs of cognitive ability, personality, and affectivity in the study. Nevertheless, it
is obviously impossible for us to include all the popular measures of the constructs used by researchers in
the literature. To confidently generalize the findings,
researchers need more studies that are similar to this
one but that use different measures of the constructs
and those of different important constructs in psychology and social science.
Just as with other statistical estimates, the estimates
of TEV obtained here are susceptible to sampling error. As discussed earlier, sampling fluctuation obviously accounts for the negative estimates of transient
error for the measures of Openness to Experience.

Results of our simulation showed that standard errors


for the estimates were relatively large, as expected
from the modest sample size of the current study.
Consequently, specific figures obtained here should
be interpreted cautiously, pending future replications.
The problem of sampling error can be adequately addressed only by application of meta-analysis when
sufficient studies become available (Hunter &
Schmidt, 1990). Accordingly, we encourage further
studies examining the effects of transient error.
Recent research (e.g., Becker, 2000; Schmidt &
Hunter, 1999) suggested transient error has a potentially large impact on the measurement of psychological constructs. This potential impact should be taken
into account when estimating the reliability of measures and when correcting observed relationship for
biases introduced by measurement error (DeShon,
1998; Schmidt & Hunter, 1999). Our study provides
some initial empirical estimates of the magnitude of
the problem for measures of widely used constructs.
The finding that transient error indeed has nontrivial
effects calls for more comprehensive treatment of
measurement errors in empirical research. One obvious step toward minimizing transient error in the measures used in research is to repeatedly administer measures to the same group of subjects with appropriate
time intervals and then average (or sum) the scores
across all the testing occasions. TEV in the scores
thus obtained will be reduced in proportion to the
number of testing occasions (Feldt & Brennan, 1989).
However, a large number of testing occasions may be
needed to virtually eradicate the effect of transient
error, and the majority of research projects probably
do not have the resources to do this. We therefore
need a more general approach to correction for the
biasing effects of measurement errors in general and
that of transient error in particular. Studies like this
one specifically designed to estimate the CES of measures of important psychological constructs can pro-

Our estimates of TEV (and those of Becker, 2000) are


actually slight underestimates. Strictly speaking, coefficient
alpha estimates the random CE rather than the strict parallelism CE (Cronbach, 1951; Nunnally & Bernstein, 1994).
Hence coefficient alpha can be expected to be slightly
smaller than the strictly parallel CE, and therefore the quantity alphaCES slightly underestimates TEV. This difference, however, is small and is reduced somewhat by the fact
that the measures used in computing the CES may have
fallen somewhat short of strict parallelism.

BEYOND ALPHA

vide the needed reliability estimates to be subsequently used to correct for measurement error in
observed correlations between measures.7 The cumulative development of such a database for the CES
estimates of measures of widely used psychological
constructs can thus enable substantive researchers to
obtain unbiased estimates of construct-level relationships among their variables of interest and to minimize the research resources needed. Rothstein (1990)
presented a precedent for this general approach: She
provided large-sample meta-analytically derived figures for interrater reliability of supervisory ratings of
overall job performance that have subsequently been
used to make corrections for measurement error in
ratings in many published studies. A similar CES database could serve this same purpose for widely used
measures of individual differences constructs.
7

As is well known, reliability varies with sample heterogeneity: Reliability is higher in samples (and populations) in
which the scale standard deviation (SDx) is larger. However,
given the reliability and the SDx in one sample, a simple
formula (see Footnote 4) is available for computing the
corresponding reliability in another sample with a different
SDx (Nunnally & Bernstein, 1994).

References
Anastasi, A. (1988). Psychological testing (6th ed.). New
York: Macmillan.
Barrick, M. R., & Mount, M. K. (1995). The Personal characteristics inventory manual. Unpublished manuscript,
University of Iowa, Iowa City.
Barrick, M. R., Mount, M. K., & Strauss, J. P. (1994). Antecedents of involuntary turnover due to a reduction in
force. Personnel Psychology, 47, 515535.
Barrick, M. R., Stewart, G. L., Neubert, M. J., & Mount,
M. K. (1998). Relating member ability and personality to
work-team processes and team effectiveness. Journal of
Applied Psychology, 83, 377391.
Becker, G. (2000). How important is transient error in estimating reliability? Going beyond simulation studies.
Psychological Methods, 5, 370379.
Blascovich, J., & Tomaka, J. (1991). Measures of selfesteem. In J. P. Robinson, P. R. Shaver, & L. S. Wrightsman (Eds.), Measures of personality and social psychological attitudes (pp. 115160). San Diego, CA:
Academic Press.
Brief, A. P. (1998). Attitudes in and around organizations.
Thousand Oaks, CA: Sage.
Buss, A. H., & Perry, M. (1992). The Aggression Question-

221

naire. Journal of Personality and Social Psychology, 63,


452459.
Clause, C. S., Mullins, M. E., Nee, M. T., Pulakos, E., &
Schmitt, N. (1998). Parallel test form development: A
procedure for alternate predictors and an example. Personnel Psychology, 51, 193208.
Conley, J. J. (1984). The hierarchy of consistency: A review
and model of longitudinal findings on adult individual
differences in intelligence, personality, and self-opinion.
Personality and Individual Differences, 5, 1125.
Costa, P. T., & McCrae, R. R. (1985). The NEO Personality
Inventory manual. Odessa, FL: Psychological Assessment Resources.
Cronbach, L. J. (1943). On estimates of test reliability.
Journal of Educational Psychology, 34, 485494.
Cronbach, L. J. (1947). Test reliability: Its meaning and
determination. Psychometrika, 12, 116.
Cronbach, L. J. (1951). Coefficient alpha and the internal
structure of tests. Psychometrika, 16, 297334.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N.
(1972). The dependability of behavioral measurements:
Theory of generalizability for scores and profiles. New
York: Wiley.
DeShon, R. P. (1998). A cautionary note on measurement
error corrections in structural equation models. Psychological Methods, 3, 412423.
Diener, E., & Emmons, R. A. (1984). The independence of
positive and negative affect. Journal of Personality and
Social Psychology, 47, 11051117.
Feldt, L. S. (1965). The approximate sampling distribution
of KuderRichardson reliability coefficient twenty. Psychometrika, 30, 357370.
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L.
Linn (Ed.), Educational measurement (3rd ed., pp. 105
146). New York: Macmillan.
Goldberg, L. R. (1992). The development of markers for the
Big-Five factor structure Psychological Assessment, 4,
2642.
Goldberg, L. R. (1997). A broad-bandwidth, public-domain,
personality inventory measuring the lower-level facets of
several five-factor models. In I. Mervielde, I. Deary, F.
De Fruyt, & F. Ostendorf (Eds.), Personality psychology
in Europe (Vol. 7, pp. 728). Tilburg, the Netherlands:
Tilburg University Press.
Hansen, C. P. (1989). A causal model of the relationship
among accidents, biodata, personality, and cognitive factors. Journal of Applied Psychology, 74, 8190.
Heaven, P. C. L. & Bucci, S. (2001). Right-wing authoritarianism, social dominance orientation and personality:
An analysis using the IPIP measure. European Journal of
Personality, 15, 4956.

222

SCHMIDT, LE, AND ILIES

Helmreich, R., & Stapp, J. (1974). Short forms of the Texas


Social Behavior Inventory (TSBI): An objective measure
of self-esteem. Bulletin of the Psychonomic Society, 4,
473475.
Hogan, R. (1986). Hogan Personality Inventory manual.
Minneapolis, MN: National Computer Systems.
Hunter, J. E., & Schmidt, F. L. (1990). Methods of metaanalysis: Correcting error and bias in research findings.
Newbury Park, CA: Sage.
Judge, T. A., & Bretz, R. D. (1993). Report on an alternative measure of affective disposition. Educational and
Psychological Measurement, 53, 10951104.
Le, H., & Schmidt, F. L. (2001, August). The multi-faceted
nature of measurement error and its implications for
measurement error corrections. Paper presented at the
109th Annual Convention of the American Psychological
Association, San Francisco, CA.
Lord, F. M. (1955). Sampling fluctuations resulting from
the sampling of test items. Psychometrika, 20, 122.
Mackaman, S. L. (1982). Performance evaluation tests for
environmental research: Wonderlic Personnel Test. Psychological Reports, 51, 635644.
Mooney, C. Z. (1997). Monte Carlo simulation. Newbury
Park, CA: Sage.
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric
theory. (3rd ed.): McGraw-Hill.
Otis, A. S., & Lennon, R. T. (1979), OtisLennon School
Ability Test. New York: The Psychological Corporation.
Price, J. L. (1997). Handbook of organizational measurement. International Journal of Manpower, 18, 303558.
Rosenberg, M. (1965). Society and the adolescent selfimage. Princeton, NJ: Princeton University Press.
Rothstein, H. R. (1990). Interrater reliability of job performance ratings: Growth to asymptote level with increasing
opportunity to observe. Journal of Applied Psychology,
75, 322327.
Schmidt, F. L. (1999, March 25). Measurement error and
cumulative knowledge in psychological research. Invited
address presented at Purdue University, West Lafayette,
IN.
Schmidt, F. L., & Hunter, J. E. (1996). Measurement error
in psychological research: Lessons from 26 research scenarios, Psychological Methods, 1, 199223.

Schmidt, F. L., & Hunter, J. E. (1999). Theory testing and


measurement error. Intelligence, 27, 183198.
Shavelson, R. J., Webb, N. M., & Rowley, G. L. (1989).
Generalizability theory. American Psychologist, 44, 922
932.
Sherer, M., Maddux, J. E., Mercandante, B., Prentice-Dunn,
S., Jacobs, B., & Rogers, R. W. (1982). The self-efficacy
scale. Psychological Reports, 76, 707710.
Stanley, J. C. (1971). Reliability. In R. L. Thorndike (Ed.),
Educational measurement (pp. 356442). Washington,
DC: American Council on Education.
Thorndike, R. L. (1951). Reliability. In E. F. Lindquist
(Ed.), Educational measurement (pp. 560620). Washington, DC: American Council on Education.
Traub, R. E. (1994). Reliability for the social sciences:
Theory and applications. Thousand Oaks, CA: Sage.
U.S. Department of Labor. (1970). Manual for the USES
General Aptitude Test Battery, Section III: Development.
Washington, DC: Manpower Administration, U.S. Department of Labor.
Vaidya, J. G., Gray, E. K., Haig, J., & Watson, D. (2001).
On the differential stability of traits: Comparing trait
affect and Big Five scales. Manuscript submitted for publication.
Watson, D. (2000). Mood and temperament. New York:
Guilford Press.
Watson, D., & Clark, L. A. (1994). The PANAS-X: Manual
for the Positive and Negative Affect Scheduleexpanded
form. Iowa City: University of Iowa.
Watson, D., Clark, L. A., & Tellegen, A. (1988). Development and validation of brief measures of positive and
negative affect: The PANAS scales. Journal of Personality and Social Psychology, 54, 6, 10631070.
Watson, D., & Tellegen, A. (1985). Toward a consensual
structure of mood. Psychological Bulletin, 98, 219235.
Wechsler, D. (1981). Manual for the Wechsler Adult Intelligence ScaleRevised. New York: The Psychological
Corporation.
Wonderlic Personnel Test. (1998). Wonderlic Personnel
Test & Scholastic Level Exam, Users manual. Northfield, IL: Wonderlic and Associates.

223

BEYOND ALPHA

Appendix A
Estimating the Standard Deviation of a Full Scale From Its Subscale Standard Deviation

We need a general method which can enable us to estimate the standard deviation of a full scale from values of
subscales with any number of items. We derive such a
method here.
Our derivations assume all the values come from an unlimited sample size, that is, they are population parameters.
We further adopt the item domain model (Nunnally & Bernstein, 1994) that assumes that all items of a scale are randomly sampled from a hypothetical pool of an infinite number of items.
Consider Scale A1 with k1 items. Assuming that we have
the standard deviation of A1 (A1) and its coefficient alpha
(1), we need to derive a formula to estimate the standard
deviation of Scale A2 that has k2 number of items from the
same item domain. Following the item domain model, items
from Scales A1 and A2 are randomly sampled from the same
hypothetical item domain.
The variance of A1 can be written as a function of the
average variance and intercorrelation of a single item as
follows:
2A1 k12 + k1(k1 1)2,

(A1)

where is the average standard deviation of one item, and


is the average intercorrelation among all the items included in the scale.
Similarly, the variance of A2 can be obtained from the
average variance and intercorrelation of a single item:
2A2k22 + k2(k2 1)2.

(A2)

Factoring Equations A1 and A2 above for k12 and k22,


respectively, we obtain
2A1 k12[1 + (k1 1)]

(A3)

2A2 k22[1 + (k2 1)].

(A4)

2 2A1 [k1 (k1 1)1]/k 21.

(A7)

Now replacing the values of in Equation A6 into Equation


A4 and simplifying, we obtain Equation A8:
2A2 k22 [k1 1(k1 k2)]/[k1 (k1 1)1].

(A8)

Next replacing the value of 2 in Equation A7 into Equation


A8 above and simplifying, we have the equation of interest:
2
A2
=

k2
k12

2
A1
k1 k1 k21.

(A9)

Taking the square root of both sides of Equation A9 yields


the formula used to compute the standard deviation of a
scale having k2 items from the standard deviation and coefficient alpha of its subscales having k1 items:
A2 =

A1 k2 k1 k1 k211 2
.
k1

(A10)

As can be seen from the derivations above, this formula


can be used to compute the standard deviation of one scale
from the standard deviation and coefficient alpha of any
other scale if the two scales sample items from the same
item domain.
In practice, the values of A1 and 1 are estimated and
therefore subject to sampling error. Consequently, the estimate of A2 is also subject to sampling error. Using the
circumflex to denote all the values are estimated, we can
rewrite Equation A10 as follows:
A2 =

A1 k2 k1 k1 k2 11 2
.
k1

(A10)

and

Solving Equation A3 for 2, we have


2 2A1/[k1 + k1(k1 1)].

(A5)

The average intercorrelation among all items () can be


computed from the coefficient alpha of Scale A1 (1) by the
reversed SpearmanBrown formula:
1/[k1 (k1 1)1].

To reduce the impact of sampling error, we should average the values of the subscales. Equation A10 was used in
our study to estimate the standard deviations of the full
scales of PCI Extraversion, PCI Openness, Sherers GSE,
and MPI Positive Affectivity and Negative Affectivity.
If we call p1 the ratio of number of items in the subscale
A1 to that of the full scale A2 (i.e., p1 k1/k2), Equation A9
above can be alternatively written as follows:
2A2 2A1 [p1 (p1 1)1]/p12.

(A6)

Substituting the value for in Equation A6 into Equation


A5 and simplifying the result, we have

(A11)

Equation A11 above is the same as Equations 18a and 18b


in the text.

(Appendixes continue)

224

SCHMIDT, LE, AND ILIES

Appendix B
Simulation Procedure to Estimate Standard Errors of CE, CES, and TEV

Data were simulated for each subject on each test item. A


subjects score on an item of a test includes four components:
xpij tpi + spi + opj + epij,

(B1)

where xpij is the observed score of subject p on item i on


occasion j, tpi is the true score of subject p on item i, spi is
the specific factor error of item i (interaction between subject p and item i), opj is the transient error on occasion j
(interaction between subject p and occasion j), and epij is the
random response error of subject p on item i on occasion j.
The components on the right side of Equation B1 can be
determined on the basis information (reliability coefficients)
of the scale that includes the item i. Specifically, let us
consider a subjects score on a scale X that has k items. It is
the sum of his or her scores across k items. From Equation
B1 above, we have
k

Xpj =

+ spi + opj + epij.

pi

(B2)

i=1

Variance (across subjects) of scale X is then


k

Var(X = Var

pi

+ Var

i=1

pi

i=1

+ Var

pj

+ Var

i=1

pij.

(B3)

i=1

Because spi and epij are not correlated across items (i.e.,
Cov(si, si) 0; and Cov(eij, eij) 0, with all i and i),
Equation B3 can be expanded as
Var(X) k2Var(t) + kVar(s) + k2Var(o) + kVar(e),
(B4)
where Var(t) is the true score variance of an item, Var(s) is
the variance of specific factor error of an item, Var(o) is the
variance of transient error of an item, and Var(e) is the
variance of random response error of an item.
By definition, the components on the right side of Equation B4 are
true score variance TV = k2Vart,

(B5)

specific factor error variance SEV = kVars,

(B6)

transient error variance (TEV) = k2Var(o),

(B7)

random response error variance (REV) = kVar(e).

(B8)

and
Standardizing X (i.e., letting Var(X) 1), we have
TV = CES,

(B9)

TEV = CE CES,

(B10)

and
SEV + REV = 1 CE,

(B11)

where CES is the coefficient of equivalence and stability of


scale X and CE is the coefficient of equivalence of scale X.
From Equation B5 to Equation B11, the variance components for an item can be determined by the reliability
coefficients (CE and CES) of the scale it is included in:
Var(t) = CESk2,

(B12)

Varo = CE CES k ,

(B13)

Vars = SEV k,

(B14)

Vare = (1 CE SEV)k.

(B15)

and
The value of SEV in Equation B14 above can be any positive number that is smaller than 1 CE (we tested different
values of SEV within that range and obtained virtually the
same results). The reliability coefficients (CE and CES) and
number of scale items (k) can be obtained for each measure
included in the study (Table 2).
With the values of Var(t), Var(o), Var(s) and Var(e) estimated from Equations B12 to B15 above, we can simulate
a score for each subject on each item on each occasion.
Specifically, the following equation was used:
xpij z1[Var(t)]1/2 + z2[Var(s)]1/2 + z3[Var(o)]1/2
+ z4[Var(e)]1/2,
(B16)
where z1, z2, z3, and z4 were randomly drawn from four
independent standard normal distributions. In total, scores
for p subjects ( p 167), on k items (k number of items
of the scale examined), on two occasions were simulated.
For a subject, the same values of z1 and z3 were used across
items, and the same value of z2 (on each item) was used
across occasions.
Two parallel half scales were then created based on the
simulated data. Relevant statistics (coefficient alphas and
correlation between the half scales) were then computed.
The program used these data to calculate CE, CES, and
TEV for each scale included in the study following the
method described in the text. One thousand data sets were
simulated and standard deviations of the distributions of
estimated CE, CES, and TEV were recorded. These are the
standard errors of interest.
A computer program (in SAS 8.01) was written to implement the simulation procedures described above. The program is available from Frank L. Schmidt upon request.

Received September 5, 2001


Revision received October 7, 2002
Accepted January 13, 2003

Вам также может понравиться