Академический Документы
Профессиональный Документы
Культура Документы
in Medical Research
http://smm.sagepub.com/
Published by:
http://www.sagepublications.com
Additional services and information for Statistical Methods in Medical Research can be found at:
Email Alerts: http://smm.sagepub.com/cgi/alerts
Subscriptions: http://smm.sagepub.com/subscriptions
Reprints: http://www.sagepub.com/journalsReprints.nav
Permissions: http://www.sagepub.com/journalsPermissions.nav
Citations: http://smm.sagepub.com/content/7/3/301.refs.html
1 Introduction
The general issues concerning reliability and agreement in Psychiatric measurement
have been discussed extensively in psychiatry journals'- and in statistical and
psychometric writing.8 14 Many exceptionally clear introductions to the topic have
been written for scientists and statisticians,15'16 and the inclusion of reliability
statistics in cutting edge research reports is now common.
The reliability literature has documented the adverse effects of extraneous random
variation on measures of symptom severity and extent, measures of functioning,
formulation of psychiatric diagnoses, and the assessment of risk status. Extraneous
random variation may be introduced by many processes, including ambiguous items
and diagnostic criteria, improperly trained raters, memory lapses by informants, and
lack of insight by patients. However, extraneous measurement variation is not always
due to mistakes or errors. Variation may also be introduced when legitimate
perspectives are sampled rather than systematically assessed. For example, mothers
and fathers may provide different, but worthwhile, information about their child.
Randomly chosen raters may have different but legitimate theories about how to
interpret common observations.
Whatever its cause, measurement variation means that one assessment differs from
an independent replicate assessment. If the amount of measurement variation is
extreme, the patient may receive one diagnosis rather than another, the wrong persons
may be assigned to treatment and control groups, and the statistical analysis of validity
studies may lead to biased results.
Recognition of the need for reliable measurement and diagnosis had a substantial
impact on the development of the third and fourth editions of the Diagnostic and
Statistical Manual in the United States17'18 and the International Classification of
Diseases outside the USA.'9 Diagnostic distinctions that could be made reliably were
Address for correspondence: PE Shrout, Department of Psychology, New York University, 6 Washington Place, New
York, NY 1003, USA. E-mail: shrout@psych.nyu.edu
0962-2802(98)SM165RA
(C Arnold 1998
302 PE Shrout
included in these manuals, while distinctions that were hard to replicate were left out.
It became widely recognized that a measure that can not be reproduced is one that is
unlikely to be valid. Perhaps more than in other areas of medicine, psychiatry has
been profoundly affected by reliability theory and methods.
Despite all that has been written and all the progress that has been made using
reliability statistics in psychiatry, there remain important controversies about
fundamental issues such as how measurement precision should be characterized,
and how reliability studies in psychiatry should be designed. Some of the standing
controversies are covered below, while recent contributions to the statistical treatment
of reliability and agreement are reviewed. These statistical/psychometric discussions
are placed in the context of psychiatric research and clinical practice. Below is a brief
review of formal reliability models.
Classical reliability theory was originally developed for measurement in education and
cognitive psychology. Lord and Novick'4 and Cronbach et al.10 provide the most
thorough developments of the statistical basis of classical theory.
Suppose we have a population of measures or measurement devices, and we sample
measure j to assess a fixed person i. Call that measurement, Xij. In education, the
measure j might be a particular item or achievement subtest, and, in psychiatry,
the measure might be an expert rater or a family informant. Following the notation
and terminology of Kraemer,16 we write Xij as the sum of a person parameter, (i, and a
residual term that carries the unique effect of measure j
xij = ( + EiP
The fixed person parameter, (i, is defined as the expected value of X over the
population of measures. Kraemer calls it the consensus score, while Lord and Novick'4
call it the true score, and Cronbach et al.10 call it the universe score. Because (i is
E(Xi0) for the fixed person, E(ij) = 0. For the fixed person, we will write,
= Var(Eij) = u(c), which is known as the squared standard error of
Var(Xij)
measurement for person i. Clearly, the smaller a2(E) is, the more precise the
measurement.
Whether 52(E) is interpreted to be large or small usually depends on its size relative
to the variance of the measured characteristic over persons. Suppose that X now
represents a measurement of a randomly selected patient. Under this sampling plan,
we consider the consensus score, (i, to be a random effect with mean ,u and the
variance, ao2((). It is customary in classical reliability theory to assume that the
variance of the error component, 52(E), is the same for different subjects. This
assumption is one that is rejected by so-called modern test theory, or item response
theory. 14
Rx= J=[r()
(2.1)
The reliability coefficient, Rx, is readily interpreted as the proportion of ax2 that is due
to replicable differences in patients. It turns out to be a useful quantity in statistical
analyses as well. For example, Snedecor and Cochran20 show that this is the quantity
that describes the bias of least squares estimate of the bivariate regression on a fallible
independent variable (see also Fuller21). Lord and Novick14 show how product
moment correlations are attenuated by the reliabilities of the two variables being
analysed, and work on bias in maximum likelihood logistic regression has also shown
bias to be indexed by Rx. Kraemer12 describes how the power of hypothesis tests is
affected by changes in Rx for the dependent measures.
2.3 Estimation of Rx
Rx is estimated using data on actual replicate measures of Xij. While, in theory,
measurements are as selected from a large population of potential measures, in
practice they are often selected from a finite set of raters, items, or test forms. Many of
the statistical issues in reliability theory are concerned with models for representing
the kinds of replicate measurements that are available in empirical studies.
The most common way to estimate Rx is to use some form of intraclass correlation.
Several forms have been described for quantitative ratings.2325 These vary according
to whether the reliability design has raters crossed with patients or whether it has
unique raters nested within patients. The forms also differ according to whether the
raters are considered to be random or fixed effects. For binary or categorical ratings,
the family of kappa coefficients provides the most common estimates.9"1226 Dunn15
provides an extensive discussion of many approaches to the estimation of various
forms of Rx.
2.4 Composite measures
The reliability of any measurement procedure that has Rx > 0 can be improved by
averaging replicate measurements, when replicate measurements are available.
Suppose W is the average of k-independent X measurements. Spearman27 and
Brown28 described the reliability of W, Rw, in terms of the reliability of the measures
that are averaged, Rx. The so-called Spearman-Brown formula is
kRx
~~~~~~~~~~~(2.2)
()1 +Rw(k)
(k-lI)Rx(2)
This expression is derived assuming that the replicate measures have the same
expected value, and the same error variance. Replications that meet these criteria are
Rw() =
304 PE Shrout
called parallel measures. Composites of measures that are not strictly parallel will also
be more reliable than the components, but the reliability, Rw, depends on the pattern
of variances and covariances of the measures in that case.
Improving the reliability of a measurement by summing or averaging replicate
measures is one of the most important strategies researchers have for dealing with
measurement error. Classical reliability theory gives the researcher the tools needed to
plan how many replicate measures are needed to obtain adequate reliability (see, for
example, Shrout and Fleiss25).
2.5 Extensions of classical theory
Kraemer12"16 showed that many of the results of classical reliability theory apply to
X variables that are binary (e.g. diagnosis present versus absent). She showed that
certain forms of the kappa family of statistics, first proposed by Cohen9 as ad hoc
agreement indices, are actually equivalent to intraclass correlation estimates of Rx.
The population model defined by Kraemer, represents the consensus scores, (i, as
continuous probability values in the interval (0, 1). Although the consensus scores can
be linked conceptually to some unknown latent categories (e.g. disease truly present,
disease truly absent), these latent categories are neither necessary nor helpful in
deriving the reliability theory results.
Cronbach et al.10 extended classical reliability theory by developing a framework for
systematically exploring determinants of error variation and of consensus score
variation. For example, suppose multiple raters are sampled from a population of
experts that includes psychiatrists, psychologists and psychiatric social workers.
Suppose also that some of the experts interview the patients in their offices and others
conduct the interview in a hospital ward. Does the selection of type of expert and place
of the interview have an effect on the ratings? The analyses of such measurement facets
is what Cronbach and his colleagues call generalizability analyses. Like simple
reliability analyses, generalizability theory makes use of variance components
methods and often reports summaries of results as variance ratios that resemble Rx.
3 Continuing controversies
3.1 Representing degree of measurement precision
A recurring issue among those concerned with measurement precision is whether the
reliability coefficient is an optimal or even appropriate index of measurement
precision. Many of the arguments against Rx centre on its dependence on a2(G), the
variance of the consensus scores. As I mentioned above, the reliability coefficient
essentially calibrates the magnitude of the error variance, ao2(E), with the between
subject variance on (i, a2(()
Rx =(
CJ
( + Uf (]
[a(0
t+
-2(
()
Although the error variation may be minuscule, the reliability coefficient will
approach zero as the between-subject consensus-score variation approaches zero.
306 PE Shrout
From the IRT perspective, summary measures of reliability should be replaced with
information functions for fixed items or measures. One set of measures may be
informative about distinctions between persons with severe psychopathology, while
another set of measures may be useful in distinguishing subclinical disorder from
persons who are free of symptoms. The impact of different levels of measurement
precision is more or less critical depending on the distribution of 0 in any given study
population.
The controversy between advocates of classical reliability and those who reject
summary measures of chance-corrected agreement, might be maintained by different
assumptions made by each side. The former may assume an interest in populationbased studies of psychiatric phenomena, while the latter may assume an interest in
specific subjects.
Advocates of classical reliability estimates can point to the utility of Rx in assessing
the impact of measurement error on substantive analyses.21'38 They can argue that the
variance of the consensus score, a2 (), is important because it affects the observed
variance of X in study samples.6 They can show the empirical benefit of improving
reliability in terms of statistical power'2 for population-based studies. They also can
argue that attention to classical reliability before substantive studies are undertaken, is
a practical and realistic way to improve the quality of scientific research.16 Finally,
they can show that alternative summary measures of measurement precision are either
flawed,6'39 or reducible to forms of the classical reliability coefficient.6,3940X42
Advocates for an emphasis on the standard error of measurement without regard to
the distribution of either the consensus score, (, or some other individual difference
quantity such as IRT's 0, are on firm ground when the emphasis is on single subjects,
or on conditional inference.37 Unless the specific subjects are being compared to
others in the study population, the variance of the study population may be irrelevant.
As discussed by Mellenbergh,37 focus on measurement information need not preclude
a later consideration of reliability, if the form of the distribution of individual
differences is known.
Unfortunately, some of the participants in the controversy regarding the usefulness
of reliability estimates, have not focused on the carefully considered positions
reviewed above. It can only be hoped that the literature will move away from
complaints about the intuitiveness, or lack thereof, of statistical results, the difficulty
of demonstrating reliability in population-based studies with limited variation, and
the sad fact that reliability has to be reassessed in each new study population.
Kraemer16 warned against 'sugar-coating' the discussion of measurement precision,
but unfortunately there are still instances of this practice.
3.2 How involved should reliability studies be?
Another area of some contention in the reliability literature is how much time and
effort should be allocated to the study of the magnitude and sources of measurement
variation. Many of the models and analyses proposed by statisticians and
psychometricians require large samples with multiple replicate measures. Methodological advances can be readily applied in educational research, where thousands, if
not hundreds of thousands, of students submit to standardized testing. However,
psychiatric clinical researchers often have limited populations of patients and
308 PE Shrout
methodological literature,'1249 recent papers have attempted to make this cost even
clearer with regard to the estimation of reliability.29 As Donner and Eliasziw pointed
out, the situation for inferences about reliability is made difficult by the fact that
rather subtle differences in precision, say between reliability of 0.4 and 0.8, may be of
clinical and research interest.29 When such distinctions are investigated with binary or
categorical ratings, provisions for much larger samples must be made.
The message that is beginning to emerge from the consideration of sampling
distributions of even the simpler reliability statistics, is that sample size for
measurement studies need to be larger than common practice. Psychiatric researchers
need help in changing their expectations about how much effort is needed to study
reliability. In the context of these new expectations, some of the multivariate methods
that are being developed to study measurement quality may be newly appraised as
feasible and desirable.
3.3 Standards for reliability results: what is bad?
Contributions to reliability theory that have appeared since the paper by Kraemer,'6
are discussed below. It is clear that the majority of the contributions relevant to
psychiatry adopt the perspective that classical reliability coefficients are useful and
deserving of further refinement.
4.1 Defining and estimating reliability
3 1 0 PE Shrout
item loading was also small. He concluded that alpha provides useful estimates when
the number of items is four or more, and when the items are known to be strongly
related to a common construct.
Raykov did not, however, consider the magnitude of alpha's bias when the items in a
composite were measures of different constructs. For example, it is common in
psychiatric research to study psychosocial risk factors such as stressful life events. A
quantity of interest may be the number of stressful events from a fixed list that
occurred in the past week. Events might reflect different independent processes such
as work stress and family stress, but the sum of the stressors is still of interest as a risk
factor. For applications such as this, Cronbach's alpha is likely to produce a very
biased estimate of the reliability of the stress composite, even if subjects are capable of
answering the items reliably. Although alpha is not very useful in this context, it is
often misinterpreted as evidence that measurement quality is poor.
Li et al.6 examined the reliability of composites under very general conditions,
including the case in which items are related to different latent variables. Their
analysis assumed that the reliabilities of the items combined in the composite are
known individually. They showed that regularities implied by the Spearman-Brown
formula (equation 2) do not necessarily hold if items are congeneric or are related to
more than one factor. By designing item weights as a function of item reliabilities and
item variances, they developed a weighted composite that has maximum reliability
under general conditions. Their results might be applicable to psychiatric research
settings where fixed raters are available to conduct expert ratings, but they differ in
their average reliability.
Fagot57'5 considered ways to generalize the assessment of reliability in a very
different way. He addressed the possibility that different raters may use numbers in
different ways leading to apparent discrepancies that are simply due to rating scale
use. For example, raters may make magnitude estimates with different comparison
references. If one wishes to estimate the reliability of ratings made on different scales,
one needs to transform one or both sets of ratings to a uniform metric. Fagot described
general classes of admissible transformations and discussed the implications of
ignoring the issue of scale.
When raters are making judgements with regard to multiple outcomes, it might be
useful to summarize the overall degree of agreement using multivariate intraclass
correlation estimates developed by Konishi et al.59 for population genetics. However,
Aickin and Ritenbaugh38 argued that the impact of multivariate unreliability on linear
model estimation needs to be studied using reliability arrays that characterize the
impact of error on both the variances and covariances of the explanatory variables.
Kraemer'6 observed that different chance-corrected agreement indices and
reliability statistics may be estimating the same quantity when the raters can be
considered to be randomly selected. Since that statement, Blackman and Koval4l
showed that four different measures of agreement in 2 x 2 tables, including Cohen's
kappa9 and Mak's rho,60 were estimating the same intraclass correlation as the
reliability coefficient. Similar conclusions were reached by Bodian,61 who developed
alternative intraclass correlation approaches for binary ratings under the assumptions
that raters were random and nested within subject, and that raters were either fixed or
random effects in a rater-by-subject crossed design.
312 PE Shrout
4.3 Inferences regarding reliability comparisons
Although the methods developed by Donner and others for comparing different
reliability estimates are likely to be useful in many cases, the interpretation of these
comparisons is complicated by the fact that both consensus score variation and error
score variation influences reliability. If two reliability statistics differ, one must ask
whether the smaller value is due to the homogeneity of its population or due to an
increase in rater error.
This problem was addressed by Cronbach and his colleagues when they introduced
generalizability studies.'0 As described by Dunn and others,15,74 generalizability
theory attempts to determine which facets of the measurement process are most
important in a given application. Analysis of variance and variance component
estimation provides the basic methodology for generalizability studies of quantitative
measures. Of special interest, is the magnitude of target and error variation that is
attributable to subject groups and assessment techniques. Although the original
formulation of generalizability theory assumed that the error variance was constant
across subjects and subpopulations, modern variance component methods make this
assumption unnecessary.
Consistent with the generalizability theory's focus on variance components,
Bartko40 recently proposed a unified approach to the representation and analysis of
variance components from two groups. His approach uses graphical tools and is
particularly easy for substantive researchers to apply.
For categorical ratings, log linear analyses can be used to describe the effect of
variation in overall base rates and within strata association on reliability. Graham75
adapted the methods of Tanner and Young76 to look at the effects of covariates on
agreement. Barlow22 proposed a model based on conditional logistic regression for
314 PE Shrout
5 Concluding comments
Statistical advances in reliability theory continue to enhance the prospects for
improved measurement in psychiatry. The benefits of these advances are delayed
somewhat by controversies about the utility of reliability coefficients versus
alternatives, about the absolute level of needed reliability, and about the need to
dedicate substantial resources to measurement.
As Kraemer noted,16 some of the misunderstandings have their roots in the phase of
psychiatric measurement when agreement indices were ad hoc, and advice on
reliability studies was based on conventions rather than on statistical theory. The
recent work on the equivalence of various forms of reliability statistics, the
determination of sample sizes needed for informed estimation and inference, and
development of new tools for comparing levels of precision and agreement across
samples and measurement procedures should help move measurement practice
beyond the earlier phase of controversy.
These new results complement an older but rich tradition in psychometric theory
that was not fully implemented in psychiatry because of the resources they require. In
particular, the tools of Cronbach's generalizability theory have been under used,
because generalizability studies require large samples and expensive designs. Recent
insights regarding the size of samples needed to study even simple categorical
agreement may put the expense of generalizability studies in a new light. Latent trait
models and other multivariate methods may also be more useful in psychiatry if larger
samples of clinically evaluated subjects become available.
Although much of this review was focused on estimation of the reliability
coefficient, Rx, it is important to recognize the uses and limitations of this statistic.
As a variance ratio, it is useful in any research application in which between subject
variation is important. Clinical trials, cross-sectional surveys, and panel studies are
examples where we need to know about the population variances, and where we should
know about the reliability of key measurements.
It is less important to know about Rx when a single patient is being tracked,
although the component of Rx reflecting the standard error of measurement is critical
to know. It is also not obvious that we routinely need to know how Rx varies in
subgroups of the study sample, such as women who come from a certain ethnic group.
If a subgroup coincides with a risk (or protective) factor, the conditional variance of X
will be restricted and this will make Rx appear small, even if the standard error of
measurement is the same in the group. The arguments for knowing Rx are based on
examination of the marginal distribution of X in the sample rather than on the
conditional variances. One would only need to know about Rx in the subgroup if it was
to be analysed separately.
For example, in epidemiological work it is often useful to include a brief mental
status screen to determine if a respondent is cognitively unimpaired. Typical questions
include asking what month it is, or who is president or prime minister at the time.
These screens are quite reliable in random samples of the general population, but they
would appear to be unreliable in a university sample. In the general population, they
have been shown to correlate with behaviours and family reports, but in university
samples they would correlate with nothing. This lack of correlation is due to a bad
choice of a construct, not to measurement error.
References
1 Bartko JJ, Carpenter WT Jr. On the methods
and theory of reliability.journal of Nervous and
Mental Disease 1976; 163: 307-17.
2 Carey G, Gottesman II. Reliability and
validity in binary ratings: areas of common
misunderstanding in diagnosis and symptom
ratings. Archives of General Psychiatry 1978; 35:
1454-59.
3 Grove WM, Andreason NC, McDonald-Scott
P, Keller MB, Shapiro R. Reliability studies of
psychiatric diagnosis: theory and practice.
316 PE Shrout
24 McGraw KO, Wong SP. Forming inferences
about some intraclass correlation coefficients.
Psychological Methods 1996; 1: 30-46.
25 Shrout PE, Fleiss JL. Intraclass correlations:
uses in assessing rater reliability. Psychological
Bulletin 1979; 86: 420-28.
26 Kraemer HC, Bloch DA. Kappa coefficients in
epidemiology: an appraisal of a reappraisal.
Journal of Clinical Epidemiology 1988; 41:
959-68.
27 Spearman C. Correlation calculated from
faulty data. British Journal of Psychology 1910;
3: 271-95.
28 Brown W. Some experimental results in the
correlation of mental abilities. British Journal
of Psychology 1910; 3: 296-322.
29 Donner A, Eliasziw M. Statistical implications
of the choice between a dichotomous or
continuous trait in studies of interobserver
agreement. Biometrics 1994; 50: 550-55.
30 Bland JM, Altman DG. A note on the use of
the intraclass correlation coefficient in the
evaluation of agreement between two methods
of measurement. Computations in Biology and
Medicine 1990; 20: 337-40.
31 Guggenmoos-Holzmann I. How reliable are
chance-corrected measures of agreement?
Statistics in Medicine 1993; 12: 2191-205.
32 Spitznagel EL, Helzer JE. A proposed solution
to the base rate problem in the kappa statistic.
Archives of General Psychiatry 1985; 42: 725-28.
33 Lindell MK, Brandt CJ. Measuring interrater
agreement for ratings of a single target.
Applied Psychological Measurement 1997; 21:
271-78.
34 James LR, Demaree RG, Wolfe G. Estimating
within-group interrater reliability with and
without response bias. Journal ofApplied
Psychology 1984; 69: 85-98.
35 James LR, Demaree RG, Wolfe G. rwg: An
assessment of within-group rater agreement.
Journal ofApplied Psychology 1993; 78: 306-309.
36 Kozlowski SWJ, Hattrup K. A disagreement
about within-group agreement: disentangling
issues of consistency versus consensus. Journal
ofApplied Psychology 1992; 77: 161-67.
37 Mellenbergh GJ. Measurement precision in
test score and item response models.
Psychological Methods 1996; 1: 293-99.
38 Aickin M, Ritenbaugh C. Analysis of
multivariate reliability structures and the
induced bias in linear model estimation.
Statistics in Medicine 1996; 15: 1647-61.
39 Langenbucher J, Labouvie E, Morgenstern J.
Measuring diagnostic agreement. Journal of
40
42
41
43
44
45
46
47
48
49
50
51
52
53
54
55