Вы находитесь на странице: 1из 18

Statistical Methods

in Medical Research
http://smm.sagepub.com/

Measurement reliability and agreement in psychiatry


Patrick E Shrout
Stat Methods Med Res 1998 7: 301
DOI: 10.1177/096228029800700306
The online version of this article can be found at:
http://smm.sagepub.com/content/7/3/301

Published by:
http://www.sagepublications.com

Additional services and information for Statistical Methods in Medical Research can be found at:
Email Alerts: http://smm.sagepub.com/cgi/alerts
Subscriptions: http://smm.sagepub.com/subscriptions
Reprints: http://www.sagepub.com/journalsReprints.nav
Permissions: http://www.sagepub.com/journalsPermissions.nav
Citations: http://smm.sagepub.com/content/7/3/301.refs.html

>> Version of Record - Jun 1, 1998


What is This?

Downloaded from smm.sagepub.com at UNIV OF IDAHO LIBRARY on September 5, 2014

Statistical Methods in Medical Research 1998; 7: 301-317

Measurement reliability and agreement in


psychiatry
Patrick E Shrout Department of Psychology, New York University, New York, USA
Psychiatric research has benefited from attention to measurement theories of reliability, and reliability/
agreement statistics for psychopathology ratings and diagnoses are regularly reported in empirical
reports. Nevertheless, there are still controversies regarding how reliability should be measured, and the
amount of resources that should be spent on studying measurement quality in research programs. These
issues are discussed in the context of recent theoretical and technical contributions to the statistical
analysis of reliability. Special attention is paid to statistical studies published since Kraemer's 1992 review
of reliability methods in this journal.

1 Introduction
The general issues concerning reliability and agreement in Psychiatric measurement
have been discussed extensively in psychiatry journals'- and in statistical and
psychometric writing.8 14 Many exceptionally clear introductions to the topic have
been written for scientists and statisticians,15'16 and the inclusion of reliability
statistics in cutting edge research reports is now common.
The reliability literature has documented the adverse effects of extraneous random
variation on measures of symptom severity and extent, measures of functioning,
formulation of psychiatric diagnoses, and the assessment of risk status. Extraneous
random variation may be introduced by many processes, including ambiguous items
and diagnostic criteria, improperly trained raters, memory lapses by informants, and
lack of insight by patients. However, extraneous measurement variation is not always
due to mistakes or errors. Variation may also be introduced when legitimate
perspectives are sampled rather than systematically assessed. For example, mothers
and fathers may provide different, but worthwhile, information about their child.
Randomly chosen raters may have different but legitimate theories about how to
interpret common observations.
Whatever its cause, measurement variation means that one assessment differs from
an independent replicate assessment. If the amount of measurement variation is
extreme, the patient may receive one diagnosis rather than another, the wrong persons
may be assigned to treatment and control groups, and the statistical analysis of validity
studies may lead to biased results.
Recognition of the need for reliable measurement and diagnosis had a substantial
impact on the development of the third and fourth editions of the Diagnostic and
Statistical Manual in the United States17'18 and the International Classification of
Diseases outside the USA.'9 Diagnostic distinctions that could be made reliably were
Address for correspondence: PE Shrout, Department of Psychology, New York University, 6 Washington Place, New
York, NY 1003, USA. E-mail: shrout@psych.nyu.edu

0962-2802(98)SM165RA

(C Arnold 1998

Downloaded from smm.sagepub.com at UNIV OF IDAHO LIBRARY on September 5, 2014

302 PE Shrout
included in these manuals, while distinctions that were hard to replicate were left out.
It became widely recognized that a measure that can not be reproduced is one that is
unlikely to be valid. Perhaps more than in other areas of medicine, psychiatry has
been profoundly affected by reliability theory and methods.
Despite all that has been written and all the progress that has been made using
reliability statistics in psychiatry, there remain important controversies about
fundamental issues such as how measurement precision should be characterized,
and how reliability studies in psychiatry should be designed. Some of the standing
controversies are covered below, while recent contributions to the statistical treatment
of reliability and agreement are reviewed. These statistical/psychometric discussions
are placed in the context of psychiatric research and clinical practice. Below is a brief
review of formal reliability models.

2 Recap of classical reliability theory


2.1 The additive error model

Classical reliability theory was originally developed for measurement in education and
cognitive psychology. Lord and Novick'4 and Cronbach et al.10 provide the most
thorough developments of the statistical basis of classical theory.
Suppose we have a population of measures or measurement devices, and we sample
measure j to assess a fixed person i. Call that measurement, Xij. In education, the
measure j might be a particular item or achievement subtest, and, in psychiatry,
the measure might be an expert rater or a family informant. Following the notation
and terminology of Kraemer,16 we write Xij as the sum of a person parameter, (i, and a
residual term that carries the unique effect of measure j
xij = ( + EiP
The fixed person parameter, (i, is defined as the expected value of X over the
population of measures. Kraemer calls it the consensus score, while Lord and Novick'4
call it the true score, and Cronbach et al.10 call it the universe score. Because (i is
E(Xi0) for the fixed person, E(ij) = 0. For the fixed person, we will write,
= Var(Eij) = u(c), which is known as the squared standard error of
Var(Xij)
measurement for person i. Clearly, the smaller a2(E) is, the more precise the
measurement.
Whether 52(E) is interpreted to be large or small usually depends on its size relative
to the variance of the measured characteristic over persons. Suppose that X now
represents a measurement of a randomly selected patient. Under this sampling plan,
we consider the consensus score, (i, to be a random effect with mean ,u and the
variance, ao2((). It is customary in classical reliability theory to assume that the
variance of the error component, 52(E), is the same for different subjects. This
assumption is one that is rejected by so-called modern test theory, or item response
theory. 14

Downloaded from smm.sagepub.com at UNIV OF IDAHO LIBRARY on September 5, 2014

Measurement reliability and agreement in psychiatry 303


2.2 The reliability coefficient
In classical reliability theory, the reliability coefficient is defined as the ratio of the
true score variance to the total variance of the observed measure, X. The patients/
respondents are assumed to be sampled independently of the error process, and so the
observed variance of X can be decomposed as the sum of the true score and error
variances

Rx= J=[r()

(2.1)

The reliability coefficient, Rx, is readily interpreted as the proportion of ax2 that is due
to replicable differences in patients. It turns out to be a useful quantity in statistical
analyses as well. For example, Snedecor and Cochran20 show that this is the quantity
that describes the bias of least squares estimate of the bivariate regression on a fallible
independent variable (see also Fuller21). Lord and Novick14 show how product
moment correlations are attenuated by the reliabilities of the two variables being
analysed, and work on bias in maximum likelihood logistic regression has also shown
bias to be indexed by Rx. Kraemer12 describes how the power of hypothesis tests is
affected by changes in Rx for the dependent measures.
2.3 Estimation of Rx
Rx is estimated using data on actual replicate measures of Xij. While, in theory,
measurements are as selected from a large population of potential measures, in
practice they are often selected from a finite set of raters, items, or test forms. Many of
the statistical issues in reliability theory are concerned with models for representing
the kinds of replicate measurements that are available in empirical studies.
The most common way to estimate Rx is to use some form of intraclass correlation.
Several forms have been described for quantitative ratings.2325 These vary according
to whether the reliability design has raters crossed with patients or whether it has
unique raters nested within patients. The forms also differ according to whether the
raters are considered to be random or fixed effects. For binary or categorical ratings,
the family of kappa coefficients provides the most common estimates.9"1226 Dunn15
provides an extensive discussion of many approaches to the estimation of various
forms of Rx.
2.4 Composite measures
The reliability of any measurement procedure that has Rx > 0 can be improved by
averaging replicate measurements, when replicate measurements are available.
Suppose W is the average of k-independent X measurements. Spearman27 and
Brown28 described the reliability of W, Rw, in terms of the reliability of the measures
that are averaged, Rx. The so-called Spearman-Brown formula is

kRx

~~~~~~~~~~~(2.2)

()1 +Rw(k)
(k-lI)Rx(2)
This expression is derived assuming that the replicate measures have the same
expected value, and the same error variance. Replications that meet these criteria are
Rw() =

Downloaded from smm.sagepub.com at UNIV OF IDAHO LIBRARY on September 5, 2014

304 PE Shrout
called parallel measures. Composites of measures that are not strictly parallel will also
be more reliable than the components, but the reliability, Rw, depends on the pattern
of variances and covariances of the measures in that case.
Improving the reliability of a measurement by summing or averaging replicate
measures is one of the most important strategies researchers have for dealing with
measurement error. Classical reliability theory gives the researcher the tools needed to
plan how many replicate measures are needed to obtain adequate reliability (see, for
example, Shrout and Fleiss25).
2.5 Extensions of classical theory

Kraemer12"16 showed that many of the results of classical reliability theory apply to
X variables that are binary (e.g. diagnosis present versus absent). She showed that
certain forms of the kappa family of statistics, first proposed by Cohen9 as ad hoc
agreement indices, are actually equivalent to intraclass correlation estimates of Rx.
The population model defined by Kraemer, represents the consensus scores, (i, as
continuous probability values in the interval (0, 1). Although the consensus scores can
be linked conceptually to some unknown latent categories (e.g. disease truly present,
disease truly absent), these latent categories are neither necessary nor helpful in
deriving the reliability theory results.
Cronbach et al.10 extended classical reliability theory by developing a framework for
systematically exploring determinants of error variation and of consensus score
variation. For example, suppose multiple raters are sampled from a population of
experts that includes psychiatrists, psychologists and psychiatric social workers.
Suppose also that some of the experts interview the patients in their offices and others
conduct the interview in a hospital ward. Does the selection of type of expert and place
of the interview have an effect on the ratings? The analyses of such measurement facets
is what Cronbach and his colleagues call generalizability analyses. Like simple
reliability analyses, generalizability theory makes use of variance components
methods and often reports summaries of results as variance ratios that resemble Rx.
3 Continuing controversies
3.1 Representing degree of measurement precision
A recurring issue among those concerned with measurement precision is whether the
reliability coefficient is an optimal or even appropriate index of measurement
precision. Many of the arguments against Rx centre on its dependence on a2(G), the
variance of the consensus scores. As I mentioned above, the reliability coefficient
essentially calibrates the magnitude of the error variance, ao2(E), with the between
subject variance on (i, a2(()

Rx =(
CJ

( + Uf (]
[a(0

t+

-2(

()

Although the error variation may be minuscule, the reliability coefficient will
approach zero as the between-subject consensus-score variation approaches zero.

Downloaded from smm.sagepub.com at UNIV OF IDAHO LIBRARY on September 5, 2014

Measurement reliability and agreement in psychiatry 305


Moreover, even if the reliability of a measurement procedure is shown to be good in
one population, it might be poor in another population that is more homogeneous,30 as
reflected by a smaller value of a2 ().
The dependence of Rx on between-subject variance has been called the 'base rate
problem' when X is a binary variable. This version of the complaint is usually made
against Cohen's kappa, which for binary judgements is a legitimate estimate of Rx.
The base rate problem has been discussed by many authors2'3'31'32 who often argue
from an intuitive perspective that something must be wrong with an index of precision
that varies systematically with the prevalence (base rate) of the binary characteristic.
The necessity and desirability of considering a2(() in indices of measurement
precision comes up in a different context when a population has only a single
member.33 Consider a situation in which families of patients who were being
discharged from a specific psychiatric hospital are asked to rate their satisfaction with
the facility. Suppose 1 is the response for 'completely satisfied' and 9 means
'completely dissatisfied'. If 99 out of 100 families in a given year chose the response
alternative, '9', one would be inclined to think that this was bad news for the
institution. However, from a classical reliability perspective the family feedback might
be disregarded because the reliability is not established. The nearly perfect consensus
would lead to a very small value of a2(E), but administrators might speculate that all
hospitals get negative ratings, which would mean that a2(() is hypothetically smaller.
Clearly, this academic argument would not be reassuring to those in the real world
who are concerned with quality of medical care.
In organizational psychology, where the issue of rating single institutions comes up
quite often, a suggestion was made to create a metaphorical form of Rx by replacing
a2(() with hypothetical values based on assumed distributional forms.34 This
suggestion has been controversial.5 33'35'36 One reasonable alternative to a rigid
adherence to the form of Rx, when a2(() is not known or not defined, is to focus the
substantive discussion of measurement quality and interpretation on the magnitude of
ar2(e), the classical squared standard error of measurement.5 In many contexts, such as
the hypothetical ratings of the psychiatric hospital, the interpretation of U2(E) is clear
without the need to resort to af (i).
A different school of statisticians and psychometricians has recommended that
classical reliability analysis be displaced by a new emphasis on understanding the
response processes of individual measures. Developed in the context of educational
measurement, this perspective is known as item response theory (IRT).'4'37 IRT
attempts to relate the responses to each item in a test to an underlying latent
dimension, 0, that reflects the construct to be measured. For quantitative ratings,
factor analysis would be an example of a latent variable model that relates an
unobserved dimension to manifest ratings.'5
IRT theorists focus their attention on the precision of estimates of 0, which they
describe using the concept of information adapted from maximum likelihood theory.
Treating each item or measure as fixed, IRT represents the precision of measurement
(information) as a function of 0. For most measures, IRT shows that information
varies considerably across the dimension one is trying to measure. By conditioning on
0, IRT theorists hope to avoid the intuitive unpleasantness of having measurement
precision depend on the characteristics of each population being measured.37

Downloaded from smm.sagepub.com at UNIV OF IDAHO LIBRARY on September 5, 2014

306 PE Shrout
From the IRT perspective, summary measures of reliability should be replaced with
information functions for fixed items or measures. One set of measures may be
informative about distinctions between persons with severe psychopathology, while
another set of measures may be useful in distinguishing subclinical disorder from
persons who are free of symptoms. The impact of different levels of measurement
precision is more or less critical depending on the distribution of 0 in any given study
population.
The controversy between advocates of classical reliability and those who reject
summary measures of chance-corrected agreement, might be maintained by different
assumptions made by each side. The former may assume an interest in populationbased studies of psychiatric phenomena, while the latter may assume an interest in
specific subjects.
Advocates of classical reliability estimates can point to the utility of Rx in assessing
the impact of measurement error on substantive analyses.21'38 They can argue that the
variance of the consensus score, a2 (), is important because it affects the observed
variance of X in study samples.6 They can show the empirical benefit of improving
reliability in terms of statistical power'2 for population-based studies. They also can
argue that attention to classical reliability before substantive studies are undertaken, is
a practical and realistic way to improve the quality of scientific research.16 Finally,
they can show that alternative summary measures of measurement precision are either
flawed,6'39 or reducible to forms of the classical reliability coefficient.6,3940X42
Advocates for an emphasis on the standard error of measurement without regard to
the distribution of either the consensus score, (, or some other individual difference
quantity such as IRT's 0, are on firm ground when the emphasis is on single subjects,
or on conditional inference.37 Unless the specific subjects are being compared to
others in the study population, the variance of the study population may be irrelevant.
As discussed by Mellenbergh,37 focus on measurement information need not preclude
a later consideration of reliability, if the form of the distribution of individual
differences is known.
Unfortunately, some of the participants in the controversy regarding the usefulness
of reliability estimates, have not focused on the carefully considered positions
reviewed above. It can only be hoped that the literature will move away from
complaints about the intuitiveness, or lack thereof, of statistical results, the difficulty
of demonstrating reliability in population-based studies with limited variation, and
the sad fact that reliability has to be reassessed in each new study population.
Kraemer16 warned against 'sugar-coating' the discussion of measurement precision,
but unfortunately there are still instances of this practice.
3.2 How involved should reliability studies be?

Another area of some contention in the reliability literature is how much time and
effort should be allocated to the study of the magnitude and sources of measurement
variation. Many of the models and analyses proposed by statisticians and
psychometricians require large samples with multiple replicate measures. Methodological advances can be readily applied in educational research, where thousands, if
not hundreds of thousands, of students submit to standardized testing. However,
psychiatric clinical researchers often have limited populations of patients and

Downloaded from smm.sagepub.com at UNIV OF IDAHO LIBRARY on September 5, 2014

Measurement reliability and agreement in psychiatry 307


restricted numbers of colleagues who can serve as replicate expert raters. For this
reason, some approaches to the study of measurement quality are described as not
feasible for psychiatry. The question is how far psychiatric researchers can, and
should, be pushed.
The answer to the 'should' part of the question no doubt depends on the research
question and the scope of the research enterprise. Small preliminary studies of
potential individual differences may need only preliminary attention to measurement
reliability, while large epidemiological surveys deserve more comprehensive attention
to measurement quality. Fortunately, the statistical literature in recent years has
begun to provide researchers with concrete information about how large a reliability
study must be to be useful. It turns out that even simple studies of rater reliability and
agreement often need samples that are larger than those commonly used in current
psychiatric research.
Cantor43 provided tables to facilitate the sample size calculations for studies
planning to estimate Cohen's kappa. They assumed that inferences would be made
using the asymptotic standard error of kappa reported by Fleiss et al.44 Their
illustrations make it clear that more than a hundred subjects are needed in each 2 x 2
table, if one wants statistical assurance of detecting modest reliability. Detecting
modest reliability means constructing confidence intervals with lower bounds greater
than 0.3 or 0.4, or rejecting a null hypothesis that kappa is equal to 0.3 against
alternatives that it is greater than 0.3.
Walter et al.45 provided useful tables for planning reliability studies in which the
outcome rating is quantitative and approximately normal, and when estimates of
intraclass correlation are to be used as reliability statistics. They used an
approximation, based on Fisher,46 to produce a closed form expression of the sample
size for various hypotheses about reliability. By studying results for different numbers
of raters, as well as subjects, they made recommendations about optimal designs of
reliability studies. Their results suggest that studies of quantitative ratings can be
substantially smaller in terms of numbers of subjects than studies of binary ratings.
For example, if we want to distinguish a true reliability of 0.6 from a null value of 0.4
with 80% power, and we have three raters, we would only need 32 subjects in the
reliability study.
Donner47 extended the discussion to consider sample size requirements for
comparing reliability results from two or more independent samples. He considered
reliability studies of both continuous measurements and binary decisions when two
raters are used. The results for continuous measurements were based on the
approximation that he and his colleagues used for single sample inference,45 while
the results for binary ratings were based on goodness-of-fit tests.48 Like the
calculations for single samples, the sizes of samples needed to compare reliabilities
of continuous ratings are substantially smaller than those needed to compare binary
ratings. To distinguish between reliabilities of 0.2 and 0.6 with two raters (Type I error
0.05, Type II error, 0.20), sample sizes of around 60 were needed in each group for
quantitative ratings. For binary ratings, between 100 and 200 were needed, with the
larger number associated with studies with smaller base rates.
The different results for continuous ratings versus binary ratings are important and
persistent. Although the cost of dichotomization has been described repeatedly in the

Downloaded from smm.sagepub.com at UNIV OF IDAHO LIBRARY on September 5, 2014

308 PE Shrout
methodological literature,'1249 recent papers have attempted to make this cost even
clearer with regard to the estimation of reliability.29 As Donner and Eliasziw pointed
out, the situation for inferences about reliability is made difficult by the fact that
rather subtle differences in precision, say between reliability of 0.4 and 0.8, may be of
clinical and research interest.29 When such distinctions are investigated with binary or
categorical ratings, provisions for much larger samples must be made.
The message that is beginning to emerge from the consideration of sampling
distributions of even the simpler reliability statistics, is that sample size for
measurement studies need to be larger than common practice. Psychiatric researchers
need help in changing their expectations about how much effort is needed to study
reliability. In the context of these new expectations, some of the multivariate methods
that are being developed to study measurement quality may be newly appraised as
feasible and desirable.
3.3 Standards for reliability results: what is bad?

A third area of some controversy in the reliability literature is what constitutes


adequate and inadequate reliability. Landis and Koch13 provided adjectives to
describe ranges of reliability values. These are (0, 0.20), slight; (0.21, 0.4), fair; (0.41,
0.60), moderate; (0.61, 0.80), substantial; and (0.81, 1.0k, almost perfect. Although these
labels have been quoted by some methodologists,5 they have been criticized by
others. 5
The issue of which labels to provide may seem to be a trivial matter of semantics, or
a simple matter of subjective value. However, these labels are used by researchers as
they decide whether to invest additional effort into improving measurement precision.
Researchers who are able to show that the reliability of a measures, X, is around 0.60
may be able to claim 'moderate' or 'substantial' reliability, depending on the sampling
fluctuations of the data. In either case, they will be unlikely to give additional effort to
improving the quality of the measure. However, if the reliability of X is truly 0.6, then
the regression of an outcome on X will be attenuated by more than a third, other
variables that are supposed to be adjusted for X will remain substantially confounded,
and the power of the study to find true differences will be diminished.
Given that reliability can usually be improved by averaging replicate measures, it is
unfortunate that the conventional labels for the quality of reliability do not serve to
encourage constant refinement of measurement. To this end, the following revision of
the Landis and Koch adjectives are proposed as:
(0.00, 0.10) - virtually none;
(0.1 1, 0.40) - slight;
(0.41, 0.60) - fair;
(0.61, 0.80) - moderate;
(0.81, 1.0) - substantial.
This suggested revision shifts the Landis and Koch adjectives to the next interval,
with a recognition that reliability values in the range of (0, 0.1) are so low that even
averaging several replicate measures together would not help. Measures with slight
reliability are ones that can be improved using averages of many replicate measures.
Measures with fair reliability can be said to have nearly half their variance attributable

Downloaded from smm.sagepub.com at UNIV OF IDAHO LIBRARY on September 5, 2014

Measurement reliability and agreement in psychiatry 309


to consensus score variation. Moderately reliable measures would still lead to bias as
explanatory variables in regression analyses, and would be somewhat inefficient as
outcome measures, but these untoward effects are limited. Substantial reliability is the
only category where the adjective conveys the sense that measures are generally
adequate.
When applying any standards to empirical results, it is important to recognize the
limitations of reliability estimates from any given study. The adjective label is usually
chosen to describe the point estimate, but the confidence interval may show that the
study is consistent with a range of labels. More important, some estimates of reliability
are known to be biased. Cronbach's alpha,51 as an estimate of scale reliability, is
known to be systematically too small, except when the items are redundant. When
applied to items that measure unique aspects of process, the alpha estimate may be low
even though the items are individually reported with good reliability (see below).

4 Recent contributions to reliability theory

Contributions to reliability theory that have appeared since the paper by Kraemer,'6
are discussed below. It is clear that the majority of the contributions relevant to
psychiatry adopt the perspective that classical reliability coefficients are useful and
deserving of further refinement.
4.1 Defining and estimating reliability

Given the breadth of Dunn's15 comprehensive coverage of approaches to estimating


reliability, one might well wonder what other refinements or methods are needed.
One set of contributions aims to further an understanding of Cronbach's alpha5'
and other estimators of composite reliability. If k items or raters, Xi, are used to
measure n subjects, and a composite measure, W= EXi, is constructed as the sum of
the k components, then Cronbach's alpha provides an estimate of the reliability of that
composite
k
Var(Xi)1
k- i
Var(W) J
As shown by Lord and Novick,14 as well as Dunn,'5 alpha is a lower bound of the
composite's true reliability, with the bias eliminated only when the items have the
same loadings on a common single factor. In that case, the items are said to be
essentially tau equivalent. When the items load on the same single factor but with
different loading weights, the items are said to be congeneric, and in that case the
alpha estimate produces a value that is systematically less than the unknown
parameter. When the items are congeneric, an alternative estimate of the reliability of
the composite can be written15'5_54 using estimates from a confirmatory factor
analysis of the item data.
In practice, coefficient alpha is often used by psychiatric researchers when it can
only be interpreted as a lower bound. Raykov55 examined the magnitude of alpha's
bias when items were congeneric rather than essentially tau equivalent. He found that
the bias was only substantial when the number of items was very small and the average

Downloaded from smm.sagepub.com at UNIV OF IDAHO LIBRARY on September 5, 2014

3 1 0 PE Shrout
item loading was also small. He concluded that alpha provides useful estimates when
the number of items is four or more, and when the items are known to be strongly
related to a common construct.
Raykov did not, however, consider the magnitude of alpha's bias when the items in a
composite were measures of different constructs. For example, it is common in
psychiatric research to study psychosocial risk factors such as stressful life events. A
quantity of interest may be the number of stressful events from a fixed list that
occurred in the past week. Events might reflect different independent processes such
as work stress and family stress, but the sum of the stressors is still of interest as a risk
factor. For applications such as this, Cronbach's alpha is likely to produce a very
biased estimate of the reliability of the stress composite, even if subjects are capable of
answering the items reliably. Although alpha is not very useful in this context, it is
often misinterpreted as evidence that measurement quality is poor.
Li et al.6 examined the reliability of composites under very general conditions,
including the case in which items are related to different latent variables. Their
analysis assumed that the reliabilities of the items combined in the composite are
known individually. They showed that regularities implied by the Spearman-Brown
formula (equation 2) do not necessarily hold if items are congeneric or are related to
more than one factor. By designing item weights as a function of item reliabilities and
item variances, they developed a weighted composite that has maximum reliability
under general conditions. Their results might be applicable to psychiatric research
settings where fixed raters are available to conduct expert ratings, but they differ in
their average reliability.
Fagot57'5 considered ways to generalize the assessment of reliability in a very
different way. He addressed the possibility that different raters may use numbers in
different ways leading to apparent discrepancies that are simply due to rating scale
use. For example, raters may make magnitude estimates with different comparison
references. If one wishes to estimate the reliability of ratings made on different scales,
one needs to transform one or both sets of ratings to a uniform metric. Fagot described
general classes of admissible transformations and discussed the implications of
ignoring the issue of scale.
When raters are making judgements with regard to multiple outcomes, it might be
useful to summarize the overall degree of agreement using multivariate intraclass
correlation estimates developed by Konishi et al.59 for population genetics. However,
Aickin and Ritenbaugh38 argued that the impact of multivariate unreliability on linear
model estimation needs to be studied using reliability arrays that characterize the
impact of error on both the variances and covariances of the explanatory variables.
Kraemer'6 observed that different chance-corrected agreement indices and
reliability statistics may be estimating the same quantity when the raters can be
considered to be randomly selected. Since that statement, Blackman and Koval4l
showed that four different measures of agreement in 2 x 2 tables, including Cohen's
kappa9 and Mak's rho,60 were estimating the same intraclass correlation as the
reliability coefficient. Similar conclusions were reached by Bodian,61 who developed
alternative intraclass correlation approaches for binary ratings under the assumptions
that raters were random and nested within subject, and that raters were either fixed or
random effects in a rater-by-subject crossed design.

Downloaded from smm.sagepub.com at UNIV OF IDAHO LIBRARY on September 5, 2014

Measurement reliability and agreement in psychiatry 311


4.2 Interval estimation
A variety of methods has been proposed for estimating confidence intervals around
reliability coefficients, but they are still not used much in psychiatric research.
Hopefully, this will change. If confidence intervals were reported, more people would
appreciate the need to use larger samples in reliability studies, and they would avoid
making false inferences about whether reliability is either good enough, or fatally low.
The main reason confidence intervals are not reported in reliability studies, is that
such reports are not yet part of the scientific culture. However, another reason may be
that the computation of confidence intervals is not easy for scientists. The ease of the
computation is confounded with the fact that accurate calculation of confidence
intervals is challenging.
An earlier study25 summarized confidence intervals for six forms of intraclass
correlations under the assumption that the ratings were normally distributed. Four of
the six intervals were based on F-tests used to test the significance of the intraclass
correlation, while two were based on asymptotic approximations.62 Although these
intervals were somewhat tedious for scientists to implement, they are now computed
by at least one commercial software program (SPSS).
Barchard and Hakstian63 examined the robustness of the F-test-based confidence
interval for Cronbach's alpha, which was one of the six intervals we reported.
Although the interval had proper coverage when the assumptions for the intraclass
correlation F-test were satisfied, they were not robust. Intervals were too narrow when
the rating residuals were allowed to correlate. The bias was sometimes extreme. One of
their conditions led to intervals that included the parameter only 77% of the time.
Equally troubling, McClelland and Nickerson64 reported that in preliminary
simulations, the asymptotic interval proposed by Fleiss and Shrout62 for the intraclass
correlation based on the mixed model (subjects random, raters fixed) was too wide.
These results are consistent with results reported by Lyles and Chambless,65 although
they did not address the issue of confidence intervals explicitly.
Confidence bounds for the reliability of binary ratings are no easier to compute than
for quantitative ratings. Hale and Fleiss66 developed a confidence interval using
Cornfield's test-based approach for 2 x 2 tables.67 In simulations, they determined
that their approach provided better coverage than a symmetric interval based on an
asymptotic standard error estimate and another due to Flack.68 However, they also
reported that the approach led to anomalies when the sample size was less than 40.
Donner and Eliaziw69 developed a procedure for estimating the confidence interval
around kappa using a goodness-of-fit test. In simulations, the goodness-of-fit
confidence interval outperformed intervals developed by Bloch and Kraemer70 and it
gave reasonable results to sample sizes as small as 25, suggesting that it might be
preferred to the method suggested by Hale and Fleiss.
Dunn15 recommends that bootstrap methods be used to calculate confidence
intervals around a variety of forms of estimates of reliability coefficients. Although
these methods are flexible, they are computationally intensive. Unless they become
automated and accessible to scientists, they are unlikely to be adopted by the
psychiatric research community as a whole. The further development of accurate,
easily computed confidence intervals is likely to be of ongoing interest by statisticians.

Downloaded from smm.sagepub.com at UNIV OF IDAHO LIBRARY on September 5, 2014

312 PE Shrout
4.3 Inferences regarding reliability comparisons

McGraw and Wong24'm provided a summary of inferential methods available for


intraclass correlation calculated from quantitative ratings. They focused on methods
based on the analysis of variance of normally distributed measures.
Donner and his colleagues,69 developed a goodness-of-fit approach to the analysis of
several problems involving kappa. One application is the comparison of agreement
level across several independent samples. Donner et al.48 were able to show that their
goodness-of-fit test worked well for even modest sample sizes, while asymptotic normal
theory tests only worked well for sample sizes greater than 100.
Donner and Eliasziw have also developed a series of goodness-of-fit-based tests to
examine associations within a multinomial table.72 In contrast to multinomial models
that compare each category to the remaining category, Donner and Eliasziw propose a
series of hierarchical hypotheses of statistically independent hypotheses. These models
are likely to be of interest to psychiatric nosologists who are considering whether
diagnostic categories can be split reliably.
A new approach to the definition of intraclass correlations for variables in the
exponential family was proposed by Commenges and Jacqmin.73 They use a
generalized estimation equation formulation to derive a score test for inferences
about a random effects versions of the reliability coefficient.
4.4 Modelling influences on agreement and disagreement

Although the methods developed by Donner and others for comparing different
reliability estimates are likely to be useful in many cases, the interpretation of these
comparisons is complicated by the fact that both consensus score variation and error
score variation influences reliability. If two reliability statistics differ, one must ask
whether the smaller value is due to the homogeneity of its population or due to an
increase in rater error.
This problem was addressed by Cronbach and his colleagues when they introduced
generalizability studies.'0 As described by Dunn and others,15,74 generalizability
theory attempts to determine which facets of the measurement process are most
important in a given application. Analysis of variance and variance component
estimation provides the basic methodology for generalizability studies of quantitative
measures. Of special interest, is the magnitude of target and error variation that is
attributable to subject groups and assessment techniques. Although the original
formulation of generalizability theory assumed that the error variance was constant
across subjects and subpopulations, modern variance component methods make this
assumption unnecessary.
Consistent with the generalizability theory's focus on variance components,
Bartko40 recently proposed a unified approach to the representation and analysis of
variance components from two groups. His approach uses graphical tools and is
particularly easy for substantive researchers to apply.
For categorical ratings, log linear analyses can be used to describe the effect of
variation in overall base rates and within strata association on reliability. Graham75
adapted the methods of Tanner and Young76 to look at the effects of covariates on
agreement. Barlow22 proposed a model based on conditional logistic regression for

Downloaded from smm.sagepub.com at UNIV OF IDAHO LIBRARY on September 5, 2014

Measurement reliability and agreement in psychiatry 313


adjusting for covariates that affect marginal rates. Such analyses can be informative
about changes in Rx from group to group, even if they do not represent the variance
components directly.
4.5 Reliability of modal rule for combining diagnoses

As mentioned several times, the practical payoff for developing a mathematical


theory of measurement error is the ability to improve measurement quality using
replicate measures, as documented by the Spearman-Brown formula and Cronbach's
alpha. Because of these results, one might expect that a binary rule based on ratings by
two independent raters would also be an improvement over the individual binary
ratings. Fleiss and Shrout,77 however, showed that this expectation is not generally
correct if the ratings are combined using a 'two-out-of-two' rule (2/2). This is the rule
that requires that both raters agree that a person is a case for the consensus rating to
be recorded as a case.
If two raters had the same base rate, the Spearman-Brown formula would describe
the reliability of the sum of their binary ratings. If the ratings were coded (0, 1), the
sum would take three values, 2, both positive; 1, one positive, one negative; 0, neither
positive. Assuming slight to fair reliability at the individual level, the sum would have
improved reliability. However, the 2/2 rule dichotomizes this three point scale. This
dichotomization offsets the benefits of combining the two ratings. Another way to
think about the 2/2 rule is that either rater has veto power regarding who is assigned
the case group. While the rule provides a check on false positives, it essentially
accumulates the false negatives.
Consider the following numerical example. Suppose two raters have an underlying
reliability of 0.52 as measured by the kappa on values from the 2 x 2 table with the
proportions: (0.2, 0.1, 0.1, 0.6). This table might have arisen from a classification
exercise in which 30% of the patients have a certain disorder, and in which there was a
constant sensitivity of 0.807 and a constant specificity of 0.920.
If we were able to conduct a separate reliability study of the 2/2 rule, we would find
that the expected concordant positive cell in the table that reflects the classification of
two teams of paired raters to be given by:
(prevalence) x (sensitivity)4 + (1 - prevalence) x (1 - specificity)4
= (0.3) x (0.8074) +(0.7) x (0.084)=0.127
The cells in the reliability table for this composite rule would have the expected
values, (0.127, 0.073, 0.073, 0.727), and its kappa value would be 0.54.
In this case, we would have doubled the cost of the assessment, and subjected all the
patients to double burden, but we would be left with a classification rule with virtually
the same reliability. Using the new 2/2 rule, the correlations would still be attenuated,
the regression estimates would be equally biased, and the inefficiency of any
significance tests would remain.
The only benefit of the 2/2 rule is that the proportion of false positives in the
proband group would be reduced. While the purity of the proband group would be
viewed as a strength by some, others might worry that the sample of cases is not
representative of all cases. As Kraemer has noted, it is likely that sensitivity is not
constant across cases, but varies with the severity of the case.

Downloaded from smm.sagepub.com at UNIV OF IDAHO LIBRARY on September 5, 2014

314 PE Shrout
5 Concluding comments
Statistical advances in reliability theory continue to enhance the prospects for
improved measurement in psychiatry. The benefits of these advances are delayed
somewhat by controversies about the utility of reliability coefficients versus
alternatives, about the absolute level of needed reliability, and about the need to
dedicate substantial resources to measurement.
As Kraemer noted,16 some of the misunderstandings have their roots in the phase of
psychiatric measurement when agreement indices were ad hoc, and advice on
reliability studies was based on conventions rather than on statistical theory. The
recent work on the equivalence of various forms of reliability statistics, the
determination of sample sizes needed for informed estimation and inference, and
development of new tools for comparing levels of precision and agreement across
samples and measurement procedures should help move measurement practice
beyond the earlier phase of controversy.
These new results complement an older but rich tradition in psychometric theory
that was not fully implemented in psychiatry because of the resources they require. In
particular, the tools of Cronbach's generalizability theory have been under used,
because generalizability studies require large samples and expensive designs. Recent
insights regarding the size of samples needed to study even simple categorical
agreement may put the expense of generalizability studies in a new light. Latent trait
models and other multivariate methods may also be more useful in psychiatry if larger
samples of clinically evaluated subjects become available.
Although much of this review was focused on estimation of the reliability
coefficient, Rx, it is important to recognize the uses and limitations of this statistic.
As a variance ratio, it is useful in any research application in which between subject
variation is important. Clinical trials, cross-sectional surveys, and panel studies are
examples where we need to know about the population variances, and where we should
know about the reliability of key measurements.
It is less important to know about Rx when a single patient is being tracked,
although the component of Rx reflecting the standard error of measurement is critical
to know. It is also not obvious that we routinely need to know how Rx varies in
subgroups of the study sample, such as women who come from a certain ethnic group.
If a subgroup coincides with a risk (or protective) factor, the conditional variance of X
will be restricted and this will make Rx appear small, even if the standard error of
measurement is the same in the group. The arguments for knowing Rx are based on
examination of the marginal distribution of X in the sample rather than on the
conditional variances. One would only need to know about Rx in the subgroup if it was
to be analysed separately.
For example, in epidemiological work it is often useful to include a brief mental
status screen to determine if a respondent is cognitively unimpaired. Typical questions
include asking what month it is, or who is president or prime minister at the time.
These screens are quite reliable in random samples of the general population, but they
would appear to be unreliable in a university sample. In the general population, they
have been shown to correlate with behaviours and family reports, but in university
samples they would correlate with nothing. This lack of correlation is due to a bad
choice of a construct, not to measurement error.

Downloaded from smm.sagepub.com at UNIV OF IDAHO LIBRARY on September 5, 2014

Measurement reliability and agreement in psychiatry 315


We might be interested in measurement precision in a subgroup defined by
immigrant status and low education level, but in this case it is the standard error in
which we are interested. The focus of item response theorists on measurement
information in the latent trait region where the immigrants are located is also relevant.
As a last comment, it is worth reiterating the desirability of studying quantitative
measures whenever possible. If a diagnostic distinction is to be studied, but it is not
made reliably by a single rater, researchers should consider using quantitative
averages of diagnostic ratings rather than a modal rule, such as a 2/2 rule. If the
quantitative version of this distinction proves to be predictive, then separate decision
analyses can be done to refine the binary consensus rule.

References
1 Bartko JJ, Carpenter WT Jr. On the methods
and theory of reliability.journal of Nervous and
Mental Disease 1976; 163: 307-17.
2 Carey G, Gottesman II. Reliability and
validity in binary ratings: areas of common
misunderstanding in diagnosis and symptom
ratings. Archives of General Psychiatry 1978; 35:
1454-59.
3 Grove WM, Andreason NC, McDonald-Scott
P, Keller MB, Shapiro R. Reliability studies of
psychiatric diagnosis: theory and practice.

Archives of General Psychiatry 1981; 38: 408-13.


4 Maxwell AE. Coefficients of agreement
between observers and their interpretation.
British journal of Psychiatry 1977; 130: 79-83.
5 Schmidt FL, Hunter JE. Interrater reliability
coefficients cannot be computed when only
one stimulus is rated. journal ofApplied
Psychology 1989; 74: 368-70.
6 Shrout PE, Spitzer RL, Fleiss JL.
Quantification of agreement in psychiatric
diagnosis revisited. Archives of General
Psychiatry 1987; 44: 172-77.
7 Spitzer RL. Psychiatric diagnosis: are
clinicians still necessary? Comprehensive
Psychiatry 1983; 24: 399-411.
8 Cochran WG. Errors in measurement in
statistics. Technometrics 1968; 10: 637-66.
9 Cohen J. A coefficient of agreement for
nominal scales. Educational and Psychological
Measurement 1960; 20: 37-46.
10 Cronbach LJ, Gleser GC, Nanda H,
Rajaratnam N. The dependability of behavioral
measurements: theory ofgeneralizability for scores
and profiles. New York: John Wiley, 1972.
11 Fleiss JL. Estimating the reliability of
interview data. Psychometrika 1970; 35: 143-62.

12 Kraemer HC. Ramifications of a population


model for kappa as a coefficient of reliability.
Psychometrika 1979; 44: 461-72.
13 Landis JR, Koch GG. The measurement of
observer agreement for categorical data.
Biometrics 1977; 33: 159-74.
14 Lord FM, Novick MR. Statistical theories of
mental test scores . Reading, MA: AddisonWesley, 1968.
15 Dunn G. Design and analysis ofreliability studies.
New York: Oxfc I University Press, 1989.
16 Kraemer HC. Measurement of reliability for
categorical data in medical research. Statistical
Methods in Medical Research 1992; 1: 183-99.
17 American Psychiatric Association. Diagnostic
and statistical manual of mental disorders, 3rd
edition. Washington, DC: American
Psychiatric Association, 1980.
18 American Psychiatric Association. Diagnostic
and statistical manual of mental disorders, 4th
edition. Washington, DC: American
Psychiatric Association, 1994.
19 World Health Organization. Draft international
classification of diseases and related health
problems, 10th edition. Geneva: World Health
Association, 1992.
20 Snedecor GW, Cochran WG. Statistical
methods, 6th edition. Ames, IA: Iowa State
University Press, 1967.
21 Fuller WA. Measurement error models. New
York: John Wiley, 1987.
22 Barlow W. Measurement of interrater
agreement with adjustment for covariates.
Biometrics 1996; 52: 695-702.
23 Bartko JJ. The intraclass correlation as a
measure of reliability. Psychological Reports
1966; 1966: 19.

Downloaded from smm.sagepub.com at UNIV OF IDAHO LIBRARY on September 5, 2014

316 PE Shrout
24 McGraw KO, Wong SP. Forming inferences
about some intraclass correlation coefficients.
Psychological Methods 1996; 1: 30-46.
25 Shrout PE, Fleiss JL. Intraclass correlations:
uses in assessing rater reliability. Psychological
Bulletin 1979; 86: 420-28.
26 Kraemer HC, Bloch DA. Kappa coefficients in
epidemiology: an appraisal of a reappraisal.
Journal of Clinical Epidemiology 1988; 41:
959-68.
27 Spearman C. Correlation calculated from
faulty data. British Journal of Psychology 1910;
3: 271-95.
28 Brown W. Some experimental results in the
correlation of mental abilities. British Journal
of Psychology 1910; 3: 296-322.
29 Donner A, Eliasziw M. Statistical implications
of the choice between a dichotomous or
continuous trait in studies of interobserver
agreement. Biometrics 1994; 50: 550-55.
30 Bland JM, Altman DG. A note on the use of
the intraclass correlation coefficient in the
evaluation of agreement between two methods
of measurement. Computations in Biology and
Medicine 1990; 20: 337-40.
31 Guggenmoos-Holzmann I. How reliable are
chance-corrected measures of agreement?
Statistics in Medicine 1993; 12: 2191-205.
32 Spitznagel EL, Helzer JE. A proposed solution
to the base rate problem in the kappa statistic.
Archives of General Psychiatry 1985; 42: 725-28.
33 Lindell MK, Brandt CJ. Measuring interrater
agreement for ratings of a single target.
Applied Psychological Measurement 1997; 21:
271-78.
34 James LR, Demaree RG, Wolfe G. Estimating
within-group interrater reliability with and
without response bias. Journal ofApplied
Psychology 1984; 69: 85-98.
35 James LR, Demaree RG, Wolfe G. rwg: An
assessment of within-group rater agreement.
Journal ofApplied Psychology 1993; 78: 306-309.
36 Kozlowski SWJ, Hattrup K. A disagreement
about within-group agreement: disentangling
issues of consistency versus consensus. Journal
ofApplied Psychology 1992; 77: 161-67.
37 Mellenbergh GJ. Measurement precision in
test score and item response models.
Psychological Methods 1996; 1: 293-99.
38 Aickin M, Ritenbaugh C. Analysis of
multivariate reliability structures and the
induced bias in linear model estimation.
Statistics in Medicine 1996; 15: 1647-61.
39 Langenbucher J, Labouvie E, Morgenstern J.
Measuring diagnostic agreement. Journal of

40
42

41

43
44

45

46
47

48

49
50
51
52

53

54
55

Consulting and Clinical Psychology 1996; 64:


1285-89.
Bartko JJ. Measures of agreement: a single
procedure. Statistics in Medicine 1994; 13:
737-45.
Nickerson CAE. A note on 'a concordance
correlation coefficient to evaluate
reproducibility'. Biometrics 1997; 53: 1503-507.
Blackman NJ-M, Koval JJ. Estimating rater
agreement in 2 x 2 tables: Correction for
chance and intraclass correlation. Applied
Psychological Measurement 1993; 17: 211-23.
Cantor AB. Sample-size calculations for
Cohen's kappa. Psychological Methods 1996; 1:
150-55.
Fleiss JL, Cohen J, Everitt BS. Large sample
standard errors of kappa and weighted kappa.
Psychological Bulletin 1969; 72: 323-27.
Walter SD, Eliasziw M, Donner A. Sample
size and optimal designs for reliability studies.
Statistics in Medicine 1998; 17: 101-10.
Fisher RA. Statistical methods for research
workers, 13th edition. Riverside, NJ: Hafner,
1958.
Donner A. Sample size requirements for the
comparison of two or more coefficients of
inter-observer agreement. Statistics in Medicine
1998; 17: 1157-68.
Donner A, Eliasziw M, Klar N. Testing the
homogeneity of kappa statistics. Biometrics
1996; 52: 176-83.
Cohen J. The cost of dichotomization. Applied
Psychological Measurement 1983; 7: 249-53.
Fleiss JL. Statistical methods for rates and
proportions, 2nd edition. New York: John
Wiley, 1981.
Cronbach LJ. Coefficient alpha and the
internal structure of tests. Psychometrika 1951;
16: 297-334.
Miller MB. Coefficient alpha: a basic
introduction from the perspectives of classical
test theory and structural equation modeling.
Structural Equation Modeling 1995; 2: 258-73.
O'Grady KE, Medoff DR. Rater reliability: a
maximum likelihood confirmatory factoranalytic approach. Multivariate Behavioral
Research 1991; 26: 363-87.
Raykov T. Estimating composite reliability for
congeneric measures. Applied Psychological
Measurement 1997; 21: 173-84.
Raykov T. Scale reliability, Cronbach's
coefficient alpha, and violations of essential
tau-equivalence with fixed congeneric
components. Multivariate Behavioral Research
1997; 32: 329-53.

Downloaded from smm.sagepub.com at UNIV OF IDAHO LIBRARY on September 5, 2014

Measurement reliability and agreement in psychiatry 317


56 Li H, Rosenthal R, Rubin DB. Reliability of
measurement in psychology: from
Spearman-Brown to maximal reliability.
Psychological Methods 1996; 1: 98-107.
57 Fagot RF. Reliability of ratings for multiple
judges: intraclass correlation and metric
scales. Applied Psychological Measurement 1991;
15: 1-11.
58 Fagot RF. An ordinal coefficient of relational
agreement for multiple judges. Psychometrika
1994; 59: 241-51.
59 Konishi S, Khatri CG, Rao CR. Inferences on
multivariate measures of interclass and
intraclass correlations in familial data.Journal
of the Royal Statistical Society 1991; 53: 649-59.
60 Mak TK. Analyzing intraclass correlation for
dichotomous variables. 7ournal of the Royal
Statistical Society 1988; 37: 344-52.
61 Bodian CA. Intraclass correlation for two-bytwo tables under three sampling designs.
Biometrics 1994; 50: 183-93.
62 Fleiss JL, Shrout PE. Approximate interval
estimation for a certain intraclass correlation
coefficient. Psychometrika 1978; 43: 259-62.
63 Barchard K, Hakstian AR. The robustness of
confidence intervals for coefficient alpha
under violation of the assumption of essential
parallelism. Multivariate Behavioral Research
1997; 32: 169-91.
64 McClelland G, Nickerson C. E-mail on
ICC(2,1) confidence interval, 1997.
65 Lyles RH, Chambless LE. Effects of model
misspecification in the estimation of variance
components and intraclass correlation for
paired data. Statistics in Medicine 1995; 14:
1693-706.
66 Hale CA, Fleiss JL. Interval estimation under
two study designs for kappa with binary
classifications. Biometrics 1993; 49: 523-34.

67 Cornfield J. A statistical problem arising from


retrospective studies. In: Jeyman J ed.
Proceedings of the third Berkeley symposium on
mathematical statistics and probability. Berkeley,
CA: University of California Press, 1956.
68 Flack VF. Confidence intervals for the
interrater agreement measure kappa.
Communications in Statistics - Theory and
Methods 1987; 16: 953-68.
69 Donner A, Eliasziw M. A goodness-of-fit
approach to inference procedures for the
kappa statistic: confidence interval
construction, significance-testing and sample
size estimation. Statistics in Medicine 1992; 11:
1511-19.
70 Bloch DA, Kraemer HC. 2 x 2 Kappa
coefficients: measures of agreement or
association. Biometrics 1988; 45: 269-87.
71 McGraw KO, Wong SP. Correction.
Psychological Methods 1996; 1: 390.
72 Donner A, Eliasziw M. A hierarchical
approach to inferences concerning
interbserver agreement for multinomial data.
Statistics in Medicine 1997; 16: 1097-106.
73 Commenges D, Jacqmin H. The intraclass
correlation coefficient: distribution free
definition and test. Biometrics 1994; 50: 517-26.
74 Shavelson RJ, Webb NM. Generalizability
theory: a primer. Newbury Park, CA: Sage,
1991.
75 Graham P. Modeling covariate effects in
observer agreement studies: the case of
nominal scale agreement. Statistics in Medicine
1995; 14: 299-310.
76 Tanner MA, Young MA. Modelling agreement
among raters. Journal of the American Statistical
Association 1985; 80:175-80.
77 Fleiss JL, Shrout PE. Reliability considerations
in planning diagnostic validity studies. New York:
Guilford Press, 1989: 279-29.

Downloaded from smm.sagepub.com at UNIV OF IDAHO LIBRARY on September 5, 2014

Вам также может понравиться