Академический Документы
Профессиональный Документы
Культура Документы
Fall 2011 “It is rather surprising that systematic studies of human abilities were not under-
taken until the second half of the last century. . . An accurate method was available
for measuring the circumference of the earth 2,000 years before the first systematic
measures of human ability were developed.” —Nunnally and Bernstein, 1994
E. H. Weber
(1795–1878)
S. S. Stevens
(1906–1973)
L. L. Thurstone
(1887–1955)
Although it is often easy to tell whether a given object is taller than
another object, it is harder to tell whether someone is more proficient
in a given discipline than another person. Likewise, it is difficult
to assess groupwise opinions by asking a random set of questions.
F. Galton K. Pearson L. L. Thurstone L. J. Cronbach A. Jensen
Assessing one’s preference for a set of objects is also a difficult task.
Why?
(1822–1911) (1857–1936) (1887–1955) (1916–2001) (1923–)
J. McKeen Cattell C. Spearman G. Rasch F. M. Lord
(1860–1944) (1863–1945) (1901–1980) (1912–2000)
Mostly because the preceding points call for a process that allows to
rate and scale the above object of measurement. Moreover, judg-
ments elicited this way will rather lead to subjective measures, unlike
most physical measurements.
?
never − almost never ≡ always − almost always Measurement and instrument
Ultimately, we may want to build a measurement tool that shares the An instrument is used to relate, or ‘map’, something observed in
same characteristics as the ones shown below, or at least one that the real world to some other thing measured as part of a theory.
allows us to locate and compare individuals’ responses or objects’ The former is called a manifest variable, while the latter is usually
properties using some metrical system. subsumed under the term latent variable or factor.
Measurement is best understood as the task of assigning number to
categories (Stevens, 1946), in order for the researcher to “provide a
reasonable and consistent way to summarize the responses that people
make to express their achievements, attitudes, or personal points of
view through instruments such as attitude scales, achievement tests,
questionnaires, surveys, and psychological scales.” (Wilson, 2005)
The CTT approach includes few but simple mathematical formula- A well-known example is the Rasch model which deals with binary
tion, with straightforward estimation of model parameters; it also responses and assumes that all items are equally discriminative. IRT
yields scoring rules of practical interest. models are more flexible and do not assume an fixed axiomatically
fixed relationship between the observed and true score. They allow
IRT models on the contrary rely on a mathematical model whereby for the joint scaling of person and item parameters when mapping the
the probability of endorsing an item depends on both item and person relationship between latent traits and responses to test items.
location (parameters) on the latent trait. For every item in a scale, a
set of properties (item parameters) are estimated. The item slope or The use of IRT model is particularly interesting when the objective
discrimination parameter describes how well the item performs in the is to develop bank of calibrated items (Reeve et al., 2007), to deliver
scale in terms of the strength of the relationship between the item targeted and adaptive testing measures (Rebollo et al., 2010), or
and the scale. The item difficulty or threshold parameter(s) identifies to study measurement invariance across groups of patients (Teresi,
the location along the construct’s latent continuum where the item 2006).
best discriminates among individuals.
Another illustration A probabilistic account of item responses
The following example is taken from Partchev, 2004.
Let us consider an item like
1. 9.00 cm 2
2. 18.85 cm 2
3. 28.27 cm 2
The first of the three options is possibly the most naive one, the
second is wrong but implies more advanced knowledge (the area is
confused with the circumference), and the third is the correct one.
Measurement
face
validity Construct
validity
Let’s consider the following definitions: (Falissard, 2008)
errora
Criterion
Structural
validity
Hypotheses
testing
• Content validity reflects the adequacy of the domains or dimen-
validityb
sions spanned by the items;
Responsiveness
Cross-cultural
validity • Criterion validity demonstrates that scales have empirical associa-
tion with external criteria, such as gold standards or other instru-
Responsiveness ments purpoted to measure equivalent concepts;
Interpretability • Construct validity relates each inter-items and item-scale relation-
a (test-retest, inter-rater, intra-rater)
b (concurrent validity, predictive validity)
ships from a theoretical point of view.
• repeated assessment of several subjects by the same rater or clin- • ‘statistical’ significance evaluates the probability or likelihood of
ician; the sample results, with reference to a reference population where
• alternative assessment of a given subject by different raters; the null is exactly true; ‘practical’ and ‘clinical’ significance are
• alternative administration of the same questionnaire or a parallel closely related concepts underlying the extent to which sample
form; results diverge from the null (as measured by effect size statistics)
• use of different subscales of a single questionnaire to infer one’s or the way treated patients may not be distinguished from control
performance; or normal ones.
Reliability analysis aims at quantifying these variations, in order to • However, ‘statistical’, ‘practical’, and ‘clinical’ significance all
provide an idea of the error of measurement. stand on the assumption that scores are meaningful and reliable
indicators of individuals’ performance. . .
11 Herrmann, C. (1997). International experiences with the hospital anxiety and depression scale–a 18 Teresi, J. (2006). Different approaches to differential item functioning in health applications:
review of validation data and clinical results. Journal of Psychosomatic Research, 42 (1), 17–41. Advantages, disadvantages and some neglected topics. Medical Care, 44, S152–S170.
12 McCrae, R. and John, O. (1992). An introduction to the five-factor model and its applications. 19 Partchev, I. (2004). A visual guide to item response theory. Friedrich-Schiller-Universität
Journal of Personality, 60 (2), 175–215. Jena. URL: website.
13 McCrae, R. and Costa, P. (2004). A contemplated revision of the neo five-factor inventory. 20 Mokkink, L., Terwee, C., Patrick, D., Alonso, J. and Stratford, P. et al. (2010). The COSMIN
Personality and Individual Difference, 36 (3), 587-596. study reached international consensus on taxonomy, terminology, and definitions of measure-
14 Wu, A., Hays, R., Kelly, S., Malitz, F. and Bozzette, S. (1997a). Applications of the medical ment properties for health-related patient-reported outcomes. Journal of Clinical Epidemiology,
outcomes study health-related quality of life measures in hiv/aids. Quality of Life Research, 63, 737–745. PMID: 20494804.
6 (6), 531-554. 21 Fayers, P. and Machin, D. (2000). Quality of life. The assessment, analysis and interpretation
15 Wu, A., Revicki, D., Jacobson, D. and Malitz, F. (1997b). Evidence for reliability, validity and of patient-reported outcomes. Wiley.
usefulness of the medical outcomes study hiv health survey (mos-hiv). Quality of Life Research, 22 Falissard, B. (2008). Mesurer la subjectivitÃľ en santÃľ. Perspective mÃľthodologique et
6 (6), 481–493. statistique. Masson.
16 Reeve, B., Hays, R., Bjorner, J., Cook, K. and Crane, P. et al. (2007). Psychometric evaluation 23 Zumbo, B. (2007). Validity: Foundational issues and statistical methodology. In Rao, C. and
and calibration of health-related quality of life item banks. Medical Care, 45 (5), S22–S31. Sinharay, S., editors, Handbook of Statistics, Vol. 26: Psychometrics, pages 45-79. Elsevier
PMID: 17443115. Science B.V.: The Netherlands.
17 Rebollo, P., Castejón, I., Cuervo, J., Villa, G. and García-Cueto, E. et al. (2010). Validation of 24 Borsboom, D., Mellenbergh, G. and Heerden, J. v. (2004). The concept of validity. Psycho-
a computer-adaptive test to evaluate generic health-related quality of life. Health and Quality logical Review, 111, 1061–1071.
of Life Outcomes, 8, 147. PMID: 21129169.
25 Borsboom, D., Cramer, A., Kievit, R., Scholten, A. and Franic, S. (2009). The end of con-
struct validity. In Lissitz, R., editor, The concept of validity: Revisions, new directions, and
applications. Information Age Publishers.
26 Pilgrim, D. (2009). Key Concepts in Mental Health. 2nd edition Sage.
27 Dunn, G. (2000). Statistics in Psychiatry . Hodder Arnold. )
28 Thompson, B., editor (2003). Score Reliability. Contemporary Thinking on Reliability issues.
Sage Publications.
cáJ" nQh