Вы находитесь на странице: 1из 6

Outline

What is psychometrics? Items and scales


Introduction to psychometrics Key concepts Measurement models

Measurement and instrument Reliability and validity


Christophe Lalanne
Latent variable models

Fall 2011 “It is rather surprising that systematic studies of human abilities were not under-
taken until the second half of the last century. . . An accurate method was available
for measuring the circumference of the earth 2,000 years before the first systematic
measures of human ability were developed.” —Nunnally and Bernstein, 1994

What is psychometrics? Foometrics


Psychometrics “If Foo is a science then Foo
Psychometrics has evolved as a subfield of psychology to become the often has both an area Foometrics
science of measurement of unobservable individual characteristics. It and an area Mathematical Foo.
encompasses
MDS 3Mode
Mathematical Foo applies
mathematical modeling to
• the construction and analysis of measurement instruments; IRT CA
the Foo subject area, while
FA
Foometrics develops and studies
• the development of theoretical approaches to measurement. SEM
data analysis techniques for
HLM LogLin
It is sometimes opposed to clinimetrics which encourages the use of empirical data collected in Foo.
clinical expertise instead of formal statistical models, or even biomet- Each of the social and behavioural
rics (de Vet et al., 2003). sciences has a form of Foometrics,
Educometrics Sociometrics although they may not all use a name
in this family.” (de Leeuw, 2006)

Some landmarks Key concepts


We, human beings, like to make either relative or absolute judgments.
G. T. Fechner W. Wundt G. von Békésy R. Likert
(1801–1887) (1832–1920) (1899–1972) (1903–1981)

E. H. Weber
(1795–1878)
S. S. Stevens
(1906–1973)
L. L. Thurstone
(1887–1955)
Although it is often easy to tell whether a given object is taller than
another object, it is harder to tell whether someone is more proficient
in a given discipline than another person. Likewise, it is difficult
to assess groupwise opinions by asking a random set of questions.
F. Galton K. Pearson L. L. Thurstone L. J. Cronbach A. Jensen
Assessing one’s preference for a set of objects is also a difficult task.
Why?
(1822–1911) (1857–1936) (1887–1955) (1916–2001) (1923–)
J. McKeen Cattell C. Spearman G. Rasch F. M. Lord
(1860–1944) (1863–1945) (1901–1980) (1912–2000)

Mostly because the preceding points call for a process that allows to
rate and scale the above object of measurement. Moreover, judg-
ments elicited this way will rather lead to subjective measures, unlike
most physical measurements.
?
never − almost never ≡ always − almost always Measurement and instrument
Ultimately, we may want to build a measurement tool that shares the An instrument is used to relate, or ‘map’, something observed in
same characteristics as the ones shown below, or at least one that the real world to some other thing measured as part of a theory.
allows us to locate and compare individuals’ responses or objects’ The former is called a manifest variable, while the latter is usually
properties using some metrical system. subsumed under the term latent variable or factor.
Measurement is best understood as the task of assigning number to
categories (Stevens, 1946), in order for the researcher to “provide a
reasonable and consistent way to summarize the responses that people
make to express their achievements, attitudes, or personal points of
view through instruments such as attitude scales, achievement tests,
questionnaires, surveys, and psychological scales.” (Wilson, 2005)

Latent variable models The four building blocks (Wilson, 2005)


A good overview of the use of LVMs in biomedical research is provided • A Construct map features a coherent and substantive definition for
by Rabe-Hesketh and Skrondal, 2008. the content of the construct which is composed of an underlying
For a more general framework, we can consider how latent and man- continuum (for ordering respondents and/or items responses).
ifest variables relate each other: (Bartholomew and Knott, 2011) • Items design deals with the standardized construction of items
that are supposed to stimulate responses, assimilable to observa-
Manifest variables tions about the construct.
Metrical Categorical
• The Outcome space is the set of well-defined categories, finite and
Metrical Factor analysis Latent trait analysis exhaustive, ordered, context-specific, and research-based.
Latent variables
Categorical Latent profile analysis Latent class analysis
• A Measurement model is needed in order to relate the scored
outcomes from the items design and the outcome space back to
the construct that was the original inspiration of the items.

Self-reported measures What is an item?


An item constitutes the fundamental unit of measurement, and it can
be considered as a criteried way to measure a psychological construct
otherwise unobservable (e.g. reading fluency, specific cognitive ability,
quality of life, personality traits).
Example: Below are nine items from the MOS 36-item short form,
scored 1 (‘All of the Time’) to 6 (‘None of the Time’) points.
From items to scale
Aggregating items together leads to a scale, which is often confounded • If the latent variable is maintained at a given level, all the item
with the dimension it is assumed to reflect. Provided it only mea- responses should be independent.
sures a single construct (unidimensionality), we can assign scores to
patterns of observed responses (e.g., summated-scale score) to infer Quoting Falissard, 2006, “A set of items is unidimensional if there ex-
one’s location on the underlying latent trait, but more generally rank ists a variable (often called a latent variable, as this variable may not
items or individuals on this scale. be observed) which ‘explains’ all the correlations observed between
the items”, but conceiving a single construct as underlying a broad
Two desired properties of a scale are unidimensionality and local concept (e.g., depression) might be misleading, see e.g. (Borsboom,
independence. That is, 2006). Factor analytic methods can be used to study the dimen-
• The instrument should measure a single construct which accounts sionality of an instrument, and provide much better information than
for one’s response to a specific item given his location on the latent single-number summary which are often inappropriately used (e.g.,
trait (and not other factors); Cronbach’s alpha). Local independence is required in many models,
including IRT or LCA models.

Examples of multidimensional instruments Two common measurement models


• The HADS includes two scales (2 × 7 items, 21 points max.) for The classical test theory assumes that there exists a true score, de-
exploring anxiety and depression-related symptoms (Zigmond and fined as the expected response of an individual following repeated
Snaith, 1983 and Herrmann, 1997). independent admistration of the same test (error-free measure). But
• The NEOPI-R is a personality inventory (240 or 60 items) rely- we only have access to
ing on the five-factor model (extraversion, agreeableness, consci-
entiousness, neuroticism, and openness; 6 facets) (McCrae and observed score = true score + some error
John, 1992 and McCrae and Costa, 2004). Although it has long been considered as a principled method in mea-
• The MOS-HIV for HIV/AIDS includes 35 items assessing: general surement, it suffers from different flaws, well reviewed in (Borsboom,
health perceptions, pain, physical functioning, role functioning, 2006). Connected conceptual issues are reliability of test scores and
social functioning, energy/fatigue, mental health, health distress, item analysis. Here, a respondent having a ‘higher amount of the
cognitive function, and quality of life (Wu et al., 1997a, 1997b). latent trait’ will have a higher score, usually computed as the sum of
A list of existing HRQL instruments is available on http://www.proqolid.org/.
his responses.

The CTT approach includes few but simple mathematical formula- A well-known example is the Rasch model which deals with binary
tion, with straightforward estimation of model parameters; it also responses and assumes that all items are equally discriminative. IRT
yields scoring rules of practical interest. models are more flexible and do not assume an fixed axiomatically
fixed relationship between the observed and true score. They allow
IRT models on the contrary rely on a mathematical model whereby for the joint scaling of person and item parameters when mapping the
the probability of endorsing an item depends on both item and person relationship between latent traits and responses to test items.
location (parameters) on the latent trait. For every item in a scale, a
set of properties (item parameters) are estimated. The item slope or The use of IRT model is particularly interesting when the objective
discrimination parameter describes how well the item performs in the is to develop bank of calibrated items (Reeve et al., 2007), to deliver
scale in terms of the strength of the relationship between the item targeted and adaptive testing measures (Rebollo et al., 2010), or
and the scale. The item difficulty or threshold parameter(s) identifies to study measurement invariance across groups of patients (Teresi,
the location along the construct’s latent continuum where the item 2006).
best discriminates among individuals.
Another illustration A probabilistic account of item responses
The following example is taken from Partchev, 2004.
Let us consider an item like

What is the area of a circle having a radius of 3 cm?

1. 9.00 cm 2
2. 18.85 cm 2
3. 28.27 cm 2

The first of the three options is possibly the most naive one, the
second is wrong but implies more advanced knowledge (the area is
confused with the circumference), and the third is the correct one.

COSMIN Taxonomy (Mokkink et al., 2010) The many facets of validity


Reliability
“ Validation of instruments is the process of determining whether there are
grounds for believing that the instrument measures what it is intended to
Internal
consistency
Reliabilitya measure, and that it is useful for its intended purpose. (Fayers and Machin,
Validity
2000)”
Content
validity

Measurement
face
validity Construct
validity
Let’s consider the following definitions: (Falissard, 2008)
errora

Criterion
Structural
validity
Hypotheses
testing
• Content validity reflects the adequacy of the domains or dimen-
validityb
sions spanned by the items;
Responsiveness
Cross-cultural
validity • Criterion validity demonstrates that scales have empirical associa-
tion with external criteria, such as gold standards or other instru-
Responsiveness ments purpoted to measure equivalent concepts;
Interpretability • Construct validity relates each inter-items and item-scale relation-
a (test-retest, inter-rater, intra-rater)
b (concurrent validity, predictive validity)
ships from a theoretical point of view.

Some caveats about clinical validity


Convergent and discriminant validities are both subsumed in the gen- In clinical setting, the validity of a medical diagnosis requires a clear
eral and theoretical concept of construct validity. A critical appraisal aetiology. However, most of functional diagnosis of mental health-
of the concept of validity can be found in (Zumbo, 2007 and Bors- related disorders (e.g. personality disorder, schizophrenia) are defined
boom et al., 2004, 2009). in a circular fashion: The diagnosis is made on the basis of symptoms
A questionnaire is valid to the extent test scores are interpretated and the symptoms are accounted for by the diagnosis.
in relation to the construct that the questionnaire purpots to assess. Likewise, although most psychiatrists will agree on what depression is,
An alternative (and more elegant) definition of construct validity is demonstrating categorically that clinical depression differs from dys-
that it “is a property of measurement instruments that codes whether phoria or everyday unhappiness is nearly impossible (Pilgrim, 2009).
these instruments are sensitive to variation in a targeted attribute” Finally, the problems of valid and reliable case identification in psy-
(Borsboom et al., 2009). chiatric epidemiology remain unsolved because no physical cause are
generally associated to the disease under consideration.
Different types of reliability Reliability vs. significance
The sources of scores variability may be very different, e.g. (Dunn, A note of caution: One must distinguish between reliability and sig-
2000) nificance: (Thompson, 2003)

• repeated assessment of several subjects by the same rater or clin- • ‘statistical’ significance evaluates the probability or likelihood of
ician; the sample results, with reference to a reference population where
• alternative assessment of a given subject by different raters; the null is exactly true; ‘practical’ and ‘clinical’ significance are
• alternative administration of the same questionnaire or a parallel closely related concepts underlying the extent to which sample
form; results diverge from the null (as measured by effect size statistics)
• use of different subscales of a single questionnaire to infer one’s or the way treated patients may not be distinguished from control
performance; or normal ones.

Reliability analysis aims at quantifying these variations, in order to • However, ‘statistical’, ‘practical’, and ‘clinical’ significance all
provide an idea of the error of measurement. stand on the assumption that scores are meaningful and reliable
indicators of individuals’ performance. . .

Construction of a questionnaire Bibliography


1 Nunnally, J. and Bernstein, I. (1994). Psychometric Theory . 3rd edition McGraw-Hill.
2 de Vet, H., Terwee, C. and Bouter, L. (2003). Clinimetrics and psychometrics: two sides of the
correspondence with RCT same coin. Journal of Clinical Epidemiology, 56, 1146–1147. DOI: 10.1016/j.jclinepi.2003.08.010.
3 de Leeuw, J. (2006). R in Psychometrics and Psychometrics in R. In The R User Conference,
phase II phase III phase IV UseR! 2009. R Foundation for Statistical Computing.
4 Stevens, S. (1946). On the theory of scales of measurement. Science, 103, 677–680.
validation 5 Wilson, M. (2005). Constructing measures. An item response modeling approach. Taylor &
Francis Group.
test construction test administration follow-up
6 Rabe-Hesketh, S. and Skrondal, A. (2008). Classical latent variable models for medical research.
field testing large-scale testing Statistical methods in medical research.
7 Bartholomew, D. and Knott, M. (2011). Latent Variable Models and Factor Analysis: A Unified
Approach. 3rd edition Wiley.
content validity construct validity responsiveness 8 Falissard, B. (2006). The unidimensionality of a psychiatric scale: a statistical point of view.
concourant validity face validity International Journal of Methods in Psychiatric Research, 8 (3), 162–167. DOI: 10.1002/mpr.66.
9 Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71 (3), 425–440.
scores reliability
DOI: 10.1007/s11336-006-1447-6.
10 Zigmond, A. and Snaith, R. (1983). The hospital anxiety and depression scale. Acta Psychiatry
Scandinavia, 67 (6), 361–370.

11 Herrmann, C. (1997). International experiences with the hospital anxiety and depression scale–a 18 Teresi, J. (2006). Different approaches to differential item functioning in health applications:
review of validation data and clinical results. Journal of Psychosomatic Research, 42 (1), 17–41. Advantages, disadvantages and some neglected topics. Medical Care, 44, S152–S170.
12 McCrae, R. and John, O. (1992). An introduction to the five-factor model and its applications. 19 Partchev, I. (2004). A visual guide to item response theory. Friedrich-Schiller-Universität
Journal of Personality, 60 (2), 175–215. Jena. URL: website.
13 McCrae, R. and Costa, P. (2004). A contemplated revision of the neo five-factor inventory. 20 Mokkink, L., Terwee, C., Patrick, D., Alonso, J. and Stratford, P. et al. (2010). The COSMIN
Personality and Individual Difference, 36 (3), 587-596. study reached international consensus on taxonomy, terminology, and definitions of measure-
14 Wu, A., Hays, R., Kelly, S., Malitz, F. and Bozzette, S. (1997a). Applications of the medical ment properties for health-related patient-reported outcomes. Journal of Clinical Epidemiology,
outcomes study health-related quality of life measures in hiv/aids. Quality of Life Research, 63, 737–745. PMID: 20494804.
6 (6), 531-554. 21 Fayers, P. and Machin, D. (2000). Quality of life. The assessment, analysis and interpretation
15 Wu, A., Revicki, D., Jacobson, D. and Malitz, F. (1997b). Evidence for reliability, validity and of patient-reported outcomes. Wiley.
usefulness of the medical outcomes study hiv health survey (mos-hiv). Quality of Life Research, 22 Falissard, B. (2008). Mesurer la subjectivitÃľ en santÃľ. Perspective mÃľthodologique et
6 (6), 481–493. statistique. Masson.
16 Reeve, B., Hays, R., Bjorner, J., Cook, K. and Crane, P. et al. (2007). Psychometric evaluation 23 Zumbo, B. (2007). Validity: Foundational issues and statistical methodology. In Rao, C. and
and calibration of health-related quality of life item banks. Medical Care, 45 (5), S22–S31. Sinharay, S., editors, Handbook of Statistics, Vol. 26: Psychometrics, pages 45-79. Elsevier
PMID: 17443115. Science B.V.: The Netherlands.
17 Rebollo, P., Castejón, I., Cuervo, J., Villa, G. and García-Cueto, E. et al. (2010). Validation of 24 Borsboom, D., Mellenbergh, G. and Heerden, J. v. (2004). The concept of validity. Psycho-
a computer-adaptive test to evaluate generic health-related quality of life. Health and Quality logical Review, 111, 1061–1071.
of Life Outcomes, 8, 147. PMID: 21129169.
25 Borsboom, D., Cramer, A., Kievit, R., Scholten, A. and Franic, S. (2009). The end of con-
struct validity. In Lissitz, R., editor, The concept of validity: Revisions, new directions, and
applications. Information Age Publishers.
26 Pilgrim, D. (2009). Key Concepts in Mental Health. 2nd edition Sage.
27 Dunn, G. (2000). Statistics in Psychiatry . Hodder Arnold. )
28 Thompson, B., editor (2003). Score Reliability. Contemporary Thinking on Reliability issues.
Sage Publications.

cáJ" nQh

Made with ConTEXt version 2011.05.18 18:04 (pdfTEX version 140),


psychometrics.tex, 285c54e on 2011/10/14

Вам также может понравиться