Вы находитесь на странице: 1из 9

Methodological Considerations in the Refinement of Clinical Assessment Instru...

Gregory T. Smith; Denis M. McCarthy


Psychological Assessment [PsycARTICLES]; September 1995; 7, 3; PsycARTICLES
pg. 300

Psychological Assessment
1995, Vol. 7, No. 3, 300-308

Copyright 1995 by the American Psychological Association, Inc.


1040-3590/95/$3.00

Methodological Considerations in the Refinement


of Clinical Assessment Instruments
Gregory T. Smith and Denis M. McCarthy
University ofKentucky
Instrument refinement refers to any set of procedures designed to improve an instrument's representation of a construct. Though often neglected, it is essential to the development of reliable and
valid measures. Five objectives of instrument refinement are proposed: identification of measures'
hierarchical or aggregational structure, establishment of interna) consistency of undimensional facets of measures, determination of content homogeneity of undimensional facets, inclusion of items
that discriminate at the desired leve! of attribute intensity, and replication of instrument properties
on an independent sample. The use of abbreviated scales is not recommended. The refinement
of behavioral observation procedures is discussed, and the role of measure refinement in theory
development is emphasized.

Imagine a hypothetical manuscript that introduces Instrument A, a measure ofSelf-Centeredness. Suppose the author of
the manuscript reports that he or she wrote items to be indicators ofthat underlying construct, administered them to a sample, obtained a coefficient alpha of .75, and found that seores
on the measure correlated significantly with seores on measures
of narcissism ( r = .53) and hostile relationships ( r = .25), but
nonsignificantly with social avoidance ( r = .11 ) . The author
concludes that the new measure is reliable and has construct
validity.
This hypothetical report is in fact highly representative of
manuscripts submitted to Psychological Assessment ( see
below). Unfortunately, however, the report's inattention to
many aspects of the psychometric properties of the new measure create a number of potential difficulties. We propose five
objectives of clinical assessment instrument refinement that are
not met in this hypothetical example; each will be discussed in
more detail in subsequent sections of this article. To introduce
these objectives, suppose that, unbeknown to the author, Instrument A actually measures three undimensional, moderately
correlated facets of Self-Centeredness: Low Empathy, Interna!
Self-Preoccupation, and Self-Aggrandizement. (As we describe
below, such a situation could certainly occur.) The five objectives are listed here.
1. Identification of the measure's hierarchical or aggregational structure. Failure to identify this measure's three facets
or unidimensional constructs could lead to inaccurate specifications oftheory as well as misleading correlational and experimental findings. For example, certain subscales might correlate

with a given criterion whereas others might not. In such a case,


the aggregate measure's correlation with the criterion might reflect the average of two or three different effects.
2. Establishment ofinternal consistency ofthe instrument's
unidimensional facets. Failure either to investigate or to take
steps to improve interna! consistency ofthe instrument's facets
could result in unnecessarily high levels of error variance in the
measurement of each unidimensional construct. 1
3. Determination of content homogeneity of each unidimensional facet. In the absence of independent, expert ratings
of content validity, it is quite possible that sorne items reflect
correlates of the target construct but are not prototypic of it,
that sorne aspects ofthe target construct are under- or overrepresented in the scale composition, and that the wording of sorne
items introduces variance unrelated to the target construct.
4. Inclusion of items that discriminate among participants
at the desired leve! of intensity of the attribute. For example, if
the measure is intended to identify individuals with clinical levels of self-absorption, it may require more extreme items than
ifit were intended to differentiate among normal-range individuals for the purpose of general population-based theory
development.
5. Replication of the psychometric properties of the measure across independent samples. Without evidence of replication, it is quite possible that items will intercorrelate less highly
in a new sample, that the factor structure will prove to vary as a
function of the target population, or that modifications made
based on sample statistics will not have the same effect on the
measure in a new sample.
The purpose of instrument refinement is to address these issues. Each is crucial for specifying precisely the construct being
measured and for ensuring that the instrument represents that
construct accurately ( Foster & Cone, 1995 ) . Each is therefore a
prerequisite to further "elaborative validity" analyses that in-

Gregory T. Smith and Denis M. McCarthy, Department of Psychology, University ofKentucky.


We are grateful to Michael T. Nietzel for his many helpful comments
on a draft of this article.
Correspondence concerning this article should be addressed to Gregory T. Smith, Department of Psychology, University of Kentucky, 115
Kastle Hall, Lexington, Kentucky 40506-0044. Electronic mail may be
sent via Internet to psy247@ukcc.uky.edu.

1
We use the terms unidimensional facet and unidimensional construct interchangeably to refer to factors that cannot be decomposed

into noncollinear subfactors.


300

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

301

SPECIAL ISSUE: INSTRUMENT REFINEMENT

vestigate the construct' s relation to other variables ( Foster &


Cone, 1995). Broadly defined, then, instrument (measure) refinement refers to any set of procedures performed on an instrument designed to improve its representation of a construct. This
definition includes procedures applied as part of the original
construction and refinement process. lt may also include steps
taken to modify an instrument for sorne new use-as with a
different population. Of course, there is no clear boundary between construction and refinement; the two are indeed aspects
of an ongoing process ( see Clark & Watson, 1995). We focus
on refinement specifically beca use of the widespread failure to
appropriate the need to modify instruments after their initial
construction.
In the following sections ofthis article, we presenta review of
a sample of manuscripts submitted to Psychological Assessment
to illustrate the state of instrument refinement practices; present a more detailed conceptual and technical discussion of the
five refinement objectives outlined earlier; highlight additional,
neglected issues in instrument refinement; discuss issues in the
refinement of behavioral observation procedures that follow
from the same general principies; and discuss the role ofinstrument refinement in the ongoing process of developing theory
and sharpening construct definition.
Current Instrument Refinement Practice:
Neglect and lmprecision
Despite its fundamental importance, instrument refinement
is simply neglected by most researchers. Where not neglected,
it is often undertaken with little rigor and with little apparent
thought relating the nature and purpose of an instrument to a
specified set ofprocedures for its evaluation and refinement.
For this article, we evaluated 76 manuscripts submitted for
publication to Psychological Assessment from 1990 to 1994 for
the use and appropriateness of instrument refinement methods. 2 For each manuscript, we first judged whether refinement
was appropriate to the goals of the article. Manuscripts for
which refinement was appropriate were then rated on whether
each of the five refinement objectives described earlier was itself
appropriate to the study and whether sorne attempt was made
to meet each appropriate objective. For each objective, this process yielded a percentage of cases in which a procedure would
have been appropriate but was not used. In addition, we rated
manuscripts on whether an attempt was made to assess discriminant validity (an oft-neglected issue), and we performed two
global ratings: whether more than one step in the refinement
process was performed and whether analyses, when performed,
were appropriate to the target measure and its conceptualization. Ratings represent the consensus reached by the two authors in each case.
For 16 manuscripts ( 21 % ) , refinement was not judged to be
appropriate to the goals of the article. This category included
studies in which, for example, an experimental manipulation
was performed and the dependent variable was a standardized
instrument or the validity of a published measure was further
evaluated on a sample common to the standardization sample
(e.g., Millon Clinical Multiaxal Inventory [MCMI] profiles of
personality-?isordered participants). Data from the remaining
60 manuscnpts are presented in Table 1.

Table 1
Rate ofFailure in the Application ofRefin~ment Procedures
Procedures

Percentage not using


procedure when appropriate

Dimensionality assessment
Item-level statistical analysis
Content-based item-level analysis
Analysis of item difficulty leve!
Independent sample replication
Assessment of discriminant validity
More than one refinement step
Use of appropriate refinement
procedures

31

64
85
100
68
71

85
50

Note, Based on N= 76 original submissions to Psychological Assessment during 1990-1994, sampled from the files of an associate editor to
approximate acceptance rates. Numbers are for cases in which procedure was judged appropriate but not applied. Examples of refinement
procedures include factor analysis, examination of item-total correlations, expert ratings of content validity, examination of frequency of
item endorsement, replication of factor analyses, convergent and discriminant validity correlations, examination of both item-total correlations and content validity ratings, and use of factor analysis to assess
unidimensionality versus multidimensionality.

We adopted a very permissive stance to assigning ratings of


refinement shortcomings; for each category we describe the
threshold for rating and give typical examples.

Identification ofHierarchica/ or Aggregationa/ Structure


This objective refers to identifying the presence of subfactors
or superordinate factors in an instrument, a task most commonly undertaken by means of factor analysis ( Floyd & Widaman, 1995). Both exploratory and confirmatory factor analytic
techniques are integral to this process ( Hoyle, 1991 ; Hoyle &
Smith, 1994). Dimensionality assessment is part of instrument
refinement in two senses: the interpretation of an instrument's
seores may change when its dimensionality is uncovered, and
definition of instrument structure is a prerequisite to subsequent instrument refinement. Appropriate cases were rated as
successful if any dimensionality analysis was performed,
whether a rationale for the selected analysis was presented or
was convincing: 31 % of submissions failed this standard. The
typical failing case was one in which investigators reported a
coefficient alpha and apparently presumed the value indicated
unidimensionality without testing that assumption, as described in the hypothetical manuscript discussed earlier. As we
note later, numerous investigators have shown that acceptable
alpha levels can be obtained by aggregating distinct but correlated subscales.

?ne

2
_
goal of this special issue is to improve the overa!! quality of
chmcal assessment research, as reflected in the quality of submissions
to Psychological Assessment; accordingly, we sampled manuscripts to
approximate the joumal's acceptance rate. The sample was obtained
from on~ of the joumal's associate editors. Ten of the 76 manuscripts
were ult~mately accepted for publication; ali manuscripts were original
( not rev1sed) submissions.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

302

GREGORY T. SMITH AND DENIS M. McCARTHY

Establishment ofInternal Consistency of


Unidimensional Facets
For a scale to measure a unidimensional construct, its items
must be parallel, alternative indicators of the same, underlying
construct. This property should be reflected in an emprica!
finding that tem seores are highly intercorrelated: the tendency
to endorse one construct indicator should be associated with
the tendency to endorse other, alternative construct indicators.
Because no author can expect to produce uniformly optima!
indicators at the outset, empirically based item inclusion and
exclusion are usually crucial to establishing or enhancing a
scale's reliability and validity. Empirical indicators ofitem consistency include tem-total correlations; interitem correlations;
alpha value changes when items are deleted; and, in the case of
factor analysis, factor loadings. Cases were rated as successful if
they gave any indication that they examined item-level statistics; it was not necessary to have deleted items or added parallel
items. "Ali tem-total correlations were satisfactory," "Ali
item-total correlations were above .30," "Items with factor
loadings greater than .30 were retained for subscales," "Ali scale
items had factor loadings greater than .45," or "Items with the
lowest factor Ioadings were dropped from the scale," ali were
rated as successes. Despite this generous standard, 64% of cases
failed. Typical failing cases reported a coefficient alpha but appeared to have looked no further into their data; no mention
was made of any other item analysis. Failing factor analysis
cases typically stated that scales consisted of items Ioading on
each factor, and nothing more. ( Very few such cases reported
the important step of examining items for simple structure.)
Almost no cases reported item deletion or modification followed by replication.

Determination ofContent Homogeneity of


Unidimensional Facets
It is also important in representing a unidimensional construct faithfully that tem content be judged to represent the
content domain ofthe construct accurately. Cases were rated as
successful if they gave any indication that they analyzed items
systematically for potential exclusion or addition based on
con ten t. ( It was not necessary to have described an exemplary
process ofreplicated ratings by independent, trained raters.) As
Table 1 shows, 85% of submissions failed this standard. In the
typical failing cases, investigators apparently relied either on coefficient alpha or on an interpretable factor solution to define
their scales; they took no steps to distinguish parallel items from
those that correlate but reflect different content.

Inclusion ofItems That Discriminate Among


Participants at the Desired Leve! ofAttribute Intensity
Sorne measures are constructed with the goal of differentiating extreme, clinical cases from others; other measures are
meant to tap individual differences within the normal population; and many are meant to perform both tasks. To assess the
attribute intensity leve! at which an item discriminates, the proportion of individuals endorsing it within different populations
should be examined. Cases would have been rated successful if

they had reported any attempt to examine frequency of tem


endorsement; it was not necessary to have reported adding, deIeting, or altering items to attain desired endorsement proportions. Even so, no case passed this criterion: no studies included
reports of item endorsement frequencies, much Iess showed recognition that this property vares as a function ofthe population
studied ( e.g., clinical vs. normal).

Rep/ication on an Independent Sample


This category was initially defined as "replicating measure
refinements on an independent sample," but we found so few
cases that reported both refinement and multiple samples that
we had to broaden the category. Here, any indication of replication was rated as successful, including reporting exploratory
factor analyses on two samples with no quantitative index of
factor congruence. ( Very few cases reported the more rigorous
process of conducting item-Ievel refinements on one sample
and then replicating measure performance on a second sample.)
Sixty-eight percent of cases failed this standard. Typical failures
reported neither refinement nor replication. Other cases performed extensive confirmatory factor analysis model respecifications to obtain an acceptable solution without replicating the
respecified solution on a new sample.

Assessment ofDiscriminant Validity


A great deal of measure refinement involves differentiating
one construct from another and developing sufficiently specific
measures ( cf. Clark & Watson, 1991 ; Watson, Clark, et al.,
1995; Watson, Weber, et al., 1995 ), so we included discriminant
validity analyses as a component of successful refinement.
Cases were rated successful on this criterion ifthey reported any
tests of discriminant validity, including either reports that sorne
factors correlated with a criterion but others did not or more
rigorous, convergent and discriminant validity investigations.
Seventy-one percent failed this standard. Typical failures included reports of convergent correlations with similar scales
only or reports in which ali factors from a factor analysis correIated comparably with a criterion-with no other validity analyses reported.

Performance ofMore Than One Step in the Refinement


Process
On this aggregate index, articles were rated as successes if
they performed two or more steps in the refinement process,
such as deleting items based on a factor analysis in one sample
and then replicating the factor structure with the remaining
items on a second sample. With this index we attempted to
identify the percentage of submissions meeting a more realistic
basic standard than the generous individual categories we used.
(Still, the refinement process did not have to be rated comprehensive.) Eighty-five percent of cases failed to report two or
more refinement-related analyses. Most of the typical failing
cases have been described already: factor analysis-based refinement without replication and factor analysis without itemlevel analysis are two examples.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

SPECIAL ISSUE: INSTRUMENT REFINEMENT

Use ofAppropriate Refinement Procedures


For cases in which sorne refinement procedures were used,
we rated whether the procedures were conducted properly or
were appropriate to the target measure. Fifty percent of cases
failed this criterion. Procedures judged inappropriate were varied. Sorne investigators reported analyses on multiple samples
without conducting any replication analyses; for example, they
conducted a factor analysis on one sample, assessed reliability
on a second sample, and performed validity correlations on a
third sample. Item analyses were not performed, and the opportunity to replicate analyses was missed. Other investigators
computed coefficient alpha and item-level statistics, deleted
items and then concluded that the resulting acceptable alpha
demo~strated that the measure was unidimensional. Other investigators deleted items that correlated negatively with their
scale without investigating whether finding a negative correlation between two presumed parallel indicators revealed ambiguity in the construct definition.
Clearly, these findings point to significant shortcomings in researchers' application of standard, well-known instrument refinement strategies. These shortcomings are likely to have a significant impact on the validity of clinical research. When reliability is not improved over modest levels, the result is to
attenuate estimates of validity coefficients or causal relationships. More broadly, when the content and correlational properties of items are not examined, and when measure dimensionality is not established through analysis and replication,
constructs are likely to be imprecisely defined. Investigators
may be using instruments that do not accurately operationalize
a target construct, they may be studying constructs that are really aggregates of more than one subconstruct, they may be introducing method variance through poorly worded items, and
they may be correlating measures with different names that actually include overlapping content. The potential for misleading
findings is great.
Too often researchers appear to settle for reporting an alpha
ora factor analysis as if checking off points on a list of standard
analyses. In the following sections, we offer a sample of conceptual considerations and methodological recommendations to
help guide instrument refinement research. ( For extended discussions of technical issues, see Anastasi, 1988, and Nunnally
& Bernstein, 1994.)
Conceptual Considerations in Instrument Refinement

Levels ofHierarchy or Aggregation of Constructs


Return to our hypothetical measure ofSelf-Centeredness, Instrument A, that consisted of three factors despite an overall
coefficient alpha of .75. As we noted earlier, such a situation is
not only possible, it is quite common. Because coefficient alpha
is influenced by both interna) consistency and scale length, it
can be high when two internally consistent subscales, themselves nly modestly interrelated, are combined (Briggs &
Cheek, 1986; Clark & Watson, 1995; Green, Lissitz, & Mulaik,
1977; McDonald, 1981 ).
It is easy to imagine that the facets of Self-CenterednessLow Empathy, Internal Self-Preoccupation, and Self-Aggrandizement-might have different patterns of relationships with

303

the criterion scales of narcissism, hostile relationships, and social avoidance that were obscured by using the aggregate variable. Social avoidance could be related to Internal Self-Preoccupation but unrelated to Low Empathy and Self-Aggrandizement, producing the nonsignificant correlation between social
avoidance and the aggregate measure. Similarly, a modest correlation between the aggregate Self-Centeredness score and hostile relationships could be due to the latter variable having
strong correlations with both Low Empathy and Self-Aggrandizement anda zero correlation with lnternal Self-Preoccupation. Narcissism might have significant correlations of varying
magnitudes with each subscale. Thus, identification ofthe facets of Self-Centeredness would sharpen our conceptualization
ofthat construct and lead to discovery of otherwise unavailable
empirical relationships ( Briggs & Cheek, 1986; Carver, 1989).
Constructs exist at different levels of hierarchy or aggregation. Understanding the hierarchical level of a construct is essential for theory. It is also a prerequisite to instrument refinement; for example, internal consistency and content homogeneity are not appropriate goals for measures combining lower level
constructs (Clark & Watson, 1995). A high coefficient alpha
does not indicate unidimensionality, and there are direct means
of assessing the degree of subscale covariance ( e.g., examination
of correlation matrices and use of confirmatory factor analysis
to test the degree ofloss of model fit when combining scales).
When can lower level constructs be combined and analyzed
at an aggregate level? The situation is analogous to the interpretation of main effects in light of interactions in analysis of
variance. Ifthere is a significant A X B interaction, it is misleading to interpret the main effect of A on the dependent variable
without noting that the simple effects of A are different at the
different levels ofB. Similarly, it is misleading to report the correlations of an aggregate construct with criterion measures
without noting that the facets of the aggregate construct correlate differently with the criteria. Ifthe correlations between the
facets and the criteria do not differ, then straightforward interpretation ofthe main effect aggregate correlation is permissible.

Items as Ejfect Indicators or Causal Indicators of


Constructs
The items oflnstrument A were written to be parallel indicators of a presumed underlying construct. In structural equation
modeling language ( cf. Bollen, 1989) such items are called
ejfect indicators, because presumably individual differences in
Self-Centeredness account for individual differences in item endorsement. High self-centeredness causes one to endorse the
items. Analyses of interna! consistency proved to have been
misleading at the higher (broader) construct leve! but would
have been appropriate at the lower level ( again assuming the
lower leve! constructs to be unidimensional).
Imagine a second instrument, Instrument B, designed to
measure Life Stress in the past year. Items were written to include different potential sources of stress, including job change,
divorce, birth of a child, serious illness, death of a loved one, and
so on. In structural equation modeling language, these items
are referred to as cause or causal indicators, because individual
differences in the experience of each of the items ( e.g., job
change) account for individual differences in overall Stress level

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

304

GREGORY T. SMITH ANO DENIS M. McCARTHY

( Blalock, 1964, pp. 162-169; Bollen, 1984). It makes no sense


to assert that changes in Stress produce birth of a child, only
that birth of a child produces a change in Stress leve!. This perspective carries an important psychometric implication: the
fact that the indicators are combined to produce an overall
Stress score does not in itself imply they are necessarily intercorrelated (Bollen & Lennox, 1991 ) . Whether they are or not
is irrelevant to the reliability and validity ofthe measure. Measures of interna! consistency cannot be construed as indexes of
reliability in this case, although they may represent interesting
empirical findings ( e.g., to the extent they are interrelated, one
might hypothesize an underlying stressor-proneness personality
construct). 3
These considerations point to the importance of giving careful thought to construct definition and to the appropriate means
of analyzing and refining instruments. Rote applications of coefficient alpha are inappropriate. Casual, single-sample dimensionality analyses without consideration of alternative structural models can produce misleading results.

Objectives and Techniques of Instrument Refinement


In this section we discuss instrument refinement objectives in
more detail and offer methodological suggestions for implementing them. 4 We encourage investigators to evaluate whether
they have met each of these objectives before publishing their
findings.

Identi.fication ofHierarchical or Aggregational Structure


Methods for analyzing dimensional structure are discussed
elsewhere in this issue ( Clark & Watson, 1995; Floyd & Widaman, 1995); we emphasize four points. First, clarification of
instrument structure is important for criterion-keyed instruments as well as theoretically derived ones. In criterion keying,
items are retained on the basis oftheir correlations with an externa) criterion, not their correlations with each other. It is important to appreciate that measures of this form carry an implicit dimensionality: ifunrelated items each predict a common
criterion, then the items reflect different lower level constructs,
each ofwhich accounts for a different aspect of a given criterion.
If the dimensionality is Ieft unexplored, the result is that each
lower leve! predictive construct contri bu tes in an unknown, and
probably disproportionately weighted ( Haynes, Richard, &
Kubany, 1995), way. Clarifying the dimensionality of criterionkeyed measures points to the importance of ensuring the reliability of the lower level constructs and provides theoretical
clarification ofthe observed empirical relationship.
Second, researchers should consider whether their measures
fit effect-indicator or causal-indicator models (Bollen & Lennox, 1991; Hoyle, 1991 ) . The procedures are different. Discussion of practica! issues in analysis of causal-indicator models
is beyond the scope of this article ( see MacCallum & Browne,
1993).
Third, when conducting factor analysis or other dimensionality analyses, careful inspection of results, rather than "automatic pilot" applications of procedures, is in order. Investigators are discouraged from approaching their data with the sole
aim of finding a defensible solution so as to move on. For exam-

ple, a model might appear to fit the data poorly because one of
the dimensions was poorly measured ( e.g., a small number of
nonparallel items). Although rejection of a factor structure that
includes that dimension might be the most readily apparent
course, iinprovement or addition of items might result in identification of an important facet or construct. Although itemlevel improvement is cumbersome in the short term, necessitating validation on an independent sample, the Iong-term savings
from having correctly specified a construct domain are certainly
far greater.
Fourth, confirmatory factor analysis ( cf. Bollen, 1989) is often useful beca use it provides indexes of the degree to which a
factor model fits sample data and it permits systematic comparison of alternative factor models. This technique may be particularly helpful in determining whether one has identified a unidimensional facet: if decomposition into subfactors does not
produce a statistically significant improvement in model fit, the
hypothesis ofunidimensionality can be retained.

Establishment ofInterna! Consistency of


Unidimensional Facets
Conceptual considerations. Ali unidimensional scales should
be internally consistent. This fact has not always been appreciated. For example, Bollen and Lennox ( 1991) have noted
that causal indicators, such as different stress domains, need
not be intercorrelated. Similarly, Meehl ( 1992) and others
have noted that when the goal is to predict a specific criterion,
unrelated items are to be preferred because the unique predictive contribution of each item is thereby maximized. The situation is analogous to multiple regression, in which additional
predictors are valuable to the extent that they add nonredundant predictive power. Although true at a higher leve! of aggregation, such statements neglect the concern with the reliability
of the individual components themselves. Although the reliability of our hypothetical Instrument B Stress measure does
not hinge on intercorrelations among causal indicators, it does
hinge on the reliability of each of the indicators. If the indicator components are unstable, instability is introduced into
measurement of the aggregate construct. Increasing the reliability of the indicator components may involve item refinement; most often, it likely involves the addition of parallel
items to create item clusters or packets ( Horst, 1966; Meehl,
1992). Psychometricians may benefit from the modeling of
behavioral assessment investigators, who pay strict attention
to the adequate sampling of specific, component target behaviors so as to attain reliability ofmeasurement (Cone, 1988;
Haynes, 1990; Lanyon & Goodstein, 1982).
3
There are also levels of generality within the causal indicator case.
Stress items may combine into categories: there may be an occupational
stress category that includes insufficient compensation, job insecurity,
sexual harassment, and a hostile supervisor.
4
For detailed exposition of these methods, readers should consult
psychometric texts ( cf. Anastasi, 1988; Nunnally & Bernstein, 1994).
For dscussion oftopics beyond the scope ofthis article, such as question
construction, question ordering, and response alternative composition,
researchers are referred to Converse and Presser ( 1986) and Hippler,
Schwarz, and Sudman ( 1987).

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

SPECIAL ISSUE: INSTRUMENT REFINEMENT

Techniques and technica/ considerations. Internal consistency-based refinement procedures typically make use of an
index such as coefficient alpha as well as a number of item-Ievel
statistics, such as the impact on alpha if a given item is deleted,
corrected item-total correlations, interitem correlations, and
the Spearman-Brown formula relating reliability estimates to
scale Iength. All ofthese statistics are based on intercorrelations
among items. Although sorne have argued that moderate interitem correlations are preferable so as to measure a sufficiently
broad construct ( e.g., .20-.40; Briggs & Cheek, 1986), this argument fails to consider the concept of construct hierarchies.
At the Iowest hierarchical leve!, items should be parallel, alternative indicators of a common construct ( Comrey, 1988)
and so should correlate highly (Bollen & Lennox, 1991 ; Horst,
1966). Breadth of coverage, if appropriate, is obtained by combining related facets of a higher order construct.
To improve a scale's interna! consistency, investigators have
three options: remove poor items, revise poor items, or add parallel items. In each case, investigators should test the impact
of their modifications on an independent sample. Poor items
( statistically) are those that have Iow item-total correlations,
Iow correlations with other items, and their removal would Iead
to an increase in coefficient alpha. Investigators should inspect
carefully their good and poor items; such scrutiny can, at times,
help clarify the definition of the target construct.

Determination of Content Homogeneity of


Unidimensional Facets
Conceptual considerations. Items reflecting unidimensional constructs should be homogeneous in content. Reliance
on item statistics is insufficient for determination of content homogeneity; as Burisch ( 1984) has noted, only content considerations can separate items that are prototypic of a construct
from items that are mere correlates of a construct. McCarthy
and Smith ( 1995) recently reviewed the content prototypicality
of severa! measures of lmpulsivity, Aggression, and Novelty
Seeking using 15 trained raters. Consider two sample items
from this review, both taken from the Weinberger Adjustment
Inventory (WAI; Weinberger, 1991) lmpulsivity Scale, a measure with solid interna! consistency reliability: "I'm the kind of
person who will try anything once, even if it is not safe" and
"I Iike to do new and different things that many people would
consider weird or not really safe." Both items were rated as nonprototypic ofimpulsivity: individuals can try new, different, and
risky activities without doing so on impulse. The items were
judged to have considerable overlap with Novelty Seeking, and
in fact the second item wasjudged prototypic ofNovelty Seeking. Impulsive individuals may be more Iikely to endorse these
items, and so the items correlate with other indicators of impulsivity, but there is nothing about the item itselfthat describes a
necessarily impulsive act. Contrast these items with, "I say the
first thing that comes into my mind without thinking enough
about it" and "I stop and think things through before I act," two
items from the WAI Impulsivity Scale that were judged prototypic of Impulsivity. Content heterogeneity Iike this can Iead to
imprecise, confounded measurements and misleading findings,
such as a potentially inflated relationship between the WAI lmpulsivity Scale and measures ofNovelty Seeking.

305

Also, items should be as simple and clear as possible. In general, the Ionger the item the more Iikely it is to reflect multiple
sources of variance ( Clark & Watson, 1995).
Techniques and technica/ considerations. We recommend
that content-based item refinement be undertaken at multiple
stages of the instrument development and refinement process.
Particularly for inductively developed measures, construct
definitions often become clarified during the construction and
refinement process (Tellegen & Waller, in press), and investigators become better able to rate the content fidelity of their items
over the course of the investigation. Similarly, repeated passes
through items often result in changes that simplify or clarify
them. It is far too often the case that, once items are first written,
their content validity appears taken as a given and is never scrutinized again.
We recommend a conservative approach to reviewing items
for content fidelity. First, multiple raters should be trained thoroughly on construct definition, both from convergent and discriminant validity perspectives. Training should include a review ofthe prototypic versus correlate distinction; provision of
sample items with correct answers; a discussion ofthose items;
provision of a new set of sample items, without provided answers, to be scored independently; discussion of raters' seores;
and an additional set of independently scored items following
the discussion. Investigators may wish to add additional rounds
of scoring and discussion. Raters should then be asked to review
the target measure's items and provide ratings of prototypicality. Discrimination ratings are often helpful, such as requiring
raters to rate items both on the target construct ( e.g.,
lmpulsivity) anda correlated construct ( e.g., Novelty Seeking).
Retained items will be rated prototypic of the target construct
but not prototypic ofthe companion construct. When the target
dimension is part of a hierarchy of scales, discrimination ratings can use other dimensions or subfactors from the same leve!
in the hierarchy to ensure differentiation among construct components ( Comrey, 1988). This process may need to be repeated
with a second, independent sample of raters.
Of course, content-based refinement should be part of the
initial development and refinement process ( Clark & Watson,
1995; Haynes et al., 1995). However, at times new constructs
are identified that, because of their similarity to existing constructs, point to the need for refinement of previously existing
measures. The emergence of the construct Novelty Seeking
makes it possible to reexamine and further refine measures of
lmpulsivity, for example.

Inclusion ofltems That Discriminate Among


Participants at the Desired Leve! ofAttribute Intensity
Item difficulty, in the context of ability or achievement testing, traditionally has referred to the proportion of individuals
in a sample giving the correct response toan tem ( Nunnally &
Bernstein, 1994). Items of varying difficulty levels are needed
to discriminate along a wide continuum of abilities. The concept of difficulty can easily be translated to "attribute intensity"
and hence applied to domains such as trait assessment. For example, the item "I would be jealous ifl saw my partner talking
with a man (woman) I didn't know" discriminates more intensely jealous individuals from others, whereas the item "I

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

306

GREGORY T. SMITH ANO DENIS M. McCARTHY

would be jealous if I saw my partner kissing a man ( woman) I


didn 't know" discrimina tes individuals with very little jealousy
from others.
When scales are developed on one population ( e.g.,
undergraduates) and then applied to another that represents a
different point along an attribute continuum ( e.g., clinic
patients), they may need refinement to improve their discriminating power in this new population. For example, if an item
is endorsed by only 3% of undergraduates, its item-total and
interitem correlations are likely to be quite low in that population. It may be deleted from the scale on that basis, yet it may
prove to have excellent discriminating power ( e.g., 50% endorsement rate) in a clinically identified population. An item
endorsed by 50% of undergraduates may be endorsed by, say,
90% of patients, rendering it less useful for discriminating
among the patient population. The general point is that investigators should give careful thought to their items' difficulty levels
and be prepared to refine their instruments as a function of the
range along which they intend to discriminate.
Item response theory ( Hambleton, Swaminathan, & Rogers,
1991; Lord, 1980) may make an important contribution to
clinical instrument refinement in this respect. An item response
theory model describes a function defining the relationship between examinees' attribute level and the probability ofitem endorsement; the function permits calibration of an item's difficulty level (Lord, 1980; Reise, Widaman, & Pugh, 1993). Clark
and Watson ( 1995) provide an overview of this topic in this
issue.

Replication on Jndependent Samples


Instrument refinement, including evaluation of the properties ofthe refined instrument, cannot be carried out in one sample. In our review, we found examples of researchers who wrote
a set ofitems, conducted an exploratory factor analysis, derived
subscales based on the items loading highest on each factor,
computed item statistics for each subscale, deleted items with
the lowest item-total correlations, and then correlated the subscales with external criteria in a validity analysis-all in one
sample! Although the use of item-level analyses is laudatory,
this approach is inappropriate because it capitalizes on chance
properties of the one sample. Instead, it is important to show
that good scale properties ( e.g., high interna! consistency) that
result from refinement procedures ( e.g., item deletion) replicate on an independent sample. Very often, an instrument construction and refinement study will require at least three samples: a pilot sample for preliminary analyses ofitem statistics, a
development sample in which dimensionality is explored and
internal consistency is improved through item-level refinement,
anda confirmatory sample in which the dimensionality findings
and instrument properties are replicated and initial validity evidence is obtained. In an inductive study, it may be necessary to
interview a cross-section of participants to develop the initial
tem pool, prior to administration to a pilot sample.
Additional Considerations in Instrument Refinement

Efficiency ofMeasurement
Because interna! consistency estimates of reliability are enhanced by increasing test length, many researchers feel a ten-

sion between reliability on the one hand and efficiency on the


other. We offer four suggestions for enhancing efficiency without
reducing reliability appreciably. First, retain prototypic items
but eliminate mere correlate itenis. Second, retain items that
are parallel but not those that are identical. The two WAI lmpulsivity prototypic items noted earlier, "I say the first thing that
comes into my mind without thinking enough about it" and "I
stop and think things through before 1act," satisfy this criterion.
In contrast, using two items such as "I stop and think things
through before 1 act" and "I stop and think before acting,"
which are nearly identical, is inefficient. The second item in
such a pair does not represent additional sampling from the
content domain, and so one aspect of the domain may be
oversampled; furthermore, the same error variance associated
with the first item is likely also associated with the second.
Whatever increase in alpha it brings is, in this sense, misleading.
Third, experiment with reducing scale length, both using the
Spearman-Brown formula and by recomputing alpha for shortened scales. Of course, reliability and validity estimates for reduced scales must be replicated on a second, independent sample. Fourth, where appropriate to the construct, experiment
with use ofLikert-type items that capture more variability than
do dichotomous items: if each item taps more variance in the
target construct, fewer items may be necessary to measure the
desired leve! ofvariability.

The Use ofAbbreviated Measures: The Jmproper


Pursuit ofEfficiency
Despite repeated exhortations to the contrary in both the intelligence ( Silverstein, 1990) and personality ( Butcher & Hostetler, I 990) literatures, the use of abbreviated forms of measures appears more widespread than ever. The availability of
multivariate data-analytic techniques has perhaps encouraged
investigators to study models involving relationships among numerous variables simultaneously; unfortunately, this has frequently led to practices such as eliminating items from each
scale in a protocol. We advise against such practices: the psychometric properties of a measure cannot be imputed to a short
form without empirical testing. Often, use of abbreviated measures attenuates reliability. Even more frequently, internal consistency is preserved but validity is attenuated because of reduced coverage ofthe target construct. Systematic measure refinement analyses should demonstrate retained content
coverage, maintained reliability, and maintained validity prior
to use of abbreviated measures.

Refining Measures ofTaxa Versus Measures of Continua


Researchers should give careful thought to whether their
target construct exists on a continuum or is categorical ( Meehl,
1992). The creation of a multiple-item measure that, when
summed, yields seores along a continuum does not in itself establish that the latent construct is actually continuous. If severa!
items are valid but fallible indicators of a latent taxon ( Widiger
& Trull, 1991 ) , then when the item seores are summed, higher
:;cores increase the probability of membership in the target
group ( Meehl, 1992). Meehl and colleagues describe numerous

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

307

SPECIAL ISSUE: INSTRUMENT REFINEMENT

statistical methods for differentiating taxa from continua ( cf.


Meehl, 1992; Meehl & Golden, 1982).
Most of the refinement principies described earlier apply to
measures oftaxa as well as to continua. Although taxa may have
heterogeneous indicators ( Meehl, 1992), investigators must
still concern themselves with the reliability of the individual
component indicators: the component structure of the indicators should be identified; the components should be measured
with parallel, internally consistent indicators whose content is
demonstrably homogeneous; and their performance should be
replicated across independent samples.

Refinement of Behavioral Observation Procedures


The general principies described here in psychometric language apply also to behavioral observation procedures: refinement can improve the definition of what is observed and
the consistency, content validity, and efficiency with which it is
observed. Because the method of observation is different, the
specific threats to validity are different, as are the resulting refinement strategies. There are a number of excellent reviews of
behavioral observation techniques ( cf. Cone, 1988; Foster, BellDolan, & Burge, 1988; Haynes, 1978, 1990 ); we provide a brief
overview of a sample of refinement issues.
Appropriate targets of behavioral observation procedure refinement include definition of the target behavior, range of
contexts across which the target behavior is sampled, the time
length of sampling, the time interval between samples, the frequency and duration of observation sessions, definition oftarget
behavior to observers, clarity of the rating scheme, procedures
for recording observations, location ofthe observer, and degree
of interaction between the target subject and the observer. As
with psychometric instruments, it is usually important to refine
the observation task repeatedly so as to eliminate as many potential sources of distortion as possible. The success of refinements must be evaluated on a new sample.

Instrument Refinement as Part ofTheory Building


In the forgoing discussion, we have discussed instrument refinement as a prerequisite to theory testing. When done rigorously, however, instrument refinement can operate in a dialectic
with theory. Because refinement involves close scrutiny ofitem
content and performance, it can lead to ad van ces in theory such
as modified construct definitions or identification of new constructs. Such advances can, in turn, lead to the construction or
further refinement of measures. Clark, Watson, and colleagues
have provided an instructive example ofthis process. They began by studying the content and the correlational properties of
numerous measures of depression and anxiety. Where measures
of the two constructs showed little discriminant validity, they
found considerable overlapping content: both measures included items assessing negative affectivity or general distress
(Watson & Clark, 1984 ). Where measures exhibited discriminant validity, the depression measure focused on low positive
affectivity ( anhedonia that is specific to depression) and the
anxiety measure focused on physiologic hyperarousal ( specific
anxiety). These findings led to a reformulation of the relationship between depression and anxiety and the refinement of in-

strumentation: the investigators developed new measures to reflect the new, more precise construct definitions ( Clark & Watson, 1991; Watson, Clark, et al., 1995; Watson, Weber, et al.,
1995 ).
A related point is that it is necessary to revisit instruments
periodically and assess the need for further refinement. The
identification of new constructs similar to a target construct and
the evolution in conceptualization of a construct may require
additional refinement of an existing measure.

Conclusion
Basic, test construction-based instrument refinement is often neglected by researchers, to the detriment of substantive research. We encourage investigators to conduct systematic,
multisample refinement analyses with the goals described
above. On a broader leve!, instrument refinement is a part of a
dialectic between measurement and substantive research. Welldeveloped measures facilitate substantive advances, which in
turn may necessitate revisions in measures. The periodic reexamination of measures in light of theoretical and empirical
findings may contribute to advances in clinical research.

References
Anastasi, A. (1988). Psychological testing (6th ed.). New York:
Macmillan.
Blalock, H. M. ( 1964). Causal inferences in nonexperimental research.
Chapel Hill: University ofNorth Carolina Press.
Bollen, K. ( 1984). Multiple indicators: Interna) consistency or no necessary relationship? Quality and Quantity, 18, 377-385.
Bollen, K. ( 1989). Structural equations with latent variables. New York:
Wiley.
Bollen, K., & Lennox, R. ( 1991 ) . Conventional wisdom on measurement: A structural equations perspective. Psychological Bulletin, 11 O,
305-314.
Briggs, S. R., & Cheek, J. M. ( 1986). The role of factor analysis in the
development and evaluation of personality scales. Journal ofPersonality, 54, 106-148.
Burisch, M. ( 1984). Approaches to personality inventory construction:
A comparison of merits. American Psychologist, 39, 214-227.
Butcher, J. N., & Hostetler, K. ( 1990). Abbreviating MMPI tem administration: What can be learned from the MMPI for the MMPI-2?
Psychological Assessment, 2, 12-21.
Carver, C. S. ( 1989). How should multifaceted personality constructs
be tested? Issues illustrated by self-monitoring, attribution style, and
hardiness. Journal of Personality and Social Psycho!ogy, 56, 577585.
Clark, L. A., & Watson, D. ( 1991 ) . Tripartite model of anxiety and
depression: Psychometric evidence and taxonomic implications.
Journal ofAbnormal Psychology, 100, 316-336.
Clark, L. A., & Watson, D. ( 1995 ). Constructing validity: Basic issues
in objective scale development. Psychological Assessment, 7, 309319.
Comrey, A. L. ( 1988). Factor-analytic methods of scale development in
personality and clinical psychology. Journal of Consulting and Clnica! Psychology, 56, 754-761.
Cone, J. D. ( 1988). Psychometric considerations and the multiple
modes ofbehavioral assessment. In A. S. Bellack & M. Hersen (Eds.),
Behavioral assessment: A practica! handbook (3rd ed., pp. 42-66).
New York: Pergamon Press.
Converse, J. M., & Presser, S. ( 1986 ). Survey questions: Handcrafling
the standardized questionnaire. Beverly Hills, CA: Sage.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

308

GREGORY T. SMITH AND DENIS M. McCARTHY

Floyd, F. J., & Widaman, K. F. ( 1995 ). Factor analysis in the development and refinement of clinical assessment instruments. Psychological Assessment, 7, 286-299.
Foster, S. L., Bell-Dolan, D. J., & Burge, D. A. ( 1988). Behavioral observation. In A. S. Bellack & M. Hersen (Eds.), Behavioral assessment: A practica! handbook (3rd ed., pp. 119-160). New York: Pergamon Press.
Foster, S. L., & Cone, J. D. ( 1995 ). Validity issues in clinical assessment.
Psychological Assessment, 7, 248-260.
Green, S. B., Lissitz, R. W., & Mulaik, S. ( 1977). Limitations of coefficient alpha asan index oftext unidimensionality. Educational and
Psychological Measurement, 37, 827-839.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. ( 1991). Fundamentals ofitem response theory Newbury Park, CA: Sage.
Haynes, S. ( 1978). Principies of behavioral assessment. New York:
Gardner Press.
Haynes, S. ( 1990). Behavioral assessment of adults. In G. Goldstein &
M. Hersen (Eds.), Handbook of psychological assessment (2nd ed.,
pp. 423-466). New York: Pergamon Press.
Haynes, S. N., Richard, D. C. S., & Kubany, E. S. ( 1995). Content
validity in psychological assessment: A functional approach to concepts and methods. Psychological Assessment, 7, 238-24 7.
Hippler, H. J., Schwarz, N., & Sudman, S. ( 1987). Social informa/ion
processing and survey methodology New York: Springer-Verlag.
Horst, P. ( 1966). Psychological measurement and prediction. Belmont,
CA: Wadsworth.
Hoyle, R. H. ( 1991 ). Evaluating measurement models in clinical research: Covariance structure analysis oflatent variable models of selfconception. Journal o/Consulting and Clinical Psychology, 59, 6776.
Hoyle, R. H., & Smith, G. T. ( 1994). Formulating clinical research
hypotheses as structural equation models: A conceptual overview.
Journal ofConsulting and Clinical Psycho!ogical, 62, 429-440.
Lanyon, R. I., & Goodstein, L. D. ( 1982). Personality assessment ( 2nd
ed.). New York: Wiley.
Lord, F. M. ( 1980). Applications of item response theory to practica!
testing problems. Hillsdale, NJ: Erlbaum.
MacCallum, R. C., & Browne, M. W. ( 1993 ). The use of causal indicators in covariance structure models: Sorne practica! issues. Psychological Bulletin, 114, 533-541.
McCarthy, D.M., & Smith, G. T. ( 1995, August). Jssues in the mea-

surement of riskfor substance use. Paper presented at the 103rd Annual Convention of the American Psychological Association, New
York.
McDonald, R. P. ( 1981 ). The dimensionality oftests and items. British
Journal ofMathematical and Statistical Psychology, 34, 100-117.
Meehl, P. E. ( 1992). Factors and taxa, traits and types, dilferences of
degree and dilferences in kind. Journal ofPersonality, 60, 117-174.
Meehl, P. E., & Golden, R. R. ( 1982). Taxometric methods. In P. C.
Kendall & J. N. Butcher (Eds.), Handbook of research methods in
clinical psychology ( pp. 127-181 ) . New York: Wiley.
Nunnally, J. C., & Bemstein, I. H. ( 1994). Psychometric theory, New
York: McGraw-Hill.
Reise, S. P., Widaman, K. F., & Pugh, R. H. ( 1993). Confirmatory factor analysis and item response theory: Two approaches for exploring
measurement invariance. Psychological Bulletin, 114, 552-566.
Silverstein, A. B. ( 1990). Short forms of individual intelligence tests.
Psychological Assessment, 2, 3-11.
Tellegen, A., & Waller, N. G. (in press). Exploring personality through
test construction: Development ofthe Multidimensional Personality
Questionnaire. In S. R. Briggs &J. M. Cheek (Eds.), Personality measures: Development and evaluation (Vol. 1 ). Greenwich, CT: JAI
Press.
Watson, D., & Clark, L. A. ( 1984). Negative alfectivity: The disposition
to experience aversive emotional states. Psychological Bulletin, 96,
465-490.
Watson, D., Clark, L. A., Weber, K., Assenheimer, J. S., Strauss, M. E.,
& McCormick, R. A. ( 1995). Testing a tripartite model: II. Exploring
the symptom structure of anxiety and depression in student, adult,
and patient samples. Journal ofAbnormal Psychology, 104, 15-25.
Watson, D., Weber, K., Assenheimer, J. S., Clark, L. A., Strauss, M. E.,
& McCormick, R. A. ( 1995). Testing a tripartite model: I. Evaluating
the convergent and discriminant validity of anxiety and depression
symptom scales. Journal ofAbnormal Psychology, 104, 3-14.
Weinberger, D. A. ( 1991 ). Social-emotional adjustment in older children and adults: Validation of the Weinberger Adjustment lnventory,
Unpublished manuscript.
Widiger, T. A., & Trull, T. J. ( 1991 ) . Diagnosis and clinical assessment.
Annual Review ofPsychology, 42, 109-134.

Received February 2, 1995


Revision received April 17, 1995
Accepted April 20, 1995

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Вам также может понравиться