Академический Документы
Профессиональный Документы
Культура Документы
Psychological Assessment
1995, Vol. 7, No. 3, 300-308
Imagine a hypothetical manuscript that introduces Instrument A, a measure ofSelf-Centeredness. Suppose the author of
the manuscript reports that he or she wrote items to be indicators ofthat underlying construct, administered them to a sample, obtained a coefficient alpha of .75, and found that seores
on the measure correlated significantly with seores on measures
of narcissism ( r = .53) and hostile relationships ( r = .25), but
nonsignificantly with social avoidance ( r = .11 ) . The author
concludes that the new measure is reliable and has construct
validity.
This hypothetical report is in fact highly representative of
manuscripts submitted to Psychological Assessment ( see
below). Unfortunately, however, the report's inattention to
many aspects of the psychometric properties of the new measure create a number of potential difficulties. We propose five
objectives of clinical assessment instrument refinement that are
not met in this hypothetical example; each will be discussed in
more detail in subsequent sections of this article. To introduce
these objectives, suppose that, unbeknown to the author, Instrument A actually measures three undimensional, moderately
correlated facets of Self-Centeredness: Low Empathy, Interna!
Self-Preoccupation, and Self-Aggrandizement. (As we describe
below, such a situation could certainly occur.) The five objectives are listed here.
1. Identification of the measure's hierarchical or aggregational structure. Failure to identify this measure's three facets
or unidimensional constructs could lead to inaccurate specifications oftheory as well as misleading correlational and experimental findings. For example, certain subscales might correlate
1
We use the terms unidimensional facet and unidimensional construct interchangeably to refer to factors that cannot be decomposed
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
301
Table 1
Rate ofFailure in the Application ofRefin~ment Procedures
Procedures
Dimensionality assessment
Item-level statistical analysis
Content-based item-level analysis
Analysis of item difficulty leve!
Independent sample replication
Assessment of discriminant validity
More than one refinement step
Use of appropriate refinement
procedures
31
64
85
100
68
71
85
50
Note, Based on N= 76 original submissions to Psychological Assessment during 1990-1994, sampled from the files of an associate editor to
approximate acceptance rates. Numbers are for cases in which procedure was judged appropriate but not applied. Examples of refinement
procedures include factor analysis, examination of item-total correlations, expert ratings of content validity, examination of frequency of
item endorsement, replication of factor analyses, convergent and discriminant validity correlations, examination of both item-total correlations and content validity ratings, and use of factor analysis to assess
unidimensionality versus multidimensionality.
?ne
2
_
goal of this special issue is to improve the overa!! quality of
chmcal assessment research, as reflected in the quality of submissions
to Psychological Assessment; accordingly, we sampled manuscripts to
approximate the joumal's acceptance rate. The sample was obtained
from on~ of the joumal's associate editors. Ten of the 76 manuscripts
were ult~mately accepted for publication; ali manuscripts were original
( not rev1sed) submissions.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
302
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
303
the criterion scales of narcissism, hostile relationships, and social avoidance that were obscured by using the aggregate variable. Social avoidance could be related to Internal Self-Preoccupation but unrelated to Low Empathy and Self-Aggrandizement, producing the nonsignificant correlation between social
avoidance and the aggregate measure. Similarly, a modest correlation between the aggregate Self-Centeredness score and hostile relationships could be due to the latter variable having
strong correlations with both Low Empathy and Self-Aggrandizement anda zero correlation with lnternal Self-Preoccupation. Narcissism might have significant correlations of varying
magnitudes with each subscale. Thus, identification ofthe facets of Self-Centeredness would sharpen our conceptualization
ofthat construct and lead to discovery of otherwise unavailable
empirical relationships ( Briggs & Cheek, 1986; Carver, 1989).
Constructs exist at different levels of hierarchy or aggregation. Understanding the hierarchical level of a construct is essential for theory. It is also a prerequisite to instrument refinement; for example, internal consistency and content homogeneity are not appropriate goals for measures combining lower level
constructs (Clark & Watson, 1995). A high coefficient alpha
does not indicate unidimensionality, and there are direct means
of assessing the degree of subscale covariance ( e.g., examination
of correlation matrices and use of confirmatory factor analysis
to test the degree ofloss of model fit when combining scales).
When can lower level constructs be combined and analyzed
at an aggregate level? The situation is analogous to the interpretation of main effects in light of interactions in analysis of
variance. Ifthere is a significant A X B interaction, it is misleading to interpret the main effect of A on the dependent variable
without noting that the simple effects of A are different at the
different levels ofB. Similarly, it is misleading to report the correlations of an aggregate construct with criterion measures
without noting that the facets of the aggregate construct correlate differently with the criteria. Ifthe correlations between the
facets and the criteria do not differ, then straightforward interpretation ofthe main effect aggregate correlation is permissible.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
304
ple, a model might appear to fit the data poorly because one of
the dimensions was poorly measured ( e.g., a small number of
nonparallel items). Although rejection of a factor structure that
includes that dimension might be the most readily apparent
course, iinprovement or addition of items might result in identification of an important facet or construct. Although itemlevel improvement is cumbersome in the short term, necessitating validation on an independent sample, the Iong-term savings
from having correctly specified a construct domain are certainly
far greater.
Fourth, confirmatory factor analysis ( cf. Bollen, 1989) is often useful beca use it provides indexes of the degree to which a
factor model fits sample data and it permits systematic comparison of alternative factor models. This technique may be particularly helpful in determining whether one has identified a unidimensional facet: if decomposition into subfactors does not
produce a statistically significant improvement in model fit, the
hypothesis ofunidimensionality can be retained.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Techniques and technica/ considerations. Internal consistency-based refinement procedures typically make use of an
index such as coefficient alpha as well as a number of item-Ievel
statistics, such as the impact on alpha if a given item is deleted,
corrected item-total correlations, interitem correlations, and
the Spearman-Brown formula relating reliability estimates to
scale Iength. All ofthese statistics are based on intercorrelations
among items. Although sorne have argued that moderate interitem correlations are preferable so as to measure a sufficiently
broad construct ( e.g., .20-.40; Briggs & Cheek, 1986), this argument fails to consider the concept of construct hierarchies.
At the Iowest hierarchical leve!, items should be parallel, alternative indicators of a common construct ( Comrey, 1988)
and so should correlate highly (Bollen & Lennox, 1991 ; Horst,
1966). Breadth of coverage, if appropriate, is obtained by combining related facets of a higher order construct.
To improve a scale's interna! consistency, investigators have
three options: remove poor items, revise poor items, or add parallel items. In each case, investigators should test the impact
of their modifications on an independent sample. Poor items
( statistically) are those that have Iow item-total correlations,
Iow correlations with other items, and their removal would Iead
to an increase in coefficient alpha. Investigators should inspect
carefully their good and poor items; such scrutiny can, at times,
help clarify the definition of the target construct.
305
Also, items should be as simple and clear as possible. In general, the Ionger the item the more Iikely it is to reflect multiple
sources of variance ( Clark & Watson, 1995).
Techniques and technica/ considerations. We recommend
that content-based item refinement be undertaken at multiple
stages of the instrument development and refinement process.
Particularly for inductively developed measures, construct
definitions often become clarified during the construction and
refinement process (Tellegen & Waller, in press), and investigators become better able to rate the content fidelity of their items
over the course of the investigation. Similarly, repeated passes
through items often result in changes that simplify or clarify
them. It is far too often the case that, once items are first written,
their content validity appears taken as a given and is never scrutinized again.
We recommend a conservative approach to reviewing items
for content fidelity. First, multiple raters should be trained thoroughly on construct definition, both from convergent and discriminant validity perspectives. Training should include a review ofthe prototypic versus correlate distinction; provision of
sample items with correct answers; a discussion ofthose items;
provision of a new set of sample items, without provided answers, to be scored independently; discussion of raters' seores;
and an additional set of independently scored items following
the discussion. Investigators may wish to add additional rounds
of scoring and discussion. Raters should then be asked to review
the target measure's items and provide ratings of prototypicality. Discrimination ratings are often helpful, such as requiring
raters to rate items both on the target construct ( e.g.,
lmpulsivity) anda correlated construct ( e.g., Novelty Seeking).
Retained items will be rated prototypic of the target construct
but not prototypic ofthe companion construct. When the target
dimension is part of a hierarchy of scales, discrimination ratings can use other dimensions or subfactors from the same leve!
in the hierarchy to ensure differentiation among construct components ( Comrey, 1988). This process may need to be repeated
with a second, independent sample of raters.
Of course, content-based refinement should be part of the
initial development and refinement process ( Clark & Watson,
1995; Haynes et al., 1995). However, at times new constructs
are identified that, because of their similarity to existing constructs, point to the need for refinement of previously existing
measures. The emergence of the construct Novelty Seeking
makes it possible to reexamine and further refine measures of
lmpulsivity, for example.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
306
Efficiency ofMeasurement
Because interna! consistency estimates of reliability are enhanced by increasing test length, many researchers feel a ten-
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
307
strumentation: the investigators developed new measures to reflect the new, more precise construct definitions ( Clark & Watson, 1991; Watson, Clark, et al., 1995; Watson, Weber, et al.,
1995 ).
A related point is that it is necessary to revisit instruments
periodically and assess the need for further refinement. The
identification of new constructs similar to a target construct and
the evolution in conceptualization of a construct may require
additional refinement of an existing measure.
Conclusion
Basic, test construction-based instrument refinement is often neglected by researchers, to the detriment of substantive research. We encourage investigators to conduct systematic,
multisample refinement analyses with the goals described
above. On a broader leve!, instrument refinement is a part of a
dialectic between measurement and substantive research. Welldeveloped measures facilitate substantive advances, which in
turn may necessitate revisions in measures. The periodic reexamination of measures in light of theoretical and empirical
findings may contribute to advances in clinical research.
References
Anastasi, A. (1988). Psychological testing (6th ed.). New York:
Macmillan.
Blalock, H. M. ( 1964). Causal inferences in nonexperimental research.
Chapel Hill: University ofNorth Carolina Press.
Bollen, K. ( 1984). Multiple indicators: Interna) consistency or no necessary relationship? Quality and Quantity, 18, 377-385.
Bollen, K. ( 1989). Structural equations with latent variables. New York:
Wiley.
Bollen, K., & Lennox, R. ( 1991 ) . Conventional wisdom on measurement: A structural equations perspective. Psychological Bulletin, 11 O,
305-314.
Briggs, S. R., & Cheek, J. M. ( 1986). The role of factor analysis in the
development and evaluation of personality scales. Journal ofPersonality, 54, 106-148.
Burisch, M. ( 1984). Approaches to personality inventory construction:
A comparison of merits. American Psychologist, 39, 214-227.
Butcher, J. N., & Hostetler, K. ( 1990). Abbreviating MMPI tem administration: What can be learned from the MMPI for the MMPI-2?
Psychological Assessment, 2, 12-21.
Carver, C. S. ( 1989). How should multifaceted personality constructs
be tested? Issues illustrated by self-monitoring, attribution style, and
hardiness. Journal of Personality and Social Psycho!ogy, 56, 577585.
Clark, L. A., & Watson, D. ( 1991 ) . Tripartite model of anxiety and
depression: Psychometric evidence and taxonomic implications.
Journal ofAbnormal Psychology, 100, 316-336.
Clark, L. A., & Watson, D. ( 1995 ). Constructing validity: Basic issues
in objective scale development. Psychological Assessment, 7, 309319.
Comrey, A. L. ( 1988). Factor-analytic methods of scale development in
personality and clinical psychology. Journal of Consulting and Clnica! Psychology, 56, 754-761.
Cone, J. D. ( 1988). Psychometric considerations and the multiple
modes ofbehavioral assessment. In A. S. Bellack & M. Hersen (Eds.),
Behavioral assessment: A practica! handbook (3rd ed., pp. 42-66).
New York: Pergamon Press.
Converse, J. M., & Presser, S. ( 1986 ). Survey questions: Handcrafling
the standardized questionnaire. Beverly Hills, CA: Sage.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
308
Floyd, F. J., & Widaman, K. F. ( 1995 ). Factor analysis in the development and refinement of clinical assessment instruments. Psychological Assessment, 7, 286-299.
Foster, S. L., Bell-Dolan, D. J., & Burge, D. A. ( 1988). Behavioral observation. In A. S. Bellack & M. Hersen (Eds.), Behavioral assessment: A practica! handbook (3rd ed., pp. 119-160). New York: Pergamon Press.
Foster, S. L., & Cone, J. D. ( 1995 ). Validity issues in clinical assessment.
Psychological Assessment, 7, 248-260.
Green, S. B., Lissitz, R. W., & Mulaik, S. ( 1977). Limitations of coefficient alpha asan index oftext unidimensionality. Educational and
Psychological Measurement, 37, 827-839.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. ( 1991). Fundamentals ofitem response theory Newbury Park, CA: Sage.
Haynes, S. ( 1978). Principies of behavioral assessment. New York:
Gardner Press.
Haynes, S. ( 1990). Behavioral assessment of adults. In G. Goldstein &
M. Hersen (Eds.), Handbook of psychological assessment (2nd ed.,
pp. 423-466). New York: Pergamon Press.
Haynes, S. N., Richard, D. C. S., & Kubany, E. S. ( 1995). Content
validity in psychological assessment: A functional approach to concepts and methods. Psychological Assessment, 7, 238-24 7.
Hippler, H. J., Schwarz, N., & Sudman, S. ( 1987). Social informa/ion
processing and survey methodology New York: Springer-Verlag.
Horst, P. ( 1966). Psychological measurement and prediction. Belmont,
CA: Wadsworth.
Hoyle, R. H. ( 1991 ). Evaluating measurement models in clinical research: Covariance structure analysis oflatent variable models of selfconception. Journal o/Consulting and Clinical Psychology, 59, 6776.
Hoyle, R. H., & Smith, G. T. ( 1994). Formulating clinical research
hypotheses as structural equation models: A conceptual overview.
Journal ofConsulting and Clinical Psycho!ogical, 62, 429-440.
Lanyon, R. I., & Goodstein, L. D. ( 1982). Personality assessment ( 2nd
ed.). New York: Wiley.
Lord, F. M. ( 1980). Applications of item response theory to practica!
testing problems. Hillsdale, NJ: Erlbaum.
MacCallum, R. C., & Browne, M. W. ( 1993 ). The use of causal indicators in covariance structure models: Sorne practica! issues. Psychological Bulletin, 114, 533-541.
McCarthy, D.M., & Smith, G. T. ( 1995, August). Jssues in the mea-
surement of riskfor substance use. Paper presented at the 103rd Annual Convention of the American Psychological Association, New
York.
McDonald, R. P. ( 1981 ). The dimensionality oftests and items. British
Journal ofMathematical and Statistical Psychology, 34, 100-117.
Meehl, P. E. ( 1992). Factors and taxa, traits and types, dilferences of
degree and dilferences in kind. Journal ofPersonality, 60, 117-174.
Meehl, P. E., & Golden, R. R. ( 1982). Taxometric methods. In P. C.
Kendall & J. N. Butcher (Eds.), Handbook of research methods in
clinical psychology ( pp. 127-181 ) . New York: Wiley.
Nunnally, J. C., & Bemstein, I. H. ( 1994). Psychometric theory, New
York: McGraw-Hill.
Reise, S. P., Widaman, K. F., & Pugh, R. H. ( 1993). Confirmatory factor analysis and item response theory: Two approaches for exploring
measurement invariance. Psychological Bulletin, 114, 552-566.
Silverstein, A. B. ( 1990). Short forms of individual intelligence tests.
Psychological Assessment, 2, 3-11.
Tellegen, A., & Waller, N. G. (in press). Exploring personality through
test construction: Development ofthe Multidimensional Personality
Questionnaire. In S. R. Briggs &J. M. Cheek (Eds.), Personality measures: Development and evaluation (Vol. 1 ). Greenwich, CT: JAI
Press.
Watson, D., & Clark, L. A. ( 1984). Negative alfectivity: The disposition
to experience aversive emotional states. Psychological Bulletin, 96,
465-490.
Watson, D., Clark, L. A., Weber, K., Assenheimer, J. S., Strauss, M. E.,
& McCormick, R. A. ( 1995). Testing a tripartite model: II. Exploring
the symptom structure of anxiety and depression in student, adult,
and patient samples. Journal ofAbnormal Psychology, 104, 15-25.
Watson, D., Weber, K., Assenheimer, J. S., Clark, L. A., Strauss, M. E.,
& McCormick, R. A. ( 1995). Testing a tripartite model: I. Evaluating
the convergent and discriminant validity of anxiety and depression
symptom scales. Journal ofAbnormal Psychology, 104, 3-14.
Weinberger, D. A. ( 1991 ). Social-emotional adjustment in older children and adults: Validation of the Weinberger Adjustment lnventory,
Unpublished manuscript.
Widiger, T. A., & Trull, T. J. ( 1991 ) . Diagnosis and clinical assessment.
Annual Review ofPsychology, 42, 109-134.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.