Вы находитесь на странице: 1из 16

Designing a New Scale/Questionnaire:

Optimal Psychometric Practice

Alex M. Wood, PhD


Senior Lecturer in Psychology

School of Psychological Sciences, University of Manchester, England.

Aims:

To describe the steps used in the construction of a new questionnaire/scale

Psychometrics is the branch of psychology concerned with the measurement of


individual differences. Used in many other fields (e.g., economics, medicine, etc).

Accurate measurement of individual differences is vital for the scientific credibility


of the discipline and the research.
Essential reading (all essential, but perhaps start with Worthington):

Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in objective scale
development. Psychological Assessment, 7, 309-319.
Hunsley, J., & Meyer, G. J. (2003). The incremental validity of psychological testing and
assessment: Conceptual, methodological, and statistical issues. Psychological Assessment, 15,
446-455.
Smith, G. T., Fischer, S., & Fister, S. M. (2003). Incremental validity principles in test
construction. Psychological Assessment, 15, 467-477.
Smith, G. T., & McCarthy, D. M. (1995). Methodological considerations in the refinement of
clinical assessment instruments. Psychological Assessment, 7, 300-308.
Worthington, R. L., & Whittaker, T. A. (2006). Scale development research - a content analysis
and recommendations for best practices. Counseling Psychologist, 34, 806-838.
Book: Coaley, K. (2010). An introduction to psychological assessment and
psychometrics. London: Sage Publishing.

Psychometric development involves:


1. Developing a clear rationale for the need to develop the scale
2. Developing a clear definition of the construct and a representative item
pool
3. Identifying the scales structure and selecting items based on factor
analysis
4. Testing the structure with confirmatory factor analysis
5. Testing internal consistency (reliability)
6. Testing temporal stability (reliability)
7. Showing face validity (validity)
8. Testing criterion validity (validity)
9. Testing predictive validity (validity)
10. Testing discriminant validity (validity)
11. Testing incremental validity (validity)

Step 1: Develop a strong rationale for the need for the scale
50 years of personality psychology has developed scales for pretty
much everything, your scale is probably not needed!
Increasingly a view in the field that there are enough scales, people
should use them rather than develop more! (Gratitude example)
It may not be immediately clear there is a scale to suit a need, but a
full search will show that it probably exist, but perhaps has a weird
name, comes from a strange theoretical position, or does not cite/is
not cited in other papers
People who develop new scales normally pretend that it measures a
new trait or conceptions, hence most of psychology is in a mess, with
a huge number of almost certainly synonymous traits and no theory to
integrate them.
Represent the entire continuum of the construct (not just the positive
or negative aspects).
Question your motives for making a new scale. The best work in
psychology integrates existing perspectives and research this is
where real progress is made.

Step 2: Develop a representative item pool (~100 items, at 10 30 per expected factor)
The operational definition of the construct, essential to
represent the universe of the construct
Could be developed through qualitative research (sleep,
genetic counselling examples)
Could use pre-existing exhaustive lists of potential items
(Lexical Hypothesis example)
Could include items from other scales (dodgy what item
pools did they use)
Could be designed to map onto a pre-existing theoretical
conception (dodgy would bias results, both by
overdefining and missing out parts of the construct)
Could simply choose items that make sense to the
researchers (unacceptable)

Step 3: Perform factor analysis on the item pool, and select items
Used both to find the number of underlying factors
Used to select the most representative items
All decisions are critical here, once youve chosen your items there is no turning
back!
Participants should be representative of who is going to use the scale in the
future. Ideally, have multiple groups (e.g.,. Community, clinical), conduct all
analysis below separately for each group, and base decisions on a balance of the
findings between the samples (which should be largely consistent anyway).
Should be exploratory factor analysis (maximum likelihood with oblique [oblim]
rotation)
Should use parallel analysis to determine the number of factors
Extract the correct number of factors. Be careful of factors defined by all positive
or negative items.
Choose the highest loading items of each factor
Decide on how many items to have per factor. Difficult decision, shorter scales are
more widely used but perhaps dont fully represent the construct. Could base on
Cronbachs alpha, but issues of bloated specifics

Step 4: Perform confirmatory factor analysis


Differs from exploratory factor analysis as tests the plausibility of a particular
factor structure
In many senses weaker the fit of other factor structures may be equally
valid
You should (a) test the expected factor structure, (b) compare it to other
factor structures, (c) perform multi-group comparisons

The next steps test reliability and validity


Reliability refers to how consistent are the scores on the test
over time or across equivalent versions of the test
Reliability refers to how well a test measures true and
systematic variation in a subject rather than error, bias or
random variation.
Validity refers to how well the test measures what it is
supposed to. This requires independent criteria on which to
base the validity of the test score

Step 5: Test internal consistency


Asks whether each of the items are highly inter-correlated
Important as the relationship between the scale and other variables will be
attenuated as a factor of the unreliability of the scale due to low correlating
items
Historically used split-half reliability
Split the test to get equivalent halves
Odd and even items
Resultant reliability is for only half the test
The longer the test, the more reliable it will be
Now use Cronbachs alpha (due to better computing)
- Effectively the average of each permutation of the off and even items
(with some adjustment). Gives a value between 0 and 1. Less than .60
very poor, .60 to .70 poor, .70 to .80 good, more than .80 excellent
Correlations between the scale and external variables can be corrected for
alpha (can be theoretically important (gratitude and appreciation example), but
can be dodgy as excuses poor scale)

Step 6: Test for temporal stability (test-retest reliability)


If the construct has trait like properties then it should be stable over time,
BUT tests should also be sensitive to real change
Show that the test is stable over time for most people (use all
methods).
Method 1. Give the same test to the same people at two time points.
These should be small enough to preclude genuine change, but long
enough for participants not to simply remember their answers.
Ideally, use two different time intervals with different groups (e.g., 2
weeks and 4 weeks, or 4 weeks and 3 months)
Method 2. Show that the means of the group does not significantly
change over time (e.g., everyones score goes up on reflection). Be
careful of issues of power.
Method 3. Show that the scale DOES change when expected to. Can
be demonstrated longitudinally (e.g., therapy), or experimentally (e.g.,
social desirability scales).

Step 7: Show face validity


Refers to what the test appears to measure
Affects acceptability of the test and for the test to be effective in practice.
Should have already been shown through the item pool creation.
Might perform further qualitative work with the final items (genetic
counselling example)
Cant be substituted for objective validity

Step 8: Show criterion validity


Refers to the scale correlating with what it is meant to
If there are existing scales, there should be high correlations (but why are
you making the new scale? Shorter? Better? in which case correlations
may be lower)
What is a high correlation? Conventionally, scales measuring the
same construct should correlated higher than r = .80. Cohen (1988,
1992) defines effect sizes as .10 small, .50 medium, .80 large). Rules
of thumb, cant apply to every situation. Unique, causal, multiply
determined, or objective relationships always smaller.
Should correlate with theoretically related constructs.
Peer-correlations. Self-report should correlate with peer-ratings. But what
is high correlation? Issues of visibility, huge literature on judgements of
others (David Funder and others), halo bias.

Step 9: Show predictive validity


Similar to criterion validity, but differs in predictive validity predicts a
future outcome
Could be a behaviour. For example, extroversion should predict how
much a person talks in a small group task, and sensation seeking
behaviour. Again issues of what is a high correlation (Epstein's work on
behavioural prediction).
Could be a longitudinal change in functioning (Emmons work on goal
striving)

Step 10: Show discriminant [sic] validity


Show that the scale is NOT correlated with what it is not theoretically
intended to.
Issues of power.
Issues of whether the other measures are actually any good.
Examples
Social desirability, but what do the scales actually measure?
Mood inductions. But should it be effected? Is the induction any good
(show that it changes other measures, but not yours)? Is this actually
evidence of lack of sensitivity to change (must be used in combination
with convergent validity?
Often determined theoretically (e.g., PANAS).

Step 11: Show incremental validity


You must show that your scale can predict some variables after
controlling for other existing scales (re-demonstrate criterion or predictive
validity after controlling for other scales)
Very important, given the duplication of scales in psychology research
Critical that you choose the right scales to control (huge potential for
bias)
Commonly use Big Five (but crap, due to hierarchical organization).
Best to conduct a tough test where you select the most similar other
scales, but where theory suggests that what your measuring may provide
additional prediction (and watch your results disappear)

Summary
Psychometric development involves:
1. Developing a clear rationale for the need to develop the scale
2. Developing a clear definition of the construct and a representative item
pool
3. Identifying the scales structure and selecting items based on factor
analysis
4. Testing the structure with confirmatory factor analysis
5. Testing internal consistency (reliability)
6. Testing temporal stability (reliability)
7. Showing face validity (validity)
8. Testing criterion validity (validity)
9. Testing predictive validity (validity)
10. Testing discriminant validity (validity)
11. Testing incremental validity (validity)

Вам также может понравиться