Вы находитесь на странице: 1из 21

Title

Psychometric Properties of Psychological


Assessment Measures

by

Dr. Celeste Fabrie


2

Contents

1 Introduction 4

2 Definition of key concepts 5

2.1 Different types of norms 5


2.2 Criterion referenced tests 5
2.3 Psychological measures 5
2.4 Reliability and validity of psychological measures 5

3 Various types of norms 6

3.1 Mental age scales and grade equivalents 6


3.2 Percentiles 7
3.3 Stanines and Sten scales 7
3.4 Deviation IQ 7

4 Criterion referenced tests 7

4.1 Expectancy tables 8

5 Constructing a psychological measure 8

5.1 The planning phase 8


5.2 Format of the items 8
5.3 Item analysis phase: 9
- Item difficulty value 9
- Item discrimination value 9
- Item-total correlation 9

6 Reliability of a psychological measure 10

6.1 Correlation coefficient 10


6.2 Statistical significance 10
6.3 Reliability coefficient 11
6.4 Observed score 11
6.5 True and error scores 11
6.6 True variance and error variance in a test score 12

contents continued
3

Contents (continued)

7. Stability of a test – the advantages and limitations 12

7.1 Test-retest reliability 12


7.2 Alternate-form reliability 12

8 Internal consistency of a test 13


– advantages and limitations

8.1 Split-half reliability 13


8.2 Kuder-Richardson 20 and Cronbach´s Alpha 13

9 Validity of a psychological measure 14

9.1 Content validity 14


9.2 Criterion-related validity 15
9.3 Predictive validity 15
9.4 Concurrent validity 15

10 Major methods of establishing construct validity 15

10.1 Developmental changes 16


10.2 Correlations with other tests 16
10.3 Factor analysis 16
10.4 Internal consistency 17
10.5 Convergent and discriminant validation 17
10.6 Experimental interventions 18

11 Summary 18

12 Conclusion 19

References 20

4
1. Introduction

To understand what is meant by psychometric properties of psychological


assessment measures, it is necessary to separate the two descriptive classes
on their own. In other words, what do we mean when we speak about
psychometric properties, and what is implied by the concept psychological
measures?

Psychometrics is basically the study of different mental traits and


behavioural characteristics in amounts or scores. For example, these tests
can take the form of intelligence scales, or a rating of different attitudes
against a specific population standard or norm. The fact is, psychometrics
involves the assessment of human data according to known, specific
standards implied by the experienced researcher or scientist.

Without psychometric theory, there would be problems to develop a reliable


and valid psychological measure. It would be impossible to study human
intelligence without comparing a person, or group to a normative sample.
Psychological assessment measures on the other hand, involve five different
processes; diagnosis, classification, planning of treatment, self-knowledge,
research and program evaluation (Gregory, 2000, p.41)

The aim of this essay is to demonstrate the principles of psychometric


theory, and how these principles are integrated to explain how a
psychological measure is developed. The following major concepts,
including their related values and tests will be discussed, namely, the
different types of norms, criterion referenced tests, psychological measures
and the reliability and validity of these measures, as well as the advantages
and limitations of each test.

5
3 Definition of key concepts
Our discussion begins with an overview of the following concepts:

2.1 Different types of norms


Norms can be defined as a raw score or table of values of an individual
measurement against the performance of others in a particular group.
Furthermore, norms help us to make important standard comparisons,
whereby we can judge how much a person’s score deviates from the
average population or representative group in a sample (Rosnow &
Rosenthal, 1999. p. 222).

2.2 Criterion referenced tests

These tests are basically the opposite to norm-referenced tests. Criterion-


referenced tests are mainly concerned with the personal achievements of the
tested person, than on making comparisons with the performance or abilities
of external groups. For example, these tests are ideal for the testing of
individual educational needs.

2.3 Psychological measures


To recapture: psychological measures or assessment measures involve
quantification techniques. In other words, psychological measures is a
dynamic process, forever changing, but retaining its original structure.
That is, a psychological measurement can be a sophisticated process with the
following characteristics, as mentioned by Gregory (2000, p.30):
- scores or categories
- behaviour samples
- norms or standards
- standardized procedures
- prediction of nontest behaviour

2.4 Reliability and validity of psychological measures


Test reliability can be defined according to how it consistently measures
what it is supposed to measure. Unfortunately, such problems as “random
error” can greatly influence the reliability of an instrument (but more about
that later in our discussion under point 6).

6
Validity, for example often pertains to the contents of a test or measuring
instrument. For instance, if a particular personality trait is been measured by
a certain test, then it is expected that this test actually measures what it is
supposed to measure, otherwise it is invalid.

3 Various types of norms


The word “normed score” on its own means very little to test takers if they
do not know what kind of norms they are been tested against. It is up to the
tester to explain a test or individual’s raw score against a background of a
representative population group. The following norms will be discussed
briefly to demonstrate the importance of understanding a person’s test score
on a norm referenced test.

3.1 Mental age scales and Grade Equivalents


These two types of norms actually are referred to as “developmental norms”
but with a slight difference. Mental age scales encourage similar-age test
comparisons. In other words, the performance level of a child who is 10
years of age will be compared to the performance level of other children of
the same age group. These age norms are a convenient way of testing
children’s developmental characteristics that are dynamically changing in
comparison to the more stable traits of adults.

Grade equivalents are quite similar to age grades. However, instead of just
checking the aptitude or ability of a similar age group, grade norms measure
the standard of test performance for every individual grade depicted in the
normative sample. This means that school performance is measured against
a normative sample from the same class or grade in the school. This is a
more convenient method to test a child or scholar’s academic performance
against a similar grade equivalent (Gregory, 2000, p.71).

7
3.2 Percentiles
Percentiles is a relative measure which can fluctuate between 0 and 100.
A percentile describes the distribution of scores either falling below or
above a particular percentage of sampling scores. For example, we talk
about a 50th percentile (which is also referred to as the “median”) when
a typical score falls below the 50% level and above the 50% level (Rosnow
& Rosenthal, 1999, p. 233)

3.3 Stanines and Sten Scales (standard scores)


Stanines transform test takers raw scores on a 1 - 9 point scale and
the 10 unit sten scale which Canfield (1951) recommended is basically
a slight variation on the stanine scale. Both measures were useful
devices before the pre-computer age to test norms. In fact, stanines always
have a mean of 5 and an exact standard deviation of 2. This means that
scores are ranked from the lowest to the highest, with the bottom 4% of
scores having a stanine of 1 and so forth (Gregory, 2000, p.68).

3.4 Deviation IQ
This norm is often used to measure the scores depicted on intelligence tests.
However, this kind of scale is often misinterpreted by inexperienced persons
wanting to find an easy “labelling” system to describe a persons intelligence
above or below a particular marker of 100. It would be foolish to use one
single IQ test as a final result of ones intelligence or character. For example,
using obsolete tests can either inflate or deflate a participants IQ scores.

4. Criterion referenced tests


A criterion referenced test is the opposite to a norm-referenced test. Where
the latter measures an individual’s performance against a representative
group, the former will measure person’s mastery or nonmastery skills on a
particular content domain, such as a task on dexterity or memory
performance (Gregory, 2000, p.74). An expectance table is a good example
of such tests.

8
4.1 Expectancy tables
This kind of table normally reflects a practical eye-view of candidates
predictor results and a specific criterion. For example, expectancy tables test
the relationships between an individual’s test scores and what he or she is
able to accomplish later on in life, whether it be a certain career path or
achieving good grades in an upcoming college entrance exam. However,
such tables also have limitations, mainly because they reflect the results of
large representative group scores, which reflect their present social or school
standards of the time. In fact, such tests, which also include most other norm
tests require constant updates or checks in order to accomplish what it is
supposed to; which is reliability and validity of results (Gregory, 2000, p.
72).

5. Constructing a psychological measure


5.1 The planning phase
This is probably one of the most important steps before beginning a
psychological measure. The planning phase involves very careful decision
making. For example, an engineer will plan every move and step along the
way before he delivers his proposal for a new effective railway bridge.
Therefore, when planning a psychological measure, the tester will have to
consider for instance; choice, format, length of items he or she will include
in a test measure.

5.2 Format of the items


Test construction is not just a simple matter of throwing any kind of item
into a main batch. It is crucial to decide what type of item format is required.
It is of no value to try and test a questionnaire on a personality trait against
items that test for physical speed, such as running 500 metres in a certain
time. For example, it would also make no sense to test a preschoolers
performance on an arithmetic test meant for an 8 year old school child. The
length of the measure should be suitable for that particular group. It would
be a waste of time to test someone with a major depressive disorder (and on
strong medication) on a test which requires 3 hours of heavy concentration.

9
In other words, test items can come in different formats and styles such
as multiple-choice questionnaires, true-false items, forced-choice,
closed/open-response and so forth.

It must be remembered however, that item selection is never a perfect


system. It always involves an item measurement error in assessment tests.
That is why careful consideration is applied to the planning and
implementation of item selection from the beginning to the end stages to
avoid as little as possible too much measurement error.

5.3 Item analysis phase


The item analysis phase involves 3 different types of item statistics:
- item difficulty value
- discrimination value
- item-total correlation
These above item statistics help the researcher to choose the most suitable
items for the end measure. It is always wise to try and adapt tests on a
homogeneous basis, which means taking into account many different
demographic features of the test person, such as age, sex, social/economic
background, educational status, and most important cultural differences.

The item difficulty value tests a large amount of students correct answers
against a single test question. If a minimum percentage get it wrong, then it
is obvious that the test item is too easy and should be adjusted. The reverse
is also true.

The discrimination value shows how well an item discriminates between


those who get high and low ratings on the complete test.

The item-total correlation is a point-biserial correlation (also similar to


The Pearson r), which stresses the relationship between 2 variables. The
higher the relationship between a single item and the total score, then the
item is considered good with regards to internal consistency. In other words,
a good measurement should have items that are homogeneous with a high
level of internal consistency.

10
6. Reliability of a psychological measure
Reliability according to Gregory (2000) “expresses the relative influence of
true and error scores on obtained test scores.” To understand what reliability
Means, is to try and imagine a scale weighing a kilo of grapes. The
greengrocer, weighs the grapes twice in a row and each time he gets a
slightly different reading, but never the same as the first weighing. In other
words, reliability is not always an absolute measure. There will always be a
slight inconsistency between the first test and the second test. But again,
slight fluctuations between tests is a matter of degrees. Repeating results
helps the tester to confirm some form of accuracy in scores, but this again
will not mean much without validity which will be discussed later on in this
essay.

6.1 Correlation coefficient


A correlation coefficient r possesses values ranging from –1.00 to +1.00.
A +1.00 is a perfect linear relationship between 2 test results. A zero
correlation occurs when 2 variables, such as height and reaction time have
no relationship to one another.

To test reliability of psychological test scores, the same test should be taken
twice, namely with a test-retest method. We can then test the degree of
variance in the obtained scores with the variance in true scores.

6.2 Statistical significance


This type of method goes beyond that of just testing a correlation coefficient
between 2 variables. The psychometrician, for example is not just interested
in a small sample of test-persons, but would like to compare/generalize the
results to a larger population. The fact is the larger the sample size, the better
the statistical significance. For instance, it is better to try for less errors by
increasing the size and homogeneity of our sample. If a correlation is
significant at a .01 level, we know then that the probability of error will
be 1 out of a 100 which is a rather good estimate.

11
6.3 Reliability coefficient
The reliability coefficient is the proportion of true score variance (factors
which are consistent) to the complete total variance of test results. In plain
terms, we add the true score variance (the stable attribute which we are
testing) with the error score variance or errors of measurement.

6.4 Observed score


The observed score or obtained score can be drastically altered by random
events or measurement errors. To avoid this problem, it is up to the
researcher to reduce as many of the nuisances as possible in order to have a
reliable measure. In fact Classical Theory (Gregory, 2000. pp77-79)
stipulates that a negative measurement error can contribute to an obtained
score been much lower than the true score. A positive measurement error on
the other hand, could contribute to a higher obtained score than the actual
true score. Either way, one of the students doing a specific knowledge test
will come out better due to some unbalanced item selection or other
measurement error.

6.5 True and error scores


True and error scores are uncorrelated according to the classical
measurement theory. True scores are hypothetical. They are never really
known. However, it is error scores which give test developers headaches.
For example, the researcher decides to test a trait for nervousness and keeps
on getting a measurement for confidence. It is obvious that there is
something inconsistent with this test measure. It could be that the researcher
has chosen incorrect test items based on obsolete tests, or that the person/s
being tested are not suitable test candidates. The fact is, that errors of
measurement will give false observed scores. If the same test would be
repeated again, the end results will be inconsistent. Therefore test
construction should be carefully planned in the beginning in order to avoid
such measurement errors creeping into the results.

12
6.5 True variance and error variance in a test score
Briefly, the true variance shows a more homogeneous, internal item
consistency than the error variance. Error variance results from bad content
sampling, such as in alternate-form and split-half reliability, as well as
heterogeneity of the traits under observation. On the other hand, a high
interitem consistency shows a more homogeneous variance with little
inconsistency. For example, if 2 half-tests show 2 different results we speak
about an error variance. This means that both half-tests are inconsistent with
one another.

7 Stability of a test – the advantages and limitations

7.1 Test-retest reliability


In this kind of measurement, the same test is repeated twice to the same test
group. This sample group is of a heterogeneous nature which is
representative of the general population. The idea behind this kind of test is
to compare or correlate the two scores for a reliable measure. The advantage
of such a test is to predict the second score from the results of the first test,
hoping that there will be a correlation between both scores.
There are limitations however to such tests. Error variances, such as
experience, maturations, lengthy time spans between tests, illness and so
forth could affect retest reliability. (PSY498-8 p. 6).

7.2 Alternate-form reliability


Alternate forms of the same test are issued to test persons. This test
measures the correlation between both scores (which is quite similar to the
test-retest reliability). However, there is a difference between the two. The
alternate-form reliability method inserts item-sampling differences (error
variance) which can limit the scope of reliability. For example, some
students may cope very well with the items on test 1 but do quite badly on
the second test due to the unidentical items with the first test. Another
limitation is the high cost of producing alternate features of a test, and the
difficulties involved trying to reproduce parallel forms (Gregory, 2000
p. 83).

13
8. Internal consistency of a test – advantages and limitations
Apart from alternate forms reliability and test-retest reliability, there
are other methods to test items for consistency. For example, the split-half
reliability, the Kuder-Richardson 20 and Cronbach’s Alpha.

8.1 Split-half reliability


As the name implies, this kind of test correlates the 2 scores from a single
test. This is achieved by “splitting” the test into identical halves. Sounds
complicated, although it is actually quite an effective measure. For example,
if the test scores on both halves indicate a strong correlation, then the scores
on two complete tests from 2 different measures should in principle also
show the same correlations (Gregory, 2000. p. 84) Internal consistency is
therefore achieved through only a single administration.

Of course there are advantages to this method such as lengthening the test to
produce more reliability or studying a large behaviour domain. But, there are
also limitations as to how one can “split” items on a single test. One can try
dividing even and odd numbers or separating easy and difficult items.
However, this becomes a problem when the test developer has to split
drawings or comprehension texts.

8.2 Kuder-Richardson 20
We use the Kuder-Richardson or KR20 (1937) formula if one wants to find
internal consistency of a single administration of one test, such as discussed
in the split-half procedure. What this formula actually does is to test
individual test items as a 0 for wrong and a 1 for right. However, when tests
go beyond the KR20 formula, such as in the testing of heterogeneous items,
we then use the Coefficient Alpha (Cronbach (1951). This formula is
suitable for example, in attitude scales where test persons must rate their
answers as; strongly agree, disagree, and so forth (Gregory, 2000, p.86).

14
9 Validity of a psychological measure
Validity can be described as the degree to which a measure does what it is
supposed to do. In other words, the psychological measure should give a
good indication of well-grounded truth/fact between both the trait been
tested, and the operational definition of the construct. Furthermore, this
measuring instrument must test, and only test what it was designed to do.
For example, it is no use designing an instrument for intelligence scales and
then using the same measure to test for “running speed” ((Blanche &
Durrheim, 2002, p. 83). The following validity procedures will be discussed:
- content validity
- criterion-related validity
- predictive validity
- concurrent validity

9.1 Content validity


Content validity is a suitable measure when testing for traits such as
knowledge, as in an examination paper (Blanche et al, 2002, p85).
In other words, this type of measure is actually the testing of item samples
on a test which are taken from a greater sample or population, which could
be several text books covering one field or domain topic. It would be
impossible to test an examinee on the entire contents of a particular subject
such as engineering! (Time is normally limited with such tests).

Content validation sometimes runs into difficulties when abstract traits, such
as personality and aptitudes have to be tested. It is difficult to give an
accurate test description of something like racism or morals, as these traits
do not fit smugly between the pages of a subject book (Blanche et al, 2002,
p85).

Face validity is another matter to consider. For instance, how does the test
appear to others? Does it look too complicated, or does it have an
unprofessional appearance? Face validity needs to be taken serious if the
measure is going to be accepted by other persons in authority, namely from a
legal and educational point of view (PSY498-/8102).

15
9.2 Criterion-related validity
Criterion-related validity normally correlates with other similar tests or
research. In other words, a researcher who discovers a new form of “job
mobbing” in corporate and industry will compare previous studies in this
field with his/her new findings. There are 2 types of validity measures to
test for criterion validity, namely, predictive validity and concurrent validity.

9.3 Predictive validity


As the name implies, predictive validity helps predict future events from
existing scores, budgets, educational performance and so forth. For
example, future inflation rates can be predicted from present statistics on the
countries economic performance in relation to the rest of the world.
Concurrent validation, on the other hand replaces predictive validity
measures when it comes to making a present diagnosis on a pupils
immediate performance, and not on future events (PSY498-8/102).

9.3 Concurrent validity


This type of method would be more suitable when testing abstract traits,
such as someone suffering from an immediate problem of depression. The
clinician can judge the patients observable behaviour and cognitive
performance, and make a suitable diagnosis. It would be difficult however to
use a method of predictive validity in such a case. One cannot “predict” if
someone who is suffering from a dark mood one day is going to suffer from
depression in the future. A positive feature of concurrent validity is that
costs are kept at a minimum and results are normally immediate, compared
to predictive validity.

10 Major methods of establishing construct validity


Example construct or traits are; technical and mechanical knowledge,
running speed, frustration, reading and spelling abilities and so forth.
How do we measure such constructs? Firstly, the researcher for instance
gathers as much data as possible on a particular trait, through observations,
interrelationships with other behaviour or cognitive measures and so forth.

16
We are looking at both a theoretical and empirical method of establishing
construct validity. Several methods will be discussed under the following.

10.1 Developmental changes


It is common knowledge that developmental changes take place between
childhood and adulthood, which also means that both behaviour and
cognitive abilities also change perhaps more rapidly in childhood than in
later years where they tend to “stabilize”.

Age-differentiation is also dictated by a specific culture. Different cultures


have different child-rearing patterns or beliefs. The Piagetian ordinal scales,
for example, the sequential patterning of development or schemas indicate
the gradual process of conceptual skills of early childhood to early
adulthood. This is an example of construct validation of ordinal scales over
several developmental levels (PSY498-8/102).

10.2 Correlations with other tests


It is a necessary condition that when making a correlation between a new
test with other tests that the former does not correlate too high to make it
invalid. A good mix would be between low and a “moderate” high, but no
more. In other words, it would be ridiculous to compare a new test on
a factor of intelligence with a similar test, and then find out later that the
new test is actually measuring a personality disorder!

10.2 Factor analysis


This is a particular family of statistics which many researchers adopt to
explain certain relationships between variables or constructs that correlate
highly with one another. This method is used to obtain a strict frugal set of
data. In other words, factor analysis allows for the testing of a multitude
of major mental abilities such as, comprehension, memory, number
recognition and so forth compared to more conservative tests, such as the
Stanford-Binet tests (Gregory, 2000, p 23). Factor analysis has one primary
goal, and that is to make a neat, comprehensible set of statistics by cutting

17
back too many “untidy” test variables to a more efficient economical set of
common traits.

10.3 Internal consistency


Briefly, internal consistency aims for significant item-test correlations with
the test pointing in a key direction. Another way of testing for internal
consistency is to correlate subtest scores with the total score. Take for
instance certain intelligence test factors, reading ability, arithmetic, spelling
and so forth. All the sub scores are added together to give a total test score.
Of course, it is necessary that items are homogeneous in order to achieve a
good internal test consistency.

10.4 Convergent and Discriminant validation


Convergent validation of a test means that a test correlates highly with other
tests, or traits that share a common factor. In other words, such tests are
normally done on a heterogeneous sample to test for convergence. This also
means that such a test should also not correlate with opposite variables.
For example, a test for vocabulary ability should not correlate with a test for
arithmetic reasoning.

Discriminant validation is important to personality tests. In fact discriminant


validation occurs when there is a clash, or non-correlation with two opposite
variables such as popularity and intelligence. This would obviously be a
negative correlation, if any correlation at all.

The multitrait-multimethod matrix (Campbell and Fiske (1959) combines


the assessment of two or more variables with two or more methods
(Gregory, 2000, p. 110-111). This matrix demonstrates a good source of data
on discriminant and convergent validity, as well as reliability.

18
10.5 Experimental interventions
Any form of experimental intervention a researcher does will involve
“control” of the test situation. This is done in order to “isolate” common
treatment factors, and remove any unwanted interferences that could
invalidate results. There are numerous research designs to choose from,
such as a standard one-group pretest-posttest design for testing construct
validation in a scholastic test.

Then there are other tests such as the Equivalent Time Series that spread out
over lengthy time periods (Neumann,1997, pp 183-197). What ever test is
chosen, there will always be a certain amount of experimental interference.
The researcher seeks solutions to problems, or tries to find a better
experimental method to test different hypothesis for present and future
generations.

11. Summary
The goal of this essay was to explain what was meant by psychometric
properties of psychological assessment measures. The principles necessary
to psychometric theory were discussed, namely, the different types of norms,
criterion referenced tests, psychological measures and the reliability and
validity of these measures, as well as the advantages and limitations of each
test.
19

12. Conclusion
Psychometric testing of psychological measurements is an extensive
procedure. There are a number of processes involved in assessing
human data which cannot be done in a vacuum. People are human constructs
which do not remain stable over time, that is why researchers are always
testing and retesting their products against the dynamics of man. It is
therefore safe to conclude that no test is a complete test. As this essay has
demonstrated, there are always advantages and limitations to assessment
measures. What works for one test, may not necessarily work for another.
Sometimes it is not a matter of degrees whether a test is supposed to
measure what it is supposed to measure, but how the test sample relates to
the real world of people.
20

References

Durrheim, K. (2002). Research in Practice. In M.T.Blanche (Ed), Quantitative


Measurement (pp. 72-95). Cape Town: UCT Press.

Gregory, R.J. (2000). Psychological Testing. 3rd Edition.


Illinois: Allyn and Bacon, Inc.

Neumann, W.L. (1997). Social Research Methods. 3rd edition. Needham Heights:
Allyn & Bacon.

Rosnow, R.L. & Rosenthal, R. (1999). Beginning Behavioral Research.


3rd. Edition. New Jersey: Prentice Hall

Tutorial Letter 102 for PSY498-8. (2003). Psychological Assessment.


Pretoria: Unisa Press. (sections from pp. 3-24).

Вам также может понравиться