Вы находитесь на странице: 1из 9

PSYCH 162 University of the Philippines-Manila P r of.

Len i R a m os-Ca r b a llo

Handout 1:

P sy ch o m et r ic P r o p er t ies of P sy ch ologica l A ssessm en t Mea su r es

by Dr. Celeste Fabrie (2013)

To understand what is meant by psychometric properties of psychological assessment measures, it


is necessary to separate the two descriptive classes on their own. In other words, what do we mean
when we speak about psychometric properties, and what is implied by the concept psychological
measures?

Psychometrics is basically the study of different mental traits and behavioural characteristics in
amounts or scores. For example, these tests can take the form of intelligence scales, or a rating of
different attitudes against a specific population standard or norm. The fact is, psychometrics involves
the assessment of human data according to known, specific standards implied by the experienced
researcher or scientist.

Without psychometric theory, there would be problems to develop a reliable and valid psychological
measure. It would be impossible to study human intelligence without comparing a person, or group
to a normative sample. Psychological assessment measures on the other hand, involve five different
processes; diagnosis, classification, planning of treatment, self-knowledge, research and program
evaluation (Gregory, 2000, p.41)

The aim of this essay is to demonstrate the principles of psychometric theory, and how these
principles are integrated to explain how a psychological measure is developed. The following major
concepts, including their related values and tests will be discussed, namely, the different types of
norms, criterion referenced tests, psychological measures and the reliability and validity of these
measures, as well as the advantages and limitations of each test.

I. Defin it io n o f k ey co n cep t s Our discussion begins with an overview of the following concepts:

1. Different types of norms. Norms can be defined as a raw score or table of values of an
individual measurement against the performance of others in a particular group. Furthermore,
norms help us to make important standard comparisons, whereby we can judge how much a
person’s score deviates from the average population or representative group in a sample
(Rosnow & Rosenthal, 1999. p. 222).
2. Criterion referenced tests. These tests are basically the opposite to norm-referenced tests.
Criterion- referenced tests are mainly concerned with the personal achievements of the
tested person, than on making comparisons with the performance or abilities of external
groups. For example, these tests are ideal for the testing of individual educational needs
3. Psychological measures To recapture: psychological measures or assessment measures
involve quantification techniques. In other words, psychological measures is a dynamic
process, forever changing, but retaining its original structure. That is, a psychological
measurement can be a sophisticated process with the following characteristics:

1
- scores or categories
- behaviour samples
- norms or standards
- standardized procedures
-prediction of nontest behavior

4. Reliability and validity of psychological measures. Test reliability can be defined


according to how it consistently measures what it is supposed to measure. Unfortunately,
such problems as “random error” can greatly influence the reliability of an instrument
Validity, for example often pertains to the contents of a test or measuring instrument. For
instance, if a particular personality trait is been measured by a certain test, then it is expected
that this test actually measures what it is supposed to measure, otherwise it is invalid.

II.Va r io u s t y p es o f n o r m s The word “normed score” on its own means very little to test takers if
they do not know what kind of norms they are been tested against. It is up to the tester to explain a
test or individual’s raw score against a background of a representative population group. The
following norms will be discussed briefly to demonstrate the importance of understanding a person’s
test score on a norm referenced test.

1. Mental age scales and Grade Equivalents. These two types of norms actually are referred
to as “developmental norms” but with a slight difference. Mental age scales encourage
similar-age test comparisons. In other words, the performance level of a child who is 10 years
of age will be compared to the performance level of other children of the same age group.
These age norms are a convenient way of testing children’s developmental characteristics
that are dynamically changing in comparison to the more stable traits of adults.
Grade equivalents are quite similar to age grades. However, instead of just checking the
aptitude or ability of a similar age group, grade norms measure the standard of test
performance for every individual grade depicted in the normative sample. This means that
school performance is measured against a normative sample from the same class or grade in
the school. This is a more convenient method to test a child or scholar’s academic
performance against a similar grade equivalent (Gregory, 2000, p.71).

2. Percentiles Percentiles is a relative measure which can fluctuate between 0 and 100. A
percentile describes the distribution of scores either falling below or above a particular
percentage of sampling scores. For example, we talk about a 50th percentile (which is also
referred to as the “median”) when a typical score falls below the 50% level and above the
50% level (Rosnow & Rosenthal, 1999, p. 233)

3. Stanines and Sten Scales (standard scores) Stanines transform test takers raw scores on a
1 - 9 point scale and the 10 unit sten scale which Canfield (1951) recommended is basically a
slight variation on the stanine scale. Both measures were useful devices before the pre-
computer age to test norms. In fact, stanines always have a mean of 5 and an exact standard
deviation of 2. This means that scores are ranked from the lowest to the highest, with the
bottom 4% of scores having a stanine of 1 and so forth (Gregory, 2000, p.68).

2
4. Deviation IQ This norm is often used to measure the scores depicted on intelligence tests.
However, this kind of scale is often misinterpreted by inexperienced persons wanting to find
an easy “labelling” system to describe a persons intelligence above or below a particular
marker of 100. It would be foolish to use one single IQ test as a final result of ones
intelligence or character. For example, using obsolete tests can either inflate or deflate a
participants IQ scores.

III. Cr it er io n r efer en ced t est s . A criterion referenced test is the opposite to a norm-referenced
test. Where the latter measures an individual’s performance against a representative group, the
former will measure person’s mastery or nonmastery skills on a particular content domain, such as a
task on dexterity or memory performance (Gregory, 2000, p.74). An expectancy table is a good
example of such tests.

Expectancy tables This kind of table normally reflects a practical eye-view of candidates
predictor results and a specific criterion. For example, expectancy tables test the
relationships between an individual’s test scores and what he or she is able to accomplish
later on in life, whether it be a certain career path or achieving good grades in an upcoming
college entrance exam. However, such tables also have limitations, mainly because they
reflect the results of large representative group scores, which reflect their present social or
school standards of the time. In fact, such tests, which also include most other norm tests
require constant updates or checks in order to accomplish what it is supposed to; which is
reliability and validity of results (Gregory, 2000, p. 72).

IV. Co n st r u ct in g a p sy ch o lo gica l m ea su r e

1. The planning phase This is probably one of the most important steps before beginning a
psychological measure. The planning phase involves very careful decision making. For
example, an engineer will plan every move and step along the way before he delivers his
proposal for a new effective railway bridge. Therefore, when planning a psychological
measure, the tester will have to consider for instance; choice, format, length of items he or
she will include in a test measure.

2. Format of the items Test construction is not just a simple matter of throwing any kind of item
into a main batch. It is crucial to decide what type of item format is required. It is of no value
to try and test a questionnaire on a personality trait against items that test for physical speed,
such as running 500 metres in a certain time. For example, it would also make no sense to
test a preschoolers performance on an arithmetic test meant for an 8 year old school child.
The length of the measure should be suitable for that particular group. It would be a waste of
time to test someone with a major depressive disorder (and on strong medication) on a test
which requires 3 hours of heavy concentration. In other words, test items can come in
different formats and styles such as multiple-choice questionnaires, true-false items, forced-
choice, closed/open-response and so forth.It must be remembered however, that item
selection is never a perfect system. It always involves an item measurement error in
assessment tests. That is why careful consideration is applied to the planning and
implementation of item selection from the beginning to the end stages to avoid as little as
possible too much measurement error.
3
3. Item analysis phase. The item analysis phase involves 3 different types of item
statistics: - 1) item difficulty value 2) discrimination value and 3) item-total correlation

These above item statistics help the researcher to choose the most suitable items for the end
measure. It is always wise to try and adapt tests on a homogeneous basis, which means
taking into account many different demographic features of the test person, such as age, sex,
social/economic background, educational status, and most important cultural differences.

The item difficulty value tests a large amount of students correct answers against a single
test question. If a minimum percentage get it wrong, then it is obvious that the test item is too
easy and should be adjusted. The reverse is also true.
The discrimination value shows how well an item discriminates between those who get high
and low ratings on the complete test.
The item-total correlation is a point-biserial correlation (also similar to The Pearson r),
which stresses the relationship between 2 variables. The higher the relationship between a
single item and the total score, then the item is considered good with regards to internal
consistency. In other words, a good measurement should have items that are homogeneous
with a high level of internal consistency.

V. R elia b ilit y o f a p sy ch o lo gica l m ea su r e

Reliability according to Gregory (2000) “expresses the relative influence of true and error
scores on obtained test scores.” To understand what reliability means, is to try and imagine a
scale weighing a kilo of grapes. The greengrocer, weighs the grapes twice in a row and each time he
gets a slightly different reading, but never the same as the first weighing. In other words, reliability is
not always an absolute measure. There will always be a slight inconsistency between the first test
and the second test. But again, slight fluctuations between tests is a matter of degrees. Repeating
results helps the tester to confirm some form of accuracy in scores, but this again will not mean
much without validity which will be discussed later on in this essay.

1. Correlation coefficient. A correlation coefficient r possesses values ranging from –1.00 to


+1.00. A +1.00 is a perfect linear relationship between 2 test results. A zero correlation
occurs when 2 variables, such as height and reaction time have no relationship to one
another. To test reliability of psychological test scores, the same test should be taken twice,
namely with a test-retest method. We can then test the degree of variance in the obtained
scores with the variance in true scores.
2. Statistical significance. This type of method goes beyond that of just testing a correlation
coefficient between 2 variables. The psychometrician, for example is not just interested in a
small sample of test-persons, but would like to compare/generalize the results to a larger
population. The fact is the larger the sample size, the better the statistical significance. For
instance, it is better to try for less errors by increasing the size and homogeneity of our
sample. If a correlation is significant at a .01 level, we know then that the probability of error
will be 1 out of a 100 which is a rather good estimate.
3. Reliability coefficient. The reliability coefficient is the proportion of true score variance
(factors which are consistent) to the complete total variance of test results. In plain terms, we
4
add the true score variance (the stable attribute which we are testing) with the error score
variance or errors of measurement.
4. Observed score The observed score or obtained score can be drastically altered by random
events or measurement errors. To avoid this problem, it is up to the researcher to reduce as
many of the nuisances as possible in order to have a reliable measure. In fact Classical
Theory (Gregory, 2000. pp77-79) stipulates that a negative measurement error can
contribute to an obtained score been much lower than the true score. A positive
measurement error on the other hand, could contribute to a higher obtained score than the
actual true score. Either way, one of the students doing a specific knowledge test will come
out better due to some unbalanced item selection or other measurement error.
5. True and error scores True and error scores are uncorrelated according to the classical
measurement theory. True scores are hypothetical. They are never really known. However, it
is error scores which give test developers headaches. For example, the researcher
decides to test a trait for nervousness and keeps on getting a measurement for confidence. It
is obvious that there is something inconsistent with this test measure. It could be that the
researcher has chosen incorrect test items based on obsolete tests, or that the person/s
being tested are not suitable test candidates. The fact is, that errors of measurement will give
false observed scores. If the same test would be repeated again, the end results will be
inconsistent. Therefore test construction should be carefully planned in the beginning in order
to avoid such measurement errors creeping into the results.

6. True variance and error variance in a test score Briefly, the true variance shows a more
homogeneous, internal item consistency than the error variance. Error variance results from
bad content sampling, such as in alternate-form and split-half reliability, as well as
heterogeneity of the traits under observation. On the other hand, a high inter-item
consistency shows a more homogeneous variance with little inconsistency. For example, if 2
half-tests show 2 different results we speak about an error variance. This means that both
half-tests are inconsistent with one another.

VI. St a b ilit y o f a t est – t h e a d v a n t a ges a n d lim it a t io n s

1. Test-retest reliability. In this kind of measurement, the same test is repeated twice to the
same test group. This sample group is of a heterogeneous nature which is representative of
the general population. The idea behind this kind of test is to compare or correlate the two
scores for a reliable measure. The advantage of such a test is to predict the second score
from the results of the first test, hoping that there will be a correlation between both scores.
There are limitations however to such tests. Error variances, such as experience,
maturations, lengthy time spans between tests, illness and so forth could affect retest
reliability. (PSY498-8 p. 6).
2. Alternate-form reliability. Alternate forms of the same test are issued to test persons. This
test measures the correlation between both scores (which is quite similar to the test-retest
reliability). However, there is a difference between the two. The alternate-form reliability
method inserts item-sampling differences (error variance) which can limit the scope of
reliability. For example, some students may cope very well with the items on test 1 but do
quite badly on the second test due to the unidentical items with the first test. Another

5
limitation is the high cost of producing alternate features of a test, and the difficulties involved
trying to reproduce parallel forms (Gregory, 2000 p. 83).

VII. In t er n a l co n sist en cy o f a t est – advantages and limitations Apart from alternate forms
reliability and test-retest reliability, there are other methods to test items for consistency. For
example, the split-half reliability, the Kuder-Richardson 20 and Cronbach’s Alpha.

1. Split-half reliability. As the name implies, this kind of test correlates the 2 scores from a
single test. This is achieved by “splitting” the test into identical halves. Sounds
complicated, although it is actually quite an effective measure. For example, if the test
scores on both halves indicate a strong correlation, then the scores on two complete tests
from 2 different measures should in principle also show the same correlations (Gregory,
2000. p. 84) Internal consistency is therefore achieved through only a single
administration. Of course there are advantages to this method such as lengthening the
test to produce more reliability or studying a large behaviour domain. But, there are also
limitations as to how one can “split” items on a single test. One can try dividing even and
odd numbers or separating easy and difficult items. However, this becomes a problem
when the test developer has to split drawings or comprehension texts.
2. Kuder-Richardson 20. We use the Kuder-Richardson or KR20 (1937) formula if one
wants to find internal consistency of a single administration of one test, such as
discussed in the split-half procedure. What this formula actually does is to test individual
test items as a 0 for wrong and a 1 for right. However, when tests go beyond the KR20
formula, such as in the testing of heterogeneous items, we then use the Coefficient Alpha
(Cronbach (1951). This formula is suitable for example, in attitude scales where test
persons must rate their answers as; strongly agree, disagree, and so forth (Gregory,
2000, p.86).

VIII. Va lid it y o f a p sy ch o lo gica l m ea su r e

Validity can be described as the degree to which a measure does what it is supposed to
do. In other words, the psychological measure should give a good indication of well-grounded
truth/fact between both the trait been tested, and the operational definition of the construct.
Furthermore, this measuring instrument must test, and only test what it was designed to do. For
example, it is no use designing an instrument for intelligence scales and then using the same
measure to test for “running speed” ((Blanche & Durrheim, 2002, p. 83). The following validity
procedures will be discussed:

- content validity

- criterion-related validity

- predictive validity

- concurrent validity

1. Content validity. Content validity is a suitable measure when testing for traits such as
knowledge, as in an examination paper (Blanche et al, 2002, p85). In other words, this type of
measure is actually the testing of item samples on a test which are taken from a greater sample or
6
population, which could be several text books covering one field or domain topic. It would be
impossible to test an examinee on the entire contents of a particular subject such as engineering!
(Time is normally limited with such tests).

Content validation sometimes runs into difficulties when abstract traits, such as personality and
aptitudes have to be tested. It is difficult to give an accurate test description of something like racism
or morals, as these traits do not fit smugly between the pages of a subject book (Blanche et al, 2002,
p85).

Face validity is another matter to consider. For instance, how does the test appear to others? Does
it look too complicated, or does it have an unprofessional appearance? Face validity needs to be
taken serious if the measure is going to be accepted by other persons in authority, namely from a
legal and educational point of view (PSY498-/8102).

2. Criterion-related validity. Criterion-related validity normally correlates with other similar tests or
research. In other words, a researcher who discovers a new form of “job mobbing” in corporate and
industry will compare previous studies in this field with his/her new findings. There are 2 types of
validity measures to test for criterion validity, namely, predictive validity and concurrent validity.

3. Predictive validity. As the name implies, predictive validity helps predict future events from
existing scores, budgets, educational performance and so forth. For example, future inflation rates
can be predicted from present statistics on the countries economic performance in relation to the rest
of the world. Concurrent validation, on the other hand replaces predictive validity measures when it
comes to making a present diagnosis on a pupils immediate performance, and not on future events
(PSY498-8/102).

4. Concurrent validity. This type of method would be more suitable when testing abstract traits,
such as someone suffering from an immediate problem of depression. The clinician can judge the
patients observable behaviour and cognitive performance, and make a suitable diagnosis. It would
be difficult however to use a method of predictive validity in such a case. One cannot “predict” if
someone who is suffering from a dark mood one day is going to suffer from depression in the future.
A positive feature of concurrent validity is that costs are kept at a minimum and results are normally
immediate, compared to predictive validity.

IX. Ma jo r m et h o d s o f est a b lish in g co n st r u ct v a lid it y

Example construct or traits are; technical and mechanical knowledge, running speed, frustration,
reading and spelling abilities and so forth. How do we measure such constructs? Firstly, the
researcher for instance gathers as much data as possible on a particular trait, through observations,
interrelationships with other behaviour or cognitive measures and so forth.

We are looking at both a theoretical and empirical method of establishing construct validity.

1. Developmental changes

It is common knowledge that developmental changes take place between childhood and
adulthood, which also means that both behaviour and cognitive abilities also change perhaps
more rapidly in childhood than in later years where they tend to “stabilize”.

7
Age-differentiation is also dictated by a specific culture. Different cultures have different child-
rearing patterns or beliefs. The Piagetian ordinal scales, for example, the sequential patterning of
development or schemas indicate the gradual process of conceptual skills of early childhood to
early adulthood. This is an example of construct validation of ordinal scales over several
developmental levels (PSY498-8/102).

2. Correlations with other tests It is a necessary condition that when making a correlation
between a new test with other tests that the former does not correlate too high to make it
invalid. A good mix would be between low and a “moderate” high, but no more. In other
words, it would be ridiculous to compare a new test on a factor of intelligence with a similar
test, and then find out later that the new test is actually measuring a personality disorder!

3. Factor analysis This is a particular family of statistics which many researchers adopt to
explain certain relationships between variables or constructs that correlate highly with one
another. This method is used to obtain a strict frugal set of data. In other words, factor
analysis allows for the testing of a multitude of major mental abilities such as,
comprehension, memory, number recognition and so forth compared to more conservative
tests, such as the Stanford-Binet tests (Gregory, 2000, p 23). Factor analysis has one
primary goal, and that is to make a neat, comprehensible set of statistics by cutting back too
many “untidy” test variables to a more efficient economical set of common traits.

4. Internal consistency Briefly, internal consistency aims for significant item-test correlations
with the test pointing in a key direction. Another way of testing for internal consistency is to
correlate subtest scores with the total score. Take for instance certain intelligence test
factors, reading ability, arithmetic, spelling and so forth. All the sub scores are added together
to give a total test score. Of course, it is necessary that items are homogeneous in order to
achieve a good internal test consistency.

5. Convergent and Discriminant validation. Convergent validation of a test means that a


test correlates highly with other tests, or traits that share a common factor. In other words,
such tests are normally done on a heterogeneous sample to test for convergence. This also
means that such a test should also not correlate with opposite variables. For example, a test
for vocabulary ability should not correlate with a test for arithmetic reasoning.

Discriminant validation is important to personality tests. In fact discriminant validation


occurs when there is a clash, or non-correlation with two opposite variables such as
popularity and intelligence. This would obviously be a negative correlation, if any correlation
at all.
The multitrait-multimethod matrix (Campbell and Fiske (1959) combines the assessment
of two or more variables with two or more methods (Gregory, 2000, p. 110-111). This matrix
demonstrates a good source of data on discriminant and convergent validity, as well as
reliability.

6. Experimental interventions Any form of experimental intervention a researcher does will


involve “control” of the test situation. This is done in order to “isolate” common treatment
factors, and remove any unwanted interferences that could invalidate results. There are
8
numerous research designs to choose from, such as a standard one-group pretest-posttest
design for testing construct validation in a scholastic test.

Then there are other tests such as the Equivalent Time Series that spread out over lengthy time
periods (Neumann,1997, pp 183-197). What ever test is chosen, there will always be a certain
amount of experimental interference. The researcher seeks solutions to problems, or tries to find a
better experimental method to test different hypothesis for present and future generations.

Su m m a r y

The goal of this essay was to explain what was meant by psychometric properties of psychological
assessment measures. The principles necessary to psychometric theory were discussed, namely,
the different types of norms, criterion referenced tests, psychological measures and the reliability
and validity of these measures, as well as the advantages and limitations of each test.

Co n clu sio n

Psychometric testing of psychological measurements is an extensive procedure. There are a number


of processes involved in assessing human data which cannot be done in a vacuum. People are
human constructs which do not remain stable over time, that is why researchers are always
testing and retesting their products against the dynamics of man. It is therefore safe to
conclude that no test is a complete test. As this essay has demonstrated, there are always
advantages and limitations to assessment measures. What works for one test, may not necessarily
work for another. Sometimes it is not a matter of degrees whether a test is supposed to measure
what it is supposed to measure, but how the test sample relates to the real world of people.

References
Durrheim, K. (2002). Research in Practice. In M.T.Blanche (Ed), Quantitative Measurement (pp. 72-
95). Cape Town: UCT Press.

Gregory, R.J. (2000). Psychological Testing. 3rd Edition. Illinois: Allyn and Bacon, Inc.
Neumann, W.L. (1997). Social Research Methods. 3rd edition. Needham Heights: Allyn & Bacon.
Rosnow, R.L. & Rosenthal, R. (1999). Beginning Behavioral Research. 3rd. Edition. New Jersey:
Prentice Hall
Tutorial Letter 102 for PSY498-8. (2003). Psychological Assessment. Pretoria: Unisa Press.
(sections from pp. 3-24)

Вам также может понравиться