Вы находитесь на странице: 1из 52

CHARACTERISTICS,

CONSTRUCTION and
EVALUATION of
PSYCHOLOGICAL TESTS
CHARACTERISTICS
of PSYCHOLOGICAL
TESTS
STANDARDIZATION
• Refers to establishing norms or a frame of
reference or point of comparison in the
performance of different individuals with
the same mental capacity.
• Meaning that all test takers are tested on
the same attributes or knowledge.
• Refers to fixing materials, directions, and
scoring rules so that a test can be given in
the same way by different examiners;
uniformity.
RELIABILITY
• The extent to which a test is consistent in
measuring whatever it does measure;
dependability, stability, trustworthiness,
relative freedom from errors of
measurement.
• A reliable test will provide a consistent
measure of current knowledge, skills, or
characteristics.
VALIDITY
• Indicated whether the test measures
what it was designed to measure.
• When a test is valid, it measures
test-taker characteristics that are
relevant to the purpose of the test.
OBJECTIVITY
• Directs a definiteness of what the
test measures. A test must have a
definite purpose or intention. This
may also refer to the degree to
which the measure is independent of
the personal opinions, subjective
judgment, biases, and beliefs of
individual test users.
ADMINISTRABILITY
• It is critical that all test takers receive
the same instructions and materials and
have the same amount of time to complete
the test.
• It is necessary to minimize the effect of
irrelevant variables or factors other than
the test taker’s knowledge, skills and
characteristics.
• Individual Administration. Requires an examiner to
work with a single test taker. It provides an
opportunity for the examiner to observe test-
taker behavior.
• Group Administration. is a cost-efficient way to
evaluate people. Group tests minimize the amount
of time needed to test a large number of people,
which in turn lowers the cost by reducing the
amount of professional administration time.
• Computer – assisted Administration. a personal
computer or computer terminal is used to present
test items and record test-taker responses, the
computer can be programmed to provide
instructions to the test taker at the beginning of
testing, and help menus can be written to provide
additional instructions during testing
SCORABILITY
• Tests are designed to measure attributes of
the test taker, and measurement implies the
assignment of numerical values. Tests vary
considerably in the precision and detail of
scoring rules.
• A standard scoring procedure must be applied
the same way to all individuals who take the
test.
• It is a process in which responses are converted
to numbers by comparing them to lists of
possible answers; objective scoring.
PRACTICABILITY
• It is concerned with the aspects of
skills, cost and time. It explains how
much practical and usable the test is;
it is good for one sitting. It also
implies commonness regardless of
race and culture. Another term for
practicability is feasibility.
INTERPRETABILITY
• This implies that the test has a uniform or
a widely accepted set of guidelines in
interpreting the results or test scores.
The basis of such set of guidelines are
objective data that are backed up by
research and provide the information
necessary for test users to make fully
informed interpretations. This would also
minimize the incidence of misuse and abuse
of test results.
ECONOMY
• Standardized measures are generally much
more economical of time and money after
they have been developed. It often frees
professionals for more important work.
Progress generally favors measure that
either require relatively little effort to
employ or allow less highly trained
technicians to do the administration and
scoring.
TEST
CONSTRUCTION
1. What are the topics and materials on
which respondents are to be tested?
2. What kind of questions should be
constructed?
3. What item and test formats or layouts
should be used?
Outline:
a. Defining the Test
b. Selecting a scaling method
c. Constructing the items
d. Testing the items
e. Revising the test
f. Publishing the test
a. DEFINING THE TEST
• Scopes and purposes must be stated
• Define what you want to measure and
how the new constructed test differ
from any existing test instrument.
(Gregory)
• How could new test make useful
contribution to the field of
psychology?
a. DEFINING THE TEST
6 PRIMARY GOALS (K-ABC):
1. Measure intelligence from a strong theoretical
and research basis
2. Separate acquired factual data from the ability
to solve unfamiliar problem
3. Yield scores that translate to education
intervention
4. Include novel (new/changes) task
5. Easy to administer and objective to score
6. Preschool minority and exceptional children
a. DEFINING THE TEST
Ex. SATT by Dr. Santos:
1. Measures two sets of components: intellective
and non-intellective
2. Easy to administer. There is a definite answer to
intellective components, and numerical equivalent
to non-intellective
3. Philippine-made test to measure Filipino traits.
Determines probable performance or proficiency
level.
Outline:
a. Defining the Test
b. Selecting a scaling method
c. Constructing the items
d. Testing the items
e. Revising the test
f. Publishing the test
SELECTING A
b.
SCALING METHOD
• To designate numbers to the responses on the
test so that the examinees could assessed
accurately on the characteristic being measured.
Scaling method suited to traits being measured.
• Level of Measurement:
– Nominal- numbers serve only as category names
– Ordinal- form of ordering or ranking
– Interval- provide information about ranking
– Ratio Scale- has all characteristics of an interval, also
possesses conceptually meaningful zero
SELECTING A SCALING
b.
METHOD
5 TYPES of SCALING METHOD:
1. Ranking of Experts – rely on the behavioral ranking
2. Method of Equal Appearing Interest
a. Collect as many true or false, yes or no statement
b. Ask 10 expert to rate the statement to determine the degree of
favorability/unfavorability toward attitude.
c. The mean favorability rating (1-10) and the standard deviation for
each item is determined.
d. Favorability rating reflects ambiguity, item with large standard
deviation should be deleted (20-30)
e. Score is determined by averaging the scale values of those items
endorsed
SELECTING A SCALING
b.
METHOD
5 TYPES of SCALING METHOD:
3. Method of Absolute Scaling – is a procedure for obtaining
a measure of absolute item difficulty based upon the
result for different age group
4. Likert Scale – (Rensis Likert) widely used method of
assessment.
5. Guttman Scales – determine whether a set of attitude of
statement is unidimensional
SELECTING A SCALING
b.
METHOD LIKERT SCALE:
ADVANTAGES:
1. Quick and economical to administer and score
2. Adopts easily to most attitude measurement situation
3. Provides direct and reliable assessment of attitude when Scale
are well constructed
4. Lends itself to item analysis procedure
DISADVANTAGE:
1. Easily faked
2. Intervals between points on the scale do not equal changes in
attitude
3. Internal consistency of the scale may be difficult to achieve
4. Good attitude statements take time to construct
5. Time consuming to construct an attitude scale
Outline:
a. Defining the Test
b. Selecting a scaling method
c. Constructing the items
d. Testing the items
e. Revising the test
f. Publishing the test
c.CONSTRUCTING THE
ITEMS
1. Item Content – number of item should be specified
COMPONENTS:
Verbal Ability
Abstract Reasoning
Numerical Ability
Judgement
Reading Comprehension
2. Item Format
a. Matching Type Question – option must be very closely related or too
easy
b. True or False Question – easy to understand
c. Multiple Choice Question
d. Short-Answered Items
Outline:
a. Defining the Test
b. Selecting a scaling method
c. Constructing the items
d. Testing the items
e. Revising the test
f. Publishing the test
d. TESTING THE ITEMS
- Determine which items should be retained, revised, and discarded
A. ITEM DIFFICULTY – defined as the proportion of examinees in large
tryout samples who got the item correct. It is useful tool for identifying
items that should be altered or discarded.
Stocklein Formula:
DF = PU+PL DF= Difficulty Index
2 PU= Upper 27% of the examinees from top who took the test
PL= Lower 27% of the examinees from top who took the test
STEPS:
1. Arrange from Highest to Lowest
2. Identify and separate the upper 27% counting from the top
3. Identify and separate the lower 27% counting from the bottom
4. Find 27% upper by dividing the total frequency for the right
answers by the number of 27% of the examinee.
5. Find 27% lower by dividing the total frequency for the right
answers by the number of 27% of the examinee.
6. Add the result in the 27% upper and 27% lower and divide by 2
.91 - 1.0 Very easy item .10 - .24 Difficult item
.76 - .90 Easy item .00 - .09 Very difficult item
.25 - .75 Average
d. TESTING THE ITEMS
B. ITEM RELIABILITY INDEX – measure a high level of internal
consistency in which the test item are homogeneous
Correlate scores – Point-Biserial Correlation Coefficient & Standard Deviation

C. ITEM VALIDITY INDEX – useful tool to identify the possible good


items. Test should significantly possess the highest possible concurrent
or predictive validity. Can identify the items that should be eliminated
and rewritten, and produce a revised instrument with greater practical
advantage.
D. ITEM CHARACTERISTIC CURVE – a graphical presentation of the
relationship between the probability of correct response and the
examinee’s position on the underlying trait measured by the test
E. ITEM DISCRIMINATION INDEX – used to determine whether the
examinees have done well on particular items and have also done well on
the whole test
D = PU - PL
d. TESTING THE ITEMS
E. ITEM DISCRIMINATION INDEX

Hypothetical Data in Computing Discriminatory Index


Item U L D Interpretation
1 59 28 .25 Reasonably good item
2 89 29 .48 Very good item
3 55 55 00 Poor item
4 96 0 .77 Very good item
5 29 93 -.51 Poor item

Discriminatory Scale
.40 and above Very good item
.30 - .39 Reasonably good item
.20 - .29 Marginal item
.19 and below Poor item
Outline:
a. Defining the Test
b. Selecting a scaling method
c. Constructing the items
d. Testing the items
e. Revising the test
f. Publishing the test
e. REVISING THE ITEMS

• To identify unproductive items in the initial test


so that they can be eliminated, revised, or
rewritten
• Further validate the test items in terms of clarity
of items and its comprehensiveness.
• Finalize the final form of the test (Could be the
basis for establishing reliability and validity
indices and norms)
Outline:
a. Defining the Test
b. Selecting a scaling method
c. Constructing the items
d. Testing the items
e. Revising the test
f. Publishing the test
f. PUBLISHING THE TEST

Manual:
a. Description of the Test
b. What the test measures
c. Test Design and Construction
d. Validation process
e. Test Administration Guide
f. Scoring and Interpretation
g. Table of Norms
EVALUATING
PSYCHOLOGICAL
TESTS
EVALUATING THE
TEST RELIABILITY
TEST-RETEST
METHOD
• Test-retest reliability estimates are used to
evaluate the error associated with administering a
test at two different times
• This type of analysis is of value only when we
measure “traits” or characteristics that do not
change over time
• Administer the same test on two well-specified
occasions and then find the correlation between
scores from the two administrations using the
Pearson Product Moment Correlation Coefficient
TEST-RETEST
METHOD
TAKE NOTE:
• Carryover and Practice Effect. This
effect occurs when the first testing
session influences scores from the second
session. Because of these problems, the
time interval between testing sessions
must be selected and evaluated carefully!
PARALLEL FORMS
METHOD
• Parallel forms reliability compares two
equivalent forms of a test that measure
the same attribute. The two forms use
different items; however, the rules used
to select items of a particular difficulty
level are the same.
• The Pearson product moment correlation
coefficient is used as an estimate of the
reliability
SPLIT-HALF METHOD
• In split-half reliability, a test is given and
divided into halves that are scored
separately. The results of one half of the
test are then compared with the results of
the other by computing its correlation
coefficient using the Pearson r.
• Odd-Even System, whereby one subscore
is obtained for the odd-numbered items in
the test and another for the even-
numbered items
SPLIT-HALF METHOD
• The resulting statistic is the
reliability of the half test. To obtain
the reliability of the entire test, a
correction formula must be applied
which is the Spearman-Brown
Formula:
r = 2r
1 + r
KR20 FORMULA
• Takes care of possible problems that
may arise in splitting the tests into
halves. This method gets at the
internal consistency of the test
through an analysis of the individual
test items.
KR20 FORMULA
KR20 = r = N (S2 - ∑pq)
N – 1 S2
KR20 = the reliability estimate (r)
N = the number of items on the test
S2= the variance of the total test score
p = the proportion of people getting each item
correct (this is found separately for each item)
q = the proportion of people getting each item
incorrect. For each item, q equals 1 – p.
∑pq = the sum of the products of p time q for
each item on the test
COEFFICIENT ALPHA
• This is used for tests, which do not
have right or wrong answers, such as
many personality and attitude scales.
• Cronbach developed a formula that
estimates the internal consistency of
tests in which the items are not
scored as 0 or 1 (right or wrong).
COEFFICIENT ALPHA
r =α = N (S2 - ∑Si2)
N–1 S2

• Si2 can describe the variance of


items whether or not they are in a
right-wrong format
EVALUATING THE
TEST VALIDITY
FACE VALIDITY
• Is the crudest type of validity that
pertains whether the test looks valid, that
is by face of the instrument, it looks like it
can measure what you intend to measure.
• An instrument that presents only its face
validity, is an open target of criticism due
to the fact that this type of validity is not
supported by any evidence that the test
really measures anything.
CONTENT-RELATED
VALIDITY
• This refers to the most crucial procedure
of test construction because content
validity sets the pace for the succeeding
validity and reliability measures
• The test items need to be representative
sample of the content of the variable
being measured because it is the item that
really reflects the “whatness” of the
property intended to measure
CONTENT-RELATED
VALIDITY
• Content validity is the degree to
which the test represents the
essence, the topics and the areas
that the test is designed to measure.
• This is reported in terms of non-
numerical data unlike the other types
of validity
CRITERION-RELATED
VALIDITY
• Criterion validity evidence tells us
just how well a test corresponds with
a particular criterion. Such evidence
is provided by high correlations
between a test and a well-defined
criterion measure. A criterion is the
standard against which the test is
compared.
CRITERION-RELATED
VALIDITY
• Predictive Validity. The forecasting
function of tests is actually a type or form
of criterion validity evidence. The purpose
of the test is to predict the likelihood of
succeeding on the criterion.
• Concurrent Validity comes from
assessments of the simultaneous
relationship between the test and the
criterion. It applies when the test and the
criterion can be measured at the same
time.
CONSTRUCT-RELATED
VALIDITY
• It is the validation of the theory or
concept behind the test. It involves
discovering a positive correlation between
and among the variable/construct that
define the concept.
• Why a test is successful in predicting a
criterion is the question asked in construct
validation

Вам также может понравиться