Вы находитесь на странице: 1из 9

EDUCATIONAL AND PSYCHOLOGICAL MUSUUJlltNT

1968, 28, 105-113.

AN INDEX OF AN INDIVIDUAL’S AGREEMENT WITH


GROUP-DETERMINED ITEM DIFFICULTIES

THOMAS F. DONLON
Educational Testing Service
AND

FREDERIC E. FISCHER
State University of New York—
College at Oswego

THE major aims of this article are to define an index called the
&dquo;personal biserial&dquo; (rperbis) and to discuss the application of this
index in the light of its apparent properties. The &dquo;personal biserial&dquo;
is the correlation between a person’s distribution of item difficulties
on a specific test and the distribution of item difficulties generated

by some reference group. The index is a function of the same


responses which yield a person’s total test score. A major impetus
for this discussion, therefore, is the prospect that with the use of
this index a single aptitude or achievement test could yield two
essentially indepenent predictive measures, thus offering additional
information concerning the examinee without requiring additional
test-administration time.

The Index
In item analysis the biserial correlation coefficient, rbis, is
frequently used to measure the extent to which success on an item
reflects success on the test. The meaning and function of rbie may
be made clearer by reference to a rectangular matrix of the re-
sponses (Rij) and N persons to a test of K items (see Table 1).
In such a matrix, a row represents a person’s responses across items,
while a column represents an item’s &dquo;successes&dquo; across persons.
Correct responses may be indicated by 1’s and incorrect responses
105

Downloaded from epm.sagepub.com at UNIV OF MICHIGAN on April 25, 2015


106 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

by 0’s. (Note in this matrix that for the sake of simplicity it is


assumed that all items on the test have been answered by all
examinees.) An additional column, the item-criterion column, typi-
cally lists a score for each person which may be the total rights-
only score on the items being analyzed (T,.), a function of the
total rights-only score, or even a score on a test which is inde-
pendent of the items at hand (Ci). Each rbi. is a correlation between
the column of responses to an item and the column of criterion
scores. This biserial correlation, sometimes called the item discrimi-

nation index, is defined as:

where: PB = mean criterion score of examinees who mark the


item correctly
.

YR - mean criterion score of examinees who reach the


item

Downloaded from epm.sagepub.com at UNIV OF MICHIGAN on April 25, 2015


DONLON AND FISCHER 107

Sp, =
standard deviation of YR
pa =
number of examinees who mark the item correctly
divided by the number who reach the -item
u =
ordinate in the unit normal distribution which
divides the area under the curve into the propor-
tions pl, and 1 pj: -

The new index proposed here is essentially rbis applied to the


transpose of the response matrix. That is, the new index is a corre-
lation between the row of responses by a person and a row of
criterion scores, the criterion scores in this case being the item diffi-
culty indices, (A/s) .~ Each Af is a function of the total rights-only
score across subjects (T.j) and could conceivably be based on either

(1) the sample of persons being analyzed, or (2) an independent


sample. In this discussion, the difficulty index is considered to be
defined in such a way that large values correspond to difficult items,
small values to easy items. Accordingly, in the following formula
for rperbis, the logically corresponding means are reversed from the
order in which they appear in rbie, thus assuring a positive correla-
tion when a person’s correct responses agree with the difficulties,
i.e., when a person’s correct responses tend to be on the easiest items
(which have small A’s). This &dquo;personal biserial&dquo; is therefore
defined as:

where: Ã1 =
mean item
difficulty for items marked correctly
2i, =
mean item difficulty for items reached

<SAjt =
standard deviation of AR
PR’ = number of items marked correctly divided by
the number of items reached
u’ =
ordinate in the unit normal distribution which
divides the area under the curve into the propor-
tions pR’ and 1 pR’ -

Thus, the personal biserial index measures the relationship be-


tween the difficulty of the items in the test for the person as evi-
&Delta; is the
1 name of the standard-score difficulty measure used by Educa-
tional Testing Service. However, as used in this discussion it could be any
difficulty index with an approximately normal distribution.

Downloaded from epm.sagepub.com at UNIV OF MICHIGAN on April 25, 2015


108 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT
denced by his passes and failures, and the difficulty of the items,
as evidenced by the group-determined item difficulties. It is positive

when there is agreement. A person who would tend to get difficult


items correct and easy items incorrect, thus disagreeing with the
group, would have a negative rperbis. Similarly, scores which resulted
from pure chance responses would generate personal biserials with
a mean of zero, because responses made in this manner would not,
on the average, be correlated with
difficulty (or, indeed, with any-
thing) ; the group of correct responses should tend to include just
as many difficult items as easy ones.
The foregoing remarks assume that the item difficulties are
derived from &dquo;different&dquo; samples. However, the item difficulties
could be derived from the same sample for which fperbis is being
calculated. It is intuitively clear that this situation builds into
rperbl8 a component of &dquo;forced agreement.&dquo; That is, each person’s
responses tend to agree more or less with the item difficulties by
virtue of the fact that the o’s are partially determined by his
responses. Therefore, when A’s are not independently derived, the
expected value of rperbis must be somewhat greater than zero.
A person’s level of ability need not determine his personal
biserial. At any score level, persons may vary on rperbta depending
upon the agreement of their response pattern with the item diffi-
culties. Thus, the value of rperbia will be high and positive for all
persons, regardless of score level whose responses &dquo;conform&dquo; with
the group that determined the item difficulties (i.e., persons who
tend to get the easier items correct). It will be low, and possibly
even negative, for persons who, for some reason or other, show a

tendency to get an unusual number of the more difficult questions


correct and easier ones wrong. For most persons, there will be a
substantial agreement with an appropriate group. But what kind
of person, or what sort of behavior would demonstrate a low per-
sonal biserial?
Guessing behavior is probably the most obvious example of be-
havior which tends to yield low biserials. As the proportion of items
on which a person guesses increases, the expected value of rperbl8

decreases, since, as mentioned above, for pure chance responding


2
success on the item and item difficulty are virtually independent.
This correspondence between amount of guessing and value of rprbi.
2
Perfectly independent if the difficulties are externally derived.

Downloaded from epm.sagepub.com at UNIV OF MICHIGAN on April 25, 2015


DONLON AND FISCHER 109

opens intriguing possibility. A person who guesses wildly at most


an
of the items on a test may achieve some positive score, but it is
not a good measure of his knowledge of the subject matter tested.
For such a person, rperb1s may well be low. This situation contrasts
sharply with that of a person who achieves the same test score by
carefully considering the five alternatives on each item and choos-
ing the one he thinks best answers the question. He answers the
same number of questions, with the same ostensible success, but
he has succeeded only on the easy items. His Tperblø is higher. It
would seem therefore, that rperbi, might be able to differentiate be-
tween the behaviors of these two persons, whereas the total test
score cannot.
The test-taking behavior of the so-called &dquo;creative&dquo; person may
also produce low biserials. There has been occasional adverse cri-
ticism of multiple-choice testing in recent years (Hoffmann, 1962)
to the effect that objective tests discriminate against creative per-
sons. It might also be hypothesized that a creative person finds easy

items boring and, therefore, responds with carelessness, missing sev-


eral items of low difficulty. On the other hand, the most difficult
and unique types of items may present an interesting challenge to
this same creative person, who will respond with correct answers
on some of the most difficult items. If this hypothesis is true, the
creative person would be expected to attain somewhat lower per-
sonal biserials.
Guessing and creative behavior are but two of possibly many
types of behavior which might be measured by the personal biserial
index. Guilford (1959) discusses a personality factor which he calls
&dquo;Restraint vs. Rhathymia.&dquo; This factor is described as a &dquo;self-
controlled, serious, conscientious disposition versus a happy-go-
lucky, carefree, and unconcerned disposition.&dquo; It would seem that
rperbla might very well be sensitive to this factor. If so, the prospect
of findingevidence of the predictive validity of rperbia (with college
success as the criterion) appears encouraging since, according to
Guilford, Goedinghaus (1954) found a correlation of .42 between
a score for restraint and college grade average.

Previous Research
The study of pattern analysis is not new in psychological testing,
as is revealed by Gaier and Lee (1953):

Downloaded from epm.sagepub.com at UNIV OF MICHIGAN on April 25, 2015


110 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

One of the more promising trends in current psychometric re-


search is an increasing concern with methods of evaluating pat-
terns of test scores and test responses ... our initial hypothesis
is that consideration of response configurations will yield more
fruitful results than the usual method of reporting merely the
total score for a test ... our basic assumption is that test data
may be so treated as to yield a higher, degree of predictive
utility than that obtainable by the more .traditional additive
,

methods. (p. 140) .

The major development of various indices of pattern similarity


has occurred within the area of personality testing.3 Most of these
techniques provide indices for the comparison of the score patterns
of two individuals. Cronbach and Gleser (1953) introduce the gen-
eral model for the concept of pattern similarity between persons.
None of the literature on pattern analysis describes rperbls, or any
similar index which would be particularly appropriate for compar-
ing the individual to the group on aptitude or ability tests.
There does exist one unpublished study (Myers, 1963) which deals
directly with rperbis, Myers’ study was essentially an interested look
at what would happen if the usual item analysis were turned around
into a &dquo;person analysis.&dquo; Myers found that the distribution of
rperbis for persons looked like the usual rbie distribution for items. He
reports also that rperbls seemed to differentiate between two selected
groups: those who had had a course in trigonometry (low rperbie)
and those who had not. He explains that the trigonometry items
on the test were typically difficult. Thus, since those who had had

a trigonometry tended to do well on the trigonometry items,


course

their behavior was with respect to the~ entire group, and


atypical
therefore, they tended to get slightly lower personal biserials. In
addition, Myers asserts that:

... the person-group biserial correlation may be interpreted to


show the homogeneity of the person with the group, the extent
to which he tended to answer &dquo;easy&dquo; items correctly and pro-
portionately to answer &dquo;difficult&dquo; items incorrectly. Hence, one
might assume that if the current criticisms of objective testing
were correct, &dquo;creative&dquo; persons would be those who received

3 Pearson (1928), Zubin (1937), duMas (1946), Cattell (1949), Stephenson

(1950), and Osgood and Suci (1952).

Downloaded from epm.sagepub.com at UNIV OF MICHIGAN on April 25, 2015


DONLON AND FISCHER 111

unusually low coefficients of correlation with the members of


the group who have taken the test.
An index related to r,,,bl. was independently developed by Jacobs
(1963) in a study of large score changes on the Scholastic Aptitude
Test. He took the K items of a test, ranked them according to diffi-
culty, divided them into quintile intervals, and assigned a weight
of &dquo;4&dquo; to the K/5 items in the most difficult interval, &dquo;3&dquo; to the
next most difficult, etc. For each person, the rights score on the K/5
items in each interval was obtained, and the weighted average Rp
of these scores computed:

where: Ri (i= 1, 2, ..., 5) is the number in then quintile


interval
This index is similar to rperbls in that it reflects a person’s agreement
with the group with respect to item difficulty; however, since R1
includes the easiest items, R2, the next-easiest, etc., the greater the
person’s Rp index, the greater the tendency for the items he got
right to be the more difficult ones. Thus, a high 7?p would corres-
pond to a low rperbls. Jacobs studied the relationships between this
index and scores obtained by persons on repeated administrations
of parallel forms of the same test.4 There is some evidence that
the use of Rp along with a score on an initial form predicted the
score on the next form more accurately than did the first score

alone, indicating that Rp (and possibly similar measures) might


increase the reliability of scores.
Donlon and Fischer conducted a small study to investigate the
descriptive characteristics of the personal biserial index and to find
evidence that the index might have some predictive value. A group
of 614 subjects who scored in the so-called &dquo;chance range&dquo; on a
special form of the Preliminary Scholastic Aptitude Verbal Test
were selected for this investigation (only scores < 22).5 The PSAT
Math scores for this group were also analyzed. Two personal

4
Actually, six separate tests were used : the SAT-Verbal, and SAT-Math
given in March, 1960, May, 1960, and January, 1961.
5 Since random
responses tend to produce zero-order biserial indices, it
was hypothesized that if negative and zero-order indices do exist, they would
most likely be found in the chance score range.

Downloaded from epm.sagepub.com at UNIV OF MICHIGAN on April 25, 2015


112 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT
biserials werecalculated for each person, one based on verbal
responses, the other on math responses. The o’s used in these cal-
culations were derived externally (obtained from the original item
analyses). The two biserial distributions were similar, fairly normal,
with mean verbal biserial = .50 and mean math biserial .57. The =

biserial standard deviations were .19 and .20, respectively. It can


be established from these distributions that zero-order and even
negative biserials do exist. However, most of these subjects had
personal biserials which were significantly different from chance,
despite the fact that their verbal scores were all in the chance range.
The two test scores and two biserial indices were then intercor-
related. In the absence of any outside criterion such as college
grade point average, the math score was treated as a criterion with
the verbal score and verbal biserial as predictors. The correlation
between verbal and math scores was .28. However, the multiple
correlation between math score and the best linear combination of
the verbal score and verbal biserial was .35. This result is considered
encouraging evidence for the possible use of rperbia as a predictor of
college success.

Summary
The new index has been most useful thus far in demonstrating
that many chance range scores are not random scores. This result
has clear relevance for the use of these scores. For example, Sax
(1962) has criticized test publishers for reporting percentile equiv-
alents for scores in the chance range. In the light of the seemingly
large values of rperbis observed in the chance score range, such
percentiles may, in fact, be appropriate.
At the present time a more thorough empirical investigation is
being conducted, using rperbis as derived from Verbal and Math
SAT scores over the full score range. The major aims of the study
are (1) to analyze the descriptive characteristics of 7’perbia and its

distribution, (2) to estimate the reliability of rperbis, (3) to test the


hypothesis that a knowledge of rperbis significantly enhances the
accuracy of predicting freshman college index from SAT scores,
and (4) to suggest and analyze traits or characteristics of the
testee which correlate with the personal biserial index.

Downloaded from epm.sagepub.com at UNIV OF MICHIGAN on April 25, 2015


DONLON AND FISCHER 113

REFERENCES
Cattell, R. B. pr and Other Coefficients of Pattern Similarity.
Psychometrika 1949, 14, 279-298.
,
Cronbach, L. J. and Gleser, G. C. Assessing Similarity between
Profiles. Psychological Bulletin, 1953, 456-473.
duMas, F. M. A Quick Method for Analyzing the Similarity of
Profiles. Journal of Clinical Psychology, 1946, 2, 80-83.
Gaier, E. L. and Lee, M. C. Pattern Analysis: The Configural Ap-
proach to Predictive Measurement. Psychological Bulletin, 1953,
140-148.
Goedinghaus, C. H. A Study ofRelationship between Tempera-
the
ment and Academic Achievement. Master’s thesis, Los Angeles,
University of Southern California, 1954.
Guilford, J. P. Personality. New York; McGraw-Hill, 1959.
Hoffmann, B. The Tyranny of Testing. New York: Crowell-Collier,
1962.
Jacobs, P. I. A Study of Large Score Changes on the Scholastic
Aptitude Test. Research Bulletin, Princeton, New Jersey, Edu-
cational Testing Service, 1963. (Multilithed Report).
Myers, C. Item Analysis Procedures Applied to Persons. Unpub-
lished manuscript, 1963.
Osgood, C. E. and Suci G. A Measure of Relation Determined by
both Mean Difference and Profile Information. Psychological
, 1952, 49, 251-262.
Bulletin
Pearson, K. On the Coefficient of Racial Likeness. Biometrika
, 1928,
18, 105-117.
Sax, G. Theoretically Derived Chance Scores and Their Normative
Equivalents Selected Number of Standardized Tests. EDU-
on a
CATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1962, 22, 573-576.
Stephenson, W. A Statistical Approach to Typology: The Study
of Trait-Universes. Journal of Clinical Psychology, 1950, 6,
26-38.
Zubin. J. The Determination of Response Patterns in Personality
Adjustment Inventories. Journal of Educational Psychology,
1937, 28, 401-413.

Downloaded from epm.sagepub.com at UNIV OF MICHIGAN on April 25, 2015

Вам также может понравиться