Classical Test Theory in Historical Perspective

Classical Test Theory in
historical Perspective
Ross E. Traub
The Ontario Institute for Studies in Education of
the University of Toronto
and tend to cluster around their

true value" (Read, 1985,
What were the historic origins of classical test theory? p. 348). "By the eighteenth cen-
What have been major milestones in its development? tury," according to Churchill Eisen-
hart (1983a),
the practice of taking the arith-
lassical test theory is an ema- Classical Theory Circumscribed metic mean [of a set of observa-
tions] "for truth" had become
nation of the early 20th cen- Classical' test theory is founded on fairly widespread . Nonetheless,
tury. It was born of a ferment the the proposition that measurement Thomas Simpson (1710-1761)
ingredients of which included three error, a random latent variable, is a wrote to the president of the
remarkable achievements of the component of the observed score Royal Society in March 1755 that
previous 150 years: A recognition of random variable . The latter vari- some persons ofconsiderable note
the presence of errors in measure- able is realized in the measure- maintained that one single obser-
ments, a conception of that error as vation, taken with due care, was
ments that may be taken of a
a random variable, and a conception as much to be relied on as the
characteristic of the persons2 in mean of a great number. As this
of correlation and how to index it. some more-or-less well-defined pop- appeared to him to be a matter of
Then in 1904 Charles Spearman ulation . Add to this proposition (a) a much importance, he said that he
showed us how to correct a correla- result that is true by construction- had a strong inclination to ascer-
tion coefficient for attenuation due namely, that the error variable has tain whether by the application of
to measurement error and how to zero covariance with that other la- mathematical principles the util-
obtain the index of reliability tent component of observed mea- ity and advantage of the practice
needed in making the correction. might be demonstrated. Measure-
surements, the true score variable
Spearman's demonstration marked ments and functions of measure-
(Lord & Novick, 1968)-and (b) a
the beginning, as I see it, of classical ments, such as their arithmetic
crucial assumption that the error means, are not amenable to math-
test theory. Subsequently, the component of a measure is indepen- ematical theory, however, as long
framework of classical theory was
dent of the error components of as individual measurements are
elaborated and refined by Spear- regarded as unique entities, that
other measures, either of the same
man, George Udny Yule, Truman is as fixed numbers y,, YZ . . . . A
characteristic or of different charac-
Lee Kelley, and others over the mathematical theory of measure-
teristics, and one is able to prove the
quarter century or so following ments, and of functions of mea-
basic theorems of classical test the- surements, is possible only when
1904. Another milestone was laid in
ory. Essential adjuncts to the theory particular measurements y,, yz,
1937 with the publication of the
are the ancillary assumptions in- . . . are regarded as instances
Kuder-Richardson formulas. This
voked and experimental procedures characteristic of hypothetical
event was followed, shortly there-
followed in estimating coefficients of measurements Y,, Y2, . . that
after, by the idea of lower bounds to
reliability and standard errors of might have been, or might be,
reliability and the framework for en- yielded by the same measurement
measurement, two of classical test
hanced understanding found in the process under the same circum-
theory's basic results .
work of Louis Guttman. The culmi- stances. Consequently, Simpson
nation of classical test theory was hypothesized that the respective
realized in the systematic treatment chances of the different errors to
it received from Melvin Novick The Zeitgeist
(1966 ; Lord & Novick, 1968) . A prominent feature of the zeitgeist
What follows is an attempt to add at the turn of the 20th century was Ross E. Traub is a Professor at the On-
a little flesh to the bare bones of the the notion of errors in scientific ob- tario Institute for Studies in Education,
foregoing outline . Before doing so, servations. As early as the 17th OISE, University of Toronto, 252 Bloor
however, I should be clear about century, apparently, "Galileo had St . W., Toronto, Ontario, Canada M5S
what I take classical test theory . . . reasoned that errors of observa- 1V6. His specializations are test theory
to be. tion are distributed symmetrically and educational assessment.
Educational Measurement : Issues and Practice

which any single observation is normal correlation for two vari- when the idea of errors in measure-
subject could be expressed as a ables" (Walker, 1929, p . 93) . These ments had become widely accepted .
discrete probability distribution individuals included Robert Adrain, Also, the coefficient of correlation
of error. . . ." (p . 531) an Irishman who emigrated to the was well established as an impor-
By the beginning of the 19th cen- U. S. and taught at Rutgers, Colum- tant statistical concept, although
tury, astronomers had come to rec- bia College, and the University of Pearson's product moment expres-
ognize errors in observations as an Pennsylvania and published such sion had not yet gained acceptance
area worthy of research . Carl an equation in 1808; Pierre-Simon, as the computational formula of
Friedrich Gauss derived the normal Marquis de LaPlace published such choice . In addition, we encounter
distribution while trying to prove an equation in 1810; Giovanni Anto- around this time the first scientific
nio Amedeo Plana, professor of as- publications in which coefficients of
that the mean of many observations
of an unknown quantity, such as a tronomy at Turin, published an correlation were the principal re-
parameter of the orbit of a planet, is equation in 1812 ; Gauss in 1823; sults of the research . For example,
the most likely value of that quan- and Auguste Bravais, professor of Karl Pearson, who, like Galton, had
astronomy at Lyons and professor of
tity (Eisenhart, 1983b ; Read, 1985) .3 an abiding interest in eugenics, in-
physics at Paris, in 1846. But none
In other work-for example, that of vestigated the correlation between
of these scholars seems to have in-
Friedrich Wilhelm Besse14-errors characteristics of pairs of brothers
terpreted the cross-product term in
in astronomical observations were (Pearson, 1904; Pearson & Lee,
the exponents of their expressions
thought to be composed of many in- 1903) . Pearson found, to his aston-
as an indicator of covariation or cor-
dependent, elementary parts . The ishment, that the magnitude of this
relation . This enormous idea was
distribution of the sum of these er- correlation was about 0.5, regard-
left to Francis Galton . Although he
rors was shown to be normal under less of characteristic, physical or
did not actually derive the formula
various assumptions, hence the ref- psychical . So this was the time, and
for the bivariate normal distribu-
erences, commonplace in the 19th tion-it can be argued that this was this was the context for the first of
century, to the law of errors in ob- done by a Cambridge mathemati- the defining achievements in the
servations (Eisenhart, 1983b; Read, cian named J. D. Hamilton Dickson, history of classical test theory-the
1985) . Later in the 19th century, to whom Galton gave the relevant correction of a correlation coefficient
however, as noted by 0. B Sheynin concepts' -Galton did originate the for attenuation due to measurement
(1968), basic ideas we associate with the bi- error. In what follows, I describe
variate distribution . The notion of this and four other milestones
(the normal distribution,] as far
as the theory of errors is con- the scatterplot, on which is dis- marking the development of classi-
cerned, had been almost forgotten played the two regression lines, can cal test theory.
and even the concept of random be found in Galton's 1885 presiden- Spearman's correction for attenu-
errors of observation . . . [had be- tial address to the Anthropological ation. Spearman was a psychologist .
come] divorced from the concept Section of the British Association, At the turn of the 20th century, he
of random quantities in the the- subsequently an article published in was just beginning his study of in-
ory of probability. . . . A qualify- telligence. In the course ofthis early
1886 in the Journal of the Anthropo-
ing remark should be added : . . .
logical Institute under the title "Re- work, he had discovered that inde-
in the second half of the 19th
century a definition of a ran- gression Towards Mediocrity in pendent measurements of a psychi-
dom quantity as "dependent on Hereditary Stature The term cor-
.116 cal characteristic of a person-for
chance" and possessing a certain relation was first used in a technical example, mental ability-vary in a
law of distribution had become sense 2 years later in an article en- random-Spearman called it an
. . natural. . . . As to random titled "Co-Relations and Their Mea- "accidental" (1904, p . 89)-fashion
errors, these were usually taken surement" (Galton, 1888). Galton, from one measurement trial to an-
to be errors with certain proba- however, used the symbol r to refer other . In other words, the coefficient
bilistic properties, their specific to reversion or regression . Credit for of correlation between such inde-
distribution . . . being not so
important . . . . It seems that calling r the coefficient of correlation pendent measurements for a group
Vassiliev (1885, Theory of Proba- belongs to Francis Y Edgeworth, of persons is not perfect . Spearman
bilities, in Russian, a lithographic and dates from 1892 (Boring, 1957).1 then had the insight that the ab-
edition, Kazan, p. 133) was the Karl Pearson contributed impor- solute value ofthe coefficient of cor-
first who definitely held that ran- tant mathematical work in support relation between the measurements
dom errors of observation are to ofhis friend Galton's coefficient. For for any pair of variables must be
be ranked among random quanti- example, Pearson (1896) proved smaller when the measurements for
ties. (pp. 236-237) that the best value of r is the covari- either or both variables are influ-
We might expect the notion of ance divided by the product of the enced by accidental variation than
correlation, that other essential fea- standard deviations . it would otherwise be. Eliminating
ture ofSpearman's zeitgeist, to have this attenuating effect of accidental
emerged from a study of the distri- error was just one ofthe matters, al-
bution of joint errors . Indeed, sev- Defining Achievements in the beit in retrospect the most impor-
eral individuals separately derived History of Classical Test Theory tant of those matters, addressed by
expressions resembling the "ordi- We find ourselves at the very begin- Spearman in his 1904 article, enti-
nates of the probability surface of ning of the 20th century, a time tled "The Proof and Measurement of
Winter 1997 9
Association Between Two Things" mental heredity can be [nothing] lar" and "accidental" (p. 273),
and published in the American more than mere accidental coinci- the correction formula applies
Journal of Psychology. dence" (p. 98). I think I may safely only to the latter. As regards
Spearman's article angered Karl leave him to calculate the odds for Brown's contention that acci-
Pearson. The reason was that or against this most remarkable dental errors can be correlated,
"mere accidental coincidence". . . .
Spearman had had the temerity to Spearman observed that, if er-
Perhaps the best thing at present
challenge Pearson's conclusion that rors were indeed found to be
would be for Mr. Spearman to
the coefficient of correlation be- write a paper giving algebraic linked (as might be the case,
tween pairs of brothers was 0.5 for proofs of all the formulae he has e.g., if a person were ill at the
psychic characteristics, just as it used; and if he did not discover time oftaking both tests x and
had been found to be for physical their erroneous nature in the y), then the investigator should
characteristics . Using reliability es- process, he would at least provide employ a better experimental
timates from his own work, Spear- tangible material for definite crit- design .
man estimated the corrected icism, which it is difficult to apply 2. It had been suggested accord-
correlation coefficient for mental to mere unproven assertions . ing to Spearman (1910, p. 272)
(Pearson, 1904, p. 160) that investigators should make
ability to be 0.8.
Pearson was co-editor of a newly Stung into responding, Spearman measurements so "efficient"
founded journal, Biometrika. He in- published a proof of the correction that no correction would be
serted the following petulant note in for attenuation in 1907, again using needed . Spearman wondered
his 1904 article in that journal on the American Journal of Psychol- how an investigator would
the correlation between selected know his measurements were
ogy . Subsequently, a proof in a form efficient enough except by
characteristics of sibling pairs . often encountered in present day using the correction formula?
I hardly know whether it is need- textbooks on educational and psy- 3. To Pearson's criticism that the
ful to refer here to a recent article chological measurement was given correction could produce coeffi-
by Mr. C. Spearman . . . criticis- by Spearman (1910) and also by cients greater than one, Spear-
ing my results for the similarity William Brown (1910), both of man countered that this might
of inheritance in the physical and whom ascribed it to Yule . This de- occur due to sampling error. He
psychical characters. Without rivation stressed that the error com- recommended (p. 277) the coef-
waiting to read my paper in full ponents of all measures should be ficient be set to one whenever
he seems to think I must have independent, and hence uncorre-
disregarded "home influences" this happened.
and the personal equation of the lated. Spearman's earlier "proof"
The Spearman-Brown formula.
school teachers. He proceeded to had not emphasized this restriction
Coefficients of reliability are needed
"correct" my results for the error (Walker, 1929).
in order to apply the correction for
of what he calls dilation on the Pearson was not the only critic of attenuation . In his 1904 article,
double basis (i) of a formula in- the correction for attenuation . In Spearman had assumed the avail-
vented by himself, but given with- particular, Brown (1910) challenged ability of two independent measure-
out proof, and (ii) of his own it on the grounds that measurement ments of both the characteristics for
experience that two observers' ob- error is not really random (acciden-
servations or measurements of which a corrected correlation coeffi-
tal). Brown proposed a way of test- cient is desired . The breakthrough,
the same series of two characters ing the equality of the covariances
were such that the correlation be- apparently achieved independently
tween their determinations was S(x 1y l) and S(x2y2), where xl, x2, yl, by Spearman and Brown, to a for-
.58 in one case and .22 in the and y2 are observed-score variables . mula by which to calculate a relia-
other. The formula invented by Assuming xl is parallel to x2 and yl bility coefficient from the two halves
Mr. Spearman for his so-called is parallel to y2, then both these co- of just one composite measure was
"dilation" is clearly wrong, for ap- variances, according to classical the- published in adjacent articles in a
plied to perfectly definite cases, it ory, should equal S(xy), where x and 1910 issue of The British Journal of
gives values greater than unity y are the true-score variables associ- Psychology . Brown's proof of the for-
for the correlation coefficient . As ated with {xi, x2} and {yl, y2}, respec- mula is the more elegant and bears
to his second basis, all I can say is tively. Brown (1913) reported the stamp of Yule.
that if the correlation between results based on an application of During the second and third
two observers of the same thing decades of the 20th century, numer-
in Mr. Spearman's case can be as this proposal, results he claimed did
show that measurement errors are ous experiments were conducted to
low as .22, he must have em- test predictions of the Spearman-
ployed the most incompetent ob- not accidental.
Spearman was aware of Brown's Brown formula . A thoughtful review
servers, or given them the most
imperfect instructions, or chosen criticism, among others, and pre- of publications emanating from this
a character [more] suitable for sciently responded as follows in his preoccupation of early psychometri-
random guessing than obser- cians can be found in the notes on
British Journal of Psychology arti-
vation in the scientific sense . test theory prepared by Louis Leon
cle of 1910:
Mr. Spearman says that "it is diffi- Thurstone (1932).
cult to avoid the conclusion that 1. Spearman reiterated his posi- The index of reliability and other
the remarkable coincidence an- tion that of the two kinds of results. It is a worthwhile experi-
nounced between physical and errors in measurement, "regu- ence, though in the late 20th cen-
10 Educational Measurement: Issues and Practice

tury a humbling one, to read the ar- (p. 74). Walker reported that able growth in the function tested
ticles on test theory that were pub- "[w]hen Munroe published the for- within the population ofindividu-
lished between 1910 and 1925 by mula in his Introduction to the The- als. These difficulties are so seri-
Spearman, Brown, and Kelley, ory of Educational Measurements ous that the method is rarely
used. (p. 151)
among others . These documents (1923) he . . . [set the phrase in
contain a great many of the basic re- capital letters], ascribed it to Kelley, Of the split-half approach, Kuder
sults of classical test theory. Kelley's and established Index of Reliability and Richardson concluded that "al-
1923 text, Statistical Methods, in- as a definite term"(1929, p. 118). though the authors have made no
cluded a compilation ofthese results The Kuder-Richardson formulas . actual count, it seems safe to say
in a section on reliability theory. Writing the coefficient of correlation that most technicians use the split-
Kelley also stated in this text the between two composite measures in half method of estimating reliabil-
definition of reliability that he terms of the variances and covari- ity" (p. 151). They then observed
championed throughout his long ca- ances of the measures' components that the number of splits possible
reer : the coefficient of correlation made it possible to study the effects for an n-item test (n being an even
between "comparable tests" (p. 203) . of the characteristics of item scores n!
Kelley laid down three conditions on the characteristics of total test number) is z , a number so
for test comparability: scores . By 1936, Marion Richard- 2L(2 )~!~
(1) sufficient fore-exercise should son, in an article in the first volume large for any test of reasonable
be provided to establish an atti- of Psychometrika, had demon- length that the reliability estimates
tude or set, thus lessening the strated several propositions for from all possible split-halves of the
likelihood ofthe second test being tests composed of discrete, dichoto- test are very likely to vary consider-
different from the first, due to a mously scored items . Invoking the ably. So the issue for Kuder and
new level of familiarity with the assumption that all test items have Richardson was to devise a proce-
mechanical features, etc.; (2) the equal variances -Richardson noted dure by which the information in
elements of the first test should that this assumption is close to true the item scores of a test could be
be as similar in difficulty and for a wide range of item difficulty
type to those in the second, pair used to produce a single estimate of
values-he showed that "the rejec- reliability.
by pair, as possible ; but (3) should tion of items with low item-test cor-
not be so identical in word or form In deriving Formulas 20 and 21,
as to commonly lead to a memory relations raises the reliability of a Kuder and Richardson began by
transfer of correlation between test, if the number of items is held writing the following expression-
errors . (p. 203) constant" (p . 72). Richardson also their Equation 3-for the correlation
showed that "for tests of homoge- of a test composed of n dichoto-
Kelley was critical of Brown neous [item] difficulty and constant mously scored items with a hypo-
(1910) for using the term reliability length, the true variance is propor- thetical second test of n such items:
coefficient to refer to the correlation tional to the average item intercor- n n
between scores on repeated admin- relation" (p .75).
istrations of the same test . Brown's
6i - . pigi+1 riipigi
Given his work on the relation- rtt (3)
definition fails to meet the third of ship between item and total test
Kelley's conditions. (Kelley's conclu- scores, it is perhaps not surprising
sion: "Accordingly the repetition of a to find Richardson a co-author, with where 6r is the test variance,
test to secure a reliability coefficient Frederic Kuder, of the blockbuster
is to be deprecated" Kelley, 1923, article of 1937, the one containing pigi is the sum of the item vari-
p . 203) . the famous Formulas 20 and 21 . (In n
An important result, used by a footnote to the article, Kuder and ances, and I riipigi is the sum ofthe
Spearman in his 1910 proof of the Richardson reported they had inde-
prophecy formula, was the expres- pendently arrived at the results item true score variances. The prob-
sion for the correlation between two contained in the article. They ap- lem Kuder and Richardson set for
composite measures in terms of the parently discovered this fact quite themselves was that of estimating
variances and covariances of the by accident, at which time they de-
components . From this result, Abel- ria, the item reliability coefficient.
cided to publish the results jointly.) They did so under various assump-
son (1911, p. 314) derived the for- Kuder and Richardson began their tions . Those leading to Equation 20
mula for what came to be known article with a critique of existing ap- were that the matrix of interitem
several years later as the index of proaches to the estimation of relia- correlation coefficients has Rank 1
reliability .' It seems this name was bility. Of the test-retest method, and that all interitem correlation
coined quite by accident . Kelley in- they said, coefficients are equal. Setting rii
dependently derived the formula for equal to ril, the interitem correla-
the index in 1916 and then in using [Using] the same form gives, in
general, estimates that are too tion coefficient, Kuder and Richard-
it wrote that "the extent to which high because of material remem- son derived the result:
the grade determined by means of bered on the second application of
this test of forty words would corre- the test. This memory factor can-
late with the true spelling ability of not be eliminated by increasing _
_ Cr' - pigi
the individual is probably an even the length of time between the
more significant index of reliability" two applications, because of vari- pigi /2- I pigi
Winter 1997
z sentially tau equivalent if, for any ability from a single trial is not
6t - nPq explicitly analyzed . (p. 257)
examinee in the test population, the
(n - 1) npq I
true score on one test half differs Guttman derived six lower
from the true score on the other half bounds to reliability, of which three
which, when substituted in Equa- by a constant, which is the same for are noted here. One of these is a
tion 3, and simplified, gives all examinees. Also, under the es- generalization of KR20 to tests com-
sentially tau-equivalent assump- posed of items scored on any scale,
n r6t - npq
tion, as distinct from the parallel dichotomous or otherwise . He la-
2 test-halves assumption, the error beled this index r3 . Subsequently, r3
6t
variance for an examinee on one became better known as coefficient
test half is not necessarily equal to alpha (Cronbach, 1951). Two of the
Formula 21 was derived under the the error variance for the examinee
additional assumption of equal item other lower bounds to reliability
on the other half (see Lord & were labeled, not surprisingly, f1
difficulty indices . Novick, 1968, p. 50). and r2, with the former typically
Kelley criticized the KR formulas Whatever the wellspring of work smaller than alpha, the latter typi-
in a 1942 article in Psychometrika on lower bounds, Louis Guttman
on the grounds that they are valid cally larger. Research on lower
(1945) published the first article, as bounds to reliability constitutes a
only for tests "with unity of pur- far as I know, in which lower bounds small but still active line of psycho-
pose"-that is, for tests composed of to reliability are explicitly derived .
items that share just one factor in metric research .
But this Psychometrika article is
common. Reiterating his long-stand- important for a reason other than
ing advocacy of the parallel-test the lower bounds it contains .
design for reliability estimation, Formalization
Guttman also offered a theoretical
Kelley went on to say "we conclude framework within which to treat, if Various attempts to formalize clas-
that a belief that two or more mea- sical test theory have been made
not actually to reconcile, the antag-
sures of a mental function exist is onistic views of Brown and Kelley over the years. Already mentioned
prerequisite to the concept of relia- regarding how the reliability coeffi- is the section on reliability in Kel-
bility, and, further, not only that ley's (1923) Statistical Method.
cient should be estimated . Guttman
they exist but that they are avail- did this by first identifying three Another early work is that by Thur-
able before a measure of reliability sources of variation in test re- stone (1932). The next presentation
is possible" (p. 76) . As a further chal- sponses-persons, items, and trials . of note is Theory of Mental Tests by
lenge to the KR formulation, Kelley Harold Gulliksen (1950) . The culmi-
Guttman defined error variance ex-
demonstrated that a test with zero clusively in terms of variation in re- nation of such efforts as these was
interitem covariances could produce sponses over the universe of trials . realized in the work of Melvin
a reasonable correlation with a This definition leads to a proof of Novick (1966) and in the early chap-
"similar" (p. 81) test, even though its total test variance as the sum of ters of Statistical 'theories of Mental
KR2o index would be zero . true-score and error variance, with- Test Scores (1968) by Frederic Lord
It is obvious, a half-century later, out the need to assume zero covari- and Novick.
that Kelley's view did not prevail . ance of true and error scores . (The
The KR formulations quickly re- latter assumption lies at the heart
ceived widespread acceptance, abet- of Yule's proof of the correction for Concluding Remarks
ted in part by the publication of an attenuation .) Defining the reliabil- Several important topics from the
article by Paul Dressel (1940). Dres- ity coefficient "as the complement of realm of classical test theory have
sel showed that, when all the items the ratio of error variance to total not been covered in this brief retro-
of a test intercorrelate perfectly and [test] variance" (p. 257), Guttman
all item variances are equal, KR20 spective. Among them are the ef-
then went on to demonstrate (pp . fects of range restriction on the
attains the value of l; otherwise, it 267-268) that the reliability coeffi-
is less. He further demonstrated magnitude of the reliability coeffi-
cient can be estimated as the corre- cient, the application of analysis of
that KR20 can take values less than lation between the test scores for a
0. Dressel also increased the applic- variance to the study of measure-
group of examinees on two "experi- ment error and reliability (this be-
ability of KR20 by deriving a version mentally independent" (p. 264) tri-
for tests to which the correction for fore the advent of generalizability
als of a test. Given the results from
guessing is applied . theory), and the modeling of test
only one trial, however, Guttman
Lower bounds to reliability. Per- data generally addressed under the
showed that the best result possible
haps it was Dressel's demonstration topic of congeneric models . Clearly,
is an estimate of a lower bound to
that KR20 can be negative that there is more to classical test theory,
reliability. This result was shown to
marks the beginning of work on and its history, than the work re-
rest on the assumption
lower bounds to reliability. Alterna- viewed in this article .
that the errors of observation are Lest we leave this topic thinking
tively, a case might be made for independent between items and
Philip Rulon's (1939) article in the classical test theory an unduly
between persons over the uni-
Harvard Educational Review. Rulon verse oftrials . In the conventional important area of research in the
introduced the notion, not the name, [Yule] approach, independence is history of empirical research in psy-
of essentially tau-equivalent test taken over persons rather than chology and education, we will find
halves . The halves of a test are es- trials, and the problem of observ- it salutary to reflect on the following
12 Educational Measurement : Issues and Practice

remarks from the preface of the time of his firing they were almost a sec- Nature presents us with a group of
1932 edition of Thurstone's notes on ond later than Maskelyne's . This matter objects of every kind, it is using a
might not have attracted much interest rather bold metaphor to speak in this
test theory : case also of a law of error, as if she
had not Maskelyne recorded it in Astro-
Since this volume is devoted to had been aiming at something all the
nomical Observations at Greenwich
the validity and reliability con- time, and had like the rest of us
(1799) . Seventeen years later, in a missed her mark more or less in every
cepts in their applications to men-
history of Greenwich Observatory pub- instance . (p. 42)
tal tests and related correlational
lished in German, Kinnebrook's tribula-
procedures, it is only fair to say
tion came to the attention of Bessel, an a Walker (1929, p . 117) suggested the
that personally I do not believe
astronomer at Konigsberg . Bessel con- derivation of the formula may have
that these correlational methods
ducted a series of studies culminating in been given to Abelson by Spearman .
and particularly the reliability
the notion of the personal equation- Abelson's article is, however, unclear on
formulae have been responsible
the name given the systematic differ- this point. The article contains two
for much that can be called fun-
damental, important or signifi- ence in recording times found to appendices . The first is described as
cant in psychology. On the characterize the stellar transits of al- having been "kindly supplied by Prof.
contrary, the correlational meth- most any pair of astronomers . From the Spearman" (p . 312). But the second ap-
ods have probably stifled scien- perspective of reliability theory, the per- pendix, which contains the index of
tific imagination as often as they sonal equation itself was not a highly reliability, bears no attribution to
have been of service . As tools in significant discovery, for it refers to sys- Spearman.
their proper place they are useful tematic error, not the random error
but as the central theme of men- treated by reliability theory. What in-
tal measurement they are rather terests us, instead, is Bessel's finding
References
sterile . that the personal equation itself is a
Abelson, A . R. (1911). The measurement
variable quantity, one that differs from
of mental ability of "backward chil-
one pair of astronomers to another. This
Notes dren." British Journal of Psychology,
variation suggests random or accidental
This article is a revised version of a 4,268-314 .
errors in observations, errors that, if
paper presented at the 1996 Annual Boring, E . G . (1957) . A history of experi-
neither controllable nor amenable to
Meeting of the National Council on mental psychology (2nd ed .) . New
elimination, at the least demand an ex-
Measurement in Education (Session York: Appleton-Century-Crofts .
planation grounded in a theory or a sci-
B1), New York . Brown, W. (1910) . Some experimental
entific law.
' Something classical, according to results in the correlation of mental
Karl Pearson (1930) noted that
Webster's New Collegiate Dictionary, is abilities . British Journal of Psychol-
Dickson did not actually write down the
accepted as standard and authoritative, ogy, 3, 296-322 .
equation for the bivariate normal distri-
as distinguished from novel or experi- Brown, W. (1913). The effects of "obser-
bution but stopped one step short of
mental : viz . classical physics . vational errors" and other factors
doing so .
a
Classical test theory applies to the According to Walker (1929, pp . upon correlation coefficients in psy-
measurements of a characteristic of the 98-101), H . P Bowditch published a chology. British Journal of Psychol-
members of any collection of objects two-way table of height and age for ogy, 6, 223-235 .
whatsoever. In particular, it is not re- 24,000 Boston school boys in 1877 in a Cronbach, L . J. (1951). Coefficient alpha
stricted, as the wording here might manuscript entitled Growth of Chil- and the internal structure of tests .
imply, to measurements of human char- dren . Although he described one of the Psychometrika, 16, 297-334 .
acteristics . regression lines of a bivariate distribu- Dressel, P. L. (1940). Some remarks on
s Although the normal distribution is the Kuder-Richardson reliability coef-
tion in his work, Bowditch neither pro-
commonly referred to as the Gaussian duced both lines nor conceived of a ficient . Psychometrika, 5, 305-310 .
distribution, priority in its formulation measure of the relationship between the Eisenhart, C . (1983a) . Laws of error I :
rightfully belongs to Abraham de variables. Development of the concept. In S .
Moivre, who obtained it in 1733 (Read, Another of Galton's ideas was that Kotz & N . L . Johnson (Eds .), Encyclo-
1985) . the (normal) law of errors in observa- pedia of statistical sciences (Vol . 4,
4 In the 18th and early 19th centuries, tions might describe the frequency dis- pp . 530-547) . Toronto : Wiley
astronomers were required to make dif- tributions of measurements of such Eisenhart, C . (1983b) . Laws of error II :
ficult judgments, based on a combina- human characteristics as mental ability. The Gaussian distribution . In S .
tion of auditory and visual cues, in order This idea was accepted readily enough Kotz & N. L . Johnson (Eds .), Encyclo-
to time stellar transits . A well-known by John Venn (1888), who wrote as fol- pedia of statistical sciences (Vol. 4,
story from the history of science (Bor- lows : "That our mental qualities, if they pp . 547-562) . Toronto : Wiley.
ing, 1957) is the firing in 1796 of could be submitted to accurate mea- Galton, F. (1888) . Co-relations and
Kinnebrook, an assistant to Maskelyne, surement, would be found to follow the their measurement, chiefly from an-
the Astronomer Royal of England . usual Law of Error may be assumed thropometric data . Proceedings of the
Kinnebrook was relieved of his job for without much hesitation" (p. 49). Venn Royal Society, 45, 135-145 .
giving inaccurate readings of stellar expressed skepticism, however, over the Gulliksen, H . (1950) . Theory of mental
transits . Although he had provided idea that a normal distribution of men- tests . New York : Wiley.
readings in agreement with Maske- tal measurements is yet another mani- Guttman, L. (1945). A basis for analyz-
festation of the law of error :
lyne's 18 months prior to his dismissal, ing test-retest reliability. Psychome-
the hapless Kinnebrook by August 1795 trika, 10, 255-282 .
When we perform an operation our-
had begun to give times that differed Kelley, T. L. (1916). A simplified method
selves with a clear consciousness of
from Maskelyne's by one-half second. what we are aiming at, we may quite of using scaled data for purposes of
Subsequently, Kinnebrook's readings correctly speak of every deviation testing. School and Society, 4, 34-37,
grew even more discrepant, so by the from this as being an error; but when 71-75 .
Winter 1997 13
Kelley, T. L. (1923) . Statistical method . characters in man, and its compari- Sheynin, O. B. (1968). On the early his-
New York: Macmillan . son with the inheritance of physical tory ofthe law of large numbers . Bio-
Kelley, T. L. (1942). The reliability coef- characters. Biometrika, 3, 131-190. metrika, 55, 459-467 .
ficient . Psychometrika, 7, 75-83 . Pearson, K. (1930). The life, letters, and Spearman, C. (1904). The proof and
Kuder, G. F., & Richardson, M. W labours of Francis Galton . Vol. IIIA. measurement of association between
(1937) . The theory of estimation of Correlation, personal identification two things. American Journal ofPsy-
test reliability. Psychometrika, 2, and eugenics . Cambridge : The Uni- chology, 15, 72-101.
151-160 . versity Press . Spearman, C. (1907). Demonstration of
Lord, F. M., & Novick, M. R. (1968). Sta- Pearson, K, & Lee, A. (1903). On the formulae for true measurement of
tistical theories of mental test scores . laws of inheritance in man. I. Inheri- correlation. American Journal of Psy-
Reading, MA: Addison-Wesley. tance of physical characters. Bio- chology, 18, 160-169 .
Novick, M. R. (1966). The axioms and metrika, 2, 357-462 .
Spearman, C. (1910) . Correlation calcu-
principal results of classical test the- Read, C. B. (1985). Normal distribution . lated from faulty data. British
ory. Journal of Mathematical Psychol- In S. Kotz & N. L. Johnson (Eds.),
ogy, 3, 1-18 . Journal of Psychology, 3, 271-295 .
Encyclopedia of statistical sciences
Pearson, K. (1896). Mathematical con- (Vol. 6, pp. 347-359) . Toronto : Wiley. Thurstone, L. L. (1932). The reliability
tributions to the theory of evolution- Richardson, M. W. (1936). Notes on the and validity of tests . Ann Arbor, MI:
III . Regression, heredity and rationale of item analysis . Psychome- N. p.
panmixia . Philosophical Transac- trika, 1(1), 69-76 . Venn, J. (1888) . The logic of chance (3rd
tions, A, 187, 252-318 . Rulon, P J. (1939). A simplified proce- ed.). London : Macmillan .
Pearson, K. (1904). On the laws of in- dure for determining the reliability of Walker, H. M. (1929). Studies in the his-
heritance in man. II. On the inheri- a test by split-halves . Harvard Edu- tory of statistical method . Baltimore:
tance of the mental and moral cational Review, 9, 99-103 . Williams & Wilkins.
A Perspective on the History of

Generalizability Theory
Robert L. Brennan
University of Iowa
with G theory. Consequently, this

article provides a somewhat idiosyn-
What psychometric and scientific perspectives influ- cratic perspective on the history of G
enced the development of G theory? What practical theory and what I perceive as unfin-
ished work for the theory. Almost
testing problems gave impetus to its adoption? What certainly, other reviewers would see
work remains to be done? the landscape somewhat differently.
Theory Development and Enabling
O verviews of various parts of

the history of generalizability
(G) theory are provided elsewhere .
and Rowley (1989) cover additional
contributions in the 1980s. A very
brief historical overview is provided
Work
In discussing the genesis of G
theory, Cronbach (1991) states :
An indispensable starting point is by Brennan (1983, 1992a, pp. 1-2). In 1957 I obtained funds from the
the preface and parts of the first In addition, Cronbach (1976, 1989, National Institute of Mental
chapter of Cronbach, Gleser, Nanda, 1991) offers numerous perspectives Health to produce, with Gleser's
and Rajaratnam (1972) entitled The on G theory and its history. Cron-
Dependability of Behavioral Mea- bach (1991) is particularly rich with
surements: Theory of Generalizabil- first-person reflections.
Robert L. Brennan is Lindquist Pro-
ity for Scores and Profiles . The This historical overview is not in- fessor of Educational Measurement and
Cronbach et al . monograph is still tended to repeat everything already Director of the Iowa Testing Programs,
the most definitive treatment of G covered in published reviews, al- University of Iowa, 334A Lindquist
theory. Shavelson and Webb (1981) though a summary is provided. Center, Iowa City, IA 52242. His special-
review the G theory literature from Parts of this article are based izations are generalizability theory,
1973-1980, and Shavelson, Webb, largely on my personal experience equating, and scaling.
14 Educational Measurement: Issues and Practice

Classical Test Theory in Historical Perspective

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Classical Test Theory in Historical Perspective

Загружено:

Авторское право:

Доступные форматы

Classical Test Theory in

and tend to cluster around their

Educational Measurement : Issues and Practice

10 Educational Measurement: Issues and Practice

12 Educational Measurement : Issues and Practice

A Perspective on the History of

with G theory. Consequently, this

Theory Development and Enabling

O verviews of various parts of

14 Educational Measurement: Issues and Practice

Вам также может понравиться