Академический Документы
Профессиональный Документы
Культура Документы
in High-Stakes Testing
Thomas M. Haladyna, Arizona State University West
Steven M. Downing, University of Illinois at Chicago
Spring 2004 17
refers to the degree to which evidence lidity ( e.g., Crooks, et al. 1996; Messick, lated with true and observed scores. The
and theory support the interpretations 1984, 1989). At least five major threats expected value of random error across a
of test scores entailed by proposed uses to validity stand out: construct under set of test scores is zero.
of tests" (p. 9). Validation is an inves representation arising form poorly con Systematic error is not random, but
tigative process by which we (a) create ceptualized or inadequately operational group- or person specific. Construct
a plausible argument regarding a de ized constructs, faulty logic of the causal irrelevant easiness refers to a contami
sired interpretation or use of test scores, inference regarding test scores, negative nating influence on test scores that
(b) collect and organize validity evi consequences of test score interpreta tends to systematically increase test
dence bearing on this argument, and tions and uses, lack of reproducibility of scores for a specific examinee or a group
(c) evaluate the argument and the ev test scores, and CIV. Although all of of examinees; construct-irrelevant dif
idence concerning the validity of the these threats deserve attention in vali ficulty does the opposite. It systemati
interpretation. Kane (1992) described dation research, this article concen cally decreases test scores for a specific
the process for establishing a plausible trates on CIV. examinee or a group of examinees. Lord
argument and criteria we might use to and Novick (1968, p. 43) discussed sys
evaluate this argument. Kane (2002) tematic error as an undesirable change
also described some nuances of de Part 2: Construct-Irrelevant in true score. The change is caused by a
scriptive and policy-based interpreta Variance variable that is unrelated to the con
tions in high-stakes student achieve The major point here is that educa struct being measured. Thus, the change
ment testing. He argued that current tional achievement tests, at best, in test score is construct irrelevant.
validation does not go far enough to reflect not only the psychological con Although random error is variable from
justify the full array of interpretations structs of knowledge and skills that examinee to examinee, systematic error
and uses. are intended to be measured, but in is not. It is predictable. It also manifests
As the stakes for testing increase, the variably a number of contaminants. in two types.
need for validity evidence also increases These adulterating influences include The first type of systematic error is
( Linn, 2002). The quest for validity evi a variety of other psychological and constant error for all members of a pa;-
dence can be very complex. This evi situational factors that technically ticular group. This kind of error is char
constitute either construct-irrelevant
dence will likely consist of documenta test difficulty or construct-irrelevant acterized by members of a specific group
tion of procedures in test development, contamination in score interpreta having systematic over- or underestima
administration, scoring and report tion. (Messick, 1984, p. 216) tion of their true scores. A good example
ing, and empirical studies (Downing & of this type of constant error is rater
Haladyna, 1996; Haladyna, 2002a). The CIV is error variance that arises from severity in a performance test of a cogni
Standards (AERA, APA, & NCME, 1999) systematic error. A good way to think tive ability. Two raters are consis
present categories of validity evidence about systematic error is to compare it tently harsh. Student papers evaluated
that include content, cognitive pro with random error. If we were to write by these two raters will likely result in
cesses, internal structure of item re the linear model representing what we systematic error that lowers their test
sponses or ratings of performance, know about random and systematic er scores. The group being scored by these
reliability, relations of test scores to rors, the model would be: two harsh raters gets lower scores than
other variables, and consequences. Es they deserve. Test form difficulty is
says by Messick ( 1995a, 1995b) also pro y = t +er+ es, another example of group-specific CIV.
vide suggestions for types of validity ev Those taking the more difficult test form
idence and their importance. where y is the observed test score for will have underestimated scores unless
Although we may view validation as a any student, t is the true score, en test score equating is carried out.
process for strengthening an argument is random error, and es is systematic The second type of systematic error is
about the validity of a particular inter error due to CIV. Lord and Novick the over- or underestimation of individ
pretation or test score use, Cronbach (1968, pp. 43-44) developed this idea ual examinee scores due to a CIV source
(1988) noted that validation should and presented systematic error as a that affects examinees differentially.
also include the testing of alternate redefined true score that is essentially Messick (1989) cited reading ability
hypotheses concerning the validity of biased. on subject-matter tests as an example
an interpretation. Crooks, Kane, and Random error is the difference be of this kind of CIV. If performance
Cohen (1996) provided a comprehen tween any observed and correspond on a test of a construct that is not
sive model for the study of threats to ing true score for each examinee. Both reading comprehension is strongly de
validity. They identified eight linked classical and generalizability theory pendent on one's level of reading com
inferences and argued that a weakness (Brennan, 2001) present methods to prehension, then reading comprehen
in any link weakens the entire chain. study random error. Random error can sion is construct-irrelevant, because
As we acquire and evaluate validity ev be large or small, positive or negative. the definition of achievement does
idence, we may conclude that some ev We never know the size ofrandom error not include reading comprehension. For
idence is weak or negative. By elimi for any examinee. Reliability is the ratio instance, two students having equal
nating or reducing these threats to of true-score variance and observed science achievement may differ in their
validity, we can increase our confi score variance for a set of test scores. test performance because one is a better
dence that a desired test score inter Random error and the observed scores reader than the other. It is reading com
pretation or use is more valid. are random variables, whereas the true prehension that differentiates these two
Several writers have proposed ways of score is a constant (Lord & Novick, students, not science achievement. By
organizing and describing threats to va- 1968, p. 35). Random error is uncorre- increasing the reading comprehension
Spring 2004 19
Table 1. A Taxonomy for the Study of Systematic Errors Associated with CIV
Category Instances Construct Need for Research 1 Type
Uniformity and 1 . Whether or not students Both Adequate Group
types of test get test preparation
preparation 2. The extensiveness of Both Adequate Group
test preparation
3. Unethical test preparation Domain Needed Group
Test development, Test Development
administration, 1. Item quality Domain Needed Group
and scoring 2. Test item format Both Adequate Group
3. Differential item functioning Both Abundant Group
Test Administration
1 . Location of test site Both Needed Group
2. Altering the administration Both Needed Group
3. Participation and exclusion Both Needed Group
4. Computer-based testing Domain Needed Group
5. Calculators in testing Domain Needed Group
Test Scoring
1. Scoring errors Domain Inadequate Group
2. Sanitizing answer sheet Domain Inadequate Group
3. Test form comparability Domain Inadequate Group
4. Rater severity and prompt choice Ability Adequate Group
5. Accuracy of passing scores Both Needed Group
Students 1 The influence of verbal
abilities on test performance Both Needed Ind.
2. Test anxiety, motivation,
and fatigue Both Needed Ind.
3. Accommodations for special
student populations Both Needed Ind.
Cheating 1. Institutional Both Needed Group
2. Individual Both Needed Group
1
Rating Scale: Abundant research exists; research on this topic is adequate; more research is needed. Ind. = individual.
construct associated with the source and (b) increases in test scores should priate and effective instruction, but to
of CIV (either domain, ability, or both), be correlated with a corresponding in the fact that some students received
give a subjective appraisal of the ade crease in student learning (Popham, test preparation and others did not.
quacy of research (abundant, adequate, 1991). A second type of CIV associated with
or needed), and identify the type of CIV There are many aspects to test prepa test preparation is its extensiveness.
error (individual or group specific). ration, including (a) giving advice to There should be some evidence that all
parents, (b) instructing students based students received uniform test prepara
Uniformity and Types of on the curriculum represented by the tion. For instance, Nolen, Haladyna, and
Test Preparation test, (c) providing examples of different Haas (1992) reported considerable vari
As noted previously, AERA (2000) has test item formats, (d) motivating stu ation in the amount of test preparation
provided a useful set of guidelines dents to do their best, and (e) teaching by teachers in one state. Lomax, West,
regarding high-stakes testing that in testwiseness-test-taking strategies that Harmon, Viator, and Madaus (1995) pro
cludes advice about alignment of con include efficient time use, error avoid vided evidence of excessive test prepara
tent and cognitive processes, instruc ance, informed guessing, and deductive tion with educationally disadvantaged
tion, and assessment. These guidelines reasoning. students.
also address opportunity to learn and Whether or not students received A third type of CIV involves unethical
the providing of remedial opportuni test preparation can be a source of CIV. types of test preparation. In an article
ties. After assuring that these guide If some students in a reportable unit of on preparing for a performance test,
lines have been met, we should also analysis, such as a school or school dis Mehrens, Popham, and Ryan (1998)
consider the issue of uniform and ethi trict, have received test preparation offered a set of guidelines that seems ap
cal test preparation. Most testing spe and another group of these students plicable to all high-stakes tests. Their
cialists recommend test preparation has not, how does this difference in test first guideline has to do with criterion
(e.g., Nitko, 2001, chapter 14). Two preparation affect the validity of test performance being task- or domain spe
guiding principles in test preparation score interpretations? Differences in cific. Their second guideline is that if
are (a) no test preparation should vio performance might not be attributable the criterion performance is domain
late ethical standards of our profession to sound curriculum design and appro- specific, we should not teach to the ex-
Spring 2004 21
scores and by so doing misrepresent the sues related to CBT, such as student need for it" (AERA, APA, & NCME, 1999,
achievement of a class, school, or even proficiency in taking a computerized p. 115). Indeed, an epidemic of scor
a school district. Thus, differences test, computer platform familiarity, user ing errors has arisen throughout the
in group performance may not be based interface, speededness, and test anxi United States. For example, in Min
on actual achievement differences but ety. They also noted the potential of in nesota, 47,000 students received in
on who was sampled and excluded. correct estimates of student scores due correct scores, leading to serious nega
The problem is more serious when to problems with scoring algorithms. tive consequences for these students
one considers the recent policy where Another potential problem with com and to subsequent lawsuits (Henriques
schools are labeled asfailing as called puterized adaptive testing is the heavy & Steinberg, 2001). More than 20 states
for in the new Elementary and Secondary demand on mid-difficult items that pro have been affected by scoring errors.
Education Act-No Child Left Behind vide maximum information. Since these In Arizona, 12,000 students received
(http://www.ed.gov/nclb/landing.j html). items are the most frequently used, these incorrect scores due to an error in
A report from theNat'ions Report Card items quickly become overexposed, the scoring key (Bowman, 2001). In
(NCES, 2002) for NAEP Science shows which is another source of CIV. The po Washington, 204,000 essays had to be
participation rates by states range tential threats of CIV in the CBT envi rescored. Scoring errors or delays also
considerably from national averages. ronment have only begun to be explored occurred in California, Florida, Georgia,
A National Assessment of Educational at this time. Indeed, Standard 12.19 of Indiana, Mississippi, New York City,
Progress (NAEP) report showed partici the Standards (AERA, APA, & NCME, Nevada, North Carolina, South Carolina,
pation rates for students with disabilities 1999) provides specific warning about Tennessee, and Wisconsin. In the Edu
can vary by state from 2.6% to 6.7%. Given the dangers of CIV related to computer cation Week on the Web Archives (2004),
that these students tend to be low scor ized testing. Besides research reports there are 8 listings for scoring error in
ing, greater fluctuations in participation addressing these problems, technical cidents. In high-stakes testing, espe
can contribute sizably to CIV (Grissmer, reports on such testing programs offer cially where critical pass-fail decisions
Flanagan, Kawata, & Williamson, 2000). an opportunity to document that CBT are made, we need stronger, more inde
Large disparities in participation rates does not contribute CIV. pendent assurance of score accuracy
for students with disabilities have also and additional documentation of extra
been observed (Erickson, Ysseldyke, & Calculators in testing. The role of scrutiny in scoring.
Thurlow, 1996). They stated that such calculators in testing has been an ac
variability in participation rates may be tive research topic in item development Sanitizing answer sheets. "Cleaning
due to the need for accountability and and test design (Haladyna, 2004). The up answer sheets" is a practice that is
achieving high test scores. Erickson, plausible hypothesis is that students recommended. For instance the National
et al. concluded: who have calculators have an added ad Association of Test Directors (2004) pro
vantage over those without calculators vides specific examples of how answer
Such variability prohibits valid com in mathematics tests and in other con sheets should be sanitized: "Erase
parisons between states, and prevents tent that may require calculation. A re all stray marks, darken light marks,
policy-relevant findings to be drawn
about how students with disabilities cent report in The Nation :S- Report Card and clean up incomplete erasures."
are benefitting from their educational (NCES, 2002) presented results from Volunteer parents may be asked to
experiences. the 2000 NAEP showing an interaction of "clean up" answer sheets before scor
grade level with calculator usage. More ing. For example, they might make
Without a doubt, there is an urgent need frequent use of calculators was corre incomplete erasures more thorough,
to ensure through policies and proce lated with lower scores in grade four, but since double-marked items are scanned
dures that standardization exists in test the opposite was true at grades eight as incorrect. That some schools and dis
participation and exclusion. Variations and 12. Also, some item types seem tricts may sanitize answer sheets while
in these rates directly contribute to CIV more susceptible to better performance other schools and school districts intro
when comparisons are made within any by using calculators. Thus, calculator duce potential CIV. The solution to this
unit of analysis. Policies that provide usage seems associated with CIV and validity threat is to have all answer
clear guidelines regarding participation the type of item being offered. The use sheets sanitized as is recommended by
and exclusion coupled with research of calculators would seem to enhance nearly all test scoring services. This
and documentation of uniform practices testing of many types of achievement by threat to validity is primed for studies
would help alleviate this threat to valid providing a higher fidelity experience.. that explore the frequency of sanitizing
interpretations of achievement scores At the same time, the use of calculators and its consequences on test scores.
for schools, school districts, and state. must not be permitted to increase CIV.
Thus, research is constantly needed to Testform comparability. The equat
Computer-based testing. We would address each new application involving ing of test forms is a standard practice
not offer computer-based testing ( CBT) calculators or any other technological in testing programs. There are many
to any student if we thought the results innovation. methods for adjusting test scores so that
would be lower than those obtained by one test form is no more difficult or easy
paper-and-pencil administration. There Test Scoring than any other test form. However, it is
is increasing use of CBT, but less fre Scoring errors. Standard 11.10 reads, possible that errors can occur in equat
quently do we see documented evidence "Test users should be alert to the pos ing studies. Although research on equat
of the equivalence of CBT and paper sibility of scoring errors; they should ing methods is active and important, we
and-pencil testing. Huff and Sireci arrange for rescoring if individual have few mechanisms and little docu
(2001) raised several important CIV is- scores or aggregated data suggest the mentation for ensuring that equating is
Spring 2004 23
With so many students being deficient motivational strategies work, then test often co-mingled populations: students
in these verbal abilities, the threat of scores containCIV, because not all stu with disabilities, LEP students, stu
CIV in these challenging performance dents or schools receive uniform motiva dents living in poverty, and students liv
tests suggests that research on this tional stimulation from school leaders. ing in cultural isolation.
problem is very much needed. What may be accounting for differences
among schools or school districts might Students with Di<;abilities or LEP
not be real learning, but more effective Federal guidelines and the new Stand
Test Anxiety, Motivation, and Fatigue motivational techniques. Although these ards (AERA, APA, & NCME, 1999) give
We know that test anxiety can increase motivational techniques are desirable, considerable attention to the necessity
test performance but more generally these techniques should be uniformly of altering the administration conditions
lowers test performance. In a meta applied to ensure that motivation does or the test itself to eliminate a disabil
analysis of 562 studies, the pattern of not become a source ofCIV. ity as a source of CIV. As discussed pre
student performance in relation to test While there is no research to report viously, reading comprehension may be
anxiety is unmistakable and conclusive about fatigue in testing, we hypothe a serious source of CIV. With LEP stu
(Hembree, 1988). Test anxiety can be size that young students may be more dents, this type ofCIVis likely to occur.
pernicious in three ways. First, it af susceptible to fatigue in long testing Chapters 9 and 10 of the Standards
flicts many test takers. Test anxiety situations than older students, and the provide considerable discussion and
is estimated to include about 25% of conditions for test administration may offer many standards bearing on what
the general population (Hill & Wigfield, interact with different types of stu is needed to eliminate CIV when testing
1984). Second, test anxiety can be exac dents. The effects of fatigue are not well students with disabilities and students
erbated or reduced by imposing certain understood or studied, but should we be with LEP. Policies of excluding these
conditions on the examinees. Hancock concerned with the energy level of stu students from assessments vary not
(2001) provided experimental evidence dents as they take long, high-stakes only within classrooms and schools, but
in a study with college students that an tests? Or is fatigue not a factor in test also across school districts and states.
evaluative threat can increase test anx performance? A related area of concern Federal law requires that students with
iety. Zahar ( 1998) also provided com is the extent to which students eat be disabilities be included in assessments,
plex experimental evidence that dispo fore testing and are allowed breaks and but the law does not explain which ac
sition to anxiety and the high-stakes snacks during a long testing day. commodations are acceptable or spec
situation contribute to test anxiety. Although we have a promising emerg ify the criteria for accommodation. If
Third, test anxiety can have conse ing science of person-fit analysis (Meijer such accommodations are carried out
quences. For example, Thornton (2001) & Sjitsma, 1995), we do not routinely uniformly in all school districts and
reported that teachers in training in study item response patterns of students states, then differences in performance
Great Britain have been so intimidated to find out if students' response patterns will not be due to this source of CIV.
by teacher testing that they are drop suggest anxiety, poor motivation, or Until we have full participation and
ping out of their teacher education pro fatigue. Some students are plodders more uniformity in the way accommo
grams and making alternative plans. As who work slowly and correctly but do dations are offered, comparisons of per
we can see, not only is test anxiety a not finish tests in the allotted time. formance of students with disabilities
powerful source of CIV, it also affects Studies of examinee fit ought to be and LEP among units of analysis such
students and preservice teachers. routine in large-scale, high-stakes as as classrooms, schools, and states can
The motivational level of students sessments, and evidence supporting not be considered reasonable or validly
may affect test score performance, no any of these student sources of CIV interpreted.
matter the achievement level of the should invalidate the scores or cause us
student. The manifestation of low mo to look for reasons for underperfor Students Living in Cultural Isolation
tivation may be non-compliance with mance other than inadequate learning. The measurement of achievement of
the test-taking protocol. Students may Another aspect of this problem is non students living in culturally homoge
seriously underperform, make random response, items omitted or not reached neous, isolated communities can be
marks on the answer sheets, omit an (Koretz, Lewis, Skewes-Cox, & Burstein, affected in many ways. For instance,
swers, or not finish the test. The fre 1993). The frequency of omitted and students living on Native American
quency of omitted responses and items not-reached items should signal poten reservations have a variety of charac
not reached are signals of low motiva tial problems with test anxiety, motiva teristics that work against effective test
tion and non-compliance. Paris, Lawton, tion, fatigue, or timing. Yet, there is performance (Haladyna, 2002b). These
Turner, and Roth (1991) found that surprisingly little research on threats to students, too, need accommodations in
younger students take large-scale tests validity. testing and, in some circumstances, al
more seriously than older students. ternative assessments. The same case
Schools and school districts take very might be made for racial or ethnic com
different approaches to motivating stu Unique Problems of munities that live in isolation from the
dents to perform on these tests. Tactics Special Populations rest of society. While test scores may
include threats, parties, prizes, awards, Keeping in mind the admonitions of traditionally be low for these groups, the
and pep rallies. Whether the tactic is Messick (1984) that CIV can contami lack or failure of accommodations and
positive or negative, knowing the ex nate both interpretations of test scores modifications in assessments might ac
tensiveness of these practices and the and implications we make from knowl count for some of this low performance.
degree of the influence of each of these edge of test scores, we confront the The threat of CIV for these populations
motivational tactics is important. If the unique problems associated with four is similar to that of students with disabil-
Spring 2004 25
Chronicle of Higher Education (2002, Nov processes: A review of research in the Kane, M. T. (2002). Validating high-stakes
ember 21). Two students arrested for United States. Review of Educational testing programs. Educational Measure
alleged high tech cheating on the GRE. Research, 65, 145-190. ment: Issues and Practices, 21 (1), 31-41.
Retrieved March 17, 2004, from http:// Garcia, G. E. (1991). Factors influencing Koretz, D., Lewis, E., Skewes-Cox, T., &
chronicle.com/free/2002/11/2002112102t. the English reading test performance Burstein, L. (1993). Omitted and not
htm of Spanish-speaking Hispanic children. reached items in mathematics in the
Cizek, G. J. (1999). Cheating on tests: How Reading Research Quarterly, 26, 371-391. 1990 National Assessment ofEducational
to do it, detect it, and prevent it. Mahwah, Grissmer, D. W., Flanagan, A. E., Kawata, Progress (CRE: Technical Report 347).
NJ: Erlbaum. J. H., & Williamson, S. (2000). Improving Los Angeles, CA: Center for Research
Cole, N. S., & Moss, P. A. (1989). Bias in test student achievement: What state NAEF on Evaluation, Standards, and Student
use. In R. L. Linn (Ed.),Educationalmea t;est scores tell us. Santa Monica, CA: Rand Testing.
surement (3rd ed., pp. 201-220). New Corporation. Li, K. (2003, March 5). Fraudelent TOEFL
York: American Council on Education and Haertel, E., & Calfee, R. (1983). School takers face possible deportation. Daily
Macmillan. achievement: Thinking about what to test. Princetonian. Retrieved on March 17,
Cronbach, L. J. (1988). Five perspectives on Journal of Educational Measurement, 2004, from http://www.dailyprinceton
validity argument. In H. Wainer & H. I. 20(2), 119-131. ian.com/archives/2003/03/05/news/7516.
Braun (Eds.), Test validity (pp. 3-17). Haladyna, T. M. (2002a). Supporting docu shtml
Hillsdale, NJ: Erlbaum. mentation: Assuring more valid test score Linacre, J. M., & Wright, B. D. (2004).
Cronbach, L. J., & Meehl, P. E. (1955). interpretations and uses. In G. Tindal & FACETS: Computer program for many
Construct validity in psychological tests. T. M. Haladyna (Eds.), Large-scale (1}; faceted Rasch measurement. [ Computer
Psychological Bulletin, 52, 281-302. sessmentfor all students: Validity, tech Software]. Chicago: MESA Press.
Crooks, T. J., Kane, M. T., & Cohen, A. S. nical adequacy, and implementation Linn, R. L. (2002). Validation of the uses
(1996). Threats to valid use ofassessment. (pp. 89-108). Mahwah, NJ: Erlbaum. and interpretations of results of state as
Assessment in Education, 3(3), 265-285. Haladyna, T. M. (2002b). Standard'ized sessment and accountability systems. In
DeMars, C. E. (1998). Gender differences in achievement testing: Validity and ac G. Tindal & T. Haladyna (Eds.), Large
mathematics and science on a high school countability. Boston: Allyn & Bacon. scale assessment programs jor all stu
proficiency exam: The role of response for Haladyna, T. M. (2004). Developing and dents: Development, implementation,
mat.Applied Measurement in Education, validating multiple-choice test items and analy. Mahwah, NJ: Erlbaum.
11 (3), 279-299. (3rd ed.). Mahwah, NJ: Erlbaum. Linn, R. L., Baker, E. L., & Dunbar, S. B.
Downing, S. M. (2002a). Construct-irrele Haladyna, T. M.,Downing, S. M., & Rodriguez, (1991). Complex performance assess
vant variance and flawed test questions: M. C. (2002). A review of multiple-choice ment: Expectations and validation cri
Do multiple-choice item writing princi item-writing guidelines for classroom teria. Educational Researcher, 20 ( 8),
ples make any difference? Academic assessment. Applied Measurement in 15-21.
Medicine, 77(10), S103-104. Education, 15(3), 309-334. Linn, R. L., Betebenner, D. W., & Wheeler,
Downing, S. M. (2002b). Threats to the Hancock, D.R. (2001). Effects of test anxi K. S. (1998). Problem choice by test tak
validity of locally developed multiple
ety and evaluative threat on students' ers: Implications for comparabuity and
achievement. Journal of Educational construct validity (CSE Technical Report
choice tests in medical education: Research, 94(5), 284-290. 485). Boulder: University of Colorado at
Construct-irrelevant variance and con Hembree, R. (1988). Correlates, causes, Boulder, Center for Research on Eval
struct underrepresentation. Advances in effects, and treatment of test anxiety. uation, Standards and Student Testing.
Health Sciences Education, 7, 235-241. Review of Educational Research, 58, Linn, R. L., Graue, M. E., & Sanders, N. M.
Downing, S. M., & Haladyna, T. M. (1996). 47-77. (1990). Comparing state and district re
Model for evaluating high-stakes testing Henriques, D. B., & Steinberg, J. (2001, sults to national norms: The validity of
programs: Why the fox should not guard May 20). Right answer, wrong score: claims that "Everyone is above average."
the chicken coop. Educational Mea Test flaws take toll. New York Times. Educational Measurement: Issues and
surement: Issues and Practice, 15 (1), Retrieved March 17, 2004, from http: Practice, 9(3), 5-14.
5-12. //www.nytimes.com/2001/05/20/business/ Lohman, D. F. (1993). Teaching and testing
Education Week on the Web Archives. 20EXAM.html?ex=1079672400&en=fbce to develop fluid abilities. Educational
(2004). Scoring errors. Retrieved March f3a39c75ddbd&ei=5070 Researcher, 22(1), 12-23.
17, 2004, from http://www.edweek.org/ Herman, J., & Golan, S. (1993). The ef Lomax, R. G., West, M. M., Harmon, M. C.,
search/ fects of testing on teaching and schools. Viator, K. A.,& Madaus, G. F. (1995). The
Engelhard, Jr., G. E. (2002). Monitoring Educational Measurement: Issues and impact of mandated standardized testing
raters in performance assessments. In G. Practice, 12, 20-25, 41. on minority students. Journal of Negro
Tindal & T. M. Haladyna (Eds.), Large Hill, K., & Wigfield, A. (1984). Test anxiety: Education, 64, 171-185.
scale (J};Sessrnentfor all student: Validity, A major educational problem and what Lord, F. M., & Novick, M. R. (1968).
technical adequacy, and implementation can be done about it. The Elementary Statistical theories of mental test scores.
(pp. 261-287). Mahwah, NJ: Erlbaum. School Journal, 85, 105-126. Reading, MA: Addison Wesley.
Erickson,R. N., Ysseldyke, J. E., & Thurlow, Hoff, D. J. (2000, June 21) As stakes rise Martinez, M. E. (1999). Cognition and the
M. L. (1996). Neglected numerators, definition of cheating blurs. Education questions of test item format. Educational
drifting denominators, and fractured Week, 19(41), 1. Psychology, 34( 4), 207-218.
fractions: Determining participation rates Holland, P. W., & Wainer, H. (Eds.). (1993). Mehrens, W. A., & Kaminski, J. (1989).
for students with disabilities in state Dlfferential item functioning. Mahwah, Methods for improving a standardized
wide assessment programs (Synthesis NJ: Erlbaum. test scores: Fruitful, fruitless, or fraudu
Report No. 23). Minneapolis: University Huff,K. L.,&Sireci,S. (2001). Validity issues lent? Educational Measurement: Issues
of Minnesota, National Center on Educa in computer-based testing. Educational and Practices, 8, 14-22.
tional Outcomes. Retrieved March 17, Measurement: Issues and Practices, Mehrens, W. A., Popham, W. J., & Ryan,
2004, from http:/!education. umn.edu/ 20(3), 16-25. J. R. (1998). How to prepare students for
NCEO/OnlinePubs/Synthesis23.html Kane, M. T. (1992). An argument-based ap performance assessments. Educational
Fitzgerald, J. (1995). English-as-a-second proach to validity. Psychological Bulletin, Measurement: Issues and Practice, 17,
language learners' cognitive reading 112, 527-535. 18-22.
Spring 2004 27