Вы находитесь на странице: 1из 11

Construct-Irrelevant Variance

in High-Stakes Testing
Thomas M. Haladyna, Arizona State University West
Steven M. Downing, University of Illinois at Chicago

developing abilities (Messick, 1984),


and learned abilities (Sternberg, 1998).
There are many threats to validity in high-stakes achievement Each cognitive ability involves contex
testing. One major threat is construct-irrelevant variance (CIV). tualized mental models, schemas, or
This article defines ClV in the con text of the contemporary, unitary frarn1s, and complex performance that
view of validity and presents logical arguments, hypotheses, 111:1y have multiple correct pathways that
and documentation for a variety of CIV sources that commonly depend on knowledge and skills. These
abilities are slow growing. They are
threaten interpretations of test scores. A more thorough study difficult to teach and learn. The item
of ClV is recommended. format with greatest fidelity to mea
sure this kind of construct is perfor
Keywords: construct-irrelevant variance, high-stakes testing, validity mance. Messick (1984) referred to this
measurement approach as construct
referenced because the performance
test focuses on the ability itself and not
C urrently, achievement testing can
be characterized as driven by con
tent standards that affect the planning
Part 1: Validity
The most fundamental. step in valida
on the domain of knowledge and skills
that support it. Mislevy (1996) traced
tion is defining the construct. Cronbach the history of achievement testing and
and delivery of instruction and the de and Meehl ( 1955) called this construct forecasted that cognitive psychology will
sign of student assessments. The align formulation. Two kinds of achievement lead us in a direction that merges learn
ment of content standards, instruction, constructs seem represented in state ing theory and test theory. For our pur
,md the t(\Sts used to evaluate student ancl national content standards. poses both kinds of achievement con
l,arning are commonly held paradigms The first kind of achievement con structs are SUS('Pptible to CIV, hut. as we
in education. Most states either have struct can be envisioned as a large will see in different ways.
in place or are developing comprehen domain of knowledge and skills, some According to the Standards for
sive assessment systems that have these times called declarative and procedural Edu.rational and Psychological TesHng
fr:atures. knowledge. Any achievement test is in- (Strwdards) (American Educational
In many states, test scores can have 1 ended to be a representative sample Rcscarch Association [AERA], Ameri
several high-stakes uses. For instance, from that domain (Haertel & Calfee, can Psychological Association [APA], &
students must pass tests to graduate 1983). Although test specifications guide National Council on Measurement in
from high school or to be promoted the design of these tests, the sample of Education [NCME], 1999): "Validity
to the next grade. Schools are evalu this domain is usually small. Each stu
ated based on test scores and annual dent's test score is intended to show
progress. Low-performing schools may status in this domain. This type of
be subject to intervention. Teachers and achievement construct can be viewed
school leaders may be evaluated based Thomas M. Haladyna is Professor r!f'
as traditional. Tests of domains of knowl Educational Psychology, College of Teacher
on student test performance, and their edge and skills are consistent with the Education and Leadership, Arizona State
employment and pay may be affected by 'criterion-referenced" movement of the U,1,ivenntv West, FAE 240 South, 4701 West
this evaluation. 1970s and 1980s. Multiple-choice (MC) Thunderbird Road, PO Box 37100, Phoeni:x,
This article addresses a serious prob formats work well to measure this kind AZ 85069-7100, thomas. haladyna@asu.edu.
lem in high-stakes testing: construct of achievement construct. He specializes in developing and evaluating
irrelevant variance (CIV). Part 1 reviews A second kind of achievement con testing programs and in item develop
the contemporary view of validity, with struct focuses on a cognitive abil:itu, ment and validation.
its ,mphasis on construct validity. In such as reading, writing, or mathemati Steve11 JI Downing is Associate Professor
Part 2, CIV is defined. In Part 3, a tax cal prnll!Pm solving. We can conceive of of Medi(/// Fr/!ication, Unh)(lnity of Illinois
onomy is presented that organizes this ability as consisting of a domain of at Chi,e,1.1;,1, n ,//ege ofMedicine, Department
ofMedical Education (MC 591), 8088. Wood
sources of CIV. Evidence is presented complex tasks. The theoretical ratio Street, Chicago, IL 60612; sdowning@uic.
of CIV's extensiveness in high-stakes nale for any cognitive ability comes edu. He specializes in test development
testing. Some of these sources of Cl\ from cognitive psychology. Other terms and ps,lfchrnnr:tric issues.fin achievement
have rnceived more research attention used to signify a cognitive ability in and credrmlialing examinations in the
than others. clude fluid abilities (Lohman, 1993), profess lo n s.

Spring 2004 17
refers to the degree to which evidence lidity ( e.g., Crooks, et al. 1996; Messick, lated with true and observed scores. The
and theory support the interpretations 1984, 1989). At least five major threats expected value of random error across a
of test scores entailed by proposed uses to validity stand out: construct under set of test scores is zero.
of tests" (p. 9). Validation is an inves representation arising form poorly con Systematic error is not random, but
tigative process by which we (a) create ceptualized or inadequately operational group- or person specific. Construct
a plausible argument regarding a de ized constructs, faulty logic of the causal irrelevant easiness refers to a contami
sired interpretation or use of test scores, inference regarding test scores, negative nating influence on test scores that
(b) collect and organize validity evi consequences of test score interpreta tends to systematically increase test
dence bearing on this argument, and tions and uses, lack of reproducibility of scores for a specific examinee or a group
(c) evaluate the argument and the ev test scores, and CIV. Although all of of examinees; construct-irrelevant dif
idence concerning the validity of the these threats deserve attention in vali ficulty does the opposite. It systemati
interpretation. Kane (1992) described dation research, this article concen cally decreases test scores for a specific
the process for establishing a plausible trates on CIV. examinee or a group of examinees. Lord
argument and criteria we might use to and Novick (1968, p. 43) discussed sys
evaluate this argument. Kane (2002) tematic error as an undesirable change
also described some nuances of de Part 2: Construct-Irrelevant in true score. The change is caused by a
scriptive and policy-based interpreta Variance variable that is unrelated to the con
tions in high-stakes student achieve The major point here is that educa struct being measured. Thus, the change
ment testing. He argued that current tional achievement tests, at best, in test score is construct irrelevant.
validation does not go far enough to reflect not only the psychological con Although random error is variable from
justify the full array of interpretations structs of knowledge and skills that examinee to examinee, systematic error
and uses. are intended to be measured, but in is not. It is predictable. It also manifests
As the stakes for testing increase, the variably a number of contaminants. in two types.
need for validity evidence also increases These adulterating influences include The first type of systematic error is
( Linn, 2002). The quest for validity evi a variety of other psychological and constant error for all members of a pa;-
dence can be very complex. This evi situational factors that technically ticular group. This kind of error is char
constitute either construct-irrelevant
dence will likely consist of documenta test difficulty or construct-irrelevant acterized by members of a specific group
tion of procedures in test development, contamination in score interpreta having systematic over- or underestima
administration, scoring and report tion. (Messick, 1984, p. 216) tion of their true scores. A good example
ing, and empirical studies (Downing & of this type of constant error is rater
Haladyna, 1996; Haladyna, 2002a). The CIV is error variance that arises from severity in a performance test of a cogni
Standards (AERA, APA, & NCME, 1999) systematic error. A good way to think tive ability. Two raters are consis
present categories of validity evidence about systematic error is to compare it tently harsh. Student papers evaluated
that include content, cognitive pro with random error. If we were to write by these two raters will likely result in
cesses, internal structure of item re the linear model representing what we systematic error that lowers their test
sponses or ratings of performance, know about random and systematic er scores. The group being scored by these
reliability, relations of test scores to rors, the model would be: two harsh raters gets lower scores than
other variables, and consequences. Es they deserve. Test form difficulty is
says by Messick ( 1995a, 1995b) also pro y = t +er+ es, another example of group-specific CIV.
vide suggestions for types of validity ev Those taking the more difficult test form
idence and their importance. where y is the observed test score for will have underestimated scores unless
Although we may view validation as a any student, t is the true score, en test score equating is carried out.
process for strengthening an argument is random error, and es is systematic The second type of systematic error is
about the validity of a particular inter error due to CIV. Lord and Novick the over- or underestimation of individ
pretation or test score use, Cronbach (1968, pp. 43-44) developed this idea ual examinee scores due to a CIV source
(1988) noted that validation should and presented systematic error as a that affects examinees differentially.
also include the testing of alternate redefined true score that is essentially Messick (1989) cited reading ability
hypotheses concerning the validity of biased. on subject-matter tests as an example
an interpretation. Crooks, Kane, and Random error is the difference be of this kind of CIV. If performance
Cohen (1996) provided a comprehen tween any observed and correspond on a test of a construct that is not
sive model for the study of threats to ing true score for each examinee. Both reading comprehension is strongly de
validity. They identified eight linked classical and generalizability theory pendent on one's level of reading com
inferences and argued that a weakness (Brennan, 2001) present methods to prehension, then reading comprehen
in any link weakens the entire chain. study random error. Random error can sion is construct-irrelevant, because
As we acquire and evaluate validity ev be large or small, positive or negative. the definition of achievement does
idence, we may conclude that some ev We never know the size ofrandom error not include reading comprehension. For
idence is weak or negative. By elimi for any examinee. Reliability is the ratio instance, two students having equal
nating or reducing these threats to of true-score variance and observed science achievement may differ in their
validity, we can increase our confi score variance for a set of test scores. test performance because one is a better
dence that a desired test score inter Random error and the observed scores reader than the other. It is reading com
pretation or use is more valid. are random variables, whereas the true prehension that differentiates these two
Several writers have proposed ways of score is a constant (Lord & Novick, students, not science achievement. By
organizing and describing threats to va- 1968, p. 35). Random error is uncorre- increasing the reading comprehension

18 Educational Measurement: Issues and Practice


demand on this science test, reading some sources of CIV arising from MC tested and before we make decisions
comprehension potentially contami or performance tests. These include based on these test scores.
nates the measure of science achieve ConQuest (Adams, Wu,& Wilson, 1998), Cole and Moss ( 1989) described a du
ment and its subsequent interpretation Facets (Linacre & Wright, 2004), ality with the term bias. The first conno
or use. By lowering the reading demand Parscale (Muraki & Bock, 2003), and tation of bias involves social justice and
on this test, CIV's threat due to reading Rumm (Andrich, Lyne, Sheridan, & equal treatment of students, such as is
comprehension is lessened. Other exam Luo, 1997). Generalizability theory embodied in the idea of opportunity to
ples of person-specific CIV include moti (Brennan, 2001) offers a useful way to learn. The second connotation is a more
vation to perform on a test, test anxiety, study the precision of test scores and technical issue that relates to the test as
and fatigue. TheStandards (AERA, APA, sources of random error, but it also pro a cause of differences between groups
& NCME, 1999) also give some examples vides a basis for studying CIV as well. believed to be equal in achievement.
of potential sources of CIV, such as read However, not all sources of CIV are so This second meaning seems to resem
ing comprehension, item formats, anxi easily identifiable, and adjustment to ble what we think of as CIV. In fact,
ety, and test administration conditions. reduce or eliminate its influence will be the Standards (AERA, APA, & NCME,
Both Messick (1989) and Lord and a considerable challenge to measure 1999, p. 76) stated, "The term bias in
Novick (1968) offer examples. This arti ment specialists. tests and testing refers to construct
cle extends and expands these discus irrelevant components that result in
sions of sources of CIV. systematically higher or lower scores
Related Ideas in the for identifiable groups of examinees."
Contrasts and Comparisons of Measurement Literature Neither the Standards nor Cole and
Random and Systemati,c Error Messick (1989) devoted only several Moss gave us detailed, specific infor
Some contrasts and comparisons be paragraphs to CIV in his influential essay mation about categories of CIV.
tween random and systematic error may on validity. Lord and Novick (1968) rec
further clarify this definition of CIV. ognized this problem and incorporated it
While random error is uncorrelated to into their discussion of true scores. The Part 3: A Proposed Taxonomy for
true and observed scores, systematic Standards (AERA, APA, & NCME, 1999, Studying CIV and Some Evidence
error is correlated to both true and ob p. 10) identifies only six standards bear We begin this section by providing a
served scores. This is true because both ing on CIV, and provides a brief discus simple taxonomy for classifying vari
individuals and groups are either af sion. The six standards and their points ables that produce CIV and then pro
fected or unaffected by CIV. The un of discussion are: vide logical arguments, hypotheses, and
affected individuals or groups have Standard 7.1-subgroup differ empirical evidence for each source.
observed scores that are closer to true ences The extensiveness of CIV weakens the
scores. Although the expected value of Standard 7.2-the validity of in validity of interpretations and uses of
random error is zero, if CIV is present, ferences when CIV is present test scores in high-stakes accountabil
then the expected value of systematic Standard 7.3-differential item ity systems. Our purpose is to present
error is a non-zero value. Systematic functioning, one source of CIV some documentation of sources of CIV
error is quantifiable. The larger this error Standard 7.10-the importance of as a serious threat to validity and call
variance, the more serious the threat studying systematic error among groups for needed research or point out where
to validity. Lord and Novick (1968) con where no differences are believed to research is being conducted.
tended that if the estimation of system exist The documentation has many ori
atic error is impossible, then test scores Standard 12.19-the interpreta gins. While we rely on scholarly reports
may not be validly interpreted and, tion of test results when CIV may be as a primary source, secondary sources
thus, be of no use to the test sponsor. present include newspaper articles, and other
Standard 13.18-concerns of CIV non-scholarly periodicals. One of the
Estimating CJV and Adjusting Test in computerized testing. most productive secondary sources is
Scores to Eliminate CJV These standards are not very specific as the World Wide Web. Another useful
The estimation of systematic error can to the many sources of CIV identified in secondary source is an in-house research
be accomplished by using the general this article. Considering the seriousness report, such as those from the extensive
linear model. The test score serves of the CIV threat to validity for high collection of the Educational Testing
as the dependent variable, and a CIV stakes test score interpretations and Service. States and testing organiza
source serves as the independent vari uses, it is important to expand our un tions often provide such information in
able. The percentage of accounted vari derstanding of CIV. Chapter 7 of the their archives and on Web pages. We
ance provides a measure of effect size Standards (AERA, APA, & NCME, 1999) believe the mix of evidence for each
for the source of CIV. With individual addresses fairness in testing. As pointed source of CIV lends credibility to the
specific error, the product-moment out in that chapter, there is no common claims made about the extensiveness of
correlation provides this index. With meaning or technical definition for fair CIV in high-stakes achievement testing.
group-specific error, R2 provides this ness. Fairness has four connotations: Occasionally, we suggest viable research
index. Analysis of variance also provides (a) all tests should be free from bias, hypotheses to explore a particular source
a basis for computing the effect size as (b) all examinees are treated equitably, of CIV and indicate what research needs
well as the appropriate statistical test (c) the test scores should be equal for to be done.
of significance. groups thought to be equal, and (d) stu Table 1 presents the taxonomy in four
There are many computer programs dents being tested have equal and fair broad areas. In the three columns to the
for the adjustment and elimination of opportunities to learn before being right, we name the type of achievement

Spring 2004 19
Table 1. A Taxonomy for the Study of Systematic Errors Associated with CIV
Category Instances Construct Need for Research 1 Type
Uniformity and 1 . Whether or not students Both Adequate Group
types of test get test preparation
preparation 2. The extensiveness of Both Adequate Group
test preparation
3. Unethical test preparation Domain Needed Group
Test development, Test Development
administration, 1. Item quality Domain Needed Group
and scoring 2. Test item format Both Adequate Group
3. Differential item functioning Both Abundant Group
Test Administration
1 . Location of test site Both Needed Group
2. Altering the administration Both Needed Group
3. Participation and exclusion Both Needed Group
4. Computer-based testing Domain Needed Group
5. Calculators in testing Domain Needed Group
Test Scoring
1. Scoring errors Domain Inadequate Group
2. Sanitizing answer sheet Domain Inadequate Group
3. Test form comparability Domain Inadequate Group
4. Rater severity and prompt choice Ability Adequate Group
5. Accuracy of passing scores Both Needed Group
Students 1 The influence of verbal
abilities on test performance Both Needed Ind.
2. Test anxiety, motivation,
and fatigue Both Needed Ind.
3. Accommodations for special
student populations Both Needed Ind.
Cheating 1. Institutional Both Needed Group
2. Individual Both Needed Group
1
Rating Scale: Abundant research exists; research on this topic is adequate; more research is needed. Ind. = individual.

construct associated with the source and (b) increases in test scores should priate and effective instruction, but to
of CIV (either domain, ability, or both), be correlated with a corresponding in the fact that some students received
give a subjective appraisal of the ade crease in student learning (Popham, test preparation and others did not.
quacy of research (abundant, adequate, 1991). A second type of CIV associated with
or needed), and identify the type of CIV There are many aspects to test prepa test preparation is its extensiveness.
error (individual or group specific). ration, including (a) giving advice to There should be some evidence that all
parents, (b) instructing students based students received uniform test prepara
Uniformity and Types of on the curriculum represented by the tion. For instance, Nolen, Haladyna, and
Test Preparation test, (c) providing examples of different Haas (1992) reported considerable vari
As noted previously, AERA (2000) has test item formats, (d) motivating stu ation in the amount of test preparation
provided a useful set of guidelines dents to do their best, and (e) teaching by teachers in one state. Lomax, West,
regarding high-stakes testing that in testwiseness-test-taking strategies that Harmon, Viator, and Madaus (1995) pro
cludes advice about alignment of con include efficient time use, error avoid vided evidence of excessive test prepara
tent and cognitive processes, instruc ance, informed guessing, and deductive tion with educationally disadvantaged
tion, and assessment. These guidelines reasoning. students.
also address opportunity to learn and Whether or not students received A third type of CIV involves unethical
the providing of remedial opportuni test preparation can be a source of CIV. types of test preparation. In an article
ties. After assuring that these guide If some students in a reportable unit of on preparing for a performance test,
lines have been met, we should also analysis, such as a school or school dis Mehrens, Popham, and Ryan (1998)
consider the issue of uniform and ethi trict, have received test preparation offered a set of guidelines that seems ap
cal test preparation. Most testing spe and another group of these students plicable to all high-stakes tests. Their
cialists recommend test preparation has not, how does this difference in test first guideline has to do with criterion
(e.g., Nitko, 2001, chapter 14). Two preparation affect the validity of test performance being task- or domain spe
guiding principles in test preparation score interpretations? Differences in cific. Their second guideline is that if
are (a) no test preparation should vio performance might not be attributable the criterion performance is domain
late ethical standards of our profession to sound curriculum design and appro- specific, we should not teach to the ex-

20 Educational Measurement: Issues and Practice


tent that the inference from the test and test characteristics? A study by tions for new methodology for the study
score to the domain of knowledge or Downing (2002a) showed that flawed of DIF. Thus, DIF as a source of CIV
skills or a cognitive ability is inaccurate. MC test items were about 7 percentage applies to both kinds of achievement
By focusing test preparation on a subset points more difficult than non-flawed constructs, domain-based and abilities.
of the domain that happens to include items measuring the same content. Research on DIF seems abundant, and
items appearing on the test, the test Lower achieving students had greater the analysis of test items for DIF seems
scores would be higher than deserved. difficulty with these flawed items than to be standard practice in high-quality
If, however, the achievement construct did higher achieving students. Although testing programs.
is an ability, such as writing, it would be this problem may be more common with
misleading to the consumers of test teacher-made tests, it points to item Test Administration
score information to teach writing in quality as a potential CIV threat to valid Location of the testing site. Adverse
only the one genre that is assessed in a ity. Most professionally developed, large testing conditions may be a source of
high-stakes test. These examples are in scale tests are systematically reviewed CIV. The location in which a test is
stances of construct-irrelevant easiness, and edited to remove such CIV-inducing given is not necessarily standardized.
and may also be related to construct items, but locally developed tests may The classroom may be the most natural
under representation. Specifically, un have a greater tendency to exhibit this environment, but when students are
ethical test preparation might include source of CIV (Downing, 2002b). displaced from their classrooms to take
(a) developing a curriculum based on a large-scale test, are the results the
test content instead of following the Test item format. Although research same as those from comparable students
state's content standards, (b) present suggests that item format may not intro taking the test in their own classroom?
ing items that are similar or identical in duce CIV for boys and girls, the threat is Is there an age-location interaction?
content and format to those on the test, omnipresent that some formats tend Younger students may perform better in
or (c) using published instructional to advantage some groups of students their own classroom, but as they ma
materials that focus on exactly what the while other formats do not. For example, ture, they might be relocated for testing
specific test measures or instructional Beller and Gafni (2000) found a gender without affecting their performance.
practices directly aimed at the specific by-format interaction in 2 different Research studies are much needed to
content of the test, pejoratively called years of the International Assessment of evaluate the effect of specific testing en
narrowing the curriculum. All these Educational Progress that essentially re vironments on test performance.
practices can increase test scores with versed itself. Upon closer examination of
out materially affecting the broader do this interaction, they found that difficult Altming the administration. We can
main that the test samples. Studies at performance items were more difficult extend the administration time, hop
test to different problems associated for girls than for boys. Thus, under some ing that this extended time will help
with test preparation (e.g., Herman & circumstances, performance differed as students who need more time. Scores
Golan, 1993; Mehrens & Kaminski, a function of both the item format and might be higher than comparable groups
1989; Nolen et al., 1992). Mehrens and item difficulty. DeMars (1998) studied where administration time was not ex
Kaminski voiced the concern that if test the consequences of performance under tended. To what extent do such non
preparation is unethical, then the public varying item format conditions by gen standardized time extensions occur?
is led to believe that students achieved der and ethnicity. Gender-by-format What is the effect of this alteration on
more than they really did achieve. interactions were reported. Rodriguez test scores? Nolen et al. (1992) reported
Although test preparation should be (2002) reviewed research on format dif that 8% of elementary and 3% of sec
a standardized practice in all class ferences and found, typically, that format ondary teachers altered the administra
rooms, the variations in test preparation is not construct irrelevant. However, tion time. Wodtke, Harper, Schommer,
and the use of unethical test prepara Martinez (1999) provided evidence that and Brunelli (1989) concluded after
tion constitute sources of CIV. We need occasionally item format can be con their observation of 10 kindergarten
more research on the unethical types of struct irrelevant. Test item format con classrooms that standardized tests were
test preparation and better documenta tinues to be a fruitful area of study of CIV. administered in such a non-standardized
tion in all student achievement testing way as to render them incomparable.
programs of the role of test preparation. Differential item functioning. An They also stated that non-standardized
All students should receive ethical test item shows differential item functioning administration may result in confusion,
preparation and the extensiveness of (DIF) if the probability of a correct re anxiety, behavioral resistance, nega
this ethical test preparation should be sponse depends on group membership tive attitudes, and other problems that
uniform. and the groups are assumed to be equal caused these researchers to wonder why
with respect to the achievement con we even bother to test. Research on test
Test Development, Administration, struct that the test measures. Standard administration, more frequent monitor
and Test Scoring 7.3 specifically provides for research on ing, and independent supervision of test
Item quality. While the principles DIF and the taking of appropriate ac ing are needed to ensure that altered
of writing effective MC test items are tion to eliminate DIF. Methods for the administration does not contribute CIV.
documented in textbooks and research study of DIF are numerous. Holland
(Haladyna, Downing, & Rodriguez, and Wainer (1993) provided a thorough Parl'icipation and exclusion. The
2002), the writing of performance test treatment of this source of CIV for MC percentage of participation in a school,
items has little scientific or research tests. Penfield and Lam (2000) reviewed school district, or state can contribute
basis. What effect, if any, does the use methods of study of DIF with perfor CIV. By excluding a group of low-scoring
of poorly crafted items have on item mance formats and made recommenda- students, we can raise average test

Spring 2004 21
scores and by so doing misrepresent the sues related to CBT, such as student need for it" (AERA, APA, & NCME, 1999,
achievement of a class, school, or even proficiency in taking a computerized p. 115). Indeed, an epidemic of scor
a school district. Thus, differences test, computer platform familiarity, user ing errors has arisen throughout the
in group performance may not be based interface, speededness, and test anxi United States. For example, in Min
on actual achievement differences but ety. They also noted the potential of in nesota, 47,000 students received in
on who was sampled and excluded. correct estimates of student scores due correct scores, leading to serious nega
The problem is more serious when to problems with scoring algorithms. tive consequences for these students
one considers the recent policy where Another potential problem with com and to subsequent lawsuits (Henriques
schools are labeled asfailing as called puterized adaptive testing is the heavy & Steinberg, 2001). More than 20 states
for in the new Elementary and Secondary demand on mid-difficult items that pro have been affected by scoring errors.
Education Act-No Child Left Behind vide maximum information. Since these In Arizona, 12,000 students received
(http://www.ed.gov/nclb/landing.j html). items are the most frequently used, these incorrect scores due to an error in
A report from theNat'ions Report Card items quickly become overexposed, the scoring key (Bowman, 2001). In
(NCES, 2002) for NAEP Science shows which is another source of CIV. The po Washington, 204,000 essays had to be
participation rates by states range tential threats of CIV in the CBT envi rescored. Scoring errors or delays also
considerably from national averages. ronment have only begun to be explored occurred in California, Florida, Georgia,
A National Assessment of Educational at this time. Indeed, Standard 12.19 of Indiana, Mississippi, New York City,
Progress (NAEP) report showed partici the Standards (AERA, APA, & NCME, Nevada, North Carolina, South Carolina,
pation rates for students with disabilities 1999) provides specific warning about Tennessee, and Wisconsin. In the Edu
can vary by state from 2.6% to 6.7%. Given the dangers of CIV related to computer cation Week on the Web Archives (2004),
that these students tend to be low scor ized testing. Besides research reports there are 8 listings for scoring error in
ing, greater fluctuations in participation addressing these problems, technical cidents. In high-stakes testing, espe
can contribute sizably to CIV (Grissmer, reports on such testing programs offer cially where critical pass-fail decisions
Flanagan, Kawata, & Williamson, 2000). an opportunity to document that CBT are made, we need stronger, more inde
Large disparities in participation rates does not contribute CIV. pendent assurance of score accuracy
for students with disabilities have also and additional documentation of extra
been observed (Erickson, Ysseldyke, & Calculators in testing. The role of scrutiny in scoring.
Thurlow, 1996). They stated that such calculators in testing has been an ac
variability in participation rates may be tive research topic in item development Sanitizing answer sheets. "Cleaning
due to the need for accountability and and test design (Haladyna, 2004). The up answer sheets" is a practice that is
achieving high test scores. Erickson, plausible hypothesis is that students recommended. For instance the National
et al. concluded: who have calculators have an added ad Association of Test Directors (2004) pro
vantage over those without calculators vides specific examples of how answer
Such variability prohibits valid com in mathematics tests and in other con sheets should be sanitized: "Erase
parisons between states, and prevents tent that may require calculation. A re all stray marks, darken light marks,
policy-relevant findings to be drawn
about how students with disabilities cent report in The Nation :S- Report Card and clean up incomplete erasures."
are benefitting from their educational (NCES, 2002) presented results from Volunteer parents may be asked to
experiences. the 2000 NAEP showing an interaction of "clean up" answer sheets before scor
grade level with calculator usage. More ing. For example, they might make
Without a doubt, there is an urgent need frequent use of calculators was corre incomplete erasures more thorough,
to ensure through policies and proce lated with lower scores in grade four, but since double-marked items are scanned
dures that standardization exists in test the opposite was true at grades eight as incorrect. That some schools and dis
participation and exclusion. Variations and 12. Also, some item types seem tricts may sanitize answer sheets while
in these rates directly contribute to CIV more susceptible to better performance other schools and school districts intro
when comparisons are made within any by using calculators. Thus, calculator duce potential CIV. The solution to this
unit of analysis. Policies that provide usage seems associated with CIV and validity threat is to have all answer
clear guidelines regarding participation the type of item being offered. The use sheets sanitized as is recommended by
and exclusion coupled with research of calculators would seem to enhance nearly all test scoring services. This
and documentation of uniform practices testing of many types of achievement by threat to validity is primed for studies
would help alleviate this threat to valid providing a higher fidelity experience.. that explore the frequency of sanitizing
interpretations of achievement scores At the same time, the use of calculators and its consequences on test scores.
for schools, school districts, and state. must not be permitted to increase CIV.
Thus, research is constantly needed to Testform comparability. The equat
Computer-based testing. We would address each new application involving ing of test forms is a standard practice
not offer computer-based testing ( CBT) calculators or any other technological in testing programs. There are many
to any student if we thought the results innovation. methods for adjusting test scores so that
would be lower than those obtained by one test form is no more difficult or easy
paper-and-pencil administration. There Test Scoring than any other test form. However, it is
is increasing use of CBT, but less fre Scoring errors. Standard 11.10 reads, possible that errors can occur in equat
quently do we see documented evidence "Test users should be alert to the pos ing studies. Although research on equat
of the equivalence of CBT and paper sibility of scoring errors; they should ing methods is active and important, we
and-pencil testing. Huff and Sireci arrange for rescoring if individual have few mechanisms and little docu
(2001) raised several important CIV is- scores or aggregated data suggest the mentation for ensuring that equating is

22 Educational Measurement: Issues and Practice


done accurately. Without adequate qual the population of content experts from 2. A construct where verbal abilities
ity control and technical reports that which a committee was convened. are very important in the perfor
thoroughly document these procedures, Another dimension to this problem mance test. A good example of this
such errors can go undetected and in arises when a jurisdiction applies stan would be an advanced placement
troduce CIV. dards in non-uniform ways. States vary test that requires reading a pas
A related problem is test-score drift. in the way they identify and label "failing sage, exercising critical thinking,
From the time of first administration of schools." While Michigan scores above and writing ability.
a new test, test scores tend to increase the national average on the NAEP, it re 3. A construct where verbal abilities
for comparable groups as a function of ported the most failing schools (40%), are used but are not deemed cru
test age. It would be easy and presump while states like Arkansas and Wyoming cial in the performance task. This
tuous to conclude that these increases reported no failing schools. Thus, we construct might involve a domain
are due to improved learning. However, have a clear indication that CIV is of knowledge and skills with a
this growth should be validated by com present in standard setting. Rothstein low cognitive demand. Minimal
parison to growth on another standard (September 8, 2002) reported that the reading and writing skills may be
ized achievement test, such as the NAEP. different standards employed in states needed to perform adequately on
In a study by Linn, Graue, and Sanders to designate schools as failing has re a test of this kind of construct.
(1990), comparisons of publishers' stan sulted in some states engaging in more 4. A construct where there is lit
dardized achievement test results with busing so that students can change tle demand for reading and writ
NAEP test results suggested that test schools. ing abilities. This construct might
scores may depend on non-achievement These differences in passing stan involve computation, symbolic
factors, such as item exposure. Test dards and standard-setting outcomes representation associated with
score drift may be related to unethical cause us to question the extent to chemistry or physics, or one of the
test preparation previously discussed or which the method of standard setting performing arts.
to other methods used to increase scores may contribute CIV. A confounding To what extent do deficits in read
for some examinees in a construct factor is the type of achievement test ing interfere with performance on an
irrelevant way. The problem of drift im used in a state and its connectedness achievement test? Research involving
plies that we have continuing problems to instruction. Based on the disparities students with limited English proficiency
with the accuracy of equating. among states on how failing schools are (LEP) by Abedi, Lord, Hofstetter, and
identified and labeled, one is inclined to Baker (2000) showed that vocabulary
Rater severity and prompt choice. conclude that standards are indeed very has a powerful influence on test perfor
The threat of CIV when equating per arbitrary. Research is sorely needed into mance. They experimented with NAEP
formance tests principally comes from the validity of standards. Part of this va test items, using simplified vocabulary
rater effects, particularly rater severity, lidity research should examine the con to improve students' test performance
and the issue of prompt choice in writing sistency of judges who set standards by allowing students to use a glossary
assessments. Linn, Baker, and Dunbar and potential bias introduced by panels and permitting students extra testing
(1991, p. 8) stated, "The training and of judges who may differ from the pop time. Fitzgerald (1995) characterized
calibration of raters is critical in this ulation of potential judges. Consistency students with LEP as slow readers,
regard." The practice of letting students of ratings is another issue within this whose test performance is obviously im
choose writing prompts assumes that need for research on standard setting. paired with time limits. Garcia (1991)
prompt choice does not contribute to showed that LEP students may have
CIV. Testing specialists have argued that trouble with the familiarity of main
prompt choice indeed introduces CIV Students stream American topics, contributing
(Linn, Betebenner, & Wheeler, 1998; Compared with others sources of CIV, to their lower performance. Thus, we
Wainer & Thissen, 1994). Fortunately, students potentially provide the most se have increasing evidence that reading
we have an extensive literature on rater rious CIV threat to validity. In these in comprehension plays a vital role in test
severity and a growing literature on the stances, students comprise individual performance. Students with LEP are
problem of prompt choice (Engelhard, specific CIV versus the group-specific probably most affected by deficits in
2002). However, this research litera CIV. reading comprehension. The measure
ture has not yet affected practice of ment of achievement in other subject
scoring performance tests and remov matter may be hopelessly contaminated
ing CIV due to rater severity, other The Influence of Verbal Abilities by their deficiency in reading compre
rater effects, and prompt choice. on Test Perfonnance hension. However, this problem is not
Another important source of CIV may be limited to LEP students. Non-LEP stu
Accuracy of passing scores. The es the demand for verbal abilities needed in dents can also have low reading compre
tablishment of a passing score for a pass the measurement of another ability. By hension. The treatment of this threat to
fail decision or the setting of multiple verbal abilities, we mean reading, writ validity should include a careful con
points on a test score scale for setting ing, speaking, and listening. Ryan and sideration of the definition of the con
benchmarks, (as in the NAEP) is usually Demark (2002) hypothesized four types struct and the role that verbal ability
a judgmental process involving subject of achievement constructs: plays in this definition. For the mea
matter experts (SME). The threat of CIV 1. A construct, such as writing, surement of some abilities, reading and
is that the SME group, which recom where the highest fidelity mea writing at a high level are necessary,
mends a passing score or set of bench surement technique is a perfor and for other abilities, the demand for
marks, may not be representative of mance test. these verbal abilities is less important.

Spring 2004 23
With so many students being deficient motivational strategies work, then test often co-mingled populations: students
in these verbal abilities, the threat of scores containCIV, because not all stu with disabilities, LEP students, stu
CIV in these challenging performance dents or schools receive uniform motiva dents living in poverty, and students liv
tests suggests that research on this tional stimulation from school leaders. ing in cultural isolation.
problem is very much needed. What may be accounting for differences
among schools or school districts might Students with Di<;abilities or LEP
not be real learning, but more effective Federal guidelines and the new Stand
Test Anxiety, Motivation, and Fatigue motivational techniques. Although these ards (AERA, APA, & NCME, 1999) give
We know that test anxiety can increase motivational techniques are desirable, considerable attention to the necessity
test performance but more generally these techniques should be uniformly of altering the administration conditions
lowers test performance. In a meta applied to ensure that motivation does or the test itself to eliminate a disabil
analysis of 562 studies, the pattern of not become a source ofCIV. ity as a source of CIV. As discussed pre
student performance in relation to test While there is no research to report viously, reading comprehension may be
anxiety is unmistakable and conclusive about fatigue in testing, we hypothe a serious source of CIV. With LEP stu
(Hembree, 1988). Test anxiety can be size that young students may be more dents, this type ofCIVis likely to occur.
pernicious in three ways. First, it af susceptible to fatigue in long testing Chapters 9 and 10 of the Standards
flicts many test takers. Test anxiety situations than older students, and the provide considerable discussion and
is estimated to include about 25% of conditions for test administration may offer many standards bearing on what
the general population (Hill & Wigfield, interact with different types of stu is needed to eliminate CIV when testing
1984). Second, test anxiety can be exac dents. The effects of fatigue are not well students with disabilities and students
erbated or reduced by imposing certain understood or studied, but should we be with LEP. Policies of excluding these
conditions on the examinees. Hancock concerned with the energy level of stu students from assessments vary not
(2001) provided experimental evidence dents as they take long, high-stakes only within classrooms and schools, but
in a study with college students that an tests? Or is fatigue not a factor in test also across school districts and states.
evaluative threat can increase test anx performance? A related area of concern Federal law requires that students with
iety. Zahar ( 1998) also provided com is the extent to which students eat be disabilities be included in assessments,
plex experimental evidence that dispo fore testing and are allowed breaks and but the law does not explain which ac
sition to anxiety and the high-stakes snacks during a long testing day. commodations are acceptable or spec
situation contribute to test anxiety. Although we have a promising emerg ify the criteria for accommodation. If
Third, test anxiety can have conse ing science of person-fit analysis (Meijer such accommodations are carried out
quences. For example, Thornton (2001) & Sjitsma, 1995), we do not routinely uniformly in all school districts and
reported that teachers in training in study item response patterns of students states, then differences in performance
Great Britain have been so intimidated to find out if students' response patterns will not be due to this source of CIV.
by teacher testing that they are drop suggest anxiety, poor motivation, or Until we have full participation and
ping out of their teacher education pro fatigue. Some students are plodders more uniformity in the way accommo
grams and making alternative plans. As who work slowly and correctly but do dations are offered, comparisons of per
we can see, not only is test anxiety a not finish tests in the allotted time. formance of students with disabilities
powerful source of CIV, it also affects Studies of examinee fit ought to be and LEP among units of analysis such
students and preservice teachers. routine in large-scale, high-stakes as as classrooms, schools, and states can
The motivational level of students sessments, and evidence supporting not be considered reasonable or validly
may affect test score performance, no any of these student sources of CIV interpreted.
matter the achievement level of the should invalidate the scores or cause us
student. The manifestation of low mo to look for reasons for underperfor Students Living in Cultural Isolation
tivation may be non-compliance with mance other than inadequate learning. The measurement of achievement of
the test-taking protocol. Students may Another aspect of this problem is non students living in culturally homoge
seriously underperform, make random response, items omitted or not reached neous, isolated communities can be
marks on the answer sheets, omit an (Koretz, Lewis, Skewes-Cox, & Burstein, affected in many ways. For instance,
swers, or not finish the test. The fre 1993). The frequency of omitted and students living on Native American
quency of omitted responses and items not-reached items should signal poten reservations have a variety of charac
not reached are signals of low motiva tial problems with test anxiety, motiva teristics that work against effective test
tion and non-compliance. Paris, Lawton, tion, fatigue, or timing. Yet, there is performance (Haladyna, 2002b). These
Turner, and Roth (1991) found that surprisingly little research on threats to students, too, need accommodations in
younger students take large-scale tests validity. testing and, in some circumstances, al
more seriously than older students. ternative assessments. The same case
Schools and school districts take very might be made for racial or ethnic com
different approaches to motivating stu Unique Problems of munities that live in isolation from the
dents to perform on these tests. Tactics Special Populations rest of society. While test scores may
include threats, parties, prizes, awards, Keeping in mind the admonitions of traditionally be low for these groups, the
and pep rallies. Whether the tactic is Messick (1984) that CIV can contami lack or failure of accommodations and
positive or negative, knowing the ex nate both interpretations of test scores modifications in assessments might ac
tensiveness of these practices and the and implications we make from knowl count for some of this low performance.
degree of the influence of each of these edge of test scores, we confront the The threat of CIV for these populations
motivational tactics is important. If the unique problems associated with four is similar to that of students with disabil-

24 Educational Measurement: Issues and Practice


ities and with LEP. Research on the mo ples. The Educational Testing Service CIV threat, and then eliminate or re
tivation and preparedness for achieve (ETS) conducted an investigation of duce it. A primary type of documenta
ment tests for these special population student cheating on one of its tests that tion is the annual technical report. This
students can reveal much about how led to the arrest of 61 students who were report should contain references to va
they learn and why their performance is accused of fraudulent test taking (Li, lidity evidence and its relationship to
lower than expected. Such studies can 2002). This type of cheating involved hir the appropriate standards (AERA, APA,
benefit from methods where students ing professional test takers as substi & NCME, 1999) and detail specific CIV
are actually interviewed after the test tutes. There have been recent incidents threats and their management.
or "think-aloud" during a test-taking of electronic devices used to copy test Sponsors of state and local high
session. items or pirate test items. In China, stakes achievement testing programs
64 students had their scores invalidated and test companies that develop these
for copying from other sources in a per tests should work together to consider
Cheating formance test for the Graduate Record the seriousness of all threats to validity,
Any deception committed to misrepre Examination (Taipei Times, 2002). Two including CIV. A shared responsibility
sent a group's or a student's level of students from Columbia University were and honest examination of these CIV
achievement is cheating. Institutional arrested for using high-tech transmit threats are likely to improve the effec
cheating is a deception used to mis ters and walkie-talkies to cheat on tiveness and increase the validity evi
represent student achievement for a the Graduate Record Examination dence for these testing programs.
class, school, school district, or state. (Chronicle ofHigher Education, 2002).
Some examples of institutional cheat This small sample of incidents shows
ing include teachers getting an advance that CIV from individual student cheat References
copy of the test and planning some ing is a worldwide problem. Abedi, J., Lord, C., Hofstetter, C., & Baker, E.
lessons based on specific items or pro The volume by Cizek (1999) is a (2000). Impact of accommodation strate
viding students "practice" with actual, milestone in the study and documen gies on English language learners' test
secure test items (Popham, 2003); read tation of cheating as a source of CIV. performance. Educational Measurement:
ing answers to students during the test Nonetheless, more research is needed Issues and Practice, 19(3), 16-26.
or helping students select correct an into the motivation for institutional Adams, R., Wu, M., & Wilson, M. ( 1998). Con
swers; giving hints to correct answers in cheating and its extensiveness partic Quest [ Computer Program J. Camberwell,
ularly within high-stakes accountabil Victoria, Australia: Australian Council
the classroom during the test or chang for Educational Research.
ing wrong answers to right answers after ity testing programs. American Educational Research Asso
the test is given. Unintentionally ex ciation (2000). Position statement of the
cluding low-achieving students from a American Educational Research Asso
test may raise the average score for a Conclusion ciation concerning high-stakes testing
unit of analysis, but intentionally ex Every interpretation or use of an in pre K-12 education. Educational Re
cluding such students can also be con achievement test score in a high-stakes searcher, 29(8), 24-25.
sidered cheating. This problem was a environment is vulnerable to many va American Educational Research Associa
topic of discussion over a decade ago lidity threats, such as inadequate con tion, American Psychological Associa
(Cannell, 1989). One of the most publi struct definition, construct under repre tion, & National Council on Measurement
in Education. (1999). Standards for
cized cheating scandals occurred in the sentation, illogical reasoning regarding educational and psychological testing.
New Orleans public schools (Meitrodt the causes of student learning, negative Washington, DC: American Educational
& Nabonne, 1997). A variety of score consequences of test score uses, and Research Association.
boosting techniques were used that in low reliability of test scores. CIV is one Andrich, D., Lyne, A., Sheridan, B, & Luo, G.
cluded exclusion oflow-scoring students, of these threats to validity. We have (1997). RUMM 2010: A windows pro
unethical test preparation, early distri shown that CIV has many sources, and gramfor Rasch unidimensional models
bution of tests and subsequent teaching evidence has been presented of its ex for measurement I Computer Progi:am].
of the test, and unusual test administra tensiveness. We have argued that CIV Duncraig, Western Australia: Murdoch
tion arrangements. We have witnessed needs increased attention, especially in University-Social Measurement Labora
outbreaks of cheating in Austin, Texas; high-stakes testing programs. tory.
Beller, M., & Gafni, N. (2000). Can item for
Chicago, Illinois; New York City; Reston, Researchers should systematically mat (multiple-choice vs. open-ended)
Virginia; and Rhode Island (Hoff, 2000). address understudied sources of CIV. account for gender differences in mathe
Cizek (1999, pp. 62-69) provided exten Identification and assessment of the se matics achievement? Sex roles, 42(),
sive evidence of institutional cheating in riousness of each source in high-stakes 1-22.
schools and school districts. The trust testing programs is the first step. This Bowman, D. H. (2001, December 12).
worthiness of student achievement data research should build a robust litera Arizona reports scoring errors on state
should be questioned, particularlywhen ture that provides a clear picture of the exam. Education Week on the Web. Re
large gains are observed for a school or seriousness of this threat to validity. trieved March 17, 2004, from http://www.
school district, as in the New Orleans Documentation is needed to assure edweek.org/ew/newstory.cfm?slug=l5ca
scandal. Like other sources, institutional the public that each specific threat to ps.h21
Brennan, R. L. (2001). Generalizability
cheating in a school or school district validity is not serious for the particular theory. New York: Springer.
level introduces CIV that is difficult to score interpretations proposed. In in Cannell, J. J. (1989). How public educa
detect and document. stances where this documentation tors cheat on standardized achieve
Individual cheating is also a veryseri shows that a threat is serious, we need ment tests. Albuquerque, NM: Friends
ous source of CIV. We have many exam- to take appropriate action to study the for Education.

Spring 2004 25
Chronicle of Higher Education (2002, Nov processes: A review of research in the Kane, M. T. (2002). Validating high-stakes
ember 21). Two students arrested for United States. Review of Educational testing programs. Educational Measure
alleged high tech cheating on the GRE. Research, 65, 145-190. ment: Issues and Practices, 21 (1), 31-41.
Retrieved March 17, 2004, from http:// Garcia, G. E. (1991). Factors influencing Koretz, D., Lewis, E., Skewes-Cox, T., &
chronicle.com/free/2002/11/2002112102t. the English reading test performance Burstein, L. (1993). Omitted and not
htm of Spanish-speaking Hispanic children. reached items in mathematics in the
Cizek, G. J. (1999). Cheating on tests: How Reading Research Quarterly, 26, 371-391. 1990 National Assessment ofEducational
to do it, detect it, and prevent it. Mahwah, Grissmer, D. W., Flanagan, A. E., Kawata, Progress (CRE: Technical Report 347).
NJ: Erlbaum. J. H., & Williamson, S. (2000). Improving Los Angeles, CA: Center for Research
Cole, N. S., & Moss, P. A. (1989). Bias in test student achievement: What state NAEF on Evaluation, Standards, and Student
use. In R. L. Linn (Ed.),Educationalmea t;est scores tell us. Santa Monica, CA: Rand Testing.
surement (3rd ed., pp. 201-220). New Corporation. Li, K. (2003, March 5). Fraudelent TOEFL
York: American Council on Education and Haertel, E., & Calfee, R. (1983). School takers face possible deportation. Daily
Macmillan. achievement: Thinking about what to test. Princetonian. Retrieved on March 17,
Cronbach, L. J. (1988). Five perspectives on Journal of Educational Measurement, 2004, from http://www.dailyprinceton
validity argument. In H. Wainer & H. I. 20(2), 119-131. ian.com/archives/2003/03/05/news/7516.
Braun (Eds.), Test validity (pp. 3-17). Haladyna, T. M. (2002a). Supporting docu shtml
Hillsdale, NJ: Erlbaum. mentation: Assuring more valid test score Linacre, J. M., & Wright, B. D. (2004).
Cronbach, L. J., & Meehl, P. E. (1955). interpretations and uses. In G. Tindal & FACETS: Computer program for many
Construct validity in psychological tests. T. M. Haladyna (Eds.), Large-scale (1}; faceted Rasch measurement. [ Computer
Psychological Bulletin, 52, 281-302. sessmentfor all students: Validity, tech Software]. Chicago: MESA Press.
Crooks, T. J., Kane, M. T., & Cohen, A. S. nical adequacy, and implementation Linn, R. L. (2002). Validation of the uses
(1996). Threats to valid use ofassessment. (pp. 89-108). Mahwah, NJ: Erlbaum. and interpretations of results of state as
Assessment in Education, 3(3), 265-285. Haladyna, T. M. (2002b). Standard'ized sessment and accountability systems. In
DeMars, C. E. (1998). Gender differences in achievement testing: Validity and ac G. Tindal & T. Haladyna (Eds.), Large
mathematics and science on a high school countability. Boston: Allyn & Bacon. scale assessment programs jor all stu
proficiency exam: The role of response for Haladyna, T. M. (2004). Developing and dents: Development, implementation,
mat.Applied Measurement in Education, validating multiple-choice test items and analy. Mahwah, NJ: Erlbaum.
11 (3), 279-299. (3rd ed.). Mahwah, NJ: Erlbaum. Linn, R. L., Baker, E. L., & Dunbar, S. B.
Downing, S. M. (2002a). Construct-irrele Haladyna, T. M.,Downing, S. M., & Rodriguez, (1991). Complex performance assess
vant variance and flawed test questions: M. C. (2002). A review of multiple-choice ment: Expectations and validation cri
Do multiple-choice item writing princi item-writing guidelines for classroom teria. Educational Researcher, 20 ( 8),
ples make any difference? Academic assessment. Applied Measurement in 15-21.
Medicine, 77(10), S103-104. Education, 15(3), 309-334. Linn, R. L., Betebenner, D. W., & Wheeler,
Downing, S. M. (2002b). Threats to the Hancock, D.R. (2001). Effects of test anxi K. S. (1998). Problem choice by test tak
validity of locally developed multiple
ety and evaluative threat on students' ers: Implications for comparabuity and
achievement. Journal of Educational construct validity (CSE Technical Report
choice tests in medical education: Research, 94(5), 284-290. 485). Boulder: University of Colorado at
Construct-irrelevant variance and con Hembree, R. (1988). Correlates, causes, Boulder, Center for Research on Eval
struct underrepresentation. Advances in effects, and treatment of test anxiety. uation, Standards and Student Testing.
Health Sciences Education, 7, 235-241. Review of Educational Research, 58, Linn, R. L., Graue, M. E., & Sanders, N. M.
Downing, S. M., & Haladyna, T. M. (1996). 47-77. (1990). Comparing state and district re
Model for evaluating high-stakes testing Henriques, D. B., & Steinberg, J. (2001, sults to national norms: The validity of
programs: Why the fox should not guard May 20). Right answer, wrong score: claims that "Everyone is above average."
the chicken coop. Educational Mea Test flaws take toll. New York Times. Educational Measurement: Issues and
surement: Issues and Practice, 15 (1), Retrieved March 17, 2004, from http: Practice, 9(3), 5-14.
5-12. //www.nytimes.com/2001/05/20/business/ Lohman, D. F. (1993). Teaching and testing
Education Week on the Web Archives. 20EXAM.html?ex=1079672400&en=fbce to develop fluid abilities. Educational
(2004). Scoring errors. Retrieved March f3a39c75ddbd&ei=5070 Researcher, 22(1), 12-23.
17, 2004, from http://www.edweek.org/ Herman, J., & Golan, S. (1993). The ef Lomax, R. G., West, M. M., Harmon, M. C.,
search/ fects of testing on teaching and schools. Viator, K. A.,& Madaus, G. F. (1995). The
Engelhard, Jr., G. E. (2002). Monitoring Educational Measurement: Issues and impact of mandated standardized testing
raters in performance assessments. In G. Practice, 12, 20-25, 41. on minority students. Journal of Negro
Tindal & T. M. Haladyna (Eds.), Large Hill, K., & Wigfield, A. (1984). Test anxiety: Education, 64, 171-185.
scale (J};Sessrnentfor all student: Validity, A major educational problem and what Lord, F. M., & Novick, M. R. (1968).
technical adequacy, and implementation can be done about it. The Elementary Statistical theories of mental test scores.
(pp. 261-287). Mahwah, NJ: Erlbaum. School Journal, 85, 105-126. Reading, MA: Addison Wesley.
Erickson,R. N., Ysseldyke, J. E., & Thurlow, Hoff, D. J. (2000, June 21) As stakes rise Martinez, M. E. (1999). Cognition and the
M. L. (1996). Neglected numerators, definition of cheating blurs. Education questions of test item format. Educational
drifting denominators, and fractured Week, 19(41), 1. Psychology, 34( 4), 207-218.
fractions: Determining participation rates Holland, P. W., & Wainer, H. (Eds.). (1993). Mehrens, W. A., & Kaminski, J. (1989).
for students with disabilities in state Dlfferential item functioning. Mahwah, Methods for improving a standardized
wide assessment programs (Synthesis NJ: Erlbaum. test scores: Fruitful, fruitless, or fraudu
Report No. 23). Minneapolis: University Huff,K. L.,&Sireci,S. (2001). Validity issues lent? Educational Measurement: Issues
of Minnesota, National Center on Educa in computer-based testing. Educational and Practices, 8, 14-22.
tional Outcomes. Retrieved March 17, Measurement: Issues and Practices, Mehrens, W. A., Popham, W. J., & Ryan,
2004, from http:/!education. umn.edu/ 20(3), 16-25. J. R. (1998). How to prepare students for
NCEO/OnlinePubs/Synthesis23.html Kane, M. T. (1992). An argument-based ap performance assessments. Educational
Fitzgerald, J. (1995). English-as-a-second proach to validity. Psychological Bulletin, Measurement: Issues and Practice, 17,
language learners' cognitive reading 112, 527-535. 18-22.

26 Educational Measurement: Issues and Practice


Meijer, R.R., & Sjitsma, K. (1995). Detection www.natd.org/Case_3_Cherry_Creek_ and implementatwn 'issues (pp. 211-
of aberrant item score patterns: A re part_E.PDF 229). Mahwah, NJ: Erlbaum.
view of recent developments. Applied National Center of Educational Statis Rothstein, R. (2002, September 18). How
Measurement inEducation, 8(3), 261- tics. (2002). The nation's report card: U.S. punishes states with higher stan
272. Mathematics highlights (NCES 2001- dards. The New York Times. Retrieved
Meitrodt, J., & Nabonne, R. ( 1997). Scores, 518). Washington, DC: U.S. Department March 17, 2004, from httpJ/www.ny
testing practices raise suspicions of ex of Education: Office of Educational Re times.com/2002/09/18
perts. New Orleans Times-Picayune search and Improvement. Retrieved Ryan, J.M., & Demark, S. (2002). Variation
Special Report. Retrieved March 17, March 17, 2004, from (http://nces.ed.gov/ in achievement scores related to gender,
2004, from http://www.nola.com/speced/ nationsreportcard/) item format and content area tested. In
toogood/rnain.html Nitko, A. J. (2001). Educational assessment G. A. Tindal & T. M. Haladyna (Eds.),
Messick, S. (1984). The psychology of edu ofstudents (3rd ed.). Upper Saddle River: Large-scale assessment programs for all
cational measurement. Journal ofEdu NJ: Merrill/Prentice Hall. students: Development, implementation,
catwnal Measurement, 21, 215-237. Nolen, S. B., Haladyna, T. M., & Haas, N. S. and analysis (pp. 67-88). Mahwah, NJ:
Messick, S. (1989). Validity. In R. L. (1992). Uses and abuses of achievement Erlbaum.
Linn (Ed.), Educational measurement test scores.Educational Measurement: Sternberg, R. J. (1998). Abilities are forms
(3rd ed., pp. 13-104). New York: American Issues and Practices, 11, 9-15. of developing expertise. Educational
Council on Education and Macmillan. Paris, S. G., Lawton, T. A., Turner, J. C., & Researcher, 27(3), 11-20.
Messick, S. (1995a). Validity of psychological Roth, J. L. (1991). A developmental per Taipei Times (2002, August 9). GRE cheat
assessment: Validation of inferences from spective on standardized achievement ing probe uncovers Asian internet sites.
persons' responses and performances as testing.Educational Researcher, 20 (1) , Retrieved March 17, 2004, from http://
scientific inquiry into score meaning. 2-7. www.taipeitimes.com/News/front/archiv
American Psychologist, 50, 741-749. Penfield, R. D., & Lam, R. C. M. (2000). es/2002/08/09/159537
Messick, S. (1995b). Standards of validity Assessing differential item functioning in Thornton, K. (2001, May 4). Test pressures
and the validity of standards in perfor perfo rmance assessment: Review and forces trainees to quit. The Times Edu
mance assessment. Educational Measure recommendations. Educatwnal Measure catwnal Supplement, N. 4427.
ment: Issues and Practice, 14 ( 4), 5-8. ment: Issues and Practice, 19(5), 5-15. Wainer, H., & Thissen, D. (1994). On exami
Mislevy, R. J. (1996). Test theory recon Popham, W. J. (1991). Appropriateness of nee choice in educational testing. &view
ceived. Journal of Educational Mea teachers' test-preparation practices. Edu of EducationalResearch, 64(1), 159-195.
surement, 33(4), 379-416. cational Measurement: Issues and Prac Wodtke, K. H., Harper, F., Schommer, M., &
Muraki, E., &Bock, R. D. (2003). PARSCALE: tices, 10(4), 12-16. Brunelli, P. (1989). How standardized is
IRT based test scoring and item analy Popham, W. J. (2003). Seeking redemption school testing? An exploratory study of
sis for graded open-ended exercises and for our psychometric sins. Educational standardized group testing in kinder
performance tests. Version 3.1 plus. Measurement: Issues and Practices, garten. Educational Evaluation and
[ Computer Software]. Chicago: Scientific 22(1), 45--48. Policy Analysis, 11 (3), 223-235.
Software. Rodriguez, M. (2002). Choosing an item for Zohar, D. (1998) An additive model of test
National Association of Test Directors mat. InG. Tindal&T. M. Haladyna (Eds.), anxiety: Role of exam-specific expecta
(2004). Cleaning up answer sheets. 7. Large-scale assessment programs for all tions. Journal ofEducational Psychology,
Retrieved March 17, 2004, from http:// students: Validity, technical adequacy, 90(2), 330-340.

Spring 2004 27

Вам также может понравиться