Академический Документы
Профессиональный Документы
Культура Документы
Background/Context: Many studies have concluded that educational accountability policies increase data use, but we know little about how to design accountability systems to
encourage productive versus distortive uses of test score data.
Purpose: I propose that five features of accountability systems affect how test score data are
used and examine how individual and organizational characteristics interact with system
features to influence teachers data use. First, systems apply varying amounts of pressure.
Second, the locus of pressure varies across systems. Third, systems diverge in the distributional goals they set for student performance. Fourth, the characteristics of assessments vary
across systems. Finally, systems differ in scopethat is, whether they incorporate multiple
measures or are process- or outcome oriented.
Research Design: I review the literature on the effects of accountability systems on teachers
data use and propose a research agenda to further our understanding of this area.
Conclusions/Recommendations: Researchers have spent much more time analyzing test
score data than investigating how teachers use data in their work. Evolving accountability
systems provide new opportunities for scholars to study how the interactions between
accountability features, individual characteristics, and organizational contexts affect teachers test score data use.
The focus on data, I would say, is the driving force [behind education] reform. No longer can we guess. We need to challenge
ourselves everyday to see what the data mean. Secretary of
Education Arne Duncan, 2010 (Quoted in Prabhu, 2010)
require exit exams for students to graduate (Urbina, 2010). Merit pay
proposals in districts such as Houston, Charlotte-Mecklenberg, and
Minneapolis tie some component of a teachers compensation to his or
her students performance on standardized tests (Papay & Moore
Johnson, 2009). More recently, many federal Race to the Top winners
committed to linking up to 50% of teacher evaluations to student test
scores (New Teacher Project, 2010), and in September 2010, the Los
Angeles Times published individual value-added data for elementary teachers (Buddin, 2010). Principals in many cities can earn substantial
bonuses or lose their jobs based on test scores, and NCLB holds both
schools and districts accountable for improving test scores.
Does the locus of pressure affect how much, and how, data are used?
We need to understand whether within-school variation in data use
increases or decreases when accountability moves to the level of the
teacher. An intriguing finding in the current literature is that most of the
variation in data use currently exists within rather than between schools
(Marsh et al., 2006). This means that individual user characteristics need
to be a focus of study along with organizational characteristics. Because
current studies focus on outcomes rather than process, we can only use
existing results to generate hypotheses about between-teacher variation
in data use.
Recent studies by Papay (2010) and Corcoran, Jennings, and Beveridge
(2010) compared teacher effects on high- and low-stakes tests under the
assumption that the stakes attached to tests matter for teacher responses.
These studies suggest that teachers who appear effective on high-stakes
tests are not necessarily effective on low-stakes tests. Furthermore,
teacher effects on high-stakes tests decay more quickly than do teacher
effects on low-stakes tests, perhaps because teachers face different pressures to increase scores on high-stakes tests that lead them to use test data
in different ways. Corcoran et al. (2010) found that there are particularly
large gaps in measured teacher effectiveness on the two tests for inexperienced teachers. These teachers may be using, or experiencing the stimulus effects of, high-stakes test data differently than their more
experienced peers. This may be because these teachers face pretenure
pressure to increase test scores; the Reback et al. (2010) study found that
in schools facing accountability pressure, untenured teachers work substantially more hours per week.
These findings led to two specific lines of inquiry. The first task is to
explain whether data use plays a role in making teachers effective on a
high-stakes test but not a low-stakes test. For example, do some teachers
use test score data from high-stakes tests as the dominant lens in making
sense of school performance, and thus focus their time and attention
9
studies suggest that high-powered individual incentives focused on a limited set of easily measurable goals are likely to distort behavior and lead
to undesirable uses of data (Baker, Gibbons, & Murphy, 1994; Campbell,
1979; Holmstrom & Milgrom, 1991). If this precept applies to schools, we
can predict that individual accountability focused largely on test scores
will encourage distortive uses of data. But we may also expect that these
responses will be mediated by the organizational contexts in which teachers work (Coburn, 2001). In some places, this pressure may be experienced acutely, whereas in others, principals and colleagues may act as a
buffer. As I will discuss later, subjective performance measures have been
proposed as a way to offset these potentially distortive uses of data.
DISTRIBUTIONAL GOALS OF THE ACCOUNTABILITY SYSTEM:
PROFICIENCY, GROWTH, AND EQUITY
The goals of an accountability system affect how student performance is
measured, which in turn may affect how data are used. The three major
models currently in use are a status (i.e., proficiency) model, a growth
model, or some combination of the two. These models create different
incentives for using data as both diagnosis and compass to target
resources to students. In the case of status modelsby which I mean
models that focus on students proficiencyteachers have incentives to
move as many students over the cut score as possible but need not attend
to the average growth in their class. In a growth model, teachers have
incentives to focus on those students who they believe have the greatest
propensity to exhibit growth. Because state tests generally have strong
ceiling effects that limit the measurable growth of high-performing students (Koedel & Betts, 2009), teachers may focus on lower performing
students in a growth system.
No studies to date have investigated whether status models have different effects on data use than growth models. However, we can generate
hypotheses about these effects by considering a growing body of literature that has assessed how accountability systems affect student achievement across the test score distribution. A prime suspect in producing
uneven distributional effects is reliance of current accountability systems
on proficiency rates, a threshold measure of achievement. Measuring
achievement this way can lead teachers to manipulate systems of measurement to create the appearance of improvement. For example, teachers can focus on bubble students, those close to the proficiency cut
score (Booher-Jennings, 2005; Hamilton et al., 2007; Neal &
Schanzenbach, 2007; Reback, 2008). Test score data appear to play a
central role in making these targeting choices and may also be used to
11
legitimize these choices as data-driven decision making (BooherJennings, 2005). Because sanctions are a function of passing rates, slightly
increasing the scores of a small number of students can positively impact
the schools accountability rating.
A large body of evidence addresses the issue of distributional effects
and provides insight into the extent to which teachers are using data to
target resources to students. The literature is decidedly mixed. One study
in Chicago found negative effects of accountability pressure on the lowest performing students (Neal & Schanzenbach, 2007), whereas another
in Texas found larger gains for marginal students and positive effects for
low-performing students as well (Reback, 2008). Four studies identified
positive effects on low-performing students (Dee & Jacob, 2009; Jacob,
2005; Ladd & Lauen, 2010; Springer, 2007), whereas four found negative
effects on high-performing students (Dee & Jacob, 2009; Krieg, 2008;
Ladd & Lauen, 2010; Reback, 2008). Because these studies intended to
establish effects at the level of the population, they did not directly attend
to how teachers varied in their use of data to target students and how
organizational context may have mediated these responses. I return to
these issues in my proposed research agenda.
Only one study to date has compared the effects of status and growth
models on achievement. Analyzing data from North Carolina, which has
a low proficiency bar, Ladd and Lauen (2010) found that low-achieving
students made more progress under a status-based accountability system.
In contrast, higher achieving students made more progress under a
growth-based system. This suggests that teachers allocation of resources
is responsive to the goals of the measurement system; teachers targeted
students below the proficiency bar under a status system, and those
expected to make larger gains (higher performing students) under a
growth system. As more states implement growth models, researchers will
have additional opportunities to address this question and determine
what role the difficulty of the proficiency cut plays in affecting how teachers allocate their attention.
A second feature that may be important for how data are used to target
students is whether the system requires subgroup accountability, and
what cutoffs are established for separately counting a subgroup. States
vary widely in how they set their subgroup cutoffs. In Georgia and
Pennsylvania, 40 students count as a subgroup, whereas in California,
schools must enroll 100 students or 50 students if that constitutes 15% of
school enrollment (Hamilton et al., 2007). Only one study by Lauen and
Gaddis (2010) has addressed the impact of NCLBs subgroup requirements. Though they found weak and inconsistent effects of subgroup
accountability, Lauen and Gaddis found large effects of subgroup
12
accountability on low-achieving students test scores in reported subgroups; these were largest for Hispanic students. These findings suggest
that we need to know more about how data are used for targeting in
schools that are separately accountable for subgroups compared with
similar schools that are not.
To summarize, most of our knowledge about the effects of distributional goals of accountability systems comes from studies that examine
test scores, rather than data use, as the focus of study. These studies raise
many questions about the role data use played in producing these outcomes. First, data use has made targeting more sophisticated, real-time,
and accurate, but we know little about how targeting varies across teachers and schools. Second, we need to know whether teachers, administrators, or both are driving targeting behavior. For example, whereas
77%90% of elementary school principals reported encouraging teachers to focus their efforts on students close to meeting the standards,
only 29%37% of teachers reported doing so (Hamilton et al., 2007).
Third, we need to know more about the uses of summative data for monitoring the effectiveness of targeting processes. How do teachers interpret students increases in proficiency when they are applying targeting
practices? Depending on the inference teachers want to make, targeting
can be perceived as a productive or distortive use of data. Targeting students below passing creates the illusion of substantial progress on proficiency, making it distortive if the object of interest is change in student
learning. On the other hand, the inferences made based on test scores
would not be as distortive if teachers examined average student scale
scores. At present, we do not know to what extent teachers draw these distinctions in schools. A final area of interest, which will be discussed in
more detail in the following section, is the extent to which targeting
increases students skills generally or is tailored to predictable test items
that will push students over the cut score.
FEATURES OF ASSESSMENTS
Features of assessments may affect whether teachers use data in productive or distortive ways. Here, I focus on three attributes of assessments
the framing of standards, the sampling of standards, and the cognitive
demand of the skills represented on state testsbecause they are most
relevant to the potentially distortive uses of data. Many more features of
assessments, such as the extent to which items are coachable, should
also be explored. The specificity of standards and their content varies
widely across states (Finn, Petrilli, & Julian, 2006). Framing standards too
broadly leads teachers to use test data to focus on tests rather than on the
13
(2003), and Wolf and McIver (1999) all demonstrate how teachers focus
their instruction not only on the content of the test, but also its format,
by presenting material in formats as they will appear on the test and
designing tasks to mirror the content of the tests. To the extent that students learn how to correctly answer questions when they are presented in
a specific format but struggle with the same skills when they are presented in a different format, this use of test data to inform instruction is
distortive because it inflates scores.
Taken together, these studies suggest that different assessments create
incentives for teachers to use data in different ways. They also suggest
that teachers are using relatively detailed data about student performance on state exams in their teaching but provide little insight into the
types of teachers or schools where these practices are most likely to be
prevalent. School organizational characteristics, such as the availability of
data systems, instructional support staff, and profesional development
opportunties, may affect how features of assessments are distilled for
teachers. Teachers own beliefs about whether these uses constitute
good teaching may also matter. Some teachers view focusing on frequently tested standards as best practice, whereas others see this as
teaching to the test. Another important area for inquiry is whether this
type of data use is arising from the bottom up or the top down. In many
cases, teachers are not making these decisions alone. Rather, school and
district leaders may mandate uses of data and changes in practice that
will increase test scores, and teachers may unevenly respond to these
demands (Bulkley, Fairman, Martinez, & Hicks, 2004; Hannaway, 2007;
Koretz, Barron, Mitchell, & Stecher, 1996; Koretz, Mitchell, Barron, &
Keith, 1996; Ladd & Zelli, 2002).
SCOPE
Many have hypothesized that accountability systems based on multiple
process and outcome measures may encourage more productive uses of
data than those based only on test scores. In a policy address, Ladd
(2007) proposed the use of teams of inspectors that would produce qualitative evaluations of school quality so that accountability systems would
more proactively promote good practice. Hamilton et al. (2007) suggested creating a broader set of indicators to provide more complete
information to the public about how schools are performing and to
lessen the unwanted consequences of test-based accountability. The hope
is that by measuring multiple outcomes and taking account of the
processes through which outcomes are produced, educators will have
weaker incentives to use distortive means.
15
A large body of literature in economics and management has considered how multiple measures may influence distortive responses to incentives, particularly when firms have multiple goals. In their seminal article,
Holmstrom and Milgrom (1991) outlined two central problems in this
area: Organizations have multiple goals that are not equally easy to measure, and success in one goal area may not lead to improved performance
in other goal areas. They showed that when organizational effectiveness
in achieving goals is weakly correlated across goal domains and information across these domains is asymmetric, workers will focus their attention on easily measured goals to the exclusion of others. They
recommended minimizing strong objective performance incentives in
these cases.
Because comprehensive multiple measures systems have not been
implemented in the United States, it is currently difficult to study their
effects on data use. Existing studies have examined how the use of multiple measures affects the validity of inferences we can make about school
quality (Chester, 2005) or have described the features of these systems
(Brown, Wohlstetter, & Liu, 2008), but none has evaluated their effects
on data use. The European and U.K. experience with inspectorates provides little clear guidance on this issue; scholars continue to debate
whether these systems have improved performance or simply led to gaming of inspections (Ehren & Visscher, 2006).
A PROPOSED RESEARCH AGENDA
To improve accountability system design and advance our scholarly
knowledge of data use, researchers should study both productive and distortive uses of data. Next, I propose a series of studies that would help
build our understanding of how accountability system features affect
teachers data use and what factors produce variability in teachers
responses.
AMOUNT OF PRESSURE
There are two specific features of regulatory accountability systems that
should be explored to understand their impacts on data use: where the
level of expected proficiency or gain is set, and how quickly schools are
expected to improve. There is wide variation in the cut scores for proficiency that states set under NCLB as well as substantial variation in the
required pace of improvement under NCLB. As Koretz and Hamilton
(2006) have written, establishing the right amount of required improvement and its effects has been a recurrent problem in performance
16
data, scholars should study whether and how teachers data use changes
in response to measures of their value-added. For example, we need to
understand whether teachers pursue productive or distortive forms of
data use as they try to improve their value-added scores and how these
reactions are mediated by individual and organizational characteristics.
GOALS OF THE ACCOUNTABILITY SYSTEM: STATUS, GROWTH, AND
EQUITY
Although scholars have proposed a variety of theories to predict how
teachers will respond to status versus growth-oriented systems, few of
these theories have been tested empirically. What seems clear is that status and growth-based systems create different incentives for using data to
inform instruction. Though there is no extant research on this area, one
might hypothesize that in a status-based system, a teacher might use formative assessment data to reteach skill deficits of students below the
threshold. Under a growth system, the same teacher might use these data
to maximize classroom growth, which could occur by focusing on the
skills that the majority of the class missed. Alternatively, a more sophisticated approach could be used whereby teachers determine which students have the greatest propensity to exhibit growth and focus on the
skill deficits of these students. As accountability systems evolve,
researchers should study how the incentives implied by status and growth
systems affect data use practices.
All the ideas posed in the preceding paragraphs suggest that teachers
have a clear understanding of how status versus growth systems work.
Because growth systems are based on sophisticated statistical models,
there are good reasons to suspect that is untrue. Researchers should also
study how teachers vary in their understanding of these systems, how
these understandings vary between and within organizations, and how
they shape teachers use of data.
FEATURES OF ASSESSMENTS
As described in the literature review, assessments offer different opportunities to use data to improve test scores. This area will become particularly relevant with the implementation of the Common Core standards
and assessments, which focus on fewer concepts but promote deeper
understanding of them. The switch to a new set of standards will provide
a unique opportunity for researchers to observe the transition in data use
that occurs when the features of standards and assessments change.
We also need to know how teachers vary in the extent to which they use
18
distortive approaches to increase scores, which can be enabled by assessments that build in predictable features. Existing research suggests that
there is substantial variation across states in such responses, which suggests that features of assessments may matter. More investigation at the
organizational level is needed. For example, there is substantial variation
in the fraction of teachers reporting that they emphasize certain assessment styles and formats of problems in their classrooms. One hypothesis
is that features of assessments make these behaviors a higher return strategy in some places than others. Another hypothesis is that variation in
score reporting produces variation in teachers data use. Researchers
could contrast teachers responses in states and districts that disaggregate
subscores in great detail, relative to those that provide only overall
scores.3 Once we better understand the features of assessments and
assessment reporting that contribute to these differences, researchers
can design assessments that promote desired uses of data and minimize
undesired uses. Using existing student-by-item-level administrative data,
it is now possible to model teachers responses to these incentives, but
what is missing from such studies is an understanding of the data-related
behaviors that produced them. Future studies using both survey and
qualitative approaches to study data use can help to unpack these
findings.
SCOPE
Many have hypothesized that accountability systems based on multiple
measuresand in particular, those that are both process- and outcome
orientedmay produce more productive uses of data. Future studies
should establish how teachers interpret multiple measures systems. These
systems will put different weights on different types of measures, requiring teachers to decide how to allocate their time between meeting them.
For example, in New York City, 85% of schools letter grades are based on
student test scores, whereas 15% are based on student, teacher, and parent surveys and attendance records. Likewise, new systems of teacher
evaluation incorporate both test scores (in some cases, up to 51% of the
evaluation) and other evaluations. We need to know how teachers understand these systems in practice and how the weights put on different types
of measures influence their understanding.
CONCLUSION
The rise of the educational accountability movement has created a flurry
of enthusiasm for the use of data to transform practice and generated
19
reams of test score data that teachers now work with every day.
Researchers have spent much more time analyzing these test score data
themselves than trying to understand how teachers use data in their
work. What this literature review makes clear is just how scant our knowledge is about what teachers are doing with these data on a day-to-day
basis. Given the widespread policy interest in redesigning accountability
systems to minimize the undesired consequences of these policies, understanding how accountability features influence teachers data use is an
important first step in that enterprise.
Notes
1. Federal, state, and district policy makers, of course, formally use data to measure
how schools are doing and to apply rewards or sanctions, but I focus here on use of data by
teachers.
2. Recent regulations now make it possible for states to request waivers from this
requirement.
3. I thank a reviewer for this point.
References
Baker, G., Gibbons, R., & Murphy, K. J. (1994). Subjective performance measures in optimal incentive contracts. Quarterly Journal of Economics, 109, 11251156.
Booher-Jennings, J. (2005). Below the bubble: Educational triage and the Texas accountability system. American Educational Research Journal, 42, 231268.
Borko, H., & Elliott, R. (1999). Hands-on pedagogy versus hands-off accountability:
Tensions between competing commitments for exemplary math teachers in Kentucky.
Phi Delta Kappan, 80, 394400.
Boudet, K., City, E., & Murnane, R. (Eds.). (2005). Data-wise: A step-by-step guide to using assessment results to improve teaching and learning. Cambridge, MA: Harvard Education Press.
Brown, R. S., Wohlstetter, P., & Liu, S. (2008). Developing an indicator system for schools
of choice: A balanced scorecard approach. Journal of School Choice, 2, 392414.
Buddin, R. (2010). Los Angeles teacher ratings: FAQ and about. Los Angeles Times. Retrieved
from http://projects.latimes.com/value-added/faq/
Bulkley, K., Fairman, J., Martinez, M. C., & Hicks, J. E. (2004). The district and test preparation. In W. A. Firestone & R. Y. Schorr (Eds.), The ambiguity of test preparation (pp.
113142). Mahwah, NJ: Erlbaum.
Campbell, D. T. (1979). Assessing the impact of planned social change. Evaluation and
Program Planning, 2, 6790.
Center on Education Policy. (2007). Choices, changes, and challenges: Curriculum and instruction in the NCLB era. Retrieved from http://www.cep-dc.org
Chester, M. D. (2005). Making valid and consistent inferences about school effectiveness
from multiple measures. Educational Measurement: Issues and Practice, 24, 4052.
Coburn, C. E. (2001). Collective sensemaking about reading: How teachers mediate reading policy in their professional communities. Educational Evaluation and Policy Analysis,
23, 145170.
20
Coburn, C. E. (2005). Shaping teacher sensemaking: School leaders and the enactment of
reading policy. Educational Policy, 19, 476509.
Coburn, C. E. (2006). Framing the problem of reading instruction: Using frame analysis to
uncover the microprocesses of policy implementation. American Educational Research
Journal, 43, 343379.
Corcoran, S. P. (2010). Can teachers be evaluated by their students test scores? Should they be?
Providence, RI: Annenberg Institute, Brown University.
Corcoran, S. P., Jennings, J. L., & Beveridge, A. A. (2010). Teacher effectiveness on high and lowstakes tests (Working paper). New York University.
Darling-Hammond, L., & Wise, A.E. (1985). Beyond standardization: State standards and
school improvement. Elementary School Journal, 85, 315336.
Dee, T. S., & Jacob, B. (2009). The impact of No Child Left Behind on student achievement (NBER
working paper). Cambridge, MA: National Bureau of Economic Research.
Diamond, J. B. (2007). Where rubber meets the road: Rethinking the connection between
high-stakes testing policy and classroom instruction. Sociology of Education, 80, 285313.
Ehren, M., & Visscher, A.J. (2006). Towards a theory on the impact of school inspections.
British Journal of Educational Studies, 54, 5172.
Finn, C. E., Petrilli, M. J., & Julian, L. (2006). The state of state standards. Washington, DC:
Thomas B. Fordham Foundation.
Hamilton, L. S., & Stecher, B. M. (2006). Measuring instructional responses to standards-based
accountability. Santa Monica, CA: RAND.
Hamilton, L. S., Stecher, B. M., Marsh, J. A., McCombs, J. S., Robyn, A., Russell, J. L., . . .
Barney, H. (2007). Implementing standards-based accountability under No Child Left Behind:
Responses of superintendents, principals, and teachers in three states. Santa Monica, CA: RAND.
Hannaway, J. (2007, November). Unbounding rationality: Politics and policy in a data rich system.
Mistisfer lecture, University Council of Education Administration, Alexandria, VA.
Holmstrom, B., & Milgrom, P. (1991). Multitask principal-agent analyses: Incentive contracts, asset ownership, and job design. Journal of Law, Economics, and Organization, 7,
2452.
Jacob, B. A. (2005). Accountability, incentives, and behavior: Evidence from school reform
in Chicago. Journal of Public Economics, 89, 761796.
Jennings, J. L., & Bearak, J. (2010, August). State test predictability and teaching to the test:
Evidence from three states. Paper presented at the annual meeting of the American
Sociological Association, Atlanta, GA.
Jennings, J. L., & Crosta, P. (2010, November). The unaccountables. Paper presented at the
annual meeting of APPAM, Boston, MA.
Kerr, K. A., Marsh, J. A., Ikemoto, G. S., & Barney, H. (2006). Strategies to promote data use
for instructional improvement: Actions, outcomes, and lessons from three urban districts. American Journal of Education, 112, 496520.
Koedel, C., & Betts, J. (2009). Value-added to what? How a ceiling in the testing instrument influences value-added estimation (Working paper). University of Missouri.
Koretz, D. (2008). Measuring up: What standardized testing really tells us. Cambridge, MA:
Harvard University Press.
Koretz, D., Barron, S., Mitchell, K., & Stecher, B. (1996a). The perceived effects of the Kentucky
Instructional Results Information System. Santa Monica, CA: RAND.
Koretz, D., & Hamilton, L. S. (2006). Testing for accountability in K-12. In R. L. Brennan
(Ed.), Educational measurement (4th ed., pp. 531578). Westport, CT: American Council
on Education/Praeger.
Koretz, D., Mitchell, K., Barron, S., & Keith, S. (1996b). The perceived effects of the Maryland
School Performance Assessment Program (CSE Tech. Rep. No. 409). Los Angeles: University
21
22
23